[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n/.vscode\n.DS_Store\n/model\n/logs\n/data\nrsync_exclude.txt\n/scripts/train.sh\n/scripts/send_to_server.sh\n/data_p\n/model_tmp\n/figs"
  },
  {
    "path": ".gitmodules",
    "content": "[submodule \"warp-transducer\"]\n\tpath = warp-transducer\n\turl = https://github.com/noahchalifour/warp-transducer.git\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019 Noah Chalifour\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# RNN-Transducer Speech Recognition\n\nEnd-to-end speech recognition using RNN-Transducer in Tensorflow 2.0\n\n## Overview\n\nThis speech recognition model is based off Google's [Streaming End-to-end Speech Recognition For Mobile Devices](https://arxiv.org/pdf/1811.06621.pdf) research paper and is implemented in Python 3 using Tensorflow 2.0\n\n## Setup Your Environment\n\nTo setup your environment, run the following command:\n\n```\ngit clone --recurse https://github.com/noahchalifour/rnnt-speech-recognition.git\ncd rnnt-speech-recognition\npip install tensorflow==2.2.0 # or tensorflow-gpu==2.2.0 for GPU support\npip install -r requirements.txt\n./scripts/build_rnnt.sh # to setup the rnnt loss\n```\n\n## Common Voice\n\nYou can find and download the Common Voice dataset [here](https://voice.mozilla.org/en/datasets)\n\n### Convert all MP3s to WAVs\n\nBefore you can train a model on the Common Voice dataset, you must first convert all the audio mp3 filetypes to wavs. Do so by running the following command:\n\n> **_NOTE:_** Make sure you have `ffmpeg` installed on your computer, as it uses that to convert mp3 to wav\n\n```\n./scripts/common_voice_convert.sh <data_dir> <# of threads>\npython scripts/remove_missing_samples.py \\\n    --data_dir <data_dir> \\\n    --replace_old\n```\n\n### Preprocessing dataset\n\nAfter converting all the mp3s to wavs you need to preprocess the dataset, you can do so by running the following command:\n\n```\npython preprocess_common_voice.py \\\n    --data_dir <data_dir> \\\n    --output_dir <preprocessed_dir>\n```\n\n### Training a model\n\n<!-- #### Training on Host -->\n\nTo train a simple model, run the following command:\n\n```\npython run_rnnt.py \\\n    --mode train \\\n    --data_dir <path to data directory>\n```\n\n<!-- #### Training in Docker Container\n\n[View Image](https://hub.docker.com/r/noahchalifour/rnnt-speech-recognition)\n\nYou can also train your model in a docker container based on the Tensorflow docker image.\n\n> **_NOTE:_** Specify all your paramters in ALL CAPS as environment variables when training in a docker container.\n\nTo run the model using a CPU only, run the following command:\n\n```\ndocker run -d --name rnnt-speech-recognition \\\n    -v <path to local data>:/rnnt-speech-recognition/data \\\n    -v <path to save model locally>:/rnnt-speech-recognition/model \\\n    -e MODE=train \\\n    -e DATA_DIR=./data \\\n    -e OUTPUT_DIR=./model \\\n    noahchalifour/rnnt-speech-recognition\n```\n\nTo run the model using a GPU you must run the following command with the added `--cap-add SYS_ADMIN`, and `--gpus <gpus>`:\n\n```\ndocker run -d --name rnnt-speech-recognition \\\n    --cap-add SYS_ADMIN \\\n    --gpus <gpus> \\\n    -v <path to local data>:/rnnt-speech-recognition/data \\\n    -v <path to save model locally>:/rnnt-speech-recognition/model \\\n    -e MODE=train \\\n    -e DATA_DIR=./data \\\n    -e OUTPUT_DIR=./model \\\n    noahchalifour/rnnt-speech-recognition\n``` -->\n"
  },
  {
    "path": "__init__.py",
    "content": ""
  },
  {
    "path": "cmake/warp-rnnt-cmakelist.txt",
    "content": "IF (APPLE)\n    cmake_minimum_required(VERSION 3.4)\nELSE()\n    cmake_minimum_required(VERSION 2.8)\nENDIF()\n\nproject(rnnt_release)\n\nIF (NOT APPLE)\n    set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -O2\")\nENDIF()\n\nIF (APPLE)\n    set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -O2\")\n    add_definitions(-DAPPLE)\nENDIF()\n\ninclude_directories(include)\n\nFIND_PACKAGE(CUDA)\nMESSAGE(STATUS \"cuda found ${CUDA_FOUND}\")\n\noption(USE_NAIVE_KERNEL \"use naive alpha-beta kernel\" OFF)\noption(DEBUG_TIME \"output kernel time\" OFF)\noption(DEBUG_KERNEL \"output alpha beta\" OFF)\nif (USE_NAIVE_KERNEL)\n    add_definitions(-DUSE_NAIVE_KERNEL)\nendif()\nif (DEBUG_TIME)\n    add_definitions(-DDEBUG_TIME)\nendif()\nif (DEBUG_KERNEL)\n    add_definitions(-DDEBUG_KERNEL)\nendif()\n\noption(WITH_GPU \"compile warp-rnnt with cuda.\" ${CUDA_FOUND})\noption(WITH_OMP \"compile warp-rnnt with openmp.\" ON)\n\nif(NOT WITH_OMP)\n    add_definitions(-DRNNT_DISABLE_OMP)\nendif()\nif (WITH_OMP)\n    set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -fopenmp\")\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -Xcompiler -fopenmp\")\nendif()\n\n# need to be at least 30 or __shfl_down in reduce wont compile\nset(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2\")\nset(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_35,code=sm_35\")\n\nset(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_50,code=sm_50\")\nset(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_52,code=sm_52\")\nIF(CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 5)\n  SET(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES\")\nENDIF()\n\nIF (CUDA_VERSION GREATER 7.6)\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_60,code=sm_60\")\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_61,code=sm_61\")\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_62,code=sm_62\")\nENDIF()\n\nIF (CUDA_VERSION GREATER 8.9)\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_70,code=sm_70\")\nENDIF()\n\nIF (CUDA_VERSION GREATER 9.9)\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} -gencode arch=compute_75,code=sm_75\")\nENDIF()\n\nif (NOT APPLE)\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS} --std=c++11\")\n    set(CUDA_NVCC_FLAGS \"${CUDA_NVCC_FLAGS}\")\nENDIF()\n\n\nIF (APPLE)\n    EXEC_PROGRAM(uname ARGS -v  OUTPUT_VARIABLE DARWIN_VERSION)\n    STRING(REGEX MATCH \"[0-9]+\" DARWIN_VERSION ${DARWIN_VERSION})\n    MESSAGE(STATUS \"DARWIN_VERSION=${DARWIN_VERSION}\")\n\n    #for el capitain have to use rpath\n\n    IF (DARWIN_VERSION LESS 15)\n        set(CMAKE_SKIP_RPATH TRUE)\n    ENDIF ()\n\nELSE()\n    #always skip for linux\n    set(CMAKE_SKIP_RPATH TRUE)\nENDIF()\n\n\nIF (WITH_GPU)\n\n    MESSAGE(STATUS \"Building shared library with GPU support\")\n    set(CUDA_curand_LIBRARY \"/usr/local/cuda/lib64/libcurand.so.10\")\n\n    CUDA_ADD_LIBRARY(warprnnt SHARED src/rnnt_entrypoint.cu)\n    IF (!Torch_FOUND)\n        TARGET_LINK_LIBRARIES(warprnnt ${CUDA_curand_LIBRARY})\n    ENDIF()\n\n    cuda_add_executable(test_time_gpu tests/test_time.cu tests/random.cpp )\n    TARGET_LINK_LIBRARIES(test_time_gpu warprnnt ${CUDA_curand_LIBRARY})\n    SET_TARGET_PROPERTIES(test_time_gpu PROPERTIES COMPILE_FLAGS \"${CMAKE_CXX_FLAGS} --std=c++11\")\n\n    cuda_add_executable(test_gpu tests/test_gpu.cu tests/random.cpp )\n    TARGET_LINK_LIBRARIES(test_gpu warprnnt ${CUDA_curand_LIBRARY})\n    SET_TARGET_PROPERTIES(test_gpu PROPERTIES COMPILE_FLAGS \"${CMAKE_CXX_FLAGS} --std=c++11\")\n\nELSE()\n    MESSAGE(STATUS \"Building shared library with no GPU support\")\n\n    if (NOT APPLE)\n        set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -O2\")\n    ENDIF()\n\n    ADD_LIBRARY(warprnnt SHARED src/rnnt_entrypoint.cpp)\n\nENDIF()\n\n\nadd_executable(test_cpu tests/test_cpu.cpp tests/random.cpp )\nTARGET_LINK_LIBRARIES(test_cpu warprnnt)\nSET_TARGET_PROPERTIES(test_cpu PROPERTIES COMPILE_FLAGS \"${CMAKE_CXX_FLAGS} --std=c++11\")\n\nadd_executable(test_time tests/test_time.cpp tests/random.cpp )\nTARGET_LINK_LIBRARIES(test_time warprnnt)\nSET_TARGET_PROPERTIES(test_time PROPERTIES COMPILE_FLAGS \"${CMAKE_CXX_FLAGS} --std=c++11\")\n\nINSTALL(TARGETS warprnnt\n        RUNTIME DESTINATION \"bin\"\n        LIBRARY DESTINATION \"lib\"\n        ARCHIVE DESTINATION \"lib\")\n\nINSTALL(FILES include/rnnt.h DESTINATION \"include\")\n"
  },
  {
    "path": "debug/debug_dataset.py",
    "content": "from argparse import ArgumentParser\nimport os\nimport json\nimport sys\nimport tensorflow as tf\n\nFILE_DIR = os.path.dirname(os.path.realpath(__file__))\nsys.path.append(os.path.join(FILE_DIR, '..'))\n\nfrom utils import preprocessing\n\n\ndef check_for_invalid_values(inp, labels):\n\n    tf.debugging.check_numerics(inp['mel_specs'],\n        message='mel_specs has invalid value.')\n\n    return inp, labels\n\n\ndef check_empty(inp, labels):\n\n    tf.debugging.assert_none_equal(\n        tf.size(inp['mel_specs']), 0,\n        message='mel_specs is empty tensor.')\n\n    tf.debugging.assert_none_equal(\n        tf.size(inp['pred_inp']), 0,\n        message='pred_inp is empty tensor.')\n\n    tf.debugging.assert_none_equal(\n        tf.size(inp['spec_lengths']), 0,\n        message='spec_lengths is empty tensor.')\n\n    tf.debugging.assert_none_equal(\n        tf.size(inp['label_lengths']), 0,\n        message='label_lengths is empty tensor.')\n\n    tf.debugging.assert_none_equal(\n        tf.size(labels), 0,\n        message='labels is empty tensor.')\n\n    return inp, labels\n\n\ndef get_dataset(data_dir, \n                name, \n                batch_size,\n                n_epochs):\n\n    dataset = preprocessing.load_dataset(data_dir, name)\n    dataset = dataset.padded_batch(\n        batch_size, padded_shapes=({\n            'mel_specs': [-1, -1], \n            'pred_inp': [-1],\n            'spec_lengths': [],\n            'label_lengths': []\n        }, [-1]))\n\n    dataset = dataset.repeat(n_epochs)\n\n    with open(os.path.join(data_dir, '{}-specs.json'.format(name)), 'r') as f:\n        dataset_specs = json.load(f)\n\n    return dataset, dataset_specs\n\n\ndef main(args):\n\n    dataset, dataset_specs = get_dataset(\n        args.data_dir, args.split,\n        batch_size=1, n_epochs=1)\n\n    dataset.map(check_for_invalid_values)\n    dataset.map(check_empty)\n\n    for _ in dataset:\n        pass\n    \n    print('All checks passed.')\n\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('-d', '--data_dir', type=str, required=True,\n        help='Path to preprocessed dataset.')\n    ap.add_argument('-s', '--split', type=str, default='train',\n        help='Name of dataset split to inspect.')\n\n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)"
  },
  {
    "path": "debug/get_common_voice_stats.py",
    "content": "from argparse import ArgumentParser\nfrom scipy.io.wavfile import read as read_wav\nimport glob\nimport os\n\n\ndef main(args):\n\n    max_length = 0\n    min_length = 0\n    total_length = 0\n    count = 0\n\n    with open(os.path.join(args.data_dir, args.split + '.tsv'), 'r') as f:\n        next(f)\n        for line in f:\n\n            line_split = line.split('\\t')\n            audio_fn = line_split[1]\n\n            filepath = os.path.join(args.data_dir, 'clips', audio_fn[:-4] + '.wav')\n\n            sr, data = read_wav(filepath)\n\n            length = len(data) / sr\n\n            if length > max_length:\n                max_length = length\n            if length < min_length or min_length == 0:\n                min_length = length\n\n            total_length += length\n            count += 1\n\n    avg_length = total_length / count\n\n    print('Total: {:.4f} s'.format(total_length))\n    print('Min length: {:.4f} s'.format(min_length))\n    print('Max length: {:.4f} s'.format(max_length))\n    print('Average length: {:.4f} s'.format(avg_length))\n\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('-d', '--data_dir', required=True, type=str,\n        help='Directory of common voice dataset.')\n    ap.add_argument('-s', '--split', type=str, default='train',\n        help='Split to get statistics for.')\n\n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)"
  },
  {
    "path": "hparams.py",
    "content": "from tensorboard.plugins.hparams import api as hp\n\nHP_TOKEN_TYPE = hp.HParam('token_type', hp.Discrete(['word-piece', 'character']))\nHP_VOCAB_SIZE = hp.HParam('vocab_size', hp.Discrete([2**12]))\n\n# Preprocessing Hparams\nHP_MEL_BINS = hp.HParam('mel_bins', hp.Discrete([80]))\nHP_FRAME_LENGTH = hp.HParam('frame_length', hp.Discrete([0.025]))\nHP_FRAME_STEP = hp.HParam('frame_step', hp.Discrete([0.01]))\nHP_HERTZ_LOW = hp.HParam('hertz_low', hp.Discrete([125.0]))\nHP_HERTZ_HIGH = hp.HParam('hertz_high', hp.Discrete([7600.0]))\nHP_DOWNSAMPLE_FACTOR = hp.HParam('downsample_factor', hp.Discrete([3]))\n\n# Model Hparams\nHP_EMBEDDING_SIZE = hp.HParam('embedding_size', hp.Discrete([500]))\nHP_ENCODER_LAYERS = hp.HParam('encoder_layers', hp.Discrete([8]))\nHP_ENCODER_SIZE = hp.HParam('encoder_size', hp.Discrete([2048]))\nHP_PROJECTION_SIZE = hp.HParam('projection_size', hp.Discrete([640]))\nHP_TIME_REDUCT_INDEX = hp.HParam('time_reduction_index', hp.Discrete([1]))\nHP_TIME_REDUCT_FACTOR = hp.HParam('time_reduction_factor', hp.Discrete([2]))\nHP_PRED_NET_LAYERS = hp.HParam('pred_net_layers', hp.Discrete([2]))\nHP_PRED_NET_SIZE = hp.HParam('pred_net_size', hp.Discrete([2048]))\nHP_JOINT_NET_SIZE = hp.HParam('joint_net_size', hp.Discrete([640]))\nHP_DROPOUT = hp.HParam('dropout', hp.Discrete([0]))\n\n# HP_EMBEDDING_SIZE = hp.HParam('embedding_size', hp.Discrete([32]))\n# HP_ENCODER_LAYERS = hp.HParam('encoder_layers', hp.Discrete([4]))\n# HP_ENCODER_SIZE = hp.HParam('encoder_size', hp.Discrete([20]))\n# HP_PROJECTION_SIZE = hp.HParam('projection_size', hp.Discrete([50]))\n# HP_TIME_REDUCT_INDEX = hp.HParam('time_reduction_index', hp.Discrete([1]))\n# HP_TIME_REDUCT_FACTOR = hp.HParam('time_reduction_factor', hp.Discrete([2]))\n# HP_PRED_NET_LAYERS = hp.HParam('pred_net_layers', hp.Discrete([2]))\n# HP_PRED_NET_SIZE = hp.HParam('pred_net_size', hp.Discrete([100]))\n# HP_JOINT_NET_SIZE = hp.HParam('joint_net_size', hp.Discrete([50]))\n# HP_DROPOUT = hp.HParam('dropout', hp.Discrete([0.2]))\n\nHP_LEARNING_RATE = hp.HParam('learning_rate', hp.Discrete([1e-4]))\n\nMETRIC_TRAIN_LOSS = 'train_loss'\nMETRIC_TRAIN_ACCURACY = 'train_accuracy'\nMETRIC_EVAL_LOSS = 'eval_loss'\nMETRIC_EVAL_ACCURACY = 'eval_accuracy'\nMETRIC_EVAL_CER = 'eval_cer'\nMETRIC_EVAL_WER = 'eval_wer'\nMETRIC_ACCURACY = 'accuracy'\nMETRIC_CER = 'cer'\nMETRIC_WER = 'wer'\n"
  },
  {
    "path": "model.py",
    "content": "import re\nimport os\nimport tensorflow as tf\n\nfrom hparams import *\n\n\nclass TimeReduction(tf.keras.layers.Layer):\n\n    def __init__(self,\n                 reduction_factor,\n                 batch_size=None,\n                 **kwargs):\n\n        super(TimeReduction, self).__init__(**kwargs)\n\n        self.reduction_factor = reduction_factor\n        self.batch_size = batch_size\n\n    def call(self, inputs):\n\n        input_shape = tf.shape(inputs)\n\n        batch_size = self.batch_size\n        if batch_size is None:\n            batch_size = input_shape[0]\n\n        max_time = input_shape[1]\n        num_units = inputs.get_shape().as_list()[-1]\n\n        outputs = inputs\n\n        paddings = [[0, 0], [0, tf.math.floormod(max_time, self.reduction_factor)], [0, 0]]\n        outputs = tf.pad(outputs, paddings)\n\n        return tf.reshape(outputs, (batch_size, -1, num_units * self.reduction_factor))\n\n\ndef encoder(specs_shape,\n            num_layers,\n            d_model,\n            proj_size,\n            reduction_index,\n            reduction_factor,\n            dropout,\n            stateful=False,\n            initializer=None,\n            dtype=tf.float32):\n\n    batch_size = None\n    if stateful:\n        batch_size = 1\n\n    mel_specs = tf.keras.Input(shape=specs_shape, batch_size=batch_size,\n        dtype=tf.float32)\n\n    norm_mel_specs = tf.keras.layers.BatchNormalization()(mel_specs)\n\n    lstm_cell = lambda: tf.compat.v1.nn.rnn_cell.LSTMCell(d_model,\n        num_proj=proj_size, initializer=initializer, dtype=dtype)\n\n    outputs = norm_mel_specs\n\n    for i in range(num_layers):\n\n        rnn_layer = tf.keras.layers.RNN(lstm_cell(),\n            return_sequences=True, stateful=stateful)\n\n        outputs = rnn_layer(outputs)\n        outputs = tf.keras.layers.Dropout(dropout)(outputs)\n        outputs = tf.keras.layers.LayerNormalization(dtype=dtype)(outputs)\n\n        if i == reduction_index:\n            # outputs = tf.keras.layers.Conv1D(proj_size,\n            #     kernel_size=reduction_factor,\n            #     strides=reduction_factor)(outputs)\n            outputs = TimeReduction(reduction_factor,\n                batch_size=batch_size)(outputs)\n\n    return tf.keras.Model(inputs=[mel_specs], outputs=[outputs],\n        name='encoder')\n\n\ndef prediction_network(vocab_size,\n                       embedding_size,\n                       num_layers,\n                       layer_size,\n                       proj_size,\n                       dropout,\n                       stateful=False,\n                       initializer=None,\n                       dtype=tf.float32):\n\n    batch_size = None\n    if stateful:\n        batch_size = 1\n\n    inputs = tf.keras.Input(shape=[None], batch_size=batch_size,\n        dtype=tf.float32)\n\n    embed = tf.keras.layers.Embedding(vocab_size, embedding_size)(inputs)\n\n    rnn_cell = lambda: tf.compat.v1.nn.rnn_cell.LSTMCell(layer_size,\n        num_proj=proj_size, initializer=initializer, dtype=dtype)\n\n    outputs = embed\n\n    for _ in range(num_layers):\n\n        outputs = tf.keras.layers.RNN(rnn_cell(),\n            return_sequences=True)(outputs)\n        outputs = tf.keras.layers.Dropout(dropout)(outputs)\n        outputs = tf.keras.layers.LayerNormalization(dtype=dtype)(outputs)\n\n    return tf.keras.Model(inputs=[inputs], outputs=[outputs],\n        name='prediction_network')\n\n\ndef build_keras_model(hparams,\n                      stateful=False,\n                      initializer=None,\n                      dtype=tf.float32):\n\n    specs_shape = [None, hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]]\n\n    batch_size = None\n    if stateful:\n        batch_size = 1\n\n    mel_specs = tf.keras.Input(shape=specs_shape, batch_size=batch_size,\n        dtype=tf.float32, name='mel_specs')\n    pred_inp = tf.keras.Input(shape=[None], batch_size=batch_size,\n        dtype=tf.float32, name='pred_inp')\n\n    inp_enc = encoder(\n        specs_shape=specs_shape,\n        num_layers=hparams[HP_ENCODER_LAYERS.name],\n        d_model=hparams[HP_ENCODER_SIZE.name],\n        proj_size=hparams[HP_PROJECTION_SIZE.name],\n        dropout=hparams[HP_DROPOUT.name],\n        reduction_index=hparams[HP_TIME_REDUCT_INDEX.name],\n        reduction_factor=hparams[HP_TIME_REDUCT_FACTOR.name],\n        stateful=stateful,\n        initializer=initializer,\n        dtype=dtype)(mel_specs)\n\n    pred_outputs = prediction_network(\n        vocab_size=hparams[HP_VOCAB_SIZE.name],\n        embedding_size=hparams[HP_EMBEDDING_SIZE.name],\n        num_layers=hparams[HP_PRED_NET_LAYERS.name],\n        layer_size=hparams[HP_PRED_NET_SIZE.name],\n        proj_size=hparams[HP_PROJECTION_SIZE.name],\n        dropout=hparams[HP_DROPOUT.name],\n        stateful=stateful,\n        initializer=initializer,\n        dtype=dtype)(pred_inp)\n\n    joint_inp = (\n        tf.expand_dims(inp_enc, axis=2) +                 # [B, T, V] => [B, T, 1, V]\n        tf.expand_dims(pred_outputs, axis=1))             # [B, U, V] => [B, 1, U, V]\n\n    joint_outputs = tf.keras.layers.Dense(hparams[HP_JOINT_NET_SIZE.name],\n        kernel_initializer=initializer, activation='tanh')(joint_inp)\n\n    outputs = tf.keras.layers.Dense(hparams[HP_VOCAB_SIZE.name],\n        kernel_initializer=initializer)(joint_outputs)\n\n    return tf.keras.Model(inputs=[mel_specs, pred_inp],\n        outputs=[outputs], name='transducer')\n"
  },
  {
    "path": "preprocess_common_voice.py",
    "content": "from absl import app, logging, flags\nimport os\nimport json\nimport tensorflow as tf\n\nfrom utils import preprocessing, encoding\nfrom utils.data import common_voice\nfrom hparams import *\n\n\nFLAGS = flags.FLAGS\n\nflags.DEFINE_string(\n    'data_dir', None,\n    'Directory to read Common Voice data from.')\nflags.DEFINE_string(\n    'output_dir', './data',\n    'Directory to save preprocessed data.')\nflags.DEFINE_integer(\n    'max_length', 0,\n    'Max audio length in seconds.')\n\n\ndef write_dataset(dataset, name):\n\n    filepath = os.path.join(FLAGS.output_dir,\n        '{}.tfrecord'.format(name))\n\n    writer = tf.data.experimental.TFRecordWriter(filepath)\n    writer.write(dataset)\n\n    logging.info('Wrote {} dataset to {}'.format(\n        name, filepath))\n\n\ndef main(_):\n\n    hparams = {\n\n        HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],\n        HP_VOCAB_SIZE: HP_VOCAB_SIZE.domain.values[0],\n\n        # Preprocessing\n        HP_MEL_BINS: HP_MEL_BINS.domain.values[0],\n        HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],\n        HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],\n        HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],\n        HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],\n        HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0]\n\n    }\n\n    _hparams = {k.name: v for k, v in hparams.items()}\n\n    texts_gen = common_voice.texts_generator(FLAGS.data_dir)\n\n    encoder_fn, decoder_fn, vocab_size = encoding.get_encoder(\n        encoder_dir=FLAGS.output_dir,\n        hparams=_hparams,\n        texts_generator=texts_gen)\n    _hparams[HP_VOCAB_SIZE.name] = vocab_size\n\n    train_dataset = common_voice.load_dataset(\n        FLAGS.data_dir, 'train')\n    dev_dataset = common_voice.load_dataset(\n        FLAGS.data_dir, 'dev')\n    test_dataset = common_voice.load_dataset(\n        FLAGS.data_dir, 'test')\n\n    train_dataset = preprocessing.preprocess_dataset(\n        train_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length,\n        save_plots=True)\n    write_dataset(train_dataset, 'train')\n\n    dev_dataset = preprocessing.preprocess_dataset(\n        dev_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length)\n    write_dataset(dev_dataset, 'dev')\n\n    test_dataset = preprocessing.preprocess_dataset(\n        test_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length)\n    write_dataset(test_dataset, 'test')\n\n\nif __name__ == '__main__':\n\n    flags.mark_flag_as_required('data_dir')\n\n    app.run(main)\n"
  },
  {
    "path": "preprocess_librispeech.py",
    "content": "from absl import app, logging, flags\nimport os\nimport json\nimport tensorflow as tf\n\nfrom utils import preprocessing, encoding\nfrom utils.data import librispeech\nfrom hparams import *\n\n\nFLAGS = flags.FLAGS\n\nflags.DEFINE_string(\n    'data_dir', None,\n    'Directory to read Librispeech data from.')\nflags.DEFINE_string(\n    'output_dir', './data',\n    'Directory to save preprocessed data.')\nflags.DEFINE_integer(\n    'max_length', 0,\n    'Max audio length in seconds.')\n\n\ndef write_dataset(dataset, name):\n\n    filepath = os.path.join(FLAGS.output_dir,\n        '{}.tfrecord'.format(name))\n\n    writer = tf.data.experimental.TFRecordWriter(filepath)\n    writer.write(dataset)\n\n    logging.info('Wrote {} dataset to {}'.format(\n        name, filepath))\n\n\ndef main(_):\n\n    hparams = {\n\n        HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],\n        HP_VOCAB_SIZE: HP_VOCAB_SIZE.domain.values[0],\n\n        # Preprocessing\n        HP_MEL_BINS: HP_MEL_BINS.domain.values[0],\n        HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],\n        HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],\n        HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],\n        HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],\n        HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0]\n\n    }\n\n    train_splits = [\n        'dev-clean'\n    ]\n\n    dev_splits = [\n        'dev-clean'\n    ]\n\n    test_splits = [\n        'dev-clean'\n    ]\n\n    # train_splits = [\n    #     'train-clean-100',\n    #     'train-clean-360',\n    #     'train-other-500'\n    # ]\n\n    # dev_splits = [\n    #     'dev-clean',\n    #     'dev-other'\n    # ]\n\n    # test_splits = [\n    #     'test-clean',\n    #     'test-other'\n    # ]\n\n    _hparams = {k.name: v for k, v in hparams.items()}\n\n    texts_gen = librispeech.texts_generator(FLAGS.data_dir,\n        split_names=train_splits)\n\n    encoder_fn, decoder_fn, vocab_size = encoding.get_encoder(\n        encoder_dir=FLAGS.output_dir,\n        hparams=_hparams,\n        texts_generator=texts_gen)\n    _hparams[HP_VOCAB_SIZE.name] = vocab_size\n\n    train_dataset = librispeech.load_dataset(\n        FLAGS.data_dir, train_splits)\n    dev_dataset = librispeech.load_dataset(\n        FLAGS.data_dir, dev_splits)\n    test_dataset = librispeech.load_dataset(\n        FLAGS.data_dir, test_splits)\n\n    train_dataset = preprocessing.preprocess_dataset(\n        train_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length,\n        save_plots=True)\n    write_dataset(train_dataset, 'train')\n\n    dev_dataset = preprocessing.preprocess_dataset(\n        dev_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length)\n    write_dataset(dev_dataset, 'dev')\n\n    test_dataset = preprocessing.preprocess_dataset(\n        test_dataset,\n        encoder_fn=encoder_fn,\n        hparams=_hparams,\n        max_length=FLAGS.max_length)\n    write_dataset(test_dataset, 'test')\n\n\nif __name__ == '__main__':\n\n    flags.mark_flag_as_required('data_dir')\n\n    app.run(main)\n"
  },
  {
    "path": "quantize_model.py",
    "content": "from argparse import ArgumentParser\nimport os\nimport tensorflow as tf\n\nfrom utils import model as model_utils\n\n\ndef main(args):\n\n    hparams = model_utils.load_hparams(args.model_dir)\n    model, _ = model_utils.load_model(args.model_dir, hparams,\n        stateful=True)\n\n    model.summary()\n\n    converter = tf.lite.TFLiteConverter.from_keras_model(model)\n    converter.experimental_new_converter = True\n    # converter.experimental_new_quantizer = True\n    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,\n                                           tf.lite.OpsSet.SELECT_TF_OPS]\n    # converter.optimizations = [tf.lite.Optimize.DEFAULT]\n\n    tflite_quant_model = converter.convert()\n\n    tflite_dir = os.path.join(args.model_dir, 'tflite')\n    os.makedirs(tflite_dir, exist_ok=True)\n\n    with open(os.path.join(tflite_dir, 'model.tflite'), 'wb') as f:\n        f.write(tflite_quant_model)\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('-m', '--model_dir', type=str, default='./model',\n        help='Directory of model.')\n\n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)"
  },
  {
    "path": "requirements.txt",
    "content": "pydub>=0.23.1\nscipy>=1.3.1\ntqdm\ntensorflow-datasets\nsoundfile\nlibrosa\nmatplotlib\n"
  },
  {
    "path": "run_rnnt.py",
    "content": "from absl import flags, logging, app\nfrom tensorboard.plugins.hparams import api as hp\nfrom tensorflow.keras.mixed_precision import experimental as mixed_precision\nfrom datetime import datetime\nimport json\nimport re\nimport os\nimport time\nimport shutil\n\nimport tensorflow as tf\ntf.get_logger().setLevel('WARNING')\ntf.autograph.set_verbosity(0)\n# tf.random.set_seed(1234)\n\nfrom utils import preprocessing, vocabulary, encoding, \\\n    metrics, decoding\nfrom utils.loss import get_loss_fn\nfrom utils import model as model_utils\nfrom model import build_keras_model\nfrom hparams import *\n\nFLAGS = flags.FLAGS\n\n# Required flags\nflags.DEFINE_enum(\n    'mode', None,\n    ['train', 'eval', 'test'],\n    'Mode to run.')\nflags.DEFINE_string(\n    'data_dir', None,\n    'Input data directory.')\n\n# Optional flags\nflags.DEFINE_string(\n    'tb_log_dir', './logs',\n    'Directory to save Tensorboard logs.')\nflags.DEFINE_string(\n    'output_dir', './model',\n    'Directory to save model.')\nflags.DEFINE_string(\n    'checkpoint', None,\n    'Checkpoint to restore from.')\nflags.DEFINE_integer(\n    'batch_size', 32,\n    'Training batch size.')\nflags.DEFINE_integer(\n    'n_epochs', 1000,\n    'Number of training epochs.')\nflags.DEFINE_integer(\n    'steps_per_log', 1,\n    'Number of steps between each log.')\nflags.DEFINE_integer(\n    'steps_per_checkpoint', 1000,\n    'Number of steps between eval and checkpoint.')\nflags.DEFINE_integer(\n    'eval_size', None,\n    'Max number of samples to use for eval.')\nflags.DEFINE_list(\n    'gpus', None,\n    'GPUs to run training on.')\nflags.DEFINE_bool(\n    'fp16_run', False,\n    'Run using 16-bit precision instead of 32-bit.')\n\ndef get_dataset(data_dir,\n                name,\n                batch_size,\n                n_epochs,\n                strategy=None,\n                max_size=None):\n\n    dataset = preprocessing.load_dataset(data_dir, name)\n\n    if max_size is not None:\n        dataset = dataset.take(max_size)\n\n    dataset = dataset.padded_batch(\n        batch_size, padded_shapes=(\n            [-1, -1], [-1], [], [],\n            [-1]\n        )\n    )\n\n    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)\n\n    if strategy is not None:\n        dataset = strategy.experimental_distribute_dataset(dataset)\n\n    return dataset\n\n\ndef configure_environment(gpu_names,\n                          fp16_run):\n\n    if fp16_run:\n        print('Using 16-bit float precision.')\n        policy = mixed_precision.Policy('mixed_float16')\n        mixed_precision.set_policy(policy)\n\n    gpus = tf.config.experimental.list_physical_devices('GPU')\n\n    if gpu_names is not None and len(gpu_names) > 0:\n        gpus = [x for x in gpus if x.name[len('/physical_device:'):] in gpu_names]\n\n    if gpus:\n        try:\n            for gpu in gpus:\n                tf.config.experimental.set_memory_growth(gpu, True)\n            # tf.config.experimental.set_virtual_device_configuration(\n            #     gpus[0],\n            #     [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096),\n            #         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])\n            logical_gpus = tf.config.experimental.list_logical_devices('GPU')\n            print(len(gpus), \"Physical GPU,\", len(logical_gpus), \"Logical GPUs\")\n        except RuntimeError as e:\n            logging.warn(str(e))\n\n    if len(gpus) > 1:\n        print('Running multi gpu: {}'.format(', '.join(gpu_names)))\n        strategy = tf.distribute.MirroredStrategy(\n            devices=gpu_names)\n    else:\n        device = gpus[0].name[len('/physical_device:'):]\n        print('Running single gpu: {}'.format(device))\n        strategy = tf.distribute.OneDeviceStrategy(\n            device=device)\n\n    dtype = tf.float16 if fp16_run else tf.float32\n\n    return strategy, dtype\n\n\ndef setup_hparams(log_dir,\n                  checkpoint):\n\n    if checkpoint is not None:\n\n        checkpoint_dir = os.path.dirname(os.path.realpath(checkpoint))\n        hparams = model_utils.load_hparams(checkpoint_dir)\n\n        tb_hparams = {}\n        tb_keys = [\n            HP_TOKEN_TYPE,\n            HP_MEL_BINS,\n            HP_FRAME_LENGTH,\n            HP_FRAME_STEP,\n            HP_HERTZ_LOW,\n            HP_HERTZ_HIGH,\n            HP_DOWNSAMPLE_FACTOR,\n            HP_EMBEDDING_SIZE,\n            HP_ENCODER_LAYERS,\n            HP_ENCODER_SIZE,\n            HP_PROJECTION_SIZE,\n            HP_TIME_REDUCT_FACTOR,\n            HP_TIME_REDUCT_INDEX,\n            HP_PRED_NET_LAYERS,\n            HP_PRED_NET_SIZE,\n            HP_JOINT_NET_SIZE,\n            HP_DROPOUT,\n            HP_LEARNING_RATE\n        ]\n\n        for k, v in hparams.items():\n            for tb_key in tb_keys:\n                if k == tb_key.name:\n                    tb_hparams[tb_key] = v\n\n    else:\n\n        tb_hparams = {\n\n            HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],\n\n            # Preprocessing\n            HP_MEL_BINS: HP_MEL_BINS.domain.values[0],\n            HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],\n            HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],\n            HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],\n            HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],\n            HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0],\n\n            # Model\n            HP_EMBEDDING_SIZE: HP_EMBEDDING_SIZE.domain.values[0],\n            HP_ENCODER_LAYERS: HP_ENCODER_LAYERS.domain.values[0],\n            HP_ENCODER_SIZE: HP_ENCODER_SIZE.domain.values[0],\n            HP_PROJECTION_SIZE: HP_PROJECTION_SIZE.domain.values[0],\n            HP_TIME_REDUCT_INDEX: HP_TIME_REDUCT_INDEX.domain.values[0],\n            HP_TIME_REDUCT_FACTOR: HP_TIME_REDUCT_FACTOR.domain.values[0],\n            HP_PRED_NET_LAYERS: HP_PRED_NET_LAYERS.domain.values[0],\n            HP_PRED_NET_SIZE: HP_PRED_NET_SIZE.domain.values[0],\n            HP_JOINT_NET_SIZE: HP_JOINT_NET_SIZE.domain.values[0],\n            HP_DROPOUT: HP_DROPOUT.domain.values[0],\n\n            HP_LEARNING_RATE: HP_LEARNING_RATE.domain.values[0]\n\n        }\n\n    with tf.summary.create_file_writer(os.path.join(log_dir, 'hparams_tuning')).as_default():\n        hp.hparams_config(\n            hparams=[\n                HP_TOKEN_TYPE,\n                HP_VOCAB_SIZE,\n                HP_ENCODER_LAYERS,\n                HP_ENCODER_SIZE,\n                HP_PROJECTION_SIZE,\n                HP_TIME_REDUCT_INDEX,\n                HP_TIME_REDUCT_FACTOR,\n                HP_PRED_NET_LAYERS,\n                HP_PRED_NET_SIZE,\n                HP_JOINT_NET_SIZE,\n                HP_DROPOUT\n            ],\n            metrics=[\n                hp.Metric(METRIC_ACCURACY, display_name='Accuracy'),\n                hp.Metric(METRIC_WER, display_name='WER'),\n            ],\n        )\n\n    return {k.name: v for k, v in tb_hparams.items()}, tb_hparams\n\n\ndef run_metrics(inputs,\n                y_true,\n                metrics,\n                strategy=None):\n\n    return {\n        metric_fn.__name__: metric_fn(inputs, y_true)\n        for metric_fn in metrics}\n\n\ndef run_training(model,\n                 optimizer,\n                 loss_fn,\n                 train_dataset,\n                 batch_size,\n                 n_epochs,\n                 checkpoint_template,\n                 hparams,\n                 noise=0,\n                 # noise=0.075,\n                 strategy=None,\n                 steps_per_log=None,\n                 steps_per_checkpoint=None,\n                 eval_dataset=None,\n                 train_metrics=[],\n                 eval_metrics=[],\n                 fp16_run=False):\n\n    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]\n\n    @tf.function(input_signature=[[\n        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),\n        tf.TensorSpec(shape=[None, None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None, None], dtype=tf.int32)]])\n    def train_step(dist_inputs):\n        def step_fn(inputs):\n\n            (mel_specs, pred_inp,\n             spec_lengths, label_lengths, labels) = inputs\n            if noise > 0:\n                mel_specs += tf.random.normal([mel_specs.shape[-1]],\n                    mean=0, stddev=noise)\n\n            with tf.GradientTape() as tape:\n                outputs = model([mel_specs, pred_inp],\n                    training=True)\n\n                rnnt_loss = loss_fn(labels, outputs,\n                    spec_lengths, label_lengths)\n\n                if fp16_run:\n                    rnnt_loss = optimizer.get_scaled_loss(rnnt_loss)\n\n                loss = tf.reduce_sum(rnnt_loss) * (1. / batch_size)\n\n            if train_metrics is not None:\n                metric_results = run_metrics(mel_specs, labels,\n                    metrics=train_metrics, strategy=strategy)\n\n            gradients = tape.gradient(loss, model.trainable_variables)\n            if fp16_run:\n                gradients = optimizer.get_unscaled_gradients(gradients)\n\n            optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n\n            return rnnt_loss, metric_results\n\n        loss, metrics_results = strategy.run(step_fn, args=(dist_inputs,))\n        loss = strategy.reduce(\n            tf.distribute.ReduceOp.MEAN, loss, axis=0)\n        metrics_results = {name: strategy.reduce(\n            tf.distribute.ReduceOp.MEAN, result, axis=0) for name, result in metrics_results.items()}\n\n        return loss, metrics_results\n\n    def checkpoint_model():\n\n        eval_start_time = time.time()\n\n        eval_loss, eval_metrics_results = run_evaluate(\n            model=model,\n            optimizer=optimizer,\n            loss_fn=loss_fn,\n            eval_dataset=eval_dataset,\n            batch_size=batch_size,\n            hparams=hparams,\n            strategy=strategy,\n            metrics=eval_metrics)\n\n        validation_log_str = 'VALIDATION RESULTS: Time: {:.4f}, Loss: {:.4f}'.format(\n            time.time() - eval_start_time, eval_loss)\n        for metric_name, metric_result in eval_metrics_results.items():\n            validation_log_str += ', {}: {:.4f}'.format(metric_name, metric_result)\n        print(validation_log_str)\n\n        tf.summary.scalar(METRIC_EVAL_LOSS, eval_loss, step=global_step)\n        if 'Accuracy' in eval_metrics_results:\n            tf.summary.scalar(METRIC_EVAL_ACCURACY, eval_metrics_results['Accuracy'], step=global_step)\n        if 'WER' in eval_metrics_results:\n            tf.summary.scalar(METRIC_EVAL_WER, eval_metrics_results['WER'], step=global_step)\n\n        checkpoint_filepath = checkpoint_template.format(\n            step=global_step, val_loss=eval_loss)\n        print('Saving checkpoint {}'.format(checkpoint_filepath))\n        model.save_weights(checkpoint_filepath)\n\n\n    with strategy.scope():\n\n        print('Starting training.')\n\n        global_step = 0\n\n        for epoch in range(n_epochs):\n\n            loss_object = tf.keras.metrics.Mean()\n            metric_objects = {fn.__name__: tf.keras.metrics.Mean() for fn in train_metrics}\n\n            for batch, inputs in enumerate(train_dataset):\n\n                if global_step % steps_per_checkpoint == 0:\n                    if eval_dataset is not None:\n                        checkpoint_model()\n\n                start_time = time.time()\n\n                loss, metrics_results = train_step(inputs)\n\n                step_time = time.time() - start_time\n\n                loss_object(loss)\n                for metric_name, metric_result in metrics_results.items():\n                    metric_objects[metric_name](metric_result)\n\n                if global_step % steps_per_log == 0:\n                    log_str = 'Epoch: {}, Batch: {}, Global Step: {}, Step Time: {:.4f}, Loss: {:.4f}'.format(\n                        epoch, batch, global_step, step_time, loss_object.result())\n                    for metric_name, metric_object in metric_objects.items():\n                        log_str += ', {}: {:.4f}'.format(metric_name, metric_object.result())\n                    print(log_str)\n\n                    tf.summary.scalar(METRIC_TRAIN_LOSS, loss_object.result(), step=global_step)\n                    if 'Accuracy' in metric_objects:\n                        tf.summary.scalar(METRIC_TRAIN_ACCURACY, metric_objects['Accuracy'].result(), step=global_step)\n\n                global_step += 1\n\n            epoch_end_log_str = 'EPOCH RESULTS: Loss: {:.4f}'.format(loss_object.result())\n            for metric_name, metric_object in metric_objects.items():\n                epoch_end_log_str += ', {}: {:.4f}'.format(metric_name, metric_object.result())\n            print(epoch_end_log_str)\n\n        checkpoint_model()\n\n\ndef run_evaluate(model,\n                 optimizer,\n                 loss_fn,\n                 eval_dataset,\n                 batch_size,\n                 strategy,\n                 hparams,\n                 metrics=[],\n                 fp16_run=False):\n\n    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]\n\n    @tf.function(input_signature=[[\n        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),\n        tf.TensorSpec(shape=[None, None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None], dtype=tf.int32),\n        tf.TensorSpec(shape=[None, None], dtype=tf.int32)]])\n    def eval_step(dist_inputs):\n        def step_fn(inputs):\n            (mel_specs, pred_inp,\n             spec_lengths, label_lengths, labels) = inputs\n            outputs = model([mel_specs, pred_inp],\n                training=False)\n\n            loss = loss_fn(labels, outputs,\n                spec_lengths=spec_lengths,\n                label_lengths=label_lengths)\n\n            if fp16_run:\n                loss = optimizer.get_scaled_loss(loss)\n\n            if metrics is not None:\n                metric_results = run_metrics(mel_specs, labels,\n                    metrics=metrics, strategy=strategy)\n\n            return loss, metric_results\n\n        loss, metrics_results = strategy.run(step_fn, args=(dist_inputs,))\n        loss = strategy.reduce(\n            tf.distribute.ReduceOp.MEAN, loss, axis=0)\n        metrics_results = {name: strategy.reduce(\n            tf.distribute.ReduceOp.MEAN, result, axis=0) for name, result in metrics_results.items()}\n\n        return loss, metrics_results\n\n    print('Performing evaluation.')\n\n    loss_object = tf.keras.metrics.Mean()\n    metric_objects = {fn.__name__: tf.keras.metrics.Mean() for fn in metrics}\n\n    for batch, inputs in enumerate(eval_dataset):\n\n        loss, metrics_results = eval_step(inputs)\n\n        loss_object(loss)\n        for metric_name, metric_result in metrics_results.items():\n            metric_objects[metric_name](metric_result)\n\n    metrics_final_results = {name: metric_object.result() for name, metric_object in metric_objects.items()}\n\n    return loss_object.result(), metrics_final_results\n\n\ndef main(_):\n\n    strategy, dtype = configure_environment(\n        gpu_names=FLAGS.gpus,\n        fp16_run=FLAGS.fp16_run)\n\n    hparams, tb_hparams = setup_hparams(\n        log_dir=FLAGS.tb_log_dir,\n        checkpoint=FLAGS.checkpoint)\n\n    os.makedirs(FLAGS.output_dir, exist_ok=True)\n\n    if FLAGS.checkpoint is None:\n        encoder_dir = FLAGS.data_dir\n    else:\n        encoder_dir = os.path.dirname(os.path.realpath(FLAGS.checkpoint))\n\n    shutil.copy(\n        os.path.join(encoder_dir, 'encoder.subwords'),\n        os.path.join(FLAGS.output_dir, 'encoder.subwords'))\n\n    encoder_fn, idx_to_text, vocab_size = encoding.get_encoder(\n        encoder_dir=FLAGS.output_dir,\n        hparams=hparams)\n\n    if HP_VOCAB_SIZE.name not in hparams:\n        hparams[HP_VOCAB_SIZE.name] = vocab_size\n\n    with strategy.scope():\n\n        model = build_keras_model(hparams,\n            dtype=dtype)\n\n        if FLAGS.checkpoint is not None:\n            model.load_weights(FLAGS.checkpoint)\n            logging.info('Restored weights from {}.'.format(FLAGS.checkpoint))\n\n        model_utils.save_hparams(hparams, FLAGS.output_dir)\n\n        optimizer = tf.keras.optimizers.SGD(hparams[HP_LEARNING_RATE.name],\n            momentum=0.9)\n\n        if FLAGS.fp16_run:\n            optimizer = mixed_precision.LossScaleOptimizer(optimizer,\n                loss_scale='dynamic')\n\n    logging.info('Using {} encoder with vocab size: {}'.format(\n        hparams[HP_TOKEN_TYPE.name], vocab_size))\n\n    loss_fn = get_loss_fn(\n        reduction_factor=hparams[HP_TIME_REDUCT_FACTOR.name])\n\n    decode_fn = decoding.greedy_decode_fn(model, hparams)\n\n    accuracy_fn = metrics.build_accuracy_fn(decode_fn)\n    wer_fn = metrics.build_wer_fn(decode_fn, idx_to_text)\n\n    encoder = model.layers[2]\n    prediction_network = model.layers[3]\n\n    encoder.summary()\n    prediction_network.summary()\n\n    model.summary()\n\n    dev_dataset = None\n    if FLAGS.eval_size != 0:\n        dev_dataset = get_dataset(FLAGS.data_dir, 'dev',\n            batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs,\n            strategy=strategy, max_size=FLAGS.eval_size)\n\n    log_dir = os.path.join(FLAGS.tb_log_dir,\n        datetime.now().strftime('%Y%m%d-%H%M%S'))\n\n    with tf.summary.create_file_writer(log_dir).as_default():\n\n        hp.hparams(tb_hparams)\n\n        if FLAGS.mode == 'train':\n\n            train_dataset = get_dataset(FLAGS.data_dir, 'train',\n                batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs,\n                strategy=strategy)\n\n            os.makedirs(FLAGS.output_dir, exist_ok=True)\n            checkpoint_template = os.path.join(FLAGS.output_dir,\n                'checkpoint_{step}_{val_loss:.4f}.hdf5')\n\n            run_training(\n                model=model,\n                optimizer=optimizer,\n                loss_fn=loss_fn,\n                train_dataset=train_dataset,\n                batch_size=FLAGS.batch_size,\n                n_epochs=FLAGS.n_epochs,\n                checkpoint_template=checkpoint_template,\n                hparams=hparams,\n                strategy=strategy,\n                steps_per_log=FLAGS.steps_per_log,\n                steps_per_checkpoint=FLAGS.steps_per_checkpoint,\n                eval_dataset=dev_dataset,\n                train_metrics=[],\n                eval_metrics=[accuracy_fn, wer_fn])\n\n        elif FLAGS.mode == 'eval' or FLAGS.mode == 'test':\n\n            if FLAGS.checkpoint is None:\n                raise Exception('You must provide a checkpoint to perform eval.')\n\n            if FLAGS.mode == 'test':\n                dataset = get_dataset(FLAGS.data_dir, 'test',\n                    batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs)\n            else:\n                dataset = dev_dataset\n\n            eval_start_time = time.time()\n\n            eval_loss, eval_metrics_results = run_evaluate(\n                model=model,\n                optimizer=optimizer,\n                loss_fn=loss_fn,\n                eval_dataset=dataset,\n                batch_size=FLAGS.batch_size,\n                hparams=hparams,\n                strategy=strategy,\n                metrics=[accuracy_fn, wer_fn])\n\n            validation_log_str = 'VALIDATION RESULTS: Time: {:.4f}, Loss: {:.4f}'.format(\n                time.time() - eval_start_time, eval_loss)\n            for metric_name, metric_result in eval_metrics_results.items():\n                validation_log_str += ', {}: {:.4f}'.format(metric_name, metric_result)\n\n            print(validation_log_str)\n\n\nif __name__ == '__main__':\n\n    # tf.config.experimental_run_functions_eagerly(True)\n\n    flags.mark_flag_as_required('mode')\n    flags.mark_flag_as_required('data_dir')\n\n    app.run(main)\n"
  },
  {
    "path": "scripts/build_rnnt.sh",
    "content": "cp cmake/warp-rnnt-cmakelist.txt warp-transducer/CMakeLists.txt\n\ncd warp-transducer\n\nmkdir build\ncd build\n\nCC=gcc-4.8 CXX=g++-4.8 cmake \\\n    -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME ..\nmake\ncd ../tensorflow_binding\n\npython setup.py install\ncd ../../\n"
  },
  {
    "path": "scripts/common_voice_convert.sh",
    "content": "#!/bin/bash\n\nOIFS=\"$IFS\"\nIFS=$'\\n'\n\nFORMAT=.mp3\nDATA_DIR=\"$1\"\nN=${2:-1}\n\nmkdir -p $DATA_DIR\n\nFILES=$(ls \"$DATA_DIR\" | grep $FORMAT)\n\nthread () {\n    local FILE_N=$1\n    FILENAME=\"${FILE_N:0:${#FILE_N}-4}\"\n    ffmpeg -i $DATA_DIR/$FILE_N -acodec pcm_s16le -ac 1 -ar 16000 $DATA_DIR/$FILENAME.wav\n    rm $DATA_DIR/$FILE_N\n}\nfor FILE in $FILES\ndo\n   ((i=i%N)); ((i++==0)) && wait\n   thread \"$FILE\" & \ndone\n\nIFS=\"$OIFS\"\n"
  },
  {
    "path": "scripts/remove_missing_samples.py",
    "content": "from argparse import ArgumentParser\nimport os\n\n\ndef remove_missing(data_dir, fname, replace_old=True):\n\n    clips_dir = os.path.join(data_dir, 'clips')\n\n    old_filepath = os.path.join(data_dir, '{}.tsv'.format(fname))\n    new_filepath = os.path.join(data_dir, '{}-tmp.tsv'.format(fname))\n\n    with open(old_filepath, 'r') as old_f:\n        with open(new_filepath, 'w') as new_f:\n            new_f.write(next(old_f))\n            for line in old_f:\n                audio_fn = line.split('\\t')[1][:-4] + '.wav'\n                if os.path.exists(os.path.join(clips_dir, audio_fn)):\n                    new_f.write(line)\n\n    if replace_old:\n        os.remove(old_filepath)\n        os.rename(new_filepath, old_filepath)\n\n\ndef main(args):\n\n    tsv_files = ['dev', 'invalidated', 'other', \n                 'test', 'train', 'validated']\n\n    for _file in tsv_files:\n        remove_missing(args.data_dir, _file,\n            replace_old=args.replace_old)\n\n    print('Done.')\n\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('--data_dir', required=True, type=str,\n        help='Path to common voice data directory.')\n    ap.add_argument('--replace_old', type=bool, default=False,\n        help='Replace old tsv files with updated ones.')\n    \n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)"
  },
  {
    "path": "streaming_transcribe.py",
    "content": "from argparse import ArgumentParser\nimport os\nimport time\nimport pyaudio\n\nimport tensorflow as tf\ntf.get_logger().setLevel('ERROR')\ntf.autograph.set_verbosity(0)\n\nfrom utils import preprocessing, encoding, decoding\nfrom utils import model as model_utils\nfrom model import build_keras_model\nfrom hparams import *\n\n\nSAMPLE_RATE = 16000\nNUM_CHANNELS = 1\nCHUNK_SIZE = 1024\n\nLAST_OUTPUT = ''\n\n\ndef main(args):\n\n    model_dir = os.path.dirname(os.path.realpath(args.checkpoint))\n\n    hparams = model_utils.load_hparams(model_dir)\n\n    _, tok_to_text, vocab_size = encoding.get_encoder(\n        encoder_dir=model_dir,\n        hparams=hparams)\n    hparams[HP_VOCAB_SIZE.name] = vocab_size\n\n    model = build_keras_model(hparams, stateful=True)\n    model.load_weights(args.checkpoint)\n\n    decoder_fn = decoding.greedy_decode_fn(model, hparams)\n\n    p = pyaudio.PyAudio()\n\n    def listen_callback(in_data, frame_count, time_info, status):\n        global LAST_OUTPUT\n\n        audio = tf.io.decode_raw(in_data, out_type=tf.float32)\n\n        log_melspec = preprocessing.preprocess_audio(\n            audio=audio,\n            sample_rate=SAMPLE_RATE,\n            hparams=hparams)\n        log_melspec = tf.expand_dims(log_melspec, axis=0)\n\n        decoded = decoder_fn(log_melspec)[0]\n\n        transcription = LAST_OUTPUT + tok_to_text(decoded)\\\n            .numpy().decode('utf8')\n\n        if transcription != LAST_OUTPUT:\n            LAST_OUTPUT = transcription\n            print(transcription)\n\n        return in_data, pyaudio.paContinue\n\n    stream = p.open(\n        format=pyaudio.paFloat32,\n        channels=NUM_CHANNELS,\n        rate=SAMPLE_RATE,\n        input=True,\n        frames_per_buffer=CHUNK_SIZE,\n        stream_callback=listen_callback)\n\n    print('Listening...')\n\n    stream.start_stream()\n\n    while stream.is_active():\n        time.sleep(0.1)\n\n    stream.stop_stream()\n    stream.close()\n\n    p.terminate()\n\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('--checkpoint', type=str, required=True,\n        help='Checkpoint to load.')\n\n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)\n"
  },
  {
    "path": "transcribe_file.py",
    "content": "from argparse import ArgumentParser\nimport os\n\nimport tensorflow as tf\ntf.get_logger().setLevel('ERROR')\ntf.autograph.set_verbosity(0)\n\nfrom utils import preprocessing, encoding, decoding\nfrom utils import model as model_utils\nfrom model import build_keras_model\nfrom hparams import *\n\n\ndef main(args):\n\n    model_dir = os.path.dirname(os.path.realpath(args.checkpoint))\n\n    hparams = model_utils.load_hparams(model_dir)\n\n    encode_fn, tok_to_text, vocab_size = encoding.get_encoder(\n        encoder_dir=model_dir,\n        hparams=hparams)\n    hparams[HP_VOCAB_SIZE.name] = vocab_size\n\n    model = build_keras_model(hparams)\n    model.load_weights(args.checkpoint)\n\n    audio, sr = preprocessing.tf_load_audio(args.input)\n\n    log_melspec = preprocessing.preprocess_audio(\n        audio=audio,\n        sample_rate=sr,\n        hparams=hparams)\n    log_melspec = tf.expand_dims(log_melspec, axis=0)\n\n    decoder_fn = decoding.greedy_decode_fn(model, hparams)\n\n    decoded = decoder_fn(log_melspec)[0]\n    transcription = tok_to_text(decoded)\n\n    print('Transcription:', transcription.numpy().decode('utf8'))\n\n\ndef parse_args():\n\n    ap = ArgumentParser()\n\n    ap.add_argument('--checkpoint', type=str, required=True,\n        help='Checkpoint to load.')\n    ap.add_argument('-i', '--input', type=str, required=True,\n        help='Wav file to transcribe.')\n\n    return ap.parse_args()\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n    main(args)\n"
  },
  {
    "path": "utils/__init__.py",
    "content": ""
  },
  {
    "path": "utils/data/__init__.py",
    "content": "from . import common_voice"
  },
  {
    "path": "utils/data/common_voice.py",
    "content": "import os\nimport tensorflow as tf\n\nfrom .. import preprocessing\n\n\ndef tf_parse_line(line, data_dir):\n\n    line_split = tf.strings.split(line, '\\t')\n\n    audio_fn = line_split[1]\n    transcription = line_split[2]\n\n    audio_filepath = tf.strings.join([data_dir, 'clips', audio_fn], '/')\n    wav_filepath = tf.strings.substr(audio_filepath, 0, tf.strings.length(audio_filepath) - 4) + '.wav'\n\n    audio, sr = preprocessing.tf_load_audio(wav_filepath)\n\n    return audio, sr, transcription\n\n\ndef load_dataset(base_path, name):\n\n    filepath = os.path.join(base_path, '{}.tsv'.format(name))\n\n    dataset = tf.data.TextLineDataset([filepath])\n\n    dataset = dataset.skip(1)\n    dataset = dataset.map(lambda line: tf_parse_line(line, base_path),\n        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n\n    return dataset\n\n\ndef texts_generator(base_path):\n\n    # split_names = ['dev', 'train', 'test']\n    split_names = ['train']\n\n    for split_name in split_names:\n        with open(os.path.join(base_path, '{}.tsv'.format(split_name)), 'r') as f:\n            for line in f:\n                transcription = line.split('\\t')[2]\n                yield transcription"
  },
  {
    "path": "utils/data/librispeech.py",
    "content": "import os\nimport tensorflow as tf\nimport soundfile as sf\n\n\ndef load_audio(filepath):\n\n    return sf.read(filepath)\n\n\ndef tf_load_audio(filepath):\n\n    return tf.py_function(\n        lambda x: load_audio(x.numpy()),\n        inp=[filepath],\n        Tout=[tf.float32, tf.int32])\n\n\ndef tf_file_exists(filepath):\n\n    return tf.py_function(\n        lambda x: os.path.exists(x.numpy()),\n        inp=[filepath],\n        Tout=tf.bool)\n\n\ndef tf_parse_line(line, data_dir, split_names):\n\n    line_split = tf.strings.split(line, ' ')\n\n    audio_fn = line_split[0]\n    transcription = tf.py_function(\n        lambda x: b' '.join(x.numpy()).decode('utf8'),\n        inp=[line_split[1:]],\n        Tout=tf.string)\n\n    speaker_id, chapter_id, _ = tf.unstack(tf.strings.split(audio_fn, '-'), 3)\n\n    all_fps = tf.map_fn(\n        lambda split_name: tf.strings.join([data_dir, split_name, speaker_id, chapter_id, audio_fn], '/') + '.flac',\n        tf.constant(split_names))\n\n    audio_filepath_idx = tf.where(\n        tf.map_fn(tf_file_exists, all_fps, dtype=tf.bool))[0][0]\n    audio_filepath = all_fps[audio_filepath_idx]\n\n    audio, sr = tf_load_audio(audio_filepath)\n\n    return audio, sr, transcription\n\n\ndef get_transcript_files(base_path, split_names):\n\n    transcript_files = []\n\n    for split_name in split_names:\n        for speaker_id in os.listdir(f'{base_path}/{split_name}'):\n            if speaker_id == '.DS_Store': continue\n            for chapter_id in os.listdir(f'{base_path}/{split_name}/{speaker_id}'):\n                if chapter_id == '.DS_Store': continue\n                transcript_files.append(f'{base_path}/{split_name}/{speaker_id}/{chapter_id}/{speaker_id}-{chapter_id}.trans.txt')\n\n    return transcript_files\n\n\ndef load_dataset(base_path, split_names):\n\n    transcript_filepaths = get_transcript_files(base_path, split_names)\n\n    dataset = tf.data.TextLineDataset(transcript_filepaths)\n    dataset = dataset.map(lambda line: tf_parse_line(line, base_path, split_names),\n        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n\n    return dataset\n\n\ndef texts_generator(base_path, split_names):\n\n    transcript_filepaths = get_transcript_files(base_path, split_names)\n    for fp in transcript_filepaths:\n        with open(fp, 'r') as f:\n            for line in f:\n                line = line.strip('\\n')\n                transcription = ' '.join(line.split(' ')[1:])\n                yield transcription"
  },
  {
    "path": "utils/decoding.py",
    "content": "import tensorflow as tf\n\nfrom hparams import *\n\n\ndef joint(model, f, g):\n\n    dense_1 = model.layers[-2]\n    dense_2 = model.layers[-1]\n\n    joint_inp = (\n        tf.expand_dims(f, axis=2) +                 # [B, T, V] => [B, T, 1, V]\n        tf.expand_dims(g[:, -1, :], axis=1))        # [B, U, V] => [B, 1, U, V]\n\n    outputs = dense_1(joint_inp)\n    outputs = dense_2(outputs)\n\n    return outputs[:, 0, 0, :]\n\n\ndef greedy_decode_fn(model, hparams):\n\n    # NOTE: Only the first input is decoded\n\n    encoder = model.layers[2]\n    prediction_network = model.layers[3]\n\n    start_token = tf.constant([0])\n\n    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]\n\n    @tf.function(input_signature=[\n        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),\n        tf.TensorSpec(shape=[], dtype=tf.int32)])\n    def greedy_decode(inputs, max_length=None):\n\n        inputs = tf.expand_dims(inputs[0], axis=0)\n\n        encoded = encoder(inputs, training=False)\n        enc_length = tf.shape(encoded)[1]\n\n        i_0 = tf.constant(0)\n        outputs_0 = tf.expand_dims(start_token, axis=0)\n        max_reached_0 = tf.constant(False)\n\n        time_cond = lambda i, outputs, max_reached: tf.logical_and(\n            i < enc_length, tf.logical_not(max_reached))\n\n        def time_step_body(i, outputs, max_reached):\n\n            inp_enc = tf.expand_dims(encoded[:, i, :],\n                axis=1)\n\n            _outputs_0 = outputs\n            _max_reached_0 = max_reached\n            dec_end_0 = tf.constant(False)\n\n            dec_cond = lambda _outputs, _max_reached, dec_end: tf.logical_and(\n                tf.logical_not(dec_end), tf.logical_not(_max_reached))\n\n            def dec_step_body(_outputs, _max_reached, dec_end):\n\n                pred_out = prediction_network(_outputs,\n                    training=False)\n                preds = joint(model, inp_enc, pred_out)[0]\n                preds = tf.nn.log_softmax(preds)\n\n                predicted_id = tf.cast(\n                    tf.argmax(preds, axis=-1), dtype=tf.int32)\n\n                if predicted_id == 0:\n                    dec_end = True\n                else:\n                    _outputs = tf.concat([_outputs, [[predicted_id]]],\n                        axis=1)\n\n                if max_length is not None and tf.shape(_outputs)[1] >= max_length + 1:\n                    _max_reached = True\n\n                return _outputs, _max_reached, dec_end\n\n            _outputs, _max_reached, _ = tf.while_loop(\n                dec_cond, dec_step_body,\n                loop_vars=[_outputs_0, _max_reached_0, dec_end_0],\n                shape_invariants=[\n                    tf.TensorShape([1, None]),\n                    _max_reached_0.get_shape(),\n                    dec_end_0.get_shape()\n                ])\n\n            return i + 1, _outputs, _max_reached\n\n        _, outputs, _ = tf.while_loop(\n            time_cond, time_step_body,\n            loop_vars=[i_0, outputs_0, max_reached_0],\n            shape_invariants=[\n                i_0.get_shape(),\n                tf.TensorShape([1, None]),\n                max_reached_0.get_shape()\n            ])\n\n        final_outputs = outputs[:, 1:]\n        # output_ids = tf.argmax(final_outputs, axis=-1)\n\n        # return tf.cast(output_ids, dtype=tf.int32)\n        return tf.cast(final_outputs, dtype=tf.int32)\n\n    return greedy_decode\n\n\n# def greedy_decode():\n\n#     # NOTE: Only the first input is decoded\n#     y_pred = y_pred[0]\n\n#     # Add blank at end for decoding\n#     pred_len = tf.shape(y_pred)[0]\n#     y_pred = tf.concat([y_pred,\n#                         tf.fill([pred_len, 1], 0)],\n#         axis=1)\n\n#     def _loop_body(_y_pred, _decoded):\n\n#         first_blank_idx = tf.cast(tf.where(\n#             tf.equal(_y_pred[0], 0)), dtype=tf.int32)\n#         has_blank = tf.not_equal(tf.size(first_blank_idx), 0)\n\n#         dec_idx = first_blank_idx[0][0]\n\n#         decoded = _y_pred[0][:dec_idx]\n#         n_dec = tf.shape(decoded)[0]\n\n#         _decoded = tf.concat([_decoded, decoded],\n#             axis=0)\n\n#         return _y_pred[1:, n_dec:], _decoded\n\n#     decoded_0 = tf.constant([], dtype=tf.int32)\n\n#     _, decoded = tf.while_loop(\n#         lambda _y_pred, _decoded: tf.not_equal(tf.size(_y_pred), 0),\n#         _loop_body,\n#         [y_pred, decoded_0],\n#         shape_invariants=[tf.TensorShape([None, None]), tf.TensorShape([None])],\n#         name='greedy_decode')\n\n#     return tf.expand_dims(decoded, axis=0)\n\n\n# a = tf.constant([\n#     [\n#         [1, 4, 4, 4, 4, 3, 2],\n#         [0, 0, 0, 0, 0, 0, 0],\n#         [0, 0, 1, 0, 0, 0, 0],\n#         [0, 0, 0, 4, 1, 4, 0]\n#     ]\n# ])\n\n# tf.config.experimental_run_functions_eagerly(True)\n\n# a = tf.zeros((4, 100, 80))\n\n# print(a)\n\n# import sys\n# import os\n\n# FILE_DIR = os.path.dirname(os.path.realpath(__file__))\n# sys.path = [os.path.join(FILE_DIR, '..')] + sys.path\n\n# from model import build_keras_model\n# from hparams import *\n\n# hparams = {\n\n#     HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],\n\n#     # Preprocessing\n#     HP_MEL_BINS: HP_MEL_BINS.domain.values[0],\n#     HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],\n#     HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],\n#     HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],\n#     HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],\n\n#     # Model\n#     HP_EMBEDDING_SIZE: HP_EMBEDDING_SIZE.domain.values[0],\n#     HP_ENCODER_LAYERS: HP_ENCODER_LAYERS.domain.values[0],\n#     HP_ENCODER_SIZE: HP_ENCODER_SIZE.domain.values[0],\n#     HP_PROJECTION_SIZE: HP_PROJECTION_SIZE.domain.values[0],\n#     HP_TIME_REDUCT_INDEX: HP_TIME_REDUCT_INDEX.domain.values[0],\n#     HP_TIME_REDUCT_FACTOR: HP_TIME_REDUCT_FACTOR.domain.values[0],\n#     HP_PRED_NET_LAYERS: HP_PRED_NET_LAYERS.domain.values[0],\n#     HP_PRED_NET_SIZE: HP_PRED_NET_SIZE.domain.values[0],\n#     HP_JOINT_NET_SIZE: HP_JOINT_NET_SIZE.domain.values[0],\n\n#     HP_LEARNING_RATE: HP_LEARNING_RATE.domain.values[0]\n\n# }\n\n# hparams = {k.name: v for k, v in hparams.items()}\n# hparams['vocab_size'] = 73\n\n# model = build_keras_model(hparams)\n\n# greedy_decode = greedy_decode_fn(model)\n\n# print(greedy_decode(a, max_length=20))\n"
  },
  {
    "path": "utils/encoding.py",
    "content": "import os\nimport tensorflow_datasets as tfds\nimport tensorflow as tf\n\nfrom hparams import *\nfrom . import vocabulary, preprocessing\n\n\ndef build_lookup_table(keys, values=None, default_value=-1):\n\n    if values is None:\n        values = tf.range(len(keys))\n\n    kv_init = tf.lookup.KeyValueTensorInitializer(\n        keys=keys, values=values)\n\n    return tf.lookup.StaticHashTable(kv_init,\n        default_value=default_value)\n\n\ndef wordpiece_encode(text, encoder):\n\n    return tf.constant(encoder.encode(text.numpy()),\n        dtype=tf.int32)\n\n\ndef tf_wordpiece_encode(text, encoder):\n\n    return tf.py_function(lambda x: wordpiece_encode(x, encoder),\n        inp=[text], Tout=tf.int32)\n\n\ndef wordpiece_decode(ids, encoder):\n\n    return tf.constant(encoder.decode(ids.numpy()))\n\n\ndef tf_wordpiece_decode(ids, encoder):\n\n    return tf.py_function(lambda x: wordpiece_decode(x, encoder),\n        inp=[ids], Tout=[tf.string])[0]\n\n\ndef tf_vocab_encode(text, vocab_table):\n\n    tokens = tf.strings.bytes_split(text)\n\n    return vocab_table.lookup(tokens)\n\n\ndef get_encoder(encoder_dir,\n                hparams,\n                texts_generator=None):\n\n    def preprocessed_gen():\n        if texts_generator is None:\n            return\n        for x in texts_generator:\n            yield preprocessing.normalize_text(x)\n\n    if hparams[HP_TOKEN_TYPE.name] == 'character':\n\n        vocab = vocabulary.init_vocab()\n        vocab_table = build_lookup_table(vocab,\n            default_value=0)\n\n        vocab_size = len(vocab)\n\n        encoder_fn = lambda text: tf_vocab_encode(text, vocab_table)\n        decoder_fn = None\n\n    elif hparams[HP_TOKEN_TYPE.name] == 'word-piece':\n\n        encoder_filename = 'encoder'\n        encoder_filepath = os.path.join(encoder_dir, encoder_filename)\n\n        if os.path.exists('{}.subwords'.format(encoder_filepath)):\n            \n            encoder = tfds.core.features.text.SubwordTextEncoder.load_from_file(encoder_filepath)\n        else:\n            encoder = tfds.core.features.text.SubwordTextEncoder.build_from_corpus(\n                corpus_generator=preprocessed_gen(),\n                target_vocab_size=hparams[HP_VOCAB_SIZE.name])\n            os.makedirs(encoder_dir, exist_ok=True)\n            encoder.save_to_file(encoder_filepath)\n\n        vocab_size = encoder.vocab_size\n\n        encoder_fn = lambda text: tf_wordpiece_encode(text, encoder)\n        decoder_fn = lambda ids: tf_wordpiece_decode(ids, encoder)\n\n    return encoder_fn, decoder_fn, vocab_size\n"
  },
  {
    "path": "utils/loss.py",
    "content": "from absl import logging\nimport tensorflow as tf\n\n_has_loss_func = False\ntry:\n    from warprnnt_tensorflow import rnnt_loss\n    _has_loss_func = True\nexcept ImportError:\n    pass\n\n\ndef get_loss_fn(reduction_factor):\n\n    def _fallback_loss(y_true,\n                       y_pred,\n                       spec_lengths,\n                       label_lengths):\n        logging.info('RNN-T loss function not found.')\n        return y_pred\n\n    if not _has_loss_func:\n        return _fallback_loss\n\n    def _loss_fn(y_true,\n                 y_pred,\n                 spec_lengths,\n                 label_lengths):\n        y_true = tf.cast(y_true, dtype=tf.int32)\n        if not tf.test.is_built_with_cuda():\n            y_pred = tf.nn.log_softmax(y_pred)\n        spec_lengths = tf.cast(\n            tf.math.ceil(spec_lengths / reduction_factor),\n            dtype=tf.int32)\n        loss = rnnt_loss(y_pred, y_true,\n            spec_lengths, label_lengths)\n        return loss\n\n    return _loss_fn\n"
  },
  {
    "path": "utils/metrics.py",
    "content": "import tensorflow as tf\n\nfrom . import decoding\n\n\ndef error_rate(y_true, decoded):\n\n    y_true_shape = tf.shape(y_true)\n    decoded_shape = tf.shape(decoded)\n\n    max_length = tf.maximum(y_true_shape[-1], decoded_shape[-1])\n\n    if y_true.dtype == tf.string:\n        truth = string_to_sparse(y_true)\n    else:\n        truth = tf.sparse.from_dense(y_true)\n\n    if decoded.dtype == tf.string:\n        hypothesis = string_to_sparse(decoded)\n    else:\n        hypothesis = tf.sparse.from_dense(decoded)\n\n    err = tf.edit_distance(hypothesis, truth, normalize=False)\n    err_norm = err / tf.cast(max_length, dtype=tf.float32)\n\n    return err_norm\n\n\ndef string_to_sparse(str_tensor):\n\n    orig_shape = tf.cast(tf.shape(str_tensor), dtype=tf.int64)\n    str_tensor = tf.squeeze(str_tensor, axis=0)\n\n    indices = tf.concat([tf.zeros((orig_shape[-1], 1), dtype=tf.int64),\n                         tf.expand_dims(tf.range(0, orig_shape[-1]), axis=-1)],\n        axis=1)\n\n    return tf.SparseTensor(indices=indices, values=str_tensor,\n        dense_shape=orig_shape)\n\n\ndef token_error_rate(y_true, decoded, tok_fn, idx_to_text):\n\n    text_true = idx_to_text(y_true)\n    text_pred = idx_to_text(decoded)\n\n    text_true.set_shape(())\n    text_pred.set_shape(())\n\n    tok_true = tok_fn(text_true)\n    tok_pred = tok_fn(text_pred)\n\n    tok_true = tf.expand_dims(tok_true, axis=0)\n    tok_pred = tf.expand_dims(tok_pred, axis=0)\n\n    return error_rate(tok_true, tok_pred)\n\n\ndef build_accuracy_fn(decode_fn):\n\n    def Accuracy(inputs, y_true):\n\n        # Decode functions only returns first result\n        y_true = tf.expand_dims(y_true[0], axis=0)\n\n        max_length = tf.shape(y_true)[1]\n\n        decoded = decode_fn(inputs,\n            max_length=max_length)\n\n        return 1 - error_rate(y_true, decoded)\n\n    return Accuracy\n\n\ndef build_wer_fn(decode_fn, idx_to_text):\n\n    def WER(inputs, y_true):\n\n        # Decode functions only returns first result\n        y_true = y_true[0]\n\n        max_length = tf.shape(y_true)[0]\n\n        decoded = decode_fn(inputs,\n            max_length=max_length)[0]\n\n        return token_error_rate(y_true, decoded,\n            tok_fn=lambda t: tf.strings.split(t, sep=' '),\n            idx_to_text=idx_to_text)\n\n    return WER\n"
  },
  {
    "path": "utils/model.py",
    "content": "from absl import logging\nimport os\nimport json\nimport re\n\nfrom model import build_keras_model\n\n\ndef load_hparams(model_dir):\n\n    with open(os.path.join(model_dir, 'hparams.json'), 'r') as f:\n        return json.load(f)\n\n\ndef save_hparams(hparams, model_dir):\n\n    with open(os.path.join(model_dir, 'hparams.json'), 'w') as f:\n        json.dump(hparams, f)\n\n\n"
  },
  {
    "path": "utils/preprocessing.py",
    "content": "import glob\nimport os\nimport librosa.display\nimport librosa\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimport numpy as np\n\nfrom hparams import *\n\n\ndef tf_load_audio(path, pre_emphasis=0.97):\n\n    audio_raw = tf.io.read_file(path)\n\n    audio, sr = tf.audio.decode_wav(audio_raw)\n\n    if tf.rank(audio) > 1:\n        audio = audio[:, 0]\n\n    return audio, sr\n\n\ndef normalize_text(text):\n\n    text = text.lower()\n    text = text.replace('\"', '')\n\n    return text\n\n\ndef tf_normalize_text(text):\n\n    return tf.py_function(\n        lambda x: normalize_text(x.numpy().decode('utf8')),\n        inp=[text],\n        Tout=tf.string)\n\n\ndef print_tensor(t, template='{}'):\n\n    return tf.py_function(\n        lambda x: print(template.format(x.numpy())),\n        inp=[t],\n        Tout=[])\n\n\ndef compute_mel_spectrograms(audio_arr,\n                             sample_rate,\n                             n_mel_bins,\n                             frame_length,\n                             frame_step,\n                             hertz_low,\n                             hertz_high):\n\n    sample_rate_f = tf.cast(sample_rate, dtype=tf.float32)\n\n    frame_length = tf.cast(tf.round(sample_rate_f * frame_length), dtype=tf.int32)\n    frame_step = tf.cast(tf.round(sample_rate_f * frame_step), dtype=tf.int32)\n\n    stfts = tf.signal.stft(audio_arr,\n                           frame_length=frame_length,\n                           frame_step=frame_step)\n\n    mag_specs = tf.abs(stfts)\n    num_spec_bins = tf.shape(mag_specs)[-1]\n\n    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(\n        num_mel_bins=n_mel_bins, num_spectrogram_bins=num_spec_bins,\n        sample_rate=sample_rate_f,\n        lower_edge_hertz=hertz_low,\n        upper_edge_hertz=hertz_high)\n\n    mel_specs = tf.tensordot(mag_specs, linear_to_mel_weight_matrix, 1)\n    mel_specs.set_shape(mag_specs.shape[:-1].concatenate(\n        linear_to_mel_weight_matrix.shape[-1:]))\n\n    log_mel_specs = tf.math.log(mel_specs + 1e-6)\n    log_mel_specs -= (tf.reduce_mean(log_mel_specs, axis=0) + 1e-8)\n\n    return log_mel_specs\n\n\ndef downsample_spec(mel_spec, n=3):\n\n    spec_shape = tf.shape(mel_spec)\n    spec_length, feat_size = spec_shape[0], spec_shape[1]\n\n    trimmed_length = (spec_length // n) * n\n\n    trimmed_spec = mel_spec[:trimmed_length]\n    spec_sampled = tf.reshape(trimmed_spec, (-1, feat_size * n))\n\n    return spec_sampled\n\n\ndef load_dataset(data_dir, name):\n\n    filenames = glob.glob(os.path.join(data_dir,\n        '{}.tfrecord'.format(name)))\n\n    raw_dataset = tf.data.TFRecordDataset(filenames)\n\n    parsed_dataset = raw_dataset.map(parse_example,\n        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n\n    return parsed_dataset\n\n\ndef parse_example(serialized_example):\n\n    parse_dict = {\n        'mel_specs': tf.io.FixedLenFeature([], tf.string),\n        'pred_inp': tf.io.FixedLenFeature([], tf.string),\n        'spec_lengths': tf.io.FixedLenFeature([], tf.string),\n        'label_lengths': tf.io.FixedLenFeature([], tf.string),\n        'labels': tf.io.FixedLenFeature([], tf.string),\n    }\n\n    example = tf.io.parse_single_example(serialized_example, parse_dict)\n\n    mel_specs = tf.io.parse_tensor(example['mel_specs'], out_type=tf.float32)\n    pred_inp = tf.io.parse_tensor(example['pred_inp'], out_type=tf.int32)\n    spec_lengths = tf.io.parse_tensor(example['spec_lengths'], out_type=tf.int32)\n    label_lengths = tf.io.parse_tensor(example['label_lengths'], out_type=tf.int32)\n\n    labels = tf.io.parse_tensor(example['labels'], out_type=tf.int32)\n\n    return (mel_specs, pred_inp, spec_lengths, label_lengths, labels)\n\n\ndef serialize_example(mel_specs,\n                      pred_inp,\n                      spec_lengths,\n                      label_lengths,\n                      labels):\n\n    def _bytes_feature(value):\n        \"\"\"Returns a bytes_list from a string / byte.\"\"\"\n        if isinstance(value, type(tf.constant(0))): # if value ist tensor\n            value = value.numpy() # get value of tensor\n        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))\n\n    mel_specs_s = tf.io.serialize_tensor(mel_specs)\n    pred_inp_s = tf.io.serialize_tensor(pred_inp)\n    spec_lengths_s = tf.io.serialize_tensor(spec_lengths)\n    label_lengths_s = tf.io.serialize_tensor(label_lengths)\n\n    labels_s = tf.io.serialize_tensor(labels)\n\n    feature = {\n        'mel_specs': _bytes_feature(mel_specs_s),\n        'pred_inp': _bytes_feature(pred_inp_s),\n        'spec_lengths': _bytes_feature(spec_lengths_s),\n        'label_lengths': _bytes_feature(label_lengths_s),\n        'labels': _bytes_feature(labels_s)\n    }\n\n    example = tf.train.Example(features=tf.train.Features(feature=feature))\n\n    return example.SerializeToString()\n\ndef tf_serialize_example(mel_specs,\n                         pred_inp,\n                         spec_lengths,\n                         label_lengths,\n                         labels):\n\n    tf_string = tf.py_function(\n        serialize_example,\n        (mel_specs, pred_inp, spec_lengths, label_lengths, labels),\n        tf.string)\n\n    return tf.reshape(tf_string, ())\n\n\ndef preprocess_text(text, encoder_fn, vocab_size):\n\n    norm_text = tf_normalize_text(text)\n    enc_text = encoder_fn(norm_text)\n    enc_padded = tf.concat([[0], enc_text], axis=0)\n\n    return enc_text, enc_padded\n\n\ndef plot_spec(spec, sr, transcription, name):\n\n    spec_db = librosa.amplitude_to_db(spec, ref=np.max)\n\n    plt.figure(figsize=(12,4))\n    librosa.display.specshow(spec_db, sr=sr,\n        x_axis='time', y_axis='mel',\n        hop_length=sr * 0.01)\n    plt.colorbar(format='%+02.0f dB')\n    plt.savefig('figs/{}.png'.format(name))\n    plt.clf()\n\n\ndef tf_plot_spec(spec, sr, transcription, name):\n\n    spec_t = tf.transpose(spec)\n\n    return tf.py_function(\n        lambda _spec, _sr, trans: plot_spec(\n            _spec.numpy(), _sr.numpy(),\n            trans.numpy().decode('utf8'),\n            name\n        ),\n        inp=[spec_t, sr, transcription],\n        Tout=[])\n\n\ndef plot_audio(audio_arr, sr, trans, name):\n\n    with open('figs/trans.txt', 'a') as f:\n        f.write('{} {}\\n'.format(name, trans))\n\n    t = np.linspace(0, audio_arr.shape[0] / sr,\n        num=audio_arr.shape[0])\n\n    plt.figure(1)\n    plt.plot(t, audio_arr)\n    plt.savefig('figs/{}.png'.format(name))\n    plt.clf()\n\n\ndef tf_plot_audio(audio_arr, sr, trans, name):\n\n    return tf.py_function(\n        lambda _audio, _sr, _trans: plot_audio(\n            _audio.numpy(), _sr.numpy(),\n            _trans.numpy(), name\n        ),\n        inp=[audio_arr, sr, trans],\n        Tout=[])\n\n\ndef preprocess_audio(audio,\n                     sample_rate,\n                     hparams):\n\n    log_melspec = compute_mel_spectrograms(\n        audio_arr=audio,\n        sample_rate=sample_rate,\n        n_mel_bins=hparams[HP_MEL_BINS.name],\n        frame_length=hparams[HP_FRAME_LENGTH.name],\n        frame_step=hparams[HP_FRAME_STEP.name],\n        hertz_low=hparams[HP_HERTZ_LOW.name],\n        hertz_high=hparams[HP_HERTZ_HIGH.name])\n\n    downsampled_spec = downsample_spec(log_melspec)\n\n    return downsampled_spec\n\n\ndef preprocess_dataset(dataset,\n                       encoder_fn,\n                       hparams,\n                       max_length=0,\n                       save_plots=False):\n\n    _dataset = dataset\n\n    if max_length > 0:\n        _dataset = _dataset.filter(lambda audio, sr, trans: (\n            tf.shape(audio)[0] <= sr * tf.constant(max_length)))\n\n    if save_plots:\n        os.makedirs('figs', exist_ok=True)\n        for i, (audio_arr, sr, trans) in enumerate(_dataset.take(5)):\n            tf_plot_audio(audio_arr, sr, trans, 'audio_{}'.format(i))\n\n    _dataset = _dataset.map(lambda audio, sr, trans: (\n        preprocess_audio(\n            audio=audio,\n            sample_rate=sr,\n            hparams=hparams),\n        sr,\n        *preprocess_text(trans,\n            encoder_fn=encoder_fn,\n            vocab_size=hparams[HP_VOCAB_SIZE.name]),\n        trans\n    ), num_parallel_calls=tf.data.experimental.AUTOTUNE)\n\n    if save_plots:\n        for i, (log_melspec, sr, _, _, trans) in enumerate(_dataset.take(5)):\n            tf_plot_spec(log_melspec, sr, trans, 'input_{}'.format(i))\n\n    _dataset = _dataset.map(\n        lambda log_melspec, sr, labels, pred_inp, trans: (\n            log_melspec, pred_inp,\n            tf.shape(log_melspec)[0], tf.shape(labels)[0],\n            labels\n        ),\n        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n\n    _dataset = _dataset.map(tf_serialize_example)\n\n    return _dataset\n"
  },
  {
    "path": "utils/vocabulary.py",
    "content": "def init_vocab():\n\n    alphabet = \"abcdefghijklmnopqrstuvwxyz'\"\n    alphabet_c = ['', ' ', '<s>', '</s>'] + [c for c in alphabet]\n\n    return alphabet_c\n\n\ndef load_vocab(filepath):\n\n    vocab = []\n\n    with open(filepath, 'r') as f:\n        for line in f:\n            line = line.strip().strip('\\n')\n            if line == '<blank>':\n                line = ''\n            elif line == '<space>':\n                line = ' '\n            vocab.append(line)\n\n    return vocab\n\n\ndef save_vocab(vocab, filepath):\n\n    with open(filepath, 'w') as f:\n        for c in vocab:\n            if c == '':\n                c = '<blank>'\n            elif c == ' ':\n                c = '<space>'\n            f.write('{}\\n'.format(c))"
  }
]