Full Code of noahchalifour/rnnt-speech-recognition for AI

master a685904d71b1 cached

31 files

76.3 KB

19.5k tokens

80 symbols

1 requests

Download .txt

Repository: noahchalifour/rnnt-speech-recognition
Branch: master
Commit: a685904d71b1
Files: 31
Total size: 76.3 KB

Directory structure:
gitextract_zm_qbej6/

├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── __init__.py
├── cmake/
│   └── warp-rnnt-cmakelist.txt
├── debug/
│   ├── debug_dataset.py
│   └── get_common_voice_stats.py
├── hparams.py
├── model.py
├── preprocess_common_voice.py
├── preprocess_librispeech.py
├── quantize_model.py
├── requirements.txt
├── run_rnnt.py
├── scripts/
│   ├── build_rnnt.sh
│   ├── common_voice_convert.sh
│   └── remove_missing_samples.py
├── streaming_transcribe.py
├── transcribe_file.py
└── utils/
    ├── __init__.py
    ├── data/
    │   ├── __init__.py
    │   ├── common_voice.py
    │   └── librispeech.py
    ├── decoding.py
    ├── encoding.py
    ├── loss.py
    ├── metrics.py
    ├── model.py
    ├── preprocessing.py
    └── vocabulary.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

/.vscode
.DS_Store
/model
/logs
/data
rsync_exclude.txt
/scripts/train.sh
/scripts/send_to_server.sh
/data_p
/model_tmp
/figs

================================================
FILE: .gitmodules
================================================
[submodule "warp-transducer"]
	path = warp-transducer
	url = https://github.com/noahchalifour/warp-transducer.git


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2019 Noah Chalifour

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# RNN-Transducer Speech Recognition

End-to-end speech recognition using RNN-Transducer in Tensorflow 2.0

## Overview

This speech recognition model is based off Google's [Streaming End-to-end Speech Recognition For Mobile Devices](https://arxiv.org/pdf/1811.06621.pdf) research paper and is implemented in Python 3 using Tensorflow 2.0

## Setup Your Environment

To setup your environment, run the following command:

```
git clone --recurse https://github.com/noahchalifour/rnnt-speech-recognition.git
cd rnnt-speech-recognition
pip install tensorflow==2.2.0 # or tensorflow-gpu==2.2.0 for GPU support
pip install -r requirements.txt
./scripts/build_rnnt.sh # to setup the rnnt loss
```

## Common Voice

You can find and download the Common Voice dataset [here](https://voice.mozilla.org/en/datasets)

### Convert all MP3s to WAVs

Before you can train a model on the Common Voice dataset, you must first convert all the audio mp3 filetypes to wavs. Do so by running the following command:

> **_NOTE:_** Make sure you have `ffmpeg` installed on your computer, as it uses that to convert mp3 to wav

```
./scripts/common_voice_convert.sh <data_dir> <# of threads>
python scripts/remove_missing_samples.py \
    --data_dir <data_dir> \
    --replace_old
```

### Preprocessing dataset

After converting all the mp3s to wavs you need to preprocess the dataset, you can do so by running the following command:

```
python preprocess_common_voice.py \
    --data_dir <data_dir> \
    --output_dir <preprocessed_dir>
```

### Training a model

<!-- #### Training on Host -->

To train a simple model, run the following command:

```
python run_rnnt.py \
    --mode train \
    --data_dir <path to data directory>
```

<!-- #### Training in Docker Container

[View Image](https://hub.docker.com/r/noahchalifour/rnnt-speech-recognition)

You can also train your model in a docker container based on the Tensorflow docker image.

> **_NOTE:_** Specify all your paramters in ALL CAPS as environment variables when training in a docker container.

To run the model using a CPU only, run the following command:

```
docker run -d --name rnnt-speech-recognition \
    -v <path to local data>:/rnnt-speech-recognition/data \
    -v <path to save model locally>:/rnnt-speech-recognition/model \
    -e MODE=train \
    -e DATA_DIR=./data \
    -e OUTPUT_DIR=./model \
    noahchalifour/rnnt-speech-recognition
```

To run the model using a GPU you must run the following command with the added `--cap-add SYS_ADMIN`, and `--gpus <gpus>`:

```
docker run -d --name rnnt-speech-recognition \
    --cap-add SYS_ADMIN \
    --gpus <gpus> \
    -v <path to local data>:/rnnt-speech-recognition/data \
    -v <path to save model locally>:/rnnt-speech-recognition/model \
    -e MODE=train \
    -e DATA_DIR=./data \
    -e OUTPUT_DIR=./model \
    noahchalifour/rnnt-speech-recognition
``` -->


================================================
FILE: __init__.py
================================================


================================================
FILE: cmake/warp-rnnt-cmakelist.txt
================================================
IF (APPLE)
    cmake_minimum_required(VERSION 3.4)
ELSE()
    cmake_minimum_required(VERSION 2.8)
ENDIF()

project(rnnt_release)

IF (NOT APPLE)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
ENDIF()

IF (APPLE)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O2")
    add_definitions(-DAPPLE)
ENDIF()

include_directories(include)

FIND_PACKAGE(CUDA)
MESSAGE(STATUS "cuda found ${CUDA_FOUND}")

option(USE_NAIVE_KERNEL "use naive alpha-beta kernel" OFF)
option(DEBUG_TIME "output kernel time" OFF)
option(DEBUG_KERNEL "output alpha beta" OFF)
if (USE_NAIVE_KERNEL)
    add_definitions(-DUSE_NAIVE_KERNEL)
endif()
if (DEBUG_TIME)
    add_definitions(-DDEBUG_TIME)
endif()
if (DEBUG_KERNEL)
    add_definitions(-DDEBUG_KERNEL)
endif()

option(WITH_GPU "compile warp-rnnt with cuda." ${CUDA_FOUND})
option(WITH_OMP "compile warp-rnnt with openmp." ON)

if(NOT WITH_OMP)
    add_definitions(-DRNNT_DISABLE_OMP)
endif()
if (WITH_OMP)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fopenmp")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -Xcompiler -fopenmp")
endif()

# need to be at least 30 or __shfl_down in reduce wont compile
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_35,code=sm_35")

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_50,code=sm_50")
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_52,code=sm_52")
IF(CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 5)
  SET(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES")
ENDIF()

IF (CUDA_VERSION GREATER 7.6)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_60,code=sm_60")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_61,code=sm_61")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_62,code=sm_62")
ENDIF()

IF (CUDA_VERSION GREATER 8.9)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_70,code=sm_70")
ENDIF()

IF (CUDA_VERSION GREATER 9.9)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_75,code=sm_75")
ENDIF()

if (NOT APPLE)
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} --std=c++11")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS}")
ENDIF()


IF (APPLE)
    EXEC_PROGRAM(uname ARGS -v  OUTPUT_VARIABLE DARWIN_VERSION)
    STRING(REGEX MATCH "[0-9]+" DARWIN_VERSION ${DARWIN_VERSION})
    MESSAGE(STATUS "DARWIN_VERSION=${DARWIN_VERSION}")

    #for el capitain have to use rpath

    IF (DARWIN_VERSION LESS 15)
        set(CMAKE_SKIP_RPATH TRUE)
    ENDIF ()

ELSE()
    #always skip for linux
    set(CMAKE_SKIP_RPATH TRUE)
ENDIF()


IF (WITH_GPU)

    MESSAGE(STATUS "Building shared library with GPU support")
    set(CUDA_curand_LIBRARY "/usr/local/cuda/lib64/libcurand.so.10")

    CUDA_ADD_LIBRARY(warprnnt SHARED src/rnnt_entrypoint.cu)
    IF (!Torch_FOUND)
        TARGET_LINK_LIBRARIES(warprnnt ${CUDA_curand_LIBRARY})
    ENDIF()

    cuda_add_executable(test_time_gpu tests/test_time.cu tests/random.cpp )
    TARGET_LINK_LIBRARIES(test_time_gpu warprnnt ${CUDA_curand_LIBRARY})
    SET_TARGET_PROPERTIES(test_time_gpu PROPERTIES COMPILE_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")

    cuda_add_executable(test_gpu tests/test_gpu.cu tests/random.cpp )
    TARGET_LINK_LIBRARIES(test_gpu warprnnt ${CUDA_curand_LIBRARY})
    SET_TARGET_PROPERTIES(test_gpu PROPERTIES COMPILE_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")

ELSE()
    MESSAGE(STATUS "Building shared library with no GPU support")

    if (NOT APPLE)
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O2")
    ENDIF()

    ADD_LIBRARY(warprnnt SHARED src/rnnt_entrypoint.cpp)

ENDIF()


add_executable(test_cpu tests/test_cpu.cpp tests/random.cpp )
TARGET_LINK_LIBRARIES(test_cpu warprnnt)
SET_TARGET_PROPERTIES(test_cpu PROPERTIES COMPILE_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")

add_executable(test_time tests/test_time.cpp tests/random.cpp )
TARGET_LINK_LIBRARIES(test_time warprnnt)
SET_TARGET_PROPERTIES(test_time PROPERTIES COMPILE_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")

INSTALL(TARGETS warprnnt
        RUNTIME DESTINATION "bin"
        LIBRARY DESTINATION "lib"
        ARCHIVE DESTINATION "lib")

INSTALL(FILES include/rnnt.h DESTINATION "include")


================================================
FILE: debug/debug_dataset.py
================================================
from argparse import ArgumentParser
import os
import json
import sys
import tensorflow as tf

FILE_DIR = os.path.dirname(os.path.realpath(__file__))
sys.path.append(os.path.join(FILE_DIR, '..'))

from utils import preprocessing


def check_for_invalid_values(inp, labels):

    tf.debugging.check_numerics(inp['mel_specs'],
        message='mel_specs has invalid value.')

    return inp, labels


def check_empty(inp, labels):

    tf.debugging.assert_none_equal(
        tf.size(inp['mel_specs']), 0,
        message='mel_specs is empty tensor.')

    tf.debugging.assert_none_equal(
        tf.size(inp['pred_inp']), 0,
        message='pred_inp is empty tensor.')

    tf.debugging.assert_none_equal(
        tf.size(inp['spec_lengths']), 0,
        message='spec_lengths is empty tensor.')

    tf.debugging.assert_none_equal(
        tf.size(inp['label_lengths']), 0,
        message='label_lengths is empty tensor.')

    tf.debugging.assert_none_equal(
        tf.size(labels), 0,
        message='labels is empty tensor.')

    return inp, labels


def get_dataset(data_dir, 
                name, 
                batch_size,
                n_epochs):

    dataset = preprocessing.load_dataset(data_dir, name)
    dataset = dataset.padded_batch(
        batch_size, padded_shapes=({
            'mel_specs': [-1, -1], 
            'pred_inp': [-1],
            'spec_lengths': [],
            'label_lengths': []
        }, [-1]))

    dataset = dataset.repeat(n_epochs)

    with open(os.path.join(data_dir, '{}-specs.json'.format(name)), 'r') as f:
        dataset_specs = json.load(f)

    return dataset, dataset_specs


def main(args):

    dataset, dataset_specs = get_dataset(
        args.data_dir, args.split,
        batch_size=1, n_epochs=1)

    dataset.map(check_for_invalid_values)
    dataset.map(check_empty)

    for _ in dataset:
        pass
    
    print('All checks passed.')


def parse_args():

    ap = ArgumentParser()

    ap.add_argument('-d', '--data_dir', type=str, required=True,
        help='Path to preprocessed dataset.')
    ap.add_argument('-s', '--split', type=str, default='train',
        help='Name of dataset split to inspect.')

    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)

================================================
FILE: debug/get_common_voice_stats.py
================================================
from argparse import ArgumentParser
from scipy.io.wavfile import read as read_wav
import glob
import os


def main(args):

    max_length = 0
    min_length = 0
    total_length = 0
    count = 0

    with open(os.path.join(args.data_dir, args.split + '.tsv'), 'r') as f:
        next(f)
        for line in f:

            line_split = line.split('\t')
            audio_fn = line_split[1]

            filepath = os.path.join(args.data_dir, 'clips', audio_fn[:-4] + '.wav')

            sr, data = read_wav(filepath)

            length = len(data) / sr

            if length > max_length:
                max_length = length
            if length < min_length or min_length == 0:
                min_length = length

            total_length += length
            count += 1

    avg_length = total_length / count

    print('Total: {:.4f} s'.format(total_length))
    print('Min length: {:.4f} s'.format(min_length))
    print('Max length: {:.4f} s'.format(max_length))
    print('Average length: {:.4f} s'.format(avg_length))


def parse_args():

    ap = ArgumentParser()

    ap.add_argument('-d', '--data_dir', required=True, type=str,
        help='Directory of common voice dataset.')
    ap.add_argument('-s', '--split', type=str, default='train',
        help='Split to get statistics for.')

    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)

================================================
FILE: hparams.py
================================================
from tensorboard.plugins.hparams import api as hp

HP_TOKEN_TYPE = hp.HParam('token_type', hp.Discrete(['word-piece', 'character']))
HP_VOCAB_SIZE = hp.HParam('vocab_size', hp.Discrete([2**12]))

# Preprocessing Hparams
HP_MEL_BINS = hp.HParam('mel_bins', hp.Discrete([80]))
HP_FRAME_LENGTH = hp.HParam('frame_length', hp.Discrete([0.025]))
HP_FRAME_STEP = hp.HParam('frame_step', hp.Discrete([0.01]))
HP_HERTZ_LOW = hp.HParam('hertz_low', hp.Discrete([125.0]))
HP_HERTZ_HIGH = hp.HParam('hertz_high', hp.Discrete([7600.0]))
HP_DOWNSAMPLE_FACTOR = hp.HParam('downsample_factor', hp.Discrete([3]))

# Model Hparams
HP_EMBEDDING_SIZE = hp.HParam('embedding_size', hp.Discrete([500]))
HP_ENCODER_LAYERS = hp.HParam('encoder_layers', hp.Discrete([8]))
HP_ENCODER_SIZE = hp.HParam('encoder_size', hp.Discrete([2048]))
HP_PROJECTION_SIZE = hp.HParam('projection_size', hp.Discrete([640]))
HP_TIME_REDUCT_INDEX = hp.HParam('time_reduction_index', hp.Discrete([1]))
HP_TIME_REDUCT_FACTOR = hp.HParam('time_reduction_factor', hp.Discrete([2]))
HP_PRED_NET_LAYERS = hp.HParam('pred_net_layers', hp.Discrete([2]))
HP_PRED_NET_SIZE = hp.HParam('pred_net_size', hp.Discrete([2048]))
HP_JOINT_NET_SIZE = hp.HParam('joint_net_size', hp.Discrete([640]))
HP_DROPOUT = hp.HParam('dropout', hp.Discrete([0]))

# HP_EMBEDDING_SIZE = hp.HParam('embedding_size', hp.Discrete([32]))
# HP_ENCODER_LAYERS = hp.HParam('encoder_layers', hp.Discrete([4]))
# HP_ENCODER_SIZE = hp.HParam('encoder_size', hp.Discrete([20]))
# HP_PROJECTION_SIZE = hp.HParam('projection_size', hp.Discrete([50]))
# HP_TIME_REDUCT_INDEX = hp.HParam('time_reduction_index', hp.Discrete([1]))
# HP_TIME_REDUCT_FACTOR = hp.HParam('time_reduction_factor', hp.Discrete([2]))
# HP_PRED_NET_LAYERS = hp.HParam('pred_net_layers', hp.Discrete([2]))
# HP_PRED_NET_SIZE = hp.HParam('pred_net_size', hp.Discrete([100]))
# HP_JOINT_NET_SIZE = hp.HParam('joint_net_size', hp.Discrete([50]))
# HP_DROPOUT = hp.HParam('dropout', hp.Discrete([0.2]))

HP_LEARNING_RATE = hp.HParam('learning_rate', hp.Discrete([1e-4]))

METRIC_TRAIN_LOSS = 'train_loss'
METRIC_TRAIN_ACCURACY = 'train_accuracy'
METRIC_EVAL_LOSS = 'eval_loss'
METRIC_EVAL_ACCURACY = 'eval_accuracy'
METRIC_EVAL_CER = 'eval_cer'
METRIC_EVAL_WER = 'eval_wer'
METRIC_ACCURACY = 'accuracy'
METRIC_CER = 'cer'
METRIC_WER = 'wer'


================================================
FILE: model.py
================================================
import re
import os
import tensorflow as tf

from hparams import *


class TimeReduction(tf.keras.layers.Layer):

    def __init__(self,
                 reduction_factor,
                 batch_size=None,
                 **kwargs):

        super(TimeReduction, self).__init__(**kwargs)

        self.reduction_factor = reduction_factor
        self.batch_size = batch_size

    def call(self, inputs):

        input_shape = tf.shape(inputs)

        batch_size = self.batch_size
        if batch_size is None:
            batch_size = input_shape[0]

        max_time = input_shape[1]
        num_units = inputs.get_shape().as_list()[-1]

        outputs = inputs

        paddings = [[0, 0], [0, tf.math.floormod(max_time, self.reduction_factor)], [0, 0]]
        outputs = tf.pad(outputs, paddings)

        return tf.reshape(outputs, (batch_size, -1, num_units * self.reduction_factor))


def encoder(specs_shape,
            num_layers,
            d_model,
            proj_size,
            reduction_index,
            reduction_factor,
            dropout,
            stateful=False,
            initializer=None,
            dtype=tf.float32):

    batch_size = None
    if stateful:
        batch_size = 1

    mel_specs = tf.keras.Input(shape=specs_shape, batch_size=batch_size,
        dtype=tf.float32)

    norm_mel_specs = tf.keras.layers.BatchNormalization()(mel_specs)

    lstm_cell = lambda: tf.compat.v1.nn.rnn_cell.LSTMCell(d_model,
        num_proj=proj_size, initializer=initializer, dtype=dtype)

    outputs = norm_mel_specs

    for i in range(num_layers):

        rnn_layer = tf.keras.layers.RNN(lstm_cell(),
            return_sequences=True, stateful=stateful)

        outputs = rnn_layer(outputs)
        outputs = tf.keras.layers.Dropout(dropout)(outputs)
        outputs = tf.keras.layers.LayerNormalization(dtype=dtype)(outputs)

        if i == reduction_index:
            # outputs = tf.keras.layers.Conv1D(proj_size,
            #     kernel_size=reduction_factor,
            #     strides=reduction_factor)(outputs)
            outputs = TimeReduction(reduction_factor,
                batch_size=batch_size)(outputs)

    return tf.keras.Model(inputs=[mel_specs], outputs=[outputs],
        name='encoder')


def prediction_network(vocab_size,
                       embedding_size,
                       num_layers,
                       layer_size,
                       proj_size,
                       dropout,
                       stateful=False,
                       initializer=None,
                       dtype=tf.float32):

    batch_size = None
    if stateful:
        batch_size = 1

    inputs = tf.keras.Input(shape=[None], batch_size=batch_size,
        dtype=tf.float32)

    embed = tf.keras.layers.Embedding(vocab_size, embedding_size)(inputs)

    rnn_cell = lambda: tf.compat.v1.nn.rnn_cell.LSTMCell(layer_size,
        num_proj=proj_size, initializer=initializer, dtype=dtype)

    outputs = embed

    for _ in range(num_layers):

        outputs = tf.keras.layers.RNN(rnn_cell(),
            return_sequences=True)(outputs)
        outputs = tf.keras.layers.Dropout(dropout)(outputs)
        outputs = tf.keras.layers.LayerNormalization(dtype=dtype)(outputs)

    return tf.keras.Model(inputs=[inputs], outputs=[outputs],
        name='prediction_network')


def build_keras_model(hparams,
                      stateful=False,
                      initializer=None,
                      dtype=tf.float32):

    specs_shape = [None, hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]]

    batch_size = None
    if stateful:
        batch_size = 1

    mel_specs = tf.keras.Input(shape=specs_shape, batch_size=batch_size,
        dtype=tf.float32, name='mel_specs')
    pred_inp = tf.keras.Input(shape=[None], batch_size=batch_size,
        dtype=tf.float32, name='pred_inp')

    inp_enc = encoder(
        specs_shape=specs_shape,
        num_layers=hparams[HP_ENCODER_LAYERS.name],
        d_model=hparams[HP_ENCODER_SIZE.name],
        proj_size=hparams[HP_PROJECTION_SIZE.name],
        dropout=hparams[HP_DROPOUT.name],
        reduction_index=hparams[HP_TIME_REDUCT_INDEX.name],
        reduction_factor=hparams[HP_TIME_REDUCT_FACTOR.name],
        stateful=stateful,
        initializer=initializer,
        dtype=dtype)(mel_specs)

    pred_outputs = prediction_network(
        vocab_size=hparams[HP_VOCAB_SIZE.name],
        embedding_size=hparams[HP_EMBEDDING_SIZE.name],
        num_layers=hparams[HP_PRED_NET_LAYERS.name],
        layer_size=hparams[HP_PRED_NET_SIZE.name],
        proj_size=hparams[HP_PROJECTION_SIZE.name],
        dropout=hparams[HP_DROPOUT.name],
        stateful=stateful,
        initializer=initializer,
        dtype=dtype)(pred_inp)

    joint_inp = (
        tf.expand_dims(inp_enc, axis=2) +                 # [B, T, V] => [B, T, 1, V]
        tf.expand_dims(pred_outputs, axis=1))             # [B, U, V] => [B, 1, U, V]

    joint_outputs = tf.keras.layers.Dense(hparams[HP_JOINT_NET_SIZE.name],
        kernel_initializer=initializer, activation='tanh')(joint_inp)

    outputs = tf.keras.layers.Dense(hparams[HP_VOCAB_SIZE.name],
        kernel_initializer=initializer)(joint_outputs)

    return tf.keras.Model(inputs=[mel_specs, pred_inp],
        outputs=[outputs], name='transducer')


================================================
FILE: preprocess_common_voice.py
================================================
from absl import app, logging, flags
import os
import json
import tensorflow as tf

from utils import preprocessing, encoding
from utils.data import common_voice
from hparams import *


FLAGS = flags.FLAGS

flags.DEFINE_string(
    'data_dir', None,
    'Directory to read Common Voice data from.')
flags.DEFINE_string(
    'output_dir', './data',
    'Directory to save preprocessed data.')
flags.DEFINE_integer(
    'max_length', 0,
    'Max audio length in seconds.')


def write_dataset(dataset, name):

    filepath = os.path.join(FLAGS.output_dir,
        '{}.tfrecord'.format(name))

    writer = tf.data.experimental.TFRecordWriter(filepath)
    writer.write(dataset)

    logging.info('Wrote {} dataset to {}'.format(
        name, filepath))


def main(_):

    hparams = {

        HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],
        HP_VOCAB_SIZE: HP_VOCAB_SIZE.domain.values[0],

        # Preprocessing
        HP_MEL_BINS: HP_MEL_BINS.domain.values[0],
        HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],
        HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],
        HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],
        HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],
        HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0]

    }

    _hparams = {k.name: v for k, v in hparams.items()}

    texts_gen = common_voice.texts_generator(FLAGS.data_dir)

    encoder_fn, decoder_fn, vocab_size = encoding.get_encoder(
        encoder_dir=FLAGS.output_dir,
        hparams=_hparams,
        texts_generator=texts_gen)
    _hparams[HP_VOCAB_SIZE.name] = vocab_size

    train_dataset = common_voice.load_dataset(
        FLAGS.data_dir, 'train')
    dev_dataset = common_voice.load_dataset(
        FLAGS.data_dir, 'dev')
    test_dataset = common_voice.load_dataset(
        FLAGS.data_dir, 'test')

    train_dataset = preprocessing.preprocess_dataset(
        train_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length,
        save_plots=True)
    write_dataset(train_dataset, 'train')

    dev_dataset = preprocessing.preprocess_dataset(
        dev_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length)
    write_dataset(dev_dataset, 'dev')

    test_dataset = preprocessing.preprocess_dataset(
        test_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length)
    write_dataset(test_dataset, 'test')


if __name__ == '__main__':

    flags.mark_flag_as_required('data_dir')

    app.run(main)


================================================
FILE: preprocess_librispeech.py
================================================
from absl import app, logging, flags
import os
import json
import tensorflow as tf

from utils import preprocessing, encoding
from utils.data import librispeech
from hparams import *


FLAGS = flags.FLAGS

flags.DEFINE_string(
    'data_dir', None,
    'Directory to read Librispeech data from.')
flags.DEFINE_string(
    'output_dir', './data',
    'Directory to save preprocessed data.')
flags.DEFINE_integer(
    'max_length', 0,
    'Max audio length in seconds.')


def write_dataset(dataset, name):

    filepath = os.path.join(FLAGS.output_dir,
        '{}.tfrecord'.format(name))

    writer = tf.data.experimental.TFRecordWriter(filepath)
    writer.write(dataset)

    logging.info('Wrote {} dataset to {}'.format(
        name, filepath))


def main(_):

    hparams = {

        HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],
        HP_VOCAB_SIZE: HP_VOCAB_SIZE.domain.values[0],

        # Preprocessing
        HP_MEL_BINS: HP_MEL_BINS.domain.values[0],
        HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],
        HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],
        HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],
        HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],
        HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0]

    }

    train_splits = [
        'dev-clean'
    ]

    dev_splits = [
        'dev-clean'
    ]

    test_splits = [
        'dev-clean'
    ]

    # train_splits = [
    #     'train-clean-100',
    #     'train-clean-360',
    #     'train-other-500'
    # ]

    # dev_splits = [
    #     'dev-clean',
    #     'dev-other'
    # ]

    # test_splits = [
    #     'test-clean',
    #     'test-other'
    # ]

    _hparams = {k.name: v for k, v in hparams.items()}

    texts_gen = librispeech.texts_generator(FLAGS.data_dir,
        split_names=train_splits)

    encoder_fn, decoder_fn, vocab_size = encoding.get_encoder(
        encoder_dir=FLAGS.output_dir,
        hparams=_hparams,
        texts_generator=texts_gen)
    _hparams[HP_VOCAB_SIZE.name] = vocab_size

    train_dataset = librispeech.load_dataset(
        FLAGS.data_dir, train_splits)
    dev_dataset = librispeech.load_dataset(
        FLAGS.data_dir, dev_splits)
    test_dataset = librispeech.load_dataset(
        FLAGS.data_dir, test_splits)

    train_dataset = preprocessing.preprocess_dataset(
        train_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length,
        save_plots=True)
    write_dataset(train_dataset, 'train')

    dev_dataset = preprocessing.preprocess_dataset(
        dev_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length)
    write_dataset(dev_dataset, 'dev')

    test_dataset = preprocessing.preprocess_dataset(
        test_dataset,
        encoder_fn=encoder_fn,
        hparams=_hparams,
        max_length=FLAGS.max_length)
    write_dataset(test_dataset, 'test')


if __name__ == '__main__':

    flags.mark_flag_as_required('data_dir')

    app.run(main)


================================================
FILE: quantize_model.py
================================================
from argparse import ArgumentParser
import os
import tensorflow as tf

from utils import model as model_utils


def main(args):

    hparams = model_utils.load_hparams(args.model_dir)
    model, _ = model_utils.load_model(args.model_dir, hparams,
        stateful=True)

    model.summary()

    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.experimental_new_converter = True
    # converter.experimental_new_quantizer = True
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
                                           tf.lite.OpsSet.SELECT_TF_OPS]
    # converter.optimizations = [tf.lite.Optimize.DEFAULT]

    tflite_quant_model = converter.convert()

    tflite_dir = os.path.join(args.model_dir, 'tflite')
    os.makedirs(tflite_dir, exist_ok=True)

    with open(os.path.join(tflite_dir, 'model.tflite'), 'wb') as f:
        f.write(tflite_quant_model)

def parse_args():

    ap = ArgumentParser()

    ap.add_argument('-m', '--model_dir', type=str, default='./model',
        help='Directory of model.')

    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)

================================================
FILE: requirements.txt
================================================
pydub>=0.23.1
scipy>=1.3.1
tqdm
tensorflow-datasets
soundfile
librosa
matplotlib


================================================
FILE: run_rnnt.py
================================================
from absl import flags, logging, app
from tensorboard.plugins.hparams import api as hp
from tensorflow.keras.mixed_precision import experimental as mixed_precision
from datetime import datetime
import json
import re
import os
import time
import shutil

import tensorflow as tf
tf.get_logger().setLevel('WARNING')
tf.autograph.set_verbosity(0)
# tf.random.set_seed(1234)

from utils import preprocessing, vocabulary, encoding, \
    metrics, decoding
from utils.loss import get_loss_fn
from utils import model as model_utils
from model import build_keras_model
from hparams import *

FLAGS = flags.FLAGS

# Required flags
flags.DEFINE_enum(
    'mode', None,
    ['train', 'eval', 'test'],
    'Mode to run.')
flags.DEFINE_string(
    'data_dir', None,
    'Input data directory.')

# Optional flags
flags.DEFINE_string(
    'tb_log_dir', './logs',
    'Directory to save Tensorboard logs.')
flags.DEFINE_string(
    'output_dir', './model',
    'Directory to save model.')
flags.DEFINE_string(
    'checkpoint', None,
    'Checkpoint to restore from.')
flags.DEFINE_integer(
    'batch_size', 32,
    'Training batch size.')
flags.DEFINE_integer(
    'n_epochs', 1000,
    'Number of training epochs.')
flags.DEFINE_integer(
    'steps_per_log', 1,
    'Number of steps between each log.')
flags.DEFINE_integer(
    'steps_per_checkpoint', 1000,
    'Number of steps between eval and checkpoint.')
flags.DEFINE_integer(
    'eval_size', None,
    'Max number of samples to use for eval.')
flags.DEFINE_list(
    'gpus', None,
    'GPUs to run training on.')
flags.DEFINE_bool(
    'fp16_run', False,
    'Run using 16-bit precision instead of 32-bit.')

def get_dataset(data_dir,
                name,
                batch_size,
                n_epochs,
                strategy=None,
                max_size=None):

    dataset = preprocessing.load_dataset(data_dir, name)

    if max_size is not None:
        dataset = dataset.take(max_size)

    dataset = dataset.padded_batch(
        batch_size, padded_shapes=(
            [-1, -1], [-1], [], [],
            [-1]
        )
    )

    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    if strategy is not None:
        dataset = strategy.experimental_distribute_dataset(dataset)

    return dataset


def configure_environment(gpu_names,
                          fp16_run):

    if fp16_run:
        print('Using 16-bit float precision.')
        policy = mixed_precision.Policy('mixed_float16')
        mixed_precision.set_policy(policy)

    gpus = tf.config.experimental.list_physical_devices('GPU')

    if gpu_names is not None and len(gpu_names) > 0:
        gpus = [x for x in gpus if x.name[len('/physical_device:'):] in gpu_names]

    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
            # tf.config.experimental.set_virtual_device_configuration(
            #     gpus[0],
            #     [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096),
            #         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
        except RuntimeError as e:
            logging.warn(str(e))

    if len(gpus) > 1:
        print('Running multi gpu: {}'.format(', '.join(gpu_names)))
        strategy = tf.distribute.MirroredStrategy(
            devices=gpu_names)
    else:
        device = gpus[0].name[len('/physical_device:'):]
        print('Running single gpu: {}'.format(device))
        strategy = tf.distribute.OneDeviceStrategy(
            device=device)

    dtype = tf.float16 if fp16_run else tf.float32

    return strategy, dtype


def setup_hparams(log_dir,
                  checkpoint):

    if checkpoint is not None:

        checkpoint_dir = os.path.dirname(os.path.realpath(checkpoint))
        hparams = model_utils.load_hparams(checkpoint_dir)

        tb_hparams = {}
        tb_keys = [
            HP_TOKEN_TYPE,
            HP_MEL_BINS,
            HP_FRAME_LENGTH,
            HP_FRAME_STEP,
            HP_HERTZ_LOW,
            HP_HERTZ_HIGH,
            HP_DOWNSAMPLE_FACTOR,
            HP_EMBEDDING_SIZE,
            HP_ENCODER_LAYERS,
            HP_ENCODER_SIZE,
            HP_PROJECTION_SIZE,
            HP_TIME_REDUCT_FACTOR,
            HP_TIME_REDUCT_INDEX,
            HP_PRED_NET_LAYERS,
            HP_PRED_NET_SIZE,
            HP_JOINT_NET_SIZE,
            HP_DROPOUT,
            HP_LEARNING_RATE
        ]

        for k, v in hparams.items():
            for tb_key in tb_keys:
                if k == tb_key.name:
                    tb_hparams[tb_key] = v

    else:

        tb_hparams = {

            HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],

            # Preprocessing
            HP_MEL_BINS: HP_MEL_BINS.domain.values[0],
            HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],
            HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],
            HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],
            HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],
            HP_DOWNSAMPLE_FACTOR: HP_DOWNSAMPLE_FACTOR.domain.values[0],

            # Model
            HP_EMBEDDING_SIZE: HP_EMBEDDING_SIZE.domain.values[0],
            HP_ENCODER_LAYERS: HP_ENCODER_LAYERS.domain.values[0],
            HP_ENCODER_SIZE: HP_ENCODER_SIZE.domain.values[0],
            HP_PROJECTION_SIZE: HP_PROJECTION_SIZE.domain.values[0],
            HP_TIME_REDUCT_INDEX: HP_TIME_REDUCT_INDEX.domain.values[0],
            HP_TIME_REDUCT_FACTOR: HP_TIME_REDUCT_FACTOR.domain.values[0],
            HP_PRED_NET_LAYERS: HP_PRED_NET_LAYERS.domain.values[0],
            HP_PRED_NET_SIZE: HP_PRED_NET_SIZE.domain.values[0],
            HP_JOINT_NET_SIZE: HP_JOINT_NET_SIZE.domain.values[0],
            HP_DROPOUT: HP_DROPOUT.domain.values[0],

            HP_LEARNING_RATE: HP_LEARNING_RATE.domain.values[0]

        }

    with tf.summary.create_file_writer(os.path.join(log_dir, 'hparams_tuning')).as_default():
        hp.hparams_config(
            hparams=[
                HP_TOKEN_TYPE,
                HP_VOCAB_SIZE,
                HP_ENCODER_LAYERS,
                HP_ENCODER_SIZE,
                HP_PROJECTION_SIZE,
                HP_TIME_REDUCT_INDEX,
                HP_TIME_REDUCT_FACTOR,
                HP_PRED_NET_LAYERS,
                HP_PRED_NET_SIZE,
                HP_JOINT_NET_SIZE,
                HP_DROPOUT
            ],
            metrics=[
                hp.Metric(METRIC_ACCURACY, display_name='Accuracy'),
                hp.Metric(METRIC_WER, display_name='WER'),
            ],
        )

    return {k.name: v for k, v in tb_hparams.items()}, tb_hparams


def run_metrics(inputs,
                y_true,
                metrics,
                strategy=None):

    return {
        metric_fn.__name__: metric_fn(inputs, y_true)
        for metric_fn in metrics}


def run_training(model,
                 optimizer,
                 loss_fn,
                 train_dataset,
                 batch_size,
                 n_epochs,
                 checkpoint_template,
                 hparams,
                 noise=0,
                 # noise=0.075,
                 strategy=None,
                 steps_per_log=None,
                 steps_per_checkpoint=None,
                 eval_dataset=None,
                 train_metrics=[],
                 eval_metrics=[],
                 fp16_run=False):

    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]

    @tf.function(input_signature=[[
        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),
        tf.TensorSpec(shape=[None, None], dtype=tf.int32),
        tf.TensorSpec(shape=[None], dtype=tf.int32),
        tf.TensorSpec(shape=[None], dtype=tf.int32),
        tf.TensorSpec(shape=[None, None], dtype=tf.int32)]])
    def train_step(dist_inputs):
        def step_fn(inputs):

            (mel_specs, pred_inp,
             spec_lengths, label_lengths, labels) = inputs
            if noise > 0:
                mel_specs += tf.random.normal([mel_specs.shape[-1]],
                    mean=0, stddev=noise)

            with tf.GradientTape() as tape:
                outputs = model([mel_specs, pred_inp],
                    training=True)

                rnnt_loss = loss_fn(labels, outputs,
                    spec_lengths, label_lengths)

                if fp16_run:
                    rnnt_loss = optimizer.get_scaled_loss(rnnt_loss)

                loss = tf.reduce_sum(rnnt_loss) * (1. / batch_size)

            if train_metrics is not None:
                metric_results = run_metrics(mel_specs, labels,
                    metrics=train_metrics, strategy=strategy)

            gradients = tape.gradient(loss, model.trainable_variables)
            if fp16_run:
                gradients = optimizer.get_unscaled_gradients(gradients)

            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            return rnnt_loss, metric_results

        loss, metrics_results = strategy.run(step_fn, args=(dist_inputs,))
        loss = strategy.reduce(
            tf.distribute.ReduceOp.MEAN, loss, axis=0)
        metrics_results = {name: strategy.reduce(
            tf.distribute.ReduceOp.MEAN, result, axis=0) for name, result in metrics_results.items()}

        return loss, metrics_results

    def checkpoint_model():

        eval_start_time = time.time()

        eval_loss, eval_metrics_results = run_evaluate(
            model=model,
            optimizer=optimizer,
            loss_fn=loss_fn,
            eval_dataset=eval_dataset,
            batch_size=batch_size,
            hparams=hparams,
            strategy=strategy,
            metrics=eval_metrics)

        validation_log_str = 'VALIDATION RESULTS: Time: {:.4f}, Loss: {:.4f}'.format(
            time.time() - eval_start_time, eval_loss)
        for metric_name, metric_result in eval_metrics_results.items():
            validation_log_str += ', {}: {:.4f}'.format(metric_name, metric_result)
        print(validation_log_str)

        tf.summary.scalar(METRIC_EVAL_LOSS, eval_loss, step=global_step)
        if 'Accuracy' in eval_metrics_results:
            tf.summary.scalar(METRIC_EVAL_ACCURACY, eval_metrics_results['Accuracy'], step=global_step)
        if 'WER' in eval_metrics_results:
            tf.summary.scalar(METRIC_EVAL_WER, eval_metrics_results['WER'], step=global_step)

        checkpoint_filepath = checkpoint_template.format(
            step=global_step, val_loss=eval_loss)
        print('Saving checkpoint {}'.format(checkpoint_filepath))
        model.save_weights(checkpoint_filepath)


    with strategy.scope():

        print('Starting training.')

        global_step = 0

        for epoch in range(n_epochs):

            loss_object = tf.keras.metrics.Mean()
            metric_objects = {fn.__name__: tf.keras.metrics.Mean() for fn in train_metrics}

            for batch, inputs in enumerate(train_dataset):

                if global_step % steps_per_checkpoint == 0:
                    if eval_dataset is not None:
                        checkpoint_model()

                start_time = time.time()

                loss, metrics_results = train_step(inputs)

                step_time = time.time() - start_time

                loss_object(loss)
                for metric_name, metric_result in metrics_results.items():
                    metric_objects[metric_name](metric_result)

                if global_step % steps_per_log == 0:
                    log_str = 'Epoch: {}, Batch: {}, Global Step: {}, Step Time: {:.4f}, Loss: {:.4f}'.format(
                        epoch, batch, global_step, step_time, loss_object.result())
                    for metric_name, metric_object in metric_objects.items():
                        log_str += ', {}: {:.4f}'.format(metric_name, metric_object.result())
                    print(log_str)

                    tf.summary.scalar(METRIC_TRAIN_LOSS, loss_object.result(), step=global_step)
                    if 'Accuracy' in metric_objects:
                        tf.summary.scalar(METRIC_TRAIN_ACCURACY, metric_objects['Accuracy'].result(), step=global_step)

                global_step += 1

            epoch_end_log_str = 'EPOCH RESULTS: Loss: {:.4f}'.format(loss_object.result())
            for metric_name, metric_object in metric_objects.items():
                epoch_end_log_str += ', {}: {:.4f}'.format(metric_name, metric_object.result())
            print(epoch_end_log_str)

        checkpoint_model()


def run_evaluate(model,
                 optimizer,
                 loss_fn,
                 eval_dataset,
                 batch_size,
                 strategy,
                 hparams,
                 metrics=[],
                 fp16_run=False):

    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]

    @tf.function(input_signature=[[
        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),
        tf.TensorSpec(shape=[None, None], dtype=tf.int32),
        tf.TensorSpec(shape=[None], dtype=tf.int32),
        tf.TensorSpec(shape=[None], dtype=tf.int32),
        tf.TensorSpec(shape=[None, None], dtype=tf.int32)]])
    def eval_step(dist_inputs):
        def step_fn(inputs):
            (mel_specs, pred_inp,
             spec_lengths, label_lengths, labels) = inputs
            outputs = model([mel_specs, pred_inp],
                training=False)

            loss = loss_fn(labels, outputs,
                spec_lengths=spec_lengths,
                label_lengths=label_lengths)

            if fp16_run:
                loss = optimizer.get_scaled_loss(loss)

            if metrics is not None:
                metric_results = run_metrics(mel_specs, labels,
                    metrics=metrics, strategy=strategy)

            return loss, metric_results

        loss, metrics_results = strategy.run(step_fn, args=(dist_inputs,))
        loss = strategy.reduce(
            tf.distribute.ReduceOp.MEAN, loss, axis=0)
        metrics_results = {name: strategy.reduce(
            tf.distribute.ReduceOp.MEAN, result, axis=0) for name, result in metrics_results.items()}

        return loss, metrics_results

    print('Performing evaluation.')

    loss_object = tf.keras.metrics.Mean()
    metric_objects = {fn.__name__: tf.keras.metrics.Mean() for fn in metrics}

    for batch, inputs in enumerate(eval_dataset):

        loss, metrics_results = eval_step(inputs)

        loss_object(loss)
        for metric_name, metric_result in metrics_results.items():
            metric_objects[metric_name](metric_result)

    metrics_final_results = {name: metric_object.result() for name, metric_object in metric_objects.items()}

    return loss_object.result(), metrics_final_results


def main(_):

    strategy, dtype = configure_environment(
        gpu_names=FLAGS.gpus,
        fp16_run=FLAGS.fp16_run)

    hparams, tb_hparams = setup_hparams(
        log_dir=FLAGS.tb_log_dir,
        checkpoint=FLAGS.checkpoint)

    os.makedirs(FLAGS.output_dir, exist_ok=True)

    if FLAGS.checkpoint is None:
        encoder_dir = FLAGS.data_dir
    else:
        encoder_dir = os.path.dirname(os.path.realpath(FLAGS.checkpoint))

    shutil.copy(
        os.path.join(encoder_dir, 'encoder.subwords'),
        os.path.join(FLAGS.output_dir, 'encoder.subwords'))

    encoder_fn, idx_to_text, vocab_size = encoding.get_encoder(
        encoder_dir=FLAGS.output_dir,
        hparams=hparams)

    if HP_VOCAB_SIZE.name not in hparams:
        hparams[HP_VOCAB_SIZE.name] = vocab_size

    with strategy.scope():

        model = build_keras_model(hparams,
            dtype=dtype)

        if FLAGS.checkpoint is not None:
            model.load_weights(FLAGS.checkpoint)
            logging.info('Restored weights from {}.'.format(FLAGS.checkpoint))

        model_utils.save_hparams(hparams, FLAGS.output_dir)

        optimizer = tf.keras.optimizers.SGD(hparams[HP_LEARNING_RATE.name],
            momentum=0.9)

        if FLAGS.fp16_run:
            optimizer = mixed_precision.LossScaleOptimizer(optimizer,
                loss_scale='dynamic')

    logging.info('Using {} encoder with vocab size: {}'.format(
        hparams[HP_TOKEN_TYPE.name], vocab_size))

    loss_fn = get_loss_fn(
        reduction_factor=hparams[HP_TIME_REDUCT_FACTOR.name])

    decode_fn = decoding.greedy_decode_fn(model, hparams)

    accuracy_fn = metrics.build_accuracy_fn(decode_fn)
    wer_fn = metrics.build_wer_fn(decode_fn, idx_to_text)

    encoder = model.layers[2]
    prediction_network = model.layers[3]

    encoder.summary()
    prediction_network.summary()

    model.summary()

    dev_dataset = None
    if FLAGS.eval_size != 0:
        dev_dataset = get_dataset(FLAGS.data_dir, 'dev',
            batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs,
            strategy=strategy, max_size=FLAGS.eval_size)

    log_dir = os.path.join(FLAGS.tb_log_dir,
        datetime.now().strftime('%Y%m%d-%H%M%S'))

    with tf.summary.create_file_writer(log_dir).as_default():

        hp.hparams(tb_hparams)

        if FLAGS.mode == 'train':

            train_dataset = get_dataset(FLAGS.data_dir, 'train',
                batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs,
                strategy=strategy)

            os.makedirs(FLAGS.output_dir, exist_ok=True)
            checkpoint_template = os.path.join(FLAGS.output_dir,
                'checkpoint_{step}_{val_loss:.4f}.hdf5')

            run_training(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                train_dataset=train_dataset,
                batch_size=FLAGS.batch_size,
                n_epochs=FLAGS.n_epochs,
                checkpoint_template=checkpoint_template,
                hparams=hparams,
                strategy=strategy,
                steps_per_log=FLAGS.steps_per_log,
                steps_per_checkpoint=FLAGS.steps_per_checkpoint,
                eval_dataset=dev_dataset,
                train_metrics=[],
                eval_metrics=[accuracy_fn, wer_fn])

        elif FLAGS.mode == 'eval' or FLAGS.mode == 'test':

            if FLAGS.checkpoint is None:
                raise Exception('You must provide a checkpoint to perform eval.')

            if FLAGS.mode == 'test':
                dataset = get_dataset(FLAGS.data_dir, 'test',
                    batch_size=FLAGS.batch_size, n_epochs=FLAGS.n_epochs)
            else:
                dataset = dev_dataset

            eval_start_time = time.time()

            eval_loss, eval_metrics_results = run_evaluate(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                eval_dataset=dataset,
                batch_size=FLAGS.batch_size,
                hparams=hparams,
                strategy=strategy,
                metrics=[accuracy_fn, wer_fn])

            validation_log_str = 'VALIDATION RESULTS: Time: {:.4f}, Loss: {:.4f}'.format(
                time.time() - eval_start_time, eval_loss)
            for metric_name, metric_result in eval_metrics_results.items():
                validation_log_str += ', {}: {:.4f}'.format(metric_name, metric_result)

            print(validation_log_str)


if __name__ == '__main__':

    # tf.config.experimental_run_functions_eagerly(True)

    flags.mark_flag_as_required('mode')
    flags.mark_flag_as_required('data_dir')

    app.run(main)


================================================
FILE: scripts/build_rnnt.sh
================================================
cp cmake/warp-rnnt-cmakelist.txt warp-transducer/CMakeLists.txt

cd warp-transducer

mkdir build
cd build

CC=gcc-4.8 CXX=g++-4.8 cmake \
    -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME ..
make
cd ../tensorflow_binding

python setup.py install
cd ../../


================================================
FILE: scripts/common_voice_convert.sh
================================================
#!/bin/bash

OIFS="$IFS"
IFS=$'\n'

FORMAT=.mp3
DATA_DIR="$1"
N=${2:-1}

mkdir -p $DATA_DIR

FILES=$(ls "$DATA_DIR" | grep $FORMAT)

thread () {
    local FILE_N=$1
    FILENAME="${FILE_N:0:${#FILE_N}-4}"
    ffmpeg -i $DATA_DIR/$FILE_N -acodec pcm_s16le -ac 1 -ar 16000 $DATA_DIR/$FILENAME.wav
    rm $DATA_DIR/$FILE_N
}
for FILE in $FILES
do
   ((i=i%N)); ((i++==0)) && wait
   thread "$FILE" & 
done

IFS="$OIFS"


================================================
FILE: scripts/remove_missing_samples.py
================================================
from argparse import ArgumentParser
import os


def remove_missing(data_dir, fname, replace_old=True):

    clips_dir = os.path.join(data_dir, 'clips')

    old_filepath = os.path.join(data_dir, '{}.tsv'.format(fname))
    new_filepath = os.path.join(data_dir, '{}-tmp.tsv'.format(fname))

    with open(old_filepath, 'r') as old_f:
        with open(new_filepath, 'w') as new_f:
            new_f.write(next(old_f))
            for line in old_f:
                audio_fn = line.split('\t')[1][:-4] + '.wav'
                if os.path.exists(os.path.join(clips_dir, audio_fn)):
                    new_f.write(line)

    if replace_old:
        os.remove(old_filepath)
        os.rename(new_filepath, old_filepath)


def main(args):

    tsv_files = ['dev', 'invalidated', 'other', 
                 'test', 'train', 'validated']

    for _file in tsv_files:
        remove_missing(args.data_dir, _file,
            replace_old=args.replace_old)

    print('Done.')


def parse_args():

    ap = ArgumentParser()

    ap.add_argument('--data_dir', required=True, type=str,
        help='Path to common voice data directory.')
    ap.add_argument('--replace_old', type=bool, default=False,
        help='Replace old tsv files with updated ones.')
    
    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)

================================================
FILE: streaming_transcribe.py
================================================
from argparse import ArgumentParser
import os
import time
import pyaudio

import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

from utils import preprocessing, encoding, decoding
from utils import model as model_utils
from model import build_keras_model
from hparams import *


SAMPLE_RATE = 16000
NUM_CHANNELS = 1
CHUNK_SIZE = 1024

LAST_OUTPUT = ''


def main(args):

    model_dir = os.path.dirname(os.path.realpath(args.checkpoint))

    hparams = model_utils.load_hparams(model_dir)

    _, tok_to_text, vocab_size = encoding.get_encoder(
        encoder_dir=model_dir,
        hparams=hparams)
    hparams[HP_VOCAB_SIZE.name] = vocab_size

    model = build_keras_model(hparams, stateful=True)
    model.load_weights(args.checkpoint)

    decoder_fn = decoding.greedy_decode_fn(model, hparams)

    p = pyaudio.PyAudio()

    def listen_callback(in_data, frame_count, time_info, status):
        global LAST_OUTPUT

        audio = tf.io.decode_raw(in_data, out_type=tf.float32)

        log_melspec = preprocessing.preprocess_audio(
            audio=audio,
            sample_rate=SAMPLE_RATE,
            hparams=hparams)
        log_melspec = tf.expand_dims(log_melspec, axis=0)

        decoded = decoder_fn(log_melspec)[0]

        transcription = LAST_OUTPUT + tok_to_text(decoded)\
            .numpy().decode('utf8')

        if transcription != LAST_OUTPUT:
            LAST_OUTPUT = transcription
            print(transcription)

        return in_data, pyaudio.paContinue

    stream = p.open(
        format=pyaudio.paFloat32,
        channels=NUM_CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=CHUNK_SIZE,
        stream_callback=listen_callback)

    print('Listening...')

    stream.start_stream()

    while stream.is_active():
        time.sleep(0.1)

    stream.stop_stream()
    stream.close()

    p.terminate()


def parse_args():

    ap = ArgumentParser()

    ap.add_argument('--checkpoint', type=str, required=True,
        help='Checkpoint to load.')

    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)


================================================
FILE: transcribe_file.py
================================================
from argparse import ArgumentParser
import os

import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

from utils import preprocessing, encoding, decoding
from utils import model as model_utils
from model import build_keras_model
from hparams import *


def main(args):

    model_dir = os.path.dirname(os.path.realpath(args.checkpoint))

    hparams = model_utils.load_hparams(model_dir)

    encode_fn, tok_to_text, vocab_size = encoding.get_encoder(
        encoder_dir=model_dir,
        hparams=hparams)
    hparams[HP_VOCAB_SIZE.name] = vocab_size

    model = build_keras_model(hparams)
    model.load_weights(args.checkpoint)

    audio, sr = preprocessing.tf_load_audio(args.input)

    log_melspec = preprocessing.preprocess_audio(
        audio=audio,
        sample_rate=sr,
        hparams=hparams)
    log_melspec = tf.expand_dims(log_melspec, axis=0)

    decoder_fn = decoding.greedy_decode_fn(model, hparams)

    decoded = decoder_fn(log_melspec)[0]
    transcription = tok_to_text(decoded)

    print('Transcription:', transcription.numpy().decode('utf8'))


def parse_args():

    ap = ArgumentParser()

    ap.add_argument('--checkpoint', type=str, required=True,
        help='Checkpoint to load.')
    ap.add_argument('-i', '--input', type=str, required=True,
        help='Wav file to transcribe.')

    return ap.parse_args()


if __name__ == '__main__':

    args = parse_args()
    main(args)


================================================
FILE: utils/__init__.py
================================================


================================================
FILE: utils/data/__init__.py
================================================
from . import common_voice

================================================
FILE: utils/data/common_voice.py
================================================
import os
import tensorflow as tf

from .. import preprocessing


def tf_parse_line(line, data_dir):

    line_split = tf.strings.split(line, '\t')

    audio_fn = line_split[1]
    transcription = line_split[2]

    audio_filepath = tf.strings.join([data_dir, 'clips', audio_fn], '/')
    wav_filepath = tf.strings.substr(audio_filepath, 0, tf.strings.length(audio_filepath) - 4) + '.wav'

    audio, sr = preprocessing.tf_load_audio(wav_filepath)

    return audio, sr, transcription


def load_dataset(base_path, name):

    filepath = os.path.join(base_path, '{}.tsv'.format(name))

    dataset = tf.data.TextLineDataset([filepath])

    dataset = dataset.skip(1)
    dataset = dataset.map(lambda line: tf_parse_line(line, base_path),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)

    return dataset


def texts_generator(base_path):

    # split_names = ['dev', 'train', 'test']
    split_names = ['train']

    for split_name in split_names:
        with open(os.path.join(base_path, '{}.tsv'.format(split_name)), 'r') as f:
            for line in f:
                transcription = line.split('\t')[2]
                yield transcription

================================================
FILE: utils/data/librispeech.py
================================================
import os
import tensorflow as tf
import soundfile as sf


def load_audio(filepath):

    return sf.read(filepath)


def tf_load_audio(filepath):

    return tf.py_function(
        lambda x: load_audio(x.numpy()),
        inp=[filepath],
        Tout=[tf.float32, tf.int32])


def tf_file_exists(filepath):

    return tf.py_function(
        lambda x: os.path.exists(x.numpy()),
        inp=[filepath],
        Tout=tf.bool)


def tf_parse_line(line, data_dir, split_names):

    line_split = tf.strings.split(line, ' ')

    audio_fn = line_split[0]
    transcription = tf.py_function(
        lambda x: b' '.join(x.numpy()).decode('utf8'),
        inp=[line_split[1:]],
        Tout=tf.string)

    speaker_id, chapter_id, _ = tf.unstack(tf.strings.split(audio_fn, '-'), 3)

    all_fps = tf.map_fn(
        lambda split_name: tf.strings.join([data_dir, split_name, speaker_id, chapter_id, audio_fn], '/') + '.flac',
        tf.constant(split_names))

    audio_filepath_idx = tf.where(
        tf.map_fn(tf_file_exists, all_fps, dtype=tf.bool))[0][0]
    audio_filepath = all_fps[audio_filepath_idx]

    audio, sr = tf_load_audio(audio_filepath)

    return audio, sr, transcription


def get_transcript_files(base_path, split_names):

    transcript_files = []

    for split_name in split_names:
        for speaker_id in os.listdir(f'{base_path}/{split_name}'):
            if speaker_id == '.DS_Store': continue
            for chapter_id in os.listdir(f'{base_path}/{split_name}/{speaker_id}'):
                if chapter_id == '.DS_Store': continue
                transcript_files.append(f'{base_path}/{split_name}/{speaker_id}/{chapter_id}/{speaker_id}-{chapter_id}.trans.txt')

    return transcript_files


def load_dataset(base_path, split_names):

    transcript_filepaths = get_transcript_files(base_path, split_names)

    dataset = tf.data.TextLineDataset(transcript_filepaths)
    dataset = dataset.map(lambda line: tf_parse_line(line, base_path, split_names),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)

    return dataset


def texts_generator(base_path, split_names):

    transcript_filepaths = get_transcript_files(base_path, split_names)
    for fp in transcript_filepaths:
        with open(fp, 'r') as f:
            for line in f:
                line = line.strip('\n')
                transcription = ' '.join(line.split(' ')[1:])
                yield transcription

================================================
FILE: utils/decoding.py
================================================
import tensorflow as tf

from hparams import *


def joint(model, f, g):

    dense_1 = model.layers[-2]
    dense_2 = model.layers[-1]

    joint_inp = (
        tf.expand_dims(f, axis=2) +                 # [B, T, V] => [B, T, 1, V]
        tf.expand_dims(g[:, -1, :], axis=1))        # [B, U, V] => [B, 1, U, V]

    outputs = dense_1(joint_inp)
    outputs = dense_2(outputs)

    return outputs[:, 0, 0, :]


def greedy_decode_fn(model, hparams):

    # NOTE: Only the first input is decoded

    encoder = model.layers[2]
    prediction_network = model.layers[3]

    start_token = tf.constant([0])

    feat_size = hparams[HP_MEL_BINS.name] * hparams[HP_DOWNSAMPLE_FACTOR.name]

    @tf.function(input_signature=[
        tf.TensorSpec(shape=[None, None, feat_size], dtype=tf.float32),
        tf.TensorSpec(shape=[], dtype=tf.int32)])
    def greedy_decode(inputs, max_length=None):

        inputs = tf.expand_dims(inputs[0], axis=0)

        encoded = encoder(inputs, training=False)
        enc_length = tf.shape(encoded)[1]

        i_0 = tf.constant(0)
        outputs_0 = tf.expand_dims(start_token, axis=0)
        max_reached_0 = tf.constant(False)

        time_cond = lambda i, outputs, max_reached: tf.logical_and(
            i < enc_length, tf.logical_not(max_reached))

        def time_step_body(i, outputs, max_reached):

            inp_enc = tf.expand_dims(encoded[:, i, :],
                axis=1)

            _outputs_0 = outputs
            _max_reached_0 = max_reached
            dec_end_0 = tf.constant(False)

            dec_cond = lambda _outputs, _max_reached, dec_end: tf.logical_and(
                tf.logical_not(dec_end), tf.logical_not(_max_reached))

            def dec_step_body(_outputs, _max_reached, dec_end):

                pred_out = prediction_network(_outputs,
                    training=False)
                preds = joint(model, inp_enc, pred_out)[0]
                preds = tf.nn.log_softmax(preds)

                predicted_id = tf.cast(
                    tf.argmax(preds, axis=-1), dtype=tf.int32)

                if predicted_id == 0:
                    dec_end = True
                else:
                    _outputs = tf.concat([_outputs, [[predicted_id]]],
                        axis=1)

                if max_length is not None and tf.shape(_outputs)[1] >= max_length + 1:
                    _max_reached = True

                return _outputs, _max_reached, dec_end

            _outputs, _max_reached, _ = tf.while_loop(
                dec_cond, dec_step_body,
                loop_vars=[_outputs_0, _max_reached_0, dec_end_0],
                shape_invariants=[
                    tf.TensorShape([1, None]),
                    _max_reached_0.get_shape(),
                    dec_end_0.get_shape()
                ])

            return i + 1, _outputs, _max_reached

        _, outputs, _ = tf.while_loop(
            time_cond, time_step_body,
            loop_vars=[i_0, outputs_0, max_reached_0],
            shape_invariants=[
                i_0.get_shape(),
                tf.TensorShape([1, None]),
                max_reached_0.get_shape()
            ])

        final_outputs = outputs[:, 1:]
        # output_ids = tf.argmax(final_outputs, axis=-1)

        # return tf.cast(output_ids, dtype=tf.int32)
        return tf.cast(final_outputs, dtype=tf.int32)

    return greedy_decode


# def greedy_decode():

#     # NOTE: Only the first input is decoded
#     y_pred = y_pred[0]

#     # Add blank at end for decoding
#     pred_len = tf.shape(y_pred)[0]
#     y_pred = tf.concat([y_pred,
#                         tf.fill([pred_len, 1], 0)],
#         axis=1)

#     def _loop_body(_y_pred, _decoded):

#         first_blank_idx = tf.cast(tf.where(
#             tf.equal(_y_pred[0], 0)), dtype=tf.int32)
#         has_blank = tf.not_equal(tf.size(first_blank_idx), 0)

#         dec_idx = first_blank_idx[0][0]

#         decoded = _y_pred[0][:dec_idx]
#         n_dec = tf.shape(decoded)[0]

#         _decoded = tf.concat([_decoded, decoded],
#             axis=0)

#         return _y_pred[1:, n_dec:], _decoded

#     decoded_0 = tf.constant([], dtype=tf.int32)

#     _, decoded = tf.while_loop(
#         lambda _y_pred, _decoded: tf.not_equal(tf.size(_y_pred), 0),
#         _loop_body,
#         [y_pred, decoded_0],
#         shape_invariants=[tf.TensorShape([None, None]), tf.TensorShape([None])],
#         name='greedy_decode')

#     return tf.expand_dims(decoded, axis=0)


# a = tf.constant([
#     [
#         [1, 4, 4, 4, 4, 3, 2],
#         [0, 0, 0, 0, 0, 0, 0],
#         [0, 0, 1, 0, 0, 0, 0],
#         [0, 0, 0, 4, 1, 4, 0]
#     ]
# ])

# tf.config.experimental_run_functions_eagerly(True)

# a = tf.zeros((4, 100, 80))

# print(a)

# import sys
# import os

# FILE_DIR = os.path.dirname(os.path.realpath(__file__))
# sys.path = [os.path.join(FILE_DIR, '..')] + sys.path

# from model import build_keras_model
# from hparams import *

# hparams = {

#     HP_TOKEN_TYPE: HP_TOKEN_TYPE.domain.values[1],

#     # Preprocessing
#     HP_MEL_BINS: HP_MEL_BINS.domain.values[0],
#     HP_FRAME_LENGTH: HP_FRAME_LENGTH.domain.values[0],
#     HP_FRAME_STEP: HP_FRAME_STEP.domain.values[0],
#     HP_HERTZ_LOW: HP_HERTZ_LOW.domain.values[0],
#     HP_HERTZ_HIGH: HP_HERTZ_HIGH.domain.values[0],

#     # Model
#     HP_EMBEDDING_SIZE: HP_EMBEDDING_SIZE.domain.values[0],
#     HP_ENCODER_LAYERS: HP_ENCODER_LAYERS.domain.values[0],
#     HP_ENCODER_SIZE: HP_ENCODER_SIZE.domain.values[0],
#     HP_PROJECTION_SIZE: HP_PROJECTION_SIZE.domain.values[0],
#     HP_TIME_REDUCT_INDEX: HP_TIME_REDUCT_INDEX.domain.values[0],
#     HP_TIME_REDUCT_FACTOR: HP_TIME_REDUCT_FACTOR.domain.values[0],
#     HP_PRED_NET_LAYERS: HP_PRED_NET_LAYERS.domain.values[0],
#     HP_PRED_NET_SIZE: HP_PRED_NET_SIZE.domain.values[0],
#     HP_JOINT_NET_SIZE: HP_JOINT_NET_SIZE.domain.values[0],

#     HP_LEARNING_RATE: HP_LEARNING_RATE.domain.values[0]

# }

# hparams = {k.name: v for k, v in hparams.items()}
# hparams['vocab_size'] = 73

# model = build_keras_model(hparams)

# greedy_decode = greedy_decode_fn(model)

# print(greedy_decode(a, max_length=20))


================================================
FILE: utils/encoding.py
================================================
import os
import tensorflow_datasets as tfds
import tensorflow as tf

from hparams import *
from . import vocabulary, preprocessing


def build_lookup_table(keys, values=None, default_value=-1):

    if values is None:
        values = tf.range(len(keys))

    kv_init = tf.lookup.KeyValueTensorInitializer(
        keys=keys, values=values)

    return tf.lookup.StaticHashTable(kv_init,
        default_value=default_value)


def wordpiece_encode(text, encoder):

    return tf.constant(encoder.encode(text.numpy()),
        dtype=tf.int32)


def tf_wordpiece_encode(text, encoder):

    return tf.py_function(lambda x: wordpiece_encode(x, encoder),
        inp=[text], Tout=tf.int32)


def wordpiece_decode(ids, encoder):

    return tf.constant(encoder.decode(ids.numpy()))


def tf_wordpiece_decode(ids, encoder):

    return tf.py_function(lambda x: wordpiece_decode(x, encoder),
        inp=[ids], Tout=[tf.string])[0]


def tf_vocab_encode(text, vocab_table):

    tokens = tf.strings.bytes_split(text)

    return vocab_table.lookup(tokens)


def get_encoder(encoder_dir,
                hparams,
                texts_generator=None):

    def preprocessed_gen():
        if texts_generator is None:
            return
        for x in texts_generator:
            yield preprocessing.normalize_text(x)

    if hparams[HP_TOKEN_TYPE.name] == 'character':

        vocab = vocabulary.init_vocab()
        vocab_table = build_lookup_table(vocab,
            default_value=0)

        vocab_size = len(vocab)

        encoder_fn = lambda text: tf_vocab_encode(text, vocab_table)
        decoder_fn = None

    elif hparams[HP_TOKEN_TYPE.name] == 'word-piece':

        encoder_filename = 'encoder'
        encoder_filepath = os.path.join(encoder_dir, encoder_filename)

        if os.path.exists('{}.subwords'.format(encoder_filepath)):
            
            encoder = tfds.core.features.text.SubwordTextEncoder.load_from_file(encoder_filepath)
        else:
            encoder = tfds.core.features.text.SubwordTextEncoder.build_from_corpus(
                corpus_generator=preprocessed_gen(),
                target_vocab_size=hparams[HP_VOCAB_SIZE.name])
            os.makedirs(encoder_dir, exist_ok=True)
            encoder.save_to_file(encoder_filepath)

        vocab_size = encoder.vocab_size

        encoder_fn = lambda text: tf_wordpiece_encode(text, encoder)
        decoder_fn = lambda ids: tf_wordpiece_decode(ids, encoder)

    return encoder_fn, decoder_fn, vocab_size


================================================
FILE: utils/loss.py
================================================
from absl import logging
import tensorflow as tf

_has_loss_func = False
try:
    from warprnnt_tensorflow import rnnt_loss
    _has_loss_func = True
except ImportError:
    pass


def get_loss_fn(reduction_factor):

    def _fallback_loss(y_true,
                       y_pred,
                       spec_lengths,
                       label_lengths):
        logging.info('RNN-T loss function not found.')
        return y_pred

    if not _has_loss_func:
        return _fallback_loss

    def _loss_fn(y_true,
                 y_pred,
                 spec_lengths,
                 label_lengths):
        y_true = tf.cast(y_true, dtype=tf.int32)
        if not tf.test.is_built_with_cuda():
            y_pred = tf.nn.log_softmax(y_pred)
        spec_lengths = tf.cast(
            tf.math.ceil(spec_lengths / reduction_factor),
            dtype=tf.int32)
        loss = rnnt_loss(y_pred, y_true,
            spec_lengths, label_lengths)
        return loss

    return _loss_fn


================================================
FILE: utils/metrics.py
================================================
import tensorflow as tf

from . import decoding


def error_rate(y_true, decoded):

    y_true_shape = tf.shape(y_true)
    decoded_shape = tf.shape(decoded)

    max_length = tf.maximum(y_true_shape[-1], decoded_shape[-1])

    if y_true.dtype == tf.string:
        truth = string_to_sparse(y_true)
    else:
        truth = tf.sparse.from_dense(y_true)

    if decoded.dtype == tf.string:
        hypothesis = string_to_sparse(decoded)
    else:
        hypothesis = tf.sparse.from_dense(decoded)

    err = tf.edit_distance(hypothesis, truth, normalize=False)
    err_norm = err / tf.cast(max_length, dtype=tf.float32)

    return err_norm


def string_to_sparse(str_tensor):

    orig_shape = tf.cast(tf.shape(str_tensor), dtype=tf.int64)
    str_tensor = tf.squeeze(str_tensor, axis=0)

    indices = tf.concat([tf.zeros((orig_shape[-1], 1), dtype=tf.int64),
                         tf.expand_dims(tf.range(0, orig_shape[-1]), axis=-1)],
        axis=1)

    return tf.SparseTensor(indices=indices, values=str_tensor,
        dense_shape=orig_shape)


def token_error_rate(y_true, decoded, tok_fn, idx_to_text):

    text_true = idx_to_text(y_true)
    text_pred = idx_to_text(decoded)

    text_true.set_shape(())
    text_pred.set_shape(())

    tok_true = tok_fn(text_true)
    tok_pred = tok_fn(text_pred)

    tok_true = tf.expand_dims(tok_true, axis=0)
    tok_pred = tf.expand_dims(tok_pred, axis=0)

    return error_rate(tok_true, tok_pred)


def build_accuracy_fn(decode_fn):

    def Accuracy(inputs, y_true):

        # Decode functions only returns first result
        y_true = tf.expand_dims(y_true[0], axis=0)

        max_length = tf.shape(y_true)[1]

        decoded = decode_fn(inputs,
            max_length=max_length)

        return 1 - error_rate(y_true, decoded)

    return Accuracy


def build_wer_fn(decode_fn, idx_to_text):

    def WER(inputs, y_true):

        # Decode functions only returns first result
        y_true = y_true[0]

        max_length = tf.shape(y_true)[0]

        decoded = decode_fn(inputs,
            max_length=max_length)[0]

        return token_error_rate(y_true, decoded,
            tok_fn=lambda t: tf.strings.split(t, sep=' '),
            idx_to_text=idx_to_text)

    return WER


================================================
FILE: utils/model.py
================================================
from absl import logging
import os
import json
import re

from model import build_keras_model


def load_hparams(model_dir):

    with open(os.path.join(model_dir, 'hparams.json'), 'r') as f:
        return json.load(f)


def save_hparams(hparams, model_dir):

    with open(os.path.join(model_dir, 'hparams.json'), 'w') as f:
        json.dump(hparams, f)




================================================
FILE: utils/preprocessing.py
================================================
import glob
import os
import librosa.display
import librosa
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

from hparams import *


def tf_load_audio(path, pre_emphasis=0.97):

    audio_raw = tf.io.read_file(path)

    audio, sr = tf.audio.decode_wav(audio_raw)

    if tf.rank(audio) > 1:
        audio = audio[:, 0]

    return audio, sr


def normalize_text(text):

    text = text.lower()
    text = text.replace('"', '')

    return text


def tf_normalize_text(text):

    return tf.py_function(
        lambda x: normalize_text(x.numpy().decode('utf8')),
        inp=[text],
        Tout=tf.string)


def print_tensor(t, template='{}'):

    return tf.py_function(
        lambda x: print(template.format(x.numpy())),
        inp=[t],
        Tout=[])


def compute_mel_spectrograms(audio_arr,
                             sample_rate,
                             n_mel_bins,
                             frame_length,
                             frame_step,
                             hertz_low,
                             hertz_high):

    sample_rate_f = tf.cast(sample_rate, dtype=tf.float32)

    frame_length = tf.cast(tf.round(sample_rate_f * frame_length), dtype=tf.int32)
    frame_step = tf.cast(tf.round(sample_rate_f * frame_step), dtype=tf.int32)

    stfts = tf.signal.stft(audio_arr,
                           frame_length=frame_length,
                           frame_step=frame_step)

    mag_specs = tf.abs(stfts)
    num_spec_bins = tf.shape(mag_specs)[-1]

    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins=n_mel_bins, num_spectrogram_bins=num_spec_bins,
        sample_rate=sample_rate_f,
        lower_edge_hertz=hertz_low,
        upper_edge_hertz=hertz_high)

    mel_specs = tf.tensordot(mag_specs, linear_to_mel_weight_matrix, 1)
    mel_specs.set_shape(mag_specs.shape[:-1].concatenate(
        linear_to_mel_weight_matrix.shape[-1:]))

    log_mel_specs = tf.math.log(mel_specs + 1e-6)
    log_mel_specs -= (tf.reduce_mean(log_mel_specs, axis=0) + 1e-8)

    return log_mel_specs


def downsample_spec(mel_spec, n=3):

    spec_shape = tf.shape(mel_spec)
    spec_length, feat_size = spec_shape[0], spec_shape[1]

    trimmed_length = (spec_length // n) * n

    trimmed_spec = mel_spec[:trimmed_length]
    spec_sampled = tf.reshape(trimmed_spec, (-1, feat_size * n))

    return spec_sampled


def load_dataset(data_dir, name):

    filenames = glob.glob(os.path.join(data_dir,
        '{}.tfrecord'.format(name)))

    raw_dataset = tf.data.TFRecordDataset(filenames)

    parsed_dataset = raw_dataset.map(parse_example,
        num_parallel_calls=tf.data.experimental.AUTOTUNE)

    return parsed_dataset


def parse_example(serialized_example):

    parse_dict = {
        'mel_specs': tf.io.FixedLenFeature([], tf.string),
        'pred_inp': tf.io.FixedLenFeature([], tf.string),
        'spec_lengths': tf.io.FixedLenFeature([], tf.string),
        'label_lengths': tf.io.FixedLenFeature([], tf.string),
        'labels': tf.io.FixedLenFeature([], tf.string),
    }

    example = tf.io.parse_single_example(serialized_example, parse_dict)

    mel_specs = tf.io.parse_tensor(example['mel_specs'], out_type=tf.float32)
    pred_inp = tf.io.parse_tensor(example['pred_inp'], out_type=tf.int32)
    spec_lengths = tf.io.parse_tensor(example['spec_lengths'], out_type=tf.int32)
    label_lengths = tf.io.parse_tensor(example['label_lengths'], out_type=tf.int32)

    labels = tf.io.parse_tensor(example['labels'], out_type=tf.int32)

    return (mel_specs, pred_inp, spec_lengths, label_lengths, labels)


def serialize_example(mel_specs,
                      pred_inp,
                      spec_lengths,
                      label_lengths,
                      labels):

    def _bytes_feature(value):
        """Returns a bytes_list from a string / byte."""
        if isinstance(value, type(tf.constant(0))): # if value ist tensor
            value = value.numpy() # get value of tensor
        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

    mel_specs_s = tf.io.serialize_tensor(mel_specs)
    pred_inp_s = tf.io.serialize_tensor(pred_inp)
    spec_lengths_s = tf.io.serialize_tensor(spec_lengths)
    label_lengths_s = tf.io.serialize_tensor(label_lengths)

    labels_s = tf.io.serialize_tensor(labels)

    feature = {
        'mel_specs': _bytes_feature(mel_specs_s),
        'pred_inp': _bytes_feature(pred_inp_s),
        'spec_lengths': _bytes_feature(spec_lengths_s),
        'label_lengths': _bytes_feature(label_lengths_s),
        'labels': _bytes_feature(labels_s)
    }

    example = tf.train.Example(features=tf.train.Features(feature=feature))

    return example.SerializeToString()

def tf_serialize_example(mel_specs,
                         pred_inp,
                         spec_lengths,
                         label_lengths,
                         labels):

    tf_string = tf.py_function(
        serialize_example,
        (mel_specs, pred_inp, spec_lengths, label_lengths, labels),
        tf.string)

    return tf.reshape(tf_string, ())


def preprocess_text(text, encoder_fn, vocab_size):

    norm_text = tf_normalize_text(text)
    enc_text = encoder_fn(norm_text)
    enc_padded = tf.concat([[0], enc_text], axis=0)

    return enc_text, enc_padded


def plot_spec(spec, sr, transcription, name):

    spec_db = librosa.amplitude_to_db(spec, ref=np.max)

    plt.figure(figsize=(12,4))
    librosa.display.specshow(spec_db, sr=sr,
        x_axis='time', y_axis='mel',
        hop_length=sr * 0.01)
    plt.colorbar(format='%+02.0f dB')
    plt.savefig('figs/{}.png'.format(name))
    plt.clf()


def tf_plot_spec(spec, sr, transcription, name):

    spec_t = tf.transpose(spec)

    return tf.py_function(
        lambda _spec, _sr, trans: plot_spec(
            _spec.numpy(), _sr.numpy(),
            trans.numpy().decode('utf8'),
            name
        ),
        inp=[spec_t, sr, transcription],
        Tout=[])


def plot_audio(audio_arr, sr, trans, name):

    with open('figs/trans.txt', 'a') as f:
        f.write('{} {}\n'.format(name, trans))

    t = np.linspace(0, audio_arr.shape[0] / sr,
        num=audio_arr.shape[0])

    plt.figure(1)
    plt.plot(t, audio_arr)
    plt.savefig('figs/{}.png'.format(name))
    plt.clf()


def tf_plot_audio(audio_arr, sr, trans, name):

    return tf.py_function(
        lambda _audio, _sr, _trans: plot_audio(
            _audio.numpy(), _sr.numpy(),
            _trans.numpy(), name
        ),
        inp=[audio_arr, sr, trans],
        Tout=[])


def preprocess_audio(audio,
                     sample_rate,
                     hparams):

    log_melspec = compute_mel_spectrograms(
        audio_arr=audio,
        sample_rate=sample_rate,
        n_mel_bins=hparams[HP_MEL_BINS.name],
        frame_length=hparams[HP_FRAME_LENGTH.name],
        frame_step=hparams[HP_FRAME_STEP.name],
        hertz_low=hparams[HP_HERTZ_LOW.name],
        hertz_high=hparams[HP_HERTZ_HIGH.name])

    downsampled_spec = downsample_spec(log_melspec)

    return downsampled_spec


def preprocess_dataset(dataset,
                       encoder_fn,
                       hparams,
                       max_length=0,
                       save_plots=False):

    _dataset = dataset

    if max_length > 0:
        _dataset = _dataset.filter(lambda audio, sr, trans: (
            tf.shape(audio)[0] <= sr * tf.constant(max_length)))

    if save_plots:
        os.makedirs('figs', exist_ok=True)
        for i, (audio_arr, sr, trans) in enumerate(_dataset.take(5)):
            tf_plot_audio(audio_arr, sr, trans, 'audio_{}'.format(i))

    _dataset = _dataset.map(lambda audio, sr, trans: (
        preprocess_audio(
            audio=audio,
            sample_rate=sr,
            hparams=hparams),
        sr,
        *preprocess_text(trans,
            encoder_fn=encoder_fn,
            vocab_size=hparams[HP_VOCAB_SIZE.name]),
        trans
    ), num_parallel_calls=tf.data.experimental.AUTOTUNE)

    if save_plots:
        for i, (log_melspec, sr, _, _, trans) in enumerate(_dataset.take(5)):
            tf_plot_spec(log_melspec, sr, trans, 'input_{}'.format(i))

    _dataset = _dataset.map(
        lambda log_melspec, sr, labels, pred_inp, trans: (
            log_melspec, pred_inp,
            tf.shape(log_melspec)[0], tf.shape(labels)[0],
            labels
        ),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)

    _dataset = _dataset.map(tf_serialize_example)

    return _dataset


================================================
FILE: utils/vocabulary.py
================================================
def init_vocab():

    alphabet = "abcdefghijklmnopqrstuvwxyz'"
    alphabet_c = ['', ' ', '<s>', '</s>'] + [c for c in alphabet]

    return alphabet_c


def load_vocab(filepath):

    vocab = []

    with open(filepath, 'r') as f:
        for line in f:
            line = line.strip().strip('\n')
            if line == '<blank>':
                line = ''
            elif line == '<space>':
                line = ' '
            vocab.append(line)

    return vocab


def save_vocab(vocab, filepath):

    with open(filepath, 'w') as f:
        for c in vocab:
            if c == '':
                c = '<blank>'
            elif c == ' ':
                c = '<space>'
            f.write('{}\n'.format(c))

Download .txt

gitextract_zm_qbej6/

├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── __init__.py
├── cmake/
│   └── warp-rnnt-cmakelist.txt
├── debug/
│   ├── debug_dataset.py
│   └── get_common_voice_stats.py
├── hparams.py
├── model.py
├── preprocess_common_voice.py
├── preprocess_librispeech.py
├── quantize_model.py
├── requirements.txt
├── run_rnnt.py
├── scripts/
│   ├── build_rnnt.sh
│   ├── common_voice_convert.sh
│   └── remove_missing_samples.py
├── streaming_transcribe.py
├── transcribe_file.py
└── utils/
    ├── __init__.py
    ├── data/
    │   ├── __init__.py
    │   ├── common_voice.py
    │   └── librispeech.py
    ├── decoding.py
    ├── encoding.py
    ├── loss.py
    ├── metrics.py
    ├── model.py
    ├── preprocessing.py
    └── vocabulary.py

Download .txt

SYMBOL INDEX (80 symbols across 19 files)

FILE: debug/debug_dataset.py
  function check_for_invalid_values (line 13) | def check_for_invalid_values(inp, labels):
  function check_empty (line 21) | def check_empty(inp, labels):
  function get_dataset (line 46) | def get_dataset(data_dir,
  function main (line 68) | def main(args):
  function parse_args (line 83) | def parse_args():

FILE: debug/get_common_voice_stats.py
  function main (line 7) | def main(args):
  function parse_args (line 43) | def parse_args():

FILE: model.py
  class TimeReduction (line 8) | class TimeReduction(tf.keras.layers.Layer):
    method __init__ (line 10) | def __init__(self,
    method call (line 20) | def call(self, inputs):
  function encoder (line 39) | def encoder(specs_shape,
  function prediction_network (line 84) | def prediction_network(vocab_size,
  function build_keras_model (line 119) | def build_keras_model(hparams,

FILE: preprocess_common_voice.py
  function write_dataset (line 24) | def write_dataset(dataset, name):
  function main (line 36) | def main(_):

FILE: preprocess_librispeech.py
  function write_dataset (line 24) | def write_dataset(dataset, name):
  function main (line 36) | def main(_):

FILE: quantize_model.py
  function main (line 8) | def main(args):
  function parse_args (line 31) | def parse_args():

FILE: run_rnnt.py
  function get_dataset (line 66) | def get_dataset(data_dir,
  function configure_environment (line 93) | def configure_environment(gpu_names,
  function setup_hparams (line 134) | def setup_hparams(log_dir,
  function run_metrics (line 223) | def run_metrics(inputs,
  function run_training (line 233) | def run_training(model,
  function run_evaluate (line 380) | def run_evaluate(model,
  function main (line 444) | def main(_):

FILE: scripts/remove_missing_samples.py
  function remove_missing (line 5) | def remove_missing(data_dir, fname, replace_old=True):
  function main (line 25) | def main(args):
  function parse_args (line 37) | def parse_args():

FILE: streaming_transcribe.py
  function main (line 23) | def main(args):
  function parse_args (line 84) | def parse_args():

FILE: transcribe_file.py
  function main (line 14) | def main(args):
  function parse_args (line 44) | def parse_args():

FILE: utils/data/common_voice.py
  function tf_parse_line (line 7) | def tf_parse_line(line, data_dir):
  function load_dataset (line 22) | def load_dataset(base_path, name):
  function texts_generator (line 35) | def texts_generator(base_path):

FILE: utils/data/librispeech.py
  function load_audio (line 6) | def load_audio(filepath):
  function tf_load_audio (line 11) | def tf_load_audio(filepath):
  function tf_file_exists (line 19) | def tf_file_exists(filepath):
  function tf_parse_line (line 27) | def tf_parse_line(line, data_dir, split_names):
  function get_transcript_files (line 52) | def get_transcript_files(base_path, split_names):
  function load_dataset (line 66) | def load_dataset(base_path, split_names):
  function texts_generator (line 77) | def texts_generator(base_path, split_names):

FILE: utils/decoding.py
  function joint (line 6) | def joint(model, f, g):
  function greedy_decode_fn (line 21) | def greedy_decode_fn(model, hparams):

FILE: utils/encoding.py
  function build_lookup_table (line 9) | def build_lookup_table(keys, values=None, default_value=-1):
  function wordpiece_encode (line 21) | def wordpiece_encode(text, encoder):
  function tf_wordpiece_encode (line 27) | def tf_wordpiece_encode(text, encoder):
  function wordpiece_decode (line 33) | def wordpiece_decode(ids, encoder):
  function tf_wordpiece_decode (line 38) | def tf_wordpiece_decode(ids, encoder):
  function tf_vocab_encode (line 44) | def tf_vocab_encode(text, vocab_table):
  function get_encoder (line 51) | def get_encoder(encoder_dir,

FILE: utils/loss.py
  function get_loss_fn (line 12) | def get_loss_fn(reduction_factor):

FILE: utils/metrics.py
  function error_rate (line 6) | def error_rate(y_true, decoded):
  function string_to_sparse (line 29) | def string_to_sparse(str_tensor):
  function token_error_rate (line 42) | def token_error_rate(y_true, decoded, tok_fn, idx_to_text):
  function build_accuracy_fn (line 59) | def build_accuracy_fn(decode_fn):
  function build_wer_fn (line 76) | def build_wer_fn(decode_fn, idx_to_text):

FILE: utils/model.py
  function load_hparams (line 9) | def load_hparams(model_dir):
  function save_hparams (line 15) | def save_hparams(hparams, model_dir):

FILE: utils/preprocessing.py
  function tf_load_audio (line 12) | def tf_load_audio(path, pre_emphasis=0.97):
  function normalize_text (line 24) | def normalize_text(text):
  function tf_normalize_text (line 32) | def tf_normalize_text(text):
  function print_tensor (line 40) | def print_tensor(t, template='{}'):
  function compute_mel_spectrograms (line 48) | def compute_mel_spectrograms(audio_arr,
  function downsample_spec (line 84) | def downsample_spec(mel_spec, n=3):
  function load_dataset (line 97) | def load_dataset(data_dir, name):
  function parse_example (line 110) | def parse_example(serialized_example):
  function serialize_example (line 132) | def serialize_example(mel_specs,
  function tf_serialize_example (line 163) | def tf_serialize_example(mel_specs,
  function preprocess_text (line 177) | def preprocess_text(text, encoder_fn, vocab_size):
  function plot_spec (line 186) | def plot_spec(spec, sr, transcription, name):
  function tf_plot_spec (line 199) | def tf_plot_spec(spec, sr, transcription, name):
  function plot_audio (line 213) | def plot_audio(audio_arr, sr, trans, name):
  function tf_plot_audio (line 227) | def tf_plot_audio(audio_arr, sr, trans, name):
  function preprocess_audio (line 238) | def preprocess_audio(audio,
  function preprocess_dataset (line 256) | def preprocess_dataset(dataset,

FILE: utils/vocabulary.py
  function init_vocab (line 1) | def init_vocab():
  function load_vocab (line 9) | def load_vocab(filepath):
  function save_vocab (line 25) | def save_vocab(vocab, filepath):

Download .json

Condensed preview — 31 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (83K chars).

[
  {
    "path": ".gitignore",
    "chars": 1329,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".gitmodules",
    "chars": 114,
    "preview": "[submodule \"warp-transducer\"]\n\tpath = warp-transducer\n\turl = https://github.com/noahchalifour/warp-transducer.git\n"
  },
  {
    "path": "LICENSE",
    "chars": 1071,
    "preview": "MIT License\n\nCopyright (c) 2019 Noah Chalifour\n\nPermission is hereby granted, free of charge, to any person obtaining a "
  },
  {
    "path": "README.md",
    "chars": 2879,
    "preview": "# RNN-Transducer Speech Recognition\n\nEnd-to-end speech recognition using RNN-Transducer in Tensorflow 2.0\n\n## Overview\n\n"
  },
  {
    "path": "__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "cmake/warp-rnnt-cmakelist.txt",
    "chars": 4247,
    "preview": "IF (APPLE)\n    cmake_minimum_required(VERSION 3.4)\nELSE()\n    cmake_minimum_required(VERSION 2.8)\nENDIF()\n\nproject(rnnt_"
  },
  {
    "path": "debug/debug_dataset.py",
    "chars": 2278,
    "preview": "from argparse import ArgumentParser\nimport os\nimport json\nimport sys\nimport tensorflow as tf\n\nFILE_DIR = os.path.dirname"
  },
  {
    "path": "debug/get_common_voice_stats.py",
    "chars": 1401,
    "preview": "from argparse import ArgumentParser\nfrom scipy.io.wavfile import read as read_wav\nimport glob\nimport os\n\n\ndef main(args)"
  },
  {
    "path": "hparams.py",
    "chars": 2321,
    "preview": "from tensorboard.plugins.hparams import api as hp\n\nHP_TOKEN_TYPE = hp.HParam('token_type', hp.Discrete(['word-piece', 'c"
  },
  {
    "path": "model.py",
    "chars": 5327,
    "preview": "import re\nimport os\nimport tensorflow as tf\n\nfrom hparams import *\n\n\nclass TimeReduction(tf.keras.layers.Layer):\n\n    de"
  },
  {
    "path": "preprocess_common_voice.py",
    "chars": 2578,
    "preview": "from absl import app, logging, flags\nimport os\nimport json\nimport tensorflow as tf\n\nfrom utils import preprocessing, enc"
  },
  {
    "path": "preprocess_librispeech.py",
    "chars": 3033,
    "preview": "from absl import app, logging, flags\nimport os\nimport json\nimport tensorflow as tf\n\nfrom utils import preprocessing, enc"
  },
  {
    "path": "quantize_model.py",
    "chars": 1161,
    "preview": "from argparse import ArgumentParser\nimport os\nimport tensorflow as tf\n\nfrom utils import model as model_utils\n\n\ndef main"
  },
  {
    "path": "requirements.txt",
    "chars": 81,
    "preview": "pydub>=0.23.1\nscipy>=1.3.1\ntqdm\ntensorflow-datasets\nsoundfile\nlibrosa\nmatplotlib\n"
  },
  {
    "path": "run_rnnt.py",
    "chars": 19601,
    "preview": "from absl import flags, logging, app\nfrom tensorboard.plugins.hparams import api as hp\nfrom tensorflow.keras.mixed_preci"
  },
  {
    "path": "scripts/build_rnnt.sh",
    "chars": 245,
    "preview": "cp cmake/warp-rnnt-cmakelist.txt warp-transducer/CMakeLists.txt\n\ncd warp-transducer\n\nmkdir build\ncd build\n\nCC=gcc-4.8 CX"
  },
  {
    "path": "scripts/common_voice_convert.sh",
    "chars": 416,
    "preview": "#!/bin/bash\n\nOIFS=\"$IFS\"\nIFS=$'\\n'\n\nFORMAT=.mp3\nDATA_DIR=\"$1\"\nN=${2:-1}\n\nmkdir -p $DATA_DIR\n\nFILES=$(ls \"$DATA_DIR\" | gr"
  },
  {
    "path": "scripts/remove_missing_samples.py",
    "chars": 1347,
    "preview": "from argparse import ArgumentParser\nimport os\n\n\ndef remove_missing(data_dir, fname, replace_old=True):\n\n    clips_dir = "
  },
  {
    "path": "streaming_transcribe.py",
    "chars": 2154,
    "preview": "from argparse import ArgumentParser\nimport os\nimport time\nimport pyaudio\n\nimport tensorflow as tf\ntf.get_logger().setLev"
  },
  {
    "path": "transcribe_file.py",
    "chars": 1453,
    "preview": "from argparse import ArgumentParser\nimport os\n\nimport tensorflow as tf\ntf.get_logger().setLevel('ERROR')\ntf.autograph.se"
  },
  {
    "path": "utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "utils/data/__init__.py",
    "chars": 26,
    "preview": "from . import common_voice"
  },
  {
    "path": "utils/data/common_voice.py",
    "chars": 1158,
    "preview": "import os\nimport tensorflow as tf\n\nfrom .. import preprocessing\n\n\ndef tf_parse_line(line, data_dir):\n\n    line_split = t"
  },
  {
    "path": "utils/data/librispeech.py",
    "chars": 2414,
    "preview": "import os\nimport tensorflow as tf\nimport soundfile as sf\n\n\ndef load_audio(filepath):\n\n    return sf.read(filepath)\n\n\ndef"
  },
  {
    "path": "utils/decoding.py",
    "chars": 6159,
    "preview": "import tensorflow as tf\n\nfrom hparams import *\n\n\ndef joint(model, f, g):\n\n    dense_1 = model.layers[-2]\n    dense_2 = m"
  },
  {
    "path": "utils/encoding.py",
    "chars": 2497,
    "preview": "import os\nimport tensorflow_datasets as tfds\nimport tensorflow as tf\n\nfrom hparams import *\nfrom . import vocabulary, pr"
  },
  {
    "path": "utils/loss.py",
    "chars": 988,
    "preview": "from absl import logging\nimport tensorflow as tf\n\n_has_loss_func = False\ntry:\n    from warprnnt_tensorflow import rnnt_l"
  },
  {
    "path": "utils/metrics.py",
    "chars": 2249,
    "preview": "import tensorflow as tf\n\nfrom . import decoding\n\n\ndef error_rate(y_true, decoded):\n\n    y_true_shape = tf.shape(y_true)\n"
  },
  {
    "path": "utils/model.py",
    "chars": 359,
    "preview": "from absl import logging\nimport os\nimport json\nimport re\n\nfrom model import build_keras_model\n\n\ndef load_hparams(model_d"
  },
  {
    "path": "utils/preprocessing.py",
    "chars": 8555,
    "preview": "import glob\nimport os\nimport librosa.display\nimport librosa\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimpo"
  },
  {
    "path": "utils/vocabulary.py",
    "chars": 715,
    "preview": "def init_vocab():\n\n    alphabet = \"abcdefghijklmnopqrstuvwxyz'\"\n    alphabet_c = ['', ' ', '<s>', '</s>'] + [c for c in "
  }
]

About this extraction

This page contains the full source code of the noahchalifour/rnnt-speech-recognition GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 31 files (76.3 KB), approximately 19.5k tokens, and a symbol index with 80 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo