[
  {
    "path": ".gitignore",
    "content": "*.a\n*.o\n*.so\n*.whl\nbenchmark_lstm\nbenchmark_gru\nhaste_lstm\nhaste_gru\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# ChangeLog\n\n## 0.4.0 (2020-04-13)\n### Added\n- New layer normalized GRU layer (`LayerNormGRU`).\n- New IndRNN layer.\n- CPU support for all PyTorch layers.\n- Support for building PyTorch API on Windows.\n- Added `state` argument to PyTorch layers to specify initial state.\n- Added weight transforms to TensorFlow API (see docs for details).\n- Added `get_weights` method to extract weights from RNN layers (TensorFlow).\n- Added `to_native_weights` and `from_native_weights` to PyTorch API for `LSTM` and `GRU` layers.\n- Validation tests to check for correctness.\n\n### Changed\n- Performance improvements to GRU layer.\n- BREAKING CHANGE: PyTorch layers default to CPU instead of GPU.\n- BREAKING CHANGE: `h` must not be transposed before passing it to `gru::BackwardPass::Iterate`.\n\n### Fixed\n- Multi-GPU training with TensorFlow caused by invalid sharing of `cublasHandle_t`.\n\n## 0.3.0 (2020-03-09)\n### Added\n- PyTorch support.\n- New layer normalized LSTM layer (`LayerNormLSTM`).\n- New fused layer normalization layer.\n\n### Fixed\n- Occasional uninitialized memory use in TensorFlow LSTM implementation.\n\n## 0.2.0 (2020-02-12)\n### Added\n- New time-fused API for LSTM (`lstm::ForwardPass::Run`, `lstm::BackwardPass::Run`).\n- Benchmarking code to evaluate the performance of an implementation.\n\n### Changed\n- Performance improvements to existing iterative LSTM API.\n- BREAKING CHANGE: `h` must not be transposed before passing it to `lstm::BackwardPass::Iterate`.\n- BREAKING CHANGE: `dv` does not need to be allocated and `v` must be passed instead to `lstm::BackwardPass::Iterate`.\n\n## 0.1.0 (2020-01-29)\n### Added\n- Initial release of Haste.\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright 2020 LMNT, Inc.\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "Makefile",
    "content": "AR ?= ar\nCXX ?= g++\nNVCC ?= nvcc -ccbin $(CXX)\nPYTHON ?= python\n\nifeq ($(OS),Windows_NT)\nLIBHASTE := haste.lib\nCUDA_HOME ?= $(CUDA_PATH)\nAR := lib\nAR_FLAGS := /nologo /out:$(LIBHASTE)\nNVCC_FLAGS := -x cu -Xcompiler \"/MD\"\nelse\nLIBHASTE := libhaste.a\nCUDA_HOME ?= /usr/local/cuda\nAR ?= ar\nAR_FLAGS := -crv $(LIBHASTE)\nNVCC_FLAGS := -std=c++11 -x cu -Xcompiler -fPIC\nendif\n\nLOCAL_CFLAGS := -I/usr/include/eigen3 -I$(CUDA_HOME)/include -Ilib -O3\nLOCAL_LDFLAGS := -L$(CUDA_HOME)/lib64 -L. -lcudart -lcublas\nGPU_ARCH_FLAGS := -gencode arch=compute_37,code=compute_37 -gencode arch=compute_60,code=compute_60 -gencode arch=compute_70,code=compute_70\n\n# Small enough project that we can just recompile all the time.\n.PHONY: all haste haste_tf haste_pytorch libhaste_tf examples benchmarks clean\n\nall: haste haste_tf haste_pytorch examples benchmarks\n\nhaste:\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/lstm_forward_gpu.cu.cc -o lib/lstm_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/lstm_backward_gpu.cu.cc -o lib/lstm_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/gru_forward_gpu.cu.cc -o lib/gru_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/gru_backward_gpu.cu.cc -o lib/gru_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_forward_gpu.cu.cc -o lib/layer_norm_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_backward_gpu.cu.cc -o lib/layer_norm_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_lstm_forward_gpu.cu.cc -o lib/layer_norm_lstm_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_lstm_backward_gpu.cu.cc -o lib/layer_norm_lstm_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_gru_forward_gpu.cu.cc -o lib/layer_norm_gru_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_gru_backward_gpu.cu.cc -o lib/layer_norm_gru_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/indrnn_backward_gpu.cu.cc -o lib/indrnn_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/indrnn_forward_gpu.cu.cc -o lib/indrnn_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_indrnn_forward_gpu.cu.cc -o lib/layer_norm_indrnn_forward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(NVCC) $(GPU_ARCH_FLAGS) -c lib/layer_norm_indrnn_backward_gpu.cu.cc -o lib/layer_norm_indrnn_backward_gpu.o $(NVCC_FLAGS) $(LOCAL_CFLAGS)\n\t$(AR) $(AR_FLAGS) lib/*.o\n\nlibhaste_tf: haste\n\t$(eval TF_CFLAGS := $(shell $(PYTHON) -c 'import tensorflow as tf; print(\" \".join(tf.sysconfig.get_compile_flags()))'))\n\t$(eval TF_LDFLAGS := $(shell $(PYTHON) -c 'import tensorflow as tf; print(\" \".join(tf.sysconfig.get_link_flags()))'))\n\t$(CXX) -std=c++11 -c frameworks/tf/lstm.cc -o frameworks/tf/lstm.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/gru.cc -o frameworks/tf/gru.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/layer_norm.cc -o frameworks/tf/layer_norm.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/layer_norm_gru.cc -o frameworks/tf/layer_norm_gru.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/layer_norm_indrnn.cc -o frameworks/tf/layer_norm_indrnn.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/layer_norm_lstm.cc -o frameworks/tf/layer_norm_lstm.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/indrnn.cc -o frameworks/tf/indrnn.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -std=c++11 -c frameworks/tf/support.cc -o frameworks/tf/support.o $(LOCAL_CFLAGS) $(TF_CFLAGS) -fPIC\n\t$(CXX) -shared frameworks/tf/*.o libhaste.a -o frameworks/tf/libhaste_tf.so $(LOCAL_LDFLAGS) $(TF_LDFLAGS) -fPIC\n\n# Dependencies handled by setup.py\nhaste_tf:\n\t@$(eval TMP := $(shell mktemp -d))\n\t@cp -r . $(TMP)\n\t@cat build/common.py build/setup.tf.py > $(TMP)/setup.py\n\t@(cd $(TMP); $(PYTHON) setup.py -q bdist_wheel)\n\t@cp $(TMP)/dist/*.whl .\n\t@rm -rf $(TMP)\n\n# Dependencies handled by setup.py\nhaste_pytorch:\n\t@$(eval TMP := $(shell mktemp -d))\n\t@cp -r . $(TMP)\n\t@cat build/common.py build/setup.pytorch.py > $(TMP)/setup.py\n\t@(cd $(TMP); $(PYTHON) setup.py -q bdist_wheel)\n\t@cp $(TMP)/dist/*.whl .\n\t@rm -rf $(TMP)\n\ndist:\n\t@$(eval TMP := $(shell mktemp -d))\n\t@cp -r . $(TMP)\n\t@cp build/MANIFEST.in $(TMP)\n\t@cat build/common.py build/setup.tf.py > $(TMP)/setup.py\n\t@(cd $(TMP); $(PYTHON) setup.py -q sdist)\n\t@cp $(TMP)/dist/*.tar.gz .\n\t@rm -rf $(TMP)\n\t@$(eval TMP := $(shell mktemp -d))\n\t@cp -r . $(TMP)\n\t@cp build/MANIFEST.in $(TMP)\n\t@cat build/common.py build/setup.pytorch.py > $(TMP)/setup.py\n\t@(cd $(TMP); $(PYTHON) setup.py -q sdist)\n\t@cp $(TMP)/dist/*.tar.gz .\n\t@rm -rf $(TMP)\n\nexamples: haste\n\t$(CXX) -std=c++11 examples/lstm.cc $(LIBHASTE) $(LOCAL_CFLAGS) $(LOCAL_LDFLAGS) -o haste_lstm -Wno-ignored-attributes\n\t$(CXX) -std=c++11 examples/gru.cc $(LIBHASTE) $(LOCAL_CFLAGS) $(LOCAL_LDFLAGS) -o haste_gru -Wno-ignored-attributes\n\nbenchmarks: haste\n\t$(CXX) -std=c++11 benchmarks/benchmark_lstm.cc $(LIBHASTE) $(LOCAL_CFLAGS) $(LOCAL_LDFLAGS) -o benchmark_lstm -Wno-ignored-attributes -lcudnn\n\t$(CXX) -std=c++11 benchmarks/benchmark_gru.cc $(LIBHASTE) $(LOCAL_CFLAGS) $(LOCAL_LDFLAGS) -o benchmark_gru -Wno-ignored-attributes -lcudnn\n\nclean:\n\trm -fr benchmark_lstm benchmark_gru haste_lstm haste_gru haste_*.whl haste_*.tar.gz\n\tfind . \\( -iname '*.o' -o -iname '*.so' -o -iname '*.a' -o -iname '*.lib' \\) -delete\n"
  },
  {
    "path": "README.md",
    "content": "<div align=\"center\">\n  <img src=\"https://lmnt.com/assets/haste-logo_social_media.png\">\n</div>\n\n--------------------------------------------------------------------------------\n[![GitHub release (latest SemVer including pre-releases)](https://img.shields.io/github/v/release/lmnt-com/haste?include_prereleases)](https://github.com/lmnt-com/haste/releases) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1hzYhcyvbXYMAUwa3515BszSkhx1UUFSt) [![GitHub](https://img.shields.io/github/license/lmnt-com/haste)](LICENSE)\n\n**We're hiring!**\nIf you like what we're building here, [come join us at LMNT](https://explore.lmnt.com).\n\nHaste is a CUDA implementation of fused RNN layers with built-in [DropConnect](http://proceedings.mlr.press/v28/wan13.html) and [Zoneout](https://arxiv.org/abs/1606.01305) regularization. These layers are exposed through C++ and Python APIs for easy integration into your own projects or machine learning frameworks.\n\nWhich RNN types are supported?\n- [GRU](https://en.wikipedia.org/wiki/Gated_recurrent_unit)\n- [IndRNN](http://arxiv.org/abs/1803.04831)\n- [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)\n- [Layer Normalized GRU](https://arxiv.org/abs/1607.06450)\n- [Layer Normalized LSTM](https://arxiv.org/abs/1607.06450)\n\nWhat's included in this project?\n- a standalone C++ API (`libhaste`)\n- a TensorFlow Python API (`haste_tf`)\n- a PyTorch API (`haste_pytorch`)\n- examples for writing your own custom C++ inference / training code using `libhaste`\n- benchmarking programs to evaluate the performance of RNN implementations\n\nFor questions or feedback about Haste, please open an issue on GitHub or send us an email at [haste@lmnt.com](mailto:haste@lmnt.com).\n\n## Install\nHere's what you'll need to get started:\n- a [CUDA Compute Capability](https://developer.nvidia.com/cuda-gpus) 3.7+ GPU (required)\n- [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 10.0+ (required)\n- [TensorFlow GPU](https://www.tensorflow.org/install/gpu) 1.14+ or 2.0+ for TensorFlow integration (optional)\n- [PyTorch](https://pytorch.org) 1.3+ for PyTorch integration (optional)\n- [Eigen 3](http://eigen.tuxfamily.org/) to build the C++ examples (optional)\n- [cuDNN Developer Library](https://developer.nvidia.com/rdp/cudnn-archive) to build benchmarking programs (optional)\n\nOnce you have the prerequisites, you can install with pip or by building the source code.\n\n### Using pip\n```\npip install haste_pytorch\npip install haste_tf\n```\n\n### Building from source\n```\nmake               # Build everything\nmake haste         # ;) Build C++ API\nmake haste_tf      # Build TensorFlow API\nmake haste_pytorch # Build PyTorch API\nmake examples\nmake benchmarks\n```\n\nIf you built the TensorFlow or PyTorch API, install it with `pip`:\n```\npip install haste_tf-*.whl\npip install haste_pytorch-*.whl\n```\n\nIf the CUDA Toolkit that you're building against is not in `/usr/local/cuda`, you must specify the\n`$CUDA_HOME` environment variable before running make:\n```\nCUDA_HOME=/usr/local/cuda-10.2 make\n```\n\n## Performance\nOur LSTM and GRU benchmarks indicate that Haste has the fastest publicly available implementation for nearly all problem sizes. The following charts show our LSTM results, but the GRU results are qualitatively similar.\n<table>\n  <tr><td><img src=\"https://lmnt.com/assets/haste/benchmark/report_n=16_c=128.png\"></td><td><img src=\"https://lmnt.com/assets/haste/benchmark/report_n=32_c=256.png\"></td></tr>\n  <tr></tr>\n  <tr><td><img src=\"https://lmnt.com/assets/haste/benchmark/report_n=64_c=128.png\"></td><td><img src=\"https://lmnt.com/assets/haste/benchmark/report_n=128_c=256.png\"></td></tr>\n</table>\n\nHere is our complete LSTM benchmark result grid:\n<br>\n[`N=1 C=64`](https://lmnt.com/assets/haste/benchmark/report_n=1_c=64.png)\n[`N=1 C=128`](https://lmnt.com/assets/haste/benchmark/report_n=1_c=128.png)\n[`N=1 C=256`](https://lmnt.com/assets/haste/benchmark/report_n=1_c=256.png)\n[`N=1 C=512`](https://lmnt.com/assets/haste/benchmark/report_n=1_c=512.png)\n<br>\n[`N=32 C=64`](https://lmnt.com/assets/haste/benchmark/report_n=32_c=64.png)\n[`N=32 C=128`](https://lmnt.com/assets/haste/benchmark/report_n=32_c=128.png)\n[`N=32 C=256`](https://lmnt.com/assets/haste/benchmark/report_n=32_c=256.png)\n[`N=32 C=512`](https://lmnt.com/assets/haste/benchmark/report_n=32_c=512.png)\n<br>\n[`N=64 C=64`](https://lmnt.com/assets/haste/benchmark/report_n=64_c=64.png)\n[`N=64 C=128`](https://lmnt.com/assets/haste/benchmark/report_n=64_c=128.png)\n[`N=64 C=256`](https://lmnt.com/assets/haste/benchmark/report_n=64_c=256.png)\n[`N=64 C=512`](https://lmnt.com/assets/haste/benchmark/report_n=64_c=512.png)\n<br>\n[`N=128 C=64`](https://lmnt.com/assets/haste/benchmark/report_n=128_c=64.png)\n[`N=128 C=128`](https://lmnt.com/assets/haste/benchmark/report_n=128_c=128.png)\n[`N=128 C=256`](https://lmnt.com/assets/haste/benchmark/report_n=128_c=256.png)\n[`N=128 C=512`](https://lmnt.com/assets/haste/benchmark/report_n=128_c=512.png)\n\n## Documentation\n### TensorFlow API\n```python\nimport haste_tf as haste\n\ngru_layer = haste.GRU(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)\nindrnn_layer = haste.IndRNN(num_units=256, direction='bidirectional', zoneout=0.1)\nlstm_layer = haste.LSTM(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)\nnorm_gru_layer = haste.LayerNormGRU(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)\nnorm_lstm_layer = haste.LayerNormLSTM(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)\n\n# `x` is a tensor with shape [N,T,C]\nx = tf.random.normal([5, 25, 128])\n\ny, state = gru_layer(x, training=True)\ny, state = indrnn_layer(x, training=True)\ny, state = lstm_layer(x, training=True)\ny, state = norm_gru_layer(x, training=True)\ny, state = norm_lstm_layer(x, training=True)\n```\n\nThe TensorFlow Python API is documented in [`docs/tf/haste_tf.md`](docs/tf/haste_tf.md).\n\n### PyTorch API\n```python\nimport torch\nimport haste_pytorch as haste\n\ngru_layer = haste.GRU(input_size=128, hidden_size=256, zoneout=0.1, dropout=0.05)\nindrnn_layer = haste.IndRNN(input_size=128, hidden_size=256, zoneout=0.1)\nlstm_layer = haste.LSTM(input_size=128, hidden_size=256, zoneout=0.1, dropout=0.05)\nnorm_gru_layer = haste.LayerNormGRU(input_size=128, hidden_size=256, zoneout=0.1, dropout=0.05)\nnorm_lstm_layer = haste.LayerNormLSTM(input_size=128, hidden_size=256, zoneout=0.1, dropout=0.05)\n\ngru_layer.cuda()\nindrnn_layer.cuda()\nlstm_layer.cuda()\nnorm_gru_layer.cuda()\nnorm_lstm_layer.cuda()\n\n# `x` is a CUDA tensor with shape [T,N,C]\nx = torch.rand([25, 5, 128]).cuda()\n\ny, state = gru_layer(x)\ny, state = indrnn_layer(x)\ny, state = lstm_layer(x)\ny, state = norm_gru_layer(x)\ny, state = norm_lstm_layer(x)\n```\n\nThe PyTorch API is documented in [`docs/pytorch/haste_pytorch.md`](docs/pytorch/haste_pytorch.md).\n\n### C++ API\nThe C++ API is documented in [`lib/haste/*.h`](lib/haste/) and there are code samples in [`examples/`](examples/).\n\n## Code layout\n- [`benchmarks/`](benchmarks): programs to evaluate performance of RNN implementations\n- [`docs/tf/`](docs/tf): API reference documentation for `haste_tf`\n- [`docs/pytorch/`](docs/pytorch): API reference documentation for `haste_pytorch`\n- [`examples/`](examples): examples for writing your own C++ inference / training code using `libhaste`\n- [`frameworks/tf/`](frameworks/tf): TensorFlow Python API and custom op code\n- [`frameworks/pytorch/`](frameworks/pytorch): PyTorch API and custom op code\n- [`lib/`](lib): CUDA kernels and C++ API\n- [`validation/`](validation): scripts to validate output and gradients of RNN layers\n\n## Implementation notes\n- the GRU implementation is based on `1406.1078v1` (same as cuDNN) rather than `1406.1078v3`\n- Zoneout on LSTM cells is applied to the hidden state only, and not the cell state\n- the layer normalized LSTM implementation uses [these equations](https://github.com/lmnt-com/haste/issues/1)\n\n## References\n1. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. _Neural Computation_, _9_(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735\n1. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. _arXiv:1406.1078 [cs, stat]_. http://arxiv.org/abs/1406.1078.\n1. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., & Fergus, R. (2013). Regularization of Neural Networks using DropConnect. In _International Conference on Machine Learning_ (pp. 1058–1066). Presented at the International Conference on Machine Learning. http://proceedings.mlr.press/v28/wan13.html.\n1. Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., et al. (2017). Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. _arXiv:1606.01305 [cs]_. http://arxiv.org/abs/1606.01305.\n1. Ba, J., Kiros, J.R., & Hinton, G.E. (2016). Layer Normalization. _arXiv:1607.06450 [cs, stat]_. https://arxiv.org/abs/1607.06450.\n1. Li, S., Li, W., Cook, C., Zhu, C., & Gao, Y. (2018). Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN. _arXiv:1803.04831 [cs]_. http://arxiv.org/abs/1803.04831.\n\n## Citing this work\nTo cite this work, please use the following BibTeX entry:\n```\n@misc{haste2020,\n  title  = {Haste: a fast, simple, and open RNN library},\n  author = {Sharvil Nanavati},\n  year   = 2020,\n  month  = \"Jan\",\n  howpublished = {\\url{https://github.com/lmnt-com/haste/}},\n}\n```\n\n## License\n[Apache 2.0](LICENSE)\n"
  },
  {
    "path": "benchmarks/benchmark_gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <Eigen/Dense>\n#include <cassert>\n#include <cmath>\n#include <cstdio>\n#include <cstdlib>\n#include <ctime>\n#include <cuda.h>\n#include <cuda_runtime_api.h>\n#include <cudnn.h>\n#include <getopt.h>\n#include <iostream>\n#include <string>\n#include <unsupported/Eigen/CXX11/Tensor>\n#include <vector>\n\n#include \"../examples/device_ptr.h\"\n#include \"cudnn_wrappers.h\"\n#include \"haste.h\"\n\nusing haste::v0::gru::BackwardPass;\nusing haste::v0::gru::ForwardPass;\nusing std::string;\n\nusing Tensor1 = Eigen::Tensor<float, 1>;\nusing Tensor2 = Eigen::Tensor<float, 2>;\nusing Tensor3 = Eigen::Tensor<float, 3>;\n\nstatic constexpr int DEFAULT_SAMPLE_SIZE = 10;\nstatic constexpr int DEFAULT_TIME_STEPS = 50;\n\nstatic cudnnHandle_t g_cudnn_handle;\nstatic cublasHandle_t g_blas_handle;\n\nfloat TimeLoop(std::function<void()> fn, int iterations) {\n  cudaEvent_t start, stop;\n  cudaEventCreate(&start);\n  cudaEventCreate(&stop);\n  cudaEventRecord(start);\n  for (int i = 0; i < iterations; ++i)\n    fn();\n  float elapsed_ms;\n  cudaEventRecord(stop);\n  cudaEventSynchronize(stop);\n  cudaEventElapsedTime(&elapsed_ms, start, stop);\n  cudaEventDestroy(start);\n  cudaEventDestroy(stop);\n  return elapsed_ms / iterations;\n}\n\nfloat CudnnInference(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor2> h_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> c_dev(batch_size * hidden_size);\n  device_ptr<Tensor3> y_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor2> h_out_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> c_out_dev(batch_size * hidden_size);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  // Descriptors all the way down. Nice.\n  RnnDescriptor<float> rnn_descriptor(g_cudnn_handle, hidden_size, CUDNN_GRU);\n\n  TensorDescriptorArray<float> x_descriptors(time_steps, { batch_size, input_size, 1 });\n  TensorDescriptorArray<float> y_descriptors(time_steps, { batch_size, hidden_size, 1 });\n\n  auto h_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto c_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto h_out_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto c_out_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n\n  size_t workspace_size;\n  cudnnGetRNNWorkspaceSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &x_descriptors,\n      &workspace_size);\n  auto workspace_dev = device_ptr<Tensor1>::NewByteSized(workspace_size);\n\n  size_t w_count;\n  cudnnGetRNNParamsSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      *&x_descriptors,\n      &w_count,\n      CUDNN_DATA_FLOAT);\n\n  auto w_dev = device_ptr<Tensor1>::NewByteSized(w_count);\n  FilterDescriptor<float> w_descriptor(w_dev.Size());\n\n  float ms = TimeLoop([&]() {\n    cudnnRNNForwardInference(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &x_descriptors,\n        x_dev.data,\n        *h_descriptor,\n        h_dev.data,\n        *c_descriptor,\n        c_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        *h_out_descriptor,\n        h_out_dev.data,\n        *c_out_descriptor,\n        c_out_dev.data,\n        workspace_dev.data,\n        workspace_size);\n  }, sample_size);\n  return ms;\n}\n\nfloat CudnnTrain(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x,\n    const Tensor3& dh) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  device_ptr<Tensor3> y_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor3> dy_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor2> dhy_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dcy_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> hx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> cx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dhx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dcx_dev(batch_size * hidden_size);\n\n  RnnDescriptor<float> rnn_descriptor(g_cudnn_handle, hidden_size, CUDNN_GRU);\n  TensorDescriptorArray<float> y_descriptors(time_steps, { batch_size, hidden_size, 1 });\n  TensorDescriptorArray<float> dy_descriptors(time_steps, { batch_size, hidden_size, 1 });\n  TensorDescriptorArray<float> dx_descriptors(time_steps, { batch_size, input_size, 1 });\n\n  TensorDescriptor<float> dhy_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dcy_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> hx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> cx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dhx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dcx_descriptor({ 1, batch_size, hidden_size });\n\n  size_t workspace_size = 0;\n  cudnnGetRNNWorkspaceSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &dx_descriptors,\n      &workspace_size);\n  auto workspace_dev = device_ptr<Tensor1>::NewByteSized(workspace_size);\n\n  size_t w_count = 0;\n  cudnnGetRNNParamsSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      *&dx_descriptors,\n      &w_count,\n      CUDNN_DATA_FLOAT);\n\n  auto w_dev = device_ptr<Tensor1>::NewByteSized(w_count);\n  FilterDescriptor<float> w_descriptor(w_dev.Size());\n\n  size_t reserve_size = 0;\n  cudnnGetRNNTrainingReserveSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &dx_descriptors,\n      &reserve_size);\n  auto reserve_dev = device_ptr<Tensor1>::NewByteSized(reserve_size);\n\n  float ms = TimeLoop([&]() {\n    cudnnRNNForwardTraining(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &dx_descriptors,\n        dx_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        *cx_descriptor,\n        cx_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        *dhy_descriptor,\n        dhy_dev.data,\n        *dcy_descriptor,\n        dcy_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        reserve_dev.data,\n        reserve_size);\n\n    cudnnRNNBackwardData(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &y_descriptors,\n        y_dev.data,\n        &dy_descriptors,\n        dy_dev.data,\n        *dhy_descriptor,\n        dhy_dev.data,\n        *dcy_descriptor,\n        dcy_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        *cx_descriptor,\n        cx_dev.data,\n        &dx_descriptors,\n        dx_dev.data,\n        *dhx_descriptor,\n        dhx_dev.data,\n        *dcx_descriptor,\n        dcx_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        reserve_dev.data,\n        reserve_size);\n\n    cudnnRNNBackwardWeights(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &dx_descriptors,\n        dx_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        *w_descriptor,\n        w_dev.data,\n        reserve_dev.data,\n        reserve_size);\n  }, sample_size);\n  return ms;\n}\n\nfloat HasteInference(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> bx_dev(bx);\n  device_ptr<Tensor1> br_dev(br);\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor2> tmp_Wx_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 3);\n\n  h_dev.zero();\n\n  // Settle down the GPU and off we go!\n  cudaDeviceSynchronize();\n  float ms = TimeLoop([&]() {\n    ForwardPass<float> forward(\n        false,\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        bx_dev.data,\n        br_dev.data,\n        x_dev.data,\n        h_dev.data,\n        nullptr,\n        tmp_Wx_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,\n        nullptr);\n  }, sample_size);\n  return ms;\n}\n\nfloat HasteTrain(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x,\n    const Tensor3& dh) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor3> x_dev(x);\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n  device_ptr<Tensor2> tmp_Wx_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 3);\n\n  device_ptr<Tensor2> W_t_dev(W);\n  device_ptr<Tensor2> R_t_dev(R);\n  device_ptr<Tensor1> bx_dev(bx);\n  device_ptr<Tensor1> br_dev(br);\n  device_ptr<Tensor3> x_t_dev(x);\n\n  // These gradients should actually come \"from above\" but we're just allocating\n  // a bunch of uninitialized memory and passing it in.\n  device_ptr<Tensor3> dh_new_dev(dh);\n\n  device_ptr<Tensor3> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dW_dev(input_size * hidden_size * 3);\n  device_ptr<Tensor2> dR_dev(hidden_size * hidden_size * 3);\n  device_ptr<Tensor2> dbx_dev(hidden_size * 3);\n  device_ptr<Tensor2> dbr_dev(hidden_size * 3);\n  device_ptr<Tensor2> dh_dev(batch_size * hidden_size);\n  device_ptr<Tensor3> dp_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor3> dq_dev(time_steps * batch_size * hidden_size * 3);\n\n  ForwardPass<float> forward(\n      true,\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  BackwardPass<float> backward(\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  static const float alpha = 1.0f;\n  static const float beta = 0.0f;\n\n  cudaDeviceSynchronize();\n  float ms = TimeLoop([&]() {\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        bx_dev.data,\n        br_dev.data,\n        x_dev.data,\n        h_dev.data,\n        v_dev.data,\n        tmp_Wx_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,\n        nullptr);\n\n    // Haste needs `x`, `W`, and `R` to be transposed between the forward\n    // pass and backward pass. Add these transposes in here to get a fair\n    // measurement of the overall time it takes to run an entire training\n    // loop.\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        batch_size * time_steps, input_size,\n        &alpha,\n        x_dev.data, input_size,\n        &beta,\n        x_dev.data, batch_size * time_steps,\n        x_t_dev.data, batch_size * time_steps);\n\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        input_size, hidden_size * 3,\n        &alpha,\n        W_dev.data, hidden_size * 3,\n        &beta,\n        W_dev.data, input_size,\n        W_t_dev.data, input_size);\n\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        hidden_size, hidden_size * 3,\n        &alpha,\n        R_dev.data, hidden_size * 3,\n        &beta,\n        R_dev.data, hidden_size,\n        R_t_dev.data, hidden_size);\n\n    backward.Run(\n        time_steps,\n        W_t_dev.data,\n        R_t_dev.data,\n        bx_dev.data,\n        br_dev.data,\n        x_t_dev.data,\n        h_dev.data,\n        v_dev.data,\n        dh_new_dev.data,\n        dx_dev.data,\n        dW_dev.data,\n        dR_dev.data,\n        dbx_dev.data,\n        dbr_dev.data,\n        dh_dev.data,\n        dp_dev.data,\n        dq_dev.data,\n        nullptr);\n  }, sample_size);\n  return ms;\n}\n\nvoid usage(const char* name) {\n  printf(\"Usage: %s [OPTION]...\\n\", name);\n  printf(\"  -h, --help\\n\");\n  printf(\"  -i, --implementation IMPL <haste|cudnn> (default: haste)\\n\");\n  printf(\"  -m, --mode MODE           <inference|training> (default: training)\\n\");\n  printf(\"  -s, --sample_size NUM     number of runs to average over (default: %d)\\n\",\n      DEFAULT_SAMPLE_SIZE);\n  printf(\"  -t, --time_steps NUM      number of time steps in RNN (default: %d)\\n\",\n      DEFAULT_TIME_STEPS);\n}\n\nint main(int argc, char* const* argv) {\n  srand(time(0));\n\n  cudnnCreate(&g_cudnn_handle);\n  cublasCreate(&g_blas_handle);\n\n  static struct option long_options[] = {\n    { \"help\", no_argument, 0, 'h' },\n    { \"implementation\", required_argument, 0, 'i' },\n    { \"mode\", required_argument, 0, 'm' },\n    { \"sample_size\", required_argument, 0, 's' },\n    { \"time_steps\", required_argument, 0, 't' },\n    { 0, 0, 0, 0 }\n  };\n\n  int c;\n  int opt_index;\n  bool inference_flag = false;\n  bool haste_flag = true;\n  int sample_size = DEFAULT_SAMPLE_SIZE;\n  int time_steps = DEFAULT_TIME_STEPS;\n  while ((c = getopt_long(argc, argv, \"hi:m:s:t:\", long_options, &opt_index)) != -1)\n    switch (c) {\n      case 'h':\n        usage(argv[0]);\n        return 0;\n      case 'i':\n        if (optarg[0] == 'c' || optarg[0] == 'C')\n          haste_flag = false;\n        break;\n      case 'm':\n        if (optarg[0] == 'i' || optarg[0] == 'I')\n          inference_flag = true;\n        break;\n      case 's':\n        sscanf(optarg, \"%d\", &sample_size);\n        break;\n      case 't':\n        sscanf(optarg, \"%d\", &time_steps);\n        break;\n    }\n\n  printf(\"# Benchmark configuration:\\n\");\n  printf(\"#   Mode: %s\\n\", inference_flag ? \"inference\" : \"training\");\n  printf(\"#   Implementation: %s\\n\", haste_flag ? \"Haste\" : \"cuDNN\");\n  printf(\"#   Sample size: %d\\n\", sample_size);\n  printf(\"#   Time steps: %d\\n\", time_steps);\n  printf(\"#\\n\");\n  printf(\"# batch_size,hidden_size,input_size,time_ms\\n\");\n\n  for (const int N : { 1, 16, 32, 64, 128 }) {\n    for (const int H : { 128, 256, 512, 768, 1024, 1536, 2048, 3072, 4096 }) {\n      for (const int C : { 64, 128, 256, 512 }) {\n        Tensor2 W(H * 3, C);\n        Tensor2 R(H * 3, H);\n        Tensor1 bx(H * 3);\n        Tensor1 br(H * 3);\n        Tensor3 x(C, N, time_steps);\n        Tensor3 dh(H, N, time_steps + 1);\n\n        float ms;\n        if (inference_flag) {\n          if (haste_flag)\n            ms = HasteInference(sample_size, W, R, bx, br, x);\n          else\n            ms = CudnnInference(sample_size, W, R, bx, br, x);\n        } else {\n          if (haste_flag)\n            ms = HasteTrain(sample_size, W, R, bx, br, x, dh);\n          else\n            ms = CudnnTrain(sample_size, W, R, bx, br, x, dh);\n        }\n        printf(\"%d,%d,%d,%f\\n\", N, H, C, ms);\n      }\n    }\n  }\n\n  cublasDestroy(g_blas_handle);\n  cudnnDestroy(g_cudnn_handle);\n  return 0;\n}\n"
  },
  {
    "path": "benchmarks/benchmark_lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <Eigen/Dense>\n#include <cassert>\n#include <cmath>\n#include <cstdio>\n#include <cstdlib>\n#include <ctime>\n#include <cuda.h>\n#include <cuda_runtime_api.h>\n#include <cudnn.h>\n#include <getopt.h>\n#include <iostream>\n#include <string>\n#include <unsupported/Eigen/CXX11/Tensor>\n#include <vector>\n\n#include \"../examples/device_ptr.h\"\n#include \"cudnn_wrappers.h\"\n#include \"haste.h\"\n\nusing haste::v0::lstm::BackwardPass;\nusing haste::v0::lstm::ForwardPass;\nusing std::string;\n\nusing Tensor1 = Eigen::Tensor<float, 1>;\nusing Tensor2 = Eigen::Tensor<float, 2>;\nusing Tensor3 = Eigen::Tensor<float, 3>;\n\nstatic constexpr int DEFAULT_SAMPLE_SIZE = 10;\nstatic constexpr int DEFAULT_TIME_STEPS = 50;\n\nstatic cudnnHandle_t g_cudnn_handle;\nstatic cublasHandle_t g_blas_handle;\n\nfloat TimeLoop(std::function<void()> fn, int iterations) {\n  cudaEvent_t start, stop;\n  cudaEventCreate(&start);\n  cudaEventCreate(&stop);\n  cudaEventRecord(start);\n  for (int i = 0; i < iterations; ++i)\n    fn();\n  float elapsed_ms;\n  cudaEventRecord(stop);\n  cudaEventSynchronize(stop);\n  cudaEventElapsedTime(&elapsed_ms, start, stop);\n  cudaEventDestroy(start);\n  cudaEventDestroy(stop);\n  return elapsed_ms / iterations;\n}\n\nfloat CudnnInference(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& b,\n    const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor2> h_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> c_dev(batch_size * hidden_size);\n  device_ptr<Tensor3> y_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor2> h_out_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> c_out_dev(batch_size * hidden_size);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  // Descriptors all the way down. Nice.\n  RnnDescriptor<float> rnn_descriptor(g_cudnn_handle, hidden_size, CUDNN_LSTM);\n\n  TensorDescriptorArray<float> x_descriptors(time_steps, { batch_size, input_size, 1 });\n  TensorDescriptorArray<float> y_descriptors(time_steps, { batch_size, hidden_size, 1 });\n\n  auto h_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto c_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto h_out_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n  auto c_out_descriptor = TensorDescriptor<float>({ 1, batch_size, hidden_size });\n\n  size_t workspace_size;\n  cudnnGetRNNWorkspaceSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &x_descriptors,\n      &workspace_size);\n  auto workspace_dev = device_ptr<Tensor1>::NewByteSized(workspace_size);\n\n  size_t w_count;\n  cudnnGetRNNParamsSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      *&x_descriptors,\n      &w_count,\n      CUDNN_DATA_FLOAT);\n\n  auto w_dev = device_ptr<Tensor1>::NewByteSized(w_count);\n  FilterDescriptor<float> w_descriptor(w_dev.Size());\n\n  float ms = TimeLoop([&]() {\n    cudnnRNNForwardInference(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &x_descriptors,\n        x_dev.data,\n        *h_descriptor,\n        h_dev.data,\n        *c_descriptor,\n        c_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        *h_out_descriptor,\n        h_out_dev.data,\n        *c_out_descriptor,\n        c_out_dev.data,\n        workspace_dev.data,\n        workspace_size);\n  }, sample_size);\n  return ms;\n}\n\nfloat CudnnTrain(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& b,\n    const Tensor3& x,\n    const Tensor3& dh,\n    const Tensor3& dc) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  device_ptr<Tensor3> y_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor3> dy_dev(time_steps * batch_size * hidden_size);\n  device_ptr<Tensor2> dhy_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dcy_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> hx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> cx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dhx_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dcx_dev(batch_size * hidden_size);\n\n  RnnDescriptor<float> rnn_descriptor(g_cudnn_handle, hidden_size, CUDNN_LSTM);\n  TensorDescriptorArray<float> y_descriptors(time_steps, { batch_size, hidden_size, 1 });\n  TensorDescriptorArray<float> dy_descriptors(time_steps, { batch_size, hidden_size, 1 });\n  TensorDescriptorArray<float> dx_descriptors(time_steps, { batch_size, input_size, 1 });\n\n  TensorDescriptor<float> dhy_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dcy_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> hx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> cx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dhx_descriptor({ 1, batch_size, hidden_size });\n  TensorDescriptor<float> dcx_descriptor({ 1, batch_size, hidden_size });\n\n  size_t workspace_size = 0;\n  cudnnGetRNNWorkspaceSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &dx_descriptors,\n      &workspace_size);\n  auto workspace_dev = device_ptr<Tensor1>::NewByteSized(workspace_size);\n\n  size_t w_count = 0;\n  cudnnGetRNNParamsSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      *&dx_descriptors,\n      &w_count,\n      CUDNN_DATA_FLOAT);\n\n  auto w_dev = device_ptr<Tensor1>::NewByteSized(w_count);\n  FilterDescriptor<float> w_descriptor(w_dev.Size());\n\n  size_t reserve_size = 0;\n  cudnnGetRNNTrainingReserveSize(\n      g_cudnn_handle,\n      *rnn_descriptor,\n      time_steps,\n      &dx_descriptors,\n      &reserve_size);\n  auto reserve_dev = device_ptr<Tensor1>::NewByteSized(reserve_size);\n\n  float ms = TimeLoop([&]() {\n    cudnnRNNForwardTraining(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &dx_descriptors,\n        dx_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        *cx_descriptor,\n        cx_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        *dhy_descriptor,\n        dhy_dev.data,\n        *dcy_descriptor,\n        dcy_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        reserve_dev.data,\n        reserve_size);\n\n    cudnnRNNBackwardData(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &y_descriptors,\n        y_dev.data,\n        &dy_descriptors,\n        dy_dev.data,\n        *dhy_descriptor,\n        dhy_dev.data,\n        *dcy_descriptor,\n        dcy_dev.data,\n        *w_descriptor,\n        w_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        *cx_descriptor,\n        cx_dev.data,\n        &dx_descriptors,\n        dx_dev.data,\n        *dhx_descriptor,\n        dhx_dev.data,\n        *dcx_descriptor,\n        dcx_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        reserve_dev.data,\n        reserve_size);\n\n    cudnnRNNBackwardWeights(\n        g_cudnn_handle,\n        *rnn_descriptor,\n        time_steps,\n        &dx_descriptors,\n        dx_dev.data,\n        *hx_descriptor,\n        hx_dev.data,\n        &y_descriptors,\n        y_dev.data,\n        workspace_dev.data,\n        workspace_size,\n        *w_descriptor,\n        w_dev.data,\n        reserve_dev.data,\n        reserve_size);\n  }, sample_size);\n  return ms;\n}\n\nfloat HasteInference(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& b,\n    const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> b_dev(b);\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> c_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 4);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  // Settle down the GPU and off we go!\n  cudaDeviceSynchronize();\n  float ms = TimeLoop([&]() {\n    ForwardPass<float> forward(\n        false,\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        b_dev.data,\n        x_dev.data,\n        h_dev.data,\n        c_dev.data,\n        v_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,\n        nullptr);\n  }, sample_size);\n  return ms;\n}\n\nfloat HasteTrain(\n    int sample_size,\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& b,\n    const Tensor3& x,\n    const Tensor3& dh,\n    const Tensor3& dc) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  Eigen::array<int, 3> transpose_x({ 1, 2, 0 });\n  Tensor3 x_t = x.shuffle(transpose_x);\n\n  Eigen::array<int, 2> transpose({ 1, 0 });\n  Tensor2 W_t = W.shuffle(transpose);\n  Tensor2 R_t = R.shuffle(transpose);\n\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor3> x_dev(x);\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> c_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 4);\n\n  device_ptr<Tensor2> W_t_dev(W_t);\n  device_ptr<Tensor2> R_t_dev(R_t);\n  device_ptr<Tensor1> b_dev(b);\n  device_ptr<Tensor3> x_t_dev(x_t);\n\n  // These gradients should actually come \"from above\" but we're just allocating\n  // a bunch of uninitialized memory and passing it in.\n  device_ptr<Tensor3> dh_new_dev(dh);\n  device_ptr<Tensor3> dc_new_dev(dc);\n\n  device_ptr<Tensor3> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dW_dev(input_size * hidden_size * 4);\n  device_ptr<Tensor2> dR_dev(hidden_size * hidden_size * 4);\n  device_ptr<Tensor2> db_dev(hidden_size * 4);\n  device_ptr<Tensor2> dh_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor2> dc_dev((time_steps + 1) * batch_size * hidden_size);\n\n  dW_dev.zero();\n  dR_dev.zero();\n  db_dev.zero();\n  dh_dev.zero();\n  dc_dev.zero();\n\n  ForwardPass<float> forward(\n      true,\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  BackwardPass<float> backward(\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  static const float alpha = 1.0f;\n  static const float beta = 0.0f;\n\n  cudaDeviceSynchronize();\n  float ms = TimeLoop([&]() {\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        b_dev.data,\n        x_dev.data,\n        h_dev.data,\n        c_dev.data,\n        v_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,\n        nullptr);\n\n    // Haste needs `x`, `W`, and `R` to be transposed between the forward\n    // pass and backward pass. Add these transposes in here to get a fair\n    // measurement of the overall time it takes to run an entire training\n    // loop.\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        batch_size * time_steps, input_size,\n        &alpha,\n        x_dev.data, input_size,\n        &beta,\n        x_dev.data, batch_size * time_steps,\n        x_t_dev.data, batch_size * time_steps);\n\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        input_size, hidden_size * 4,\n        &alpha,\n        W_dev.data, hidden_size * 4,\n        &beta,\n        W_dev.data, input_size,\n        W_t_dev.data, input_size);\n\n    cublasSgeam(\n        g_blas_handle,\n        CUBLAS_OP_T, CUBLAS_OP_N,\n        hidden_size, hidden_size * 4,\n        &alpha,\n        R_dev.data, hidden_size * 4,\n        &beta,\n        R_dev.data, hidden_size,\n        R_t_dev.data, hidden_size);\n\n    backward.Run(\n        time_steps,\n        W_t_dev.data,\n        R_t_dev.data,\n        b_dev.data,\n        x_t_dev.data,\n        h_dev.data,\n        c_dev.data,\n        dh_new_dev.data,\n        dc_new_dev.data,\n        dx_dev.data,\n        dW_dev.data,\n        dR_dev.data,\n        db_dev.data,\n        dh_dev.data,\n        dc_dev.data,\n        v_dev.data,\n        nullptr);\n  }, sample_size);\n  return ms;\n}\n\nvoid usage(const char* name) {\n  printf(\"Usage: %s [OPTION]...\\n\", name);\n  printf(\"  -h, --help\\n\");\n  printf(\"  -i, --implementation IMPL <haste|cudnn> (default: haste)\\n\");\n  printf(\"  -m, --mode MODE           <inference|training> (default: training)\\n\");\n  printf(\"  -s, --sample_size NUM     number of runs to average over (default: %d)\\n\",\n      DEFAULT_SAMPLE_SIZE);\n  printf(\"  -t, --time_steps NUM      number of time steps in RNN (default: %d)\\n\",\n      DEFAULT_TIME_STEPS);\n}\n\nint main(int argc, char* const* argv) {\n  srand(time(0));\n\n  cudnnCreate(&g_cudnn_handle);\n  cublasCreate(&g_blas_handle);\n\n  static struct option long_options[] = {\n    { \"help\", no_argument, 0, 'h' },\n    { \"implementation\", required_argument, 0, 'i' },\n    { \"mode\", required_argument, 0, 'm' },\n    { \"sample_size\", required_argument, 0, 's' },\n    { \"time_steps\", required_argument, 0, 't' },\n    { 0, 0, 0, 0 }\n  };\n\n  int c;\n  int opt_index;\n  bool inference_flag = false;\n  bool haste_flag = true;\n  int sample_size = DEFAULT_SAMPLE_SIZE;\n  int time_steps = DEFAULT_TIME_STEPS;\n  while ((c = getopt_long(argc, argv, \"hi:m:s:t:\", long_options, &opt_index)) != -1)\n    switch (c) {\n      case 'h':\n        usage(argv[0]);\n        return 0;\n      case 'i':\n        if (optarg[0] == 'c' || optarg[0] == 'C')\n          haste_flag = false;\n        break;\n      case 'm':\n        if (optarg[0] == 'i' || optarg[0] == 'I')\n          inference_flag = true;\n        break;\n      case 's':\n        sscanf(optarg, \"%d\", &sample_size);\n        break;\n      case 't':\n        sscanf(optarg, \"%d\", &time_steps);\n        break;\n    }\n\n  printf(\"# Benchmark configuration:\\n\");\n  printf(\"#   Mode: %s\\n\", inference_flag ? \"inference\" : \"training\");\n  printf(\"#   Implementation: %s\\n\", haste_flag ? \"Haste\" : \"cuDNN\");\n  printf(\"#   Sample size: %d\\n\", sample_size);\n  printf(\"#   Time steps: %d\\n\", time_steps);\n  printf(\"#\\n\");\n  printf(\"# batch_size,hidden_size,input_size,time_ms\\n\");\n\n  for (const int N : { 1, 16, 32, 64, 128 }) {\n    for (const int H : { 128, 256, 512, 768, 1024, 1536, 2048, 3072, 4096 }) {\n      for (const int C : { 64, 128, 256, 512 }) {\n        Tensor2 W(H * 4, C);\n        Tensor2 R(H * 4, H);\n        Tensor1 b(H * 4);\n        Tensor3 x(C, N, time_steps);\n        Tensor3 dh(H, N, time_steps + 1);\n        Tensor3 dc(H, N, time_steps + 1);\n\n        float ms;\n        if (inference_flag) {\n          if (haste_flag)\n            ms = HasteInference(sample_size, W, R, b, x);\n          else\n            ms = CudnnInference(sample_size, W, R, b, x);\n        } else {\n          if (haste_flag)\n            ms = HasteTrain(sample_size, W, R, b, x, dh, dc);\n          else\n            ms = CudnnTrain(sample_size, W, R, b, x, dh, dc);\n        }\n        printf(\"%d,%d,%d,%f\\n\", N, H, C, ms);\n      }\n    }\n  }\n\n  cublasDestroy(g_blas_handle);\n  cudnnDestroy(g_cudnn_handle);\n  return 0;\n}\n"
  },
  {
    "path": "benchmarks/cudnn_wrappers.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cassert>\n#include <cudnn.h>\n#include <vector>\n\ntemplate<typename T>\nstruct CudnnDataType {};\n\ntemplate<>\nstruct CudnnDataType<float> {\n  static constexpr auto value = CUDNN_DATA_FLOAT;\n};\n\ntemplate<>\nstruct CudnnDataType<double> {\n  static constexpr auto value = CUDNN_DATA_DOUBLE;\n};\n\ntemplate<typename T>\nclass TensorDescriptor {\n  public:\n    TensorDescriptor(const std::vector<int>& dims) {\n      std::vector<int> strides;\n      int stride = 1;\n      for (int i = dims.size() - 1; i >= 0; --i) {\n        strides.insert(strides.begin(), stride);\n        stride *= dims[i];\n      }\n      cudnnCreateTensorDescriptor(&descriptor_);\n      cudnnSetTensorNdDescriptor(descriptor_, CudnnDataType<T>::value, dims.size(), &dims[0], &strides[0]);\n    }\n\n    ~TensorDescriptor() {\n      cudnnDestroyTensorDescriptor(descriptor_);\n    }\n\n    cudnnTensorDescriptor_t& operator*() {\n      return descriptor_;\n    }\n\n  private:\n    cudnnTensorDescriptor_t descriptor_;\n};\n\n\ntemplate<typename T>\nclass TensorDescriptorArray {\n  public:\n    TensorDescriptorArray(int count, const std::vector<int>& dims) {\n      std::vector<int> strides;\n      int stride = 1;\n      for (int i = dims.size() - 1; i >= 0; --i) {\n        strides.insert(strides.begin(), stride);\n        stride *= dims[i];\n      }\n      for (int i = 0; i < count; ++i) {\n        cudnnTensorDescriptor_t descriptor;\n        cudnnCreateTensorDescriptor(&descriptor);\n        cudnnSetTensorNdDescriptor(descriptor, CudnnDataType<T>::value, dims.size(), &dims[0], &strides[0]);\n        descriptors_.push_back(descriptor);\n      }\n    }\n\n    ~TensorDescriptorArray() {\n      for (auto& desc : descriptors_)\n        cudnnDestroyTensorDescriptor(desc);\n    }\n\n    cudnnTensorDescriptor_t* operator&() {\n      return &descriptors_[0];\n    }\n\n  private:\n    std::vector<cudnnTensorDescriptor_t> descriptors_;\n};\n\nclass DropoutDescriptor {\n  public:\n    DropoutDescriptor(const cudnnHandle_t& handle) {\n      cudnnCreateDropoutDescriptor(&descriptor_);\n      cudnnSetDropoutDescriptor(descriptor_, handle, 0.0f, nullptr, 0, 0LL);\n    }\n\n    ~DropoutDescriptor() {\n      cudnnDestroyDropoutDescriptor(descriptor_);\n    }\n\n    cudnnDropoutDescriptor_t& operator*() {\n      return descriptor_;\n    }\n\n  private:\n    cudnnDropoutDescriptor_t descriptor_;\n};\n\ntemplate<typename T>\nclass RnnDescriptor {\n  public:\n    RnnDescriptor(const cudnnHandle_t& handle, int size, cudnnRNNMode_t algorithm) : dropout_(handle) {\n      cudnnCreateRNNDescriptor(&descriptor_);\n      cudnnSetRNNDescriptor(\n          handle,\n          descriptor_,\n          size,\n          1,\n          *dropout_,\n          CUDNN_LINEAR_INPUT,\n          CUDNN_UNIDIRECTIONAL,\n          algorithm,\n          CUDNN_RNN_ALGO_STANDARD,\n          CudnnDataType<T>::value);\n    }\n\n    ~RnnDescriptor() {\n      cudnnDestroyRNNDescriptor(descriptor_);\n    }\n\n    cudnnRNNDescriptor_t& operator*() {\n      return descriptor_;\n    }\n\n  private:\n    cudnnRNNDescriptor_t descriptor_;\n    DropoutDescriptor dropout_;\n};\n\ntemplate<typename T>\nclass FilterDescriptor {\n  public:\n    FilterDescriptor(const size_t size) {\n      int filter_dim[] = { (int)size, 1, 1 };\n      cudnnCreateFilterDescriptor(&descriptor_);\n      cudnnSetFilterNdDescriptor(descriptor_, CudnnDataType<T>::value, CUDNN_TENSOR_NCHW, 3, filter_dim);\n    }\n\n    ~FilterDescriptor() {\n      cudnnDestroyFilterDescriptor(descriptor_);\n    }\n\n    cudnnFilterDescriptor_t& operator*() {\n      return descriptor_;\n    }\n\n  private:\n    cudnnFilterDescriptor_t descriptor_;\n};\n"
  },
  {
    "path": "benchmarks/report.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#    http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport argparse\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport os\n\n\ndef extract(x, predicate):\n  return np.array(list(filter(predicate, x)))\n\n\ndef main(args):\n  np.set_printoptions(suppress=True)\n\n  A = np.loadtxt(args.A, delimiter=',')\n  B = np.loadtxt(args.B, delimiter=',')\n\n  faster = 1.0 - A[:,-1] / B[:,-1]\n\n  print(f'A is faster than B by:')\n  print(f'  mean:   {np.mean(faster)*100:7.4}%')\n  print(f'  std:    {np.std(faster)*100:7.4}%')\n  print(f'  median: {np.median(faster)*100:7.4}%')\n  print(f'  min:    {np.min(faster)*100:7.4}%')\n  print(f'  max:    {np.max(faster)*100:7.4}%')\n\n  for batch_size in np.unique(A[:,0]):\n    for input_size in np.unique(A[:,2]):\n      a = extract(A, lambda x: x[0] == batch_size and x[2] == input_size)\n      b = extract(B, lambda x: x[0] == batch_size and x[2] == input_size)\n      fig, ax = plt.subplots(dpi=200)\n      ax.set_xticks(a[:,1])\n      ax.set_xticklabels(a[:,1].astype(np.int32), rotation=60)\n      ax.tick_params(axis='y', which='both', length=0)\n      ax.spines['top'].set_visible(False)\n      ax.spines['right'].set_visible(False)\n      plt.title(f'batch size={int(batch_size)}, input size={int(input_size)}')\n      plt.plot(a[:,1], a[:,-1], color=args.color[0])\n      plt.plot(a[:,1], b[:,-1], color=args.color[1])\n      plt.xlabel('hidden size')\n      plt.ylabel('time (ms)')\n      plt.legend(args.name, frameon=False)\n      plt.tight_layout()\n      if args.save:\n        os.makedirs(args.save[0], exist_ok=True)\n        plt.savefig(f'{args.save[0]}/report_n={int(batch_size)}_c={int(input_size)}.png', dpi=200)\n      else:\n        plt.show()\n\n\nif __name__ == '__main__':\n  parser = argparse.ArgumentParser()\n  parser.add_argument('--name', nargs=2, default=['A', 'B'])\n  parser.add_argument('--color', nargs=2, default=['#1f77b4', '#2ca02c'])\n  parser.add_argument('--save', nargs=1, default=None)\n  parser.add_argument('A')\n  parser.add_argument('B')\n  main(parser.parse_args())\n"
  },
  {
    "path": "build/MANIFEST.in",
    "content": "include Makefile\ninclude frameworks/tf/*.h\ninclude frameworks/tf/*.cc\ninclude frameworks/pytorch/*.h\ninclude frameworks/pytorch/*.cc\ninclude lib/*.cc\ninclude lib/*.h\ninclude lib/haste/*.h\n"
  },
  {
    "path": "build/common.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\nVERSION = '0.5.0-rc0'\nDESCRIPTION = 'Haste: a fast, simple, and open RNN library.'\nAUTHOR = 'LMNT, Inc.'\nAUTHOR_EMAIL = 'haste@lmnt.com'\nURL = 'https://haste.lmnt.com'\nLICENSE = 'Apache 2.0'\nCLASSIFIERS = [\n  'Development Status :: 4 - Beta',\n  'Intended Audience :: Developers',\n  'Intended Audience :: Education',\n  'Intended Audience :: Science/Research',\n  'License :: OSI Approved :: Apache Software License',\n  'Programming Language :: Python :: 3.6',\n  'Programming Language :: Python :: 3.7',\n  'Programming Language :: Python :: 3.8',\n  'Topic :: Scientific/Engineering :: Mathematics',\n  'Topic :: Software Development :: Libraries :: Python Modules',\n  'Topic :: Software Development :: Libraries',\n]\n\n\n"
  },
  {
    "path": "build/setup.pytorch.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport os\nimport sys\n\nfrom glob import glob\nfrom platform import platform\nfrom torch.utils import cpp_extension\nfrom setuptools import setup\nfrom setuptools.dist import Distribution\n\n\nclass BuildHaste(cpp_extension.BuildExtension):\n  def run(self):\n    os.system('make haste')\n    super().run()\n\n\nbase_path = os.path.dirname(os.path.realpath(__file__))\nif 'Windows' in platform():\n  CUDA_HOME = os.environ.get('CUDA_HOME', os.environ.get('CUDA_PATH'))\n  extra_args = []\nelse:\n  CUDA_HOME = os.environ.get('CUDA_HOME', '/usr/local/cuda')\n  extra_args = ['-Wno-sign-compare']\n\nwith open(f'frameworks/pytorch/_version.py', 'wt') as f:\n  f.write(f'__version__ = \"{VERSION}\"')\n\nextension = cpp_extension.CUDAExtension(\n    'haste_pytorch_lib',\n    sources = glob('frameworks/pytorch/*.cc'),\n    extra_compile_args = extra_args,\n    include_dirs = [os.path.join(base_path, 'lib'), os.path.join(CUDA_HOME, 'include')],\n    libraries = ['haste'],\n    library_dirs = ['.'])\n\nsetup(name = 'haste_pytorch',\n    version = VERSION,\n    description = DESCRIPTION,\n    long_description = open('README.md', 'r',encoding='utf-8').read(),\n    long_description_content_type = 'text/markdown',\n    author = AUTHOR,\n    author_email = AUTHOR_EMAIL,\n    url = URL,\n    license = LICENSE,\n    keywords = 'pytorch machine learning rnn lstm gru custom op',\n    packages = ['haste_pytorch'],\n    package_dir = { 'haste_pytorch': 'frameworks/pytorch' },\n    install_requires = [],\n    ext_modules = [extension],\n    cmdclass = { 'build_ext': BuildHaste },\n    classifiers = CLASSIFIERS)\n"
  },
  {
    "path": "build/setup.tf.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport os\nimport sys\n\nfrom setuptools import setup\nfrom setuptools.dist import Distribution\nfrom distutils.command.build import build as _build\n\n\nclass BinaryDistribution(Distribution):\n  \"\"\"This class is needed in order to create OS specific wheels.\"\"\"\n  def has_ext_modules(self):\n    return True\n\n\nclass BuildHaste(_build):\n  def run(self):\n    os.system('make libhaste_tf')\n    super().run()\n\n\nwith open(f'frameworks/tf/_version.py', 'wt') as f:\n  f.write(f'__version__ = \"{VERSION}\"')\n\nsetup(name = 'haste_tf',\n    version = VERSION,\n    description = DESCRIPTION,\n    long_description = open('README.md', 'r').read(),\n    long_description_content_type = 'text/markdown',\n    author = AUTHOR,\n    author_email = AUTHOR_EMAIL,\n    url = URL,\n    license = LICENSE,\n    keywords = 'tensorflow machine learning rnn lstm gru custom op',\n    packages = ['haste_tf'],\n    package_dir = { 'haste_tf': 'frameworks/tf' },\n    package_data = { 'haste_tf': ['*.so'] },\n    install_requires = [],\n    zip_safe = False,\n    distclass = BinaryDistribution,\n    cmdclass = { 'build': BuildHaste },\n    classifiers = CLASSIFIERS)\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch/GRU.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch.GRU\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"add_module\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"buffers\"/>\n<meta itemprop=\"property\" content=\"children\"/>\n<meta itemprop=\"property\" content=\"cpu\"/>\n<meta itemprop=\"property\" content=\"cuda\"/>\n<meta itemprop=\"property\" content=\"double\"/>\n<meta itemprop=\"property\" content=\"eval\"/>\n<meta itemprop=\"property\" content=\"extra_repr\"/>\n<meta itemprop=\"property\" content=\"float\"/>\n<meta itemprop=\"property\" content=\"forward\"/>\n<meta itemprop=\"property\" content=\"from_native_weights\"/>\n<meta itemprop=\"property\" content=\"half\"/>\n<meta itemprop=\"property\" content=\"load_state_dict\"/>\n<meta itemprop=\"property\" content=\"modules\"/>\n<meta itemprop=\"property\" content=\"named_buffers\"/>\n<meta itemprop=\"property\" content=\"named_children\"/>\n<meta itemprop=\"property\" content=\"named_modules\"/>\n<meta itemprop=\"property\" content=\"named_parameters\"/>\n<meta itemprop=\"property\" content=\"parameters\"/>\n<meta itemprop=\"property\" content=\"register_backward_hook\"/>\n<meta itemprop=\"property\" content=\"register_buffer\"/>\n<meta itemprop=\"property\" content=\"register_forward_hook\"/>\n<meta itemprop=\"property\" content=\"register_forward_pre_hook\"/>\n<meta itemprop=\"property\" content=\"register_parameter\"/>\n<meta itemprop=\"property\" content=\"requires_grad_\"/>\n<meta itemprop=\"property\" content=\"reset_parameters\"/>\n<meta itemprop=\"property\" content=\"share_memory\"/>\n<meta itemprop=\"property\" content=\"state_dict\"/>\n<meta itemprop=\"property\" content=\"to\"/>\n<meta itemprop=\"property\" content=\"to_native_weights\"/>\n<meta itemprop=\"property\" content=\"train\"/>\n<meta itemprop=\"property\" content=\"type\"/>\n<meta itemprop=\"property\" content=\"zero_grad\"/>\n</div>\n\n# haste_pytorch.GRU\n\n<!-- Insert buttons and diff -->\n\n\n## Class `GRU`\n\nGated Recurrent Unit layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis GRU layer offers a fused, GPU-accelerated PyTorch op for inference\nand training. There are two commonly-used variants of GRU cells. This one\nimplements 1406.1078v1 which applies the reset gate to the hidden state\nafter matrix multiplication. cuDNN also implements this variant. The other\nvariant, 1406.1078v3, applies the reset gate before matrix multiplication\nand is currently unsupported.\n\nThis layer has built-in support for DropConnect and Zoneout, which are\nboth techniques used to regularize RNNs.\n\nSee [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\nSee [from_native_weights](#from_native_weights) and\n[to_native_weights](#to_native_weights) for compatibility with PyTorch GRUs.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    input_size,\n    hidden_size,\n    batch_first=False,\n    dropout=0.0,\n    zoneout=0.0\n)\n```\n\nInitialize the parameters of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`input_size`</b>: int, the feature dimension of the input.\n* <b>`hidden_size`</b>: int, the feature dimension of the output.\n* <b>`batch_first`</b>: (optional) bool, if `True`, then the input and output\n  tensors are provided as `(batch, seq, feature)`.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization.\n\n\n#### Variables:\n\n\n* <b>`kernel`</b>: the input projection weight matrix. Dimensions\n  (input_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n  with Xavier uniform initialization.\n* <b>`recurrent_kernel`</b>: the recurrent projection weight matrix. Dimensions\n  (hidden_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n  with orthogonal initialization.\n* <b>`bias`</b>: the input projection bias vector. Dimensions (hidden_size * 3) with\n  `z,r,h` gate layout. Initialized to zeros.\n* <b>`recurrent_bias`</b>: the recurrent projection bias vector. Dimensions\n  (hidden_size * 3) with `z,r,h` gate layout. Initialized to zeros.\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    *input,\n    **kwargs\n)\n```\n\nCall self as a function.\n\n\n<h3 id=\"add_module\"><code><a name=\"add_module\">add_module</a></code></h3>\n\n``` python\nadd_module(\n    name,\n    module\n)\n```\n\nAdds a child module to the current module.\n\nThe module can be accessed as an attribute using the given name.\n\n#### Args:\n\nname (string): name of the child module. The child module can be\n    accessed from this module using the given name\nmodule (Module): child module to be added to the module.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(fn)\n```\n\nApplies ``fn`` recursively to every submodule (as returned by ``.children()``)\nas well as self. Typical use includes initializing the parameters of a model\n(see also :ref:`nn-init-doc`).\n\n#### Args:\n\nfn (:class:`Module` -> None): function to be applied to each submodule\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> def init_weights(m):\n    >>>     print(m)\n    >>>     if type(m) == nn.Linear:\n    >>>         m.weight.data.fill_(1.0)\n    >>>         print(m.weight)\n    >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))\n    >>> net.apply(init_weights)\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    ```\n\n<h3 id=\"buffers\"><code><a name=\"buffers\">buffers</a></code></h3>\n\n``` python\nbuffers(recurse=True)\n```\n\nReturns an iterator over module buffers.\n\n\n#### Args:\n\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`torch.Tensor`</b>: module buffer\n\nExample::\n\n    ```\n    >>> for buf in model.buffers():\n    >>>     print(type(buf.data), buf.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"children\"><code><a name=\"children\">children</a></code></h3>\n\n``` python\nchildren()\n```\n\nReturns an iterator over immediate children modules.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a child module\n\n<h3 id=\"cpu\"><code><a name=\"cpu\">cpu</a></code></h3>\n\n``` python\ncpu()\n```\n\nMoves all model parameters and buffers to the CPU.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"cuda\"><code><a name=\"cuda\">cuda</a></code></h3>\n\n``` python\ncuda(device=None)\n```\n\nMoves all model parameters and buffers to the GPU.\n\nThis also makes associated parameters and buffers different objects. So\nit should be called before constructing optimizer if the module will\nlive on GPU while being optimized.\n\n#### Arguments:\n\ndevice (int, optional): if specified, all parameters will be\n    copied to that device\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"double\"><code><a name=\"double\">double</a></code></h3>\n\n``` python\ndouble()\n```\n\nCasts all floating point parameters and buffers to ``double`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"eval\"><code><a name=\"eval\">eval</a></code></h3>\n\n``` python\neval()\n```\n\nSets the module in evaluation mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\nThis is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"extra_repr\"><code><a name=\"extra_repr\">extra_repr</a></code></h3>\n\n``` python\nextra_repr()\n```\n\nSet the extra representation of the module\n\nTo print customized extra information, you should reimplement\nthis method in your own modules. Both single-line and multi-line\nstrings are acceptable.\n\n<h3 id=\"float\"><code><a name=\"float\">float</a></code></h3>\n\n``` python\nfloat()\n```\n\nCasts all floating point parameters and buffers to float datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"forward\"><code><a name=\"forward\">forward</a></code></h3>\n\n``` python\nforward(\n    input,\n    state=None,\n    lengths=None\n)\n```\n\nRuns a forward pass of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`input`</b>: Tensor, a batch of input sequences to pass through the GRU.\n  Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n  `False`, otherwise (batch_size, seq_len, input_size).\n* <b>`lengths`</b>: (optional) Tensor, list of sequence lengths for each batch\n  element. Dimension (batch_size). This argument may be omitted if\n  all batch elements are unpadded and have the same sequence length.\n\n\n#### Returns:\n\n\n* <b>`output`</b>: Tensor, the output of the GRU layer. Dimensions\n  (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n  or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n  that if `lengths` was specified, the `output` tensor will not be\n  masked. It's the caller's responsibility to either not use the invalid\n  entries or to mask them out before using them.\n* <b>`h_n`</b>: the hidden state for the last sequence item. Dimensions\n  (1, batch_size, hidden_size).\n\n<h3 id=\"from_native_weights\"><code><a name=\"from_native_weights\">from_native_weights</a></code></h3>\n\n``` python\nfrom_native_weights(\n    weight_ih_l0,\n    weight_hh_l0,\n    bias_ih_l0,\n    bias_hh_l0\n)\n```\n\nCopies and converts the provided PyTorch GRU weights into this layer.\n\n\n#### Arguments:\n\n\n* <b>`weight_ih_l0`</b>: Parameter, the input-hidden weights of the PyTorch GRU layer.\n* <b>`weight_hh_l0`</b>: Parameter, the hidden-hidden weights of the PyTorch GRU layer.\n* <b>`bias_ih_l0`</b>: Parameter, the input-hidden bias of the PyTorch GRU layer.\n* <b>`bias_hh_l0`</b>: Parameter, the hidden-hidden bias of the PyTorch GRU layer.\n\n<h3 id=\"half\"><code><a name=\"half\">half</a></code></h3>\n\n``` python\nhalf()\n```\n\nCasts all floating point parameters and buffers to ``half`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"load_state_dict\"><code><a name=\"load_state_dict\">load_state_dict</a></code></h3>\n\n``` python\nload_state_dict(\n    state_dict,\n    strict=True\n)\n```\n\nCopies parameters and buffers from :attr:`state_dict` into\nthis module and its descendants. If :attr:`strict` is ``True``, then\nthe keys of :attr:`state_dict` must exactly match the keys returned\nby this module's :meth:`~torch.nn.Module.state_dict` function.\n\n#### Arguments:\n\nstate_dict (dict): a dict containing parameters and\n    persistent buffers.\nstrict (bool, optional): whether to strictly enforce that the keys\n    in :attr:`state_dict` match the keys returned by this module's\n    :meth:`~torch.nn.Module.state_dict` function. Default: ``True``\n\n\n\n#### Returns:\n\n``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:\n    * **missing_keys** is a list of str containing the missing keys\n    * **unexpected_keys** is a list of str containing the unexpected keys\n\n\n<h3 id=\"modules\"><code><a name=\"modules\">modules</a></code></h3>\n\n``` python\nmodules()\n```\n\nReturns an iterator over all modules in the network.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a module in the network\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    1 -> Linear(in_features=2, out_features=2, bias=True)\n\n<h3 id=\"named_buffers\"><code><a name=\"named_buffers\">named_buffers</a></code></h3>\n\n``` python\nnamed_buffers(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module buffers, yielding both the\nname of the buffer as well as the buffer itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all buffer names.\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, torch.Tensor)`</b>: Tuple containing the name and buffer\n\nExample::\n\n    ```\n    >>> for name, buf in self.named_buffers():\n    >>>    if name in ['running_var']:\n    >>>        print(buf.size())\n    ```\n\n<h3 id=\"named_children\"><code><a name=\"named_children\">named_children</a></code></h3>\n\n``` python\nnamed_children()\n```\n\nReturns an iterator over immediate children modules, yielding both\nthe name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple containing a name and child module\n\nExample::\n\n    ```\n    >>> for name, module in model.named_children():\n    >>>     if name in ['conv4', 'conv5']:\n    >>>         print(module)\n    ```\n\n<h3 id=\"named_modules\"><code><a name=\"named_modules\">named_modules</a></code></h3>\n\n``` python\nnamed_modules(\n    memo=None,\n    prefix=''\n)\n```\n\nReturns an iterator over all modules in the network, yielding\nboth the name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple of name and module\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.named_modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> ('', Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    ))\n    1 -> ('0', Linear(in_features=2, out_features=2, bias=True))\n\n<h3 id=\"named_parameters\"><code><a name=\"named_parameters\">named_parameters</a></code></h3>\n\n``` python\nnamed_parameters(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module parameters, yielding both the\nname of the parameter as well as the parameter itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all parameter names.\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, Parameter)`</b>: Tuple containing the name and parameter\n\nExample::\n\n    ```\n    >>> for name, param in self.named_parameters():\n    >>>    if name in ['bias']:\n    >>>        print(param.size())\n    ```\n\n<h3 id=\"parameters\"><code><a name=\"parameters\">parameters</a></code></h3>\n\n``` python\nparameters(recurse=True)\n```\n\nReturns an iterator over module parameters.\n\nThis is typically passed to an optimizer.\n\n#### Args:\n\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`Parameter`</b>: module parameter\n\nExample::\n\n    ```\n    >>> for param in model.parameters():\n    >>>     print(type(param.data), param.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"register_backward_hook\"><code><a name=\"register_backward_hook\">register_backward_hook</a></code></h3>\n\n``` python\nregister_backward_hook(hook)\n```\n\nRegisters a backward hook on the module.\n\nThe hook will be called every time the gradients with respect to module\ninputs are computed. The hook should have the following signature::\n\n    hook(module, grad_input, grad_output) -> Tensor or None\n\nThe :attr:`grad_input` and :attr:`grad_output` may be tuples if the\nmodule has multiple inputs or outputs. The hook should not modify its\narguments, but it can optionally return a new gradient with respect to\ninput that will be used in place of :attr:`grad_input` in subsequent\ncomputations.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n.. warning ::\n\n    The current implementation will not have the presented behavior\n    for complex :class:`Module` that perform many operations.\n    In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only\n    contain the gradients for a subset of the inputs and outputs.\n    For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`\n    directly on a specific input or output to get the required gradients.\n\n<h3 id=\"register_buffer\"><code><a name=\"register_buffer\">register_buffer</a></code></h3>\n\n``` python\nregister_buffer(\n    name,\n    tensor\n)\n```\n\nAdds a persistent buffer to the module.\n\nThis is typically used to register a buffer that should not to be\nconsidered a model parameter. For example, BatchNorm's ``running_mean``\nis not a parameter, but is part of the persistent state.\n\nBuffers can be accessed as attributes using given names.\n\n#### Args:\n\nname (string): name of the buffer. The buffer can be accessed\n    from this module using the given name\ntensor (Tensor): buffer to be registered.\n\n\nExample::\n\n    ```\n    >>> self.register_buffer('running_mean', torch.zeros(num_features))\n    ```\n\n<h3 id=\"register_forward_hook\"><code><a name=\"register_forward_hook\">register_forward_hook</a></code></h3>\n\n``` python\nregister_forward_hook(hook)\n```\n\nRegisters a forward hook on the module.\n\nThe hook will be called every time after :func:`forward` has computed an output.\nIt should have the following signature::\n\n    hook(module, input, output) -> None or modified output\n\nThe hook can modify the output. It can modify the input inplace but\nit will not have effect on forward since this is called after\n:func:`forward` is called.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_forward_pre_hook\"><code><a name=\"register_forward_pre_hook\">register_forward_pre_hook</a></code></h3>\n\n``` python\nregister_forward_pre_hook(hook)\n```\n\nRegisters a forward pre-hook on the module.\n\nThe hook will be called every time before :func:`forward` is invoked.\nIt should have the following signature::\n\n    hook(module, input) -> None or modified input\n\nThe hook can modify the input. User can either return a tuple or a\nsingle modified value in the hook. We will wrap the value into a tuple\nif a single value is returned(unless that value is already a tuple).\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_parameter\"><code><a name=\"register_parameter\">register_parameter</a></code></h3>\n\n``` python\nregister_parameter(\n    name,\n    param\n)\n```\n\nAdds a parameter to the module.\n\nThe parameter can be accessed as an attribute using given name.\n\n#### Args:\n\nname (string): name of the parameter. The parameter can be accessed\n    from this module using the given name\nparam (Parameter): parameter to be added to the module.\n\n\n<h3 id=\"requires_grad_\"><code><a name=\"requires_grad_\">requires_grad_</a></code></h3>\n\n``` python\nrequires_grad_(requires_grad=True)\n```\n\nChange if autograd should record operations on parameters in this\nmodule.\n\nThis method sets the parameters' :attr:`requires_grad` attributes\nin-place.\n\nThis method is helpful for freezing part of the module for finetuning\nor training parts of a model individually (e.g., GAN training).\n\n#### Args:\n\nrequires_grad (bool): whether autograd should record operations on\n                      parameters in this module. Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"reset_parameters\"><code><a name=\"reset_parameters\">reset_parameters</a></code></h3>\n\n``` python\nreset_parameters()\n```\n\nResets this layer's parameters to their initial values.\n\n\n<h3 id=\"share_memory\"><code><a name=\"share_memory\">share_memory</a></code></h3>\n\n``` python\nshare_memory()\n```\n\n\n\n\n<h3 id=\"state_dict\"><code><a name=\"state_dict\">state_dict</a></code></h3>\n\n``` python\nstate_dict(\n    destination=None,\n    prefix='',\n    keep_vars=False\n)\n```\n\nReturns a dictionary containing a whole state of the module.\n\nBoth parameters and persistent buffers (e.g. running averages) are\nincluded. Keys are corresponding parameter and buffer names.\n\n#### Returns:\n\n\n* <b>`dict`</b>:     a dictionary containing a whole state of the module\n\nExample::\n\n    ```\n    >>> module.state_dict().keys()\n    ['bias', 'weight']\n    ```\n\n<h3 id=\"to\"><code><a name=\"to\">to</a></code></h3>\n\n``` python\nto(\n    *args,\n    **kwargs\n)\n```\n\nMoves and/or casts the parameters and buffers.\n\nThis can be called as\n\n.. function:: to(device=None, dtype=None, non_blocking=False)\n\n.. function:: to(dtype, non_blocking=False)\n\n.. function:: to(tensor, non_blocking=False)\n\nIts signature is similar to :meth:`torch.Tensor.to`, but only accepts\nfloating point desired :attr:`dtype` s. In addition, this method will\nonly cast the floating point parameters and buffers to :attr:`dtype`\n(if given). The integral parameters and buffers will be moved\n:attr:`device`, if that is given, but with dtypes unchanged. When\n:attr:`non_blocking` is set, it tries to convert/move asynchronously\nwith respect to the host if possible, e.g., moving CPU Tensors with\npinned memory to CUDA devices.\n\nSee below for examples.\n\n.. note::\n    This method modifies the module in-place.\n\n#### Args:\n\ndevice (:class:`torch.device`): the desired device of the parameters\n    and buffers in this module\ndtype (:class:`torch.dtype`): the desired floating point type of\n    the floating point parameters and buffers in this module\ntensor (torch.Tensor): Tensor whose dtype and device are the desired\n    dtype and device for all parameters and buffers in this module\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> linear = nn.Linear(2, 2)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]])\n    >>> linear.to(torch.double)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]], dtype=torch.float64)\n    >>> gpu1 = torch.device(\"cuda:1\")\n    >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')\n    >>> cpu = torch.device(\"cpu\")\n    >>> linear.to(cpu)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16)\n    ```\n\n<h3 id=\"to_native_weights\"><code><a name=\"to_native_weights\">to_native_weights</a></code></h3>\n\n``` python\nto_native_weights()\n```\n\nConverts Haste GRU weights to native PyTorch GRU weights.\n\n\n#### Returns:\n\n\n* <b>`weight_ih_l0`</b>: Parameter, the input-hidden weights of the GRU layer.\n* <b>`weight_hh_l0`</b>: Parameter, the hidden-hidden weights of the GRU layer.\n* <b>`bias_ih_l0`</b>: Parameter, the input-hidden bias of the GRU layer.\n* <b>`bias_hh_l0`</b>: Parameter, the hidden-hidden bias of the GRU layer.\n\n<h3 id=\"train\"><code><a name=\"train\">train</a></code></h3>\n\n``` python\ntrain(mode=True)\n```\n\nSets the module in training mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\n#### Args:\n\nmode (bool): whether to set training mode (``True``) or evaluation\n             mode (``False``). Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"type\"><code><a name=\"type\">type</a></code></h3>\n\n``` python\ntype(dst_type)\n```\n\nCasts all parameters and buffers to :attr:`dst_type`.\n\n\n#### Arguments:\n\ndst_type (type or string): the desired type\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"zero_grad\"><code><a name=\"zero_grad\">zero_grad</a></code></h3>\n\n``` python\nzero_grad()\n```\n\nSets gradients of all model parameters to zero.\n\n\n\n\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch/IndRNN.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch.IndRNN\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"add_module\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"buffers\"/>\n<meta itemprop=\"property\" content=\"children\"/>\n<meta itemprop=\"property\" content=\"cpu\"/>\n<meta itemprop=\"property\" content=\"cuda\"/>\n<meta itemprop=\"property\" content=\"double\"/>\n<meta itemprop=\"property\" content=\"eval\"/>\n<meta itemprop=\"property\" content=\"extra_repr\"/>\n<meta itemprop=\"property\" content=\"float\"/>\n<meta itemprop=\"property\" content=\"forward\"/>\n<meta itemprop=\"property\" content=\"half\"/>\n<meta itemprop=\"property\" content=\"load_state_dict\"/>\n<meta itemprop=\"property\" content=\"modules\"/>\n<meta itemprop=\"property\" content=\"named_buffers\"/>\n<meta itemprop=\"property\" content=\"named_children\"/>\n<meta itemprop=\"property\" content=\"named_modules\"/>\n<meta itemprop=\"property\" content=\"named_parameters\"/>\n<meta itemprop=\"property\" content=\"parameters\"/>\n<meta itemprop=\"property\" content=\"register_backward_hook\"/>\n<meta itemprop=\"property\" content=\"register_buffer\"/>\n<meta itemprop=\"property\" content=\"register_forward_hook\"/>\n<meta itemprop=\"property\" content=\"register_forward_pre_hook\"/>\n<meta itemprop=\"property\" content=\"register_parameter\"/>\n<meta itemprop=\"property\" content=\"requires_grad_\"/>\n<meta itemprop=\"property\" content=\"reset_parameters\"/>\n<meta itemprop=\"property\" content=\"share_memory\"/>\n<meta itemprop=\"property\" content=\"state_dict\"/>\n<meta itemprop=\"property\" content=\"to\"/>\n<meta itemprop=\"property\" content=\"train\"/>\n<meta itemprop=\"property\" content=\"type\"/>\n<meta itemprop=\"property\" content=\"zero_grad\"/>\n</div>\n\n# haste_pytorch.IndRNN\n\n<!-- Insert buttons and diff -->\n\n\n## Class `IndRNN`\n\nIndependently Recurrent Neural Network layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis layer offers a fused, GPU-accelerated PyTorch op for inference and\ntraining. It also supports Zoneout regularization.\n\nSee [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    input_size,\n    hidden_size,\n    batch_first=False,\n    zoneout=0.0\n)\n```\n\nInitialize the parameters of the IndRNN layer.\n\n\n#### Arguments:\n\n\n* <b>`input_size`</b>: int, the feature dimension of the input.\n* <b>`hidden_size`</b>: int, the feature dimension of the output.\n* <b>`batch_first`</b>: (optional) bool, if `True`, then the input and output\n  tensors are provided as `(batch, seq, feature)`.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization.\n\n\n#### Variables:\n\n\n* <b>`kernel`</b>: the input projection weight matrix. Dimensions\n  (input_size, hidden_size). Initialized with Xavier uniform\n  initialization.\n* <b>`recurrent_scale`</b>: the recurrent scale weight vector. Dimensions\n  (hidden_size). Initialized uniformly in [-0.5, 0.5]. Note that this\n  initialization scheme is different than in the original authors'\n  implementation. See https://github.com/lmnt-com/haste/issues/7 for\n  details.\n* <b>`bias`</b>: the RNN bias vector. Dimensions (hidden_size). Initialized to zeros.\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    *input,\n    **kwargs\n)\n```\n\nCall self as a function.\n\n\n<h3 id=\"add_module\"><code><a name=\"add_module\">add_module</a></code></h3>\n\n``` python\nadd_module(\n    name,\n    module\n)\n```\n\nAdds a child module to the current module.\n\nThe module can be accessed as an attribute using the given name.\n\n#### Args:\n\nname (string): name of the child module. The child module can be\n    accessed from this module using the given name\nmodule (Module): child module to be added to the module.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(fn)\n```\n\nApplies ``fn`` recursively to every submodule (as returned by ``.children()``)\nas well as self. Typical use includes initializing the parameters of a model\n(see also :ref:`nn-init-doc`).\n\n#### Args:\n\nfn (:class:`Module` -> None): function to be applied to each submodule\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> def init_weights(m):\n    >>>     print(m)\n    >>>     if type(m) == nn.Linear:\n    >>>         m.weight.data.fill_(1.0)\n    >>>         print(m.weight)\n    >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))\n    >>> net.apply(init_weights)\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    ```\n\n<h3 id=\"buffers\"><code><a name=\"buffers\">buffers</a></code></h3>\n\n``` python\nbuffers(recurse=True)\n```\n\nReturns an iterator over module buffers.\n\n\n#### Args:\n\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`torch.Tensor`</b>: module buffer\n\nExample::\n\n    ```\n    >>> for buf in model.buffers():\n    >>>     print(type(buf.data), buf.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"children\"><code><a name=\"children\">children</a></code></h3>\n\n``` python\nchildren()\n```\n\nReturns an iterator over immediate children modules.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a child module\n\n<h3 id=\"cpu\"><code><a name=\"cpu\">cpu</a></code></h3>\n\n``` python\ncpu()\n```\n\nMoves all model parameters and buffers to the CPU.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"cuda\"><code><a name=\"cuda\">cuda</a></code></h3>\n\n``` python\ncuda(device=None)\n```\n\nMoves all model parameters and buffers to the GPU.\n\nThis also makes associated parameters and buffers different objects. So\nit should be called before constructing optimizer if the module will\nlive on GPU while being optimized.\n\n#### Arguments:\n\ndevice (int, optional): if specified, all parameters will be\n    copied to that device\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"double\"><code><a name=\"double\">double</a></code></h3>\n\n``` python\ndouble()\n```\n\nCasts all floating point parameters and buffers to ``double`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"eval\"><code><a name=\"eval\">eval</a></code></h3>\n\n``` python\neval()\n```\n\nSets the module in evaluation mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\nThis is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"extra_repr\"><code><a name=\"extra_repr\">extra_repr</a></code></h3>\n\n``` python\nextra_repr()\n```\n\nSet the extra representation of the module\n\nTo print customized extra information, you should reimplement\nthis method in your own modules. Both single-line and multi-line\nstrings are acceptable.\n\n<h3 id=\"float\"><code><a name=\"float\">float</a></code></h3>\n\n``` python\nfloat()\n```\n\nCasts all floating point parameters and buffers to float datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"forward\"><code><a name=\"forward\">forward</a></code></h3>\n\n``` python\nforward(\n    input,\n    state=None,\n    lengths=None\n)\n```\n\nRuns a forward pass of the IndRNN layer.\n\n\n#### Arguments:\n\n\n* <b>`input`</b>: Tensor, a batch of input sequences to pass through the GRU.\n  Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n  `False`, otherwise (batch_size, seq_len, input_size).\n* <b>`state`</b>: (optional) Tensor, the initial state for each batch element in\n  `input`. Dimensions (1, batch_size, hidden_size). Defaults to zeros.\n* <b>`lengths`</b>: (optional) Tensor, list of sequence lengths for each batch\n  element. Dimension (batch_size). This argument may be omitted if\n  all batch elements are unpadded and have the same sequence length.\n\n\n#### Returns:\n\n\n* <b>`output`</b>: Tensor, the output of the GRU layer. Dimensions\n  (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n  or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n  that if `lengths` was specified, the `output` tensor will not be\n  masked. It's the caller's responsibility to either not use the invalid\n  entries or to mask them out before using them.\n* <b>`state`</b>: the hidden state for the last sequence item. Dimensions\n  (1, batch_size, hidden_size).\n\n<h3 id=\"half\"><code><a name=\"half\">half</a></code></h3>\n\n``` python\nhalf()\n```\n\nCasts all floating point parameters and buffers to ``half`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"load_state_dict\"><code><a name=\"load_state_dict\">load_state_dict</a></code></h3>\n\n``` python\nload_state_dict(\n    state_dict,\n    strict=True\n)\n```\n\nCopies parameters and buffers from :attr:`state_dict` into\nthis module and its descendants. If :attr:`strict` is ``True``, then\nthe keys of :attr:`state_dict` must exactly match the keys returned\nby this module's :meth:`~torch.nn.Module.state_dict` function.\n\n#### Arguments:\n\nstate_dict (dict): a dict containing parameters and\n    persistent buffers.\nstrict (bool, optional): whether to strictly enforce that the keys\n    in :attr:`state_dict` match the keys returned by this module's\n    :meth:`~torch.nn.Module.state_dict` function. Default: ``True``\n\n\n\n#### Returns:\n\n``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:\n    * **missing_keys** is a list of str containing the missing keys\n    * **unexpected_keys** is a list of str containing the unexpected keys\n\n\n<h3 id=\"modules\"><code><a name=\"modules\">modules</a></code></h3>\n\n``` python\nmodules()\n```\n\nReturns an iterator over all modules in the network.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a module in the network\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    1 -> Linear(in_features=2, out_features=2, bias=True)\n\n<h3 id=\"named_buffers\"><code><a name=\"named_buffers\">named_buffers</a></code></h3>\n\n``` python\nnamed_buffers(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module buffers, yielding both the\nname of the buffer as well as the buffer itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all buffer names.\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, torch.Tensor)`</b>: Tuple containing the name and buffer\n\nExample::\n\n    ```\n    >>> for name, buf in self.named_buffers():\n    >>>    if name in ['running_var']:\n    >>>        print(buf.size())\n    ```\n\n<h3 id=\"named_children\"><code><a name=\"named_children\">named_children</a></code></h3>\n\n``` python\nnamed_children()\n```\n\nReturns an iterator over immediate children modules, yielding both\nthe name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple containing a name and child module\n\nExample::\n\n    ```\n    >>> for name, module in model.named_children():\n    >>>     if name in ['conv4', 'conv5']:\n    >>>         print(module)\n    ```\n\n<h3 id=\"named_modules\"><code><a name=\"named_modules\">named_modules</a></code></h3>\n\n``` python\nnamed_modules(\n    memo=None,\n    prefix=''\n)\n```\n\nReturns an iterator over all modules in the network, yielding\nboth the name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple of name and module\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.named_modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> ('', Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    ))\n    1 -> ('0', Linear(in_features=2, out_features=2, bias=True))\n\n<h3 id=\"named_parameters\"><code><a name=\"named_parameters\">named_parameters</a></code></h3>\n\n``` python\nnamed_parameters(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module parameters, yielding both the\nname of the parameter as well as the parameter itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all parameter names.\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, Parameter)`</b>: Tuple containing the name and parameter\n\nExample::\n\n    ```\n    >>> for name, param in self.named_parameters():\n    >>>    if name in ['bias']:\n    >>>        print(param.size())\n    ```\n\n<h3 id=\"parameters\"><code><a name=\"parameters\">parameters</a></code></h3>\n\n``` python\nparameters(recurse=True)\n```\n\nReturns an iterator over module parameters.\n\nThis is typically passed to an optimizer.\n\n#### Args:\n\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`Parameter`</b>: module parameter\n\nExample::\n\n    ```\n    >>> for param in model.parameters():\n    >>>     print(type(param.data), param.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"register_backward_hook\"><code><a name=\"register_backward_hook\">register_backward_hook</a></code></h3>\n\n``` python\nregister_backward_hook(hook)\n```\n\nRegisters a backward hook on the module.\n\nThe hook will be called every time the gradients with respect to module\ninputs are computed. The hook should have the following signature::\n\n    hook(module, grad_input, grad_output) -> Tensor or None\n\nThe :attr:`grad_input` and :attr:`grad_output` may be tuples if the\nmodule has multiple inputs or outputs. The hook should not modify its\narguments, but it can optionally return a new gradient with respect to\ninput that will be used in place of :attr:`grad_input` in subsequent\ncomputations.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n.. warning ::\n\n    The current implementation will not have the presented behavior\n    for complex :class:`Module` that perform many operations.\n    In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only\n    contain the gradients for a subset of the inputs and outputs.\n    For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`\n    directly on a specific input or output to get the required gradients.\n\n<h3 id=\"register_buffer\"><code><a name=\"register_buffer\">register_buffer</a></code></h3>\n\n``` python\nregister_buffer(\n    name,\n    tensor\n)\n```\n\nAdds a persistent buffer to the module.\n\nThis is typically used to register a buffer that should not to be\nconsidered a model parameter. For example, BatchNorm's ``running_mean``\nis not a parameter, but is part of the persistent state.\n\nBuffers can be accessed as attributes using given names.\n\n#### Args:\n\nname (string): name of the buffer. The buffer can be accessed\n    from this module using the given name\ntensor (Tensor): buffer to be registered.\n\n\nExample::\n\n    ```\n    >>> self.register_buffer('running_mean', torch.zeros(num_features))\n    ```\n\n<h3 id=\"register_forward_hook\"><code><a name=\"register_forward_hook\">register_forward_hook</a></code></h3>\n\n``` python\nregister_forward_hook(hook)\n```\n\nRegisters a forward hook on the module.\n\nThe hook will be called every time after :func:`forward` has computed an output.\nIt should have the following signature::\n\n    hook(module, input, output) -> None or modified output\n\nThe hook can modify the output. It can modify the input inplace but\nit will not have effect on forward since this is called after\n:func:`forward` is called.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_forward_pre_hook\"><code><a name=\"register_forward_pre_hook\">register_forward_pre_hook</a></code></h3>\n\n``` python\nregister_forward_pre_hook(hook)\n```\n\nRegisters a forward pre-hook on the module.\n\nThe hook will be called every time before :func:`forward` is invoked.\nIt should have the following signature::\n\n    hook(module, input) -> None or modified input\n\nThe hook can modify the input. User can either return a tuple or a\nsingle modified value in the hook. We will wrap the value into a tuple\nif a single value is returned(unless that value is already a tuple).\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_parameter\"><code><a name=\"register_parameter\">register_parameter</a></code></h3>\n\n``` python\nregister_parameter(\n    name,\n    param\n)\n```\n\nAdds a parameter to the module.\n\nThe parameter can be accessed as an attribute using given name.\n\n#### Args:\n\nname (string): name of the parameter. The parameter can be accessed\n    from this module using the given name\nparam (Parameter): parameter to be added to the module.\n\n\n<h3 id=\"requires_grad_\"><code><a name=\"requires_grad_\">requires_grad_</a></code></h3>\n\n``` python\nrequires_grad_(requires_grad=True)\n```\n\nChange if autograd should record operations on parameters in this\nmodule.\n\nThis method sets the parameters' :attr:`requires_grad` attributes\nin-place.\n\nThis method is helpful for freezing part of the module for finetuning\nor training parts of a model individually (e.g., GAN training).\n\n#### Args:\n\nrequires_grad (bool): whether autograd should record operations on\n                      parameters in this module. Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"reset_parameters\"><code><a name=\"reset_parameters\">reset_parameters</a></code></h3>\n\n``` python\nreset_parameters()\n```\n\n\n\n\n<h3 id=\"share_memory\"><code><a name=\"share_memory\">share_memory</a></code></h3>\n\n``` python\nshare_memory()\n```\n\n\n\n\n<h3 id=\"state_dict\"><code><a name=\"state_dict\">state_dict</a></code></h3>\n\n``` python\nstate_dict(\n    destination=None,\n    prefix='',\n    keep_vars=False\n)\n```\n\nReturns a dictionary containing a whole state of the module.\n\nBoth parameters and persistent buffers (e.g. running averages) are\nincluded. Keys are corresponding parameter and buffer names.\n\n#### Returns:\n\n\n* <b>`dict`</b>:     a dictionary containing a whole state of the module\n\nExample::\n\n    ```\n    >>> module.state_dict().keys()\n    ['bias', 'weight']\n    ```\n\n<h3 id=\"to\"><code><a name=\"to\">to</a></code></h3>\n\n``` python\nto(\n    *args,\n    **kwargs\n)\n```\n\nMoves and/or casts the parameters and buffers.\n\nThis can be called as\n\n.. function:: to(device=None, dtype=None, non_blocking=False)\n\n.. function:: to(dtype, non_blocking=False)\n\n.. function:: to(tensor, non_blocking=False)\n\nIts signature is similar to :meth:`torch.Tensor.to`, but only accepts\nfloating point desired :attr:`dtype` s. In addition, this method will\nonly cast the floating point parameters and buffers to :attr:`dtype`\n(if given). The integral parameters and buffers will be moved\n:attr:`device`, if that is given, but with dtypes unchanged. When\n:attr:`non_blocking` is set, it tries to convert/move asynchronously\nwith respect to the host if possible, e.g., moving CPU Tensors with\npinned memory to CUDA devices.\n\nSee below for examples.\n\n.. note::\n    This method modifies the module in-place.\n\n#### Args:\n\ndevice (:class:`torch.device`): the desired device of the parameters\n    and buffers in this module\ndtype (:class:`torch.dtype`): the desired floating point type of\n    the floating point parameters and buffers in this module\ntensor (torch.Tensor): Tensor whose dtype and device are the desired\n    dtype and device for all parameters and buffers in this module\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> linear = nn.Linear(2, 2)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]])\n    >>> linear.to(torch.double)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]], dtype=torch.float64)\n    >>> gpu1 = torch.device(\"cuda:1\")\n    >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')\n    >>> cpu = torch.device(\"cpu\")\n    >>> linear.to(cpu)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16)\n    ```\n\n<h3 id=\"train\"><code><a name=\"train\">train</a></code></h3>\n\n``` python\ntrain(mode=True)\n```\n\nSets the module in training mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\n#### Args:\n\nmode (bool): whether to set training mode (``True``) or evaluation\n             mode (``False``). Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"type\"><code><a name=\"type\">type</a></code></h3>\n\n``` python\ntype(dst_type)\n```\n\nCasts all parameters and buffers to :attr:`dst_type`.\n\n\n#### Arguments:\n\ndst_type (type or string): the desired type\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"zero_grad\"><code><a name=\"zero_grad\">zero_grad</a></code></h3>\n\n``` python\nzero_grad()\n```\n\nSets gradients of all model parameters to zero.\n\n\n\n\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch/LSTM.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch.LSTM\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"add_module\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"buffers\"/>\n<meta itemprop=\"property\" content=\"children\"/>\n<meta itemprop=\"property\" content=\"cpu\"/>\n<meta itemprop=\"property\" content=\"cuda\"/>\n<meta itemprop=\"property\" content=\"double\"/>\n<meta itemprop=\"property\" content=\"eval\"/>\n<meta itemprop=\"property\" content=\"extra_repr\"/>\n<meta itemprop=\"property\" content=\"float\"/>\n<meta itemprop=\"property\" content=\"forward\"/>\n<meta itemprop=\"property\" content=\"from_native_weights\"/>\n<meta itemprop=\"property\" content=\"half\"/>\n<meta itemprop=\"property\" content=\"load_state_dict\"/>\n<meta itemprop=\"property\" content=\"modules\"/>\n<meta itemprop=\"property\" content=\"named_buffers\"/>\n<meta itemprop=\"property\" content=\"named_children\"/>\n<meta itemprop=\"property\" content=\"named_modules\"/>\n<meta itemprop=\"property\" content=\"named_parameters\"/>\n<meta itemprop=\"property\" content=\"parameters\"/>\n<meta itemprop=\"property\" content=\"register_backward_hook\"/>\n<meta itemprop=\"property\" content=\"register_buffer\"/>\n<meta itemprop=\"property\" content=\"register_forward_hook\"/>\n<meta itemprop=\"property\" content=\"register_forward_pre_hook\"/>\n<meta itemprop=\"property\" content=\"register_parameter\"/>\n<meta itemprop=\"property\" content=\"requires_grad_\"/>\n<meta itemprop=\"property\" content=\"reset_parameters\"/>\n<meta itemprop=\"property\" content=\"share_memory\"/>\n<meta itemprop=\"property\" content=\"state_dict\"/>\n<meta itemprop=\"property\" content=\"to\"/>\n<meta itemprop=\"property\" content=\"to_native_weights\"/>\n<meta itemprop=\"property\" content=\"train\"/>\n<meta itemprop=\"property\" content=\"type\"/>\n<meta itemprop=\"property\" content=\"zero_grad\"/>\n</div>\n\n# haste_pytorch.LSTM\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LSTM`\n\nLong Short-Term Memory layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis LSTM layer offers a fused, GPU-accelerated PyTorch op for inference\nand training. Although this implementation is comparable in performance to\ncuDNN's LSTM, it offers additional options not typically found in other\nhigh-performance implementations. DropConnect and Zoneout regularization are\nbuilt-in, and this layer allows setting a non-zero initial forget gate bias.\n\nSee [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for general usage.\nSee [from_native_weights](#from_native_weights) and\n[to_native_weights](#to_native_weights) for compatibility with PyTorch LSTMs.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    input_size,\n    hidden_size,\n    batch_first=False,\n    forget_bias=1.0,\n    dropout=0.0,\n    zoneout=0.0\n)\n```\n\nInitialize the parameters of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`input_size`</b>: int, the feature dimension of the input.\n* <b>`hidden_size`</b>: int, the feature dimension of the output.\n* <b>`batch_first`</b>: (optional) bool, if `True`, then the input and output\n  tensors are provided as `(batch, seq, feature)`.\n* <b>`forget_bias`</b>: (optional) float, sets the initial bias of the forget gate\n  for this LSTM cell.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization.\n\n\n#### Variables:\n\n\n* <b>`kernel`</b>: the input projection weight matrix. Dimensions\n  (input_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n  with Xavier uniform initialization.\n* <b>`recurrent_kernel`</b>: the recurrent projection weight matrix. Dimensions\n  (hidden_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n  with orthogonal initialization.\n* <b>`bias`</b>: the projection bias vector. Dimensions (hidden_size * 4) with\n  `i,g,f,o` gate layout. The forget gate biases are initialized to\n  `forget_bias` and the rest are zeros.\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    *input,\n    **kwargs\n)\n```\n\nCall self as a function.\n\n\n<h3 id=\"add_module\"><code><a name=\"add_module\">add_module</a></code></h3>\n\n``` python\nadd_module(\n    name,\n    module\n)\n```\n\nAdds a child module to the current module.\n\nThe module can be accessed as an attribute using the given name.\n\n#### Args:\n\nname (string): name of the child module. The child module can be\n    accessed from this module using the given name\nmodule (Module): child module to be added to the module.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(fn)\n```\n\nApplies ``fn`` recursively to every submodule (as returned by ``.children()``)\nas well as self. Typical use includes initializing the parameters of a model\n(see also :ref:`nn-init-doc`).\n\n#### Args:\n\nfn (:class:`Module` -> None): function to be applied to each submodule\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> def init_weights(m):\n    >>>     print(m)\n    >>>     if type(m) == nn.Linear:\n    >>>         m.weight.data.fill_(1.0)\n    >>>         print(m.weight)\n    >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))\n    >>> net.apply(init_weights)\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    ```\n\n<h3 id=\"buffers\"><code><a name=\"buffers\">buffers</a></code></h3>\n\n``` python\nbuffers(recurse=True)\n```\n\nReturns an iterator over module buffers.\n\n\n#### Args:\n\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`torch.Tensor`</b>: module buffer\n\nExample::\n\n    ```\n    >>> for buf in model.buffers():\n    >>>     print(type(buf.data), buf.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"children\"><code><a name=\"children\">children</a></code></h3>\n\n``` python\nchildren()\n```\n\nReturns an iterator over immediate children modules.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a child module\n\n<h3 id=\"cpu\"><code><a name=\"cpu\">cpu</a></code></h3>\n\n``` python\ncpu()\n```\n\nMoves all model parameters and buffers to the CPU.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"cuda\"><code><a name=\"cuda\">cuda</a></code></h3>\n\n``` python\ncuda(device=None)\n```\n\nMoves all model parameters and buffers to the GPU.\n\nThis also makes associated parameters and buffers different objects. So\nit should be called before constructing optimizer if the module will\nlive on GPU while being optimized.\n\n#### Arguments:\n\ndevice (int, optional): if specified, all parameters will be\n    copied to that device\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"double\"><code><a name=\"double\">double</a></code></h3>\n\n``` python\ndouble()\n```\n\nCasts all floating point parameters and buffers to ``double`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"eval\"><code><a name=\"eval\">eval</a></code></h3>\n\n``` python\neval()\n```\n\nSets the module in evaluation mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\nThis is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"extra_repr\"><code><a name=\"extra_repr\">extra_repr</a></code></h3>\n\n``` python\nextra_repr()\n```\n\nSet the extra representation of the module\n\nTo print customized extra information, you should reimplement\nthis method in your own modules. Both single-line and multi-line\nstrings are acceptable.\n\n<h3 id=\"float\"><code><a name=\"float\">float</a></code></h3>\n\n``` python\nfloat()\n```\n\nCasts all floating point parameters and buffers to float datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"forward\"><code><a name=\"forward\">forward</a></code></h3>\n\n``` python\nforward(\n    input,\n    state=None,\n    lengths=None\n)\n```\n\nRuns a forward pass of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`input`</b>: Tensor, a batch of input sequences to pass through the LSTM.\n  Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n  `False`, otherwise (batch_size, seq_len, input_size).\n* <b>`lengths`</b>: (optional) Tensor, list of sequence lengths for each batch\n  element. Dimension (batch_size). This argument may be omitted if\n  all batch elements are unpadded and have the same sequence length.\n\n\n#### Returns:\n\n\n* <b>`output`</b>: Tensor, the output of the LSTM layer. Dimensions\n  (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n  or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n  that if `lengths` was specified, the `output` tensor will not be\n  masked. It's the caller's responsibility to either not use the invalid\n  entries or to mask them out before using them.\n* <b>`(h_n, c_n)`</b>: the hidden and cell states, respectively, for the last\n  sequence item. Dimensions (1, batch_size, hidden_size).\n\n<h3 id=\"from_native_weights\"><code><a name=\"from_native_weights\">from_native_weights</a></code></h3>\n\n``` python\nfrom_native_weights(\n    weight_ih_l0,\n    weight_hh_l0,\n    bias_ih_l0,\n    bias_hh_l0\n)\n```\n\nCopies and converts the provided PyTorch LSTM weights into this layer.\n\n\n#### Arguments:\n\n\n* <b>`weight_ih_l0`</b>: Parameter, the input-hidden weights of the PyTorch LSTM layer.\n* <b>`weight_hh_l0`</b>: Parameter, the hidden-hidden weights of the PyTorch LSTM layer.\n* <b>`bias_ih_l0`</b>: Parameter, the input-hidden bias of the PyTorch LSTM layer.\n* <b>`bias_hh_l0`</b>: Parameter, the hidden-hidden bias of the PyTorch LSTM layer.\n\n<h3 id=\"half\"><code><a name=\"half\">half</a></code></h3>\n\n``` python\nhalf()\n```\n\nCasts all floating point parameters and buffers to ``half`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"load_state_dict\"><code><a name=\"load_state_dict\">load_state_dict</a></code></h3>\n\n``` python\nload_state_dict(\n    state_dict,\n    strict=True\n)\n```\n\nCopies parameters and buffers from :attr:`state_dict` into\nthis module and its descendants. If :attr:`strict` is ``True``, then\nthe keys of :attr:`state_dict` must exactly match the keys returned\nby this module's :meth:`~torch.nn.Module.state_dict` function.\n\n#### Arguments:\n\nstate_dict (dict): a dict containing parameters and\n    persistent buffers.\nstrict (bool, optional): whether to strictly enforce that the keys\n    in :attr:`state_dict` match the keys returned by this module's\n    :meth:`~torch.nn.Module.state_dict` function. Default: ``True``\n\n\n\n#### Returns:\n\n``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:\n    * **missing_keys** is a list of str containing the missing keys\n    * **unexpected_keys** is a list of str containing the unexpected keys\n\n\n<h3 id=\"modules\"><code><a name=\"modules\">modules</a></code></h3>\n\n``` python\nmodules()\n```\n\nReturns an iterator over all modules in the network.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a module in the network\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    1 -> Linear(in_features=2, out_features=2, bias=True)\n\n<h3 id=\"named_buffers\"><code><a name=\"named_buffers\">named_buffers</a></code></h3>\n\n``` python\nnamed_buffers(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module buffers, yielding both the\nname of the buffer as well as the buffer itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all buffer names.\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, torch.Tensor)`</b>: Tuple containing the name and buffer\n\nExample::\n\n    ```\n    >>> for name, buf in self.named_buffers():\n    >>>    if name in ['running_var']:\n    >>>        print(buf.size())\n    ```\n\n<h3 id=\"named_children\"><code><a name=\"named_children\">named_children</a></code></h3>\n\n``` python\nnamed_children()\n```\n\nReturns an iterator over immediate children modules, yielding both\nthe name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple containing a name and child module\n\nExample::\n\n    ```\n    >>> for name, module in model.named_children():\n    >>>     if name in ['conv4', 'conv5']:\n    >>>         print(module)\n    ```\n\n<h3 id=\"named_modules\"><code><a name=\"named_modules\">named_modules</a></code></h3>\n\n``` python\nnamed_modules(\n    memo=None,\n    prefix=''\n)\n```\n\nReturns an iterator over all modules in the network, yielding\nboth the name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple of name and module\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.named_modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> ('', Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    ))\n    1 -> ('0', Linear(in_features=2, out_features=2, bias=True))\n\n<h3 id=\"named_parameters\"><code><a name=\"named_parameters\">named_parameters</a></code></h3>\n\n``` python\nnamed_parameters(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module parameters, yielding both the\nname of the parameter as well as the parameter itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all parameter names.\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, Parameter)`</b>: Tuple containing the name and parameter\n\nExample::\n\n    ```\n    >>> for name, param in self.named_parameters():\n    >>>    if name in ['bias']:\n    >>>        print(param.size())\n    ```\n\n<h3 id=\"parameters\"><code><a name=\"parameters\">parameters</a></code></h3>\n\n``` python\nparameters(recurse=True)\n```\n\nReturns an iterator over module parameters.\n\nThis is typically passed to an optimizer.\n\n#### Args:\n\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`Parameter`</b>: module parameter\n\nExample::\n\n    ```\n    >>> for param in model.parameters():\n    >>>     print(type(param.data), param.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"register_backward_hook\"><code><a name=\"register_backward_hook\">register_backward_hook</a></code></h3>\n\n``` python\nregister_backward_hook(hook)\n```\n\nRegisters a backward hook on the module.\n\nThe hook will be called every time the gradients with respect to module\ninputs are computed. The hook should have the following signature::\n\n    hook(module, grad_input, grad_output) -> Tensor or None\n\nThe :attr:`grad_input` and :attr:`grad_output` may be tuples if the\nmodule has multiple inputs or outputs. The hook should not modify its\narguments, but it can optionally return a new gradient with respect to\ninput that will be used in place of :attr:`grad_input` in subsequent\ncomputations.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n.. warning ::\n\n    The current implementation will not have the presented behavior\n    for complex :class:`Module` that perform many operations.\n    In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only\n    contain the gradients for a subset of the inputs and outputs.\n    For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`\n    directly on a specific input or output to get the required gradients.\n\n<h3 id=\"register_buffer\"><code><a name=\"register_buffer\">register_buffer</a></code></h3>\n\n``` python\nregister_buffer(\n    name,\n    tensor\n)\n```\n\nAdds a persistent buffer to the module.\n\nThis is typically used to register a buffer that should not to be\nconsidered a model parameter. For example, BatchNorm's ``running_mean``\nis not a parameter, but is part of the persistent state.\n\nBuffers can be accessed as attributes using given names.\n\n#### Args:\n\nname (string): name of the buffer. The buffer can be accessed\n    from this module using the given name\ntensor (Tensor): buffer to be registered.\n\n\nExample::\n\n    ```\n    >>> self.register_buffer('running_mean', torch.zeros(num_features))\n    ```\n\n<h3 id=\"register_forward_hook\"><code><a name=\"register_forward_hook\">register_forward_hook</a></code></h3>\n\n``` python\nregister_forward_hook(hook)\n```\n\nRegisters a forward hook on the module.\n\nThe hook will be called every time after :func:`forward` has computed an output.\nIt should have the following signature::\n\n    hook(module, input, output) -> None or modified output\n\nThe hook can modify the output. It can modify the input inplace but\nit will not have effect on forward since this is called after\n:func:`forward` is called.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_forward_pre_hook\"><code><a name=\"register_forward_pre_hook\">register_forward_pre_hook</a></code></h3>\n\n``` python\nregister_forward_pre_hook(hook)\n```\n\nRegisters a forward pre-hook on the module.\n\nThe hook will be called every time before :func:`forward` is invoked.\nIt should have the following signature::\n\n    hook(module, input) -> None or modified input\n\nThe hook can modify the input. User can either return a tuple or a\nsingle modified value in the hook. We will wrap the value into a tuple\nif a single value is returned(unless that value is already a tuple).\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_parameter\"><code><a name=\"register_parameter\">register_parameter</a></code></h3>\n\n``` python\nregister_parameter(\n    name,\n    param\n)\n```\n\nAdds a parameter to the module.\n\nThe parameter can be accessed as an attribute using given name.\n\n#### Args:\n\nname (string): name of the parameter. The parameter can be accessed\n    from this module using the given name\nparam (Parameter): parameter to be added to the module.\n\n\n<h3 id=\"requires_grad_\"><code><a name=\"requires_grad_\">requires_grad_</a></code></h3>\n\n``` python\nrequires_grad_(requires_grad=True)\n```\n\nChange if autograd should record operations on parameters in this\nmodule.\n\nThis method sets the parameters' :attr:`requires_grad` attributes\nin-place.\n\nThis method is helpful for freezing part of the module for finetuning\nor training parts of a model individually (e.g., GAN training).\n\n#### Args:\n\nrequires_grad (bool): whether autograd should record operations on\n                      parameters in this module. Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"reset_parameters\"><code><a name=\"reset_parameters\">reset_parameters</a></code></h3>\n\n``` python\nreset_parameters()\n```\n\nResets this layer's parameters to their initial values.\n\n\n<h3 id=\"share_memory\"><code><a name=\"share_memory\">share_memory</a></code></h3>\n\n``` python\nshare_memory()\n```\n\n\n\n\n<h3 id=\"state_dict\"><code><a name=\"state_dict\">state_dict</a></code></h3>\n\n``` python\nstate_dict(\n    destination=None,\n    prefix='',\n    keep_vars=False\n)\n```\n\nReturns a dictionary containing a whole state of the module.\n\nBoth parameters and persistent buffers (e.g. running averages) are\nincluded. Keys are corresponding parameter and buffer names.\n\n#### Returns:\n\n\n* <b>`dict`</b>:     a dictionary containing a whole state of the module\n\nExample::\n\n    ```\n    >>> module.state_dict().keys()\n    ['bias', 'weight']\n    ```\n\n<h3 id=\"to\"><code><a name=\"to\">to</a></code></h3>\n\n``` python\nto(\n    *args,\n    **kwargs\n)\n```\n\nMoves and/or casts the parameters and buffers.\n\nThis can be called as\n\n.. function:: to(device=None, dtype=None, non_blocking=False)\n\n.. function:: to(dtype, non_blocking=False)\n\n.. function:: to(tensor, non_blocking=False)\n\nIts signature is similar to :meth:`torch.Tensor.to`, but only accepts\nfloating point desired :attr:`dtype` s. In addition, this method will\nonly cast the floating point parameters and buffers to :attr:`dtype`\n(if given). The integral parameters and buffers will be moved\n:attr:`device`, if that is given, but with dtypes unchanged. When\n:attr:`non_blocking` is set, it tries to convert/move asynchronously\nwith respect to the host if possible, e.g., moving CPU Tensors with\npinned memory to CUDA devices.\n\nSee below for examples.\n\n.. note::\n    This method modifies the module in-place.\n\n#### Args:\n\ndevice (:class:`torch.device`): the desired device of the parameters\n    and buffers in this module\ndtype (:class:`torch.dtype`): the desired floating point type of\n    the floating point parameters and buffers in this module\ntensor (torch.Tensor): Tensor whose dtype and device are the desired\n    dtype and device for all parameters and buffers in this module\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> linear = nn.Linear(2, 2)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]])\n    >>> linear.to(torch.double)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]], dtype=torch.float64)\n    >>> gpu1 = torch.device(\"cuda:1\")\n    >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')\n    >>> cpu = torch.device(\"cpu\")\n    >>> linear.to(cpu)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16)\n    ```\n\n<h3 id=\"to_native_weights\"><code><a name=\"to_native_weights\">to_native_weights</a></code></h3>\n\n``` python\nto_native_weights()\n```\n\nConverts Haste LSTM weights to native PyTorch LSTM weights.\n\n\n#### Returns:\n\n\n* <b>`weight_ih_l0`</b>: Parameter, the input-hidden weights of the LSTM layer.\n* <b>`weight_hh_l0`</b>: Parameter, the hidden-hidden weights of the LSTM layer.\n* <b>`bias_ih_l0`</b>: Parameter, the input-hidden bias of the LSTM layer.\n* <b>`bias_hh_l0`</b>: Parameter, the hidden-hidden bias of the LSTM layer.\n\n<h3 id=\"train\"><code><a name=\"train\">train</a></code></h3>\n\n``` python\ntrain(mode=True)\n```\n\nSets the module in training mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\n#### Args:\n\nmode (bool): whether to set training mode (``True``) or evaluation\n             mode (``False``). Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"type\"><code><a name=\"type\">type</a></code></h3>\n\n``` python\ntype(dst_type)\n```\n\nCasts all parameters and buffers to :attr:`dst_type`.\n\n\n#### Arguments:\n\ndst_type (type or string): the desired type\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"zero_grad\"><code><a name=\"zero_grad\">zero_grad</a></code></h3>\n\n``` python\nzero_grad()\n```\n\nSets gradients of all model parameters to zero.\n\n\n\n\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch/LayerNormGRU.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch.LayerNormGRU\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"add_module\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"buffers\"/>\n<meta itemprop=\"property\" content=\"children\"/>\n<meta itemprop=\"property\" content=\"cpu\"/>\n<meta itemprop=\"property\" content=\"cuda\"/>\n<meta itemprop=\"property\" content=\"double\"/>\n<meta itemprop=\"property\" content=\"eval\"/>\n<meta itemprop=\"property\" content=\"extra_repr\"/>\n<meta itemprop=\"property\" content=\"float\"/>\n<meta itemprop=\"property\" content=\"forward\"/>\n<meta itemprop=\"property\" content=\"half\"/>\n<meta itemprop=\"property\" content=\"load_state_dict\"/>\n<meta itemprop=\"property\" content=\"modules\"/>\n<meta itemprop=\"property\" content=\"named_buffers\"/>\n<meta itemprop=\"property\" content=\"named_children\"/>\n<meta itemprop=\"property\" content=\"named_modules\"/>\n<meta itemprop=\"property\" content=\"named_parameters\"/>\n<meta itemprop=\"property\" content=\"parameters\"/>\n<meta itemprop=\"property\" content=\"register_backward_hook\"/>\n<meta itemprop=\"property\" content=\"register_buffer\"/>\n<meta itemprop=\"property\" content=\"register_forward_hook\"/>\n<meta itemprop=\"property\" content=\"register_forward_pre_hook\"/>\n<meta itemprop=\"property\" content=\"register_parameter\"/>\n<meta itemprop=\"property\" content=\"requires_grad_\"/>\n<meta itemprop=\"property\" content=\"reset_parameters\"/>\n<meta itemprop=\"property\" content=\"share_memory\"/>\n<meta itemprop=\"property\" content=\"state_dict\"/>\n<meta itemprop=\"property\" content=\"to\"/>\n<meta itemprop=\"property\" content=\"train\"/>\n<meta itemprop=\"property\" content=\"type\"/>\n<meta itemprop=\"property\" content=\"zero_grad\"/>\n</div>\n\n# haste_pytorch.LayerNormGRU\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormGRU`\n\nLayer Normalized Gated Recurrent Unit layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis GRU layer applies layer normalization to the input and recurrent output\nactivations of a standard GRU. The implementation is fused and\nGPU-accelerated. There are two commonly-used variants of GRU cells. This one\nimplements 1406.1078v1 which applies the reset gate to the hidden state\nafter matrix multiplication. The other variant, 1406.1078v3, applies the\nreset gate before matrix multiplication and is currently unsupported.\n\nThis layer has built-in support for DropConnect and Zoneout, which are\nboth techniques used to regularize RNNs.\n\nSee [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    input_size,\n    hidden_size,\n    batch_first=False,\n    dropout=0.0,\n    zoneout=0.0\n)\n```\n\nInitialize the parameters of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`input_size`</b>: int, the feature dimension of the input.\n* <b>`hidden_size`</b>: int, the feature dimension of the output.\n* <b>`batch_first`</b>: (optional) bool, if `True`, then the input and output\n  tensors are provided as `(batch, seq, feature)`.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization.\n\n\n#### Variables:\n\n\n* <b>`kernel`</b>: the input projection weight matrix. Dimensions\n  (input_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n  with Xavier uniform initialization.\n* <b>`recurrent_kernel`</b>: the recurrent projection weight matrix. Dimensions\n  (hidden_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n  with orthogonal initialization.\n* <b>`bias`</b>: the input projection bias vector. Dimensions (hidden_size * 3) with\n  `z,r,h` gate layout. Initialized to zeros.\n* <b>`recurrent_bias`</b>: the recurrent projection bias vector. Dimensions\n  (hidden_size * 3) with `z,r,h` gate layout. Initialized to zeros.\n* <b>`gamma`</b>: the input and recurrent normalization gain. Dimensions\n  (2, hidden_size * 4) with `gamma[0]` specifying the input gain and\n  `gamma[1]` specifying the recurrent gain. Initialized to ones.\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    *input,\n    **kwargs\n)\n```\n\nCall self as a function.\n\n\n<h3 id=\"add_module\"><code><a name=\"add_module\">add_module</a></code></h3>\n\n``` python\nadd_module(\n    name,\n    module\n)\n```\n\nAdds a child module to the current module.\n\nThe module can be accessed as an attribute using the given name.\n\n#### Args:\n\nname (string): name of the child module. The child module can be\n    accessed from this module using the given name\nmodule (Module): child module to be added to the module.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(fn)\n```\n\nApplies ``fn`` recursively to every submodule (as returned by ``.children()``)\nas well as self. Typical use includes initializing the parameters of a model\n(see also :ref:`nn-init-doc`).\n\n#### Args:\n\nfn (:class:`Module` -> None): function to be applied to each submodule\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> def init_weights(m):\n    >>>     print(m)\n    >>>     if type(m) == nn.Linear:\n    >>>         m.weight.data.fill_(1.0)\n    >>>         print(m.weight)\n    >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))\n    >>> net.apply(init_weights)\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    ```\n\n<h3 id=\"buffers\"><code><a name=\"buffers\">buffers</a></code></h3>\n\n``` python\nbuffers(recurse=True)\n```\n\nReturns an iterator over module buffers.\n\n\n#### Args:\n\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`torch.Tensor`</b>: module buffer\n\nExample::\n\n    ```\n    >>> for buf in model.buffers():\n    >>>     print(type(buf.data), buf.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"children\"><code><a name=\"children\">children</a></code></h3>\n\n``` python\nchildren()\n```\n\nReturns an iterator over immediate children modules.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a child module\n\n<h3 id=\"cpu\"><code><a name=\"cpu\">cpu</a></code></h3>\n\n``` python\ncpu()\n```\n\nMoves all model parameters and buffers to the CPU.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"cuda\"><code><a name=\"cuda\">cuda</a></code></h3>\n\n``` python\ncuda(device=None)\n```\n\nMoves all model parameters and buffers to the GPU.\n\nThis also makes associated parameters and buffers different objects. So\nit should be called before constructing optimizer if the module will\nlive on GPU while being optimized.\n\n#### Arguments:\n\ndevice (int, optional): if specified, all parameters will be\n    copied to that device\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"double\"><code><a name=\"double\">double</a></code></h3>\n\n``` python\ndouble()\n```\n\nCasts all floating point parameters and buffers to ``double`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"eval\"><code><a name=\"eval\">eval</a></code></h3>\n\n``` python\neval()\n```\n\nSets the module in evaluation mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\nThis is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"extra_repr\"><code><a name=\"extra_repr\">extra_repr</a></code></h3>\n\n``` python\nextra_repr()\n```\n\nSet the extra representation of the module\n\nTo print customized extra information, you should reimplement\nthis method in your own modules. Both single-line and multi-line\nstrings are acceptable.\n\n<h3 id=\"float\"><code><a name=\"float\">float</a></code></h3>\n\n``` python\nfloat()\n```\n\nCasts all floating point parameters and buffers to float datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"forward\"><code><a name=\"forward\">forward</a></code></h3>\n\n``` python\nforward(\n    input,\n    state=None,\n    lengths=None\n)\n```\n\nRuns a forward pass of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`input`</b>: Tensor, a batch of input sequences to pass through the GRU.\n  Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n  `False`, otherwise (batch_size, seq_len, input_size).\n* <b>`state`</b>: (optional) Tensor, the intial state for each batch element in\n  `input`. Dimensions (1, batch_size, hidden_size). Defaults to zeros.\n* <b>`lengths`</b>: (optional) Tensor, list of sequence lengths for each batch\n  element. Dimension (batch_size). This argument may be omitted if\n  all batch elements are unpadded and have the same sequence length.\n\n\n#### Returns:\n\n\n* <b>`output`</b>: Tensor, the output of the GRU layer. Dimensions\n  (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n  or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n  that if `lengths` was specified, the `output` tensor will not be\n  masked. It's the caller's responsibility to either not use the invalid\n  entries or to mask them out before using them.\n* <b>`h_n`</b>: the hidden state for the last sequence item. Dimensions\n  (1, batch_size, hidden_size).\n\n<h3 id=\"half\"><code><a name=\"half\">half</a></code></h3>\n\n``` python\nhalf()\n```\n\nCasts all floating point parameters and buffers to ``half`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"load_state_dict\"><code><a name=\"load_state_dict\">load_state_dict</a></code></h3>\n\n``` python\nload_state_dict(\n    state_dict,\n    strict=True\n)\n```\n\nCopies parameters and buffers from :attr:`state_dict` into\nthis module and its descendants. If :attr:`strict` is ``True``, then\nthe keys of :attr:`state_dict` must exactly match the keys returned\nby this module's :meth:`~torch.nn.Module.state_dict` function.\n\n#### Arguments:\n\nstate_dict (dict): a dict containing parameters and\n    persistent buffers.\nstrict (bool, optional): whether to strictly enforce that the keys\n    in :attr:`state_dict` match the keys returned by this module's\n    :meth:`~torch.nn.Module.state_dict` function. Default: ``True``\n\n\n\n#### Returns:\n\n``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:\n    * **missing_keys** is a list of str containing the missing keys\n    * **unexpected_keys** is a list of str containing the unexpected keys\n\n\n<h3 id=\"modules\"><code><a name=\"modules\">modules</a></code></h3>\n\n``` python\nmodules()\n```\n\nReturns an iterator over all modules in the network.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a module in the network\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    1 -> Linear(in_features=2, out_features=2, bias=True)\n\n<h3 id=\"named_buffers\"><code><a name=\"named_buffers\">named_buffers</a></code></h3>\n\n``` python\nnamed_buffers(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module buffers, yielding both the\nname of the buffer as well as the buffer itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all buffer names.\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, torch.Tensor)`</b>: Tuple containing the name and buffer\n\nExample::\n\n    ```\n    >>> for name, buf in self.named_buffers():\n    >>>    if name in ['running_var']:\n    >>>        print(buf.size())\n    ```\n\n<h3 id=\"named_children\"><code><a name=\"named_children\">named_children</a></code></h3>\n\n``` python\nnamed_children()\n```\n\nReturns an iterator over immediate children modules, yielding both\nthe name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple containing a name and child module\n\nExample::\n\n    ```\n    >>> for name, module in model.named_children():\n    >>>     if name in ['conv4', 'conv5']:\n    >>>         print(module)\n    ```\n\n<h3 id=\"named_modules\"><code><a name=\"named_modules\">named_modules</a></code></h3>\n\n``` python\nnamed_modules(\n    memo=None,\n    prefix=''\n)\n```\n\nReturns an iterator over all modules in the network, yielding\nboth the name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple of name and module\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.named_modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> ('', Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    ))\n    1 -> ('0', Linear(in_features=2, out_features=2, bias=True))\n\n<h3 id=\"named_parameters\"><code><a name=\"named_parameters\">named_parameters</a></code></h3>\n\n``` python\nnamed_parameters(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module parameters, yielding both the\nname of the parameter as well as the parameter itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all parameter names.\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, Parameter)`</b>: Tuple containing the name and parameter\n\nExample::\n\n    ```\n    >>> for name, param in self.named_parameters():\n    >>>    if name in ['bias']:\n    >>>        print(param.size())\n    ```\n\n<h3 id=\"parameters\"><code><a name=\"parameters\">parameters</a></code></h3>\n\n``` python\nparameters(recurse=True)\n```\n\nReturns an iterator over module parameters.\n\nThis is typically passed to an optimizer.\n\n#### Args:\n\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`Parameter`</b>: module parameter\n\nExample::\n\n    ```\n    >>> for param in model.parameters():\n    >>>     print(type(param.data), param.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"register_backward_hook\"><code><a name=\"register_backward_hook\">register_backward_hook</a></code></h3>\n\n``` python\nregister_backward_hook(hook)\n```\n\nRegisters a backward hook on the module.\n\nThe hook will be called every time the gradients with respect to module\ninputs are computed. The hook should have the following signature::\n\n    hook(module, grad_input, grad_output) -> Tensor or None\n\nThe :attr:`grad_input` and :attr:`grad_output` may be tuples if the\nmodule has multiple inputs or outputs. The hook should not modify its\narguments, but it can optionally return a new gradient with respect to\ninput that will be used in place of :attr:`grad_input` in subsequent\ncomputations.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n.. warning ::\n\n    The current implementation will not have the presented behavior\n    for complex :class:`Module` that perform many operations.\n    In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only\n    contain the gradients for a subset of the inputs and outputs.\n    For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`\n    directly on a specific input or output to get the required gradients.\n\n<h3 id=\"register_buffer\"><code><a name=\"register_buffer\">register_buffer</a></code></h3>\n\n``` python\nregister_buffer(\n    name,\n    tensor\n)\n```\n\nAdds a persistent buffer to the module.\n\nThis is typically used to register a buffer that should not to be\nconsidered a model parameter. For example, BatchNorm's ``running_mean``\nis not a parameter, but is part of the persistent state.\n\nBuffers can be accessed as attributes using given names.\n\n#### Args:\n\nname (string): name of the buffer. The buffer can be accessed\n    from this module using the given name\ntensor (Tensor): buffer to be registered.\n\n\nExample::\n\n    ```\n    >>> self.register_buffer('running_mean', torch.zeros(num_features))\n    ```\n\n<h3 id=\"register_forward_hook\"><code><a name=\"register_forward_hook\">register_forward_hook</a></code></h3>\n\n``` python\nregister_forward_hook(hook)\n```\n\nRegisters a forward hook on the module.\n\nThe hook will be called every time after :func:`forward` has computed an output.\nIt should have the following signature::\n\n    hook(module, input, output) -> None or modified output\n\nThe hook can modify the output. It can modify the input inplace but\nit will not have effect on forward since this is called after\n:func:`forward` is called.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_forward_pre_hook\"><code><a name=\"register_forward_pre_hook\">register_forward_pre_hook</a></code></h3>\n\n``` python\nregister_forward_pre_hook(hook)\n```\n\nRegisters a forward pre-hook on the module.\n\nThe hook will be called every time before :func:`forward` is invoked.\nIt should have the following signature::\n\n    hook(module, input) -> None or modified input\n\nThe hook can modify the input. User can either return a tuple or a\nsingle modified value in the hook. We will wrap the value into a tuple\nif a single value is returned(unless that value is already a tuple).\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_parameter\"><code><a name=\"register_parameter\">register_parameter</a></code></h3>\n\n``` python\nregister_parameter(\n    name,\n    param\n)\n```\n\nAdds a parameter to the module.\n\nThe parameter can be accessed as an attribute using given name.\n\n#### Args:\n\nname (string): name of the parameter. The parameter can be accessed\n    from this module using the given name\nparam (Parameter): parameter to be added to the module.\n\n\n<h3 id=\"requires_grad_\"><code><a name=\"requires_grad_\">requires_grad_</a></code></h3>\n\n``` python\nrequires_grad_(requires_grad=True)\n```\n\nChange if autograd should record operations on parameters in this\nmodule.\n\nThis method sets the parameters' :attr:`requires_grad` attributes\nin-place.\n\nThis method is helpful for freezing part of the module for finetuning\nor training parts of a model individually (e.g., GAN training).\n\n#### Args:\n\nrequires_grad (bool): whether autograd should record operations on\n                      parameters in this module. Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"reset_parameters\"><code><a name=\"reset_parameters\">reset_parameters</a></code></h3>\n\n``` python\nreset_parameters()\n```\n\nResets this layer's parameters to their initial values.\n\n\n<h3 id=\"share_memory\"><code><a name=\"share_memory\">share_memory</a></code></h3>\n\n``` python\nshare_memory()\n```\n\n\n\n\n<h3 id=\"state_dict\"><code><a name=\"state_dict\">state_dict</a></code></h3>\n\n``` python\nstate_dict(\n    destination=None,\n    prefix='',\n    keep_vars=False\n)\n```\n\nReturns a dictionary containing a whole state of the module.\n\nBoth parameters and persistent buffers (e.g. running averages) are\nincluded. Keys are corresponding parameter and buffer names.\n\n#### Returns:\n\n\n* <b>`dict`</b>:     a dictionary containing a whole state of the module\n\nExample::\n\n    ```\n    >>> module.state_dict().keys()\n    ['bias', 'weight']\n    ```\n\n<h3 id=\"to\"><code><a name=\"to\">to</a></code></h3>\n\n``` python\nto(\n    *args,\n    **kwargs\n)\n```\n\nMoves and/or casts the parameters and buffers.\n\nThis can be called as\n\n.. function:: to(device=None, dtype=None, non_blocking=False)\n\n.. function:: to(dtype, non_blocking=False)\n\n.. function:: to(tensor, non_blocking=False)\n\nIts signature is similar to :meth:`torch.Tensor.to`, but only accepts\nfloating point desired :attr:`dtype` s. In addition, this method will\nonly cast the floating point parameters and buffers to :attr:`dtype`\n(if given). The integral parameters and buffers will be moved\n:attr:`device`, if that is given, but with dtypes unchanged. When\n:attr:`non_blocking` is set, it tries to convert/move asynchronously\nwith respect to the host if possible, e.g., moving CPU Tensors with\npinned memory to CUDA devices.\n\nSee below for examples.\n\n.. note::\n    This method modifies the module in-place.\n\n#### Args:\n\ndevice (:class:`torch.device`): the desired device of the parameters\n    and buffers in this module\ndtype (:class:`torch.dtype`): the desired floating point type of\n    the floating point parameters and buffers in this module\ntensor (torch.Tensor): Tensor whose dtype and device are the desired\n    dtype and device for all parameters and buffers in this module\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> linear = nn.Linear(2, 2)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]])\n    >>> linear.to(torch.double)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]], dtype=torch.float64)\n    >>> gpu1 = torch.device(\"cuda:1\")\n    >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')\n    >>> cpu = torch.device(\"cpu\")\n    >>> linear.to(cpu)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16)\n    ```\n\n<h3 id=\"train\"><code><a name=\"train\">train</a></code></h3>\n\n``` python\ntrain(mode=True)\n```\n\nSets the module in training mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\n#### Args:\n\nmode (bool): whether to set training mode (``True``) or evaluation\n             mode (``False``). Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"type\"><code><a name=\"type\">type</a></code></h3>\n\n``` python\ntype(dst_type)\n```\n\nCasts all parameters and buffers to :attr:`dst_type`.\n\n\n#### Arguments:\n\ndst_type (type or string): the desired type\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"zero_grad\"><code><a name=\"zero_grad\">zero_grad</a></code></h3>\n\n``` python\nzero_grad()\n```\n\nSets gradients of all model parameters to zero.\n\n\n\n\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch/LayerNormLSTM.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch.LayerNormLSTM\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"add_module\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"buffers\"/>\n<meta itemprop=\"property\" content=\"children\"/>\n<meta itemprop=\"property\" content=\"cpu\"/>\n<meta itemprop=\"property\" content=\"cuda\"/>\n<meta itemprop=\"property\" content=\"double\"/>\n<meta itemprop=\"property\" content=\"eval\"/>\n<meta itemprop=\"property\" content=\"extra_repr\"/>\n<meta itemprop=\"property\" content=\"float\"/>\n<meta itemprop=\"property\" content=\"forward\"/>\n<meta itemprop=\"property\" content=\"half\"/>\n<meta itemprop=\"property\" content=\"load_state_dict\"/>\n<meta itemprop=\"property\" content=\"modules\"/>\n<meta itemprop=\"property\" content=\"named_buffers\"/>\n<meta itemprop=\"property\" content=\"named_children\"/>\n<meta itemprop=\"property\" content=\"named_modules\"/>\n<meta itemprop=\"property\" content=\"named_parameters\"/>\n<meta itemprop=\"property\" content=\"parameters\"/>\n<meta itemprop=\"property\" content=\"register_backward_hook\"/>\n<meta itemprop=\"property\" content=\"register_buffer\"/>\n<meta itemprop=\"property\" content=\"register_forward_hook\"/>\n<meta itemprop=\"property\" content=\"register_forward_pre_hook\"/>\n<meta itemprop=\"property\" content=\"register_parameter\"/>\n<meta itemprop=\"property\" content=\"requires_grad_\"/>\n<meta itemprop=\"property\" content=\"reset_parameters\"/>\n<meta itemprop=\"property\" content=\"share_memory\"/>\n<meta itemprop=\"property\" content=\"state_dict\"/>\n<meta itemprop=\"property\" content=\"to\"/>\n<meta itemprop=\"property\" content=\"train\"/>\n<meta itemprop=\"property\" content=\"type\"/>\n<meta itemprop=\"property\" content=\"zero_grad\"/>\n</div>\n\n# haste_pytorch.LayerNormLSTM\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormLSTM`\n\nLayer Normalized Long Short-Term Memory layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis LSTM layer applies layer normalization to the input, recurrent, and\noutput activations of a standard LSTM. The implementation is fused and\nGPU-accelerated. DropConnect and Zoneout regularization are built-in, and\nthis layer allows setting a non-zero initial forget gate bias.\n\nDetails about the exact function this layer implements can be found at\nhttps://github.com/lmnt-com/haste/issues/1.\n\nSee [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    input_size,\n    hidden_size,\n    batch_first=False,\n    forget_bias=1.0,\n    dropout=0.0,\n    zoneout=0.0\n)\n```\n\nInitialize the parameters of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`input_size`</b>: int, the feature dimension of the input.\n* <b>`hidden_size`</b>: int, the feature dimension of the output.\n* <b>`batch_first`</b>: (optional) bool, if `True`, then the input and output\n  tensors are provided as `(batch, seq, feature)`.\n* <b>`forget_bias`</b>: (optional) float, sets the initial bias of the forget gate\n  for this LSTM cell.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization.\n\n\n#### Variables:\n\n\n* <b>`kernel`</b>: the input projection weight matrix. Dimensions\n  (input_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n  with Xavier uniform initialization.\n* <b>`recurrent_kernel`</b>: the recurrent projection weight matrix. Dimensions\n  (hidden_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n  with orthogonal initialization.\n* <b>`bias`</b>: the projection bias vector. Dimensions (hidden_size * 4) with\n  `i,g,f,o` gate layout. The forget gate biases are initialized to\n  `forget_bias` and the rest are zeros.\n* <b>`gamma`</b>: the input and recurrent normalization gain. Dimensions\n  (2, hidden_size * 4) with `gamma[0]` specifying the input gain and\n  `gamma[1]` specifying the recurrent gain. Initialized to ones.\n* <b>`gamma_h`</b>: the output normalization gain. Dimensions (hidden_size).\n  Initialized to ones.\n* <b>`beta_h`</b>: the output normalization bias. Dimensions (hidden_size).\n  Initialized to zeros.\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    *input,\n    **kwargs\n)\n```\n\nCall self as a function.\n\n\n<h3 id=\"add_module\"><code><a name=\"add_module\">add_module</a></code></h3>\n\n``` python\nadd_module(\n    name,\n    module\n)\n```\n\nAdds a child module to the current module.\n\nThe module can be accessed as an attribute using the given name.\n\n#### Args:\n\nname (string): name of the child module. The child module can be\n    accessed from this module using the given name\nmodule (Module): child module to be added to the module.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(fn)\n```\n\nApplies ``fn`` recursively to every submodule (as returned by ``.children()``)\nas well as self. Typical use includes initializing the parameters of a model\n(see also :ref:`nn-init-doc`).\n\n#### Args:\n\nfn (:class:`Module` -> None): function to be applied to each submodule\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> def init_weights(m):\n    >>>     print(m)\n    >>>     if type(m) == nn.Linear:\n    >>>         m.weight.data.fill_(1.0)\n    >>>         print(m.weight)\n    >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))\n    >>> net.apply(init_weights)\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Linear(in_features=2, out_features=2, bias=True)\n    Parameter containing:\n    tensor([[ 1.,  1.],\n            [ 1.,  1.]])\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    ```\n\n<h3 id=\"buffers\"><code><a name=\"buffers\">buffers</a></code></h3>\n\n``` python\nbuffers(recurse=True)\n```\n\nReturns an iterator over module buffers.\n\n\n#### Args:\n\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`torch.Tensor`</b>: module buffer\n\nExample::\n\n    ```\n    >>> for buf in model.buffers():\n    >>>     print(type(buf.data), buf.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"children\"><code><a name=\"children\">children</a></code></h3>\n\n``` python\nchildren()\n```\n\nReturns an iterator over immediate children modules.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a child module\n\n<h3 id=\"cpu\"><code><a name=\"cpu\">cpu</a></code></h3>\n\n``` python\ncpu()\n```\n\nMoves all model parameters and buffers to the CPU.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"cuda\"><code><a name=\"cuda\">cuda</a></code></h3>\n\n``` python\ncuda(device=None)\n```\n\nMoves all model parameters and buffers to the GPU.\n\nThis also makes associated parameters and buffers different objects. So\nit should be called before constructing optimizer if the module will\nlive on GPU while being optimized.\n\n#### Arguments:\n\ndevice (int, optional): if specified, all parameters will be\n    copied to that device\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"double\"><code><a name=\"double\">double</a></code></h3>\n\n``` python\ndouble()\n```\n\nCasts all floating point parameters and buffers to ``double`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"eval\"><code><a name=\"eval\">eval</a></code></h3>\n\n``` python\neval()\n```\n\nSets the module in evaluation mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\nThis is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"extra_repr\"><code><a name=\"extra_repr\">extra_repr</a></code></h3>\n\n``` python\nextra_repr()\n```\n\nSet the extra representation of the module\n\nTo print customized extra information, you should reimplement\nthis method in your own modules. Both single-line and multi-line\nstrings are acceptable.\n\n<h3 id=\"float\"><code><a name=\"float\">float</a></code></h3>\n\n``` python\nfloat()\n```\n\nCasts all floating point parameters and buffers to float datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"forward\"><code><a name=\"forward\">forward</a></code></h3>\n\n``` python\nforward(\n    input,\n    state=None,\n    lengths=None\n)\n```\n\nRuns a forward pass of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`input`</b>: Tensor, a batch of input sequences to pass through the LSTM.\n  Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n  `False`, otherwise (batch_size, seq_len, input_size).\n* <b>`lengths`</b>: (optional) Tensor, list of sequence lengths for each batch\n  element. Dimension (batch_size). This argument may be omitted if\n  all batch elements are unpadded and have the same sequence length.\n\n\n#### Returns:\n\n\n* <b>`output`</b>: Tensor, the output of the LSTM layer. Dimensions\n  (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n  or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n  that if `lengths` was specified, the `output` tensor will not be\n  masked. It's the caller's responsibility to either not use the invalid\n  entries or to mask them out before using them.\n* <b>`(h_n, c_n)`</b>: the hidden and cell states, respectively, for the last\n  sequence item. Dimensions (1, batch_size, hidden_size).\n\n<h3 id=\"half\"><code><a name=\"half\">half</a></code></h3>\n\n``` python\nhalf()\n```\n\nCasts all floating point parameters and buffers to ``half`` datatype.\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"load_state_dict\"><code><a name=\"load_state_dict\">load_state_dict</a></code></h3>\n\n``` python\nload_state_dict(\n    state_dict,\n    strict=True\n)\n```\n\nCopies parameters and buffers from :attr:`state_dict` into\nthis module and its descendants. If :attr:`strict` is ``True``, then\nthe keys of :attr:`state_dict` must exactly match the keys returned\nby this module's :meth:`~torch.nn.Module.state_dict` function.\n\n#### Arguments:\n\nstate_dict (dict): a dict containing parameters and\n    persistent buffers.\nstrict (bool, optional): whether to strictly enforce that the keys\n    in :attr:`state_dict` match the keys returned by this module's\n    :meth:`~torch.nn.Module.state_dict` function. Default: ``True``\n\n\n\n#### Returns:\n\n``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:\n    * **missing_keys** is a list of str containing the missing keys\n    * **unexpected_keys** is a list of str containing the unexpected keys\n\n\n<h3 id=\"modules\"><code><a name=\"modules\">modules</a></code></h3>\n\n``` python\nmodules()\n```\n\nReturns an iterator over all modules in the network.\n\n\n#### Yields:\n\n\n* <b>`Module`</b>: a module in the network\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    )\n    1 -> Linear(in_features=2, out_features=2, bias=True)\n\n<h3 id=\"named_buffers\"><code><a name=\"named_buffers\">named_buffers</a></code></h3>\n\n``` python\nnamed_buffers(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module buffers, yielding both the\nname of the buffer as well as the buffer itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all buffer names.\nrecurse (bool): if True, then yields buffers of this module\n    and all submodules. Otherwise, yields only buffers that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, torch.Tensor)`</b>: Tuple containing the name and buffer\n\nExample::\n\n    ```\n    >>> for name, buf in self.named_buffers():\n    >>>    if name in ['running_var']:\n    >>>        print(buf.size())\n    ```\n\n<h3 id=\"named_children\"><code><a name=\"named_children\">named_children</a></code></h3>\n\n``` python\nnamed_children()\n```\n\nReturns an iterator over immediate children modules, yielding both\nthe name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple containing a name and child module\n\nExample::\n\n    ```\n    >>> for name, module in model.named_children():\n    >>>     if name in ['conv4', 'conv5']:\n    >>>         print(module)\n    ```\n\n<h3 id=\"named_modules\"><code><a name=\"named_modules\">named_modules</a></code></h3>\n\n``` python\nnamed_modules(\n    memo=None,\n    prefix=''\n)\n```\n\nReturns an iterator over all modules in the network, yielding\nboth the name of the module as well as the module itself.\n\n#### Yields:\n\n\n* <b>`(string, Module)`</b>: Tuple of name and module\n\n\n#### Note:\n\nDuplicate modules are returned only once. In the following\nexample, ``l`` will be returned only once.\n\n\nExample::\n\n    ```\n    >>> l = nn.Linear(2, 2)\n    >>> net = nn.Sequential(l, l)\n    >>> for idx, m in enumerate(net.named_modules()):\n            print(idx, '->', m)\n    ```\n\n    0 -> ('', Sequential(\n      (0): Linear(in_features=2, out_features=2, bias=True)\n      (1): Linear(in_features=2, out_features=2, bias=True)\n    ))\n    1 -> ('0', Linear(in_features=2, out_features=2, bias=True))\n\n<h3 id=\"named_parameters\"><code><a name=\"named_parameters\">named_parameters</a></code></h3>\n\n``` python\nnamed_parameters(\n    prefix='',\n    recurse=True\n)\n```\n\nReturns an iterator over module parameters, yielding both the\nname of the parameter as well as the parameter itself.\n\n#### Args:\n\nprefix (str): prefix to prepend to all parameter names.\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`(string, Parameter)`</b>: Tuple containing the name and parameter\n\nExample::\n\n    ```\n    >>> for name, param in self.named_parameters():\n    >>>    if name in ['bias']:\n    >>>        print(param.size())\n    ```\n\n<h3 id=\"parameters\"><code><a name=\"parameters\">parameters</a></code></h3>\n\n``` python\nparameters(recurse=True)\n```\n\nReturns an iterator over module parameters.\n\nThis is typically passed to an optimizer.\n\n#### Args:\n\nrecurse (bool): if True, then yields parameters of this module\n    and all submodules. Otherwise, yields only parameters that\n    are direct members of this module.\n\n\n\n#### Yields:\n\n\n* <b>`Parameter`</b>: module parameter\n\nExample::\n\n    ```\n    >>> for param in model.parameters():\n    >>>     print(type(param.data), param.size())\n    <class 'torch.FloatTensor'> (20L,)\n    <class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)\n    ```\n\n<h3 id=\"register_backward_hook\"><code><a name=\"register_backward_hook\">register_backward_hook</a></code></h3>\n\n``` python\nregister_backward_hook(hook)\n```\n\nRegisters a backward hook on the module.\n\nThe hook will be called every time the gradients with respect to module\ninputs are computed. The hook should have the following signature::\n\n    hook(module, grad_input, grad_output) -> Tensor or None\n\nThe :attr:`grad_input` and :attr:`grad_output` may be tuples if the\nmodule has multiple inputs or outputs. The hook should not modify its\narguments, but it can optionally return a new gradient with respect to\ninput that will be used in place of :attr:`grad_input` in subsequent\ncomputations.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n.. warning ::\n\n    The current implementation will not have the presented behavior\n    for complex :class:`Module` that perform many operations.\n    In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only\n    contain the gradients for a subset of the inputs and outputs.\n    For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`\n    directly on a specific input or output to get the required gradients.\n\n<h3 id=\"register_buffer\"><code><a name=\"register_buffer\">register_buffer</a></code></h3>\n\n``` python\nregister_buffer(\n    name,\n    tensor\n)\n```\n\nAdds a persistent buffer to the module.\n\nThis is typically used to register a buffer that should not to be\nconsidered a model parameter. For example, BatchNorm's ``running_mean``\nis not a parameter, but is part of the persistent state.\n\nBuffers can be accessed as attributes using given names.\n\n#### Args:\n\nname (string): name of the buffer. The buffer can be accessed\n    from this module using the given name\ntensor (Tensor): buffer to be registered.\n\n\nExample::\n\n    ```\n    >>> self.register_buffer('running_mean', torch.zeros(num_features))\n    ```\n\n<h3 id=\"register_forward_hook\"><code><a name=\"register_forward_hook\">register_forward_hook</a></code></h3>\n\n``` python\nregister_forward_hook(hook)\n```\n\nRegisters a forward hook on the module.\n\nThe hook will be called every time after :func:`forward` has computed an output.\nIt should have the following signature::\n\n    hook(module, input, output) -> None or modified output\n\nThe hook can modify the output. It can modify the input inplace but\nit will not have effect on forward since this is called after\n:func:`forward` is called.\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_forward_pre_hook\"><code><a name=\"register_forward_pre_hook\">register_forward_pre_hook</a></code></h3>\n\n``` python\nregister_forward_pre_hook(hook)\n```\n\nRegisters a forward pre-hook on the module.\n\nThe hook will be called every time before :func:`forward` is invoked.\nIt should have the following signature::\n\n    hook(module, input) -> None or modified input\n\nThe hook can modify the input. User can either return a tuple or a\nsingle modified value in the hook. We will wrap the value into a tuple\nif a single value is returned(unless that value is already a tuple).\n\n#### Returns:\n\n:class:`torch.utils.hooks.RemovableHandle`:\n    a handle that can be used to remove the added hook by calling\n    ``handle.remove()``\n\n\n<h3 id=\"register_parameter\"><code><a name=\"register_parameter\">register_parameter</a></code></h3>\n\n``` python\nregister_parameter(\n    name,\n    param\n)\n```\n\nAdds a parameter to the module.\n\nThe parameter can be accessed as an attribute using given name.\n\n#### Args:\n\nname (string): name of the parameter. The parameter can be accessed\n    from this module using the given name\nparam (Parameter): parameter to be added to the module.\n\n\n<h3 id=\"requires_grad_\"><code><a name=\"requires_grad_\">requires_grad_</a></code></h3>\n\n``` python\nrequires_grad_(requires_grad=True)\n```\n\nChange if autograd should record operations on parameters in this\nmodule.\n\nThis method sets the parameters' :attr:`requires_grad` attributes\nin-place.\n\nThis method is helpful for freezing part of the module for finetuning\nor training parts of a model individually (e.g., GAN training).\n\n#### Args:\n\nrequires_grad (bool): whether autograd should record operations on\n                      parameters in this module. Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"reset_parameters\"><code><a name=\"reset_parameters\">reset_parameters</a></code></h3>\n\n``` python\nreset_parameters()\n```\n\nResets this layer's parameters to their initial values.\n\n\n<h3 id=\"share_memory\"><code><a name=\"share_memory\">share_memory</a></code></h3>\n\n``` python\nshare_memory()\n```\n\n\n\n\n<h3 id=\"state_dict\"><code><a name=\"state_dict\">state_dict</a></code></h3>\n\n``` python\nstate_dict(\n    destination=None,\n    prefix='',\n    keep_vars=False\n)\n```\n\nReturns a dictionary containing a whole state of the module.\n\nBoth parameters and persistent buffers (e.g. running averages) are\nincluded. Keys are corresponding parameter and buffer names.\n\n#### Returns:\n\n\n* <b>`dict`</b>:     a dictionary containing a whole state of the module\n\nExample::\n\n    ```\n    >>> module.state_dict().keys()\n    ['bias', 'weight']\n    ```\n\n<h3 id=\"to\"><code><a name=\"to\">to</a></code></h3>\n\n``` python\nto(\n    *args,\n    **kwargs\n)\n```\n\nMoves and/or casts the parameters and buffers.\n\nThis can be called as\n\n.. function:: to(device=None, dtype=None, non_blocking=False)\n\n.. function:: to(dtype, non_blocking=False)\n\n.. function:: to(tensor, non_blocking=False)\n\nIts signature is similar to :meth:`torch.Tensor.to`, but only accepts\nfloating point desired :attr:`dtype` s. In addition, this method will\nonly cast the floating point parameters and buffers to :attr:`dtype`\n(if given). The integral parameters and buffers will be moved\n:attr:`device`, if that is given, but with dtypes unchanged. When\n:attr:`non_blocking` is set, it tries to convert/move asynchronously\nwith respect to the host if possible, e.g., moving CPU Tensors with\npinned memory to CUDA devices.\n\nSee below for examples.\n\n.. note::\n    This method modifies the module in-place.\n\n#### Args:\n\ndevice (:class:`torch.device`): the desired device of the parameters\n    and buffers in this module\ndtype (:class:`torch.dtype`): the desired floating point type of\n    the floating point parameters and buffers in this module\ntensor (torch.Tensor): Tensor whose dtype and device are the desired\n    dtype and device for all parameters and buffers in this module\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\nExample::\n\n    ```\n    >>> linear = nn.Linear(2, 2)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]])\n    >>> linear.to(torch.double)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1913, -0.3420],\n            [-0.5113, -0.2325]], dtype=torch.float64)\n    >>> gpu1 = torch.device(\"cuda:1\")\n    >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')\n    >>> cpu = torch.device(\"cpu\")\n    >>> linear.to(cpu)\n    Linear(in_features=2, out_features=2, bias=True)\n    >>> linear.weight\n    Parameter containing:\n    tensor([[ 0.1914, -0.3420],\n            [-0.5112, -0.2324]], dtype=torch.float16)\n    ```\n\n<h3 id=\"train\"><code><a name=\"train\">train</a></code></h3>\n\n``` python\ntrain(mode=True)\n```\n\nSets the module in training mode.\n\nThis has any effect only on certain modules. See documentations of\nparticular modules for details of their behaviors in training/evaluation\nmode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,\netc.\n\n#### Args:\n\nmode (bool): whether to set training mode (``True``) or evaluation\n             mode (``False``). Default: ``True``.\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"type\"><code><a name=\"type\">type</a></code></h3>\n\n``` python\ntype(dst_type)\n```\n\nCasts all parameters and buffers to :attr:`dst_type`.\n\n\n#### Arguments:\n\ndst_type (type or string): the desired type\n\n\n\n#### Returns:\n\n\n* <b>`Module`</b>: self\n\n<h3 id=\"zero_grad\"><code><a name=\"zero_grad\">zero_grad</a></code></h3>\n\n``` python\nzero_grad()\n```\n\nSets gradients of all model parameters to zero.\n\n\n\n\n"
  },
  {
    "path": "docs/pytorch/haste_pytorch.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_pytorch\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n</div>\n\n# Module: haste_pytorch\n\n\n\nHaste: a fast, simple, and open RNN library.\n\n\n\n## Classes\n\n[`class GRU`](./haste_pytorch/GRU.md): Gated Recurrent Unit layer.\n\n[`class IndRNN`](./haste_pytorch/IndRNN.md): Independently Recurrent Neural Network layer.\n\n[`class LSTM`](./haste_pytorch/LSTM.md): Long Short-Term Memory layer.\n\n[`class LayerNormGRU`](./haste_pytorch/LayerNormGRU.md): Layer Normalized Gated Recurrent Unit layer.\n\n[`class LayerNormLSTM`](./haste_pytorch/LayerNormLSTM.md): Layer Normalized Long Short-Term Memory layer.\n\n"
  },
  {
    "path": "docs/tf/haste_tf/GRU.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.GRU\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"bidirectional\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.GRU\n\n<!-- Insert buttons and diff -->\n\n\n## Class `GRU`\n\nGated Recurrent Unit layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis GRU layer offers a fused, GPU-accelerated TensorFlow op for inference\nand training. There are two commonly-used variants of GRU cells. This one\nimplements 1406.1078v1 which applies the reset gate to the hidden state\nafter matrix multiplication. cuDNN also implements this variant. The other\nvariant, 1406.1078v3, applies the reset gate before matrix multiplication\nand is currently unsupported.\n\nThis layer has built-in support for DropConnect and Zoneout, which are\nboth techniques used to regularize RNNs.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    direction='unidirectional',\n    **kwargs\n)\n```\n\nInitialize the parameters of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`num_units`</b>: int, the number of units in the LSTM cell.\n* <b>`direction`</b>: string, 'unidirectional' or 'bidirectional'.\n* <b>`**kwargs`</b>: Dict, keyword arguments (see below).\n\n\n#### Keyword Arguments:\n\n\n* <b>`kernel_initializer`</b>: (optional) the initializer to use for the input\n  matrix weights. Defaults to `glorot_uniform`.\n* <b>`recurrent_initializer`</b>: (optional) the initializer to use for the\n  recurrent matrix weights. Defaults to `orthogonal`.\n* <b>`bias_initializer`</b>: (optional) the initializer to use for input bias\n  vectors. Defaults to `zeros`.\n* <b>`recurrent_bias_initializer`</b>: (optional) the initializer to use for\n  recurrent bias vectors. Defaults to `zeros`.\n* <b>`kernel_transform`</b>: (optional) a function with signature\n  `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n  used. Defaults to the identity function.\n* <b>`recurrent_transform`</b>: (optional) a function with signature\n  `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n  kernel before it is used. Defaults to the identity function.\n* <b>`bias_transform`</b>: (optional) a function with signature\n  `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n  Defaults to the identity function.\n* <b>`recurrent_bias_transform`</b>: (optional) a function with signature\n  `(recurrent_bias: Tensor) -> Tensor` that transforms the recurrent bias\n  before it is used. Defaults to the identity function.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix. Defaults to 0.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization. Defaults to 0.\n* <b>`dtype`</b>: (optional) the data type for this layer. Defaults to `tf.float32`.\n* <b>`name`</b>: (optional) string, the name for this layer.\n\n\n\n## Properties\n\n<h3 id=\"bidirectional\"><code>bidirectional</code></h3>\n\n`True` if this is a bidirectional RNN, `False` otherwise.\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\n\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    training,\n    sequence_length=None,\n    time_major=False\n)\n```\n\nRuns the RNN layer.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n  is `False`, or with shape [T,N,C] if `time_major` is `True`.\n* <b>`training`</b>: bool, `True` if running in training mode, `False` if running\n  in inference mode.\n* <b>`sequence_length`</b>: (optional) Tensor, a rank 1 tensor with shape [N] and\n  dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n  length of each example in the input minibatch.\n* <b>`time_major`</b>: (optional) bool, specifies whether `input` has shape [N,T,C]\n  (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n\n#### Returns:\n\nA pair, `(output, state)` for unidirectional layers, or a pair\n`([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\nlayers.\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the RNN class. It is called\ninternally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/GRUCell.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.GRUCell\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"activity_regularizer\"/>\n<meta itemprop=\"property\" content=\"dtype\"/>\n<meta itemprop=\"property\" content=\"dynamic\"/>\n<meta itemprop=\"property\" content=\"graph\"/>\n<meta itemprop=\"property\" content=\"input\"/>\n<meta itemprop=\"property\" content=\"input_mask\"/>\n<meta itemprop=\"property\" content=\"input_shape\"/>\n<meta itemprop=\"property\" content=\"losses\"/>\n<meta itemprop=\"property\" content=\"metrics\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"non_trainable_variables\"/>\n<meta itemprop=\"property\" content=\"non_trainable_weights\"/>\n<meta itemprop=\"property\" content=\"output\"/>\n<meta itemprop=\"property\" content=\"output_mask\"/>\n<meta itemprop=\"property\" content=\"output_shape\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"scope_name\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"trainable_weights\"/>\n<meta itemprop=\"property\" content=\"updates\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"weights\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"compute_mask\"/>\n<meta itemprop=\"property\" content=\"compute_output_shape\"/>\n<meta itemprop=\"property\" content=\"count_params\"/>\n<meta itemprop=\"property\" content=\"from_config\"/>\n<meta itemprop=\"property\" content=\"get_config\"/>\n<meta itemprop=\"property\" content=\"get_initial_state\"/>\n<meta itemprop=\"property\" content=\"get_input_at\"/>\n<meta itemprop=\"property\" content=\"get_input_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_input_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_losses_for\"/>\n<meta itemprop=\"property\" content=\"get_output_at\"/>\n<meta itemprop=\"property\" content=\"get_output_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_output_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_updates_for\"/>\n<meta itemprop=\"property\" content=\"get_weights\"/>\n<meta itemprop=\"property\" content=\"set_weights\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n<meta itemprop=\"property\" content=\"zero_state\"/>\n</div>\n\n# haste_tf.GRUCell\n\n<!-- Insert buttons and diff -->\n\n\n## Class `GRUCell`\n\nA GRU cell that's compatible with the Haste GRU layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis cell can be used on hardware other than GPUs and with other TensorFlow\nclasses that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\nwrappers, etc.).\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    name=None,\n    **kwargs\n)\n```\n\n\n\n\n\n\n## Properties\n\n<h3 id=\"activity_regularizer\"><code>activity_regularizer</code></h3>\n\nOptional regularizer function for the output of this layer.\n\n\n<h3 id=\"dtype\"><code>dtype</code></h3>\n\n\n\n\n<h3 id=\"dynamic\"><code>dynamic</code></h3>\n\n\n\n\n<h3 id=\"graph\"><code>graph</code></h3>\n\nDEPRECATED FUNCTION\n\nWarning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version.\nInstructions for updating:\nStop using this property because tf.layers layers no longer track their graph.\n\n<h3 id=\"input\"><code>input</code></h3>\n\nRetrieves the input tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput tensor or list of input tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n* <b>`AttributeError`</b>: If no inbound nodes are found.\n\n<h3 id=\"input_mask\"><code>input_mask</code></h3>\n\nRetrieves the input mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput mask tensor (potentially None) or list of input\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"input_shape\"><code>input_shape</code></h3>\n\nRetrieves the input shape(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer, or if all inputs\nhave the same shape.\n\n#### Returns:\n\nInput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per input tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined input_shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"losses\"><code>losses</code></h3>\n\nLosses which are associated with this `Layer`.\n\nVariable regularization tensors are created when this property is accessed,\nso it is eager safe: accessing `losses` under a `tf.GradientTape` will\npropagate gradients back to the corresponding variables.\n\n#### Returns:\n\nA list of tensors.\n\n\n<h3 id=\"metrics\"><code>metrics</code></h3>\n\n\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"non_trainable_variables\"><code>non_trainable_variables</code></h3>\n\n\n\n\n<h3 id=\"non_trainable_weights\"><code>non_trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"output\"><code>output</code></h3>\n\nRetrieves the output tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one output,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput tensor or list of output tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to more than one incoming\n  layers.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_mask\"><code>output_mask</code></h3>\n\nRetrieves the output mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput mask tensor (potentially None) or list of output\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"output_shape\"><code>output_shape</code></h3>\n\nRetrieves the output shape(s) of a layer.\n\nOnly applicable if the layer has one output,\nor if all outputs have the same shape.\n\n#### Returns:\n\nOutput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per output tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined output shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\nInteger or TensorShape: size of outputs produced by this cell.\n\n\n<h3 id=\"scope_name\"><code>scope_name</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\nsize(s) of state(s) used by this cell.\n\nIt can be represented by an Integer, a TensorShape or a tuple of Integers\nor TensorShapes.\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable\"><code>trainable</code></h3>\n\n\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"trainable_weights\"><code>trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"updates\"><code>updates</code></h3>\n\n\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nReturns the list of all layer variables/weights.\n\nAlias of `self.weights`.\n\n#### Returns:\n\nA list of variables.\n\n\n<h3 id=\"weights\"><code>weights</code></h3>\n\nReturns the list of all layer variables/weights.\n\n\n#### Returns:\n\nA list of variables.\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    state,\n    scope=None\n)\n```\n\nRun this RNN cell on inputs, starting from the given state.\n\n\n#### Args:\n\n\n* <b>`inputs`</b>: `2-D` tensor with shape `[batch_size, input_size]`.\n* <b>`state`</b>: if `self.state_size` is an integer, this should be a `2-D Tensor`\n  with shape `[batch_size, self.state_size]`.  Otherwise, if\n  `self.state_size` is a tuple of integers, this should be a tuple with\n  shapes `[batch_size, s] for s in self.state_size`.\n* <b>`scope`</b>: VariableScope for the created subgraph; defaults to class name.\n\n\n#### Returns:\n\n\n* <b>`A pair containing`</b>: \n- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.\n- New state: Either a single `2-D` tensor, or a tuple of tensors matching\n  the arity and shapes of `state`.\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(\n    inputs,\n    *args,\n    **kwargs\n)\n```\n\nApply the layer on a input.\n\nThis is an alias of `self.__call__`.\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor(s).\n* <b>`*args`</b>: additional positional arguments to be passed to `self.call`.\n* <b>`**kwargs`</b>: additional keyword arguments to be passed to `self.call`.\n\n\n#### Returns:\n\nOutput tensor(s).\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer (optional, for subclass implementers).\n\nThis is a method that implementers of subclasses of `Layer` or `Model`\ncan override if they need a state-creation step in-between\nlayer instantiation and layer call.\n\nThis is typically used to create the weights of `Layer` subclasses.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Instance of `TensorShape`, or list of instances of\n  `TensorShape` if the layer expects a list of inputs\n  (one instance per input).\n\n<h3 id=\"compute_mask\"><code><a name=\"compute_mask\">compute_mask</a></code></h3>\n\n``` python\ncompute_mask(\n    inputs,\n    mask=None\n)\n```\n\nComputes an output mask tensor.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor or list of tensors.\n* <b>`mask`</b>: Tensor or list of tensors.\n\n\n#### Returns:\n\nNone or a tensor (or list of tensors,\n    one per output tensor of the layer).\n\n\n<h3 id=\"compute_output_shape\"><code><a name=\"compute_output_shape\">compute_output_shape</a></code></h3>\n\n``` python\ncompute_output_shape(input_shape)\n```\n\nComputes the output shape of the layer.\n\nAssumes that the layer will be built\nto match that input shape provided.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Shape tuple (tuple of integers)\n    or list of shape tuples (one per output tensor of the layer).\n    Shape tuples can include None for free dimensions,\n    instead of an integer.\n\n\n#### Returns:\n\nAn input shape tuple.\n\n\n<h3 id=\"count_params\"><code><a name=\"count_params\">count_params</a></code></h3>\n\n``` python\ncount_params()\n```\n\nCount the total number of scalars composing the weights.\n\n\n#### Returns:\n\nAn integer count.\n\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: if the layer isn't yet built\n  (in which case its weights aren't yet defined).\n\n<h3 id=\"from_config\"><code><a name=\"from_config\">from_config</a></code></h3>\n\n``` python\n@classmethod\nfrom_config(\n    cls,\n    config\n)\n```\n\nCreates a layer from its config.\n\nThis method is the reverse of `get_config`,\ncapable of instantiating the same layer from the config\ndictionary. It does not handle layer connectivity\n(handled by Network), nor weights (handled by `set_weights`).\n\n#### Arguments:\n\n\n* <b>`config`</b>: A Python dictionary, typically the\n    output of get_config.\n\n\n#### Returns:\n\nA layer instance.\n\n\n<h3 id=\"get_config\"><code><a name=\"get_config\">get_config</a></code></h3>\n\n``` python\nget_config()\n```\n\nReturns the config of the layer.\n\nA layer config is a Python dictionary (serializable)\ncontaining the configuration of a layer.\nThe same layer can be reinstantiated later\n(without its trained weights) from this configuration.\n\nThe config of a layer does not include connectivity\ninformation, nor the layer class name. These are handled\nby `Network` (one layer of abstraction above).\n\n#### Returns:\n\nPython dictionary.\n\n\n<h3 id=\"get_initial_state\"><code><a name=\"get_initial_state\">get_initial_state</a></code></h3>\n\n``` python\nget_initial_state(\n    inputs=None,\n    batch_size=None,\n    dtype=None\n)\n```\n\n\n\n\n<h3 id=\"get_input_at\"><code><a name=\"get_input_at\">get_input_at</a></code></h3>\n\n``` python\nget_input_at(node_index)\n```\n\nRetrieves the input tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_input_mask_at\"><code><a name=\"get_input_mask_at\">get_input_mask_at</a></code></h3>\n\n``` python\nget_input_mask_at(node_index)\n```\n\nRetrieves the input mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple inputs).\n\n\n<h3 id=\"get_input_shape_at\"><code><a name=\"get_input_shape_at\">get_input_shape_at</a></code></h3>\n\n``` python\nget_input_shape_at(node_index)\n```\n\nRetrieves the input shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_losses_for\"><code><a name=\"get_losses_for\">get_losses_for</a></code></h3>\n\n``` python\nget_losses_for(inputs)\n```\n\nRetrieves losses relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of loss tensors of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_output_at\"><code><a name=\"get_output_at\">get_output_at</a></code></h3>\n\n``` python\nget_output_at(node_index)\n```\n\nRetrieves the output tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_output_mask_at\"><code><a name=\"get_output_mask_at\">get_output_mask_at</a></code></h3>\n\n``` python\nget_output_mask_at(node_index)\n```\n\nRetrieves the output mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple outputs).\n\n\n<h3 id=\"get_output_shape_at\"><code><a name=\"get_output_shape_at\">get_output_shape_at</a></code></h3>\n\n``` python\nget_output_shape_at(node_index)\n```\n\nRetrieves the output shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_updates_for\"><code><a name=\"get_updates_for\">get_updates_for</a></code></h3>\n\n``` python\nget_updates_for(inputs)\n```\n\nRetrieves updates relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of update ops of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_weights\"><code><a name=\"get_weights\">get_weights</a></code></h3>\n\n``` python\nget_weights()\n```\n\nReturns the current weights of the layer.\n\n\n#### Returns:\n\nWeights values as a list of numpy arrays.\n\n\n<h3 id=\"set_weights\"><code><a name=\"set_weights\">set_weights</a></code></h3>\n\n``` python\nset_weights(weights)\n```\n\nSets the weights of the layer, from Numpy arrays.\n\n\n#### Arguments:\n\n\n* <b>`weights`</b>: a list of Numpy arrays. The number\n    of arrays and their shape must match\n    number of the dimensions of the weights\n    of the layer (i.e. it should match the\n    output of `get_weights`).\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: If the provided weights list does not match the\n    layer's specifications.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n<h3 id=\"zero_state\"><code><a name=\"zero_state\">zero_state</a></code></h3>\n\n``` python\nzero_state(\n    batch_size,\n    dtype\n)\n```\n\nReturn zero-filled state tensor(s).\n\n\n#### Args:\n\n\n* <b>`batch_size`</b>: int, float, or unit Tensor representing the batch size.\n* <b>`dtype`</b>: the data type to use for the state.\n\n\n#### Returns:\n\nIf `state_size` is an int or TensorShape, then the return value is a\n`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.\n\nIf `state_size` is a nested list or tuple, then the return value is\na nested list or tuple (of the same structure) of `2-D` tensors with\nthe shapes `[batch_size, s]` for each s in `state_size`.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/IndRNN.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.IndRNN\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"bidirectional\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.IndRNN\n\n<!-- Insert buttons and diff -->\n\n\n## Class `IndRNN`\n\nIndependently Recurrent Neural Network layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis layer offers a fused, GPU-accelerated TensorFlow op for inference and\ntraining. It also supports Zoneout regularization.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    direction='unidirectional',\n    **kwargs\n)\n```\n\nInitialize the parameters of the IndRNN layer.\n\n\n#### Arguments:\n\n\n* <b>`num_units`</b>: int, the number of units in the IndRNN cell.\n* <b>`direction`</b>: string, 'unidirectional' or 'bidirectional'.\n* <b>`**kwargs`</b>: Dict, keyword arguments (see below).\n\n\n#### Keyword Arguments:\n\n\n* <b>`kernel_initializer`</b>: (optional) the initializer to use for the input\n  matrix weights. Defaults to `glorot_uniform`.\n* <b>`recurrent_initializer`</b>: (optional) the initializer to use for the\n  recurrent scale weights. Defaults to uniform random in [-0.5, 0.5].\n  Note that this initialization scheme is different than in the original\n  authors' implementation. See https://github.com/lmnt-com/haste/issues/7\n  for details.\n* <b>`bias_initializer`</b>: (optional) the initializer to use for the bias vector.\n  Defaults to `zeros`.\n* <b>`kernel_transform`</b>: (optional) a function with signature\n  `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n  used. Defaults to the identity function.\n* <b>`recurrent_transform`</b>: (optional) a function with signature\n  `(recurrent_scale: Tensor) -> Tensor` that transforms the recurrent\n  scale vector before it is used. Defaults to the identity function.\n* <b>`bias_transform`</b>: (optional) a function with signature\n  `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n  Defaults to the identity function.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization. Defaults to 0.\n* <b>`dtype`</b>: (optional) the data type for this layer. Defaults to `tf.float32`.\n* <b>`name`</b>: (optional) string, the name for this layer.\n\n\n\n## Properties\n\n<h3 id=\"bidirectional\"><code>bidirectional</code></h3>\n\n`True` if this is a bidirectional RNN, `False` otherwise.\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\n\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    training,\n    sequence_length=None,\n    time_major=False\n)\n```\n\nRuns the RNN layer.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n  is `False`, or with shape [T,N,C] if `time_major` is `True`.\n* <b>`training`</b>: bool, `True` if running in training mode, `False` if running\n  in inference mode.\n* <b>`sequence_length`</b>: (optional) Tensor, a rank 1 tensor with shape [N] and\n  dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n  length of each example in the input minibatch.\n* <b>`time_major`</b>: (optional) bool, specifies whether `input` has shape [N,T,C]\n  (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n\n#### Returns:\n\nA pair, `(output, state)` for unidirectional layers, or a pair\n`([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\nlayers.\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the RNN class. It is called\ninternally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LSTM.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LSTM\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"bidirectional\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.LSTM\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LSTM`\n\nLong Short-Term Memory layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis LSTM layer offers a fused, GPU-accelerated TensorFlow op for inference\nand training. Its weights and variables are compatible with `BasicLSTMCell`,\n`LSTMCell`, and `LSTMBlockCell` by default, and is able to load weights\nfrom `tf.contrib.cudnn_rnn.CudnnLSTM` when `cudnn_compat=True` is specified.\n\nAlthough this implementation is comparable in performance to cuDNN's LSTM,\nit offers additional options not typically found in other high-performance\nimplementations. DropConnect and Zoneout regularization are built-in, and\nthis layer allows setting a non-zero initial forget gate bias.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    direction='unidirectional',\n    **kwargs\n)\n```\n\nInitialize the parameters of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`num_units`</b>: int, the number of units in the LSTM cell.\n* <b>`direction`</b>: string, 'unidirectional' or 'bidirectional'.\n* <b>`**kwargs`</b>: Dict, keyword arguments (see below).\n\n\n#### Keyword Arguments:\n\n\n* <b>`kernel_initializer`</b>: (optional) the initializer to use for the input\n  matrix weights. Defaults to `glorot_uniform`.\n* <b>`recurrent_initializer`</b>: (optional) the initializer to use for the\n  recurrent matrix weights. Defaults to `orthogonal`.\n* <b>`bias_initializer`</b>: (optional) the initializer to use for both input and\n  recurrent bias vectors. Defaults to `zeros` unless `forget_bias` is\n  non-zero (see below).\n* <b>`kernel_transform`</b>: (optional) a function with signature\n  `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n  used. Defaults to the identity function.\n* <b>`recurrent_transform`</b>: (optional) a function with signature\n  `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n  kernel before it is used. Defaults to the identity function.\n* <b>`bias_transform`</b>: (optional) a function with signature\n  `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n  Defaults to the identity function.\n* <b>`forget_bias`</b>: (optional) float, sets the initial weights for the forget\n  gates. Defaults to 1 and overrides the `bias_initializer` unless this\n  argument is set to 0.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix. Defaults to 0.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization. Defaults to 0.\n* <b>`dtype`</b>: (optional) the data type for this layer. Defaults to `tf.float32`.\n* <b>`name`</b>: (optional) string, the name for this layer.\n* <b>`cudnn_compat`</b>: (optional) bool, if `True`, the variables created by this\n  layer are compatible with `tf.contrib.cudnn_rnn.CudnnLSTM`. Note that\n  this should only be set if you're restoring variables from a cuDNN\n  model. It's currently not possible to train a model with\n  `cudnn_compat=True` and restore it with CudnnLSTM. Defaults to `False`.\n\n\n\n## Properties\n\n<h3 id=\"bidirectional\"><code>bidirectional</code></h3>\n\n`True` if this is a bidirectional RNN, `False` otherwise.\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\n\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    training,\n    sequence_length=None,\n    time_major=False\n)\n```\n\nRuns the RNN layer.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n  is `False`, or with shape [T,N,C] if `time_major` is `True`.\n* <b>`training`</b>: bool, `True` if running in training mode, `False` if running\n  in inference mode.\n* <b>`sequence_length`</b>: (optional) Tensor, a rank 1 tensor with shape [N] and\n  dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n  length of each example in the input minibatch.\n* <b>`time_major`</b>: (optional) bool, specifies whether `input` has shape [N,T,C]\n  (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n\n#### Returns:\n\nA pair, `(output, state)` for unidirectional layers, or a pair\n`([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\nlayers.\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the RNN class. It is called\ninternally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LayerNorm.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LayerNorm\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.LayerNorm\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNorm`\n\nLayer normalization layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis class exposes a fused and GPU-accelerated implementation of layer\nnormalization as described by [Ba et al.](https://arxiv.org/abs/1607.06450)\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(name=None)\n```\n\nInitialize the parameters of the layer normalization layer.\n\n\n#### Arguments:\n\n\n* <b>`name`</b>: (optional) string, the name for this layer.\n\n\n\n## Properties\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(x)\n```\n\nRuns the layer.\n\n\n#### Arguments:\n\n\n* <b>`x`</b>: Tensor, a rank R tensor.\n\n\n#### Returns:\n\n\n* <b>`y`</b>: Tensor, a rank R tensor with the last dimension normalized.\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the LayerNorm class. It is\ncalled internally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LayerNormGRU.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LayerNormGRU\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"bidirectional\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.LayerNormGRU\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormGRU`\n\nLayer Normalized Gated Recurrent Unit layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis GRU layer applies layer normalization to the input and recurrent output\nactivations of a standard GRU. The implementation is fused and\nGPU-accelerated. There are two commonly-used variants of GRU cells. This one\nimplements 1406.1078v1 which applies the reset gate to the hidden state\nafter matrix multiplication. The other variant, 1406.1078v3, applies the\nreset gate before matrix multiplication and is currently unsupported.\n\nThis layer has built-in support for DropConnect and Zoneout, which are\nboth techniques used to regularize RNNs.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    direction='unidirectional',\n    **kwargs\n)\n```\n\nInitialize the parameters of the GRU layer.\n\n\n#### Arguments:\n\n\n* <b>`num_units`</b>: int, the number of units in the LSTM cell.\n* <b>`direction`</b>: string, 'unidirectional' or 'bidirectional'.\n* <b>`**kwargs`</b>: Dict, keyword arguments (see below).\n\n\n#### Keyword Arguments:\n\n\n* <b>`kernel_initializer`</b>: (optional) the initializer to use for the input\n  matrix weights. Defaults to `glorot_uniform`.\n* <b>`recurrent_initializer`</b>: (optional) the initializer to use for the\n  recurrent matrix weights. Defaults to `orthogonal`.\n* <b>`bias_initializer`</b>: (optional) the initializer to use for input bias\n  vectors. Defaults to `zeros`.\n* <b>`recurrent_bias_initializer`</b>: (optional) the initializer to use for\n  recurrent bias vectors. Defaults to `zeros`.\n* <b>`kernel_transform`</b>: (optional) a function with signature\n  `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n  used. Defaults to the identity function.\n* <b>`recurrent_transform`</b>: (optional) a function with signature\n  `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n  kernel before it is used. Defaults to the identity function.\n* <b>`bias_transform`</b>: (optional) a function with signature\n  `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n  Defaults to the identity function.\n* <b>`recurrent_bias_transform`</b>: (optional) a function with signature\n  `(recurrent_bias: Tensor) -> Tensor` that transforms the recurrent bias\n  before it is used. Defaults to the identity function.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix. Defaults to 0.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization. Defaults to 0.\n* <b>`dtype`</b>: (optional) the data type for this layer. Defaults to `tf.float32`.\n* <b>`name`</b>: (optional) string, the name for this layer.\n\n\n\n## Properties\n\n<h3 id=\"bidirectional\"><code>bidirectional</code></h3>\n\n`True` if this is a bidirectional RNN, `False` otherwise.\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\n\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    training,\n    sequence_length=None,\n    time_major=False\n)\n```\n\nRuns the RNN layer.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n  is `False`, or with shape [T,N,C] if `time_major` is `True`.\n* <b>`training`</b>: bool, `True` if running in training mode, `False` if running\n  in inference mode.\n* <b>`sequence_length`</b>: (optional) Tensor, a rank 1 tensor with shape [N] and\n  dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n  length of each example in the input minibatch.\n* <b>`time_major`</b>: (optional) bool, specifies whether `input` has shape [N,T,C]\n  (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n\n#### Returns:\n\nA pair, `(output, state)` for unidirectional layers, or a pair\n`([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\nlayers.\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the RNN class. It is called\ninternally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LayerNormGRUCell.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LayerNormGRUCell\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"activity_regularizer\"/>\n<meta itemprop=\"property\" content=\"dtype\"/>\n<meta itemprop=\"property\" content=\"dynamic\"/>\n<meta itemprop=\"property\" content=\"graph\"/>\n<meta itemprop=\"property\" content=\"input\"/>\n<meta itemprop=\"property\" content=\"input_mask\"/>\n<meta itemprop=\"property\" content=\"input_shape\"/>\n<meta itemprop=\"property\" content=\"losses\"/>\n<meta itemprop=\"property\" content=\"metrics\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"non_trainable_variables\"/>\n<meta itemprop=\"property\" content=\"non_trainable_weights\"/>\n<meta itemprop=\"property\" content=\"output\"/>\n<meta itemprop=\"property\" content=\"output_mask\"/>\n<meta itemprop=\"property\" content=\"output_shape\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"scope_name\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"trainable_weights\"/>\n<meta itemprop=\"property\" content=\"updates\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"weights\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"compute_mask\"/>\n<meta itemprop=\"property\" content=\"compute_output_shape\"/>\n<meta itemprop=\"property\" content=\"count_params\"/>\n<meta itemprop=\"property\" content=\"from_config\"/>\n<meta itemprop=\"property\" content=\"get_config\"/>\n<meta itemprop=\"property\" content=\"get_initial_state\"/>\n<meta itemprop=\"property\" content=\"get_input_at\"/>\n<meta itemprop=\"property\" content=\"get_input_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_input_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_losses_for\"/>\n<meta itemprop=\"property\" content=\"get_output_at\"/>\n<meta itemprop=\"property\" content=\"get_output_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_output_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_updates_for\"/>\n<meta itemprop=\"property\" content=\"get_weights\"/>\n<meta itemprop=\"property\" content=\"set_weights\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n<meta itemprop=\"property\" content=\"zero_state\"/>\n</div>\n\n# haste_tf.LayerNormGRUCell\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormGRUCell`\n\nA GRU cell that's compatible with the Haste LayerNormGRU layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis cell can be used on hardware other than GPUs and with other TensorFlow\nclasses that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\nwrappers, etc.).\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    forget_bias=1.0,\n    dropout=0.0,\n    dtype=None,\n    name=None,\n    **kwargs\n)\n```\n\n\n\n\n\n\n## Properties\n\n<h3 id=\"activity_regularizer\"><code>activity_regularizer</code></h3>\n\nOptional regularizer function for the output of this layer.\n\n\n<h3 id=\"dtype\"><code>dtype</code></h3>\n\n\n\n\n<h3 id=\"dynamic\"><code>dynamic</code></h3>\n\n\n\n\n<h3 id=\"graph\"><code>graph</code></h3>\n\nDEPRECATED FUNCTION\n\nWarning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version.\nInstructions for updating:\nStop using this property because tf.layers layers no longer track their graph.\n\n<h3 id=\"input\"><code>input</code></h3>\n\nRetrieves the input tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput tensor or list of input tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n* <b>`AttributeError`</b>: If no inbound nodes are found.\n\n<h3 id=\"input_mask\"><code>input_mask</code></h3>\n\nRetrieves the input mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput mask tensor (potentially None) or list of input\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"input_shape\"><code>input_shape</code></h3>\n\nRetrieves the input shape(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer, or if all inputs\nhave the same shape.\n\n#### Returns:\n\nInput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per input tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined input_shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"losses\"><code>losses</code></h3>\n\nLosses which are associated with this `Layer`.\n\nVariable regularization tensors are created when this property is accessed,\nso it is eager safe: accessing `losses` under a `tf.GradientTape` will\npropagate gradients back to the corresponding variables.\n\n#### Returns:\n\nA list of tensors.\n\n\n<h3 id=\"metrics\"><code>metrics</code></h3>\n\n\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"non_trainable_variables\"><code>non_trainable_variables</code></h3>\n\n\n\n\n<h3 id=\"non_trainable_weights\"><code>non_trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"output\"><code>output</code></h3>\n\nRetrieves the output tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one output,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput tensor or list of output tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to more than one incoming\n  layers.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_mask\"><code>output_mask</code></h3>\n\nRetrieves the output mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput mask tensor (potentially None) or list of output\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"output_shape\"><code>output_shape</code></h3>\n\nRetrieves the output shape(s) of a layer.\n\nOnly applicable if the layer has one output,\nor if all outputs have the same shape.\n\n#### Returns:\n\nOutput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per output tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined output shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\nInteger or TensorShape: size of outputs produced by this cell.\n\n\n<h3 id=\"scope_name\"><code>scope_name</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\nsize(s) of state(s) used by this cell.\n\nIt can be represented by an Integer, a TensorShape or a tuple of Integers\nor TensorShapes.\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable\"><code>trainable</code></h3>\n\n\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"trainable_weights\"><code>trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"updates\"><code>updates</code></h3>\n\n\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nReturns the list of all layer variables/weights.\n\nAlias of `self.weights`.\n\n#### Returns:\n\nA list of variables.\n\n\n<h3 id=\"weights\"><code>weights</code></h3>\n\nReturns the list of all layer variables/weights.\n\n\n#### Returns:\n\nA list of variables.\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    state,\n    scope=None\n)\n```\n\nRun this RNN cell on inputs, starting from the given state.\n\n\n#### Args:\n\n\n* <b>`inputs`</b>: `2-D` tensor with shape `[batch_size, input_size]`.\n* <b>`state`</b>: if `self.state_size` is an integer, this should be a `2-D Tensor`\n  with shape `[batch_size, self.state_size]`.  Otherwise, if\n  `self.state_size` is a tuple of integers, this should be a tuple with\n  shapes `[batch_size, s] for s in self.state_size`.\n* <b>`scope`</b>: VariableScope for the created subgraph; defaults to class name.\n\n\n#### Returns:\n\n\n* <b>`A pair containing`</b>: \n- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.\n- New state: Either a single `2-D` tensor, or a tuple of tensors matching\n  the arity and shapes of `state`.\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(\n    inputs,\n    *args,\n    **kwargs\n)\n```\n\nApply the layer on a input.\n\nThis is an alias of `self.__call__`.\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor(s).\n* <b>`*args`</b>: additional positional arguments to be passed to `self.call`.\n* <b>`**kwargs`</b>: additional keyword arguments to be passed to `self.call`.\n\n\n#### Returns:\n\nOutput tensor(s).\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer (optional, for subclass implementers).\n\nThis is a method that implementers of subclasses of `Layer` or `Model`\ncan override if they need a state-creation step in-between\nlayer instantiation and layer call.\n\nThis is typically used to create the weights of `Layer` subclasses.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Instance of `TensorShape`, or list of instances of\n  `TensorShape` if the layer expects a list of inputs\n  (one instance per input).\n\n<h3 id=\"compute_mask\"><code><a name=\"compute_mask\">compute_mask</a></code></h3>\n\n``` python\ncompute_mask(\n    inputs,\n    mask=None\n)\n```\n\nComputes an output mask tensor.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor or list of tensors.\n* <b>`mask`</b>: Tensor or list of tensors.\n\n\n#### Returns:\n\nNone or a tensor (or list of tensors,\n    one per output tensor of the layer).\n\n\n<h3 id=\"compute_output_shape\"><code><a name=\"compute_output_shape\">compute_output_shape</a></code></h3>\n\n``` python\ncompute_output_shape(input_shape)\n```\n\nComputes the output shape of the layer.\n\nAssumes that the layer will be built\nto match that input shape provided.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Shape tuple (tuple of integers)\n    or list of shape tuples (one per output tensor of the layer).\n    Shape tuples can include None for free dimensions,\n    instead of an integer.\n\n\n#### Returns:\n\nAn input shape tuple.\n\n\n<h3 id=\"count_params\"><code><a name=\"count_params\">count_params</a></code></h3>\n\n``` python\ncount_params()\n```\n\nCount the total number of scalars composing the weights.\n\n\n#### Returns:\n\nAn integer count.\n\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: if the layer isn't yet built\n  (in which case its weights aren't yet defined).\n\n<h3 id=\"from_config\"><code><a name=\"from_config\">from_config</a></code></h3>\n\n``` python\n@classmethod\nfrom_config(\n    cls,\n    config\n)\n```\n\nCreates a layer from its config.\n\nThis method is the reverse of `get_config`,\ncapable of instantiating the same layer from the config\ndictionary. It does not handle layer connectivity\n(handled by Network), nor weights (handled by `set_weights`).\n\n#### Arguments:\n\n\n* <b>`config`</b>: A Python dictionary, typically the\n    output of get_config.\n\n\n#### Returns:\n\nA layer instance.\n\n\n<h3 id=\"get_config\"><code><a name=\"get_config\">get_config</a></code></h3>\n\n``` python\nget_config()\n```\n\nReturns the config of the layer.\n\nA layer config is a Python dictionary (serializable)\ncontaining the configuration of a layer.\nThe same layer can be reinstantiated later\n(without its trained weights) from this configuration.\n\nThe config of a layer does not include connectivity\ninformation, nor the layer class name. These are handled\nby `Network` (one layer of abstraction above).\n\n#### Returns:\n\nPython dictionary.\n\n\n<h3 id=\"get_initial_state\"><code><a name=\"get_initial_state\">get_initial_state</a></code></h3>\n\n``` python\nget_initial_state(\n    inputs=None,\n    batch_size=None,\n    dtype=None\n)\n```\n\n\n\n\n<h3 id=\"get_input_at\"><code><a name=\"get_input_at\">get_input_at</a></code></h3>\n\n``` python\nget_input_at(node_index)\n```\n\nRetrieves the input tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_input_mask_at\"><code><a name=\"get_input_mask_at\">get_input_mask_at</a></code></h3>\n\n``` python\nget_input_mask_at(node_index)\n```\n\nRetrieves the input mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple inputs).\n\n\n<h3 id=\"get_input_shape_at\"><code><a name=\"get_input_shape_at\">get_input_shape_at</a></code></h3>\n\n``` python\nget_input_shape_at(node_index)\n```\n\nRetrieves the input shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_losses_for\"><code><a name=\"get_losses_for\">get_losses_for</a></code></h3>\n\n``` python\nget_losses_for(inputs)\n```\n\nRetrieves losses relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of loss tensors of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_output_at\"><code><a name=\"get_output_at\">get_output_at</a></code></h3>\n\n``` python\nget_output_at(node_index)\n```\n\nRetrieves the output tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_output_mask_at\"><code><a name=\"get_output_mask_at\">get_output_mask_at</a></code></h3>\n\n``` python\nget_output_mask_at(node_index)\n```\n\nRetrieves the output mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple outputs).\n\n\n<h3 id=\"get_output_shape_at\"><code><a name=\"get_output_shape_at\">get_output_shape_at</a></code></h3>\n\n``` python\nget_output_shape_at(node_index)\n```\n\nRetrieves the output shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_updates_for\"><code><a name=\"get_updates_for\">get_updates_for</a></code></h3>\n\n``` python\nget_updates_for(inputs)\n```\n\nRetrieves updates relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of update ops of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_weights\"><code><a name=\"get_weights\">get_weights</a></code></h3>\n\n``` python\nget_weights()\n```\n\nReturns the current weights of the layer.\n\n\n#### Returns:\n\nWeights values as a list of numpy arrays.\n\n\n<h3 id=\"set_weights\"><code><a name=\"set_weights\">set_weights</a></code></h3>\n\n``` python\nset_weights(weights)\n```\n\nSets the weights of the layer, from Numpy arrays.\n\n\n#### Arguments:\n\n\n* <b>`weights`</b>: a list of Numpy arrays. The number\n    of arrays and their shape must match\n    number of the dimensions of the weights\n    of the layer (i.e. it should match the\n    output of `get_weights`).\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: If the provided weights list does not match the\n    layer's specifications.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n<h3 id=\"zero_state\"><code><a name=\"zero_state\">zero_state</a></code></h3>\n\n``` python\nzero_state(\n    batch_size,\n    dtype\n)\n```\n\nReturn zero-filled state tensor(s).\n\n\n#### Args:\n\n\n* <b>`batch_size`</b>: int, float, or unit Tensor representing the batch size.\n* <b>`dtype`</b>: the data type to use for the state.\n\n\n#### Returns:\n\nIf `state_size` is an int or TensorShape, then the return value is a\n`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.\n\nIf `state_size` is a nested list or tuple, then the return value is\na nested list or tuple (of the same structure) of `2-D` tensors with\nthe shapes `[batch_size, s]` for each s in `state_size`.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LayerNormLSTM.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LayerNormLSTM\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"bidirectional\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n</div>\n\n# haste_tf.LayerNormLSTM\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormLSTM`\n\nLayer Normalized Long Short-Term Memory layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis LSTM layer applies layer normalization to the input, recurrent, and\noutput activations of a standard LSTM. The implementation is fused and\nGPU-accelerated. DropConnect and Zoneout regularization are built-in, and\nthis layer allows setting a non-zero initial forget gate bias.\n\nDetails about the exact function this layer implements can be found at\nhttps://github.com/lmnt-com/haste/issues/1.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    direction='unidirectional',\n    **kwargs\n)\n```\n\nInitialize the parameters of the LSTM layer.\n\n\n#### Arguments:\n\n\n* <b>`num_units`</b>: int, the number of units in the LSTM cell.\n* <b>`direction`</b>: string, 'unidirectional' or 'bidirectional'.\n* <b>`**kwargs`</b>: Dict, keyword arguments (see below).\n\n\n#### Keyword Arguments:\n\n\n* <b>`kernel_initializer`</b>: (optional) the initializer to use for the input\n  matrix weights. Defaults to `glorot_uniform`.\n* <b>`recurrent_initializer`</b>: (optional) the initializer to use for the\n  recurrent matrix weights. Defaults to `orthogonal`.\n* <b>`bias_initializer`</b>: (optional) the initializer to use for both input and\n  recurrent bias vectors. Defaults to `zeros` unless `forget_bias` is\n  non-zero (see below).\n* <b>`kernel_transform`</b>: (optional) a function with signature\n  `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n  used. Defaults to the identity function.\n* <b>`recurrent_transform`</b>: (optional) a function with signature\n  `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n  kernel before it is used. Defaults to the identity function.\n* <b>`bias_transform`</b>: (optional) a function with signature\n  `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n  Defaults to the identity function.\n* <b>`forget_bias`</b>: (optional) float, sets the initial weights for the forget\n  gates. Defaults to 1 and overrides the `bias_initializer` unless this\n  argument is set to 0.\n* <b>`dropout`</b>: (optional) float, sets the dropout rate for DropConnect\n  regularization on the recurrent matrix. Defaults to 0.\n* <b>`zoneout`</b>: (optional) float, sets the zoneout rate for Zoneout\n  regularization. Defaults to 0.\n* <b>`dtype`</b>: (optional) the data type for this layer. Defaults to `tf.float32`.\n* <b>`name`</b>: (optional) string, the name for this layer.\n\n\n\n## Properties\n\n<h3 id=\"bidirectional\"><code>bidirectional</code></h3>\n\n`True` if this is a bidirectional RNN, `False` otherwise.\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\n\n\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    training,\n    sequence_length=None,\n    time_major=False\n)\n```\n\nRuns the RNN layer.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n  is `False`, or with shape [T,N,C] if `time_major` is `True`.\n* <b>`training`</b>: bool, `True` if running in training mode, `False` if running\n  in inference mode.\n* <b>`sequence_length`</b>: (optional) Tensor, a rank 1 tensor with shape [N] and\n  dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n  length of each example in the input minibatch.\n* <b>`time_major`</b>: (optional) bool, specifies whether `input` has shape [N,T,C]\n  (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n\n#### Returns:\n\nA pair, `(output, state)` for unidirectional layers, or a pair\n`([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\nlayers.\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer.\n\nCalling this method is optional for users of the RNN class. It is called\ninternally with the correct shape when `__call__` is invoked.\n\n#### Arguments:\n\n\n* <b>`shape`</b>: instance of `TensorShape`.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/LayerNormLSTMCell.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.LayerNormLSTMCell\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"activity_regularizer\"/>\n<meta itemprop=\"property\" content=\"dtype\"/>\n<meta itemprop=\"property\" content=\"dynamic\"/>\n<meta itemprop=\"property\" content=\"graph\"/>\n<meta itemprop=\"property\" content=\"input\"/>\n<meta itemprop=\"property\" content=\"input_mask\"/>\n<meta itemprop=\"property\" content=\"input_shape\"/>\n<meta itemprop=\"property\" content=\"losses\"/>\n<meta itemprop=\"property\" content=\"metrics\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"non_trainable_variables\"/>\n<meta itemprop=\"property\" content=\"non_trainable_weights\"/>\n<meta itemprop=\"property\" content=\"output\"/>\n<meta itemprop=\"property\" content=\"output_mask\"/>\n<meta itemprop=\"property\" content=\"output_shape\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"scope_name\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"trainable_weights\"/>\n<meta itemprop=\"property\" content=\"updates\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"weights\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"compute_mask\"/>\n<meta itemprop=\"property\" content=\"compute_output_shape\"/>\n<meta itemprop=\"property\" content=\"count_params\"/>\n<meta itemprop=\"property\" content=\"from_config\"/>\n<meta itemprop=\"property\" content=\"get_config\"/>\n<meta itemprop=\"property\" content=\"get_initial_state\"/>\n<meta itemprop=\"property\" content=\"get_input_at\"/>\n<meta itemprop=\"property\" content=\"get_input_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_input_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_losses_for\"/>\n<meta itemprop=\"property\" content=\"get_output_at\"/>\n<meta itemprop=\"property\" content=\"get_output_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_output_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_updates_for\"/>\n<meta itemprop=\"property\" content=\"get_weights\"/>\n<meta itemprop=\"property\" content=\"set_weights\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n<meta itemprop=\"property\" content=\"zero_state\"/>\n</div>\n\n# haste_tf.LayerNormLSTMCell\n\n<!-- Insert buttons and diff -->\n\n\n## Class `LayerNormLSTMCell`\n\nAn LSTM cell that's compatible with the Haste LayerNormLSTM layer.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThis cell can be used on hardware other than GPUs and with other TensorFlow\nclasses that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\nwrappers, etc.).\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    num_units,\n    forget_bias=1.0,\n    dropout=0.0,\n    dtype=None,\n    name=None,\n    **kwargs\n)\n```\n\n\n\n\n\n\n## Properties\n\n<h3 id=\"activity_regularizer\"><code>activity_regularizer</code></h3>\n\nOptional regularizer function for the output of this layer.\n\n\n<h3 id=\"dtype\"><code>dtype</code></h3>\n\n\n\n\n<h3 id=\"dynamic\"><code>dynamic</code></h3>\n\n\n\n\n<h3 id=\"graph\"><code>graph</code></h3>\n\nDEPRECATED FUNCTION\n\nWarning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version.\nInstructions for updating:\nStop using this property because tf.layers layers no longer track their graph.\n\n<h3 id=\"input\"><code>input</code></h3>\n\nRetrieves the input tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput tensor or list of input tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n* <b>`AttributeError`</b>: If no inbound nodes are found.\n\n<h3 id=\"input_mask\"><code>input_mask</code></h3>\n\nRetrieves the input mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput mask tensor (potentially None) or list of input\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"input_shape\"><code>input_shape</code></h3>\n\nRetrieves the input shape(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer, or if all inputs\nhave the same shape.\n\n#### Returns:\n\nInput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per input tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined input_shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"losses\"><code>losses</code></h3>\n\nLosses which are associated with this `Layer`.\n\nVariable regularization tensors are created when this property is accessed,\nso it is eager safe: accessing `losses` under a `tf.GradientTape` will\npropagate gradients back to the corresponding variables.\n\n#### Returns:\n\nA list of tensors.\n\n\n<h3 id=\"metrics\"><code>metrics</code></h3>\n\n\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"non_trainable_variables\"><code>non_trainable_variables</code></h3>\n\n\n\n\n<h3 id=\"non_trainable_weights\"><code>non_trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"output\"><code>output</code></h3>\n\nRetrieves the output tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one output,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput tensor or list of output tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to more than one incoming\n  layers.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_mask\"><code>output_mask</code></h3>\n\nRetrieves the output mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput mask tensor (potentially None) or list of output\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"output_shape\"><code>output_shape</code></h3>\n\nRetrieves the output shape(s) of a layer.\n\nOnly applicable if the layer has one output,\nor if all outputs have the same shape.\n\n#### Returns:\n\nOutput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per output tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined output shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\nInteger or TensorShape: size of outputs produced by this cell.\n\n\n<h3 id=\"scope_name\"><code>scope_name</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\nsize(s) of state(s) used by this cell.\n\nIt can be represented by an Integer, a TensorShape or a tuple of Integers\nor TensorShapes.\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable\"><code>trainable</code></h3>\n\n\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"trainable_weights\"><code>trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"updates\"><code>updates</code></h3>\n\n\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nReturns the list of all layer variables/weights.\n\nAlias of `self.weights`.\n\n#### Returns:\n\nA list of variables.\n\n\n<h3 id=\"weights\"><code>weights</code></h3>\n\nReturns the list of all layer variables/weights.\n\n\n#### Returns:\n\nA list of variables.\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    state,\n    scope=None\n)\n```\n\nRun this RNN cell on inputs, starting from the given state.\n\n\n#### Args:\n\n\n* <b>`inputs`</b>: `2-D` tensor with shape `[batch_size, input_size]`.\n* <b>`state`</b>: if `self.state_size` is an integer, this should be a `2-D Tensor`\n  with shape `[batch_size, self.state_size]`.  Otherwise, if\n  `self.state_size` is a tuple of integers, this should be a tuple with\n  shapes `[batch_size, s] for s in self.state_size`.\n* <b>`scope`</b>: VariableScope for the created subgraph; defaults to class name.\n\n\n#### Returns:\n\n\n* <b>`A pair containing`</b>: \n- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.\n- New state: Either a single `2-D` tensor, or a tuple of tensors matching\n  the arity and shapes of `state`.\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(\n    inputs,\n    *args,\n    **kwargs\n)\n```\n\nApply the layer on a input.\n\nThis is an alias of `self.__call__`.\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor(s).\n* <b>`*args`</b>: additional positional arguments to be passed to `self.call`.\n* <b>`**kwargs`</b>: additional keyword arguments to be passed to `self.call`.\n\n\n#### Returns:\n\nOutput tensor(s).\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(shape)\n```\n\nCreates the variables of the layer (optional, for subclass implementers).\n\nThis is a method that implementers of subclasses of `Layer` or `Model`\ncan override if they need a state-creation step in-between\nlayer instantiation and layer call.\n\nThis is typically used to create the weights of `Layer` subclasses.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Instance of `TensorShape`, or list of instances of\n  `TensorShape` if the layer expects a list of inputs\n  (one instance per input).\n\n<h3 id=\"compute_mask\"><code><a name=\"compute_mask\">compute_mask</a></code></h3>\n\n``` python\ncompute_mask(\n    inputs,\n    mask=None\n)\n```\n\nComputes an output mask tensor.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor or list of tensors.\n* <b>`mask`</b>: Tensor or list of tensors.\n\n\n#### Returns:\n\nNone or a tensor (or list of tensors,\n    one per output tensor of the layer).\n\n\n<h3 id=\"compute_output_shape\"><code><a name=\"compute_output_shape\">compute_output_shape</a></code></h3>\n\n``` python\ncompute_output_shape(input_shape)\n```\n\nComputes the output shape of the layer.\n\nAssumes that the layer will be built\nto match that input shape provided.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Shape tuple (tuple of integers)\n    or list of shape tuples (one per output tensor of the layer).\n    Shape tuples can include None for free dimensions,\n    instead of an integer.\n\n\n#### Returns:\n\nAn input shape tuple.\n\n\n<h3 id=\"count_params\"><code><a name=\"count_params\">count_params</a></code></h3>\n\n``` python\ncount_params()\n```\n\nCount the total number of scalars composing the weights.\n\n\n#### Returns:\n\nAn integer count.\n\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: if the layer isn't yet built\n  (in which case its weights aren't yet defined).\n\n<h3 id=\"from_config\"><code><a name=\"from_config\">from_config</a></code></h3>\n\n``` python\n@classmethod\nfrom_config(\n    cls,\n    config\n)\n```\n\nCreates a layer from its config.\n\nThis method is the reverse of `get_config`,\ncapable of instantiating the same layer from the config\ndictionary. It does not handle layer connectivity\n(handled by Network), nor weights (handled by `set_weights`).\n\n#### Arguments:\n\n\n* <b>`config`</b>: A Python dictionary, typically the\n    output of get_config.\n\n\n#### Returns:\n\nA layer instance.\n\n\n<h3 id=\"get_config\"><code><a name=\"get_config\">get_config</a></code></h3>\n\n``` python\nget_config()\n```\n\nReturns the config of the layer.\n\nA layer config is a Python dictionary (serializable)\ncontaining the configuration of a layer.\nThe same layer can be reinstantiated later\n(without its trained weights) from this configuration.\n\nThe config of a layer does not include connectivity\ninformation, nor the layer class name. These are handled\nby `Network` (one layer of abstraction above).\n\n#### Returns:\n\nPython dictionary.\n\n\n<h3 id=\"get_initial_state\"><code><a name=\"get_initial_state\">get_initial_state</a></code></h3>\n\n``` python\nget_initial_state(\n    inputs=None,\n    batch_size=None,\n    dtype=None\n)\n```\n\n\n\n\n<h3 id=\"get_input_at\"><code><a name=\"get_input_at\">get_input_at</a></code></h3>\n\n``` python\nget_input_at(node_index)\n```\n\nRetrieves the input tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_input_mask_at\"><code><a name=\"get_input_mask_at\">get_input_mask_at</a></code></h3>\n\n``` python\nget_input_mask_at(node_index)\n```\n\nRetrieves the input mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple inputs).\n\n\n<h3 id=\"get_input_shape_at\"><code><a name=\"get_input_shape_at\">get_input_shape_at</a></code></h3>\n\n``` python\nget_input_shape_at(node_index)\n```\n\nRetrieves the input shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_losses_for\"><code><a name=\"get_losses_for\">get_losses_for</a></code></h3>\n\n``` python\nget_losses_for(inputs)\n```\n\nRetrieves losses relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of loss tensors of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_output_at\"><code><a name=\"get_output_at\">get_output_at</a></code></h3>\n\n``` python\nget_output_at(node_index)\n```\n\nRetrieves the output tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_output_mask_at\"><code><a name=\"get_output_mask_at\">get_output_mask_at</a></code></h3>\n\n``` python\nget_output_mask_at(node_index)\n```\n\nRetrieves the output mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple outputs).\n\n\n<h3 id=\"get_output_shape_at\"><code><a name=\"get_output_shape_at\">get_output_shape_at</a></code></h3>\n\n``` python\nget_output_shape_at(node_index)\n```\n\nRetrieves the output shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_updates_for\"><code><a name=\"get_updates_for\">get_updates_for</a></code></h3>\n\n``` python\nget_updates_for(inputs)\n```\n\nRetrieves updates relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of update ops of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_weights\"><code><a name=\"get_weights\">get_weights</a></code></h3>\n\n``` python\nget_weights()\n```\n\nReturns the current weights of the layer.\n\n\n#### Returns:\n\nWeights values as a list of numpy arrays.\n\n\n<h3 id=\"set_weights\"><code><a name=\"set_weights\">set_weights</a></code></h3>\n\n``` python\nset_weights(weights)\n```\n\nSets the weights of the layer, from Numpy arrays.\n\n\n#### Arguments:\n\n\n* <b>`weights`</b>: a list of Numpy arrays. The number\n    of arrays and their shape must match\n    number of the dimensions of the weights\n    of the layer (i.e. it should match the\n    output of `get_weights`).\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: If the provided weights list does not match the\n    layer's specifications.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n<h3 id=\"zero_state\"><code><a name=\"zero_state\">zero_state</a></code></h3>\n\n``` python\nzero_state(\n    batch_size,\n    dtype\n)\n```\n\nReturn zero-filled state tensor(s).\n\n\n#### Args:\n\n\n* <b>`batch_size`</b>: int, float, or unit Tensor representing the batch size.\n* <b>`dtype`</b>: the data type to use for the state.\n\n\n#### Returns:\n\nIf `state_size` is an int or TensorShape, then the return value is a\n`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.\n\nIf `state_size` is a nested list or tuple, then the return value is\na nested list or tuple (of the same structure) of `2-D` tensors with\nthe shapes `[batch_size, s]` for each s in `state_size`.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf/ZoneoutWrapper.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf.ZoneoutWrapper\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n<meta itemprop=\"property\" content=\"activity_regularizer\"/>\n<meta itemprop=\"property\" content=\"dtype\"/>\n<meta itemprop=\"property\" content=\"dynamic\"/>\n<meta itemprop=\"property\" content=\"graph\"/>\n<meta itemprop=\"property\" content=\"input\"/>\n<meta itemprop=\"property\" content=\"input_mask\"/>\n<meta itemprop=\"property\" content=\"input_shape\"/>\n<meta itemprop=\"property\" content=\"losses\"/>\n<meta itemprop=\"property\" content=\"metrics\"/>\n<meta itemprop=\"property\" content=\"name\"/>\n<meta itemprop=\"property\" content=\"name_scope\"/>\n<meta itemprop=\"property\" content=\"non_trainable_variables\"/>\n<meta itemprop=\"property\" content=\"non_trainable_weights\"/>\n<meta itemprop=\"property\" content=\"output\"/>\n<meta itemprop=\"property\" content=\"output_mask\"/>\n<meta itemprop=\"property\" content=\"output_shape\"/>\n<meta itemprop=\"property\" content=\"output_size\"/>\n<meta itemprop=\"property\" content=\"scope_name\"/>\n<meta itemprop=\"property\" content=\"state_size\"/>\n<meta itemprop=\"property\" content=\"submodules\"/>\n<meta itemprop=\"property\" content=\"trainable\"/>\n<meta itemprop=\"property\" content=\"trainable_variables\"/>\n<meta itemprop=\"property\" content=\"trainable_weights\"/>\n<meta itemprop=\"property\" content=\"updates\"/>\n<meta itemprop=\"property\" content=\"variables\"/>\n<meta itemprop=\"property\" content=\"weights\"/>\n<meta itemprop=\"property\" content=\"__call__\"/>\n<meta itemprop=\"property\" content=\"__init__\"/>\n<meta itemprop=\"property\" content=\"apply\"/>\n<meta itemprop=\"property\" content=\"build\"/>\n<meta itemprop=\"property\" content=\"compute_mask\"/>\n<meta itemprop=\"property\" content=\"compute_output_shape\"/>\n<meta itemprop=\"property\" content=\"count_params\"/>\n<meta itemprop=\"property\" content=\"from_config\"/>\n<meta itemprop=\"property\" content=\"get_config\"/>\n<meta itemprop=\"property\" content=\"get_initial_state\"/>\n<meta itemprop=\"property\" content=\"get_input_at\"/>\n<meta itemprop=\"property\" content=\"get_input_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_input_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_losses_for\"/>\n<meta itemprop=\"property\" content=\"get_output_at\"/>\n<meta itemprop=\"property\" content=\"get_output_mask_at\"/>\n<meta itemprop=\"property\" content=\"get_output_shape_at\"/>\n<meta itemprop=\"property\" content=\"get_updates_for\"/>\n<meta itemprop=\"property\" content=\"get_weights\"/>\n<meta itemprop=\"property\" content=\"set_weights\"/>\n<meta itemprop=\"property\" content=\"with_name_scope\"/>\n<meta itemprop=\"property\" content=\"zero_state\"/>\n</div>\n\n# haste_tf.ZoneoutWrapper\n\n<!-- Insert buttons and diff -->\n\n\n## Class `ZoneoutWrapper`\n\nAn LSTM/GRU cell wrapper that applies zoneout to the inner cell's hidden state.\n\n\n\n<!-- Placeholder for \"Used in\" -->\n\nThe zoneout paper applies zoneout to both the cell state and hidden state,\neach with its own zoneout rate. This class (and the `LSTM` implementation in Haste)\napplies zoneout to the hidden state and not the cell state.\n\n<h2 id=\"__init__\"><code><a name=\"__init__\">__init__</a></code></h2>\n\n``` python\n__init__(\n    cell,\n    rate,\n    training\n)\n```\n\nInitialize the parameters of the zoneout wrapper.\n\n\n#### Arguments:\n\n\n* <b>`cell`</b>: RNNCell, an instance of {`BasicLSTMCell`, `LSTMCell`,\n  `LSTMBlockCell`, <a href=\"../haste_tf/GRUCell.md\"><code>haste_tf.GRUCell</code></a>} on which to apply zoneout.\n* <b>`rate`</b>: float, 0 <= rate <= 1, the percent of hidden units to zone out per\n  time step.\n* <b>`training`</b>: bool, `True` if used during training, `False` if used during\n  inference.\n\n\n\n## Properties\n\n<h3 id=\"activity_regularizer\"><code>activity_regularizer</code></h3>\n\nOptional regularizer function for the output of this layer.\n\n\n<h3 id=\"dtype\"><code>dtype</code></h3>\n\n\n\n\n<h3 id=\"dynamic\"><code>dynamic</code></h3>\n\n\n\n\n<h3 id=\"graph\"><code>graph</code></h3>\n\nDEPRECATED FUNCTION\n\nWarning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version.\nInstructions for updating:\nStop using this property because tf.layers layers no longer track their graph.\n\n<h3 id=\"input\"><code>input</code></h3>\n\nRetrieves the input tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput tensor or list of input tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n* <b>`AttributeError`</b>: If no inbound nodes are found.\n\n<h3 id=\"input_mask\"><code>input_mask</code></h3>\n\nRetrieves the input mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nInput mask tensor (potentially None) or list of input\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"input_shape\"><code>input_shape</code></h3>\n\nRetrieves the input shape(s) of a layer.\n\nOnly applicable if the layer has exactly one input,\ni.e. if it is connected to one incoming layer, or if all inputs\nhave the same shape.\n\n#### Returns:\n\nInput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per input tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined input_shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"losses\"><code>losses</code></h3>\n\nLosses which are associated with this `Layer`.\n\nVariable regularization tensors are created when this property is accessed,\nso it is eager safe: accessing `losses` under a `tf.GradientTape` will\npropagate gradients back to the corresponding variables.\n\n#### Returns:\n\nA list of tensors.\n\n\n<h3 id=\"metrics\"><code>metrics</code></h3>\n\n\n\n\n<h3 id=\"name\"><code>name</code></h3>\n\nReturns the name of this module as passed or determined in the ctor.\n\nNOTE: This is not the same as the `self.name_scope.name` which includes\nparent module names.\n\n<h3 id=\"name_scope\"><code>name_scope</code></h3>\n\nReturns a `tf.name_scope` instance for this class.\n\n\n<h3 id=\"non_trainable_variables\"><code>non_trainable_variables</code></h3>\n\n\n\n\n<h3 id=\"non_trainable_weights\"><code>non_trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"output\"><code>output</code></h3>\n\nRetrieves the output tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one output,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput tensor or list of output tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to more than one incoming\n  layers.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_mask\"><code>output_mask</code></h3>\n\nRetrieves the output mask tensor(s) of a layer.\n\nOnly applicable if the layer has exactly one inbound node,\ni.e. if it is connected to one incoming layer.\n\n#### Returns:\n\nOutput mask tensor (potentially None) or list of output\nmask tensors.\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer is connected to\nmore than one incoming layers.\n\n<h3 id=\"output_shape\"><code>output_shape</code></h3>\n\nRetrieves the output shape(s) of a layer.\n\nOnly applicable if the layer has one output,\nor if all outputs have the same shape.\n\n#### Returns:\n\nOutput shape, as an integer shape tuple\n(or list of shape tuples, one tuple per output tensor).\n\n\n\n#### Raises:\n\n\n* <b>`AttributeError`</b>: if the layer has no defined output shape.\n* <b>`RuntimeError`</b>: if called in Eager mode.\n\n<h3 id=\"output_size\"><code>output_size</code></h3>\n\nInteger or TensorShape: size of outputs produced by this cell.\n\n\n<h3 id=\"scope_name\"><code>scope_name</code></h3>\n\n\n\n\n<h3 id=\"state_size\"><code>state_size</code></h3>\n\nsize(s) of state(s) used by this cell.\n\nIt can be represented by an Integer, a TensorShape or a tuple of Integers\nor TensorShapes.\n\n<h3 id=\"submodules\"><code>submodules</code></h3>\n\nSequence of all sub-modules.\n\nSubmodules are modules which are properties of this module, or found as\nproperties of modules which are properties of this module (and so on).\n\n```\na = tf.Module()\nb = tf.Module()\nc = tf.Module()\na.b = b\nb.c = c\nassert list(a.submodules) == [b, c]\nassert list(b.submodules) == [c]\nassert list(c.submodules) == []\n```\n\n#### Returns:\n\nA sequence of all submodules.\n\n\n<h3 id=\"trainable\"><code>trainable</code></h3>\n\n\n\n\n<h3 id=\"trainable_variables\"><code>trainable_variables</code></h3>\n\nSequence of variables owned by this module and it's submodules.\n\nNote: this method uses reflection to find variables on the current instance\nand submodules. For performance reasons you may wish to cache the result\nof calling this method if you don't expect the return value to change.\n\n#### Returns:\n\nA sequence of variables for the current module (sorted by attribute\nname) followed by variables from all submodules recursively (breadth\nfirst).\n\n\n<h3 id=\"trainable_weights\"><code>trainable_weights</code></h3>\n\n\n\n\n<h3 id=\"updates\"><code>updates</code></h3>\n\n\n\n\n<h3 id=\"variables\"><code>variables</code></h3>\n\nReturns the list of all layer variables/weights.\n\nAlias of `self.weights`.\n\n#### Returns:\n\nA list of variables.\n\n\n<h3 id=\"weights\"><code>weights</code></h3>\n\nReturns the list of all layer variables/weights.\n\n\n#### Returns:\n\nA list of variables.\n\n\n\n\n## Methods\n\n<h3 id=\"__call__\"><code><a name=\"__call__\">__call__</a></code></h3>\n\n``` python\n__call__(\n    inputs,\n    state,\n    scope=None\n)\n```\n\nRuns one step of the RNN cell with zoneout applied.\n\n\n#### Arguments:\n\nsee documentation for the inner cell.\n\n\n<h3 id=\"apply\"><code><a name=\"apply\">apply</a></code></h3>\n\n``` python\napply(\n    inputs,\n    *args,\n    **kwargs\n)\n```\n\nApply the layer on a input.\n\nThis is an alias of `self.__call__`.\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor(s).\n* <b>`*args`</b>: additional positional arguments to be passed to `self.call`.\n* <b>`**kwargs`</b>: additional keyword arguments to be passed to `self.call`.\n\n\n#### Returns:\n\nOutput tensor(s).\n\n\n<h3 id=\"build\"><code><a name=\"build\">build</a></code></h3>\n\n``` python\nbuild(_)\n```\n\nCreates the variables of the layer (optional, for subclass implementers).\n\nThis is a method that implementers of subclasses of `Layer` or `Model`\ncan override if they need a state-creation step in-between\nlayer instantiation and layer call.\n\nThis is typically used to create the weights of `Layer` subclasses.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Instance of `TensorShape`, or list of instances of\n  `TensorShape` if the layer expects a list of inputs\n  (one instance per input).\n\n<h3 id=\"compute_mask\"><code><a name=\"compute_mask\">compute_mask</a></code></h3>\n\n``` python\ncompute_mask(\n    inputs,\n    mask=None\n)\n```\n\nComputes an output mask tensor.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Tensor or list of tensors.\n* <b>`mask`</b>: Tensor or list of tensors.\n\n\n#### Returns:\n\nNone or a tensor (or list of tensors,\n    one per output tensor of the layer).\n\n\n<h3 id=\"compute_output_shape\"><code><a name=\"compute_output_shape\">compute_output_shape</a></code></h3>\n\n``` python\ncompute_output_shape(input_shape)\n```\n\nComputes the output shape of the layer.\n\nAssumes that the layer will be built\nto match that input shape provided.\n\n#### Arguments:\n\n\n* <b>`input_shape`</b>: Shape tuple (tuple of integers)\n    or list of shape tuples (one per output tensor of the layer).\n    Shape tuples can include None for free dimensions,\n    instead of an integer.\n\n\n#### Returns:\n\nAn input shape tuple.\n\n\n<h3 id=\"count_params\"><code><a name=\"count_params\">count_params</a></code></h3>\n\n``` python\ncount_params()\n```\n\nCount the total number of scalars composing the weights.\n\n\n#### Returns:\n\nAn integer count.\n\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: if the layer isn't yet built\n  (in which case its weights aren't yet defined).\n\n<h3 id=\"from_config\"><code><a name=\"from_config\">from_config</a></code></h3>\n\n``` python\n@classmethod\nfrom_config(\n    cls,\n    config\n)\n```\n\nCreates a layer from its config.\n\nThis method is the reverse of `get_config`,\ncapable of instantiating the same layer from the config\ndictionary. It does not handle layer connectivity\n(handled by Network), nor weights (handled by `set_weights`).\n\n#### Arguments:\n\n\n* <b>`config`</b>: A Python dictionary, typically the\n    output of get_config.\n\n\n#### Returns:\n\nA layer instance.\n\n\n<h3 id=\"get_config\"><code><a name=\"get_config\">get_config</a></code></h3>\n\n``` python\nget_config()\n```\n\nReturns the config of the layer.\n\nA layer config is a Python dictionary (serializable)\ncontaining the configuration of a layer.\nThe same layer can be reinstantiated later\n(without its trained weights) from this configuration.\n\nThe config of a layer does not include connectivity\ninformation, nor the layer class name. These are handled\nby `Network` (one layer of abstraction above).\n\n#### Returns:\n\nPython dictionary.\n\n\n<h3 id=\"get_initial_state\"><code><a name=\"get_initial_state\">get_initial_state</a></code></h3>\n\n``` python\nget_initial_state(\n    inputs=None,\n    batch_size=None,\n    dtype=None\n)\n```\n\n\n\n\n<h3 id=\"get_input_at\"><code><a name=\"get_input_at\">get_input_at</a></code></h3>\n\n``` python\nget_input_at(node_index)\n```\n\nRetrieves the input tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_input_mask_at\"><code><a name=\"get_input_mask_at\">get_input_mask_at</a></code></h3>\n\n``` python\nget_input_mask_at(node_index)\n```\n\nRetrieves the input mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple inputs).\n\n\n<h3 id=\"get_input_shape_at\"><code><a name=\"get_input_shape_at\">get_input_shape_at</a></code></h3>\n\n``` python\nget_input_shape_at(node_index)\n```\n\nRetrieves the input shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple inputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_losses_for\"><code><a name=\"get_losses_for\">get_losses_for</a></code></h3>\n\n``` python\nget_losses_for(inputs)\n```\n\nRetrieves losses relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of loss tensors of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_output_at\"><code><a name=\"get_output_at\">get_output_at</a></code></h3>\n\n``` python\nget_output_at(node_index)\n```\n\nRetrieves the output tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA tensor (or list of tensors if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_output_mask_at\"><code><a name=\"get_output_mask_at\">get_output_mask_at</a></code></h3>\n\n``` python\nget_output_mask_at(node_index)\n```\n\nRetrieves the output mask tensor(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA mask tensor\n(or list of tensors if the layer has multiple outputs).\n\n\n<h3 id=\"get_output_shape_at\"><code><a name=\"get_output_shape_at\">get_output_shape_at</a></code></h3>\n\n``` python\nget_output_shape_at(node_index)\n```\n\nRetrieves the output shape(s) of a layer at a given node.\n\n\n#### Arguments:\n\n\n* <b>`node_index`</b>: Integer, index of the node\n    from which to retrieve the attribute.\n    E.g. `node_index=0` will correspond to the\n    first time the layer was called.\n\n\n#### Returns:\n\nA shape tuple\n(or list of shape tuples if the layer has multiple outputs).\n\n\n\n#### Raises:\n\n\n* <b>`RuntimeError`</b>: If called in Eager mode.\n\n<h3 id=\"get_updates_for\"><code><a name=\"get_updates_for\">get_updates_for</a></code></h3>\n\n``` python\nget_updates_for(inputs)\n```\n\nRetrieves updates relevant to a specific set of inputs.\n\n\n#### Arguments:\n\n\n* <b>`inputs`</b>: Input tensor or list/tuple of input tensors.\n\n\n#### Returns:\n\nList of update ops of the layer that depend on `inputs`.\n\n\n<h3 id=\"get_weights\"><code><a name=\"get_weights\">get_weights</a></code></h3>\n\n``` python\nget_weights()\n```\n\nReturns the current weights of the layer.\n\n\n#### Returns:\n\nWeights values as a list of numpy arrays.\n\n\n<h3 id=\"set_weights\"><code><a name=\"set_weights\">set_weights</a></code></h3>\n\n``` python\nset_weights(weights)\n```\n\nSets the weights of the layer, from Numpy arrays.\n\n\n#### Arguments:\n\n\n* <b>`weights`</b>: a list of Numpy arrays. The number\n    of arrays and their shape must match\n    number of the dimensions of the weights\n    of the layer (i.e. it should match the\n    output of `get_weights`).\n\n\n#### Raises:\n\n\n* <b>`ValueError`</b>: If the provided weights list does not match the\n    layer's specifications.\n\n<h3 id=\"with_name_scope\"><code><a name=\"with_name_scope\">with_name_scope</a></code></h3>\n\n``` python\n@classmethod\nwith_name_scope(\n    cls,\n    method\n)\n```\n\nDecorator to automatically enter the module name scope.\n\n```\nclass MyModule(tf.Module):\n  @tf.Module.with_name_scope\n  def __call__(self, x):\n    if not hasattr(self, 'w'):\n      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))\n    return tf.matmul(x, self.w)\n```\n\nUsing the above module would produce `tf.Variable`s and `tf.Tensor`s whose\nnames included the module name:\n\n```\nmod = MyModule()\nmod(tf.ones([8, 32]))\n# ==> <tf.Tensor: ...>\nmod.w\n# ==> <tf.Variable ...'my_module/w:0'>\n```\n\n#### Args:\n\n\n* <b>`method`</b>: The method to wrap.\n\n\n#### Returns:\n\nThe original method wrapped such that it enters the module's name scope.\n\n\n<h3 id=\"zero_state\"><code><a name=\"zero_state\">zero_state</a></code></h3>\n\n``` python\nzero_state(\n    batch_size,\n    dtype\n)\n```\n\nReturn zero-filled state tensor(s).\n\n\n#### Args:\n\n\n* <b>`batch_size`</b>: int, float, or unit Tensor representing the batch size.\n* <b>`dtype`</b>: the data type to use for the state.\n\n\n#### Returns:\n\nIf `state_size` is an int or TensorShape, then the return value is a\n`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.\n\nIf `state_size` is a nested list or tuple, then the return value is\na nested list or tuple (of the same structure) of `2-D` tensors with\nthe shapes `[batch_size, s]` for each s in `state_size`.\n\n\n\n\n"
  },
  {
    "path": "docs/tf/haste_tf.md",
    "content": "<div itemscope itemtype=\"http://developers.google.com/ReferenceObject\">\n<meta itemprop=\"name\" content=\"haste_tf\" />\n<meta itemprop=\"path\" content=\"Stable\" />\n</div>\n\n# Module: haste_tf\n\n\n\nHaste: a fast, simple, and open RNN library.\n\n\n\n## Classes\n\n[`class GRU`](./haste_tf/GRU.md): Gated Recurrent Unit layer.\n\n[`class GRUCell`](./haste_tf/GRUCell.md): A GRU cell that's compatible with the Haste GRU layer.\n\n[`class IndRNN`](./haste_tf/IndRNN.md): Independently Recurrent Neural Network layer.\n\n[`class LSTM`](./haste_tf/LSTM.md): Long Short-Term Memory layer.\n\n[`class LayerNorm`](./haste_tf/LayerNorm.md): Layer normalization layer.\n\n[`class LayerNormGRU`](./haste_tf/LayerNormGRU.md): Layer Normalized Gated Recurrent Unit layer.\n\n[`class LayerNormGRUCell`](./haste_tf/LayerNormGRUCell.md): A GRU cell that's compatible with the Haste LayerNormGRU layer.\n\n[`class LayerNormLSTM`](./haste_tf/LayerNormLSTM.md): Layer Normalized Long Short-Term Memory layer.\n\n[`class LayerNormLSTMCell`](./haste_tf/LayerNormLSTMCell.md): An LSTM cell that's compatible with the Haste LayerNormLSTM layer.\n\n[`class ZoneoutWrapper`](./haste_tf/ZoneoutWrapper.md): An LSTM/GRU cell wrapper that applies zoneout to the inner cell's hidden state.\n\n"
  },
  {
    "path": "examples/device_ptr.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cassert>\n#include <cuda.h>\n#include <cuda_runtime_api.h>\n\ntemplate<typename T>\nstruct device_ptr {\n  static constexpr size_t ElemSize = sizeof(typename T::Scalar);\n\n  static device_ptr<T> NewByteSized(size_t bytes) {\n    return device_ptr<T>((bytes + ElemSize - 1) / ElemSize);\n  }\n\n  explicit device_ptr(size_t size_)\n      : data(nullptr), size(size_) {\n    void* tmp;\n    cudaMalloc(&tmp, size * ElemSize);\n    data = static_cast<typename T::Scalar*>(tmp);\n  }\n\n  explicit device_ptr(const T& elem)\n      : data(nullptr), size(elem.size()) {\n    void* tmp;\n    cudaMalloc(&tmp, size * ElemSize);\n    data = static_cast<typename T::Scalar*>(tmp);\n    ToDevice(elem);\n  }\n\n  device_ptr(device_ptr<T>&& other) : data(other.data), size(other.size) {\n    other.data = nullptr;\n    other.size = 0;\n  }\n\n  device_ptr& operator=(const device_ptr<T>&& other) {\n    if (&other != this) {\n      data = other.data;\n      size = other.size;\n      other.data = nullptr;\n      other.size = 0;\n    }\n    return *this;\n  }\n\n  device_ptr(const device_ptr<T>& other) = delete;\n  device_ptr& operator=(const device_ptr<T>& other) = delete;\n\n  void ToDevice(const T& src) {\n    assert(size == src.size());\n    cudaMemcpy(data, src.data(), src.size() * ElemSize, cudaMemcpyHostToDevice);\n  }\n\n  void ToHost(T& target) const {\n    assert(size == target.size());\n    cudaMemcpy(target.data(), data, target.size() * ElemSize, cudaMemcpyDeviceToHost);\n  }\n\n  size_t Size() const {\n    return size;\n  }\n\n  void zero() {\n    cudaMemset(data, 0, size * ElemSize);\n  }\n\n  ~device_ptr() {\n    cudaFree(data);\n  }\n\n  typename T::Scalar* data;\n  size_t size;\n};\n"
  },
  {
    "path": "examples/gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <Eigen/Dense>\n#include <cassert>\n#include <cmath>\n#include <cstdio>\n#include <cstdlib>\n#include <ctime>\n#include <cuda.h>\n#include <cuda_runtime_api.h>\n#include <iostream>\n#include <string>\n#include <unsupported/Eigen/CXX11/Tensor>\n#include <vector>\n\n#include \"device_ptr.h\"\n#include \"haste.h\"\n\nusing haste::v0::gru::ForwardPass;\nusing haste::v0::gru::BackwardPass;\nusing std::string;\n\nusing Tensor1 = Eigen::Tensor<float, 1>;\nusing Tensor2 = Eigen::Tensor<float, 2>;\nusing Tensor3 = Eigen::Tensor<float, 3>;\n\nconstexpr int BATCH_SIZE = 64;\nconstexpr int SEQUENCE_LEN = 1000;\nconstexpr int HIDDEN_DIMS = 512;\nconstexpr int INPUT_DIMS = 512;\n\nstatic cublasHandle_t g_blas_handle;\n\nclass ScopeTimer {\n  public:\n    ScopeTimer(const string& msg) : msg_(msg) {\n      cudaEventCreate(&start_);\n      cudaEventCreate(&stop_);\n      cudaDeviceSynchronize();\n      cudaEventRecord(start_);\n    }\n\n    ~ScopeTimer() {\n      float elapsed_ms;\n      cudaEventRecord(stop_);\n      cudaEventSynchronize(stop_);\n      cudaEventElapsedTime(&elapsed_ms, start_, stop_);\n      printf(\"%s %fms\\n\", msg_.c_str(), elapsed_ms);\n      cudaEventDestroy(start_);\n      cudaEventDestroy(stop_);\n    }\n\n  private:\n    string msg_;\n    cudaEvent_t start_, stop_;\n};\n\nvoid GruInference(\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> bx_dev(bx);\n  device_ptr<Tensor1> br_dev(br);\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor2> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> tmp_Wx_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 3);\n\n  h_dev.zero();\n\n  ScopeTimer t(\"Inference:\");\n\n  ForwardPass<float> forward = ForwardPass<float>(\n      false,  // training\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  forward.Run(\n      time_steps,\n      W_dev.data,\n      R_dev.data,\n      bx_dev.data,\n      br_dev.data,\n      x_dev.data,\n      h_dev.data,\n      nullptr,\n      tmp_Wx_dev.data,\n      tmp_Rh_dev.data,\n      0.0f,\n      nullptr);\n}\n\nvoid GruTrain(\n    const Tensor2& W,\n    const Tensor2& R,\n    const Tensor1& bx,\n    const Tensor1& br,\n    const Tensor3& x,\n    const Tensor3& dh_new) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> bx_dev(bx);\n  device_ptr<Tensor1> br_dev(br);\n  device_ptr<Tensor3> x_dev(x);\n  device_ptr<Tensor3> dh_new_dev(dh_new);\n\n  device_ptr<Tensor2> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> tmp_Wx_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 3);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n\n  h_dev.zero();\n\n  {\n    ScopeTimer t(\"Train forward:\");\n    ForwardPass<float> forward = ForwardPass<float>(\n        true,  // training\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        bx_dev.data,\n        br_dev.data,\n        x_dev.data,\n        h_dev.data,\n        v_dev.data,\n        tmp_Wx_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,\n        nullptr);\n  }\n\n  device_ptr<Tensor3> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dW_dev(input_size * hidden_size * 3);\n  device_ptr<Tensor2> dR_dev(hidden_size * hidden_size * 3);\n  device_ptr<Tensor1> dbx_dev(hidden_size * 3);\n  device_ptr<Tensor1> dbr_dev(hidden_size * 3);\n  device_ptr<Tensor2> dh_dev(batch_size * hidden_size);\n  device_ptr<Tensor3> dp_dev(time_steps * batch_size * hidden_size * 3);\n  device_ptr<Tensor3> dq_dev(time_steps * batch_size * hidden_size * 3);\n\n  {\n    ScopeTimer t(\"Train backward:\");\n    BackwardPass<float> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    backward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        bx_dev.data,\n        br_dev.data,\n        x_dev.data,\n        h_dev.data,\n        v_dev.data,\n        dh_new_dev.data,\n        dx_dev.data,\n        dW_dev.data,\n        dR_dev.data,\n        dbx_dev.data,\n        dbr_dev.data,\n        dh_dev.data,\n        dp_dev.data,\n        dq_dev.data,\n        nullptr);\n  }\n}\n\nint main() {\n  srand(time(0));\n\n  cublasCreate(&g_blas_handle);\n\n  // Weights.\n  Tensor2 W(HIDDEN_DIMS * 3, INPUT_DIMS);\n  Tensor2 R(HIDDEN_DIMS * 3, HIDDEN_DIMS);\n  Tensor1 bx(HIDDEN_DIMS * 3);\n  Tensor1 br(HIDDEN_DIMS * 3);\n\n  // Input.\n  Tensor3 x(INPUT_DIMS, BATCH_SIZE, SEQUENCE_LEN);\n\n  // Gradients from upstream layers.\n  Tensor3 dh(HIDDEN_DIMS, BATCH_SIZE, SEQUENCE_LEN + 1);\n\n  W.setRandom();\n  R.setRandom();\n  bx.setRandom();\n  br.setRandom();\n  x.setRandom();\n  dh.setRandom();\n\n  GruInference(W, R, bx, br, x);\n  GruTrain(W, R, bx, br, x, dh);\n\n  cublasDestroy(g_blas_handle);\n\n  return 0;\n}\n"
  },
  {
    "path": "examples/lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <Eigen/Dense>\n#include <cassert>\n#include <cmath>\n#include <cstdio>\n#include <cstdlib>\n#include <ctime>\n#include <cuda.h>\n#include <cuda_runtime_api.h>\n#include <iostream>\n#include <string>\n#include <unsupported/Eigen/CXX11/Tensor>\n#include <vector>\n\n#include \"device_ptr.h\"\n#include \"haste.h\"\n\nusing haste::v0::lstm::BackwardPass;\nusing haste::v0::lstm::ForwardPass;\nusing std::string;\n\nusing Tensor1 = Eigen::Tensor<float, 1>;\nusing Tensor2 = Eigen::Tensor<float, 2>;\nusing Tensor3 = Eigen::Tensor<float, 3>;\n\nconstexpr int BATCH_SIZE = 64;\nconstexpr int SEQUENCE_LEN = 1000;\nconstexpr int HIDDEN_DIMS = 512;\nconstexpr int INPUT_DIMS = 512;\n\nstatic cublasHandle_t g_blas_handle;\n\nclass ScopeTimer {\n  public:\n    ScopeTimer(const string& msg) : msg_(msg) {\n      cudaEventCreate(&start_);\n      cudaEventCreate(&stop_);\n      cudaDeviceSynchronize();\n      cudaEventRecord(start_);\n    }\n\n    ~ScopeTimer() {\n      float elapsed_ms;\n      cudaEventRecord(stop_);\n      cudaEventSynchronize(stop_);\n      cudaEventElapsedTime(&elapsed_ms, start_, stop_);\n      printf(\"%s %.1fms\\n\", msg_.c_str(), elapsed_ms);\n      cudaEventDestroy(start_);\n      cudaEventDestroy(stop_);\n    }\n\n  private:\n    string msg_;\n    cudaEvent_t start_, stop_;\n};\n\nvoid LstmInference(const Tensor2& W, const Tensor2& R, const Tensor1& b, const Tensor3& x) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> b_dev(b);\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor2> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor2> c_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 4);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  ScopeTimer t(\"Inference:\");\n\n  ForwardPass<float> forward(\n      false,  // training\n      batch_size,\n      input_size,\n      hidden_size,\n      g_blas_handle);\n\n  forward.Run(\n      time_steps,\n      W_dev.data,\n      R_dev.data,\n      b_dev.data,\n      x_dev.data,\n      h_dev.data,\n      c_dev.data,\n      v_dev.data,\n      tmp_Rh_dev.data,\n      0.0f,      // zoneout prob\n      nullptr);  // zoneout mask\n}\n\nvoid LstmTrain(const Tensor2& W, const Tensor2& R, const Tensor1& b, const Tensor3& x,\n               const Tensor3& dh, const Tensor3& dc) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> b_dev(b);\n  device_ptr<Tensor3> x_dev(x);\n\n  // This is nearly the same as the inference code except we have an extra dimension\n  // for h and c. We'll store those outputs of the cell for all time steps and use\n  // them during the backward pass below.\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> c_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(batch_size * hidden_size * 4 * time_steps);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 4);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  {\n    ScopeTimer t(\"Train forward:\");\n    ForwardPass<float> forward(\n        true,  // training\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    forward.Run(\n        time_steps,\n        W_dev.data,\n        R_dev.data,\n        b_dev.data,\n        x_dev.data,\n        h_dev.data,\n        c_dev.data,\n        v_dev.data,\n        tmp_Rh_dev.data,\n        0.0f,      // zoneout prob\n        nullptr);  // zoneout mask\n  }\n\n  Eigen::array<int, 3> transpose_x({ 1, 2, 0 });\n  Tensor3 x_t = x.shuffle(transpose_x);\n\n  Eigen::array<int, 2> transpose({ 1, 0 });\n  Tensor2 W_t = W.shuffle(transpose);\n  Tensor2 R_t = R.shuffle(transpose);\n\n  device_ptr<Tensor3> x_t_dev(x_t);\n  device_ptr<Tensor2> W_t_dev(W_t);\n  device_ptr<Tensor2> R_t_dev(R_t);\n\n  // These gradients should actually come \"from above\" but we're just allocating\n  // a bunch of uninitialized memory and passing it in.\n  device_ptr<Tensor3> dh_new_dev(dh);\n  device_ptr<Tensor3> dc_new_dev(dc);\n\n  device_ptr<Tensor3> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dW_dev(input_size * hidden_size * 4);\n  device_ptr<Tensor2> dR_dev(hidden_size * hidden_size * 4);\n  device_ptr<Tensor2> db_dev(hidden_size * 4);\n  device_ptr<Tensor2> dh_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dc_dev(batch_size * hidden_size);\n\n  dW_dev.zero();\n  dR_dev.zero();\n  db_dev.zero();\n  dh_dev.zero();\n  dc_dev.zero();\n\n  {\n    ScopeTimer t(\"Train backward:\");\n    BackwardPass<float> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    backward.Run(\n        time_steps,\n        W_t_dev.data,\n        R_t_dev.data,\n        b_dev.data,\n        x_t_dev.data,\n        h_dev.data,\n        c_dev.data,\n        dh_new_dev.data,\n        dc_new_dev.data,\n        dx_dev.data,\n        dW_dev.data,\n        dR_dev.data,\n        db_dev.data,\n        dh_dev.data,\n        dc_dev.data,\n        v_dev.data,\n        nullptr);\n  }\n}\n\nvoid LstmTrainIterative(const Tensor2& W, const Tensor2& R, const Tensor1& b, const Tensor3& x,\n                        const Tensor3& dh, const Tensor3& dc) {\n  const int time_steps = x.dimension(2);\n  const int batch_size = x.dimension(1);\n  const int input_size = x.dimension(0);\n  const int hidden_size = R.dimension(1);\n\n  // Copy weights over to GPU.\n  device_ptr<Tensor2> W_dev(W);\n  device_ptr<Tensor2> R_dev(R);\n  device_ptr<Tensor1> b_dev(b);\n  device_ptr<Tensor3> x_dev(x);\n\n  device_ptr<Tensor3> h_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> c_dev((time_steps + 1) * batch_size * hidden_size);\n  device_ptr<Tensor3> v_dev(time_steps * batch_size * hidden_size * 4);\n  device_ptr<Tensor2> tmp_Rh_dev(batch_size * hidden_size * 4);\n\n  h_dev.zero();\n  c_dev.zero();\n\n  {\n    ScopeTimer t(\"Train forward (iterative):\");\n    ForwardPass<float> forward(\n        true,  // training\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    const int NC = batch_size * input_size;\n    const int NH = batch_size * hidden_size;\n    for (int t = 0; t < time_steps; ++t) {\n      forward.Iterate(\n          0,\n          W_dev.data,\n          R_dev.data,\n          b_dev.data,\n          x_dev.data + t * NC,\n          h_dev.data + t * NH,\n          c_dev.data + t * NH,\n          h_dev.data + (t + 1) * NH,\n          c_dev.data + (t + 1) * NH,\n          v_dev.data + t * NH * 4,\n          tmp_Rh_dev.data,\n          0.0f,      // zoneout prob\n          nullptr);  // zoneout mask\n    }\n  }\n\n  Eigen::array<int, 3> transpose_x({ 1, 2, 0 });\n  Tensor3 x_t = x.shuffle(transpose_x);\n\n  Eigen::array<int, 2> transpose({ 1, 0 });\n  Tensor2 W_t = W.shuffle(transpose);\n  Tensor2 R_t = R.shuffle(transpose);\n\n  device_ptr<Tensor3> x_t_dev(x_t);\n  device_ptr<Tensor2> W_t_dev(W_t);\n  device_ptr<Tensor2> R_t_dev(R_t);\n\n  // These gradients should actually come \"from above\" but we're just allocating\n  // a bunch of uninitialized memory and passing it in.\n  device_ptr<Tensor3> dh_new_dev(dh);\n  device_ptr<Tensor3> dc_new_dev(dc);\n\n  device_ptr<Tensor3> dx_dev(time_steps * batch_size * input_size);\n  device_ptr<Tensor2> dW_dev(input_size * hidden_size * 4);\n  device_ptr<Tensor2> dR_dev(hidden_size * hidden_size * 4);\n  device_ptr<Tensor2> db_dev(hidden_size * 4);\n  device_ptr<Tensor2> dh_dev(batch_size * hidden_size);\n  device_ptr<Tensor2> dc_dev(batch_size * hidden_size);\n\n  dW_dev.zero();\n  dR_dev.zero();\n  db_dev.zero();\n  dh_dev.zero();\n  dc_dev.zero();\n\n  {\n    ScopeTimer t(\"Train backward (iterative):\");\n    BackwardPass<float> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        g_blas_handle);\n\n    const int NC = batch_size * input_size;\n    const int NH = batch_size * hidden_size;\n    for (int t = time_steps - 1; t >= 0; --t) {\n      backward.Iterate(\n          0,\n          W_t_dev.data,\n          R_t_dev.data,\n          b_dev.data,\n          x_t_dev.data + t * NC,\n          h_dev.data + t * NH,\n          c_dev.data + t * NH,\n          c_dev.data + (t + 1) * NH,\n          dh_new_dev.data + t * NH,\n          dc_new_dev.data + t * NH,\n          dx_dev.data + t * NC,\n          dW_dev.data,\n          dR_dev.data,\n          db_dev.data,\n          dh_dev.data,\n          dc_dev.data,\n          v_dev.data + t * NH * 4,\n          nullptr);\n    }\n  }\n}\n\nint main() {\n  srand(time(0));\n\n  cublasCreate(&g_blas_handle);\n\n  // Weights.\n  // W: input weight matrix\n  // R: recurrent weight matrix\n  // b: bias\n  Tensor2 W(HIDDEN_DIMS * 4, INPUT_DIMS);\n  Tensor2 R(HIDDEN_DIMS * 4, HIDDEN_DIMS);\n  Tensor1 b(HIDDEN_DIMS * 4);\n\n  // Input.\n  Tensor3 x(INPUT_DIMS, BATCH_SIZE, SEQUENCE_LEN);\n\n  // Gradients from upstream layers.\n  Tensor3 dh(HIDDEN_DIMS, BATCH_SIZE, SEQUENCE_LEN + 1);\n  Tensor3 dc(HIDDEN_DIMS, BATCH_SIZE, SEQUENCE_LEN + 1);\n\n  W.setRandom();\n  R.setRandom();\n  b.setRandom();\n  x.setRandom();\n  dh.setRandom();\n  dc.setRandom();\n\n  LstmInference(W, R, b, x);\n  LstmTrain(W, R, b, x, dh, dc);\n  LstmTrainIterative(W, R, b, x, dh, dc);\n\n  cublasDestroy(g_blas_handle);\n\n  return 0;\n}\n"
  },
  {
    "path": "frameworks/pytorch/__init__.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"\nHaste: a fast, simple, and open RNN library.\n\"\"\"\n\nimport torch as _\n\nfrom ._version import __version__  # generated in setup.py\nfrom .gru import GRU\nfrom .indrnn import IndRNN\nfrom .lstm import LSTM\nfrom .layer_norm_gru import LayerNormGRU\nfrom .layer_norm_indrnn import LayerNormIndRNN\nfrom .layer_norm_lstm import LayerNormLSTM\n\n__all__ = [\n    'GRU',\n    'IndRNN',\n    'LSTM',\n    'LayerNormGRU',\n    'LayerNormIndRNN',\n    'LayerNormLSTM'\n]\n"
  },
  {
    "path": "frameworks/pytorch/base_rnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport torch\nimport torch.nn as nn\n\n\n__all__ = [\n    'BaseRNN'\n]\n\n\nclass BaseRNN(nn.Module):\n  def __init__(\n      self,\n      input_size,\n      hidden_size,\n      batch_first,\n      zoneout,\n      return_state_sequence):\n    super().__init__()\n    self.input_size = input_size\n    self.hidden_size = hidden_size\n    self.batch_first = batch_first\n    self.zoneout = zoneout\n    self.return_state_sequence = return_state_sequence\n\n  def _permute(self, x):\n    if self.batch_first:\n      return x.permute(1, 0, 2)\n    return x\n\n  def _get_state(self, input, state, state_shape):\n    if state is None:\n      state = _zero_state(input, state_shape)\n    else:\n      _validate_state(state, state_shape)\n    return state\n\n  def _get_final_state(self, state, lengths):\n    if isinstance(state, tuple):\n      return tuple(self._get_final_state(s, lengths) for s in state)\n    if isinstance(state, list):\n      return [self._get_final_state(s, lengths) for s in state]\n    if self.return_state_sequence:\n      return self._permute(state[1:]).unsqueeze(0)\n    if lengths is not None:\n      cols = range(state.size(1))\n      return state[[lengths, cols]].unsqueeze(0)\n    return state[-1].unsqueeze(0)\n\n  def _get_zoneout_mask(self, input):\n    if self.zoneout:\n      zoneout_mask = input.new_empty(input.shape[0], input.shape[1], self.hidden_size)\n      zoneout_mask.bernoulli_(1.0 - self.zoneout)\n    else:\n      zoneout_mask = input.new_empty(0, 0, 0)\n    return zoneout_mask\n\n  def _is_cuda(self):\n    is_cuda = [tensor.is_cuda for tensor in list(self.parameters())]\n    if any(is_cuda) and not all(is_cuda):\n      raise ValueError('RNN tensors should all be CUDA tensors or none should be CUDA tensors')\n    return any(is_cuda)\n\n\ndef _validate_state(state, state_shape):\n  \"\"\"\n  Checks to make sure that `state` has the same nested structure and dimensions\n  as `state_shape`. `None` values in `state_shape` are a wildcard and are not\n  checked.\n\n  Arguments:\n    state: a nested structure of Tensors.\n    state_shape: a nested structure of integers or None.\n\n  Raises:\n    ValueError: if the structure and/or shapes don't match.\n  \"\"\"\n  if isinstance(state, (tuple, list)):\n    # Handle nested structure.\n    if not isinstance(state_shape, (tuple, list)):\n      raise ValueError('RNN state has invalid structure; expected {}'.format(state_shape))\n    for s, ss in zip(state, state_shape):\n      _validate_state(s, ss)\n  else:\n    shape = list(state.size())\n    if len(shape) != len(state_shape):\n      raise ValueError('RNN state dimension mismatch; expected {} got {}'.format(len(state_shape), len(shape)))\n\n    for i, (d1, d2) in enumerate(zip(list(state.size()), state_shape)):\n      if d2 is not None and d1 != d2:\n        raise ValueError('RNN state size mismatch on dim {}; expected {} got {}'.format(i, d2, d1))\n\n\ndef _zero_state(input, state_shape):\n  \"\"\"\n  Returns a nested structure of zero Tensors with the same structure and shape\n  as `state_shape`. The returned Tensors will have the same dtype and be on the\n  same device as `input`.\n\n  Arguments:\n    input: Tensor, to specify the device and dtype of the returned tensors.\n    shape_state: nested structure of integers.\n\n  Returns:\n    zero_state: a nested structure of zero Tensors.\n\n  Raises:\n    ValueError: if `state_shape` has non-integer values.\n  \"\"\"\n  if isinstance(state_shape, (tuple, list)) and isinstance(state_shape[0], int):\n    state = input.new_zeros(*state_shape)\n  elif isinstance(state_shape, tuple):\n    state = tuple(_zero_state(input, s) for s in state_shape)\n  elif isinstance(state_shape, list):\n    state = [_zero_state(input, s) for s in state_shape]\n  else:\n    raise ValueError('RNN state_shape is invalid')\n  return state\n"
  },
  {
    "path": "frameworks/pytorch/gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nusing haste::v0::gru::ForwardPass;\nusing haste::v0::gru::BackwardPass;\n\nusing torch::Tensor;\n\nstd::vector<Tensor> gru_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor kernel,\n    Tensor recurrent_kernel,\n    Tensor bias,\n    Tensor recurrent_bias,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_kernel.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_kernel);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(recurrent_bias);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor cache = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor tmp_Wx = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor tmp_Rh = torch::empty({ batch_size, hidden_size * 3 }, options);\n\n  output[0] = h0;\n\n  AT_DISPATCH_FLOATING_TYPES_AND_HALF(x.scalar_type(), \"gru_forward\", ([&] {\n    ForwardPass<typename native_type<scalar_t>::T> forward(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    forward.Run(\n        time_steps,\n        ptr<scalar_t>(kernel),\n        ptr<scalar_t>(recurrent_kernel),\n        ptr<scalar_t>(bias),\n        ptr<scalar_t>(recurrent_bias),\n        ptr<scalar_t>(x),\n        ptr<scalar_t>(output),\n        ptr<scalar_t>(cache),\n        ptr<scalar_t>(tmp_Wx),\n        ptr<scalar_t>(tmp_Rh),\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? ptr<scalar_t>(zoneout_mask) : nullptr);\n  }));\n\n  return { output, cache };\n}\n\nstd::vector<Tensor> gru_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_kernel_t,\n    Tensor bias,\n    Tensor recurrent_bias,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor cache,\n    Tensor dh_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_kernel_t.size(1);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_kernel_t);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(recurrent_bias);\n  CHECK_INPUT(h);\n  CHECK_INPUT(cache);\n  CHECK_INPUT(dh_new);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size * 3 }, options);\n  Tensor dR = torch::zeros({ hidden_size, hidden_size * 3 }, options);\n  Tensor dbx = torch::zeros({ hidden_size * 3 }, options);\n  Tensor dbr = torch::zeros({ hidden_size * 3 }, options);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor dp = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor dq = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n\n  AT_DISPATCH_FLOATING_TYPES_AND_HALF(x_t.scalar_type(), \"gru_backward\", ([&] {\n    BackwardPass<typename native_type<scalar_t>::T> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    backward.Run(\n        time_steps,\n        ptr<scalar_t>(kernel_t),\n        ptr<scalar_t>(recurrent_kernel_t),\n        ptr<scalar_t>(bias),\n        ptr<scalar_t>(recurrent_bias),\n        ptr<scalar_t>(x_t),\n        ptr<scalar_t>(h),\n        ptr<scalar_t>(cache),\n        ptr<scalar_t>(dh_new),\n        ptr<scalar_t>(dx),\n        ptr<scalar_t>(dW),\n        ptr<scalar_t>(dR),\n        ptr<scalar_t>(dbx),\n        ptr<scalar_t>(dbr),\n        ptr<scalar_t>(dh),\n        ptr<scalar_t>(dp),\n        ptr<scalar_t>(dq),\n        has_zoneout ? ptr<scalar_t>(zoneout_mask) : nullptr);\n  }));\n\n  return { dx, dh, dW, dR, dbx, dbr };\n}\n\n}  // anonymous namespace\n\nvoid gru_init(py::module& m) {\n  m.def(\"gru_forward\", &gru_forward, \"GRU forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"gru_backward\", &gru_backward, \"GRU backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/gru.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Gated Recurrent Unit\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'GRU'\n]\n\n\n#@torch.jit.script\ndef GRUScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    kernel,\n    recurrent_kernel,\n    bias,\n    recurrent_bias,\n    zoneout_mask):\n  time_steps = input.shape[0]\n  batch_size = input.shape[1]\n  hidden_size = recurrent_kernel.shape[0]\n\n  h = [h0]\n  Wx = input @ kernel + bias\n  for t in range(time_steps):\n    Rh = h[t] @ recurrent_kernel + recurrent_bias\n    vx = torch.chunk(Wx[t], 3, 1)\n    vh = torch.chunk(Rh, 3, 1)\n\n    z = torch.sigmoid(vx[0] + vh[0])\n    r = torch.sigmoid(vx[1] + vh[1])\n    g = torch.tanh(vx[2] + r * vh[2])\n\n    h.append(z * h[t] + (1 - z) * g)\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n\n  h = torch.stack(h)\n  return h\n\n\nclass GRUFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    h, cache = LIB.gru_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[2:], h, cache)\n    ctx.mark_non_differentiable(inputs[-1])\n    ctx.training = training\n    return h\n\n  @staticmethod\n  def backward(ctx, grad_h):\n    if not ctx.training:\n      raise RuntimeError('GRU backward can only be called in training mode')\n\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()\n    saved[1] = saved[1].permute(1, 0).contiguous()\n    saved[2] = saved[2].permute(1, 0).contiguous()\n    grads = LIB.gru_backward(*saved, grad_h.contiguous())\n    return (None, None, *grads, None)\n\n\nclass GRU(BaseRNN):\n  \"\"\"\n  Gated Recurrent Unit layer.\n\n  This GRU layer offers a fused, GPU-accelerated PyTorch op for inference\n  and training. There are two commonly-used variants of GRU cells. This one\n  implements 1406.1078v1 which applies the reset gate to the hidden state\n  after matrix multiplication. cuDNN also implements this variant. The other\n  variant, 1406.1078v3, applies the reset gate before matrix multiplication\n  and is currently unsupported.\n\n  This layer has built-in support for DropConnect and Zoneout, which are\n  both techniques used to regularize RNNs.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n  See [from_native_weights](#from_native_weights) and\n  [to_native_weights](#to_native_weights) for compatibility with PyTorch GRUs.\n  \"\"\"\n\n  def __init__(self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      dropout=0.0,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the GRU layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n        with Xavier uniform initialization.\n      recurrent_kernel: the recurrent projection weight matrix. Dimensions\n        (hidden_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n        with orthogonal initialization.\n      bias: the input projection bias vector. Dimensions (hidden_size * 3) with\n        `z,r,h` gate layout. Initialized to zeros.\n      recurrent_bias: the recurrent projection bias vector. Dimensions\n        (hidden_size * 3) with `z,r,h` gate layout. Initialized to zeros.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if dropout < 0 or dropout > 1:\n      raise ValueError('GRU: dropout must be in [0.0, 1.0]')\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('GRU: zoneout must be in [0.0, 1.0]')\n\n    self.dropout = dropout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size * 3))\n    self.recurrent_kernel = nn.Parameter(torch.empty(hidden_size, hidden_size * 3))\n    self.bias = nn.Parameter(torch.empty(hidden_size * 3))\n    self.recurrent_bias = nn.Parameter(torch.empty(hidden_size * 3))\n    self.reset_parameters()\n\n  def to_native_weights(self):\n    \"\"\"\n    Converts Haste GRU weights to native PyTorch GRU weights.\n\n    Returns:\n      weight_ih_l0: Parameter, the input-hidden weights of the GRU layer.\n      weight_hh_l0: Parameter, the hidden-hidden weights of the GRU layer.\n      bias_ih_l0: Parameter, the input-hidden bias of the GRU layer.\n      bias_hh_l0: Parameter, the hidden-hidden bias of the GRU layer.\n    \"\"\"\n    def reorder_weights(w):\n      z, r, n = torch.chunk(w, 3, dim=-1)\n      return torch.cat([r, z, n], dim=-1)\n\n    kernel = reorder_weights(self.kernel).permute(1, 0).contiguous()\n    recurrent_kernel = reorder_weights(self.recurrent_kernel).permute(1, 0).contiguous()\n    bias1 = reorder_weights(self.bias).contiguous()\n    bias2 = reorder_weights(self.recurrent_bias).contiguous()\n\n    kernel = torch.nn.Parameter(kernel)\n    recurrent_kernel = torch.nn.Parameter(recurrent_kernel)\n    bias1 = torch.nn.Parameter(bias1)\n    bias2 = torch.nn.Parameter(bias2)\n    return kernel, recurrent_kernel, bias1, bias2\n\n  def from_native_weights(self, weight_ih_l0, weight_hh_l0, bias_ih_l0, bias_hh_l0):\n    \"\"\"\n    Copies and converts the provided PyTorch GRU weights into this layer.\n\n    Arguments:\n      weight_ih_l0: Parameter, the input-hidden weights of the PyTorch GRU layer.\n      weight_hh_l0: Parameter, the hidden-hidden weights of the PyTorch GRU layer.\n      bias_ih_l0: Parameter, the input-hidden bias of the PyTorch GRU layer.\n      bias_hh_l0: Parameter, the hidden-hidden bias of the PyTorch GRU layer.\n    \"\"\"\n    def reorder_weights(w):\n      r, z, n = torch.chunk(w, 3, axis=-1)\n      return torch.cat([z, r, n], dim=-1)\n\n    kernel = reorder_weights(weight_ih_l0.permute(1, 0)).contiguous()\n    recurrent_kernel = reorder_weights(weight_hh_l0.permute(1, 0)).contiguous()\n    bias = reorder_weights(bias_ih_l0).contiguous()\n    recurrent_bias = reorder_weights(bias_hh_l0).contiguous()\n\n    self.kernel = nn.Parameter(kernel)\n    self.recurrent_kernel = nn.Parameter(recurrent_kernel)\n    self.bias = nn.Parameter(bias)\n    self.recurrent_bias = nn.Parameter(recurrent_bias)\n\n  def reset_parameters(self):\n    \"\"\"Resets this layer's parameters to their initial values.\"\"\"\n    hidden_size = self.hidden_size\n    for i in range(3):\n      nn.init.xavier_uniform_(self.kernel[:, i*hidden_size:(i+1)*hidden_size])\n      nn.init.orthogonal_(self.recurrent_kernel[:, i*hidden_size:(i+1)*hidden_size])\n    nn.init.zeros_(self.bias)\n    nn.init.zeros_(self.recurrent_bias)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the GRU layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the GRU.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the GRU layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      h_n: the hidden state for the last sequence item. Dimensions\n        (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    h0 = self._get_state(input, state, state_shape)\n    h = self._impl(input, h0[0], self._get_zoneout_mask(input))\n    state = self._get_final_state(h, lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return GRUFunction.apply(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state.contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.recurrent_bias.contiguous(),\n          zoneout_mask.contiguous())\n    else:\n      return GRUScript(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state.contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.recurrent_bias.contiguous(),\n          zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/indrnn.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nusing haste::v0::indrnn::ForwardPass;\nusing haste::v0::indrnn::BackwardPass;\n\nusing torch::Tensor;\n\nTensor indrnn_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor kernel,\n    Tensor recurrent_scale,\n    Tensor bias,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_scale.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_scale);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor workspace = torch::empty({ time_steps, batch_size, hidden_size }, options);\n\n  output[0] = h0;\n\n  AT_DISPATCH_FLOATING_TYPES(x.scalar_type(), \"indrnn_forward\", ([&] {\n    ForwardPass<scalar_t> forward(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    forward.Run(\n        time_steps,\n        kernel.data_ptr<scalar_t>(),\n        recurrent_scale.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x.data_ptr<scalar_t>(),\n        output.data_ptr<scalar_t>(),\n        workspace.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return output;\n}\n\nstd::vector<Tensor> indrnn_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_scale,\n    Tensor bias,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor dh_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_scale.size(0);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_scale);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(zoneout_mask);\n  CHECK_INPUT(h);\n  CHECK_INPUT(dh_new);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size }, options);\n  Tensor du = torch::zeros({ hidden_size }, options);\n  Tensor db = torch::zeros_like(bias);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor workspace = torch::empty({ time_steps, batch_size, hidden_size }, options);\n\n  AT_DISPATCH_FLOATING_TYPES(x_t.scalar_type(), \"indrnn_backward\", ([&] {\n    BackwardPass<scalar_t> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    backward.Run(\n        time_steps,\n        kernel_t.data_ptr<scalar_t>(),\n        recurrent_scale.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x_t.data_ptr<scalar_t>(),\n        h.data_ptr<scalar_t>(),\n        dh_new.data_ptr<scalar_t>(),\n        dx.data_ptr<scalar_t>(),\n        dW.data_ptr<scalar_t>(),\n        du.data_ptr<scalar_t>(),\n        db.data_ptr<scalar_t>(),\n        dh.data_ptr<scalar_t>(),\n        workspace.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { dx, dh, dW, du, db };\n}\n\n}  // anonymous namespace\n\nvoid indrnn_init(py::module& m) {\n  m.def(\"indrnn_forward\", &indrnn_forward, \"IndRNN forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"indrnn_backward\", &indrnn_backward, \"IndRNN backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/indrnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Independently Recurrent Neural Network\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'IndRNN'\n]\n\n\n#@torch.jit.script\ndef IndRNNScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    kernel,\n    recurrent_scale,\n    bias,\n    zoneout_mask):\n  time_steps = input.shape[0]\n\n  h = [h0]\n  Wx = input @ kernel + bias\n  for t in range(time_steps):\n    h.append(torch.tanh(Wx[t] + h[-1] * recurrent_scale))\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n  h = torch.stack(h)\n  return h\n\n\nclass IndRNNFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    h = LIB.indrnn_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[2:], h)\n    ctx.training = training\n    return h\n\n  @staticmethod\n  def backward(ctx, grad_h):\n    if not ctx.training:\n      raise RuntimeError('IndRNN backward can only be called in training mode')\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()\n    saved[1] = saved[1].permute(1, 0).contiguous()\n    grads = LIB.indrnn_backward(*saved, grad_h.contiguous())\n    return (None, None, *grads, None)\n\n\nclass IndRNN(BaseRNN):\n  \"\"\"\n  Independently Recurrent Neural Network layer.\n\n  This layer offers a fused, GPU-accelerated PyTorch op for inference and\n  training. It also supports Zoneout regularization.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n  \"\"\"\n\n  def __init__(\n      self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the IndRNN layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size). Initialized with Xavier uniform\n        initialization.\n      recurrent_scale: the recurrent scale weight vector. Dimensions\n        (hidden_size). Initialized uniformly in [-0.5, 0.5]. Note that this\n        initialization scheme is different than in the original authors'\n        implementation. See https://github.com/lmnt-com/haste/issues/7 for\n        details.\n      bias: the RNN bias vector. Dimensions (hidden_size). Initialized to zeros.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('IndRNN: zoneout must be in [0.0, 1.0]')\n\n    self.input_size = input_size\n    self.hidden_size = hidden_size\n    self.batch_first = batch_first\n    self.zoneout = zoneout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size))\n    self.recurrent_scale = nn.Parameter(torch.empty(hidden_size))\n    self.bias = nn.Parameter(torch.empty(hidden_size))\n    self.reset_parameters()\n\n  def reset_parameters(self):\n    nn.init.xavier_uniform_(self.kernel)\n    nn.init.uniform_(self.recurrent_scale, -0.5, 0.5)\n    nn.init.zeros_(self.bias)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the IndRNN layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the GRU.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      state: (optional) Tensor, the initial state for each batch element in\n        `input`. Dimensions (1, batch_size, hidden_size). Defaults to zeros.\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the GRU layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      state: the hidden state for the last sequence item. Dimensions\n        (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    h0 = self._get_state(input, state, state_shape)\n    h = self._impl(input, h0[0], self._get_zoneout_mask(input))\n    state = self._get_final_state(h, lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return IndRNNFunction.apply(\n        self.training,\n        self.zoneout,\n        input.contiguous(),\n        state.contiguous(),\n        self.kernel.contiguous(),\n        self.recurrent_scale.contiguous(),\n        self.bias.contiguous(),\n        zoneout_mask.contiguous())\n    else:\n      return IndRNNScript(\n        self.training,\n        self.zoneout,\n        input.contiguous(),\n        state.contiguous(),\n        self.kernel.contiguous(),\n        self.recurrent_scale.contiguous(),\n        self.bias.contiguous(),\n        zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_gru = haste::v0::layer_norm_gru;\n\nusing torch::Tensor;\n\nstd::vector<Tensor> layer_norm_gru_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor kernel,\n    Tensor recurrent_kernel,\n    Tensor bias,\n    Tensor recurrent_bias,\n    Tensor gamma,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_kernel.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_kernel);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(recurrent_bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor cache = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor act_Wx = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor tmp_Wx_norm = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor act_Wx_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n  Tensor act_Rh = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor tmp_Rh_norm = torch::empty({ batch_size, hidden_size * 3 }, options);\n  Tensor act_Rh_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n\n  output[0] = h0;\n\n  AT_DISPATCH_FLOATING_TYPES(x.scalar_type(), \"layer_norm_gru_forward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, 2>();\n\n    layer_norm::ForwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::ForwardPass<scalar_t> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma_a[1].data(),\n        nullptr,\n        act_Rh_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_gru::ForwardPass<scalar_t> gru(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    gru.Run(\n        time_steps,\n        kernel.data_ptr<scalar_t>(),\n        recurrent_kernel.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        recurrent_bias.data_ptr<scalar_t>(),\n        x.data_ptr<scalar_t>(),\n        output.data_ptr<scalar_t>(),\n        cache.data_ptr<scalar_t>(),\n        act_Wx.data_ptr<scalar_t>(),\n        layer_norm1,\n        tmp_Wx_norm.data_ptr<scalar_t>(),\n        act_Rh.data_ptr<scalar_t>(),\n        layer_norm2,\n        tmp_Rh_norm.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { output, cache, act_Wx, act_Wx_norm_cache, act_Rh, act_Rh_norm_cache };\n}\n\nstd::vector<Tensor> layer_norm_gru_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_kernel_t,\n    Tensor bias,\n    Tensor recurrent_bias,\n    Tensor gamma,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor cache,\n    Tensor act_Wx,\n    Tensor act_Wx_norm_cache,\n    Tensor act_Rh,\n    Tensor act_Rh_norm_cache,\n    Tensor dh_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_kernel_t.size(1);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_kernel_t);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(recurrent_bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(h);\n  CHECK_INPUT(cache);\n  CHECK_INPUT(act_Wx);\n  CHECK_INPUT(act_Wx_norm_cache);\n  CHECK_INPUT(act_Rh);\n  CHECK_INPUT(act_Rh_norm_cache);\n  CHECK_INPUT(dh_new);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size * 3 }, options);\n  Tensor dR = torch::zeros({ hidden_size, hidden_size * 3 }, options);\n  Tensor dbx = torch::zeros({ hidden_size * 3 }, options);\n  Tensor dbr = torch::zeros({ hidden_size * 3 }, options);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor dp = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor dq = torch::empty({ time_steps, batch_size, hidden_size * 3 }, options);\n  Tensor dgamma = torch::zeros_like(gamma);\n\n  AT_DISPATCH_FLOATING_TYPES(x_t.scalar_type(), \"layer_norm_gru_backward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, 2>();\n    auto dgamma_a = dgamma.packed_accessor32<scalar_t, 2>();\n\n    layer_norm::BackwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx.data_ptr<scalar_t>(),\n        dgamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::BackwardPass<scalar_t> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma_a[1].data(),\n        nullptr,\n        act_Rh.data_ptr<scalar_t>(),\n        dgamma_a[1].data(),\n        nullptr,\n        act_Rh_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_gru::BackwardPass<scalar_t> gru(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    gru.Run(\n        time_steps,\n        kernel_t.data_ptr<scalar_t>(),\n        recurrent_kernel_t.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        recurrent_bias.data_ptr<scalar_t>(),\n        x_t.data_ptr<scalar_t>(),\n        h.data_ptr<scalar_t>(),\n        cache.data_ptr<scalar_t>(),\n        dh_new.data_ptr<scalar_t>(),\n        dx.data_ptr<scalar_t>(),\n        dW.data_ptr<scalar_t>(),\n        dR.data_ptr<scalar_t>(),\n        dbx.data_ptr<scalar_t>(),\n        dbr.data_ptr<scalar_t>(),\n        dh.data_ptr<scalar_t>(),\n        dp.data_ptr<scalar_t>(),\n        dq.data_ptr<scalar_t>(),\n        layer_norm1,\n        layer_norm2,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { dx, dh, dW, dR, dbx, dbr, dgamma };\n}\n\n}  // anonymous namespace\n\nvoid layer_norm_gru_init(py::module& m) {\n  m.def(\"layer_norm_gru_forward\", &layer_norm_gru_forward, \"LayerNormGRU forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"layer_norm_gru_backward\", &layer_norm_gru_backward, \"LayerNormGRU backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_gru.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Gated Recurrent Unit\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'LayerNormGRU'\n]\n\n\n#@torch.jit.script\ndef LayerNormGRUScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    kernel,\n    recurrent_kernel,\n    bias,\n    recurrent_bias,\n    gamma,\n    zoneout_mask):\n  time_steps = input.shape[0]\n  batch_size = input.shape[1]\n  hidden_size = recurrent_kernel.shape[0]\n\n  h = [h0]\n  Wx = F.layer_norm(input @ kernel, (hidden_size * 3,), weight=gamma[0]) + bias\n  for t in range(time_steps):\n    Rh = F.layer_norm(h[t] @ recurrent_kernel, (hidden_size * 3,), weight=gamma[1]) + recurrent_bias\n    vx = torch.chunk(Wx[t], 3, 1)\n    vh = torch.chunk(Rh, 3, 1)\n\n    z = torch.sigmoid(vx[0] + vh[0])\n    r = torch.sigmoid(vx[1] + vh[1])\n    g = torch.tanh(vx[2] + r * vh[2])\n\n    h.append(z * h[t] + (1 - z) * g)\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n\n  h = torch.stack(h)\n  return h\n\n\nclass LayerNormGRUFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    output = LIB.layer_norm_gru_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[2:], *output)\n    ctx.mark_non_differentiable(inputs[-1])\n    ctx.training = training\n    return output[0]\n\n  @staticmethod\n  def backward(ctx, grad_h):\n    if not ctx.training:\n      raise RuntimeError('LayerNormGRU backward can only be called in training mode')\n\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()\n    saved[1] = saved[1].permute(1, 0).contiguous()\n    saved[2] = saved[2].permute(1, 0).contiguous()\n    grads = LIB.layer_norm_gru_backward(*saved, grad_h.contiguous())\n    return (None, None, *grads, None)\n\n\nclass LayerNormGRU(BaseRNN):\n  \"\"\"\n  Layer Normalized Gated Recurrent Unit layer.\n\n  This GRU layer applies layer normalization to the input and recurrent output\n  activations of a standard GRU. The implementation is fused and\n  GPU-accelerated. There are two commonly-used variants of GRU cells. This one\n  implements 1406.1078v1 which applies the reset gate to the hidden state\n  after matrix multiplication. The other variant, 1406.1078v3, applies the\n  reset gate before matrix multiplication and is currently unsupported.\n\n  This layer has built-in support for DropConnect and Zoneout, which are\n  both techniques used to regularize RNNs.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n  \"\"\"\n\n  def __init__(self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      dropout=0.0,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the GRU layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n        with Xavier uniform initialization.\n      recurrent_kernel: the recurrent projection weight matrix. Dimensions\n        (hidden_size, hidden_size * 3) with `z,r,h` gate layout. Initialized\n        with orthogonal initialization.\n      bias: the input projection bias vector. Dimensions (hidden_size * 3) with\n        `z,r,h` gate layout. Initialized to zeros.\n      recurrent_bias: the recurrent projection bias vector. Dimensions\n        (hidden_size * 3) with `z,r,h` gate layout. Initialized to zeros.\n      gamma: the input and recurrent normalization gain. Dimensions\n        (2, hidden_size * 3) with `gamma[0]` specifying the input gain and\n        `gamma[1]` specifying the recurrent gain. Initialized to ones.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if dropout < 0 or dropout > 1:\n      raise ValueError('LayerNormGRU: dropout must be in [0.0, 1.0]')\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('LayerNormGRU: zoneout must be in [0.0, 1.0]')\n\n    self.dropout = dropout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size * 3))\n    self.recurrent_kernel = nn.Parameter(torch.empty(hidden_size, hidden_size * 3))\n    self.bias = nn.Parameter(torch.empty(hidden_size * 3))\n    self.recurrent_bias = nn.Parameter(torch.empty(hidden_size * 3))\n    self.gamma = nn.Parameter(torch.empty(2, hidden_size * 3))\n    self.reset_parameters()\n\n  def reset_parameters(self):\n    \"\"\"Resets this layer's parameters to their initial values.\"\"\"\n    hidden_size = self.hidden_size\n    for i in range(3):\n      nn.init.xavier_uniform_(self.kernel[:, i*hidden_size:(i+1)*hidden_size])\n      nn.init.orthogonal_(self.recurrent_kernel[:, i*hidden_size:(i+1)*hidden_size])\n    nn.init.zeros_(self.bias)\n    nn.init.zeros_(self.recurrent_bias)\n    nn.init.ones_(self.gamma)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the GRU layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the GRU.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      state: (optional) Tensor, the intial state for each batch element in\n        `input`. Dimensions (1, batch_size, hidden_size). Defaults to zeros.\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the GRU layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      h_n: the hidden state for the last sequence item. Dimensions\n        (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    h0 = self._get_state(input, state, state_shape)\n    h = self._impl(input, h0[0], self._get_zoneout_mask(input))\n    state = self._get_final_state(h, lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return LayerNormGRUFunction.apply(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state.contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.recurrent_bias.contiguous(),\n          self.gamma.contiguous(),\n          zoneout_mask.contiguous())\n    else:\n      return LayerNormGRUScript(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state.contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.recurrent_bias.contiguous(),\n          self.gamma.contiguous(),\n          zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_indrnn.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_indrnn = haste::v0::layer_norm_indrnn;\n\nusing torch::Tensor;\n\nstd::vector<Tensor> layer_norm_indrnn_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor kernel,\n    Tensor recurrent_scale,\n    Tensor bias,\n    Tensor gamma,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_scale.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_scale);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor workspace = torch::empty({ time_steps, batch_size, hidden_size }, options);\n  Tensor act_Wx = torch::empty({ time_steps, batch_size, hidden_size }, options);\n  Tensor act_Wx_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n\n  output[0] = h0;\n\n  AT_DISPATCH_FLOATING_TYPES(x.scalar_type(), \"layer_norm_indrnn_forward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, 2>();\n\n    layer_norm::ForwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_indrnn::ForwardPass<scalar_t> forward(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    forward.Run(\n        time_steps,\n        kernel.data_ptr<scalar_t>(),\n        recurrent_scale.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x.data_ptr<scalar_t>(),\n        output.data_ptr<scalar_t>(),\n        workspace.data_ptr<scalar_t>(),\n        act_Wx.data_ptr<scalar_t>(),\n        layer_norm1,\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { output, act_Wx, act_Wx_norm_cache };\n}\n\nstd::vector<Tensor> layer_norm_indrnn_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_scale,\n    Tensor bias,\n    Tensor gamma,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor act_Wx,\n    Tensor act_Wx_norm_cache,\n    Tensor dh_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_scale.size(0);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_scale);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(zoneout_mask);\n  CHECK_INPUT(h);\n  CHECK_INPUT(dh_new);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size }, options);\n  Tensor du = torch::zeros({ hidden_size }, options);\n  Tensor db = torch::zeros_like(bias);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor workspace = torch::empty({ time_steps, batch_size, hidden_size }, options);\n  Tensor dgamma = torch::zeros_like(gamma);\n\n  AT_DISPATCH_FLOATING_TYPES(x_t.scalar_type(), \"layer_norm_indrnn_backward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, 2>();\n    auto dgamma_a = dgamma.packed_accessor32<scalar_t, 2>();\n\n    layer_norm::BackwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx.data_ptr<scalar_t>(),\n        dgamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_indrnn::BackwardPass<scalar_t> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    backward.Run(\n        time_steps,\n        kernel_t.data_ptr<scalar_t>(),\n        recurrent_scale.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x_t.data_ptr<scalar_t>(),\n        h.data_ptr<scalar_t>(),\n        dh_new.data_ptr<scalar_t>(),\n        dx.data_ptr<scalar_t>(),\n        dW.data_ptr<scalar_t>(),\n        du.data_ptr<scalar_t>(),\n        db.data_ptr<scalar_t>(),\n        dh.data_ptr<scalar_t>(),\n        workspace.data_ptr<scalar_t>(),\n        layer_norm1,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { dx, dh, dW, du, db, dgamma };\n}\n\n}  // anonymous namespace\n\nvoid layer_norm_indrnn_init(py::module& m) {\n  m.def(\"layer_norm_indrnn_forward\", &layer_norm_indrnn_forward, \"LayerNormIndRNN forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"layer_norm_indrnn_backward\", &layer_norm_indrnn_backward, \"LayerNormIndRNN backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_indrnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Independently Recurrent Neural Network\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'LayerNormIndRNN'\n]\n\n\n#@torch.jit.script\ndef LayerNormIndRNNScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    kernel,\n    recurrent_scale,\n    bias,\n    gamma,\n    zoneout_mask):\n  time_steps = input.shape[0]\n  hidden_size = kernel.shape[1]\n\n  h = [h0]\n  Wx = F.layer_norm(input @ kernel, (hidden_size,), weight=gamma[0]) + bias\n  for t in range(time_steps):\n    h.append(torch.tanh(Wx[t] + h[-1] * recurrent_scale))\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n  h = torch.stack(h)\n  return h\n\n\nclass LayerNormIndRNNFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    output = LIB.layer_norm_indrnn_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[2:], *output)\n    ctx.training = training\n    return output[0]\n\n  @staticmethod\n  def backward(ctx, grad_h):\n    if not ctx.training:\n      raise RuntimeError('LayerNormIndRNN backward can only be called in training mode')\n\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()\n    saved[1] = saved[1].permute(1, 0).contiguous()\n    grads = LIB.layer_norm_indrnn_backward(*saved, grad_h.contiguous())\n    return (None, None, *grads, None)\n\n\nclass LayerNormIndRNN(BaseRNN):\n  \"\"\"\n  Layer Normalized Independently Recurrent Neural Network layer.\n\n  This IndRNN layer applies layer normalization to the input activations of a\n  standard IndRNN. The implementation is fused and GPU-accelerated.\n\n  This layer has built-in support for Zoneout regularization.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n  \"\"\"\n\n  def __init__(\n      self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the IndRNN layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size). Initialized with Xavier uniform\n        initialization.\n      recurrent_scale: the recurrent scale weight vector. Dimensions\n        (hidden_size). Initialized uniformly in [-0.5, 0.5]. Note that this\n        initialization scheme is different than in the original authors'\n        implementation. See https://github.com/lmnt-com/haste/issues/7 for\n        details.\n      bias: the RNN bias vector. Dimensions (hidden_size). Initialized to zeros.\n      gamma: the input and recurrent normalization gain. Dimensions\n        (2, hidden_size) with `gamma[0]` specifying the input gain and\n        `gamma[1]` specifying the recurrent gain. Initialized to ones.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('LayerNormIndRNN: zoneout must be in [0.0, 1.0]')\n\n    self.input_size = input_size\n    self.hidden_size = hidden_size\n    self.batch_first = batch_first\n    self.zoneout = zoneout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size))\n    self.recurrent_scale = nn.Parameter(torch.empty(hidden_size))\n    self.bias = nn.Parameter(torch.empty(hidden_size))\n    self.gamma = nn.Parameter(torch.empty(2, hidden_size))\n    self.reset_parameters()\n\n  def reset_parameters(self):\n    \"\"\"Resets this layer's parameters to their initial values.\"\"\"\n    nn.init.xavier_uniform_(self.kernel)\n    nn.init.uniform_(self.recurrent_scale, -0.5, 0.5)\n    nn.init.zeros_(self.bias)\n    nn.init.ones_(self.gamma)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the IndRNN layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the GRU.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      state: (optional) Tensor, the initial state for each batch element in\n        `input`. Dimensions (1, batch_size, hidden_size). Defaults to zeros.\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the GRU layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      state: the hidden state for the last sequence item. Dimensions\n        (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    h0 = self._get_state(input, state, state_shape)\n    h = self._impl(input, h0[0], self._get_zoneout_mask(input))\n    state = self._get_final_state(h, lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return LayerNormIndRNNFunction.apply(\n        self.training,\n        self.zoneout,\n        input.contiguous(),\n        state.contiguous(),\n        self.kernel.contiguous(),\n        self.recurrent_scale.contiguous(),\n        self.bias.contiguous(),\n        self.gamma.contiguous(),\n        zoneout_mask.contiguous())\n    else:\n      return LayerNormIndRNNScript(\n        self.training,\n        self.zoneout,\n        input.contiguous(),\n        state.contiguous(),\n        self.kernel.contiguous(),\n        self.recurrent_scale.contiguous(),\n        self.bias.contiguous(),\n        self.gamma.contiguous(),\n        zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_lstm = haste::v0::layer_norm_lstm;\n\nusing torch::Tensor;\n\nstd::vector<Tensor> layer_norm_lstm_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor c0,\n    Tensor kernel,\n    Tensor recurrent_kernel,\n    Tensor bias,\n    Tensor gamma,\n    Tensor gamma_h,\n    Tensor beta_h,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_kernel.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(c0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_kernel);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(gamma_h);\n  CHECK_INPUT(beta_h);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::zeros({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor output_state = torch::zeros({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor act_Wx = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor act_Wx_norm = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor act_Wx_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n  Tensor act_Rh = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor act_Rh_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n  Tensor act_c_norm = torch::empty({ time_steps, batch_size, hidden_size }, options);\n  Tensor act_c_norm_cache = torch::empty({ time_steps, batch_size, 2 }, options);\n  Tensor tmp_Rh = torch::empty({ batch_size, hidden_size * 4 }, options);\n\n  output[0] = h0;\n  output_state[0] = c0;\n\n  AT_DISPATCH_FLOATING_TYPES(x.scalar_type(), \"layer_norm_lstm_forward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, (int)2>();\n\n    layer_norm::ForwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::ForwardPass<scalar_t> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma_a[1].data(),\n        nullptr,\n        act_Rh_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::ForwardPass<scalar_t> layer_norm3(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_h.data_ptr<scalar_t>(),\n        beta_h.data_ptr<scalar_t>(),\n        act_c_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_lstm::ForwardPass<scalar_t> lstm(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    lstm.Run(\n        time_steps,\n        kernel.data_ptr<scalar_t>(),\n        recurrent_kernel.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x.data_ptr<scalar_t>(),\n        output.data_ptr<scalar_t>(),\n        output_state.data_ptr<scalar_t>(),\n        act_Wx.data_ptr<scalar_t>(),\n        tmp_Rh.data_ptr<scalar_t>(),\n        layer_norm1,\n        act_Wx_norm.data_ptr<scalar_t>(),\n        act_Rh.data_ptr<scalar_t>(),\n        layer_norm2,\n        layer_norm3,\n        act_c_norm.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return {\n      output,\n      output_state,\n      act_Wx,\n      act_Wx_norm,\n      act_Wx_norm_cache,\n      act_Rh,\n      act_Rh_norm_cache,\n      act_c_norm,\n      act_c_norm_cache };\n}\n\nstd::vector<Tensor> layer_norm_lstm_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_kernel_t,\n    Tensor bias,\n    Tensor gamma,\n    Tensor gamma_h,\n    Tensor beta_h,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor c,\n    Tensor act_Wx,\n    Tensor act_Wx_norm,\n    Tensor act_Wx_norm_cache,\n    Tensor act_Rh,\n    Tensor act_Rh_norm_cache,\n    Tensor act_c_norm,\n    Tensor act_c_norm_cache,\n    Tensor dh_new,\n    Tensor dc_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_kernel_t.size(1);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_kernel_t);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(gamma);\n  CHECK_INPUT(gamma_h);\n  CHECK_INPUT(beta_h);\n  CHECK_INPUT(zoneout_mask);\n  CHECK_INPUT(h);\n  CHECK_INPUT(c);\n  CHECK_INPUT(act_Wx);\n  CHECK_INPUT(act_Wx_norm);\n  CHECK_INPUT(act_Wx_norm_cache);\n  CHECK_INPUT(act_Rh);\n  CHECK_INPUT(act_Rh_norm_cache);\n  CHECK_INPUT(act_c_norm);\n  CHECK_INPUT(act_c_norm_cache);\n  CHECK_INPUT(dh_new);\n  CHECK_INPUT(dc_new);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size * 4 }, options);\n  Tensor dR = torch::zeros({ hidden_size, hidden_size * 4 }, options);\n  Tensor db = torch::zeros({ hidden_size * 4 }, options);\n  Tensor dgamma = torch::zeros_like(gamma);\n  Tensor dgamma_h = torch::zeros_like(gamma_h);\n  Tensor dbeta_h = torch::zeros_like(beta_h);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor dc = torch::zeros({ batch_size, hidden_size }, options);\n\n  AT_DISPATCH_FLOATING_TYPES(x_t.scalar_type(), \"layer_norm_lstm_backward\", ([&] {\n    auto gamma_a = gamma.packed_accessor32<scalar_t, 2>();\n    auto dgamma_a = dgamma.packed_accessor32<scalar_t, 2>();\n    auto c_a = c.packed_accessor32<scalar_t, 3>();\n\n    layer_norm::BackwardPass<scalar_t> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma_a[0].data(),\n        nullptr,\n        act_Wx.data_ptr<scalar_t>(),\n        dgamma_a[0].data(),\n        nullptr,\n        act_Wx_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::BackwardPass<scalar_t> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma_a[1].data(),\n        nullptr,\n        act_Rh.data_ptr<scalar_t>(),\n        dgamma_a[1].data(),\n        nullptr,\n        act_Rh_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm::BackwardPass<scalar_t> layer_norm3(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_h.data_ptr<scalar_t>(),\n        beta_h.data_ptr<scalar_t>(),\n        c_a[1].data(),\n        dgamma_h.data_ptr<scalar_t>(),\n        dbeta_h.data_ptr<scalar_t>(),\n        act_c_norm_cache.data_ptr<scalar_t>());\n\n    layer_norm_lstm::BackwardPass<scalar_t> lstm(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    lstm.Run(\n        time_steps,\n        kernel_t.data_ptr<scalar_t>(),\n        recurrent_kernel_t.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x_t.data_ptr<scalar_t>(),\n        h.data_ptr<scalar_t>(),\n        c.data_ptr<scalar_t>(),\n        dh_new.data_ptr<scalar_t>(),\n        dc_new.data_ptr<scalar_t>(),\n        dx.data_ptr<scalar_t>(),\n        dW.data_ptr<scalar_t>(),\n        dR.data_ptr<scalar_t>(),\n        db.data_ptr<scalar_t>(),\n        dh.data_ptr<scalar_t>(),\n        dc.data_ptr<scalar_t>(),\n        act_Wx.data_ptr<scalar_t>(),\n        layer_norm1,\n        act_Wx_norm.data_ptr<scalar_t>(),\n        act_Rh.data_ptr<scalar_t>(),\n        layer_norm2,\n        layer_norm3,\n        act_c_norm.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { dx, dh, dc, dW, dR, db, dgamma, dgamma_h, dbeta_h };\n}\n\n}  // anonymous namespace\n\nvoid layer_norm_lstm_init(py::module& m) {\n  m.def(\"layer_norm_lstm_forward\", &layer_norm_lstm_forward, \"LayerNormLSTM forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"layer_norm_lstm_backward\", &layer_norm_lstm_backward, \"LayerNormLSTM backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/layer_norm_lstm.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Long Short-Term Memory\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'LayerNormLSTM'\n]\n\n\n#@torch.jit.script\ndef LayerNormLSTMScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    c0,\n    kernel,\n    recurrent_kernel,\n    bias,\n    gamma,\n    gamma_h,\n    beta_h,\n    zoneout_mask):\n  time_steps = input.shape[0]\n  batch_size = input.shape[1]\n  hidden_size = recurrent_kernel.shape[0]\n\n  h = [h0]\n  c = [c0]\n  Wx = F.layer_norm(input @ kernel, (hidden_size * 4,), weight=gamma[0])\n  for t in range(time_steps):\n    v = F.layer_norm(h[t] @ recurrent_kernel, (hidden_size * 4,), weight=gamma[1]) + Wx[t] + bias\n    i, g, f, o = torch.chunk(v, 4, 1)\n    i = torch.sigmoid(i)\n    g = torch.tanh(g)\n    f = torch.sigmoid(f)\n    o = torch.sigmoid(o)\n    c.append(f * c[t] + i * g)\n    h.append(o * torch.tanh(F.layer_norm(c[-1], (hidden_size,), weight=gamma_h, bias=beta_h)))\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n  h = torch.stack(h)\n  c = torch.stack(c)\n  return h, c\n\n\nclass LayerNormLSTMFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    outputs = LIB.layer_norm_lstm_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[3:], *outputs)\n    ctx.mark_non_differentiable(inputs[-1])  # zoneout mask is non-differentiable\n    ctx.training = training\n    return outputs[0], outputs[1]\n\n  @staticmethod\n  def backward(ctx, grad_h, grad_c):\n    if not ctx.training:\n      raise RuntimeError('LayerNormLSTM backward can only be called in training mode')\n\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()  # x -> x_t\n    saved[1] = saved[1].permute(1, 0).contiguous()     # kernel -> kernel_t\n    saved[2] = saved[2].permute(1, 0).contiguous()     # recurrent_kernel -> recurrent_kernel_t\n    grads = LIB.layer_norm_lstm_backward(*saved, grad_h.contiguous(), grad_c.contiguous())\n    return (None, None, *grads, None)\n\n\nclass LayerNormLSTM(BaseRNN):\n  \"\"\"\n  Layer Normalized Long Short-Term Memory layer.\n\n  This LSTM layer applies layer normalization to the input, recurrent, and\n  output activations of a standard LSTM. The implementation is fused and\n  GPU-accelerated. DropConnect and Zoneout regularization are built-in, and\n  this layer allows setting a non-zero initial forget gate bias.\n\n  Details about the exact function this layer implements can be found at\n  https://github.com/lmnt-com/haste/issues/1.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for usage.\n  \"\"\"\n\n  def __init__(self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      forget_bias=1.0,\n      dropout=0.0,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the LSTM layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      forget_bias: (optional) float, sets the initial bias of the forget gate\n        for this LSTM cell.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n        with Xavier uniform initialization.\n      recurrent_kernel: the recurrent projection weight matrix. Dimensions\n        (hidden_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n        with orthogonal initialization.\n      bias: the projection bias vector. Dimensions (hidden_size * 4) with\n        `i,g,f,o` gate layout. The forget gate biases are initialized to\n        `forget_bias` and the rest are zeros.\n      gamma: the input and recurrent normalization gain. Dimensions\n        (2, hidden_size * 4) with `gamma[0]` specifying the input gain and\n        `gamma[1]` specifying the recurrent gain. Initialized to ones.\n      gamma_h: the output normalization gain. Dimensions (hidden_size).\n        Initialized to ones.\n      beta_h: the output normalization bias. Dimensions (hidden_size).\n        Initialized to zeros.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if dropout < 0 or dropout > 1:\n      raise ValueError('LayerNormLSTM: dropout must be in [0.0, 1.0]')\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('LayerNormLSTM: zoneout must be in [0.0, 1.0]')\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size * 4))\n    self.recurrent_kernel = nn.Parameter(torch.empty(hidden_size, hidden_size * 4))\n    self.bias = nn.Parameter(torch.empty(hidden_size * 4))\n    self.gamma = nn.Parameter(torch.empty(2, hidden_size * 4))\n    self.gamma_h = nn.Parameter(torch.empty(hidden_size))\n    self.beta_h = nn.Parameter(torch.empty(hidden_size))\n    self.reset_parameters()\n\n  def reset_parameters(self):\n    \"\"\"Resets this layer's parameters to their initial values.\"\"\"\n    hidden_size = self.hidden_size\n    for i in range(4):\n      nn.init.xavier_uniform_(self.kernel[:, i*hidden_size:(i+1)*hidden_size])\n      nn.init.orthogonal_(self.recurrent_kernel[:, i*hidden_size:(i+1)*hidden_size])\n    nn.init.zeros_(self.bias)\n    nn.init.constant_(self.bias[hidden_size*2:hidden_size*3], self.forget_bias)\n    nn.init.ones_(self.gamma)\n    nn.init.ones_(self.gamma_h)\n    nn.init.zeros_(self.beta_h)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the LSTM layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the LSTM.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the LSTM layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      (h_n, c_n): the hidden and cell states, respectively, for the last\n        sequence item. Dimensions (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    state_shape = (state_shape, state_shape)\n    h0, c0 = self._get_state(input, state, state_shape)\n    h, c = self._impl(input, (h0[0], c0[0]), self._get_zoneout_mask(input))\n    state = self._get_final_state((h, c), lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return LayerNormLSTMFunction.apply(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state[0].contiguous(),\n          state[1].contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.gamma.contiguous(),\n          self.gamma_h.contiguous(),\n          self.beta_h.contiguous(),\n          zoneout_mask.contiguous())\n    else:\n      return LayerNormLSTMScript(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state[0].contiguous(),\n          state[1].contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          self.gamma.contiguous(),\n          self.gamma_h.contiguous(),\n          self.beta_h.contiguous(),\n          zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <ATen/cuda/CUDAContext.h>\n#include <c10/cuda/CUDAGuard.h>\n#include <torch/extension.h>\n#include <vector>\n\n#include \"haste.h\"\n#include \"support.h\"\n\nnamespace {\n\nusing haste::v0::lstm::ForwardPass;\nusing haste::v0::lstm::BackwardPass;\n\nusing torch::Tensor;\n\nstd::vector<Tensor> lstm_forward(\n    bool training,\n    float zoneout_prob,\n    Tensor x,\n    Tensor h0,\n    Tensor c0,\n    Tensor kernel,\n    Tensor recurrent_kernel,\n    Tensor bias,\n    Tensor zoneout_mask) {\n  const auto time_steps = x.size(0);\n  const auto batch_size = x.size(1);\n  const auto input_size = x.size(2);\n  const auto hidden_size = recurrent_kernel.size(0);\n  const bool has_zoneout = zoneout_prob && zoneout_mask.size(0);\n\n  CHECK_INPUT(x);\n  CHECK_INPUT(h0);\n  CHECK_INPUT(c0);\n  CHECK_INPUT(kernel);\n  CHECK_INPUT(recurrent_kernel);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor output = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor output_state = torch::empty({ time_steps + 1, batch_size, hidden_size }, options);\n  Tensor cache = torch::empty({ time_steps, batch_size, hidden_size * 4 }, options);\n  Tensor tmp_Rh = torch::empty({ batch_size, hidden_size * 4 }, options);\n\n  output[0] = h0;\n  output_state[0] = c0;\n\n  AT_DISPATCH_FLOATING_TYPES(x.scalar_type(), \"lstm_forward\", ([&] {\n    ForwardPass<scalar_t> forward(\n        training,\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    forward.Run(\n        time_steps,\n        kernel.data_ptr<scalar_t>(),\n        recurrent_kernel.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x.data_ptr<scalar_t>(),\n        output.data_ptr<scalar_t>(),\n        output_state.data_ptr<scalar_t>(),\n        cache.data_ptr<scalar_t>(),\n        tmp_Rh.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_prob : 0.0f,\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { output, output_state, cache };\n}\n\nstd::vector<Tensor> lstm_backward(\n    Tensor x_t,\n    Tensor kernel_t,\n    Tensor recurrent_kernel_t,\n    Tensor bias,\n    Tensor zoneout_mask,\n    Tensor h,\n    Tensor c,\n    Tensor cache,\n    Tensor dh_new,\n    Tensor dc_new) {\n  const auto input_size = x_t.size(0);\n  const auto time_steps = x_t.size(1);\n  const auto batch_size = x_t.size(2);\n  const auto hidden_size = recurrent_kernel_t.size(1);\n  const bool has_zoneout = !!zoneout_mask.size(0);\n\n  CHECK_INPUT(x_t);\n  CHECK_INPUT(kernel_t);\n  CHECK_INPUT(recurrent_kernel_t);\n  CHECK_INPUT(bias);\n  CHECK_INPUT(h);\n  CHECK_INPUT(c);\n  CHECK_INPUT(cache);\n  CHECK_INPUT(dh_new);\n  CHECK_INPUT(dc_new);\n  CHECK_INPUT(zoneout_mask);\n\n  const auto options = x_t.options();\n  const at::cuda::CUDAGuard guard(options.device_index());\n  Tensor dx = torch::empty({ time_steps, batch_size, input_size }, options);\n  Tensor dW = torch::zeros({ input_size, hidden_size * 4 }, options);\n  Tensor dR = torch::zeros({ hidden_size, hidden_size * 4 }, options);\n  Tensor db = torch::zeros_like(bias);\n  Tensor dh = torch::zeros({ batch_size, hidden_size }, options);\n  Tensor dc = torch::zeros({ batch_size, hidden_size }, options);\n\n  AT_DISPATCH_FLOATING_TYPES(x_t.scalar_type(), \"lstm_backward\", ([&] {\n    BackwardPass<scalar_t> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        at::cuda::getCurrentCUDABlasHandle(),\n        at::cuda::getCurrentCUDAStream());\n\n    backward.Run(\n        time_steps,\n        kernel_t.data_ptr<scalar_t>(),\n        recurrent_kernel_t.data_ptr<scalar_t>(),\n        bias.data_ptr<scalar_t>(),\n        x_t.data_ptr<scalar_t>(),\n        h.data_ptr<scalar_t>(),\n        c.data_ptr<scalar_t>(),\n        dh_new.data_ptr<scalar_t>(),\n        dc_new.data_ptr<scalar_t>(),\n        dx.data_ptr<scalar_t>(),\n        dW.data_ptr<scalar_t>(),\n        dR.data_ptr<scalar_t>(),\n        db.data_ptr<scalar_t>(),\n        dh.data_ptr<scalar_t>(),\n        dc.data_ptr<scalar_t>(),\n        cache.data_ptr<scalar_t>(),\n        has_zoneout ? zoneout_mask.data_ptr<scalar_t>() : nullptr);\n  }));\n\n  return { dx, dh, dc, dW, dR, db };\n}\n\n}  // anonymous namespace\n\nvoid lstm_init(py::module& m) {\n  m.def(\"lstm_forward\", &lstm_forward, \"LSTM forward\", py::call_guard<py::gil_scoped_release>());\n  m.def(\"lstm_backward\", &lstm_backward, \"LSTM backward\", py::call_guard<py::gil_scoped_release>());\n}\n"
  },
  {
    "path": "frameworks/pytorch/lstm.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Long Short-Term Memory\"\"\"\n\n\nimport haste_pytorch_lib as LIB\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .base_rnn import BaseRNN\n\n\n__all__ = [\n    'LSTM'\n]\n\n\n#@torch.jit.script\ndef LSTMScript(\n    training: bool,\n    zoneout_prob: float,\n    input,\n    h0,\n    c0,\n    kernel,\n    recurrent_kernel,\n    bias,\n    zoneout_mask):\n  time_steps = input.shape[0]\n  batch_size = input.shape[1]\n  hidden_size = recurrent_kernel.shape[0]\n\n  h = [h0]\n  c = [c0]\n  Wx = input @ kernel\n  for t in range(time_steps):\n    v = h[t] @ recurrent_kernel + Wx[t] + bias\n    i, g, f, o = torch.chunk(v, 4, 1)\n    i = torch.sigmoid(i)\n    g = torch.tanh(g)\n    f = torch.sigmoid(f)\n    o = torch.sigmoid(o)\n    c.append(f * c[t] + i * g)\n    h.append(o * torch.tanh(c[-1]))\n    if zoneout_prob:\n      if training:\n        h[-1] = (h[-1] - h[-2]) * zoneout_mask[t] + h[-2]\n      else:\n        h[-1] = zoneout_prob * h[-2] + (1 - zoneout_prob) * h[-1]\n  h = torch.stack(h)\n  c = torch.stack(c)\n  return h, c\n\n\nclass LSTMFunction(torch.autograd.Function):\n  @staticmethod\n  def forward(ctx, training, zoneout_prob, *inputs):\n    h, c, cache = LIB.lstm_forward(training, zoneout_prob, *inputs)\n    ctx.save_for_backward(inputs[0], *inputs[3:], h, c, cache)\n    ctx.mark_non_differentiable(inputs[-1])\n    ctx.training = training\n    return h, c\n\n  @staticmethod\n  def backward(ctx, grad_h, grad_c):\n    if not ctx.training:\n      raise RuntimeError('LSTM backward can only be called in training mode')\n\n    saved = [*ctx.saved_tensors]\n    saved[0] = saved[0].permute(2, 0, 1).contiguous()\n    saved[1] = saved[1].permute(1, 0).contiguous()\n    saved[2] = saved[2].permute(1, 0).contiguous()\n    grads = LIB.lstm_backward(*saved, grad_h.contiguous(), grad_c.contiguous())\n    return (None, None, *grads, None)\n\n\nclass LSTM(BaseRNN):\n  \"\"\"\n  Long Short-Term Memory layer.\n\n  This LSTM layer offers a fused, GPU-accelerated PyTorch op for inference\n  and training. Although this implementation is comparable in performance to\n  cuDNN's LSTM, it offers additional options not typically found in other\n  high-performance implementations. DropConnect and Zoneout regularization are\n  built-in, and this layer allows setting a non-zero initial forget gate bias.\n\n  See [\\_\\_init\\_\\_](#__init__) and [forward](#forward) for general usage.\n  See [from_native_weights](#from_native_weights) and\n  [to_native_weights](#to_native_weights) for compatibility with PyTorch LSTMs.\n  \"\"\"\n\n  def __init__(self,\n      input_size,\n      hidden_size,\n      batch_first=False,\n      forget_bias=1.0,\n      dropout=0.0,\n      zoneout=0.0,\n      return_state_sequence=False):\n    \"\"\"\n    Initialize the parameters of the LSTM layer.\n\n    Arguments:\n      input_size: int, the feature dimension of the input.\n      hidden_size: int, the feature dimension of the output.\n      batch_first: (optional) bool, if `True`, then the input and output\n        tensors are provided as `(batch, seq, feature)`.\n      forget_bias: (optional) float, sets the initial bias of the forget gate\n        for this LSTM cell.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization.\n      return_state_sequence: (optional) bool, if `True`, the forward pass will\n        return the entire state sequence instead of just the final state. Note\n        that if the input is a padded sequence, the returned state will also\n        be a padded sequence.\n\n    Variables:\n      kernel: the input projection weight matrix. Dimensions\n        (input_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n        with Xavier uniform initialization.\n      recurrent_kernel: the recurrent projection weight matrix. Dimensions\n        (hidden_size, hidden_size * 4) with `i,g,f,o` gate layout. Initialized\n        with orthogonal initialization.\n      bias: the projection bias vector. Dimensions (hidden_size * 4) with\n        `i,g,f,o` gate layout. The forget gate biases are initialized to\n        `forget_bias` and the rest are zeros.\n    \"\"\"\n    super().__init__(input_size, hidden_size, batch_first, zoneout, return_state_sequence)\n\n    if dropout < 0 or dropout > 1:\n      raise ValueError('LSTM: dropout must be in [0.0, 1.0]')\n    if zoneout < 0 or zoneout > 1:\n      raise ValueError('LSTM: zoneout must be in [0.0, 1.0]')\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n\n    self.kernel = nn.Parameter(torch.empty(input_size, hidden_size * 4))\n    self.recurrent_kernel = nn.Parameter(torch.empty(hidden_size, hidden_size * 4))\n    self.bias = nn.Parameter(torch.empty(hidden_size * 4))\n    self.reset_parameters()\n\n  def to_native_weights(self):\n    \"\"\"\n    Converts Haste LSTM weights to native PyTorch LSTM weights.\n\n    Returns:\n      weight_ih_l0: Parameter, the input-hidden weights of the LSTM layer.\n      weight_hh_l0: Parameter, the hidden-hidden weights of the LSTM layer.\n      bias_ih_l0: Parameter, the input-hidden bias of the LSTM layer.\n      bias_hh_l0: Parameter, the hidden-hidden bias of the LSTM layer.\n    \"\"\"\n    def reorder_weights(w):\n      i, g, f, o = torch.chunk(w, 4, dim=-1)\n      return torch.cat([i, f, g, o], dim=-1)\n    kernel = reorder_weights(self.kernel).permute(1, 0).contiguous()\n    recurrent_kernel = reorder_weights(self.recurrent_kernel).permute(1, 0).contiguous()\n    half_bias = reorder_weights(self.bias) / 2.0\n\n    kernel = torch.nn.Parameter(kernel)\n    recurrent_kernel = torch.nn.Parameter(recurrent_kernel)\n    bias1 = torch.nn.Parameter(half_bias)\n    bias2 = torch.nn.Parameter(half_bias.clone())\n    return kernel, recurrent_kernel, bias1, bias2\n\n  def from_native_weights(self, weight_ih_l0, weight_hh_l0, bias_ih_l0, bias_hh_l0):\n    \"\"\"\n    Copies and converts the provided PyTorch LSTM weights into this layer.\n\n    Arguments:\n      weight_ih_l0: Parameter, the input-hidden weights of the PyTorch LSTM layer.\n      weight_hh_l0: Parameter, the hidden-hidden weights of the PyTorch LSTM layer.\n      bias_ih_l0: Parameter, the input-hidden bias of the PyTorch LSTM layer.\n      bias_hh_l0: Parameter, the hidden-hidden bias of the PyTorch LSTM layer.\n    \"\"\"\n    def reorder_weights(w):\n      i, f, g, o = torch.chunk(w, 4, dim=-1)\n      return torch.cat([i, g, f, o], dim=-1)\n    kernel = reorder_weights(weight_ih_l0.permute(1, 0)).contiguous()\n    recurrent_kernel = reorder_weights(weight_hh_l0.permute(1, 0)).contiguous()\n    bias = reorder_weights(bias_ih_l0 + bias_hh_l0).contiguous()\n\n    self.kernel = nn.Parameter(kernel)\n    self.recurrent_kernel = nn.Parameter(recurrent_kernel)\n    self.bias = nn.Parameter(bias)\n\n  def reset_parameters(self):\n    \"\"\"Resets this layer's parameters to their initial values.\"\"\"\n    hidden_size = self.hidden_size\n    for i in range(4):\n      nn.init.xavier_uniform_(self.kernel[:, i*hidden_size:(i+1)*hidden_size])\n      nn.init.orthogonal_(self.recurrent_kernel[:, i*hidden_size:(i+1)*hidden_size])\n    nn.init.zeros_(self.bias)\n    nn.init.constant_(self.bias[hidden_size*2:hidden_size*3], self.forget_bias)\n\n  def forward(self, input, state=None, lengths=None):\n    \"\"\"\n    Runs a forward pass of the LSTM layer.\n\n    Arguments:\n      input: Tensor, a batch of input sequences to pass through the LSTM.\n        Dimensions (seq_len, batch_size, input_size) if `batch_first` is\n        `False`, otherwise (batch_size, seq_len, input_size).\n      lengths: (optional) Tensor, list of sequence lengths for each batch\n        element. Dimension (batch_size). This argument may be omitted if\n        all batch elements are unpadded and have the same sequence length.\n\n    Returns:\n      output: Tensor, the output of the LSTM layer. Dimensions\n        (seq_len, batch_size, hidden_size) if `batch_first` is `False` (default)\n        or (batch_size, seq_len, hidden_size) if `batch_first` is `True`. Note\n        that if `lengths` was specified, the `output` tensor will not be\n        masked. It's the caller's responsibility to either not use the invalid\n        entries or to mask them out before using them.\n      (h_n, c_n): the hidden and cell states, respectively, for the last\n        sequence item. Dimensions (1, batch_size, hidden_size).\n    \"\"\"\n    input = self._permute(input)\n    state_shape = [1, input.shape[1], self.hidden_size]\n    state_shape = (state_shape, state_shape)\n    h0, c0 = self._get_state(input, state, state_shape)\n    h, c = self._impl(input, (h0[0], c0[0]), self._get_zoneout_mask(input))\n    state = self._get_final_state((h, c), lengths)\n    output = self._permute(h[1:])\n    return output, state\n\n  def _impl(self, input, state, zoneout_mask):\n    if self._is_cuda():\n      return LSTMFunction.apply(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state[0].contiguous(),\n          state[1].contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          zoneout_mask.contiguous())\n    else:\n      return LSTMScript(\n          self.training,\n          self.zoneout,\n          input.contiguous(),\n          state[0].contiguous(),\n          state[1].contiguous(),\n          self.kernel.contiguous(),\n          F.dropout(self.recurrent_kernel, self.dropout, self.training).contiguous(),\n          self.bias.contiguous(),\n          zoneout_mask.contiguous())\n"
  },
  {
    "path": "frameworks/pytorch/support.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <torch/extension.h>\n\nvoid gru_init(py::module&);\nvoid indrnn_init(py::module&);\nvoid lstm_init(py::module&);\nvoid layer_norm_gru_init(py::module&);\nvoid layer_norm_indrnn_init(py::module&);\nvoid layer_norm_lstm_init(py::module&);\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  gru_init(m);\n  indrnn_init(m);\n  lstm_init(m);\n  layer_norm_gru_init(m);\n  layer_norm_indrnn_init(m);\n  layer_norm_lstm_init(m);\n}\n"
  },
  {
    "path": "frameworks/pytorch/support.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <torch/extension.h>\n\n#define CHECK_CUDA(x) TORCH_CHECK(x.options().device().is_cuda(), #x \" must be a CUDA tensor\")\n#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x \" must be contiguous\")\n#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)\n\ntemplate<typename U>\nstruct native_type {\n  using T = U;\n};\n\ntemplate<>\nstruct native_type<c10::Half> {\n  using T = __half;\n};\n\ntemplate<typename U>\ntypename native_type<U>::T* ptr(torch::Tensor t) {\n  return reinterpret_cast<typename native_type<U>::T*>(t.data_ptr<U>());\n}\n"
  },
  {
    "path": "frameworks/tf/__init__.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"\nHaste: a fast, simple, and open RNN library.\n\"\"\"\n\n\nfrom ._version import __version__  # generated in setup.py\nfrom .gru import GRU\nfrom .gru_cell import GRUCell\nfrom .indrnn import IndRNN\nfrom .layer_norm import LayerNorm\nfrom .layer_norm_gru import LayerNormGRU\nfrom .layer_norm_gru_cell import LayerNormGRUCell\nfrom .layer_norm_indrnn import LayerNormIndRNN\nfrom .layer_norm_lstm import LayerNormLSTM\nfrom .layer_norm_lstm_cell import LayerNormLSTMCell\nfrom .lstm import LSTM\nfrom .zoneout_wrapper import ZoneoutWrapper\n\n\n__all__ = [\n    'GRU',\n    'GRUCell',\n    'IndRNN',\n    'LayerNorm',\n    'LayerNormGRU',\n    'LayerNormGRUCell',\n    'LayerNormIndRNN',\n    'LayerNormLSTM',\n    'LayerNormLSTMCell',\n    'LSTM',\n    'ZoneoutWrapper'\n]\n"
  },
  {
    "path": "frameworks/tf/arena.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cassert>\n#include <string>\n#include <utility>\n#include <vector>\n\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n\ntemplate<typename T>\nclass TensorView {\n  public:\n    TensorView(T* ptr, const tensorflow::TensorShape& shape) {\n      ptr_ = ptr;\n      leading_dim_ = shape.dim_size(0);\n      stride_ = 1LL;\n      for (int i = 1; i < shape.dims(); ++i)\n        stride_ *= shape.dim_size(i);\n    }\n\n    tensorflow::int64 AllocatedBytes() const {\n      return leading_dim_ * stride_ * sizeof(T);\n    }\n\n    T* data() {\n      return ptr_;\n    }\n\n    T* SubSlicePtr(tensorflow::int64 i) {\n      assert(i >= 0);\n      assert(i < leading_dim_);\n\n      return ptr_ + (i * stride_);\n    }\n\n    const tensorflow::int64 NumElements() const {\n      return leading_dim_ * stride_;\n    }\n\n  private:\n    T* ptr_;\n    tensorflow::int64 leading_dim_;\n    tensorflow::int64 stride_;\n};\n\ntemplate<typename T>\nclass Arena {\n  public:\n    struct Entry {\n      std::string name;\n      TensorView<T> view;\n    };\n\n    Arena(std::initializer_list<Entry> map) : map_(map) {}\n    Arena(const std::vector<Entry>& map) : map_(map) {}\n\n    TensorView<T> operator[](const std::string& name) {\n      for (auto& unit : map_)\n        if (name == unit.name)\n          return unit.view;\n      assert(false && \"Invalid tensor name.\");\n    }\n\n  private:\n    std::vector<Entry> map_;\n};\n\ntemplate<typename T>\nclass ArenaLayout {\n  public:\n    struct Entry {\n      std::string name;\n      tensorflow::TensorShape shape;\n    };\n\n    ArenaLayout(std::initializer_list<Entry> layout) : layout_(layout) {}\n\n    tensorflow::int64 NumElements() const {\n      tensorflow::int64 total_elements = 0;\n      for (const auto& entry : layout_)\n        total_elements += entry.shape.num_elements();\n      return total_elements;\n    }\n\n    Arena<T> Realize(T* ptr) const {\n      std::vector<typename Arena<T>::Entry> map;\n      for (const auto& entry : layout_) {\n        map.push_back({ entry.name, TensorView<T>(ptr, entry.shape) });\n        ptr += entry.shape.num_elements();\n      }\n      return Arena<T>(map);\n    }\n\n  private:\n    std::vector<Entry> layout_;\n};\n"
  },
  {
    "path": "frameworks/tf/base_rnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Base RNN layer class.\"\"\"\n\n\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\n\n\n__all__ = [\n    'BaseRNN'\n]\n\n\ndef reverse_sequence(sequence, sequence_length):\n  \"\"\"\n  Reverses a batched sequence in time-major order [T,N,...]. The input sequence\n  may be padded, in which case sequence_length specifies the unpadded length of\n  each sequence.\n  \"\"\"\n  if sequence_length is None:\n    return tf.reverse(sequence, axis=[0])\n  return tf.reverse_sequence(sequence, sequence_length, seq_axis=0, batch_axis=1)\n\n\ndef transpose(tensor_or_tuple, perm):\n  \"\"\"Transposes the given tensor or tuple of tensors by the same permutation.\"\"\"\n  if isinstance(tensor_or_tuple, tuple):\n    return tuple([tf.transpose(tensor, perm) for tensor in tensor_or_tuple])\n  return tf.transpose(tensor_or_tuple, perm)\n\n\nclass BaseRNN(tf.Module):\n  def __init__(self, rnn_class, num_units, direction, default_name, **kwargs):\n    assert direction in ['unidirectional', 'bidirectional']\n\n    self.default_name = default_name\n    if direction == 'bidirectional':\n      name = kwargs.pop('name', None)\n      super().__init__(name)\n      self.realname = name\n      self.fw_layer = rnn_class(num_units, name='fw', **kwargs)\n      self.bw_layer = rnn_class(num_units, name='bw', **kwargs)\n    else:\n      super().__init__()\n      self.fw_layer = rnn_class(num_units, **kwargs)\n      self.bw_layer = None\n\n  def build(self, shape):\n    \"\"\"\n    Creates the variables of the layer.\n\n    Calling this method is optional for users of the RNN class. It is called\n    internally with the correct shape when `__call__` is invoked.\n\n    Arguments:\n      shape: instance of `TensorShape`.\n    \"\"\"\n    if self.bidirectional:\n      with self.name_scope, v1.variable_scope(self.realname, self.default_name):\n        self.fw_layer.build(shape)\n        self.bw_layer.build(shape)\n    else:\n      self.fw_layer.build(shape)\n\n  @property\n  def output_size(self):\n    if self.bidirectional:\n      return self.fw_layer.output_size, self.bw_layer.output_size\n    return self.fw_layer.output_size\n\n  @property\n  def state_size(self):\n    if self.bidirectional:\n      return self.fw_layer.state_size, self.bw_layer.state_size\n    return self.fw_layer.state_size\n\n  def __call__(self, inputs, training, sequence_length=None, time_major=False):\n    \"\"\"\n    Runs the RNN layer.\n\n    Arguments:\n      inputs: Tensor, a rank 3 input tensor with shape [N,T,C] if `time_major`\n        is `False`, or with shape [T,N,C] if `time_major` is `True`.\n      training: bool, `True` if running in training mode, `False` if running\n        in inference mode.\n      sequence_length: (optional) Tensor, a rank 1 tensor with shape [N] and\n        dtype of `tf.int32` or `tf.int64`. This tensor specifies the unpadded\n        length of each example in the input minibatch.\n      time_major: (optional) bool, specifies whether `input` has shape [N,T,C]\n        (`time_major=False`) or shape [T,N,C] (`time_major=True`).\n\n    Returns:\n      A pair, `(output, state)` for unidirectional layers, or a pair\n      `([output_fw, output_bw], [state_fw, state_bw])` for bidirectional\n      layers.\n    \"\"\"\n    if not time_major:\n      inputs = transpose(inputs, [1, 0, 2])\n\n    result, state = self.fw_layer(inputs, sequence_length, training)\n\n    if self.bidirectional:\n      inputs = reverse_sequence(inputs, sequence_length)\n      bw_result, bw_state = self.bw_layer(inputs, sequence_length, training)\n      result = result, reverse_sequence(bw_result, sequence_length)\n      state = state, bw_state\n\n    if not time_major:\n      result = transpose(result, [1, 0, 2])\n\n    return result, state\n\n  @property\n  def bidirectional(self):\n    \"\"\"`True` if this is a bidirectional RNN, `False` otherwise.\"\"\"\n    return self.bw_layer is not None\n"
  },
  {
    "path": "frameworks/tf/gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing haste::v0::gru::BackwardPass;\nusing haste::v0::gru::ForwardPass;\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteGru\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H*3]\n    .Input(\"recurrent_kernel: R\")       // [H,H*3]\n    .Input(\"bias: R\")                   // [H*3]\n    .Input(\"recurrent_bias: R\")         // [H*3]\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T,N,H]\n    .Output(\"v: R\")                     // [T,N,H*4]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle recurrent_bias_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 1, &recurrent_bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n      DimensionHandle hidden_size_4;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n      TF_RETURN_IF_ERROR(c->Multiply(hidden_size, 4, &hidden_size_4));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(1, c->MakeShape({ time_steps, batch_size, hidden_size_4 }));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteGruOp : public OpKernel {\n  explicit HasteGruOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  // When running on GPU, TF backs all inputs and outputs with device memory\n  // and not host memory. We don't need to do explicit memory copies or allocations\n  // for the inputs and outputs.\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& recurrent_bias = context->input(4);\n    const Tensor& zoneout_mask = context->input(5);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    OP_REQUIRES(context, input_size == kernel.shape().dim_size(0),\n        errors::InvalidArgument(\"input[2] and kernel[0] dimensions must match. Found \",\n            input_size, \" and \", kernel.shape().dim_size(0)));\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    const TensorShape v_out_shape = { time_steps, batch_size, training_ ? hidden_size * 4 : 0 };\n\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    Tensor* v_out = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, v_out_shape, &v_out));\n\n    Tensor tmp_Wx;\n    const TensorShape tmp_Wx_shape = { time_steps, batch_size, hidden_size * 3};\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Wx_shape, &tmp_Wx));\n\n    Tensor tmp_Rh;\n    const TensorShape tmp_Rh_shape = { batch_size, hidden_size * 3 };\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Rh_shape, &tmp_Rh));\n\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n\n    ForwardPass<T> forward(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    forward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        recurrent_bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        v_out->flat<T>().data(),\n        tmp_Wx.flat<T>().data(),\n        tmp_Rh.flat<T>().data(),\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteGru, float);\nREGISTER_GPU_KERNEL(HasteGru, double);\n\nREGISTER_OP(\"HasteGruGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [T,C,N]\n    .Input(\"kernel_t: R\")              // [H*3,C]\n    .Input(\"recurrent_kernel_t: R\")    // [H*3,H]\n    .Input(\"bias: R\")                  // [H*3]\n    .Input(\"recurrent_bias: R\")        // [H*3]\n    .Input(\"h: R\")                     // [T,N,H]\n    .Input(\"v: R\")                     // [T,N,H*4]\n    .Input(\"dh_new: R\")                // [T,N,H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H*3]\n    .Output(\"dr: R\")                   // [H,H*3]\n    .Output(\"dbx: R\")                  // [H*3]\n    .Output(\"dbr: R\")                  // [H*3]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_kernel_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle recurrent_bias_shape;\n      ShapeHandle h_shape;\n      ShapeHandle v_shape;\n      ShapeHandle dh_new_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 1, &recurrent_bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &v_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 3, &dh_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(8), 3, &zoneout_mask_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_kernel_shape, 1);\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, c->Value(hidden_size) * 3 }));\n      c->set_output(2, c->MakeShape({ hidden_size, c->Value(hidden_size) * 3 }));\n      c->set_output(3, bias_shape);\n      c->set_output(4, recurrent_bias_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteGruGradOp : public OpKernel {\n  explicit HasteGruGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& recurrent_bias = context->input(4);\n    const Tensor& h_vector = context->input(5);\n    const Tensor& v_vector = context->input(6);\n    const Tensor& dh_new = context->input(7);\n    const Tensor& zoneout_mask = context->input(8);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(1);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size * 3 };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape dR_shape = { hidden_size, hidden_size * 3 };\n    Tensor* dR = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, dR_shape, &dR));\n\n    // Needs to be initialized to 0.\n    const TensorShape dbx_shape = { hidden_size * 3 };\n    Tensor* dbx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, dbx_shape, &dbx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dbr_shape = { hidden_size * 3 };\n    Tensor* dbr = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(4, dbr_shape, &dbr));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dp_shape = { time_steps, batch_size, hidden_size * 3 };\n    Tensor dp;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dp_shape, &dp));\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dq_shape = { time_steps, batch_size, hidden_size * 3 };\n    Tensor dq;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dq_shape, &dq));\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(dR->flat<T>().data(), 0, dR->AllocatedBytes());\n    cudaMemset(dbx->flat<T>().data(), 0, dbx->AllocatedBytes());\n    cudaMemset(dbr->flat<T>().data(), 0, dbr->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n\n    BackwardPass<T> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    backward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        recurrent_bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        v_vector.flat<T>().data(),\n        dh_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        dR->flat<T>().data(),\n        dbx->flat<T>().data(),\n        dbr->flat<T>().data(),\n        dh.flat<T>().data(),\n        dp.flat<T>().data(),\n        dq.flat<T>().data(),\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteGruGrad, float);\nREGISTER_GPU_KERNEL(HasteGruGrad, double);\n"
  },
  {
    "path": "frameworks/tf/gru.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Gated Recurrent Unit\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'GRU'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteGru\")\ndef gru_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('GRU can only compute gradients if `training=True` was specified during the '\n                      'forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  R = op.inputs[2]\n  bx = op.inputs[3]\n  br = op.inputs[4]\n  zoneout_mask = op.inputs[5]\n  h = op.outputs[0]\n  v = op.outputs[1]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n  R = tf.transpose(R, [1, 0])\n\n  dx, dW, dR, dbx, dbr = LIB.haste_gru_grad(x, W, R, bx, br, h, v, grads[0], zoneout_mask)\n\n  return [dx, dW, dR, dbx, dbr, None]\n\n\nclass GRULayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        recurrent_bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        recurrent_bias_transform=None,\n        dropout=0.0,\n        zoneout=0.0,\n        dtype=None,\n        name=None):\n    super(GRULayer, self).__init__(name)\n    self.realname = name\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.orthogonal(), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n    self.recurrent_bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(kernel_initializer, None, kernel_transform)\n    self.recurrent_config.override(recurrent_initializer, None, recurrent_transform)\n    self.bias_config.override(bias_initializer, None, bias_transform)\n    self.recurrent_bias_config.override(recurrent_bias_initializer, None, recurrent_bias_transform)\n\n    self.dropout = dropout\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    def build_weights(initializer, shape):\n      weights = [initializer(shape, dtype=self.dtype) for _ in range(3)]\n      weights = tf.concat(weights, axis=-1)\n      return weights\n\n    kernel_weights = build_weights(self.kernel_config.initializer, [input_size, num_units])\n    recurrent_weights = build_weights(self.recurrent_config.initializer, [num_units, num_units])\n    biases = build_weights(self.bias_config.initializer, [num_units])\n    recurrent_biases = build_weights(self.recurrent_bias_config.initializer, [num_units])\n\n    weights = tf.concat([kernel_weights, recurrent_weights], axis=0)\n    biases = tf.concat([biases, recurrent_biases], axis=0)\n\n    with self.name_scope, v1.variable_scope(self.realname, 'gru_cell'):\n      self._kernel = v1.get_variable('kernel', initializer=weights)\n      self._bias = v1.get_variable('bias', initializer=biases)\n    self.built = True\n\n  def get_weights(self):\n    input_size = self._kernel.shape.as_list()[0] - self.num_units\n    kernel, recurrent_kernel = tf.split(self._kernel, [input_size, self.num_units], axis=0)\n    bias, recurrent_bias = tf.split(self._bias, 2, axis=0)\n    return {\n        'kernel': self.kernel_config.transform(kernel),\n        'recurrent_kernel': self.recurrent_config.transform(recurrent_kernel),\n        'bias': self.bias_config.transform(bias),\n        'recurrent_bias': self.recurrent_bias_config.transform(recurrent_bias)\n    }\n\n  def __call__(self, inputs, sequence_length, training):\n    self.build(inputs.shape)\n\n    shape = tf.shape(inputs)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    if training and self.dropout > 0:\n      recurrent_kernel = tf.nn.dropout(weights['recurrent_kernel'], rate=self.dropout)\n    else:\n      recurrent_kernel = weights['recurrent_kernel']\n    result, _ = LIB.haste_gru(\n        inputs,\n        weights['kernel'],\n        recurrent_kernel,\n        weights['bias'],\n        weights['recurrent_bias'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      # 0-indexed tensors, so length-1.\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = tf.gather_nd(result, indices)\n    else:\n      state = result[-1]\n\n    return result[1:], state\n\n\nclass GRU(BaseRNN):\n  \"\"\"\n  Gated Recurrent Unit layer.\n\n  This GRU layer offers a fused, GPU-accelerated TensorFlow op for inference\n  and training. There are two commonly-used variants of GRU cells. This one\n  implements 1406.1078v1 which applies the reset gate to the hidden state\n  after matrix multiplication. cuDNN also implements this variant. The other\n  variant, 1406.1078v3, applies the reset gate before matrix multiplication\n  and is currently unsupported.\n\n  This layer has built-in support for DropConnect and Zoneout, which are\n  both techniques used to regularize RNNs.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the GRU layer.\n\n    Arguments:\n      num_units: int, the number of units in the LSTM cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent matrix weights. Defaults to `orthogonal`.\n      bias_initializer: (optional) the initializer to use for input bias\n        vectors. Defaults to `zeros`.\n      recurrent_bias_initializer: (optional) the initializer to use for\n        recurrent bias vectors. Defaults to `zeros`.\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n        kernel before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      recurrent_bias_transform: (optional) a function with signature\n        `(recurrent_bias: Tensor) -> Tensor` that transforms the recurrent bias\n        before it is used. Defaults to the identity function.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix. Defaults to 0.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super().__init__(GRULayer, num_units, direction, 'gru_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/gru_cell.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"A GRU cell compatible with the Haste GRU layer.\"\"\"\n\n\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\n\nclass GRUCell(rnn_cell.RNNCell):\n  \"\"\"\n  A GRU cell that's compatible with the Haste GRU layer.\n\n  This cell can be used on hardware other than GPUs and with other TensorFlow\n  classes that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\n  wrappers, etc.).\n  \"\"\"\n  def __init__(self, num_units, name=None, **kwargs):\n    super(GRUCell, self).__init__(name=name, **kwargs)\n\n    self.realname = name\n    self.num_units = num_units\n    self.built = False\n\n  @property\n  def state_size(self):\n    return self.num_units\n\n  @property\n  def output_size(self):\n    return self.num_units\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n    dtype = self.dtype or tf.float32\n\n    kernel_initializer = v1.initializers.glorot_uniform(dtype=dtype)\n    bias_initializer = v1.initializers.zeros(dtype=dtype)\n\n    with tf.name_scope(self.name), v1.variable_scope(self.realname, 'gru_cell'):\n      self._kernel = v1.get_variable('kernel', initializer=lambda: kernel_initializer([input_size + num_units, num_units * 3]))\n      self._bias = v1.get_variable('bias', initializer=lambda: bias_initializer([num_units * 6]))\n\n    self.kernel, self.recurrent_kernel = tf.split(self._kernel, [input_size, num_units], axis=0)\n    self.bias, self.recurrent_bias = tf.split(self._bias, 2, axis=0)\n\n    self.built = True\n\n  def __call__(self, inputs, state, scope=None):\n    self.build(inputs.shape)\n\n    h_proj = tf.nn.xw_plus_b(state, self.recurrent_kernel, self.recurrent_bias)\n    x = tf.nn.xw_plus_b(inputs, self.kernel, self.bias)\n    h_z, h_r, h_g = tf.split(h_proj, 3, axis=-1)\n    x_z, x_r, x_g = tf.split(x, 3, axis=-1)\n    z = tf.nn.sigmoid(h_z + x_z)\n    r = tf.nn.sigmoid(h_r + x_r)\n    g = tf.nn.tanh(r * h_g + x_g)\n    h = z * state + (1 - z) * g\n    return h, h\n"
  },
  {
    "path": "frameworks/tf/indrnn.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing haste::v0::indrnn::ForwardPass;\nusing haste::v0::indrnn::BackwardPass;\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteIndrnn\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H]\n    .Input(\"recurrent_scale: R\")        // [H]\n    .Input(\"bias: R\")                   // [H]\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T+1,N,H]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteIndrnnOp : public OpKernel {\n  explicit HasteIndrnnOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_scale = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& zoneout_mask = context->input(4);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_scale.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    Tensor workspace;\n    const TensorShape workspace_shape = { time_steps, batch_size, hidden_size };\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, workspace_shape, &workspace));\n\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n\n    ForwardPass<T> forward(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    forward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_scale.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        workspace.flat<T>().data(),\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteIndrnn, float);\nREGISTER_GPU_KERNEL(HasteIndrnn, double);\n\nREGISTER_OP(\"HasteIndrnnGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [C,N,T]\n    .Input(\"kernel_t: R\")              // [4,C]\n    .Input(\"recurrent_scale: R\")       // [H]\n    .Input(\"bias: R\")                  // [H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Input(\"h: R\")                     // [T+1,N,H]\n    .Input(\"dh_new: R\")                // [T+1,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H]\n    .Output(\"dr: R\")                   // [H]\n    .Output(\"db: R\")                   // [H]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle zoneout_mask_shape;\n      ShapeHandle h_shape;\n      ShapeHandle dh_new_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 3, &zoneout_mask_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &dh_new_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, hidden_size }));\n      c->set_output(2, recurrent_shape);\n      c->set_output(3, bias_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteIndrnnGradOp : public OpKernel {\n  explicit HasteIndrnnGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_scale = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& zoneout_mask = context->input(4);\n    const Tensor& h_vector = context->input(5);\n    const Tensor& dh_new = context->input(6);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_scale.shape().dim_size(0);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape du_shape = { hidden_size };\n    Tensor* du = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, du_shape, &du));\n\n    // Needs to be initialized to 0.\n    const TensorShape db_shape = { hidden_size };\n    Tensor* db = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, db_shape, &db));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    const TensorShape workspace_shape = { time_steps, batch_size, hidden_size };\n    Tensor workspace;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, workspace_shape, &workspace));\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(du->flat<T>().data(), 0, du->AllocatedBytes());\n    cudaMemset(db->flat<T>().data(), 0, db->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n\n    BackwardPass<T> backward = BackwardPass<T>(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    backward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_scale.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        dh_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        du->flat<T>().data(),\n        db->flat<T>().data(),\n        dh.flat<T>().data(),\n        workspace.flat<T>().data(),\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteIndrnnGrad, float);\nREGISTER_GPU_KERNEL(HasteIndrnnGrad, double);\n"
  },
  {
    "path": "frameworks/tf/indrnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Independently Recurrent Neural Network\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'IndRNN'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteIndrnn\")\ndef indrnn_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('IndRNN can only compute gradients if `training=True` was specified during '\n                      'the forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  u = op.inputs[2]\n  b = op.inputs[3]\n  zoneout_mask = op.inputs[4]\n  h = op.outputs[0]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n\n  dx, dW, du, db = LIB.haste_indrnn_grad(x, W, u, b, zoneout_mask, h, grads[0])\n  return [dx, dW, du, db, None]\n\n\ndef _get_initializer(initializer):\n  if not isinstance(initializer, dict):\n    return initializer\n  if 'uniform' in initializer:\n    value = initializer['uniform']\n    return v1.initializers.random_uniform(-value, value)\n  if 'normal' in initializer:\n    value = initializer['normal']\n    return v1.initializers.truncated_normal(stddev=value)\n  raise ValueError(f'Unknown initializer {initializer}')\n\n\nclass IndRNNLayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        zoneout=0.0,\n        dtype=None,\n        name=None):\n    super().__init__(name)\n    self.realname = name\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.random_uniform(-0.5, 0.5), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(_get_initializer(kernel_initializer), None, kernel_transform)\n    self.recurrent_config.override(_get_initializer(recurrent_initializer), None, recurrent_transform)\n    self.bias_config.override(_get_initializer(bias_initializer), None, bias_transform)\n\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.kernel = None\n    self.recurrent_scale = None\n    self.bias = None\n    self.recurrent_bias = None\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    kernel_shape = tf.TensorShape([input_size, num_units])\n    recurrent_shape = tf.TensorShape([num_units])\n    bias_shape = tf.TensorShape([num_units])\n\n    kernel_weights = self.kernel_config.initializer(kernel_shape, dtype=self.dtype)\n    recurrent_weights = self.recurrent_config.initializer(recurrent_shape, dtype=self.dtype)\n    biases = self.bias_config.initializer(bias_shape)\n\n    with self.name_scope, v1.variable_scope(self.realname, 'indrnn_cell'):\n      self.kernel = v1.get_variable('kernel', initializer=kernel_weights)\n      self.recurrent_scale = v1.get_variable('recurrent_scale', initializer=recurrent_weights)\n      self.bias = v1.get_variable('bias', initializer=biases)\n\n    self.built = True\n\n  def get_weights(self):\n    return {\n        'kernel': self.kernel_config.transform(self.kernel),\n        'recurrent_scale': self.recurrent_config.transform(self.recurrent_scale),\n        'bias': self.bias_config.transform(self.bias)\n    }\n\n  def __call__(self, inputs, sequence_length, training):\n    self.build(inputs.shape)\n\n    shape = tf.shape(inputs)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    result = LIB.haste_indrnn(\n        inputs,\n        weights['kernel'],\n        weights['recurrent_scale'],\n        weights['bias'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      # 0-indexed tensors, so length-1.\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = tf.gather_nd(result, indices)\n    else:\n      state = result[-1]\n\n    return result[1:], state\n\n\nclass IndRNN(BaseRNN):\n  \"\"\"\n  Independently Recurrent Neural Network layer.\n\n  This layer offers a fused, GPU-accelerated TensorFlow op for inference and\n  training. It also supports Zoneout regularization.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the IndRNN layer.\n\n    Arguments:\n      num_units: int, the number of units in the IndRNN cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent scale weights. Defaults to uniform random in [-0.5, 0.5].\n        Note that this initialization scheme is different than in the original\n        authors' implementation. See https://github.com/lmnt-com/haste/issues/7\n        for details.\n      bias_initializer: (optional) the initializer to use for the bias vector.\n        Defaults to `zeros`.\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_scale: Tensor) -> Tensor` that transforms the recurrent\n        scale vector before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super().__init__(IndRNNLayer, num_units, direction, 'indrnn_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/layer_norm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing haste::v0::layer_norm::ForwardPass;\nusing haste::v0::layer_norm::BackwardPass;\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\nREGISTER_OP(\"HasteLayerNorm\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x: R\")\n    .Input(\"gamma: R\")\n    .Input(\"beta: R\")\n    .Output(\"y: R\")\n    .Output(\"cache: R\")\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle beta_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 2, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &beta_shape));\n\n      c->set_output(0, input_shape);\n      c->set_output(1, c->UnknownShapeOfRank(2));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormOp : public OpKernel {\n  explicit HasteLayerNormOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& x = context->input(0);\n    const Tensor& gamma = context->input(1);\n    const Tensor& beta = context->input(2);\n\n    const auto batch_size = x.shape().dim_size(0);\n    const auto hidden_size = x.shape().dim_size(1);\n\n    Tensor* y = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, x.shape(), &y));\n\n    Tensor* cache = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, { batch_size, 2 }, &cache));\n\n    ForwardPass<T> forward(\n        batch_size,\n        hidden_size,\n        gamma.flat<T>().data(),\n        beta.shape().dim_size(0) ? beta.flat<T>().data() : nullptr,\n        cache->flat<T>().data());\n\n    forward.Run(GetCudaStream(context), x.flat<T>().data(), y->flat<T>().data());\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNorm, float);\nREGISTER_GPU_KERNEL(HasteLayerNorm, double);\n\nREGISTER_OP(\"HasteLayerNormGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x: R\")\n    .Input(\"gamma: R\")\n    .Input(\"beta: R\")\n    .Input(\"dy: R\")\n    .Input(\"cache: R\")\n    .Output(\"dx: R\")\n    .Output(\"dgamma: R\")\n    .Output(\"dbeta: R\")\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle beta_shape;\n      ShapeHandle dy_shape;\n      ShapeHandle cache_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 2, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &beta_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 2, &dy_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 2, &cache_shape));\n\n      c->set_output(0, x_shape);\n      c->set_output(1, gamma_shape);\n      c->set_output(2, beta_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormGradOp : public OpKernel {\n  explicit HasteLayerNormGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& x = context->input(0);\n    const Tensor& gamma = context->input(1);\n    const Tensor& beta = context->input(2);\n    const Tensor& dy = context->input(3);\n    const Tensor& cache = context->input(4);\n\n    const auto batch_size = x.shape().dim_size(0);\n    const auto hidden_size = x.shape().dim_size(1);\n    const auto cache_shape = context->input(4).shape();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, x.shape(), &dx));\n\n    Tensor* dgamma = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, gamma.shape(), &dgamma));\n\n    Tensor* dbeta = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, beta.shape(), &dbeta));\n\n    cudaMemset(dgamma->flat<T>().data(), 0, dgamma->AllocatedBytes());\n    cudaMemset(dbeta->flat<T>().data(), 0, dbeta->AllocatedBytes());\n\n    BackwardPass<T> backward(\n        batch_size,\n        hidden_size,\n        gamma.flat<T>().data(),\n        beta.shape().dim_size(0) ? beta.flat<T>().data() : nullptr,\n        x.flat<T>().data(),\n        dgamma->flat<T>().data(),\n        beta.shape().dim_size(0) ? dbeta->flat<T>().data() : nullptr,\n        const_cast<T*>(cache.flat<T>().data()));\n\n    backward.Run(GetCudaStream(context), dy.flat<T>().data(), dx->flat<T>().data());\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormGrad, float);\nREGISTER_GPU_KERNEL(HasteLayerNormGrad, double);\n"
  },
  {
    "path": "frameworks/tf/layer_norm.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalization\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\n\n\n__all__ = [\n    'LayerNorm'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteLayerNorm\")\ndef layer_norm_gradient(op, *grads):\n  x = op.inputs[0]\n  gamma = op.inputs[1]\n  beta = op.inputs[2]\n  cache = op.outputs[1]\n\n  return LIB.haste_layer_norm_grad(x, gamma, beta, grads[0], cache)\n\n\nclass LayerNorm(tf.Module):\n  \"\"\"\n  Layer normalization layer.\n\n  This class exposes a fused and GPU-accelerated implementation of layer\n  normalization as described by [Ba et al.](https://arxiv.org/abs/1607.06450)\n  \"\"\"\n\n  def __init__(self, name=None):\n    \"\"\"\n    Initialize the parameters of the layer normalization layer.\n\n    Arguments:\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super(LayerNorm, self).__init__(name)\n    self.realname = name\n    self.gamma = None\n    self.beta = None\n    self.built = False\n\n  def build(self, shape):\n    \"\"\"\n    Creates the variables of the layer.\n\n    Calling this method is optional for users of the LayerNorm class. It is\n    called internally with the correct shape when `__call__` is invoked.\n\n    Arguments:\n      shape: instance of `TensorShape`.\n    \"\"\"\n    if self.built:\n      return\n    hidden_size = int(shape[-1])\n    with self.name_scope, v1.variable_scope(self.realname, 'layer_norm'):\n      self.gamma = v1.get_variable('gamma', shape=[hidden_size], initializer=v1.initializers.ones())\n      self.beta = v1.get_variable('beta', shape=[hidden_size], initializer=v1.initializers.zeros())\n    self.built = True\n\n  def __call__(self, x):\n    \"\"\"\n    Runs the layer.\n\n    Arguments:\n      x: Tensor, a rank R tensor.\n\n    Returns:\n      y: Tensor, a rank R tensor with the last dimension normalized.\n    \"\"\"\n    self.build(x.shape)\n    y, _ = LIB.haste_layer_norm(x, self.gamma, self.beta)\n    return y\n"
  },
  {
    "path": "frameworks/tf/layer_norm_gru.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"arena.h\"\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_gru = haste::v0::layer_norm_gru;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteLayerNormGru\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H*3]\n    .Input(\"recurrent_kernel: R\")       // [H,H*3]\n    .Input(\"bias: R\")                   // [H*3]\n    .Input(\"recurrent_bias: R\")         // [H*3]\n    .Input(\"gamma: R\")                  // [2,H*3]\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T,N,H]\n    .Output(\"v: R\")                     // [T,N,H*4]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle recurrent_bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 1, &recurrent_bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 2, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(1, c->UnknownShapeOfRank(1));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormGruOp : public OpKernel {\n  explicit HasteLayerNormGruOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  // When running on GPU, TF backs all inputs and outputs with device memory\n  // and not host memory. We don't need to do explicit memory copies or allocations\n  // for the inputs and outputs.\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& recurrent_bias = context->input(4);\n    const Tensor& gamma = context->input(5);\n    const Tensor& zoneout_mask = context->input(6);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    OP_REQUIRES(context, input_size == kernel.shape().dim_size(0),\n        errors::InvalidArgument(\"input[2] and kernel[0] dimensions must match. Found \",\n            input_size, \" and \", kernel.shape().dim_size(0)));\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    const ArenaLayout<T> memory_layout = {\n      { \"cache\", { time_steps, batch_size, hidden_size * 4 } },\n      { \"act_Wx\", { time_steps, batch_size, hidden_size * 3 } },\n      { \"act_Wx_norm_cache\", { time_steps, batch_size, 2 } },\n      { \"act_Rh\", { time_steps, batch_size, hidden_size * 3 } },\n      { \"act_Rh_norm_cache\", { time_steps, batch_size, 2 } },\n    };\n\n    Tensor* output_cache = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, { memory_layout.NumElements() }, &output_cache));\n\n    Arena<T> memory = memory_layout.Realize(output_cache->flat<T>().data());\n    TensorView<T> cache = memory[\"cache\"];\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n    TensorView<T> act_Rh = memory[\"act_Rh\"];\n    TensorView<T> act_Rh_norm_cache = memory[\"act_Rh_norm_cache\"];\n\n    const TensorShape tmp_Wx_norm_shape = { time_steps, batch_size, hidden_size * 3 };\n    Tensor tmp_Wx_norm;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Wx_norm_shape, &tmp_Wx_norm));\n\n    const TensorShape tmp_Rh_norm_shape = { batch_size, hidden_size * 3 };\n    Tensor tmp_Rh_norm;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Rh_norm_shape, &tmp_Rh_norm));\n\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n\n    layer_norm::ForwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm::ForwardPass<T> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma.SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh_norm_cache.data());\n\n    layer_norm_gru::ForwardPass<T> gru(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    gru.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        recurrent_bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        cache.data(),\n        act_Wx.data(),\n        layer_norm1,\n        tmp_Wx_norm.flat<T>().data(),\n        act_Rh.data(),\n        layer_norm2,\n        tmp_Rh_norm.flat<T>().data(),\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormGru, float);\nREGISTER_GPU_KERNEL(HasteLayerNormGru, double);\n\nREGISTER_OP(\"HasteLayerNormGruGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [T,C,N]\n    .Input(\"kernel_t: R\")              // [H*3,C]\n    .Input(\"recurrent_kernel_t: R\")    // [H*3,H]\n    .Input(\"bias: R\")                  // [H*3]\n    .Input(\"recurrent_bias: R\")        // [H*3]\n    .Input(\"gamma: R\")                 // [2,H*3]\n    .Input(\"h: R\")                     // [T,N,H]\n    .Input(\"cache: R\")                 // [?]\n    .Input(\"dh_new: R\")                // [T,N,H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H*3]\n    .Output(\"dr: R\")                   // [H,H*3]\n    .Output(\"dbx: R\")                  // [H*3]\n    .Output(\"dbr: R\")                  // [H*3]\n    .Output(\"dgamma: R\")\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_kernel_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle recurrent_bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle h_shape;\n      ShapeHandle cache_shape;\n      ShapeHandle dh_new_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 1, &recurrent_bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 2, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 1, &cache_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(8), 3, &dh_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(9), 3, &zoneout_mask_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_kernel_shape, 1);\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, c->Value(hidden_size) * 3 }));\n      c->set_output(2, c->MakeShape({ hidden_size, c->Value(hidden_size) * 3 }));\n      c->set_output(3, bias_shape);\n      c->set_output(4, recurrent_bias_shape);\n      c->set_output(5, gamma_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormGruGradOp : public OpKernel {\n  explicit HasteLayerNormGruGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& recurrent_bias = context->input(4);\n    const Tensor& gamma = context->input(5);\n    const Tensor& h_vector = context->input(6);\n    const Tensor& cache_input = context->input(7);\n    const Tensor& dh_new = context->input(8);\n    const Tensor& zoneout_mask = context->input(9);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(1);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    const ArenaLayout<T> memory_layout = {\n      { \"cache\", { time_steps, batch_size, hidden_size * 4 } },\n      { \"act_Wx\", { time_steps, batch_size, hidden_size * 3 } },\n      { \"act_Wx_norm_cache\", { time_steps, batch_size, 2 } },\n      { \"act_Rh\", { time_steps, batch_size, hidden_size * 3 } },\n      { \"act_Rh_norm_cache\", { time_steps, batch_size, 2 } },\n    };\n\n    assert(cache_input.shape().num_elements() == memory_layout.NumElements());\n\n    Arena<T> memory = memory_layout.Realize(const_cast<T*>(cache_input.flat<T>().data()));\n    TensorView<T> cache = memory[\"cache\"];\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n    TensorView<T> act_Rh = memory[\"act_Rh\"];\n    TensorView<T> act_Rh_norm_cache = memory[\"act_Rh_norm_cache\"];\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size * 3 };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape dR_shape = { hidden_size, hidden_size * 3 };\n    Tensor* dR = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, dR_shape, &dR));\n\n    // Needs to be initialized to 0.\n    const TensorShape dbx_shape = { hidden_size * 3 };\n    Tensor* dbx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, dbx_shape, &dbx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dbr_shape = { hidden_size * 3 };\n    Tensor* dbr = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(4, dbr_shape, &dbr));\n\n    // Needs to be initialized to 0.\n    Tensor* dgamma = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(5, gamma.shape(), &dgamma));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dp_shape = { time_steps, batch_size, hidden_size * 3 };\n    Tensor dp;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dp_shape, &dp));\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dq_shape = { time_steps, batch_size, hidden_size * 3 };\n    Tensor dq;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dq_shape, &dq));\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(dR->flat<T>().data(), 0, dR->AllocatedBytes());\n    cudaMemset(dbx->flat<T>().data(), 0, dbx->AllocatedBytes());\n    cudaMemset(dbr->flat<T>().data(), 0, dbr->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n    cudaMemset(dgamma->flat<T>().data(), 0, dgamma->AllocatedBytes());\n\n    layer_norm::BackwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx.data(),\n        dgamma->SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm::BackwardPass<T> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 3,\n        gamma.SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh.data(),\n        dgamma->SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh_norm_cache.data());\n\n    layer_norm_gru::BackwardPass<T> gru(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    gru.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        recurrent_bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        cache.data(),\n        dh_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        dR->flat<T>().data(),\n        dbx->flat<T>().data(),\n        dbr->flat<T>().data(),\n        dh.flat<T>().data(),\n        dp.flat<T>().data(),\n        dq.flat<T>().data(),\n        layer_norm1,\n        layer_norm2,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormGruGrad, float);\nREGISTER_GPU_KERNEL(HasteLayerNormGruGrad, double);\n"
  },
  {
    "path": "frameworks/tf/layer_norm_gru.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Gated Recurrent Unit\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'LayerNormGRU'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteLayerNormGru\")\ndef layer_norm_gru_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('LayerNormGRU can only compute gradients if `training=True` was specified '\n                      'during the forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  R = op.inputs[2]\n  bx = op.inputs[3]\n  br = op.inputs[4]\n  gamma = op.inputs[5]\n  zoneout_mask = op.inputs[6]\n  h = op.outputs[0]\n  cache = op.outputs[1]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n  R = tf.transpose(R, [1, 0])\n\n  grads = LIB.haste_layer_norm_gru_grad(x, W, R, bx, br, gamma, h, cache, grads[0], zoneout_mask)\n\n  return [*grads, None]\n\n\nclass LayerNormGRULayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        recurrent_bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        recurrent_bias_transform=None,\n        dropout=0.0,\n        zoneout=0.0,\n        dtype=None,\n        name=None):\n    super(LayerNormGRULayer, self).__init__(name)\n    self.realname = name\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.orthogonal(), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n    self.recurrent_bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(kernel_initializer, None, kernel_transform)\n    self.recurrent_config.override(recurrent_initializer, None, recurrent_transform)\n    self.bias_config.override(bias_initializer, None, bias_transform)\n    self.recurrent_bias_config.override(recurrent_bias_initializer, None, recurrent_bias_transform)\n\n    self.dropout = dropout\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    def build_weights(initializer, shape):\n      weights = [initializer(shape, dtype=self.dtype) for _ in range(3)]\n      weights = tf.concat(weights, axis=-1)\n      return weights\n\n    kernel_weights = build_weights(self.kernel_config.initializer, [input_size, num_units])\n    recurrent_weights = build_weights(self.recurrent_config.initializer, [num_units, num_units])\n    biases = build_weights(self.bias_config.initializer, [num_units])\n    recurrent_biases = build_weights(self.recurrent_bias_config.initializer, [num_units])\n\n    weights = tf.concat([kernel_weights, recurrent_weights], axis=0)\n    biases = tf.concat([biases, recurrent_biases], axis=0)\n\n    with self.name_scope, v1.variable_scope(self.realname, 'gru_cell'):\n      self._kernel = v1.get_variable('kernel', initializer=weights)\n      self._bias = v1.get_variable('bias', initializer=biases)\n      self.gamma = v1.get_variable('gamma', shape=[2, self.num_units * 3], initializer=v1.initializers.ones())\n    self.built = True\n\n  def get_weights(self):\n    input_size = self._kernel.shape.as_list()[0] - self.num_units\n    kernel, recurrent_kernel = tf.split(self._kernel, [input_size, self.num_units], axis=0)\n    bias, recurrent_bias = tf.split(self._bias, 2, axis=0)\n    return {\n        'kernel': self.kernel_config.transform(kernel),\n        'recurrent_kernel': self.recurrent_config.transform(recurrent_kernel),\n        'bias': self.bias_config.transform(bias),\n        'recurrent_bias': self.recurrent_bias_config.transform(recurrent_bias),\n        'gamma': self.gamma,\n    }\n\n  def __call__(self, inputs, sequence_length, training):\n    self.build(inputs.shape)\n\n    shape = tf.shape(inputs)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    if training and self.dropout > 0:\n      recurrent_kernel = tf.nn.dropout(weights['recurrent_kernel'], rate=self.dropout)\n    else:\n      recurrent_kernel = weights['recurrent_kernel']\n    result, _ = LIB.haste_layer_norm_gru(\n        inputs,\n        weights['kernel'],\n        recurrent_kernel,\n        weights['bias'],\n        weights['recurrent_bias'],\n        weights['gamma'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      # 0-indexed tensors, so length-1.\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = tf.gather_nd(result, indices)\n    else:\n      state = result[-1]\n\n    return result[1:], state\n\n\nclass LayerNormGRU(BaseRNN):\n  \"\"\"\n  Layer Normalized Gated Recurrent Unit layer.\n\n  This GRU layer applies layer normalization to the input and recurrent output\n  activations of a standard GRU. The implementation is fused and\n  GPU-accelerated. There are two commonly-used variants of GRU cells. This one\n  implements 1406.1078v1 which applies the reset gate to the hidden state\n  after matrix multiplication. The other variant, 1406.1078v3, applies the\n  reset gate before matrix multiplication and is currently unsupported.\n\n  This layer has built-in support for DropConnect and Zoneout, which are\n  both techniques used to regularize RNNs.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the GRU layer.\n\n    Arguments:\n      num_units: int, the number of units in the GRU cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent matrix weights. Defaults to `orthogonal`.\n      bias_initializer: (optional) the initializer to use for input bias\n        vectors. Defaults to `zeros`.\n      recurrent_bias_initializer: (optional) the initializer to use for\n        recurrent bias vectors. Defaults to `zeros`.\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n        kernel before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      recurrent_bias_transform: (optional) a function with signature\n        `(recurrent_bias: Tensor) -> Tensor` that transforms the recurrent bias\n        before it is used. Defaults to the identity function.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix. Defaults to 0.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super().__init__(LayerNormGRULayer, num_units, direction, 'gru_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/layer_norm_gru_cell.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"A GRU cell compatible with the Haste LayerNormGRU layer.\"\"\"\n\n\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\n\n__all__ = [\n    'LayerNormGRUCell'\n]\n\n\nclass LayerNormGRUCell(rnn_cell.RNNCell):\n  \"\"\"\n  A GRU cell that's compatible with the Haste LayerNormGRU layer.\n\n  This cell can be used on hardware other than GPUs and with other TensorFlow\n  classes that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\n  wrappers, etc.).\n  \"\"\"\n\n  def __init__(self,\n        num_units,\n        forget_bias=1.0,\n        dropout=0.0,\n        dtype=None,\n        name=None,\n        **kwargs):\n    super(LayerNormGRUCell, self).__init__(dtype=dtype, name=name, **kwargs)\n    self.realname = name\n    self.num_units = num_units\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n    self.kernel = None\n    self.recurrent_kernel = None\n    self.bias = None\n    self.recurrent_bias = None\n    self.gamma = None\n    self.built = False\n\n  @property\n  def state_size(self):\n    return self.num_units\n\n  @property\n  def output_size(self):\n    return self.num_units\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n    dtype = self.dtype or tf.float32\n\n    kernel_initializer = v1.initializers.glorot_uniform(dtype=dtype)\n    bias_initializer = v1.initializers.zeros(dtype=dtype)\n\n    with tf.name_scope(self.name), v1.variable_scope(self.realname, 'gru_cell'):\n      self._kernel = v1.get_variable('kernel', initializer=lambda: kernel_initializer([input_size + num_units, num_units * 3]))\n      self._bias = v1.get_variable('bias', initializer=lambda: bias_initializer([num_units * 6]))\n      self.gamma = v1.get_variable('gamma', initializer=v1.initializers.ones())\n\n    self.kernel, self.recurrent_kernel = tf.split(self._kernel, [input_size, num_units], axis=0)\n    self.bias, self.recurrent_bias = tf.split(self._bias, 2, axis=0)\n\n    self.built = True\n\n  def __call__(self, inputs, state, training=False, scope=None):\n    self.build(inputs.shape)\n\n    if training and self.dropout > 0:\n      R = tf.nn.dropout(self.recurrent_kernel, rate=self.dropout)\n    else:\n      R = self.recurrent_kernel\n\n    x = self._layer_norm(inputs @ self.kernel, self.gamma[0]) + self.bias\n    h_proj = self._layer_norm(state @ R, self.gamma[1]) + self.recurrent_bias\n    h_z, h_r, h_g = tf.split(h_proj, 3, axis=-1)\n    x_z, x_r, x_g = tf.split(x, 3, axis=-1)\n    z = tf.nn.sigmoid(h_z + x_z)\n    r = tf.nn.sigmoid(h_r + x_r)\n    g = tf.nn.tanh(r * h_g + x_g)\n    h = z * state + (1 - z) * g\n    return h, h\n\n  def _layer_norm(self, x, gamma):\n    mean, variance = tf.nn.moments(x, axes=[-1], keepdims=True)\n    return tf.nn.batch_normalization(x, mean, variance, None, gamma, 1e-5)\n"
  },
  {
    "path": "frameworks/tf/layer_norm_indrnn.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"arena.h\"\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_indrnn = haste::v0::layer_norm_indrnn;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteLayerNormIndrnn\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H]\n    .Input(\"recurrent_scale: R\")        // [H]\n    .Input(\"bias: R\")                   // [H]\n    .Input(\"gamma: R\")                  // [2,H]\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T+1,N,H]\n    .Output(\"cache: R\")\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 2, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(1, c->UnknownShapeOfRank(1));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormIndrnnOp : public OpKernel {\n  explicit HasteLayerNormIndrnnOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_scale = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& gamma = context->input(4);\n    const Tensor& zoneout_mask = context->input(5);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_scale.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    Tensor workspace;\n    const TensorShape workspace_shape = { time_steps, batch_size, hidden_size };\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, workspace_shape, &workspace));\n\n    const ArenaLayout<T> memory_layout = {\n      { \"act_Wx\", { time_steps, batch_size, hidden_size } },\n      { \"act_Wx_norm_cache\", { time_steps, batch_size, 2 } },\n    };\n\n    Tensor* output_cache = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, { memory_layout.NumElements() }, &output_cache));\n\n    Arena<T> memory = memory_layout.Realize(output_cache->flat<T>().data());\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n\n    layer_norm::ForwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm_indrnn::ForwardPass<T> forward(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    forward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_scale.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        workspace.flat<T>().data(),\n        act_Wx.data(),\n        layer_norm1,\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormIndrnn, float);\nREGISTER_GPU_KERNEL(HasteLayerNormIndrnn, double);\n\nREGISTER_OP(\"HasteLayerNormIndrnnGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [C,N,T]\n    .Input(\"kernel_t: R\")              // [4,C]\n    .Input(\"recurrent_scale: R\")       // [H]\n    .Input(\"bias: R\")                  // [H]\n    .Input(\"gamma: R\")                 // [2,H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Input(\"h: R\")                     // [T+1,N,H]\n    .Input(\"cache: R\")                 // [?]\n    .Input(\"dh_new: R\")                // [T+1,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H]\n    .Output(\"dr: R\")                   // [H]\n    .Output(\"db: R\")                   // [H]\n    .Output(\"dgamma: R\")               // [H]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle zoneout_mask_shape;\n      ShapeHandle h_shape;\n      ShapeHandle cache_shape;\n      ShapeHandle dh_new_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 1, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 1, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &zoneout_mask_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 1, &cache_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(8), 3, &dh_new_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, hidden_size }));\n      c->set_output(2, recurrent_shape);\n      c->set_output(3, bias_shape);\n      c->set_output(4, gamma_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormIndrnnGradOp : public OpKernel {\n  explicit HasteLayerNormIndrnnGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_scale = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& gamma = context->input(4);\n    const Tensor& zoneout_mask = context->input(5);\n    const Tensor& h_vector = context->input(6);\n    const Tensor& cache_input = context->input(7);\n    const Tensor& dh_new = context->input(8);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_scale.shape().dim_size(0);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    const ArenaLayout<T> memory_layout = {\n      { \"act_Wx\", { time_steps, batch_size, hidden_size } },\n      { \"act_Wx_norm_cache\", { time_steps, batch_size, 2 } },\n    };\n\n    assert(cache_input.shape().num_elements() == memory_layout.NumElements());\n\n    Arena<T> memory = memory_layout.Realize(const_cast<T*>(cache_input.flat<T>().data()));\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape du_shape = { hidden_size };\n    Tensor* du = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, du_shape, &du));\n\n    // Needs to be initialized to 0.\n    Tensor* db = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, bias.shape(), &db));\n\n    // Needs to be initialized to 0.\n    Tensor* dgamma = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(4, gamma.shape(), &dgamma));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    const TensorShape workspace_shape = { time_steps, batch_size, hidden_size };\n    Tensor workspace;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, workspace_shape, &workspace));\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(du->flat<T>().data(), 0, du->AllocatedBytes());\n    cudaMemset(db->flat<T>().data(), 0, db->AllocatedBytes());\n    cudaMemset(dgamma->flat<T>().data(), 0, dgamma->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n\n    layer_norm::BackwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx.data(),\n        dgamma->SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm_indrnn::BackwardPass<T> backward(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    backward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_scale.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        dh_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        du->flat<T>().data(),\n        db->flat<T>().data(),\n        dh.flat<T>().data(),\n        workspace.flat<T>().data(),\n        layer_norm1,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormIndrnnGrad, float);\nREGISTER_GPU_KERNEL(HasteLayerNormIndrnnGrad, double);\n"
  },
  {
    "path": "frameworks/tf/layer_norm_indrnn.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Independently Recurrent Neural Network\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'LayerNormIndRNN'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteLayerNormIndrnn\")\ndef layer_norm_indrnn_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('LayerNormIndRNN can only compute gradients if `training=True` was specified '\n                      'during the forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  u = op.inputs[2]\n  b = op.inputs[3]\n  gamma = op.inputs[4]\n  zoneout_mask = op.inputs[5]\n  h = op.outputs[0]\n  cache = op.outputs[1]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n\n  grads = LIB.haste_layer_norm_indrnn_grad(x, W, u, b, gamma, zoneout_mask, h, cache, grads[0])\n  return [*grads, None]\n\n\ndef _get_initializer(initializer):\n  if not isinstance(initializer, dict):\n    return initializer\n  if 'uniform' in initializer:\n    value = initializer['uniform']\n    return v1.initializers.random_uniform(-value, value)\n  if 'normal' in initializer:\n    value = initializer['normal']\n    return v1.initializers.truncated_normal(stddev=value)\n  raise ValueError(f'Unknown initializer {initializer}')\n\n\nclass LayerNormIndRNNLayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        zoneout=0.0,\n        dtype=None,\n        name=None):\n    super().__init__(name)\n    self.realname = name\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.random_uniform(-0.5, 0.5), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(_get_initializer(kernel_initializer), None, kernel_transform)\n    self.recurrent_config.override(_get_initializer(recurrent_initializer), None, recurrent_transform)\n    self.bias_config.override(_get_initializer(bias_initializer), None, bias_transform)\n\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.kernel = None\n    self.recurrent_scale = None\n    self.bias = None\n    self.gamma = None\n    self.recurrent_bias = None\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    kernel_shape = tf.TensorShape([input_size, num_units])\n    recurrent_shape = tf.TensorShape([num_units])\n    bias_shape = tf.TensorShape([num_units])\n\n    kernel_weights = self.kernel_config.initializer(kernel_shape, dtype=self.dtype)\n    recurrent_weights = self.recurrent_config.initializer(recurrent_shape, dtype=self.dtype)\n    biases = self.bias_config.initializer(bias_shape)\n\n    with self.name_scope, v1.variable_scope(self.realname, 'indrnn_cell'):\n      self.kernel = v1.get_variable('kernel', initializer=kernel_weights)\n      self.recurrent_scale = v1.get_variable('recurrent_scale', initializer=recurrent_weights)\n      self.bias = v1.get_variable('bias', initializer=biases)\n      self.gamma = v1.get_variable('gamma', shape=[2, self.num_units], initializer=v1.initializers.ones())\n    self.built = True\n\n  def get_weights(self):\n    return {\n        'kernel': self.kernel_config.transform(self.kernel),\n        'recurrent_scale': self.recurrent_config.transform(self.recurrent_scale),\n        'bias': self.bias_config.transform(self.bias),\n        'gamma': self.gamma,\n    }\n\n  def __call__(self, inputs, sequence_length, training):\n    self.build(inputs.shape)\n\n    shape = tf.shape(inputs)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    result, _ = LIB.haste_layer_norm_indrnn(\n        inputs,\n        weights['kernel'],\n        weights['recurrent_scale'],\n        weights['bias'],\n        weights['gamma'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      # 0-indexed tensors, so length-1.\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = tf.gather_nd(result, indices)\n    else:\n      state = result[-1]\n\n    return result[1:], state\n\n\nclass LayerNormIndRNN(BaseRNN):\n  \"\"\"\n  Layer Normalized Independently Recurrent Neural Network layer.\n\n  This IndRNN layer applies layer normalization to the input activations of a\n  standard IndRNN. The implementation is fused and GPU-accelerated.\n\n  This layer has built-in support for Zoneout regularization.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the IndRNN layer.\n\n    Arguments:\n      num_units: int, the number of units in the IndRNN cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent scale weights. Defaults to uniform random in [-0.5, 0.5].\n        Note that this initialization scheme is different than in the original\n        authors' implementation. See https://github.com/lmnt-com/haste/issues/7\n        for details.\n      bias_initializer: (optional) the initializer to use for the bias vector.\n        Defaults to `zeros`.\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_scale: Tensor) -> Tensor` that transforms the recurrent\n        scale vector before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super().__init__(LayerNormIndRNNLayer, num_units, direction, 'indrnn_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/layer_norm_lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"arena.h\"\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\nnamespace layer_norm = haste::v0::layer_norm;\nnamespace layer_norm_lstm = haste::v0::layer_norm_lstm;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteLayerNormLstm\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H*4]\n    .Input(\"recurrent_kernel: R\")       // [H,H*4]\n    .Input(\"bias: R\")                   // [H*4]\n    .Input(\"gamma: R\")\n    .Input(\"gamma_h: R\")\n    .Input(\"beta_h: R\")\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T,N,H]\n    .Output(\"c: R\")                     // [T,N,H]\n    .Output(\"cache: R\")                 // [?] (activations cache)\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle gamma_h_shape;\n      ShapeHandle beta_h_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 2, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 1, &gamma_h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 1, &beta_h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(1, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(2, c->UnknownShapeOfRank(1));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormLstmOp : public OpKernel {\n  explicit HasteLayerNormLstmOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  // When running on GPU, TF backs all inputs and outputs with device memory\n  // and not host memory. We don't need to do explicit memory copies or allocations\n  // for the inputs and outputs.\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& gamma = context->input(4);\n    const Tensor& gamma_h = context->input(5);\n    const Tensor& beta_h = context->input(6);\n    const Tensor& zoneout_mask = context->input(7);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    OP_REQUIRES(context, input_size == kernel.shape().dim_size(0),\n        errors::InvalidArgument(\"input[2] and kernel[0] dimensions must match. Found \",\n            input_size, \" and \", kernel.shape().dim_size(0)));\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    const TensorShape activations_shape = { time_steps, batch_size, hidden_size * 4 };\n    const TensorShape norm_cache_shape = { time_steps, batch_size, 2 };\n\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    Tensor* output_cell_state = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, output_shape, &output_cell_state));\n\n    const ArenaLayout<T> memory_layout = {\n      { \"act_Wx\", activations_shape },\n      { \"act_Wx_norm\", activations_shape },\n      { \"act_Wx_norm_cache\", norm_cache_shape },\n      { \"act_Rh\", activations_shape },\n      { \"act_Rh_norm_cache\", norm_cache_shape },\n      { \"act_c_norm\", { time_steps, batch_size, hidden_size } },\n      { \"act_c_norm_cache\", norm_cache_shape },\n    };\n\n    Tensor* output_cache = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, { memory_layout.NumElements() }, &output_cache));\n\n    Arena<T> memory = memory_layout.Realize(output_cache->flat<T>().data());\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm = memory[\"act_Wx_norm\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n    TensorView<T> act_Rh = memory[\"act_Rh\"];\n    TensorView<T> act_Rh_norm_cache = memory[\"act_Rh_norm_cache\"];\n    TensorView<T> act_c_norm = memory[\"act_c_norm\"];\n    TensorView<T> act_c_norm_cache = memory[\"act_c_norm_cache\"];\n\n    Tensor tmp_Rh;\n    const TensorShape tmp_Rh_shape = { batch_size, 4 * hidden_size };\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Rh_shape, &tmp_Rh));\n\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n    cudaMemset(output_cell_state->flat<T>().data(), 0, output_cell_state->AllocatedBytes());\n\n    layer_norm::ForwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm::ForwardPass<T> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma.SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh_norm_cache.data());\n\n    layer_norm::ForwardPass<T> layer_norm3(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_h.flat<T>().data(),\n        beta_h.flat<T>().data(),\n        act_c_norm_cache.data());\n\n    layer_norm_lstm::ForwardPass<T> lstm(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    lstm.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        output_cell_state->flat<T>().data(),\n        act_Wx.data(),\n        tmp_Rh.flat<T>().data(),\n        layer_norm1,\n        act_Wx_norm.data(),\n        act_Rh.data(),\n        layer_norm2,\n        layer_norm3,\n        act_c_norm.data(),\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormLstm, float);\nREGISTER_GPU_KERNEL(HasteLayerNormLstm, double);\n\nREGISTER_OP(\"HasteLayerNormLstmGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [C,N,T]\n    .Input(\"kernel_t: R\")              // [H*4,C]\n    .Input(\"recurrent_kernel_t: R\")    // [H*4,H]\n    .Input(\"bias: R\")                  // [H*4]\n    .Input(\"gamma: R\")\n    .Input(\"gamma_h: R\")\n    .Input(\"beta_h: R\")\n    .Input(\"h: R\")                     // [T,N,H]\n    .Input(\"c: R\")                     // [T,N,H]\n    .Input(\"cache: R\")\n    .Input(\"dh_new: R\")                // [T,N,H]\n    .Input(\"dc_new: R\")                // [T,N,H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H*4]\n    .Output(\"dr: R\")                   // [H,H*4]\n    .Output(\"db: R\")                   // [H*4]\n    .Output(\"dgamma: R\")\n    .Output(\"dgamma_h: R\")\n    .Output(\"dbeta_h: R\")\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_kernel_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle gamma_shape;\n      ShapeHandle gamma_h_shape;\n      ShapeHandle beta_h_shape;\n      ShapeHandle h_shape;\n      ShapeHandle c_shape;\n      ShapeHandle cache_shape;\n      ShapeHandle dh_new_shape;\n      ShapeHandle dc_new_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 2, &gamma_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 1, &gamma_h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 1, &beta_h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(8), 3, &c_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(9), 1, &cache_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(10), 3, &dh_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(11), 3, &dc_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(12), 3, &zoneout_mask_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_kernel_shape, 1);\n      DimensionHandle hidden_size_4;\n\n      TF_RETURN_IF_ERROR(c->Multiply(hidden_size, 4, &hidden_size_4));\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, hidden_size_4 }));\n      c->set_output(2, c->MakeShape({ hidden_size, hidden_size_4 }));\n      c->set_output(3, bias_shape);\n      c->set_output(4, gamma_shape);\n      c->set_output(5, gamma_h_shape);\n      c->set_output(6, beta_h_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLayerNormLstmGradOp : public OpKernel {\n  explicit HasteLayerNormLstmGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& gamma = context->input(4);\n    const Tensor& gamma_h = context->input(5);\n    const Tensor& beta_h = context->input(6);\n    const Tensor& h_vector = context->input(7);\n    const Tensor& c_vector = context->input(8);\n    const Tensor& cache_input = context->input(9);\n    const Tensor& dh_new = context->input(10);\n    const Tensor& dc_new = context->input(11);\n    const Tensor& zoneout_mask = context->input(12);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(1);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size * 4 };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape dR_shape = { hidden_size, hidden_size * 4 };\n    Tensor* dR = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, dR_shape, &dR));\n\n    // Needs to be initialized to 0.\n    const TensorShape db_shape = { hidden_size * 4 };\n    Tensor* db = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, db_shape, &db));\n\n    // Needs to be initialized to 0.\n    Tensor* dgamma = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(4, gamma.shape(), &dgamma));\n\n    // Needs to be initialized to 0.\n    Tensor* dgamma_h = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(5, gamma_h.shape(), &dgamma_h));\n\n    // Needs to be initialized to 0.\n    Tensor* dbeta_h = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(6, beta_h.shape(), &dbeta_h));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    // Needs to be initialized to 0.\n    const TensorShape dc_shape = { batch_size, hidden_size };\n    Tensor dc;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dc_shape, &dc));\n\n    const TensorShape activations_shape = { time_steps, batch_size, hidden_size * 4 };\n    const TensorShape norm_cache_shape = { time_steps, batch_size, 2 };\n    const ArenaLayout<T> memory_layout = {\n      { \"act_Wx\", activations_shape },\n      { \"act_Wx_norm\", activations_shape },\n      { \"act_Wx_norm_cache\", norm_cache_shape },\n      { \"act_Rh\", activations_shape },\n      { \"act_Rh_norm_cache\", norm_cache_shape },\n      { \"act_c_norm\", { time_steps, batch_size, hidden_size } },\n      { \"act_c_norm_cache\", norm_cache_shape },\n    };\n\n    assert(cache_input.shape().num_elements() == memory_layout.NumElements());\n\n    Arena<T> memory = memory_layout.Realize(const_cast<T*>(cache_input.flat<T>().data()));\n    TensorView<T> act_Wx = memory[\"act_Wx\"];\n    TensorView<T> act_Wx_norm = memory[\"act_Wx_norm\"];\n    TensorView<T> act_Wx_norm_cache = memory[\"act_Wx_norm_cache\"];\n    TensorView<T> act_Rh = memory[\"act_Rh\"];\n    TensorView<T> act_Rh_norm_cache = memory[\"act_Rh_norm_cache\"];\n    TensorView<T> act_c_norm = memory[\"act_c_norm\"];\n    TensorView<T> act_c_norm_cache = memory[\"act_c_norm_cache\"];\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(dR->flat<T>().data(), 0, dR->AllocatedBytes());\n    cudaMemset(db->flat<T>().data(), 0, db->AllocatedBytes());\n    cudaMemset(dgamma->flat<T>().data(), 0, dgamma->AllocatedBytes());\n    cudaMemset(dgamma_h->flat<T>().data(), 0, dgamma_h->AllocatedBytes());\n    cudaMemset(dbeta_h->flat<T>().data(), 0, dbeta_h->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n    cudaMemset(dc.flat<T>().data(), 0, dc.AllocatedBytes());\n\n    layer_norm::BackwardPass<T> layer_norm1(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma.SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx.data(),\n        dgamma->SubSlice(0).unaligned_flat<T>().data(),\n        nullptr,\n        act_Wx_norm_cache.data());\n\n    layer_norm::BackwardPass<T> layer_norm2(\n        time_steps * batch_size,\n        hidden_size * 4,\n        gamma.SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh.data(),\n        dgamma->SubSlice(1).unaligned_flat<T>().data(),\n        nullptr,\n        act_Rh_norm_cache.data());\n\n    layer_norm::BackwardPass<T> layer_norm3(\n        time_steps * batch_size,\n        hidden_size,\n        gamma_h.flat<T>().data(),\n        beta_h.flat<T>().data(),\n        c_vector.SubSlice(1).unaligned_flat<T>().data(),\n        dgamma_h->flat<T>().data(),\n        dbeta_h->flat<T>().data(),\n        act_c_norm_cache.data());\n\n    layer_norm_lstm::BackwardPass<T> lstm(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    lstm.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        c_vector.flat<T>().data(),\n        dh_new.flat<T>().data(),\n        dc_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        dR->flat<T>().data(),\n        db->flat<T>().data(),\n        dh.flat<T>().data(),\n        dc.flat<T>().data(),\n        act_Wx.data(),\n        layer_norm1,\n        act_Wx_norm.data(),\n        act_Rh.data(),\n        layer_norm2,\n        layer_norm3,\n        act_c_norm.data(),\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLayerNormLstmGrad, float);\nREGISTER_GPU_KERNEL(HasteLayerNormLstmGrad, double);\n"
  },
  {
    "path": "frameworks/tf/layer_norm_lstm.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Layer Normalized Long Short-Term Memory\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'LayerNormLSTM'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteLayerNormLstm\")\ndef lstm_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('LSTM can only compute gradients if `training=True` was specified during the '\n                      'forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  R = op.inputs[2]\n  b = op.inputs[3]\n  gamma = op.inputs[4]\n  gamma_h = op.inputs[5]\n  beta_h = op.inputs[6]\n  zoneout_mask = op.inputs[7]\n  h = op.outputs[0]\n  c = op.outputs[1]\n  cache = op.outputs[2]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n  R = tf.transpose(R, [1, 0])\n\n  dx, dW, dR, db, dgamma, dgamma_h, dbeta_h = LIB.haste_layer_norm_lstm_grad(\n      x,\n      W,\n      R,\n      b,\n      gamma,\n      gamma_h,\n      beta_h,\n      h,\n      c,\n      cache,\n      grads[0],\n      grads[1],\n      zoneout_mask)\n  return [dx, dW, dR, db, dgamma, dgamma_h, dbeta_h, None]\n\n\nclass LayerNormLSTMLayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        forget_bias=1.0,\n        dropout=0.0,\n        zoneout=0.0,\n        dtype=None,\n        name=None):\n    super(LayerNormLSTMLayer, self).__init__(name)\n    self.realname = name\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.orthogonal(), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(kernel_initializer, None, kernel_transform)\n    self.recurrent_config.override(recurrent_initializer, None, recurrent_transform)\n    self.bias_config.override(bias_initializer, None, bias_transform)\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.kernel = None\n    self.bias = None\n    self.gamma = None\n    self.gamma_h = None\n    self.beta_h = None\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    kernel_shape = tf.TensorShape([input_size, num_units])\n    recurrent_shape = tf.TensorShape([num_units, num_units])\n    bias_shape = tf.TensorShape([num_units])\n\n    kernel_weights = [self.kernel_config.initializer(kernel_shape, dtype=self.dtype) for _ in range(4)]\n    recurrent_weights = [self.recurrent_config.initializer(recurrent_shape, dtype=self.dtype) for _ in range(4)]\n    if self.forget_bias:\n      biases = [tf.zeros(bias_shape, dtype=self.dtype) for _ in range(4)]\n      biases[2] = tf.constant(self.forget_bias, shape=bias_shape, dtype=self.dtype)\n    else:\n      biases = [self.bias_config.initializer(bias_shape, dtype=self.dtype) for _ in range(4)]\n\n    kernel_weights = tf.concat(kernel_weights, axis=-1)\n    recurrent_weights = tf.concat(recurrent_weights, axis=-1)\n    biases = tf.concat(biases, axis=-1)\n\n    # Use the same format as LSTMBlockCell.\n    with self.name_scope, v1.variable_scope(self.realname, 'lstm_cell'):\n      weights = tf.concat([kernel_weights, recurrent_weights], axis=0)\n      self.kernel = v1.get_variable('kernel', initializer=weights)\n      self.bias = v1.get_variable('bias', initializer=biases)\n      self.gamma = v1.get_variable('gamma', shape=[2, self.num_units * 4], initializer=v1.initializers.ones())\n      self.gamma_h = v1.get_variable('gamma_h', shape=[self.num_units], initializer=v1.initializers.ones())\n      self.beta_h = v1.get_variable('beta_h', shape=[self.num_units], initializer=v1.initializers.zeros())\n    self.built = True\n\n  def get_weights(self):\n    kernel = self.kernel[:-self.num_units]\n    recurrent_kernel = self.kernel[-self.num_units:]\n    return {\n        'kernel': self.kernel_config.transform(kernel),\n        'recurrent_kernel': self.recurrent_config.transform(recurrent_kernel),\n        'bias': self.bias_config.transform(self.bias),\n        'gamma': self.gamma,\n        'gamma_h': self.gamma_h,\n        'beta_h': self.beta_h,\n    }\n\n  @property\n  def state_size(self):\n    return rnn_cell.LSTMStateTuple(self.num_units, self.num_units)\n\n  @property\n  def output_size(self):\n    return self.num_units\n\n  def __call__(self, x, sequence_length, training):\n    self.build(x.shape)\n\n    shape = tf.shape(x)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    if training and self.dropout > 0:\n      recurrent_kernel = tf.nn.dropout(weights['recurrent_kernel'], rate=self.dropout)\n    else:\n      recurrent_kernel = weights['recurrent_kernel']\n    h, c, _ = LIB.haste_layer_norm_lstm(\n        x,\n        weights['kernel'],\n        recurrent_kernel,\n        weights['bias'],\n        weights['gamma'],\n        weights['gamma_h'],\n        weights['beta_h'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = rnn_cell.LSTMStateTuple(tf.gather_nd(c, indices), tf.gather_nd(h, indices))\n    else:\n      state = rnn_cell.LSTMStateTuple(c[-1], h[-1])\n\n    return h[1:], state\n\n\nclass LayerNormLSTM(BaseRNN):\n  \"\"\"\n  Layer Normalized Long Short-Term Memory layer.\n\n  This LSTM layer applies layer normalization to the input, recurrent, and\n  output activations of a standard LSTM. The implementation is fused and\n  GPU-accelerated. DropConnect and Zoneout regularization are built-in, and\n  this layer allows setting a non-zero initial forget gate bias.\n\n  Details about the exact function this layer implements can be found at\n  https://github.com/lmnt-com/haste/issues/1.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the LSTM layer.\n\n    Arguments:\n      num_units: int, the number of units in the LSTM cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent matrix weights. Defaults to `orthogonal`.\n      bias_initializer: (optional) the initializer to use for both input and\n        recurrent bias vectors. Defaults to `zeros` unless `forget_bias` is\n        non-zero (see below).\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n        kernel before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      forget_bias: (optional) float, sets the initial weights for the forget\n        gates. Defaults to 1 and overrides the `bias_initializer` unless this\n        argument is set to 0.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix. Defaults to 0.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n    \"\"\"\n    super().__init__(LayerNormLSTMLayer, num_units, direction, 'lstm_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/layer_norm_lstm_cell.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"An LSTM cell compatible with the Haste LayerNormLSTM layer.\"\"\"\n\n\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\n\n__all__ = [\n    'LayerNormLSTMCell'\n]\n\n\nclass LayerNormLSTMCell(rnn_cell.RNNCell):\n  \"\"\"\n  An LSTM cell that's compatible with the Haste LayerNormLSTM layer.\n\n  This cell can be used on hardware other than GPUs and with other TensorFlow\n  classes that operate on RNN cells (e.g. `dynamic_rnn`, `BasicDecoder`, cell\n  wrappers, etc.).\n  \"\"\"\n\n  def __init__(self,\n        num_units,\n        forget_bias=1.0,\n        dropout=0.0,\n        dtype=None,\n        name=None,\n        **kwargs):\n    super(LayerNormLSTMCell, self).__init__(dtype=dtype, name=name, **kwargs)\n    self.realname = name\n    self.num_units = num_units\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n    self.kernel = None\n    self.recurrent_kernel = None\n    self.bias = None\n    self.gamma = None\n    self.gamma_h = None\n    self.beta_h = None\n    self.built = False\n\n  @property\n  def state_size(self):\n    return rnn_cell.LSTMStateTuple(self.num_units, self.num_units)\n\n  @property\n  def output_size(self):\n    return self.num_units\n\n  def build(self, shape):\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    # No user-supplied initializers here since this class should only really\n    # be used for inference on a pre-trained model.\n    with tf.name_scope(self.name), v1.variable_scope(self.realname, 'lstm_cell'):\n      self._kernel = v1.get_variable('kernel', shape=[input_size + num_units, num_units * 4])\n      self.kernel, self.recurrent_kernel = tf.split(self._kernel, [input_size, num_units], axis=0)\n      self.bias = v1.get_variable('bias', shape=[num_units * 4], initializer=v1.initializers.zeros())\n      self.gamma = v1.get_variable('gamma', shape=[2, num_units * 4], initializer=v1.initializers.ones())\n      self.gamma_h = v1.get_variable('gamma_h', shape=[num_units], initializer=v1.initializers.ones())\n      self.beta_h = v1.get_variable('beta_h', shape=[num_units], initializer=v1.initializers.zeros())\n      self.null = tf.zeros_like(self.gamma[0])\n    self.built = True\n\n  def __call__(self, inputs, state, training=False, scope=None):\n    self.build(inputs.shape)\n\n    if training and self.dropout > 0:\n      R = tf.nn.dropout(self.recurrent_kernel, rate=self.dropout)\n    else:\n      R = self.recurrent_kernel\n\n    Wx = self._layer_norm(tf.matmul(inputs, self.kernel), self.gamma[0], self.null)\n    Rh = self._layer_norm(tf.matmul(state.h, R), self.gamma[1], self.null)\n    v = Wx + Rh + self.bias\n    v_i, v_g, v_f, v_o = tf.split(v, 4, axis=-1)\n    i = tf.nn.sigmoid(v_i)\n    g = tf.nn.tanh   (v_g)\n    f = tf.nn.sigmoid(v_f)\n    o = tf.nn.sigmoid(v_o)\n    c_new = f * state.c + i * g\n    c_tanh = tf.nn.tanh(self._layer_norm(c_new, self.gamma_h, self.beta_h))\n    h_new = o * c_tanh\n\n    return h_new, rnn_cell.LSTMStateTuple(c_new, h_new)\n\n  def _layer_norm(self, x, gamma, beta):\n    mean, variance = tf.nn.moments(x, axes=[-1], keepdims=True)\n    return tf.nn.batch_normalization(x, mean, variance, beta, gamma, 1e-5)\n"
  },
  {
    "path": "frameworks/tf/lstm.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cuda_runtime_api.h>\n\n#include \"haste.h\"\n#include \"support.h\"\n#include \"tensorflow/core/framework/op.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/framework/shape_inference.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing namespace tensorflow;\n\nusing haste::v0::lstm::ForwardPass;\nusing haste::v0::lstm::BackwardPass;\nusing tensorflow::se::Stream;\nusing tensorflow::shape_inference::DimensionHandle;\nusing tensorflow::shape_inference::InferenceContext;\nusing tensorflow::shape_inference::ShapeHandle;\n\n// Define the interface and shape function for the op.\nREGISTER_OP(\"HasteLstm\")\n    .Attr(\"R: {float, double}\")         // Some real number type.\n    .Attr(\"training: bool\")\n    .Attr(\"zoneout_prob: float\")\n    .Input(\"x: R\")                      // [T,N,C]\n    .Input(\"kernel: R\")                 // [C,H*4]\n    .Input(\"recurrent_kernel: R\")       // [H,H*4]\n    .Input(\"bias: R\")                   // [H*4]\n    .Input(\"zoneout_mask: R\")           // [T,N,H]\n    .Output(\"h: R\")                     // [T,N,H]\n    .Output(\"c: R\")                     // [T,N,H]\n    .Output(\"v: R\")                     // [T,N,H*4]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle input_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &input_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 3, &zoneout_mask_shape));\n\n      const DimensionHandle time_steps = c->Dim(input_shape, 0);\n      const DimensionHandle batch_size = c->Dim(input_shape, 1);\n      const DimensionHandle hidden_size = c->Dim(recurrent_shape, 0);\n      DimensionHandle time_steps_plus_1;\n      DimensionHandle hidden_size_4;\n\n      TF_RETURN_IF_ERROR(c->Add(time_steps, 1, &time_steps_plus_1));\n      TF_RETURN_IF_ERROR(c->Multiply(hidden_size, 4, &hidden_size_4));\n\n      c->set_output(0, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(1, c->MakeShape({ time_steps_plus_1, batch_size, hidden_size }));\n      c->set_output(2, c->MakeShape({ time_steps, batch_size, hidden_size_4 }));\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLstmOp : public OpKernel {\n  explicit HasteLstmOp(OpKernelConstruction* context) : OpKernel(context) {\n    OP_REQUIRES_OK(context, context->GetAttr(\"training\", &training_));\n    OP_REQUIRES_OK(context, context->GetAttr(\"zoneout_prob\", &zoneout_prob_));\n  }\n\n  // When running on GPU, TF backs all inputs and outputs with device memory\n  // and not host memory. We don't need to do explicit memory copies or allocations\n  // for the inputs and outputs.\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& zoneout_mask = context->input(4);\n\n    const auto time_steps = input.shape().dim_size(0);\n    const auto batch_size = input.shape().dim_size(1);\n    const auto input_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(0);\n    const bool has_zoneout = zoneout_prob_ && zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    OP_REQUIRES(context, input_size == kernel.shape().dim_size(0),\n        errors::InvalidArgument(\"input[2] and kernel[0] dimensions must match. Found \",\n            input_size, \" and \", kernel.shape().dim_size(0)));\n\n    const TensorShape output_shape = { time_steps + 1, batch_size, hidden_size };\n    const TensorShape activations_shape = { time_steps, batch_size, hidden_size * 4 };\n\n    Tensor* output = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));\n\n    Tensor* output_cell_state = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, output_shape, &output_cell_state));\n\n    Tensor output_v_temp;\n    Tensor* output_v = nullptr;\n    if (training_) {\n      OP_REQUIRES_OK(context, context->allocate_output(2, activations_shape, &output_v));\n    } else {\n      // Return an empty tensor in inference mode and provide temp memory\n      // to the forward pass instead.\n      OP_REQUIRES_OK(context, context->allocate_output(2, TensorShape({ 0 }), &output_v));\n      OP_REQUIRES_OK(context, context->allocate_temp(data_type, activations_shape, &output_v_temp));\n      output_v = &output_v_temp;\n    }\n\n    Tensor tmp_Rh;\n    const TensorShape tmp_Rh_shape = { batch_size, 4 * hidden_size };\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, tmp_Rh_shape, &tmp_Rh));\n    cudaMemset(output->flat<T>().data(), 0, output->AllocatedBytes());\n    cudaMemset(output_cell_state->flat<T>().data(), 0, output_cell_state->AllocatedBytes());\n\n    ForwardPass<T> forward = ForwardPass<T>(\n        training_,\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    forward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        output->flat<T>().data(),\n        output_cell_state->flat<T>().data(),\n        output_v->flat<T>().data(),\n        tmp_Rh.flat<T>().data(),\n        has_zoneout ? zoneout_prob_ : 0.0f,\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n\n  private:\n    bool training_;\n    float zoneout_prob_;\n};\n\nREGISTER_GPU_KERNEL(HasteLstm, float);\nREGISTER_GPU_KERNEL(HasteLstm, double);\n\nREGISTER_OP(\"HasteLstmGrad\")\n    .Attr(\"R: {float, double}\")\n    .Input(\"x_t: R\")                   // [C,N,T]\n    .Input(\"kernel_t: R\")              // [H*4,C]\n    .Input(\"recurrent_kernel_t: R\")    // [H*4,H]\n    .Input(\"bias: R\")                  // [H*4]\n    .Input(\"h: R\")                     // [T,N,H]\n    .Input(\"c: R\")                     // [T,N,H]\n    .Input(\"v: R\")                     // [T,N,H*4]\n    .Input(\"dh_new: R\")                // [T,N,H]\n    .Input(\"dc_new: R\")                // [T,N,H]\n    .Input(\"zoneout_mask: R\")          // [T,N,H]\n    .Output(\"dx: R\")                   // [T,N,C]\n    .Output(\"dw: R\")                   // [C,H*4]\n    .Output(\"dr: R\")                   // [H,H*4]\n    .Output(\"db: R\")                   // [H*4]\n    .SetShapeFn([](InferenceContext* c) {\n      ShapeHandle x_shape;\n      ShapeHandle kernel_shape;\n      ShapeHandle recurrent_kernel_shape;\n      ShapeHandle bias_shape;\n      ShapeHandle h_shape;\n      ShapeHandle c_shape;\n      ShapeHandle v_shape;\n      ShapeHandle dh_new_shape;\n      ShapeHandle dc_new_shape;\n      ShapeHandle zoneout_mask_shape;\n\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 3, &x_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &recurrent_kernel_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 1, &bias_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 3, &h_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 3, &c_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 3, &v_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 3, &dh_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(8), 3, &dc_new_shape));\n      TF_RETURN_IF_ERROR(c->WithRank(c->input(9), 3, &zoneout_mask_shape));\n\n      DimensionHandle input_size = c->Dim(x_shape, 0);\n      DimensionHandle time_steps = c->Dim(x_shape, 1);\n      DimensionHandle batch_size = c->Dim(x_shape, 2);\n      DimensionHandle hidden_size = c->Dim(recurrent_kernel_shape, 1);\n      DimensionHandle hidden_size_4;\n\n      TF_RETURN_IF_ERROR(c->Multiply(hidden_size, 4, &hidden_size_4));\n\n      c->set_output(0, c->MakeShape({ time_steps, batch_size, input_size }));\n      c->set_output(1, c->MakeShape({ input_size, hidden_size_4 }));\n      c->set_output(2, c->MakeShape({ hidden_size, hidden_size_4 }));\n      c->set_output(3, bias_shape);\n      return Status::OK();\n    });\n\ntemplate<typename T>\nstruct HasteLstmGradOp : public OpKernel {\n  explicit HasteLstmGradOp(OpKernelConstruction* context) : OpKernel(context) {}\n\n  void Compute(OpKernelContext* context) override {\n    const Tensor& input = context->input(0);\n    const Tensor& kernel = context->input(1);\n    const Tensor& recurrent_kernel = context->input(2);\n    const Tensor& bias = context->input(3);\n    const Tensor& h_vector = context->input(4);\n    const Tensor& c_vector = context->input(5);\n    const Tensor& dv = context->input(6);\n    const Tensor& dh_new = context->input(7);\n    const Tensor& dc_new = context->input(8);\n    const Tensor& zoneout_mask = context->input(9);\n\n    const auto input_size = input.shape().dim_size(0);\n    const auto time_steps = input.shape().dim_size(1);\n    const auto batch_size = input.shape().dim_size(2);\n    const auto hidden_size = recurrent_kernel.shape().dim_size(1);\n    const bool has_zoneout = !!zoneout_mask.NumElements();\n    const auto data_type = DataTypeToEnum<T>::value;\n\n    // Can be uninitialized. Output only, no accumulation.\n    const TensorShape dx_shape = { time_steps, batch_size, input_size };\n    Tensor* dx = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(0, dx_shape, &dx));\n\n    // Needs to be initialized to 0.\n    const TensorShape dW_shape = { input_size, hidden_size * 4 };\n    Tensor* dW = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(1, dW_shape, &dW));\n\n    // Needs to be initialized to 0.\n    const TensorShape dR_shape = { hidden_size, hidden_size * 4 };\n    Tensor* dR = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(2, dR_shape, &dR));\n\n    // Needs to be initialized to 0.\n    const TensorShape db_shape = { hidden_size * 4 };\n    Tensor* db = nullptr;\n    OP_REQUIRES_OK(context, context->allocate_output(3, db_shape, &db));\n\n    // Needs to be initialized to 0.\n    const TensorShape dh_shape = { batch_size, hidden_size };\n    Tensor dh;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dh_shape, &dh));\n\n    // Needs to be initialized to 0.\n    const TensorShape dc_shape = { batch_size, hidden_size };\n    Tensor dc;\n    OP_REQUIRES_OK(context, context->allocate_temp(data_type, dc_shape, &dc));\n\n    cudaMemset(dW->flat<T>().data(), 0, dW->AllocatedBytes());\n    cudaMemset(dR->flat<T>().data(), 0, dR->AllocatedBytes());\n    cudaMemset(db->flat<T>().data(), 0, db->AllocatedBytes());\n    cudaMemset(dh.flat<T>().data(), 0, dh.AllocatedBytes());\n    cudaMemset(dc.flat<T>().data(), 0, dc.AllocatedBytes());\n\n    BackwardPass<T> backward = BackwardPass<T>(\n        batch_size,\n        input_size,\n        hidden_size,\n        GetCublasHandle(context));\n\n    backward.Run(\n        time_steps,\n        kernel.flat<T>().data(),\n        recurrent_kernel.flat<T>().data(),\n        bias.flat<T>().data(),\n        input.flat<T>().data(),\n        h_vector.flat<T>().data(),\n        c_vector.flat<T>().data(),\n        dh_new.flat<T>().data(),\n        dc_new.flat<T>().data(),\n        dx->flat<T>().data(),\n        dW->flat<T>().data(),\n        dR->flat<T>().data(),\n        db->flat<T>().data(),\n        dh.flat<T>().data(),\n        dc.flat<T>().data(),\n        const_cast<T*>(dv.flat<T>().data()),\n        has_zoneout ? zoneout_mask.flat<T>().data() : nullptr);\n  }\n};\n\nREGISTER_GPU_KERNEL(HasteLstmGrad, float);\nREGISTER_GPU_KERNEL(HasteLstmGrad, double);\n"
  },
  {
    "path": "frameworks/tf/lstm.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Long Short-Term Memory\"\"\"\n\n\nimport pkg_resources\nimport tensorflow as tf\n\nfrom tensorflow.compat import v1\nfrom tensorflow.compat.v1.nn import rnn_cell\n\nfrom .base_rnn import BaseRNN\nfrom .weight_config import WeightConfig\n\n\n__all__ = [\n    'LSTM'\n]\n\n\nLIB = tf.load_op_library(pkg_resources.resource_filename(__name__, 'libhaste_tf.so'))\n\n\n@tf.RegisterGradient(\"HasteLstm\")\ndef lstm_gradient(op, *grads):\n  training = op.get_attr('training')\n  if not training:\n    raise ValueError(('LSTM can only compute gradients if `training=True` was specified during the '\n                      'forward pass.\\nFailed op: {}').format(op.name))\n\n  # Extract inputs and outputs from the op.\n  x = op.inputs[0]\n  W = op.inputs[1]\n  R = op.inputs[2]\n  b = op.inputs[3]\n  zoneout_mask = op.inputs[4]\n  h = op.outputs[0]\n  c = op.outputs[1]\n  v = op.outputs[2]\n\n  # Pre-transpose matrices for better performance.\n  x = tf.transpose(x, [2, 0, 1])\n  W = tf.transpose(W, [1, 0])\n  R = tf.transpose(R, [1, 0])\n\n  dx, dW, dR, db = LIB.haste_lstm_grad(x, W, R, b, h, c, v, grads[0], grads[1], zoneout_mask)\n  return [dx, dW, dR, db, None]\n\n\nclass LSTMLayer(tf.Module):\n  def __init__(self,\n        num_units,\n        kernel_initializer=None,\n        recurrent_initializer=None,\n        bias_initializer=None,\n        kernel_transform=None,\n        recurrent_transform=None,\n        bias_transform=None,\n        forget_bias=1.0,\n        dropout=0.0,\n        zoneout=0.0,\n        dtype=None,\n        name=None,\n        cudnn_compat=False):\n    super(LSTMLayer, self).__init__(name)\n    self.realname = name\n    self.input_size = None\n    self.num_units = num_units\n\n    identity = lambda x: x\n    self.kernel_config = WeightConfig(v1.initializers.glorot_uniform(), None, identity)\n    self.recurrent_config = WeightConfig(v1.initializers.orthogonal(), None, identity)\n    self.bias_config = WeightConfig(v1.initializers.zeros(), None, identity)\n\n    self.kernel_config.override(kernel_initializer, None, kernel_transform)\n    self.recurrent_config.override(recurrent_initializer, None, recurrent_transform)\n    self.bias_config.override(bias_initializer, None, bias_transform)\n\n    self.forget_bias = forget_bias\n    self.dropout = dropout\n    self.zoneout = zoneout\n    self.dtype = dtype or tf.float32\n    self.cudnn_compat = cudnn_compat\n    self.opaque = None\n    self.kernel = None\n    self.bias = None\n    self.built = False\n\n  def build(self, shape):\n    if self.built:\n      return\n\n    num_units = self.num_units\n    input_size = int(shape[-1])\n\n    kernel_shape = tf.TensorShape([input_size, num_units])\n    recurrent_shape = tf.TensorShape([num_units, num_units])\n    bias_shape = tf.TensorShape([num_units])\n\n    kernel_weights = [self.kernel_config.initializer(kernel_shape, dtype=self.dtype) for _ in range(4)]\n    recurrent_weights = [self.recurrent_config.initializer(recurrent_shape, dtype=self.dtype) for _ in range(4)]\n    if self.forget_bias:\n      biases = [tf.zeros(bias_shape, dtype=self.dtype) for _ in range(4)]\n      biases[2] = tf.constant(self.forget_bias, shape=bias_shape, dtype=self.dtype)\n    else:\n      biases = [self.bias_config.initializer(bias_shape, dtype=self.dtype) for _ in range(4)]\n\n    kernel_weights = tf.concat(kernel_weights, axis=-1)\n    recurrent_weights = tf.concat(recurrent_weights, axis=-1)\n    biases = tf.concat(biases, axis=-1)\n\n    if not self.cudnn_compat:\n      # Use the same format as LSTMBlockCell.\n      with self.name_scope, v1.variable_scope(self.realname, 'lstm_cell'):\n        weights = tf.concat([kernel_weights, recurrent_weights], axis=0)\n        self.kernel = v1.get_variable('kernel', initializer=weights)\n        self.bias = v1.get_variable('bias', initializer=biases)\n    else:\n      # Use the same format as CudnnLSTM.\n      with self.name_scope, v1.variable_scope(self.realname, 'lstm_cell'):\n        with v1.variable_scope('cudnn_lstm'):\n          # Sigh, cuDNN uses two bias vectors instead of just one.\n          extra_biases = [self.bias_initializer(tf.TensorShape([num_units]), dtype=self.dtype) for _ in range(4)]\n          extra_biases = tf.concat(extra_biases, axis=-1)\n          kernel_weights = tf.reshape(kernel_weights, [-1])\n          recurrent_weights = tf.reshape(recurrent_weights, [-1])\n          opaque_initial_value = tf.concat([kernel_weights, recurrent_weights, biases, extra_biases], axis=-1)\n          self.opaque = v1.get_variable('opaque_kernel', initializer=opaque_initial_value)\n\n    self.input_size = input_size\n    self.built = True\n\n  def get_weights(self):\n    if self.cudnn_compat:\n      # Split into 3 variables.\n      W_size = 4 * self.input_size * self.num_units\n      R_size = 4 * self.num_units * self.num_units\n      b_size = 8 * self.num_units\n      kernel, recurrent_kernel, bias = tf.split(opaque, [W_size, R_size, b_size])\n\n      # Convert from cuDNN [i, f, g, o] format to TF and LMNT [i, g, f, o] format.\n      # Note that we only use a single bias vector so we sum the two separate ones\n      # and then reorder formats.\n      Wi, Wf, Wg, Wo = tf.split(kernel, 4)\n      Ri, Rf, Rg, Ro = tf.split(recurrent_kernel, 4)\n      bi, bf, bg, bo = tf.split(tf.reduce_sum(tf.split(bias, 2), axis=0), 4)\n      kernel = tf.concat([Wi, Wg, Wf, Wo], axis=0)\n      recurrent_kernel = tf.concat([Ri, Rg, Rf, Ro], axis=0)\n      bias = tf.concat([bi, bg, bf, bo], axis=0)\n\n      # Shape them correctly.\n      kernel = tf.reshape(kernel, [4 * self.num_units, self.input_size])\n      recurrent_kernel = tf.reshape(recurrent_kernel, [4 * self.num_units, self.num_units])\n      bias = tf.reshape(bias, [4 * self.num_units])\n\n      # Pre-transpose the kernels.\n      kernel = tf.transpose(kernel, [1, 0])\n      recurrent_kernel = tf.transpose(recurrent_kernel, [1, 0])\n    else:\n      kernel = self.kernel[:-self.num_units]\n      recurrent_kernel = self.kernel[-self.num_units:]\n      bias = self.bias\n    return {\n        'kernel': self.kernel_config.transform(kernel),\n        'recurrent_kernel': self.recurrent_config.transform(recurrent_kernel),\n        'bias': self.bias_config.transform(bias)\n    }\n\n  @property\n  def state_size(self):\n    return rnn_cell.LSTMStateTuple(self.num_units, self.num_units)\n\n  @property\n  def output_size(self):\n    return self.num_units\n\n  def __call__(self, x, sequence_length, training):\n    self.build(x.shape)\n\n    shape = tf.shape(x)\n    time_steps = shape[0]\n    batch_size = shape[1]\n\n    # Use an empty zoneout mask if no zoneout is going to be applied.\n    # Sadly, we can't pass `None` to the op but at least we won't be wasting\n    # memory or bandwidth on this tensor.\n    zoneout_mask = tf.zeros([0, 0, 0], dtype=self.dtype)\n    if self.zoneout:\n      zoneout_mask = 1.0 - self.zoneout\n      zoneout_mask += tf.random.uniform([time_steps, batch_size, self.num_units], dtype=self.dtype)\n      zoneout_mask = tf.floor(zoneout_mask)\n\n    weights = self.get_weights()\n    if training and self.dropout > 0:\n      recurrent_kernel = tf.nn.dropout(weights['recurrent_kernel'], rate=self.dropout)\n    else:\n      recurrent_kernel = weights['recurrent_kernel']\n    h, c, _ = LIB.haste_lstm(\n        x,\n        weights['kernel'],\n        recurrent_kernel,\n        weights['bias'],\n        zoneout_mask,\n        training=training,\n        zoneout_prob=self.zoneout)\n\n    if sequence_length is not None:\n      indices = sequence_length\n      indices = tf.stack([indices, tf.range(batch_size, dtype=sequence_length.dtype)], axis=-1)\n      state = rnn_cell.LSTMStateTuple(tf.gather_nd(c, indices), tf.gather_nd(h, indices))\n    else:\n      state = rnn_cell.LSTMStateTuple(c[-1], h[-1])\n\n    return h[1:], state\n\n\nclass LSTM(BaseRNN):\n  \"\"\"\n  Long Short-Term Memory layer.\n\n  This LSTM layer offers a fused, GPU-accelerated TensorFlow op for inference\n  and training. Its weights and variables are compatible with `BasicLSTMCell`,\n  `LSTMCell`, and `LSTMBlockCell` by default, and is able to load weights\n  from `tf.contrib.cudnn_rnn.CudnnLSTM` when `cudnn_compat=True` is specified.\n\n  Although this implementation is comparable in performance to cuDNN's LSTM,\n  it offers additional options not typically found in other high-performance\n  implementations. DropConnect and Zoneout regularization are built-in, and\n  this layer allows setting a non-zero initial forget gate bias.\n  \"\"\"\n\n  def __init__(self, num_units, direction='unidirectional', **kwargs):\n    \"\"\"\n    Initialize the parameters of the LSTM layer.\n\n    Arguments:\n      num_units: int, the number of units in the LSTM cell.\n      direction: string, 'unidirectional' or 'bidirectional'.\n      **kwargs: Dict, keyword arguments (see below).\n\n    Keyword Arguments:\n      kernel_initializer: (optional) the initializer to use for the input\n        matrix weights. Defaults to `glorot_uniform`.\n      recurrent_initializer: (optional) the initializer to use for the\n        recurrent matrix weights. Defaults to `orthogonal`.\n      bias_initializer: (optional) the initializer to use for both input and\n        recurrent bias vectors. Defaults to `zeros` unless `forget_bias` is\n        non-zero (see below).\n      kernel_transform: (optional) a function with signature\n        `(kernel: Tensor) -> Tensor` that transforms the kernel before it is\n        used. Defaults to the identity function.\n      recurrent_transform: (optional) a function with signature\n        `(recurrent_kernel: Tensor) -> Tensor` that transforms the recurrent\n        kernel before it is used. Defaults to the identity function.\n      bias_transform: (optional) a function with signature\n        `(bias: Tensor) -> Tensor` that transforms the bias before it is used.\n        Defaults to the identity function.\n      forget_bias: (optional) float, sets the initial weights for the forget\n        gates. Defaults to 1 and overrides the `bias_initializer` unless this\n        argument is set to 0.\n      dropout: (optional) float, sets the dropout rate for DropConnect\n        regularization on the recurrent matrix. Defaults to 0.\n      zoneout: (optional) float, sets the zoneout rate for Zoneout\n        regularization. Defaults to 0.\n      dtype: (optional) the data type for this layer. Defaults to `tf.float32`.\n      name: (optional) string, the name for this layer.\n      cudnn_compat: (optional) bool, if `True`, the variables created by this\n        layer are compatible with `tf.contrib.cudnn_rnn.CudnnLSTM`. Note that\n        this should only be set if you're restoring variables from a cuDNN\n        model. It's currently not possible to train a model with\n        `cudnn_compat=True` and restore it with CudnnLSTM. Defaults to `False`.\n    \"\"\"\n    super().__init__(LSTMLayer, num_units, direction, 'lstm_cell', **kwargs)\n"
  },
  {
    "path": "frameworks/tf/support.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <iterator>\n#include <mutex>\n#include <thread>\n#include <unordered_map>\n#include <utility>\n\n#include \"support.h\"\n#include \"tensorflow/core/framework/op_kernel.h\"\n#include \"tensorflow/core/util/stream_executor_util.h\"\n#include \"tensorflow/stream_executor/stream.h\"\n\nusing Key = std::pair<int, std::thread::id>;\n\nnamespace std {\n\ntemplate<>\nstruct hash<Key> {\n  size_t operator()(const Key& key) const noexcept {\n    const auto h1 = std::hash<Key::first_type>{}(key.first);\n    const auto h2 = std::hash<Key::second_type>{}(key.second);\n    return h1 ^ (h2 << 1);\n  }\n};\n\n}\n\n// LOL.\ncublasHandle_t GetCublasHandle(tensorflow::OpKernelContext* context) {\n  static std::unordered_map<Key, cublasHandle_t> handle_map;\n  static std::mutex mutex;\n\n  int device;\n  std::lock_guard<std::mutex> lock(mutex);\n  std::thread::id tid = std::this_thread::get_id();\n  cudaGetDevice(&device);\n  cudaStream_t stream = GetCudaStream(context);\n  auto item = std::make_pair(device, tid);\n  auto i = handle_map.find(item);\n  if (i == std::end(handle_map)) {\n    cublasHandle_t handle;\n    cublasCreate(&handle);\n    cublasSetStream(handle, stream);\n    i = handle_map.insert(std::make_pair(item, handle)).first;\n  }\n\n  return i->second;\n}\n\nconst cudaStream_t& GetCudaStream(tensorflow::OpKernelContext* context) {\n  const auto ptr =\n      context->op_device_context()->stream()->implementation()->GpuStreamMemberHack();\n  return *reinterpret_cast<const cudaStream_t*>(ptr);\n}\n"
  },
  {
    "path": "frameworks/tf/support.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\nnamespace tensorflow {\nclass OpKernelContext;\n}\n\n#define REGISTER_GPU_KERNEL(NAME, T)                 \\\n  REGISTER_KERNEL_BUILDER(Name(#NAME)                \\\n                            .Device(DEVICE_GPU)      \\\n                            .TypeConstraint<T>(\"R\"), \\\n                          NAME##Op<T>)\n\ncublasHandle_t GetCublasHandle(tensorflow::OpKernelContext* context);\nconst cudaStream_t& GetCudaStream(tensorflow::OpKernelContext* context);\n"
  },
  {
    "path": "frameworks/tf/weight_config.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\nclass WeightConfig:\n  __slots__ = ['initializer', 'constraint', 'transform']\n\n  def __init__(self, initializer=None, constraint=None, transform=None):\n    self.initializer = initializer\n    self.constraint = constraint\n    self.transform = transform\n\n  def override(self, initializer, constraint, transform):\n    if initializer is not None:\n      self.initializer = initializer\n    if constraint is not None:\n      self.constraint = constraint\n    if transform is not None:\n      self.transform = transform\n    return self\n"
  },
  {
    "path": "frameworks/tf/zoneout_wrapper.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"An RNN cell wrapper that applies Zoneout.\"\"\"\n\n\nimport tensorflow as tf\n\nfrom tensorflow.compat.v1.nn import rnn_cell\n\n\n__all__ = [\n    'ZoneoutWrapper'\n]\n\n\nclass ZoneoutWrapper(rnn_cell.RNNCell):\n  \"\"\"\n  An LSTM/GRU cell wrapper that applies zoneout to the inner cell's hidden state.\n\n  The zoneout paper applies zoneout to both the cell state and hidden state,\n  each with its own zoneout rate. This class (and the `LSTM` implementation in Haste)\n  applies zoneout to the hidden state and not the cell state.\n  \"\"\"\n\n  def __init__(self, cell, rate, training):\n    \"\"\"\n    Initialize the parameters of the zoneout wrapper.\n\n    Arguments:\n      cell: RNNCell, an instance of {`BasicLSTMCell`, `LSTMCell`,\n        `LSTMBlockCell`, `haste_tf.GRUCell`} on which to apply zoneout.\n      rate: float, 0 <= rate <= 1, the percent of hidden units to zone out per\n        time step.\n      training: bool, `True` if used during training, `False` if used during\n        inference.\n    \"\"\"\n    super(ZoneoutWrapper, self).__init__()\n    self.cell = cell\n    self.rate = rate\n    self.training = training\n\n  @property\n  def state_size(self):\n    return self.cell.state_size\n\n  @property\n  def output_size(self):\n    return self.cell.output_size\n\n  def __call__(self, inputs, state, scope=None):\n    \"\"\"\n    Runs one step of the RNN cell with zoneout applied.\n\n    Arguments:\n      see documentation for the inner cell.\n    \"\"\"\n\n    output, new_state = self.cell(inputs, state, scope)\n\n    # Zoneout disabled\n    if not self.rate:\n      return output, new_state\n\n    if isinstance(new_state, rnn_cell.LSTMStateTuple):\n      zoned_out_h = self._apply_zoneout(new_state.h, state.h)\n      return output, rnn_cell.LSTMStateTuple(new_state.c, zoned_out_h)\n    elif isinstance(new_state, list) and len(new_state) == 1:\n      return output, self._apply_zoneout(new_state[0], state[0])\n    elif isinstance(new_state, tf.Tensor):\n      return output, self._apply_zoneout(new_state, state)\n    else:\n      raise ValueError(('ZoneoutWrapper wraps cells that return LSTMStateTuple or '\n          'unnested state Tensors. Please use one of the following cell types:\\n'\n          '  tf.nn.rnn_cell.BasicLSTMCell\\n'\n          '  tf.nn.rnn_cell.LSTMCell\\n'\n          '  tf.contrib.rnn.LSTMBlockCell\\n'\n          '  haste_tf.GRUCell'))\n\n  def _apply_zoneout(self, new_tensor, old_tensor):\n    if self.training:\n      mask = self._build_mask(tf.shape(new_tensor))\n      zoned_out = (new_tensor - old_tensor) * mask + old_tensor\n    else:\n      zoned_out = self.rate * old_tensor + (1.0 - self.rate) * new_tensor\n    return zoned_out\n\n  def _build_mask(self, shape):\n    mask = 1 - self.rate\n    mask += tf.random.uniform(shape)\n    return tf.floor(mask)\n"
  },
  {
    "path": "lib/blas.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n\ntemplate<typename T>\nstruct blas {\n  struct set_pointer_mode {\n    set_pointer_mode(cublasHandle_t handle) : handle_(handle) {\n      cublasGetPointerMode(handle_, &old_mode_);\n      cublasSetPointerMode(handle_, CUBLAS_POINTER_MODE_HOST);\n    }\n    ~set_pointer_mode() {\n      cublasSetPointerMode(handle_, old_mode_);\n    }\n    private:\n      cublasHandle_t handle_;\n      cublasPointerMode_t old_mode_;\n  };\n  struct enable_tensor_cores {\n    enable_tensor_cores(cublasHandle_t handle) : handle_(handle) {\n      cublasGetMathMode(handle_, &old_mode_);\n      cublasSetMathMode(handle_, CUBLAS_TENSOR_OP_MATH);\n    }\n    ~enable_tensor_cores() {\n      cublasSetMathMode(handle_, old_mode_);\n    }\n    private:\n      cublasHandle_t handle_;\n      cublasMath_t old_mode_;\n  };\n};\n\ntemplate<>\nstruct blas<__half> {\n  static constexpr decltype(cublasHgemm)* gemm = &cublasHgemm;\n};\n\ntemplate<>\nstruct blas<float> {\n  static constexpr decltype(cublasSgemm)* gemm = &cublasSgemm;\n};\n\ntemplate<>\nstruct blas<double> {\n  static constexpr decltype(cublasDgemm)* gemm = &cublasDgemm;\n};\n"
  },
  {
    "path": "lib/device_assert.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\nextern \"C\"\n__host__ __device__\nvoid __assertfail(\n    const char * __assertion,\n    const char *__file,\n    unsigned int __line,\n    const char *__function,\n    size_t charsize);\n\n#define device_assert_fail(msg) \\\n      __assertfail((msg), __FILE__, __LINE__, __PRETTY_FUNCTION__, sizeof(char))\n"
  },
  {
    "path": "lib/gru_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"device_assert.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* h,\n                         const T* v,\n                         const T* dh_new,\n                         T* dbx_out,\n                         T* dbr_out,\n                         T* dh_inout,\n                         T* dp_out,\n                         T* dq_out,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int base_idx = col * hidden_dim + row;\n\n  T dh_total = dh_new[base_idx] + dh_inout[base_idx];\n\n  const int stride4_base_idx = col * (hidden_dim * 4) + row;\n  const int z_idx = stride4_base_idx + 0 * hidden_dim;\n  const int r_idx = stride4_base_idx + 1 * hidden_dim;\n  const int g_idx = stride4_base_idx + 2 * hidden_dim;\n  const int q_g_idx = stride4_base_idx + 3 * hidden_dim;\n\n  const T z = v[z_idx];\n  const T r = v[r_idx];\n  const T g = v[g_idx];\n  const T q_g = v[q_g_idx];\n\n  if (ApplyZoneout) {\n    const T mask = zoneout_mask[base_idx];\n    dh_inout[base_idx] = (static_cast<T>(1.0) - mask) * dh_total;\n    dh_total = mask * dh_total;\n    dh_inout[base_idx] += z * dh_total;\n  } else {\n    dh_inout[base_idx] = z * dh_total;\n  }\n\n  const T dg = (static_cast<T>(1.0) - z) * dh_total;\n  const T dz = (h[base_idx] - g) * dh_total;\n  const T dp_g = d_tanh(g) * dg;\n  const T dq_g = dp_g * r;\n  const T dr = dp_g * q_g;\n  const T dp_r = d_sigmoid(r) * dr;\n  const T dq_r = dp_r;\n  const T dp_z = d_sigmoid(z) * dz;\n  const T dq_z = dp_z;\n\n  const int idx = col * (hidden_dim * 3) + row;\n\n  dp_out[idx + 0 * hidden_dim] = dp_z;\n  dp_out[idx + 1 * hidden_dim] = dp_r;\n  dp_out[idx + 2 * hidden_dim] = dp_g;\n\n  dq_out[idx + 0 * hidden_dim] = dq_z;\n  dq_out[idx + 1 * hidden_dim] = dq_r;\n  dq_out[idx + 2 * hidden_dim] = dq_g;\n\n  atomicAdd(&dbx_out[row + 0 * hidden_dim], dp_z);\n  atomicAdd(&dbx_out[row + 1 * hidden_dim], dp_r);\n  atomicAdd(&dbx_out[row + 2 * hidden_dim], dp_g);\n\n  atomicAdd(&dbr_out[row + 0 * hidden_dim], dq_z);\n  atomicAdd(&dbr_out[row + 1 * hidden_dim], dq_r);\n  atomicAdd(&dbr_out[row + 2 * hidden_dim], dq_g);\n}\n\n#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700)\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const half* h,\n                         const half* v,\n                         const half* dh_new,\n                         half* dbx_out,\n                         half* dbr_out,\n                         half* dh_inout,\n                         half* dp_out,\n                         half* dq_out,\n                         const half* zoneout_mask) {\n  device_assert_fail(\"FP16 is not supported on compute capability < 7.0.\");\n}\n#endif\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace gru {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Iterate(\n    const T* W_t,     // [H*3,C]\n    const T* R_t,     // [H*3,H]\n    const T* bx,      // [H*3]\n    const T* br,      // [H*3]\n    const T* x_t,     // [C,N]\n    const T* h,       // [N,H]\n    const T* v,       // [N,H*4]\n    const T* dh_new,  // [N,H]\n    T* dx,            // [N,C]\n    T* dW,            // [C,H*3]\n    T* dR,            // [H,H*3]\n    T* dbx,           // [H*3]\n    T* dbr,           // [H*3]\n    T* dh,            // [N,H]\n    T* dp,            // [N,H*3]\n    T* dq,            // [N,H*3]\n    const T* zoneout_mask) {  // [N,H]\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);\n  const T beta_assign = static_cast<T>(0.0);\n\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const int input_size = data_->input_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  IterateInternal(\n      R_t,\n      h,\n      v,\n      dh_new,\n      dbx,\n      dbr,\n      dh,\n      dp,\n      dq,\n      zoneout_mask);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, input_size, batch_size,\n      &alpha,\n      dp, hidden_size * 3,\n      x_t, batch_size,\n      &beta_sum,\n      dW, hidden_size * 3);\n\n  // Wait for pointwise operations to complete since there's a\n  // data dependency between its output (`dp`, `dq`) and the following matmuls.\n  cudaStreamWaitEvent(stream2, event, 0);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, batch_size, hidden_size * 3,\n      &alpha,\n      W_t, input_size,\n      dp, hidden_size * 3,\n      &beta_assign,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 3, hidden_size, batch_size,\n      &alpha,\n      dq, hidden_size * 3,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 3);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::IterateInternal(\n    const T* R_t,     // [H*3,H]\n    const T* h,       // [N,H]\n    const T* v,       // [N,H*4]\n    const T* dh_new,  // [N,H]\n    T* dbx,           // [H*3]\n    T* dbr,           // [H*3]\n    T* dh,            // [N,H]\n    T* dp,            // [N,H*3]\n    T* dq,            // [N,H*3]\n    const T* zoneout_mask) {  // [N,H]\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);\n\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(32, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (zoneout_mask) {\n    PointwiseOperations<T, true><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        h,\n        v,\n        dh_new,\n        dbx,\n        dbr,\n        dh,\n        dp,\n        dq,\n        zoneout_mask\n    );\n  } else {\n    PointwiseOperations<T, false><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        h,\n        v,\n        dh_new,\n        dbx,\n        dbr,\n        dh,\n        dp,\n        dq,\n        nullptr\n    );\n  }\n  cudaEventRecord(event, stream1);\n\n  cublasSetStream(blas_handle,  stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, batch_size, hidden_size * 3,\n      &alpha,\n      R_t, hidden_size,\n      dq, hidden_size * 3,\n      &beta_sum,\n      dh, hidden_size);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,\n    const T* R_t,\n    const T* bx,\n    const T* br,\n    const T* x_t,\n    const T* h,\n    const T* v,\n    const T* dh_new,\n    T* dx,\n    T* dW,\n    T* dR,\n    T* dbx,\n    T* dbr,\n    T* dh,\n    T* dp,\n    T* dq,\n    const T* zoneout_mask) {\n  const blas<void>::enable_tensor_cores scoped0(data_->blas_handle);\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);\n  const T beta_assign = static_cast<T>(0.0);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = steps - 1; i >= 0; --i) {\n    IterateInternal(\n        R_t,\n        h + i * NH,\n        v + i * NH * 4,\n        dh_new + (i + 1) * NH,\n        dbx,\n        dbr,\n        dh,\n        dp + i * NH * 3,\n        dq + i * NH * 3,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr );\n  }\n\n  // Wait for pointwise operations to complete since there's a\n  // data dependency between its output (`dp`, `dq`) and the following matmuls.\n  cudaStreamWaitEvent(stream2, event, 0);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, batch_size * steps, hidden_size * 3,\n      &alpha,\n      W_t, input_size,\n      dp, hidden_size * 3,\n      &beta_assign,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 3, hidden_size, batch_size * steps,\n      &alpha,\n      dq, hidden_size * 3,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 3);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, input_size, batch_size * steps,\n      &alpha,\n      dp, hidden_size * 3,\n      x_t, batch_size * steps,\n      &beta_sum,\n      dW, hidden_size * 3);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct BackwardPass<half>;\ntemplate struct BackwardPass<float>;\ntemplate struct BackwardPass<double>;\n\n}  // namespace gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/gru_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n#include <cuda_fp16.h>\n\n#include \"blas.h\"\n#include \"device_assert.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* Wx,\n                         const T* Rh,\n                         const T* bx,\n                         const T* br,\n                         const T* h,\n                         T* h_out,\n                         T* v,\n                         const T zoneout_prob,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int weight_idx = col * (hidden_dim * 3) + row;\n\n  // Index into the `h` and `h_out` vectors (they have a stride of `hidden_dim`).\n  const int output_idx = col * hidden_dim + row;\n\n  // Indicies into the Wx and Rh matrices (for each of the u, r, and e components).\n  const int z_idx = weight_idx + 0 * hidden_dim;\n  const int r_idx = weight_idx + 1 * hidden_dim;\n  const int g_idx = weight_idx + 2 * hidden_dim;\n\n  // Indices into the bias vectors (for each of the u, r, and e components).\n  const int bz_idx = row + 0 * hidden_dim;\n  const int br_idx = row + 1 * hidden_dim;\n  const int bg_idx = row + 2 * hidden_dim;\n\n  const T z = sigmoid(Wx[z_idx] + Rh[z_idx] + bx[bz_idx] + br[bz_idx]);\n  const T r = sigmoid(Wx[r_idx] + Rh[r_idx] + bx[br_idx] + br[br_idx]);\n  const T g = tanh   (Wx[g_idx] + r * (Rh[g_idx] + br[bg_idx]) + bx[bg_idx]);\n\n  // Store internal activations if we're eventually going to backprop.\n  if (Training) {\n    const int base_v_idx = col * (hidden_dim * 4) + row;\n    v[base_v_idx + 0 * hidden_dim] = z;\n    v[base_v_idx + 1 * hidden_dim] = r;\n    v[base_v_idx + 2 * hidden_dim] = g;\n    v[base_v_idx + 3 * hidden_dim] = Rh[g_idx] + br[bg_idx];\n  }\n\n  T cur_h_value = z * h[output_idx] + (static_cast<T>(1.0) - z) * g;\n\n  if (ApplyZoneout) {\n    if (Training) {\n      cur_h_value = (cur_h_value - h[output_idx]) * zoneout_mask[output_idx] + h[output_idx];\n    } else {\n      cur_h_value = (zoneout_prob * h[output_idx]) + ((static_cast<T>(1.0) - zoneout_prob) * cur_h_value);\n    }\n  }\n\n  h_out[output_idx] = cur_h_value;\n}\n\n#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700)\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const half* Wx,\n                         const half* Rh,\n                         const half* bx,\n                         const half* br,\n                         const half* h,\n                         half* h_out,\n                         half* v,\n                         const half zoneout_prob,\n                         const half* zoneout_mask) {\n  device_assert_fail(\"FP16 is not supported on compute capability < 7.0.\");\n}\n#endif\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace gru {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Iterate(\n    const T* W,  // [C,H*3]\n    const T* R,  // [H,H*3]\n    const T* bx, // [H*3]\n    const T* br, // [H*3]\n    const T* x,  // [N,C]\n    const T* h,  // [N,H]\n    T* h_out,    // [N,H]\n    T* v,        // [N,H*4]\n    T* tmp_Wx,   // [N,H*3]\n    T* tmp_Rh,   // [N,H*3]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, batch_size, input_size,\n      &alpha,\n      W, hidden_size * 3,\n      x, input_size,\n      &beta,\n      tmp_Wx, hidden_size * 3);\n  cudaEventRecord(event, stream2);\n\n  IterateInternal(\n      R,\n      bx,\n      br,\n      h,\n      h_out,\n      v,\n      tmp_Wx,\n      tmp_Rh,\n      zoneout_prob,\n      zoneout_mask);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::IterateInternal(\n    const T* R,  // [H,H*3]\n    const T* bx, // [H*3]\n    const T* br, // [H*3]\n    const T* h,  // [N,H]\n    T* h_out,    // [N,H]\n    T* v,        // [N,H*4]\n    T* tmp_Wx,   // [N,H*3]\n    T* tmp_Rh,   // [N,H*3]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  // Constants for GEMM\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, batch_size, hidden_size,\n      &alpha,\n      R, hidden_size * 3,\n      h, hidden_size,\n      &beta,\n      tmp_Rh, hidden_size * 3);\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(32, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  cudaStreamWaitEvent(stream1, event, 0);\n\n  if (training) {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, true, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx,\n          tmp_Rh,\n          bx,\n          br,\n          h,\n          h_out,\n          v,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, true, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx,\n          tmp_Rh,\n          bx,\n          br,\n          h,\n          h_out,\n          v,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, false, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx,\n          tmp_Rh,\n          bx,\n          br,\n          h,\n          h_out,\n          nullptr,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, false, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx,\n          tmp_Rh,\n          bx,\n          br,\n          h,\n          h_out,\n          nullptr,\n          0.0f,\n          nullptr);\n    }\n  }\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,  // [C,H*3]\n    const T* R,  // [H,H*3]\n    const T* bx, // [H*3]\n    const T* br, // [H*3]\n    const T* x,  // [N,C]\n    T* h,        // [N,H]\n    T* v,        // [N,H*4]\n    T* tmp_Wx,   // [N,H*3]\n    T* tmp_Rh,   // [N,H*3]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::enable_tensor_cores scoped0(data_->blas_handle);\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size * 3,\n      x, input_size,\n      &beta,\n      tmp_Wx, hidden_size * 3);\n  cudaEventRecord(event, stream2);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = 0; i < steps; ++i) {\n    IterateInternal(\n        R,\n        bx,\n        br,\n        h + i * NH,\n        h + (i + 1) * NH,\n        v + i * NH * 4,\n        tmp_Wx + i * NH * 3,\n        tmp_Rh,\n        zoneout_prob,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct ForwardPass<half>;\ntemplate struct ForwardPass<float>;\ntemplate struct ForwardPass<double>;\n\n}  // namespace gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/gru.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace gru {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    // training: `true` if the caller intends to perform a backward pass to compute gradients.\n    // batch_size: the number of training/inference inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~ForwardPass();\n\n    // Performs one forward iteration of the GRU cell.\n    //\n    // W: [C,H*3] the input weight matrix.\n    // R: [H,H*3] the recurrent weight matrix.\n    // bx: [H*3] the bias for the input weight matrix.\n    // br: [H*3] the bias for the recurrent weight matrix.\n    // x: [N,C] the GRU input for this iteration (N vectors, each with dimension C).\n    // h: [N,H] the t-1 iteration's `h_out` or the initial hidden state if this is the\n    //     t=0 iteration (typically zeros).\n    // h_out: [N,H] the GRU's output, and the input to the next iteration's `h`. This\n    //     pointer may be the same as `h`. Each iteration may reuse the same memory region.\n    // v: [N,H*4] if `training` is `false`, this can be a null pointer. If `training` is\n    //     `true`, this vector will contain intermediate activations for this iteration which\n    //     must be provided as-is to the corresponding backward iteration. The caller must\n    //     provide a new memory region for each iteration.\n    // tmp_Wx: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector, and must provide a new memory region for\n    //     each iteration.\n    // tmp_Rh: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. The same memory region may be provided\n    //     for each iteration.\n    // zoneout_prob: 0.0 <= zoneout_prob <= 1.0; specifies the probability of a hidden\n    //     activation being randomly zoned out. If zoneout was used during training, this\n    //     parameter must also be specified during inference with the same value.\n    // zoneout_mask: [N,H] may be null to disable zoneout. This is a random binary mask\n    //     following a Bernoulli(1-zoneout_prob) distribution. A different mask is typically\n    //     used for each iteration.\n    void Iterate(\n        const T* W,\n        const T* R,\n        const T* bx,\n        const T* br,\n        const T* x,\n        const T* h,\n        T* h_out,\n        T* v,\n        T* tmp_Wx,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    void Run(\n        const int steps,\n        const T* W,\n        const T* R,\n        const T* bx,\n        const T* br,\n        const T* x,\n        T* h,\n        T* v,\n        T* tmp_Wx,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R,\n        const T* bx,\n        const T* br,\n        const T* h,\n        T* h_out,\n        T* v,\n        T* tmp_Wx,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    // batch_size: the number of training inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~BackwardPass();\n\n    // Performs one backward iteration of the GRU cell.\n    //\n    // Note that BackwardPass must be iterated in the reverse order as ForwardPass.\n    // If ForwardPass iterates from 0 to T-1, BackwardPass needs to iterate from\n    // T-1 down to 0. When iteration numbers are described, they will be based on the\n    // iteration index (i.e., the T-1'th iteration of the forward pass is the last call\n    // to ForwardPass::Iterate, whereas it is the first call to BackwardPass::Iterate).\n    //\n    // W_t: [H*3,C] the transpose of the input weight matrix.\n    // R_t: [H*3,H] the transpose of the recurrent weight matrix.\n    // bx: [H*3] the bias vector for the input weight matrix.\n    // br: [H*3] the bias vector for the recurrent weight matrix.\n    // x_t: [C,N] the transpose of the GRU input for this iteration.\n    // h: [N,H] the t-1 iteration's `h_out` or the initial hidden state if this is the t=0\n    //     iteration (typically zeros).\n    // v: [N,H*4] the same vector as returned by ForwardPass::Iterate on its corresponding\n    //     iteration.\n    // dh_new: [N,H] the gradient of `h_out` with respect to the loss at this iteration.\n    // dx: [N,C] the gradient of the input at this time step with respect to the loss.\n    // dW: [C,H*3] the gradient of the input weight matrix with respect to the loss.\n    // dR: [H,H*3] the gradient of the recurrent weight matrix with respect to the loss.\n    // dbx: [H*3] the gradient of the bias vector for the input weight matrix with respect to\n    //     the loss.\n    // dbr: [H*3] the gradient of the bias vector for the recurrent weight matrix with respect\n    //     to the loss.\n    // dh: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros\n    //     for the T-1'th iteration and the same pointer should be passed in for each\n    //     iteration. After a complete backward pass, this vector will contain the gradient\n    //     of the initial hidden state with respect to the loss.\n    // dp: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. A new memory region must be provided\n    //     for each iteration.\n    // dq: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. A new memory region must be provided\n    //     for each iteration.\n    // zoneout_mask: [N,H] may be null if zoneout was disabled in the forward pass. This vector\n    //     must be the same as the one provided during the corresponding forward iteration.\n    void Iterate(\n        const T* W_t,\n        const T* R_t,\n        const T* bx,\n        const T* br,\n        const T* x_t,\n        const T* h,\n        const T* v,\n        const T* dh_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* dbx,\n        T* dbr,\n        T* dh,\n        T* dp,\n        T* dq,\n        const T* zoneout_mask);\n\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* R_t,\n        const T* bx,\n        const T* br,\n        const T* x_t,\n        const T* h,\n        const T* v,\n        const T* dh_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* dbx,\n        T* dbr,\n        T* dh,\n        T* dp,\n        T* dq,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R_t,\n        const T* h,\n        const T* v,\n        const T* dh_new,\n        T* dbx,\n        T* dbr,\n        T* dh,\n        T* dp,\n        T* dq,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/indrnn.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cuda_runtime_api.h>\n#include <cublas_v2.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace indrnn {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    ~ForwardPass();\n\n    void Run(\n        const int steps,\n        const T* W,\n        const T* u,\n        const T* b,\n        const T* x,\n        T* h,\n        T* workspace,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    ~BackwardPass();\n\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* u,\n        const T* b,\n        const T* x_t,\n        const T* h,\n        const T* dh_new,\n        T* dx,\n        T* dW,\n        T* du,\n        T* db,\n        T* dh,\n        T* workspace,\n        const T* zoneout_mask);\n\n  private:\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/layer_norm.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cuda_runtime_api.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    // gamma: [H]\n    // beta: [H]\n    // cache: [N,2]\n    ForwardPass(\n        const int batch_size,\n        const int hidden_size,\n        const T* gamma,\n        const T* beta,\n        T* cache);\n\n    // Computes the layer norm of an input tensor `x` over its innermost (fastest changing)\n    // dimension. The layer norm is defined as: \\(\\frac{x-\\mu}{\\sigma} \\gamma + \\beta\\)\n    // where `\\gamma` and `\\beta` are trainable parameters.\n    //\n    // x: [N,H]\n    // y: [N,H]\n    void Run(const cudaStream_t& stream, const T* x, T* y);\n\n    void RunPartial(\n        const cudaStream_t& stream,\n        const int minibatch,\n        const T* x,\n        T* y);\n\n  private:\n    const int batch_size_;\n    const int hidden_size_;\n    const T* gamma_;\n    const T* beta_;\n    T* cache_;\n    int partial_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    BackwardPass(\n        const int batch_size,\n        const int hidden_size,\n        const T* gamma,\n        const T* beta,\n        const T* x,\n        T* dgamma,\n        T* dbeta,\n        T* cache);\n\n    void Run(const cudaStream_t& stream, const T* dy, T* dx);\n\n    void RunPartial(\n        const cudaStream_t& stream,\n        const int minibatch,\n        const T* dy,\n        T* dx);\n\n  private:\n    const int batch_size_;\n    const int hidden_size_;\n    const T* gamma_;\n    const T* beta_;\n    const T* x_;\n    T* dgamma_;\n    T* dbeta_;\n    T* cache_;\n    int partial_;\n};\n\n}  // namespace layer_norm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/layer_norm_gru.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_gru {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    // training: `true` if the caller intends to perform a backward pass to compute gradients.\n    // batch_size: the number of training/inference inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~ForwardPass();\n\n    // Performs one forward iteration of the GRU cell.\n    //\n    // W: [C,H*3] the input weight matrix.\n    // R: [H,H*3] the recurrent weight matrix.\n    // bx: [H*3] the bias for the input weight matrix.\n    // br: [H*3] the bias for the recurrent weight matrix.\n    // x: [N,C] the GRU input for this iteration (N vectors, each with dimension C).\n    // h: [N,H] the t-1 iteration's `h_out` or the initial hidden state if this is the\n    //     t=0 iteration (typically zeros).\n    // h_out: [N,H] the GRU's output, and the input to the next iteration's `h`. This\n    //     pointer may be the same as `h`. Each iteration may reuse the same memory region.\n    // v: [N,H*4] if `training` is `false`, this can be a null pointer. If `training` is\n    //     `true`, this vector will contain intermediate activations for this iteration which\n    //     must be provided as-is to the corresponding backward iteration. The caller must\n    //     provide a new memory region for each iteration.\n    // tmp_Wx: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector, and must provide a new memory region for\n    //     each iteration.\n    // tmp_Rh: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. The same memory region may be provided\n    //     for each iteration.\n    // zoneout_prob: 0.0 <= zoneout_prob <= 1.0; specifies the probability of a hidden\n    //     activation being randomly zoned out. If zoneout was used during training, this\n    //     parameter must also be specified during inference with the same value.\n    // zoneout_mask: [N,H] may be null to disable zoneout. This is a random binary mask\n    //     following a Bernoulli(1-zoneout_prob) distribution. A different mask is typically\n    //     used for each iteration.\n    void Run(\n        const int steps,\n        const T* W,\n        const T* R,\n        const T* bx,\n        const T* br,\n        const T* x,\n        T* h,\n        T* v,\n        T* act_Wx,\n        layer_norm::ForwardPass<T>& layer_norm1,\n        T* tmp_Wx_norm,\n        T* act_Rh,\n        layer_norm::ForwardPass<T>& layer_norm2,\n        T* tmp_Rh_norm,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R,\n        const T* bx,\n        const T* br,\n        const T* h,\n        T* h_out,\n        T* v,\n        T* tmp_Wx_norm,\n        T* act_Rh,\n        layer_norm::ForwardPass<T>& layer_norm2,\n        T* tmp_Rh_norm,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    // batch_size: the number of training inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~BackwardPass();\n\n    // Performs one backward iteration of the GRU cell.\n    //\n    // Note that BackwardPass must be iterated in the reverse order as ForwardPass.\n    // If ForwardPass iterates from 0 to T-1, BackwardPass needs to iterate from\n    // T-1 down to 0. When iteration numbers are described, they will be based on the\n    // iteration index (i.e., the T-1'th iteration of the forward pass is the last call\n    // to ForwardPass::Iterate, whereas it is the first call to BackwardPass::Iterate).\n    //\n    // W_t: [H*3,C] the transpose of the input weight matrix.\n    // R_t: [H*3,H] the transpose of the recurrent weight matrix.\n    // bx: [H*3] the bias vector for the input weight matrix.\n    // br: [H*3] the bias vector for the recurrent weight matrix.\n    // x_t: [C,N] the transpose of the GRU input for this iteration.\n    // h: [N,H] the t-1 iteration's `h_out` or the initial hidden state if this is the t=0\n    //     iteration (typically zeros).\n    // v: [N,H*4] the same vector as returned by ForwardPass::Iterate on its corresponding\n    //     iteration.\n    // dh_new: [N,H] the gradient of `h_out` with respect to the loss at this iteration.\n    // dx: [N,C] the gradient of the input at this time step with respect to the loss.\n    // dW: [C,H*3] the gradient of the input weight matrix with respect to the loss.\n    // dR: [H,H*3] the gradient of the recurrent weight matrix with respect to the loss.\n    // dbx: [H*3] the gradient of the bias vector for the input weight matrix with respect to\n    //     the loss.\n    // dbr: [H*3] the gradient of the bias vector for the recurrent weight matrix with respect\n    //     to the loss.\n    // dh: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros\n    //     for the T-1'th iteration and the same pointer should be passed in for each\n    //     iteration. After a complete backward pass, this vector will contain the gradient\n    //     of the initial hidden state with respect to the loss.\n    // dp: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. A new memory region must be provided\n    //     for each iteration.\n    // dq: [N,H*3] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. A new memory region must be provided\n    //     for each iteration.\n    // zoneout_mask: [N,H] may be null if zoneout was disabled in the forward pass. This vector\n    //     must be the same as the one provided during the corresponding forward iteration.\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* R_t,\n        const T* bx,\n        const T* br,\n        const T* x_t,\n        const T* h,\n        const T* v,\n        const T* dh_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* dbx,\n        T* dbr,\n        T* dh,\n        T* dp,\n        T* dq,\n        layer_norm::BackwardPass<T>& layer_norm1,\n        layer_norm::BackwardPass<T>& layer_norm2,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R_t,\n        const T* h,\n        const T* v,\n        const T* dh_new,\n        T* dbx,\n        T* dbr,\n        T* dh,\n        T* dp,\n        T* dq,\n        layer_norm::BackwardPass<T>& layer_norm2,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace layer_norm_gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/layer_norm_indrnn.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cuda_runtime_api.h>\n#include <cublas_v2.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_indrnn {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    ~ForwardPass();\n\n    void Run(\n        const int steps,\n        const T* W,\n        const T* u,\n        const T* b,\n        const T* x,\n        T* h,\n        T* workspace,\n        T* act_Wx,\n        layer_norm::ForwardPass<T>& layer_norm1,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    ~BackwardPass();\n\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* u,\n        const T* b,\n        const T* x_t,\n        const T* h,\n        const T* dh_new,\n        T* dx,\n        T* dW,\n        T* du,\n        T* db,\n        T* dh,\n        T* workspace,\n        layer_norm::BackwardPass<T>& layer_norm1,\n        const T* zoneout_mask);\n\n  private:\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace layer_norm_indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/layer_norm_lstm.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_lstm {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    // training: `true` if the caller intends to perform a backward pass to compute gradients.\n    // batch_size: the number of training/inference inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~ForwardPass();\n\n    // Runs the LSTM over all time steps. This method is faster than using a per-step\n    // `Iterate` but requires that the entire input sequence be available upfront. In some\n    // situations, this constraint may not be satisfiable (e.g. autoregressive models).\n    // Users should prefer calling `Run` over `Iterate` whenever possible.\n    //\n    // steps: the number of iterations to run (i.e. T).\n    // W: [C,H*4] the input weight matrix.\n    // R: [H,H*4] the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x: [T,N,C] the LSTM input for this iteration (N vectors, each with dimension C).\n    // h: [T+1,N,H] the hidden state vectors across all time steps. The t=0'th vector should\n    //      be set to the desired initial hidden state (typically zeros). The rest of the\n    //      vectors will be set by this function. `h[1:,:,:]` forms the output of this LSTM\n    //      layer.\n    // c: [T+1,N,H] the cell state vectors across all time steps. The t=0'th vector should be\n    //      set to the desired initial cell state (typically zeros). The rest of the vectors\n    //      will be set by this function.\n    // v: [T,N,H*4] if `training` is `false`, this is scratch space and should not be used by\n    //     the caller. If `training` is `true`, this parameter will contain intermediate\n    //     activations which must be provided as-is to `BackwardPass::Run` or manually urolled\n    //     for `BackwardPass::Iterate`.\n    // tmp_Rh: [N,H*4] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. The same memory region may be provided\n    //     for each iteration.\n    // zoneout_prob: 0.0 <= zoneout_prob <= 1.0; specifies the probability of a hidden\n    //     activation being randomly zoned out. If zoneout was used during training, this\n    //     parameter must also be specified during inference with the same value.\n    // zoneout_mask: [T,N,H] may be null to disable zoneout. This is a random binary mask\n    //     following a Bernoulli(1-zoneout_prob) distribution. A different mask is typically\n    //     used for each iteration.\n    void Run(\n        const int steps,\n        const T* W,\n        const T* R,\n        const T* b,\n        const T* x,\n        T* h,\n        T* c,\n        T* act_Wx,\n        T* tmp_Rh,\n        layer_norm::ForwardPass<T>& layer_norm1,\n        T* act_Wx_norm,\n        T* act_Rh,\n        layer_norm::ForwardPass<T>& layer_norm2,\n        layer_norm::ForwardPass<T>& layer_norm3,\n        T* act_c_norm,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R,\n        const T* b,\n        const T* h,\n        const T* c,\n        T* h_out,\n        T* c_out,\n        T* v,\n        T* tmp_Rh,\n        T* act_Rh,\n        layer_norm::ForwardPass<T>& layer_norm2,\n        layer_norm::ForwardPass<T>& layer_norm3,\n        T* act_c_norm,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    // batch_size: the number of training inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~BackwardPass();\n\n    // Runs the LSTM backward pass over all time steps. This method is faster than using a\n    // per-step `Iterate` but requires that the entire input sequence be available upfront.\n    // In some situations, this constraint may not be satisfiable (e.g. autoregressive models).\n    // Users should prefer calling `Run` over `Iterate` whenever possible.\n    //\n    // steps: the number of iterations to run (i.e. T).\n    // W_t: [H*4,C] the transpose of the input weight matrix.\n    // R_t: [H*4,H] the transpose of the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x_t: [C,T,N] the transpose of the LSTM input for this iteration.\n    // h: [T+1,N,H] the hidden state vectors after running `ForwardPass::Run`.\n    // c: [T+1,N,H] the cell state vectors after running `ForwardPass::Run`.\n    // dh_new: [T+1,N,H] the gradient of the loss with respect to `h`.\n    // dc_new: [T+1,N,H] the gradient of the loss with respect to `c`.\n    // dx: [T,N,C] the gradient of the loss with respect to the input.\n    // dW: [C,H*4] the gradient of the loss with respect to the input weight matrix.\n    // dR: [H,H*4] the gradient of the loss with respect to the recurrent weight matrix.\n    // db: [H*4] the gradient of the loss with respect to the bias vector.\n    // dh: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros.\n    //     When this function returns, `dh` will contain the gradient of the loss with respect\n    //     to the initial hidden state.\n    // dc: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros.\n    //     When this function returns, `dc` will contain the gradient of the loss with respect\n    //     to the initial cell state.\n    // v: [T,N,H*4] the same tensor that was passed to `ForwardPass::Run`.\n    // zoneout_mask: [T,N,H] may be null if zoneout was disabled in the forward pass. This\n    //     vector must be the same as the one provided during the forward pass.\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* R_t,\n        const T* b,\n        const T* x_t,\n        const T* h,\n        const T* c,\n        const T* dh_new,\n        const T* dc_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* db,\n        T* dh,\n        T* dc,\n        T* act_Wx,\n        layer_norm::BackwardPass<T>& layer_norm1,\n        T* act_Wx_norm,\n        T* act_Rh,\n        layer_norm::BackwardPass<T>& layer_norm2,\n        layer_norm::BackwardPass<T>& layer_norm3,\n        T* act_c_norm,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R_t,\n        const T* c,\n        const T* c_new,\n        const T* dh_new,\n        const T* dc_new,\n        T* db,\n        T* dh,\n        T* dc,\n        T* v,\n        T* act_Rh,\n        layer_norm::BackwardPass<T>& layer_norm2,\n        layer_norm::BackwardPass<T>& layer_norm3,\n        T* act_c_norm,\n        const T* zoneout_mask);\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace layer_norm_lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste/lstm.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\nnamespace haste {\nnamespace v0 {\nnamespace lstm {\n\ntemplate<typename T>\nclass ForwardPass {\n  public:\n    // training: `true` if the caller intends to perform a backward pass to compute gradients.\n    // batch_size: the number of training/inference inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    ForwardPass(\n        const bool training,\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~ForwardPass();\n\n    // Performs one forward iteration of the LSTM cell.\n    //\n    // W: [C,H*4] the input weight matrix.\n    // R: [H,H*4] the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x: [N,C] the LSTM input for this iteration (N vectors, each with dimension C).\n    // h: [N,H] the t-1 iteration's `h_out` or the initial hidden state if this is the\n    //     t=0 iteration (typically zeros).\n    // c: [N,H] the t-1 iteration's `c_out` or the initial cell state if this is the\n    //     t=0 iteration (typically zeros).\n    // h_out: [N,H] the LSTM's output, and the input to the next iteration's `h`. This\n    //     pointer may be the same as `h`. Each iteration may reuse the same memory region.\n    // c_out: [N,H] the LSTM's internal cell state after this iteration is complete. This\n    //     will become the input to the next iteration's `c`.\n    // v: [N,H*4] if `training` is `false`, this is scratch space and should not be used by\n    //     the caller. If `training` is `true`, this vector will contain intermediate\n    //     activations for this iteration which must be provided as-is to the corresponding\n    //     backward iteration. In either case, a new memory region must be provided for each\n    //     iteration.\n    // tmp_Rh: [N,H*4] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. The same memory region may be provided\n    //     for each iteration.\n    // zoneout_prob: 0.0 <= zoneout_prob <= 1.0; specifies the probability of a hidden\n    //     activation being randomly zoned out. If zoneout was used during training, this\n    //     parameter must also be specified during inference with the same value.\n    // zoneout_mask: [N,H] may be null to disable zoneout. This is a random binary mask\n    //     following a Bernoulli(1-zoneout_prob) distribution. A different mask is typically\n    //     used for each iteration.\n    void Iterate(\n        const cudaStream_t& stream,\n        const T* W,\n        const T* R,\n        const T* b,\n        const T* x,\n        const T* h,\n        const T* c,\n        T* h_out,\n        T* c_out,\n        T* v,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    // Runs the LSTM over all time steps. This method is faster than using a per-step\n    // `Iterate` but requires that the entire input sequence be available upfront. In some\n    // situations, this constraint may not be satisfiable (e.g. autoregressive models).\n    // Users should prefer calling `Run` over `Iterate` whenever possible.\n    //\n    // steps: the number of iterations to run (i.e. T).\n    // W: [C,H*4] the input weight matrix.\n    // R: [H,H*4] the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x: [T,N,C] the LSTM input for this iteration (N vectors, each with dimension C).\n    // h: [T+1,N,H] the hidden state vectors across all time steps. The t=0'th vector should\n    //      be set to the desired initial hidden state (typically zeros). The rest of the\n    //      vectors will be set by this function. `h[1:,:,:]` forms the output of this LSTM\n    //      layer.\n    // c: [T+1,N,H] the cell state vectors across all time steps. The t=0'th vector should be\n    //      set to the desired initial cell state (typically zeros). The rest of the vectors\n    //      will be set by this function.\n    // v: [T,N,H*4] if `training` is `false`, this is scratch space and should not be used by\n    //     the caller. If `training` is `true`, this parameter will contain intermediate\n    //     activations which must be provided as-is to `BackwardPass::Run` or manually urolled\n    //     for `BackwardPass::Iterate`.\n    // tmp_Rh: [N,H*4] additional temporary work space required for this iteration. The caller\n    //     should not use the contents of this vector. The same memory region may be provided\n    //     for each iteration.\n    // zoneout_prob: 0.0 <= zoneout_prob <= 1.0; specifies the probability of a hidden\n    //     activation being randomly zoned out. If zoneout was used during training, this\n    //     parameter must also be specified during inference with the same value.\n    // zoneout_mask: [T,N,H] may be null to disable zoneout. This is a random binary mask\n    //     following a Bernoulli(1-zoneout_prob) distribution. A different mask is typically\n    //     used for each iteration.\n    void Run(\n        const int steps,\n        const T* W,\n        const T* R,\n        const T* b,\n        const T* x,\n        T* h,\n        T* c,\n        T* v,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R,\n        const T* b,\n        const T* h,\n        const T* c,\n        T* h_out,\n        T* c_out,\n        T* v,\n        T* tmp_Rh,\n        const float zoneout_prob,\n        const T* zoneout_mask);\n\n    struct private_data;\n    private_data* data_;\n};\n\ntemplate<typename T>\nclass BackwardPass {\n  public:\n    // batch_size: the number of training inputs provided in each tensor.\n    // input_size: the dimension of each input vector.\n    // hidden_size: the expected dimension of each output vector.\n    // blas_handle: an initialized cuBLAS handle (see `cublasCreate`).\n    BackwardPass(\n        const int batch_size,\n        const int input_size,\n        const int hidden_size,\n        const cublasHandle_t& blas_handle,\n        const cudaStream_t& stream = 0);\n\n    // Releases internal resources.\n    // Blocks until all iterations have completed executing on the GPU.\n    ~BackwardPass();\n\n    // Performs one backward iteration of the LSTM cell.\n    //\n    // Note that BackwardPass must be iterated in the reverse order as ForwardPass.\n    // If ForwardPass iterates from 0 to T-1, BackwardPass needs to iterate from\n    // T-1 down to 0. When iteration numbers are described, they will be based on the\n    // iteration index (i.e., the T-1'th iteration of the forward pass is the last call\n    // to ForwardPass::Iterate, whereas it is the first call to BackwardPass::Iterate).\n    //\n    // W_t: [H*4,C] the transpose of the input weight matrix.\n    // R_t: [H*4,H] the transpose of the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x_t: [C,N] the transpose of the LSTM input for this iteration.\n    // h: [N,H] the hidden state of the t'th iteration or the initial hidden state if this is\n    //     the t=0 iteration (typically zeros).\n    // c: [N,H] the t-1'th forward iteration's `c_out` or the initial cell state if this is\n    //     the t=0 iteration (typically zeros).\n    // c_new: [N,H] the t'th forward iteration's `c_out` vector.\n    // dh_new: [N,H] the gradient of the loss with respect to `h_out` at this iteration.\n    // dc_new: [N,H] the gradient of the loss with respect to `c_out` at this iteration.\n    // dx: [N,C] the gradient of the loss with respect to the input at this time step.\n    // dW: [C,H*4] the gradient of the loss with respect to the input weight matrix.\n    // dR: [H,H*4] the gradient of the loss with respect to the recurrent weight matrix.\n    // db: [H*4] the gradient of the loss with respect to the bias vector.\n    // dh: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros\n    //     for the T-1'th iteration and the same pointer should be passed in for each\n    //     iteration. After a complete backward pass, this vector will contain the gradient\n    //     of the loss with respect to the initial hidden state.\n    // dc: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros\n    //     for the T-1'th iteration and the same pointer should be passed in for each\n    //     iteration. After a complete backward pass, this vector will contain the gradient\n    //     of the loss with respect to the initial cell state.\n    // v: [N,H*4] the same tensor that was passed to `ForwardPass::Iterate` on its corresponding\n    //     iteration.\n    // zoneout_mask: [N,H] may be null if zoneout was disabled in the forward pass. This vector\n    //     must be the same as the one provided during the corresponding forward iteration.\n    void Iterate(\n        const cudaStream_t& stream,\n        const T* W_t,\n        const T* R_t,\n        const T* b,\n        const T* x_t,\n        const T* h,\n        const T* c,\n        const T* c_new,\n        const T* dh_new,\n        const T* dc_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* db,\n        T* dh,\n        T* dc,\n        T* v,\n        const T* zoneout_mask);\n\n    // Runs the LSTM backward pass over all time steps. This method is faster than using a\n    // per-step `Iterate` but requires that the entire input sequence be available upfront.\n    // In some situations, this constraint may not be satisfiable (e.g. autoregressive models).\n    // Users should prefer calling `Run` over `Iterate` whenever possible.\n    //\n    // steps: the number of iterations to run (i.e. T).\n    // W_t: [H*4,C] the transpose of the input weight matrix.\n    // R_t: [H*4,H] the transpose of the recurrent weight matrix.\n    // b: [H*4] the bias vector.\n    // x_t: [C,T,N] the transpose of the LSTM input for this iteration.\n    // h: [T+1,N,H] the hidden state vectors after running `ForwardPass::Run`.\n    // c: [T+1,N,H] the cell state vectors after running `ForwardPass::Run`.\n    // dh_new: [T+1,N,H] the gradient of the loss with respect to `h`.\n    // dc_new: [T+1,N,H] the gradient of the loss with respect to `c`.\n    // dx: [T,N,C] the gradient of the loss with respect to the input.\n    // dW: [C,H*4] the gradient of the loss with respect to the input weight matrix.\n    // dR: [H,H*4] the gradient of the loss with respect to the recurrent weight matrix.\n    // db: [H*4] the gradient of the loss with respect to the bias vector.\n    // dh: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros.\n    //     When this function returns, `dh` will contain the gradient of the loss with respect\n    //     to the initial hidden state.\n    // dc: [N,H] NOTE: this is an input and output parameter. Should be initialized to zeros.\n    //     When this function returns, `dc` will contain the gradient of the loss with respect\n    //     to the initial cell state.\n    // v: [T,N,H*4] the same tensor that was passed to `ForwardPass::Run`.\n    // zoneout_mask: [T,N,H] may be null if zoneout was disabled in the forward pass. This\n    //     vector must be the same as the one provided during the forward pass.\n    void Run(\n        const int steps,\n        const T* W_t,\n        const T* R_t,\n        const T* b,\n        const T* x_t,\n        const T* h,\n        const T* c,\n        const T* dh_new,\n        const T* dc_new,\n        T* dx,\n        T* dW,\n        T* dR,\n        T* db,\n        T* dh,\n        T* dc,\n        T* v,\n        const T* zoneout_mask);\n\n  private:\n    void IterateInternal(\n        const T* R_t,\n        const T* c,\n        const T* c_new,\n        const T* dh_new,\n        const T* dc_new,\n        T* db,\n        T* dh,\n        T* dc,\n        T* v,\n        const T* zoneout_mask);\n    struct private_data;\n    private_data* data_;\n};\n\n}  // namespace lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/haste.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n// GENERAL NOTES:\n// No pointers may be null unless otherwise specified.\n// All pointers are expected to point to device memory.\n// The square brackets below describe tensor shapes, where\n//     T = number of RNN time steps\n//     N = batch size\n//     C = input size\n//     H = hidden size\n// and the rightmost dimension changes the fastest.\n\n#include \"haste/gru.h\"\n#include \"haste/indrnn.h\"\n#include \"haste/layer_norm.h\"\n#include \"haste/layer_norm_gru.h\"\n#include \"haste/layer_norm_indrnn.h\"\n#include \"haste/layer_norm_lstm.h\"\n#include \"haste/lstm.h\"\n"
  },
  {
    "path": "lib/indrnn_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid IndrnnBwdOps(\n    const int steps,\n    const int batch_size,\n    const int hidden_size,\n    const T* u,\n    const T* h_prev,\n    const T* h,\n    const T* dh_new,\n    T* du_out,\n    T* db_out,\n    T* dh_inout,\n    T* dk_out,\n    const T* zoneout_mask) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int NH = batch_size * hidden_size;\n  const int idx = col * hidden_size + row;\n\n  const T u_row = u[row];\n  T dh_inout_idx = dh_inout[idx];\n  T du_sum = static_cast<T>(0.0);\n  T db_sum = static_cast<T>(0.0);\n\n  for (int i = (steps - 1) * NH; i >= 0; i -= NH) {\n    T dh_total = dh_new[idx + i] + dh_inout_idx;\n    T dh = static_cast<T>(0.0);\n    if (ApplyZoneout) {\n      const T mask = zoneout_mask[idx + i];\n      dh = (static_cast<T>(1.0) - mask) * dh_total;\n      dh_total = mask * dh_total;\n    }\n\n    const T dk = d_tanh(h[idx + i]) * dh_total;\n\n    dk_out[idx + i] = dk;\n    dh_inout_idx = dh + u_row * dk;\n    du_sum += h_prev[idx + i] * dk;\n    db_sum += dk;\n  }\n\n  dh_inout[idx] = dh_inout_idx;\n  atomicAdd(&du_out[row], du_sum);\n  atomicAdd(&db_out[row], db_sum);\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace indrnn {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  if (data_->sync_stream) {\n    cudaEvent_t event;\n    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);\n    cudaEventRecord(event, data_->stream);\n    cudaStreamWaitEvent(data_->sync_stream, event, 0);\n    cudaEventDestroy(event);\n  } else {\n    cudaStreamSynchronize(data_->stream);\n  }\n  cudaStreamDestroy(data_->stream);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,\n    const T* u,\n    const T* b,\n    const T* x_t,\n    const T* h,\n    const T* dh_new,\n    T* dx,\n    T* dW,\n    T* du,\n    T* db,\n    T* dh,\n    T* workspace,\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream = data_->stream;\n\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n  const int NH = batch_size * hidden_size;\n  if (zoneout_mask) {\n    IndrnnBwdOps<T, true><<<gridDim, blockDim, 0, stream>>>(\n        steps,\n        batch_size,\n        hidden_size,\n        u,\n        h,\n        h + NH,\n        dh_new + NH,\n        du,\n        db,\n        dh,\n        workspace,\n        zoneout_mask);\n  } else {\n    IndrnnBwdOps<T, false><<<gridDim, blockDim, 0, stream>>>(\n        steps,\n        batch_size,\n        hidden_size,\n        u,\n        h,\n        h + NH,\n        dh_new + NH,\n        du,\n        db,\n        dh,\n        workspace,\n        nullptr);\n  }\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, input_size, batch_size * steps,\n      &alpha,\n      workspace, hidden_size,\n      x_t, batch_size * steps,\n      &beta,\n      dW, hidden_size);\n\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, steps * batch_size, hidden_size,\n      &alpha,\n      W_t, input_size,\n      workspace, hidden_size,\n      &beta,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate class BackwardPass<float>;\ntemplate class BackwardPass<double>;\n\n}  // namespace indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/indrnn_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid IndrnnFwdOps(\n    const int steps,\n    const int batch_size,\n    const int hidden_size,\n    const T* Wx,\n    const T* u,\n    const T* b,\n    const T* h,\n    T* h_out,\n    const float zoneout_prob,\n    const T* zoneout_mask) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int idx = col * hidden_size + row;\n  const int NH = batch_size * hidden_size;\n  const T u_row = u[row];\n  const T b_row = b[row];\n\n  for (int i = 0; i < steps * NH; i += NH) {\n    const T a = Wx[idx + i] + u_row * h[idx + i] + b_row;\n    T cur_h_value = tanh(a);\n\n    if (ApplyZoneout) {\n      if (Training) {\n        cur_h_value = (cur_h_value - h[idx + i]) * zoneout_mask[idx + i] + h[idx + i];\n      } else {\n        cur_h_value = (zoneout_prob * h[idx + i]) + ((1.0f - zoneout_prob) * cur_h_value);\n      }\n    }\n\n    h_out[idx + i] = cur_h_value;\n  }\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace indrnn {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  if (data_->sync_stream) {\n    cudaEvent_t event;\n    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);\n    cudaEventRecord(event, data_->stream);\n    cudaStreamWaitEvent(data_->sync_stream, event, 0);\n    cudaEventDestroy(event);\n  } else {\n    cudaStreamSynchronize(data_->stream);\n  }\n  cudaStreamDestroy(data_->stream);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,\n    const T* u,\n    const T* b,\n    const T* x,\n    T* h,\n    T* workspace,\n    const float zoneout_prob,\n    const T* zoneout_mask) {\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream = data_->stream;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size,\n      x, input_size,\n      &beta,\n      workspace, hidden_size);\n\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n  const int NH = batch_size * hidden_size;\n  if (training) {\n    if (zoneout_prob && zoneout_mask) {\n      IndrnnFwdOps<T, true, true><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      IndrnnFwdOps<T, true, false><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    if (zoneout_prob && zoneout_mask) {\n      IndrnnFwdOps<T, false, true><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      IndrnnFwdOps<T, false, false><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          0.0f,\n          nullptr);\n    }\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate class ForwardPass<float>;\ntemplate class ForwardPass<double>;\n\n}  // namespace indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/inline_ops.h",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#pragma once\n\n#include <cuda_fp16.h>\n\ntemplate<typename T>\n__device__ __forceinline__\nT sigmoid(const T x) {\n  return static_cast<T>(1.0) / (static_cast<T>(1.0) + exp(-x));\n}\n\ntemplate<typename T>\n__device__ __forceinline__\nT tanh(const T x) {\n  return std::tanh(x);\n}\n\ntemplate<typename T>\n__device__ __forceinline__\nT d_sigmoid(const T sigmoid_output) {\n  return sigmoid_output * (static_cast<T>(1.0) - sigmoid_output);\n}\n\ntemplate<typename T>\n__device__ __forceinline__\nT d_tanh(const T tanh_output) {\n  return (static_cast<T>(1.0) - tanh_output * tanh_output);\n}\n\n#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600)\n\n__device__ __forceinline__\ndouble atomicAdd(double* address, double val) {\n  unsigned long long int* address_as_ull = (unsigned long long int*)address;\n  unsigned long long int old = *address_as_ull, assumed;\n  do {\n      assumed = old;\n      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));\n  } while (assumed != old);\n  return __longlong_as_double(old);\n}\n\n#endif\n\n#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 600)\n\ntemplate<>\n__device__ __forceinline__\nhalf sigmoid(const half x) {\n  return static_cast<half>(1.0) / (static_cast<half>(1.0) + hexp(-x));\n}\n\ntemplate<>\n__device__ __forceinline__\nhalf tanh(const half x) {\n  return std::tanh(float(x));\n}\n\n#endif\n"
  },
  {
    "path": "lib/layer_norm_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cassert>\n\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyBeta>\n__global__\nvoid LayerNormGrad(\n    const int batch_size,\n    const int hidden_size,\n    const T* gamma,\n    const T* x,\n    const T* dy,\n    T* dgamma,\n    T* dbeta,\n    T* dx,\n    T* cache) {\n  const int batch = blockDim.x * blockIdx.x + threadIdx.x;\n  if (batch >= batch_size)\n    return;\n\n  extern __shared__ int shared_var[];\n  T* shared = reinterpret_cast<T*>(shared_var);\n  const int index = threadIdx.y;\n  const int stride = blockDim.y;\n  const int batch_idx = batch * hidden_size;\n  const int batch_block_idx = threadIdx.x * stride * 3;\n\n  const T mean   = cache[batch * 2 + 0];\n  const T invstd = cache[batch * 2 + 1];\n\n  T dsigma_tmp = static_cast<T>(0.0);\n  T dmu1_tmp = static_cast<T>(0.0);\n  T dmu2_tmp = static_cast<T>(0.0);\n  for (int i = index; i < hidden_size; i += stride) {\n    const T cur_dy = dy[batch_idx + i];\n    const T centered_x = x[batch_idx + i] - mean;\n    const T z = centered_x * invstd;\n\n    atomicAdd(&dgamma[i], z * cur_dy);\n    if (ApplyBeta)\n      atomicAdd(&dbeta[i], cur_dy);\n\n    const T db = gamma[i] * cur_dy;\n    dsigma_tmp += centered_x * db;\n    dmu1_tmp += centered_x;\n    dmu2_tmp += db;\n  }\n  shared[batch_block_idx + index * 3 + 0] = dsigma_tmp;\n  shared[batch_block_idx + index * 3 + 1] = dmu1_tmp;\n  shared[batch_block_idx + index * 3 + 2] = dmu2_tmp;\n  __syncthreads();\n\n  for (int s = stride / 2; s > 0; s >>= 1) {\n    if (index < s) {\n      shared[batch_block_idx + index * 3 + 0] += shared[batch_block_idx + (index + s) * 3 + 0];\n      shared[batch_block_idx + index * 3 + 1] += shared[batch_block_idx + (index + s) * 3 + 1];\n      shared[batch_block_idx + index * 3 + 2] += shared[batch_block_idx + (index + s) * 3 + 2];\n    }\n    __syncthreads();\n  }\n\n  const T dsigma = static_cast<T>(-0.5) * shared[batch_block_idx + 0] * invstd * invstd * invstd;\n  const T dmu = (static_cast<T>(-2.0) * shared[batch_block_idx + 1] * dsigma / hidden_size) -\n                (shared[batch_block_idx + 2] * invstd);\n\n  for (int i = index; i < hidden_size; i += stride) {\n    const T cur_dy = dy[batch_idx + i];\n    const T centered_x = x[batch_idx + i] - mean;\n\n    const T db = gamma[i] * cur_dy;\n    dx[batch_idx + i] = (static_cast<T>(2.0) * centered_x * dsigma / hidden_size) +\n                        (invstd * db) +\n                        (dmu / hidden_size);\n  }\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm {\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int hidden_size,\n    const T* gamma,\n    const T* beta,\n    const T* x,\n    T* dgamma,\n    T* dbeta,\n    T* cache)\n        : batch_size_(batch_size),\n          hidden_size_(hidden_size),\n          gamma_(gamma),\n          beta_(beta),\n          x_(x),\n          dgamma_(dgamma),\n          dbeta_(dbeta),\n          cache_(cache),\n          partial_(batch_size) {\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(const cudaStream_t& stream, const T* dy, T* dx) {\n  RunPartial(stream, batch_size_, dy, dx);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::RunPartial(\n    const cudaStream_t& stream,\n    const int minibatch,\n    const T* dy,\n    T* dx) {\n  assert(partial_ - minibatch >= 0);\n\n  dim3 blockDim(4, 256);\n  dim3 gridDim;\n  gridDim.x = (minibatch + blockDim.x - 1) / blockDim.x;\n  const int shared_mem_size = sizeof(T) * blockDim.x * blockDim.y * 3;\n\n  if (beta_ && dbeta_) {\n    LayerNormGrad<T, true><<<gridDim, blockDim, shared_mem_size, stream>>>(\n        minibatch,\n        hidden_size_,\n        gamma_,\n        x_ + (partial_ - minibatch) * hidden_size_,\n        dy,\n        dgamma_,\n        dbeta_,\n        dx,\n        cache_ + (partial_ - minibatch) * 2);\n  } else {\n    LayerNormGrad<T, false><<<gridDim, blockDim, shared_mem_size, stream>>>(\n        minibatch,\n        hidden_size_,\n        gamma_,\n        x_ + (partial_ - minibatch) * hidden_size_,\n        dy,\n        dgamma_,\n        nullptr,\n        dx,\n        cache_ + (partial_ - minibatch) * 2);\n  }\n\n  partial_ -= minibatch;\n}\n\ntemplate class BackwardPass<float>;\ntemplate class BackwardPass<double>;\n\n}  // namespace layer_norm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cassert>\n\n#include \"haste.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyBeta>\n__global__\nvoid LayerNorm(\n    const int batch_size,\n    const int hidden_size,\n    const T* gamma,\n    const T* beta,\n    const T* x,\n    T* y,\n    T* cache) {\n  const int batch = blockDim.x * blockIdx.x + threadIdx.x;\n  if (batch >= batch_size)\n    return;\n\n  extern __shared__ int shared_var[];\n  T* shared = reinterpret_cast<T*>(shared_var);\n  const int index = threadIdx.y;\n  const int stride = blockDim.y;\n  const int batch_idx = batch * hidden_size;\n  const int batch_block_idx = threadIdx.x * stride;\n\n  // TODO: use parallel single-pass algorithm to compute moments.\n\n  // Reduce sum\n  T sum = static_cast<T>(0.0);\n  for (int i = index; i < hidden_size; i += stride)\n    sum += x[batch_idx + i];\n  shared[batch_block_idx + index] = sum;\n  __syncthreads();\n\n  for (int s = stride / 2; s > 0; s >>= 1) {\n    if (index < s)\n      shared[batch_block_idx + index] += shared[batch_block_idx + index + s];\n    __syncthreads();\n  }\n\n  // Make sure this read completes before we start writing to `shared` again below.\n  const T mean = shared[batch_block_idx] / hidden_size;\n  __syncthreads();\n\n  // Reduce squared difference\n  T sumsq = static_cast<T>(0.0);\n  for (int i = index; i < hidden_size; i += stride) {\n    const T diff = x[batch_idx + i] - mean;\n    sumsq += diff * diff;\n  }\n  shared[batch_block_idx + index] = sumsq;\n  __syncthreads();\n\n  for (int s = stride / 2; s > 0; s >>= 1) {\n    if (index < s)\n      shared[batch_block_idx + index] += shared[batch_block_idx + index + s];\n    __syncthreads();\n  }\n\n  const T invstd = rsqrt(shared[batch_block_idx] / hidden_size + static_cast<T>(1e-5));\n\n  for (int i = index; i < hidden_size; i += stride) {\n    if (ApplyBeta)\n      y[batch_idx + i] = (x[batch_idx + i] - mean) * invstd * gamma[i] + beta[i];\n    else\n      y[batch_idx + i] = (x[batch_idx + i] - mean) * invstd * gamma[i];\n  }\n\n  cache[batch * 2 + 0] = mean;\n  cache[batch * 2 + 1] = invstd;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm {\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const int batch_size,\n    const int hidden_size,\n    const T* gamma,\n    const T* beta,\n    T* cache)\n        : batch_size_(batch_size),\n          hidden_size_(hidden_size),\n          gamma_(gamma),\n          beta_(beta),\n          cache_(cache),\n          partial_(0) {\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(const cudaStream_t& stream, const T* x, T* y) {\n  RunPartial(stream, batch_size_, x, y);\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::RunPartial(\n    const cudaStream_t& stream,\n    const int minibatch,\n    const T* x,\n    T* y) {\n  assert(partial_ + minibatch <= batch_size_);\n\n  dim3 blockDim(4, 256);\n  dim3 gridDim;\n  gridDim.x = (minibatch + blockDim.x - 1) / blockDim.x;\n  const int shared_mem_size = sizeof(T) * blockDim.x * blockDim.y;\n\n  if (beta_) {\n    LayerNorm<T, true><<<gridDim, blockDim, shared_mem_size, stream>>>(\n        minibatch,\n        hidden_size_,\n        gamma_,\n        beta_,\n        x,\n        y,\n        cache_ + partial_ * 2);\n  } else {\n    LayerNorm<T, false><<<gridDim, blockDim, shared_mem_size, stream>>>(\n        minibatch,\n        hidden_size_,\n        gamma_,\n        nullptr,\n        x,\n        y,\n        cache_ + partial_ * 2);\n  }\n\n  partial_ += minibatch;\n}\n\ntemplate class ForwardPass<float>;\ntemplate class ForwardPass<double>;\n\n}  // namespace layer_norm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_gru_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* h,\n                         const T* v,\n                         const T* dh_new,\n                         T* dbx_out,\n                         T* dbr_out,\n                         T* dh_inout,\n                         T* dp_out,\n                         T* dq_out,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int base_idx = col * hidden_dim + row;\n\n  T dh_total = dh_new[base_idx] + dh_inout[base_idx];\n\n  const int stride4_base_idx = col * (hidden_dim * 4) + row;\n  const int z_idx = stride4_base_idx + 0 * hidden_dim;\n  const int r_idx = stride4_base_idx + 1 * hidden_dim;\n  const int g_idx = stride4_base_idx + 2 * hidden_dim;\n  const int q_g_idx = stride4_base_idx + 3 * hidden_dim;\n\n  const T z = v[z_idx];\n  const T r = v[r_idx];\n  const T g = v[g_idx];\n  const T q_g = v[q_g_idx];\n\n  if (ApplyZoneout) {\n    const T mask = zoneout_mask[base_idx];\n    dh_inout[base_idx] = (static_cast<T>(1.0) - mask) * dh_total;\n    dh_total = mask * dh_total;\n    dh_inout[base_idx] += z * dh_total;\n  } else {\n    dh_inout[base_idx] = z * dh_total;\n  }\n\n  const T dg = (static_cast<T>(1.0) - z) * dh_total;\n  const T dz = (h[base_idx] - g) * dh_total;\n  const T dp_g = d_tanh(g) * dg;\n  const T dq_g = dp_g * r;\n  const T dr = dp_g * q_g;\n  const T dp_r = d_sigmoid(r) * dr;\n  const T dq_r = dp_r;\n  const T dp_z = d_sigmoid(z) * dz;\n  const T dq_z = dp_z;\n\n  const int idx = col * (hidden_dim * 3) + row;\n\n  dp_out[idx + 0 * hidden_dim] = dp_z;\n  dp_out[idx + 1 * hidden_dim] = dp_r;\n  dp_out[idx + 2 * hidden_dim] = dp_g;\n\n  dq_out[idx + 0 * hidden_dim] = dq_z;\n  dq_out[idx + 1 * hidden_dim] = dq_r;\n  dq_out[idx + 2 * hidden_dim] = dq_g;\n\n  atomicAdd(&dbx_out[row + 0 * hidden_dim], dp_z);\n  atomicAdd(&dbx_out[row + 1 * hidden_dim], dp_r);\n  atomicAdd(&dbx_out[row + 2 * hidden_dim], dp_g);\n\n  atomicAdd(&dbr_out[row + 0 * hidden_dim], dq_z);\n  atomicAdd(&dbr_out[row + 1 * hidden_dim], dq_r);\n  atomicAdd(&dbr_out[row + 2 * hidden_dim], dq_g);\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_gru {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::IterateInternal(\n    const T* R_t,     // [H*3,H]\n    const T* h,       // [N,H]\n    const T* v,       // [N,H*4]\n    const T* dh_new,  // [N,H]\n    T* dbx,           // [H*3]\n    T* dbr,           // [H*3]\n    T* dh,            // [N,H]\n    T* dp,            // [N,H*3]\n    T* dq,            // [N,H*3]\n    layer_norm::BackwardPass<T>& layer_norm2,\n    const T* zoneout_mask) {  // [N,H]\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);\n\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(32, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (zoneout_mask) {\n    PointwiseOperations<T, true><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        h,\n        v,\n        dh_new,\n        dbx,\n        dbr,\n        dh,\n        dp,\n        dq,\n        zoneout_mask\n    );\n  } else {\n    PointwiseOperations<T, false><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        h,\n        v,\n        dh_new,\n        dbx,\n        dbr,\n        dh,\n        dp,\n        dq,\n        nullptr\n    );\n  }\n  cudaEventRecord(event, stream1);\n\n  cublasSetStream(blas_handle,  stream1);\n  layer_norm2.RunPartial(stream1, batch_size, dq, dq);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, batch_size, hidden_size * 3,\n      &alpha,\n      R_t, hidden_size,\n      dq, hidden_size * 3,\n      &beta_sum,\n      dh, hidden_size);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,\n    const T* R_t,\n    const T* bx,\n    const T* br,\n    const T* x_t,\n    const T* h,\n    const T* v,\n    const T* dh_new,\n    T* dx,\n    T* dW,\n    T* dR,\n    T* dbx,\n    T* dbr,\n    T* dh,\n    T* dp,\n    T* dq,\n    layer_norm::BackwardPass<T>& layer_norm1,\n    layer_norm::BackwardPass<T>& layer_norm2,\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);\n  const T beta_assign = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = steps - 1; i >= 0; --i) {\n    IterateInternal(\n        R_t,\n        h + i * NH,\n        v + i * NH * 4,\n        dh_new + (i + 1) * NH,\n        dbx,\n        dbr,\n        dh,\n        dp + i * NH * 3,\n        dq + i * NH * 3,\n        layer_norm2,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n\n  // Wait for pointwise operations to complete since there's a\n  // data dependency between its output (`dp`, `dq`) and the following matmuls.\n  cudaStreamWaitEvent(stream2, event, 0);\n\n  cublasSetStream(blas_handle, stream2);\n  layer_norm1.Run(stream2, dp, dp);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, batch_size * steps, hidden_size * 3,\n      &alpha,\n      W_t, input_size,\n      dp, hidden_size * 3,\n      &beta_assign,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 3, hidden_size, batch_size * steps,\n      &alpha,\n      dq, hidden_size * 3,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 3);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, input_size, batch_size * steps,\n      &alpha,\n      dp, hidden_size * 3,\n      x_t, batch_size * steps,\n      &beta_sum,\n      dW, hidden_size * 3);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct BackwardPass<float>;\ntemplate struct BackwardPass<double>;\n\n}  // namespace layer_norm_gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_gru_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* Wx,\n                         const T* Rh,\n                         const T* bx,\n                         const T* br,\n                         const T* h,\n                         T* h_out,\n                         T* v,\n                         const float zoneout_prob,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int weight_idx = col * (hidden_dim * 3) + row;\n\n  // Index into the `h` and `h_out` vectors (they have a stride of `hidden_dim`).\n  const int output_idx = col * hidden_dim + row;\n\n  // Indicies into the Wx and Rh matrices (for each of the u, r, and e components).\n  const int z_idx = weight_idx + 0 * hidden_dim;\n  const int r_idx = weight_idx + 1 * hidden_dim;\n  const int g_idx = weight_idx + 2 * hidden_dim;\n\n  // Indices into the bias vectors (for each of the u, r, and e components).\n  const int bz_idx = row + 0 * hidden_dim;\n  const int br_idx = row + 1 * hidden_dim;\n  const int bg_idx = row + 2 * hidden_dim;\n\n  const T z = sigmoid(Wx[z_idx] + Rh[z_idx] + bx[bz_idx] + br[bz_idx]);\n  const T r = sigmoid(Wx[r_idx] + Rh[r_idx] + bx[br_idx] + br[br_idx]);\n  const T g = tanh   (Wx[g_idx] + r * (Rh[g_idx] + br[bg_idx]) + bx[bg_idx]);\n\n  // Store internal activations if we're eventually going to backprop.\n  if (Training) {\n    const int base_v_idx = col * (hidden_dim * 4) + row;\n    v[base_v_idx + 0 * hidden_dim] = z;\n    v[base_v_idx + 1 * hidden_dim] = r;\n    v[base_v_idx + 2 * hidden_dim] = g;\n    v[base_v_idx + 3 * hidden_dim] = Rh[g_idx] + br[bg_idx];\n  }\n\n  T cur_h_value = z * h[output_idx] + (static_cast<T>(1.0) - z) * g;\n\n  if (ApplyZoneout) {\n    if (Training) {\n      cur_h_value = (cur_h_value - h[output_idx]) * zoneout_mask[output_idx] + h[output_idx];\n    } else {\n      cur_h_value = (zoneout_prob * h[output_idx]) + ((1.0f - zoneout_prob) * cur_h_value);\n    }\n  }\n\n  h_out[output_idx] = cur_h_value;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_gru {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::IterateInternal(\n    const T* R,     // [H,H*3]\n    const T* bx,    // [H*3]\n    const T* br,    // [H*3]\n    const T* h,     // [N,H]\n    T* h_out,       // [N,H]\n    T* v,           // [N,H*4]\n    T* tmp_Wx_norm, // [N,H*3]\n    T* act_Rh,      // [N,H*3]\n    layer_norm::ForwardPass<T>& layer_norm2,\n    T* tmp_Rh_norm,\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  // Constants for GEMM\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, batch_size, hidden_size,\n      &alpha,\n      R, hidden_size * 3,\n      h, hidden_size,\n      &beta,\n      act_Rh, hidden_size * 3);\n  layer_norm2.RunPartial(stream1, batch_size, act_Rh, tmp_Rh_norm);\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(32, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  cudaStreamWaitEvent(stream1, event, 0);\n\n  if (training) {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, true, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx_norm,\n          tmp_Rh_norm,\n          bx,\n          br,\n          h,\n          h_out,\n          v,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, true, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx_norm,\n          tmp_Rh_norm,\n          bx,\n          br,\n          h,\n          h_out,\n          v,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, false, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx_norm,\n          tmp_Rh_norm,\n          bx,\n          br,\n          h,\n          h_out,\n          nullptr,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, false, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          tmp_Wx_norm,\n          tmp_Rh_norm,\n          bx,\n          br,\n          h,\n          h_out,\n          nullptr,\n          0.0f,\n          nullptr);\n    }\n  }\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,  // [C,H*3]\n    const T* R,  // [H,H*3]\n    const T* bx, // [H*3]\n    const T* br, // [H*3]\n    const T* x,  // [N,C]\n    T* h,        // [N,H]\n    T* v,        // [N,H*4]\n    T* act_Wx,   // [N,H*3]\n    layer_norm::ForwardPass<T>& layer_norm1,\n    T* tmp_Wx_norm,\n    T* act_Rh,   // [N,H*3]\n    layer_norm::ForwardPass<T>& layer_norm2,\n    T* tmp_Rh_norm,\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 3, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size * 3,\n      x, input_size,\n      &beta,\n      act_Wx, hidden_size * 3);\n  layer_norm1.Run(stream2, act_Wx, tmp_Wx_norm);\n  cudaEventRecord(event, stream2);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = 0; i < steps; ++i) {\n    IterateInternal(\n        R,\n        bx,\n        br,\n        h + i * NH,\n        h + (i + 1) * NH,\n        v + i * NH * 4,\n        tmp_Wx_norm + i * NH * 3,\n        act_Rh + i * NH * 3,\n        layer_norm2,\n        tmp_Rh_norm,\n        zoneout_prob,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct ForwardPass<float>;\ntemplate struct ForwardPass<double>;\n\n}  // namespace layer_norm_gru\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_indrnn_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid LayerNormIndrnnBwdOps(\n    const int steps,\n    const int batch_size,\n    const int hidden_size,\n    const T* u,\n    const T* h_prev,\n    const T* h,\n    const T* dh_new,\n    T* du_out,\n    T* db_out,\n    T* dh_inout,\n    T* dk_out,\n    const T* zoneout_mask) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int NH = batch_size * hidden_size;\n  const int idx = col * hidden_size + row;\n\n  const T u_row = u[row];\n  T dh_inout_idx = dh_inout[idx];\n  T du_sum = static_cast<T>(0.0);\n  T db_sum = static_cast<T>(0.0);\n\n  for (int i = (steps - 1) * NH; i >= 0; i -= NH) {\n    T dh_total = dh_new[idx + i] + dh_inout_idx;\n    T dh = static_cast<T>(0.0);\n    if (ApplyZoneout) {\n      const T mask = zoneout_mask[idx + i];\n      dh = (static_cast<T>(1.0) - mask) * dh_total;\n      dh_total = mask * dh_total;\n    }\n\n    const T dk = d_tanh(h[idx + i]) * dh_total;\n\n    dk_out[idx + i] = dk;\n    dh_inout_idx = dh + u_row * dk;\n    du_sum += h_prev[idx + i] * dk;\n    db_sum += dk;\n  }\n\n  dh_inout[idx] = dh_inout_idx;\n  atomicAdd(&du_out[row], du_sum);\n  atomicAdd(&db_out[row], db_sum);\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_indrnn {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  if (data_->sync_stream) {\n    cudaEvent_t event;\n    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);\n    cudaEventRecord(event, data_->stream);\n    cudaStreamWaitEvent(data_->sync_stream, event, 0);\n    cudaEventDestroy(event);\n  } else {\n    cudaStreamSynchronize(data_->stream);\n  }\n  cudaStreamDestroy(data_->stream);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,\n    const T* u,\n    const T* b,\n    const T* x_t,\n    const T* h,\n    const T* dh_new,\n    T* dx,\n    T* dW,\n    T* du,\n    T* db,\n    T* dh,\n    T* workspace,\n    layer_norm::BackwardPass<T>& layer_norm1,\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream = data_->stream;\n\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n  const int NH = batch_size * hidden_size;\n  if (zoneout_mask) {\n    LayerNormIndrnnBwdOps<T, true><<<gridDim, blockDim, 0, stream>>>(\n        steps,\n        batch_size,\n        hidden_size,\n        u,\n        h,\n        h + NH,\n        dh_new + NH,\n        du,\n        db,\n        dh,\n        workspace,\n        zoneout_mask);\n  } else {\n    LayerNormIndrnnBwdOps<T, false><<<gridDim, blockDim, 0, stream>>>(\n        steps,\n        batch_size,\n        hidden_size,\n        u,\n        h,\n        h + NH,\n        dh_new + NH,\n        du,\n        db,\n        dh,\n        workspace,\n        nullptr);\n  }\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream);\n  layer_norm1.Run(stream, workspace, workspace);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, input_size, batch_size * steps,\n      &alpha,\n      workspace, hidden_size,\n      x_t, batch_size * steps,\n      &beta,\n      dW, hidden_size);\n\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, steps * batch_size, hidden_size,\n      &alpha,\n      W_t, input_size,\n      workspace, hidden_size,\n      &beta,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate class BackwardPass<float>;\ntemplate class BackwardPass<double>;\n\n}  // namespace layer_norm_indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_indrnn_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid LayerNormIndrnnFwdOps(\n    const int steps,\n    const int batch_size,\n    const int hidden_size,\n    const T* Wx,\n    const T* u,\n    const T* b,\n    const T* h,\n    T* h_out,\n    const float zoneout_prob,\n    const T* zoneout_mask) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int idx = col * hidden_size + row;\n  const int NH = batch_size * hidden_size;\n  const T u_row = u[row];\n  const T b_row = b[row];\n\n  for (int i = 0; i < steps * NH; i += NH) {\n    const T a = Wx[idx + i] + u_row * h[idx + i] + b_row;\n    T cur_h_value = tanh(a);\n\n    if (ApplyZoneout) {\n      if (Training) {\n        cur_h_value = (cur_h_value - h[idx + i]) * zoneout_mask[idx + i] + h[idx + i];\n      } else {\n        cur_h_value = (zoneout_prob * h[idx + i]) + ((1.0f - zoneout_prob) * cur_h_value);\n      }\n    }\n\n    h_out[idx + i] = cur_h_value;\n  }\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_indrnn {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  if (data_->sync_stream) {\n    cudaEvent_t event;\n    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);\n    cudaEventRecord(event, data_->stream);\n    cudaStreamWaitEvent(data_->sync_stream, event, 0);\n    cudaEventDestroy(event);\n  } else {\n    cudaStreamSynchronize(data_->stream);\n  }\n  cudaStreamDestroy(data_->stream);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,\n    const T* u,\n    const T* b,\n    const T* x,\n    T* h,\n    T* workspace,\n    T* act_Wx,\n    layer_norm::ForwardPass<T>& layer_norm1,\n    const float zoneout_prob,\n    const T* zoneout_mask) {\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream = data_->stream;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size,\n      x, input_size,\n      &beta,\n      act_Wx, hidden_size);\n  layer_norm1.Run(stream, act_Wx, workspace);\n\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n  const int NH = batch_size * hidden_size;\n  if (training) {\n    if (zoneout_prob && zoneout_mask) {\n      LayerNormIndrnnFwdOps<T, true, true><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      LayerNormIndrnnFwdOps<T, true, false><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    if (zoneout_prob && zoneout_mask) {\n      LayerNormIndrnnFwdOps<T, false, true><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      LayerNormIndrnnFwdOps<T, false, false><<<gridDim, blockDim, 0, stream>>>(\n          steps,\n          batch_size,\n          hidden_size,\n          workspace,\n          u,\n          b,\n          h,\n          h + NH,\n          0.0f,\n          nullptr);\n    }\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate class ForwardPass<float>;\ntemplate class ForwardPass<double>;\n\n}  // namespace layer_norm_indrnn\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_lstm_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <algorithm>\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n#include <vector>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid ComputeOutputGrad(\n    const int batch_size,\n    const int hidden_size,\n    const T* act_c_norm,\n    const T* dh_new,\n    T* dh_inout,\n    T* dlayer_norm,\n    T* v,\n    const T* zoneout_mask) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int base_idx = col * hidden_size + row;\n  const int stride4_base_idx = col * (hidden_size * 4) + row;\n  const int o_idx = stride4_base_idx + 3 * hidden_size;\n\n  T dh_total = dh_new[base_idx] + dh_inout[base_idx];\n  if (ApplyZoneout) {\n    const T mask = zoneout_mask[base_idx];\n    dh_inout[base_idx] = (static_cast<T>(1.0) - mask) * dh_total;\n    dh_total = mask * dh_total;\n  } else {\n    dh_inout[base_idx] = static_cast<T>(0.0);\n  }\n\n  const T c_tanh = tanh(act_c_norm[base_idx]);\n  const T o = v[o_idx];\n\n  const T do_ = c_tanh * dh_total;\n  const T dc_tanh = o * dh_total;\n\n  dlayer_norm[base_idx] = d_tanh(c_tanh) * dc_tanh;\n  v[o_idx] = d_sigmoid(o) * do_;\n}\n\ntemplate<typename T>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* c,\n                         const T* v,\n                         const T* dc_new,\n                         const T* dlayer_norm,\n                         T* db_out,\n                         T* dc_inout,\n                         T* dv_out) {\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int base_idx = col * hidden_dim + row;\n  const int stride4_base_idx = col * (hidden_dim * 4) + row;\n  const int i_idx = stride4_base_idx + 0 * hidden_dim;\n  const int g_idx = stride4_base_idx + 1 * hidden_dim;\n  const int f_idx = stride4_base_idx + 2 * hidden_dim;\n  const int o_idx = stride4_base_idx + 3 * hidden_dim;\n\n  const T i = v[i_idx];\n  const T g = v[g_idx];\n  const T f = v[f_idx];\n  const T o = v[o_idx];\n\n  const T dc_total = dc_new[base_idx] + dc_inout[base_idx] + dlayer_norm[base_idx];\n  const T df = c[base_idx] * dc_total;\n  const T dc = f * dc_total;\n  const T di = g * dc_total;\n  const T dg = i * dc_total;\n  const T dv_g = d_tanh(g) * dg;\n  const T dv_o = o;\n  const T dv_i = d_sigmoid(i) * di;\n  const T dv_f = d_sigmoid(f) * df;\n\n  // TODO: performance optimization opportunity on this reduce operation.\n  atomicAdd(&db_out[row + 0 * hidden_dim], dv_i);\n  atomicAdd(&db_out[row + 1 * hidden_dim], dv_g);\n  atomicAdd(&db_out[row + 2 * hidden_dim], dv_f);\n  atomicAdd(&db_out[row + 3 * hidden_dim], dv_o);\n\n  dc_inout[base_idx] = dc;\n\n  dv_out[i_idx] = dv_i;\n  dv_out[g_idx] = dv_g;\n  dv_out[f_idx] = dv_f;\n  dv_out[o_idx] = dv_o;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_lstm {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[3];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaStreamCreate(&data_->stream[2]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[2]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[2]);\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[2]);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::IterateInternal(\n    const T* R_t,     // [H*4,H]\n    const T* c,       // [N,H]\n    const T* c_new,   // [N,H]\n    const T* dh_new,  // [N,H]\n    const T* dc_new,  // [N,H]\n    T* db,            // [H*4]\n    T* dh,            // [N,H]\n    T* dc,            // [N,H]\n    T* v,             // [N,H*4]\n    T* act_Rh,\n    layer_norm::BackwardPass<T>& layer_norm2,\n    layer_norm::BackwardPass<T>& layer_norm3,\n    T* act_c_norm,\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);  // Accumulate into output matrix!\n\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (zoneout_mask) {\n    ComputeOutputGrad<T, true><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        act_c_norm,\n        dh_new,\n        dh,\n        act_c_norm,\n        v,\n        zoneout_mask);\n  } else {\n    ComputeOutputGrad<T, false><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        act_c_norm,\n        dh_new,\n        dh,\n        act_c_norm,\n        v,\n        nullptr);\n  }\n  layer_norm3.RunPartial(stream1, batch_size, act_c_norm, act_c_norm);\n  PointwiseOperations<T><<<gridDim, blockDim, 0, stream1>>>(\n      batch_size,\n      hidden_size,\n      c,\n      v,\n      dc_new,\n      act_c_norm,\n      db,\n      dc,\n      v);\n\n  // Signal completion of pointwise operations for data-dependent streams.\n  cudaEventRecord(event, stream1);\n\n  cublasSetStream(blas_handle, stream1);\n  layer_norm2.RunPartial(stream1, batch_size, v, act_Rh);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, batch_size, hidden_size * 4,\n      &alpha,\n      R_t, hidden_size,\n      act_Rh, hidden_size * 4,\n      &beta_sum,\n      dh, hidden_size);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,     // [H*4,C]\n    const T* R_t,     // [H*4,H]\n    const T* b,       // [H*4]\n    const T* x_t,     // [C,T,N]\n    const T* h,       // [T+1,N,H]\n    const T* c,       // [T+1,N,H]\n    const T* dh_new,  // [T+1,N,H]\n    const T* dc_new,  // [T+1,N,H]\n    T* dx,            // [T,N,C]\n    T* dW,            // [C,H*4]\n    T* dR,            // [H,H*4]\n    T* db,            // [H*4]\n    T* dh,            // [N,H]\n    T* dc,            // [N,H]\n    T* act_Wx,        // [T,N,H*4]\n    layer_norm::BackwardPass<T>& layer_norm1,\n    T* act_Wx_norm,   // [T,N,H*4]\n    T* act_Rh,\n    layer_norm::BackwardPass<T>& layer_norm2,\n    layer_norm::BackwardPass<T>& layer_norm3,\n    T* act_c_norm,\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);  // Accumulate into output matrix!\n  const T beta_assign = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaStream_t stream3 = data_->stream[2];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = steps - 1; i >= 0; --i) {\n    IterateInternal(\n        R_t,\n        c + i * NH,\n        c + (i + 1) * NH,\n        dh_new + (i + 1) * NH,\n        dc_new + (i + 1) * NH,\n        db,\n        dh,\n        dc,\n        act_Wx_norm + i * NH * 4,\n        act_Rh + i * NH * 4,\n        layer_norm2,\n        layer_norm3,\n        act_c_norm + i * NH,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n  cudaEventRecord(event, stream1);\n\n  cudaStreamWaitEvent(stream2, event, 0);\n  layer_norm1.Run(stream2, act_Wx_norm, act_Wx);\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, input_size, batch_size * steps,\n      &alpha,\n      act_Wx, hidden_size * 4,\n      x_t, batch_size * steps,\n      &beta_sum,\n      dW, hidden_size * 4);\n\n  cudaStreamWaitEvent(stream3, event, 0);\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 4, hidden_size, batch_size * steps,\n      &alpha,\n      act_Rh, hidden_size * 4,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 4);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, steps * batch_size, hidden_size * 4,\n      &alpha,\n      W_t, input_size,\n      act_Wx, hidden_size * 4,\n      &beta_assign,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct BackwardPass<float>;\ntemplate struct BackwardPass<double>;\n\n}  // namespace layer_norm_lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/layer_norm_lstm_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\n// `c` and `c_out` may be aliased.\ntemplate<typename T, bool Training>\n__global__\nvoid ComputeCellState(\n    const int batch_size,\n    const int hidden_size,\n    const T* Wx,  // Precomputed (Wx) vector\n    const T* Rh,  // Precomputed (Rh) vector\n    const T* b,   // Bias for gates\n    const T* c,   // Input cell state\n    T* c_out,     // Output cell state\n    T* v_out) {   // Output vector v (Wx + Rh + b)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  // Base index into the Wx and Rh matrices.\n  const int weight_idx = col * (hidden_size * 4) + row;\n\n  // Base index into the output matrix. This is different from `weight_idx` because\n  // the number of rows are different between the two sets of matrices.\n  const int output_idx = col * hidden_size + row;\n\n  const int i_idx = weight_idx + 0 * hidden_size;\n  const int g_idx = weight_idx + 1 * hidden_size;\n  const int f_idx = weight_idx + 2 * hidden_size;\n  const int o_idx = weight_idx + 3 * hidden_size;\n\n  const T i = sigmoid(Wx[i_idx] + Rh[i_idx] + b[row + 0 * hidden_size]);\n  const T g = tanh   (Wx[g_idx] + Rh[g_idx] + b[row + 1 * hidden_size]);\n  const T f = sigmoid(Wx[f_idx] + Rh[f_idx] + b[row + 2 * hidden_size]);\n  const T o = sigmoid(Wx[o_idx] + Rh[o_idx] + b[row + 3 * hidden_size]);\n\n  // Compile-time constant branch should be eliminated by compiler so we have\n  // straight-through code.\n  if (Training) {\n    v_out[i_idx] = i;\n    v_out[g_idx] = g;\n    v_out[f_idx] = f;\n    v_out[o_idx] = o;\n  } else {\n    v_out[o_idx] = o;\n  }\n\n  c_out[output_idx] = (f * c[output_idx]) + (i * g);\n}\n\n// `h` and `h_out` may be aliased.\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid ComputeCellOutput(\n    const int batch_size,\n    const int hidden_size,\n    const T* h,   // Input recurrent state\n    const T* c,   // Input cell state\n    const T* v,\n    T* h_out,     // Output recurrent state\n    const float zoneout_prob,\n    const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_size || col >= batch_size)\n    return;\n\n  const int weight_idx = col * (hidden_size * 4) + row;\n  const int output_idx = col * hidden_size + row;\n\n  const T o = v[weight_idx + 3 * hidden_size];\n  const T cur_c_value = c[output_idx];\n\n  T cur_h_value = o * tanh(cur_c_value);\n\n  if (ApplyZoneout) {\n    if (Training) {\n      cur_h_value = (cur_h_value - h[output_idx]) * zoneout_mask[output_idx] + h[output_idx];\n    } else {\n      cur_h_value = (zoneout_prob * h[output_idx]) + ((1.0f - zoneout_prob) * cur_h_value);\n    }\n  }\n\n  h_out[output_idx] = cur_h_value;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace layer_norm_lstm {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaStream_t sync_stream;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  data_->sync_stream = stream;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  if (data_->sync_stream) {\n    cudaEventRecord(data_->event, data_->stream[1]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n    cudaEventRecord(data_->event, data_->stream[0]);\n    cudaStreamWaitEvent(data_->sync_stream, data_->event, 0);\n  } else {\n    cudaStreamSynchronize(data_->stream[1]);\n    cudaStreamSynchronize(data_->stream[0]);\n  }\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::IterateInternal(\n    const T* R,  // Weight matrix for recurrent state (Rh) [H,H*4]\n    const T* b,  // Bias for gates (Wx + Rh + b) [H*4]\n    const T* h,  // Recurrent state [N,H]\n    const T* c,  // Cell state [N,H]\n    T* h_out,    // Output recurrent state [N,H]\n    T* c_out,    // Output cell state [N,H]\n    T* v,        // Output vector (Wx + Rh + b) [N,H*4]\n    T* tmp_Rh,   // Temporary storage for Rh vector [N,H*4]\n    T* act_Rh,\n    layer_norm::ForwardPass<T>& layer_norm2,\n    layer_norm::ForwardPass<T>& layer_norm3,\n    T* act_c_norm,\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, batch_size, hidden_size,\n      &alpha,\n      R, hidden_size * 4,\n      h, hidden_size,\n      &beta,\n      act_Rh, hidden_size * 4);\n  layer_norm2.RunPartial(stream1, batch_size, act_Rh, tmp_Rh);\n  cudaStreamWaitEvent(stream1, event, 0);\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (training) {\n    ComputeCellState<T, true><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        v,\n        tmp_Rh,\n        b,\n        c,\n        c_out,\n        v);\n    layer_norm3.RunPartial(stream1, batch_size, c_out, act_c_norm);\n    if (zoneout_prob && zoneout_mask) {\n      ComputeCellOutput<T, true, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          h,\n          act_c_norm,\n          v,\n          h_out,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      ComputeCellOutput<T, true, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          h,\n          act_c_norm,\n          v,\n          h_out,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    ComputeCellState<T, false><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        v,\n        tmp_Rh,\n        b,\n        c,\n        c_out,\n        v);\n    layer_norm3.RunPartial(stream1, batch_size, c_out, act_c_norm);\n    if (zoneout_prob && zoneout_mask) {\n      ComputeCellOutput<T, false, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          h,\n          act_c_norm,\n          v,\n          h_out,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      ComputeCellOutput<T, false, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          h,\n          act_c_norm,\n          v,\n          h_out,\n          0.0f,\n          nullptr);\n    }\n  }\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,  // Weight matrix for input (Wx) [C,H*4]\n    const T* R,  // Weight matrix for recurrent state (Rh) [H,H*4]\n    const T* b,  // Bias for gates (Wx + Rh + b) [H*4]\n    const T* x,  // Input vector [T,N,C]\n    T* h,        // Recurrent state [T+1,N,H]\n    T* c,        // Cell state [T+1,N,H]\n    T* act_Wx,   // Output vector (Wx + Rh + b) [T,N,H*4]\n    T* tmp_Rh,   // Temporary storage for Rh vector [N,H*4]\n    layer_norm::ForwardPass<T>& layer_norm1,\n    T* act_Wx_norm,\n    T* act_Rh,\n    layer_norm::ForwardPass<T>& layer_norm2,\n    layer_norm::ForwardPass<T>& layer_norm3,\n    T* act_c_norm,\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [T,N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size * 4,\n      x, input_size,\n      &beta,\n      act_Wx, hidden_size * 4);\n  layer_norm1.Run(stream1, act_Wx, act_Wx_norm);\n\n  for (int i = 0; i < steps; ++i) {\n    const int NH = batch_size * hidden_size;\n    IterateInternal(\n        R,\n        b,\n        h + i * NH,\n        c + i * NH,\n        h + (i + 1) * NH,\n        c + (i + 1) * NH,\n        act_Wx_norm + i * NH * 4,\n        tmp_Rh,\n        act_Rh + i * NH * 4,\n        layer_norm2,\n        layer_norm3,\n        act_c_norm + i * NH,\n        zoneout_prob,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct ForwardPass<float>;\ntemplate struct ForwardPass<double>;\n\n}  // namespace layer_norm_lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/lstm_backward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <algorithm>\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n#include <vector>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\ntemplate<typename T, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* c,\n                         const T* v,\n                         const T* c_new,\n                         const T* dh_new,\n                         const T* dc_new,\n                         T* db_out,\n                         T* dh_inout,\n                         T* dc_inout,\n                         T* dv_out,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  const int base_idx = col * hidden_dim + row;\n\n        T dc_total = dc_new[base_idx] + dc_inout[base_idx];\n        T dh_total = dh_new[base_idx] + dh_inout[base_idx];\n  const T c_tanh = tanh(c_new[base_idx]);\n\n  const int stride4_base_idx = col * (hidden_dim * 4) + row;\n  const int i_idx = stride4_base_idx + 0 * hidden_dim;\n  const int g_idx = stride4_base_idx + 1 * hidden_dim;\n  const int f_idx = stride4_base_idx + 2 * hidden_dim;\n  const int o_idx = stride4_base_idx + 3 * hidden_dim;\n\n  const T i = v[i_idx];\n  const T g = v[g_idx];\n  const T f = v[f_idx];\n  const T o = v[o_idx];\n\n  if (ApplyZoneout) {\n    const T mask = zoneout_mask[base_idx];\n    dh_inout[base_idx] = (static_cast<T>(1.0) - mask) * dh_total;\n    dh_total = mask * dh_total;\n  } else {\n    dh_inout[base_idx] = static_cast<T>(0.0);\n  }\n\n  const T do_ = c_tanh * dh_total;\n  const T dc_tanh = o * dh_total;\n          dc_total += d_tanh(c_tanh) * dc_tanh;\n  const T df = c[base_idx] * dc_total;\n  const T dc = f * dc_total;\n  const T di = g * dc_total;\n  const T dg = i * dc_total;\n  const T dv_g = d_tanh(g) * dg;\n  const T dv_o = d_sigmoid(o) * do_;\n  const T dv_i = d_sigmoid(i) * di;\n  const T dv_f = d_sigmoid(f) * df;\n\n  // TODO: performance optimization opportunity on this reduce operation.\n  atomicAdd(&db_out[row + 0 * hidden_dim], dv_i);\n  atomicAdd(&db_out[row + 1 * hidden_dim], dv_g);\n  atomicAdd(&db_out[row + 2 * hidden_dim], dv_f);\n  atomicAdd(&db_out[row + 3 * hidden_dim], dv_o);\n\n  dc_inout[base_idx] = dc;\n\n  dv_out[i_idx] = dv_i;\n  dv_out[g_idx] = dv_g;\n  dv_out[f_idx] = dv_f;\n  dv_out[o_idx] = dv_o;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace lstm {\n\ntemplate<typename T>\nstruct BackwardPass<T>::private_data {\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[3];\n  cudaEvent_t event;\n};\n\ntemplate<typename T>\nBackwardPass<T>::BackwardPass(\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaStreamCreate(&data_->stream[2]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nBackwardPass<T>::~BackwardPass() {\n  cudaStreamSynchronize(data_->stream[2]);\n  cudaStreamSynchronize(data_->stream[1]);\n  cudaStreamSynchronize(data_->stream[0]);\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[2]);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Iterate(\n    const cudaStream_t& stream,\n    const T* W_t,     // [H*4,C]\n    const T* R_t,     // [H*4,H]\n    const T* b,       // [H*4]\n    const T* x_t,     // [C,N]\n    const T* h,       // [N,H]\n    const T* c,       // [N,H]\n    const T* c_new,   // [N,H]\n    const T* dh_new,  // [N,H]\n    const T* dc_new,  // [N,H]\n    T* dx,            // [N,C]\n    T* dW,            // [C,H*4]\n    T* dR,            // [H,H*4]\n    T* db,            // [H*4]\n    T* dh,            // [N,H]\n    T* dc,            // [N,H]\n    T* v,             // [N,H*4]\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);  // Accumulate into output matrix!\n  const T beta_assign = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaStream_t stream3 = data_->stream[2];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  // Make sure inputs are ready before using them.\n  if (stream) {\n    cudaEventRecord(event, stream);\n    cudaStreamWaitEvent(stream1, event, 0);\n  }\n\n  IterateInternal(\n      R_t,\n      c,\n      c_new,\n      dh_new,\n      dc_new,\n      db,\n      dh,\n      dc,\n      v,\n      zoneout_mask);\n\n  // Wait for pointwise operations to complete since there's a\n  // data dependency between its output (`v`) and the following matmuls.\n  cudaStreamWaitEvent(stream2, event, 0);\n  cudaStreamWaitEvent(stream3, event, 0);\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, batch_size, hidden_size * 4,\n      &alpha,\n      W_t, input_size,\n      v, hidden_size * 4,\n      &beta_assign,\n      dx, input_size);\n\n  // We can get away with only waiting for the `dx` and `dh` outputs and\n  // let the `dR` and `dW` matrices complete whenever they complete. It's\n  // a little unsafe, but we make the assumption that callers won't have\n  // upstream data-dependencies on those matrices.\n  if (stream) {\n    cudaEventRecord(event, stream2);\n    cudaStreamWaitEvent(stream, event, 0);\n  }\n\n  cublasSetStream(blas_handle, stream3);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 4, hidden_size, batch_size,\n      &alpha,\n      v, hidden_size * 4,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 4);\n\n  cublasSetStream(blas_handle, stream3);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, input_size, batch_size,\n      &alpha,\n      v, hidden_size * 4,\n      x_t, batch_size,\n      &beta_sum,\n      dW, hidden_size * 4);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::IterateInternal(\n    const T* R_t,     // [H*4,H]\n    const T* c,       // [N,H]\n    const T* c_new,   // [N,H]\n    const T* dh_new,  // [N,H]\n    const T* dc_new,  // [N,H]\n    T* db,            // [H*4]\n    T* dh,            // [N,H]\n    T* dc,            // [N,H]\n    T* v,             // [N,H*4]\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);  // Accumulate into output matrix!\n\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (zoneout_mask) {\n    PointwiseOperations<T, true><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        c,\n        v,\n        c_new,\n        dh_new,\n        dc_new,\n        db,\n        dh,\n        dc,\n        v,\n        zoneout_mask\n    );\n  } else {\n    PointwiseOperations<T, false><<<gridDim, blockDim, 0, stream1>>>(\n        batch_size,\n        hidden_size,\n        c,\n        v,\n        c_new,\n        dh_new,\n        dc_new,\n        db,\n        dh,\n        dc,\n        v,\n        nullptr\n    );\n  }\n\n  // Signal completion of pointwise operations for data-dependent streams.\n  cudaEventRecord(event, stream1);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size, batch_size, hidden_size * 4,\n      &alpha,\n      R_t, hidden_size,\n      v, hidden_size * 4,\n      &beta_sum,\n      dh, hidden_size);\n}\n\ntemplate<typename T>\nvoid BackwardPass<T>::Run(\n    const int steps,\n    const T* W_t,     // [H*4,C]\n    const T* R_t,     // [H*4,H]\n    const T* b,       // [H*4]\n    const T* x_t,     // [C,T,N]\n    const T* h,       // [T+1,N,H]\n    const T* c,       // [T+1,N,H]\n    const T* dh_new,  // [T+1,N,H]\n    const T* dc_new,  // [T+1,N,H]\n    T* dx,            // [T,N,C]\n    T* dW,            // [C,H*4]\n    T* dR,            // [H,H*4]\n    T* db,            // [H*4]\n    T* dh,            // [N,H]\n    T* dc,            // [N,H]\n    T* v,            // [T,N,H*4]\n    const T* zoneout_mask) {\n  const T alpha = static_cast<T>(1.0);\n  const T beta_sum = static_cast<T>(1.0);  // Accumulate into output matrix!\n  const T beta_assign = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaStream_t stream3 = data_->stream[2];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  const int NH = batch_size * hidden_size;\n  for (int i = steps - 1; i >= 0; --i) {\n    IterateInternal(\n        R_t,\n        c + i * NH,\n        c + (i + 1) * NH,\n        dh_new + (i + 1) * NH,\n        dc_new + (i + 1) * NH,\n        db,\n        dh,\n        dc,\n        v + i * NH * 4,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n  cudaEventRecord(event, stream1);\n\n  cudaStreamWaitEvent(stream2, event, 0);\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, input_size, batch_size * steps,\n      &alpha,\n      v, hidden_size * 4,\n      x_t, batch_size * steps,\n      &beta_sum,\n      dW, hidden_size * 4);\n\n  cudaStreamWaitEvent(stream3, event, 0);\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_T,\n      hidden_size * 4, hidden_size, batch_size * steps,\n      &alpha,\n      v, hidden_size * 4,\n      h, hidden_size,\n      &beta_sum,\n      dR, hidden_size * 4);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      input_size, steps * batch_size, hidden_size * 4,\n      &alpha,\n      W_t, input_size,\n      v, hidden_size * 4,\n      &beta_assign,\n      dx, input_size);\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct BackwardPass<float>;\ntemplate struct BackwardPass<double>;\n\n}  // namespace lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "lib/lstm_forward_gpu.cu.cc",
    "content": "// Copyright 2020 LMNT, Inc. All Rights Reserved.\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//    http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n// ==============================================================================\n\n#include <cublas_v2.h>\n#include <cuda_runtime_api.h>\n\n#include \"blas.h\"\n#include \"haste.h\"\n#include \"inline_ops.h\"\n\nnamespace {\n\n// `h` and `h_out` may be aliased.\n// `c` and `c_out` may be aliased.\ntemplate<typename T, bool Training, bool ApplyZoneout>\n__global__\nvoid PointwiseOperations(const int batch_dim,\n                         const int hidden_dim,\n                         const T* Wx,  // Precomputed (Wx) vector\n                         const T* Rh,  // Precomputed (Rh) vector\n                         const T* b,   // Bias for gates\n                         const T* h,   // Input recurrent state\n                         const T* c,   // Input cell state\n                         T* h_out,     // Output recurrent state\n                         T* c_out,     // Output cell state\n                         T* v_out,     // Output vector v (Wx + Rh + b) (only used if Training==true)\n                         const float zoneout_prob,\n                         const T* zoneout_mask) {  // Zoneout mask (only used if ApplyZoneout==true)\n  // We're in column-major order here, so increase x => increase row.\n  const int row = blockDim.x * blockIdx.x + threadIdx.x;\n  const int col = blockDim.y * blockIdx.y + threadIdx.y;\n\n  if (row >= hidden_dim || col >= batch_dim)\n    return;\n\n  // Base index into the Wx and Rh matrices.\n  const int weight_idx = col * (hidden_dim * 4) + row;\n\n  // Base index into the output matrix. This is different from `weight_idx` because\n  // the number of rows are different between the two sets of matrices.\n  const int output_idx = col * hidden_dim + row;\n\n  const int i_idx = weight_idx + 0 * hidden_dim;\n  const int g_idx = weight_idx + 1 * hidden_dim;\n  const int f_idx = weight_idx + 2 * hidden_dim;\n  const int o_idx = weight_idx + 3 * hidden_dim;\n\n  const T i = sigmoid(Wx[i_idx] + Rh[i_idx] + b[row + 0 * hidden_dim]);\n  const T g = tanh   (Wx[g_idx] + Rh[g_idx] + b[row + 1 * hidden_dim]);\n  const T f = sigmoid(Wx[f_idx] + Rh[f_idx] + b[row + 2 * hidden_dim]);\n  const T o = sigmoid(Wx[o_idx] + Rh[o_idx] + b[row + 3 * hidden_dim]);\n\n  // Compile-time constant branch should be eliminated by compiler so we have\n  // straight-through code.\n  if (Training) {\n    v_out[i_idx] = i;\n    v_out[g_idx] = g;\n    v_out[f_idx] = f;\n    v_out[o_idx] = o;\n  }\n\n  T cur_c_value = (f * c[output_idx]) + (i * g);\n  T cur_h_value = o * tanh(cur_c_value);\n\n  if (ApplyZoneout) {\n    if (Training) {\n      cur_h_value = (cur_h_value - h[output_idx]) * zoneout_mask[output_idx] + h[output_idx];\n    } else {\n      cur_h_value = (zoneout_prob * h[output_idx]) + ((1.0f - zoneout_prob) * cur_h_value);\n    }\n  }\n\n  c_out[output_idx] = cur_c_value;\n  h_out[output_idx] = cur_h_value;\n}\n\n}  // anonymous namespace\n\nnamespace haste {\nnamespace v0 {\nnamespace lstm {\n\ntemplate<typename T>\nstruct ForwardPass<T>::private_data {\n  bool training;\n  int batch_size;\n  int input_size;\n  int hidden_size;\n  cublasHandle_t blas_handle;\n  cudaStream_t stream[2];\n  cudaEvent_t event;\n  cudaEvent_t ready_event;\n  cudaEvent_t finished_event;\n};\n\ntemplate<typename T>\nForwardPass<T>::ForwardPass(\n    const bool training,\n    const int batch_size,\n    const int input_size,\n    const int hidden_size,\n    const cublasHandle_t& blas_handle,\n    const cudaStream_t& stream) : data_(new private_data) {\n  data_->training = training;\n  data_->batch_size = batch_size;\n  data_->input_size = input_size;\n  data_->hidden_size = hidden_size;\n  data_->blas_handle = blas_handle;\n  cudaStreamCreate(&data_->stream[0]);\n  cudaStreamCreate(&data_->stream[1]);\n  cudaEventCreateWithFlags(&data_->event, cudaEventDisableTiming);\n  cudaEventCreateWithFlags(&data_->ready_event, cudaEventDisableTiming);\n  cudaEventCreateWithFlags(&data_->finished_event, cudaEventDisableTiming);\n}\n\ntemplate<typename T>\nForwardPass<T>::~ForwardPass() {\n  cudaStreamSynchronize(data_->stream[1]);\n  cudaStreamSynchronize(data_->stream[0]);\n  cudaEventDestroy(data_->finished_event);\n  cudaEventDestroy(data_->ready_event);\n  cudaEventDestroy(data_->event);\n  cudaStreamDestroy(data_->stream[1]);\n  cudaStreamDestroy(data_->stream[0]);\n  delete data_;\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Iterate(\n    const cudaStream_t& stream,\n    const T* W,  // Weight matrix for input (Wx) [C,H*4]\n    const T* R,  // Weight matrix for recurrent state (Rh) [H,H*4]\n    const T* b,  // Bias for gates (Wx + Rh + b) [H*4]\n    const T* x,  // Input vector [N,C]\n    const T* h,  // Recurrent state [N,H]\n    const T* c,  // Cell state [N,H]\n    T* h_out,    // Output recurrent state [N,H]\n    T* c_out,    // Output cell state [N,H]\n    T* v,        // Output vector (Wx + Rh + b) [N,H*4]\n    T* tmp_Rh,   // Temporary storage for Rh vector [N,H*4]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  // Constants for GEMM\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaStream_t stream2 = data_->stream[1];\n  const cudaEvent_t event = data_->event;\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  // Make sure inputs are ready before we use them.\n  if (stream) {\n    cudaEventRecord(event, stream);\n    cudaStreamWaitEvent(stream2, event, 0);\n  }\n\n  cublasSetStream(blas_handle, stream2);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, batch_size, input_size,\n      &alpha,\n      W, hidden_size * 4,\n      x, input_size,\n      &beta,\n      v, hidden_size * 4);\n  cudaEventRecord(event, stream2);\n\n  IterateInternal(\n      R,\n      b,\n      h,\n      c,\n      h_out,\n      c_out,\n      v,\n      tmp_Rh,\n      zoneout_prob,\n      zoneout_mask);\n\n  // Make sure outputs have settled.\n  if (stream) {\n    cudaEventRecord(event, stream1);\n    cudaStreamWaitEvent(stream, event, 0);\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::IterateInternal(\n    const T* R,  // Weight matrix for recurrent state (Rh) [H,H*4]\n    const T* b,  // Bias for gates (Wx + Rh + b) [H*4]\n    const T* h,  // Recurrent state [N,H]\n    const T* c,  // Cell state [N,H]\n    T* h_out,    // Output recurrent state [N,H]\n    T* c_out,    // Output cell state [N,H]\n    T* v,        // Output vector (Wx + Rh + b) [N,H*4]\n    T* tmp_Rh,   // Temporary storage for Rh vector [N,H*4]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const bool training = data_->training;\n  const int batch_size = data_->batch_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n  const cudaEvent_t event = data_->event;\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, batch_size, hidden_size,\n      &alpha,\n      R, hidden_size * 4,\n      h, hidden_size,\n      &beta,\n      tmp_Rh, hidden_size * 4);\n\n  cudaStreamWaitEvent(stream1, event, 0);\n\n  // Compute launch configuration for pointwise operations kernel.\n  const dim3 blockDim(64, 16);\n  const dim3 gridDim(\n      (hidden_size + blockDim.x - 1) / blockDim.x,\n      (batch_size + blockDim.y - 1) / blockDim.y);\n\n  if (training) {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, true, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          v,\n          tmp_Rh,\n          b,\n          h,\n          c,\n          h_out,\n          c_out,\n          v,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, true, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          v,\n          tmp_Rh,\n          b,\n          h,\n          c,\n          h_out,\n          c_out,\n          v,\n          0.0f,\n          nullptr);\n    }\n  } else {\n    if (zoneout_prob && zoneout_mask) {\n      PointwiseOperations<T, false, true><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          v,\n          tmp_Rh,\n          b,\n          h,\n          c,\n          h_out,\n          c_out,\n          nullptr,\n          zoneout_prob,\n          zoneout_mask);\n    } else {\n      PointwiseOperations<T, false, false><<<gridDim, blockDim, 0, stream1>>>(\n          batch_size,\n          hidden_size,\n          v,\n          tmp_Rh,\n          b,\n          h,\n          c,\n          h_out,\n          c_out,\n          nullptr,\n          0.0f,\n          nullptr);\n    }\n  }\n}\n\ntemplate<typename T>\nvoid ForwardPass<T>::Run(\n    const int steps,\n    const T* W,  // Weight matrix for input (Wx) [C,H*4]\n    const T* R,  // Weight matrix for recurrent state (Rh) [H,H*4]\n    const T* b,  // Bias for gates (Wx + Rh + b) [H*4]\n    const T* x,  // Input vector [T,N,C]\n    T* h,        // Recurrent state [T+1,N,H]\n    T* c,        // Cell state [T+1,N,H]\n    T* v,        // Output vector (Wx + Rh + b) [T,N,H*4]\n    T* tmp_Rh,   // Temporary storage for Rh vector [N,H*4]\n    const float zoneout_prob,\n    const T* zoneout_mask) { // Zoneout mask [T,N,H]\n  static const T alpha = static_cast<T>(1.0);\n  static const T beta = static_cast<T>(0.0);\n\n  const blas<void>::set_pointer_mode scoped1(data_->blas_handle);\n\n  const int batch_size = data_->batch_size;\n  const int input_size = data_->input_size;\n  const int hidden_size = data_->hidden_size;\n  const cublasHandle_t blas_handle = data_->blas_handle;\n  const cudaStream_t stream1 = data_->stream[0];\n\n  cudaStream_t save_stream;\n  cublasGetStream(blas_handle, &save_stream);\n\n  cublasSetStream(blas_handle, stream1);\n  blas<T>::gemm(blas_handle,\n      CUBLAS_OP_N, CUBLAS_OP_N,\n      hidden_size * 4, steps * batch_size, input_size,\n      &alpha,\n      W, hidden_size * 4,\n      x, input_size,\n      &beta,\n      v, hidden_size * 4);\n\n  for (int i = 0; i < steps; ++i) {\n    const int NH = batch_size * hidden_size;\n    IterateInternal(\n        R,\n        b,\n        h + i * NH,\n        c + i * NH,\n        h + (i + 1) * NH,\n        c + (i + 1) * NH,\n        v + i * NH * 4,\n        tmp_Rh,\n        zoneout_prob,\n        zoneout_mask ? zoneout_mask + i * NH : nullptr);\n  }\n\n  cublasSetStream(blas_handle, save_stream);\n}\n\ntemplate struct ForwardPass<float>;\ntemplate struct ForwardPass<double>;\n\n}  // namespace lstm\n}  // namespace v0\n}  // namespace haste\n"
  },
  {
    "path": "validation/pytorch.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport argparse\nfrom unittest import mock\nimport torch\nimport haste_pytorch as haste\n\n\nRNN_MAP = {\n    'gru': haste.GRU,\n    'indrnn': haste.IndRNN,\n    'layer_norm_gru': haste.LayerNormGRU,\n    'layer_norm_indrnn': haste.LayerNormIndRNN,\n    'layer_norm_lstm': haste.LayerNormLSTM,\n    'lstm': haste.LSTM,\n}\n\nHASTE_TO_NATIVE = {\n    haste.GRU: torch.nn.GRU,\n    haste.LSTM: torch.nn.LSTM,\n}\n\nbatch_size = 32\ntime_steps = 250\ninput_size = 128\nhidden_size = 256\n\n\ndef self_consistency(rnn, x):\n  x_cuda = x.clone().cuda()\n  x_cuda_torch = x_cuda.detach().clone()\n  x_cuda.requires_grad_(True)\n  x_cuda_torch.requires_grad_(True)\n\n  rnn.cuda()\n  seed = 5566\n  torch.manual_seed(seed)\n  y1, _ = rnn.forward(x_cuda)\n  y1.backward(torch.ones_like(y1))\n  torch.manual_seed(seed)\n  with mock.patch.object(rnn, \"_is_cuda\", lambda: False):\n    y2, _ = rnn.forward(x_cuda_torch)\n  y2.backward(torch.ones_like(y2))\n\n  g1 = x_cuda_torch.grad.data\n  g2 = x_cuda.grad.data\n\n  print(torch.max(torch.abs(y1.cpu()-y2.cpu())))\n  print(torch.max(torch.abs(g1.cpu()-g2.cpu())))\n\n\ndef native_consistency(haste_rnn, pytorch_rnn, x):\n  pytorch_rnn.cuda()\n  haste_rnn.cuda()\n  haste_rnn.from_native_weights(\n      pytorch_rnn.weight_ih_l0,\n      pytorch_rnn.weight_hh_l0,\n      pytorch_rnn.bias_ih_l0,\n      pytorch_rnn.bias_hh_l0)\n\n  x1 = x.clone().cuda()\n  x2 = x.clone().cuda()\n  x1.requires_grad_(True)\n  x2.requires_grad_(True)\n\n  y1, _ = haste_rnn.forward(x1)\n  y1.backward(torch.ones_like(y1))\n\n  y2, _ = pytorch_rnn.forward(x2)\n  y2.backward(torch.ones_like(y2))\n\n  g1 = x1.grad.data\n  g2 = x2.grad.data\n\n  print(torch.max(torch.abs(y1-y2)))\n  print(torch.max(torch.abs(g1-g2)))\n\n\ndef _run_rnn(rnn_type, x, **kwargs):\n  rnn = rnn_type(input_size, hidden_size, **kwargs)\n  self_consistency(rnn, x)\n  if rnn_type in HASTE_TO_NATIVE and not kwargs:\n    pytorch_rnn = HASTE_TO_NATIVE[rnn_type](input_size, hidden_size)\n    native_consistency(rnn, pytorch_rnn, x)\n\n\ndef run_rnn(rnn_type, x):\n  for kwargs in [dict(), dict(zoneout=0.5)]:\n    _run_rnn(rnn_type, x, **kwargs)\n\n\ndef main(args):\n  x = torch.rand(time_steps, batch_size, input_size)\n  if args.rnn_type == 'all':\n    for type_name, rnn_type in RNN_MAP.items():\n      print(f'[{type_name}]')\n      run_rnn(rnn_type, x)\n      print('')\n  else:\n    print(f'[{args.rnn_type}]')\n    rnn_type = RNN_MAP[args.rnn_type]\n    rnn = run_rnn(rnn_type, x)\n\n\nif __name__ == '__main__':\n  parser = argparse.ArgumentParser()\n  parser.add_argument(\n      'rnn_type',\n      nargs='?',\n      default='all',\n      choices=list(RNN_MAP.keys()) + ['all'])\n  main(parser.parse_args())\n"
  },
  {
    "path": "validation/pytorch_speed.py",
    "content": "import torch\nimport haste_pytorch as haste\n\nfrom time import time\n\n\nseq_len = 2500\nbatch_size = 64\ninput_size = 256\nhidden_size = 4096\n\nrnn = haste.IndRNN(input_size, hidden_size).cuda()\n\nx = torch.rand(seq_len, batch_size, input_size).cuda()\nstart = time()\nfor _ in range(10):\n  y, _ = rnn(x)\n  y.backward(torch.ones_like(y))\nend = time()\nprint(f'{end-start}')\n"
  },
  {
    "path": "validation/tf.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport argparse\nimport haste_tf as haste\nimport tensorflow as tf\n\n\ndef stfu():\n  import os\n  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '4'\n  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)\n\n\ndef NativeGRUBuilder(hidden_size):\n  return tf.keras.layers.GRU(\n      hidden_size,\n      implementation=2,\n      activation='tanh',\n      recurrent_activation='sigmoid',\n      return_sequences=True,\n      reset_after=True)\n\n\ndef NativeLSTMBuilder(hidden_size):\n  return tf.keras.layers.LSTM(\n      hidden_size,\n      implementation=2,\n      activation='tanh',\n      unit_forget_bias=False,\n      recurrent_activation='sigmoid',\n      return_sequences=True)\n\n\ndef NativeGRUWeights(native_gru, haste_gru):\n  weights = haste_gru.fw_layer.get_weights()\n  native_gru.variables[0].assign(weights['kernel'])\n  native_gru.variables[1].assign(weights['recurrent_kernel'])\n  native_gru.variables[2].assign(tf.stack([weights['bias'], weights['recurrent_bias']], axis=0))\n\n\ndef NativeLSTMWeights(native_lstm, haste_lstm):\n  def swapple(x):\n    i, g, f, o = tf.split(x, 4, axis=-1)\n    return tf.concat([i, f, g, o], axis=-1)\n  weights = haste_lstm.fw_layer.get_weights()\n  native_lstm.variables[0].assign(swapple(weights['kernel']))\n  native_lstm.variables[1].assign(swapple(weights['recurrent_kernel']))\n  native_lstm.variables[2].assign(swapple(weights['bias']))\n\n\nRNN_MAP = {\n    'gru': haste.GRU,\n    'indrnn': haste.IndRNN,\n    'layer_norm_gru': haste.LayerNormGRU,\n    'layer_norm_lstm': haste.LayerNormLSTM,\n    'lstm': haste.LSTM,\n}\n\nHASTE_TO_NATIVE = {\n    haste.GRU: NativeGRUBuilder,\n    haste.LSTM: NativeLSTMBuilder,\n}\n\nHASTE_TO_NATIVE_WEIGHTS = {\n    haste.GRU: NativeGRUWeights,\n    haste.LSTM: NativeLSTMWeights,\n}\n\n\nbatch_size = 32\ntime_steps = 250\ninput_size = 128\nhidden_size = 256\n\n\ndef native_consistency(haste_rnn, native_rnn, x):\n  with tf.GradientTape() as tape:\n    tape.watch(x)\n    y1, _ = haste_rnn(x, training=True)\n    g1 = tape.gradient(y1, x)\n\n  native_rnn.build(x.shape)\n  HASTE_TO_NATIVE_WEIGHTS[type(haste_rnn)](native_rnn, haste_rnn)\n\n  with tf.GradientTape() as tape:\n    tape.watch(x)\n    y2 = native_rnn(x, training=True)\n    g2 = tape.gradient(y2, x)\n\n  print(tf.reduce_max(tf.abs(y2-y1)))\n  print(tf.reduce_max(tf.abs(g2-g1)))\n\n\ndef run_rnn(rnn_type, x):\n  rnn = rnn_type(hidden_size)\n  if rnn_type in HASTE_TO_NATIVE:\n    native_rnn = HASTE_TO_NATIVE[rnn_type](hidden_size)\n    native_consistency(rnn, native_rnn, x)\n\n\ndef main(args):\n  tf.compat.v1.enable_eager_execution()\n  stfu()\n\n  x = tf.random.normal([batch_size, time_steps, input_size])\n  if args.rnn_type == 'all':\n    for type_name, rnn_type in RNN_MAP.items():\n      print(f'[{type_name}]')\n      run_rnn(rnn_type, x)\n      print('')\n  else:\n    print(f'[{args.rnn_type}]')\n    rnn_type = RNN_MAP[args.rnn_type]\n    rnn = run_rnn(rnn_type, x)\n\n\nif __name__ == '__main__':\n  parser = argparse.ArgumentParser()\n  parser.add_argument(\n      'rnn_type',\n      nargs='?',\n      default='all',\n      choices=list(RNN_MAP.keys()) + ['all'])\n  main(parser.parse_args())\n"
  },
  {
    "path": "validation/tf_pytorch.py",
    "content": "# Copyright 2020 LMNT, Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\nimport argparse\nimport numpy as np\nimport tensorflow as tf\nimport haste_tf\nimport torch\nimport torch.nn as nn\nimport haste_pytorch\n\n\ndef stfu():\n  import os\n  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '4'\n  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)\n\n\ndef copy_weights_gru(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_kernel = torch.Tensor(weights['recurrent_kernel'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n  recurrent_bias = torch.Tensor(weights['recurrent_bias'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_kernel = nn.Parameter(recurrent_kernel)\n  rnn_pt.bias = nn.Parameter(bias)\n  rnn_pt.recurrent_bias = nn.Parameter(recurrent_bias)\n\n\ndef copy_weights_indrnn(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_scale = torch.Tensor(weights['recurrent_scale'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_scale = nn.Parameter(recurrent_scale)\n  rnn_pt.bias = nn.Parameter(bias)\n\n\ndef copy_weights_layer_norm_gru(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_kernel = torch.Tensor(weights['recurrent_kernel'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n  recurrent_bias = torch.Tensor(weights['recurrent_bias'].numpy())\n  gamma = torch.Tensor(weights['gamma'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_kernel = nn.Parameter(recurrent_kernel)\n  rnn_pt.bias = nn.Parameter(bias)\n  rnn_pt.recurrent_bias = nn.Parameter(recurrent_bias)\n  rnn_pt.gamma = nn.Parameter(gamma)\n\n\ndef copy_weights_layer_norm_indrnn(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_scale = torch.Tensor(weights['recurrent_scale'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n  gamma = torch.Tensor(weights['gamma'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_scale = nn.Parameter(recurrent_scale)\n  rnn_pt.bias = nn.Parameter(bias)\n  rnn_pt.gamma = nn.Parameter(gamma)\n\n\ndef copy_weights_layer_norm_lstm(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_kernel = torch.Tensor(weights['recurrent_kernel'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n  gamma = torch.Tensor(weights['gamma'].numpy())\n  gamma_h = torch.Tensor(weights['gamma_h'].numpy())\n  beta_h = torch.Tensor(weights['beta_h'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_kernel = nn.Parameter(recurrent_kernel)\n  rnn_pt.bias = nn.Parameter(bias)\n  rnn_pt.gamma = nn.Parameter(gamma)\n  rnn_pt.gamma_h = nn.Parameter(gamma_h)\n  rnn_pt.beta_h = nn.Parameter(beta_h)\n\n\ndef copy_weights_lstm(rnn_tf, rnn_pt):\n  weights = rnn_tf.fw_layer.get_weights()\n  kernel = torch.Tensor(weights['kernel'].numpy())\n  recurrent_kernel = torch.Tensor(weights['recurrent_kernel'].numpy())\n  bias = torch.Tensor(weights['bias'].numpy())\n\n  rnn_pt.kernel = nn.Parameter(kernel)\n  rnn_pt.recurrent_kernel = nn.Parameter(recurrent_kernel)\n  rnn_pt.bias = nn.Parameter(bias)\n\n\nbatch_size = 32\ntime_steps = 250\ninput_size = 128\nhidden_size = 256\n\nRNN_MAP = {\n    'gru': haste_tf.GRU,\n    'indrnn': haste_tf.IndRNN,\n    'layer_norm_gru': haste_tf.LayerNormGRU,\n    'layer_norm_indrnn': haste_tf.LayerNormIndRNN,\n    'layer_norm_lstm': haste_tf.LayerNormLSTM,\n    'lstm': haste_tf.LSTM,\n}\n\nTF_TO_PT = {\n    haste_tf.GRU: haste_pytorch.GRU,\n    haste_tf.IndRNN: haste_pytorch.IndRNN,\n    haste_tf.LayerNormGRU: haste_pytorch.LayerNormGRU,\n    haste_tf.LayerNormIndRNN: haste_pytorch.LayerNormIndRNN,\n    haste_tf.LayerNormLSTM: haste_pytorch.LayerNormLSTM,\n    haste_tf.LSTM: haste_pytorch.LSTM,\n}\n\nWEIGHT_COPY_MAP = {\n    haste_tf.GRU: copy_weights_gru,\n    haste_tf.IndRNN: copy_weights_indrnn,\n    haste_tf.LayerNormGRU: copy_weights_layer_norm_gru,\n    haste_tf.LayerNormIndRNN: copy_weights_layer_norm_indrnn,\n    haste_tf.LayerNormLSTM: copy_weights_layer_norm_lstm,\n    haste_tf.LSTM: copy_weights_lstm,\n}\n\n\ndef run_rnn(rnn_type, x):\n  rnn_tf = rnn_type(hidden_size)\n  rnn_pt = TF_TO_PT[rnn_type](input_size, hidden_size, batch_first=True)\n\n  rnn_tf.build(x.shape)\n  WEIGHT_COPY_MAP[type(rnn_tf)](rnn_tf, rnn_pt)\n\n  x1 = tf.convert_to_tensor(x)\n  x2 = torch.Tensor(x)\n  x2.requires_grad_(True)\n  with tf.GradientTape() as tape:\n    tape.watch(x1)\n    y1, _ = rnn_tf(x1, training=True)\n    g1 = tape.gradient(y1, x1)\n\n  y2, _ = rnn_pt(x2)\n  y2.backward(torch.ones_like(y2))\n\n  print(np.amax(np.abs(y1.numpy() - y2.detach().numpy())))\n  print(np.amax(np.abs(g1.numpy() - x2.grad.data.numpy())))\n\n\ndef main(args):\n  tf.compat.v1.enable_eager_execution()\n  stfu()\n\n  x = np.random.normal(size=[time_steps, batch_size, input_size]).astype(np.float32)\n  if args.rnn_type == 'all':\n    for type_name, rnn_type in RNN_MAP.items():\n      print(f'[{type_name}]')\n      run_rnn(rnn_type, x)\n      print('')\n  else:\n    print(f'[{args.rnn_type}]')\n    rnn_type = RNN_MAP[args.rnn_type]\n    rnn = run_rnn(rnn_type, x)\n\n\nif __name__ == '__main__':\n  parser = argparse.ArgumentParser()\n  parser.add_argument(\n      'rnn_type',\n      nargs='?',\n      default='all',\n      choices=list(RNN_MAP.keys()) + ['all'])\n  main(parser.parse_args())\n"
  }
]