Showing preview only (574K chars total). Download the full file or copy to clipboard to get everything.
Repository: a16z-infra/cog-llama-template
Branch: main
Commit: 845e24f626bb
Files: 120
Total size: 28.8 MB
Directory structure:
gitextract_722afjzh/
├── .gitignore
├── .gitmodules
├── CONTRIBUTING.md
├── LICENSE.txt
├── Makefile
├── README.md
├── __init__.py
├── base-schema.json
├── chat-schema.json
├── cog.yaml
├── examples/
│ └── alpaca/
│ ├── README.md
│ ├── process_data.py
│ └── replicate_alpaca_data.json
├── llama_recipes/
│ ├── LICENSE
│ ├── __init__.py
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── datasets.py
│ │ ├── fsdp.py
│ │ ├── peft.py
│ │ └── training.py
│ ├── ft_datasets/
│ │ ├── __init__.py
│ │ ├── alpaca_dataset.py
│ │ ├── completion_dataset.py
│ │ ├── grammar_dataset/
│ │ │ ├── __init__.py
│ │ │ ├── grammar_dataset.py
│ │ │ └── grammar_dataset_process.ipynb
│ │ ├── samsum_dataset.py
│ │ └── utils.py
│ ├── llama_finetuning.py
│ ├── model_checkpointing/
│ │ ├── __init__.py
│ │ └── checkpoint_handler.py
│ ├── multi_node.slurm
│ ├── policies/
│ │ ├── __init__.py
│ │ ├── activation_checkpointing_functions.py
│ │ ├── anyprecision_optimizer.py
│ │ ├── mixed_precision.py
│ │ └── wrapping.py
│ ├── quickstart.ipynb
│ ├── requirements.txt
│ ├── scripts/
│ │ ├── markdown_link_check_config.json
│ │ ├── spellcheck.sh
│ │ └── spellcheck_conf/
│ │ ├── spellcheck.yaml
│ │ └── wordlist.txt
│ └── utils/
│ ├── __init__.py
│ ├── config_utils.py
│ ├── dataset_utils.py
│ ├── fsdp_utils.py
│ ├── memory_utils.py
│ └── train_utils.py
├── mistral-schema.json
├── model_templates/
│ └── config.py
├── models/
│ ├── dockerignore
│ ├── llama-2-13b/
│ │ └── config.py
│ ├── llama-2-13b-chat/
│ │ └── config.py
│ ├── llama-2-13b-chat-hf-mlc/
│ │ └── config.py
│ ├── llama-2-13b-mlc/
│ │ └── config.py
│ ├── llama-2-70b/
│ │ ├── config.py
│ │ └── model_artifacts/
│ │ └── tokenizer/
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer.model
│ │ ├── tokenizer_checklist.chk
│ │ └── tokenizer_config.json
│ ├── llama-2-70b-chat/
│ │ └── config.py
│ ├── llama-2-70b-chat-hf-mlc/
│ │ └── config.py
│ ├── llama-2-70b-mlc/
│ │ └── config.py
│ ├── llama-2-7b/
│ │ └── config.py
│ ├── llama-2-7b-chat/
│ │ └── config.py
│ ├── llama-2-7b-chat-hf-mlc/
│ │ └── config.py
│ ├── llama-2-7b-mlc/
│ │ └── config.py
│ ├── llama-2-7b-transformers/
│ │ ├── config.py
│ │ └── model_artifacts/
│ │ └── tokenizer/
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer.model
│ │ ├── tokenizer_checklist.chk
│ │ └── tokenizer_config.json
│ ├── llama-2-7b-vllm/
│ │ └── config.py
│ ├── mistral-7b-instruct-v0.1-mlc/
│ │ └── config.py
│ └── mistral-7b-v0.1-mlc/
│ └── config.py
├── notes/
│ └── new_model_notes.md
├── predict.py
├── pyproject.toml
├── requirements-dev.txt
├── scripts/
│ ├── benchmark_token_latency.py
│ ├── load_secrets.sh
│ ├── test_fast_llama.py
│ ├── test_load_unload_lora.py
│ ├── train_multi_gpu.sh
│ └── train_single_gpu.sh
├── src/
│ ├── __init__.py
│ ├── config_utils.py
│ ├── download.py
│ ├── inference_engines/
│ │ ├── __init__.py
│ │ ├── engine.py
│ │ ├── exllama.py
│ │ ├── mlc_engine.py
│ │ ├── mlc_vllm_engine.py
│ │ ├── transformers_engine.py
│ │ ├── vllm_engine.py
│ │ ├── vllm_exllama_engine.py
│ │ └── vllm_transformers.py
│ ├── more_utils.py
│ └── utils.py
├── tests/
│ ├── __init__.py
│ ├── assets/
│ │ └── llama_tokenizer/
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer.model
│ │ ├── tokenizer_checklist.chk
│ │ └── tokenizer_config.json
│ ├── conftest.py
│ ├── data/
│ │ └── 200_samples.jsonl
│ ├── run_local_tests.sh
│ ├── test_e2e.py
│ ├── test_predict.py
│ ├── test_predict_with_trained_weights.py
│ ├── test_remote_predict.py
│ ├── test_remote_train.py
│ ├── test_train.py
│ ├── test_train_predict.py
│ ├── test_utils.py
│ ├── timing.py
│ └── unit_tests/
│ ├── test_completion_dataset.py
│ └── test_utils.py
└── train.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
**/__pycache__/**
flan-t5**
checkpoints/**
tmp/**
unconverted-weights
unconverted-weights/
weights
weights/
.DS_STORE
*.safetensors
.cog/
llama_weights/
.env
exllama/
llama-recipes/
orig-llama-recipes/
vllm/
.pytest_cache
.dockerignore
*.egg-info/
================================================
FILE: .gitmodules
================================================
[submodule "exllama"]
path = exllama
url = https://github.com/technillogue/exllama.git
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing
Thanks for taking the time to contribute to this project!
## Releases
This section documents the process used internally at Replicate to deploy the many variant Llama models.
Model variants live in the [models](models) directory, and deployment is managed by a [Makefile](Makefile).
To release a new model:
1. Run `make select <model-name>`, where model name corresponds to the name of a folder in the [models](models) directory, like `model-llama-2-7b`. This will copy stuff around and jigger the local state of the repo to say "use this model".
1. Run `make test-local` to test locally (assuming you're on a machine with a GPU).
1. Run `make stage test-stage <model-name>` to push to staging. If this passes, the model is ready to be promoted to production.
1. Run `REPLICATE_USER=replicate && make push test-prod <model-name>`. This runs the same tests as staging.
After releasing to production:
1. Search for old instances of the previous version's Docker image id in documentation and replace them with the new version.
================================================
FILE: LICENSE.txt
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2022, Replicate, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: Makefile
================================================
.PHONY: init
.PHONY: select
.PHONY: test-local
.PHONY: push
.PHONY: push-and-test
.PHONY: clean
# this is required to build sentencepiece for py3.11
# requires cog > 0.9.0-beta1
# get it at https://github.com/replicate/cog/releases/download/v0.9.0-beta1/cog_linux_x86_64
export COG_EXPERIMENTAL_BUILD_STAGE_DEPS = apt update && apt install -yy cmake google-perftools
export FAKE_COG_VERSION = 0.8.1
CURRENT_DIR := $(shell basename $(PWD))
ifeq ($(findstring cog,$(CURRENT_DIR)),cog)
IMAGE_NAME := $(CURRENT_DIR)
else
IMAGE_NAME := cog-$(CURRENT_DIR)
endif
REPLICATE_USER ?= replicate-internal
model ?= $(SELECTED_MODEL)
PROD_MODEL ?= $(model)
ifeq ($(findstring chat,$(model)),chat)
schema := chat-schema.json
else ifeq ($(model),mistral-7b-instruct-v0.1-mlc)
schema := mistral-schema.json
else
schema := base-schema.json
endif
base-schema.json:
$(MAKE) select model=llama-2-7b-mlc
cog run --use-cuda-base-image=false python3 -m cog.command.openapi_schema | jq > base-schema.json
chat-schema.json:
$(MAKE) select model=llama-2-7b-chat-hf-mlc
cog run --use-cuda-base-image=false python3 -m cog.command.openapi_schema | jq > chat-schema.json
mistral-schema.json:
$(MAKE) select model=mistral-7b-instruct-v0.1-mlc
cog run --use-cuda-base-image=false python3 -m cog.command.openapi_schema | jq > mistral-schema.json
init:
@if [ -z "$(model)" ]; then \
echo "Error: 'model' argument must be specified or 'MODEL_ENV' environment variable must be set. E.g., make select model=your_model_name or export MODEL_ENV=your_model_name"; \
exit 1; \
fi
# Initialize directory for model
mkdir -p models/$(model)
cp -r model_templates/* models/$(model)
if [ -e model_templates/.env ]; then cp model_templates/.env models/$(model) ; fi
if [ -e model_templates/.dockerignore ]; then \
cp model_templates/.dockerignore models/$(model); \
else \
touch models/$(model)/.dockerignore; \
fi
printf "\n# Generated by 'make init'\n" >> models/$(model)/.dockerignore
printf "/models/*/\n" >> models/$(model)/.dockerignore
printf "!/models/$(model)/\n" >> models/$(model)/.dockerignore
printf "/models/$(model)/model_artifacts/**\n" >> models/$(model)/.dockerignore
printf "!/models/$(model)/model_artifacts/tokenizer/\n" >> models/$(model)/.dockerignore
mkdir -p models/$(model)/model_artifacts/tokenizer
cp -r llama_weights/tokenizer/* models/$(model)/model_artifacts/tokenizer
update:
@if [ -z "$(model)" ]; then \
echo "Error: 'model' argument must be specified or 'MODEL_ENV' environment variable must be set. E.g., make select model=your_model_name or export MODEL_ENV=your_model_name"; \
exit 1; \
fi
cp -r model_templates/* models/$(model)
model_dir=models/$(model)
select:
@if [ -z "$(model)" ]; then \
echo "Error: 'model' argument must be specified or 'MODEL_ENV' environment variable must be set. E.g., make select model=your_model_name or export MODEL_ENV=your_model_name"; \
exit 1; \
fi
# this approach makes copies
# rsync -av --exclude 'model_artifacts/' models/$(model)/ .
# this approach behaves the same way but makes symlinks
# # if we also wanted to copy directory structure we could do this, but we only need one dir deep
# rsync -av --exclude 'model_artifacts/' --include '*/' --exclude '*' $(model_dir)/ .
# For symlinking files
find $(model_dir) -type f ! -path "$(model_dir)/model_artifacts/*" -exec ln -sf {} . \;
# For specific files like .env and .dockerignore, we link them if they exist
[ -e $(model_dir)/.env ] && ln -sf $(model_dir)/.env .env || true
# rm .dockerignore || true
cp models/dockerignore .dockerignore
echo "!$(model_dir)" >> .dockerignore
# [ -e $(model_dir)/dockerignore ] && cat $(model_dir)/dockerignore > .dockerignore
#cog build
@echo "#########Selected model: $(model)########"
clean: select
if [ -e models/$(model)/model_artifacts/default_inference_weights]; then sudo rm -rf models/$(model)/model_artifacts/default_inference_weights; fi
if [ -e models/$(model)/model_artifacts/training_weights]; then sudo rm -rf models/$(model)/model_artifacts/training_weights; fi
if [ -e training_output.zip]; then sudo rm -rf training_output.zip; fi
build-local: select
cog build --openapi-schema=$(schema) --use-cuda-base-image=false --progress plain
serve: select
docker run \
-ti \
-p 5000:5000 \
--gpus=all \
-e COG_WEIGHTS=http://$(HOST_NAME):8000/training_output.zip \
-v `pwd`/training_output.zip:/src/local_weights.zip \
$(IMAGE_NAME)
test-local-predict: build-local
@if [ "$(verbose)" = "true" ]; then \
pytest ./tests/test_predict.py -s; \
else \
pytest ./tests/test_predict.py; \
fi
test-local-train: build-local
rm -rf training_output.zip
@if [ "$(verbose)" = "true" ]; then \
pytest ./tests/test_train.py -s; \
else \
pytest ./tests/test_train.py; \
fi
test-local-train-predict: build-local
@if [ "$(verbose)" = "true" ]; then \
pytest ./tests/test_train_predict.py -s; \
else \
pytest ./tests/test_train_predict.py; \
fi
test-local: select test-local-predict test-local-train test-local-train-predict
stage: select
@echo "Pushing $(model) to r8.im/$(REPLICATE_USER)/staging-$(model)..."
cog push --openapi-schema=$(schema) --use-cuda-base-image=false --progress plain r8.im/$(REPLICATE_USER)/staging-$(model)
test-stage-predict:
@if [ "$(verbose)" = "true" ]; then \
pytest tests/test_remote_predict.py -s --model $(REPLICATE_USER)/staging-$(model); \
else \
pytest tests/test_remote_predict.py --model $(REPLICATE_USER)/staging-$(model); \
fi
test-stage-train-predict:
@if [ "$(verbose)" = "true" ]; then \
pytest tests/test_remote_train.py -s --model $(REPLICATE_USER)/staging-$(model); \
else \
pytest tests/test_remote_train.py --model $(REPLICATE_USER)/staging-$(model); \
fi
test-stage: test-stage-predict test-stage-train-predict
stage-and-test-models:
$(foreach model, $(subst ,, $(models)), \
$(MAKE) select model=$(model); \
$(MAKE) stage model=$(model); \
$(MAKE) test-stage model=$(model); \
)
push: select
cog push --openapi-schema=$(schema) --use-cuda-base-image=false --progress plain r8.im/$(REPLICATE_USER)/$(PROD_MODEL)
test-prod-predict:
@if [ "$(verbose)" = "true" ]; then \
pytest tests/test_remote_predict.py -s --model $(REPLICATE_USER)/$(PROD_MODEL); \
else \
pytest tests/test_remote_predict.py --model $(REPLICATE_USER)/$(PROD_MODEL); \
fi
test-prod-train-predict:
@if [ "$(verbose)" = "true" ]; then \
pytest tests/test_remote_train.py -s --model $(REPLICATE_USER)/$(PROD_MODEL); \
else \
pytest tests/test_remote_train.py --model $(REPLICATE_USER)/$(PROD_MODEL); \
fi
test-prod: test-prod-predict test-prod-train-predict
format:
python3 -m ruff format .
lint:
python3 -m ruff .
python3 -m ruff format --check .
help:
@echo "Available targets:\n\n"
@echo "init: Create the model directory."
@echo " e.g., \`make init dir=<model_dir>\`"
================================================
FILE: README.md
================================================
# LLaMA Cog template 🦙
This is a monorepo for building multiple Llama models using Cog:
- llama-2-13b
- llama-2-13b-chat
- llama-2-13b-transformers
- llama-2-70b
- llama-2-70b-chat
- llama-2-7b
- llama-2-7b-chat
- llama-2-7b-transformers
- llama-2-7b-vllm
See [replicate.com/meta](https://replicate.com/meta).
---
**NOTE: This is an experimental branch that depends on exllama**
For now, you should:
```sh
git clone https://github.com/turboderp/exllama
cd exllama
git checkout e8a544f95b3fd64dfa5549eeeafb85b1ac71a793
```
We're working on a proper integration.
**This Cog template works with LLaMA 1 & 2 versions.**
LLaMA is a [new open-source language model from Meta Research](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) that performs as well as closed-source models.
This is a guide to running LLaMA using in the cloud using Replicate. You'll use the [Cog](https://github.com/replicate/cog) command-line tool to package the model and push it to Replicate as a web interface and API.
This template can be used to run the `7B`, `13B`, and `70B` versions of LLaMA and LLaMA2 and it also works with fine-tuned models.
**Note: Please verify the system prompt for LLaMA or LLAMA2 and update it accordingly.**
**Note: LLaMA is for research purposes only. It is not intended for commercial use. Check the license of LLaMA & LLaMA2 on the official LLaMA website of Meta Platforms, Inc.**
## Prerequisites
- **LLaMA weights**. The weights for LLaMA have not yet been released publicly. To apply for access, fill out the Meta Research form to be able to download the weights.
- **GPU machine**. You'll need a Linux machine with an NVIDIA GPU attached and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) installed. If you don't already have access to a machine with a GPU, check out our [guide to getting a GPU machine](https://replicate.com/docs/guides/get-a-gpu-machine).
- **Docker**. You'll be using the [Cog](https://github.com/replicate/cog) command-line tool to build and push a model. Cog uses Docker to create containers for models.
## Step 0: Install Cog
First, [install Cog](https://github.com/replicate/cog#install):
```
sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/latest/download/cog_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/cog
```
## Step 1: Set up weights
Replicate currently supports the `7B` model size.
Put your downloaded weights in a folder called `unconverted-weights`. The folder hierarchy should look something like this:
```
unconverted-weights
├── 7B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
├── tokenizer.model
└── tokenizer_checklist.chk
```
Convert the weights from a PyTorch checkpoint to a transformers-compatible format using the this command:
```
cog run python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir unconverted-weights --model_size 7B --output_dir weights
```
You final directory structure should look like this:
```
weights
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.model
└── tokenizer_config.json
```
Once you've done this, you should uncomment `unconverted-weights` in your `.dockerignore` file. This ensures that `unconverted-weights` aren't built into the resulting cog image.
## Step 2: Tenzorize the weights from the transformers-compatible/huggingface format (this will allow cold-starts to happen much faster):
Run convert_to_tensors.py to tenzorize the weights from the previous transformers-compatible/huggingface format:
```
cog run python convert_to_tensors.py
```
This will tensorize your weights and write the tensorized weights to `./llama_weights/llama-7b/llama_7b_fp16.tensors` if you have a GPU available and `.../llama_7b_fp32.tensors` if you don't.
(To load the tensorized model instead of the transformers-compatible/huggingface weights, verify that `DEFAULT_MODEL_NAME` in `config.py` is set to the path of your tensorized weights.)
- Make sure `**.tensors` is not in your `.dockerignore`:
In your `.dockerignore` file, remove `**.tensors`. This line will ignore all files that end with .tensors, no matter where they are in the directory structure.
## Step 3: Run the model
You can run the model locally to test it:
```
cog predict -i prompt="Simply put, the theory of relativity states that"
```
LLaMA is not fine-tuned to answer questions. You should construct your prompt so that the expected answer is the natural continuation of your prompt.
Here are a few examples from the [LLaMA FAQ](https://github.com/facebookresearch/llama/blob/57b0eb62de0636e75af471e49e2f1862d908d9d8/FAQ.md#2-generations-are-bad):
- Do not prompt with "What is the meaning of life? Be concise and do not repeat yourself." but with "I believe the meaning of life is"
- Do not prompt with "Explain the theory of relativity." but with "Simply put, the theory of relativity states that"
- Do not prompt with "Ten easy steps to build a website..." but with "Building a website can be done in 10 simple steps:\n"
## Step 4: Create a model on Replicate
Go to [replicate.com/create](https://replicate.com/create) to create a Replicate model.
Make sure to specify "private" to keep the model private.
## Step 5: Configure the model to run on A100 GPUs
Replicate supports running models on a variety of GPUs. The default GPU type is a T4, but for best performance you'll want to configure your model to run on an A100.
Click on the "Settings" tab on your model page, scroll down to "GPU hardware", and select "A100". Then click "Save".
## Step 6: Push the model to Replicate
Log in to Replicate:
```
sudo cog login
```
Push the contents of your current directory to Replicate, using the model name you specified in step 3:
```
sudo cog push r8.im/username/modelname
```
Note: if you get an error while pushing your model indicating that your model does not exist on Replicate (even if it was successfully created on the Replicate dashboard), make sure to use the "sudo" command in the "cog login" in terminal.
[Learn more about pushing models to Replicate.](https://replicate.com/docs/guides/push-a-model)
## Step 7: Run the model on Replicate
Now that you've pushed the model to Replicate, you can run it from the website or with an API.
To use your model in the browser, go to your model page.
To use your model with an API, click on the "API" tab on your model page. You'll see commands to run the model with cURL, Python, etc.
To learn more about how to use Replicate, [check out our documentation](https://replicate.com/docs).
## Contributors ✨
This template was generated by Marco Mascorro (@mascobot), with some modifications to the original cog LLaMA template and with the help of the cog and Replicate documentation that wonderful people put together. See all contributors below.
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
================================================
FILE: __init__.py
================================================
================================================
FILE: base-schema.json
================================================
{
"openapi": "3.0.2",
"info": {
"title": "Cog",
"version": "0.1.0"
},
"paths": {
"/": {
"get": {
"summary": "Root",
"operationId": "root__get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Root Get"
}
}
}
}
}
}
},
"/health-check": {
"get": {
"summary": "Healthcheck",
"operationId": "healthcheck_health_check_get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Healthcheck Health Check Get"
}
}
}
}
}
}
},
"/predictions": {
"post": {
"summary": "Predict",
"description": "Run a single prediction on the model",
"operationId": "predict_predictions_post",
"parameters": [
{
"required": false,
"schema": {
"title": "Prefer",
"type": "string"
},
"name": "prefer",
"in": "header"
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionRequest"
}
}
}
},
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionResponse"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/predictions/{prediction_id}": {
"put": {
"summary": "Predict Idempotent",
"description": "Run a single prediction on the model (idempotent creation).",
"operationId": "predict_idempotent_predictions__prediction_id__put",
"parameters": [
{
"required": true,
"schema": {
"title": "Prediction ID",
"type": "string"
},
"name": "prediction_id",
"in": "path"
},
{
"required": false,
"schema": {
"title": "Prefer",
"type": "string"
},
"name": "prefer",
"in": "header"
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"title": "Prediction Request",
"allOf": [
{
"$ref": "#/components/schemas/PredictionRequest"
}
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionResponse"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/predictions/{prediction_id}/cancel": {
"post": {
"summary": "Cancel",
"description": "Cancel a running prediction",
"operationId": "cancel_predictions__prediction_id__cancel_post",
"parameters": [
{
"required": true,
"schema": {
"title": "Prediction ID",
"type": "string"
},
"name": "prediction_id",
"in": "path"
}
],
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Cancel Predictions Prediction Id Cancel Post"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/shutdown": {
"post": {
"summary": "Start Shutdown",
"operationId": "start_shutdown_shutdown_post",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Start Shutdown Shutdown Post"
}
}
}
}
}
}
}
},
"components": {
"schemas": {
"HTTPValidationError": {
"title": "HTTPValidationError",
"type": "object",
"properties": {
"detail": {
"title": "Detail",
"type": "array",
"items": {
"$ref": "#/components/schemas/ValidationError"
}
}
}
},
"Input": {
"title": "Input",
"required": [
"prompt"
],
"type": "object",
"properties": {
"prompt": {
"title": "Prompt",
"type": "string",
"description": "Prompt to send to the model.",
"x-order": 0
},
"max_new_tokens": {
"title": "Max New Tokens",
"minimum": 1,
"type": "integer",
"description": "Maximum number of tokens to generate. A word is generally 2-3 tokens",
"default": 128,
"x-order": 1
},
"min_new_tokens": {
"title": "Min New Tokens",
"minimum": -1,
"type": "integer",
"description": "Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.",
"default": -1,
"x-order": 2
},
"temperature": {
"title": "Temperature",
"maximum": 5,
"minimum": 0.01,
"type": "number",
"description": "Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
"default": 0.7,
"x-order": 3
},
"top_p": {
"title": "Top P",
"maximum": 1,
"minimum": 0,
"type": "number",
"description": "When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
"default": 0.95,
"x-order": 4
},
"repetition_penalty": {
"title": "Repetition Penalty",
"minimum": 0,
"type": "number",
"description": "A parameter that controls how repetitive text can be. Lower means more repetitive, while higher means less repetitive. Set to 1.0 to disable.",
"default": 1.15,
"x-order": 5
},
"stop_sequences": {
"title": "Stop Sequences",
"type": "string",
"description": "A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.",
"x-order": 6
},
"seed": {
"title": "Seed",
"type": "integer",
"description": "Random seed. Leave blank to randomize the seed",
"x-order": 7
},
"debug": {
"title": "Debug",
"type": "boolean",
"description": "provide debugging output in logs",
"default": false,
"x-order": 8
},
"prompt_template": {
"title": "Prompt Template",
"type": "string",
"description": "Template for formatting the prompt",
"default": "{prompt}",
"x-order": 9
},
"replicate_weights": {
"title": "Replicate Weights",
"type": "string",
"description": "Path to fine-tuned weights produced by a Replicate fine-tune job.",
"x-order": 10
}
}
},
"Output": {
"title": "Output",
"type": "array",
"items": {
"type": "string"
},
"x-cog-array-display": "concatenate",
"x-cog-array-type": "iterator"
},
"PredictionRequest": {
"title": "PredictionRequest",
"type": "object",
"properties": {
"input": {
"$ref": "#/components/schemas/Input"
},
"id": {
"title": "Id",
"type": "string"
},
"created_at": {
"title": "Created At",
"type": "string",
"format": "date-time"
},
"output_file_prefix": {
"title": "Output File Prefix",
"type": "string"
},
"webhook": {
"title": "Webhook",
"maxLength": 65536,
"minLength": 1,
"type": "string",
"format": "uri"
},
"webhook_events_filter": {
"type": "array",
"items": {
"$ref": "#/components/schemas/WebhookEvent"
},
"default": [
"start",
"output",
"logs",
"completed"
]
}
}
},
"PredictionResponse": {
"title": "PredictionResponse",
"type": "object",
"properties": {
"input": {
"$ref": "#/components/schemas/Input"
},
"output": {
"$ref": "#/components/schemas/Output"
},
"id": {
"title": "Id",
"type": "string"
},
"version": {
"title": "Version",
"type": "string"
},
"created_at": {
"title": "Created At",
"type": "string",
"format": "date-time"
},
"started_at": {
"title": "Started At",
"type": "string",
"format": "date-time"
},
"completed_at": {
"title": "Completed At",
"type": "string",
"format": "date-time"
},
"logs": {
"title": "Logs",
"type": "string",
"default": ""
},
"error": {
"title": "Error",
"type": "string"
},
"status": {
"$ref": "#/components/schemas/Status"
},
"metrics": {
"title": "Metrics",
"type": "object"
}
}
},
"Status": {
"title": "Status",
"enum": [
"starting",
"processing",
"succeeded",
"canceled",
"failed"
],
"type": "string",
"description": "An enumeration."
},
"ValidationError": {
"title": "ValidationError",
"required": [
"loc",
"msg",
"type"
],
"type": "object",
"properties": {
"loc": {
"title": "Location",
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
},
{
"type": "integer"
}
]
}
},
"msg": {
"title": "Message",
"type": "string"
},
"type": {
"title": "Error Type",
"type": "string"
}
}
},
"WebhookEvent": {
"title": "WebhookEvent",
"enum": [
"start",
"output",
"logs",
"completed"
],
"type": "string",
"description": "An enumeration."
}
}
}
}
================================================
FILE: chat-schema.json
================================================
{
"openapi": "3.0.2",
"info": {
"title": "Cog",
"version": "0.1.0"
},
"paths": {
"/": {
"get": {
"summary": "Root",
"operationId": "root__get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Root Get"
}
}
}
}
}
}
},
"/health-check": {
"get": {
"summary": "Healthcheck",
"operationId": "healthcheck_health_check_get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Healthcheck Health Check Get"
}
}
}
}
}
}
},
"/predictions": {
"post": {
"summary": "Predict",
"description": "Run a single prediction on the model",
"operationId": "predict_predictions_post",
"parameters": [
{
"required": false,
"schema": {
"title": "Prefer",
"type": "string"
},
"name": "prefer",
"in": "header"
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionRequest"
}
}
}
},
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionResponse"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/predictions/{prediction_id}": {
"put": {
"summary": "Predict Idempotent",
"description": "Run a single prediction on the model (idempotent creation).",
"operationId": "predict_idempotent_predictions__prediction_id__put",
"parameters": [
{
"required": true,
"schema": {
"title": "Prediction ID",
"type": "string"
},
"name": "prediction_id",
"in": "path"
},
{
"required": false,
"schema": {
"title": "Prefer",
"type": "string"
},
"name": "prefer",
"in": "header"
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"title": "Prediction Request",
"allOf": [
{
"$ref": "#/components/schemas/PredictionRequest"
}
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionResponse"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/predictions/{prediction_id}/cancel": {
"post": {
"summary": "Cancel",
"description": "Cancel a running prediction",
"operationId": "cancel_predictions__prediction_id__cancel_post",
"parameters": [
{
"required": true,
"schema": {
"title": "Prediction ID",
"type": "string"
},
"name": "prediction_id",
"in": "path"
}
],
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Cancel Predictions Prediction Id Cancel Post"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/HTTPValidationError"
}
}
}
}
}
}
},
"/shutdown": {
"post": {
"summary": "Start Shutdown",
"operationId": "start_shutdown_shutdown_post",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Start Shutdown Shutdown Post"
}
}
}
}
}
}
}
},
"components": {
"schemas": {
"HTTPValidationError": {
"title": "HTTPValidationError",
"type": "object",
"properties": {
"detail": {
"title": "Detail",
"type": "array",
"items": {
"$ref": "#/components/schemas/ValidationError"
}
}
}
},
"Input": {
"title": "Input",
"required": [
"prompt"
],
"type": "object",
"properties": {
"prompt": {
"title": "Prompt",
"type": "string",
"description": "Prompt to send to the model.",
"x-order": 0
},
"system_prompt": {
"title": "System Prompt",
"type": "string",
"description": "System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. Should not be blank.",
"default": "You are a helpful, respectful and honest assistant.",
"x-order": 1
},
"max_new_tokens": {
"title": "Max New Tokens",
"minimum": 1,
"type": "integer",
"description": "Maximum number of tokens to generate. A word is generally 2-3 tokens",
"default": 128,
"x-order": 2
},
"min_new_tokens": {
"title": "Min New Tokens",
"minimum": -1,
"type": "integer",
"description": "Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.",
"default": -1,
"x-order": 3
},
"temperature": {
"title": "Temperature",
"maximum": 5,
"minimum": 0.01,
"type": "number",
"description": "Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
"default": 0.7,
"x-order": 4
},
"top_p": {
"title": "Top P",
"maximum": 1,
"minimum": 0,
"type": "number",
"description": "When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
"default": 0.95,
"x-order": 5
},
"repetition_penalty": {
"title": "Repetition Penalty",
"minimum": 0,
"type": "number",
"description": "A parameter that controls how repetitive text can be. Lower means more repetitive, while higher means less repetitive. Set to 1.0 to disable.",
"default": 1.15,
"x-order": 6
},
"stop_sequences": {
"title": "Stop Sequences",
"type": "string",
"description": "A comma-separated list of sequences to stop generation at. For example, '<end>,<stop>' will stop generation at the first instance of 'end' or '<stop>'.",
"x-order": 7
},
"seed": {
"title": "Seed",
"type": "integer",
"description": "Random seed. Leave blank to randomize the seed",
"x-order": 8
},
"debug": {
"title": "Debug",
"type": "boolean",
"description": "provide debugging output in logs",
"default": false,
"x-order": 9
},
"prompt_template": {
"title": "Prompt Template",
"type": "string",
"description": "Template for formatting the prompt",
"default": "[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{prompt} [/INST]",
"x-order": 10
},
"replicate_weights": {
"title": "Replicate Weights",
"type": "string",
"description": "Path to fine-tuned weights produced by a Replicate fine-tune job.",
"x-order": 11
}
}
},
"Output": {
"title": "Output",
"type": "array",
"items": {
"type": "string"
},
"x-cog-array-type": "iterator",
"x-cog-array-display": "concatenate"
},
"PredictionRequest": {
"title": "PredictionRequest",
"type": "object",
"properties": {
"input": {
"$ref": "#/components/schemas/Input"
},
"id": {
"title": "Id",
"type": "string"
},
"created_at": {
"title": "Created At",
"type": "string",
"format": "date-time"
},
"output_file_prefix": {
"title": "Output File Prefix",
"type": "string"
},
"webhook": {
"title": "Webhook",
"maxLength": 65536,
"minLength": 1,
"type": "string",
"format": "uri"
},
"webhook_events_filter": {
"type": "array",
"items": {
"$ref": "#/components/schemas/WebhookEvent"
},
"default": [
"start",
"output",
"logs",
"completed"
]
}
}
},
"PredictionResponse": {
"title": "PredictionResponse",
"type": "object",
"properties": {
"input": {
"$ref": "#/components/schemas/Input"
},
"output": {
"$ref": "#/components/schemas/Output"
},
"id": {
"title": "Id",
"type": "string"
},
"version": {
"title": "Version",
"type": "string"
},
"created_at": {
"title": "Created At",
"type": "string",
"format": "date-time"
},
"started_at": {
"title": "Started At",
"type": "string",
"format": "date-time"
},
"completed_at": {
"title": "Completed At",
"type": "string",
"format": "date-time"
},
"logs": {
"title": "Logs",
"type": "string",
"default": ""
},
"error": {
"title": "Error",
"type": "string"
},
"status": {
"$ref": "#/components/schemas/Status"
},
"metrics": {
"title": "Metrics",
"type": "object"
}
}
},
"Status": {
"title": "Status",
"enum": [
"starting",
"processing",
"succeeded",
"canceled",
"failed"
],
"type": "string",
"description": "An enumeration."
},
"ValidationError": {
"title": "ValidationError",
"required": [
"loc",
"msg",
"type"
],
"type": "object",
"properties": {
"loc": {
"title": "Location",
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
},
{
"type": "integer"
}
]
}
},
"msg": {
"title": "Message",
"type": "string"
},
"type": {
"title": "Error Type",
"type": "string"
}
}
},
"WebhookEvent": {
"title": "WebhookEvent",
"enum": [
"start",
"output",
"logs",
"completed"
],
"type": "string",
"description": "An enumeration."
}
}
}
}
================================================
FILE: cog.yaml
================================================
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
# set to true if your model requires a GPU
gpu: true
cuda: "11.8"
# python version in the form '3.8' or '3.8.12'
python_version: "3.11"
# a list of packages in the format <package-name>==<version>
python_packages:
- "numpy==1.24.2"
- "sentencepiece==0.1.99"
- "jinja2==3.1.2"
- "scipy==1.11.1"
- "safetensors>=0.3.1"
- "python-dotenv"
- "fire"
- "datasets"
- "transformers==4.33.2"
- "peft==0.4.0"
- "accelerate"
- "bitsandbytes"
- "trl==0.5.0"
- "aiohttp[speedups]"
- "triton" # hm
- "fastapi<0.99.0"
# uncomment these when we go back to 12.1
# - "https://r2.drysys.workers.dev/torch/torch-2.1.0-cp311-cp311-linux_x86_64.whl"
# - "https://weights.replicate.delivery/default/wheels/vllm-0.2a0-cp311-cp311-linux_x86_64.whl"
- "https://r2.drysys.workers.dev/torch/11.8/torch-2.1.0-cp311-cp311-linux_x86_64.whl"
# This wheel can be built by running `TORCH_CUDA_ARCH_LIST="8.0;8.6" pip wheel .` in https://github.com/replicate/vllm-with-loras
- "https://r2.drysys.workers.dev/vllm/11.8/vllm-0.2a0-cp311-cp311-linux_x86_64.whl"
- "https://r2.drysys.workers.dev/xformers/11.8/xformers-0.0.23+b4c853d.d20231107-cp311-cp311-linux_x86_64.whl"
- "--pre -f https://mlc.ai/wheels"
- "mlc-chat-nightly-cu118"
- "mlc-ai-nightly-cu118"
# - "mlc-chat-nightly-cu121"
# - "mlc-ai-nightly-cu121"
run:
- curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.1.1/pget" && chmod +x /usr/local/bin/pget
# since we can't do LD_LIBRARY_PATH=torch/lib, use this to make sure mlc can access the cuda libs bundled with torch
- bash -c 'ln -s /usr/local/lib/python3.11/site-packages/torch/lib/lib{nv,cu}* /usr/lib'
# predict.py defines how predictions are run on your model
predict: "predict.py:Predictor"
train: "train.py:train"
================================================
FILE: examples/alpaca/README.md
================================================
Example code for parsing the dataset needed to train [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca).
This contains both a function, `process_data.py`, which shows how to transform the [given alpaca data](https://github.com/gururise/AlpacaDataCleaned) into the format expected by `cog train`. It also contains an example parsed dataset as a reference for that `{'prompt': ..., 'completion':...}` format.
================================================
FILE: examples/alpaca/process_data.py
================================================
from transformers import T5Tokenizer
import json
PROMPT_DICT = {
"prompt_input": (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
),
"prompt_no_input": (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Response:"
),
}
class Preprocessor:
"""Simple class to parse alpaca data into format expected by trainer. Run this offline to build your dataset."""
def __init__(self, tokenizer):
self.prompt_dict = PROMPT_DICT
self.tokenizer = tokenizer
def batch_tokenize(self, texts):
"""Tokenizes text. Presently doesn't pad inputs, just returns input ids."""
tokenized = [
self.tokenizer(
prompt,
return_tensors="pt",
padding="longest",
).input_ids
for prompt in texts
]
return tokenized
def make_prompt(self, input_row):
if len(input_row["input"]) > 1:
return self.prompt_dict["prompt_input"].format_map(input_row)
return self.prompt_dict["prompt_no_input"].format_map(input_row)
def make_short_prompt(self, input_row):
if len(input_row["input"]) > 1:
return f"""{input_row['instruction']}\n{input_row['input']}"""
return input_row["instruction"]
def construct_dataset(self, input_data):
prompts = [self.make_short_prompt(val) for val in input_data]
return [
{"prompt": val[0], "completion": val[1]}
for val in zip(prompts, [val["output"] for val in input_data])
]
if __name__ == "__main__":
proc = Preprocessor(T5Tokenizer.from_pretrained("google/flan-t5-xl"))
with open("alpaca_data.json", "r") as f:
data = json.load(f)
data_out = proc.construct_dataset(data)
with open("short_alpaca_data.json", "w") as f:
json.dump(data_out, f, indent=2)
================================================
FILE: examples/alpaca/replicate_alpaca_data.json
================================================
[File too large to display: 28.3 MB]
================================================
FILE: llama_recipes/LICENSE
================================================
LLAMA 2 COMMUNITY LICENSE AGREEMENT
Llama 2 Version Release Date: July 18, 2023
"Agreement" means the terms and conditions for use, reproduction, distribution and
modification of the Llama Materials set forth herein.
"Documentation" means the specifications, manuals and documentation
accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
libraries/llama-downloads/.
"Licensee" or "you" means you, or your employer or any other person or entity (if
you are entering into this Agreement on such person or entity's behalf), of the age
required under applicable laws, rules or regulations to provide legal consent and that
has legal authority to bind your employer or such other person or entity if you are
entering in this Agreement on their behalf.
"Llama 2" means the foundational large language models and software and
algorithms, including machine-learning model code, trained model weights,
inference-enabling code, training-enabling code, fine-tuning enabling code and other
elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
libraries/llama-downloads/.
"Llama Materials" means, collectively, Meta's proprietary Llama 2 and
Documentation (and any portion thereof) made available under this Agreement.
"Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
are an entity, your principal place of business is in the EEA or Switzerland) and Meta
Platforms, Inc. (if you are located outside of the EEA or Switzerland).
By clicking "I Accept" below or by using or distributing any portion or element of the
Llama Materials, you agree to be bound by this Agreement.
1. License Rights and Redistribution.
a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
transferable and royalty-free limited license under Meta's intellectual property or
other rights owned by Meta embodied in the Llama Materials to use, reproduce,
distribute, copy, create derivative works of, and make modifications to the Llama
Materials.
b. Redistribution and Use.
i. If you distribute or make the Llama Materials, or any derivative works
thereof, available to a third party, you shall provide a copy of this Agreement to such
third party.
ii. If you receive Llama Materials, or any derivative works thereof, from
a Licensee as part of an integrated end user product, then Section 2 of this
Agreement will not apply to you.
iii. You must retain in all copies of the Llama Materials that you
distribute the following attribution notice within a "Notice" text file distributed as a
part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
Copyright (c) Meta Platforms, Inc. All Rights Reserved."
iv. Your use of the Llama Materials must comply with applicable laws
and regulations (including trade compliance laws and regulations) and adhere to the
Acceptable Use Policy for the Llama Materials (available at
https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
this Agreement.
v. You will not use the Llama Materials or any output or results of the
Llama Materials to improve any other large language model (excluding Llama 2 or
derivative works thereof).
2. Additional Commercial Terms. If, on the Llama 2 version release date, the
monthly active users of the products or services made available by or for Licensee,
or Licensee's affiliates, is greater than 700 million monthly active users in the
preceding calendar month, you must request a license from Meta, which Meta may
grant to you in its sole discretion, and you are not authorized to exercise any of the
rights under this Agreement unless or until Meta otherwise expressly grants you
such rights.
3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
ANY OF THE FOREGOING.
5. Intellectual Property.
a. No trademark licenses are granted under this Agreement, and in
connection with the Llama Materials, neither Meta nor Licensee may use any name
or mark owned by or associated with the other or any of its affiliates, except as
required for reasonable and customary use in describing and redistributing the
Llama Materials.
b. Subject to Meta's ownership of Llama Materials and derivatives made by or
for Meta, with respect to any derivative works and modifications of the Llama
Materials that are made by you, as between you and Meta, you are and will be the
owner of such derivative works and modifications.
c. If you institute litigation or other proceedings against Meta or any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
constitutes infringement of intellectual property or other rights owned or licensable
by you, then any licenses granted to you under this Agreement shall terminate as of
the date such litigation or claim is filed or instituted. You will indemnify and hold
harmless Meta from and against any claim by any third party arising out of or related
to your use or distribution of the Llama Materials.
6. Term and Termination. The term of this Agreement will commence upon your
acceptance of this Agreement or access to the Llama Materials and will continue in
full force and effect until terminated in accordance with the terms and conditions
herein. Meta may terminate this Agreement if you are in breach of any term or
condition of this Agreement. Upon termination of this Agreement, you shall delete
and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
termination of this Agreement.
7. Governing Law and Jurisdiction. This Agreement will be governed and
construed under the laws of the State of California without regard to choice of law
principles, and the UN Convention on Contracts for the International Sale of Goods
does not apply to this Agreement. The courts of California shall have exclusive
jurisdiction of any dispute arising out of this Agreement.
================================================
FILE: llama_recipes/__init__.py
================================================
================================================
FILE: llama_recipes/configs/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .peft import (
lora_config,
llama_adapter_config,
prefix_config,
qlora_config,
bitsandbytes_config,
)
from .fsdp import fsdp_config
from .training import train_config
================================================
FILE: llama_recipes/configs/datasets.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from dataclasses import dataclass
@dataclass
class samsum_dataset:
dataset: str = "samsum_dataset"
train_split: str = "train"
test_split: str = "validation"
input_length: int = 2048
@dataclass
class grammar_dataset:
dataset: str = "grammar_dataset"
train_split: str = "ft_datasets/grammar_dataset/gtrain_10k.csv"
test_split: str = "ft_datasets/grammar_dataset/grammar_validation.csv"
input_length: int = 2048
@dataclass
class alpaca_dataset:
dataset: str = "alpaca_dataset"
train_split: str = "train"
test_split: str = "val"
data_path: str = "ft_datasets/alpaca_data.json"
@dataclass
class completion:
"""
A generic class for completion format datasets. Format is expected
to be JSONL like:
```
{"text": "..."}
```
or
```
{"text": "prompt ...", "completion": "..."}
```
"""
dataset: str = "completion"
train_split: str = "train"
test_split: str = "val"
data_path: str = None
num_validation_samples: int = 100
run_validation: bool = True
validation_data_path: str = None
pack_sequences: bool = True
wrap_packed_sequences: bool = False
chunk_size: int = 2048
max_seq_length: int = 4096
================================================
FILE: llama_recipes/configs/fsdp.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from dataclasses import dataclass
from torch.distributed.fsdp import ShardingStrategy
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
@dataclass
class fsdp_config:
mixed_precision: bool = True
use_fp16: bool = False
sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD
checkpoint_type: StateDictType = StateDictType.SHARDED_STATE_DICT # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
fsdp_activation_checkpointing: bool = True
pure_bf16: bool = False
optimizer: str = "AdamW"
================================================
FILE: llama_recipes/configs/peft.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from dataclasses import dataclass
from typing import ClassVar, List
import torch
@dataclass
class lora_config:
r: int = 8
lora_alpha: int = 16
target_modules: ClassVar[List[str]] = ["q_proj", "v_proj"]
bias = "none"
task_type: str = "CAUSAL_LM"
lora_dropout: float = 0.05
inference_mode: bool = False
@dataclass
class llama_adapter_config:
adapter_len: int = 10
adapter_layers: int = 30
task_type: str = "CAUSAL_LM"
@dataclass
class prefix_config:
num_virtual_tokens: int = 30
task_type: str = "CAUSAL_LM"
@dataclass
class bitsandbytes_config:
load_in_4bit: bool = True
bnb_4bit_quant_type: str = "nf4"
bnb_4bit_use_double_quant: bool = True
bnb_4bit_compute_dtype: torch.dtype = torch.bfloat16
@dataclass
class qlora_config:
r: int = 8
lora_alpha: int = 32
target_modules: ClassVar[List[str]] = ["q_proj", "v_proj"]
bias = "none"
task_type: str = "CAUSAL_LM"
lora_dropout: float = 0.05
inference_mode: bool = False
================================================
FILE: llama_recipes/configs/training.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from dataclasses import dataclass
@dataclass
class train_config:
model_name: str = "llama_weights/llama-2-7b"
enable_fsdp: bool = False
run_validation: bool = True
batch_size_training: int = 4
num_epochs: int = 3
num_workers_dataloader: int = 1
gradient_accumulation_steps: int = 1
lr: float = 1e-4
weight_decay: float = 0.0
gamma: float = 0.85
seed: int = 42
use_fp16: bool = False
mixed_precision: bool = True
val_batch_size: int = 1
dataset = "completion"
peft_method: str = "lora" # None , llama_adapter, prefix
use_peft: bool = False
output_dir: str = "PATH/to/save/PEFT/model"
freeze_layers: bool = False
num_freeze_layers: int = 1
quantization: bool = False
one_gpu: bool = False
save_model: bool = True
dist_checkpoint_root_folder: str = (
"PATH/to/save/FSDP/model" # will be used if using FSDP
)
dist_checkpoint_folder: str = "fine-tuned" # will be used if using FSDP
save_optimizer: bool = False # will be used if using FSDP
data_path: str = None
num_validation_samples: int = 100
validation_data_path: str = None
validation_prompt: str = None
wrap_packed_sequences: bool = False
pack_sequences: bool = True
chunk_size: int = 2048
# optim: Optional[str] = field(
# default="paged_adamw_32bit",
# metadata={"help": "The optimizer to use."},
# )
# lr_scheduler_type: str = field(
# default="constant",
# metadata={"help": "Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis"},
# )
# max_steps: int = field(default=10000, metadata={"help": "How many optimizer update steps to take"})
# warmup_ratio
# save_steps: int = field(default=100, metadata={"help": "Save checkpoint every X updates steps."})
# logging_steps: int = field(default=10, metadata={"help": "Log every X updates steps."})
# eval_steps: int = field(default=None, metadata={"help": "Run evaluation every X steps"})
# evaluation_strateg
================================================
FILE: llama_recipes/ft_datasets/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .grammar_dataset import get_dataset as get_grammar_dataset
from .alpaca_dataset import InstructionDataset as get_alpaca_dataset
from .samsum_dataset import get_preprocessed_samsum as get_samsum_dataset
from .completion_dataset import get_completion_dataset as get_completion_dataset
================================================
FILE: llama_recipes/ft_datasets/alpaca_dataset.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# For dataset details visit: https://crfm.stanford.edu/2023/03/13/alpaca.html
import copy
import json
import torch
from torch.utils.data import Dataset
PROMPT_DICT = {
"prompt_input": (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
),
"prompt_no_input": (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Response:"
),
}
class InstructionDataset(Dataset):
def __init__(self, dataset_config, tokenizer, partition="train", max_words=30):
self.ann = json.load(open(dataset_config.data_path))
if partition == "train":
self.ann = self.ann
else:
self.ann = self.ann[:200]
self.max_words = max_words
# tokenizer = Tokenizer(model_path=model_path + "./tokenizer.model")
self.tokenizer = tokenizer
# self.tokenizer1 = tokenizer
def __len__(self):
return len(self.ann)
def __getitem__(self, index):
ann = self.ann[index]
if ann.get("input", "") == "":
prompt = PROMPT_DICT["prompt_no_input"].format_map(ann)
else:
prompt = PROMPT_DICT["prompt_input"].format_map(ann)
example = prompt + ann["output"]
prompt = torch.tensor(self.tokenizer.encode(prompt), dtype=torch.int64)
example = self.tokenizer.encode(example)
example.append(self.tokenizer.eos_token_id)
example = torch.tensor(example, dtype=torch.int64)
padding = self.max_words - example.shape[0]
if padding > 0:
example = torch.cat((example, torch.zeros(padding, dtype=torch.int64) - 1))
elif padding < 0:
example = example[: self.max_words]
labels = copy.deepcopy(example)
labels[: len(prompt)] = -1
example_mask = example.ge(0)
label_mask = labels.ge(0)
example[~example_mask] = 0
labels[~label_mask] = 0
example_mask = example_mask.float()
label_mask = label_mask.float()
return {
"input_ids": example,
"labels": labels,
"attention_mask": example_mask,
}
================================================
FILE: llama_recipes/ft_datasets/completion_dataset.py
================================================
from .utils import Concatenator
import json
from datasets import Dataset
def load_data(
dataset_config,
split,
):
data_path = dataset_config.data_path
num_validation_samples = int(dataset_config.num_validation_samples)
run_validation = dataset_config.run_validation
validation_data_path = dataset_config.validation_data_path
def _load_data(path):
data = []
with open(path, "r") as file:
for line in file:
data.append(json.loads(line))
dataset = Dataset.from_dict(
{key: [item[key] for item in data] for key in data[0]},
)
return dataset
if not validation_data_path:
dataset = _load_data(data_path)
if run_validation and split == "train":
print(
f"Selecting observations 0 through {len(dataset)-num_validation_samples} from data for training..."
)
end_index = len(dataset) - num_validation_samples
indices = list(range(end_index))
dataset = dataset.select(indices)
elif run_validation and split == "val":
print(
f"Selecting observations {len(dataset)-num_validation_samples} through {len(dataset)} from data for validation..."
)
start_index = len(dataset) - num_validation_samples
indices = list(range(start_index, len(dataset)))
dataset = dataset.select(indices)
else:
if split == "train":
dataset = _load_data(data_path)
elif split == "val":
dataset = _load_data(validation_data_path)
return dataset
def format_data(dataset, tokenizer, config=None):
def apply_text_template(sample):
return {"text": sample["text"] + tokenizer.eos_token}
def apply_prompt_template(sample):
return {
"text": sample["prompt"] + "\n" + sample["completion"] + tokenizer.eos_token
}
# Assume - all "text" or all "prompt/completion"
if "text" in dataset[0]:
dataset = dataset.map(
apply_text_template, remove_columns=list(dataset.features)
)
elif "prompt" in dataset[0] and "completion" in dataset[0]:
dataset = dataset.map(
apply_prompt_template, remove_columns=list(dataset.features)
)
else:
raise Exception(
"Dataset did not contain `text` or `prompt` and `completion` inputs. Example row:",
dataset[0],
)
return dataset
def tokenize_data(dataset, tokenizer, config=None):
try:
max_length = config.max_seq_length
except:
max_length = tokenizer.model_max_length
dataset = dataset.map(
lambda sample: tokenizer(
sample["text"], max_length=max_length, truncation=True
),
batched=True,
remove_columns=list(dataset.features),
).map(lambda sample: {"labels": sample["input_ids"]}, batched=True)
if config.pack_sequences:
dataset = dataset.map(
Concatenator(
chunk_size=config.chunk_size,
wrap_packed_sequences=config.wrap_packed_sequences,
),
batched=True,
)
return dataset
def get_completion_dataset(config: str, tokenizer, split: str = "train"):
dataset = load_data(config, split)
dataset = format_data(dataset, tokenizer, config)
dataset = tokenize_data(dataset, tokenizer, config)
return dataset
================================================
FILE: llama_recipes/ft_datasets/grammar_dataset/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .grammar_dataset import get_dataset
================================================
FILE: llama_recipes/ft_datasets/grammar_dataset/grammar_dataset.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# For dataset details visit: https://huggingface.co/datasets/jfleg
# For download and preparation see: recipes/ft_datasets/grammar_dataset/grammar_dataset_process.ipynb
from torch.utils.data import Dataset
from datasets import load_dataset
from pathlib import Path
from ..utils import ConcatDataset
class grammar(Dataset):
def __init__(
self,
tokenizer,
csv_name=None,
):
try:
self.dataset = load_dataset(
"csv",
data_files={"train": [csv_name]}, # "eval": "grammar_validation.csv"},
delimiter=",",
)
except Exception as e:
print(
"Loading of grammar dataset failed! Please see recipes/ft_datasets/grammar_dataset/grammar_dataset_process.ipynb for details on how to download the dataset."
)
raise e
# self.dataset = load_dataset("wikihow", "all", data_dir="data/", split=type_path)
# if num_samples:
# self.dataset = self.dataset.select(list(range(0, num_samples)))
self.tokenizer = tokenizer
self.print_text = False # print_text
def __len__(self):
return self.dataset["train"].shape[0]
def convert_to_features(self, example_batch):
# Create prompt and tokenize contexts and questions
if self.print_text:
print("Input Text: ", self.clean_text(example_batch["text"]))
input_ = example_batch["input"]
target_ = example_batch["target"]
prompt = (
f"Correct this to standard English: {input_}\n---\nCorrected: {target_}"
)
sample = self.tokenizer(prompt)
return sample
def __getitem__(self, index):
sample = self.convert_to_features(self.dataset["train"][index])
source_ids = sample["input_ids"]
src_mask = sample["attention_mask"]
return {
"input_ids": source_ids,
"attention_mask": src_mask,
"labels": source_ids.copy(),
}
def get_dataset(dataset_config, tokenizer, csv_name=None):
"""cover function for handling loading the working dataset"""
"""dataset loading"""
if csv_name is None:
currPath = Path.cwd() / "datasets_grammar" / "grammar_train.csv"
print(f"Loading dataset {currPath}")
csv_name = str(currPath)
dataset = grammar(
tokenizer=tokenizer,
csv_name=csv_name,
)
return ConcatDataset(dataset, chunk_size=dataset_config.input_length)
================================================
FILE: llama_recipes/ft_datasets/grammar_dataset/grammar_dataset_process.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Meta Platforms, Inc. and affiliates.\n",
"This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.\n",
"\n",
"Use this notebook to pull in datasets and apply pre-processing. Most grammar datasets unfortunately require preprocessing before being usable in training. (example - jfleg has 4 targets per input, so we have to rematch as 1:1 pairings) "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"from datasets import load_metric, load_dataset\n",
"from pathlib import Path"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"list_replacements = [\n",
" (\" .\", \".\"), \n",
" (\" ,\", \",\"),\n",
" (\" '\", \"'\"),\n",
" (\" ?\", \"?\"),\n",
" (\" !\", \"!\"),\n",
" (\" :\", \"!\"),\n",
" (\" ;\", \"!\"),\n",
" (\" n't\", \"n't\"),\n",
" (\" v\", \"n't\"),\n",
" (\"2 0 0 6\", \"2006\"),\n",
" (\"5 5\", \"55\"),\n",
" (\"4 0 0\", \"400\"),\n",
" (\"1 7-5 0\", \"1750\"),\n",
" (\"2 0 %\", \"20%\"),\n",
" (\"5 0\", \"50\"),\n",
" (\"1 2\", \"12\"),\n",
" (\"1 0\", \"10\"),\n",
" ('\" ballast water', '\"ballast water')\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def correct_spacing(item):\n",
" \"\"\" we iterate through the list of all replacements per each item in dataset\"\"\"\n",
" for fix in list_replacements:\n",
" item = item.replace(fix[0], fix[1])\n",
" return item\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def generate_csv(csv_path, dataset):\n",
" \"\"\" apply spacing corrections and save out matched pairs to csv file as dataset\"\"\"\n",
" with open(csv_path, 'w', newline='') as csvfile:\n",
" writer = csv.writer(csvfile)\n",
" writer.writerow([\"input\", \"target\"])\n",
" for case in dataset:\n",
" \t # Adding the t5 task indication prefix to input \n",
" input_text = case[\"sentence\"]\n",
" input_text = correct_spacing(input_text)\n",
"\n",
" for correction in case[\"corrections\"]:\n",
" correction = correct_spacing(correction)\n",
" # a few of the cases contain blank strings. \n",
" if input_text and correction:\n",
" writer.writerow([input_text, correction])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In Jfleg - validation will be used as 'train', test will be 'validation'"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)\n",
"Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)\n"
]
}
],
"source": [
"train_dataset = load_dataset(\"jfleg\", split='validation[:]') \n",
"eval_dataset = load_dataset(\"jfleg\", split='test[:]')\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset({\n",
" features: ['sentence', 'corrections'],\n",
" num_rows: 755\n",
"})\n",
"Dataset({\n",
" features: ['sentence', 'corrections'],\n",
" num_rows: 748\n",
"})\n"
]
}
],
"source": [
"print(train_dataset)\n",
"print(eval_dataset)\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas . \n",
"['Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ']\n"
]
}
],
"source": [
"print(train_dataset['sentence'][22])\n",
"print(train_dataset['corrections'][22])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas. '"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clean22 = correct_spacing(train_dataset['sentence'][22])\n",
"clean22"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"jfleg_dir = Path.cwd()/'jfleg_dataset' # if you only use 'jfleg', hf will try and use that and complain\n",
"jfleg_dir.mkdir(parents=True,exist_ok=True)\n",
"c4_dir = Path.cwd()/'c4_dataset'\n",
"c4_dir.mkdir(parents=True,exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Process Jfleg data "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"j_train_file = jfleg_dir/'jtrain.csv'\n",
"j_eval_file = jfleg_dir/'jeval.csv'"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"generate_csv(j_train_file, train_dataset)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"generate_csv(j_eval_file, eval_dataset)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Process C4_200M (!) - we'll pull 10K to start"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"c4_dataset = load_dataset(\"liweili/c4_200m\", streaming = True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"iterator = iter(c4_dataset['train'])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def c4_generate_csv(csv_path, iterator, num_examples):\n",
" with open(csv_path, 'w', newline='') as csvfile:\n",
" writer = csv.writer(csvfile)\n",
" writer.writerow([\"input\", \"target\"])\n",
" for i in range(0,num_examples):\n",
" data = next(iterator)\n",
" input_text = data[\"input\"]\n",
" input_text = correct_spacing(input_text)\n",
" correction = correct_spacing(data[\"output\"])\n",
" if input_text and correction:\n",
" writer.writerow([input_text, correction])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"c4_dir = Path.cwd()/'c4_dataset'\n",
"c4_dir.mkdir(parents=True,exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can modify the following to make the csv file with desired number of instances, here we go for 10k to make a quick test"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"c4_filename = c4_dir/'c4train_10k.csv'"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"c4_generate_csv(c4_filename, iterator, num_examples=10000)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a single training file by combining jtrain and c4train"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"merge_list = [j_train_file, c4_filename, ]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"combined_csv = pd.concat([pd.read_csv(fn) for fn in merge_list])\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"merged_name = \"gtrain_10k.csv\""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"combined_csv.to_csv(merged_name, index=False, encoding = 'utf-8-sig', )"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"eval_name = \"grammar_validation.csv\""
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"eval_csv = pd.read_csv(j_eval_file)\n",
"eval_csv.to_csv(eval_name, index=False, encoding = 'utf-8-sig', )"
]
}
],
"metadata": {
"interpreter": {
"hash": "5b2c14c5f2a3b21e6c2412c8196f5145870350e81c0b737cae3e5c60eb1e1eac"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: llama_recipes/ft_datasets/samsum_dataset.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# For dataset details visit: https://huggingface.co/datasets/samsum
import datasets
from .utils import Concatenator
def get_preprocessed_samsum(dataset_config, tokenizer, split):
dataset = datasets.load_dataset("samsum", split=split)
prompt = (
"Summarize this dialog:\n{dialog}\n---\nSummary:\n{summary}{eos_token}"
)
def apply_prompt_template(sample):
return {
"text": prompt.format(
dialog=sample["dialogue"],
summary=sample["summary"],
eos_token=tokenizer.eos_token,
)
}
dataset = dataset.map(apply_prompt_template, remove_columns=list(dataset.features))
dataset = dataset.map(
lambda sample: tokenizer(sample["text"]),
batched=True,
remove_columns=list(dataset.features),
).map(Concatenator(), batched=True)
return dataset
================================================
FILE: llama_recipes/ft_datasets/utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from tqdm import tqdm
from itertools import chain
from torch.utils.data import Dataset
class Concatenator(object):
def __init__(self, chunk_size=2048, wrap_packed_sequences=False):
self.chunk_size = chunk_size
self.residual = {"input_ids": [], "attention_mask": []}
self.wrap_packed_sequences = wrap_packed_sequences
def _wrap_concat(self, batch):
"""
When we pack samples into a single sequence, it's possible that the final
sample's sequence will exceed `chunk_size`. In this case, the `_wrap_concat`
method will wrap the final sample around to the beginning of the next sequence.
This breaks the sample into two parts and may introduce samples that violate prompt formats.
However, it allows us to strictly enforce chunk size.
"""
concatenated_samples = {
k: v + list(chain(*batch[k])) for k, v in self.residual.items()
}
total_length = len(concatenated_samples[list(concatenated_samples.keys())[0]])
if total_length >= self.chunk_size:
chunk_num = total_length // self.chunk_size
result = {
k: [
v[i : i + self.chunk_size]
for i in range(0, chunk_num * self.chunk_size, self.chunk_size)
]
for k, v in concatenated_samples.items()
}
self.residual = {
k: v[(chunk_num * self.chunk_size) :]
for k, v in concatenated_samples.items()
}
else:
result = concatenated_samples
self.residual = {k: [] for k in concatenated_samples.keys()}
# result["labels"] = result["input_ids"].copy()
return result
def _concat(self, batch):
"""
When we pack samples into a single sequence, it's possible that the final
sample's sequence will exceed `chunk_size`. In this case, the `_concat` method
will simply promote the final sample to the next sequence. This may introduce
sequences with variable lengths, e.g. some that are below `chunk_size`,
but it allows us to pack sequences while strictly respecting formatting.
"""
# Initialize current sequences from residual or empty if none exists
keys = batch.keys()
current_sequences = {key: self.residual.get(key, []) for key in keys}
# # We'll store packed sequences in results
results = {key: [] for key in keys}
# len_of_new_seq = len(batch[list(batch.keys())[0]])
# len_of_current_seq = len(current_sequences[list(current_sequences.keys())[0]])
num_samples = len(batch[next(iter(keys))])
for idx in range(num_samples):
# Check if adding next sample will exceed the chunk size for any key
len_current_sequences = len(current_sequences[list(keys)[0]])
len_batch_sequence = len(batch[list(keys)[0]][idx])
will_exceed = len_current_sequences + len_batch_sequence > self.chunk_size
if will_exceed:
if len_current_sequences > 0:
for key in keys:
results[key].append(current_sequences[key])
current_sequences[key] = []
# After appending to results, extend current_sequences with the sample for all keys
for key in keys:
current_sequences[key].extend(batch[key][idx])
else:
for key in keys:
current_sequences[key].extend(batch[key][idx])
# Store unappended sequences as residual
self.residual = current_sequences
# results["labels"] = results["input_ids"].copy()
return results
def __call__(self, batch):
if self.wrap_packed_sequences:
return self._wrap_concat(batch)
else:
return self._concat(batch)
class ConcatDataset(Dataset):
def __init__(self, dataset, chunk_size=4096):
self.dataset = dataset
self.chunk_size = chunk_size
self.samples = []
buffer = {
"input_ids": [],
"attention_mask": [],
"labels": [],
}
for sample in tqdm(self.dataset, desc="Preprocessing dataset"):
buffer = {k: v + sample[k] for k, v in buffer.items()}
while len(next(iter(buffer.values()))) > self.chunk_size:
self.samples.append(
{k: v[: self.chunk_size] for k, v in buffer.items()}
)
buffer = {k: v[self.chunk_size :] for k, v in buffer.items()}
def __getitem__(self, idx):
return self.samples[idx]
def __len__(self):
return len(self.samples)
================================================
FILE: llama_recipes/llama_finetuning.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import fire
import torch
# Unused imports removed
from utils import fsdp_auto_wrap_policy
from transformers import (
LlamaForCausalLM,
LlamaTokenizer,
AutoModelForCausalLM,
DataCollatorForTokenClassification,
)
import torch.distributed as dist
# Unused imports removed
from utils.train_utils import (
train,
freeze_transformer_layers,
setup,
setup_environ_flags,
print_model_size,
get_policies,
)
from utils.dataset_utils import get_preprocessed_dataset
from utils.config_utils import (
update_config,
generate_peft_config,
generate_dataset_config,
)
from peft import (
get_peft_model,
prepare_model_for_int8_training,
prepare_model_for_kbit_training,
)
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
)
from torch.utils.data import DistributedSampler
import policies
from policies import AnyPrecisionAdamW
from configs import fsdp_config, train_config
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
import torch
import torch.distributed as dist
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
def main(**kwargs):
# Update the configuration for the training and sharding process
update_config((train_config, fsdp_config), **kwargs)
# Set the seeds for reproducibility
torch.cuda.manual_seed(train_config.seed)
torch.manual_seed(train_config.seed)
#########################################################
# CONFIGURE DISTRIBUTED TRAINING -----------------------
#########################################################
if train_config.enable_fsdp:
setup()
# torchrun specific
import os
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
if torch.distributed.is_initialized():
torch.cuda.set_device(rank)
setup_environ_flags(rank)
#########################################################
# INITIALIZE TOKENIZEER --------------------------------
#########################################################
tokenizer = LlamaTokenizer.from_pretrained(train_config.model_name, legacy=False)
tokenizer.add_special_tokens(
{
"pad_token": "<PAD>",
}
)
#########################################################
# PREPARE TRAIN AND VALIDATION DATA --------------------
#########################################################
dataset_config = generate_dataset_config(train_config, kwargs)
update_config(
dataset_config,
**{
"data_path": train_config.data_path,
"num_validation_samples": train_config.num_validation_samples,
"validation_data_path": train_config.validation_data_path,
"run_validation": train_config.run_validation,
"pack_sequences": train_config.pack_sequences,
"wrap_packed_sequences": train_config.wrap_packed_sequences,
"chunk_size": train_config.chunk_size,
},
)
# Load and preprocess the dataset for training and validation
dataset_train = get_preprocessed_dataset(
tokenizer,
dataset_config,
split="train",
)
if not train_config.enable_fsdp or rank == 0:
print(f"--> Training Set Length = {len(dataset_train)}")
if train_config.run_validation:
dataset_val = get_preprocessed_dataset(
tokenizer,
dataset_config,
split="val",
)
if not train_config.enable_fsdp or rank == 0:
print(f"--> Validation Set Length = {len(dataset_val)}")
else:
dataset_val = None
train_sampler = None
val_sampler = None
if train_config.enable_fsdp:
train_sampler = DistributedSampler(
dataset_train,
rank=dist.get_rank(),
num_replicas=dist.get_world_size(),
shuffle=True,
)
if train_config.run_validation:
val_sampler = DistributedSampler(
dataset_val,
rank=dist.get_rank(),
num_replicas=dist.get_world_size(),
)
# Create DataLoaders for the training and validation dataset
data_collator = DataCollatorForTokenClassification(
tokenizer=tokenizer, padding="longest"
)
train_dataloader = torch.utils.data.DataLoader(
dataset_train,
batch_size=train_config.batch_size_training,
num_workers=train_config.num_workers_dataloader,
pin_memory=True,
sampler=train_sampler if train_sampler else None,
drop_last=True,
collate_fn=data_collator,
)
if train_config.run_validation:
eval_dataloader = torch.utils.data.DataLoader(
dataset_val,
batch_size=train_config.val_batch_size,
num_workers=train_config.num_workers_dataloader,
pin_memory=True,
sampler=val_sampler if val_sampler else None,
drop_last=True,
collate_fn=data_collator,
)
else:
eval_dataloader = None
if len(train_dataloader) == 0:
raise ValueError(
"Training dataloader is empty! This happens when your dataset is too small, relative to your batch size. "
"If `pack_sequences` is `True`, you're more likely to run into this issue, particularly with small datasets that "
"consist of short examples. Try setting `pack_sequences` to `False` and/or reducing your batch size."
)
#########################################################
# CONFIGURE AND INITIALIZE MODEL ------------------------
#########################################################
# Model preparation for full fine-tuning -------
# ----------------------------------------------
if not train_config.use_peft:
print("Loading model for peft")
model = LlamaForCausalLM.from_pretrained(
train_config.model_name,
load_in_8bit=True if train_config.quantization else None,
device_map="auto" if train_config.quantization else None,
)
print("Loaded model")
else:
kwargs["r"] = kwargs[
"lora_rank"
] # can't pass --r to the script, torchrun won't have it
peft_config = generate_peft_config(train_config.peft_method, kwargs)
# Model preparation for QLoRA fine-tuning ------
# ----------------------------------------------
if train_config.peft_method == "qlora":
print("LOADING MODEL FOR QLORA")
bnb_config = generate_peft_config("bitsandbytes_config", kwargs)
import os
print(
f"Loading model from {train_config.model_name}, which contains the following files:"
)
print(os.listdir(train_config.model_name))
model = AutoModelForCausalLM.from_pretrained(
train_config.model_name,
quantization_config=bnb_config,
device_map="auto", # dispatch efficiently the model on the available ressources
# max_memory = {i: max_memory for i in range(num_gpus)},
)
print("Loaded model")
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
# Model preparation for LoRA fine-tuning ------
# ----------------------------------------------
else:
model = LlamaForCausalLM.from_pretrained(
train_config.model_name,
load_in_8bit=True if train_config.quantization else None,
device_map="auto" if train_config.quantization else None,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# We added a special token for padding, so we need to resize the token embeddings
model.resize_token_embeddings(model.config.vocab_size + 1)
print_model_size(model, train_config, rank if train_config.enable_fsdp else 0)
# Prepare the model for int8 training if quantization is enabled
if train_config.quantization:
model = prepare_model_for_int8_training(model)
# Convert the model to bfloat16 if fsdp and pure_bf16 is enabled
if train_config.enable_fsdp and fsdp_config.pure_bf16:
model.to(torch.bfloat16)
# setting up FSDP if enable_fsdp is enabled
if train_config.enable_fsdp:
if not train_config.use_peft and train_config.freeze_layers:
freeze_transformer_layers(train_config.num_freeze_layers)
mixed_precision_policy, wrapping_policy = get_policies(fsdp_config, rank)
my_auto_wrapping_policy = fsdp_auto_wrap_policy(model, LlamaDecoderLayer)
model = FSDP(
model,
auto_wrap_policy=my_auto_wrapping_policy
if train_config.use_peft
else wrapping_policy,
mixed_precision=mixed_precision_policy
if not fsdp_config.pure_bf16
else None,
sharding_strategy=fsdp_config.sharding_strategy,
device_id=torch.cuda.current_device(),
limit_all_gathers=True,
)
if fsdp_config.fsdp_activation_checkpointing:
policies.apply_fsdp_checkpointing(model)
# Note: When we use QLoRA, we load directly to devices with `automap`, so we don't need to move to cuda here.
elif (
not train_config.quantization
and not train_config.enable_fsdp
and not train_config.peft_method == "qlora"
):
model.to("cuda")
# Initialize the optimizer and learning rate scheduler
if not train_config.peft_method == "qlora":
if fsdp_config.pure_bf16 and fsdp_config.optimizer == "anyprecision":
optimizer = AnyPrecisionAdamW(
model.parameters(),
lr=train_config.lr,
momentum_dtype=torch.bfloat16,
variance_dtype=torch.bfloat16,
use_kahan_summation=False,
)
else:
optimizer = optim.AdamW(
model.parameters(),
lr=train_config.lr,
weight_decay=0.0,
)
gradient_accumulation_steps = train_config.gradient_accumulation_steps
if not train_config.peft_method == "qlora":
scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
# Start the training process
results = train(
model,
train_dataloader,
eval_dataloader,
tokenizer,
optimizer,
scheduler,
gradient_accumulation_steps,
train_config,
fsdp_config if train_config.enable_fsdp else None,
local_rank if train_config.enable_fsdp else None,
rank if train_config.enable_fsdp else None,
)
if not train_config.enable_fsdp or rank == 0:
[print(f"Key: {k}, Value: {v}") for k, v in results.items()]
else:
from transformers import TrainingArguments, Trainer
from trl.trainer.utils import PeftSavingCallback
training_args = TrainingArguments(
output_dir=train_config.output_dir,
per_device_train_batch_size=train_config.batch_size_training,
per_device_eval_batch_size=train_config.val_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
learning_rate=train_config.lr,
bf16=True,
log_level="info",
logging_steps=10,
optim="paged_adamw_32bit",
warmup_ratio=0.03,
save_strategy="no",
num_train_epochs=train_config.num_epochs,
gradient_checkpointing=True,
do_eval=True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset_train,
eval_dataset=dataset_val,
data_collator=data_collator,
# peft_config=peft_config,
args=training_args,
compute_metrics=None,
callbacks=[PeftSavingCallback],
)
trainer.train()
trainer.model.save_pretrained(train_config.output_dir)
if __name__ == "__main__":
fire.Fire(main)
================================================
FILE: llama_recipes/model_checkpointing/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .checkpoint_handler import (
load_model_checkpoint,
save_model_checkpoint,
load_optimizer_checkpoint,
save_optimizer_checkpoint,
save_model_and_optimizer_sharded,
load_model_sharded,
load_sharded_model_single_gpu,
)
================================================
FILE: llama_recipes/model_checkpointing/checkpoint_handler.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from pathlib import Path
from datetime import datetime
import torch
import time
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
StateDictType,
FullStateDictConfig, # general model non-sharded, non-flattened params
)
from torch.distributed._shard.checkpoint import (
FileSystemReader,
)
from torch.distributed.checkpoint.default_planner import (
DefaultSavePlanner,
)
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
import torch.distributed._shard.checkpoint as dist_cp
import torch.distributed as dist
def get_date_of_run():
"""create date and time for file save uniqueness
example: 2022-05-07-08:31:12_PM'
"""
date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p")
print(f"--> current date and time of run = {date_of_run}")
return date_of_run
# create singleton saving policies to avoid making over and over
fullstate_save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
def load_model_sharded(model, rank, cfg):
# torch.manual_seed(103)
folder_name = (
cfg.dist_checkpoint_root_folder
+ "/"
+ cfg.dist_checkpoint_folder
+ "-"
+ cfg.model_name
)
load_dir = Path.cwd() / folder_name
if not load_dir.exists():
if rank == 0:
print("No sharded_state_dict checkpoint directory found...skipping")
return
if rank == 0:
print(f"loading model from model path: {load_dir} ")
reader = FileSystemReader(load_dir)
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
checkpoint = model.state_dict()
if rank == 0:
ck = checkpoint.keys()
print(f" checkpoint key len = {len(ck)} and \n keys = {ck}")
dist_cp.load_state_dict(
state_dict=checkpoint,
storage_reader=reader,
)
if rank == 0:
print("checkpoint after load_state_dict()")
ck = checkpoint.keys()
print(f" checkpoint key len = {len(ck)} and \n keys = {ck}")
model.load_state_dict(checkpoint)
if rank == 0:
print(f"Sharded state checkpoint loaded from {load_dir}")
def save_model_and_optimizer_sharded(model, rank, cfg, optim=None):
"""save model and optimizer via sharded_state_dict to save_dir"""
folder_name = (
cfg.dist_checkpoint_root_folder
+ "/"
+ cfg.dist_checkpoint_folder
+ "-"
+ cfg.model_name
)
save_dir = Path.cwd() / folder_name
if rank == 0:
print(f"Saving model to {save_dir}")
distributed_writer = dist_cp.FileSystemWriter(
save_dir,
)
t0 = time.perf_counter()
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
state_dict = {"model": model.state_dict()}
if optim is not None:
state_dict["optim"] = FSDP.optim_state_dict(model, optim)
dist_cp.save_state_dict(
state_dict=state_dict,
storage_writer=distributed_writer,
planner=DefaultSavePlanner(),
)
dist.barrier()
t1 = time.perf_counter()
if rank == 0:
print(f"Sharded state checkpoint saved to {save_dir}")
print(f"Checkpoint Time = {t1-t0:.4f}\n")
def save_model_checkpoint(
model,
optimizer,
rank,
cfg,
epoch=1,
):
"""saving model via rank0 cpu streaming and full_state_dict"""
with FSDP.state_dict_type(
model, StateDictType.FULL_STATE_DICT, fullstate_save_policy
):
cpu_state = model.state_dict()
print(f"saving process: rank {rank} done w model state_dict\n")
if rank == 0:
print("--> saving model ...")
# create save path
folder_name = (
cfg.dist_checkpoint_root_folder
+ "/"
+ cfg.dist_checkpoint_folder
+ "-"
+ cfg.model_name
)
save_dir = Path.cwd() / folder_name
save_dir.mkdir(parents=True, exist_ok=True)
save_name = cfg.model_name + "-" + str(epoch) + ".pt"
save_full_path = str(save_dir) + "/" + save_name
# save model
torch.save(cpu_state, save_full_path)
print(f"model checkpoint saved for epoch {epoch} at {save_full_path}\n")
def load_model_checkpoint(model, rank, cfg):
"""load local checkpoint to rank0 cpu
must be called * before * passing to FSDP"""
if rank != 0:
return
# where is the checkpoint at...
full_state_dict_model_path = (
Path.cwd() / cfg.checkpoint_folder / cfg.checkpoint_model_filename
)
# is it present...
if not full_state_dict_model_path.is_file():
print(
f"model checkpoint {full_state_dict_model_path} not present. Returning..."
)
return
model_checkpoint = torch.load(full_state_dict_model_path)
# integrate into loaded model
model.load_state_dict(model_checkpoint)
print("model checkpoint loaded to rank0 cpu")
def save_optimizer_checkpoint(model, optimizer, rank, cfg, epoch=1):
"""save optimizer state via full state dict"""
print(f"--> optim state call on rank {rank}\n")
# pull all sharded optimizer states to rank0 cpu...
optim_state = FSDP.full_optim_state_dict(model, optimizer)
print(f"optim state dict ready on {rank} and len of {len(optim_state)}\n")
if rank == 0:
folder_name = (
cfg.dist_checkpoint_root_folder
+ "/"
+ cfg.dist_checkpoint_folder
+ "-"
+ cfg.model_name
)
save_dir = Path.cwd() / folder_name
save_dir.mkdir(parents=True, exist_ok=True)
opt_save_name = "optimizer" + "-" + cfg.model_name + "-" + str(epoch) + ".pt"
opt_save_full_path = save_dir / opt_save_name
print("--> saving optimizer state...")
torch.save(optim_state, opt_save_full_path)
print(f"--> saved {opt_save_full_path} to disk")
def load_optimizer_checkpoint(model, optimizer_checkpoint_path, rank):
"""load an fsdp optimizer full_state checkpoint using scatter method
this ensures only rank 0 loads the optimizer state dict and scatters to other ranks
"""
if not optimizer_checkpoint_path.is_file():
print(
f"warning - optimizer checkpoint not present {optimizer_checkpoint_path}. Returning. "
)
return
full_osd = None
if rank == 0:
full_osd = torch.load(optimizer_checkpoint_path)
# called from all ranks, though only rank0 has a valid param for full_osd
sharded_osd = FSDP.scatter_full_optim_state_dict(full_osd, model)
print(f"optimizer shard loaded on rank {rank}")
def load_sharded_model_single_gpu(model, model_path):
reader = FileSystemReader(model_path)
state_dict = {"model": model.state_dict()}
dist_cp.load_state_dict(
state_dict=state_dict,
storage_reader=FileSystemReader(model_path),
no_dist=True,
)
model.load_state_dict(state_dict["model"])
print(f"Sharded state checkpoint loaded from {model_path}")
return model
================================================
FILE: llama_recipes/multi_node.slurm
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the GNU General Public License version 3.
#!/bin/bash
#SBATCH --job-name=Nano-2d-trainer-20b-8nodes
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --gpus-per-task=4
#SBATCH --partition=train
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Enable for A100
export FI_PROVIDER="efa"
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0
# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="ens"
export FI_EFA_USE_DEVICE_RDMA=1
srun torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora
================================================
FILE: llama_recipes/policies/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .mixed_precision import *
from .wrapping import *
from .activation_checkpointing_functions import apply_fsdp_checkpointing
from .anyprecision_optimizer import AnyPrecisionAdamW
================================================
FILE: llama_recipes/policies/activation_checkpointing_functions.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
checkpoint_wrapper,
CheckpointImpl,
apply_activation_checkpointing,
)
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
from functools import partial
non_reentrant_wrapper = partial(
checkpoint_wrapper,
checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)
check_fn = lambda submodule: isinstance(submodule, LlamaDecoderLayer)
def apply_fsdp_checkpointing(model):
"""apply activation checkpointing to model
returns None as model is updated directly
"""
print("--> applying fsdp activation checkpointing...")
apply_activation_checkpointing(
model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
)
================================================
FILE: llama_recipes/policies/anyprecision_optimizer.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# AnyPrecisionAdamW: a flexible precision AdamW optimizer
# with optional Kahan summation for high precision weight updates.
# Allows direct control over momentum, variance and auxiliary compensation
# buffer dtypes.
# Optional Kahan summation is used to offset precision reduction for
# the weight updates. This allows full training in BFloat16 (equal or
# better than FP32 results in many cases) due to high precision weight upates.
import torch
from torch.optim.optimizer import Optimizer
class AnyPrecisionAdamW(Optimizer):
def __init__(
self,
params,
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.0,
use_kahan_summation=False,
momentum_dtype=torch.bfloat16,
variance_dtype=torch.bfloat16,
compensation_buffer_dtype=torch.bfloat16,
):
"""
Args:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay coefficient (default: 1e-2)
# Any Precision specific
use_kahan_summation = creates auxiliary buffer to ensure high precision
model param updates (default: False)
momentum_dtype = dtype for momentum (default: BFloat32)
variance_dtype = dtype for uncentered variance (default: BFloat16)
compensation_buffer_dtype = dtype for Kahan summation
buffer (default: BFloat16)
# Usage
This optimizer implements optimizer states, and Kahan summation
for high precision updates, all in user controlled dtypes.
Defaults are variance in BF16, Momentum in FP32.
This can be run in FSDP mixed precision, amp, or full precision,
depending on what training pipeline you wish to work with.
Setting to use_kahan_summation = False, and changing momentum and
variance dtypes to FP32, reverts this to a standard AdamW optimizer.
"""
defaults = dict(
lr=lr,
betas=betas,
eps=eps,
weight_decay=weight_decay,
use_kahan_summation=use_kahan_summation,
momentum_dtype=momentum_dtype,
variance_dtype=variance_dtype,
compensation_buffer_dtype=compensation_buffer_dtype,
)
super().__init__(params, defaults)
@torch.no_grad()
def step(self, closure=None):
"""Performs a single optimization step.
Args:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
if closure is not None:
with torch.enable_grad():
# to fix linter, we do not keep the returned loss for use atm.
closure()
for group in self.param_groups:
beta1, beta2 = group["betas"]
lr = group["lr"]
weight_decay = group["weight_decay"]
eps = group["eps"]
use_kahan_summation = group["use_kahan_summation"]
momentum_dtype = group["momentum_dtype"]
variance_dtype = group["variance_dtype"]
compensation_buffer_dtype = group["compensation_buffer_dtype"]
for p in group["params"]:
if p.grad is None:
continue
if p.grad.is_sparse:
raise RuntimeError(
"AnyPrecisionAdamW does not support sparse gradients"
)
state = self.state[p]
# State initialization
if len(state) == 0:
state["step"] = torch.tensor(0.0)
# momentum - EMA of gradient values
state["exp_avg"] = torch.zeros_like(
p,
dtype=momentum_dtype,
)
# variance uncentered - EMA of squared gradient values
state["exp_avg_sq"] = torch.zeros_like(
p,
dtype=variance_dtype,
)
# optional Kahan summation - accumulated error tracker
if use_kahan_summation:
state["compensation"] = torch.zeros_like(
p,
dtype=compensation_buffer_dtype,
)
# main processing -------------------------
# update the steps for each param group update
state["step"] += 1
step = state["step"]
exp_avg = state["exp_avg"]
exp_avg_sq = state["exp_avg_sq"]
grad = p.grad
# weight decay, AdamW style
if weight_decay:
p.data.mul_(1 - lr * weight_decay)
# update momentum
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
# update uncentered variance
exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
# adjust using bias1
bias_correction1 = 1 - beta1**step
step_size = lr / bias_correction1
# adjust using bias2
denom_correction = (1 - beta2**step) ** 0.5 # avoids math import
centered_variance = (exp_avg_sq.sqrt() / denom_correction).add_(
eps, alpha=1
)
# lr update to compensation
if use_kahan_summation:
compensation = state["compensation"]
compensation.addcdiv_(exp_avg, centered_variance, value=-step_size)
# update weights with compensation (Kahan summation)
# save error back to compensation for next iteration
temp_buffer = p.detach().clone()
p.data.add_(compensation)
compensation.add_(temp_buffer.sub_(p.data))
else:
# usual AdamW updates
p.data.addcdiv_(exp_avg, centered_variance, value=-step_size)
================================================
FILE: llama_recipes/policies/mixed_precision.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import torch
from torch.distributed.fsdp import (
# FullyShardedDataParallel as FSDP,
# CPUOffload,
MixedPrecision,
# BackwardPrefetch,
# ShardingStrategy,
)
# requires grad scaler in main loop
fpSixteen = MixedPrecision(
param_dtype=torch.float16,
# Gradient communication precision.
reduce_dtype=torch.float16,
# Buffer precision.
buffer_dtype=torch.float16,
)
bfSixteen = MixedPrecision(
param_dtype=torch.bfloat16,
# Gradient communication precision.
reduce_dtype=torch.bfloat16,
# Buffer precision.
buffer_dtype=torch.bfloat16,
cast_forward_inputs=True,
)
bfSixteen_mixed = MixedPrecision(
param_dtype=torch.float32,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
fp32_policy = MixedPrecision(
param_dtype=torch.float32,
reduce_dtype=torch.float32,
buffer_dtype=torch.float32,
)
================================================
FILE: llama_recipes/policies/wrapping.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
from torch.distributed.fsdp.wrap import (
transformer_auto_wrap_policy,
size_based_auto_wrap_policy,
)
import functools
def get_size_policy(min_params=1e8):
num_wrap_policy = functools.partial(
size_based_auto_wrap_policy, min_num_params=min_params
)
return num_wrap_policy
def get_llama_wrapper():
"""we register our main layer class and use the fsdp transformer wrapping policy
ensures embedding layers are in the root fsdp unit for shared access and that fsdp units map to transformer layers
"""
# ==== use new transformer wrapper
llama_auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={
LlamaDecoderLayer,
},
)
return llama_auto_wrap_policy
================================================
FILE: llama_recipes/quickstart.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Meta Platforms, Inc. and affiliates.\n",
"This software may be used and distributed according to the terms of the Llama 2 Community License Agreement."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quick Start Notebook\n",
"\n",
"This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.\n",
"\n",
"### Step 0: Install pre-requirements and convert checkpoint\n",
"\n",
"The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.\n",
"The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.\n",
"Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# %%bash\n",
"# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets\n",
"# TRANSFORM=`python -c \"import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')\"`\n",
"# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1: Load the model\n",
"\n",
"Point model_id to model weight folder"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"===================================BUG REPORT===================================\n",
"Welcome to bitsandbytes. For bug reports, please run\n",
"\n",
"python -m bitsandbytes\n",
"\n",
" and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
"================================================================================\n",
"bin /data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so\n",
"CUDA SETUP: CUDA runtime path found: /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so\n",
"CUDA SETUP: Highest compute capability among GPUs detected: 8.0\n",
"CUDA SETUP: Detected CUDA version 112\n",
"CUDA SETUP: Loading binary /data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/mreso/miniconda3/envs/llama did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...\n",
" warn(msg)\n",
"/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/efa/lib')}\n",
" warn(msg)\n",
"The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.\n",
"Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.09s/it]\n"
]
}
],
"source": [
"import torch\n",
"from transformers import LlamaForCausalLM, LlamaTokenizer\n",
"\n",
"model_id=\"./models_hf/7B\"\n",
"\n",
"tokenizer = LlamaTokenizer.from_pretrained(model_id)\n",
"\n",
"model =LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: Load the preprocessed dataset\n",
"\n",
"We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset samsum (/data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)\n",
"Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-b14554a76c1c7ecd.arrow\n",
"Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e40e61e15ebeb527.arrow\n",
"Loading cached processed dataset at /data/home/mreso/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e08ac9e1b792e7ba.arrow\n"
]
}
],
"source": [
"from pathlib import Path\n",
"import os\n",
"import sys\n",
"from utils.dataset_utils import get_preprocessed_dataset\n",
"from configs.datasets import samsum_dataset\n",
"\n",
"train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: Check base model\n",
"\n",
"Run the base model on an example input:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Summarize this dialog:\n",
"A: Hi Tom, are you busy tomorrow’s afternoon?\n",
"B: I’m pretty sure I am. What’s up?\n",
"A: Can you go with me to the animal shelter?.\n",
"B: What do you want to do?\n",
"A: I want to get a puppy for my son.\n",
"B: That will make him so happy.\n",
"A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
"B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
"A: I'll get him one of those little dogs.\n",
"B: One that won't grow up too big;-)\n",
"A: And eat too much;-))\n",
"B: Do you know which one he would like?\n",
"A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
"B: I bet you had to drag him away.\n",
"A: He wanted to take it home right away ;-).\n",
"B: I wonder what he'll name it.\n",
"A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))\n",
"---\n",
"Summary:\n",
"A: Hi Tom, are you busy tomorrow’s afternoon?\n",
"B: I’m pretty sure I am. What’s up?\n",
"A: Can you go with me to the animal shelter?.\n",
"B: What do you want to do?\n",
"A: I want to get a puppy for my son.\n",
"B: That will make him so happy.\n",
"A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
"B\n"
]
}
],
"source": [
"eval_prompt = \"\"\"\n",
"Summarize this dialog:\n",
"A: Hi Tom, are you busy tomorrow’s afternoon?\n",
"B: I’m pretty sure I am. What’s up?\n",
"A: Can you go with me to the animal shelter?.\n",
"B: What do you want to do?\n",
"A: I want to get a puppy for my son.\n",
"B: That will make him so happy.\n",
"A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
"B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
"A: I'll get him one of those little dogs.\n",
"B: One that won't grow up too big;-)\n",
"A: And eat too much;-))\n",
"B: Do you know which one he would like?\n",
"A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
"B: I bet you had to drag him away.\n",
"A: He wanted to take it home right away ;-).\n",
"B: I wonder what he'll name it.\n",
"A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))\n",
"---\n",
"Summary:\n",
"\"\"\"\n",
"\n",
"model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
"\n",
"model.eval()\n",
"with torch.no_grad():\n",
" print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the base model only repeats the conversation.\n",
"\n",
"### Step 4: Prepare model for PEFT\n",
"\n",
"Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199\n"
]
}
],
"source": [
"model.train()\n",
"\n",
"def create_peft_config(model):\n",
" from peft import (\n",
" get_peft_model,\n",
" LoraConfig,\n",
" TaskType,\n",
" prepare_model_for_int8_training,\n",
" )\n",
"\n",
" peft_config = LoraConfig(\n",
" task_type=TaskType.CAUSAL_LM,\n",
" inference_mode=False,\n",
" r=8,\n",
" lora_alpha=32,\n",
" lora_dropout=0.05,\n",
" target_modules = [\"q_proj\", \"v_proj\"]\n",
" )\n",
"\n",
" # prepare int-8 model for training\n",
" model = prepare_model_for_int8_training(model)\n",
" model = get_peft_model(model, peft_config)\n",
" model.print_trainable_parameters()\n",
" return model, peft_config\n",
"\n",
"# create peft config\n",
"model, lora_config = create_peft_config(model)\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"### Step 5: Define an optional profiler"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from transformers import TrainerCallback\n",
"from contextlib import nullcontext\n",
"enable_profiler = False\n",
"output_dir = \"tmp/llama-output\"\n",
"\n",
"config = {\n",
" 'lora_config': lora_config,\n",
" 'learning_rate': 1e-4,\n",
" 'num_train_epochs': 1,\n",
" 'gradient_accumulation_steps': 2,\n",
" 'per_device_train_batch_size': 2,\n",
" 'gradient_checkpointing': False,\n",
"}\n",
"\n",
"# Set up profiler\n",
"if enable_profiler:\n",
" wait, warmup, active, repeat = 1, 1, 2, 1\n",
" total_steps = (wait + warmup + active) * (1 + repeat)\n",
" schedule = torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)\n",
" profiler = torch.profiler.profile(\n",
" schedule=schedule,\n",
" on_trace_ready=torch.profiler.tensorboard_trace_handler(f\"{output_dir}/logs/tensorboard\"),\n",
" record_shapes=True,\n",
" profile_memory=True,\n",
" with_stack=True)\n",
" \n",
" class ProfilerCallback(TrainerCallback):\n",
" def __init__(self, profiler):\n",
" self.profiler = profiler\n",
" \n",
" def on_step_end(self, *args, **kwargs):\n",
" self.profiler.step()\n",
"\n",
" profiler_callback = ProfilerCallback(profiler)\n",
"else:\n",
" profiler = nullcontext()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 6: Fine tune the model\n",
"\n",
"Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...\n",
"/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization\n",
" warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
"/data/home/mreso/miniconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
" warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n"
]
},
{
"data": {
"text/html": [
"\n",
" <div>\n",
" \n",
" <progress value='389' max='389' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
" [389/389 1:12:06, Epoch 1/1]\n",
" </div>\n",
" <table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>Step</th>\n",
" <th>Training Loss</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>10</td>\n",
" <td>1.965000</td>\n",
" </tr>\n",
" <tr>\n",
" <td>20</td>\n",
" <td>1.845600</td>\n",
" </tr>\n",
" <tr>\n",
" <td>30</td>\n",
" <td>1.801100</td>\n",
" </tr>\n",
" <tr>\n",
" <td>40</td>\n",
" <td>1.780900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>50</td>\n",
" <td>1.715400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>60</td>\n",
" <td>1.697800</td>\n",
" </tr>\n",
" <tr>\n",
" <td>70</td>\n",
" <td>1.707600</td>\n",
" </tr>\n",
" <tr>\n",
" <td>80</td>\n",
" <td>1.713300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>90</td>\n",
" <td>1.663900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>100</td>\n",
" <td>1.702700</td>\n",
" </tr>\n",
" <tr>\n",
" <td>110</td>\n",
" <td>1.658800</td>\n",
" </tr>\n",
" <tr>\n",
" <td>120</td>\n",
" <td>1.692400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>130</td>\n",
" <td>1.644900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>140</td>\n",
" <td>1.687900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>150</td>\n",
" <td>1.686600</td>\n",
" </tr>\n",
" <tr>\n",
" <td>160</td>\n",
" <td>1.649600</td>\n",
" </tr>\n",
" <tr>\n",
" <td>170</td>\n",
" <td>1.666900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>180</td>\n",
" <td>1.709200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>190</td>\n",
" <td>1.670400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>200</td>\n",
" <td>1.662700</td>\n",
" </tr>\n",
" <tr>\n",
" <td>210</td>\n",
" <td>1.681300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>220</td>\n",
" <td>1.685500</td>\n",
" </tr>\n",
" <tr>\n",
" <td>230</td>\n",
" <td>1.663400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>240</td>\n",
" <td>1.638300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>250</td>\n",
" <td>1.627400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>260</td>\n",
" <td>1.654300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>270</td>\n",
" <td>1.640900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>280</td>\n",
" <td>1.674700</td>\n",
" </tr>\n",
" <tr>\n",
" <td>290</td>\n",
" <td>1.657300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>300</td>\n",
" <td>1.660200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>310</td>\n",
" <td>1.666600</td>\n",
" </tr>\n",
" <tr>\n",
" <td>320</td>\n",
" <td>1.674500</td>\n",
" </tr>\n",
" <tr>\n",
" <td>330</td>\n",
" <td>1.656200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>340</td>\n",
" <td>1.684300</td>\n",
" </tr>\n",
" <tr>\n",
" <td>350</td>\n",
" <td>1.667900</td>\n",
" </tr>\n",
" <tr>\n",
" <td>360</td>\n",
" <td>1.661400</td>\n",
" </tr>\n",
" <tr>\n",
" <td>370</td>\n",
" <td>1.676800</td>\n",
" </tr>\n",
" <tr>\n",
" <td>380</td>\n",
" <td>1.628100</td>\n",
" </tr>\n",
" </tbody>\n",
"</table><p>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import default_data_collator, Trainer, TrainingArguments\n",
"\n",
"\n",
"\n",
"# Define training args\n",
"training_args = TrainingArguments(\n",
" output_dir=output_dir,\n",
" overwrite_output_dir=True,\n",
" bf16=True, # Use BF16 if available\n",
" # logging strategies\n",
" logging_dir=f\"{output_dir}/logs\",\n",
" logging_strategy=\"steps\",\n",
" logging_steps=10,\n",
" save_strategy=\"no\",\n",
" optim=\"adamw_torch_fused\",\n",
" max_steps=total_steps if enable_profiler else -1,\n",
" **{k:v for k,v in config.items() if k != 'lora_config'}\n",
")\n",
"\n",
"with profiler:\n",
" # Create Trainer instance\n",
" trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=train_dataset,\n",
" data_collator=default_data_collator,\n",
" callbacks=[profiler_callback] if enable_profiler else [],\n",
" )\n",
"\n",
" # Start training\n",
" trainer.train()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 7:\n",
"Save model checkpoint"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"model.save_pretrained(output_dir)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 8:\n",
"Try the fine tuned model on the same example again to see the learning progress:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Summarize this dialog:\n",
"A: Hi Tom, are you busy tomorrow’s afternoon?\n",
"B: I’m pretty sure I am. What’s up?\n",
"A: Can you go with me to the animal shelter?.\n",
"B: What do you want to do?\n",
"A: I want to get a puppy for my son.\n",
"B: That will make him so happy.\n",
"A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
"B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
"A: I'll get him one of those little dogs.\n",
"B: One that won't grow up too big;-)\n",
"A: And eat too much;-))\n",
"B: Do you know which one he would like?\n",
"A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
"B: I bet you had to drag him away.\n",
"A: He wanted to take it home right away ;-).\n",
"B: I wonder what he'll name it.\n",
"A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))\n",
"---\n",
"Summary:\n",
"A wants to get a puppy for his son. He took him to the animal shelter last Monday. He showed him one that he really liked. A will name it after his dead hamster - Lemmy.\n"
]
}
],
"source": [
"model.eval()\n",
"with torch.no_grad():\n",
" print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: llama_recipes/requirements.txt
================================================
-f https://download.pytorch.org/whl/torch_stable.html
torch==2.0.1+cu118
accelerate
appdirs
loralib
bitsandbytes==0.39.1
black
black[jupyter]
datasets
fire
git+https://github.com/huggingface/peft.git
transformers>=4.31.0
sentencepiece
py7zr
scipy
================================================
FILE: llama_recipes/scripts/markdown_link_check_config.json
================================================
{
"retryOn429": true,
"retryCount": 5,
"fallbackRetryDelay": "10s",
"httpHeaders": [
{
"urls": [
"https://docs.github.com/",
"https://help.github.com/"
],
"headers": {
"Accept-Encoding": "zstd, br, gzip, deflate"
}
}
],
"ignorePatterns": [
{
"pattern": "^http(s)?://127.0.0.1.*"
},
{
"pattern": "^http(s)?://localhost.*"
},
{
"pattern": "https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html"
}
]
}
================================================
FILE: llama_recipes/scripts/spellcheck.sh
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
# Source: https://github.com/pytorch/torchx/blob/main/scripts/spellcheck.sh
set -ex
sudo apt-get install aspell
if [[ -z "$@" ]]; then
sources=$(find -name '*.md')
else
sources=$@
fi
sources_arg=""
for src in $sources; do
sources_arg="${sources_arg} -S $src"
done
if [ ! "$sources_arg" ]; then
echo "No files to spellcheck"
else
pyspelling -c scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources_arg
fi
================================================
FILE: llama_recipes/scripts/spellcheck_conf/spellcheck.yaml
================================================
matrix:
- name: Markdown
apsell:
lang: en
d: en_US
dictionary:
wordlists:
- scripts/spellcheck_conf/wordlist.txt
output: scripts/spellcheck_conf/wordlist.dic
encoding: utf-8
pipeline:
- pyspelling.filters.context:
context_visible_first: true
delimiters:
- open: '(?s)^ *(?P<open>`{3,})[a-z0-9]*?$'
close: '^(?P=open)$'
- open: ''
content: 'https?://[-a-zA-Z0-9.]+?\.[a-z]{2,6}[-?=&%.0-9a-zA-Z/_#]*'
close: ''
- pyspelling.filters.markdown:
markdown_extensions:
- markdown.extensions.extra:
================================================
FILE: llama_recipes/scripts/spellcheck_conf/wordlist.txt
================================================
BaseHandler
ImageNet
RGB
TorchServe
archiver
dataset
github
href
https
json
li
py
pytorch
segmenter
torchvision
ul
usecase
CUDA
JDK
NVIDIA
WSL
bashrc
cd
githubusercontent
html
microsoft
ol
openjdk
OpenJDK
pre
psutil
sentencepiece
src
sudo
torchtext
ubuntu
wget
APIs
Eg
MilliSeconds
URI
YAML
dataflow
func
lt
md
params
postprocess
postprocessing
preprocess
preprocessing
serializable
tbody
td
th
thead
unregister
url
CONFIG
MNIST
README
hotdogs
ncs
squeezenet
vgg
TorchServe's
cfg
configs
runtime
yyyyMMddHHmmssSSS
AWS
Benchmarking
Captum
Grafana
HuggingFace
JMeter
KMS
Kubeflow
Kubernetes
MMF
contrib
ddb
gRPC
ipynb
mlflow
nmt
performant
torschripted
API's
ASG
Django
Dockerfile
ELB
LoadBalancer
OpenAPI
PyPi
SDK
SageMaker
blockquote
cli
cloudformation
cmd
dev
dir
io
issuecomment
lxning
netty
perf
presigned
tagname
txt
ConfigManager
GPL
NVSMI
Powershell
Redistributable
env
exe
frontend
msi
nodejs
npm
prebuilt
smi
stackoverflow
util
AlexNet
DeepLabV
Densenet
FCN
RCNN
ResNet
Torchscripted
fastrcnn
jpg
maskrcnn
png
KFServing
Seldon
ai
analytics
orchestrator
PMD
backend
checkstyle
cov
gradlew
htmlcov
node.js
pylint
pylintrc
pytest
rcfile
tcort
ut
localhost
myworkflow
wfpredict
Bytearray
CN
CORS
EventLoopGroup
EventLoops
GPUs
JVM
MaxDirectMemorySize
OU
OpenSSL
PCI
PIL
PKCS
PYTHONPATH
Palo
RSA
SSL
WorkerThread
amazonaws
async
batchSize
changeit
dalay
defaultVersion
dep
dname
envvars
genkey
gpu
gz
keyalg
keyout
keysize
keystore
keytool
livebook
marName
maxBatchDelay
maxWorkers
minWorkers
modelName
msec
mycert
mykey
natively
newkey
noop
parameterName
parameterNameN
parameterValue
parameterValueN
pathname
pem
preflight
readthedocs
req
responseTimeout
scalability
storepass
storetype
urls
utf
vmargs
wlm
www
yourdomain
nextPageToken
subfolder
unregistering
workflowDag
workflowName
workflowUrl
Javascript
RESTful
codegen
Args
CustomImageClassifier
DefaultHandlerClass
ImageClassifier
Init
LayerIntegratedGradients
ModelHandler
NDArray
PredictionException
Preprocessed
RuntimeError
Waveglow
cpu
embeddings
fp
ie
isfile
isinstance
jit
kwargs
os
param
pred
pth
pyt
serializedFile
str
tacotron
utils
vCPUs
waveglowpyt
DL
LJO
MiB
cv
dockerd
entrypoint
gpuId
gpuUsage
inferencing
loadedAtStartup
memoryUsage
milli
modelUrl
modelVersion
pid
startTime
Captum's
InferenceAPIsService
ModelServer
br
kf
proto
CPUUtilization
DiskAvailable
DiskUsage
DiskUsed
DiskUtilization
DistanceInKM
HostName
InferenceTime
JSONLayout
LoopCount
MemoryAvailable
MemoryUsed
MemoryUtilization
MetricName
SizeOfImage
StatsD
appender
dimN
etsy
formatter
idx
img
kB
DescribeModel
ListModels
RegisterModel
ScaleWorker
SetDefault
UnregisterModel
gRPCs
grpcio
mkdir
protobuf
protoc
repo
BackendWorker
ConversionPattern
Dlog
MaxBackupIndex
MaxFileSize
PatternLayout
RollingFileAppender
WorkerLifeCycle
apache
nnvm
stderr
stdout
ConflictStatusException
DownloadModelException
InvalidSnapshotException
ModelNotFoundException
NoSuchMethodError
ServiceUnavailableException
lang
mb
ntl
PrometheusServer
globoff
noopversioned
systemctl
uuid
yml
AWSS
AmazonS
IAM
ManagementAPIsService
ReadOnlyAccess
UserGuide
UsingKMSEncryption
acknowledgement
macOS
sse
fairseq
libs
mv
pretrained
publically
ready-made
tmp
torchscript
torchvision's
handerl
Bitte
Bonjour
Hallo
Hause
Ich
Ihnen
Ihren
Je
Namen
Sie
TransformerEn
Und
WMT
Wie
allez
arxiv
auf
bien
chez
danke
dataclasses
dich
du
english
erinnere
et
fb
geht
german
komm
kommst
le
leid
läuft
m'excuser
merci
mich
mir
monde
möglich
nFine
nIt’s
nPlease
nach
ne
nicht
nom
prie
quand
rentrerez
selbst
sich
sind
souviens
tôt
va
venir
votre
vous
wann
warte
Ça
BERTQA
BERTSeqClassification
BERTTokenClassification
MFreidank
RoBERTA
XLM
distilbert
does't
finetuning
num
tc
tokenizer
vidhya
vocabs
AutoConfig
Huggingface's
ScriptFunction
transfomers
BBM
BaseDataset
BaseDatasetBuilder
BaseModel
FNSio
MMFTransformer
MultiModal
OmegaConfing
Pyav
REU
TextCaps
TextVQA
Tochserve
csv
datasets
facebook
facebookresearch
fbclid
getitem
lables
len
mc
mmfartifacts
EmbeddingBag
TextHandler
overriden
DBUILD
DCMAKE
DSM
EFFT
FasterTransformer
NGC
Transfomer
bytedance
cmake
cp
geforce
libpyt
nvcr
oauthtoken
turing
volta
xlarge
DeepLearningExamples
SpeechSynthesis
WaveGlow's
librosa
numpy
rb
scipy
unidecode
wav
wb
Interoperability
Mtail
Sart
chmod
cnn
mtailtarget
progs
rc
timeseries
xvzf
cuda
jdk
nvidia
torchserve
wsl
yaml
api
config
http
mnist
resnet
Huggingface
PyTorch
benchmarking
bert
captum
grpc
kubeflow
kubernetes
Torchserve's
asg
aws
elb
readme
sdk
apis
powershell
alexnet
deeplabv
densenet
fcn
kfserving
seldon
excuted
findbugs
HTTPs
cors
openssl
prometheus
rsa
ssl
gpus
init
waveglow
hostname
statsd
grafana
kms
userguide
readymade
torchscripted
rcnn
roberta
xlm
Basedataset
mmf
multimodal
preprocessed
batchsize
download
fastertransformer
ngc
deeplearningexamples
mtail
scarpe
NVidia
WaveGlow
huggingface
torchServe
CProfile
KSERVE
apachelounge
args
jmeter
kserve
latencies
snakeviz
codec
loadbalancer
torchserves
xml
Conda
autoscaling
conda
GPUMemoryUsed
GPUMemoryUtilization
GPUUtilization
JSONPatternLayout
MXNetModelServer
QLog
QLogLayout
QLogsetupModelDependencies
abc
dda
patternlayout
qlog
IPEX
ORT
PROFILER
TensorRT
ValueToSet
kineto
profiler
pypi
runtimes
torchprep
GPT
KServe
LMHeadModel
Parallelize
Textgeneration
gpt
kserve
parallelize
tx
xl
DCGAN
DLRM
GAN
NN
Recommender
ScriptModule
Scriptable
TorchRec
TorchScript
Torchrec
dcgan
dlrm
fashiongen
FashionGen
fashionGen
gan
nn
scriptable
torchrec
AVX
Allocator
BLOCKTIME
BertModel
CONDA
JeMalloc
KMP
LD
NUMA
Numa
OMP
OpenMP
PRELOAD
PTMalloc
TCMalloc
Xeon
afeeb
affinitized
allocator
args
eval
gif
hyperthreaded
hyperthreading
inplace
inputPath
intel
iomp
ipex
iter
jemalloc
libiomp
libtcmalloc
numa
numactl
pdt
qconfig
randint
randn
tcmalloc
tunable
unix
unutilized
usr
CONTAINERD
DaemonSet
GKE
Gcloud
Gi
GoogleCloudPlatform
Ki
NFS
PV
PersistentVolume
RWX
STORAGECLASS
VPC
allocatable
auth
autoupgrade
bcc
cidr
clusterIP
creationTimestamp
daemonset
drwx
drwxr
fsSL
gcloud
ggc
gke
googleapis
ip
ipv
jsonpath
kubeconfig
kubectl
lR
mynfs
namespaces
nfs
nodePools
persistentvolume
persistentvolumeclaim
po
preloaded
provisioner
pv
pvc
quickstart
rw
svc
tesla
tty
unformatted
AAAAAElFTkSuQmCC
Autoscaler
BUILDKIT
GOR
InferenceService
Knative
Rollout
inferenceservice
ingressgateway
istio
kfs
knative
loadBalancer
mnt
modelCount
readmes
rollout
serverless
recommender
HandlerTime
customizedMetadata
environ
ContentType
kservev
tobytes
CustomHandler
GH
OSS
PRs
ctx
onnx
ClusterConfig
EBS
EFS
EKS
apiVersion
desiredCapacity
efs
eks
eksctl
instanceTypes
instancesDistribution
maxSize
minSize
namespace
ng
nodeGroups
onDemandBaseCapacity
onDemandPercentageAboveBaseCapacity
pvpod
spotInstancePools
storagehttps
subnet
subnets
vpc
MMS
commandline
filepath
jmx
rampup
requestdefaults
scaleup
tearDown
testplan
JProfiler
JProfiler's
SqueezeNet
TSBenchmark
apos
cProfile
dockerhub
filesystem
filterresults
gradle
homebrew
imageFilePath
jpgc
linuxbrew
mergeresults
modelN
perfmon
urlN
Arg
KFserving
arg
authn
authz
dicts
dockerfiles
enum
eventloop
hashmap
lifecycles
sagemaker
startServer
threadpool
mGPU
socio
gridfs
NLP
TorchScript's
Meta's
criteo
personalization
NMTBackTranslate
NMTDualTranslate
nlp
DogCatBreed
DogCatBreedClassification
CloudWatch
LogGroup
TorchServeInferenceURL
TorchServeManagementURL
cloudwatch
keypair
spinup
ReactApp
logdir
tensorboard
DenseNet
pytorchbot
Validator
comparator
validator
validators
Datafile
UI
buildspec
cmds
AKS
PVCs
DockerHub
jq
HPA
HPG
targetValue
totensor
KFServer
TSModelRepository
TorchserveModel
Torchservemodel
kfserve
kfserver
KFModel
marfile
AKS
Balancer
EFK
Liveness
autoscale
datasource
helmignore
lookingup
mountpath
Az
VM
aks
az
ds
eastus
myAKSCluster
myResourceGroup
sc
vm
CODEBUILD
CodeBuild
Dockerfiles
bt
buildtype
codebuild
cudaversion
cudnn
memlock
shm
ulimit
Cresta's
DAGs
Dynabench
Dynaboard
MLFlow
MLOps
MLflow
Operationalize
Sagemaker
Streamlit
Inferentia
opensource
operationalising
Wadhwani
modelarchive
eagermode
AttributeName
AttributeType
DDBEndPoint
DDBSnapshotSerializer
DefaultCredentialsProvider
FS
IndexName
KeySchema
KeyType
PluginsManager
ProjectionType
ProvisionedThroughput
ReadCapacityUnits
SDKs
WriteCapacityUnits
createdOn
createdOnMonth
dynamodb
impl
serializer
servingsdk
snapshotName
behaviour
teardown
tg
udv
dataN
backendgroup
sexualized
ecbe
grayscale
bz
marsgen
efft
envvar
Roadmap
fff
pvd
whl
ss
dn
rn
De
ec
VQA
xxxx
Affero
MinIO
fs
fsspec
minioadmin
pythonic
DeepSpeed
MII
deepspeed
mii
Diffusers
diffusers
AzureML
Largemodels
bigscience
mem
sharded
NVfuser
fuser
ort
sess
dali
BetterTransformer
TransformerEncoder
InferenceTimeInMS
MetricTypes
MetricsCache
TIMM
backends
inductor
Integrations
integrations
UseCases
usecases
Explainability
TorchData
px
svg
nvfuser
noborder
datapipes
tensorrt
vec
torchdata
CodeQL
Dependabot
Snyk
pythonversion
StreamPredictions
LLMs
MPS
mps
deviceIds
rpc
pippy
MBS
MicroBatching
MicroBatchingHandler
QPS
PiPPy
Microbatching
Micro-batching
microbatch
microbatching
DeviceId
PredictionTime
QueueTime
WorkerLoadTime
WorkerName
WorkerThreadTime
MicroSoft
lmi
torchrun
nproc
largemodels
torchpippy
InferenceSession
maxRetryTimeoutInSec
neuronx
AMI
DLAMI
XLA
inferentia
ActionSLAM
statins
ci
chatGPT
Llama
PEFT
LORA
FSDP
AuditNLG
finetune
fsdp
ineference
lora
peft
samsum
vLLM
TGI
vLLM
vLLM's
OOM
RTX
SKU
TPUs
checkpointing
enviroment
fragmentations
intra
nightlies
recenly
uncomment
================================================
FILE: llama_recipes/utils/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from .memory_utils import MemoryTrace
from .dataset_utils import *
from .fsdp_utils import fsdp_auto_wrap_policy
from .train_utils import *
================================================
FILE: llama_recipes/utils/config_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import inspect
from dataclasses import fields
from peft import (
LoraConfig,
AdaptionPromptConfig,
PrefixTuningConfig,
)
from transformers import BitsAndBytesConfig
import configs.datasets as datasets
from configs import (
lora_config,
llama_adapter_config,
prefix_config,
train_config,
qlora_config,
bitsandbytes_config,
)
from .dataset_utils import DATASET_PREPROC
def update_config(config, **kwargs):
if isinstance(config, (tuple, list)):
for c in config:
update_config(c, **kwargs)
else:
for k, v in kwargs.items():
if hasattr(config, k):
setattr(config, k, v)
elif "." in k:
# allow --some_config.some_param=True
config_name, param_name = k.split(".")
if type(config).__name__ == config_name:
if hasattr(config, param_name):
setattr(config, param_name, v)
else:
# In case of specialized config we can warm user
print(f"Warning: {config_name} does not accept parameter: {k}")
elif isinstance(config, train_config):
print(f"Warning: unknown parameter {k}")
def generate_peft_config(peft_method, kwargs):
# Config mapping for train_config.peft_method to its corresponding config class
config_mapping = {
"lora": lora_config,
"llama_adapter": llama_adapter_config,
"prefix": prefix_config,
"bitsandbytes_config": bitsandbytes_config,
"qlora": qlora_config,
# Add other mappings as needed
}
# Mapping from config class to its corresponding PEFT config
peft_config_mapping = {
lora_config: LoraConfig,
llama_adapter_config: AdaptionPromptConfig,
prefix_config: PrefixTuningConfig,
bitsandbytes_config: BitsAndBytesConfig,
qlora_config: LoraConfig,
# Add other mappings as needed
}
# Step 2: Updated assertion
assert peft_method in config_mapping.keys(), f"Peft config not found: {peft_method}"
# Step 3: Fetch the correct configuration class based on train_config.peft_method
config = config_mapping[peft_method]
update_config(config, **kwargs)
params = {k.name: getattr(config, k.name) for k in fields(config)}
# Step 5: Fetch the correct PEFT config based on the configuration class
peft_config_class = peft_config_mapping[config]
peft_config = peft_config_class(**params)
return peft_config
# def generate_peft_config(train_config, kwargs):
# configs = (lora_config, llama_adapter_config, prefix_config, qlora_config)
# peft_configs = (LoraConfig, AdaptionPromptConfig, PrefixTuningConfig)
# names = tuple(c.__name__.rstrip("_config") for c in configs)
# assert train_config.peft_method in names, f"Peft config not found: {train_config.peft_method}"
# config = configs[names.index(train_config.peft_method)]
# update_config(config, **kwargs)
# params = {k.name: getattr(config, k.name) for k in fields(config)}
# peft_config = peft_configs[names.index(train_config.peft_method)](**params)
# return peft_config
def generate_dataset_config(train_config, kwargs):
names = tuple(DATASET_PREPROC.keys())
assert train_config.dataset in names, f"Unknown dataset: {train_config.dataset}"
dataset_config = {k: v for k, v in inspect.getmembers(datasets)}[
train_config.dataset
]
update_config(dataset_config, **kwargs)
return dataset_config
================================================
FILE: llama_recipes/utils/dataset_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import torch
from functools import partial
from ft_datasets import (
get_grammar_dataset,
get_alpaca_dataset,
get_samsum_dataset,
get_completion_dataset,
)
DATASET_PREPROC = {
"alpaca_dataset": partial(get_alpaca_dataset, max_words=224),
"grammar_dataset": get_grammar_dataset,
"samsum_dataset": get_samsum_dataset,
"completion": get_completion_dataset,
}
def get_preprocessed_dataset(
tokenizer, dataset_config, split: str = "train"
) -> torch.utils.data.Dataset:
if dataset_config.dataset not in DATASET_PREPROC:
raise NotImplementedError(f"{dataset_config.dataset} is not (yet) implemented")
def get_split():
return (
dataset_config.train_split
if split == "train"
else dataset_config.test_split
)
return DATASET_PREPROC[dataset_config.dataset](
dataset_config,
tokenizer,
get_split(),
)
================================================
FILE: llama_recipes/utils/fsdp_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
def fsdp_auto_wrap_policy(model, transformer_layer_name):
import functools
from torch.distributed.fsdp.wrap import (
_or_policy,
lambda_auto_wrap_policy,
transformer_auto_wrap_policy,
)
from peft.tuners import PrefixEncoder, PromptEmbedding, PromptEncoder
def lambda_policy_fn(module):
if (
len(list(module.named_children())) == 0
and getattr(module, "weight", None) is not None
and module.weight.requires_grad
):
return True
return False
lambda_policy = functools.partial(
lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn
)
transformer_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls=(
PrefixEncoder,
PromptEncoder,
PromptEmbedding,
transformer_layer_name,
# FullyShardedDataParallelPlugin.get_module_class_from_name(
# model, transformer_layer_name
# ),
),
)
auto_wrap_policy = functools.partial(
_or_policy, policies=[lambda_policy, transformer_wrap_policy]
)
return auto_wrap_policy
================================================
FILE: llama_recipes/utils/memory_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import gc
import threading
import psutil
import torch
def byte2gb(x):
return int(x / 2**30)
# This context manager is used to track the peak memory usage of the process
class MemoryTrace:
def __enter__(self):
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero
self.begin = byte2gb(torch.cuda.memory_allocated())
self.process = psutil.Process()
self.cpu_begin = byte2gb(self.cpu_mem_used())
self.peak_monitoring = True
peak_monitor_thread = threading.Thread(target=self.peak_monitor_func)
peak_monitor_thread.daemon = True
peak_monitor_thread.start()
return self
def cpu_mem_used(self):
"""get resident set size memory for the current process"""
return self.process.memory_info().rss
def peak_monitor_func(self):
self.cpu_peak = -1
while True:
self.cpu_peak = max(self.cpu_mem_used(), self.cpu_peak)
# can't sleep or will not catch the peak right (this comment is here on purpose)
# time.sleep(0.001) # 1msec
if not self.peak_monitoring:
break
def __exit__(self, *exc):
self.peak_monitoring = False
gc.collect()
torch.cuda.empty_cache()
self.end = byte2gb(torch.cuda.memory_allocated())
self.peak = byte2gb(torch.cuda.max_memory_allocated())
cuda_info = torch.cuda.memory_stats()
self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
self.cuda_malloc_retires = cuda_info.get("num_alloc_retries", 0)
self.peak_active_gb = byte2gb(cuda_info["active_bytes.all.peak"])
self.m_cuda_ooms = cuda_info.get("num_ooms", 0)
self.used = byte2gb(self.end - self.begin)
self.peaked = byte2gb(self.peak - self.begin)
self.max_reserved = byte2gb(torch.cuda.max_memory_reserved())
self.cpu_end = self.cpu_mem_used()
self.cpu_used = byte2gb(self.cpu_end - self.cpu_begin)
self.cpu_peaked = byte2gb(self.cpu_peak - self.cpu_begin)
# print(f"delta used/peak {self.used:4d}/{self.peaked:4d}")
================================================
FILE: llama_recipes/utils/train_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import os
import sys
import yaml
import torch
from tqdm import tqdm
"""
Unused imports:
import torch.nn as nn
import bitsandbytes as bnb
"""
from transformers import LlamaTokenizer
from torch.distributed.fsdp import StateDictType
import torch.distributed as dist
from pkg_resources import packaging
from .memory_utils import MemoryTrace
import model_checkpointing
import torch.cuda.nccl as nccl
from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
from pathlib import Path
sys.path.append(str(Path(__file__).resolve().parent.parent))
from policies import fpSixteen, bfSixteen_mixed, get_llama_wrapper
def set_tokenizer_params(tokenizer: LlamaTokenizer):
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"
# Converting Bytes to Megabytes
def byte2mb(x):
return int(x / 2**20)
def train(
model,
train_dataloader,
eval_dataloader,
tokenizer,
optimizer,
lr_scheduler,
gradient_accumulation_steps,
train_config,
fsdp_config=None,
local_rank=None,
rank=None,
):
"""
Trains the model on the given dataloader
Args:
model: The model to be trained
train_dataloader: The dataloader containing the training data
optimizer: The optimizer used for training
lr_scheduler: The learning rate scheduler
gradient_accumulation_steps: The number of steps to accumulate gradients before performing a backward/update operation
num_epochs: The number of epochs to train for
local_rank: The rank of the current node in a distributed setting
train_config: The training configuration
eval_dataloader: The dataloader containing the eval data
tokenizer: tokenizer used in the eval for decoding the predicitons
Returns: results dictionary containing average training and validation perplexity and loss
"""
# Create a gradient scaler for fp16
if train_config.use_fp16 and train_config.enable_fsdp:
scaler = ShardedGradScaler()
elif train_config.use_fp16 and not train_config.enable_fsdp:
scaler = torch.cuda.amp.GradScaler()
if train_config.enable_fsdp:
world_size = int(os.environ["WORLD_SIZE"])
train_prep = []
train_loss = []
val_prep = []
val_loss = []
results = {}
best_val_loss = float("inf")
for epoch in range(train_config.num_epochs):
with MemoryTrace() as memtrace: # track the memory usage
model.train()
total_loss = 0.0
for step, batch in enumerate(
tqdm(train_dataloader, colour="blue", desc=f"Training Epoch{epoch}")
):
for key in batch.keys():
if train_config.enable_fsdp:
batch[key] = batch[key].to(local_rank)
else:
batch[key] = batch[key].to("cuda:0")
loss = model(**batch).loss
loss = loss / gradient_accumulation_steps
total_loss += loss.detach().float()
if train_config.use_fp16:
# if fp16 is enabled, use gradient scaler to handle gradient update
scaler.scale(loss).backward()
if (step + 1) % gradient_accumulation_steps == 0 or step == len(
train_dataloader
) - 1:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
else:
# regular backpropagation when fp16 is not used
loss.backward()
if (step + 1) % gradient_accumulation_steps == 0 or step == len(
train_dataloader
) - 1:
optimizer.step()
optimizer.zero_grad()
if train_config.enable_fsdp:
if rank == 0:
print(
f"\n step {step} is completed and loss is {loss.detach().float()}"
)
else:
print(
f"\n step {step} is completed and loss is {loss.detach().float()}"
)
# Reducing total_loss across all devices if there's more than one CUDA device
if torch.cuda.device_count() > 1 and train_config.enable_fsdp:
dist.all_reduce(total_loss, op=dist.ReduceOp.SUM)
train_epoch_loss = total_loss / len(train_dataloader)
if train_config.enable_fsdp:
train_epoch_loss = train_epoch_loss / world_size
train_perplexity = torch.exp(train_epoch_loss)
train_prep.append(train_perplexity)
train_loss.append(train_epoch_loss)
if train_config.enable_fsdp:
if rank == 0:
print(f"Max CUDA memory allocated was {memtrace.peak} GB")
print(f"Max CUDA memory reserved was {memtrace.max_reserved} GB")
print(f"Peak active CUDA memory was {memtrace.peak_active_gb} GB")
print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
print(
f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB"
)
else:
print(f"Max CUDA memory allocated was {memtrace.peak} GB")
print(f"Max CUDA memory reserved was {memtrace.max_reserved} GB")
print(f"Peak active CUDA memory was {memtrace.peak_active_gb} GB")
print(f"Cuda Malloc retires : {memtrace.cuda_malloc_retires}")
print(
f"CPU Total Peak Memory consumed during the train (max): {memtrace.cpu_peaked + memtrace.cpu_begin} GB"
)
# Update the learning rate as needed
lr_scheduler.step()
if train_config.run_validation:
eval_ppl, eval_epoch_loss = evaluation(
model, train_config, eval_dataloader, rank, tokenizer
)
if train_config.save_model and eval_epoch_loss < best_val_loss:
if train_config.enable_fsdp:
dist.barrier()
if train_config.use_peft:
if train_config.enable_fsdp:
if rank == 0:
print("we are about to save the PEFT modules")
else:
print("we are about to save the PEFT modules")
model.save_pretrained(train_config.output_dir)
if train_config.enable_fsdp:
if rank == 0:
print(
f"PEFT modules are saved in {train_config.output_dir} directory"
)
else:
print(
f"PEFT modules are saved in {train_config.output_dir} directory"
)
else:
if (
not train_config.use_peft
and fsdp_config.checkpoint_type == StateDictType.FULL_STATE_DICT
):
model_checkpointing.save_model_checkpoint(
model, optimizer, rank, train_config, epoch=epoch
)
elif (
not train_config.use_peft
and fsdp_config.checkpoint_type
== StateDictType.SHARDED_STATE_DICT
):
print(
" Saving the FSDP model checkpoints using SHARDED_STATE_DICT"
)
print("=====================================================")
model_checkpointing.save_model_and_optimizer_sharded(
model, rank, train_config
)
if train_config.save_optimizer:
model_checkpointing.save_model_and_optimizer_sharded(
model, rank, train_config, optim=optimizer
)
print(
" Saving the FSDP model checkpoints qnd optimizer using SHARDED_STATE_DICT"
)
print(
"====================================================="
)
if not train_config.use_peft and train_config.save_optimizer:
model_checkpointing.save_optimizer_checkpoint(
model, optimizer, rank, train_config, epoch=epoch
)
print(
" Saving the FSDP model checkpoints qnd optimizer using FULL_STATE_DICT"
)
print("=====================================================")
if train_config.enable_fsdp:
dist.barrier()
if eval_epoch_loss < best_val_loss:
best_val_loss = eval_epoch_loss
if train_config.enable_fsdp:
if rank == 0:
print(f"best eval loss on epoch {epoch} is {best_val_loss}")
else:
print(f"best eval loss on epoch {epoch} is {best_val_loss}")
val_loss.append(best_val_loss)
val_prep.append(eval_ppl)
if train_config.enable_fsdp:
if rank == 0:
print(
f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}"
)
else:
print(
f"Epoch {epoch+1}: train_perplexity={train_perplexity:.4f}, train_epoch_loss={train_epoch_loss:.4f}"
)
avg_train_prep = sum(train_prep) / len(train_prep)
avg_train_loss = sum(train_loss) / len(train_loss)
if train_config.run_validation:
avg_eval_prep = sum(val_prep) / len(val_prep)
avg_eval_loss = sum(val_loss) / len(val_loss)
results["avg_train_prep"] = avg_train_prep
results["avg_train_loss"] = avg_train_loss
if train_config.run_validation:
results["avg_eval_prep"] = avg_eval_prep
results["avg_eval_loss"] = avg_eval_loss
# saving the training params including fsdp setting for reference.
if train_config.enable_fsdp and not train_config.use_peft:
save_train_params(train_config, fsdp_config, rank)
if train_config.use_peft and not train_config.run_validation:
if train_config.enable_fsdp:
if rank == 0:
print("we are about to save the PEFT modules")
else:
print("we are about to save the PEFT modules")
model.save_pretrained(train_config.output_dir)
if train_config.enable_fsdp:
if rank == 0:
print(f"PEFT modules are saved in {train_config.output_dir} directory")
else:
print(f"PEFT modules are saved in {train_config.output_dir} directory")
return results
def evaluation(
model, train_config, eval_dataloader, local_rank, tokenizer, prompt=None
):
"""
Evaluates the model on the given dataloader
Args:
model: The model to evaluate
eval_dataloader: The dataloader containing the evaluation data
local_rank: The rank of the current node in a distributed setting
tokenizer: The tokenizer used to decode predictions
Returns: eval_ppl, eval_epoch_loss
"""
if train_config.enable_fsdp:
world_size = int(os.environ["WORLD_SIZE"])
model.eval()
eval_preds = []
eval_loss = 0.0 # Initialize evaluation loss
with MemoryTrace() as memtrace:
for step, batch in enumerate(
tqdm(eval_dataloader, colour="green", desc="evaluating Epoch")
):
for key in batch.keys():
if train_config.enable_fsdp:
batch[key] = batch[key].to(local_rank)
else:
batch[key] = batch[key].to("cuda:0")
# Ensure no gradients are computed for this scope to save memory
with torch.no_grad():
# Forward pass and compute loss
outputs = model(**batch)
loss = outputs.loss
eval_loss += loss.detach().float()
# Decode predictions and add to evaluation predictions list
preds = torch.argmax(outputs.logits, -1)
eval_preds.extend(
tokenizer.batch_decode(
preds.detach().cpu().numpy(), skip_special_tokens=True
)
)
# If there's more than one CUDA device, reduce evaluation loss across all devices
if torch.cuda.device_count() > 1 and train_config.enable_fsdp:
dist.all_reduce(eval_loss, op=dist.ReduceOp.SUM)
# Compute average loss and perplexity
eval_epoch_loss = eval_loss / len(eval_dataloader)
if train_config.enable_fsdp:
eval_epoch_loss = eval_epoch_loss / world_size
eval_ppl = torch.exp(eval_epoch_loss)
# Print evaluation metrics
if train_config.validation_prompt:
input_ids = tokenizer(train_config.validation_prompt, return_tensors="pt")[
"input_ids"
].to(local_rank)
output_ids = model.generate(
inputs=input_ids,
max_length=50,
do_sample=True,
top_k=250,
top_p=0.8,
temperature=0.75,
)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
if train_config.enable_fsdp:
if local_rank == 0:
print(f" {eval_ppl=} {eval_epoch_loss=}")
if train_config.validation_prompt:
print(
f"\n\n---- Generated Response ----\n\n{generated_text}\n----------\n"
)
else:
if train_config.validation_prompt:
print(f"\n\n---- Generated Response ----\n\n{generated_text}\n----------\n")
print(f" {eval_ppl=} {eval_epoch_loss=}")
return eval_ppl, eval_epoch_loss
def freeze_transformer_layers(model, num_layer):
for i, layer in enumerate(model.model.layers):
if i < num_layer:
for param in layer.parameters():
param.requires_grad = False
def check_frozen_layers_peft_model(model):
for i, layer in enumerate(model.base_model.model.model.layers):
for name, param in layer.named_parameters():
print(f"Layer {i}, parameter {name}: requires_grad = {param.requires_grad}")
def setup():
"""Initialize the process group for distributed training"""
dist.init_process_group("nccl")
def setup_environ_flags(rank):
"""Set environment flags for debugging purposes"""
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = str(1)
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = str(1)
# os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
# This flag will help with CUDA memory fragmentations that can lead into OOM in some cases.
# Note this is only availble in PyTorch Nighlies (as of July 30 2023)
# os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'
if rank == 0:
print("--> Running with torch dist debug set to detail")
def cleanup():
"""Clean up the process group after training"""
dist.destroy_process_group()
def clear_gpu_cache(rank=None):
"""Clear the GPU cache for all ranks"""
if rank == 0:
print("Clearing GPU cache for all ranks")
torch.cuda.empty_cache()
def get_parameter_dtypes(model):
"""Get the data types of model parameters"""
parameter_dtypes = {}
for name, parameter in model.named_parameters():
parameter_dtypes[name] = parameter.dtype
return parameter_dtypes
def print_model_size(model, config, rank: int = 0) -> None:
"""
Print model name, the number of trainable parameters and initialization time.
Args:
model: The PyTorch model.
model_name (str): Name of the model.
init_time_start (float): Initialization start time.
init_time_end (float): Initialization end time.
rank (int, optional): Current process's rank. Defaults to 0.
"""
if rank == 0:
print(f"--> Model {config.model_name}")
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\n--> {config.model_name} has {total_params / 1e6} Million params\n")
def get_policies(cfg, rank):
"""Get the policies for mixed precision and fsdp wrapping"""
verify_bfloat_support = (
torch.version.cuda
and torch.cuda.is_bf16_supported()
and packaging.version.parse(torch.version.cuda).release >= (11, 0)
and dist.is_nccl_available()
and nccl.version() >= (2, 10)
)
mixed_precision_policy = None
wrapping_policy = None
# Mixed precision
if cfg.mixed_precision:
bf16_ready = verify_bfloat_support
if bf16_ready and not cfg.use_fp16:
mixed_precision_policy = bfSixteen_mixed
if rank == 0:
print("bFloat16 enabled for mixed precision - using bfSixteen policy")
elif cfg.use_fp16:
mixed_precision_policy = fpSixteen
if rank == 0:
print("FP16 enabled")
else:
print("bFloat16 support not present. Using FP32, and not mixed precision")
wrapping_policy = get_llama_wrapper()
return mixed_precision_policy, wrapping_policy
def save_train_params(train_config, fsdp_config, rank):
"""
This function saves the train_config and FSDP config into a train_params.yaml.
This will be used by converter script in the inference folder to fetch the HF model name or path.
It also would be hepful as a log for future references.
"""
# Convert the train_config and fsdp_config objects to dictionaries,
# converting all values to strings to ensure they can be serialized into a YAML file
train_config_dict = {
k: str(v) for k, v in vars(train_config).items() if not k.startswith("__")
}
fsdp_config_dict = {
k: str(v) for k, v in vars(fsdp_config).items() if not k.startswith("__")
}
# Merge the two dictionaries into one
train_params_dict = {**train_config_dict, **fsdp_config_dict}
# Construct the folder name (follwoing FSDP checkpointing style) using properties of the train_config object
folder_name = (
train_config.dist_checkpoint_root_folder
+ "/"
+ train_config.dist_checkpoint_folder
+ "-"
+ train_config.model_name
)
save_dir = Path.cwd() / folder_name
# If the directory does not exist, create it
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Convert the dictionary to a YAML string
config_yaml = yaml.dump(train_params_dict, indent=4)
file_name = os.path.join(save_dir, "train_params.yaml")
# Check if there's a directory with the same name as the file
if os.path.isdir(file_name):
print(f"Error: {file_name} is a directory, not a file.")
else:
# Write the YAML string to the file
with open(file_name, "w") as f:
f.write(config_yaml)
if rank == 0:
print(f"training params are saved in {file_name}")
================================================
FILE: mistral-schema.json
================================================
{
"openapi": "3.0.2",
"info": {
"title": "Cog",
"version": "0.1.0"
},
"paths": {
"/": {
"get": {
"summary": "Root",
"operationId": "root__get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Root Get"
}
}
}
}
}
}
},
"/health-check": {
"get": {
"summary": "Healthcheck",
"operationId": "healthcheck_health_check_get",
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"title": "Response Healthcheck Health Check Get"
}
}
}
}
}
}
},
"/predictions": {
"post": {
"summary": "Predict",
"description": "Run a single prediction on the model",
"operationId": "predict_predictions_post",
"parameters": [
{
"required": false,
"schema": {
"title": "Prefer",
"type": "string"
},
"name": "prefer",
"in": "header"
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionRequest"
}
}
}
},
"responses": {
"200": {
"description": "Successful Response",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/PredictionResponse"
}
}
}
},
"422": {
"description": "Validation Error",
"content": {
"application/js
gitextract_722afjzh/ ├── .gitignore ├── .gitmodules ├── CONTRIBUTING.md ├── LICENSE.txt ├── Makefile ├── README.md ├── __init__.py ├── base-schema.json ├── chat-schema.json ├── cog.yaml ├── examples/ │ └── alpaca/ │ ├── README.md │ ├── process_data.py │ └── replicate_alpaca_data.json ├── llama_recipes/ │ ├── LICENSE │ ├── __init__.py │ ├── configs/ │ │ ├── __init__.py │ │ ├── datasets.py │ │ ├── fsdp.py │ │ ├── peft.py │ │ └── training.py │ ├── ft_datasets/ │ │ ├── __init__.py │ │ ├── alpaca_dataset.py │ │ ├── completion_dataset.py │ │ ├── grammar_dataset/ │ │ │ ├── __init__.py │ │ │ ├── grammar_dataset.py │ │ │ └── grammar_dataset_process.ipynb │ │ ├── samsum_dataset.py │ │ └── utils.py │ ├── llama_finetuning.py │ ├── model_checkpointing/ │ │ ├── __init__.py │ │ └── checkpoint_handler.py │ ├── multi_node.slurm │ ├── policies/ │ │ ├── __init__.py │ │ ├── activation_checkpointing_functions.py │ │ ├── anyprecision_optimizer.py │ │ ├── mixed_precision.py │ │ └── wrapping.py │ ├── quickstart.ipynb │ ├── requirements.txt │ ├── scripts/ │ │ ├── markdown_link_check_config.json │ │ ├── spellcheck.sh │ │ └── spellcheck_conf/ │ │ ├── spellcheck.yaml │ │ └── wordlist.txt │ └── utils/ │ ├── __init__.py │ ├── config_utils.py │ ├── dataset_utils.py │ ├── fsdp_utils.py │ ├── memory_utils.py │ └── train_utils.py ├── mistral-schema.json ├── model_templates/ │ └── config.py ├── models/ │ ├── dockerignore │ ├── llama-2-13b/ │ │ └── config.py │ ├── llama-2-13b-chat/ │ │ └── config.py │ ├── llama-2-13b-chat-hf-mlc/ │ │ └── config.py │ ├── llama-2-13b-mlc/ │ │ └── config.py │ ├── llama-2-70b/ │ │ ├── config.py │ │ └── model_artifacts/ │ │ └── tokenizer/ │ │ ├── special_tokens_map.json │ │ ├── tokenizer.model │ │ ├── tokenizer_checklist.chk │ │ └── tokenizer_config.json │ ├── llama-2-70b-chat/ │ │ └── config.py │ ├── llama-2-70b-chat-hf-mlc/ │ │ └── config.py │ ├── llama-2-70b-mlc/ │ │ └── config.py │ ├── llama-2-7b/ │ │ └── config.py │ ├── llama-2-7b-chat/ │ │ └── config.py │ ├── llama-2-7b-chat-hf-mlc/ │ │ └── config.py │ ├── llama-2-7b-mlc/ │ │ └── config.py │ ├── llama-2-7b-transformers/ │ │ ├── config.py │ │ └── model_artifacts/ │ │ └── tokenizer/ │ │ ├── special_tokens_map.json │ │ ├── tokenizer.model │ │ ├── tokenizer_checklist.chk │ │ └── tokenizer_config.json │ ├── llama-2-7b-vllm/ │ │ └── config.py │ ├── mistral-7b-instruct-v0.1-mlc/ │ │ └── config.py │ └── mistral-7b-v0.1-mlc/ │ └── config.py ├── notes/ │ └── new_model_notes.md ├── predict.py ├── pyproject.toml ├── requirements-dev.txt ├── scripts/ │ ├── benchmark_token_latency.py │ ├── load_secrets.sh │ ├── test_fast_llama.py │ ├── test_load_unload_lora.py │ ├── train_multi_gpu.sh │ └── train_single_gpu.sh ├── src/ │ ├── __init__.py │ ├── config_utils.py │ ├── download.py │ ├── inference_engines/ │ │ ├── __init__.py │ │ ├── engine.py │ │ ├── exllama.py │ │ ├── mlc_engine.py │ │ ├── mlc_vllm_engine.py │ │ ├── transformers_engine.py │ │ ├── vllm_engine.py │ │ ├── vllm_exllama_engine.py │ │ └── vllm_transformers.py │ ├── more_utils.py │ └── utils.py ├── tests/ │ ├── __init__.py │ ├── assets/ │ │ └── llama_tokenizer/ │ │ ├── special_tokens_map.json │ │ ├── tokenizer.model │ │ ├── tokenizer_checklist.chk │ │ └── tokenizer_config.json │ ├── conftest.py │ ├── data/ │ │ └── 200_samples.jsonl │ ├── run_local_tests.sh │ ├── test_e2e.py │ ├── test_predict.py │ ├── test_predict_with_trained_weights.py │ ├── test_remote_predict.py │ ├── test_remote_train.py │ ├── test_train.py │ ├── test_train_predict.py │ ├── test_utils.py │ ├── timing.py │ └── unit_tests/ │ ├── test_completion_dataset.py │ └── test_utils.py └── train.py
SYMBOL INDEX (289 symbols across 48 files)
FILE: examples/alpaca/process_data.py
class Preprocessor (line 18) | class Preprocessor:
method __init__ (line 21) | def __init__(self, tokenizer):
method batch_tokenize (line 25) | def batch_tokenize(self, texts):
method make_prompt (line 37) | def make_prompt(self, input_row):
method make_short_prompt (line 42) | def make_short_prompt(self, input_row):
method construct_dataset (line 47) | def construct_dataset(self, input_data):
FILE: llama_recipes/configs/datasets.py
class samsum_dataset (line 8) | class samsum_dataset:
class grammar_dataset (line 16) | class grammar_dataset:
class alpaca_dataset (line 24) | class alpaca_dataset:
class completion (line 32) | class completion:
FILE: llama_recipes/configs/fsdp.py
class fsdp_config (line 10) | class fsdp_config:
FILE: llama_recipes/configs/peft.py
class lora_config (line 10) | class lora_config:
class llama_adapter_config (line 21) | class llama_adapter_config:
class prefix_config (line 28) | class prefix_config:
class bitsandbytes_config (line 34) | class bitsandbytes_config:
class qlora_config (line 42) | class qlora_config:
FILE: llama_recipes/configs/training.py
class train_config (line 7) | class train_config:
FILE: llama_recipes/ft_datasets/alpaca_dataset.py
class InstructionDataset (line 26) | class InstructionDataset(Dataset):
method __init__ (line 27) | def __init__(self, dataset_config, tokenizer, partition="train", max_w...
method __len__ (line 39) | def __len__(self):
method __getitem__ (line 42) | def __getitem__(self, index):
FILE: llama_recipes/ft_datasets/completion_dataset.py
function load_data (line 6) | def load_data(
function format_data (line 54) | def format_data(dataset, tokenizer, config=None):
function tokenize_data (line 81) | def tokenize_data(dataset, tokenizer, config=None):
function get_completion_dataset (line 107) | def get_completion_dataset(config: str, tokenizer, split: str = "train"):
FILE: llama_recipes/ft_datasets/grammar_dataset/grammar_dataset.py
class grammar (line 17) | class grammar(Dataset):
method __init__ (line 18) | def __init__(
method __len__ (line 41) | def __len__(self):
method convert_to_features (line 44) | def convert_to_features(self, example_batch):
method __getitem__ (line 60) | def __getitem__(self, index):
function get_dataset (line 73) | def get_dataset(dataset_config, tokenizer, csv_name=None):
FILE: llama_recipes/ft_datasets/samsum_dataset.py
function get_preprocessed_samsum (line 10) | def get_preprocessed_samsum(dataset_config, tokenizer, split):
FILE: llama_recipes/ft_datasets/utils.py
class Concatenator (line 9) | class Concatenator(object):
method __init__ (line 10) | def __init__(self, chunk_size=2048, wrap_packed_sequences=False):
method _wrap_concat (line 15) | def _wrap_concat(self, batch):
method _concat (line 50) | def _concat(self, batch):
method __call__ (line 98) | def __call__(self, batch):
class ConcatDataset (line 105) | class ConcatDataset(Dataset):
method __init__ (line 106) | def __init__(self, dataset, chunk_size=4096):
method __getitem__ (line 127) | def __getitem__(self, idx):
method __len__ (line 130) | def __len__(self):
FILE: llama_recipes/llama_finetuning.py
function main (line 54) | def main(**kwargs):
FILE: llama_recipes/model_checkpointing/checkpoint_handler.py
function get_date_of_run (line 28) | def get_date_of_run():
function load_model_sharded (line 41) | def load_model_sharded(model, rank, cfg):
function save_model_and_optimizer_sharded (line 80) | def save_model_and_optimizer_sharded(model, rank, cfg, optim=None):
function save_model_checkpoint (line 117) | def save_model_checkpoint(
function load_model_checkpoint (line 154) | def load_model_checkpoint(model, rank, cfg):
function save_optimizer_checkpoint (line 179) | def save_optimizer_checkpoint(model, optimizer, rank, cfg, epoch=1):
function load_optimizer_checkpoint (line 211) | def load_optimizer_checkpoint(model, optimizer_checkpoint_path, rank):
function load_sharded_model_single_gpu (line 233) | def load_sharded_model_single_gpu(model, model_path):
FILE: llama_recipes/policies/activation_checkpointing_functions.py
function apply_fsdp_checkpointing (line 21) | def apply_fsdp_checkpointing(model):
FILE: llama_recipes/policies/anyprecision_optimizer.py
class AnyPrecisionAdamW (line 16) | class AnyPrecisionAdamW(Optimizer):
method __init__ (line 17) | def __init__(
method step (line 73) | def step(self, closure=None):
FILE: llama_recipes/policies/wrapping.py
function get_size_policy (line 15) | def get_size_policy(min_params=1e8):
function get_llama_wrapper (line 22) | def get_llama_wrapper():
FILE: llama_recipes/utils/config_utils.py
function update_config (line 26) | def update_config(config, **kwargs):
function generate_peft_config (line 47) | def generate_peft_config(peft_method, kwargs):
function generate_dataset_config (line 98) | def generate_dataset_config(train_config, kwargs):
FILE: llama_recipes/utils/dataset_utils.py
function get_preprocessed_dataset (line 25) | def get_preprocessed_dataset(
FILE: llama_recipes/utils/fsdp_utils.py
function fsdp_auto_wrap_policy (line 5) | def fsdp_auto_wrap_policy(model, transformer_layer_name):
FILE: llama_recipes/utils/memory_utils.py
function byte2gb (line 10) | def byte2gb(x):
class MemoryTrace (line 15) | class MemoryTrace:
method __enter__ (line 16) | def __enter__(self):
method cpu_mem_used (line 29) | def cpu_mem_used(self):
method peak_monitor_func (line 33) | def peak_monitor_func(self):
method __exit__ (line 45) | def __exit__(self, *exc):
FILE: llama_recipes/utils/train_utils.py
function set_tokenizer_params (line 30) | def set_tokenizer_params(tokenizer: LlamaTokenizer):
function byte2mb (line 36) | def byte2mb(x):
function train (line 40) | def train(
function evaluation (line 276) | def evaluation(
function freeze_transformer_layers (line 364) | def freeze_transformer_layers(model, num_layer):
function check_frozen_layers_peft_model (line 371) | def check_frozen_layers_peft_model(model):
function setup (line 377) | def setup():
function setup_environ_flags (line 382) | def setup_environ_flags(rank):
function cleanup (line 394) | def cleanup():
function clear_gpu_cache (line 399) | def clear_gpu_cache(rank=None):
function get_parameter_dtypes (line 406) | def get_parameter_dtypes(model):
function print_model_size (line 414) | def print_model_size(model, config, rank: int = 0) -> None:
function get_policies (line 431) | def get_policies(cfg, rank):
function save_train_params (line 463) | def save_train_params(train_config, fsdp_config, rank):
FILE: predict.py
class Predictor (line 36) | class Predictor(BasePredictor):
method setup (line 37) | def setup(self, weights: Optional[Path] = None):
method get_lora (line 56) | def get_lora(self, replicate_weights: str) -> Any:
method initialize_peft (line 77) | def initialize_peft(self, replicate_weights: str) -> None:
method delete_lora (line 86) | def delete_lora(self):
method predict (line 94) | def predict(
method remove (line 234) | def remove(f: Callable, defaults: dict[str, Any]) -> Callable:
FILE: scripts/benchmark_token_latency.py
class AbstractInferenceModel (line 12) | class AbstractInferenceModel(ABC):
method __init__ (line 14) | def __init__(self, model_name_or_path, tokenizer_name_or_path):
method _load_model (line 21) | def _load_model(self):
method _load_tokenizer (line 25) | def _load_tokenizer(self):
method generate_tokens (line 29) | def generate_tokens(self, input_ids, prompt_length, output_length):
class LlamaBnB4Bit (line 33) | class LlamaBnB4Bit(AbstractInferenceModel):
method __init__ (line 34) | def __init__(self, model_name_or_path, tokenizer_name_or_path, some_ot...
method _load_model (line 37) | def _load_model(self):
method _load_tokenizer (line 49) | def _load_tokenizer(self):
method generate_tokens (line 68) | def generate_tokens(self, input_ids, prompt_length, output_length):
function measure_latency (line 75) | def measure_latency(inference_model, prompt_length, output_length):
function benchmark_model (line 112) | def benchmark_model(model_name, inference_model, prompt_lengths, output_...
FILE: scripts/test_fast_llama.py
class Engine (line 16) | class Engine(Enum):
class LoraAdapter (line 22) | class LoraAdapter:
class SpeedyReplicateGonzalez (line 27) | class SpeedyReplicateGonzalez:
method __init__ (line 28) | def __init__(self):
method replicate_model_name (line 63) | def replicate_model_name(self):
method replicate_model_name (line 67) | def replicate_model_name(self, model_name):
method get_lora (line 71) | def get_lora(self, lora_path):
method generate_replicate (line 84) | def generate_replicate(self, prompt, lora):
method generate_vllm (line 99) | def generate_vllm(self, prompt, lora):
method set_engine (line 111) | def set_engine(self, engine):
method timing_decorator (line 123) | def timing_decorator(self, prompt, lora):
method enable_timing (line 134) | def enable_timing(self, verbose: bool = False):
method disable_timing (line 138) | def disable_timing(self):
method run_long_generation (line 141) | def run_long_generation(self):
method run_base (line 147) | def run_base(self):
method run_sql (line 160) | def run_sql(self):
method run_summary (line 187) | def run_summary(self):
FILE: scripts/test_load_unload_lora.py
class vLLMLoraTest (line 11) | class vLLMLoraTest:
method __init__ (line 12) | def __init__(self):
method get_lora (line 35) | def get_lora(self, lora_path):
method generate_replicate (line 47) | def generate_replicate(self, prompt, lora_path):
method generate (line 57) | def generate(self, prompt, lora):
method run_base (line 68) | def run_base(self):
method run_sql (line 81) | def run_sql(self):
method run_summary (line 108) | def run_summary(self):
FILE: src/config_utils.py
class Weights (line 9) | class Weights(BaseModel):
function get_fp16_file_list (line 15) | def get_fp16_file_list(n_shards: int):
function get_gptq_file_list (line 35) | def get_gptq_file_list(base_model_name: str):
function get_mlc_file_list (line 52) | def get_mlc_file_list(model_name: str, n_shards: int):
function exllama_kwargs (line 70) | def exllama_kwargs(weights: Weights, config_overrides: Optional[dict] = ...
function vllm_kwargs (line 77) | def vllm_kwargs(weights: Weights, config_overrides: Optional[dict] = None):
function mlc_kwargs (line 87) | def mlc_kwargs(
FILE: src/download.py
class SeekableMmap (line 28) | class SeekableMmap(mmap.mmap):
method seekable (line 29) | def seekable(self) -> bool:
class Downloader (line 33) | class Downloader:
method __init__ (line 34) | def __init__(self, concurrency: int | None = None) -> None:
method session (line 50) | def session(self) -> aiohttp.ClientSession:
method threadpool (line 61) | def threadpool(self) -> ThreadPoolExecutor:
method get_remote_file_size (line 66) | async def get_remote_file_size(self, url: str | URL) -> "tuple[URL, in...
method download_chunk (line 101) | async def download_chunk(
method download_file (line 120) | async def download_file(self, url: str | URL) -> mmap.mmap:
method download_file_to_disk (line 154) | async def download_file_to_disk(self, url: str, path: str) -> None:
method maybe_download_files_to_disk (line 163) | async def maybe_download_files_to_disk(
method sync (line 186) | def sync(f: t.Callable) -> t.Callable:
FILE: src/inference_engines/engine.py
class Engine (line 9) | class Engine(ABC):
method load_weights (line 14) | def load_weights(self, weights: Weights):
method load_lora (line 23) | def load_lora(self, lora_data: dict):
method set_lora (line 31) | def set_lora(self, lora: Any):
method is_lora_active (line 38) | def is_lora_active(self) -> bool:
method delete_lora (line 45) | def delete_lora(self):
method __call__ (line 52) | def __call__(self, prompt, **kwargs):
FILE: src/inference_engines/exllama.py
function next_logits (line 27) | def next_logits(
function begin (line 36) | def begin(generator):
function timer (line 44) | def timer(name, func):
class ExllamaEngine (line 52) | class ExllamaEngine(Engine):
method __init__ (line 53) | def __init__(self, weights: Weights, fused_attn=True):
method delete_lora (line 90) | def delete_lora(self):
method is_lora_active (line 94) | def is_lora_active(self) -> bool:
method load_lora (line 97) | def load_lora(self, data_ref: dict) -> ExLlamaLora:
method set_lora (line 104) | def set_lora(self, lora: ExLlamaLora | None) -> None:
method __call__ (line 107) | def __call__(
FILE: src/inference_engines/mlc_engine.py
class MLCEngine (line 11) | class MLCEngine(Engine):
method __init__ (line 16) | def __init__(
method load_weights (line 50) | def load_weights(self, weights: Weights) -> str:
method get_logits (line 69) | def get_logits(self):
method load_lora (line 75) | def load_lora(self):
method is_lora_active (line 82) | def is_lora_active(self):
method set_lora (line 88) | def set_lora(self):
method delete_lora (line 94) | def delete_lora(self):
method __call__ (line 97) | def __call__(
FILE: src/inference_engines/mlc_vllm_engine.py
class MLCvLLMEngine (line 8) | class MLCvLLMEngine(Engine):
method __init__ (line 13) | def __init__(self, mlc_args: dict, vllm_args: dict) -> None:
method load_lora (line 24) | def load_lora(self, lora_data: dict) -> Any:
method is_lora_active (line 43) | def is_lora_active(self) -> bool:
method set_lora (line 51) | def set_lora(self, lora: Any) -> None:
method delete_lora (line 61) | def delete_lora(self) -> None:
method __call__ (line 64) | def __call__(
FILE: src/inference_engines/transformers_engine.py
class ExtraStopSequence (line 23) | class ExtraStopSequence(StoppingCriteria):
method __init__ (line 29) | def __init__(self, stop_sequence: torch.Tensor, device: str):
method __call__ (line 32) | def __call__(
class TransformersEngine (line 40) | class TransformersEngine(Engine):
method __init__ (line 46) | def __init__(self, weights: Weights, tokenizer_func=None, device="cuda"):
method load_lora (line 55) | def load_lora(self, lora_weights: dict) -> Tuple[LoraConfig, Any]:
method is_lora_active (line 78) | def is_lora_active(self) -> bool:
method delete_lora (line 81) | def delete_lora(self):
method set_lora (line 90) | def set_lora(self, lora):
method get_logits (line 114) | def get_logits(self, prompt):
method __call__ (line 129) | def __call__(
FILE: src/inference_engines/vllm_engine.py
class LoRA (line 20) | class LoRA:
method __init__ (line 21) | def __init__(
method load_from_path (line 28) | def load_from_path(
method load_from_bytes (line 40) | def load_from_bytes(
class vLLMEngine (line 48) | class vLLMEngine(Engine):
method __init__ (line 53) | def __init__(self, weights: Weights, dtype: str) -> None:
method load_lora (line 63) | def load_lora(
method is_lora_active (line 122) | def is_lora_active(self) -> bool:
method set_lora (line 128) | def set_lora(self, lora: LoRA) -> None:
method delete_lora (line 138) | def delete_lora(self) -> None:
method generate_stream (line 141) | async def generate_stream(
method __call__ (line 148) | def __call__(
function run_generation (line 241) | def run_generation():
FILE: src/inference_engines/vllm_exllama_engine.py
class ExllamaVllmEngine (line 12) | class ExllamaVllmEngine(Engine):
method __init__ (line 17) | def __init__(self, vllm_args: dict, exllama_args: dict) -> None:
method load_lora (line 27) | def load_lora(self, lora_data: dict) -> Any:
method is_lora_active (line 45) | def is_lora_active(self) -> bool:
method set_lora (line 53) | def set_lora(self, lora: Any) -> None:
method delete_lora (line 63) | def delete_lora(self) -> None:
method __call__ (line 66) | def __call__(
FILE: src/inference_engines/vllm_transformers.py
class vLLMTransformersEngine (line 11) | class vLLMTransformersEngine(Engine):
method __init__ (line 16) | def __init__(
method load_lora (line 23) | def load_lora(self, lora_data: dict) -> Any:
method is_lora_active (line 43) | def is_lora_active(self) -> bool:
method set_lora (line 51) | def set_lora(self, lora: Any) -> None:
method delete_lora (line 61) | def delete_lora(self) -> None:
method __call__ (line 64) | def __call__(
FILE: src/more_utils.py
function log_memory_stuff (line 10) | def log_memory_stuff(prompt=None):
function load_tokenizer (line 20) | def load_tokenizer(tokenizer_path):
FILE: src/utils.py
function seed_all (line 11) | def seed_all(seed: int):
function get_env_var_or_default (line 23) | def get_env_var_or_default(var_name, default_value):
class Logger (line 42) | class Logger:
method __init__ (line 43) | def __init__(self, marker: str = "predict-timings"):
method log (line 48) | def log(self, *args):
function get_loop (line 60) | def get_loop() -> asyncio.AbstractEventLoop:
function download_file (line 67) | def download_file(file, local_filename):
function check_files_exist (line 79) | def check_files_exist(remote_files: list[str], local_path: str) -> list[...
function download_file_with_pget (line 89) | async def download_file_with_pget(remote_path, dest_path, pget_concurren...
function download_files_with_pget (line 116) | async def download_files_with_pget(
function maybe_download_with_pget (line 126) | def maybe_download_with_pget(
class StreamingTextStopSequenceHandler (line 166) | class StreamingTextStopSequenceHandler:
method __init__ (line 167) | def __init__(self, stop_sequences: tp.List[str] = None, eos_token: str...
method get_match_length (line 176) | def get_match_length(self, text: str, stop_sequence: str):
method process (line 191) | def process(self, token):
method __call__ (line 245) | def __call__(self, token):
method finalize (line 252) | def finalize(self):
function delay_prints (line 259) | def delay_prints(REALLY_EAT_MY_PRINT_STATEMENTS: bool = False) -> tp.Ite...
FILE: tests/conftest.py
function pytest_addoption (line 1) | def pytest_addoption(parser):
FILE: tests/test_e2e.py
function wait_for_server_to_be_ready (line 14) | def wait_for_server_to_be_ready(url, timeout=300):
function server (line 46) | def server():
function test_health_check (line 73) | def test_health_check():
function test_prediction (line 80) | def test_prediction():
FILE: tests/test_predict.py
function server (line 21) | def server():
function test_health_check (line 52) | def test_health_check(server):
function test_simple_prediction (line 59) | def test_simple_prediction(server):
function test_input_too_long (line 77) | def test_input_too_long(server):
FILE: tests/test_remote_predict.py
function model_name (line 6) | def model_name(request):
function model (line 11) | def model(model_name):
function version (line 16) | def version(model):
function prediction_tests (line 22) | def prediction_tests():
function test_initial_predictions (line 28) | def test_initial_predictions(version, prediction_tests):
FILE: tests/test_remote_train.py
function model_name (line 7) | def model_name(request):
function model (line 12) | def model(model_name):
function version (line 17) | def version(model):
function training (line 23) | def training(model_name, version):
function prediction_tests (line 36) | def prediction_tests():
function test_training (line 62) | def test_training(training):
function trained_model_and_version (line 70) | def trained_model_and_version(training):
function test_post_training_predictions (line 75) | def test_post_training_predictions(trained_model_and_version, prediction...
FILE: tests/test_train.py
function test_train (line 42) | def test_train():
FILE: tests/test_train_predict.py
function server (line 22) | def server():
function test_health_check (line 58) | def test_health_check(server):
function test_prediction (line 65) | def test_prediction(server):
function test_input_too_long (line 83) | def test_input_too_long(server):
FILE: tests/test_utils.py
function get_image_name (line 12) | def get_image_name():
function process_log_line (line 21) | def process_log_line(line):
function capture_output (line 37) | def capture_output(pipe, print_lock, logs=None, error_detected=None):
function wait_for_server_to_be_ready (line 49) | def wait_for_server_to_be_ready(url, timeout=300):
function run_training_subprocess (line 79) | def run_training_subprocess(command):
FILE: tests/timing.py
function run (line 17) | def run(v):
FILE: tests/unit_tests/test_completion_dataset.py
function dataset_config (line 17) | def dataset_config():
function tokenizer (line 35) | def tokenizer():
function test__load_data_train (line 51) | def test__load_data_train(dataset_config):
function test__load_data_train_with_val_split (line 59) | def test__load_data_train_with_val_split(dataset_config):
function dataset (line 73) | def dataset(dataset_config):
function test_format_data (line 79) | def test_format_data(dataset, tokenizer):
function formatted_dataset (line 87) | def formatted_dataset(dataset, tokenizer):
function test_tokenize_data_with_wrapped_packing (line 91) | def test_tokenize_data_with_wrapped_packing(
function test_tokenize_data_without_wrapped_packing_small_chunk (line 121) | def test_tokenize_data_without_wrapped_packing_small_chunk(
function test_tokenize_data_without_wrapped_packing_large_chunk (line 155) | def test_tokenize_data_without_wrapped_packing_large_chunk(
function test_tokenize_data_without_packing (line 189) | def test_tokenize_data_without_packing(formatted_dataset, tokenizer, dat...
FILE: tests/unit_tests/test_utils.py
function tokenizer (line 11) | def tokenizer():
function get_decoded_prompt_tokens (line 27) | def get_decoded_prompt_tokens(tokenizer, prompt):
function test_no_stop_sequences (line 33) | def test_no_stop_sequences(tokenizer):
function test_single_stop_sequence_1 (line 70) | def test_single_stop_sequence_1(tokenizer):
function test_single_stop_sequence_2 (line 107) | def test_single_stop_sequence_2(tokenizer):
function test_multiple_stop_sequence (line 144) | def test_multiple_stop_sequence(tokenizer):
function test_adjacent_stop_sequences (line 181) | def test_adjacent_stop_sequences(tokenizer):
function test_substring_stop_sequence (line 218) | def test_substring_stop_sequence(tokenizer):
FILE: train.py
class TrainingOutput (line 29) | class TrainingOutput(BaseModel):
function train (line 33) | def train(
Condensed preview — 120 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (584K chars).
[
{
"path": ".gitignore",
"chars": 248,
"preview": "**/__pycache__/**\nflan-t5**\ncheckpoints/**\ntmp/**\nunconverted-weights\nunconverted-weights/\nweights\nweights/\n.DS_STORE\n*."
},
{
"path": ".gitmodules",
"chars": 89,
"preview": "[submodule \"exllama\"]\n\tpath = exllama\n\turl = https://github.com/technillogue/exllama.git\n"
},
{
"path": "CONTRIBUTING.md",
"chars": 1047,
"preview": "# Contributing\n\nThanks for taking the time to contribute to this project!\n\n## Releases\n\nThis section documents the proce"
},
{
"path": "LICENSE.txt",
"chars": 11346,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "Makefile",
"chars": 6883,
"preview": ".PHONY: init \n.PHONY: select\n.PHONY: test-local\n.PHONY: push\n.PHONY: push-and-test\n.PHONY: clean\n\n# this is required to "
},
{
"path": "README.md",
"chars": 7175,
"preview": "# LLaMA Cog template 🦙\n\nThis is a monorepo for building multiple Llama models using Cog:\n\n- llama-2-13b\n- llama-2-13b-ch"
},
{
"path": "__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "base-schema.json",
"chars": 12960,
"preview": "{\n \"openapi\": \"3.0.2\",\n \"info\": {\n \"title\": \"Cog\",\n \"version\": \"0.1.0\"\n },\n \"paths\": {\n \"/\": {\n \"get\":"
},
{
"path": "chat-schema.json",
"chars": 13379,
"preview": "{\n \"openapi\": \"3.0.2\",\n \"info\": {\n \"title\": \"Cog\",\n \"version\": \"0.1.0\"\n },\n \"paths\": {\n \"/\": {\n \"get\":"
},
{
"path": "cog.yaml",
"chars": 1983,
"preview": "# Configuration for Cog ⚙️\n# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md\n\nbuild:\n # set to true "
},
{
"path": "examples/alpaca/README.md",
"chars": 420,
"preview": "Example code for parsing the dataset needed to train [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca).\n\nT"
},
{
"path": "examples/alpaca/process_data.py",
"chars": 2203,
"preview": "from transformers import T5Tokenizer\nimport json\n\nPROMPT_DICT = {\n \"prompt_input\": (\n \"Below is an instruction"
},
{
"path": "llama_recipes/LICENSE",
"chars": 6914,
"preview": "LLAMA 2 COMMUNITY LICENSE AGREEMENT\nLlama 2 Version Release Date: July 18, 2023\n\n\"Agreement\" means the terms and conditi"
},
{
"path": "llama_recipes/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "llama_recipes/configs/__init__.py",
"chars": 357,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/configs/datasets.py",
"chars": 1418,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/configs/fsdp.py",
"chars": 763,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/configs/peft.py",
"chars": 1183,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/configs/training.py",
"chars": 2235,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/__init__.py",
"chars": 453,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/alpaca_dataset.py",
"chars": 2595,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/completion_dataset.py",
"chars": 3476,
"preview": "from .utils import Concatenator\nimport json\nfrom datasets import Dataset\n\n\ndef load_data(\n dataset_config,\n split,"
},
{
"path": "llama_recipes/ft_datasets/grammar_dataset/__init__.py",
"chars": 206,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/grammar_dataset/grammar_dataset.py",
"chars": 2693,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/grammar_dataset/grammar_dataset_process.ipynb",
"chars": 10598,
"preview": "{\n \"cells\": [\n {\n \"attachments\": {},\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"Copyright (c)"
},
{
"path": "llama_recipes/ft_datasets/samsum_dataset.py",
"chars": 1052,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/ft_datasets/utils.py",
"chars": 4958,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/llama_finetuning.py",
"chars": 12523,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/model_checkpointing/__init__.py",
"chars": 414,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/model_checkpointing/checkpoint_handler.py",
"chars": 7306,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/multi_node.slurm",
"chars": 1162,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/policies/__init__.py",
"chars": 347,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/policies/activation_checkpointing_functions.py",
"chars": 923,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/policies/anyprecision_optimizer.py",
"chars": 6884,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/policies/mixed_precision.py",
"chars": 1055,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/policies/wrapping.py",
"chars": 1025,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/quickstart.ipynb",
"chars": 22751,
"preview": "{\n \"cells\": [\n {\n \"attachments\": {},\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"Copyright (c)"
},
{
"path": "llama_recipes/requirements.txt",
"chars": 249,
"preview": "-f https://download.pytorch.org/whl/torch_stable.html \ntorch==2.0.1+cu118\naccelerate\nappdirs\nloralib\nbitsandbytes==0.39."
},
{
"path": "llama_recipes/scripts/markdown_link_check_config.json",
"chars": 533,
"preview": "{\n \"retryOn429\": true,\n \"retryCount\": 5,\n \"fallbackRetryDelay\": \"10s\",\n \"httpHeaders\": [\n {\n \"urls\": [\n "
},
{
"path": "llama_recipes/scripts/spellcheck.sh",
"chars": 601,
"preview": "\n# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms"
},
{
"path": "llama_recipes/scripts/spellcheck_conf/spellcheck.yaml",
"chars": 584,
"preview": "matrix:\n- name: Markdown\n apsell:\n lang: en\n d: en_US\n dictionary:\n wordlists:\n - scripts/spellcheck_conf/"
},
{
"path": "llama_recipes/scripts/spellcheck_conf/wordlist.txt",
"chars": 9357,
"preview": "BaseHandler\nImageNet\nRGB\nTorchServe\narchiver\ndataset\ngithub\nhref\nhttps\njson\nli\npy\npytorch\nsegmenter\ntorchvision\nul\nuseca"
},
{
"path": "llama_recipes/utils/__init__.py",
"chars": 305,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/utils/config_utils.py",
"chars": 3747,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/utils/dataset_utils.py",
"chars": 1101,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/utils/fsdp_utils.py",
"chars": 1370,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/utils/memory_utils.py",
"chars": 2359,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "llama_recipes/utils/train_utils.py",
"chars": 19680,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# This software may be used and distributed according to the terms "
},
{
"path": "mistral-schema.json",
"chars": 13499,
"preview": "{\n \"openapi\": \"3.0.2\",\n \"info\": {\n \"title\": \"Cog\",\n \"version\": \"0.1.0\"\n },\n \"paths\": {\n \"/\": {\n \"get\":"
},
{
"path": "model_templates/config.py",
"chars": 2988,
"preview": "from dotenv import load_dotenv\nfrom src.utils import get_env_var_or_default\n\nload_dotenv()\n\nMODEL_NAME = \n# INFERENCE CO"
},
{
"path": "models/dockerignore",
"chars": 582,
"preview": "*pdf\n*docx\nflan-t5**\ncheckpoints/**\nexamples/**\nweights_13/**\ntmp/**\n**.jsonl\nunconverted-weights\nunconverted-weights/\nw"
},
{
"path": "models/llama-2-13b/config.py",
"chars": 3030,
"preview": "from dotenv import load_dotenv\nfrom src.utils import get_env_var_or_default\n\nload_dotenv()\n\nMODEL_NAME = \"llama-2-13b\"\n#"
},
{
"path": "models/llama-2-13b-chat/config.py",
"chars": 3242,
"preview": "from dotenv import load_dotenv\nfrom src.utils import get_env_var_or_default\n\nload_dotenv()\n\nMODEL_NAME = \"llama-2-13b-ch"
},
{
"path": "models/llama-2-13b-chat-hf-mlc/config.py",
"chars": 1696,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-13b-mlc/config.py",
"chars": 1676,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-70b/config.py",
"chars": 3238,
"preview": "from dotenv import load_dotenv\nfrom src.utils import get_env_var_or_default\n\nload_dotenv()\n\nMODEL_NAME = \"llama-2-70b\"\n#"
},
{
"path": "models/llama-2-70b/model_artifacts/tokenizer/special_tokens_map.json",
"chars": 2,
"preview": "{}"
},
{
"path": "models/llama-2-70b/model_artifacts/tokenizer/tokenizer_checklist.chk",
"chars": 50,
"preview": "eeec4125e9c7560836b4873b6f8e3025 tokenizer.model\n"
},
{
"path": "models/llama-2-70b/model_artifacts/tokenizer/tokenizer_config.json",
"chars": 114,
"preview": "{\"bos_token\": \"\", \"eos_token\": \"\", \"model_max_length\": 4096, \"tokenizer_class\": \"LlamaTokenizer\", \"unk_token\": \"\"}"
},
{
"path": "models/llama-2-70b-chat/config.py",
"chars": 3037,
"preview": "from dotenv import load_dotenv\nfrom src.utils import get_env_var_or_default\n\nload_dotenv()\n\nMODEL_NAME = \"llama-2-70b-ch"
},
{
"path": "models/llama-2-70b-chat-hf-mlc/config.py",
"chars": 1664,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-70b-mlc/config.py",
"chars": 1641,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-7b/config.py",
"chars": 1931,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n exllama_kwargs,\n get_fp16_file_list,\n "
},
{
"path": "models/llama-2-7b-chat/config.py",
"chars": 1706,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n exllama_kwargs,\n get_fp16_file_list,\n get_gptq_f"
},
{
"path": "models/llama-2-7b-chat-hf-mlc/config.py",
"chars": 1641,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-7b-mlc/config.py",
"chars": 1626,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/llama-2-7b-transformers/config.py",
"chars": 2430,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import Weights, get_fp16_file_list\nfrom src.utils import get_env_va"
},
{
"path": "models/llama-2-7b-transformers/model_artifacts/tokenizer/special_tokens_map.json",
"chars": 2,
"preview": "{}"
},
{
"path": "models/llama-2-7b-transformers/model_artifacts/tokenizer/tokenizer_checklist.chk",
"chars": 50,
"preview": "eeec4125e9c7560836b4873b6f8e3025 tokenizer.model\n"
},
{
"path": "models/llama-2-7b-transformers/model_artifacts/tokenizer/tokenizer_config.json",
"chars": 114,
"preview": "{\"bos_token\": \"\", \"eos_token\": \"\", \"model_max_length\": 4096, \"tokenizer_class\": \"LlamaTokenizer\", \"unk_token\": \"\"}"
},
{
"path": "models/llama-2-7b-vllm/config.py",
"chars": 1714,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import Weights, get_fp16_file_list, vllm_kwargs\n\n\nfrom src.utils im"
},
{
"path": "models/mistral-7b-instruct-v0.1-mlc/config.py",
"chars": 2036,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "models/mistral-7b-v0.1-mlc/config.py",
"chars": 1635,
"preview": "from dotenv import load_dotenv\nfrom src.config_utils import (\n Weights,\n get_fp16_file_list,\n get_mlc_file_list"
},
{
"path": "notes/new_model_notes.md",
"chars": 3939,
"preview": "# `cog-llama-template` Model Management\n\nThe `cog-llama-template` repo decomposes model management into four constructs:"
},
{
"path": "predict.py",
"chars": 11230,
"preview": "import functools\nimport inspect\nimport os\nimport socket\nimport time\nimport zipfile\nfrom typing import Any, Callable, Opt"
},
{
"path": "pyproject.toml",
"chars": 197,
"preview": "[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"cog-llama-t"
},
{
"path": "requirements-dev.txt",
"chars": 256,
"preview": "#\n# This file is autogenerated by pip-compile with Python 3.11\n# by the following command:\n#\n# pip-compile --extra=de"
},
{
"path": "scripts/benchmark_token_latency.py",
"chars": 5610,
"preview": "import time\nimport json\nimport random\nimport torch\nimport argparse\nfrom abc import ABC, abstractmethod\n\n# Number of runs"
},
{
"path": "scripts/load_secrets.sh",
"chars": 584,
"preview": "if [ ! -d \"../official-models\" ]; then\n pushd ..\n git clone git@github.com:replicate/official-models\n popd\nfi\n\n"
},
{
"path": "scripts/test_fast_llama.py",
"chars": 9525,
"preview": "import zipfile\nfrom dataclasses import dataclass\nfrom enum import Enum\nfrom io import BytesIO\nfrom typing import Any\nimp"
},
{
"path": "scripts/test_load_unload_lora.py",
"chars": 5719,
"preview": "import zipfile\nfrom io import BytesIO\n\nimport replicate\nfrom termcolor import cprint\n\nfrom src.download import Downloade"
},
{
"path": "scripts/train_multi_gpu.sh",
"chars": 274,
"preview": "#!/bin/bash\n\npython train.py \\\n --train_data 70k_samples_prompt.jsonl \\\n --num_train_epochs 1 \\\n --learning_rat"
},
{
"path": "scripts/train_single_gpu.sh",
"chars": 325,
"preview": "#!/bin/bash\n\npython train.py \\\n --model_name_or_path google/flan-t5-base \\\n --data_path ./replicate_alpaca_data.js"
},
{
"path": "src/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "src/config_utils.py",
"chars": 2606,
"preview": "\"\"\"\nAn entirely self-contained config parsing util that should, if all goes well, dramatically simplify our configuratio"
},
{
"path": "src/download.py",
"chars": 8397,
"preview": "import asyncio\nimport functools\nimport mmap\nimport os\nimport random\nimport shutil\nimport sys\nimport time\nimport typing a"
},
{
"path": "src/inference_engines/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "src/inference_engines/engine.py",
"chars": 1566,
"preview": "import time\nfrom abc import ABC, abstractmethod\nfrom typing import Any\n\nfrom src.config_utils import Weights\nfrom src.ut"
},
{
"path": "src/inference_engines/exllama.py",
"chars": 6741,
"preview": "import io\nimport os\nimport sys\nimport glob\n\nimport torch\nimport time\nimport typing as tp\n\nfrom src.config_utils import W"
},
{
"path": "src/inference_engines/mlc_engine.py",
"chars": 6771,
"preview": "import os\n\nfrom cog import ConcatenateIterator\nfrom mlc_chat import ChatConfig, ChatModule, ConvConfig, GenerationConfig"
},
{
"path": "src/inference_engines/mlc_vllm_engine.py",
"chars": 2905,
"preview": "from typing import Any, Optional, List\nimport os\n\nfrom .engine import Engine\nfrom .vllm_engine import vLLMEngine\n\n\nclass"
},
{
"path": "src/inference_engines/transformers_engine.py",
"chars": 5913,
"preview": "import os\nimport shutil\nfrom transformers import AutoModelForCausalLM, TextIteratorStreamer, StoppingCriteria\nfrom typin"
},
{
"path": "src/inference_engines/vllm_engine.py",
"chars": 9521,
"preview": "import asyncio\nimport json\nimport os\nfrom io import BytesIO, IOBase\nfrom typing import AsyncIterator, BinaryIO, List, Op"
},
{
"path": "src/inference_engines/vllm_exllama_engine.py",
"chars": 2909,
"preview": "import gc\nfrom typing import Any, Optional, List\n\nimport torch\nimport os\n\nfrom .engine import Engine\nfrom .vllm_engine i"
},
{
"path": "src/inference_engines/vllm_transformers.py",
"chars": 2805,
"preview": "import gc\nfrom typing import Any, Optional, List\n\nimport torch\n\nfrom .engine import Engine\nfrom .vllm_engine import vLLM"
},
{
"path": "src/more_utils.py",
"chars": 879,
"preview": "import os\n\n\nDEFAULT_PAD_TOKEN = \"[PAD]\"\nDEFAULT_EOS_TOKEN = \"</s>\"\nDEFAULT_BOS_TOKEN = \"<s>\"\nDEFAULT_UNK_TOKEN = \"</s>\"\n"
},
{
"path": "src/utils.py",
"chars": 9459,
"preview": "import asyncio\nimport builtins\nimport contextlib\nimport os\nimport random\nimport subprocess\nimport time\nimport typing as "
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/assets/llama_tokenizer/special_tokens_map.json",
"chars": 2,
"preview": "{}"
},
{
"path": "tests/assets/llama_tokenizer/tokenizer_checklist.chk",
"chars": 50,
"preview": "eeec4125e9c7560836b4873b6f8e3025 tokenizer.model\n"
},
{
"path": "tests/assets/llama_tokenizer/tokenizer_config.json",
"chars": 114,
"preview": "{\"bos_token\": \"\", \"eos_token\": \"\", \"model_max_length\": 4096, \"tokenizer_class\": \"LlamaTokenizer\", \"unk_token\": \"\"}"
},
{
"path": "tests/conftest.py",
"chars": 119,
"preview": "def pytest_addoption(parser):\n parser.addoption(\"--model\", action=\"store\", default=None, help=\"Model name to test\")\n"
},
{
"path": "tests/data/200_samples.jsonl",
"chars": 165476,
"preview": "{\"text\": \"Write a response to the following message:\\nI like to read popular science book. Does that mean that I have to"
},
{
"path": "tests/run_local_tests.sh",
"chars": 387,
"preview": "#!/bin/bash\n\n# TODO - rework this to spin up cog servers locally for prediction & training\n# this gives us the ability t"
},
{
"path": "tests/test_e2e.py",
"chars": 2555,
"preview": "import pytest\nimport requests\nimport subprocess\nimport time\n\n# Constants\nSERVER_URL = \"http://localhost:5000/predictions"
},
{
"path": "tests/test_predict.py",
"chars": 3025,
"preview": "import pytest\nimport requests\nimport subprocess\nfrom threading import Thread, Lock\n\nfrom tests.test_utils import (\n g"
},
{
"path": "tests/test_predict_with_trained_weights.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/test_remote_predict.py",
"chars": 751,
"preview": "import pytest\nimport replicate\n\n\n@pytest.fixture(scope=\"module\")\ndef model_name(request):\n return request.config.geto"
},
{
"path": "tests/test_remote_train.py",
"chars": 2781,
"preview": "import time\nimport pytest\nimport replicate\n\n\n@pytest.fixture(scope=\"module\")\ndef model_name(request):\n return request"
},
{
"path": "tests/test_train.py",
"chars": 3078,
"preview": "import pytest\nimport os\nimport re\n\nfrom tests.test_utils import run_training_subprocess\n\nERROR_PATTERN = re.compile(r\"ER"
},
{
"path": "tests/test_train_predict.py",
"chars": 3249,
"preview": "import pytest\nimport requests\nimport subprocess\nimport os\nfrom threading import Thread, Lock\n\nfrom tests.test_utils impo"
},
{
"path": "tests/test_utils.py",
"chars": 3200,
"preview": "import os\nimport json\nimport requests\nimport time\nimport re\nimport multiprocessing\nimport subprocess\n\nERROR_PATTERN = re"
},
{
"path": "tests/timing.py",
"chars": 1455,
"preview": "import time\nimport replicate\nimport os\n\nbase = \"replicate-internal/staging-llama-2-7b:8ba7b9478e1cbdde020f79f0838cd94465"
},
{
"path": "tests/unit_tests/test_completion_dataset.py",
"chars": 6540,
"preview": "import pytest\n\nimport sys\n\nsys.path.append(\".\")\n\nfrom llama_recipes.ft_datasets.completion_dataset import (\n load_dat"
},
{
"path": "tests/unit_tests/test_utils.py",
"chars": 7589,
"preview": "import pytest\n\nimport sys\n\nsys.path.append(\".\")\n\nfrom src.src.utils import StreamingTextStopSequenceHandler\n\n\n@pytest.fi"
},
{
"path": "train.py",
"chars": 10613,
"preview": "import argparse\nimport asyncio\nimport os\nimport shutil\nimport subprocess\nfrom zipfile import ZipFile\nimport psutil\n\n\nimp"
}
]
// ... and 4 more files (download for full content)
About this extraction
This page contains the full source code of the a16z-infra/cog-llama-template GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 120 files (28.8 MB), approximately 131.8k tokens, and a symbol index with 289 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.