Showing preview only (406K chars total). Download the full file or copy to clipboard to get everything.
Repository: deepseek-ai/Janus
Branch: main
Commit: 1daa72fa4090
Files: 37
Total size: 391.0 KB
Directory structure:
gitextract_8bv7suly/
├── .gitattributes
├── .gitignore
├── LICENSE-CODE
├── LICENSE-MODEL
├── Makefile
├── README.md
├── demo/
│ ├── Janus_colab_demo.ipynb
│ ├── app.py
│ ├── app_janusflow.py
│ ├── app_januspro.py
│ ├── fastapi_app.py
│ └── fastapi_client.py
├── generation_inference.py
├── inference.py
├── interactivechat.py
├── janus/
│ ├── __init__.py
│ ├── janusflow/
│ │ ├── __init__.py
│ │ └── models/
│ │ ├── __init__.py
│ │ ├── clip_encoder.py
│ │ ├── image_processing_vlm.py
│ │ ├── modeling_vlm.py
│ │ ├── processing_vlm.py
│ │ ├── siglip_vit.py
│ │ └── uvit.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── clip_encoder.py
│ │ ├── image_processing_vlm.py
│ │ ├── modeling_vlm.py
│ │ ├── processing_vlm.py
│ │ ├── projector.py
│ │ ├── siglip_vit.py
│ │ └── vq_model.py
│ └── utils/
│ ├── __init__.py
│ ├── conversation.py
│ └── io.py
├── pyproject.toml
└── requirements.txt
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
* text eol=lf
*.ipynb linguist-detectable=false
*.png binary
*.jpg binary
*.jpeg binary
*.gif binary
*.pdf binary
================================================
FILE: .gitignore
================================================
##### Python.gitignore #####
# Byte-compiled / optimized / DLL files
**/__pycache__/
*.pyc
*.pyo
*.pyd
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
wheelhouse/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
*.whl
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
docs/source/_build/
_autosummary/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# ruff
.ruff_cache/
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
##### macOS.gitignore #####
# General
.DS_Store
.AppleDouble
.LSOverride
# Icon must end with two \r
Icon
# Thumbnails
._*
# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
##### Linux.gitignore #####
*~
# Temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*
# KDE directory preferences
.directory
# Linux trash folder which might appear on any partition or disk
.Trash-*
# .nfs files are created when an open file is removed but is still being accessed
.nfs*
##### Windows.gitignore #####
# Windows thumbnail cache files
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db
# Dump file
*.stackdump
# Folder config file
[Dd]esktop.ini
# Recycle Bin used on file shares
$RECYCLE.BIN/
# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp
# Windows shortcuts
*.lnk
##### Archives.gitignore #####
# It's better to unpack these files and commit the raw source because
# git has its own built in compression methods.
*.7z
*.jar
*.rar
*.zip
*.gz
*.gzip
*.tgz
*.bzip
*.bzip2
*.bz2
*.xz
*.lzma
*.cab
*.xar
# Packing-only formats
*.iso
*.tar
# Package management formats
*.dmg
*.xpi
*.gem
*.egg
*.deb
*.rpm
*.msi
*.msm
*.msp
*.txz
##### Xcode.gitignore #####
# Xcode
#
# gitignore contributors: remember to update Global/Xcode.gitignore, Objective-C.gitignore & Swift.gitignore
## User settings
xcuserdata/
## Compatibility with Xcode 8 and earlier (ignoring not required starting Xcode 9)
*.xcscmblueprint
*.xccheckout
## Compatibility with Xcode 3 and earlier (ignoring not required starting Xcode 4)
build/
DerivedData/
*.moved-aside
*.pbxuser
!default.pbxuser
*.mode1v3
!default.mode1v3
*.mode2v3
!default.mode2v3
*.perspectivev3
!default.perspectivev3
## Gcc Patch
/*.gcno
##### JetBrains.gitignore #####
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User settings
.idea/*
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
##### VisualStudioCode.gitignore #####
.vscode/*
# !.vscode/settings.json
# !.vscode/tasks.json
# !.vscode/launch.json
!.vscode/extensions.json
*.code-workspace
# Local History for Visual Studio Code
.history/
##### Vim.gitignore #####
# Swap
.*.s[a-v][a-z]
!*.svg # comment out if you don't need vector files
.*.sw[a-p]
.s[a-rt-v][a-z]
.ss[a-gi-z]
.sw[a-p]
# Session
Session.vim
Sessionx.vim
# Temporary
.netrwhist
*~
# Auto-generated tag files
tags
# Persistent undo
[._]*.un~
.vscode
.github
generated_samples/
================================================
FILE: LICENSE-CODE
================================================
MIT License
Copyright (c) 2023 DeepSeek
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: LICENSE-MODEL
================================================
DEEPSEEK LICENSE AGREEMENT
Version 1.0, 23 October 2023
Copyright (c) 2023 DeepSeek
Section I: PREAMBLE
Large generative models are being widely adopted and used, and have the potential to transform the way individuals conceive and benefit from AI or ML technologies.
Notwithstanding the current and potential benefits that these artifacts can bring to society at large, there are also concerns about potential misuses of them, either due to their technical limitations or ethical considerations.
In short, this license strives for both the open and responsible downstream use of the accompanying model. When it comes to the open character, we took inspiration from open source permissive licenses regarding the grant of IP rights. Referring to the downstream responsible use, we added use-based restrictions not permitting the use of the model in very specific scenarios, in order for the licensor to be able to enforce the license in case potential misuses of the Model may occur. At the same time, we strive to promote open and responsible research on generative models for content generation.
Even though downstream derivative versions of the model could be released under different licensing terms, the latter will always have to include - at minimum - the same use-based restrictions as the ones in the original license (this license). We believe in the intersection between open and responsible AI development; thus, this agreement aims to strike a balance between both in order to enable responsible open-science in the field of AI.
This License governs the use of the model (and its derivatives) and is informed by the model card associated with the model.
NOW THEREFORE, You and DeepSeek agree as follows:
1. Definitions
"License" means the terms and conditions for use, reproduction, and Distribution as defined in this document.
"Data" means a collection of information and/or content extracted from the dataset used with the Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under this License.
"Output" means the results of operating a Model as embodied in informational content resulting therefrom.
"Model" means any accompanying machine-learning based assemblies (including checkpoints), consisting of learnt weights, parameters (including optimizer states), corresponding to the model architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or in part on the Data, using the Complementary Material.
"Derivatives of the Model" means all modifications to the Model, works based on the Model, or any other model which is created or initialized by transfer of patterns of the weights, parameters, activations or output of the Model, to the other model, in order to cause the other model to perform similarly to the Model, including - but not limited to - distillation methods entailing the use of intermediate data representations or methods based on the generation of synthetic data by the Model for training the other model.
"Complementary Material" means the accompanying source code and scripts used to define, run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if any. This includes any accompanying documentation, tutorials, examples, etc, if any.
"Distribution" means any transmission, reproduction, publication or other sharing of the Model or Derivatives of the Model to a third party, including providing the Model as a hosted service made available by electronic or other remote means - e.g. API-based or web access.
"DeepSeek" (or "we") means Beijing DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd., Hangzhou DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd. and/or any of their affiliates.
"You" (or "Your") means an individual or Legal Entity exercising permissions granted by this License and/or making use of the Model for whichever purpose and in any field of use, including usage of the Model in an end-use application - e.g. chatbot, translator, etc.
"Third Parties" means individuals or legal entities that are not under common control with DeepSeek or You.
Section II: INTELLECTUAL PROPERTY RIGHTS
Both copyright and patent grants apply to the Model, Derivatives of the Model and Complementary Material. The Model and Derivatives of the Model are subject to additional terms as described in Section III.
2. Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.
3. Grant of Patent License. Subject to the terms and conditions of this License and where and as applicable, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such license applies only to those patent claims licensable by DeepSeek that are necessarily infringed by its contribution(s). If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for the Model and/or works shall terminate as of the date such litigation is asserted or filed.
Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
4. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g. software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof in any medium, with or without modifications, provided that You meet the following conditions:
a. Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to, that the Model or Derivatives of the Model are subject to paragraph 5. This provision does not apply to the use of Complementary Material.
b. You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this License;
c. You must cause any modified files to carry prominent notices stating that You changed the files;
d. You must retain all copyright, patent, trademark, and attribution notices excluding those notices that do not pertain to any part of the Model, Derivatives of the Model.
e. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions - respecting paragraph 4.a. – for use, reproduction, or Distribution of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use, reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License.
5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions. Therefore You cannot use the Model and the Derivatives of the Model for the specified restricted uses. You may use the Model subject to this License, including only for lawful purposes and in accordance with the License. Use may include creating any content with, finetuning, updating, running, training, evaluating and/or reparametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model to comply with the terms of this paragraph (paragraph 5).
6. The Output You Generate. Except as set forth herein, DeepSeek claims no rights in the Output You generate using the Model. You are accountable for the Output you generate and its subsequent uses. No use of the output can contravene any provision as stated in the License.
Section IV: OTHER PROVISIONS
7. Updates and Runtime Restrictions. To the maximum extent permitted by law, DeepSeek reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License.
8. Trademarks and related. Nothing in this License permits You to make use of DeepSeek’ trademarks, trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the parties; and any rights not expressly granted herein are reserved by DeepSeek.
9. Personal information, IP rights and related. This Model may contain personal information and works with IP rights. You commit to complying with applicable laws and regulations in the handling of personal information and the use of such works. Please note that DeepSeek's license granted to you to use the Model does not imply that you have obtained a legitimate basis for processing the related information or works. As an independent personal information processor and IP rights user, you need to ensure full compliance with relevant legal and regulatory requirements when handling personal information and works with IP rights that may be contained in the Model, and are willing to assume solely any risks and consequences that may arise from that.
10. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, DeepSeek provides the Model and the Complementary Material on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Model, Derivatives of the Model, and the Complementary Material and assume any risks associated with Your exercise of permissions under this License.
11. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall DeepSeek be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model and the Complementary Material (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if DeepSeek has been advised of the possibility of such damages.
12. Accepting Warranty or Additional Liability. While redistributing the Model, Derivatives of the Model and the Complementary Material thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of DeepSeek, and only if You agree to indemnify, defend, and hold DeepSeek harmless for any liability incurred by, or claims asserted against, DeepSeek by reason of your accepting any such warranty or additional liability.
13. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
14. Governing Law and Jurisdiction. This agreement will be governed and construed under PRC laws without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this agreement. The courts located in the domicile of Hangzhou DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd. shall have exclusive jurisdiction of any dispute arising out of this agreement.
END OF TERMS AND CONDITIONS
Attachment A
Use Restrictions
You agree not to use the Model or Derivatives of the Model:
- In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
- For military use in any way;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
- To generate or disseminate inappropriate content subject to applicable regulatory requirements;
- To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
- To defame, disparage or otherwise harass others;
- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
================================================
FILE: Makefile
================================================
print-% : ; @echo $* = $($*)
PROJECT_NAME = Janus
COPYRIGHT = "DeepSeek."
PROJECT_PATH = janus
SHELL = /bin/bash
SOURCE_FOLDERS = janus
PYTHON_FILES = $(shell find $(SOURCE_FOLDERS) -type f -name "*.py" -o -name "*.pyi") inference.py
COMMIT_HASH = $(shell git log -1 --format=%h)
PATH := $(HOME)/go/bin:$(PATH)
PYTHON ?= $(shell command -v python3 || command -v python)
PYTESTOPTS ?=
.PHONY: default
default: install
# Tools Installation
check_pip_install = $(PYTHON) -m pip show $(1) &>/dev/null || (cd && $(PYTHON) -m pip install $(1) --upgrade)
check_pip_install_extra = $(PYTHON) -m pip show $(1) &>/dev/null || (cd && $(PYTHON) -m pip install $(2) --upgrade)
pylint-install:
$(call check_pip_install_extra,pylint,pylint[spelling])
$(call check_pip_install,pyenchant)
flake8-install:
$(call check_pip_install,flake8)
$(call check_pip_install,flake8-bugbear)
$(call check_pip_install,flake8-comprehensions)
$(call check_pip_install,flake8-docstrings)
$(call check_pip_install,flake8-pyi)
$(call check_pip_install,flake8-simplify)
py-format-install:
$(call check_pip_install,isort)
$(call check_pip_install_extra,black,black[jupyter])
ruff-install:
$(call check_pip_install,ruff)
mypy-install:
$(call check_pip_install,mypy)
pre-commit-install:
$(call check_pip_install,pre-commit)
$(PYTHON) -m pre_commit install --install-hooks
go-install:
# requires go >= 1.16
command -v go || (sudo apt-get install -y golang && sudo ln -sf /usr/lib/go/bin/go /usr/bin/go)
addlicense-install: go-install
command -v addlicense || go install github.com/google/addlicense@latest
addlicense: addlicense-install
addlicense -c $(COPYRIGHT) -ignore tests/coverage.xml -l mit -y 2023-$(shell date +"%Y") -check $(SOURCE_FOLDERS)
# Python linters
pylint: pylint-install
$(PYTHON) -m pylint $(PROJECT_PATH)
flake8: flake8-install
$(PYTHON) -m flake8 --count --show-source --statistics
py-format: py-format-install
$(PYTHON) -m isort --project $(PROJECT_PATH) --check $(PYTHON_FILES) && \
$(PYTHON) -m black --check $(PYTHON_FILES)
black-format: py-format-install
$(PYTHON) -m black --check $(PYTHON_FILES)
ruff: ruff-install
$(PYTHON) -m ruff check .
ruff-fix: ruff-install
$(PYTHON) -m ruff check . --fix --exit-non-zero-on-fix
mypy: mypy-install
$(PYTHON) -m mypy $(PROJECT_PATH) --install-types --non-interactive
pre-commit: pre-commit-install
$(PYTHON) -m pre_commit run --all-files
# Utility functions
lint: ruff flake8 py-format mypy pylint addlicense
format: py-format-install ruff-install addlicense-install
$(PYTHON) -m isort --project $(PROJECT_PATH) $(PYTHON_FILES)
$(PYTHON) -m black $(PYTHON_FILES)
addlicense -c $(COPYRIGHT) -ignore tests/coverage.xml -l mit -y 2023-$(shell date +"%Y") $(SOURCE_FOLDERS) inference.py
clean-py:
find . -type f -name '*.py[co]' -delete
find . -depth -type d -name "__pycache__" -exec rm -r "{}" +
find . -depth -type d -name ".ruff_cache" -exec rm -r "{}" +
find . -depth -type d -name ".mypy_cache" -exec rm -r "{}" +
clean: clean-py
================================================
FILE: README.md
================================================
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->
<div align="center">
<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
</div>
<hr>
<div align="center">
<h1>🚀 Janus-Series: Unified Multimodal Understanding and Generation Models</h1>
</div>
<div align="center">
<a href="https://www.deepseek.com/" target="_blank">
<img alt="Homepage" src="images/badge.svg" />
</a>
</a>
<a href="https://huggingface.co/deepseek-ai" target="_blank">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
</a>
</div>
<div align="center">
<!-- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
</a> -->
<!-- <a href="images/qr.jpeg" target="_blank">
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />
</a> -->
<!-- <a href="https://twitter.com/deepseek_ai" target="_blank">
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
</a> -->
</div>
<div align="center">
<a href="LICENSE-CODE">
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
</a>
<a href="LICENSE-MODEL">
<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
</a>
</div>
<p align="center">
<a href="#2-model-download"><b>📥 Model Download</b></a> |
<a href="#3-quick-start"><b>⚡ Quick Start</b></a> |
<a href="#4-license"><b>📜 License</b></a> |
<a href="#5-citation"><b>📖 Citation</b></a> <br>
<!-- 📄 Paper Link (<a href="https://arxiv.org/abs/2410.13848"><b>Janus</b></a>, <a href="https://arxiv.org/abs/2410.13848"><b>JanusFlow</b></a>) | -->
🤗 Online Demo (<a href="https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B"><b>Janus-Pro-7B</b></a>, <a href="https://huggingface.co/spaces/deepseek-ai/Janus-1.3B"><b>Janus</b></a>, <a href="https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B"><b>JanusFlow</b></a>)
</p>
## News
**2025.01.27**: Janus-Pro is released, an advanced version of Janus, improving both multimodal understanding and visual generation significantly. See [paper](./janus_pro_tech_report.pdf)
**2024.11.13**: JanusFlow is released, a new unified model with rectified flow for image generation. See [paper](https://arxiv.org/abs/2411.07975), [demo](https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B) and [usage](https://github.com/deepseek-ai/Janus?tab=readme-ov-file#janusflow).
**2024.10.23**: Evaluation code for reproducing the multimodal understanding results from the paper has been added to VLMEvalKit. Please refer to [this link]( https://github.com/open-compass/VLMEvalKit/pull/541).
**2024.10.20**: (1) Fix a bug in [tokenizer_config.json](https://huggingface.co/deepseek-ai/Janus-1.3B/blob/main/tokenizer_config.json). The previous version caused classifier-free guidance to not function properly, resulting in relatively poor visual generation quality. (2) Release Gradio demo ([online demo](https://huggingface.co/spaces/deepseek-ai/Janus-1.3B) and [local](#gradio-demo)).
## 1. Introduction
<a href="./janus_pro_tech_report.pdf"><b>Janus-Pro: Unified Multimodal Understanding and
Generation with Data and Model Scaling</b></a>
**Janus-Pro** is an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation.
<div align="center">
<img alt="image" src="images/teaser_januspro.png" style="width:90%;">
</div>
<a href="https://arxiv.org/abs/2410.13848"><b>Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation</b></a>
**Janus** is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
<div align="center">
<img alt="image" src="images/teaser.png" style="width:90%;">
</div>
<a href="https://arxiv.org/abs/2411.07975"><b>JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</b></a>
**JanusFlow** introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
<div align="center">
<img alt="image" src="images/teaser_janusflow.png" style="width:90%;">
</div>
## 2. Model Download
We release Janus to the public to support a broader and more diverse range of research within both academic and commercial communities.
Please note that the use of this model is subject to the terms outlined in [License section](#5-license). Commercial usage is
permitted under these terms.
### Huggingface
| Model | Sequence Length | Download |
|-----------------------|-----------------|-----------------------------------------------------------------------------|
| Janus-1.3B | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/Janus-1.3B) |
| JanusFlow-1.3B | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/JanusFlow-1.3B) |
| Janus-Pro-1B | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/Janus-Pro-1B) |
| Janus-Pro-7B | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/Janus-Pro-7B) |
## 3. Quick Start
<details>
<summary><h3>Janus-Pro</h3></summary>
### Installation
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
```shell
pip install -e .
```
### Simple Inference Example
#### Multimodal Understanding
```python
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```
#### Text-to-Image Generation
```python
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "<|Assistant|>", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
temperature: float = 1,
parallel_size: int = 16,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
img_size: int = 384,
patch_size: int = 16,
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size*2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
PIL.Image.fromarray(visual_img[i]).save(save_path)
generate(
vl_gpt,
vl_chat_processor,
prompt,
)
```
### Gradio Demo
We have deployed online demo in [Huggingface](https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B).
For the local gradio demo, you can run with the following command:
```
pip install -e .[gradio]
python demo/app_januspro.py
```
Have Fun!
</details>
<details>
<summary><h3>Janus</h3></summary>
### Installation
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
```shell
pip install -e .
```
### Simple Inference Example
#### Multimodal Understanding
```python
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# specify the path to the model
model_path = "deepseek-ai/Janus-1.3B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["images/equation.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```
#### Text-to-Image Generation
```python
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# specify the path to the model
model_path = "deepseek-ai/Janus-1.3B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "Assistant", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
temperature: float = 1,
parallel_size: int = 16,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
img_size: int = 384,
patch_size: int = 16,
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size*2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
PIL.Image.fromarray(visual_img[i]).save(save_path)
generate(
vl_gpt,
vl_chat_processor,
prompt,
)
```
### Gradio Demo
We have deployed online demo in [Huggingface](https://huggingface.co/spaces/deepseek-ai/Janus-1.3B).
For the local gradio demo, you can run with the following command:
```
pip install -e .[gradio]
python demo/app.py
```
Have Fun!
### FastAPI Demo
It's easy to run a FastAPI server to host an API server running the same functions as gradio.
To start FastAPI server, run the following command:
```
python demo/fastapi_app.py
```
To test the server, you can open another terminal and run:
```
python demo/fastapi_client.py
```
</details>
<details>
<summary><h3>JanusFlow</h3></summary>
### Installation
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
```shell
pip install -e .
pip install diffusers[torch]
```
### 🤗 Huggingface Online Demo
Check out the demo in [this link](https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B).
### Simple Inference Example
#### Multimodal Understanding
```python
import torch
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# specify the path to the model
model_path = "deepseek-ai/JanusFlow-1.3B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nConvert the formula into latex code.",
"images": ["images/equation.png"],
},
{"role": "Assistant", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```
#### Text-to-Image Generation
```python
import os
import PIL.Image
import torch
import numpy as np
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
import torchvision
# specify the path to the model
model_path = "deepseek-ai/JanusFlow-1.3B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
from diffusers.models import AutoencoderKL
# remember to use bfloat16 dtype, this vae doesn't work with fp16
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae")
vae = vae.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "Assistant", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_gen_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
cfg_weight: float = 5.0,
num_inference_steps: int = 30,
batchsize: int = 5
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.stack([input_ids] * 2 * batchsize).cuda()
tokens[batchsize:, 1:] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
# we remove the last <bog> token and replace it with t_emb later
inputs_embeds = inputs_embeds[:, :-1, :]
# generate with rectified flow ode
# step 1: encode with vision_gen_enc
z = torch.randn((batchsize, 4, 48, 48), dtype=torch.bfloat16).cuda()
dt = 1.0 / num_inference_steps
dt = torch.zeros_like(z).cuda().to(torch.bfloat16) + dt
# step 2: run ode
attention_mask = torch.ones((2*batchsize, inputs_embeds.shape[1]+577)).to(vl_gpt.device)
attention_mask[batchsize:, 1:inputs_embeds.shape[1]] = 0
attention_mask = attention_mask.int()
for step in range(num_inference_steps):
# prepare inputs for the llm
z_input = torch.cat([z, z], dim=0) # for cfg
t = step / num_inference_steps * 1000.
t = torch.tensor([t] * z_input.shape[0]).to(dt)
z_enc = vl_gpt.vision_gen_enc_model(z_input, t)
z_emb, t_emb, hs = z_enc[0], z_enc[1], z_enc[2]
z_emb = z_emb.view(z_emb.shape[0], z_emb.shape[1], -1).permute(0, 2, 1)
z_emb = vl_gpt.vision_gen_enc_aligner(z_emb)
llm_emb = torch.cat([inputs_embeds, t_emb.unsqueeze(1), z_emb], dim=1)
# input to the llm
# we apply attention mask for CFG: 1 for tokens that are not masked, 0 for tokens that are masked.
if step == 0:
outputs = vl_gpt.language_model.model(inputs_embeds=llm_emb,
use_cache=True,
attention_mask=attention_mask,
past_key_values=None)
past_key_values = []
for kv_cache in past_key_values:
k, v = kv_cache[0], kv_cache[1]
past_key_values.append((k[:, :, :inputs_embeds.shape[1], :], v[:, :, :inputs_embeds.shape[1], :]))
past_key_values = tuple(past_key_values)
else:
outputs = vl_gpt.language_model.model(inputs_embeds=llm_emb,
use_cache=True,
attention_mask=attention_mask,
past_key_values=past_key_values)
hidden_states = outputs.last_hidden_state
# transform hidden_states back to v
hidden_states = vl_gpt.vision_gen_dec_aligner(vl_gpt.vision_gen_dec_aligner_norm(hidden_states[:, -576:, :]))
hidden_states = hidden_states.reshape(z_emb.shape[0], 24, 24, 768).permute(0, 3, 1, 2)
v = vl_gpt.vision_gen_dec_model(hidden_states, hs, t_emb)
v_cond, v_uncond = torch.chunk(v, 2)
v = cfg_weight * v_cond - (cfg_weight-1.) * v_uncond
z = z + dt * v
# step 3: decode with vision_gen_dec and sdxl vae
decoded_image = vae.decode(z / vae.config.scaling_factor).sample
os.makedirs('generated_samples', exist_ok=True)
save_path = os.path.join('generated_samples', "img.jpg")
torchvision.utils.save_image(decoded_image.clip_(-1.0, 1.0)*0.5+0.5, save_path)
generate(
vl_gpt,
vl_chat_processor,
prompt,
cfg_weight=2.0,
num_inference_steps=30,
batchsize=5
)
```
### Gradio Demo
For the local gradio demo, you can run with the following command:
```
pip install -e .[gradio]
python demo/app_janusflow.py
```
Have Fun!
</details>
## 4. License
This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of Janus models is subject to [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL).
## 5. Citation
```bibtex
@article{chen2025janus,
title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling},
author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong},
journal={arXiv preprint arXiv:2501.17811},
year={2025}
}
@article{wu2024janus,
title={Janus: Decoupling visual encoding for unified multimodal understanding and generation},
author={Wu, Chengyue and Chen, Xiaokang and Wu, Zhiyu and Ma, Yiyang and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong and others},
journal={arXiv preprint arXiv:2410.13848},
year={2024}
}
@misc{ma2024janusflow,
title={JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation},
author={Yiyang Ma and Xingchao Liu and Xiaokang Chen and Wen Liu and Chengyue Wu and Zhiyu Wu and Zizheng Pan and Zhenda Xie and Haowei Zhang and Xingkai yu and Liang Zhao and Yisong Wang and Jiaying Liu and Chong Ruan},
journal={arXiv preprint arXiv:2411.07975},
year={2024}
}
```
## 6. Contact
If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
================================================
FILE: demo/Janus_colab_demo.ipynb
================================================
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"machine_shape": "hm",
"gpuType": "A100",
"authorship_tag": "ABX9TyOvbiAJgFit+gP+tHHbDMVh",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"763db730135046ad82de98efeff17d43": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_4f86e96fadab4c0d99817e7e586773c8",
"IPY_MODEL_3d5d9b536d5547f088c5b38998b86f4b",
"IPY_MODEL_e749a276bffb4136a4ca518867fcefdc"
],
"layout": "IPY_MODEL_26d64ffbb4ae49dca04faf3a4e2e55ca"
}
},
"4f86e96fadab4c0d99817e7e586773c8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_e4c1880a3a4048da9fb06fc699d6f130",
"placeholder": "",
"style": "IPY_MODEL_c7b3782ef893473199e9140ec6208edc",
"value": "preprocessor_config.json: 100%"
}
},
"3d5d9b536d5547f088c5b38998b86f4b": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_f95595cb4b6a4f1d829cf9fd7f27e0d3",
"max": 347,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_14a1ff29167d48a7a16cf4d848861178",
"value": 347
}
},
"e749a276bffb4136a4ca518867fcefdc": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_51cb411fef0c4be4bdd47a93be2f1cf5",
"placeholder": "",
"style": "IPY_MODEL_f0bace68043e4b1c92a1e3dc4b6192b7",
"value": " 347/347 [00:00<00:00, 28.5kB/s]"
}
},
"26d64ffbb4ae49dca04faf3a4e2e55ca": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e4c1880a3a4048da9fb06fc699d6f130": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"c7b3782ef893473199e9140ec6208edc": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"f95595cb4b6a4f1d829cf9fd7f27e0d3": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"14a1ff29167d48a7a16cf4d848861178": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"51cb411fef0c4be4bdd47a93be2f1cf5": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"f0bace68043e4b1c92a1e3dc4b6192b7": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"e5de18967479416bacedefaeee42e0f7": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_f7bcc5523c57490aa4300fc8720e32b3",
"IPY_MODEL_eef53bb650af41729306ad9d9194706e",
"IPY_MODEL_e7fed22226664988ad2557265b0aa6e6"
],
"layout": "IPY_MODEL_4d9ab6f4605b465892ad6499a3172b1a"
}
},
"f7bcc5523c57490aa4300fc8720e32b3": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_ea65267f0aa341389d72635be3e75671",
"placeholder": "",
"style": "IPY_MODEL_88932da6dac441009d2ea0ce5c61e6e8",
"value": "tokenizer_config.json: 100%"
}
},
"eef53bb650af41729306ad9d9194706e": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_eb89a6efc26641e28373cfcd858293d0",
"max": 105920,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_b1588c0f92b647f0bf3a6f31dd3c1e5d",
"value": 105920
}
},
"e7fed22226664988ad2557265b0aa6e6": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_23b7a43bedb14094857a36d594f01086",
"placeholder": "",
"style": "IPY_MODEL_ef6cc43fecde4a8da2e090cde5853429",
"value": " 106k/106k [00:00<00:00, 2.36MB/s]"
}
},
"4d9ab6f4605b465892ad6499a3172b1a": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"ea65267f0aa341389d72635be3e75671": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"88932da6dac441009d2ea0ce5c61e6e8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"eb89a6efc26641e28373cfcd858293d0": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"b1588c0f92b647f0bf3a6f31dd3c1e5d": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"23b7a43bedb14094857a36d594f01086": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"ef6cc43fecde4a8da2e090cde5853429": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"957f5ed508f843909daf9a3b75470e8f": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_89a1845ac46748d09011ef870a0dbe09",
"IPY_MODEL_1b347354adb649ba8203351c13151a0a",
"IPY_MODEL_6f2857d7a0e141a7b0643dca48d6cfbb"
],
"layout": "IPY_MODEL_d3656a2353944ce0b8756a94b37a76ba"
}
},
"89a1845ac46748d09011ef870a0dbe09": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_50be41d72df9453a802ea0b97b30b6d2",
"placeholder": "",
"style": "IPY_MODEL_1898c0ac42a642b3a72d2b50ce11e4f3",
"value": "tokenizer.json: 100%"
}
},
"1b347354adb649ba8203351c13151a0a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_b8bce5e59e33458296b93346c8d4acf9",
"max": 4719431,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_3d085ca38040494e973cc93d5ff9bf8c",
"value": 4719431
}
},
"6f2857d7a0e141a7b0643dca48d6cfbb": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_bf52ff5a55ce44b0991e0a48e204fc56",
"placeholder": "",
"style": "IPY_MODEL_a018d2d67e8c4c4081daad6b72f3ee1a",
"value": " 4.72M/4.72M [00:00<00:00, 20.6MB/s]"
}
},
"d3656a2353944ce0b8756a94b37a76ba": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"50be41d72df9453a802ea0b97b30b6d2": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"1898c0ac42a642b3a72d2b50ce11e4f3": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"b8bce5e59e33458296b93346c8d4acf9": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"3d085ca38040494e973cc93d5ff9bf8c": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"bf52ff5a55ce44b0991e0a48e204fc56": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"a018d2d67e8c4c4081daad6b72f3ee1a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"037700e9766646ddae7083729496b40b": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_fe8f18798bd545f995dd02eeaf77384a",
"IPY_MODEL_7c0ddb47757a44488cdd78ad684589fd",
"IPY_MODEL_ae5d6af8dae14362b3c69697ab15a72a"
],
"layout": "IPY_MODEL_6ea26650157742b587befb21d1f30d86"
}
},
"fe8f18798bd545f995dd02eeaf77384a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_854f71de40234fe199384ae4cebf20be",
"placeholder": "",
"style": "IPY_MODEL_eb8b27ee72634878bbc67772ed1ffeca",
"value": "special_tokens_map.json: 100%"
}
},
"7c0ddb47757a44488cdd78ad684589fd": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_e57da3016ec445c1bdd3a3c1731b1152",
"max": 97,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_e2c13314f1c4490a82a3e8ade9f1f098",
"value": 97
}
},
"ae5d6af8dae14362b3c69697ab15a72a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_4f99e07a8ecd49dd90a6a5ec39630cbb",
"placeholder": "",
"style": "IPY_MODEL_9d4da350a4494d7893c95193390e6558",
"value": " 97.0/97.0 [00:00<00:00, 8.47kB/s]"
}
},
"6ea26650157742b587befb21d1f30d86": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"854f71de40234fe199384ae4cebf20be": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"eb8b27ee72634878bbc67772ed1ffeca": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"e57da3016ec445c1bdd3a3c1731b1152": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e2c13314f1c4490a82a3e8ade9f1f098": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"4f99e07a8ecd49dd90a6a5ec39630cbb": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9d4da350a4494d7893c95193390e6558": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"5133262f73ca4e5daf973095bf4f9b4c": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_357844d6f75c41f3a7f5b8360724e0bf",
"IPY_MODEL_56450694470e4f60a0294dd413bd3af0",
"IPY_MODEL_dfa4768c7cac41ff8e033e5d004afe3d"
],
"layout": "IPY_MODEL_e545b395dbb3424ba54c2ff1b8046048"
}
},
"357844d6f75c41f3a7f5b8360724e0bf": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_5c40e43264f041ef934f1e0aaf13ec56",
"placeholder": "",
"style": "IPY_MODEL_7a60189404d04d4391ac0e1015876a62",
"value": "processor_config.json: 100%"
}
},
"56450694470e4f60a0294dd413bd3af0": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_ff4d4dbc8d1e40d59191cefc4d4845bf",
"max": 210,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_f458a4a037634cfe99e8d504de04b1d7",
"value": 210
}
},
"dfa4768c7cac41ff8e033e5d004afe3d": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_420a613fb36c487698a313bfd456d370",
"placeholder": "",
"style": "IPY_MODEL_9c150ef7e78d4746a341f8d5740a6bca",
"value": " 210/210 [00:00<00:00, 18.0kB/s]"
}
},
"e545b395dbb3424ba54c2ff1b8046048": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"5c40e43264f041ef934f1e0aaf13ec56": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"7a60189404d04d4391ac0e1015876a62": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"ff4d4dbc8d1e40d59191cefc4d4845bf": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"f458a4a037634cfe99e8d504de04b1d7": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"420a613fb36c487698a313bfd456d370": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9c150ef7e78d4746a341f8d5740a6bca": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"f838d4eb98b54fa0a541a1cef8946969": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_8c5bb358a34342eca7b8fd5e72d8e4aa",
"IPY_MODEL_f46e488cc6ac44c5a7f45ee965b57d07",
"IPY_MODEL_c5aeeb6e4226476a807515d06e4d7f03"
],
"layout": "IPY_MODEL_20a906bac55a41cb80d2f7d21c830263"
}
},
"8c5bb358a34342eca7b8fd5e72d8e4aa": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7d7040673e7f457388875c7f84de4180",
"placeholder": "",
"style": "IPY_MODEL_36fa5cd8d9e9422ba805951bd9066405",
"value": "config.json: 100%"
}
},
"f46e488cc6ac44c5a7f45ee965b57d07": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_e7e3164c57fb4365b6502562ed4e8b93",
"max": 1450,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_9a46e41f46b541f2a98ae78e2f487d52",
"value": 1450
}
},
"c5aeeb6e4226476a807515d06e4d7f03": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_d22516caec184f10b795d5035660029a",
"placeholder": "",
"style": "IPY_MODEL_2ac6ccae9de14d41937201819119bfb3",
"value": " 1.45k/1.45k [00:00<00:00, 113kB/s]"
}
},
"20a906bac55a41cb80d2f7d21c830263": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"7d7040673e7f457388875c7f84de4180": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"36fa5cd8d9e9422ba805951bd9066405": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"e7e3164c57fb4365b6502562ed4e8b93": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9a46e41f46b541f2a98ae78e2f487d52": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"d22516caec184f10b795d5035660029a": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"2ac6ccae9de14d41937201819119bfb3": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"d46166ae17a14324b0315650d231fabf": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_a0209d882c9e41a39fb4c5086dfc32fe",
"IPY_MODEL_a98fe1626e154c8bb284695b262449c3",
"IPY_MODEL_8abd03a7756c4b9eb19952b33072663f"
],
"layout": "IPY_MODEL_6d8030e748004e05ba5e6ade8e444521"
}
},
"a0209d882c9e41a39fb4c5086dfc32fe": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_e5e1110b5ea4441c84641c6ea8889bc3",
"placeholder": "",
"style": "IPY_MODEL_213741a787f342b9b41848e65597fc6a",
"value": "model.safetensors: 100%"
}
},
"a98fe1626e154c8bb284695b262449c3": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_fefab3bf50c84c5c83a61e7733c39e27",
"max": 4178706382,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_3ec2079aea414a0b85ca58a0529111b9",
"value": 4178706382
}
},
"8abd03a7756c4b9eb19952b33072663f": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"model_module_version": "1.5.0",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_0d91dbd78efa4ca2bc5c8bfc7b6b3007",
"placeholder": "",
"style": "IPY_MODEL_9d4624220b1b48b99417316a072c122a",
"value": " 4.18G/4.18G [00:25<00:00, 140MB/s]"
}
},
"6d8030e748004e05ba5e6ade8e444521": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e5e1110b5ea4441c84641c6ea8889bc3": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"213741a787f342b9b41848e65597fc6a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"fefab3bf50c84c5c83a61e7733c39e27": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"3ec2079aea414a0b85ca58a0529111b9": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"0d91dbd78efa4ca2bc5c8bfc7b6b3007": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"model_module_version": "1.2.0",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"9d4624220b1b48b99417316a072c122a": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"model_module_version": "1.5.0",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
}
}
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/TheOneTrueGuy/Janus/blob/main/Janus_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "s-CHPPvkXVw6",
"outputId": "e68e9d54-00cf-4fa3-edca-328acdda2463"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Cloning into 'Janus'...\n",
"remote: Enumerating objects: 35, done.\u001b[K\n",
"remote: Counting objects: 100% (35/35), done.\u001b[K\n",
"remote: Compressing objects: 100% (31/31), done.\u001b[K\n",
"remote: Total 35 (delta 4), reused 35 (delta 4), pack-reused 0 (from 0)\u001b[K\n",
"Receiving objects: 100% (35/35), 955.66 KiB | 6.73 MiB/s, done.\n",
"Resolving deltas: 100% (4/4), done.\n"
]
}
],
"source": [
"!git clone https://github.com/deepseek-ai/Janus.git\n"
]
},
{
"cell_type": "code",
"source": [
"%cd Janus\n",
"!pip install -e .\n",
"!pip install flash-attn"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BIeDQ1ClXwL_",
"outputId": "556e62de-2bac-4eec-9008-78f33515c790"
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content/Janus\n",
"Obtaining file:///content/Janus\n",
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Checking if build backend supports build_editable ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build editable ... \u001b[?25l\u001b[?25hdone\n",
" Preparing editable metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: torch>=2.0.1 in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (2.4.1+cu121)\n",
"Requirement already satisfied: transformers>=4.38.2 in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (4.44.2)\n",
"Requirement already satisfied: timm>=0.9.16 in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (1.0.10)\n",
"Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (0.34.2)\n",
"Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (0.2.0)\n",
"Collecting attrdict (from janus==1.0.0)\n",
" Downloading attrdict-2.0.1-py2.py3-none-any.whl.metadata (6.7 kB)\n",
"Requirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (from janus==1.0.0) (0.8.0)\n",
"Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from timm>=0.9.16->janus==1.0.0) (0.19.1+cu121)\n",
"Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from timm>=0.9.16->janus==1.0.0) (6.0.2)\n",
"Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (from timm>=0.9.16->janus==1.0.0) (0.24.7)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from timm>=0.9.16->janus==1.0.0) (0.4.5)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (3.16.1)\n",
"Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (4.12.2)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (1.13.3)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (3.4.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (3.1.4)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=2.0.1->janus==1.0.0) (2024.6.1)\n",
"Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (1.26.4)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (24.1)\n",
"Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (2024.9.11)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (2.32.3)\n",
"Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (0.19.1)\n",
"Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.38.2->janus==1.0.0) (4.66.5)\n",
"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate->janus==1.0.0) (5.9.5)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from attrdict->janus==1.0.0) (1.16.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=2.0.1->janus==1.0.0) (3.0.1)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.38.2->janus==1.0.0) (3.4.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.38.2->janus==1.0.0) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.38.2->janus==1.0.0) (2.2.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.38.2->janus==1.0.0) (2024.8.30)\n",
"Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=2.0.1->janus==1.0.0) (1.3.0)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision->timm>=0.9.16->janus==1.0.0) (10.4.0)\n",
"Downloading attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB)\n",
"Building wheels for collected packages: janus\n",
" Building editable for janus (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for janus: filename=janus-1.0.0-0.editable-py3-none-any.whl size=13213 sha256=6549905c4ca7d80ffe93b29502638be9402c9b6e9ad02be1fe7c1de49807f601\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-3bjwhhza/wheels/db/43/fb/dfa5a5a63d3fcf28c66693ac6be0a596a4e95ee0b525246235\n",
"Successfully built janus\n",
"Installing collected packages: attrdict, janus\n",
"Successfully installed attrdict-2.0.1 janus-1.0.0\n",
"Collecting flash-attn\n",
" Downloading flash_attn-2.6.3.tar.gz (2.6 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 MB\u001b[0m \u001b[31m28.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from flash-attn) (2.4.1+cu121)\n",
"Requirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (from flash-attn) (0.8.0)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (3.16.1)\n",
"Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (4.12.2)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (1.13.3)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (3.4.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (3.1.4)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->flash-attn) (2024.6.1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->flash-attn) (3.0.1)\n",
"Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->flash-attn) (1.3.0)\n",
"Building wheels for collected packages: flash-attn\n",
" Building wheel for flash-attn (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for flash-attn: filename=flash_attn-2.6.3-cp310-cp310-linux_x86_64.whl size=187309225 sha256=237ef9c6157db394e1ddde4ba609a21ebb98382377a27041edc09318801a6f24\n",
" Stored in directory: /root/.cache/pip/wheels/7e/e3/c3/89c7a2f3c4adc07cd1c675f8bb7b9ad4d18f64a72bccdfe826\n",
"Successfully built flash-attn\n",
"Installing collected packages: flash-attn\n",
"Successfully installed flash-attn-2.6.3\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"\n",
"import torch\n",
"from transformers import AutoModelForCausalLM\n",
"from janus.models import MultiModalityCausalLM, VLChatProcessor\n",
"from janus.utils.io import load_pil_images\n",
"\n",
"# specify the path to the model\n",
"model_path = \"deepseek-ai/Janus-1.3B\"\n",
"vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)\n",
"tokenizer = vl_chat_processor.tokenizer\n",
"\n",
"vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(\n",
" model_path, trust_remote_code=True\n",
")\n",
"vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()\n",
"\n",
"conversation = [\n",
" {\n",
" \"role\": \"User\",\n",
" \"content\": \"<image_placeholder>\\nConvert the formula into latex code.\",\n",
" \"images\": [\"images/equation.png\"],\n",
" },\n",
" {\"role\": \"Assistant\", \"content\": \"\"},\n",
"]\n",
"\n",
"# load images and prepare for inputs\n",
"pil_images = load_pil_images(conversation)\n",
"prepare_inputs = vl_chat_processor(\n",
" conversations=conversation, images=pil_images, force_batchify=True\n",
").to(vl_gpt.device)\n",
"\n",
"# # run image encoder to get the image embeddings\n",
"inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)\n",
"\n",
"# # run the model to get the response\n",
"outputs = vl_gpt.language_model.generate(\n",
" inputs_embeds=inputs_embeds,\n",
" attention_mask=prepare_inputs.attention_mask,\n",
" pad_token_id=tokenizer.eos_token_id,\n",
" bos_token_id=tokenizer.bos_token_id,\n",
" eos_token_id=tokenizer.eos_token_id,\n",
" max_new_tokens=512,\n",
" do_sample=False,\n",
" use_cache=True,\n",
")\n",
"\n",
"answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)\n",
"print(f\"{prepare_inputs['sft_format'][0]}\", answer)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 547,
"referenced_widgets": [
"763db730135046ad82de98efeff17d43",
"4f86e96fadab4c0d99817e7e586773c8",
"3d5d9b536d5547f088c5b38998b86f4b",
"e749a276bffb4136a4ca518867fcefdc",
"26d64ffbb4ae49dca04faf3a4e2e55ca",
"e4c1880a3a4048da9fb06fc699d6f130",
"c7b3782ef893473199e9140ec6208edc",
"f95595cb4b6a4f1d829cf9fd7f27e0d3",
"14a1ff29167d48a7a16cf4d848861178",
"51cb411fef0c4be4bdd47a93be2f1cf5",
"f0bace68043e4b1c92a1e3dc4b6192b7",
"e5de18967479416bacedefaeee42e0f7",
"f7bcc5523c57490aa4300fc8720e32b3",
"eef53bb650af41729306ad9d9194706e",
"e7fed22226664988ad2557265b0aa6e6",
"4d9ab6f4605b465892ad6499a3172b1a",
"ea65267f0aa341389d72635be3e75671",
"88932da6dac441009d2ea0ce5c61e6e8",
"eb89a6efc26641e28373cfcd858293d0",
"b1588c0f92b647f0bf3a6f31dd3c1e5d",
"23b7a43bedb14094857a36d594f01086",
"ef6cc43fecde4a8da2e090cde5853429",
"957f5ed508f843909daf9a3b75470e8f",
"89a1845ac46748d09011ef870a0dbe09",
"1b347354adb649ba8203351c13151a0a",
"6f2857d7a0e141a7b0643dca48d6cfbb",
"d3656a2353944ce0b8756a94b37a76ba",
"50be41d72df9453a802ea0b97b30b6d2",
"1898c0ac42a642b3a72d2b50ce11e4f3",
"b8bce5e59e33458296b93346c8d4acf9",
"3d085ca38040494e973cc93d5ff9bf8c",
"bf52ff5a55ce44b0991e0a48e204fc56",
"a018d2d67e8c4c4081daad6b72f3ee1a",
"037700e9766646ddae7083729496b40b",
"fe8f18798bd545f995dd02eeaf77384a",
"7c0ddb47757a44488cdd78ad684589fd",
"ae5d6af8dae14362b3c69697ab15a72a",
"6ea26650157742b587befb21d1f30d86",
"854f71de40234fe199384ae4cebf20be",
"eb8b27ee72634878bbc67772ed1ffeca",
"e57da3016ec445c1bdd3a3c1731b1152",
"e2c13314f1c4490a82a3e8ade9f1f098",
"4f99e07a8ecd49dd90a6a5ec39630cbb",
"9d4da350a4494d7893c95193390e6558",
"5133262f73ca4e5daf973095bf4f9b4c",
"357844d6f75c41f3a7f5b8360724e0bf",
"56450694470e4f60a0294dd413bd3af0",
"dfa4768c7cac41ff8e033e5d004afe3d",
"e545b395dbb3424ba54c2ff1b8046048",
"5c40e43264f041ef934f1e0aaf13ec56",
"7a60189404d04d4391ac0e1015876a62",
"ff4d4dbc8d1e40d59191cefc4d4845bf",
"f458a4a037634cfe99e8d504de04b1d7",
"420a613fb36c487698a313bfd456d370",
"9c150ef7e78d4746a341f8d5740a6bca",
"f838d4eb98b54fa0a541a1cef8946969",
"8c5bb358a34342eca7b8fd5e72d8e4aa",
"f46e488cc6ac44c5a7f45ee965b57d07",
"c5aeeb6e4226476a807515d06e4d7f03",
"20a906bac55a41cb80d2f7d21c830263",
"7d7040673e7f457388875c7f84de4180",
"36fa5cd8d9e9422ba805951bd9066405",
"e7e3164c57fb4365b6502562ed4e8b93",
"9a46e41f46b541f2a98ae78e2f487d52",
"d22516caec184f10b795d5035660029a",
"2ac6ccae9de14d41937201819119bfb3",
"d46166ae17a14324b0315650d231fabf",
"a0209d882c9e41a39fb4c5086dfc32fe",
"a98fe1626e154c8bb284695b262449c3",
"8abd03a7756c4b9eb19952b33072663f",
"6d8030e748004e05ba5e6ade8e444521",
"e5e1110b5ea4441c84641c6ea8889bc3",
"213741a787f342b9b41848e65597fc6a",
"fefab3bf50c84c5c83a61e7733c39e27",
"3ec2079aea414a0b85ca58a0529111b9",
"0d91dbd78efa4ca2bc5c8bfc7b6b3007",
"9d4624220b1b48b99417316a072c122a"
]
},
"id": "X7B6TWTvXxRu",
"outputId": "072387f9-6404-48f1-bcbb-f1acddf1c77f"
},
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Python version is above 3.10, patching the collections module.\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead\n",
" warnings.warn(\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"preprocessor_config.json: 0%| | 0.00/347 [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "763db730135046ad82de98efeff17d43"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/106k [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "e5de18967479416bacedefaeee42e0f7"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"tokenizer.json: 0%| | 0.00/4.72M [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "957f5ed508f843909daf9a3b75470e8f"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/97.0 [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "037700e9766646ddae7083729496b40b"
}
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"processor_config.json: 0%| | 0.00/210 [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "5133262f73ca4e5daf973095bf4f9b4c"
}
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"Some kwargs in processor config are unused and will not have any effect: sft_format, ignore_id, add_special_token, mask_prompt, num_image_tokens, image_tag. \n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"Add image tag = <image_placeholder> to the tokenizer\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"config.json: 0%| | 0.00/1.45k [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "f838d4eb98b54fa0a541a1cef8946969"
}
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"model.safetensors: 0%| | 0.00/4.18G [00:00<?, ?B/s]"
],
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "d46166ae17a14324b0315650d231fabf"
}
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained(\"openai/whisper-tiny\", attn_implementation=\"flash_attention_2\", torch_dtype=torch.float16)`\n",
"Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained(\"openai/whisper-tiny\", attn_implementation=\"flash_attention_2\", torch_dtype=torch.float16)`\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n",
"\n",
"User: <image_placeholder>\n",
"Convert the formula into latex code.\n",
"\n",
"Assistant: Sure, here is the LaTeX code for the formula:\n",
"\n",
"\\[ A_n = a_0 \\left[ 1 + \\frac{3}{4} \\sum_{k=1}^{n} \\left( \\frac{4}{9} \\right)^k \\right] \\]\n"
]
}
]
}
]
}
================================================
FILE: demo/app.py
================================================
import gradio as gr
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from PIL import Image
import numpy as np
# Load model and processor
model_path = "deepseek-ai/Janus-1.3B"
config = AutoConfig.from_pretrained(model_path)
language_config = config.language_config
language_config._attn_implementation = 'eager'
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path,
language_config=language_config,
trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda()
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
cuda_device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Multimodal Understanding function
@torch.inference_mode()
# Multimodal Understanding function
def multimodal_understanding(image, question, seed, top_p, temperature):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
# set seed
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
conversation = [
{
"role": "User",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "Assistant", "content": ""},
]
pil_images = [Image.fromarray(image)]
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False if temperature == 0 else True,
use_cache=True,
temperature=temperature,
top_p=top_p,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
return answer
def generate(input_ids,
width,
height,
temperature: float = 1,
parallel_size: int = 5,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
patch_size: int = 16):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).to(cuda_device)
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).to(cuda_device)
pkv = None
for i in range(image_token_num_per_image):
outputs = vl_gpt.language_model.model(inputs_embeds=inputs_embeds,
use_cache=True,
past_key_values=pkv)
pkv = outputs.past_key_values
hidden_states = outputs.last_hidden_state
logits = vl_gpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = vl_gpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
patches = vl_gpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
shape=[parallel_size, 8, width // patch_size, height // patch_size])
return generated_tokens.to(dtype=torch.int), patches
def unpack(dec, width, height, parallel_size=5):
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, width, height, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
return visual_img
@torch.inference_mode()
def generate_image(prompt,
seed=None,
guidance=5):
# Clear CUDA cache and avoid tracking gradients
torch.cuda.empty_cache()
# Set the seed for reproducible results
if seed is not None:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
width = 384
height = 384
parallel_size = 5
with torch.no_grad():
messages = [{'role': 'User', 'content': prompt},
{'role': 'Assistant', 'content': ''}]
text = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(conversations=messages,
sft_format=vl_chat_processor.sft_format,
system_prompt='')
text = text + vl_chat_processor.image_start_tag
input_ids = torch.LongTensor(tokenizer.encode(text))
output, patches = generate(input_ids,
width // 16 * 16,
height // 16 * 16,
cfg_weight=guidance,
parallel_size=parallel_size)
images = unpack(patches,
width // 16 * 16,
height // 16 * 16)
return [Image.fromarray(images[i]).resize((1024, 1024), Image.LANCZOS) for i in range(parallel_size)]
# Gradio interface
with gr.Blocks() as demo:
gr.Markdown(value="# Multimodal Understanding")
# with gr.Row():
with gr.Row():
image_input = gr.Image()
with gr.Column():
question_input = gr.Textbox(label="Question")
und_seed_input = gr.Number(label="Seed", precision=0, value=42)
top_p = gr.Slider(minimum=0, maximum=1, value=0.95, step=0.05, label="top_p")
temperature = gr.Slider(minimum=0, maximum=1, value=0.1, step=0.05, label="temperature")
understanding_button = gr.Button("Chat")
understanding_output = gr.Textbox(label="Response")
examples_inpainting = gr.Examples(
label="Multimodal Understanding examples",
examples=[
[
"explain this meme",
"images/doge.png",
],
[
"Convert the formula into latex code.",
"images/equation.png",
],
],
inputs=[question_input, image_input],
)
gr.Markdown(value="# Text-to-Image Generation")
with gr.Row():
cfg_weight_input = gr.Slider(minimum=1, maximum=10, value=5, step=0.5, label="CFG Weight")
prompt_input = gr.Textbox(label="Prompt")
seed_input = gr.Number(label="Seed (Optional)", precision=0, value=12345)
generation_button = gr.Button("Generate Images")
image_output = gr.Gallery(label="Generated Images", columns=2, rows=2, height=300)
examples_t2i = gr.Examples(
label="Text to image generation examples. (Tips for designing prompts: Adding description like 'digital art' at the end of the prompt or writing the prompt in more detail can help produce better images!)",
examples=[
"Master shifu racoon wearing drip attire as a street gangster.",
"A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
"The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.\n\nAbove the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail. \n\nOverall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component\u2014from the intricate designs framing the eye to the ancient-looking stone piece above\u2014contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.",
],
inputs=prompt_input,
)
understanding_button.click(
multimodal_understanding,
inputs=[image_input, question_input, und_seed_input, top_p, temperature],
outputs=understanding_output
)
generation_button.click(
fn=generate_image,
inputs=[prompt_input, seed_input, cfg_weight_input],
outputs=image_output
)
demo.launch(share=True)
================================================
FILE: demo/app_janusflow.py
================================================
import gradio as gr
import torch
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from PIL import Image
from diffusers.models import AutoencoderKL
import numpy as np
cuda_device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load model and processor
model_path = "deepseek-ai/JanusFlow-1.3B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(model_path)
vl_gpt = vl_gpt.to(torch.bfloat16).to(cuda_device).eval()
# remember to use bfloat16 dtype, this vae doesn't work with fp16
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae")
vae = vae.to(torch.bfloat16).to(cuda_device).eval()
# Multimodal Understanding function
@torch.inference_mode()
# Multimodal Understanding function
def multimodal_understanding(image, question, seed, top_p, temperature):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
# set seed
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
conversation = [
{
"role": "User",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "Assistant", "content": ""},
]
pil_images = [Image.fromarray(image)]
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False if temperature == 0 else True,
use_cache=True,
temperature=temperature,
top_p=top_p,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
return answer
@torch.inference_mode()
def generate(
input_ids,
cfg_weight: float = 2.0,
num_inference_steps: int = 30
):
# we generate 5 images at a time, *2 for CFG
tokens = torch.stack([input_ids] * 10).cuda()
tokens[5:, 1:] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
print(inputs_embeds.shape)
# we remove the last <bog> token and replace it with t_emb later
inputs_embeds = inputs_embeds[:, :-1, :]
# generate with rectified flow ode
# step 1: encode with vision_gen_enc
z = torch.randn((5, 4, 48, 48), dtype=torch.bfloat16).cuda()
dt = 1.0 / num_inference_steps
dt = torch.zeros_like(z).cuda().to(torch.bfloat16) + dt
# step 2: run ode
attention_mask = torch.ones((10, inputs_embeds.shape[1]+577)).to(vl_gpt.device)
attention_mask[5:, 1:inputs_embeds.shape[1]] = 0
attention_mask = attention_mask.int()
for step in range(num_inference_steps):
# prepare inputs for the llm
z_input = torch.cat([z, z], dim=0) # for cfg
t = step / num_inference_steps * 1000.
t = torch.tensor([t] * z_input.shape[0]).to(dt)
z_enc = vl_gpt.vision_gen_enc_model(z_input, t)
z_emb, t_emb, hs = z_enc[0], z_enc[1], z_enc[2]
z_emb = z_emb.view(z_emb.shape[0], z_emb.shape[1], -1).permute(0, 2, 1)
z_emb = vl_gpt.vision_gen_enc_aligner(z_emb)
llm_emb = torch.cat([inputs_embeds, t_emb.unsqueeze(1), z_emb], dim=1)
# input to the llm
# we apply attention mask for CFG: 1 for tokens that are not masked, 0 for tokens that are masked.
if step == 0:
outputs = vl_gpt.language_model.model(inputs_embeds=llm_emb,
use_cache=True,
attention_mask=attention_mask,
past_key_values=None)
past_key_values = []
for kv_cache in past_key_values:
k, v = kv_cache[0], kv_cache[1]
past_key_values.append((k[:, :, :inputs_embeds.shape[1], :], v[:, :, :inputs_embeds.shape[1], :]))
past_key_values = tuple(past_key_values)
else:
outputs = vl_gpt.language_model.model(inputs_embeds=llm_emb,
use_cache=True,
attention_mask=attention_mask,
past_key_values=past_key_values)
hidden_states = outputs.last_hidden_state
# transform hidden_states back to v
hidden_states = vl_gpt.vision_gen_dec_aligner(vl_gpt.vision_gen_dec_aligner_norm(hidden_states[:, -576:, :]))
hidden_states = hidden_states.reshape(z_emb.shape[0], 24, 24, 768).permute(0, 3, 1, 2)
v = vl_gpt.vision_gen_dec_model(hidden_states, hs, t_emb)
v_cond, v_uncond = torch.chunk(v, 2)
v = cfg_weight * v_cond - (cfg_weight-1.) * v_uncond
z = z + dt * v
# step 3: decode with vision_gen_dec and sdxl vae
decoded_image = vae.decode(z / vae.config.scaling_factor).sample
images = decoded_image.float().clip_(-1., 1.).permute(0,2,3,1).cpu().numpy()
images = ((images+1) / 2. * 255).astype(np.uint8)
return images
def unpack(dec, width, height, parallel_size=5):
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, width, height, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
return visual_img
@torch.inference_mode()
def generate_image(prompt,
seed=None,
guidance=5,
num_inference_steps=30):
# Clear CUDA cache and avoid tracking gradients
torch.cuda.empty_cache()
# Set the seed for reproducible results
if seed is not None:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
with torch.no_grad():
messages = [{'role': 'User', 'content': prompt},
{'role': 'Assistant', 'content': ''}]
text = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(conversations=messages,
sft_format=vl_chat_processor.sft_format,
system_prompt='')
text = text + vl_chat_processor.image_start_tag
input_ids = torch.LongTensor(tokenizer.encode(text))
images = generate(input_ids,
cfg_weight=guidance,
num_inference_steps=num_inference_steps)
return [Image.fromarray(images[i]).resize((1024, 1024), Image.LANCZOS) for i in range(images.shape[0])]
# Gradio interface
with gr.Blocks() as demo:
gr.Markdown(value="# Multimodal Understanding")
# with gr.Row():
with gr.Row():
image_input = gr.Image()
with gr.Column():
question_input = gr.Textbox(label="Question")
und_seed_input = gr.Number(label="Seed", precision=0, value=42)
top_p = gr.Slider(minimum=0, maximum=1, value=0.95, step=0.05, label="top_p")
temperature = gr.Slider(minimum=0, maximum=1, value=0.1, step=0.05, label="temperature")
understanding_button = gr.Button("Chat")
understanding_output = gr.Textbox(label="Response")
examples_inpainting = gr.Examples(
label="Multimodal Understanding examples",
examples=[
[
"explain this meme",
"./images/doge.png",
],
[
"Convert the formula into latex code.",
"./images/equation.png",
],
],
inputs=[question_input, image_input],
)
gr.Markdown(value="# Text-to-Image Generation")
with gr.Row():
cfg_weight_input = gr.Slider(minimum=1, maximum=10, value=2, step=0.5, label="CFG Weight")
step_input = gr.Slider(minimum=1, maximum=50, value=30, step=1, label="Number of Inference Steps")
prompt_input = gr.Textbox(label="Prompt")
seed_input = gr.Number(label="Seed (Optional)", precision=0, value=12345)
generation_button = gr.Button("Generate Images")
image_output = gr.Gallery(label="Generated Images", columns=2, rows=2, height=300)
examples_t2i = gr.Examples(
label="Text to image generation examples.",
examples=[
"Master shifu racoon wearing drip attire as a street gangster.",
"A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
"The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.\n\nAbove the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail. \n\nOverall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component\u2014from the intricate designs framing the eye to the ancient-looking stone piece above\u2014contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.",
],
inputs=prompt_input,
)
understanding_button.click(
multimodal_understanding,
inputs=[image_input, question_input, und_seed_input, top_p, temperature],
outputs=understanding_output
)
generation_button.click(
fn=generate_image,
inputs=[prompt_input, seed_input, cfg_weight_input, step_input],
outputs=image_output
)
demo.launch(share=True)
================================================
FILE: demo/app_januspro.py
================================================
import gradio as gr
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image
import numpy as np
import os
import time
# import spaces # Import spaces for ZeroGPU compatibility
# Load model and processor
model_path = "deepseek-ai/Janus-Pro-7B"
config = AutoConfig.from_pretrained(model_path)
language_config = config.language_config
language_config._attn_implementation = 'eager'
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path,
language_config=language_config,
trust_remote_code=True)
if torch.cuda.is_available():
vl_gpt = vl_gpt.to(torch.bfloat16).cuda()
else:
vl_gpt = vl_gpt.to(torch.float16)
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
cuda_device = 'cuda' if torch.cuda.is_available() else 'cpu'
@torch.inference_mode()
# @spaces.GPU(duration=120)
# Multimodal Understanding function
def multimodal_understanding(image, question, seed, top_p, temperature):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
# set seed
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
pil_images = [Image.fromarray(image)]
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False if temperature == 0 else True,
use_cache=True,
temperature=temperature,
top_p=top_p,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
return answer
def generate(input_ids,
width,
height,
temperature: float = 1,
parallel_size: int = 5,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
patch_size: int = 16):
# Clear CUDA cache before generating
torch.cuda.empty_cache()
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).to(cuda_device)
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).to(cuda_device)
pkv = None
for i in range(image_token_num_per_image):
with torch.no_grad():
outputs = vl_gpt.language_model.model(inputs_embeds=inputs_embeds,
use_cache=True,
past_key_values=pkv)
pkv = outputs.past_key_values
hidden_states = outputs.last_hidden_state
logits = vl_gpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = vl_gpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
patches = vl_gpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
shape=[parallel_size, 8, width // patch_size, height // patch_size])
return generated_tokens.to(dtype=torch.int), patches
def unpack(dec, width, height, parallel_size=5):
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, width, height, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
return visual_img
@torch.inference_mode()
# @spaces.GPU(duration=120) # Specify a duration to avoid timeout
def generate_image(prompt,
seed=None,
guidance=5,
t2i_temperature=1.0):
# Clear CUDA cache and avoid tracking gradients
torch.cuda.empty_cache()
# Set the seed for reproducible results
if seed is not None:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
width = 384
height = 384
parallel_size = 5
with torch.no_grad():
messages = [{'role': '<|User|>', 'content': prompt},
{'role': '<|Assistant|>', 'content': ''}]
text = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(conversations=messages,
sft_format=vl_chat_processor.sft_format,
system_prompt='')
text = text + vl_chat_processor.image_start_tag
input_ids = torch.LongTensor(tokenizer.encode(text))
output, patches = generate(input_ids,
width // 16 * 16,
height // 16 * 16,
cfg_weight=guidance,
parallel_size=parallel_size,
temperature=t2i_temperature)
images = unpack(patches,
width // 16 * 16,
height // 16 * 16,
parallel_size=parallel_size)
return [Image.fromarray(images[i]).resize((768, 768), Image.LANCZOS) for i in range(parallel_size)]
# Gradio interface
with gr.Blocks() as demo:
gr.Markdown(value="# Multimodal Understanding")
with gr.Row():
image_input = gr.Image()
with gr.Column():
question_input = gr.Textbox(label="Question")
und_seed_input = gr.Number(label="Seed", precision=0, value=42)
top_p = gr.Slider(minimum=0, maximum=1, value=0.95, step=0.05, label="top_p")
temperature = gr.Slider(minimum=0, maximum=1, value=0.1, step=0.05, label="temperature")
understanding_button = gr.Button("Chat")
understanding_output = gr.Textbox(label="Response")
examples_inpainting = gr.Examples(
label="Multimodal Understanding examples",
examples=[
[
"explain this meme",
"images/doge.png",
],
[
"Convert the formula into latex code.",
"images/equation.png",
],
],
inputs=[question_input, image_input],
)
gr.Markdown(value="# Text-to-Image Generation")
with gr.Row():
cfg_weight_input = gr.Slider(minimum=1, maximum=10, value=5, step=0.5, label="CFG Weight")
t2i_temperature = gr.Slider(minimum=0, maximum=1, value=1.0, step=0.05, label="temperature")
prompt_input = gr.Textbox(label="Prompt. (Prompt in more detail can help produce better images!)")
seed_input = gr.Number(label="Seed (Optional)", precision=0, value=12345)
generation_button = gr.Button("Generate Images")
image_output = gr.Gallery(label="Generated Images", columns=2, rows=2, height=300)
examples_t2i = gr.Examples(
label="Text to image generation examples.",
examples=[
"Master shifu racoon wearing drip attire as a street gangster.",
"The face of a beautiful girl",
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
"A glass of red wine on a reflective surface.",
"A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
"The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.\n\nAbove the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail. \n\nOverall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component\u2014from the intricate designs framing the eye to the ancient-looking stone piece above\u2014contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.",
],
inputs=prompt_input,
)
understanding_button.click(
multimodal_understanding,
inputs=[image_input, question_input, und_seed_input, top_p, temperature],
outputs=understanding_output
)
generation_button.click(
fn=generate_image,
inputs=[prompt_input, seed_input, cfg_weight_input, t2i_temperature],
outputs=image_output
)
demo.launch(share=True)
# demo.queue(concurrency_count=1, max_size=10).launch(server_name="0.0.0.0", server_port=37906, root_path="/path")
================================================
FILE: demo/fastapi_app.py
================================================
from fastapi import FastAPI, File, Form, UploadFile, HTTPException
from fastapi.responses import JSONResponse, StreamingResponse
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from PIL import Image
import numpy as np
import io
app = FastAPI()
# Load model and processor
model_path = "deepseek-ai/Janus-1.3B"
config = AutoConfig.from_pretrained(model_path)
language_config = config.language_config
language_config._attn_implementation = 'eager'
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path,
language_config=language_config,
trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda()
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
cuda_device = 'cuda' if torch.cuda.is_available() else 'cpu'
@torch.inference_mode()
def multimodal_understanding(image_data, question, seed, top_p, temperature):
torch.cuda.empty_cache()
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
conversation = [
{
"role": "User",
"content": f"<image_placeholder>\n{question}",
"images": [image_data],
},
{"role": "Assistant", "content": ""},
]
pil_images = [Image.open(io.BytesIO(image_data))]
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(cuda_device, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False if temperature == 0 else True,
use_cache=True,
temperature=temperature,
top_p=top_p,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
return answer
@app.post("/understand_image_and_question/")
async def understand_image_and_question(
file: UploadFile = File(...),
question: str = Form(...),
seed: int = Form(42),
top_p: float = Form(0.95),
temperature: float = Form(0.1)
):
image_data = await file.read()
response = multimodal_understanding(image_data, question, seed, top_p, temperature)
return JSONResponse({"response": response})
def generate(input_ids,
width,
height,
temperature: float = 1,
parallel_size: int = 5,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
patch_size: int = 16):
torch.cuda.empty_cache()
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).to(cuda_device)
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = vl_gpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).to(cuda_device)
pkv = None
for i in range(image_token_num_per_image):
outputs = vl_gpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=pkv)
pkv = outputs.past_key_values
hidden_states = outputs.last_hidden_state
logits = vl_gpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = vl_gpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
patches = vl_gpt.gen_vision_model.decode_code(
generated_tokens.to(dtype=torch.int),
shape=[parallel_size, 8, width // patch_size, height // patch_size]
)
return generated_tokens.to(dtype=torch.int), patches
def unpack(dec, width, height, parallel_size=5):
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, width, height, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
return visual_img
@torch.inference_mode()
def generate_image(prompt, s
gitextract_8bv7suly/ ├── .gitattributes ├── .gitignore ├── LICENSE-CODE ├── LICENSE-MODEL ├── Makefile ├── README.md ├── demo/ │ ├── Janus_colab_demo.ipynb │ ├── app.py │ ├── app_janusflow.py │ ├── app_januspro.py │ ├── fastapi_app.py │ └── fastapi_client.py ├── generation_inference.py ├── inference.py ├── interactivechat.py ├── janus/ │ ├── __init__.py │ ├── janusflow/ │ │ ├── __init__.py │ │ └── models/ │ │ ├── __init__.py │ │ ├── clip_encoder.py │ │ ├── image_processing_vlm.py │ │ ├── modeling_vlm.py │ │ ├── processing_vlm.py │ │ ├── siglip_vit.py │ │ └── uvit.py │ ├── models/ │ │ ├── __init__.py │ │ ├── clip_encoder.py │ │ ├── image_processing_vlm.py │ │ ├── modeling_vlm.py │ │ ├── processing_vlm.py │ │ ├── projector.py │ │ ├── siglip_vit.py │ │ └── vq_model.py │ └── utils/ │ ├── __init__.py │ ├── conversation.py │ └── io.py ├── pyproject.toml └── requirements.txt
SYMBOL INDEX (274 symbols across 22 files)
FILE: demo/app.py
function multimodal_understanding (line 26) | def multimodal_understanding(image, question, seed, top_p, temperature):
function generate (line 69) | def generate(input_ids,
function unpack (line 110) | def unpack(dec, width, height, parallel_size=5):
function generate_image (line 122) | def generate_image(prompt,
FILE: demo/app_janusflow.py
function multimodal_understanding (line 25) | def multimodal_understanding(image, question, seed, top_p, temperature):
function generate (line 70) | def generate(
function unpack (line 141) | def unpack(dec, width, height, parallel_size=5):
function generate_image (line 152) | def generate_image(prompt,
FILE: demo/app_januspro.py
function multimodal_understanding (line 34) | def multimodal_understanding(image, question, seed, top_p, temperature):
function generate (line 77) | def generate(input_ids,
function unpack (line 123) | def unpack(dec, width, height, parallel_size=5):
function generate_image (line 136) | def generate_image(prompt,
FILE: demo/fastapi_app.py
function multimodal_understanding (line 28) | def multimodal_understanding(image_data, question, seed, top_p, temperat...
function understand_image_and_question (line 67) | async def understand_image_and_question(
function generate (line 79) | def generate(input_ids,
function unpack (line 119) | def unpack(dec, width, height, parallel_size=5):
function generate_image (line 130) | def generate_image(prompt, seed, guidance):
function generate_images (line 156) | async def generate_images(
FILE: demo/fastapi_client.py
function understand_image_and_question (line 12) | def understand_image_and_question(image_path, question, seed=42, top_p=0...
function generate_images (line 26) | def generate_images(prompt, seed=None, guidance=5.0):
FILE: generation_inference.py
function generate (line 55) | def generate(
FILE: interactivechat.py
function create_prompt (line 21) | def create_prompt(user_input: str) -> str:
function generate (line 40) | def generate(
function interactive_image_generator (line 113) | def interactive_image_generator():
FILE: janus/janusflow/models/clip_encoder.py
class CLIPVisionTower (line 30) | class CLIPVisionTower(nn.Module):
method __init__ (line 31) | def __init__(
method build_vision_tower (line 70) | def build_vision_tower(self, vision_tower_params):
method feature_select (line 88) | def feature_select(self, image_forward_outs):
method forward (line 107) | def forward(self, images):
FILE: janus/janusflow/models/image_processing_vlm.py
function expand2square (line 41) | def expand2square(pil_img, background_color):
class VLMImageProcessorConfig (line 55) | class VLMImageProcessorConfig(PretrainedConfig):
method __init__ (line 64) | def __init__(
class VLMImageProcessor (line 92) | class VLMImageProcessor(BaseImageProcessor):
method __init__ (line 95) | def __init__(
method resize (line 127) | def resize(self, pil_img: Image) -> np.ndarray:
method preprocess (line 164) | def preprocess(self, images, return_tensors: str = "pt", **kwargs) -> ...
method default_shape (line 195) | def default_shape(self):
FILE: janus/janusflow/models/modeling_vlm.py
function model_name_to_cls (line 37) | def model_name_to_cls(cls_name):
class VisionUnderstandEncoderConfig (line 51) | class VisionUnderstandEncoderConfig(PretrainedConfig):
method __init__ (line 56) | def __init__(self, **kwargs):
class VisionGenerationEncoderConfig (line 66) | class VisionGenerationEncoderConfig(PretrainedConfig):
method __init__ (line 71) | def __init__(self, **kwargs):
class VisionGenerationDecoderConfig (line 81) | class VisionGenerationDecoderConfig(PretrainedConfig):
method __init__ (line 86) | def __init__(self, **kwargs):
class MultiModalityConfig (line 96) | class MultiModalityConfig(PretrainedConfig):
method __init__ (line 101) | def __init__(self, **kwargs):
class MultiModalityPreTrainedModel (line 125) | class MultiModalityPreTrainedModel(PreTrainedModel):
class MultiModalityCausalLM (line 132) | class MultiModalityCausalLM(MultiModalityPreTrainedModel):
method __init__ (line 134) | def __init__(self, config: MultiModalityConfig):
method prepare_inputs_embeds (line 171) | def prepare_inputs_embeds(
FILE: janus/janusflow/models/processing_vlm.py
class DictOutput (line 32) | class DictOutput(object):
method keys (line 33) | def keys(self):
method __getitem__ (line 36) | def __getitem__(self, item):
method __setitem__ (line 39) | def __setitem__(self, key, value):
class VLChatProcessorOutput (line 44) | class VLChatProcessorOutput(DictOutput):
method __len__ (line 50) | def __len__(self):
class BatchedVLChatProcessorOutput (line 55) | class BatchedVLChatProcessorOutput(DictOutput):
method to (line 63) | def to(self, device, dtype=torch.bfloat16):
class VLChatProcessor (line 72) | class VLChatProcessor(ProcessorMixin):
method __init__ (line 84) | def __init__(
method new_chat_template (line 154) | def new_chat_template(self):
method apply_sft_template_for_multi_turn_prompts (line 159) | def apply_sft_template_for_multi_turn_prompts(
method image_token (line 202) | def image_token(self):
method image_id (line 206) | def image_id(self):
method image_start_id (line 211) | def image_start_id(self):
method image_end_id (line 216) | def image_end_id(self):
method image_start_token (line 221) | def image_start_token(self):
method image_end_token (line 225) | def image_end_token(self):
method pad_id (line 229) | def pad_id(self):
method image_gen_id (line 237) | def image_gen_id(self):
method add_image_token (line 241) | def add_image_token(
method process_one (line 289) | def process_one(
method __call__ (line 352) | def __call__(
method batchify (line 387) | def batchify(
FILE: janus/janusflow/models/siglip_vit.py
function _no_grad_trunc_normal_ (line 54) | def _no_grad_trunc_normal_(tensor, mean, std, a, b):
function trunc_normal_ (line 92) | def trunc_normal_(tensor, mean=0.0, std=1.0, a=-2.0, b=2.0):
function init_weights (line 120) | def init_weights(self):
function init_weights_vit_timm (line 126) | def init_weights_vit_timm(module: nn.Module, name: str = "") -> None:
class Attention (line 136) | class Attention(nn.Module):
method __init__ (line 139) | def __init__(
method forward (line 164) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class LayerScale (line 194) | class LayerScale(nn.Module):
method __init__ (line 195) | def __init__(
method forward (line 205) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class Block (line 209) | class Block(nn.Module):
method __init__ (line 210) | def __init__(
method forward (line 253) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class VisionTransformer (line 259) | class VisionTransformer(nn.Module):
method __init__ (line 268) | def __init__(
method init_weights (line 434) | def init_weights(self, mode: Literal["jax", "jax_nlhb", "moco", ""] = ...
method no_weight_decay (line 443) | def no_weight_decay(self) -> Set:
method group_matcher (line 447) | def group_matcher(self, coarse: bool = False) -> Dict:
method set_grad_checkpointing (line 454) | def set_grad_checkpointing(self, enable: bool = True) -> None:
method get_classifier (line 458) | def get_classifier(self) -> nn.Module:
method reset_classifier (line 461) | def reset_classifier(self, num_classes: int, global_pool=None) -> None:
method _pos_embed (line 476) | def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
method _intermediate_layers (line 509) | def _intermediate_layers(
method get_intermediate_layers (line 531) | def get_intermediate_layers(
method forward_features (line 562) | def forward_features(self, x: torch.Tensor) -> torch.Tensor:
method forward_head (line 574) | def forward_head(self, x: torch.Tensor, pre_logits: bool = False) -> t...
method forward (line 585) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class SigLIPVisionCfg (line 593) | class SigLIPVisionCfg:
function create_siglip_vit (line 650) | def create_siglip_vit(
FILE: janus/janusflow/models/uvit.py
class ImageHead (line 35) | class ImageHead(nn.Module):
method __init__ (line 37) | def __init__(self, decoder_cfg, gpt_cfg, layer_id=None):
method forward (line 105) | def forward(
class GlobalResponseNorm (line 137) | class GlobalResponseNorm(nn.Module):
method __init__ (line 139) | def __init__(self, dim):
method forward (line 144) | def forward(self, x):
class Downsample2D (line 151) | class Downsample2D(nn.Module):
method __init__ (line 167) | def __init__(
method forward (line 219) | def forward(self, hidden_states: torch.Tensor, *args, **kwargs) -> tor...
class Upsample2D (line 239) | class Upsample2D(nn.Module):
method __init__ (line 255) | def __init__(
method forward (line 318) | def forward(
class ConvNextBlock (line 373) | class ConvNextBlock(nn.Module):
method __init__ (line 374) | def __init__(
method forward (line 405) | def forward(self, x, cond_embeds):
class Patchify (line 430) | class Patchify(nn.Module):
method __init__ (line 431) | def __init__(
method forward (line 453) | def forward(self, x):
class Unpatchify (line 461) | class Unpatchify(nn.Module):
method __init__ (line 462) | def __init__(
method forward (line 476) | def forward(self, x):
class UVitBlock (line 486) | class UVitBlock(nn.Module):
method __init__ (line 487) | def __init__(
method forward (line 559) | def forward(self, x, emb, recompute=False):
class ShallowUViTEncoder (line 572) | class ShallowUViTEncoder(nn.Module):
method __init__ (line 573) | def __init__(
method get_num_extra_tensors (line 625) | def get_num_extra_tensors(self):
method forward (line 628) | def forward(self, x, timesteps):
class ShallowUViTDecoder (line 644) | class ShallowUViTDecoder(nn.Module):
method __init__ (line 645) | def __init__(
method forward (line 702) | def forward(self, x, hs, t_emb):
FILE: janus/models/clip_encoder.py
class CLIPVisionTower (line 30) | class CLIPVisionTower(nn.Module):
method __init__ (line 31) | def __init__(
method build_vision_tower (line 70) | def build_vision_tower(self, vision_tower_params):
method feature_select (line 88) | def feature_select(self, image_forward_outs):
method forward (line 107) | def forward(self, images):
FILE: janus/models/image_processing_vlm.py
function expand2square (line 41) | def expand2square(pil_img, background_color):
class VLMImageProcessorConfig (line 55) | class VLMImageProcessorConfig(PretrainedConfig):
method __init__ (line 64) | def __init__(
class VLMImageProcessor (line 92) | class VLMImageProcessor(BaseImageProcessor):
method __init__ (line 95) | def __init__(
method resize (line 127) | def resize(self, pil_img: Image) -> np.ndarray:
method preprocess (line 164) | def preprocess(self, images, return_tensors: str = "pt", **kwargs) -> ...
method default_shape (line 195) | def default_shape(self):
FILE: janus/models/modeling_vlm.py
class vision_head (line 36) | class vision_head(torch.nn.Module):
method __init__ (line 37) | def __init__(self, params):
method forward (line 47) | def forward(self, x):
function model_name_to_cls (line 54) | def model_name_to_cls(cls_name):
class VisionConfig (line 73) | class VisionConfig(PretrainedConfig):
method __init__ (line 78) | def __init__(self, **kwargs):
class AlignerConfig (line 88) | class AlignerConfig(PretrainedConfig):
method __init__ (line 93) | def __init__(self, **kwargs):
class GenVisionConfig (line 103) | class GenVisionConfig(PretrainedConfig):
method __init__ (line 108) | def __init__(self, **kwargs):
class GenAlignerConfig (line 118) | class GenAlignerConfig(PretrainedConfig):
method __init__ (line 123) | def __init__(self, **kwargs):
class GenHeadConfig (line 133) | class GenHeadConfig(PretrainedConfig):
method __init__ (line 138) | def __init__(self, **kwargs):
class MultiModalityConfig (line 148) | class MultiModalityConfig(PretrainedConfig):
method __init__ (line 159) | def __init__(self, **kwargs):
class MultiModalityPreTrainedModel (line 183) | class MultiModalityPreTrainedModel(PreTrainedModel):
class MultiModalityCausalLM (line 190) | class MultiModalityCausalLM(MultiModalityPreTrainedModel):
method __init__ (line 191) | def __init__(self, config: MultiModalityConfig):
method prepare_inputs_embeds (line 221) | def prepare_inputs_embeds(
method prepare_gen_img_embeds (line 262) | def prepare_gen_img_embeds(self, image_ids: torch.LongTensor):
FILE: janus/models/processing_vlm.py
class DictOutput (line 32) | class DictOutput(object):
method keys (line 33) | def keys(self):
method __getitem__ (line 36) | def __getitem__(self, item):
method __setitem__ (line 39) | def __setitem__(self, key, value):
class VLChatProcessorOutput (line 44) | class VLChatProcessorOutput(DictOutput):
method __len__ (line 50) | def __len__(self):
class BatchedVLChatProcessorOutput (line 55) | class BatchedVLChatProcessorOutput(DictOutput):
method to (line 63) | def to(self, device, dtype=torch.bfloat16):
class VLChatProcessor (line 72) | class VLChatProcessor(ProcessorMixin):
method __init__ (line 84) | def __init__(
method new_chat_template (line 132) | def new_chat_template(self):
method apply_sft_template_for_multi_turn_prompts (line 137) | def apply_sft_template_for_multi_turn_prompts(
method image_token (line 180) | def image_token(self):
method image_id (line 184) | def image_id(self):
method image_start_id (line 189) | def image_start_id(self):
method image_end_id (line 194) | def image_end_id(self):
method image_start_token (line 199) | def image_start_token(self):
method image_end_token (line 203) | def image_end_token(self):
method pad_id (line 207) | def pad_id(self):
method add_image_token (line 215) | def add_image_token(
method process_one (line 260) | def process_one(
method __call__ (line 322) | def __call__(
method batchify (line 357) | def batchify(
FILE: janus/models/projector.py
class MlpProjector (line 27) | class MlpProjector(nn.Module):
method __init__ (line 28) | def __init__(self, cfg):
method forward (line 63) | def forward(
FILE: janus/models/siglip_vit.py
function _no_grad_trunc_normal_ (line 54) | def _no_grad_trunc_normal_(tensor, mean, std, a, b):
function trunc_normal_ (line 92) | def trunc_normal_(tensor, mean=0.0, std=1.0, a=-2.0, b=2.0):
function init_weights (line 120) | def init_weights(self):
function init_weights_vit_timm (line 126) | def init_weights_vit_timm(module: nn.Module, name: str = "") -> None:
class Attention (line 136) | class Attention(nn.Module):
method __init__ (line 139) | def __init__(
method forward (line 164) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class LayerScale (line 194) | class LayerScale(nn.Module):
method __init__ (line 195) | def __init__(
method forward (line 205) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class Block (line 209) | class Block(nn.Module):
method __init__ (line 210) | def __init__(
method forward (line 253) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class VisionTransformer (line 259) | class VisionTransformer(nn.Module):
method __init__ (line 268) | def __init__(
method init_weights (line 434) | def init_weights(self, mode: Literal["jax", "jax_nlhb", "moco", ""] = ...
method no_weight_decay (line 443) | def no_weight_decay(self) -> Set:
method group_matcher (line 447) | def group_matcher(self, coarse: bool = False) -> Dict:
method set_grad_checkpointing (line 454) | def set_grad_checkpointing(self, enable: bool = True) -> None:
method get_classifier (line 458) | def get_classifier(self) -> nn.Module:
method reset_classifier (line 461) | def reset_classifier(self, num_classes: int, global_pool=None) -> None:
method _pos_embed (line 476) | def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
method _intermediate_layers (line 509) | def _intermediate_layers(
method get_intermediate_layers (line 531) | def get_intermediate_layers(
method forward_features (line 562) | def forward_features(self, x: torch.Tensor) -> torch.Tensor:
method forward_head (line 574) | def forward_head(self, x: torch.Tensor, pre_logits: bool = False) -> t...
method forward (line 585) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class SigLIPVisionCfg (line 593) | class SigLIPVisionCfg:
function create_siglip_vit (line 640) | def create_siglip_vit(
FILE: janus/models/vq_model.py
class ModelArgs (line 32) | class ModelArgs:
class Encoder (line 46) | class Encoder(nn.Module):
method __init__ (line 47) | def __init__(
method forward (line 105) | def forward(self, x):
class Decoder (line 127) | class Decoder(nn.Module):
method __init__ (line 128) | def __init__(
method last_layer (line 190) | def last_layer(self):
method forward (line 193) | def forward(self, z):
class VectorQuantizer (line 217) | class VectorQuantizer(nn.Module):
method __init__ (line 218) | def __init__(self, n_e, e_dim, beta, entropy_loss_ratio, l2_norm, show...
method forward (line 236) | def forward(self, z):
method get_codebook_entry (line 284) | def get_codebook_entry(self, indices, shape=None, channel_first=True):
class ResnetBlock (line 302) | class ResnetBlock(nn.Module):
method __init__ (line 303) | def __init__(
method forward (line 337) | def forward(self, x):
class AttnBlock (line 355) | class AttnBlock(nn.Module):
method __init__ (line 356) | def __init__(self, in_channels, norm_type="group"):
method forward (line 366) | def forward(self, x):
function nonlinearity (line 393) | def nonlinearity(x):
function Normalize (line 398) | def Normalize(in_channels, norm_type="group"):
class Upsample (line 408) | class Upsample(nn.Module):
method __init__ (line 409) | def __init__(self, in_channels, with_conv):
method forward (line 417) | def forward(self, x):
class Downsample (line 430) | class Downsample(nn.Module):
method __init__ (line 431) | def __init__(self, in_channels, with_conv):
method forward (line 440) | def forward(self, x):
function compute_entropy_loss (line 450) | def compute_entropy_loss(affinity, loss_type="softmax", temperature=0.01):
class VQModel (line 466) | class VQModel(nn.Module):
method __init__ (line 467) | def __init__(self, config: ModelArgs):
method encode (line 494) | def encode(self, x):
method decode (line 500) | def decode(self, quant):
method decode_code (line 505) | def decode_code(self, code_b, shape=None, channel_first=True):
method forward (line 510) | def forward(self, input):
function VQ_16 (line 519) | def VQ_16(**kwargs):
FILE: janus/utils/conversation.py
class SeparatorStyle (line 29) | class SeparatorStyle(IntEnum):
class Conversation (line 52) | class Conversation:
method get_prompt (line 76) | def get_prompt(self) -> str:
method get_prompt_for_current_round (line 141) | def get_prompt_for_current_round(self, content=None):
method set_system_message (line 153) | def set_system_message(self, system_message: str):
method append_message (line 157) | def append_message(self, role: str, message: str):
method reset_message (line 161) | def reset_message(self):
method update_last_message (line 165) | def update_last_message(self, message: str):
method to_gradio_chatbot (line 173) | def to_gradio_chatbot(self):
method to_openai_api_messages (line 183) | def to_openai_api_messages(self):
method copy (line 196) | def copy(self):
method dict (line 211) | def dict(self):
function register_conv_template (line 225) | def register_conv_template(template: Conversation, override: bool = False):
function get_conv_template (line 235) | def get_conv_template(name: str) -> Conversation:
FILE: janus/utils/io.py
function load_pretrained_model (line 32) | def load_pretrained_model(model_path: str):
function load_pil_images (line 44) | def load_pil_images(conversations: List[Dict[str, str]]) -> List[PIL.Ima...
function load_json (line 86) | def load_json(filepath):
Condensed preview — 37 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (423K chars).
[
{
"path": ".gitattributes",
"chars": 115,
"preview": "* text eol=lf\n*.ipynb linguist-detectable=false\n\n*.png binary\n*.jpg binary\n*.jpeg binary\n*.gif binary\n*.pdf binary\n"
},
{
"path": ".gitignore",
"chars": 7301,
"preview": "##### Python.gitignore #####\n# Byte-compiled / optimized / DLL files\n**/__pycache__/\n*.pyc\n*.pyo\n*.pyd\n*.py[cod]\n*$py.cl"
},
{
"path": "LICENSE-CODE",
"chars": 1065,
"preview": "MIT License\n\nCopyright (c) 2023 DeepSeek\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
},
{
"path": "LICENSE-MODEL",
"chars": 13712,
"preview": "DEEPSEEK LICENSE AGREEMENT\n\nVersion 1.0, 23 October 2023\n\nCopyright (c) 2023 DeepSeek\n\nSection I: PREAMBLE\n\nLarge genera"
},
{
"path": "Makefile",
"chars": 3069,
"preview": "print-% : ; @echo $* = $($*)\nPROJECT_NAME = Janus\nCOPYRIGHT = \"DeepSeek.\"\nPROJECT_PATH = janus\nSHELL "
},
{
"path": "README.md",
"chars": 26742,
"preview": "<!-- markdownlint-disable first-line-h1 -->\n<!-- markdownlint-disable html -->\n<!-- markdownlint-disable no-duplicate-he"
},
{
"path": "demo/Janus_colab_demo.ipynb",
"chars": 110351,
"preview": "{\n \"nbformat\": 4,\n \"nbformat_minor\": 0,\n \"metadata\": {\n \"colab\": {\n \"provenance\": [],\n \"machine_shape\": "
},
{
"path": "demo/app.py",
"chars": 9870,
"preview": "import gradio as gr\nimport torch\nfrom transformers import AutoConfig, AutoModelForCausalLM\nfrom janus.models import Mult"
},
{
"path": "demo/app_janusflow.py",
"chars": 10861,
"preview": "import gradio as gr\nimport torch\nfrom janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor\nfrom PIL impo"
},
{
"path": "demo/app_januspro.py",
"chars": 10706,
"preview": "import gradio as gr\nimport torch\nfrom transformers import AutoConfig, AutoModelForCausalLM\nfrom janus.models import Mult"
},
{
"path": "demo/fastapi_app.py",
"chars": 6571,
"preview": "from fastapi import FastAPI, File, Form, UploadFile, HTTPException\nfrom fastapi.responses import JSONResponse, Streaming"
},
{
"path": "demo/fastapi_client.py",
"chars": 2589,
"preview": "import requests\nfrom PIL import Image\nimport io\n# Endpoint URLs\nunderstand_image_url = \"http://localhost:8000/understand"
},
{
"path": "generation_inference.py",
"chars": 4515,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "inference.py",
"chars": 2642,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "interactivechat.py",
"chars": 5188,
"preview": "import os\nimport PIL.Image\nimport torch\nimport numpy as np\nfrom transformers import AutoModelForCausalLM\nfrom janus.mode"
},
{
"path": "janus/__init__.py",
"chars": 1458,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/__init__.py",
"chars": 1458,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/__init__.py",
"chars": 1328,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/clip_encoder.py",
"chars": 4457,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/image_processing_vlm.py",
"chars": 6796,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/modeling_vlm.py",
"chars": 8206,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/processing_vlm.py",
"chars": 15715,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/siglip_vit.py",
"chars": 24335,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/janusflow/models/uvit.py",
"chars": 22456,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/__init__.py",
"chars": 1328,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/clip_encoder.py",
"chars": 4447,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/image_processing_vlm.py",
"chars": 6796,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/modeling_vlm.py",
"chars": 8927,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/processing_vlm.py",
"chars": 14045,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/projector.py",
"chars": 3625,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/siglip_vit.py",
"chars": 24088,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/models/vq_model.py",
"chars": 17663,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/utils/__init__.py",
"chars": 1091,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/utils/conversation.py",
"chars": 12278,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "janus/utils/io.py",
"chars": 3169,
"preview": "# Copyright (c) 2023-2024 DeepSeek.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a copy of\n"
},
{
"path": "pyproject.toml",
"chars": 1111,
"preview": "[build-system]\nrequires = [\"setuptools>=40.6.0\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"jan"
},
{
"path": "requirements.txt",
"chars": 278,
"preview": "torch==2.0.1\ntransformers>=4.38.2\ntimm>=0.9.16\naccelerate\nsentencepiece\nattrdict\neinops\n\n# for gradio demo\ngradio==3.48."
}
]
About this extraction
This page contains the full source code of the deepseek-ai/Janus GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 37 files (391.0 KB), approximately 98.8k tokens, and a symbol index with 274 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.