Showing preview only (531K chars total). Download the full file or copy to clipboard to get everything.
Repository: apple/ml-fastvlm
Branch: main
Commit: 592b4add3c1c
Files: 74
Total size: 506.0 KB
Directory structure:
gitextract_4fmnuh7_/
├── .gitignore
├── ACKNOWLEDGEMENTS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── LICENSE_MODEL
├── README.md
├── app/
│ ├── Configuration/
│ │ └── Build.xcconfig
│ ├── FastVLM/
│ │ ├── FastVLM.h
│ │ ├── FastVLM.swift
│ │ └── MediaProcessingExtensions.swift
│ ├── FastVLM App/
│ │ ├── Assets.xcassets/
│ │ │ ├── AccentColor.colorset/
│ │ │ │ └── Contents.json
│ │ │ ├── AppIcon.appiconset/
│ │ │ │ └── Contents.json
│ │ │ └── Contents.json
│ │ ├── ContentView.swift
│ │ ├── FastVLM.entitlements
│ │ ├── FastVLMApp.swift
│ │ ├── FastVLMModel.swift
│ │ ├── Info.plist
│ │ ├── InfoView.swift
│ │ └── Preview Content/
│ │ └── Preview Assets.xcassets/
│ │ └── Contents.json
│ ├── FastVLM.xcodeproj/
│ │ ├── project.pbxproj
│ │ └── xcshareddata/
│ │ └── xcschemes/
│ │ └── FastVLM App.xcscheme
│ ├── README.md
│ ├── Video/
│ │ ├── CameraController.swift
│ │ ├── CameraControlsView.swift
│ │ ├── CameraType.swift
│ │ ├── Video.h
│ │ └── VideoFrameView.swift
│ └── get_pretrained_mlx_model.sh
├── get_models.sh
├── llava/
│ ├── __init__.py
│ ├── constants.py
│ ├── conversation.py
│ ├── mm_utils.py
│ ├── model/
│ │ ├── __init__.py
│ │ ├── apply_delta.py
│ │ ├── builder.py
│ │ ├── consolidate.py
│ │ ├── language_model/
│ │ │ ├── llava_llama.py
│ │ │ ├── llava_mistral.py
│ │ │ ├── llava_mpt.py
│ │ │ └── llava_qwen.py
│ │ ├── llava_arch.py
│ │ ├── make_delta.py
│ │ ├── multimodal_encoder/
│ │ │ ├── builder.py
│ │ │ ├── clip_encoder.py
│ │ │ ├── mobileclip/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── configs/
│ │ │ │ │ └── mobileclip_l.json
│ │ │ │ └── mci.py
│ │ │ └── mobileclip_encoder.py
│ │ ├── multimodal_projector/
│ │ │ └── builder.py
│ │ └── utils.py
│ ├── serve/
│ │ ├── __init__.py
│ │ ├── cli.py
│ │ ├── controller.py
│ │ ├── gradio_web_server.py
│ │ ├── model_worker.py
│ │ ├── register_worker.py
│ │ ├── sglang_worker.py
│ │ └── test_message.py
│ ├── train/
│ │ ├── llama_flash_attn_monkey_patch.py
│ │ ├── llama_xformers_attn_monkey_patch.py
│ │ ├── llava_trainer.py
│ │ ├── train.py
│ │ ├── train_mem.py
│ │ ├── train_qwen.py
│ │ └── train_xformers.py
│ └── utils.py
├── model_export/
│ ├── README.md
│ ├── export_vision_encoder.py
│ └── fastvlm_mlx-vlm.patch
├── predict.py
└── pyproject.toml
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# macOS
**/.DS_Store
# PyCharm project settings
.idea/
# Xcode
*.xcworkspace
# FastVLM models
app/FastVLM/model
================================================
FILE: ACKNOWLEDGEMENTS
================================================
Acknowledgements
Portions of this Software may utilize the following copyrighted
material, the use of which is hereby acknowledged.
---------------------------------------------------------------------------------
LLaVA: Large Language and Vision Assistant (LLaVA)
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---------------------------------------------------------------------------------
FastViT (ml-fastvit)
Copyright (C) 2023 Apple Inc. All Rights Reserved.
IMPORTANT: This Apple software is supplied to you by Apple
Inc. ("Apple") in consideration of your agreement to the following
terms, and your use, installation, modification or redistribution of
this Apple software constitutes acceptance of these terms. If you do
not agree with these terms, please do not use, install, modify or
redistribute this Apple software.
In consideration of your agreement to abide by the following terms, and
subject to these terms, Apple grants you a personal, non-exclusive
license, under Apple's copyrights in this original Apple software (the
"Apple Software"), to use, reproduce, modify and redistribute the Apple
Software, with or without modifications, in source and/or binary forms;
provided that if you redistribute the Apple Software in its entirety and
without modifications, you must retain this notice and the following
text and disclaimers in all such redistributions of the Apple Software.
Neither the name, trademarks, service marks or logos of Apple Inc. may
be used to endorse or promote products derived from the Apple Software
without specific prior written permission from Apple. Except as
expressly stated in this notice, no other rights or licenses, express or
implied, are granted by Apple herein, including but not limited to any
patent rights that may be infringed by your derivative works or by other
works in which the Apple Software may be incorporated.
The Apple Software is provided by Apple on an "AS IS" basis. APPLE
MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
---------------------------------------------------------------------------------
mlx-vlm
MIT License
Copyright © 2023 Apple Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
---------------------------------------------------------------------------------
MobileCLIP (ml-mobileclip)
MIT License
Copyright © 2024 Apple Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-------------------------------------------------------------------------------
SOFTWARE DISTRIBUTED WITH ML-MobileCLIP:
The ML-MobileCLIP model weights and data copyright and license terms can be
found in LICENSE_weights_data.
The ML-MobileCLIP software includes a number of subcomponents with separate
copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
---------------------------------------------------------------------------------
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the open source team at [opensource-conduct@group.apple.com](mailto:opensource-conduct@group.apple.com). All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4,
available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)
================================================
FILE: CONTRIBUTING.md
================================================
# Contribution Guide
Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.
## Before you get started
By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE).
We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).
================================================
FILE: LICENSE
================================================
Copyright (C) 2025 Apple Inc. All Rights Reserved.
IMPORTANT: This Apple software is supplied to you by Apple
Inc. ("Apple") in consideration of your agreement to the following
terms, and your use, installation, modification or redistribution of
this Apple software constitutes acceptance of these terms. If you do
not agree with these terms, please do not use, install, modify or
redistribute this Apple software.
In consideration of your agreement to abide by the following terms, and
subject to these terms, Apple grants you a personal, non-exclusive
license, under Apple's copyrights in this original Apple software (the
"Apple Software"), to use, reproduce, modify and redistribute the Apple
Software, with or without modifications, in source and/or binary forms;
provided that if you redistribute the Apple Software in its entirety and
without modifications, you must retain this notice and the following
text and disclaimers in all such redistributions of the Apple Software.
Neither the name, trademarks, service marks or logos of Apple Inc. may
be used to endorse or promote products derived from the Apple Software
without specific prior written permission from Apple. Except as
expressly stated in this notice, no other rights or licenses, express or
implied, are granted by Apple herein, including but not limited to any
patent rights that may be infringed by your derivative works or by other
works in which the Apple Software may be incorporated.
The Apple Software is provided by Apple on an "AS IS" basis. APPLE
MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
-------------------------------------------------------------------------------
SOFTWARE DISTRIBUTED WITH ML-FASTVLM:
The ml-fastvlm software includes a number of subcomponents with separate
copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
The ml-fastvlm model weights copyright and license terms can be
found in LICENSE_MODEL file.
-------------------------------------------------------------------------------
================================================
FILE: LICENSE_MODEL
================================================
Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is
specifically developed and released by Apple Inc. ("Apple") for the sole purpose
of scientific research of artificial intelligence and machine-learning
technology. “Apple Machine Learning Research Model” means the model, including
but not limited to algorithms, formulas, trained model weights, parameters,
configurations, checkpoints, and any related materials (including
documentation).
This Apple Machine Learning Research Model is provided to You by
Apple in consideration of your agreement to the following terms, and your use,
modification, creation of Model Derivatives, and or redistribution of the Apple
Machine Learning Research Model constitutes acceptance of this Agreement. If You
do not agree with these terms, please do not use, modify, create Model
Derivatives of, or distribute this Apple Machine Learning Research Model or
Model Derivatives.
* License Scope: In consideration of your agreement to abide by the following
terms, and subject to these terms, Apple hereby grants you a personal,
non-exclusive, worldwide, non-transferable, royalty-free, revocable, and
limited license, to use, copy, modify, distribute, and create Model
Derivatives (defined below) of the Apple Machine Learning Research Model
exclusively for Research Purposes. You agree that any Model Derivatives You
may create or that may be created for You will be limited to Research Purposes
as well. “Research Purposes” means non-commercial scientific research and
academic development activities, such as experimentation, analysis, testing
conducted by You with the sole intent to advance scientific knowledge and
research. “Research Purposes” does not include any commercial exploitation,
product development or use in any commercial product or service.
* Distribution of Apple Machine Learning Research Model and Model Derivatives:
If you choose to redistribute Apple Machine Learning Research Model or its
Model Derivatives, you must provide a copy of this Agreement to such third
party, and ensure that the following attribution notice be provided: “Apple
Machine Learning Research Model is licensed under the Apple Machine Learning
Research Model License Agreement.” Additionally, all Model Derivatives must
clearly be identified as such, including disclosure of modifications and
changes made to the Apple Machine Learning Research Model. The name,
trademarks, service marks or logos of Apple may not be used to endorse or
promote Model Derivatives or the relationship between You and Apple. “Model
Derivatives” means any models or any other artifacts created by modifications,
improvements, adaptations, alterations to the architecture, algorithm or
training processes of the Apple Machine Learning Research Model, or by any
retraining, fine-tuning of the Apple Machine Learning Research Model.
* No Other License: Except as expressly stated in this notice, no other rights
or licenses, express or implied, are granted by Apple herein, including but
not limited to any patent, trademark, and similar intellectual property rights
worldwide that may be infringed by the Apple Machine Learning Research Model,
the Model Derivatives or by other works in which the Apple Machine Learning
Research Model may be incorporated.
* Compliance with Laws: Your use of Apple Machine Learning Research Model must
be in compliance with all applicable laws and regulations.
* Term and Termination: The term of this Agreement will begin upon your
acceptance of this Agreement or use of the Apple Machine Learning Research
Model and will continue until terminated in accordance with the following
terms. Apple may terminate this Agreement at any time if You are in breach of
any term or condition of this Agreement. Upon termination of this Agreement,
You must cease to use all Apple Machine Learning Research Models and Model
Derivatives and permanently delete any copy thereof. Sections 3, 6 and 7 will
survive termination.
* Disclaimer and Limitation of Liability: This Apple Machine Learning Research
Model and any outputs generated by the Apple Machine Learning Research Model
are provided on an “AS IS” basis. APPLE MAKES NO WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
REGARDING THE APPLE MACHINE LEARNING RESEARCH MODEL OR OUTPUTS GENERATED BY
THE APPLE MACHINE LEARNING RESEARCH MODEL. You are solely responsible for
determining the appropriateness of using or redistributing the Apple Machine
Learning Research Model and any outputs of the Apple Machine Learning Research
Model and assume any risks associated with Your use of the Apple Machine
Learning Research Model and any output and results. IN NO EVENT SHALL APPLE BE
LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF
THE APPLE MACHINE LEARNING RESEARCH MODEL AND ANY OUTPUTS OF THE APPLE MACHINE
LEARNING RESEARCH MODEL, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT,
TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
* Governing Law: This Agreement will be governed by and construed under the laws
of the State of California without regard to its choice of law principles. The
Convention on Contracts for the International Sale of Goods shall not apply to
the Agreement except that the arbitration clause and any arbitration hereunder
shall be governed by the Federal Arbitration Act, Chapters 1 and 2.
Copyright (C) 2025 Apple Inc. All Rights Reserved.
================================================
FILE: README.md
================================================
# FastVLM: Efficient Vision Encoding for Vision Language Models
This is the official repository of
**[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)**
[//]: # ()
<p align="center">
<img src="docs/acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
</p>
### Highlights
* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
* Demo iOS app to demonstrate the performance of our model on a mobile device.
<table>
<tr>
<td><img src="docs/fastvlm-counting.gif" alt="FastVLM - Counting"></td>
<td><img src="docs/fastvlm-handwriting.gif" alt="FastVLM - Handwriting"></td>
<td><img src="docs/fastvlm-emoji.gif" alt="FastVLM - Emoji"></td>
</tr>
</table>
## Getting Started
We use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants,
please follow instructions provided in [LLaVA](https://github.com/haotian-liu/LLaVA) codebase.
We provide instructions for running inference with our models.
### Setup
```bash
conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .
```
### Model Zoo
For detailed information on various evaluations, please refer to our [paper](https://www.arxiv.org/abs/2412.13303).
| Model | Stage | Pytorch Checkpoint (url) |
|:-------------|:-----:|:---------------------------------------------------------------------------------------------------------------:|
| FastVLM-0.5B | 2 | [fastvlm_0.5b_stage2](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage2.zip) |
| | 3 | [fastvlm_0.5b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage3.zip) |
| FastVLM-1.5B | 2 | [fastvlm_1.5b_stage2](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage2.zip) |
| | 3 | [fastvlm_1.5b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage3.zip) |
| FastVLM-7B | 2 | [fastvlm_7b_stage2](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage2.zip) |
| | 3 | [fastvlm_7b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage3.zip) |
To download all the pretrained checkpoints run the command below (note that this might take some time depending on your connection so might be good to grab ☕️ while you wait).
```bash
bash get_models.sh # Files will be downloaded to `checkpoints` directory.
```
### Usage Example
To run inference of PyTorch checkpoint, follow the instruction below
```bash
python predict.py --model-path /path/to/checkpoint-dir \
--image-file /path/to/image.png \
--prompt "Describe the image."
```
### Inference on Apple Silicon
To run inference on Apple Silicon, pytorch checkpoints have to be exported to format
suitable for running on Apple Silicon, detailed instructions and code can be found [`model_export`](model_export/) subfolder.
Please see the README there for more details.
For convenience, we provide 3 models that are in Apple Silicon compatible format: [fastvlm_0.5b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage3_llm.fp16.zip),
[fastvlm_1.5b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage3_llm.int8.zip),
[fastvlm_7b_stage3](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage3_llm.int4.zip).
We encourage developers to export the model of their choice with the appropriate quantization levels following
the instructions in [`model_export`](model_export/).
### Inference on Apple Devices
To run inference on Apple devices like iPhone, iPad or Mac, see [`app`](app/) subfolder for more details.
## Citation
If you found this code useful, please cite the following paper:
```
@InProceedings{fastvlm2025,
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}
```
## Acknowledgements
Our codebase is built using multiple opensource contributions, please see [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS) for more details.
## License
Please check out the repository [LICENSE](LICENSE) before using the provided code and
[LICENSE_MODEL](LICENSE_MODEL) for the released models.
================================================
FILE: app/Configuration/Build.xcconfig
================================================
// The `DISAMBIGUATOR` configuration is to make it easier to build
// and run a sample code project. Once you set your project's development team,
// you'll have a unique bundle identifier. This is because the bundle identifier
// is derived based on the 'DISAMBIGUATOR' value. Do not use this
// approach in your own projects—it's only useful for example projects because
// they are frequently downloaded and don't have a development team set.
DISAMBIGUATOR=${DEVELOPMENT_TEAM}
================================================
FILE: app/FastVLM/FastVLM.h
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
#ifndef FastVLM_h
#define FastVLM_h
#endif /* FastVLM_h */
================================================
FILE: app/FastVLM/FastVLM.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import CoreImage
import CoreML
import Foundation
import MLX
import MLXFast
import MLXLMCommon
import MLXNN
import MLXVLM
import Tokenizers
// FastVLM is Qwen2VL with a custom vision tower.
// MARK: - Common
/// Rotates half the hidden dims of the input
private func rotateHalf(_ x: MLXArray) -> MLXArray {
let index = x.dim(-1) / 2
let x1 = x[.ellipsis, 0 ..< index]
let x2 = x[.ellipsis, index...]
return concatenated([-x2, x1], axis: -1)
}
// MARK: - Language
private enum Language {
/// Applies Rotary Position Embedding with Multimodal Sections to the query and key tensors
static private func applyMultimodalRotaryPositionEmbedding(
q: MLXArray, k: MLXArray, cos: MLXArray, sin: MLXArray,
positionIds: MLXArray, mropeSection: [Int]
) -> (MLXArray, MLXArray) {
var cos = cos[positionIds]
var sin = sin[positionIds]
cos =
concatenated(
// [m[i % 3] for i, m in enumerate(mx.split(cos, mrope_section, axis=-1))]
split(cos, indices: mropeSection, axis: -1).enumerated().map { i, m in m[i % 3] },
axis: -1
)[0..., .newAxis, 0..., 0...]
sin =
concatenated(
split(sin, indices: mropeSection, axis: -1).enumerated().map { i, m in m[i % 3] },
axis: -1
)[0..., .newAxis, 0..., 0...]
// Apply rotary embedding
let qEmbed = (q * cos) + (rotateHalf(q) * sin)
let kEmbed = (k * cos) + (rotateHalf(k) * sin)
return (qEmbed, kEmbed)
}
fileprivate class Attention: Module {
let heads: Int
let kvHeads: Int
let headDim: Int
let scale: Float
let mropeSection: [Int]
@ModuleInfo(key: "q_proj") var wq: Linear
@ModuleInfo(key: "k_proj") var wk: Linear
@ModuleInfo(key: "v_proj") var wv: Linear
@ModuleInfo(key: "o_proj") var wo: Linear
@ModuleInfo(key: "rotary_emb") var rotaryEmbedding: RoPE
public init(_ args: FastVLMConfiguration.TextConfiguration) {
let dim = args.hiddenSize
self.heads = args.attentionHeads
self.kvHeads = args.kvHeads
self.headDim = dim / heads
self.scale = pow(Float(headDim), -0.5)
self._wq.wrappedValue = Linear(dim, heads * headDim, bias: true)
self._wk.wrappedValue = Linear(dim, kvHeads * headDim, bias: true)
self._wv.wrappedValue = Linear(dim, kvHeads * headDim, bias: true)
self._wo.wrappedValue = Linear(heads * headDim, dim, bias: false)
if let v = args.ropeScaling?["mrope_section"], let array = v.asInts() {
// mrope_section = np.cumsum(mrope_section * 2)[:-1].tolist()
self.mropeSection = sequence(state: (0, array.makeIterator())) { state in
if let v = state.1.next() {
// note the *2
state.0 += v * 2
return state.0
} else {
return nil
}
}.dropLast()
} else {
fatalError("rope_scaling['mrope_section'] must be an array of integers")
}
self._rotaryEmbedding.wrappedValue = RoPE(
dimensions: headDim, traditional: args.ropeTraditional, base: args.ropeTheta)
}
public func callAsFunction(
_ x: MLXArray, mask: MLXArray? = nil, cache: KVCache?
) -> MLXArray {
let (B, L) = (x.dim(0), x.dim(1))
var queries = wq(x)
var keys = wk(x)
var values = wv(x)
// prepare the queries, keys and values for the attention computation
queries = queries.reshaped(B, L, heads, headDim).transposed(0, 2, 1, 3)
keys = keys.reshaped(B, L, kvHeads, headDim).transposed(0, 2, 1, 3)
values = values.reshaped(B, L, kvHeads, headDim).transposed(0, 2, 1, 3)
let offset = cache?.offset ?? 0
let mask = mask?[0..., 0 ..< keys.dim(-2)]
queries = rotaryEmbedding(queries, offset: offset)
keys = rotaryEmbedding(keys, offset: offset)
if let cache {
(keys, values) = cache.update(keys: keys, values: values)
}
let output = MLXFast.scaledDotProductAttention(
queries: queries, keys: keys, values: values, scale: scale, mask: mask
)
.transposed(0, 2, 1, 3)
.reshaped(B, L, -1)
return wo(output)
}
}
fileprivate class MLP: Module, UnaryLayer {
@ModuleInfo(key: "gate_proj") var gate: Linear
@ModuleInfo(key: "down_proj") var down: Linear
@ModuleInfo(key: "up_proj") var up: Linear
public init(dimensions: Int, hiddenDimensions: Int) {
self._gate.wrappedValue = Linear(dimensions, hiddenDimensions, bias: false)
self._down.wrappedValue = Linear(hiddenDimensions, dimensions, bias: false)
self._up.wrappedValue = Linear(dimensions, hiddenDimensions, bias: false)
}
public func callAsFunction(_ x: MLXArray) -> MLXArray {
down(silu(gate(x)) * up(x))
}
}
fileprivate class FastVLMDecoderLayer: Module {
@ModuleInfo(key: "self_attn") var attention: Attention
let mlp: MLP
@ModuleInfo(key: "input_layernorm") var inputLayerNorm: RMSNorm
@ModuleInfo(key: "post_attention_layernorm") var postAttentionLayerNorm: RMSNorm
public init(_ args: FastVLMConfiguration.TextConfiguration) {
self._attention.wrappedValue = Attention(args)
self.mlp = MLP(dimensions: args.hiddenSize, hiddenDimensions: args.intermediateSize)
self._inputLayerNorm.wrappedValue = RMSNorm(
dimensions: args.hiddenSize, eps: args.rmsNormEps)
self._postAttentionLayerNorm.wrappedValue = RMSNorm(
dimensions: args.hiddenSize, eps: args.rmsNormEps)
}
public func callAsFunction(
_ x: MLXArray, mask: MLXArray? = nil, cache: KVCache?
) -> MLXArray {
var r = attention(inputLayerNorm(x), mask: mask, cache: cache)
let h = x + r
r = mlp(postAttentionLayerNorm(h))
let out = h + r
return out
}
}
fileprivate class Qwen2Model: Module {
@ModuleInfo(key: "embed_tokens") var embedTokens: Embedding
fileprivate let layers: [FastVLMDecoderLayer]
fileprivate let norm: RMSNorm
public init(_ args: FastVLMConfiguration.TextConfiguration) {
precondition(args.vocabularySize > 0)
self._embedTokens.wrappedValue = Embedding(
embeddingCount: args.vocabularySize, dimensions: args.hiddenSize)
self.layers = (0 ..< args.hiddenLayers)
.map { _ in
FastVLMDecoderLayer(args)
}
self.norm = RMSNorm(dimensions: args.hiddenSize, eps: args.rmsNormEps)
}
public func callAsFunction(
_ inputs: MLXArray?, cache: [KVCache]? = nil, inputEmbedding: MLXArray? = nil
) -> MLXArray {
var h: MLXArray
if let inputEmbedding {
h = inputEmbedding
} else if let inputs {
h = embedTokens(inputs)
} else {
fatalError("one of inputs or inputEmbedding must be non-nil")
}
let mask = createAttentionMask(h: h, cache: cache)
for (i, layer) in layers.enumerated() {
h = layer(h, mask: mask, cache: cache?[i])
}
return norm(h)
}
}
fileprivate class LanguageModel: Module, KVCacheDimensionProvider {
@ModuleInfo var model: Qwen2Model
@ModuleInfo(key: "lm_head") var lmHead: Linear?
var kvHeads: [Int]
public init(_ args: FastVLMConfiguration.TextConfiguration) {
self.model = Qwen2Model(args)
if !args.tieWordEmbeddings {
_lmHead.wrappedValue = Linear(args.hiddenSize, args.vocabularySize, bias: false)
}
self.kvHeads = (0 ..< args.hiddenLayers).map { _ in args.kvHeads }
}
public func callAsFunction(
_ inputs: MLXArray?, cache: [KVCache]? = nil, inputEmbedding: MLXArray? = nil
) -> LMOutput {
var out = model(inputs, cache: cache, inputEmbedding: inputEmbedding)
if let lmHead {
out = lmHead(out)
} else {
out = model.embedTokens.asLinear(out)
}
return LMOutput(logits: out)
}
}
}
// MARK: - Vision
private enum Vision {
fileprivate class VisionModelCoreML {
let lock = NSLock()
var _model: fastvithd?
init() {
}
func load() throws -> fastvithd {
try lock.withLock {
if let model = _model { return model }
let model = try fastvithd()
_model = model
return model
}
}
public func model() -> fastvithd {
try! load()
}
public func encode(_ image: MLXArray) -> MLXArray {
// MLMultiArray requires mutable input data
var (data, strides) = {
let arrayData = image.asType(.float32).asData(access: .noCopyIfContiguous)
return (arrayData.data, arrayData.strides)
}()
precondition(image.ndim == 4)
precondition(image.dim(0) == 1)
precondition(image.dim(1) == 3)
let h = NSNumber(value: image.dim(2))
let w = NSNumber(value: image.dim(3))
return data.withUnsafeMutableBytes { (ptr: UnsafeMutableRawBufferPointer) in
// wrap the backing of the MLXArray
let array = try! MLMultiArray(
dataPointer: ptr.baseAddress!, shape: [1, 3, h, w], dataType: .float32,
strides: strides.map { .init(value: $0) })
// inference
let output = try! model().prediction(images: array)
precondition(output.image_features.shape == [1, 256, 3072])
precondition(output.image_features.dataType == .float32)
return output.image_features.withUnsafeBytes { ptr in
MLXArray(ptr, [1, 256, 3072], type: Float32.self)
}
}
}
}
fileprivate class VisionModel: Module {
let model = VisionModelCoreML()
public override init() {}
public func callAsFunction(_ hiddenStates: MLXArray, gridThw: [THW]) -> MLXArray {
model.encode(hiddenStates)
}
}
}
// MARK: - Processor
/// FastVLM `UserInputProcessor`.
///
/// This is meant to be used with ``FastVLM`` and is typically created by ``VLMModelFactory``.
public class FastVLMProcessor: UserInputProcessor {
private let config: FastVLMProcessorConfiguration
private let imageProcessingConfig: FastVLMPreProcessorConfiguration
private let tokenizer: any Tokenizer
public init(_ config: FastVLMPreProcessorConfiguration, tokenizer: any Tokenizer) {
self.config = FastVLMProcessorConfiguration()
self.imageProcessingConfig = config
self.tokenizer = tokenizer
}
public func preprocess(image: CIImage, processing: UserInput.Processing?) throws -> (
MLXArray, THW
) {
// first apply the user requested resizing, etc. if any
var image = MediaProcessingExtensions.apply(image, processing: processing)
// image_processing_clip.py
let size = MediaProcessingExtensions.fitIn(
image.extent.size, shortestEdge: imageProcessingConfig.size.shortestEdge)
image = MediaProcessingExtensions.resampleBicubic(image, to: size)
image = MediaProcessingExtensions.centerCrop(
image, size: imageProcessingConfig.cropSize.size)
image = MediaProcessing.normalize(
image, mean: imageProcessingConfig.imageMeanTuple,
std: imageProcessingConfig.imageStdTuple)
let array = MediaProcessingExtensions.asPlanarMLXArray(image)
return (array, .init(0, array.dim(2), array.dim(3)))
}
public func prepare(prompt: UserInput.Prompt, imageTHW: THW?) -> String {
var messages = prompt.asMessages()
if messages[0]["role"] != "system" {
messages.insert(["role": "system", "content": "You are a helpful assistant."], at: 0)
}
let lastIndex = messages.count - 1
var lastMessage = messages[lastIndex]["content"] ?? ""
// processing_llava.py
if let imageTHW {
let height = imageTHW.h
let width = imageTHW.w
let patchSize = config.patchSize
var numImageTokens =
(height / patchSize) * (width / patchSize) + config.numAdditionalImageTokens
if config.visionFeatureSelectStrategy == .default {
numImageTokens -= 1
}
lastMessage += Array(repeating: config.imageToken, count: numImageTokens)
.joined()
}
messages[lastIndex]["content"] = lastMessage
return
messages
.map {
"<|im_start|>\($0["role"] ?? "user")\n\($0["content"] ?? "")<|im_end|>"
}
.joined(separator: "\n")
+ "\n<|im_start|>assistant\n"
}
public func prepare(input: UserInput) throws -> LMInput {
if input.images.isEmpty {
// just a straight text prompt
let prompt = prepare(prompt: input.prompt, imageTHW: nil)
let promptTokens = tokenizer.encode(text: prompt)
return LMInput(tokens: MLXArray(promptTokens))
}
if input.images.count > 1 {
throw VLMError.singleImageAllowed
}
let (pixels, thw) = try preprocess(
image: input.images[0].asCIImage(), processing: input.processing)
let image = LMInput.ProcessedImage(pixels: pixels, imageGridThw: [thw])
let prompt = prepare(prompt: input.prompt, imageTHW: thw)
let promptTokens = tokenizer.encode(text: prompt)
let promptArray = MLXArray(promptTokens).expandedDimensions(axis: 0)
let mask = ones(like: promptArray).asType(.int8)
return LMInput(text: .init(tokens: promptArray, mask: mask), image: image)
}
}
// MARK: - Model
private class FastVLMMultiModalProjector: Module, UnaryLayer {
@ModuleInfo(key: "linear_0") var linear0: Linear
@ModuleInfo(key: "gelu") var gelu: GELU
@ModuleInfo(key: "linear_2") var linear2: Linear
public init(_ config: FastVLMConfiguration) {
self._linear0.wrappedValue = Linear(
config.visionConfiguration.hiddenSize,
config.textConfiguration.hiddenSize,
bias: true)
self._gelu.wrappedValue = GELU()
self._linear2.wrappedValue = Linear(
config.textConfiguration.hiddenSize,
config.textConfiguration.hiddenSize,
bias: true)
}
public func callAsFunction(_ x: MLXArray) -> MLXArray {
var x = linear0(x)
x = gelu(x)
x = linear2(x)
return x
}
}
/// FastVLM
///
/// This is typically created by ``VLMModelFactory``.
public class FastVLM: Module, VLMModel, KVCacheDimensionProvider {
static public var modelConfiguration: ModelConfiguration {
let bundle = Bundle(for: FastVLM.self)
let url = bundle.url(forResource: "config", withExtension: "json")!
.resolvingSymlinksInPath()
.deletingLastPathComponent()
return ModelConfiguration(directory: url)
}
static public func register(modelFactory: VLMModelFactory) {
modelFactory.typeRegistry.registerModelType("llava_qwen2") { url in
let configuration = try JSONDecoder().decode(
FastVLMConfiguration.self, from: Data(contentsOf: url))
return FastVLM(configuration)
}
modelFactory.processorRegistry.registerProcessorType("LlavaProcessor") { url, tokenizer in
let configuration = try JSONDecoder().decode(
FastVLMPreProcessorConfiguration.self, from: Data(contentsOf: url))
return FastVLMProcessor(configuration, tokenizer: tokenizer)
}
}
@ModuleInfo(key: "vision_tower") private var visionModel: Vision.VisionModel
@ModuleInfo(key: "language_model") private var languageModel: Language.LanguageModel
@ModuleInfo(key: "multi_modal_projector") private var multiModalProjector:
FastVLMMultiModalProjector
public let config: FastVLMConfiguration
public var vocabularySize: Int { config.baseConfiguration.vocabularySize }
public var kvHeads: [Int] { languageModel.kvHeads }
public func loraLinearLayers() -> MLXLMCommon.LoRALinearLayers {
languageModel.model.layers.map { ($0.attention, ["q_proj", "v_proj"]) }
}
public init(_ config: FastVLMConfiguration) {
self.config = config
self._visionModel.wrappedValue = Vision.VisionModel()
self._languageModel.wrappedValue = Language.LanguageModel(config.textConfiguration)
self._multiModalProjector.wrappedValue = FastVLMMultiModalProjector(config)
}
private func inputEmbeddings(inputIds: MLXArray, pixelValues: MLXArray?, gridThw: [THW]?)
-> MLXArray
{
guard let pixelValues, let gridThw else {
return languageModel(inputIds).logits
}
// Get the input embeddings from the language model
let inputEmbeds = languageModel.model.embedTokens(inputIds)
// Get the ouptut hidden states from the vision model
let imageFeaturesCoreML = self.visionModel(pixelValues, gridThw: gridThw)
let imageFeatures = multiModalProjector(imageFeaturesCoreML)
// Insert special image tokens in the input_ids
return mergeInputIdsWithImageFeatures(
inputIds: inputIds, inputEmbeds: inputEmbeds, imageFeatures: imageFeatures)
}
private func mergeInputIdsWithImageFeatures(
inputIds: MLXArray, inputEmbeds: MLXArray, imageFeatures: MLXArray
) -> MLXArray {
let imageTokenIndex = config.baseConfiguration.imageTokenId
var imageIndices = [Int]()
for (i, v) in inputIds.asArray(Int.self).enumerated() {
if v == imageTokenIndex {
imageIndices.append(i)
}
}
inputEmbeds[0..., MLXArray(imageIndices), 0...] = imageFeatures
return inputEmbeds
}
public func prepare(_ input: LMInput, cache: [any KVCache], windowSize: Int?) throws
-> PrepareResult
{
let gridThw = input.image?.imageGridThw
let dtype = DType.float32
let pixels = input.image?.pixels.asType(dtype)
let inputEmbeddings = self.inputEmbeddings(
inputIds: input.text.tokens, pixelValues: pixels, gridThw: gridThw)
let result = languageModel(nil, cache: cache, inputEmbedding: inputEmbeddings)
return .logits(result)
}
public func callAsFunction(_ inputs: MLXArray, cache: [any KVCache]?) -> MLXArray {
languageModel(inputs, cache: cache).logits
}
public func sanitize(weights: [String: MLXArray]) -> [String: MLXArray] {
_ = try? visionModel.model.load()
return weights
}
}
// MARK: - Configuration
/// Configuration for ``FastVLM``
public struct FastVLMConfiguration: Codable, Sendable {
public struct VisionConfiguration: Codable, Sendable {
public let hiddenSize: Int
enum CodingKeys: String, CodingKey {
case hiddenSize = "mm_hidden_size"
}
}
public struct TextConfiguration: Codable, Sendable {
public let modelType: String
public let hiddenSize: Int
public let hiddenLayers: Int
public let intermediateSize: Int
public let attentionHeads: Int
private let _rmsNormEps: Float?
public var rmsNormEps: Float { _rmsNormEps ?? 1e-6 }
public let vocabularySize: Int
public let kvHeads: Int
private let _maxPositionEmbeddings: Int?
public var maxpPositionEmbeddings: Int { _maxPositionEmbeddings ?? 32768 }
private let _ropeTheta: Float?
public var ropeTheta: Float { _ropeTheta ?? 1_000_000 }
private let _ropeTraditional: Bool?
public var ropeTraditional: Bool { _ropeTraditional ?? false }
public let _ropeScaling: [String: StringOrNumber]?
public var ropeScaling: [String: StringOrNumber]? {
_ropeScaling ?? ["mrope_section": .ints([2, 1, 1])]
}
private let _tieWordEmbeddings: Bool?
public var tieWordEmbeddings: Bool { _tieWordEmbeddings ?? true }
enum CodingKeys: String, CodingKey {
case modelType = "model_type"
case hiddenSize = "hidden_size"
case hiddenLayers = "num_hidden_layers"
case intermediateSize = "intermediate_size"
case attentionHeads = "num_attention_heads"
case _rmsNormEps = "rms_norm_eps"
case vocabularySize = "vocab_size"
case kvHeads = "num_key_value_heads"
case _maxPositionEmbeddings = "max_position_embeddings"
case _ropeTheta = "rope_theta"
case _ropeTraditional = "rope_traditional"
case _ropeScaling = "rope_scaling"
case _tieWordEmbeddings = "tie_word_embeddings"
}
}
public struct BaseConfiguration: Codable, Sendable {
public let modelType: String
public let vocabularySize: Int
public let imageTokenId: Int
public let hiddenSize: Int
enum CodingKeys: String, CodingKey {
case modelType = "model_type"
case vocabularySize = "vocab_size"
case imageTokenId = "image_token_index"
case hiddenSize = "hidden_size"
}
}
public let visionConfiguration: VisionConfiguration
public let textConfiguration: TextConfiguration
public let baseConfiguration: BaseConfiguration
public init(from decoder: any Swift.Decoder) throws {
// these are overlaid in the top level
self.visionConfiguration = try VisionConfiguration(from: decoder)
self.textConfiguration = try TextConfiguration(from: decoder)
self.baseConfiguration = try BaseConfiguration(from: decoder)
}
}
/// Configuration for ``FastVLMProcessor``
public struct FastVLMPreProcessorConfiguration: Codable, Sendable {
public struct CropSize: Codable, Sendable {
let width: Int
let height: Int
var size: CGSize { .init(width: CGFloat(width), height: CGFloat(height)) }
}
public struct Size: Codable, Sendable {
let shortestEdge: Int
enum CodingKeys: String, CodingKey {
case shortestEdge = "shortest_edge"
}
}
public var imageMean: [CGFloat]
public var imageStd: [CGFloat]
public var size: Size
public var cropSize: CropSize
public var imageMeanTuple: (CGFloat, CGFloat, CGFloat) {
(imageMean[0], imageMean[1], imageMean[2])
}
public var imageStdTuple: (CGFloat, CGFloat, CGFloat) {
(imageStd[0], imageStd[1], imageStd[2])
}
enum CodingKeys: String, CodingKey {
case imageMean = "image_mean"
case imageStd = "image_std"
case size
case cropSize = "crop_size"
}
}
public struct FastVLMProcessorConfiguration: Codable, Sendable {
public enum Strategy: Codable, Sendable {
case `default`
}
public var imageToken = "<image>"
public var numAdditionalImageTokens = 0
public var patchSize = 64
public var visionFeatureSelectStrategy: Strategy?
}
================================================
FILE: app/FastVLM/MediaProcessingExtensions.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import Accelerate
import CoreImage
import MLX
import MLXLMCommon
import MLXVLM
/// Additions to MediaProcessing -- not currently present in mlx-libraries
enum MediaProcessingExtensions {
// this function is not exported in current mlx-swift-examples -- local copy until it is exposed
// properly
public static func apply(_ image: CIImage, processing: UserInput.Processing?) -> CIImage {
var image = image
if let resize = processing?.resize {
let scale = MediaProcessing.bestFitScale(image.extent.size, in: resize)
image = image.transformed(by: CGAffineTransform(scaleX: scale, y: scale))
}
return image
}
public static func rectSmallerOrEqual(_ extent: CGRect, size: CGSize) -> Bool {
return extent.width <= size.width && extent.height <= size.height
}
public static func centerCrop(_ extent: CGRect, size: CGSize) -> CGRect {
let targetWidth = min(extent.width, size.width)
let targetHeight = min(extent.height, size.height)
return CGRect(
x: (extent.maxX - targetWidth) / 2,
y: (extent.maxY - targetHeight) / 2,
width: targetWidth, height: targetHeight
)
}
public static func centerCrop(_ image: CIImage, size: CGSize) -> CIImage {
let extent = image.extent
if rectSmallerOrEqual(extent, size: size) {
return image
}
let crop = centerCrop(extent, size: size)
return
image
.cropped(to: crop)
.transformed(by: CGAffineTransform(translationX: -crop.minX, y: -crop.minY))
}
public static func fitIn(_ size: CGSize, shortestEdge: Int) -> CGSize {
let floatShortestEdge = CGFloat(shortestEdge)
let (short, long) =
size.width <= size.height ? (size.width, size.height) : (size.height, size.width)
let newShort = floatShortestEdge
let newLong = floatShortestEdge * long / short
return size.width <= size.height
? CGSize(width: newShort, height: newLong) : CGSize(width: newLong, height: newShort)
}
public static func fitIn(_ size: CGSize, longestEdge: Int) -> CGSize {
let floatLongestEdge = CGFloat(longestEdge)
var (newShort, newLong) =
size.width <= size.height ? (size.width, size.height) : (size.height, size.width)
if newLong > floatLongestEdge {
newLong = floatLongestEdge
newShort = floatLongestEdge * newShort / newLong
}
return size.width <= size.height
? CGSize(width: newShort, height: newLong) : CGSize(width: newLong, height: newShort)
}
// version of function from https://github.com/ml-explore/mlx-swift-examples/pull/222
public static func resampleBicubic(_ image: CIImage, to size: CGSize) -> CIImage {
// Create a bicubic scale filter
let yScale = size.height / image.extent.height
let xScale = size.width / image.extent.width
let filter = CIFilter.bicubicScaleTransform()
filter.inputImage = image
filter.scale = Float(yScale)
filter.aspectRatio = Float(xScale / yScale)
let scaledImage = filter.outputImage!
// Create a rect with the exact dimensions we want
let exactRect = CGRect(
x: 0,
y: 0,
width: size.width,
height: size.height
)
// Crop to ensure exact dimensions
return scaledImage.cropped(to: exactRect)
}
static let context = CIContext()
/// Convert the CIImage into a planar 3 channel MLXArray `[1, C, H, W]`.
///
/// This physically moves the channels into a planar configuration -- this is
/// required for feeding into the CoreML model and is faster to use
/// dedicated functions than transforming into contiguous memory
/// on readout.
static public func asPlanarMLXArray(_ image: CIImage, colorSpace: CGColorSpace? = nil)
-> MLXArray
{
let size = image.extent.size
let w = Int(size.width.rounded())
let h = Int(size.height.rounded())
// probably not strictly necessary, but this is what happens in
// e.g. image_processing_siglip in transformers (float32)
let format = CIFormat.RGBAf
let componentsPerPixel = 4
let bytesPerComponent: Int = MemoryLayout<Float32>.size
let bytesPerPixel = componentsPerPixel * bytesPerComponent
let bytesPerRow = w * bytesPerPixel
var data = Data(count: w * h * bytesPerPixel)
var planarData = Data(count: 3 * w * h * bytesPerComponent)
data.withUnsafeMutableBytes { ptr in
context.render(
image, toBitmap: ptr.baseAddress!, rowBytes: bytesPerRow, bounds: image.extent,
format: format, colorSpace: colorSpace)
context.clearCaches()
let vh = vImagePixelCount(h)
let vw = vImagePixelCount(w)
// convert from RGBAf -> RGBf in place
let rgbBytesPerRow = w * 3 * bytesPerComponent
var rgbaSrc = vImage_Buffer(
data: ptr.baseAddress!, height: vh, width: vw, rowBytes: bytesPerRow)
var rgbDest = vImage_Buffer(
data: ptr.baseAddress!, height: vh, width: vw, rowBytes: rgbBytesPerRow)
vImageConvert_RGBAFFFFtoRGBFFF(&rgbaSrc, &rgbDest, vImage_Flags(kvImageNoFlags))
// and convert to planar data in a second buffer
planarData.withUnsafeMutableBytes { planarPtr in
let planeBytesPerRow = w * bytesPerComponent
var rDest = vImage_Buffer(
data: planarPtr.baseAddress!.advanced(by: 0 * planeBytesPerRow * h), height: vh,
width: vw, rowBytes: planeBytesPerRow)
var gDest = vImage_Buffer(
data: planarPtr.baseAddress!.advanced(by: 1 * planeBytesPerRow * h), height: vh,
width: vw, rowBytes: planeBytesPerRow)
var bDest = vImage_Buffer(
data: planarPtr.baseAddress!.advanced(by: 2 * planeBytesPerRow * h), height: vh,
width: vw, rowBytes: planeBytesPerRow)
vImageConvert_RGBFFFtoPlanarF(
&rgbDest, &rDest, &gDest, &bDest, vImage_Flags(kvImageNoFlags))
}
}
return MLXArray(planarData, [1, 3, h, w], type: Float32.self)
}
}
================================================
FILE: app/FastVLM App/Assets.xcassets/AccentColor.colorset/Contents.json
================================================
{
"colors" : [
{
"idiom" : "universal"
}
],
"info" : {
"author" : "xcode",
"version" : 1
}
}
================================================
FILE: app/FastVLM App/Assets.xcassets/AppIcon.appiconset/Contents.json
================================================
{
"images" : [
{
"filename" : "FastVLM - 150 Blue - Light@2x.png",
"idiom" : "universal",
"platform" : "ios",
"size" : "1024x1024"
},
{
"appearances" : [
{
"appearance" : "luminosity",
"value" : "dark"
}
],
"filename" : "FastVLM - 150 Blue - Dark@2x.png",
"idiom" : "universal",
"platform" : "ios",
"size" : "1024x1024"
},
{
"appearances" : [
{
"appearance" : "luminosity",
"value" : "tinted"
}
],
"filename" : "FastVLM - 150 White - Tinted@2x.png",
"idiom" : "universal",
"platform" : "ios",
"size" : "1024x1024"
},
{
"idiom" : "mac",
"scale" : "1x",
"size" : "16x16"
},
{
"idiom" : "mac",
"scale" : "2x",
"size" : "16x16"
},
{
"idiom" : "mac",
"scale" : "1x",
"size" : "32x32"
},
{
"idiom" : "mac",
"scale" : "2x",
"size" : "32x32"
},
{
"idiom" : "mac",
"scale" : "1x",
"size" : "128x128"
},
{
"idiom" : "mac",
"scale" : "2x",
"size" : "128x128"
},
{
"idiom" : "mac",
"scale" : "1x",
"size" : "256x256"
},
{
"idiom" : "mac",
"scale" : "2x",
"size" : "256x256"
},
{
"filename" : "FastVLM - MacOS - Dark@1x.png",
"idiom" : "mac",
"scale" : "1x",
"size" : "512x512"
},
{
"filename" : "FastVLM - MacOS - Dark@2x.png",
"idiom" : "mac",
"scale" : "2x",
"size" : "512x512"
}
],
"info" : {
"author" : "xcode",
"version" : 1
}
}
================================================
FILE: app/FastVLM App/Assets.xcassets/Contents.json
================================================
{
"info" : {
"author" : "xcode",
"version" : 1
}
}
================================================
FILE: app/FastVLM App/ContentView.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import AVFoundation
import MLXLMCommon
import SwiftUI
import Video
// support swift 6
extension CVImageBuffer: @unchecked @retroactive Sendable {}
extension CMSampleBuffer: @unchecked @retroactive Sendable {}
// delay between frames -- controls the frame rate of the updates
let FRAME_DELAY = Duration.milliseconds(1)
struct ContentView: View {
@State private var camera = CameraController()
@State private var model = FastVLMModel()
/// stream of frames -> VideoFrameView, see distributeVideoFrames
@State private var framesToDisplay: AsyncStream<CVImageBuffer>?
@State private var prompt = "Describe the image in English."
@State private var promptSuffix = "Output should be brief, about 15 words or less."
@State private var isShowingInfo: Bool = false
@State private var selectedCameraType: CameraType = .continuous
@State private var isEditingPrompt: Bool = false
var toolbarItemPlacement: ToolbarItemPlacement {
var placement: ToolbarItemPlacement = .navigation
#if os(iOS)
placement = .topBarLeading
#endif
return placement
}
var statusTextColor : Color {
return model.evaluationState == .processingPrompt ? .black : .white
}
var statusBackgroundColor : Color {
switch model.evaluationState {
case .idle:
return .gray
case .generatingResponse:
return .green
case .processingPrompt:
return .yellow
}
}
var body: some View {
NavigationStack {
Form {
Section {
VStack(alignment: .leading, spacing: 10.0) {
Picker("Camera Type", selection: $selectedCameraType) {
ForEach(CameraType.allCases, id: \.self) { cameraType in
Text(cameraType.rawValue.capitalized).tag(cameraType)
}
}
// Prevent macOS from adding a text label for the picker
.labelsHidden()
.pickerStyle(.segmented)
.onChange(of: selectedCameraType) { _, _ in
// Cancel any in-flight requests when switching modes
model.cancel()
}
if let framesToDisplay {
VideoFrameView(
frames: framesToDisplay,
cameraType: selectedCameraType,
action: { frame in
processSingleFrame(frame)
})
// Because we're using the AVCaptureSession preset
// `.vga640x480`, we can assume this aspect ratio
.aspectRatio(4/3, contentMode: .fit)
#if os(macOS)
.frame(maxWidth: 750)
#endif
.overlay(alignment: .top) {
if !model.promptTime.isEmpty {
Text("TTFT \(model.promptTime)")
.font(.caption)
.foregroundStyle(.white)
.monospaced()
.padding(.vertical, 4.0)
.padding(.horizontal, 6.0)
.background(alignment: .center) {
RoundedRectangle(cornerRadius: 8)
.fill(Color.black.opacity(0.6))
}
.padding(.top)
}
}
#if !os(macOS)
.overlay(alignment: .topTrailing) {
CameraControlsView(
backCamera: $camera.backCamera,
device: $camera.device,
devices: $camera.devices)
.padding()
}
#endif
.overlay(alignment: .bottom) {
if selectedCameraType == .continuous {
Group {
if model.evaluationState == .processingPrompt {
HStack {
ProgressView()
.tint(self.statusTextColor)
.controlSize(.small)
Text(model.evaluationState.rawValue)
}
} else if model.evaluationState == .idle {
HStack(spacing: 6.0) {
Image(systemName: "clock.fill")
.font(.caption)
Text(model.evaluationState.rawValue)
}
}
else {
// I'm manually tweaking the spacing to
// better match the spacing with ProgressView
HStack(spacing: 6.0) {
Image(systemName: "lightbulb.fill")
.font(.caption)
Text(model.evaluationState.rawValue)
}
}
}
.foregroundStyle(self.statusTextColor)
.font(.caption)
.bold()
.padding(.vertical, 6.0)
.padding(.horizontal, 8.0)
.background(self.statusBackgroundColor)
.clipShape(.capsule)
.padding(.bottom)
}
}
#if os(macOS)
.frame(maxWidth: .infinity)
.frame(minWidth: 500)
.frame(minHeight: 375)
#endif
}
}
}
.listRowInsets(EdgeInsets())
.listRowBackground(Color.clear)
.listRowSeparator(.hidden)
promptSections
Section {
if model.output.isEmpty && model.running {
ProgressView()
.controlSize(.large)
.frame(maxWidth: .infinity)
} else {
ScrollView {
Text(model.output)
.foregroundStyle(isEditingPrompt ? .secondary : .primary)
.textSelection(.enabled)
#if os(macOS)
.font(.headline)
.fontWeight(.regular)
#endif
}
.frame(minHeight: 50.0, maxHeight: 200.0)
}
} header: {
Text("Response")
#if os(macOS)
.font(.headline)
.padding(.bottom, 2.0)
#endif
}
#if os(macOS)
Spacer()
#endif
}
#if os(iOS)
.listSectionSpacing(0)
#elseif os(macOS)
.padding()
#endif
.task {
camera.start()
}
.task {
await model.load()
}
#if !os(macOS)
.onAppear {
// Prevent the screen from dimming or sleeping due to inactivity
UIApplication.shared.isIdleTimerDisabled = true
}
.onDisappear {
// Resumes normal idle timer behavior
UIApplication.shared.isIdleTimerDisabled = false
}
#endif
// task to distribute video frames -- this will cancel
// and restart when the view is on/off screen. note: it is
// important that this is here (attached to the VideoFrameView)
// rather than the outer view because this has the correct lifecycle
.task {
if Task.isCancelled {
return
}
await distributeVideoFrames()
}
.navigationTitle("FastVLM")
#if os(iOS)
.navigationBarTitleDisplayMode(.inline)
#endif
.toolbar {
ToolbarItem(placement: toolbarItemPlacement) {
Button {
isShowingInfo.toggle()
}
label: {
Image(systemName: "info.circle")
}
}
ToolbarItem(placement: .primaryAction) {
if isEditingPrompt {
Button {
isEditingPrompt.toggle()
}
label: {
Text("Done")
.fontWeight(.bold)
}
}
else {
Menu {
Button("Describe image") {
prompt = "Describe the image in English."
promptSuffix = "Output should be brief, about 15 words or less."
}
Button("Facial expression") {
prompt = "What is this person's facial expression?"
promptSuffix = "Output only one or two words."
}
Button("Read text") {
prompt = "What is written in this image?"
promptSuffix = "Output only the text in the image."
}
#if !os(macOS)
Button("Customize...") {
isEditingPrompt.toggle()
}
#endif
} label: { Text("Prompts") }
}
}
}
.sheet(isPresented: $isShowingInfo) {
InfoView()
}
}
}
var promptSummary: some View {
Section("Prompt") {
VStack(alignment: .leading, spacing: 4.0) {
let trimmedPrompt = prompt.trimmingCharacters(in: .whitespacesAndNewlines)
if !trimmedPrompt.isEmpty {
Text(trimmedPrompt)
.foregroundStyle(.secondary)
}
let trimmedSuffix = promptSuffix.trimmingCharacters(in: .whitespacesAndNewlines)
if !trimmedSuffix.isEmpty {
Text(trimmedSuffix)
.font(.caption)
.foregroundStyle(.tertiary)
}
}
}
}
var promptForm: some View {
Group {
#if os(iOS)
Section("Prompt") {
TextEditor(text: $prompt)
.frame(minHeight: 38)
}
Section("Prompt Suffix") {
TextEditor(text: $promptSuffix)
.frame(minHeight: 38)
}
#elseif os(macOS)
Section {
HStack(alignment: .top) {
VStack(alignment: .leading) {
Text("Prompt")
.font(.headline)
TextEditor(text: $prompt)
.frame(height: 38)
.padding(.horizontal, 8.0)
.padding(.vertical, 10.0)
.background(Color(.textBackgroundColor))
.cornerRadius(10.0)
}
VStack(alignment: .leading) {
Text("Prompt Suffix")
.font(.headline)
TextEditor(text: $promptSuffix)
.frame(height: 38)
.padding(.horizontal, 8.0)
.padding(.vertical, 10.0)
.background(Color(.textBackgroundColor))
.cornerRadius(10.0)
}
}
}
.padding(.vertical)
#endif
}
}
var promptSections: some View {
Group {
#if os(iOS)
if isEditingPrompt {
promptForm
}
else {
promptSummary
}
#elseif os(macOS)
promptForm
#endif
}
}
func analyzeVideoFrames(_ frames: AsyncStream<CVImageBuffer>) async {
for await frame in frames {
let userInput = UserInput(
prompt: .text("\(prompt) \(promptSuffix)"),
images: [.ciImage(CIImage(cvPixelBuffer: frame))]
)
// generate output for a frame and wait for generation to complete
let t = await model.generate(userInput)
_ = await t.result
do {
try await Task.sleep(for: FRAME_DELAY)
} catch { return }
}
}
func distributeVideoFrames() async {
// attach a stream to the camera -- this code will read this
let frames = AsyncStream<CMSampleBuffer>(bufferingPolicy: .bufferingNewest(1)) {
camera.attach(continuation: $0)
}
let (framesToDisplay, framesToDisplayContinuation) = AsyncStream.makeStream(
of: CVImageBuffer.self,
bufferingPolicy: .bufferingNewest(1)
)
self.framesToDisplay = framesToDisplay
// Only create analysis stream if in continuous mode
let (framesToAnalyze, framesToAnalyzeContinuation) = AsyncStream.makeStream(
of: CVImageBuffer.self,
bufferingPolicy: .bufferingNewest(1)
)
// set up structured tasks (important -- this means the child tasks
// are cancelled when the parent is cancelled)
async let distributeFrames: () = {
for await sampleBuffer in frames {
if let frame = sampleBuffer.imageBuffer {
framesToDisplayContinuation.yield(frame)
// Only send frames for analysis in continuous mode
if await selectedCameraType == .continuous {
framesToAnalyzeContinuation.yield(frame)
}
}
}
// detach from the camera controller and feed to the video view
await MainActor.run {
self.framesToDisplay = nil
self.camera.detatch()
}
framesToDisplayContinuation.finish()
framesToAnalyzeContinuation.finish()
}()
// Only analyze frames if in continuous mode
if selectedCameraType == .continuous {
async let analyze: () = analyzeVideoFrames(framesToAnalyze)
await distributeFrames
await analyze
} else {
await distributeFrames
}
}
/// Perform FastVLM inference on a single frame.
/// - Parameter frame: The frame to analyze.
func processSingleFrame(_ frame: CVImageBuffer) {
// Reset Response UI (show spinner)
Task { @MainActor in
model.output = ""
}
// Construct request to model
let userInput = UserInput(
prompt: .text("\(prompt) \(promptSuffix)"),
images: [.ciImage(CIImage(cvPixelBuffer: frame))]
)
// Post request to FastVLM
Task {
await model.generate(userInput)
}
}
}
#Preview {
ContentView()
}
================================================
FILE: app/FastVLM App/FastVLM.entitlements
================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.developer.kernel.increased-memory-limit</key>
<true/>
<key>com.apple.security.app-sandbox</key>
<true/>
<key>com.apple.security.device.camera</key>
<true/>
<key>com.apple.security.files.user-selected.read-only</key>
<true/>
<key>com.apple.security.network.client</key>
<true/>
</dict>
</plist>
================================================
FILE: app/FastVLM App/FastVLMApp.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import SwiftUI
@main
struct FastVLMApp: App {
var body: some Scene {
WindowGroup {
ContentView()
}
}
}
================================================
FILE: app/FastVLM App/FastVLMModel.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import CoreImage
import FastVLM
import Foundation
import MLX
import MLXLMCommon
import MLXRandom
import MLXVLM
@Observable
@MainActor
class FastVLMModel {
public var running = false
public var modelInfo = ""
public var output = ""
public var promptTime: String = ""
enum LoadState {
case idle
case loaded(ModelContainer)
}
private let modelConfiguration = FastVLM.modelConfiguration
/// parameters controlling the output
let generateParameters = GenerateParameters(temperature: 0.0)
let maxTokens = 240
/// update the display every N tokens -- 4 looks like it updates continuously
/// and is low overhead. observed ~15% reduction in tokens/s when updating
/// on every token
let displayEveryNTokens = 4
private var loadState = LoadState.idle
private var currentTask: Task<Void, Never>?
enum EvaluationState: String, CaseIterable {
case idle = "Idle"
case processingPrompt = "Processing Prompt"
case generatingResponse = "Generating Response"
}
public var evaluationState = EvaluationState.idle
public init() {
FastVLM.register(modelFactory: VLMModelFactory.shared)
}
private func _load() async throws -> ModelContainer {
switch loadState {
case .idle:
// limit the buffer cache
MLX.GPU.set(cacheLimit: 20 * 1024 * 1024)
let modelContainer = try await VLMModelFactory.shared.loadContainer(
configuration: modelConfiguration
) {
[modelConfiguration] progress in
Task { @MainActor in
self.modelInfo =
"Downloading \(modelConfiguration.name): \(Int(progress.fractionCompleted * 100))%"
}
}
self.modelInfo = "Loaded"
loadState = .loaded(modelContainer)
return modelContainer
case .loaded(let modelContainer):
return modelContainer
}
}
public func load() async {
do {
_ = try await _load()
} catch {
self.modelInfo = "Error loading model: \(error)"
}
}
public func generate(_ userInput: UserInput) async -> Task<Void, Never> {
if let currentTask, running {
return currentTask
}
running = true
// Cancel any existing task
currentTask?.cancel()
// Create new task and store reference
let task = Task {
do {
let modelContainer = try await _load()
// each time you generate you will get something new
MLXRandom.seed(UInt64(Date.timeIntervalSinceReferenceDate * 1000))
// Check if task was cancelled
if Task.isCancelled { return }
let result = try await modelContainer.perform { context in
// Measure the time it takes to prepare the input
Task { @MainActor in
evaluationState = .processingPrompt
}
let llmStart = Date()
let input = try await context.processor.prepare(input: userInput)
var seenFirstToken = false
// FastVLM generates the output
let result = try MLXLMCommon.generate(
input: input, parameters: generateParameters, context: context
) { tokens in
// Check if task was cancelled
if Task.isCancelled {
return .stop
}
if !seenFirstToken {
seenFirstToken = true
// produced first token, update the time to first token,
// the processing state and start displaying the text
let llmDuration = Date().timeIntervalSince(llmStart)
let text = context.tokenizer.decode(tokens: tokens)
Task { @MainActor in
evaluationState = .generatingResponse
self.output = text
self.promptTime = "\(Int(llmDuration * 1000)) ms"
}
}
// Show the text in the view as it generates
if tokens.count % displayEveryNTokens == 0 {
let text = context.tokenizer.decode(tokens: tokens)
Task { @MainActor in
self.output = text
}
}
if tokens.count >= maxTokens {
return .stop
} else {
return .more
}
}
// Return the duration of the LLM and the result
return result
}
// Check if task was cancelled before updating UI
if !Task.isCancelled {
self.output = result.output
}
} catch {
if !Task.isCancelled {
output = "Failed: \(error)"
}
}
if evaluationState == .generatingResponse {
evaluationState = .idle
}
running = false
}
currentTask = task
return task
}
public func cancel() {
currentTask?.cancel()
currentTask = nil
running = false
output = ""
promptTime = ""
}
}
================================================
FILE: app/FastVLM App/Info.plist
================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict/>
</plist>
================================================
FILE: app/FastVLM App/InfoView.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import Foundation
import SwiftUI
struct InfoView: View {
@Environment(\.dismiss) var dismiss
let paragraph1 = "**FastVLM¹** is a new family of Vision-Language models that makes use of **FastViTHD**, a hierarchical hybrid vision encoder that produces small number of high quality tokens at low latencies, resulting in significantly faster time-to-first-token (TTFT)."
let paragraph2 = "This app showcases the **FastVLM** model in action, allowing users to freely customize the prompt. FastVLM utilizes Qwen2-Instruct LLMs without additional safety tuning, so please exercise caution when modifying the prompt."
let footer = "1. **FastVLM: Efficient Vision Encoding for Vision Language Models.** (CVPR 2025) Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari"
var body: some View {
NavigationStack {
VStack(alignment: .leading, spacing: 20.0) {
// I'm not going to lie, this doesn't make sense...
// Wrapping `String`s with `.init()` turns them into `LocalizedStringKey`s
// which gives us all of the fun Markdown formatting while retaining the
// ability to use `String` variables. ¯\_(ツ)_/¯
Text("\(.init(paragraph1))\n\n\(.init(paragraph2))\n\n")
.font(.body)
Spacer()
Text(.init(footer))
.font(.caption)
.foregroundStyle(.secondary)
}
.padding()
.frame(maxWidth: .infinity, maxHeight: .infinity, alignment: .top)
.textSelection(.enabled)
.navigationTitle("Information")
#if os(iOS)
.navigationBarTitleDisplayMode(.inline)
#endif
.toolbar {
#if os(iOS)
ToolbarItem(placement: .navigationBarLeading) {
Button {
dismiss()
} label: {
Image(systemName: "xmark.circle")
.resizable()
.frame(width: 25, height: 25)
.foregroundStyle(.secondary)
}
.buttonStyle(.plain)
}
#elseif os(macOS)
ToolbarItem(placement: .cancellationAction) {
Button("Done") {
dismiss()
}
.buttonStyle(.bordered)
}
#endif
}
}
}
}
#Preview {
InfoView()
}
================================================
FILE: app/FastVLM App/Preview Content/Preview Assets.xcassets/Contents.json
================================================
{
"info" : {
"author" : "xcode",
"version" : 1
}
}
================================================
FILE: app/FastVLM.xcodeproj/project.pbxproj
================================================
// !$*UTF8*$!
{
archiveVersion = 1;
classes = {
};
objectVersion = 77;
objects = {
/* Begin PBXBuildFile section */
019A3E1A2D78E7370055F93B /* MLX in Frameworks */ = {isa = PBXBuildFile; productRef = 019A3E192D78E7370055F93B /* MLX */; };
019A3E1C2D78E73E0055F93B /* MLXLMCommon in Frameworks */ = {isa = PBXBuildFile; productRef = 019A3E1B2D78E73E0055F93B /* MLXLMCommon */; };
019A3E1E2D78E7470055F93B /* MLXRandom in Frameworks */ = {isa = PBXBuildFile; productRef = 019A3E1D2D78E7470055F93B /* MLXRandom */; };
019A3E202D78E74C0055F93B /* MLXVLM in Frameworks */ = {isa = PBXBuildFile; productRef = 019A3E1F2D78E74C0055F93B /* MLXVLM */; };
019A3E212D78E7530055F93B /* Video.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = C35372AE2D08C32D00474D34 /* Video.framework */; };
019A3E222D78E7530055F93B /* Video.framework in Embed Frameworks */ = {isa = PBXBuildFile; fileRef = C35372AE2D08C32D00474D34 /* Video.framework */; settings = {ATTRIBUTES = (CodeSignOnCopy, RemoveHeadersOnCopy, ); }; };
C3ED544E2D790860005E20B3 /* MLXLMCommon in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED544D2D790860005E20B3 /* MLXLMCommon */; };
C3ED54502D790860005E20B3 /* MLXVLM in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED544F2D790860005E20B3 /* MLXVLM */; };
C3ED54522D790860005E20B3 /* MLX in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED54512D790860005E20B3 /* MLX */; };
C3ED54542D790860005E20B3 /* MLXNN in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED54532D790860005E20B3 /* MLXNN */; };
C3ED54582D790A68005E20B3 /* MLXFast in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED54572D790A68005E20B3 /* MLXFast */; };
C3ED545B2D790AD6005E20B3 /* Transformers in Frameworks */ = {isa = PBXBuildFile; productRef = C3ED545A2D790AD6005E20B3 /* Transformers */; };
C3ED55012D7A0A7A005E20B3 /* FastVLM.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = C39BB3E62D79082A005DB8FB /* FastVLM.framework */; };
C3ED55022D7A0A7A005E20B3 /* FastVLM.framework in Embed Frameworks */ = {isa = PBXBuildFile; fileRef = C39BB3E62D79082A005DB8FB /* FastVLM.framework */; settings = {ATTRIBUTES = (CodeSignOnCopy, RemoveHeadersOnCopy, ); }; };
/* End PBXBuildFile section */
/* Begin PBXContainerItemProxy section */
019A3E232D78E7530055F93B /* PBXContainerItemProxy */ = {
isa = PBXContainerItemProxy;
containerPortal = C35EDB642D07699400757E80 /* Project object */;
proxyType = 1;
remoteGlobalIDString = C35372AD2D08C32D00474D34;
remoteInfo = Video;
};
C3ED55032D7A0A7A005E20B3 /* PBXContainerItemProxy */ = {
isa = PBXContainerItemProxy;
containerPortal = C35EDB642D07699400757E80 /* Project object */;
proxyType = 1;
remoteGlobalIDString = C39BB3E52D79082A005DB8FB;
remoteInfo = FastVLM;
};
/* End PBXContainerItemProxy section */
/* Begin PBXCopyFilesBuildPhase section */
019A3E252D78E7530055F93B /* Embed Frameworks */ = {
isa = PBXCopyFilesBuildPhase;
buildActionMask = 2147483647;
dstPath = "";
dstSubfolderSpec = 10;
files = (
C3ED55022D7A0A7A005E20B3 /* FastVLM.framework in Embed Frameworks */,
019A3E222D78E7530055F93B /* Video.framework in Embed Frameworks */,
);
name = "Embed Frameworks";
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXCopyFilesBuildPhase section */
/* Begin PBXFileReference section */
019A3E0A2D78E6A00055F93B /* FastVLM App.app */ = {isa = PBXFileReference; explicitFileType = wrapper.application; includeInIndex = 0; path = "FastVLM App.app"; sourceTree = BUILT_PRODUCTS_DIR; };
12FFAF3D2DC93583009C4EFA /* get_pretrained_mlx_model.sh */ = {isa = PBXFileReference; lastKnownFileType = text.script.sh; path = get_pretrained_mlx_model.sh; sourceTree = "<group>"; };
12FFAF3E2DC93583009C4EFA /* README.md */ = {isa = PBXFileReference; lastKnownFileType = net.daringfireball.markdown; path = README.md; sourceTree = "<group>"; };
C35372AE2D08C32D00474D34 /* Video.framework */ = {isa = PBXFileReference; explicitFileType = wrapper.framework; includeInIndex = 0; path = Video.framework; sourceTree = BUILT_PRODUCTS_DIR; };
C39BB3E62D79082A005DB8FB /* FastVLM.framework */ = {isa = PBXFileReference; explicitFileType = wrapper.framework; includeInIndex = 0; path = FastVLM.framework; sourceTree = BUILT_PRODUCTS_DIR; };
/* End PBXFileReference section */
/* Begin PBXFileSystemSynchronizedBuildFileExceptionSet section */
120A44852D9B05A900E244A3 /* Exceptions for "FastVLM App" folder in "FastVLM App" target */ = {
isa = PBXFileSystemSynchronizedBuildFileExceptionSet;
membershipExceptions = (
Info.plist,
);
target = 019A3E092D78E6A00055F93B /* FastVLM App */;
};
C35372B92D08C32D00474D34 /* Exceptions for "Video" folder in "Video" target */ = {
isa = PBXFileSystemSynchronizedBuildFileExceptionSet;
publicHeaders = (
Video.h,
);
target = C35372AD2D08C32D00474D34 /* Video */;
};
C3ED54BB2D791BEA005E20B3 /* Exceptions for "FastVLM" folder in "FastVLM" target */ = {
isa = PBXFileSystemSynchronizedBuildFileExceptionSet;
publicHeaders = (
FastVLM.h,
);
target = C39BB3E52D79082A005DB8FB /* FastVLM */;
};
/* End PBXFileSystemSynchronizedBuildFileExceptionSet section */
/* Begin PBXFileSystemSynchronizedRootGroup section */
019A3E0B2D78E6A00055F93B /* FastVLM App */ = {
isa = PBXFileSystemSynchronizedRootGroup;
exceptions = (
120A44852D9B05A900E244A3 /* Exceptions for "FastVLM App" folder in "FastVLM App" target */,
);
path = "FastVLM App";
sourceTree = "<group>";
};
C32B4A802DA4805400EF663D /* Configuration */ = {
isa = PBXFileSystemSynchronizedRootGroup;
path = Configuration;
sourceTree = "<group>";
};
C35372AF2D08C32D00474D34 /* Video */ = {
isa = PBXFileSystemSynchronizedRootGroup;
exceptions = (
C35372B92D08C32D00474D34 /* Exceptions for "Video" folder in "Video" target */,
);
path = Video;
sourceTree = "<group>";
};
C39BB3E72D79082A005DB8FB /* FastVLM */ = {
isa = PBXFileSystemSynchronizedRootGroup;
exceptions = (
C3ED54BB2D791BEA005E20B3 /* Exceptions for "FastVLM" folder in "FastVLM" target */,
);
path = FastVLM;
sourceTree = "<group>";
};
/* End PBXFileSystemSynchronizedRootGroup section */
/* Begin PBXFrameworksBuildPhase section */
019A3E072D78E6A00055F93B /* Frameworks */ = {
isa = PBXFrameworksBuildPhase;
buildActionMask = 2147483647;
files = (
019A3E1C2D78E73E0055F93B /* MLXLMCommon in Frameworks */,
019A3E212D78E7530055F93B /* Video.framework in Frameworks */,
C3ED55012D7A0A7A005E20B3 /* FastVLM.framework in Frameworks */,
019A3E1E2D78E7470055F93B /* MLXRandom in Frameworks */,
019A3E202D78E74C0055F93B /* MLXVLM in Frameworks */,
019A3E1A2D78E7370055F93B /* MLX in Frameworks */,
);
runOnlyForDeploymentPostprocessing = 0;
};
C35372AB2D08C32D00474D34 /* Frameworks */ = {
isa = PBXFrameworksBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C39BB3E32D79082A005DB8FB /* Frameworks */ = {
isa = PBXFrameworksBuildPhase;
buildActionMask = 2147483647;
files = (
C3ED545B2D790AD6005E20B3 /* Transformers in Frameworks */,
C3ED54522D790860005E20B3 /* MLX in Frameworks */,
C3ED54502D790860005E20B3 /* MLXVLM in Frameworks */,
C3ED54542D790860005E20B3 /* MLXNN in Frameworks */,
C3ED54582D790A68005E20B3 /* MLXFast in Frameworks */,
C3ED544E2D790860005E20B3 /* MLXLMCommon in Frameworks */,
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXFrameworksBuildPhase section */
/* Begin PBXGroup section */
C35EDB632D07699400757E80 = {
isa = PBXGroup;
children = (
12FFAF3E2DC93583009C4EFA /* README.md */,
12FFAF3D2DC93583009C4EFA /* get_pretrained_mlx_model.sh */,
C32B4A802DA4805400EF663D /* Configuration */,
C35372AF2D08C32D00474D34 /* Video */,
019A3E0B2D78E6A00055F93B /* FastVLM App */,
C39BB3E72D79082A005DB8FB /* FastVLM */,
C35EDB7F2D076C3C00757E80 /* Frameworks */,
C35EDB702D076A5A00757E80 /* Products */,
);
sourceTree = "<group>";
};
C35EDB702D076A5A00757E80 /* Products */ = {
isa = PBXGroup;
children = (
C35372AE2D08C32D00474D34 /* Video.framework */,
019A3E0A2D78E6A00055F93B /* FastVLM App.app */,
C39BB3E62D79082A005DB8FB /* FastVLM.framework */,
);
name = Products;
sourceTree = "<group>";
};
C35EDB7F2D076C3C00757E80 /* Frameworks */ = {
isa = PBXGroup;
children = (
);
name = Frameworks;
sourceTree = "<group>";
};
/* End PBXGroup section */
/* Begin PBXHeadersBuildPhase section */
C35372A92D08C32D00474D34 /* Headers */ = {
isa = PBXHeadersBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C39BB3E12D79082A005DB8FB /* Headers */ = {
isa = PBXHeadersBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXHeadersBuildPhase section */
/* Begin PBXNativeTarget section */
019A3E092D78E6A00055F93B /* FastVLM App */ = {
isa = PBXNativeTarget;
buildConfigurationList = 019A3E182D78E6A20055F93B /* Build configuration list for PBXNativeTarget "FastVLM App" */;
buildPhases = (
019A3E062D78E6A00055F93B /* Sources */,
019A3E072D78E6A00055F93B /* Frameworks */,
019A3E082D78E6A00055F93B /* Resources */,
019A3E252D78E7530055F93B /* Embed Frameworks */,
);
buildRules = (
);
dependencies = (
019A3E242D78E7530055F93B /* PBXTargetDependency */,
C3ED55042D7A0A7A005E20B3 /* PBXTargetDependency */,
);
fileSystemSynchronizedGroups = (
019A3E0B2D78E6A00055F93B /* FastVLM App */,
);
name = "FastVLM App";
packageProductDependencies = (
019A3E192D78E7370055F93B /* MLX */,
019A3E1B2D78E73E0055F93B /* MLXLMCommon */,
019A3E1D2D78E7470055F93B /* MLXRandom */,
019A3E1F2D78E74C0055F93B /* MLXVLM */,
);
productName = FastVLMCameraExample;
productReference = 019A3E0A2D78E6A00055F93B /* FastVLM App.app */;
productType = "com.apple.product-type.application";
};
C35372AD2D08C32D00474D34 /* Video */ = {
isa = PBXNativeTarget;
buildConfigurationList = C35372BA2D08C32D00474D34 /* Build configuration list for PBXNativeTarget "Video" */;
buildPhases = (
C35372A92D08C32D00474D34 /* Headers */,
C35372AA2D08C32D00474D34 /* Sources */,
C35372AB2D08C32D00474D34 /* Frameworks */,
C35372AC2D08C32D00474D34 /* Resources */,
);
buildRules = (
);
dependencies = (
);
fileSystemSynchronizedGroups = (
C35372AF2D08C32D00474D34 /* Video */,
);
name = Video;
packageProductDependencies = (
);
productName = Video;
productReference = C35372AE2D08C32D00474D34 /* Video.framework */;
productType = "com.apple.product-type.framework";
};
C39BB3E52D79082A005DB8FB /* FastVLM */ = {
isa = PBXNativeTarget;
buildConfigurationList = C39BB3FF2D79082A005DB8FB /* Build configuration list for PBXNativeTarget "FastVLM" */;
buildPhases = (
C39BB3E12D79082A005DB8FB /* Headers */,
C39BB3E22D79082A005DB8FB /* Sources */,
C39BB3E32D79082A005DB8FB /* Frameworks */,
C39BB3E42D79082A005DB8FB /* Resources */,
);
buildRules = (
);
dependencies = (
);
fileSystemSynchronizedGroups = (
C39BB3E72D79082A005DB8FB /* FastVLM */,
);
name = FastVLM;
packageProductDependencies = (
C3ED544D2D790860005E20B3 /* MLXLMCommon */,
C3ED544F2D790860005E20B3 /* MLXVLM */,
C3ED54512D790860005E20B3 /* MLX */,
C3ED54532D790860005E20B3 /* MLXNN */,
C3ED54572D790A68005E20B3 /* MLXFast */,
C3ED545A2D790AD6005E20B3 /* Transformers */,
);
productName = FastVLM;
productReference = C39BB3E62D79082A005DB8FB /* FastVLM.framework */;
productType = "com.apple.product-type.framework";
};
/* End PBXNativeTarget section */
/* Begin PBXProject section */
C35EDB642D07699400757E80 /* Project object */ = {
isa = PBXProject;
attributes = {
BuildIndependentTargetsInParallel = 1;
LastSwiftUpdateCheck = 1620;
LastUpgradeCheck = 1630;
TargetAttributes = {
019A3E092D78E6A00055F93B = {
CreatedOnToolsVersion = 16.2;
};
C35372AD2D08C32D00474D34 = {
CreatedOnToolsVersion = 16.0;
};
C39BB3E52D79082A005DB8FB = {
CreatedOnToolsVersion = 16.0;
LastSwiftMigration = 1600;
};
};
};
buildConfigurationList = C35EDB672D07699400757E80 /* Build configuration list for PBXProject "FastVLM" */;
developmentRegion = en;
hasScannedForEncodings = 0;
knownRegions = (
en,
Base,
);
mainGroup = C35EDB632D07699400757E80;
minimizedProjectReferenceProxies = 1;
packageReferences = (
C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */,
C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */,
C3ED54592D790AC6005E20B3 /* XCRemoteSwiftPackageReference "swift-transformers" */,
);
preferredProjectObjectVersion = 77;
productRefGroup = C35EDB702D076A5A00757E80 /* Products */;
projectDirPath = "";
projectRoot = "";
targets = (
C35372AD2D08C32D00474D34 /* Video */,
019A3E092D78E6A00055F93B /* FastVLM App */,
C39BB3E52D79082A005DB8FB /* FastVLM */,
);
};
/* End PBXProject section */
/* Begin PBXResourcesBuildPhase section */
019A3E082D78E6A00055F93B /* Resources */ = {
isa = PBXResourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C35372AC2D08C32D00474D34 /* Resources */ = {
isa = PBXResourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C39BB3E42D79082A005DB8FB /* Resources */ = {
isa = PBXResourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXResourcesBuildPhase section */
/* Begin PBXSourcesBuildPhase section */
019A3E062D78E6A00055F93B /* Sources */ = {
isa = PBXSourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C35372AA2D08C32D00474D34 /* Sources */ = {
isa = PBXSourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
C39BB3E22D79082A005DB8FB /* Sources */ = {
isa = PBXSourcesBuildPhase;
buildActionMask = 2147483647;
files = (
);
runOnlyForDeploymentPostprocessing = 0;
};
/* End PBXSourcesBuildPhase section */
/* Begin PBXTargetDependency section */
019A3E242D78E7530055F93B /* PBXTargetDependency */ = {
isa = PBXTargetDependency;
target = C35372AD2D08C32D00474D34 /* Video */;
targetProxy = 019A3E232D78E7530055F93B /* PBXContainerItemProxy */;
};
C3ED55042D7A0A7A005E20B3 /* PBXTargetDependency */ = {
isa = PBXTargetDependency;
target = C39BB3E52D79082A005DB8FB /* FastVLM */;
targetProxy = C3ED55032D7A0A7A005E20B3 /* PBXContainerItemProxy */;
};
/* End PBXTargetDependency section */
/* Begin XCBuildConfiguration section */
019A3E162D78E6A20055F93B /* Debug */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_ENTITLEMENTS = "FastVLM App/FastVLM.entitlements";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 0.1.0;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = dwarf;
DEVELOPMENT_ASSET_PATHS = "\"FastVLM App/Preview Content\"";
DEVELOPMENT_TEAM = "";
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_PREVIEWS = YES;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_TESTABILITY = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_DYNAMIC_NO_PIC = NO;
GCC_NO_COMMON_BLOCKS = YES;
GCC_OPTIMIZATION_LEVEL = 0;
GCC_PREPROCESSOR_DEFINITIONS = (
"DEBUG=1",
"$(inherited)",
);
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_FILE = "FastVLM App/Info.plist";
INFOPLIST_KEY_CFBundleDisplayName = FastVLM;
INFOPLIST_KEY_NSCameraUsageDescription = "Use camera to get live feed of images";
"INFOPLIST_KEY_UIApplicationSceneManifest_Generation[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UIApplicationSceneManifest_Generation[sdk=iphonesimulator*]" = YES;
"INFOPLIST_KEY_UIApplicationSupportsIndirectInputEvents[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UIApplicationSupportsIndirectInputEvents[sdk=iphonesimulator*]" = YES;
"INFOPLIST_KEY_UILaunchScreen_Generation[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UILaunchScreen_Generation[sdk=iphonesimulator*]" = YES;
INFOPLIST_KEY_UIRequiresFullScreen = YES;
"INFOPLIST_KEY_UIStatusBarStyle[sdk=iphoneos*]" = UIStatusBarStyleDefault;
"INFOPLIST_KEY_UIStatusBarStyle[sdk=iphonesimulator*]" = UIStatusBarStyleDefault;
INFOPLIST_KEY_UISupportedInterfaceOrientations = UIInterfaceOrientationPortrait;
IPHONEOS_DEPLOYMENT_TARGET = 18.2;
LD_RUNPATH_SEARCH_PATHS = "@executable_path/Frameworks";
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = "@executable_path/../Frameworks";
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.2;
MARKETING_VERSION = 1.0;
MTL_ENABLE_DEBUG_INFO = INCLUDE_SOURCE;
MTL_FAST_MATH = YES;
ONLY_ACTIVE_ARCH = YES;
PRODUCT_BUNDLE_IDENTIFIER = com.apple.ml.FastVLM;
PRODUCT_NAME = "$(TARGET_NAME)";
SDKROOT = auto;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_ACTIVE_COMPILATION_CONDITIONS = "DEBUG $(inherited)";
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_OPTIMIZATION_LEVEL = "-Onone";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
XROS_DEPLOYMENT_TARGET = 2.2;
};
name = Debug;
};
019A3E172D78E6A20055F93B /* Release */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_ENTITLEMENTS = "FastVLM App/FastVLM.entitlements";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 0.1.0;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym";
DEVELOPMENT_ASSET_PATHS = "\"FastVLM App/Preview Content\"";
DEVELOPMENT_TEAM = "";
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_NS_ASSERTIONS = NO;
ENABLE_PREVIEWS = YES;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_FILE = "FastVLM App/Info.plist";
INFOPLIST_KEY_CFBundleDisplayName = FastVLM;
INFOPLIST_KEY_NSCameraUsageDescription = "Use camera to get live feed of images";
"INFOPLIST_KEY_UIApplicationSceneManifest_Generation[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UIApplicationSceneManifest_Generation[sdk=iphonesimulator*]" = YES;
"INFOPLIST_KEY_UIApplicationSupportsIndirectInputEvents[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UIApplicationSupportsIndirectInputEvents[sdk=iphonesimulator*]" = YES;
"INFOPLIST_KEY_UILaunchScreen_Generation[sdk=iphoneos*]" = YES;
"INFOPLIST_KEY_UILaunchScreen_Generation[sdk=iphonesimulator*]" = YES;
INFOPLIST_KEY_UIRequiresFullScreen = YES;
"INFOPLIST_KEY_UIStatusBarStyle[sdk=iphoneos*]" = UIStatusBarStyleDefault;
"INFOPLIST_KEY_UIStatusBarStyle[sdk=iphonesimulator*]" = UIStatusBarStyleDefault;
INFOPLIST_KEY_UISupportedInterfaceOrientations = UIInterfaceOrientationPortrait;
IPHONEOS_DEPLOYMENT_TARGET = 18.2;
LD_RUNPATH_SEARCH_PATHS = "@executable_path/Frameworks";
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = "@executable_path/../Frameworks";
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.2;
MARKETING_VERSION = 1.0;
MTL_ENABLE_DEBUG_INFO = NO;
MTL_FAST_MATH = YES;
PRODUCT_BUNDLE_IDENTIFIER = com.apple.ml.FastVLM;
PRODUCT_NAME = "$(TARGET_NAME)";
SDKROOT = auto;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_COMPILATION_MODE = wholemodule;
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
XROS_DEPLOYMENT_TARGET = 2.2;
};
name = Release;
};
C35372B72D08C32D00474D34 /* Debug */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALLOW_TARGET_PLATFORM_SPECIALIZATION = YES;
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
BUILD_LIBRARY_FOR_DISTRIBUTION = YES;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_IDENTITY = "";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 1;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = dwarf;
DEFINES_MODULE = YES;
DYLIB_COMPATIBILITY_VERSION = 1;
DYLIB_CURRENT_VERSION = 1;
DYLIB_INSTALL_NAME_BASE = "@rpath";
ENABLE_MODULE_VERIFIER = YES;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_TESTABILITY = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_DYNAMIC_NO_PIC = NO;
GCC_NO_COMMON_BLOCKS = YES;
GCC_OPTIMIZATION_LEVEL = 0;
GCC_PREPROCESSOR_DEFINITIONS = (
"DEBUG=1",
"$(inherited)",
);
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_KEY_NSHumanReadableCopyright = "";
INSTALL_PATH = "$(LOCAL_LIBRARY_DIR)/Frameworks";
IPHONEOS_DEPLOYMENT_TARGET = 18.0;
LD_RUNPATH_SEARCH_PATHS = (
"@executable_path/Frameworks",
"@loader_path/Frameworks",
);
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = (
"@executable_path/../Frameworks",
"@loader_path/Frameworks",
);
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.0;
MARKETING_VERSION = 1.0;
MODULE_VERIFIER_SUPPORTED_LANGUAGES = "objective-c objective-c++";
MODULE_VERIFIER_SUPPORTED_LANGUAGE_STANDARDS = "gnu17 gnu++20";
MTL_ENABLE_DEBUG_INFO = INCLUDE_SOURCE;
MTL_FAST_MATH = YES;
ONLY_ACTIVE_ARCH = YES;
PRODUCT_BUNDLE_IDENTIFIER = mlx.Video;
PRODUCT_NAME = "$(TARGET_NAME:c99extidentifier)";
SDKROOT = auto;
SKIP_INSTALL = YES;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_ACTIVE_COMPILATION_CONDITIONS = "DEBUG $(inherited)";
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_INSTALL_OBJC_HEADER = NO;
SWIFT_OPTIMIZATION_LEVEL = "-Onone";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
VERSIONING_SYSTEM = "apple-generic";
VERSION_INFO_PREFIX = "";
XROS_DEPLOYMENT_TARGET = 2.0;
};
name = Debug;
};
C35372B82D08C32D00474D34 /* Release */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALLOW_TARGET_PLATFORM_SPECIALIZATION = YES;
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
BUILD_LIBRARY_FOR_DISTRIBUTION = YES;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_IDENTITY = "";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 1;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym";
DEFINES_MODULE = YES;
DYLIB_COMPATIBILITY_VERSION = 1;
DYLIB_CURRENT_VERSION = 1;
DYLIB_INSTALL_NAME_BASE = "@rpath";
ENABLE_MODULE_VERIFIER = YES;
ENABLE_NS_ASSERTIONS = NO;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_KEY_NSHumanReadableCopyright = "";
INSTALL_PATH = "$(LOCAL_LIBRARY_DIR)/Frameworks";
IPHONEOS_DEPLOYMENT_TARGET = 18.0;
LD_RUNPATH_SEARCH_PATHS = (
"@executable_path/Frameworks",
"@loader_path/Frameworks",
);
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = (
"@executable_path/../Frameworks",
"@loader_path/Frameworks",
);
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.0;
MARKETING_VERSION = 1.0;
MODULE_VERIFIER_SUPPORTED_LANGUAGES = "objective-c objective-c++";
MODULE_VERIFIER_SUPPORTED_LANGUAGE_STANDARDS = "gnu17 gnu++20";
MTL_ENABLE_DEBUG_INFO = NO;
MTL_FAST_MATH = YES;
PRODUCT_BUNDLE_IDENTIFIER = mlx.Video;
PRODUCT_NAME = "$(TARGET_NAME:c99extidentifier)";
SDKROOT = auto;
SKIP_INSTALL = YES;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_COMPILATION_MODE = wholemodule;
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_INSTALL_OBJC_HEADER = NO;
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
VERSIONING_SYSTEM = "apple-generic";
VERSION_INFO_PREFIX = "";
XROS_DEPLOYMENT_TARGET = 2.0;
};
name = Release;
};
C35EDB682D07699400757E80 /* Debug */ = {
isa = XCBuildConfiguration;
baseConfigurationReferenceAnchor = C32B4A802DA4805400EF663D /* Configuration */;
baseConfigurationReferenceRelativePath = Build.xcconfig;
buildSettings = {
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
DEAD_CODE_STRIPPING = YES;
DEVELOPMENT_TEAM = 565ARCVNXV;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_TESTABILITY = YES;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
ONLY_ACTIVE_ARCH = YES;
};
name = Debug;
};
C35EDB692D07699400757E80 /* Release */ = {
isa = XCBuildConfiguration;
buildSettings = {
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
DEAD_CODE_STRIPPING = YES;
DEVELOPMENT_TEAM = 565ARCVNXV;
ENABLE_STRICT_OBJC_MSGSEND = YES;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
};
name = Release;
};
C39BB3FB2D79082A005DB8FB /* Debug */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALLOW_TARGET_PLATFORM_SPECIALIZATION = YES;
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
BUILD_LIBRARY_FOR_DISTRIBUTION = NO;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_IDENTITY = "";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 1;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = dwarf;
DEFINES_MODULE = NO;
DYLIB_COMPATIBILITY_VERSION = 1;
DYLIB_CURRENT_VERSION = 1;
DYLIB_INSTALL_NAME_BASE = "@rpath";
ENABLE_MODULE_VERIFIER = YES;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_TESTABILITY = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_DYNAMIC_NO_PIC = NO;
GCC_NO_COMMON_BLOCKS = YES;
GCC_OPTIMIZATION_LEVEL = 0;
GCC_PREPROCESSOR_DEFINITIONS = (
"DEBUG=1",
"$(inherited)",
);
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_KEY_NSHumanReadableCopyright = "";
INSTALL_PATH = "$(LOCAL_LIBRARY_DIR)/Frameworks";
IPHONEOS_DEPLOYMENT_TARGET = 18.0;
LD_RUNPATH_SEARCH_PATHS = (
"@executable_path/Frameworks",
"@loader_path/Frameworks",
);
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = (
"@executable_path/../Frameworks",
"@loader_path/Frameworks",
);
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.0;
MARKETING_VERSION = 1.0;
MODULE_VERIFIER_SUPPORTED_LANGUAGES = "objective-c objective-c++";
MODULE_VERIFIER_SUPPORTED_LANGUAGE_STANDARDS = "gnu17 gnu++20";
MTL_ENABLE_DEBUG_INFO = INCLUDE_SOURCE;
MTL_FAST_MATH = YES;
ONLY_ACTIVE_ARCH = YES;
PRODUCT_BUNDLE_IDENTIFIER = mlx.FastVLM;
PRODUCT_NAME = "$(TARGET_NAME:c99extidentifier)";
SDKROOT = auto;
SKIP_INSTALL = YES;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_ACTIVE_COMPILATION_CONDITIONS = "DEBUG $(inherited)";
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_INSTALL_OBJC_HEADER = NO;
SWIFT_OPTIMIZATION_LEVEL = "-Onone";
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
VERSIONING_SYSTEM = "apple-generic";
VERSION_INFO_PREFIX = "";
XROS_DEPLOYMENT_TARGET = 2.0;
};
name = Debug;
};
C39BB3FC2D79082A005DB8FB /* Release */ = {
isa = XCBuildConfiguration;
buildSettings = {
ALLOW_TARGET_PLATFORM_SPECIALIZATION = YES;
ALWAYS_SEARCH_USER_PATHS = NO;
ASSETCATALOG_COMPILER_GENERATE_SWIFT_ASSET_SYMBOL_EXTENSIONS = YES;
BUILD_LIBRARY_FOR_DISTRIBUTION = NO;
CLANG_ANALYZER_NONNULL = YES;
CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE;
CLANG_CXX_LANGUAGE_STANDARD = "gnu++20";
CLANG_ENABLE_MODULES = YES;
CLANG_ENABLE_OBJC_ARC = YES;
CLANG_ENABLE_OBJC_WEAK = YES;
CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES;
CLANG_WARN_BOOL_CONVERSION = YES;
CLANG_WARN_COMMA = YES;
CLANG_WARN_CONSTANT_CONVERSION = YES;
CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES;
CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
CLANG_WARN_EMPTY_BODY = YES;
CLANG_WARN_ENUM_CONVERSION = YES;
CLANG_WARN_INFINITE_RECURSION = YES;
CLANG_WARN_INT_CONVERSION = YES;
CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES;
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES;
CLANG_WARN_OBJC_LITERAL_CONVERSION = YES;
CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES;
CLANG_WARN_RANGE_LOOP_ANALYSIS = YES;
CLANG_WARN_STRICT_PROTOTYPES = YES;
CLANG_WARN_SUSPICIOUS_MOVE = YES;
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE;
CLANG_WARN_UNREACHABLE_CODE = YES;
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
CODE_SIGN_IDENTITY = "";
CODE_SIGN_STYLE = Automatic;
COPY_PHASE_STRIP = NO;
CURRENT_PROJECT_VERSION = 1;
DEAD_CODE_STRIPPING = YES;
DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym";
DEFINES_MODULE = NO;
DYLIB_COMPATIBILITY_VERSION = 1;
DYLIB_CURRENT_VERSION = 1;
DYLIB_INSTALL_NAME_BASE = "@rpath";
ENABLE_MODULE_VERIFIER = YES;
ENABLE_NS_ASSERTIONS = NO;
ENABLE_STRICT_OBJC_MSGSEND = YES;
ENABLE_USER_SCRIPT_SANDBOXING = YES;
GCC_C_LANGUAGE_STANDARD = gnu17;
GCC_NO_COMMON_BLOCKS = YES;
GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
GCC_WARN_UNDECLARED_SELECTOR = YES;
GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
GCC_WARN_UNUSED_FUNCTION = YES;
GCC_WARN_UNUSED_VARIABLE = YES;
GENERATE_INFOPLIST_FILE = YES;
INFOPLIST_KEY_NSHumanReadableCopyright = "";
INSTALL_PATH = "$(LOCAL_LIBRARY_DIR)/Frameworks";
IPHONEOS_DEPLOYMENT_TARGET = 18.0;
LD_RUNPATH_SEARCH_PATHS = (
"@executable_path/Frameworks",
"@loader_path/Frameworks",
);
"LD_RUNPATH_SEARCH_PATHS[sdk=macosx*]" = (
"@executable_path/../Frameworks",
"@loader_path/Frameworks",
);
LOCALIZATION_PREFERS_STRING_CATALOGS = YES;
MACOSX_DEPLOYMENT_TARGET = 15.0;
MARKETING_VERSION = 1.0;
MODULE_VERIFIER_SUPPORTED_LANGUAGES = "objective-c objective-c++";
MODULE_VERIFIER_SUPPORTED_LANGUAGE_STANDARDS = "gnu17 gnu++20";
MTL_ENABLE_DEBUG_INFO = NO;
MTL_FAST_MATH = YES;
PRODUCT_BUNDLE_IDENTIFIER = mlx.FastVLM;
PRODUCT_NAME = "$(TARGET_NAME:c99extidentifier)";
SDKROOT = auto;
SKIP_INSTALL = YES;
SUPPORTED_PLATFORMS = "iphoneos iphonesimulator macosx xros xrsimulator";
SWIFT_COMPILATION_MODE = wholemodule;
SWIFT_EMIT_LOC_STRINGS = YES;
SWIFT_INSTALL_OBJC_HEADER = NO;
SWIFT_VERSION = 5.0;
TARGETED_DEVICE_FAMILY = "1,2,7";
VERSIONING_SYSTEM = "apple-generic";
VERSION_INFO_PREFIX = "";
XROS_DEPLOYMENT_TARGET = 2.0;
};
name = Release;
};
/* End XCBuildConfiguration section */
/* Begin XCConfigurationList section */
019A3E182D78E6A20055F93B /* Build configuration list for PBXNativeTarget "FastVLM App" */ = {
isa = XCConfigurationList;
buildConfigurations = (
019A3E162D78E6A20055F93B /* Debug */,
019A3E172D78E6A20055F93B /* Release */,
);
defaultConfigurationIsVisible = 0;
defaultConfigurationName = Release;
};
C35372BA2D08C32D00474D34 /* Build configuration list for PBXNativeTarget "Video" */ = {
isa = XCConfigurationList;
buildConfigurations = (
C35372B72D08C32D00474D34 /* Debug */,
C35372B82D08C32D00474D34 /* Release */,
);
defaultConfigurationIsVisible = 0;
defaultConfigurationName = Release;
};
C35EDB672D07699400757E80 /* Build configuration list for PBXProject "FastVLM" */ = {
isa = XCConfigurationList;
buildConfigurations = (
C35EDB682D07699400757E80 /* Debug */,
C35EDB692D07699400757E80 /* Release */,
);
defaultConfigurationIsVisible = 0;
defaultConfigurationName = Release;
};
C39BB3FF2D79082A005DB8FB /* Build configuration list for PBXNativeTarget "FastVLM" */ = {
isa = XCConfigurationList;
buildConfigurations = (
C39BB3FB2D79082A005DB8FB /* Debug */,
C39BB3FC2D79082A005DB8FB /* Release */,
);
defaultConfigurationIsVisible = 0;
defaultConfigurationName = Release;
};
/* End XCConfigurationList section */
/* Begin XCRemoteSwiftPackageReference section */
C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */ = {
isa = XCRemoteSwiftPackageReference;
repositoryURL = "https://github.com/ml-explore/mlx-swift-examples";
requirement = {
kind = upToNextMajorVersion;
minimumVersion = 2.21.2;
};
};
C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */ = {
isa = XCRemoteSwiftPackageReference;
repositoryURL = "https://github.com/ml-explore/mlx-swift";
requirement = {
kind = upToNextMajorVersion;
minimumVersion = 0.21.2;
};
};
C3ED54592D790AC6005E20B3 /* XCRemoteSwiftPackageReference "swift-transformers" */ = {
isa = XCRemoteSwiftPackageReference;
repositoryURL = "https://github.com/huggingface/swift-transformers";
requirement = {
kind = upToNextMajorVersion;
minimumVersion = 0.1.18;
};
};
/* End XCRemoteSwiftPackageReference section */
/* Begin XCSwiftPackageProductDependency section */
019A3E192D78E7370055F93B /* MLX */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */;
productName = MLX;
};
019A3E1B2D78E73E0055F93B /* MLXLMCommon */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */;
productName = MLXLMCommon;
};
019A3E1D2D78E7470055F93B /* MLXRandom */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */;
productName = MLXRandom;
};
019A3E1F2D78E74C0055F93B /* MLXVLM */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */;
productName = MLXVLM;
};
C3ED544D2D790860005E20B3 /* MLXLMCommon */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */;
productName = MLXLMCommon;
};
C3ED544F2D790860005E20B3 /* MLXVLM */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB6A2D076A3900757E80 /* XCRemoteSwiftPackageReference "mlx-swift-examples" */;
productName = MLXVLM;
};
C3ED54512D790860005E20B3 /* MLX */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */;
productName = MLX;
};
C3ED54532D790860005E20B3 /* MLXNN */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */;
productName = MLXNN;
};
C3ED54572D790A68005E20B3 /* MLXFast */ = {
isa = XCSwiftPackageProductDependency;
package = C35EDB8A2D07777E00757E80 /* XCRemoteSwiftPackageReference "mlx-swift" */;
productName = MLXFast;
};
C3ED545A2D790AD6005E20B3 /* Transformers */ = {
isa = XCSwiftPackageProductDependency;
package = C3ED54592D790AC6005E20B3 /* XCRemoteSwiftPackageReference "swift-transformers" */;
productName = Transformers;
};
/* End XCSwiftPackageProductDependency section */
};
rootObject = C35EDB642D07699400757E80 /* Project object */;
}
================================================
FILE: app/FastVLM.xcodeproj/xcshareddata/xcschemes/FastVLM App.xcscheme
================================================
<?xml version="1.0" encoding="UTF-8"?>
<Scheme
LastUpgradeVersion = "1630"
version = "1.7">
<BuildAction
parallelizeBuildables = "YES"
buildImplicitDependencies = "YES"
buildArchitectures = "Automatic">
<BuildActionEntries>
<BuildActionEntry
buildForTesting = "YES"
buildForRunning = "YES"
buildForProfiling = "YES"
buildForArchiving = "YES"
buildForAnalyzing = "YES">
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "019A3E092D78E6A00055F93B"
BuildableName = "FastVLM App.app"
BlueprintName = "FastVLM App"
ReferencedContainer = "container:FastVLM.xcodeproj">
</BuildableReference>
</BuildActionEntry>
</BuildActionEntries>
</BuildAction>
<TestAction
buildConfiguration = "Debug"
selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
shouldUseLaunchSchemeArgsEnv = "YES"
shouldAutocreateTestPlan = "YES">
</TestAction>
<LaunchAction
buildConfiguration = "Release"
selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
launchStyle = "0"
useCustomWorkingDirectory = "NO"
ignoresPersistentStateOnLaunch = "NO"
debugDocumentVersioning = "YES"
debugServiceExtension = "internal"
allowLocationSimulation = "YES">
<BuildableProductRunnable
runnableDebuggingMode = "0">
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "019A3E092D78E6A00055F93B"
BuildableName = "FastVLM App.app"
BlueprintName = "FastVLM App"
ReferencedContainer = "container:FastVLM.xcodeproj">
</BuildableReference>
</BuildableProductRunnable>
</LaunchAction>
<ProfileAction
buildConfiguration = "Release"
shouldUseLaunchSchemeArgsEnv = "YES"
savedToolIdentifier = ""
useCustomWorkingDirectory = "NO"
debugDocumentVersioning = "YES">
<BuildableProductRunnable
runnableDebuggingMode = "0">
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "019A3E092D78E6A00055F93B"
BuildableName = "FastVLM App.app"
BlueprintName = "FastVLM App"
ReferencedContainer = "container:FastVLM.xcodeproj">
</BuildableReference>
</BuildableProductRunnable>
</ProfileAction>
<AnalyzeAction
buildConfiguration = "Debug">
</AnalyzeAction>
<ArchiveAction
buildConfiguration = "Release"
revealArchiveInOrganizer = "YES">
</ArchiveAction>
</Scheme>
================================================
FILE: app/README.md
================================================
# FastVLM
Demonstrates the performance of **FastVLM** models for on-device, visual question answering.
<table>
<tr>
<td><img src="../docs/fastvlm-counting.gif" alt="FastVLM - Counting"></td>
<td><img src="../docs/fastvlm-handwriting.gif" alt="FastVLM - Handwriting"></td>
<td><img src="../docs/fastvlm-emoji.gif" alt="FastVLM - Emoji"></td>
</tr>
</table>
## Features
- FastVLM runs on iOS (18.2+) and macOS (15.2+).
- View Time-To-First-Token (TTFT) with every inference.
- All predictions are processed privately and securely using on-device models.
### Flexible Prompting
<img src="../docs/fastvlm-flexible_prompts.png" alt="Flexible prompting" style="width:66%;">
The app includes a set of built-in prompts to help you get started quickly. Tap the **Prompts** button in the top-right corner to explore them. Selecting a prompt will immediately update the active input. To create new prompts or edit existing ones, choose **Customize…** from the **Prompts** menu.
## Pretrained Model Options
There are 3 pretrained sizes of FastVLM to choose from:
- **FastVLM 0.5B**: Small and fast - great for mobile devices where speed matters.
- **FastVLM 1.5B**: Well balanced - great for larger devices where speed and accuracy matters.
- **FastVLM 7B**: Fast and accurate - ideal for situations where accuracy matters over speed.
To download any FastVLM listed above, use the [get_pretrained_mlx_model.sh](get_pretrained_mlx_model.sh) script. The script downloads the model from the web and places it in the appropriate location. Once a model has been downloaded using the steps below, no additional steps are needed to build the app in Xcode.
To explore how the other models work for your use-case, simply re-run the `get_pretrained_mlx_model.sh` with the new model selected, follow the prompts, and rebuild your app in Xcode.
### Download Instructions
1. Make the script executable
```shell
chmod +x app/get_pretrained_mlx_model.sh
```
2. Download FastVLM
```shell
app/get_pretrained_mlx_model.sh --model 0.5b --dest app/FastVLM/model
```
3. Open the app in Xcode, Build, and Run.
### Custom Model
In addition to pretrained sizes of FastVLM, you can further quantize or fine-tune FastVLM to best fit their needs. To learn more, check out our documentation on how to [`export the model`](../model_export#export-vlm).
Please clear existing model in `app/FastVLM/model` before downloading or copying a new model.
================================================
FILE: app/Video/CameraController.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import AVFoundation
import CoreImage
#if os(iOS)
import UIKit
#endif
@Observable
public class CameraController: NSObject {
private var framesContinuation: AsyncStream<CMSampleBuffer>.Continuation?
public var backCamera = true {
didSet {
stop()
start()
}
}
public var devices = [AVCaptureDevice]()
public var device: AVCaptureDevice = AVCaptureDevice.default(for: .video)! {
didSet {
stop()
start()
}
}
private var permissionGranted = true
private var captureSession: AVCaptureSession?
private let sessionQueue = DispatchQueue(label: "sessionQueue")
@objc dynamic private var rotationCoordinator : AVCaptureDevice.RotationCoordinator?
private var rotationObservation: NSKeyValueObservation?
public func attach(continuation: AsyncStream<CMSampleBuffer>.Continuation) {
sessionQueue.async {
self.framesContinuation = continuation
}
}
public func detatch() {
sessionQueue.async {
self.framesContinuation = nil
}
}
public func stop() {
sessionQueue.sync { [self] in
captureSession?.stopRunning()
captureSession = nil
}
}
public func start() {
sessionQueue.async { [self] in
let captureSession = AVCaptureSession()
self.captureSession = captureSession
self.checkPermission()
self.setupCaptureSession(position: backCamera ? .back : .front)
captureSession.startRunning()
}
}
#if os(iOS)
private func setOrientation(_ orientation: UIDeviceOrientation) {
guard let captureSession else { return }
let angle: Double?
switch orientation {
case .unknown, .faceDown:
angle = nil
case .portrait, .faceUp:
angle = 90
case .portraitUpsideDown:
angle = 270
case .landscapeLeft:
angle = 0
case .landscapeRight:
angle = 180
@unknown default:
angle = nil
}
if let angle {
for output in captureSession.outputs {
output.connection(with: .video)?.videoRotationAngle = angle
}
}
}
private func updateRotation(rotation : CGFloat) {
guard let captureSession else { return }
for output in captureSession.outputs {
output.connection(with: .video)?.videoRotationAngle = rotation
}
}
#endif
func checkPermission() {
switch AVCaptureDevice.authorizationStatus(for: .video) {
case .authorized:
// The user has previously granted access to the camera.
self.permissionGranted = true
case .notDetermined:
// The user has not yet been asked for camera access.
self.requestPermission()
// Combine the two other cases into the default case
default:
self.permissionGranted = false
}
}
func requestPermission() {
// Strong reference not a problem here but might become one in the future.
AVCaptureDevice.requestAccess(for: .video) { [unowned self] granted in
self.permissionGranted = granted
}
}
func setupCaptureSession(position: AVCaptureDevice.Position) {
guard let captureSession else { return }
let videoOutput = AVCaptureVideoDataOutput()
guard permissionGranted else {
print("No permission for camera")
return
}
let deviceTypes: [AVCaptureDevice.DeviceType]
#if os(iOS)
deviceTypes = [.builtInDualCamera, .builtInWideAngleCamera]
#else
deviceTypes = [.external, .continuityCamera, .builtInWideAngleCamera]
#endif
let videoDeviceDiscoverySession = AVCaptureDevice.DiscoverySession(
deviceTypes: deviceTypes,
mediaType: .video,
position: position)
let videoDevice: AVCaptureDevice?
if videoDeviceDiscoverySession.devices.contains(self.device) {
videoDevice = self.device
} else {
videoDevice = videoDeviceDiscoverySession.devices.first
}
if devices.isEmpty {
self.devices = videoDeviceDiscoverySession.devices
}
guard
let videoDevice
else {
print("Unable to find video device")
return
}
guard let videoDeviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
print("Unable to create AVCaptureDeviceInput")
return
}
guard captureSession.canAddInput(videoDeviceInput) else {
print("Unable to add input")
return
}
captureSession.addInput(videoDeviceInput)
videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sampleBufferQueue"))
captureSession.addOutput(videoOutput)
captureSession.sessionPreset = AVCaptureSession.Preset.hd1920x1080
#if os(iOS)
rotationCoordinator = AVCaptureDevice.RotationCoordinator(device: videoDevice, previewLayer: nil)
rotationObservation = observe(\.rotationCoordinator!.videoRotationAngleForHorizonLevelCapture, options: [.initial, .new]) { [weak self] _, change in
if let nv = change.newValue {
self?.updateRotation(rotation: nv)
}
}
#endif
}
}
extension CameraController: AVCaptureVideoDataOutputSampleBufferDelegate {
public func captureOutput(
_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection
) {
if sampleBuffer.isValid && sampleBuffer.imageBuffer != nil {
framesContinuation?.yield(sampleBuffer)
}
}
}
================================================
FILE: app/Video/CameraControlsView.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import AVFoundation
import SwiftUI
public struct CameraControlsView: View {
@Binding public var backCamera: Bool
@Binding public var device: AVCaptureDevice
@Binding public var devices: [AVCaptureDevice]
public init(
backCamera: Binding<Bool>,
device: Binding<AVCaptureDevice>,
devices: Binding<[AVCaptureDevice]>
) {
self._backCamera = backCamera
self._device = device
self._devices = devices
}
public var body: some View {
Button {
backCamera.toggle()
} label: {
RoundedRectangle(cornerRadius: 8.0)
.fill(.regularMaterial)
.frame(width: 32.0, height: 32.0)
.overlay(alignment: .center) {
// Switch cameras image
Image(systemName: "arrow.triangle.2.circlepath.camera.fill")
.foregroundStyle(.primary)
.padding(6.0)
}
}
}
}
================================================
FILE: app/Video/CameraType.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import Foundation
public enum CameraType: String, CaseIterable {
case continuous
case single
}
================================================
FILE: app/Video/Video.h
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
#import <Foundation/Foundation.h>
//! Project version number for Video.
FOUNDATION_EXPORT double VideoVersionNumber;
//! Project version string for Video.
FOUNDATION_EXPORT const unsigned char VideoVersionString[];
================================================
FILE: app/Video/VideoFrameView.swift
================================================
//
// For licensing see accompanying LICENSE file.
// Copyright (C) 2025 Apple Inc. All Rights Reserved.
//
import AVFoundation
import CoreImage
import Foundation
import SwiftUI
/// Displays a stream of video frames
public struct VideoFrameView: View {
@Environment(\.colorScheme) private var colorScheme
public let frames: AsyncStream<CVImageBuffer>
public let cameraType: CameraType
public let action: ((CVImageBuffer) -> Void)?
@State private var hold: Bool = false
@State private var videoFrame: CVImageBuffer?
private var backgroundColor: Color {
#if os(iOS)
return Color(.secondarySystemBackground)
#elseif os(macOS)
return Color(.secondarySystemFill)
#else
// When in doubt, use these values that I captured to match iOS' secondarySystemBackground
if colorScheme == .dark {
return Color(red: 0.11, green: 0.11, blue: 0.12)
} else {
return Color(red: 0.95, green: 0.95, blue: 0.97)
}
#endif
}
public init(
frames: AsyncStream<CVImageBuffer>,
cameraType: CameraType,
action: ((CVImageBuffer) -> Void)?
) {
self.frames = frames
self.cameraType = cameraType
self.action = action
}
public var body: some View {
Group {
if let videoFrame {
_ImageView(image: videoFrame)
.overlay(alignment: .bottom) {
if cameraType == .single {
Button {
tap()
} label: {
if hold {
Label("Resume", systemImage: "play.fill")
} else {
Label("Capture Photo", systemImage: "camera.fill")
}
}
.clipShape(.capsule)
.buttonStyle(.borderedProminent)
.tint(hold ? .gray : .accentColor)
.foregroundColor(.white)
.padding()
}
}
} else {
// spinner before the camera comes up
ProgressView()
.controlSize(.large)
}
}
// This ensures that we take up the full 4/3 aspect ratio
// even if we don't have an image to display
.frame(maxWidth: .infinity, maxHeight: .infinity)
.background(backgroundColor)
.clipShape(RoundedRectangle(cornerRadius: 10.0))
.task {
// feed frames to the _ImageView
if Task.isCancelled {
return
}
for await frame in frames {
if !hold {
videoFrame = frame
}
}
}
.onChange(of: cameraType) { _, newType in
// No matter what, when the user switches to .continuous,
// we need to continue showing updated frames
if newType == .continuous {
hold = false
}
}
}
private func tap() {
if hold {
// resume
hold = false
} else if let videoFrame {
hold = true
if let action {
action(videoFrame)
}
}
}
}
#if os(iOS)
/// Internal view to display a CVImageBuffer
private struct _ImageView: UIViewRepresentable {
let image: Any
var gravity = CALayerContentsGravity.resizeAspectFill
func makeUIView(context: Context) -> UIView {
let view = UIView()
view.layer.contentsGravity = gravity
return view
}
func updateUIView(_ uiView: UIView, context: Context) {
uiView.layer.contents = image
}
}
#else
private struct _ImageView: NSViewRepresentable {
let image: Any
var gravity = CALayerContentsGravity.resizeAspectFill
func makeNSView(context: Context) -> NSView {
let view = NSView()
view.wantsLayer = true
view.layer?.contentsGravity = gravity
return view
}
func updateNSView(_ uiView: NSView, context: Context) {
uiView.layer?.contents = image
}
}
#endif
================================================
FILE: app/get_pretrained_mlx_model.sh
================================================
#!/usr/bin/env bash
#
# For licensing see accompanying LICENSE_MODEL file.
# Copyright (C) 2025 Apple Inc. All Rights Reserved.
#
set -e
# Help function
show_help() {
local is_error=${1:-true} # Default to error mode if no argument provided
echo "Usage: $0 --model <model_size> --dest <destination_directory>"
echo
echo "Required arguments:"
echo " --model <model_size> Size of the model to download"
echo " --dest <directory> Directory where the model will be downloaded"
echo
echo "Available model sizes:"
echo " 0.5b - 0.5B parameter model (FP16)"
echo " 1.5b - 1.5B parameter model (INT8)"
echo " 7b - 7B parameter model (INT4)"
echo
echo "Options:"
echo " --help Show help message"
# Exit with success (0) for help flag, error (1) for usage errors
if [ "$is_error" = "false" ]; then
exit 0
else
exit 1
fi
}
# Parse command line arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
--model) model_size="$2"; shift ;;
--dest) dest_dir="$2"; shift ;;
--help) show_help false ;; # Explicit help request
*) echo -e "Unknown parameter: $1\n"; show_help true ;; # Error case
esac
shift
done
# Validate required parameters
if [ -z "$model_size" ]; then
echo -e "Error: --model parameter is required\n"
show_help true
fi
if [ -z "$dest_dir" ]; then
echo -e "Error: --dest parameter is required\n"
show_help true
fi
# Map model size to full model name
case "$model_size" in
"0.5b") model="llava-fastvithd_0.5b_stage3_llm.fp16" ;;
"1.5b") model="llava-fastvithd_1.5b_stage3_llm.int8" ;;
"7b") model="llava-fastvithd_7b_stage3_llm.int4" ;;
*)
echo -e "Error: Invalid model size '$model_size'\n"
show_help true
;;
esac
cleanup() {
rm -rf "$tmp_dir"
}
download_model() {
# Download directory
tmp_dir=$(mktemp -d)
# Model paths
base_url="https://ml-site.cdn-apple.com/datasets/fastvlm"
# Create destination directory if it doesn't exist
if [ ! -d "$dest_dir" ]; then
echo "Creating destination directory: $dest_dir"
mkdir -p "$dest_dir"
elif [ "$(ls -A "$dest_dir")" ]; then
echo -e "Destination directory '$dest_dir' exists and is not empty.\n"
read -p "Do you want to clear it and continue? [y/N]: " confirm
if [[ ! "$confirm" =~ ^[Yy]$ ]]; then
echo -e "\nStopping."
exit 1
fi
echo -e "\nClearing existing contents in '$dest_dir'"
rm -rf "${dest_dir:?}"/*
fi
# Create temp variables
tmp_zip_file="${tmp_dir}/${model}.zip"
tmp_extract_dir="${tmp_dir}/${model}"
# Create temp extract directory
mkdir -p "$tmp_extract_dir"
# Download model
echo -e "\nDownloading '${model}' model ...\n"
wget -q --progress=bar:noscroll --show-progress -O "$tmp_zip_file" "$base_url/$model.zip"
# Unzip model
echo -e "\nUnzipping model..."
unzip -q "$tmp_zip_file" -d "$tmp_extract_dir"
# Copy model files to destination directory
echo -e "\nCopying model files to destination directory..."
cp -r "$tmp_extract_dir/$model"/* "$dest_dir"
# Verify destination directory exists and is not empty
if [ ! -d "$dest_dir" ] || [ -z "$(ls -A "$dest_dir")" ]; then
echo -e "\nModel extraction failed. Destination directory '$dest_dir' is missing or empty."
exit 1
fi
echo -e "\nModel downloaded and extracted to '$dest_dir'"
}
# Cleanup download directory on exit
trap cleanup EXIT INT TERM
# Download models
download_model
================================================
FILE: get_models.sh
================================================
#!/usr/bin/env bash
#
# For licensing see accompanying LICENSE_MODEL file.
# Copyright (C) 2025 Apple Inc. All Rights Reserved.
#
mkdir -p checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage2.zip -P checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage3.zip -P checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage2.zip -P checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage3.zip -P checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage2.zip -P checkpoints
wget https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage3.zip -P checkpoints
# Extract models
cd checkpoints
unzip -qq llava-fastvithd_0.5b_stage2.zip
unzip -qq llava-fastvithd_0.5b_stage3.zip
unzip -qq llava-fastvithd_1.5b_stage2.zip
unzip -qq llava-fastvithd_1.5b_stage3.zip
unzip -qq llava-fastvithd_7b_stage2.zip
unzip -qq llava-fastvithd_7b_stage3.zip
# Clean up
rm llava-fastvithd_0.5b_stage2.zip
rm llava-fastvithd_0.5b_stage3.zip
rm llava-fastvithd_1.5b_stage2.zip
rm llava-fastvithd_1.5b_stage3.zip
rm llava-fastvithd_7b_stage2.zip
rm llava-fastvithd_7b_stage3.zip
cd -
================================================
FILE: llava/__init__.py
================================================
from .model import LlavaLlamaForCausalLM, LlavaQwen2ForCausalLM
================================================
FILE: llava/constants.py
================================================
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"
IMAGE_PLACEHOLDER = "<image-placeholder>"
================================================
FILE: llava/conversation.py
================================================
import dataclasses
from enum import auto, Enum
from typing import List, Tuple
import base64
from io import BytesIO
from PIL import Image
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
LLAMA_2 = auto()
QWEN_2 = auto() # fix: add qwen2
CHATML = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
skip_next: bool = False
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0].replace("<image>", "").strip()
if 'mmtag' in self.version:
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], "<Image><image></Image>"))
messages.insert(1, (self.roles[1], "Received."))
else:
messages[0] = (init_role, "<image>\n" + init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
# elif self.sep_style == SeparatorStyle.QWEN_2: # fix: add qwen2
# seps = [self.sep, self.sep2]
# ret = self.system + seps[0]
# ret = ""
# for i, (role, message) in enumerate(messages):
# if message:
# if type(message) is tuple:
# message, _, _ = message
# ret += role + ": " + message + seps[i % 2]
# else:
# ret += role + ":"
elif self.sep_style == SeparatorStyle.QWEN_2: # fix: add qwen2
ret = self.system + self.sep
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.CHATML:
ret = "" if self.system == "" else self.system + self.sep + "\n"
for role, message in messages:
if message:
if type(message) is tuple:
message, images = message
message = "<image>" * len(images) + message
ret += role + "\n" + message + self.sep + "\n"
else:
ret += role + "\n"
return ret
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
def wrap_sys(msg): return f"<<SYS>>\n{msg}\n<</SYS>>\n\n" if len(msg) > 0 else msg
def wrap_inst(msg): return f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0:
message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def process_image(self, image, image_process_mode, return_pil=False, image_format='PNG', max_len=1344, min_len=672):
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if max(image.size) > max_len:
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str
def get_images(self, return_pil=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
image = self.process_image(image, image_process_mode, return_pil=return_pil)
images.append(image)
return images
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
img_b64_str = self.process_image(
image, "Default", return_pil=False,
image_format='JPEG')
img_str = f'<img src="data:image/jpeg;base64,{img_b64_str}" alt="user upload image" />'
msg = img_str + msg.replace('<image>', '').strip()
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(
system=self.system,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep,
sep2=self.sep2,
version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
conv_vicuna_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=(
("Human", "What are the key differences between renewable and non-renewable energy sources?"),
("Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n")
),
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_vicuna_v1 = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
)
conv_llama_2 = Conversation(
system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=(),
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="<s>",
sep2="</s>",
)
conv_llava_llama_2 = Conversation(
system="You are a helpful language and vision assistant. "
"You are able to understand the visual content that the user provides, "
"and assist the user with a variety of tasks using natural language.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=(),
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="<s>",
sep2="</s>",
)
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_llava_plain = Conversation(
system="",
roles=("", ""),
messages=(
),
offset=0,
sep_style=SeparatorStyle.PLAIN,
sep="\n",
)
conv_llava_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=(
),
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_llava_v0_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: <Image>visual content</Image>.",
roles=("Human", "Assistant"),
messages=(
),
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
version="v0_mmtag",
)
conv_llava_v1 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
)
conv_llava_v1_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: <Image>visual content</Image>.",
roles=("USER", "ASSISTANT"),
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
version="v1_mmtag",
)
conv_mistral_instruct = Conversation(
system="",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=(),
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="</s>",
)
conv_chatml_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_qwen_2 = Conversation(
system="<|im_start|>system\nYou are a helpful assistant.",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="qwen_v2",
messages=(),
offset=0,
sep_style=SeparatorStyle.QWEN_2,
sep="<|im_end|>\n",
)
# conv_qwen_2 = Conversation(
# system="",
# roles=("user", "assistant"),
# version="qwen_v2",
# messages=(),
# offset=0,
# sep_style=SeparatorStyle.QWEN_2,
# sep=" ",
# sep2="<|im_end|>",
# )
# fix: add qwen2
# conv_qwen_2 = Conversation(
# system="A chat between a curious user and an artificial intelligence assistant. "
# "The assistant gives helpful, detailed, and polite answers to the user's questions.",
# roles=("USER", "ASSISTANT"),
# version="qwen_v2",
# messages=(),
# offset=0,
# sep_style=SeparatorStyle.QWEN_2,
# sep=" ",
# sep2="<|endoftext|>",
# )
# conv_qwen_2 = Conversation(
# system="""<|im_start|>system
# You are a helpful assistant.""",
# roles=("<|im_start|>user", "<|im_start|>assistant"),
# version="qwen_v2",
# messages=[],
# offset=0,
# sep_style=SeparatorStyle.QWEN_2,
# sep="<|im_end|>",
# sep2="<|im_end|>",
# )
default_conversation = conv_qwen_2
conv_templates = {
"default": conv_qwen_2,
"v0": conv_vicuna_v0,
"v1": conv_vicuna_v1,
"vicuna_v1": conv_vicuna_v1,
"qwen_2": conv_qwen_2,
"llama_2": conv_llama_2,
"mistral_instruct": conv_mistral_instruct,
"chatml_direct": conv_chatml_direct,
"mistral_direct": conv_chatml_direct,
"plain": conv_llava_plain,
"v0_plain": conv_llava_plain,
"llava_v0": conv_llava_v0,
"v0_mmtag": conv_llava_v0_mmtag,
"llava_v1": conv_llava_v1,
"v1_mmtag": conv_llava_v1_mmtag,
"llava_llama_2": conv_llava_llama_2,
"mpt": conv_mpt,
}
if __name__ == "__main__":
print("conversation:", default_conversation.get_prompt())
================================================
FILE: llava/mm_utils.py
================================================
import PIL
from PIL import Image
PIL.Image.MAX_IMAGE_PIXELS=500000000
from io import BytesIO
import base64
import torch
import math
import ast
from transformers import StoppingCriteria
from llava.constants import IMAGE_TOKEN_INDEX
def select_best_resolution(original_size, possible_resolutions):
"""
Selects the best resolution from a list of possible resolutions based on the original size.
Args:
original_size (tuple): The original size of the image in the format (width, height).
possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
Returns:
tuple: The best fit resolution in the format (width, height).
"""
original_width, original_height = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float('inf')
for width, height in possible_resolutions:
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (width, height)
return best_fit
def resize_and_pad_image(image, target_resolution):
"""
Resize and pad an image to a target resolution while maintaining aspect ratio.
Args:
image (PIL.Image.Image): The input image.
target_resolution (tuple): The target resolution (width, height) of the image.
Returns:
PIL.Image.Image: The resized and padded image.
"""
original_width, original_height = image.size
target_width, target_height = target_resolution
scale_w = target_width / original_width
scale_h = target_height / original_height
if scale_w < scale_h:
new_width = target_width
new_height = min(math.ceil(original_height * scale_w), target_height)
else:
new_height = target_height
new_width = min(math.ceil(original_width * scale_h), target_width)
# Resize the image
resized_image = image.resize((new_width, new_height))
new_image = Image.new('RGB', (target_width, target_height), (0, 0, 0))
paste_x = (target_width - new_width) // 2
paste_y = (target_height - new_height) // 2
new_image.paste(resized_image, (paste_x, paste_y))
return new_image
def divide_to_patches(image, patch_size):
"""
Divides an image into patches of a specified size.
Args:
image (PIL.Image.Image): The input image.
patch_size (int): The size of each patch.
Returns:
list: A list of PIL.Image.Image objects representing the patches.
"""
patches = []
width, height = image.size
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
box = (j, i, j + patch_size, i + patch_size)
patch = image.crop(box)
patches.append(patch)
return patches
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
"""
Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
Args:
image_size (tuple): The size of the input image in the format (width, height).
grid_pinpoints (str): A string representation of a list of possible resolutions.
patch_size (int): The size of each image patch.
Returns:
tuple: The shape of the image patch grid in the format (width, height).
"""
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
width, height = select_best_resolution(image_size, possible_resolutions)
return width // patch_size, height // patch_size
def process_anyres_image(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions)
image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image_padded, processor.crop_size['height'])
image_original_resize = image.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors='pt')['pixel_values'][0]
for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def process_images(images, image_processor, model_cfg):
image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
new_images = []
if image_aspect_ratio == 'pad':
for image in images:
image = expand2square(image, tuple(int(x*255) for x in image_processor.image_mean))
image = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
new_images.append(image)
elif image_aspect_ratio == "anyres":
for image in images:
image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
else:
return image_processor(images, return_tensors='pt')['pixel_values']
if all(x.shape == new_images[0].shape for x in new_images):
new_images = torch.stack(new_images, dim=0)
return new_images
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == 'pt':
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f'Unsupported tensor type: {return_tensors}')
return input_ids
def get_model_name_from_path(model_path):
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith('checkpoint-'):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
self.max_keyword_len = 0
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
if len(cur_keyword_ids) > self.max_keyword_len:
self.max_keyword_len = len(cur_keyword_ids)
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
offset = min(output_ids.shape[1] - self.start_len, self.max_keyword_len)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
truncated_output_ids = output_ids[0, -keyword_id.shape[0]:]
if torch.equal(truncated_output_ids, keyword_id):
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
outputs = []
for i in range(output_ids.shape[0]):
outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
return all(outputs)
================================================
FILE: llava/model/__init__.py
================================================
# try:
from .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig
from .language_model.llava_mpt import LlavaMptForCausalLM, LlavaMptConfig
from .language_model.llava_mistral import LlavaMistralForCausalLM, LlavaMistralConfig
from .language_model.llava_qwen import LlavaQwen2ForCausalLM, LlavaConfig
# except:
# pass
================================================
FILE: llava/model/apply_delta.py
================================================
"""
Usage:
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava import LlavaLlamaForCausalLM
def apply_delta(base_model_path, target_model_path, delta_path):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(
base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading delta")
delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
print("Applying delta")
for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
if name not in base.state_dict():
assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
continue
if param.data.shape == base.state_dict()[name].shape:
param.data += base.state_dict()[name]
else:
assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
bparam = base.state_dict()[name]
param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
print("Saving target model")
delta.save_pretrained(target_model_path)
delta_tokenizer.save_pretrained(target_model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
args = parser.parse_args()
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
================================================
FILE: llava/model/builder.
gitextract_4fmnuh7_/ ├── .gitignore ├── ACKNOWLEDGEMENTS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── LICENSE_MODEL ├── README.md ├── app/ │ ├── Configuration/ │ │ └── Build.xcconfig │ ├── FastVLM/ │ │ ├── FastVLM.h │ │ ├── FastVLM.swift │ │ └── MediaProcessingExtensions.swift │ ├── FastVLM App/ │ │ ├── Assets.xcassets/ │ │ │ ├── AccentColor.colorset/ │ │ │ │ └── Contents.json │ │ │ ├── AppIcon.appiconset/ │ │ │ │ └── Contents.json │ │ │ └── Contents.json │ │ ├── ContentView.swift │ │ ├── FastVLM.entitlements │ │ ├── FastVLMApp.swift │ │ ├── FastVLMModel.swift │ │ ├── Info.plist │ │ ├── InfoView.swift │ │ └── Preview Content/ │ │ └── Preview Assets.xcassets/ │ │ └── Contents.json │ ├── FastVLM.xcodeproj/ │ │ ├── project.pbxproj │ │ └── xcshareddata/ │ │ └── xcschemes/ │ │ └── FastVLM App.xcscheme │ ├── README.md │ ├── Video/ │ │ ├── CameraController.swift │ │ ├── CameraControlsView.swift │ │ ├── CameraType.swift │ │ ├── Video.h │ │ └── VideoFrameView.swift │ └── get_pretrained_mlx_model.sh ├── get_models.sh ├── llava/ │ ├── __init__.py │ ├── constants.py │ ├── conversation.py │ ├── mm_utils.py │ ├── model/ │ │ ├── __init__.py │ │ ├── apply_delta.py │ │ ├── builder.py │ │ ├── consolidate.py │ │ ├── language_model/ │ │ │ ├── llava_llama.py │ │ │ ├── llava_mistral.py │ │ │ ├── llava_mpt.py │ │ │ └── llava_qwen.py │ │ ├── llava_arch.py │ │ ├── make_delta.py │ │ ├── multimodal_encoder/ │ │ │ ├── builder.py │ │ │ ├── clip_encoder.py │ │ │ ├── mobileclip/ │ │ │ │ ├── __init__.py │ │ │ │ ├── configs/ │ │ │ │ │ └── mobileclip_l.json │ │ │ │ └── mci.py │ │ │ └── mobileclip_encoder.py │ │ ├── multimodal_projector/ │ │ │ └── builder.py │ │ └── utils.py │ ├── serve/ │ │ ├── __init__.py │ │ ├── cli.py │ │ ├── controller.py │ │ ├── gradio_web_server.py │ │ ├── model_worker.py │ │ ├── register_worker.py │ │ ├── sglang_worker.py │ │ └── test_message.py │ ├── train/ │ │ ├── llama_flash_attn_monkey_patch.py │ │ ├── llama_xformers_attn_monkey_patch.py │ │ ├── llava_trainer.py │ │ ├── train.py │ │ ├── train_mem.py │ │ ├── train_qwen.py │ │ └── train_xformers.py │ └── utils.py ├── model_export/ │ ├── README.md │ ├── export_vision_encoder.py │ └── fastvlm_mlx-vlm.patch ├── predict.py └── pyproject.toml
SYMBOL INDEX (335 symbols across 32 files)
FILE: llava/conversation.py
class SeparatorStyle (line 9) | class SeparatorStyle(Enum):
class Conversation (line 21) | class Conversation:
method get_prompt (line 34) | def get_prompt(self):
method append_message (line 143) | def append_message(self, role, message):
method process_image (line 146) | def process_image(self, image, image_process_mode, return_pil=False, i...
method get_images (line 186) | def get_images(self, return_pil=False):
method to_gradio_chatbot (line 196) | def to_gradio_chatbot(self):
method copy (line 214) | def copy(self):
method dict (line 225) | def dict(self):
FILE: llava/mm_utils.py
function select_best_resolution (line 14) | def select_best_resolution(original_size, possible_resolutions):
function resize_and_pad_image (line 44) | def resize_and_pad_image(image, target_resolution):
function divide_to_patches (line 79) | def divide_to_patches(image, patch_size):
function get_anyres_image_grid_shape (line 101) | def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
function process_anyres_image (line 121) | def process_anyres_image(image, processor, grid_pinpoints):
function load_image_from_base64 (line 150) | def load_image_from_base64(image):
function expand2square (line 154) | def expand2square(pil_img, background_color):
function process_images (line 168) | def process_images(images, image_processor, model_cfg):
function tokenizer_image_token (line 187) | def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOK...
function get_model_name_from_path (line 209) | def get_model_name_from_path(model_path):
class KeywordsStoppingCriteria (line 218) | class KeywordsStoppingCriteria(StoppingCriteria):
method __init__ (line 219) | def __init__(self, keywords, tokenizer, input_ids):
method call_for_batch (line 233) | def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.F...
method __call__ (line 246) | def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTe...
FILE: llava/model/apply_delta.py
function apply_delta (line 13) | def apply_delta(base_model_path, target_model_path, delta_path):
FILE: llava/model/builder.py
function load_pretrained_model (line 26) | def load_pretrained_model(model_path, model_base, model_name, load_8bit=...
FILE: llava/model/consolidate.py
function consolidate_ckpt (line 13) | def consolidate_ckpt(src_path, dst_path):
FILE: llava/model/language_model/llava_llama.py
class LlavaConfig (line 30) | class LlavaConfig(LlamaConfig):
class LlavaLlamaModel (line 34) | class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
method __init__ (line 37) | def __init__(self, config: LlamaConfig):
class LlavaLlamaForCausalLM (line 41) | class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
method __init__ (line 44) | def __init__(self, config):
method get_model (line 54) | def get_model(self):
method forward (line 57) | def forward(
method generate (line 106) | def generate(
method prepare_inputs_for_generation (line 145) | def prepare_inputs_for_generation(self, input_ids, past_key_values=None,
FILE: llava/model/language_model/llava_mistral.py
class LlavaMistralConfig (line 31) | class LlavaMistralConfig(MistralConfig):
class LlavaMistralModel (line 35) | class LlavaMistralModel(LlavaMetaModel, MistralModel):
method __init__ (line 38) | def __init__(self, config: MistralConfig):
class LlavaMistralForCausalLM (line 42) | class LlavaMistralForCausalLM(MistralForCausalLM, LlavaMetaForCausalLM):
method __init__ (line 45) | def __init__(self, config):
method get_model (line 54) | def get_model(self):
method forward (line 57) | def forward(
method generate (line 105) | def generate(
method prepare_inputs_for_generation (line 144) | def prepare_inputs_for_generation(self, input_ids, past_key_values=None,
FILE: llava/model/language_model/llava_mpt.py
class LlavaMptConfig (line 25) | class LlavaMptConfig(MptConfig):
class LlavaMptModel (line 29) | class LlavaMptModel(LlavaMetaModel, MptModel):
method __init__ (line 32) | def __init__(self, config: MptConfig):
method embed_tokens (line 36) | def embed_tokens(self, x):
class LlavaMptForCausalLM (line 40) | class LlavaMptForCausalLM(MptForCausalLM, LlavaMetaForCausalLM):
method __init__ (line 44) | def __init__(self, config):
method get_model (line 53) | def get_model(self):
method _set_gradient_checkpointing (line 56) | def _set_gradient_checkpointing(self, module, value=False):
method forward (line 60) | def forward(
method prepare_inputs_for_generation (line 87) | def prepare_inputs_for_generation(self, input_ids, past_key_values=Non...
FILE: llava/model/language_model/llava_qwen.py
class LlavaConfig (line 30) | class LlavaConfig(Qwen2Config):
class LlavaQwen2Model (line 34) | class LlavaQwen2Model(LlavaMetaModel, Qwen2Model):
method __init__ (line 37) | def __init__(self, config: Qwen2Config):
class LlavaQwen2ForCausalLM (line 41) | class LlavaQwen2ForCausalLM(Qwen2ForCausalLM, LlavaMetaForCausalLM):
method __init__ (line 44) | def __init__(self, config):
method get_model (line 54) | def get_model(self):
method forward (line 57) | def forward(
method generate (line 106) | def generate(
method prepare_inputs_for_generation (line 145) | def prepare_inputs_for_generation(self, input_ids, past_key_values=None,
FILE: llava/model/llava_arch.py
class LlavaMetaModel (line 29) | class LlavaMetaModel:
method __init__ (line 31) | def __init__(self, config):
method get_vision_tower (line 43) | def get_vision_tower(self):
method initialize_vision_modules (line 49) | def initialize_vision_modules(self, model_args, fsdp=None):
function unpad_image (line 101) | def unpad_image(tensor, original_size):
class LlavaMetaForCausalLM (line 132) | class LlavaMetaForCausalLM(ABC):
method get_model (line 135) | def get_model(self):
method get_vision_tower (line 138) | def get_vision_tower(self):
method encode_images (line 141) | def encode_images(self, images):
method prepare_inputs_labels_for_multimodal (line 146) | def prepare_inputs_labels_for_multimodal(
method initialize_vision_tokenizer (line 334) | def initialize_vision_tokenizer(self, model_args, tokenizer):
FILE: llava/model/make_delta.py
function make_delta (line 13) | def make_delta(base_model_path, target_model_path, delta_path, hub_repo_...
FILE: llava/model/multimodal_encoder/builder.py
function build_vision_tower (line 6) | def build_vision_tower(vision_tower_cfg, **kwargs):
FILE: llava/model/multimodal_encoder/clip_encoder.py
class CLIPVisionTower (line 7) | class CLIPVisionTower(nn.Module):
method __init__ (line 8) | def __init__(self, vision_tower, args, delay_load=False):
method load_model (line 31) | def load_model(self, device_map=None):
method feature_select (line 48) | def feature_select(self, image_forward_outs):
method forward (line 58) | def forward(self, images):
method forward_images (line 65) | def forward_images(self, images):
method dummy_feature (line 79) | def dummy_feature(self):
method dtype (line 83) | def dtype(self):
method device (line 87) | def device(self):
method config (line 91) | def config(self):
method hidden_size (line 98) | def hidden_size(self):
method num_patches_per_side (line 102) | def num_patches_per_side(self):
method num_patches (line 106) | def num_patches(self):
class CLIPVisionTowerS2 (line 111) | class CLIPVisionTowerS2(CLIPVisionTower):
method __init__ (line 112) | def __init__(self, vision_tower, args, delay_load=False):
method load_model (line 132) | def load_model(self, device_map=None):
method forward_feature (line 147) | def forward_feature(self, images):
method forward (line 153) | def forward(self, images):
method hidden_size (line 165) | def hidden_size(self):
FILE: llava/model/multimodal_encoder/mobileclip/__init__.py
function load_model_config (line 15) | def load_model_config(
class MCi (line 34) | class MCi(nn.Module):
method __init__ (line 39) | def __init__(self, model_name: str, *args, **kwargs) -> None:
method forward (line 55) | def forward(self, x: Any, *args, **kwargs) -> Any:
method _get_in_feature_dimension (line 61) | def _get_in_feature_dimension(image_classifier: nn.Module) -> int:
method _update_image_classifier (line 82) | def _update_image_classifier(
FILE: llava/model/multimodal_encoder/mobileclip/mci.py
function _cfg (line 20) | def _cfg(url="", **kwargs):
class SEBlock (line 42) | class SEBlock(nn.Module):
method __init__ (line 49) | def __init__(self, in_channels: int, rd_ratio: float = 0.0625) -> None:
method forward (line 72) | def forward(self, inputs: torch.Tensor) -> torch.Tensor:
class MobileOneBlock (line 84) | class MobileOneBlock(nn.Module):
method __init__ (line 94) | def __init__(
method forward (line 194) | def forward(self, x: torch.Tensor) -> torch.Tensor:
method reparameterize (line 219) | def reparameterize(self):
method _get_kernel_bias (line 249) | def _get_kernel_bias(self) -> Tuple[torch.Tensor, torch.Tensor]:
method _fuse_bn_tensor (line 284) | def _fuse_bn_tensor(
method _conv_bn (line 332) | def _conv_bn(self, kernel_size: int, padding: int) -> nn.Sequential:
class ReparamLargeKernelConv (line 368) | class ReparamLargeKernelConv(nn.Module):
method __init__ (line 377) | def __init__(
method forward (line 442) | def forward(self, x: torch.Tensor) -> torch.Tensor:
method get_kernel_bias (line 453) | def get_kernel_bias(self) -> Tuple[torch.Tensor, torch.Tensor]:
method reparameterize (line 469) | def reparameterize(self) -> None:
method _fuse_bn (line 495) | def _fuse_bn(
method _conv_bn (line 517) | def _conv_bn(self, kernel_size: int, padding: int = 0) -> nn.Sequential:
function convolutional_stem (line 553) | def convolutional_stem(
class LayerNormChannel (line 606) | class LayerNormChannel(nn.Module):
method __init__ (line 611) | def __init__(self, num_features, eps=1e-05) -> None:
method forward (line 617) | def forward(self, x) -> torch.Tensor:
class MHSA (line 626) | class MHSA(nn.Module):
method __init__ (line 633) | def __init__(
method forward (line 661) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class PatchEmbed (line 688) | class PatchEmbed(nn.Module):
method __init__ (line 691) | def __init__(
method forward (line 739) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class RepMixer (line 744) | class RepMixer(nn.Module):
method __init__ (line 751) | def __init__(
method forward (line 808) | def forward(self, x: torch.Tensor) -> torch.Tensor:
method reparameterize (line 819) | def reparameterize(self) -> None:
class ConvFFN (line 862) | class ConvFFN(nn.Module):
method __init__ (line 865) | def __init__(
method _init_weights (line 914) | def _init_weights(self, m: nn.Module) -> None:
method forward (line 920) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class RepCPE (line 930) | class RepCPE(nn.Module):
method __init__ (line 939) | def __init__(
method forward (line 992) | def forward(self, x: torch.Tensor) -> torch.Tensor:
method reparameterize (line 1000) | def reparameterize(self) -> None:
class RepMixerBlock (line 1042) | class RepMixerBlock(nn.Module):
method __init__ (line 1049) | def __init__(
method forward (line 1106) | def forward(self, x):
class AttentionBlock (line 1116) | class AttentionBlock(nn.Module):
method __init__ (line 1123) | def __init__(
method forward (line 1185) | def forward(self, x):
function basic_blocks (line 1195) | def basic_blocks(
class GlobalPool2D (line 1272) | class GlobalPool2D(nn.Module):
method __init__ (line 1275) | def __init__(self, in_dim: int, out_dim: int, *args, **kwargs) -> None:
method pool (line 1282) | def pool(self, x) -> Tensor:
method forward (line 1290) | def forward(self, x: Tensor, *args, **kwargs) -> Tensor:
class FastViT (line 1305) | class FastViT(nn.Module):
method __init__ (line 1310) | def __init__(
method cls_init_weights (line 1420) | def cls_init_weights(self, m: nn.Module) -> None:
method forward_embeddings (line 1427) | def forward_embeddings(self, x: torch.Tensor) -> torch.Tensor:
method forward_tokens (line 1431) | def forward_tokens(self, x: torch.Tensor, *args, **kwargs) -> torch.Te...
method forward (line 1436) | def forward(self, x: torch.Tensor, *args, **kwargs) -> Union[Tensor, D...
function fastvithd (line 1455) | def fastvithd(pretrained=False, **kwargs):
FILE: llava/model/multimodal_encoder/mobileclip_encoder.py
class MobileCLIPVisionTower (line 13) | class MobileCLIPVisionTower(nn.Module):
method __init__ (line 14) | def __init__(self, vision_tower, args, delay_load=False):
method load_model (line 31) | def load_model(self, device_map=None):
method feature_select (line 60) | def feature_select(self, image_forward_outs):
method forward (line 70) | def forward(self, images):
method forward_images (line 77) | def forward_images(self, images):
method dummy_feature (line 91) | def dummy_feature(self):
method dtype (line 95) | def dtype(self):
method device (line 99) | def device(self):
method config (line 103) | def config(self):
method hidden_size (line 107) | def hidden_size(self):
method num_patches_per_side (line 111) | def num_patches_per_side(self):
method num_patches (line 115) | def num_patches(self):
FILE: llava/model/multimodal_projector/builder.py
class IdentityMap (line 5) | class IdentityMap(nn.Module):
method __init__ (line 6) | def __init__(self):
method forward (line 9) | def forward(self, x, *args, **kwargs):
method config (line 13) | def config(self):
function build_vision_projector (line 17) | def build_vision_projector(config, delay_load=False, **kwargs):
FILE: llava/model/utils.py
function auto_upgrade (line 4) | def auto_upgrade(config):
FILE: llava/serve/cli.py
function load_image (line 18) | def load_image(image_file):
function main (line 27) | def main(args):
FILE: llava/serve/controller.py
class DispatchMethod (line 28) | class DispatchMethod(Enum):
method from_str (line 33) | def from_str(cls, name):
class WorkerInfo (line 43) | class WorkerInfo:
function heart_beat_controller (line 51) | def heart_beat_controller(controller):
class Controller (line 57) | class Controller:
method __init__ (line 58) | def __init__(self, dispatch_method: str):
method register_worker (line 69) | def register_worker(self, worker_name: str, check_heart_beat: bool,
method get_worker_status (line 88) | def get_worker_status(self, worker_name: str):
method remove_worker (line 101) | def remove_worker(self, worker_name: str):
method refresh_all_workers (line 104) | def refresh_all_workers(self):
method list_models (line 112) | def list_models(self):
method get_worker_address (line 120) | def get_worker_address(self, model_name: str):
method receive_heart_beat (line 173) | def receive_heart_beat(self, worker_name: str, queue_length: int):
method remove_stable_workers_by_expiration (line 183) | def remove_stable_workers_by_expiration(self):
method worker_api_generate_stream (line 193) | def worker_api_generate_stream(self, params):
method worker_api_get_status (line 220) | def worker_api_get_status(self):
function register_worker (line 243) | async def register_worker(request: Request):
function refresh_all_workers (line 251) | async def refresh_all_workers():
function list_models (line 256) | async def list_models():
function get_worker_address (line 262) | async def get_worker_address(request: Request):
function receive_heart_beat (line 269) | async def receive_heart_beat(request: Request):
function worker_api_generate_stream (line 277) | async def worker_api_generate_stream(request: Request):
function worker_api_get_status (line 284) | async def worker_api_get_status(request: Request):
FILE: llava/serve/gradio_web_server.py
function get_conv_log_filename (line 32) | def get_conv_log_filename():
function get_model_list (line 38) | def get_model_list():
function load_demo (line 58) | def load_demo(url_params, request: gr.Request):
function load_demo_refresh_model_list (line 71) | def load_demo_refresh_model_list(request: gr.Request):
function vote_last_response (line 82) | def vote_last_response(state, vote_type, model_selector, request: gr.Req...
function upvote_last_response (line 94) | def upvote_last_response(state, model_selector, request: gr.Request):
function downvote_last_response (line 100) | def downvote_last_response(state, model_selector, request: gr.Request):
function flag_last_response (line 106) | def flag_last_response(state, model_selector, request: gr.Request):
function regenerate (line 112) | def regenerate(state, image_process_mode, request: gr.Request):
function clear_history (line 122) | def clear_history(request: gr.Request):
function add_text (line 128) | def add_text(state, text, image, image_process_mode, request: gr.Request):
function http_bot (line 154) | def http_bot(state, model_selector, temperature, top_p, max_new_tokens, ...
function build_demo (line 317) | def build_demo(embed_mode, cur_dir=None, concurrency_count=10):
FILE: llava/serve/model_worker.py
function heart_beat_worker (line 37) | def heart_beat_worker(controller):
class ModelWorker (line 44) | class ModelWorker:
method __init__ (line 45) | def __init__(self, controller_addr, worker_addr,
method register_to_controller (line 75) | def register_to_controller(self):
method send_heart_beat (line 87) | def send_heart_beat(self):
method get_queue_length (line 108) | def get_queue_length(self):
method get_status (line 115) | def get_status(self):
method generate_stream (line 123) | def generate_stream(self, params):
method generate_stream_gate (line 195) | def generate_stream_gate(self, params):
function release_model_semaphore (line 225) | def release_model_semaphore(fn=None):
function generate_stream (line 232) | async def generate_stream(request: Request):
function get_status (line 248) | async def get_status(request: Request):
FILE: llava/serve/sglang_worker.py
function heart_beat_worker (line 38) | def heart_beat_worker(controller):
function pipeline (line 45) | def pipeline(s, prompt, max_tokens):
class ModelWorker (line 54) | class ModelWorker:
method __init__ (line 55) | def __init__(self, controller_addr, worker_addr, sgl_endpoint,
method register_to_controller (line 85) | def register_to_controller(self):
method send_heart_beat (line 97) | def send_heart_beat(self):
method get_queue_length (line 118) | def get_queue_length(self):
method get_status (line 125) | def get_status(self):
method generate_stream (line 132) | async def generate_stream(self, params):
method generate_stream_gate (line 172) | async def generate_stream_gate(self, params):
function release_model_semaphore (line 195) | def release_model_semaphore(fn=None):
function generate_stream (line 202) | async def generate_stream(request: Request):
function get_status (line 218) | async def get_status(request: Request):
FILE: llava/serve/test_message.py
function main (line 9) | def main():
FILE: llava/train/llama_flash_attn_monkey_patch.py
function forward (line 16) | def forward(
function _prepare_decoder_attention_mask (line 98) | def _prepare_decoder_attention_mask(
function replace_llama_attn_with_flash_attn (line 105) | def replace_llama_attn_with_flash_attn():
FILE: llava/train/llama_xformers_attn_monkey_patch.py
function replace_llama_attn_with_xformers_attn (line 19) | def replace_llama_attn_with_xformers_attn():
function xformers_forward (line 23) | def xformers_forward(
FILE: llava/train/llava_trainer.py
function maybe_zero_3 (line 22) | def maybe_zero_3(param, ignore_status=False, name=None):
function get_mm_adapter_state_maybe_zero_3 (line 36) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
function split_to_even_chunks (line 42) | def split_to_even_chunks(indices, lengths, num_chunks):
function get_modality_length_grouped_indices (line 64) | def get_modality_length_grouped_indices(lengths, batch_size, world_size,...
function get_length_grouped_indices (line 92) | def get_length_grouped_indices(lengths, batch_size, world_size, generato...
class LengthGroupedSampler (line 103) | class LengthGroupedSampler(Sampler):
method __init__ (line 109) | def __init__(
method __len__ (line 126) | def __len__(self):
method __iter__ (line 129) | def __iter__(self):
class LLaVATrainer (line 137) | class LLaVATrainer(Trainer):
method _get_train_sampler (line 139) | def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
method create_optimizer (line 154) | def create_optimizer(self):
method _save_checkpoint (line 244) | def _save_checkpoint(self, model, trial, metrics=None):
method _save (line 267) | def _save(self, output_dir: Optional[str] = None, state_dict=None):
FILE: llava/train/train.py
function rank0_print (line 45) | def rank0_print(*args):
class ModelArguments (line 54) | class ModelArguments:
class DataArguments (line 74) | class DataArguments:
class TrainingArguments (line 87) | class TrainingArguments(transformers.TrainingArguments):
function maybe_zero_3 (line 123) | def maybe_zero_3(param, ignore_status=False, name=None):
function get_peft_state_maybe_zero_3 (line 138) | def get_peft_state_maybe_zero_3(named_params, bias):
function get_peft_state_non_lora_maybe_zero_3 (line 163) | def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only...
function get_mm_adapter_state_maybe_zero_3 (line 171) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
function find_all_linear_names (line 177) | def find_all_linear_names(model):
function safe_save_model_for_hf_trainer (line 193) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
function smart_tokenizer_and_embedding_resize (line 232) | def smart_tokenizer_and_embedding_resize(
function _tokenize_fn (line 257) | def _tokenize_fn(strings: Sequence[str],
function _mask_targets (line 284) | def _mask_targets(target, tokenized_lens, speakers):
function _add_speaker_and_signal (line 295) | def _add_speaker_and_signal(header, source, get_conversation=True):
function preprocess_multimodal (line 316) | def preprocess_multimodal(
function preprocess_llama_2 (line 340) | def preprocess_llama_2(
function preprocess_qwen_2 (line 423) | def preprocess_qwen_2(
function preprocess_v1 (line 519) | def preprocess_v1(
function preprocess_mpt (line 605) | def preprocess_mpt(
function preprocess_plain (line 693) | def preprocess_plain(
function preprocess (line 715) | def preprocess(
class LazySupervisedDataset (line 767) | class LazySupervisedDataset(Dataset):
method __init__ (line 770) | def __init__(self, data_path: List[str],
method __len__ (line 785) | def __len__(self):
method lengths (line 789) | def lengths(self):
method modality_lengths (line 797) | def modality_lengths(self):
method __getitem__ (line 805) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
class DataCollatorForSupervisedDataset (line 863) | class DataCollatorForSupervisedDataset(object):
method __call__ (line 868) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
function make_supervised_data_module (line 897) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
function train (line 909) | def train(attn_implementation=None):
FILE: llava/train/train_qwen.py
function rank0_print (line 45) | def rank0_print(*args):
class ModelArguments (line 54) | class ModelArguments:
class DataArguments (line 74) | class DataArguments:
class TrainingArguments (line 87) | class TrainingArguments(transformers.TrainingArguments):
function maybe_zero_3 (line 123) | def maybe_zero_3(param, ignore_status=False, name=None):
function get_peft_state_maybe_zero_3 (line 138) | def get_peft_state_maybe_zero_3(named_params, bias):
function get_peft_state_non_lora_maybe_zero_3 (line 163) | def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only...
function get_mm_adapter_state_maybe_zero_3 (line 171) | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
function find_all_linear_names (line 177) | def find_all_linear_names(model):
function safe_save_model_for_hf_trainer (line 193) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
function smart_tokenizer_and_embedding_resize (line 232) | def smart_tokenizer_and_embedding_resize(
function _tokenize_fn (line 257) | def _tokenize_fn(strings: Sequence[str],
function _mask_targets (line 284) | def _mask_targets(target, tokenized_lens, speakers):
function _add_speaker_and_signal (line 295) | def _add_speaker_and_signal(header, source, get_conversation=True):
function preprocess_multimodal (line 316) | def preprocess_multimodal(
function preprocess_llama_2 (line 340) | def preprocess_llama_2(
function preprocess_qwen_2 (line 527) | def preprocess_qwen_2(
function preprocess_v1 (line 637) | def preprocess_v1(
function preprocess_mpt (line 723) | def preprocess_mpt(
function preprocess_plain (line 811) | def preprocess_plain(
function preprocess (line 833) | def preprocess(
class LazySupervisedDataset (line 891) | class LazySupervisedDataset(Dataset):
method __init__ (line 894) | def __init__(self, data_path: List[str],
method __len__ (line 909) | def __len__(self):
method lengths (line 913) | def lengths(self):
method modality_lengths (line 921) | def modality_lengths(self):
method get_sample (line 929) | def get_sample(self, i) -> Dict[str, torch.Tensor]:
method __getitem__ (line 985) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
class DataCollatorForSupervisedDataset (line 995) | class DataCollatorForSupervisedDataset(object):
method __call__ (line 1000) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
function make_supervised_data_module (line 1029) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
function train (line 1041) | def train(attn_implementation=None):
FILE: llava/utils.py
function build_logger (line 17) | def build_logger(logger_name, logger_filename):
class StreamToLogger (line 60) | class StreamToLogger(object):
method __init__ (line 65) | def __init__(self, logger, log_level=logging.INFO):
method __getattr__ (line 71) | def __getattr__(self, attr):
method write (line 74) | def write(self, buf):
method flush (line 88) | def flush(self):
function disable_torch_init (line 94) | def disable_torch_init():
function violates_moderation (line 103) | def violates_moderation(text):
function pretty_print_semaphore (line 124) | def pretty_print_semaphore(semaphore):
FILE: model_export/export_vision_encoder.py
function export (line 19) | def export(args):
FILE: predict.py
function predict (line 18) | def predict(args):
Condensed preview — 74 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (545K chars).
[
{
"path": ".gitignore",
"chars": 1914,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": "ACKNOWLEDGEMENTS",
"chars": 16193,
"preview": "Acknowledgements\nPortions of this Software may utilize the following copyrighted\nmaterial, the use of which is hereby ac"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 3357,
"preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
},
{
"path": "CONTRIBUTING.md",
"chars": 748,
"preview": "# Contribution Guide\n\nThanks for your interest in contributing. This project was released to accompany a research paper "
},
{
"path": "LICENSE",
"chars": 2760,
"preview": "Copyright (C) 2025 Apple Inc. All Rights Reserved.\n\nIMPORTANT: This Apple software is supplied to you by Apple\nInc. (\"A"
},
{
"path": "LICENSE_MODEL",
"chars": 5791,
"preview": "Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is\nspecifically developed and released by Apple Inc. ("
},
{
"path": "README.md",
"chars": 5107,
"preview": "# FastVLM: Efficient Vision Encoding for Vision Language Models\n\nThis is the official repository of\n**[FastVLM: Efficien"
},
{
"path": "app/Configuration/Build.xcconfig",
"chars": 480,
"preview": "// The `DISAMBIGUATOR` configuration is to make it easier to build\n// and run a sample code project. Once you set your p"
},
{
"path": "app/FastVLM/FastVLM.h",
"chars": 170,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\n#ifndef Fas"
},
{
"path": "app/FastVLM/FastVLM.swift",
"chars": 24185,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Core"
},
{
"path": "app/FastVLM/MediaProcessingExtensions.swift",
"chars": 6600,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Acce"
},
{
"path": "app/FastVLM App/Assets.xcassets/AccentColor.colorset/Contents.json",
"chars": 123,
"preview": "{\n \"colors\" : [\n {\n \"idiom\" : \"universal\"\n }\n ],\n \"info\" : {\n \"author\" : \"xcode\",\n \"version\" : 1\n }"
},
{
"path": "app/FastVLM App/Assets.xcassets/AppIcon.appiconset/Contents.json",
"chars": 1702,
"preview": "{\n \"images\" : [\n {\n \"filename\" : \"FastVLM - 150 Blue - Light@2x.png\",\n \"idiom\" : \"universal\",\n \"platf"
},
{
"path": "app/FastVLM App/Assets.xcassets/Contents.json",
"chars": 63,
"preview": "{\n \"info\" : {\n \"author\" : \"xcode\",\n \"version\" : 1\n }\n}\n"
},
{
"path": "app/FastVLM App/ContentView.swift",
"chars": 17593,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport AVFo"
},
{
"path": "app/FastVLM App/FastVLM.entitlements",
"chars": 490,
"preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/P"
},
{
"path": "app/FastVLM App/FastVLMApp.swift",
"chars": 249,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Swif"
},
{
"path": "app/FastVLM App/FastVLMModel.swift",
"chars": 6073,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Core"
},
{
"path": "app/FastVLM App/Info.plist",
"chars": 181,
"preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/P"
},
{
"path": "app/FastVLM App/InfoView.swift",
"chars": 2810,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Foun"
},
{
"path": "app/FastVLM App/Preview Content/Preview Assets.xcassets/Contents.json",
"chars": 63,
"preview": "{\n \"info\" : {\n \"author\" : \"xcode\",\n \"version\" : 1\n }\n}\n"
},
{
"path": "app/FastVLM.xcodeproj/project.pbxproj",
"chars": 45682,
"preview": "// !$*UTF8*$!\n{\n\tarchiveVersion = 1;\n\tclasses = {\n\t};\n\tobjectVersion = 77;\n\tobjects = {\n\n/* Begin PBXBuildFile section *"
},
{
"path": "app/FastVLM.xcodeproj/xcshareddata/xcschemes/FastVLM App.xcscheme",
"chars": 2907,
"preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Scheme\n LastUpgradeVersion = \"1630\"\n version = \"1.7\">\n <BuildAction\n "
},
{
"path": "app/README.md",
"chars": 2439,
"preview": "# FastVLM\n\nDemonstrates the performance of **FastVLM** models for on-device, visual question answering. \n\n<table>\n<tr>\n "
},
{
"path": "app/Video/CameraController.swift",
"chars": 6133,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport AVFo"
},
{
"path": "app/Video/CameraControlsView.swift",
"chars": 1117,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport AVFo"
},
{
"path": "app/Video/CameraType.swift",
"chars": 213,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport Foun"
},
{
"path": "app/Video/Video.h",
"chars": 326,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\n#import <Fo"
},
{
"path": "app/Video/VideoFrameView.swift",
"chars": 4476,
"preview": "//\n// For licensing see accompanying LICENSE file.\n// Copyright (C) 2025 Apple Inc. All Rights Reserved.\n//\n\nimport AVFo"
},
{
"path": "app/get_pretrained_mlx_model.sh",
"chars": 3634,
"preview": "#!/usr/bin/env bash\n#\n# For licensing see accompanying LICENSE_MODEL file.\n# Copyright (C) 2025 Apple Inc. All Rights Re"
},
{
"path": "get_models.sh",
"chars": 1246,
"preview": "#!/usr/bin/env bash\n#\n# For licensing see accompanying LICENSE_MODEL file.\n# Copyright (C) 2025 Apple Inc. All Rights Re"
},
{
"path": "llava/__init__.py",
"chars": 64,
"preview": "from .model import LlavaLlamaForCausalLM, LlavaQwen2ForCausalLM\n"
},
{
"path": "llava/constants.py",
"chars": 335,
"preview": "CONTROLLER_HEART_BEAT_EXPIRATION = 30\nWORKER_HEART_BEAT_INTERVAL = 15\n\nLOGDIR = \".\"\n\n# Model Constants\nIGNORE_INDEX = -1"
},
{
"path": "llava/conversation.py",
"chars": 17788,
"preview": "import dataclasses\nfrom enum import auto, Enum\nfrom typing import List, Tuple\nimport base64\nfrom io import BytesIO\nfrom "
},
{
"path": "llava/mm_utils.py",
"chars": 9596,
"preview": "import PIL\nfrom PIL import Image\nPIL.Image.MAX_IMAGE_PIXELS=500000000\nfrom io import BytesIO\nimport base64\nimport torch\n"
},
{
"path": "llava/model/__init__.py",
"chars": 338,
"preview": "# try:\nfrom .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig\nfrom .language_model.llava_mpt import "
},
{
"path": "llava/model/apply_delta.py",
"chars": 1956,
"preview": "\"\"\"\nUsage:\npython3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --de"
},
{
"path": "llava/model/builder.py",
"chars": 9395,
"preview": "# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "llava/model/consolidate.py",
"chars": 914,
"preview": "\"\"\"\nUsage:\npython3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate\n"
},
{
"path": "llava/model/language_model/llava_llama.py",
"chars": 5407,
"preview": "# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "llava/model/language_model/llava_mistral.py",
"chars": 5386,
"preview": "# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "llava/model/language_model/llava_mpt.py",
"chars": 3498,
"preview": "# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "llava/model/language_model/llava_qwen.py",
"chars": 5384,
"preview": "\n# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not"
},
{
"path": "llava/model/llava_arch.py",
"chars": 18545,
"preview": "# Copyright 2023 Haotian Liu\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "llava/model/make_delta.py",
"chars": 2257,
"preview": "\"\"\"\nUsage:\npython3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~"
},
{
"path": "llava/model/multimodal_encoder/builder.py",
"chars": 934,
"preview": "import os\nfrom .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2\nfrom .mobileclip_encoder import MobileCLIPVisionT"
},
{
"path": "llava/model/multimodal_encoder/clip_encoder.py",
"chars": 6702,
"preview": "import torch\nimport torch.nn as nn\n\nfrom transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig\n\n\ncla"
},
{
"path": "llava/model/multimodal_encoder/mobileclip/__init__.py",
"chars": 2967,
"preview": "#\n# For licensing see accompanying LICENSE file.\n# Copyright (C) 2025 Apple Inc. All Rights Reserved.\n#\nimport os\nimport"
},
{
"path": "llava/model/multimodal_encoder/mobileclip/configs/mobileclip_l.json",
"chars": 483,
"preview": "{\n \"embed_dim\": 768,\n \"image_cfg\": {\n \"image_size\": 1024,\n \"model_name\": \"fastvithd\",\n \"embed"
},
{
"path": "llava/model/multimodal_encoder/mobileclip/mci.py",
"chars": 49913,
"preview": "#\n# For licensing see accompanying LICENSE file.\n# Copyright (C) 2025 Apple Inc. All Rights Reserved.\n#\nimport copy\nfrom"
},
{
"path": "llava/model/multimodal_encoder/mobileclip_encoder.py",
"chars": 4345,
"preview": "#\n# For licensing see accompanying LICENSE file.\n# Copyright (C) 2025 Apple Inc. All Rights Reserved.\n#\nimport torch\nimp"
},
{
"path": "llava/model/multimodal_projector/builder.py",
"chars": 1040,
"preview": "import torch.nn as nn\nimport re\n\n\nclass IdentityMap(nn.Module):\n def __init__(self):\n super().__init__()\n\n "
},
{
"path": "llava/model/utils.py",
"chars": 927,
"preview": "from transformers import AutoConfig\n\n\ndef auto_upgrade(config):\n cfg = AutoConfig.from_pretrained(config)\n if 'lla"
},
{
"path": "llava/serve/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "llava/serve/cli.py",
"chars": 4710,
"preview": "import argparse\nimport torch\n\nfrom llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN"
},
{
"path": "llava/serve/controller.py",
"chars": 10006,
"preview": "\"\"\"\nA controller manages distributed workers.\nIt sends worker addresses to clients.\n\"\"\"\nimport argparse\nimport asyncio\ni"
},
{
"path": "llava/serve/gradio_web_server.py",
"chars": 18911,
"preview": "import argparse\nimport datetime\nimport json\nimport os\nimport time\n\nimport gradio as gr\nimport requests\n\nfrom llava.conve"
},
{
"path": "llava/serve/model_worker.py",
"chars": 11229,
"preview": "\"\"\"\nA model worker executes the model.\n\"\"\"\nimport argparse\nimport asyncio\nimport json\nimport time\nimport threading\nimpor"
},
{
"path": "llava/serve/register_worker.py",
"chars": 734,
"preview": "\"\"\"\nManually register workers.\n\nUsage:\npython3 -m fastchat.serve.register_worker --controller http://localhost:21001 --w"
},
{
"path": "llava/serve/sglang_worker.py",
"chars": 8731,
"preview": "\"\"\"\nA model worker executes the model.\n\"\"\"\nimport argparse\nimport asyncio\nfrom concurrent.futures import ThreadPoolExecu"
},
{
"path": "llava/serve/test_message.py",
"chars": 2046,
"preview": "import argparse\nimport json\n\nimport requests\n\nfrom llava.conversation import default_conversation\n\n\ndef main():\n if a"
},
{
"path": "llava/train/llama_flash_attn_monkey_patch.py",
"chars": 4404,
"preview": "from typing import Optional, Tuple\nimport warnings\n\nimport torch\n\nimport transformers\nfrom transformers.models.llama.mod"
},
{
"path": "llava/train/llama_xformers_attn_monkey_patch.py",
"chars": 4916,
"preview": "\"\"\"\nDirectly copied the code from https://raw.githubusercontent.com/oobabooga/text-generation-webui/main/modules/llama_a"
},
{
"path": "llava/train/llava_trainer.py",
"chars": 12187,
"preview": "import os\nimport torch\nimport torch.nn as nn\n\nfrom torch.utils.data import Sampler\n\nimport transformers\nfrom transformer"
},
{
"path": "llava/train/train.py",
"chars": 44188,
"preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\n# Adopted from tatsu-lab@stanford_al"
},
{
"path": "llava/train/train_mem.py",
"chars": 120,
"preview": "from llava.train.train_qwen import train\n\nif __name__ == \"__main__\":\n train(attn_implementation=\"flash_attention_2\")\n"
},
{
"path": "llava/train/train_qwen.py",
"chars": 48491,
"preview": "# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:\n# Adopted from tatsu-lab@stanford_al"
},
{
"path": "llava/train/train_xformers.py",
"chars": 366,
"preview": "# Make it more memory efficient by monkey patching the LLaMA model with xformers attention.\n\n# Need to call this before "
},
{
"path": "llava/utils.py",
"chars": 4004,
"preview": "import datetime\nimport logging\nimport logging.handlers\nimport os\nimport sys\n\nimport requests\n\nfrom llava.constants impor"
},
{
"path": "model_export/README.md",
"chars": 2278,
"preview": "# Model Export for inference on Apple Silicon\nDisclaimer: this is not an official recommendation, just research and expl"
},
{
"path": "model_export/export_vision_encoder.py",
"chars": 4209,
"preview": "#\n# For licensing see accompanying LICENSE file.\n# Copyright (C) 2025 Apple Inc. All Rights Reserved.\n#\nimport os\nimport"
},
{
"path": "model_export/fastvlm_mlx-vlm.patch",
"chars": 19014,
"preview": "diff --git a/mlx_vlm/convert.py b/mlx_vlm/convert.py\nindex 5952a88..335e9db 100644\n--- a/mlx_vlm/convert.py\n+++ b/mlx_vl"
},
{
"path": "predict.py",
"chars": 3445,
"preview": "#\n# Modified from LLaVA/predict.py\n# Please see ACKNOWLEDGEMENTS for details about LICENSE\n#\nimport os\nimport argparse\n\n"
},
{
"path": "pyproject.toml",
"chars": 1137,
"preview": "[build-system]\nrequires = [\"setuptools>=61.0\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"llava\"\nversion"
}
]
About this extraction
This page contains the full source code of the apple/ml-fastvlm GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 74 files (506.0 KB), approximately 122.8k tokens, and a symbol index with 335 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.