Full Code of rednote-hilab/dots.vlm1 for AI

main 528d7b5eb400 cached

6 files

86.9 KB

22.8k tokens

1 requests

Download .txt

Repository: rednote-hilab/dots.vlm1
Branch: main
Commit: 528d7b5eb400
Files: 6
Total size: 86.9 KB

Directory structure:
gitextract_k7195w42/

├── LICENSE
├── NOTICE
├── README.md
├── assets/
│   └── blog.md
└── docker/
    ├── Dockerfile
    └── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
dots.vlm1 LICENSE AGREEMENT

Effective Date: [August 16, 2025]

Copyright Holder: [Xingyin Information Technology (Shanghai) Co., Ltd]

This License Agreement (“Agreement”) governs Your use, reproduction, modification, and distribution of dots.vlm1 (the "Model Materials"). This Agreement is designed to maximize the openness and use of the Model Materials while addressing the unique legal, ethical, and technical challenges posed by visual-language models.

WHEREAS, Licensor has developed the dots.vlm1 visual-Language Models and intends to distribute the Model Materials under an open‑source framework;
WHEREAS, traditional open-source licenses (e.g., the MIT License) may not fully address the complexities of Visual-Language Models, namely their multiple components (code, weights, training data), potential ethical risks, data‑governance issues, and intellectual‑property and liability questions regarding AI‑generated content;
WHEREAS, Licensor seeks to provide a legal framework that ensures maximum access to and use of the Model Materials while clearly defining the rights, obligations, and liabilities of Licensee;

THEREFORE, the parties agree that, subject to the MIT License, they shall be bound by the following terms and conditions:

1. Definitions and Interpretation
Purpose: To define key terms used in this Agreement, particularly "Model Materials," ensuring clarity of the license scope beyond traditional software code. To clarify the order of precedence between this Agreement and the MIT License to avoid conflict.

1.1 “Licensor” shall mean the entity providing the Model Materials under this Agreement, namely [Xingyin Information Technology (Shanghai) Co., Ltd].

1.2 “Licensee” or "You" shall mean any individual or entity exercising permissions granted by this Agreement.

1.3 “Model Materials” shall mean all materials provided by Licensor under this Agreement, including but not limited to:
        (a) one or more machine‑learning models, including architecture and trained parameters (i.e., model weights);
        (b) all associated preprocessing, training, inference, and fine‑tuning code;
        (c) training datasets and evaluation scripts (or their detailed descriptions and access mechanisms); and
        (d) any accompanying documentation, metadata, and tools.
The above Model Materials shall be subject to the content published on the Licensor’s website or GitHub repository at https://github.com/rednote-hilab/dots.vlm1.

1.4 “Outputs” shall mean any content generated through the use of the Model Materials, including but not limited to text descriptions, image labels, image-text matching results, etc.

1.5 “MIT License” shall mean The MIT Open Source License published by the Massachusetts Institute of Technology.

1.6   Priority of Agreement. In the event of any conflict or inconsistency between this Agreement and the MIT License, the terms of the MIT License shall prevail. However, if the terms of the MIT License are ambiguous or silent on a particular matter, the provisions of this Agreement shall apply and supplement the MIT License.

2. Grant of Rights and Scope of Use
Purpose: To grant broad, permissive rights to the Licensee for the Model Materials—including code, weights, data, and documentation—to ensure maximum openness and flexibility while clarifying the free use of model-generated content. Additionally, it clarifies the feasibility of transitioning from open-source to commercial‑use.

2.1   Grant of Copyright License. Subject to Licensee's compliance with this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non‑exclusive, no-charge, royalty‑free copyright license to use (run or test), reproduce, modify, create derivative works of, merge, publish, distribute the Model Materials copies of the Model Materials or any derivative works thereof; and incorporate the unmodified or modified Model Materials into proprietary products or services, including for commercial purposes.

2.2   Fundamental Capabilities. The Model Materials only provide the fundamental model’s capabilities. Licensees may develop derivative AI applications or undertake task‑specific training thereon.

2.3   From Open Source to Commercial Use. The open-source release does not preclude Licensor’s commercial exploitation of the Model Materials, in whole or in part. Any such commercial use shall, at that time, be subject to license agreements between Licensor and applicable users.

3. Acceptable Use Policy and Prohibited Uses

3.1   Responsible Use. Licensee must use the Model Materials in a responsible, ethical, and lawful manner, in compliance with all applicable laws, regulations, industry standards, and best practices.

3.2   Enterprise On‑Premises Deployment. The Licensee may deploy the Model Materials in closed‑source, on‑premises enterprise environments.

3.3   Prohibited Uses. Any breach of the prohibitions below will result in the automatic termination of all licenses granted under this Agreement. Licensee agrees not to use the Model Materials or any derivative works thereof, in connection with:
(a) Prohibition of Illegal Content Identification and Description: It is strictly prohibited to use image recognition technology to analyze images of fraudulent or forged documents, tickets, seals, etc., and generate text descriptions that make them appear legitimate or actionable (e.g., outputting misleading conclusions such as "the anti-counterfeit features of this ticket are intact"). It is also prohibited to generate technical guidance, beautification, or censorship-evading text descriptions for images related to violent crimes, child exploitation, or terrorist activities.
(b) Prohibition of Privacy Invasion and Discrimination: Without legal authorization and explicit consent from the concerned parties, image recognition must not be used to extract personal biometric features (such as facial features, fingerprints), identification numbers, medical records, or sensitive location information, and output full details (e.g., "the person in the image resides at XX street") or associative conclusions (e.g., "this face matches a wanted criminal in the database"). It is also prohibited to generate text reports for automated discriminatory decisions based on characteristics like race or gender observed in images (e.g., "Based on clothing, credit rating is likely low").
(c) Prohibition of Undermining Information Authenticity and Social Trust: It is strictly forbidden to generate text descriptions that cover up traces of image forgery. It is also prohibited to generate inflammatory text labels for violent or insulting images (e.g., describing a fight image as "just punishment"). In news, judicial, or similar fields, it is prohibited to output definitive conclusions without factual basis for suspicious images.
(d) Prohibition of Intellectual Property Infringement: It is prohibited to generate text descriptions that assist in the distribution of infringing images (e.g., "highly detailed reproduction of artwork techniques") for uploaded infringing images.

4. Intellectual Property Ownership and Contributions
4.1   Licensor's Copyright Reservation. Licensor reserves all right, title, and interest in and to the Model Materials (including the model architecture, parameters, code, and original training data), except as expressly licensed herein. The original copyright of the Model Materials belongs to the Licensor.

4.2   Patent License. Subject to the terms and conditions of this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model Materials, where such license applies only to those patent claims licensable by the Licensor that are necessarily infringed by its contribution(s). 
If Licensee institutes patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model Materials constitute direct or contributory patent infringement, then any patent licenses granted under this License for the Model Materials shall terminate as of the date such litigation is asserted or filed.

4.3   Outputs. The intellectual property rights and related rights of any images uploaded by the user always remain with the original rights holder. This agreement does not transfer or grant any rights or licenses in this regard. The text output generated by the model is a machine interpretation of the input image (including object recognition, scene analysis, attribute description, etc.). As it is essentially an automated expression of objective facts, it does not constitute an original work protected by copyright law and therefore does not create new intellectual property ownership issues. The licensee and end users are free to use such descriptive text.
The licensor makes no guarantees regarding whether the output text infringes third-party rights (including copyrights, privacy, trademarks). The licensee should independently assess legal risks, especially when user-uploaded content contains material with a high potential for infringement (e.g., movie screenshots, artwork).
4.4   Trademarks. Nothing in this License permits Licensee to make use of Licensor’s trademarks, trade names, logos (e.g., “rednote,” “Xiaohongshu,” “dots.vlm1”“hilab”) or to otherwise suggest endorsement or misrepresent the relationship between the parties, unless Licensor’s prior written approval is granted.

5. Data Governance, Privacy, and Security
5.1   Training‑Data Usage. Licensee acknowledges and agrees that user inputs and Outputs, may be used by the Licensor in further training, to improve the Model Materials and enhance user experience. Licensor assumes no confidentiality obligation with respect to such data.

5.2   Data Quality and Bias. Licensee shall use training data from lawful sources and is encouraged to conduct due diligence before deploying the Model Materials and to take reasonable steps to mitigate any known biases in its training data or applications.

5.3   Privacy Protection.
        (a) Sensitive‑Data Restrictions. It is prohibited to use the Model Materials to process, generate, or infer sensitive personal data protected under specific laws (such as GDPR or HIPAA), particularly when dealing with documents containing personally identifiable information (such as ID numbers, health data, financial information, etc.), unless Licensee has obtained all necessary consents, lawful basis, or authorizations, and has implemented adequate anonymization, pseudonymization, or other privacy-enhancing technologies.
        (b) Data Minimization and Purpose Limitation. The Licensee shall follow the principle of data minimization when using the Visual-Language Model, processing only the user data necessary for specific, explicit, and lawful purposes. 
        (c) Transparency. Licensee shall provide clear and transparent privacy policies and terms of use when processing user data, clearly disclose that the images uploaded by the user will be analyzed (including objects, text, and biometric features) and generate text descriptions, ensuring that end users are informed about how their data is used, stored, and protected.

5.4   Security Measures. Licensee shall implement appropriate technical and administrative safeguards to protect the Model Materials and any associated data against unauthorized access, disclosure, alteration, or destruction. Such measures may include, but are not limited to, encryption, access controls, logging, and audit trails.

5.5   Further Training. Licensee may only use user‑provided input or Outputs for training, fine-tuning, or improving other AI models if it has obtained the specific and informed consent of data subjects.

6. Disclaimer of Warranty and Limitation of Liability

6.1 “AS IS” Basis. Unless required by applicable law, the Model Materials are provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Licensee is solely responsible for determining the appropriateness of using or redistributing the Model Materials and assume any risks associated with the exercise of permissions under this License. Licensor does not provide any warranty of non-infringement but represents that no infringing code has been knowingly included.

6.2   Outputs Disclaimer. As a neutral technology, Licensor disclaims all liability for the accuracy, completeness, reliability, safety, legality, or suitability of any Outputs. The Licensee is solely responsible for verifying the accuracy and appropriateness of AI-generated content and shall provide appropriate disclosures when publishing or relying upon such content.

6.3   Limitation of Liability and Recourse. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall Licensor or contributors be liable for any claims, damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model Materials (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Licensor has been advised of the possibility of such damages. If such losses are incurred, recourse may be sought against the Licensee responsible for causing the loss.

6.4   Content‑Filtering Disclaimer. Although the Model Materials may include content‑filtering mechanisms, Licensor makes no warranties of any kind regarding the stability, quality, accuracy, completeness, or any specific outcome of Outputs. Licensee is solely responsible for reviewing, verifying, and performing quality control on Outputs and assumes all associated risks and liabilities.

7. Attribution and License Reservation

7.1   License. When distributing or redistributing the Model Materials, Licensee must give any other recipients of the Model Materials a copy of this Agreement.

7.2   Copyright and Notices. When distributing any part of the Model Materials, Licensee must retain all copyright, patent, trademark, and attribution notices included in the Model Materials.

7.3   Attribution. Licensee is encouraged to prominently display the name of Licensor and the Model Materials in any public statements, products, or services that contain the Model Materials (or any derivative works thereof), to promote transparency and community trust. If Licensee distributes modified weights or fine‑tuned models based on the Model Materials, Licensee must prominently display the following statement in the related website or documentation: “Built with dots.vlm1.”

8. Governing Law and Dispute Resolution

8.1   Governing Law. This Agreement shall be governed by and construed in accordance with the laws of the People’s Republic of China, without regard to its conflict of laws principles.

8.2   Dispute Resolution. Any dispute claim, or disagreement arising out of or relating to this Agreement shall first be resolved through amicable consultation. If such consultation fails, the dispute shall be submitted to the Hangzhou Arbitration Commission for arbitration. The arbitration shall be conducted in accordance with the laws of China, and the place of arbitration shall be [Hangzhou, China]. The arbitral award shall be final and binding upon both parties.

 9. Regulatory Compliance Amendments
In the event that any part of this Agreement becomes invalid or requires adjustment due to changes in applicable laws or regulations, Licensor reserves the right to issue a revised version of this Agreement. Licensee shall migrate to the new version within [e.g., ninety (90)] days of its release; otherwise, all rights granted under this Agreement shall automatically terminate.

10. Security Reporting
Licensee discovering any security vulnerability in the Model Materials may report it to Licensor via: dots_feedback@xiaohongshu.com. Licensee shall not disclose vulnerability details until Licensor issues an official remediation, unless otherwise required by law.


================================================
FILE: NOTICE
================================================
Copyright Notice and License Texts
===============================================================================
Project: gradio
Copyright notice: No copyright info provided

License: Apache 2.0

Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

    "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

    "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

    "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition,
      
"control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

    "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.

    "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

    "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

    "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).

    "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

   "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."

    "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:

    (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and

    (b) You must cause any modified files to carry prominent notices stating that You changed the files; and

    (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and

    (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

    You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Not withstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

    To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

=============================================================================
Project：DeepSeek-V3
Copyright notice：Copyright (c) 2023 DeepSeek
License: DEEPSEEK LICENSE AGREEMENT
DEEPSEEK LICENSE AGREEMENT

Version 1.0, 23 October 2023

Copyright (c) 2023 DeepSeek

Section I: PREAMBLE 

Large generative models are being widely adopted and used, and have the potential to transform the way individuals conceive and benefit from AI or ML technologies. 

Notwithstanding the current and potential benefits that these artifacts can bring to society at large, there are also concerns about potential misuses of them, either due to their technical limitations or ethical considerations. 

In short, this license strives for both the open and responsible downstream use of the accompanying model. When it comes to the open character, we took inspiration from open source permissive licenses regarding the grant of IP rights. Referring to the downstream responsible use, we added use-based restrictions not permitting the use of the model in very specific scenarios, in order for the licensor to be able to enforce the license in case potential misuses of the Model may occur. At the same time, we strive to promote open and responsible research on generative models for content generation. 

Even though downstream derivative versions of the model could be released under different licensing terms, the latter will always have to include - at minimum - the same use-based restrictions as the ones in the original license (this license). We believe in the intersection between open and responsible AI development; thus, this agreement aims to strike a balance between both in order to enable responsible open-science in the field of AI. 

This License governs the use of the model (and its derivatives) and is informed by the model card associated with the model. 

NOW THEREFORE, You and DeepSeek agree as follows: 

1. Definitions 
"License" means the terms and conditions for use, reproduction, and Distribution as defined in this document. 
"Data" means a collection of information and/or content extracted from the dataset used with the Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under this License.
"Output" means the results of operating a Model as embodied in informational content resulting therefrom. 
"Model" means any accompanying machine-learning based assemblies (including checkpoints), consisting of learnt weights, parameters (including optimizer states), corresponding to the model architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or in part on the Data, using the Complementary Material. 
"Derivatives of the Model" means all modifications to the Model, works based on the Model, or any other model which is created or initialized by transfer of patterns of the weights, parameters, activations or output of the Model, to the other model, in order to cause the other model to perform similarly to the Model, including - but not limited to - distillation methods entailing the use of intermediate data representations or methods based on the generation of synthetic data by the Model for training the other model. 
"Complementary Material" means the accompanying source code and scripts used to define, run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if any. This includes any accompanying documentation, tutorials, examples, etc, if any. 
"Distribution" means any transmission, reproduction, publication or other sharing of the Model or Derivatives of the Model to a third party, including providing the Model as a hosted service made available by electronic or other remote means - e.g. API-based or web access. 
"DeepSeek" (or "we") means Beijing DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd., Hangzhou DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd. and/or any of their affiliates.
"You" (or "Your") means an individual or Legal Entity exercising permissions granted by this License and/or making use of the Model for whichever purpose and in any field of use, including usage of the Model in an end-use application - e.g. chatbot, translator, etc. 
"Third Parties" means individuals or legal entities that are not under common control with DeepSeek or You. 

Section II: INTELLECTUAL PROPERTY RIGHTS 

Both copyright and patent grants apply to the Model, Derivatives of the Model and Complementary Material. The Model and Derivatives of the Model are subject to additional terms as described in Section III. 

2. Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model. 

3. Grant of Patent License. Subject to the terms and conditions of this License and where and as applicable, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such license applies only to those patent claims licensable by DeepSeek that are necessarily infringed by its contribution(s). If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for the Model and/or works shall terminate as of the date such litigation is asserted or filed. 


Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION

4. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g. software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof in any medium, with or without modifications, provided that You meet the following conditions: 
a. Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to, that the Model or Derivatives of the Model are subject to paragraph 5. This provision does not apply to the use of Complementary Material. 
b. You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this License; 
c. You must cause any modified files to carry prominent notices stating that You changed the files; 
d. You must retain all copyright, patent, trademark, and attribution notices excluding those notices that do not pertain to any part of the Model, Derivatives of the Model. 
e. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions - respecting paragraph 4.a. – for use, reproduction, or Distribution of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use, reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License. 

5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions. Therefore You cannot use the Model and the Derivatives of the Model for the specified restricted uses. You may use the Model subject to this License, including only for lawful purposes and in accordance with the License. Use may include creating any content with, finetuning, updating, running, training, evaluating and/or reparametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model to comply with the terms of this paragraph (paragraph 5). 

6. The Output You Generate. Except as set forth herein, DeepSeek claims no rights in the Output You generate using the Model. You are accountable for the Output you generate and its subsequent uses. No use of the output can contravene any provision as stated in the License.

Section IV: OTHER PROVISIONS 

7. Updates and Runtime Restrictions. To the maximum extent permitted by law, DeepSeek reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License. 

8. Trademarks and related. Nothing in this License permits You to make use of DeepSeek’ trademarks, trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the parties; and any rights not expressly granted herein are reserved by DeepSeek. 

9. Personal information, IP rights and related. This Model may contain personal information and works with IP rights. You commit to complying with applicable laws and regulations in the handling of personal information and the use of such works. Please note that DeepSeek's license granted to you to use the Model does not imply that you have obtained a legitimate basis for processing the related information or works. As an independent personal information processor and IP rights user, you need to ensure full compliance with relevant legal and regulatory requirements when handling personal information and works with IP rights that may be contained in the Model, and are willing to assume solely any risks and consequences that may arise from that.

10. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, DeepSeek provides the Model and the Complementary Material on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Model, Derivatives of the Model, and the Complementary Material and assume any risks associated with Your exercise of permissions under this License. 

11. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall DeepSeek be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model and the Complementary Material (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if DeepSeek has been advised of the possibility of such damages. 

12. Accepting Warranty or Additional Liability. While redistributing the Model, Derivatives of the Model and the Complementary Material thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of DeepSeek, and only if You agree to indemnify, defend, and hold DeepSeek harmless for any liability incurred by, or claims asserted against, DeepSeek by reason of your accepting any such warranty or additional liability. 

13. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.

14. Governing Law and Jurisdiction. This agreement will be governed and construed under PRC laws without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this agreement. The courts located in the domicile of Hangzhou DeepSeek Artificial Intelligence Fundamental Technology Research Co., Ltd. shall have exclusive jurisdiction of any dispute arising out of this agreement.

END OF TERMS AND CONDITIONS

Attachment A

Use Restrictions

You agree not to use the Model or Derivatives of the Model:

-	In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party; 
-	For military use in any way;
-	For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; 
-	To generate or disseminate verifiably false information and/or content with the purpose of harming others; 
-	To generate or disseminate inappropriate content subject to applicable regulatory requirements;
-	To generate or disseminate personal identifiable information without due authorization or for unreasonable use; 
-	To defame, disparage or otherwise harass others; 
-	For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; 
-	For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics; 
-	To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; 
-	For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

=================================================================
Project: Qwen2.5-VL
Copyright notice: No copyright info provided

License: Apache 2.0

Please see above.

=================================================================
Project: aim
Copyright notice: Copyright 2019-2022 Aimhub, Inc.

License：Apache 2.0

Please see above.

=================================================================
Project: sglang
Copyright notice: Copyright 2023-2024 SGLang Team

License：Apache 2.0

Please see above.



================================================
FILE: README.md
================================================
<div align="center">
<p align="center">
    <img src="/assets/logo.png" width="300"/>
<p>



[![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md)
[![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.vlm1.inst)

<div align="center">
  <a href="https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | 
  <a href="/assets/wechat.png" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | 
  <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
</div>
</div>

## 1. Introduction

We are excited to introduce **dots.vlm1**, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), **dots.vlm1** demonstrates strong multimodal understanding and reasoning capabilities.  

**Model Highlights**:
- **NaViT Vision Encoder**: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model’s perceptual capabilities—particularly for tasks such as OCR.  
- **Multimodal Training Data**: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.


Through large-scale pretraining and carefully tuned post-training, **dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning**, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.

*Special thanks to the DeepSeek team for the excellent DeepSeek V3 model.*

## 2. Performance

|  | | Qwen2.5VL-72B | **Gemini2.5 Pro** | **Seed-VL1.5 thinking** | dots.vlm1 |
|------|--------|----------------|--------------------|--------------------------|-----------|
| **STEM/Reasoning** | MMMU | 69.3 | **84.22** | 79.89 | <ins>80.11</ins> |
|  | MMMU_pro | 51.91 | **76.5** | 68.9 | <ins>70.11</ins> |
|  | MathVision | 39.4 | **72.34** | 68.77 | <ins>69.64</ins> |
|  | MathVista | 74.6 | 83.5 | **86.1** | <ins>85.0</ins> |
|  | ZeroBench | 2 | **5** | 2 | <ins>4</ins> |
|  | ZeroBench-sub | 20 | **30.24** | 25.75 | <ins>26.65</ins> |
|  | VisuLogic | 25.6 | 29.8 | **35.9** | <ins>32.2</ins> |
| **General Visual** | MMbench-CN | 88.2 | <ins>89</ins> | **89.78** | 88.24 |
|  | MMbench-EN | 89.2 | **89.55** | <ins>89.47</ins> | 89.32 |
|  | MMStar | 71.13 | **78.73** | <ins>78.33</ins> | 76.67 |
|  | RealWorldQA | 75.9 | 78.43 | <ins>78.69</ins> | **79.08** |
|  | Vibe(GPT4o) | 60.13 | **76.39** | 68.59 | <ins>69.24</ins> |
|  | m3gia(cn) | 88.24 | 89.54 | **91.2** | <ins>90.85</ins> |
|  | SimpleVQA_ds | 52.19 | <ins>57.09</ins> | **61.34** | 55.8 |
|  | MMVP | 66 | 67.33 | **73.33** | <ins>72</ins> |
|  | HallusionBench | 56.5 | 63.07 | 63.49 | **<ins>64.83</ins>** |
|  | CVBench | 84.15 | 85.36 | **89.68** | <ins>85.65</ins> |
|  | Blink | 61.7 | <ins>71.86</ins> | **72.38** | <ins>66.33</ins> |
| **OCR/Doc/Chart** | charxiv(dq) | 88.2 | <ins>90.3</ins> | 89.6 | **92.1** |
|  | charxiv(rq) | 48.5 | **68.3** | 63.4 | <ins>64.4</ins> |
|  | OCRReasoning | 38.02 | **70.81** | 63.42 | <ins>66.23</ins> |
|  | DOCVQA | 96.23 | <ins>95.42</ins> | 93.65 | **96.52** |
|  | ChartQA | 86.1 | 86.16 | <ins>86.88</ins> | **87.68** |
|  | OCRBenchV1 | 87.1 | <ins>86.6</ins> | **86.7** | 82.3 |
|  | AI2D | 88.3 | **91.03** | <ins>89.05</ins> | 88.37 |
| **Grounding/Counting** | RefCOCO | 90.3 | 74.6 | **91.3** | <ins>90.45</ins> |
|  | CountBench | **92.4** | 91.79 | 89 | <ins>91.99</ins> |
| **Multi Image** | muir | 69.38 | 70.5 | **79.77** | <ins>78.58</ins> |
|  | mantis | 79.26 | <ins>84.33</ins> | 82.3 | **86.18** |
| |  | **Deepseek-R1-0528** | **Qwen3-235B-A22B** | **Qwen3-235B-A22B-think-2507**|**dots.vlm1**|
| **Text** | LiveCodeBench | <ins>73.3</ins> | 70.7 | **78.4** | 72.94 |
|  | AIME 2025 | <ins>87.5</ins> | 82.6 | **92.3** | 85.83 |
|  | GPQA | <ins>81</ins> | 70.7 | **81.1** | 72.78 |

## 3. Usage

### Environment Setup

You have two options to set up the environment:

#### Option 1: Using Base Image + Manual Installation
```bash
# Use the base SGLang image
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126

# Clone and install our custom SGLang branch
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
# NOTE: This installation must be done on EVERY node in your cluster
# We have submitted a PR to the main SGLang repository (currently under review):
# https://github.com/sgl-project/sglang/pull/8778
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
pip install -e sglang/python
```

#### Option 2: Using Pre-built Image (Recommended)
```bash
# Use our pre-built image with dots.vlm1 support
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
```

### Multi-Node Deployment

Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:

**Prerequisites:**
- Model: `rednote-hilab/dots.vlm1.inst`
- Node 1 IP: `10.0.0.1` (master node)
- Node 2 IP: `10.0.0.2` (worker node)

#### Node 1 (Master - rank 0):
```bash
# Recommend downloading model locally to avoid timeout during startup
# Use: huggingface-cli download rednote-hilab/dots.vlm1.inst --local-dir ./dots.vlm1.inst
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"  # or local path like ./dots.vlm1.inst
# Get actual IP address: hostname -I | awk '{print $1}' or ip route get 1 | awk '{print $7}'
export MASTER_IP="10.0.0.1"  # Replace with actual master node IP
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 0 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8
```

#### Node 2 (Worker - rank 1):
```bash
# Use the same variables as defined in Node 1
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
export MASTER_IP="10.0.0.1"  # Must match Node 1
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 1 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8
```
### Configuration Parameters

Key parameters explanation:
- `--tp 16`: Tensor parallelism across 16 GPUs per node
- `--nnodes 2`: Total number of nodes in the cluster
- `--node-rank`: Node identifier (0 for master, 1+ for workers)
- `--context-length 65536`: Maximum context length
- `--quantization fp8`: Use FP8 quantization for efficiency
- `--chat-template dots-vlm`: Use custom chat template for dots.vlm model

### API Usage

Once the servers are launched, you can access the model through OpenAI-compatible API:
```bash
# Use the same MASTER_IP and API_PORT as defined above
curl -X POST http://$MASTER_IP:$API_PORT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "model",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Please briefly describe this image"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        "temperature": 0.1,
        "top_p": 0.9,
        "max_tokens": 55000
    }'
```


================================================
FILE: assets/blog.md
================================================
<div align="center">
<p align="center">
    <img src="/assets/logo.png" width="300"/>
<p>



[![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.vlm1/blob/main/assets/blog.md)
[![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.vlm1.inst)

<div align="center">
  <a href="https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> | 
  <a href="/assets/wechat.png" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> | 
  <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
</div>
</div>

## 1. Introduction

We are excited to introduce **dots.vlm1**, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), **dots.vlm1** demonstrates strong multimodal understanding and reasoning capabilities.  

**Model Highlights**:
- **NaViT Vision Encoder**: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model’s perceptual capabilities—particularly for tasks such as OCR.  
- **Multimodal Training Data**: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.


Through large-scale pretraining and carefully tuned post-training, **dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning**, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.

*Special thanks to the DeepSeek team for the excellent DeepSeek V3 model.*

## 2. Performance

### 2.1 Benchmark Results
|  | | Qwen2.5VL-72B | **Gemini2.5 Pro** | **Seed-VL1.5 thinking** | dots.vlm1 |
|------|--------|----------------|--------------------|--------------------------|-----------|
| **STEM/Reasoning** | MMMU | 69.3 | **84.22** | 79.89 | <ins>80.11</ins> |
|  | MMMU_pro | 51.91 | **76.5** | 68.9 | <ins>70.11</ins> |
|  | MathVision | 39.4 | **72.34** | 68.77 | <ins>69.64</ins> |
|  | MathVista | 74.6 | 83.5 | **86.1** | <ins>85.0</ins> |
|  | ZeroBench | 2 | **5** | 2 | <ins>4</ins> |
|  | ZeroBench-sub | 20 | **30.24** | 25.75 | <ins>26.65</ins> |
|  | VisuLogic | 25.6 | 29.8 | **35.9** | <ins>32.2</ins> |
| **General Visual** | MMbench-CN | 88.2 | <ins>89</ins> | **89.78** | 88.24 |
|  | MMbench-EN | 89.2 | **89.55** | <ins>89.47</ins> | 89.32 |
|  | MMStar | 71.13 | **78.73** | <ins>78.33</ins> | 76.67 |
|  | RealWorldQA | 75.9 | 78.43 | <ins>78.69</ins> | **79.08** |
|  | Vibe(GPT4o) | 60.13 | **76.39** | 68.59 | <ins>69.24</ins> |
|  | m3gia(cn) | 88.24 | 89.54 | **91.2** | <ins>90.85</ins> |
|  | SimpleVQA_ds | 52.19 | <ins>57.09</ins> | **61.34** | 55.8 |
|  | MMVP | 66 | 67.33 | **73.33** | <ins>72</ins> |
|  | HallusionBench | 56.5 | 63.07 | 63.49 | **<ins>64.83</ins>** |
|  | CVBench | 84.15 | 85.36 | **89.68** | <ins>85.65</ins> |
|  | Blink | 61.7 | <ins>71.86</ins> | **72.38** | <ins>66.33</ins> |
| **OCR/Doc/Chart** | charxiv(dq) | 88.2 | <ins>90.3</ins> | 89.6 | **92.1** |
|  | charxiv(rq) | 48.5 | **68.3** | 63.4 | <ins>64.4</ins> |
|  | OCRReasoning | 38.02 | **70.81** | 63.42 | <ins>66.23</ins> |
|  | DOCVQA | 96.23 | <ins>95.42</ins> | 93.65 | **96.52** |
|  | ChartQA | 86.1 | 86.16 | <ins>86.88</ins> | **87.68** |
|  | OCRBenchV1 | 87.1 | <ins>86.6</ins> | **86.7** | 82.3 |
|  | AI2D | 88.3 | **91.03** | <ins>89.05</ins> | 88.37 |
| **Grounding/Counting** | RefCOCO | 90.3 | 74.6 | **91.3** | <ins>90.45</ins> |
|  | CountBench | **92.4** | 91.79 | 89 | <ins>91.99</ins> |
| **Multi Image** | muir | 69.38 | 70.5 | **79.77** | <ins>78.58</ins> |
|  | mantis | 79.26 | <ins>84.33</ins> | 82.3 | **86.18** |
| |  | **Deepseek-R1-0528** | **Qwen3-235B-A22B** | **Qwen3-235B-A22B-think-2507**|**dots.vlm1**|
| **Text** | LiveCodeBench | <ins>73.3</ins> | 70.7 | **78.4** | 72.94 |
|  | AIME 2025 | <ins>87.5</ins> | 82.6 | **92.3** | 85.83 |
|  | GPQA | <ins>81</ins> | 70.7 | **81.1** | 72.78 |

> Due to the inability to reliably reproduce reported results, we re-evaluated all models using our evaluation pipeline. Our evaluation configuration is as follows:  
> For multiple-choice datasets (including MMMU, MMMU_Pro, and related benchmarks), we adopted the default inference and evaluation protocols established by VLMEvalKit. To minimize errors during the response-to-option mapping stage, we inhibit rule-based extraction in favor of a model-based solution (see the corresponding [pull request](https://github.com/open-compass/VLMEvalKit/pull/1175) for details). For the MMVP dataset, we report pair-accuracy metrics following the SeedVL evaluation framework.  
> For structured reasoning and document understanding tasks—specifically OCRReasoning, DOCVQA, ChartQA, MUIR, and Mantis—we utilized xVerify-9B-C as the discriminative scoring model. Our ChartQA evaluation was conducted on the "human_test" subset to ensure consistency with established benchmarks. For subjective evaluation on the Vibe-Eval dataset, we employed GPT-4o as the adjudicator model to assess response quality and appropriateness.  
> All evaluations of dots.vlm1 were conducted with a sequence length of 64k.  

On major visual benchmarks, dots.vlm1 has achieved overall performance comparable to leading models such as **Gemini 2.5 Pro** and **Seed-VL1.5 thinking**. In particular, it demonstrates strong visual-text understanding and reasoning capabilities on datasets like **MMMU**, **MathVision**, and **OCR Reasoning**, where it delivers competitive results.  

For typical text-based reasoning tasks (e.g., **AIME**, **GPQA**, **LiveCodeBench**), **dots.vlm1** performs roughly on par with **DeepSeek-R1-0528**, showing a certain degree of general capability in mathematics and coding. However, there is still a noticeable performance gap on more diverse reasoning challenges such as **GPQA**. 

Overall, **dots.vlm1** approaches state-of-the-art levels in multimodal visual understanding and achieves mainstream performance in text reasoning. That said, there remains a measurable gap on some specialized tasks, which calls for further optimization in both architecture design and training data. These subsets will also be key areas of focus in our next phase of improvements.  

### 2.2 Demo Highlights

<details>
<summary> Examples for Complex Doc/Chart Reasoning </summary>
<img src="/assets/cases/1040025031km07evmi00ddb8r1c.png" width="300"/>   


**Prompt**: What is the average amount of goods where the second letter in the name is 'e'?

**Response**: `<think>`…`</think>`   
To determine the average amount of goods where the second letter in the name is 'e', first identify the goods from the list and check the second letter of their names.

The goods and their export values are:

- Petroleum products: $69.6B
- Pearls, precious & semiprecious stones: $39.3B
- Machinery and appliances: $25.4B
- Iron and steel: $22.9B
- Organic chemicals: $22.1B

Now, check the second letter of the first word in each name (as is standard for such queries):

- Petroleum products: First word "Petroleum" – second letter is 'e' (P-e-t-r-o-l-e-u-m). Qualifies.
- Pearls, precious & semiprecious stones: First word "Pearls" – second letter is 'e' (P-e-a-r-l-s). Qualifies.
- Machinery and appliances: First word "Machinery" – second letter is 'a' (M-a-c-h-i-n-e-r-y). Does not qualify.
- Iron and steel: First word "Iron" – second letter is 'r' (I-r-o-n). Does not qualify.
- Organic chemicals: First word "Organic" – second letter is 'r' (O-r-g-a-n-i-c). Does not qualify.

Only two goods qualify: Petroleum products (69.6B) and Pearls, precious & semiprecious stones (39.3B).

Calculate the average:

- Sum of values: 69.6B + 39.3B = $108.9B
- Number of goods: 2
- Average: 108.9B / 2 = 54.45B

The average amount is $54.45 billion.  
\boxed{54.45}

**Answer**: 54.45

<img src="/assets/cases/1040025031km08ed5200979plfk.png" width="300"/>   

**Prompt**: 如果一个12人的团队，其中有6个身高1.3米的儿童，2个身高1.8米的成人，4个本科在读大学生，他们参观黄姚古镇，总共需要花费多少钱？

**Response**: `<think>`…`</think>`   
根据黄姚古镇景区价目表，计算12人团队的门票费用：

6个身高1.3米的儿童：身高在1.2米（不含）至1.4米（含）之间，适用半票，价格为50元/人。费用：6 × 50 = 300元。

2个身高1.8米的成人：身高超过1.4米，适用全票，价格为100元/人。费用：2 × 100 = 200元。

4个本科在读大学生：凭证学生（含本科及以下在读）适用半票，价格为50元/人。费用：4 × 50 = 200元。

门票总费用：300 + 200 + 200 = 700元。

团队为12人，属于10人以上团队，但价目表中团体票价格未明确指定，且无旅行社行程单或结算单信息，因此按 individual 票价计算。观光车票价格为0元（免费），景区讲解费为80元/每团/次，但问题未提及需要讲解服务，故不包括。

总花费为门票费用700元。

\boxed{700}

**Answer**: 700

<img src="/assets/cases/1040025031km09b4fi005gg197s.png" width="300"/>   

**Prompt**: What is the difference in protein content between the black-packaged product and the blue-packaged product if you eat 100 calories?  

**Response**: `<think>`…`</think>`   
To determine the difference in protein content between the black-packaged product and the blue-packaged product when consuming 100 calories, first calculate the protein content per 100 calories for each product.  

For the blue-packaged product (left nutrition label):  

- Calories per serving: 130
- Protein per serving: 2 g Protein per 100 calories = \frac{2}{130} \times 100 = \frac{200}{130} = \frac{20}{13} \approx 1.538 g

For the black-packaged product (right nutrition label):

- Calories per serving: 150
- Protein per serving: 3 g Protein per 100 calories = \frac{3}{150} \times 100 = \frac{300}{150} = 2 g

The difference in protein content per 100 calories is calculated as the absolute difference between the black-packaged and blue-packaged products:  

|2 - \frac{20}{13}| = |\frac{26}{13} - \frac{20}{13}| = |\frac{6}{13}| = \frac{6}{13} g  

Thus, the difference is \frac{6}{13} grams per 100 calories.  

\boxed{\dfrac{6}{13}}  

**Answer**: 0.46g

<img src="/assets/cases/1040025031km0k8hh2006updpps.png" width="300"/>   

**Prompt**: How many our bid/ask actions categories will result in exactly 9 competitor's bid/ask actions' categories to have a value above 21? * Your final answer must be an exact integer.

**Response**: `<think>`…`</think>`   
To determine how many of our bid/ask actions categories result in exactly 9 competitor's bid/ask actions' categories having a value above 21, each column (corresponding to our bid/ask actions) is examined. The values in each column are checked to count how many are greater than 21.

Column 1 (our action 1): Values above 21 are in rows 1, 2, 3, 5, 6, 7, 9, 10, 11. Count = 9.

Column 2 (our action 2): Values above 21 are in rows 1, 2, 3, 5, 6, 7, 9, 10, 11. Count = 9.

Column 3 (our action 3): All values are below 21. Count = 0.

Column 4 (our action 4): All values are below 21. Count = 0.

Column 5 (our action 5): Values above 21 are in rows 1, 2, 3, 5, 6, 7, 9, 10, 11. Count = 9.

Column 6 (our action 6): Values above 21 are in rows 1, 2, 3, 5, 6, 7, 9, 10, 11. Count = 9.

Column 7 (our action 7): All values are below 21. Count = 0.

Column 8 (our action 8): All values are below 21. Count = 0.

Column 9 (our action 9): All values are below 21. Count = 0.

Column 10 (our action 10): All values are below 21. Count = 0.

Column 11 (our action 11): All values are below 21. Count = 0.

Column 12 (our action 12): All values are below 21. Count = 0.

Column 13 (our action 13): All values are below 21. Count = 0.

Column 14 (our action 14): All values are below 21. Count = 0.

Column 15 (our action 15): All values are below 21. Count = 0.

Column 16 (our action 16): All values are below 21. Count = 0.

The columns with exactly 9 values above 21 are columns 1, 2, 5, and 6, corresponding to our bid/ask actions 1, 2, 5, and 6. Thus, there are 4 such categories.

**Answer**: 4



    
</details>


<details>
<summary> Examples for STEM Problem </summary>

<img src="/assets/cases/1040025031km175a3i0030p4k24.png" width="300"/>   

**Prompt**: Hint: Please answer the question and provide the final answer at the end. Question: Triangle ABC has a right angle at C, AC=3 and BC=4. Triangle ABD has a right angle at A and AD=12. Points C and D are on opposite sides of AB. The line through D parallel to AC meets CB extended at E. If DE/DB=m/n, where m and n are relatively prime positive integers, then m+n=

**Response**: `<think>`…`</think>`   
In triangle ABC with a right angle at C, AC = 3 and BC = 4. Using the Pythagorean theorem, AB = \sqrt{AC^2 + BC^2} = \sqrt{3^2 + 4^2} = \sqrt{25} = 5.  

Triangle ABD has a right angle at A and AD = 12. Since \angle DAB = 90^\circ, BD = \sqrt{AB^2 + AD^2} = \sqrt{5^2 + 12^2} = \sqrt{169} = 13.  

Place C at (0,0), A at (0,3), and B at (4,0). The line AB has the equation y = -\frac{3}{4}x + 3. Point C(0,0) is below this line. Since C and D are on opposite sides of AB, D must be above AB.  

The vector \overrightarrow{AB} = (4, -3). A vector perpendicular to \overrightarrow{AB} is (3, 4), with length \sqrt{3^2 + 4^2} = 5. Since AD = 12, the vector \overrightarrow{AD} = 12 \cdot \frac{(3, 4)}{5} = \left(\frac{36}{5}, \frac{48}{5}\right). Thus, D is at \left(0 + \frac{36}{5}, 3 + \frac{48}{5}\right) = \left(\frac{36}{5}, \frac{63}{5}\right).  

The line AC is vertical (from (0,3) to (0,0)). The line through D parallel to AC is vertical, so it has the equation x = \frac{36}{5}. This line intersects the extension of CB (which is the x-axis, y = 0) at E\left(\frac{36}{5}, 0\right).  

The distance DE is the vertical distance between D\left(\frac{36}{5}, \frac{63}{5}\right) and E\left(\frac{36}{5}, 0\right), so DE = \left|\frac{63}{5} - 0\right| = \frac{63}{5}. The distance DB is between D\left(\frac{36}{5}, \frac{63}{5}\right) and B(4, 0) = \left(\frac{20}{5}, 0\right):  

DB = \sqrt{\left(\frac{36}{5} - \frac{20}{5}\right)^2 + \left(\frac{63}{5} - 0\right)^2} = \sqrt{\left(\frac{16}{5}\right)^2 + \left(\frac{63}{5}\right)^2} = \sqrt{\frac{256}{25} + \frac{3969}{25}} = \sqrt{\frac{4225}{25}} = \frac{65}{5} = 13.  

The ratio \frac{DE}{DB} = \frac{\frac{63}{5}}{13} = \frac{63}{65}. The fraction \frac{63}{65} is in simplest terms since \gcd(63, 65) = 1. Thus, m = 63, n = 65, and m + n = 128.  

\boxed{128}

**Answer**: 128

<img src="/assets/cases/1040025031km1bh7ii00f4vhe68.png" width="300"/>   

**Prompt**: Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. Question: If ABCD is a 2 X 2 square, E is the midpoint of AB, F is the midpoint of BC, AF and DE intersect at I, and BD and AF intersect at H, then the area of quadrilateral BEIH is Choices: (A) 1/3 (B) 2/5 (C) 7/15 (D) 8/15 (E) 3/5

**Response**: `<think>`…`</think>`   
The area of quadrilateral BEIH in the 2x2 square ABCD is calculated using coordinate geometry. Assign coordinates to the square: A(0,2), B(0,0), C(2,0), D(2,2). E is the midpoint of AB, so E(0,1). F is the midpoint of BC, so F(1,0).

The line AF passes through A(0,2) and F(1,0), with equation y = -2x + 2. The line DE passes through D(2,2) and E(0,1), with equation y = (1/2)x + 1. The intersection of AF and DE is I, found by solving -2x + 2 = (1/2)x + 1, giving x = 2/5 and y = 6/5, so I(2/5, 6/5).

The diagonal BD passes through B(0,0) and D(2,2), with equation y = x. The intersection of BD and AF is H, found by solving x = -2x + 2, giving x = 2/3 and y = 2/3, so H(2/3, 2/3).

The vertices of quadrilateral BEIH are B(0,0), E(0,1), I(2/5, 6/5), and H(2/3, 2/3). Using the shoelace formula with the order B-E-I-H-B:

Sum = (0·1 - 0·0) + (0·6/5 - 2/5·1) + (2/5·2/3 - 2/3·6/5) + (2/3·0 - 0·2/3) = 0 + (-2/5) + (-8/15) + 0 = -14/15

Absolute value = 14/15

Area = (1/2) · 14/15 = 7/15

Alternatively, split BEIH into triangles BEI and BIH:

Area of ΔBEI: Base BE = 1 (vertical), height = x-coordinate of I = 2/5, area = (1/2)·1·(2/5) = 1/5

Area of ΔBIH: Shoelace formula gives sum = -8/15, absolute value 8/15, area = (1/2)·8/15 = 4/15

Total area = 1/5 + 4/15 = 3/15 + 4/15 = 7/15

Both methods confirm the area is 7/15, which corresponds to option (C).

\boxed{\text{C}}

**Answer**: C

<img src="/assets/cases/1040025031km1dqmm2005ou9re0.png" width="300"/>   

**Prompt**: Solve this problem

**Response**: `<think>`…`</think>`   
The acceleration vs. time graph for both cars starts at (0,0) and ends at (T,0), with both reaching a maximum acceleration of 10 m/s². The area under the acceleration-time graph represents the change in velocity. Since both cars start from rest, the velocity at time T is given by the area under their respective acceleration curves from 0 to T.

Both acceleration curves are triangular, with a base of T and a height of 10 m/s². The area of a triangle is \frac{1}{2} \times \text{base} \times \text{height}, so the area for each car is \frac{1}{2} \times T \times 10 = 5T m/s. Thus, the change in velocity is the same for both cars, and since they start from rest, v_P = v_Q at time T.

The position at time T is given by the integral of velocity from 0 to T, which is equivalent to the double integral of acceleration. Specifically, x(T) = \int_0^T a(s) (T - s) ds, where (T - s) is a weighting factor that decreases linearly from T to 0 as s increases from 0 to T. This weighting factor emphasizes acceleration at earlier times more than at later times.

Car P has a higher acceleration than car Q for early times (s < s_{\text{cross}}), while car Q has a higher acceleration for later times (s > s_{\text{cross}}). Since the weighting factor (T - s) is larger for smaller s, the region where a_P(s) > a_Q(s) is weighted more heavily. Therefore, the integral for x_P(T) is greater than for x_Q(T), so x_P > x_Q at time T.

Thus, the speeds are equal (v_P = v_Q) and the position of car P is greater than that of car Q (x_P > x_Q), which corresponds to option (A).

\boxed{\text{A}}

**Answer**: A

<img src="/assets/cases/1040025031km3c2iri00b99tbfg.png" width="300"/>   

**Prompt**: The square is filled in with a pattern of triangles that become smaller and smaller until they are infinitely small. what fraction of the square is shaded blue?

**Response**: `<think>`…`</think>`   
The square has an area of 1. The pattern consists of a sequence of blue right triangles, each similar to the previous one but scaled down by a factor of \frac{1}{2} in linear dimensions, and only one triangle is added at each level.

The largest blue triangle (level 0) has legs of length 1, so its area is \frac{1 \times 1}{2} = \frac{1}{2} .

The next blue triangle (level 1) has legs of length \frac{1}{2} , so its area is \frac{\frac{1}{2} \times \frac{1}{2}}{2} = \frac{1}{8} .

The subsequent blue triangle (level 2) has legs of length \frac{1}{4} , so its area is \frac{\frac{1}{4} \times \frac{1}{4}}{2} = \frac{1}{32} .

This pattern continues indefinitely, with the area of each triangle scaled by a factor of \frac{1}{4} relative to the previous triangle (since area scales with the square of the linear scale factor).

The areas of the blue triangles form a geometric series:

\frac{1}{2} + \frac{1}{8} + \frac{1}{32} + \frac{1}{128} + \cdots

The first term a is \frac{1}{2} , and the common ratio r is \frac{1}{4} . The sum of an infinite geometric series is given by \frac{a}{1 - r} :

\text{Total blue area} = \frac{\frac{1}{2}}{1 - \frac{1}{4}} = \frac{\frac{1}{2}}{\frac{3}{4}} = \frac{1}{2} \times \frac{4}{3} = \frac{4}{6} = \frac{2}{3}.

Thus, the fraction of the square shaded blue is \frac{2}{3} .

\boxed{\dfrac{2}{3}}

**Answer**: 2/3

<img src="/assets/cases/1040025031km3cg062003b8stno.png" width="300"/>   

**Prompt**: the area of the blue triangle is 12 which is a right triangle and the area of the gray triangle is 9 which is a right triangle. what is the area of the orange rectangle?

**Response**: `<think>`…`</think>`   
The blue triangle has a base of 3 and an area of 12. The area of a triangle is given by \frac{1}{2} \times \text{base} \times \text{height}. Solving for the height:

\frac{1}{2} \times 3 \times h_b = 12 \implies \frac{3h_b}{2} = 12 \implies 3h_b = 24 \implies h_b = 8.

Thus, the height from the apex to the base of the blue triangle is 8, which is also the height of the large triangle.

The gray triangle is above the orange rectangle, and the rectangle has a height of 2. The base of the gray triangle is the top side of the rectangle, which is parallel to the base of the large triangle. The vertical distance from the base of the large triangle to the base of the gray triangle is 2, so the distance from the apex to the base of the gray triangle is 8 - 2 = 6.

The area of the gray triangle is 9. Using the area formula for a triangle:

\frac{1}{2} \times w \times 6 = 9 \implies 3w = 9 \implies w = 3,

where w is the width of the rectangle (and the base of the gray triangle).

The orange rectangle has a width of w = 3 and a height of 2, so its area is:

3 \times 2 = 6.

Thus, the area of the orange rectangle is 6.

\boxed{6}

**Answer**:  6

    
</details>

<details>
<summary> Examples for Long Tail Recognition </summary>

<img src="/assets/cases/1040025031km1sneri00ce3n0bg.png" width="300"/>   

**Prompt**: What is the name of the location in this photo?

**Response**: `<think>`…`</think>`   

Butlers Wharf

**Answer**: Butler's Wharf

<img src="/assets/cases/1040025031km226usi003565j3k.png" width="300"/>   

**Prompt**: What is the title of this artwork?

**Response**: `<think>`…`</think>`   

\boxed{Christ Walking on Water}

**Answer**: christ walking on the water

<img src="/assets/cases/1040025031km1uuj3i00apbgr10.png" width="300"/>   

**Prompt**: What is the title of this artwork?

**Response**: `<think>`…`</think>`   
To determine the title of the artwork, we analyze key elements:

Artist and Context: This fresco is by Giotto di Bondone, part of his cycle in the Scrovegni Chapel (Padua, Italy), which depicts scenes from the lives of the Virgin Mary and Christ.

Subject Matter: The scene shows a group of figures in a ceremonial setting, with architectural details and halos (indicating holy figures). This matches the Marriage of the Virgin—a traditional episode where Mary is wed to Joseph.

The title of the artwork is The Marriage of the Virgin.

**Answer**: Marriage of the Virgin

<img src="/assets/cases/1040025031km1psp6i003mbo95c.png" width="300"/>   

**Prompt**: 图中文物的名称是什么？
A: 金腰带饰
B: 凤首人身形玉佩
C: 四联璜玉组佩
D: 跽坐人形玉佩

**Response**: `<think>`…`</think>`   
\boxed{B}

**Answer**: B

<img src="/assets/cases/1040025031km1rmsr2007ajpqu0.png" width="300"/>   

**Prompt**: 图中文物的名称是什么？
A: 亚长方斝
B: 王子午鼎
C: 亚址鼎
D: 齐侯盂

**Response**: `<think>`…`</think>`   
\boxed{D}

**Answer**: D
    
</details>

<details>
<summary> Examples for Visual Reasoning </summary>

<img src="/assets/cases/1040025031kolrtqji000efpsjg.png" width="300"/>   

**Prompt**: can you solve this Rebus puzzle?

**Response**: `<think>`…`</think>`   
The Rebus puzzle features the word "LOAD" written vertically, with each letter stacked from top to bottom (L, O, A, D). This vertical arrangement suggests the direction "down," as reading from top to bottom implies a downward motion. Combined with "load," this forms the common term "download," which refers to the process of transferring data from a remote system (like the internet) to a local device.

\boxed{download}

**Answer**:  download

<img src="/assets/cases/1040025031kolsok7i000i9pnps.png" width="300"/>   

**Prompt**: Which novel do these emojis remind you of?

**Response**: `<think>`…`</think>`   
The emojis 🎶🐦&🐍 most likely represent the novel "The Ballad of Songbirds and Snakes" by Suzanne Collins, which is a prequel to "The Hunger Games" series. Here's why:

- 🎶 Musical notes: Represents the "ballad" in the title, as a ballad is a type of song or poem that tells a story, often with musical elements. The novel features themes of music and performance, including the "Hanging Tree" song, which is central to the story.

- 🐦 Blue bird: Symbolizes the "songbirds" in the title. Songbirds are often depicted as blue or colorful in illustrations and are a key motif in the book, representing innocence, hope, and the characters' connections to nature.

- & Ampersand: Directly corresponds to the "and" in the title, linking "songbirds" and "snakes" as dual symbols of the story's conflict.

- 🐍 Green snake: Represents the "snakes" in the title, which symbolize danger, corruption, and the darker aspects of the story, particularly through the character of Coriolanus Snow and the serpentine themes.

This novel explores the origins of the Hunger Games and the moral complexities of its characters, making the emojis a fitting visual shorthand for the title and its themes. If you're thinking of something else, it could be a reference to another work, but this is the most direct match based on common interpretations! 📚

**Answer**: The Ballad of Songbirds and Snakes

<img src="/assets/cases/1040025031km25jkdi006vrceig.png" width="300"/>   

**Prompt**: These seemingly random phrases hide the names of eight capital cities around the world. Each phrase contains the name of two cities. Can you unscramble them and figure out what they are?

**Response**: `<think>`…`</think>`   
The phrases are anagrams that hide pairs of capital cities. By rearranging the letters in each phrase, the following capital cities are revealed:

LION JOB ENDING unscrambles to Beijing (capital of China) and London (capital of the United Kingdom).

Letters: L, I, O, N, J, O, B, E, N, D, I, N, G → Beijing (B, E, I, J, I, N, G) and London (L, O, N, D, O, N).

TOOK MY ORE unscrambles to Tokyo (capital of Japan) and Rome (capital of Italy).

Letters: T, O, O, K, M, Y, O, R, E → Tokyo (T, O, K, Y, O) and Rome (R, O, M, E).

CRIMSON ELBOW unscrambles to Moscow (capital of Russia) and Berlin (capital of Germany).

Letters: C, R, I, M, S, O, N, E, L, B, O, W → Moscow (M, O, S, C, O, W) and Berlin (B, E, R, L, I, N).

RAPID DISARM unscrambles to Paris (capital of France) and Madrid (capital of Spain).

Letters: R, A, P, I, D, D, I, S, A, R, M → Paris (P, A, R, I, S) and Madrid (M, A, D, R, I, D).

The eight capital cities are: Beijing, London, Tokyo, Rome, Moscow, Berlin, Paris, and Madrid.

\boxed{\text{Beijing, London, Tokyo, Rome, Moscow, Berlin, Paris, Madrid}}

**Answer**: London, Beijing, Tokyo, Rome, Berlin, Moscow, Paris, Madrid


<img src="/assets/cases/1040025031km265kr2003hs0ap4.png" width="300"/>   

**Prompt**: Alice and Betty are playing “guess my word”. Betty thinks of a five-letter word (all distinct letters), and Alice guesses. For each guess, Betty tells Alice how many letter from her guess are right. What is Betty’s word?

**Response**: `<think>`…`</think>`   
Betty's word is a five-letter word with all distinct letters. The guesses and the number of correct letters are as follows:

- HUNKY: 2 correct letters
- JUNKS: 0 correct letters
- WHISK: 2 correct letters
- CENTS: 2 correct letters
- STINK: 0 correct letters

From JUNKS having 0 correct letters, the letters J, U, N, K, and S are not in Betty's word.

From HUNKY having 2 correct letters and U, N, K excluded by JUNKS, the correct letters must be H and Y. Thus, H and Y are in Betty's word.

From STINK having 0 correct letters and S, N, K excluded by JUNKS, the letters T and I are not in Betty's word.

From WHISK having 2 correct letters, and S, K excluded by JUNKS and I excluded by STINK, the only possible correct letters are W and H. Since H is already in the word, W must also be in the word.

From CENTS having 2 correct letters, and N, S excluded by JUNKS and T excluded by STINK, the only possible correct letters are C and E. Thus, both C and E are in the word.

The letters in Betty's word are C, E, H, W, and Y. These form the word CHEWY, which has all distinct letters and satisfies all the guesses:

- HUNKY: H and Y are correct (2 letters).
- JUNKS: No letters match (0 letters).
- WHISK: W and H are correct (2 letters).
- CENTS: C and E are correct (2 letters).
- STINK: No letters match (0 letters).

Thus, Betty's word is CHEWY.

\boxed{CHEWY}

**Answer**: CHEWY

    
</details>

## 3. Methods
### 3.1 Architecture Overview
**dots.vlm1** integrates a 1.2B NaViT vision encoder, a lightweight MLP adapter and a DeepSeek V3 MoE language model. These components are trained through a three-stage process.  

- **Stage 1: Vision Encoder Pretraining** — The NaViT vision encoder is trained entirely from scratch to maximize perceptual capability across diverse visual data.
- **Stage 2: VLM Pretraining** — The vision encoder and DeepSeek V3 LLM are jointly pretrained on large-scale, diverse multimodal datasets.
- **Stage 3: VLM Post-training** — A final phase of supervised fine-tuning (SFT) is performed using task-varied data to enhance generalization.  

*Training note: Post-training only includes supervised fine-tuning (SFT); reinforcement learning methods are planned for future work.*  

### 3.2 NaVIT Vision Encoder
We developed a two-phase training strategy for the NaViT encoder—while the overall structure and training method are similar to AimV2[1], our model is trained entirely from scratch at native resolution. The 1.2B vision encoder comprises 42 transformer layers, incorporating techniques such as RMSNorm, SwiGLU, and 2D RoPE.
#### Phase 1: Pretrain
We start from random initialization and train on 224×224 image resolution using a dual supervision strategy: one is Next Token Prediction (NTP) using a large amount of image-text pairs to enhance the model's perceptual capability; the other is Next Patch Generation (NPG), which leverages vision-only data to predict image patches by diffusion model, further improving the vision encoder's spatial and semantic perception. The training leverages massive image-text pairs.
#### Phase 2: Resolution Scaling Pretrain
We gradually increased image resolution, starting with 1M-pixel inputs trained on massive tokens, followed by 10M-pixel inputs training. To further enhance generalization, we also incorporated more diverse visual sources, including OCR-rich images, grounding annotations, and video frames.
### 3.3 VLM Pretraining Data
To enhance dots.vlm1’s multimodal capabilities, we organize the pretraining data into two major categories:
#### Cross-Modal Translation Data
Teaches the model to describe, summarize, or reinterpret image content in text:
- General images with alt-text or dense captions
- Complex charts/tables/Formula/Graphic (real or synthesized) with structured annotations or captions.
- OCR: Multilingual / Scene Grounding / Plain text / Doc Parsing, etc.
- Video frames with temporal captions.
- Grounding supervision data, including bounding boxes and keypoints 

It is difficult to exhaustively enumerate all types of cross-modal translation data, as they involve a wide range of image and video types along with corresponding perceptual outputs. Broadly speaking, our objective is to construct a data spectrum that captures all forms of human-interpretable information that can be expressed as discrete token sequences aligned with visual content.
#### Cross-Modal Fusion Data
Cross-modal fusion data refers to interleaved visual and textual content that trains the model to perform next-token prediction in complex multimodal contexts, helping mitigate over-reliance on any single modality. We designed careful preprocessing pipelines for various types of interleaved data, with two particularly effective categories:
- **Web Data**: We found that image-text data from web pages offers rich diversity but often suffers from weak visual-textual alignment. Rather than relying on traditional CLIP-score based filtering, we used an in-house VLM to rewrite and clean the data, removing low-quality images and irrelevant or weakly grounded text content.
- **PDF data**: We observed that PDF data is often of high quality. To fully leverage this, we developed a dedicated parsing model, [dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr), which converts PDF documents into interleaved image-text representations. Additionally, we render full PDF pages as images and selectively mask text blocks, prompting the model to reconstruct the masked content using layout and contextual cues—enhancing its ability to read and understand visually formatted documents.  


## 4. Future Work
Despite the progress achieved with dots.vlm1, our evaluation reveals notable limitations in both visual perception and reasoning capabilities.  

For visual perception, we plan to significantly expand the scale and diversity of cross-modal translation data. In parallel, we will continue to improve the vision encoder by exploring innovations in neural network architectures and loss function design to significantly improve training efficiency.  

For visual reasoning, we prioritize advancing reinforcement learning approaches to reduce the test-time scaling gap between text-only and multimodal prompts. Furthermore, we are investigating the feasibility of shifting more reasoning ability into the pretraining stage to improve generalization and efficiency.  

> Hiring!!!  
> We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of multimodal intelligence, we would love to hear from you.  
> Please reach out to us via email at: martin [at] xiaohongshu [dot] com

## 5. Author List
### Core Contributor
ADu, Baiheng, Chaohui Yu, Guang Yang, Guangwei Zhao, Guofu, Guohai Xu, Hao Liu, Jack Hong, Jiayu Chen, Jie Lou, Laite, Liang Xin, Qianwu, Rui Yang, Shengdong Chen, Weijian Luo, Wenliang, Wenze, Xiaozhi, Xing Wu, Yimeng Ye, Yiwei Ma, Yumeng Li, Yunkai, Yuqiu Ji, Zhenhang Huang

### Contributor
rednote hi lab all

### Project Lead
Colin Zhang

### Advisor
Debing Zhang, Lei Zhang


[1] Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis Béthune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby. Multimodal Autoregressive Pre-training of Large Vision Encoders. CVPR 2025. 


================================================
FILE: docker/Dockerfile
================================================
from lmsysorg/sglang:v0.4.9.post1-cu126

RUN git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
RUN pip install -e sglang/python

================================================
FILE: docker/README.md
================================================
# dots.vlm1
The official repository of the dots.vlm1 instruct models proposed by rednote-hilab.

## Usage

### Environment Setup

You have two options to set up the environment:

#### Option 1: Using Base Image + Manual Installation
```bash
# Use the base SGLang image
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126

# Clone and install our custom SGLang branch
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
# We have submitted a PR to the main SGLang repository (currently under review):
# https://github.com/sgl-project/sglang/pull/8778
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
pip install -e sglang/python
```

#### Option 2: Using Pre-built Image (Recommended)
```bash
# Use our pre-built image with dots.vlm1 support
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
```

### Multi-Node Deployment

Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:

**Prerequisites:**
- Model: `rednote-hilab/dots.vlm1.inst`
- Node 1 IP: `10.0.0.1` (master node)
- Node 2 IP: `10.0.0.2` (worker node)

#### Node 1 (Master - rank 0):
```bash
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr 10.0.0.1:23456 \
    --nnodes 2 \
    --node-rank 0 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 15553 \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8
```

#### Node 2 (Worker - rank 1):
```bash
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr 10.0.0.1:23456 \
    --nnodes 2 \
    --node-rank 1 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 15553 \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8
```
### Configuration Parameters

Key parameters explanation:
- `--tp 16`: Tensor parallelism across 16 GPUs per node
- `--nnodes 2`: Total number of nodes in the cluster
- `--node-rank`: Node identifier (0 for master, 1+ for workers)
- `--context-length 65536`: Maximum context length
- `--quantization fp8`: Use FP8 quantization for efficiency
- `--chat-template dots-vlm`: Use custom chat template for dots.vlm model

### API Usage

Once the servers are launched, you can access the model through OpenAI-compatible API:
```bash
curl -X POST http://10.0.0.1:15553/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "model",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Please briefly describe this image"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        "temperature": 0.1,
        "top_p": 0.9,
        "max_tokens": 55000
    }'
```

Download .txt

gitextract_k7195w42/

├── LICENSE
├── NOTICE
├── README.md
├── assets/
│   └── blog.md
└── docker/
    ├── Dockerfile
    └── README.md

Download .json

Condensed preview — 6 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (92K chars).

[
  {
    "path": "LICENSE",
    "chars": 16315,
    "preview": "dots.vlm1 LICENSE AGREEMENT\n\nEffective Date: [August 16, 2025]\n\nCopyright Holder: [Xingyin Information Technology (Shang"
  },
  {
    "path": "NOTICE",
    "chars": 25076,
    "preview": "Copyright Notice and License Texts\n===============================================================================\nProje"
  },
  {
    "path": "README.md",
    "chars": 9047,
    "preview": "<div align=\"center\">\n<p align=\"center\">\n    <img src=\"/assets/logo.png\" width=\"300\"/>\n<p>\n\n\n\n[![Blog](https://img.shield"
  },
  {
    "path": "assets/blog.md",
    "chars": 34723,
    "preview": "<div align=\"center\">\n<p align=\"center\">\n    <img src=\"/assets/logo.png\" width=\"300\"/>\n<p>\n\n\n\n[![Blog](https://img.shield"
  },
  {
    "path": "docker/Dockerfile",
    "chars": 156,
    "preview": "from lmsysorg/sglang:v0.4.9.post1-cu126\n\nRUN git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sgl"
  },
  {
    "path": "docker/README.md",
    "chars": 3656,
    "preview": "# dots.vlm1\nThe official repository of the dots.vlm1 instruct models proposed by rednote-hilab.\n\n## Usage\n\n### Environme"
  }
]

About this extraction

This page contains the full source code of the rednote-hilab/dots.vlm1 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 6 files (86.9 KB), approximately 22.8k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo