Full Code of opendatalab/WanJuan1.0 for AI

main 17269a7dadc5 cached
3 files
37.4 KB
13.9k tokens
1 requests
Download .txt
Repository: opendatalab/WanJuan1.0
Branch: main
Commit: 17269a7dadc5
Files: 3
Total size: 37.4 KB

Directory structure:
gitextract_fby9gy9t/

├── License
├── README.md
└── WanJuan1.0-CN.md

================================================
FILE CONTENTS
================================================

================================================
FILE: License
================================================
Attribution 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.

     Considerations for licensors: Our public licenses are
     intended for use by those authorized to give the public
     permission to use material in ways otherwise restricted by
     copyright and certain other rights. Our licenses are
     irrevocable. Licensors should read and understand the terms
     and conditions of the license they choose before applying it.
     Licensors should also secure all rights necessary before
     applying our licenses so that the public can reuse the
     material as expected. Licensors should clearly mark any
     material not subject to the license. This includes other CC-
     licensed material, or material used under an exception or
     limitation to copyright. More considerations for licensors:
    wiki.creativecommons.org/Considerations_for_licensors

     Considerations for the public: By using one of our public
     licenses, a licensor grants the public permission to use the
     licensed material under specified terms and conditions. If
     the licensor's permission is not necessary for any reason--for
     example, because of any applicable exception or limitation to
     copyright--then that use is not regulated by the license. Our
     licenses grant only permissions under copyright and certain
     other rights that a licensor has authority to grant. Use of
     the licensed material may still be restricted for other
     reasons, including because others have copyright or other
     rights in the material. A licensor may make special requests,
     such as asking that all changes be marked or described.
     Although not required by our licenses, you are encouraged to
     respect those requests where reasonable. More considerations
     for the public:
    wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution 4.0 International Public License ("Public License"). To the
extent this Public License may be interpreted as a contract, You are
granted the Licensed Rights in consideration of Your acceptance of
these terms and conditions, and the Licensor grants You such rights in
consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  d. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  e. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  f. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  g. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  h. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  i. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  j. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  k. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

       4. If You Share Adapted Material You produce, the Adapter's
          License You apply must not prevent recipients of the Adapted
          Material from complying with this Public License.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material; and

  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.


=======================================================================

Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.

Creative Commons may be contacted at creativecommons.org.


================================================
FILE: README.md
================================================
# Intern · WanJuan Multimodal Corpus
**English**🌎|[简体中文](./WanJuan1.0-CN.md)🀄 

![Image](./images/01_宣传图.png)

## Intern · WanJuan 1.0

Intern · WanJuan 1.0 is the first open-source version of Intern · Wanjuan multimodal corpus, which includes three parts: text dataset, image-text dataset, and video dataset, with a total data volume exceeding 2TB. Based on the corpus built by the large model data alliance, the Shanghai AI Lab has carried out fine-grained cleaning, deduplication, and value alignment on some of the data, forming Intern · WanJuan 1.0, which has four characteristics those are multiple integration, fine processing, value alignment, ease of use and efficiency, etc. .

- **In terms of multiple integration**, Intern · WanJuan 1.0 contains multi-modal data such as text, image and video, covering multiple fields such as science and technology, literature, media, education and law. It improves the knowledge content, logical reasoning and Significant effect on generalization ability.

- **In terms of fine processing**, Intern · WanJuan 1.0 has gone through refined data processing links such as language screening, text extraction, format standardization, data filtering and cleaning based on rules and models, multi-scale deduplication, and data quality assessment.Therefore, it can better meet the needs of subsequent model training.

- **In terms of value alignment**, during the construction of Intern · WanJuan 1.0, the researchers focused on the alignment of the content with the mainstream Chinese values, and improved the purity of the corpus through the combination of algorithms and manual evaluation.

- **In terms of ease of use and efficiency**, the researchers adopted a unified format in Intern · WanJuan 1.0, and provided detailed field descriptions and tool guidance, making it easy to use and efficient. Let it can be quickly applied to Multimodal Large Language Models (MLLMs) or large language model (LLM) training.

Currently, Intern · WanJuan 1.0 has been applied to the training of those large models such as Intern Multimodal and Intern Puyu. Through the "digestion" of high-quality corpus, the Intern series models have shown excellent performance in various generative tasks such as semantic understanding, knowledge question answering, visual understanding, and visual question answering.

Paper:[https://arxiv.org/pdf/2308.10755.pdf](https://arxiv.org/pdf/2308.10755.pdf)

<br>

## Intern · WanJuan 1.0 - text dataset

- Introduction

Intern · WanJuan 1.0 Text Dataset is composed of cleaned pre-training corpora from different sources such as web pages, encyclopedias, books, patents, textbooks, and exam questions. The total amount of data exceeds 500 million documents, and the data size exceeds 1TB. The corpus processes data in various formats such as html, text, pdf and epub into a jsonl format with unified fields。And after fine-grained cleaning, deduplication, and value alignment, it forms a safe, reliable, and high-quality pre- training corpus.

- Composition

![Image](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySwiaaegnHFwsq4cg1uX3MCNegNkC9CiaCXkHHUicvR951QNT5AdU8V86qg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- Sample
  ![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySsnhSxvOicUt6sZPRa9S2Yld1Fjd1IibHfyZVicYxCVyP8uHm08niaZxvSg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)
  
```json
{
    "id": "BkORdv3xK7IA0HG7pccr",
    "content": "\\*诗作[222]\n录自索菲娅·马克思的笔记本\n#### 人生\n时光倏忽即逝,\n宛如滔滔流水;\n时光带走的一切,\n永远都不会返回。\n生就是死,\n生就是不断死亡的过程;\n人们奋斗不息,\n却难以摆脱困顿;\n人走完生命的路,\n最后化为乌有;\n他的事业和追求\n湮没于时光的潮流。\n对于人的事业,\n精灵们投以嘲讽的目光;\n因为人的渴望是那样强烈,\n而人生道路是那样狭窄迷茫;\n人在沾沾自喜之后,\n便感到无穷的懊丧;\n那绵绵不尽的悔恨\n深藏在自己的心房;\n人贪婪追求的目标\n其实十分渺小;\n人生内容局限于此,\n那便是空虚的游戏。\n有人自命不凡,\n其实并不伟大;\n这种人的命运,\n就是自我丑化。\n卡尔·马克思\n#### 查理大帝\n使一个高贵心灵深受感动的一切,\n使所有美好心灵欢欣鼓舞的一切,\n如今已蒙上漆黑的阴影,\n野蛮人的手亵渎了圣洁光明。\n巍巍格拉亚山的崇高诗人,\n曾满怀激情把那一切歌颂,\n激越的歌声使那一切永不磨灭,\n诗人自己也沉浸在幸福欢乐之中。\n高贵的狄摩西尼热情奔放,\n曾把那一切滔滔宣讲,\n面对人山人海的广场,\n演讲者大胆嘲讽高傲的菲力浦国王。\n那一切就是崇高和美,\n那一切笼罩着缪斯的神圣光辉,\n那一切使缪斯的子孙激动陶醉,\n如今却被野蛮人无情地摧毁。\n这时查理大帝挥动崇高魔杖,\n呼唤缪斯重见天光;\n他使美离开了幽深的墓穴,\n他让一切艺术重放光芒。\n他改变陈规陋习,\n他发挥教育的神奇力量;\n民众得以安居乐业,\n因为可靠的法律成了安全的保障。\n他进行过多次战争,\n杀得尸横遍野血染疆场;\n他雄才大略英勇顽强,\n但辉煌的胜利中也隐含祸殃;\n他为善良的人类赢得美丽花冠,\n这花冠比一切战功都更有分量;\n他战胜了那个时代的蒙昧,\n这就是他获得的崇高奖赏。\n在无穷无尽的世界历史上,\n他将永远不会被人遗忘,\n历史将为他编织一顶桂冠,\n这桂冠决不会淹没于时代的激浪。\n卡尔·马克思于1833年\n#### 莱茵河女神\n**叙事诗**\n(见本卷第885—889页)\n#### 盲女\n**叙事诗**\n(见本卷第852—858页)\n#### 两重天\n**乘马车赴柏林途中**\n(见本卷第475—478页)\n#### 父亲诞辰献诗。1836年\n**(见本卷第845—846页)**\n#### 席勒\n**十四行诗两首**\n(见本卷第846—847页)\n#### 歌德\n**十四行诗两首**\n(见本卷第848—849页)\n#### 女儿\n**叙事诗**\n(见本卷第838—841页)\n#### 凄惨的女郎\n**叙事诗**\n(见本卷第533—537页)\n卡·马克思写于1833年一大约1837年\n第一次用原文发表于《马克思恩格斯全集》1975年历史考证版第1部分第1卷\n并用俄文发表于《马克思恩格斯全集》1975年莫斯科版第40卷\n原文是德文\n中文根据《马克思恩格斯全集》1975年历史考证版第1部分第1卷翻译\n---\n**注释:**\n[222]马克思的这些诗作是他的姐姐索菲娅抄录在一个笔记本里的。除了马克思的诗作外,笔记本里还有其他人的诗作以及索菲娅自己和她的亲友的个人记事。马克思的这些诗作,除了《人生》和《查理大帝》外都在马克思的几本诗集和索菲娅的纪念册里出现过。《查理大帝》一诗注明写作日期是1833年,可见马克思早在中学时代就已开始写诗了。《盲女》注明写作日期是1835年。为祝贺父亲生日而献给亨利希·马克思的诗作的写作日期应该不晚于1836年初。——913。"
}
```

**- Field**

**  - id:** [string type] the unique ID of the document.
**  - content:** [string type] the content of the document, the format is normal Text format or Markdown format.

<br>

## Intern · WanJuan 1.0 - image-text dataset

- Introduction

The data of Intern · WanJuan 1.0 - image-text dataset mainly come from public webpages, which are processed to form interlaced images and text documents. The total number of documents exceeds 22 million, and the data size exceeds 140GB (excluding pictures), covering news events, people, natural landscapes, social life and other fields. The data is in a unified jsonl format, where the pictures are given in the form of url. If you need to get the picture data, you can use the following script: 
https://github.com/opendatalab/image-downloader

- Composition
![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySTG634PTTIbmFIJlDZUfKGrXYibkgXCU3E58mrZIn0ibW0oia2mUOrv31Q/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- Sample
![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySJWLdsY1qx1EAI8xAra8HnEunics0sqTQjNI6VhzM3SdINw3ojvtP9Uw/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

```json
{
    "id": "BkKuk1zxK3YAbgNSWYik",
    "img_list": [
        {
            "url": "http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg",
            "sha256": "019cca88f37ae5ffe59ad48ad5c392fe64e489f08e841b6ea50c79c18f5c6ec3",
            "caption": "",
            "width": "400",
            "height": "266"
        }
    ],
    "content": "![](http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg)\n奋斗百年路 启航新征程\n走进觉悟社当年社员开会的房间,桌子中间摆放的一盘纸条格外引人注目,周恩来“伍豪”和邓颖超“逸豪”的笔名就诞生于此。\n“为了斗争的需要,觉悟社社员们采取抓阄的办法,以号取名。”1月19日,天津觉悟社纪念馆助理馆员迟爱民讲述了102年前的情景:当时年纪最小的邓颖超抓到了最小数字1号,所以叫“逸豪”。周恩来抓到5号,就取名“伍豪”。\n时间回到1919年那个思潮澎湃的年代。在天津,以周恩来为代表的一批以天下为己任的先进分子,在众多新思潮中艰难地探索革命真理。通过觉悟社的锻炼和洗礼,其主要成员成长为我国早期的共产主义者。周恩来也在这个时期成为马克思主义的宣传者。\n诞生:冲破封建束缚探索革命真理\n觉悟社成立于“五四运动”在天津发展到最高潮的阶段。\n觉悟社纪念馆中的一张合影,记录下了这一张张充满青春朝气的脸庞。他们神色凝重,目光坚定,这些人就是觉悟社成立之初的部分社员。\n“这个比一般学生爱国团体更加严密的组织的成立,源于之前一次赴京请愿斗争。”迟爱民介绍,1919年9月2日,周恩来等天津各界联合会、学生联合会、女界爱国同志会的先进青年在返津途中,经过交流,一致认为,应该成立一个研究新思潮,探索革命真理,冲破封建习俗束缚,由男女同学共同组建的团体。\n1919年9月16日,在天津东南角草场庵天津学生联合会办公室里,革命青年团体觉悟社诞生了。出席成立会的男女各10名成员成为最初的社员,包括周恩来、邓颖超、马骏、刘清扬、郭隆真等。\n周恩来执笔起草了《觉悟的宣言》。觉悟社成立后,以“革心”和“革新”的精神组织演讲,出版刊物《觉悟》,探讨研究新思潮,很快就成为天津学生爱国运动的中坚力量。\n引领:觉悟社成立5天后李大钊应邀前来\n在波澜起伏的斗争中,周恩来和觉悟社社员们迫切感到,要用先进思想武装头脑。\n觉悟社社员谌小岑曾回忆道,在觉悟社成立后第5天,我国最早的马克思主义者、中国共产党先驱李大钊就应邀到觉悟社座谈。李大钊听完邓颖超对觉悟社的介绍后,对觉悟社深表赞许,他表示“觉悟社是男女平等、社交公开的先行”。\n在李大钊的启发下,觉悟社成员阅读了李大钊发表在《新青年》上的《庶民的胜利》《布尔什维主义的胜利》《我的马克思主义观》等文章。还邀请徐谦、包世杰、钱玄同、刘半农等来演讲,并召开讨论会。\n天津市委党校文史教研部副主任徐娜表示,觉悟社社员们学习、讨论中国最早的马列主义文献,并积极投身实践斗争,为他们选择信仰马克思主义、走上共产主义道路进行了最初的启蒙与引导。\n影响:觉悟社多人加入中国共产党\n1920年1月29日,在抵制日货的斗争中,周恩来、马骏等人被捕,成立仅4个月的觉悟社受到沉重打击。纪念馆展厅中的两本书《警厅拘留记》和《检厅日录》,记录了青年们斗争的艰难和残酷。身陷囹圄的周恩来先后用6个晚上,向狱友介绍马克思主义学说。出狱后,编写了3.5万字的《警厅拘留记》和《检厅日录》。在后来旅法期间,周恩来说“我的思想是颤动于狱中”,可以说这是周恩来马克思主义世界观形成的重要时期。\n1920年11月,随着周恩来、刘清扬、郭隆真等人赴法国勤工俭学,觉悟社的社员们开始星散,觉悟社的集体活动停止……\n觉悟社存在的时间虽然不长,但为一批年轻人树立马克思主义信仰奠定了坚实基础。徐娜表示,觉悟社作为“五四”运动爆发之后在天津影响最广泛、作用最突出的进步学生组织,其表现出的反对封建主义、憎恨一切剥削和压迫的进步思想,为接受马克思主义作好了准备。随后,远赴欧洲勤工俭学的周恩来加入中国共产党八个发起组之一的巴黎共产主义小组,成为中国共产党创建人之一,而其他的觉悟社主要社员如马骏、邓颖超、郭隆真等都加入了中国共产党,成为革命的骨干力量。"
}
```

**- Field**

**  - id:** [string type] the unique ID of the document.
**  - img_list:** [array type], the list of images contained in the document. The information of each picture includes network url, sha256 of url, length and width.
**  - content: **[string type] the content of the document, the format is normal Text format or Markdown format.

<br>

## Intern · WanJuan 1.0 - video dataset

- Introduction

Intern · WanJuan 1.0 Video Dataset is mainly from China Media Group and Shanghai Media Group. It contains various types of program videos, with more than 1,000 video files and a data size of more than 900GB. The content covers military, literature and art, sports, nature, real society, knowledge, video art, media, food, historical documentaries, science and education, etc.

- Composition
  
![Image](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySQnSGLrzp6tUVn2P5kZ5RuERiaibf5vSFibJUZtFWhT8rZmaslBTjicBI4Q/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- Sample
  ![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjyS9H6XnjNibfo5DJh7hscAGmeSvJ6ohVgnBAKk2blTSVIqNUKXicQ8984g/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

## Download link

To download the complete dataset, please go to: 
[https://opendatalab.org.cn/WanJuan1.0](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0/tree/main?source=R2l0aHVi)


## License

The whole Intern · WanJuan 1.0 adopts the CC BY 4.0 license agreement. You are free to share and adapt this dataset, subject to the following conditions:

- Attribution: You must give appropriate attribution to the author, provide a link to the agreement, and indicate whether modifications were made (to the original data set). You may do so in any reasonable way, but not in any way that implies that the licensor agrees with you or your use.

- No Additional Restrictions: You may not use legal terms or technological measures to restrict others from doing anything the license permits.

For the complete content of the agreement, please visit [CC BY 4.0 Agreement Full Text](https://creativecommons.org/licenses/by/4.0/).


## Special attention items

Note that some subsets of this dataset may be subject to other agreements. Before using a specific subset, please be sure to read the relevant agreement carefully to ensure compliant use. For more detailed protocol information, please check the relevant documents or metadata of a specific subset.

As a non-profit organization, OpenDataLab advocates a harmonious and friendly open source communication environment. If you find any content that infringes your legal rights in the open source dataset, you can send an email to (OpenDataLab@pjlab.org.cn), and please indicate the relevant infringement in the email. A detailed description of the facts and provide us with relevant ownership certification materials. We will initiate the investigation and processing mechanism within 3 working days, and take necessary measures to deal with it (as listed below). But you should ensure the authenticity of your complaint, otherwise you should be solely responsible for the adverse consequences after taking measures.

## Change Log
```
2023-10-20: Security upgrade: further cleaning and improving the purity of the corpus, the total file size after the upgrade is 2047.6GB

2023-08-14: First release
```

## Citation

```
@misc{he2023wanjuan,
      title={WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models}, 
      author={Conghui He and Zhenjiang Jin and Chao Xu and Jiantao Qiu and Bin Wang and Wei Li and Hang Yan and Jiaqi Wang and Dahua Lin},
      year={2023},
      eprint={2308.10755},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


================================================
FILE: WanJuan1.0-CN.md
================================================
# 书生·万卷多模态语料库
 [English🌎](./README.md)|**简体中文**🀄 

![Image](./images/01_宣传图.png)


## 书生·万卷1.0

书生·万卷1.0为书生·万卷多模态语料库的首个开源版本,包含文本数据集、图文数据集、视频数据集三部分,数据总量超过2TB。基于大模型数据联盟构建的语料库,上海AI实验室对其中部分数据进行细粒度清洗、去重以及价值对齐,形成了书生·万卷1.0,具备多元融合、精细处理、价值对齐、易用高效等四大特征。

**- 在多元融合方面**,书生·万卷1.0包含文本、图文、视频等多模态数据,范围覆盖科技、文学、媒体、教育、法律等多个领域,在训练提升模型知识含量、逻辑推理和泛化能力方面具有显著效果。  

**- 在精细处理方面**,书生·万卷1.0经历了语言甄别、正文抽取、格式标准化、基于规则及模型的数据过滤与清洗、多尺度去重、数据质量评估等精细化数据处理环节,因而能更好地适配后续的模型训练需求。  

**- 在价值对齐方面**,研究人员在书生·万卷1.0的构建过程中,着眼于内容与中文主流价值观的对齐,通过算法与人工评估结合的方式,提升了语料的纯净度。  

**- 在易用高效方面**,研究人员在书生·万卷1.0采用统一格式,并提供详细的字段说明和工具指导,使其兼顾了易用性和效率,可快速应用于语言、多模态等大模型训练。  


目前,书生·万卷1.0已被应用于书生·多模态、书生·浦语的训练。通过对高质量语料的“消化”,书生系列模型在语义理解、知识问答、视觉理解、视觉问答等各类生成式任务表现出的优异性能。

论文地址:[https://arxiv.org/pdf/2308.10755.pdf](https://arxiv.org/pdf/2308.10755.pdf)

<br>

## 书生·万卷文本数据集1.0

- 简介

书生·万卷文本数据集1.0由来自网页、百科、书籍、专利、教材、考题等不同来源的清洗后预训练语料组成,数据总量超过5亿个文档,数据大小超过1TB。该语料将html、text、pdf、epub等多种格式的数据统一处理为字段统一的jsonl格式,并经过细粒度的清洗、去重、价值对齐,形成了一份安全可信、高质量的预训练语料。

- 组成

![Image](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySwiaaegnHFwsq4cg1uX3MCNegNkC9CiaCXkHHUicvR951QNT5AdU8V86qg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- 样例
  ![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySsnhSxvOicUt6sZPRa9S2Yld1Fjd1IibHfyZVicYxCVyP8uHm08niaZxvSg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

```json
{
    "id": "BkORdv3xK7IA0HG7pccr",
    "content": "\\*诗作[222]\n录自索菲娅·马克思的笔记本\n#### 人生\n时光倏忽即逝,\n宛如滔滔流水;\n时光带走的一切,\n永远都不会返回。\n生就是死,\n生就是不断死亡的过程;\n人们奋斗不息,\n却难以摆脱困顿;\n人走完生命的路,\n最后化为乌有;\n他的事业和追求\n湮没于时光的潮流。\n对于人的事业,\n精灵们投以嘲讽的目光;\n因为人的渴望是那样强烈,\n而人生道路是那样狭窄迷茫;\n人在沾沾自喜之后,\n便感到无穷的懊丧;\n那绵绵不尽的悔恨\n深藏在自己的心房;\n人贪婪追求的目标\n其实十分渺小;\n人生内容局限于此,\n那便是空虚的游戏。\n有人自命不凡,\n其实并不伟大;\n这种人的命运,\n就是自我丑化。\n卡尔·马克思\n#### 查理大帝\n使一个高贵心灵深受感动的一切,\n使所有美好心灵欢欣鼓舞的一切,\n如今已蒙上漆黑的阴影,\n野蛮人的手亵渎了圣洁光明。\n巍巍格拉亚山的崇高诗人,\n曾满怀激情把那一切歌颂,\n激越的歌声使那一切永不磨灭,\n诗人自己也沉浸在幸福欢乐之中。\n高贵的狄摩西尼热情奔放,\n曾把那一切滔滔宣讲,\n面对人山人海的广场,\n演讲者大胆嘲讽高傲的菲力浦国王。\n那一切就是崇高和美,\n那一切笼罩着缪斯的神圣光辉,\n那一切使缪斯的子孙激动陶醉,\n如今却被野蛮人无情地摧毁。\n这时查理大帝挥动崇高魔杖,\n呼唤缪斯重见天光;\n他使美离开了幽深的墓穴,\n他让一切艺术重放光芒。\n他改变陈规陋习,\n他发挥教育的神奇力量;\n民众得以安居乐业,\n因为可靠的法律成了安全的保障。\n他进行过多次战争,\n杀得尸横遍野血染疆场;\n他雄才大略英勇顽强,\n但辉煌的胜利中也隐含祸殃;\n他为善良的人类赢得美丽花冠,\n这花冠比一切战功都更有分量;\n他战胜了那个时代的蒙昧,\n这就是他获得的崇高奖赏。\n在无穷无尽的世界历史上,\n他将永远不会被人遗忘,\n历史将为他编织一顶桂冠,\n这桂冠决不会淹没于时代的激浪。\n卡尔·马克思于1833年\n#### 莱茵河女神\n**叙事诗**\n(见本卷第885—889页)\n#### 盲女\n**叙事诗**\n(见本卷第852—858页)\n#### 两重天\n**乘马车赴柏林途中**\n(见本卷第475—478页)\n#### 父亲诞辰献诗。1836年\n**(见本卷第845—846页)**\n#### 席勒\n**十四行诗两首**\n(见本卷第846—847页)\n#### 歌德\n**十四行诗两首**\n(见本卷第848—849页)\n#### 女儿\n**叙事诗**\n(见本卷第838—841页)\n#### 凄惨的女郎\n**叙事诗**\n(见本卷第533—537页)\n卡·马克思写于1833年一大约1837年\n第一次用原文发表于《马克思恩格斯全集》1975年历史考证版第1部分第1卷\n并用俄文发表于《马克思恩格斯全集》1975年莫斯科版第40卷\n原文是德文\n中文根据《马克思恩格斯全集》1975年历史考证版第1部分第1卷翻译\n---\n**注释:**\n[222]马克思的这些诗作是他的姐姐索菲娅抄录在一个笔记本里的。除了马克思的诗作外,笔记本里还有其他人的诗作以及索菲娅自己和她的亲友的个人记事。马克思的这些诗作,除了《人生》和《查理大帝》外都在马克思的几本诗集和索菲娅的纪念册里出现过。《查理大帝》一诗注明写作日期是1833年,可见马克思早在中学时代就已开始写诗了。《盲女》注明写作日期是1835年。为祝贺父亲生日而献给亨利希·马克思的诗作的写作日期应该不晚于1836年初。——913。"
}
```

**字段**  

**-id:**【字符串类型】文档的唯一ID。  

**- content:**【字符串类型】文档的内容,格式为普通Text格式或Markdown格式。  

<br>

## 书生·万卷图文数据集1.0

- 简介

书生·万卷图文数据集1.0数据主要来自公开网页,经处理后形成图文交错文档。文档总量超过2200万个,数据大小超过140GB(不含图片),覆盖新闻事件、人物、自然景观、社会生活等多个领域。数据均为统一的jsonl格式,其中图片以url的形式给出,若需获取图片数据,可以采用以下脚本:https://github.com/opendatalab/image-downloader

- 组成
  ![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySTG634PTTIbmFIJlDZUfKGrXYibkgXCU3E58mrZIn0ibW0oia2mUOrv31Q/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- 样例
![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySJWLdsY1qx1EAI8xAra8HnEunics0sqTQjNI6VhzM3SdINw3ojvtP9Uw/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

```json
{
    "id": "BkKuk1zxK3YAbgNSWYik",
    "img_list": [
        {
            "url": "http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg",
            "sha256": "019cca88f37ae5ffe59ad48ad5c392fe64e489f08e841b6ea50c79c18f5c6ec3",
            "caption": "",
            "width": "400",
            "height": "266"
        }
    ],
    "content": "![](http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg)\n奋斗百年路 启航新征程\n走进觉悟社当年社员开会的房间,桌子中间摆放的一盘纸条格外引人注目,周恩来“伍豪”和邓颖超“逸豪”的笔名就诞生于此。\n“为了斗争的需要,觉悟社社员们采取抓阄的办法,以号取名。”1月19日,天津觉悟社纪念馆助理馆员迟爱民讲述了102年前的情景:当时年纪最小的邓颖超抓到了最小数字1号,所以叫“逸豪”。周恩来抓到5号,就取名“伍豪”。\n时间回到1919年那个思潮澎湃的年代。在天津,以周恩来为代表的一批以天下为己任的先进分子,在众多新思潮中艰难地探索革命真理。通过觉悟社的锻炼和洗礼,其主要成员成长为我国早期的共产主义者。周恩来也在这个时期成为马克思主义的宣传者。\n诞生:冲破封建束缚探索革命真理\n觉悟社成立于“五四运动”在天津发展到最高潮的阶段。\n觉悟社纪念馆中的一张合影,记录下了这一张张充满青春朝气的脸庞。他们神色凝重,目光坚定,这些人就是觉悟社成立之初的部分社员。\n“这个比一般学生爱国团体更加严密的组织的成立,源于之前一次赴京请愿斗争。”迟爱民介绍,1919年9月2日,周恩来等天津各界联合会、学生联合会、女界爱国同志会的先进青年在返津途中,经过交流,一致认为,应该成立一个研究新思潮,探索革命真理,冲破封建习俗束缚,由男女同学共同组建的团体。\n1919年9月16日,在天津东南角草场庵天津学生联合会办公室里,革命青年团体觉悟社诞生了。出席成立会的男女各10名成员成为最初的社员,包括周恩来、邓颖超、马骏、刘清扬、郭隆真等。\n周恩来执笔起草了《觉悟的宣言》。觉悟社成立后,以“革心”和“革新”的精神组织演讲,出版刊物《觉悟》,探讨研究新思潮,很快就成为天津学生爱国运动的中坚力量。\n引领:觉悟社成立5天后李大钊应邀前来\n在波澜起伏的斗争中,周恩来和觉悟社社员们迫切感到,要用先进思想武装头脑。\n觉悟社社员谌小岑曾回忆道,在觉悟社成立后第5天,我国最早的马克思主义者、中国共产党先驱李大钊就应邀到觉悟社座谈。李大钊听完邓颖超对觉悟社的介绍后,对觉悟社深表赞许,他表示“觉悟社是男女平等、社交公开的先行”。\n在李大钊的启发下,觉悟社成员阅读了李大钊发表在《新青年》上的《庶民的胜利》《布尔什维主义的胜利》《我的马克思主义观》等文章。还邀请徐谦、包世杰、钱玄同、刘半农等来演讲,并召开讨论会。\n天津市委党校文史教研部副主任徐娜表示,觉悟社社员们学习、讨论中国最早的马列主义文献,并积极投身实践斗争,为他们选择信仰马克思主义、走上共产主义道路进行了最初的启蒙与引导。\n影响:觉悟社多人加入中国共产党\n1920年1月29日,在抵制日货的斗争中,周恩来、马骏等人被捕,成立仅4个月的觉悟社受到沉重打击。纪念馆展厅中的两本书《警厅拘留记》和《检厅日录》,记录了青年们斗争的艰难和残酷。身陷囹圄的周恩来先后用6个晚上,向狱友介绍马克思主义学说。出狱后,编写了3.5万字的《警厅拘留记》和《检厅日录》。在后来旅法期间,周恩来说“我的思想是颤动于狱中”,可以说这是周恩来马克思主义世界观形成的重要时期。\n1920年11月,随着周恩来、刘清扬、郭隆真等人赴法国勤工俭学,觉悟社的社员们开始星散,觉悟社的集体活动停止……\n觉悟社存在的时间虽然不长,但为一批年轻人树立马克思主义信仰奠定了坚实基础。徐娜表示,觉悟社作为“五四”运动爆发之后在天津影响最广泛、作用最突出的进步学生组织,其表现出的反对封建主义、憎恨一切剥削和压迫的进步思想,为接受马克思主义作好了准备。随后,远赴欧洲勤工俭学的周恩来加入中国共产党八个发起组之一的巴黎共产主义小组,成为中国共产党创建人之一,而其他的觉悟社主要社员如马骏、邓颖超、郭隆真等都加入了中国共产党,成为革命的骨干力量。"
}
```

**字段**  

**-id:**【字符串类型】文档的唯一ID。  

**-img_list:**【数组类型】,文档内包含的图片列表。每张图片的信息包括网络url, url的sha256, 长度和宽度。  

**-content:**【字符串类型】文档的内容,格式为普通Text格式或Markdown格式。 

<br>

## 书生·万卷视频数据集1.0

- 简介

书生·万卷视频数据集1.0主要来自中央广播电视总台和上海文广集团,包含多种类型的节目影像,视频文件数超过1000个,数据大小超过900GB。内容覆盖军事、文艺、体育、自然、真实世界、知识类、影像艺术、媒体、美食、历史纪录片、科教类等方面。

- 组成

![Image](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjySQnSGLrzp6tUVn2P5kZ5RuERiaibf5vSFibJUZtFWhT8rZmaslBTjicBI4Q/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

- 样例
![](https://mmbiz.qpic.cn/sz_mmbiz_png/7yjDpC9UfD7vkz4XTP9dNyQZNeGmJjyS9H6XnjNibfo5DJh7hscAGmeSvJ6ohVgnBAKk2blTSVIqNUKXicQ8984g/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1)

<br>

## 下载地址

完整数据集下载请前往:[https://opendatalab.org.cn/WanJuan1.0](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0/tree/main?source=R2l0aHVi)

<br>

## 许可

书生·万卷1.0整体采用CC BY 4.0许可协议。您可以自由共享、改编该数据集,唯需遵循以下条件:

- 署名:您必须适当地标明作者、提供指向本协议的链接,以及指明是否(对原始数据集)做了修改。您可以以任何合理的方式这样做,但不能以任何方式暗示许可人同意您或您的使用。

- 没有附加限制:您不得使用法律条款或技术措施来限制他人执行许可证允许的任何操作。

完整协议内容,请访问[CC BY 4.0协议全文](https://creativecommons.org/licenses/by/4.0/)。

<br>

### 特别注意事项

请注意,本数据集的某些子集可能受制于其他协议规定。在使用特定子集之前,请务必仔细阅读相关协议,确保合规使用。更为详细的协议信息,请在特定子集的相关文档或元数据中查看。

OpenDataLab作为非盈利机构,倡导和谐友好的开源交流环境,若在开源数据集内发现有侵犯您合法权益的内容,可发送邮件至(OpenDataLab@pjlab.org.cn),邮件中请写明侵权相关事实的详细描述并向我们提供相关的权属证明资料。我们将于3个工作日内启动调查处理机制,并采取必要的措施进行处置(如下架相关数据)。但您应确保您投诉的真实性,否则采取措施后所产生的不利后果应由您独立承担。

<br>

### 更新日志
```
2023-10-20:安全升级:进一步清洗、提升语料纯净度,升级后总文件大小2047.6GB   

2023-08-14:首次发布
```

<br>

## 引文

```
@misc{he2023wanjuan,
      title={WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models}, 
      author={Conghui He and Zhenjiang Jin and Chao Xu and Jiantao Qiu and Bin Wang and Wei Li and Hang Yan and Jiaqi Wang and Dahua Lin},
      year={2023},
      eprint={2308.10755},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

Download .txt
gitextract_fby9gy9t/

├── License
├── README.md
└── WanJuan1.0-CN.md
Condensed preview — 3 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (53K chars).
[
  {
    "path": "License",
    "chars": 18652,
    "preview": "Attribution 4.0 International\n\n=======================================================================\n\nCreative Commons"
  },
  {
    "path": "README.md",
    "chars": 12094,
    "preview": "# Intern · WanJuan Multimodal Corpus\r\n**English**🌎|[简体中文](./WanJuan1.0-CN.md)🀄 \r\n\r\n![Image](./images/01_宣传图.png)\r\n\r\n## I"
  },
  {
    "path": "WanJuan1.0-CN.md",
    "chars": 7559,
    "preview": "# 书生·万卷多模态语料库\n [English🌎](./README.md)|**简体中文**🀄 \n\n![Image](./images/01_宣传图.png)\n\n\n## 书生·万卷1.0\n\n书生·万卷1.0为书生·万卷多模态语料库的首个开"
  }
]

About this extraction

This page contains the full source code of the opendatalab/WanJuan1.0 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 3 files (37.4 KB), approximately 13.9k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!