[
  {
    "path": ".gitmodules",
    "content": "[submodule \"data-preparation\"]\n\tpath = pipeline_scripts/common_crawl/data-preparation\n\turl = https://github.com/bigscience-workshop/data-preparation.git\n[submodule \"deduplicate-text-datasets\"]\n\tpath = pipeline_scripts/common_crawl/deduplicate-text-datasets\n\turl = https://github.com/TristanThrush/deduplicate-text-datasets.git\n"
  },
  {
    "path": "LICENSE",
    "content": "Copyright 2022- The Hugging Face team. All rights reserved.\n\n                                Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# Online Language Modelling Dataset Pipeline\n\nThis repo enables you to pull a large and up-to-date text corpus from the web. It uses state-of-the-art processing methods to produce a clean text dataset that you can immediately use to pretrain a large language model, like BERT, GPT, or BLOOM. The main use-case for this repo is the Online Language Modelling Project, where we want to keep a language model up-to-date by pretraining it on the latest Common Crawl and Wikipedia dumps every month or so. You can see the models for the OLM project here: https://huggingface.co/olm. They actually get better performance than their original static counterparts.\n\nSpecifically, this repo has modular Python commands that enable you to:\n* Specify Common Crawl web snapshots, or just Wikipedia snapshots. Then pull the data.\n* Filter the data for a particular language, like English or French.\n* Run the OSCAR filters used by BigScience for the BLOOM language model. These filters ensure some level of text quality and reduce pornographic content.\n* Deduplicate the data.\n\nThis code is also fairly parallelized, although it can certianly be improved further. It can process over a terabyte from Common Crawl in a day or two, and all of English Wikipedia in less than an hour if you have:\n* A machine with a lot of CPUs and memory.\n* A fast internet connection.\n\n## Setup\n1. If you want to use this repo to generate a decent amount of data, get a machine with lots of CPUs and memory. We use an `n2d-standard-224` running `Ubuntu 20.04 LTS` on GCP. Add Terabytes of disk space too. You may need an even larger machine if you want to process close to 100% of a Common Crawl snapshot or several snapshots, particularly due to how much memory the deduplication process uses. Alternatively, you can specify in the deduplication arguments that you want to deduplicate the dataset in chunks so your memory doesn't explode.\n2. Clone with submodules: `git clone --recursive git@github.com:huggingface/olm-datasets.git`\n3. Install cargo (rust package manager) with `curl https://sh.rustup.rs -sSf | sh`. Then install Ungoliant with `cargo install ungoliant@1.2.3`. You may need to install gcc and cmake first.\n4. Set up a Python 3.9 environment, and run `pip install -r requirements.txt`\n5. Run `huggingface-cli login`. This cli should have been installed from `requirements.txt`. To login, you need to paste a token from your account at [https://huggingface.co](https://huggingface.co). This step is necessary for the pipeline to push the generated datasets to your Hugging Face account.\n\n## Getting a clean and up-to-date Common Crawl corpus\n\nFollow the instructions at [pipeline_scripts/common_crawl](pipeline_scripts/common_crawl).\n\nHere is the output dataset to expect from a 20% random segment sample of the August 2022 Common Crawl Snapshot: [https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20](https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20)\n\n## Getting a clean and up-to-date Wikipedia corpus\n\nFollow the instructions at [pipeline_scripts/wikipedia](pipeline_scripts/wikipedia).\n\nHere is the output dataset to expect from a September 2022 snapshot of Wikipedia: [https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920](https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920)\n\n## Analyzing the corpora\n\nFollow the instructions at [analysis_scripts](analysis_scripts).\n\nHere is a tweet thread which utilizes these scripts: [https://twitter.com/TristanThrush/status/1582356055794733057](https://twitter.com/TristanThrush/status/1582356055794733057)\n\nHere is another tweet thread that dives a little deeper:\n[https://twitter.com/TristanThrush/status/1588156731909029889](https://twitter.com/TristanThrush/status/1588156731909029889)\n\nAnd here is a colab where you can quickly run some of the analysis yourself! [https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing](https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing)\n\n## Citation\n\n```\n@misc{thrush2022pipeline,\n    title={Online Language Modelling Data Pipeline},\n    author={Tristan Thrush and Helen Ngo and Nathan Lambert and Douwe Kiela},\n    year={2022},\n    howpublished={\\url{https://github.com/huggingface/olm-datasets}}\n}\n```\n"
  },
  {
    "path": "analysis_scripts/README.md",
    "content": "# OLM Analysis\n\n## To analyze for term counts accross various datasets\n\nThis command reports the count of terms associated with events that happened over summer 2022, accross chronologically ordered summer 2022 OLM datasets. We would expect the counts to go up over the summer:\n\n```\npython term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 --input_dataset_pretty_names \"May\" \"June/July\" \"August\" --terms \"gentleminion\" \"monkeypox outbreak\" \"inflation reduction act of 2022\" \"quiet quitting\" \"jonas vingegaard\" --plot_title=\"Count of Terms in 2022 Summer CC OLM Datasets\" --analysis_column=text --split=train --num_proc=224 --output_filename=summer_2022_term_counts.png --load_from_hub_instead_of_disk --ylabel Count\n```\n\nHere is the resulting figure:\n\n![summer_2022_term_counts](https://user-images.githubusercontent.com/20826878/200715141-6ce73388-7d6a-4d05-bbf4-88e1f2a3c62c.png)\n\nThis command reports the count of words with the highest usage increase between the start of summer 2022 and the fall of 2022, out of all of the frequent (> mean + std) words in the dataset with only alphabetic characters, lowercased, and split by spaces:\n\n```\npython term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names \"May\" \"Jun/Jul\" \"Aug\" \"Sep/Oct\" --num_terms_to_find 5 --plot_title=\"Top 5 Words with Highest Usage Increase\" --analysis_column=text --split=train --num_proc=224 --output_filename=top_5_term_counts_heatmap.png --load_from_hub_instead_of_disk --ylabel \"Word\" --as_heatmap --heatmap_bar_label \"Percent Increase\" --xlabel \"Internet Snapshot\" --normalize_axis=1 --cache_dir=term_counts_cache_top_5 --percent_increase --annotation \"To avoid spurious results from words with small counts, we only considered frequent words. A word is considered frequent if the count is greater than a standard deviation above the mean count. Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets.\"\n```\n\nHere is the resulting figure:\n\n![top_5_term_counts_heatmap](https://user-images.githubusercontent.com/20826878/200715219-ce3b6fa4-e9f6-4dac-b594-caa052e759a0.png)\n\nThis command reports the count of date mentions in the text between summer 2022 and fall 2022:\n\n```\npython term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names \"May\" \"Jun/Jul\" \"Aug\" \"Sep/Oct\" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title=\"Relative Freq of Dates in Webpage Text\" --analysis_column=text --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_text.png --load_from_hub_instead_of_disk --as_heatmap --ylabel \"Date (YYYY/MM)\" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_text --xlabel \"Internet Snapshot\" --annotation \"Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets.\" --normalize\n```\n\nHere is the resulting figure:\n\n![date_term_counts_heatmap_text](https://user-images.githubusercontent.com/20826878/200715272-e5dab35b-211c-4344-b685-881e0ce46bb0.png)\n\nThis command reports the count of date mentions in the URLs between summer 2022 and fall 2022:\n\n```\npython term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names \"May\" \"Jun/Jul\" \"Aug\" \"Sep/Oct\" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title=\"Relative Freq of Dates in Webpage URLs\" --analysis_column=url --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_url.png --load_from_hub_instead_of_disk --as_heatmap --ylabel \"Date (YYYY/MM)\" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_urls --xlabel \"Internet Snapshot\" --annotation \"Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets.\" --normalize\n```\n\nHere is the resulting figure:\n\n![date_term_counts_heatmap_url](https://user-images.githubusercontent.com/20826878/200715307-b3110b88-191b-419f-91ff-1e45ecfc6361.png)\n\n## To analyze the timestamp distribution accross and within various datasets\n\nThis command reports the last-modified timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets:\n\n```\npython timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column last_modified_timestamp --plot_title \"Last-Modified Timestamp Distributions from Webpages\" --num_proc=224 --output_filename last_modified_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_last_modified --split=train\n```\n\nHere is the resulting figure:\n\n![last_modified_dist](https://user-images.githubusercontent.com/20826878/200715332-203f5950-6d4d-4e3a-bfaa-ebbcf7603242.png)\n\nThis command reports the crawl timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets:\n\n```\npython timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column crawl_timestamp --plot_title \"Crawl Timestamp Distributions from Webpages\" --num_proc=224 --output_filename crawl_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_crawl --split=train\n```\n\nHere is the resulting figure:\n\n![crawl_dist](https://user-images.githubusercontent.com/20826878/200715349-562af902-8863-428a-8417-0975738164bf.png)\n\n## To analyze the URL domain distribution accross and within various datasets\n\nThis command reports the domain distribution within the May 2022 OLM CC dataset:\n\n```\npython url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names May --url_column url --hist_plot_title \"URL Domain Distribution for May Internet Snapshot\" --corr_plot_title \"URL Domain Distribution Corr for May Internet Snapshot\" --num_proc=224 --output_corr_filename url_corr_may.png --output_hist_filename url_hist_may.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_may --no_hist_legend --annotation \"Only the top 25 domains are shown.\"\n```\n\nHere is the resulting figure:\n\n![url_hist_may](https://user-images.githubusercontent.com/20826878/200715359-7c7bc37a-5749-454a-9e38-77b1116de7f0.png)\n\nThis command reports the domain correlations between the summer 2022 through fall 2022 OLM CC datasets:\n\n```\npython url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names May Jun/Jul Aug Sep/Oct --url_column url --hist_plot_title \"URL Domain Distribution for Internet Snapshots\" --corr_plot_title \"URL Domain Distribution Corr for Internet Snapshots\" --num_proc=224 --output_corr_filename url_corr.png --output_hist_filename url_hist.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all\n```\n\nHere is the resulting figure:\n\n![url_corr](https://user-images.githubusercontent.com/20826878/200715384-d4793781-9775-4884-bffe-698b16677284.png)\n\nDoes sampling about 15-20% of a Common Crawl Snapshot do anything surprising? How much correlation is there between the resulting OLM dataset from a Common Crawl sample from a random seed versus another random seed? This command reports the domain correlation between two Sep/Oct datasets where the only difference is the sampled segments based on different random seeds:\n\n```\npython url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --input_dataset_pretty_names \"Sep/Oct Seed 1\" \"Sep/Oct Seed 2\" --url_column url --hist_plot_title \"URL Domain Distribution for Sep/Oct Snapshots\" --corr_plot_title \"URL Domain Distribution Corr for Sep/Oct Snapshots\" --num_proc=224 --output_corr_filename url_corr_sep_oct_different_seeds.png --output_hist_filename url_hist_sep_oct_different_seeds.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all --annotation=\"This plot shows two different OLM datasets. They were created with the same code from a 16% random sample of Sep/Oct Common Crawl WET files, but with different random seeds for the sampling.\"\n```\n\nHere is the resulting figure:\n\n![url_corr_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715404-5ccb3a1e-9e82-41be-82db-9e54e73785fe.png)\n\n## To analyze for duplicates accross various datasets\n\nThis command reports the ratio of shared URLs between the August and June/July Common Crawl OLM Datasets:\n\n```\npython duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=url --split=train --num_proc=224 --plot_title=\"URLs in the June/July plus the August CC OLM Datasets\" --output_filename=duplicate_urls_aug_jun_jul.png --load_from_hub_instead_of_disk\n```\n\nHere is the resulting figure:\n\n![duplicate_urls_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715427-79d0120b-fa48-4fdf-8410-8943a1325780.png)\n\nThis command reports the ratio of exact text duplicated between the August and June/July Common Crawl OLM Datasets:\n\n```\npython duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=text --split=train --num_proc=224 --plot_title=\"Text in the June/July plus the August CC OLM Datasets\" --output_filename=duplicate_text_aug_jun_jul.png --load_from_hub_instead_of_disk\n```\n\nHere is the resulting figure:\n\n![duplicate_text_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715436-4893263b-1fe9-4941-ae43-edd4732652c4.png)\n\nWhat about the duplicated URLs between two differently seeded OLM datasets from the same month?\n\n```\npython duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=url --split=train --num_proc=224 --plot_title=\"URLs in two Differently Seeded Sep/Oct CC OLM Datasets\" --output_filename=duplicate_urls_sep_oct_different_seeds.png --load_from_hub_instead_of_disk\n```\n\nHere is the resulting figure:\n\n![duplicate_urls_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715575-fae99dcb-cef5-411e-a786-a6e20e53a003.png)\n\nAnd what about the text?\n\n```\npython duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=text --split=train --num_proc=224 --plot_title=\"Text in two Differently Seeded Sep/Oct CC OLM Datasets\" --output_filename=duplicate_text_sep_oct_different_seeds.png --load_from_hub_instead_of_disk\n```\n\n![duplicate_text_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715583-1aa76245-14c5-4afe-88c5-539c8665d4d7.png)\n\n## Documentation\n\n```\npython term_counts.py --help\n```\n\n```\npython url_dist.py --help\n```\n\n```\npython timestamp_dist.py --help\n```\n\n```\npython duplicates.py --help\n```\n"
  },
  {
    "path": "analysis_scripts/duplicates.py",
    "content": "from datasets import load_dataset, load_from_disk, concatenate_datasets\nimport argparse\nimport matplotlib.pyplot as plt\n\nparser = argparse.ArgumentParser(description=\"This script takes a list of datasets, concatenates them, and saves a pie chart for duplicate versus unique items in the specified column.\")\nparser.add_argument(\"--input_dataset_names\", nargs=\"+\", required=True)\nparser.add_argument(\"--analysis_column\", required=True)\nparser.add_argument(\"--plot_title\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The dataset split to use. Some datasets don't have splits so this argument is optional.\")\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--duplicate_label\", default=\"Duplicate\")\nparser.add_argument(\"--unique_label\", default=\"Unique\")\nparser.add_argument(\"--output_filename\", required=True)\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).\")\nargs = parser.parse_args()\n\ndatasets = []\nfor input_dataset_name in args.input_dataset_names:\n    if args.load_from_hub_instead_of_disk:\n        if args.split is None:\n            ds = load_dataset(input_dataset_name)\n        else:\n            ds = load_dataset(input_dataset_name, split=args.split)\n    else:\n        if args.split is None:\n            ds = load_from_disk(input_dataset_name)\n        else:\n            ds = load_from_disk(input_dataset_name)[args.split]\n    \n    datasets.append(ds)\n\nds = concatenate_datasets(datasets)\n\nds = ds.sort(args.analysis_column)\n\nmax_index = len(ds) - 1\ndef same_adjacent_entry(entry, index):\n    if index == max_index:\n        return ds[index - 1][args.analysis_column] == entry\n    elif index == 0:\n        return ds[index + 1][args.analysis_column] == entry\n    return ds[index - 1][args.analysis_column] == entry or ds[index + 1][args.analysis_column] == entry\n\nnum_examples = len(ds)\nds = ds.filter(lambda example, index: same_adjacent_entry(example[args.analysis_column], index), num_proc=args.num_proc, with_indices=True)\nnum_examples_only_duplicate_entries = len(ds)\n\n\nlabels = [args.duplicate_label, args.unique_label]\nsizes = [num_examples_only_duplicate_entries, num_examples - num_examples_only_duplicate_entries]\nplt.pie(sizes, labels=labels, autopct='%1.1f%%')\n\nplt.title(args.plot_title, fontweight=\"bold\")\nplt.rcParams[\"font.family\"] = \"Times New Roman\"\n\nplt.savefig(args.output_filename, dpi=300)\n"
  },
  {
    "path": "analysis_scripts/term_counts.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom collections import Counter\nfrom multiprocessing import Manager\nfrom tqdm import tqdm\nfrom os import path, mkdir\nimport pickle\nimport statistics\n\nparser = argparse.ArgumentParser(description=\"This script takes in an ordered list of datasets and counts terms in each of them, in the specified column. It then plots a graph or a heatmap for how the count changed accross datasets. \")\nparser.add_argument(\"--input_dataset_names\", nargs=\"+\", required=True)\nparser.add_argument(\"--input_dataset_pretty_names\", nargs=\"+\", required=True, help=\"The names of the datasets that you want to appear in the saved graph.\")\nparser.add_argument(\"--terms\", nargs=\"+\", default=None, help=\"The terms that you want to count. If left as None, then you must specify --num_terms_to_find, and then the script will return the top --num_terms_to_find with the greatest percent change from the first dataset to the last dataset, out of the terms which have count > the mean count plus the standard deviation (so we don't get spurious results from low-count words).\")\nparser.add_argument(\"--term_pretty_names\", nargs=\"+\", default=None)\nparser.add_argument(\"--analysis_column\", required=True)\nparser.add_argument(\"--plot_title\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The dataset split to use. Some datasets don't have splits so this argument is optional.\")\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--output_filename\", required=True)\nparser.add_argument(\"--as_heatmap\", action=\"store_true\")\nparser.add_argument(\"--samples\", default=None, type=int)\nparser.add_argument(\"--num_terms_to_find\", default=None, type=int)\nparser.add_argument(\"--normalize\", action=\"store_true\")\nparser.add_argument(\"--ylabel\", required=True)\nparser.add_argument(\"--cache_dir\", default=\"term_count_cache\")\nparser.add_argument(\"--load_from_cache_dir\", action=\"store_true\")\nparser.add_argument(\"--heatmap_bar_label\", default=\"\")\nparser.add_argument(\"--annotation\", default=None)\nparser.add_argument(\"--xlabel\", default=\"Dataset\")\nparser.add_argument(\"--normalize_axis\", default=0, type=int)\nparser.add_argument(\"--percent_increase\", action=\"store_true\")\nparser.add_argument(\"--bottom\", default=0.25, type=float)\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).\")\nargs = parser.parse_args()\n\ndatasets = []\nterm_y_coords = None\ncount_dicts = []\n\nif args.load_from_cache_dir:\n    if args.terms is None:\n        count_dicts = pickle.load(open(path.join(args.cache_dir, \"count_dicts.pkl\"), \"rb\"))\n    else:\n        term_y_coords = pickle.load(open(path.join(args.cache_dir, \"term_y_coords.pkl\"), \"rb\"))\n    cached_args = pickle.load(open(path.join(args.cache_dir, \"args.pkl\"), \"rb\"))\n    if args != cached_args:\n        print(\"Warning: argument mismatch between cached args and current args\")\n        print(\"Cached args: \", cached_args)\n        print(\"Current args: \", args)\n\nif term_y_coords is None:\n    for input_dataset_name in args.input_dataset_names:\n        if args.load_from_hub_instead_of_disk:\n            if args.split is None:\n                ds = load_dataset(input_dataset_name)\n            else:\n                ds = load_dataset(input_dataset_name, split=args.split)\n        else:\n            if args.split is None:\n                ds = load_from_disk(input_dataset_name)\n            else:\n                ds = load_from_disk(input_dataset_name)[args.split]\n\n        if args.samples is not None:\n            ds = ds.shuffle(seed=42)\n            ds = ds.select(range(args.samples))\n\n        datasets.append(ds)\n\n        if args.terms is None and not args.load_from_cache_dir:\n            with Manager() as manager:\n                shared_list = manager.list()\n                def build_count_dict(examples):\n                    counts = None\n                    for text in examples[args.analysis_column]:\n                        if counts is None:\n                            counts = Counter(filter(lambda obj: obj.isalpha(), text.lower().split(\" \")))\n                        else:\n                            counts += Counter(filter(lambda obj: obj.isalpha(), text.lower().split(\" \")))\n                    shared_list.append(counts)\n\n                ds.map(build_count_dict, num_proc=args.num_proc, batched=True, batch_size=len(ds) // args.num_proc, remove_columns=ds.column_names)\n        \n                count_dict = shared_list[0]\n                for counts in tqdm(shared_list[1:]):\n                    count_dict += counts\n\n                count_dicts.append(count_dict)\n\n    if args.terms is None:\n        if not path.exists(args.cache_dir):\n            mkdir(args.cache_dir)\n        pickle.dump(args, open(path.join(args.cache_dir, \"args.pkl\"), \"wb\"))\n        pickle.dump(count_dicts, open(path.join(args.cache_dir, \"count_dicts.pkl\"), \"wb\"))\n\nif args.terms is None:\n\n    intersection_count_set = set(count_dicts[0].keys())\n    for count_dict in count_dicts[1:]:\n        intersection_count_set = intersection_count_set.intersection(set(count_dict.keys()))\n\n    words_with_occurence_changes = []\n    counts = []\n    for word in intersection_count_set:\n        count_sum = 0\n        for count_dict in count_dicts:\n            count_sum += count_dict[word]\n        counts.append(count_sum)\n    mean_count = statistics.mean(counts)\n    std = statistics.stdev(counts)\n    for word in intersection_count_set:\n        count_sum = 0\n        for count_dict in count_dicts:\n            count_sum += count_dict[word]\n        if count_sum > mean_count + std:\n            change = count_dicts[-1][word]/count_dicts[0][word]\n            words_with_occurence_changes.append((word, change))\n\n    words_with_occurence_changes.sort(key=lambda word_and_change: word_and_change[1], reverse=True)\n    terms = [word_and_change[0] for word_and_change in words_with_occurence_changes[:args.num_terms_to_find]]\n\nelse:\n    terms = args.terms\n\n\nif term_y_coords is None:\n\n    term_y_coords = {term: [] for term in terms}\n\n    for ds in datasets:\n        def term_counts(text):\n            return {term + \"_count\": text.lower().count(term.lower()) for term in terms}\n\n        ds = ds.map(lambda example: term_counts(example[args.analysis_column]), num_proc=args.num_proc)\n\n        for term in terms:\n            term_y_coords[term].append(sum(ds[term + \"_count\"]))\n\n    if not path.exists(args.cache_dir):\n        mkdir(args.cache_dir)\n    pickle.dump(args, open(path.join(args.cache_dir, \"args.pkl\"), \"wb\"))\n    pickle.dump(term_y_coords, open(path.join(args.cache_dir, \"term_y_coords.pkl\"), \"wb\"))\n\nplt.xticks(range(len(args.input_dataset_pretty_names)), args.input_dataset_pretty_names)\n\nif args.as_heatmap:\n    matrix = []\n    for term in terms:\n        matrix.append(term_y_coords[term])\n    matrix = np.array(matrix)\n    if args.percent_increase:\n        matrix = matrix.transpose()\n        matrix = (matrix - matrix[0])/matrix[0]\n        matrix = matrix.transpose() * 100\n\n    if args.normalize:\n        column_sums = matrix.sum(axis=args.normalize_axis)\n        if args.normalize_axis == 0:\n            normalized_matrix = matrix / column_sums\n        if args.normalize_axis == 1:\n            normalized_matrix = matrix.transpose() / column_sums\n            normalized_matrix = normalized_matrix.transpose()\n        plt.imshow(np.flipud(normalized_matrix), plt.cm.Blues)\n    else:\n        plt.imshow(np.flipud(matrix), plt.cm.Blues)\n    plt.yticks(range(len(terms)), reversed(terms if args.term_pretty_names is None else args.term_pretty_names))\n    cbar = plt.colorbar()\n    cbar.ax.set_ylabel(args.heatmap_bar_label, rotation=-90, va=\"bottom\")\n    plt.ylabel(args.ylabel, style='italic', fontweight=\"bold\")\nelse:\n    for term in terms:\n        if args.normalize:\n            term_y_coords[term] = np.array(term_y_coords[term])/sum(term_y_coords[term])\n        plt.plot(term_y_coords[term], label=term, marker=\".\")\n    plt.grid(linestyle=\":\")\n    plt.legend(loc=\"upper left\")\n    plt.ylabel(args.ylabel, style='italic', fontweight=\"bold\")\n\nif args.annotation is not None:\n    plt.figtext(0.6, 0.01, args.annotation, wrap=True, horizontalalignment='center', fontsize=8)\n    plt.subplots_adjust(bottom=args.bottom)\nplt.xlabel(args.xlabel, style='italic', fontweight=\"bold\")\nplt.title(args.plot_title, fontweight=\"bold\")\nplt.rcParams[\"font.family\"] = \"Times New Roman\"\nplt.savefig(args.output_filename, dpi=300)\n"
  },
  {
    "path": "analysis_scripts/timestamp_dist.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd\nfrom datetime import datetime\nfrom os import path, mkdir\nimport pickle\n\nparser = argparse.ArgumentParser(description=\"This script takes in an ordered list of datasets. It is assumed that each dataset has a timestamp column. The script plots the timestamp distribution histogram for each dataset.\")\nparser.add_argument(\"--input_dataset_names\", nargs=\"+\", required=True)\nparser.add_argument(\"--input_dataset_pretty_names\", nargs=\"+\", required=True, help=\"The names of the datasets that you want to appear in the saved graph.\")\nparser.add_argument(\"--timestamp_column\", required=True)\nparser.add_argument(\"--plot_title\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The dataset split to use. Some datasets don't have splits so this argument is optional.\")\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--output_filename\", required=True)\nparser.add_argument(\"--samples\", default=None, type=int)\nparser.add_argument(\"--bins\", default=100, help=\"The number of histogram bins to plot\")\nparser.add_argument(\"--cache_dir\", default=\"timestamp_dist_cache\")\nparser.add_argument(\"--load_from_cache_dir\", action=\"store_true\")\nparser.add_argument(\"--annotation\", default=None)\nparser.add_argument(\"--legend_title\", default=\"Internet Snapshot\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).\")\nargs = parser.parse_args()\n\nif args.load_from_cache_dir:\n    data_array = np.load(open(path.join(args.cache_dir, \"data_array.npy\"), \"rb\"))\n    cached_args = pickle.load(open(path.join(args.cache_dir, \"args.pkl\"), \"rb\"))\n    if args != cached_args:\n        print(\"Warning: argument mismatch between cached args and current args\")\n        print(\"Cached args: \", cached_args)\n        print(\"Current args: \", args)\nelse:\n\n    # Remove timestamp outliers more than 10 median deviations away from the median.\n    # This is important if the timestamp is the Last-Modified timestamp, which can sometimes be wrong\n    # because websites can report whatever they want. We don't want one website that says it was created\n    # a billion years ago to seriously affect the distribution.\n    def reject_outliers(data, m = 10.):\n        d = np.abs(data - np.median(data))\n        mdev = np.median(d)\n        s = d/mdev if mdev else 0.\n        return data[s<m]\n\n    data_list = []\n    shortest_len = None\n    for input_dataset_name in args.input_dataset_names:\n        if args.load_from_hub_instead_of_disk:\n            if args.split is None:\n                ds = load_dataset(input_dataset_name)\n            else:\n                ds = load_dataset(input_dataset_name, split=args.split)\n        else:\n            if args.split is None:\n                ds = load_from_disk(input_dataset_name)\n            else:\n                ds = load_from_disk(input_dataset_name)[args.split]\n\n        if args.samples is not None:\n            ds = ds.shuffle(seed=42)\n            ds = ds.select(range(args.samples))\n\n        ds = ds.filter(lambda example: example[args.timestamp_column] is not None, num_proc=args.num_proc)\n        \n        data = np.array(ds[args.timestamp_column])\n        data_no_outliers = reject_outliers(data)\n        data_list.append(data_no_outliers)\n        if shortest_len is None:\n            shortest_len = len(data_no_outliers)\n        else:\n            shortest_len = min(shortest_len, len(data_no_outliers))\n\n    truncated_data_list = []\n    for data in data_list:\n        truncated_data_list.append(data[:shortest_len])\n    data_array = np.array(truncated_data_list).transpose()\n    if not path.exists(args.cache_dir):\n        mkdir(args.cache_dir)\n    np.save(open(path.join(args.cache_dir, \"data_array.npy\"), \"wb\"), data_array)\n    pickle.dump(args, open(path.join(args.cache_dir, \"args.pkl\"), \"wb\"))\n\ndf = pd.DataFrame(data=data_array, columns=args.input_dataset_pretty_names)\ncolor_palette = sns.color_palette(\"viridis\")\ncolors = color_palette[:len(args.input_dataset_names)]\nplot = sns.displot(data=df, kde=True, palette=colors, bins=args.bins, height=5, aspect=1.5)\nmeans = np.mean(data_array, axis=0)\nxticks = np.concatenate((np.array([np.min(data_array)]), means, np.array([np.max(data_array)])))\nfor mean, color in zip(means, colors):\n    plt.axvline(x=mean, linestyle=\"--\", color=color)\n\nplot.set(xticks=xticks)\nplot.axes[0,0].set_title(args.plot_title, fontweight=\"bold\")\nplot.axes[0,0].set_xlabel(\"Timestamp\", style=\"italic\", fontweight=\"bold\")\nplot.axes[0,0].set_ylabel(\"Count\", style=\"italic\", fontweight=\"bold\")\nplot.set_xticklabels([datetime.fromtimestamp(timestamp).strftime('%b %d') for timestamp in xticks], rotation=45)\nif args.annotation is not None:\n    plot.figure.text(0.5, 0.01, args.annotation, wrap=True, horizontalalignment='center', fontsize=8)\n    plot.figure.subplots_adjust(bottom=0.20)\nplot._legend.set_title(args.legend_title)\nplot.fig.savefig(args.output_filename, dpi=300, bbox_inches='tight')\n\n"
  },
  {
    "path": "analysis_scripts/url_dist.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nfrom collections import Counter\nfrom multiprocessing import Manager\nfrom tqdm import tqdm\nfrom urllib.parse import urlparse\nimport seaborn as sns\nimport pandas as pd\nfrom os import path, mkdir\nimport pickle\n\nparser = argparse.ArgumentParser(description=\"This script takes in an ordered list of datasets which each have a URL column. It extracts domain names from each URL and then plots a histogram of the URL counts per domain and a correlation matrix comparing each dataset's histogram.\")\nparser.add_argument(\"--input_dataset_names\", nargs=\"+\", required=True)\nparser.add_argument(\"--input_dataset_pretty_names\", nargs=\"+\", required=True, help=\"The names of the datasets that you want to appear in the saved graphs.\")\nparser.add_argument(\"--url_column\", required=True)\nparser.add_argument(\"--hist_plot_title\", required=True)\nparser.add_argument(\"--corr_plot_title\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The dataset split to use. Some datasets don't have splits so this argument is optional.\")\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--output_corr_filename\", required=True)\nparser.add_argument(\"--output_hist_filename\", required=True)\nparser.add_argument(\"--samples\", default=None, type=int)\nparser.add_argument(\"--hist_bins\", default=25, type=int)\nparser.add_argument(\"--hist_bin_fontsize\", default=8, type=int)\nparser.add_argument(\"--cache_dir\", default=\"url_dist\")\nparser.add_argument(\"--no_hist_legend\", action=\"store_true\")\nparser.add_argument(\"--load_from_cache_dir\", action=\"store_true\", help=\"If you've already run this function and just want to change parameters for the graphs (like --hist_bins, for example), then specify this option to load the cached domain distribution so the computation isn't repeated.\")\nparser.add_argument(\"--annotation\", default=None)\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input datasets from the Hugging Face hub instead of the disk (default is the disk).\")\nargs = parser.parse_args()\n\nif args.load_from_cache_dir:\n    count_dicts = pickle.load(open(path.join(args.cache_dir, \"count_dicts.pkl\"), \"rb\"))\n    cached_args = pickle.load(open(path.join(args.cache_dir, \"args.pkl\"), \"rb\"))\n    if args != cached_args:\n        print(\"Warning: argument mismatch between cached args and current args\")\n        print(\"Cached args: \", cached_args)\n        print(\"Current args: \", args)\n\nelse:\n    count_dicts = []\n    for input_dataset_name in args.input_dataset_names:\n        if args.load_from_hub_instead_of_disk:\n            if args.split is None:\n                ds = load_dataset(input_dataset_name)\n            else:\n                ds = load_dataset(input_dataset_name, split=args.split)\n        else:\n            if args.split is None:\n                ds = load_from_disk(input_dataset_name)\n            else:\n                ds = load_from_disk(input_dataset_name)[args.split]\n\n        if args.samples is not None:\n            ds = ds.shuffle(seed=42)\n            ds = ds.select(range(args.samples))\n\n        with Manager() as manager:\n            shared_list = manager.list()\n            def build_count_dict(examples):\n                counts = None\n                for url in examples[args.url_column]:\n                    domain = urlparse(url).netloc\n                    if counts is None:\n                        counts = Counter([domain])\n                    else:\n                        counts += Counter([domain])\n                shared_list.append(counts)\n\n            ds.map(build_count_dict, num_proc=args.num_proc, batched=True, batch_size=len(ds) // args.num_proc)\n        \n            count_dict = shared_list[0]\n            for counts in tqdm(shared_list[1:]):\n                count_dict += counts\n\n            count_dicts.append(count_dict)\n    \n    if not path.exists(args.cache_dir):\n        mkdir(args.cache_dir)\n    pickle.dump(args, open(path.join(args.cache_dir, \"args.pkl\"), \"wb\"))\n    pickle.dump(count_dicts, open(path.join(args.cache_dir, \"count_dicts.pkl\"), \"wb\"))\n\nunion_count_set = set(count_dicts[0].keys())\nfor count_dict in tqdm(count_dicts[1:]):\n    union_count_set = union_count_set.union(set(count_dict.keys()))\n\ndataframe_dict = {dataset_name: [] for dataset_name in args.input_dataset_pretty_names}\ndataframe_dict[\"domain_name\"] = []\nfor domain in tqdm(union_count_set):\n    for index in range(len(args.input_dataset_pretty_names)):\n        count_dict = count_dicts[index]\n        dataset_name = args.input_dataset_pretty_names[index]\n        dataframe_dict[dataset_name].append(count_dict.get(domain, 0))\n    dataframe_dict[\"domain_name\"].append(domain)\n\ndf = pd.DataFrame(dataframe_dict)\nplot = sns.heatmap(df.corr().iloc[::-1], cmap=\"Blues\", annot=True)\nplot.set_title(args.corr_plot_title, fontweight=\"bold\")\nif args.annotation is not None:\n    plot.figure.text(0.5, 0.01, args.annotation, wrap=True, horizontalalignment=\"center\", fontsize=8)\n    plot.figure.subplots_adjust(bottom=0.15)\nplot.figure.savefig(args.output_corr_filename, dpi=300)\nplot.figure.clf()\n\ndf = df.sort_values(by=args.input_dataset_pretty_names, ascending=False)\ndataframe_dict = {\"samples\": [], \"dataset\": [], \"domain\": []}\nindex = 0\nfor _, datum in df.iterrows():\n    if index >= args.hist_bins:\n        break\n    dataframe_dict[\"samples\"] += [datum[name] for name in args.input_dataset_pretty_names]\n    dataframe_dict[\"dataset\"] += args.input_dataset_pretty_names\n    dataframe_dict[\"domain\"] += [datum[\"domain_name\"]]*len(args.input_dataset_pretty_names)\n    index += 1\n\ndf = pd.DataFrame(dataframe_dict)\ncolor_palette = sns.color_palette(\"pastel\")\ncolors = color_palette[:len(args.input_dataset_names)]\nplot = sns.barplot(data=df, palette=colors, hue=\"dataset\", y=\"domain\", x=\"samples\")\nif args.annotation is not None:\n    plot.figure.text(0.4, 0.01, args.annotation, wrap=True, horizontalalignment=\"center\", fontsize=8)\n    plot.figure.subplots_adjust(bottom=0.15)\nplot.legend().set_title(\"\")\nif args.no_hist_legend:\n    plot.legend().remove()\nfor item in plot.get_yticklabels():\n    item.set_fontsize(args.hist_bin_fontsize)\nplot.set_title(args.hist_plot_title, fontweight=\"bold\")\nplot.set_xlabel(\"Count\", style=\"italic\", fontweight=\"bold\")\nplot.set_ylabel(\"Domain\", style=\"italic\", fontweight=\"bold\")\nplot.figure.savefig(args.output_hist_filename, dpi=300, bbox_inches=\"tight\")\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/README.md",
    "content": "![olm_cc_pipeline](https://user-images.githubusercontent.com/20826878/199851707-64a7a026-c413-4d78-8b04-a825e07534b3.jpeg)\n\n# Quick start\n\nThis section provides all the commands that you need to generate a deduplicated and filtered dataset from Common Crawl, ready for pretraining!\n\n## One time only\n\n`bash download_pipeline_processing_models.sh`\n\n## Every time\n\nUse the following commands to get a dataset. They should take only a few min if you have lots of CPUs. Adjust `--num_proc` to be equal to however many CPUs that you have.\n\n```\npython download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wet_downloads --num_proc=224\npython get_text_dataset_from_wet_downloads.py --download_dir=common_crawl_wet_downloads --output_dataset_name=cc_raw --num_proc=224\npython remove_wikipedia_urls.py --input_dataset_name=cc_raw --output_dataset_name=cc_no_wikipedia --url_column=url --split=en --num_proc=224\npython apply_bigscience_filters.py --input_dataset_name=cc_no_wikipedia --output_dataset_name=cc_filtered --lang_id=en --text_column=text --num_proc=224\nulimit -Sn 1000000 && python deduplicate.py --input_dataset_name=cc_filtered --output_dataset_name=cc_olm --text_column=text --remove_whole_example --num_proc=224\n\n# Optionally, get the last-modified headers from the websites and add them to the dataset. --segment_sampling_ratios and --seed must be the same as above for this to work.\npython download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wat_downloads --paths_type=wat --num_proc=224\npython get_last_modified_dataset_from_wat_downloads.py --download_dir=common_crawl_wat_downloads --output_dataset_name=cc_raw_last_modified --num_proc=224\npython combine_last_modified_with_text_dataset.py --text_dataset_name=cc_olm --last_modified_dataset_name=cc_raw_last_modified --output_dataset_name=cc_olm_with_last_modified --url_column=url --crawl_timestamp_column=crawl_timestamp --last_modified_timestamp_column=last_modified_timestamp --num_proc=224\n\n```\n\nYou can then upload the final dataset to the Hugging Face Hub from a Python terminal like this:\n\n```\nfrom datasets import load_from_disk\n\nds = load_from_disk(\"cc_olm\")  # Or cc_olm_with_last_modified if you did the optional step above.\n\nds = ds.shuffle()  # Optionally, shuffle the dataset so you can get an idea of what a random sample of the dataset looks like in the Hugging Face Hub dataset preview.\n\nds.push_to_hub(\"cc_olm\")  # Or cc_olm_with_last_modified if you did the optional step above.\n```\n\n\n# Important notes\n\n## Finding the latest Common Crawl snapshots\n\nThey are displayed here: [https://commoncrawl.org/the-data/get-started/](https://commoncrawl.org/the-data/get-started/). Just enter the names of the snapshots you want as arguments to the `download_common_crawl.py` script.\n\n## Intermediate dataset checkpoints\n\nEach of the python scripts from the quick start commands saves a Hugging Face dataset to the disk. The dataset is then read by the next python command. These intermediate datasets are not deleted by default, so you can observe what each step of the pipeline does. This also means that you should have a large disk. We use a 15 terabyte disk for the Online Language Modelling Project.\n\n## How to specify the size of the dataset\n\nIncrease `--segment_sampling_ratios` to get a larger dataset (it goes up to `1`). In the above quick start code, `0.0001` means that it only uses a sample of `0.01%` of the data from a Common Crawl snapshot. To generate a dataset for the Online Language Modelling Project, we are currently pulling about 1.45 terabytes from each Common Crawl snapshot, which is about 350 gigabytes after going through the BigScience filters and finally 30 gigabytes after going through the deduplication code. For the August 2022 snapshot, 1.45 terabytes is about 20% (i.e. `--segment_sampling_ratios 0.20`). Crawl sizes very though. For May 2022, 1.45 terabytes is about 14%.\n\nIf you want to train a larger model than us, then specify a higher value for `--segment_sampling_ratios`, or even use multiple Common Crawl snapshots like this:\n\n```\npython download_common_crawl.py --snapshots CC-MAIN-2022-27 CC-MAIN-2022-33 --segment_sampling_ratios 0.5 1 --download_dir=common_crawl_wet_downloads --num_proc=224\n```\n\nKeep in mind that, with more data, the deduplication script will need more RAM. Read on for limitations of the deduplication script.\n\n## Limitations of the deduplication code\n\nThere are tons of duplicates in Common Crawl data, which means that the deduplication script will need 100's of gigabytes of RAM if you want to generate a 30 gigabyte dataset like us :(. If you want to get around this, there is also the option in the deduplication script for you to chunk the dataset and deduplicate each chunk individually. The main problem is this issue in the Google deduplication code: [https://github.com/google-research/deduplicate-text-datasets/issues/18](https://github.com/google-research/deduplicate-text-datasets/issues/18).\n\n\n# More documentation\n\nRun any of the python commands with the `--help` flag. For example, `python download_common_crawl.py --help`.\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/apply_bigscience_filters.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nfrom subprocess import run\nfrom os import path, mkdir\nfrom shutil import rmtree\nimport sys\nimport uuid\n\nsys.path.append(\"data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering\")\nfrom filtering import DatasetFiltering\n\nparser = argparse.ArgumentParser(description=\"Applies the BigScience BLOOM filters which were used on OSCAR. They are designed to improve text quality and remove pornographic content.\")\nparser.add_argument(\"--input_dataset_name\", help=\"The name of the input dataset.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"The name of the output dataset.\", required=True)\nparser.add_argument(\"--lang_id\", help=\"The language id of your dataset. This is necessary because the BigScience filters use a list of language-specific pornographic words, and also language-specific hyperparameters for text quality improvement.\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.\")\nparser.add_argument(\"--text_column\", help=\"The name of the dataset column that contains the text.\", required=True)\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use.\", required=True)\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.\")\nparser.add_argument(\"--tmp_dir\", default=\".tmp_apply_bigscience_filters\", help=\"Directory to store temporary files. It will be deleted afterwards. Defaults to .tmp_apply_bigscience_filters.\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.\")\nargs = parser.parse_args()\n\nif args.load_from_hub_instead_of_disk:\n    if args.split is None:\n        ds = load_dataset(args.input_dataset_name)\n    else:\n        ds = load_dataset(args.input_dataset_name, split=args.split)\nelse:\n    if args.split is None:\n        ds = load_from_disk(args.input_dataset_name)\n    else:\n        ds = load_from_disk(args.input_dataset_name)[args.split]\n\n# We have to do this if the text column is not named \"text\" in the dataset,\n# because DatasetFiltering assumes that the name is \"text\".\ntemp_column_name = None\nif args.text_column != \"text\":\n    if \"text\" in ds.colum_names:\n        temp_column_name = str(uuid.uuid4())\n        ds = ds.rename_column(\"text\", temp_column_name)\n    ds = ds.rename_column(args.text_column, \"text\")\n\nif path.exists(args.tmp_dir):\n    run(f\"rm -r {args.tmp_dir}\", shell=True)\n\nmkdir(args.tmp_dir)\ntmp_dataset_name = path.join(args.tmp_dir, \"intermediate_bigscience_filtered_dataset\")\n\ndataset_filtering = DatasetFiltering(\n    dataset=ds,\n    lang_dataset_id=args.lang_id,\n    path_fasttext_model=\"sp_kenlm_ft_models/lid.176.bin\",\n    path_sentencepiece_model=f\"sp_kenlm_ft_models/{args.lang_id}.sp.model\",\n    path_kenlm_model=f\"sp_kenlm_ft_models/{args.lang_id}.arpa.bin\",\n    num_proc=args.num_proc,\n    path_dir_save_dataset=tmp_dataset_name,\n)\n\ndataset_filtering.modifying_documents()\ndataset_filtering.filtering()\ndataset_filtering.save_dataset()\n\nds = load_from_disk(path.join(tmp_dataset_name, args.lang_id))\n\n# We have to do this if the text column is not named \"text\" in the dataset,\n# because DatasetFiltering assumes that the name is \"text\".\nif args.text_column != \"text\":\n    ds = ds.rename_column(\"text\", args.text_column)\n    if temp_column_name is not None:\n        ds = ds.rename_column(temp_column_name, \"text\")\n\nds.save_to_disk(args.output_dataset_name)\nrmtree(args.tmp_dir)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/combine_last_modified_with_text_dataset.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nfrom multiprocessing import Manager\nfrom tqdm import tqdm\nimport uuid\n\nparser = argparse.ArgumentParser(description=\"This script takes in a text dataset with crawl timestamps and urls, and then a last-modified dataset with crawl timestamps and urls. It uses the shared urls and crawl timestamps to add last-modified timestamps to the text dataset.\")\nparser.add_argument(\"--text_dataset_name\", required=True)\nparser.add_argument(\"--last_modified_dataset_name\", required=True)\nparser.add_argument(\"--output_dataset_name\", required=True)\nparser.add_argument(\"--text_dataset_split\", default=None)\nparser.add_argument(\"--last_modified_dataset_split\", default=None)\nparser.add_argument(\"--last_modified_timestamp_column\", required=True)\nparser.add_argument(\"--crawl_timestamp_column\", required=True)\nparser.add_argument(\"--url_column\", required=True)\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--load_text_dataset_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the text dataset from the Hugging Face hub instead of the disk (default is the disk).\")\nparser.add_argument(\"--load_last_modified_dataset_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the last modified dataset from the Hugging Face hub instead of the disk (default is the disk).\")\nparser.add_argument(\"--push_to_hub\", action=\"store_true\")\nargs = parser.parse_args()\n\nif args.load_text_dataset_from_hub_instead_of_disk:\n    if args.text_dataset_split is None:\n        text_ds = load_dataset(args.text_dataset_name)\n    else:\n        text_ds = load_dataset(args.text_dataset_name, split=args.text_dataset_split)\nelse:\n    if args.text_dataset_split is None:\n        text_ds = load_from_disk(args.text_dataset_name)\n    else:\n        text_ds = load_from_disk(args.text_dataset_name)[args.text_dataset_split]\n\nif args.load_last_modified_dataset_from_hub_instead_of_disk:\n    if args.last_modified_dataset_split is None:\n        last_modified_ds = load_dataset(args.last_modified_dataset_name)\n    else:\n        last_modified_ds = load_dataset(args.last_modified_dataset_name, split=args.last_modified_dataset_split)\nelse:\n    if args.last_modified_dataset_split is None:\n        last_modified_ds = load_from_disk(args.last_modified_dataset_name)\n    else:\n        last_modified_ds = load_from_disk(args.last_modified_dataset_name)[args.last_modified_dataset_split]\n\n\nwith Manager() as manager:\n    shared_list = manager.list()\n    def build_last_modified_dict(examples):\n        last_modified_dict = {}\n        for url, crawl_timestamp, last_modified_tag_timestamp in zip(examples[args.url_column], examples[args.crawl_timestamp_column], examples[args.last_modified_timestamp_column]):\n            last_modified_dict[(url, crawl_timestamp)] = last_modified_tag_timestamp\n        shared_list.append(last_modified_dict)\n\n    last_modified_ds.map(build_last_modified_dict, num_proc=args.num_proc, batched=True, batch_size=len(last_modified_ds) // args.num_proc)\n\n    aggregate_last_modified_dict = {}\n    for last_modified_dict in tqdm(shared_list):\n        aggregate_last_modified_dict |= last_modified_dict\n\n# Set the new fingerprint manually so the map function doesn't take forever hashing the huge aggregate_last_modified_dict.\ntext_ds = text_ds.map(lambda example: {args.last_modified_timestamp_column: aggregate_last_modified_dict.get((example[args.url_column], example[args.crawl_timestamp_column]), None)}, new_fingerprint=str(uuid.uuid4()))\n\ntext_ds.save_to_disk(args.output_dataset_name)\n\nif args.push_to_hub:\n    text_ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/deduplicate.py",
    "content": "from datasets import load_dataset, load_from_disk, concatenate_datasets\nfrom text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator\nfrom shutil import rmtree\nfrom os import path\nimport argparse\nimport hashlib\nimport uuid\n\nparser = argparse.ArgumentParser(description=\"Applies varying levels of exact deduplication or exact suffix array deduplication to a Hugging Face dataset.\")\nparser.add_argument(\"--input_dataset_name\", help=\"Name of the input dataset.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"Name of the output dataset.\", required=True)\nparser.add_argument(\"--text_column\", help=\"Name of the dataset's text column.\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The split of the dataset to apply deduplication on. Not all datasets have splits, so this argument is optional.\")\nparser.add_argument(\"--num_proc\", type=int, help=\"The minimum number of processes to use.\", required=True)\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.\")\nparser.add_argument(\"--remove_whole_example\", action=\"store_true\", help= \"If an example in our courpus has a byte string of 100 or longer which is duplicated elsewhere in the corpus, then this option will result in the removal of the whole example. If this option is not specified, then only the substring is removed, not the whole example. In the paper for this deduplication method, they only remove the byte string, not the whole example. Removing the whole example will vastly shrink the size of the dataset, but it will ensure no gaps in text continuity.\")\nparser.add_argument(\"--only_exact_duplicates\", action=\"store_true\", help=\"Use this option if you want to forget about the suffix array stuff and just get rid of examples that exactly match other examples in the dataset.\")\nparser.add_argument(\"--chunks\", type=int, default=1, help=\"Deduplication can be really memory-intensive. This option allows you to split the dataset up in to n chunks, and perform deduplication independently on each of the chunks. Then the resulting deduplicated datasets are concatenated together at the end.\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input dataset from the Hugging Face Hub. If this argument is not used, then it is assumed that the input dataset is stored locally on the disk.\")\nargs = parser.parse_args()\n\nif args.load_from_hub_instead_of_disk:\n    if args.split is None:\n        ds = load_dataset(args.input_dataset_name)\n    else:\n        ds = load_dataset(args.input_dataset_name, split=args.split)\nelse:\n    if args.split is None:\n        ds = load_from_disk(args.input_dataset_name)\n    else:\n        ds = load_from_disk(args.input_dataset_name)[args.split]\n\ndeduplicated_ds_shard_list = []\nfor ds_shard_index in range(args.chunks):\n    ds_shard = ds.shard(num_shards=args.chunks, index=ds_shard_index)\n    \n    if args.remove_whole_example:\n        def check_for_ending_example_in_cluster(example, index, column, last_index):\n            if index == last_index:\n                return True\n            return ds_shard[index+1][column] != example[column]\n\n        # Sort the dataset so examples with the same first 100 bytes of text are grouped together.\n        print(\"Sorting by first 100 bytes of text\")\n        temp_column_name = str(uuid.uuid4())\n        ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode(\"u8\")[:100]}, num_proc=args.num_proc)\n        ds_shard = ds_shard.sort(temp_column_name)\n\n        # Filter away examples if their first 100 bytes of text exactly matches another example's first 100 bytes of text.\n        # This gets rid of a subset of the examples that the next step (suffix array deduplication) gets rid of, so we technically\n        # don't need to do it. But it speeds up the next step quite a bit to do this first.\n        last_index = len(ds_shard) - 1\n        len_before = len(ds_shard)\n        ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)\n        ds_shard = ds_shard.remove_columns(temp_column_name)\n        print(f\"Got rid of all examples sharing first 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.\")\n\n        # Do the same thing with the ending 100 bytes of text.\n        print(\"Sorting by last 100 bytes of text\")\n        temp_column_name = str(uuid.uuid4())\n        ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode(\"u8\")[-100:]}, num_proc=args.num_proc)\n        ds_shard = ds_shard.sort(temp_column_name)\n\n        last_index = len(ds_shard) - 1 \n        len_before = len(ds_shard)\n        ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)\n        ds_shard = ds_shard.remove_columns(temp_column_name)\n        print(f\"Got rid of all examples sharing last 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.\") \n\n    else:\n        print(\"Getting rid of exact duplicates\")\n        def check_for_ending_example_in_cluster(example, index, column, last_index):\n            if index == last_index:\n                return True\n            return ds_shard[index+1][column] != example[column]\n\n        temp_column_name = str(uuid.uuid4())\n        ds_shard = ds_shard.map(lambda example: {temp_column_name: hashlib.md5(example[args.text_column].encode()).hexdigest()}, num_proc=args.num_proc)\n        ds_shard = ds_shard.sort(temp_column_name)\n\n        last_index = len(ds_shard) - 1\n        ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)\n        ds_shard = ds_shard.remove_columns(temp_column_name)\n        print(\"Got rid of exact duplicates\")\n\n    if path.exists(\".cache\"):\n        rmtree(\".cache\")\n\n    if not args.only_exact_duplicates:\n        # Now, do Suffix Array Substring Exact Deduplication.\n\n        deduplicator = GoogleSuffixArrayDeduplicator(k=100)\n\n        # We need to create this iterator over the dataset text column\n        # to ensure that not all of the text entries are loaded into memory at once.\n        class DatasetColumnIterator():\n            def __init__(self, dataset, column):\n                self.iterable_dataset = dataset.__iter__()\n                self.column = column\n\n            def __iter__(self):\n                return self\n\n            def __next__(self):\n                return self.iterable_dataset.__next__()[self.column]\n\n        slices = deduplicator.fit_predict(DatasetColumnIterator(ds_shard, args.text_column))\n        if args.remove_whole_example:\n            ds_shard = ds_shard.filter(lambda example, index: slices[index] == [], num_proc=args.num_proc, with_indices=True)\n        else:\n            def remove_slice_list(string, slice_list):\n                for s in slice_list:\n                    string = string.replace(string[s], \"\")\n                return string\n            # It's important to give this map function a uuid as its fingerprint. If we let it compute the fingerprint as a hash of the whole slice_list, then it will take too long.\n            ds_shard = ds_shard.map(lambda example, index: {args.text_column: remove_slice_list(example[args.text_column], slices[index])}, num_proc=args.num_proc, with_indices=True, new_fingerprint=str(uuid.uuid4()))\n            ds_shard = ds_shard.filter(lambda example: example[args.text_column] != \"\", num_proc=args.num_proc)\n\n    if path.exists(\".cache\"):\n        rmtree(\".cache\")\n\n    deduplicated_ds_shard_list.append(ds_shard)\n\nds = concatenate_datasets(deduplicated_ds_shard_list)\n\nds.save_to_disk(args.output_dataset_name)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/download_common_crawl.py",
    "content": "from os import mkdir, path\nfrom subprocess import run\nimport argparse\nimport random\n\nparser = argparse.ArgumentParser(description=\"Downloads raw Common Crawl WET files, or WAT files if you specify --paths_type=wat.\")\nparser.add_argument(\"--snapshots\", nargs='+', help=\"The Common Crawl snapshots to download files from, such as CC-MAIN-2022-33 or CC-MAIN-2022-27. Several can be specified.\", required=True)\nparser.add_argument(\"--download_dir\", help=\"The name of the directory to create and download WET files to.\", required=True)\nparser.add_argument(\"--segment_sampling_ratios\", type=float, nargs=\"+\", help=\"The ratios of each Common Crawl snapshot to use. The higher the ratio, the larger the generated dataset (but also the longer the time that the OLM pipeline runs). You should specify one for each snapshot. For example, if you specify '--snapshots CC-MAIN-2022-33 CC-MAIN-2022-27', then --segment_sampling_ratios could be '0.15 0.11'. This means that 15 percent of the segments from CC-MAIN-2022-33 will uniformly randomly sampled and used, and 11 percent of the segments from CC-MAIN-2022-27 will be uniformly randomly sampled and used.\", required=True)\nparser.add_argument(\"--tmp_dir\", default=\".tmp_download_common_crawl\", help=\"The directory where temporary files are stored. They are deleted when this script completes. Default is .tmp_download_common_crawl.\")\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use.\", required=True)\nparser.add_argument(\"--seed\", type=int, default=42)\nparser.add_argument(\"--paths_type\", default=\"wet\")\nargs = parser.parse_args()\n\nrandom.seed(args.seed)\n\nif path.exists(args.download_dir):\n    run(f\"rm -r {args.download_dir}\", shell=True)\n\nif path.exists(args.tmp_dir):\n    run(f\"rm -r {args.tmp_dir}\", shell=True)\n\nrun(f\"mkdir {args.download_dir} {args.tmp_dir}\", shell=True)\nfor index in range(len(args.snapshots)):\n    # Download the data for a certian common crawl snapshot\n    tmp_download_dir_name = f\"{args.tmp_dir}/ungoliant_downloads-{args.snapshots[index]}\"\n    run(f\"mkdir {tmp_download_dir_name}\", shell=True)\n    run(f\"wget https://data.commoncrawl.org/crawl-data/{args.snapshots[index]}/{args.paths_type}.paths.gz\", shell=True)\n    run(f\"gzip -d {args.paths_type}.paths.gz\", shell=True)\n    paths_name = f\"{args.paths_type}-{args.snapshots[index]}.paths\"\n    run(f\"mv {args.paths_type}.paths {paths_name}\", shell=True)\n    segments = open(paths_name, \"r\").readlines()\n    kept_segments = []\n    for segment in segments:\n        if random.random() <= args.segment_sampling_ratios[index]:\n            kept_segments.append(segment)\n    open(paths_name, \"w\").writelines(kept_segments)\n    run(f\"ungoliant download -t={args.num_proc} {paths_name} {tmp_download_dir_name}\", shell=True)\n    run(f\"rm {paths_name}\", shell=True)\n\n    # Now, add 0's to the filename for every downloaded file. We want the number of 0's to be different than those from another common crawl snapshot\n    # because we want every file to have a unique name accross multiple snapshot downloads.\n    if index > 0:\n        run(f\"cd {tmp_download_dir_name} && for f in * ; do mv \\\"$f\\\" {'0'*index}\\\"$f\\\" ; done\", shell=True)\n\n    # Now we can move the downloaded files into the main download dir which has the downloads from the rest of this for loop.\n    run(f\"mv {tmp_download_dir_name}/* {args.download_dir}/\", shell=True)\n    run(f\"rm -r {tmp_download_dir_name}\", shell=True)\n\nrun(f\"rm -r {args.tmp_dir}\", shell=True)\nrun(\"rm -r errors.txt\", shell=True)\n\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/download_pipeline_processing_models.sh",
    "content": "# exit when any command fails\nset -e\n\npython data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering/download_sentencepiece_kenlm_models.py --output_dir_path=sp_kenlm_ft_models\nwget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P sp_kenlm_ft_models/\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/add_perplexity.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\nimport sys\nsys.path.append(\"kenlm\")\nfrom model import KenlmModel\n\nparser = argparse.ArgumentParser(description=\"This script simply uses a kenlm trained on English Wikipedia to compute the perplexity of each text example in the dataset. It then sorts the dataset by perplexity so that the user can then select the range of perplexities that they want their data to be in.\")\nparser.add_argument(\"--input_dataset_name\", help=\"The name of the input dataset.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"The name of the output dataset.\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.\")\nparser.add_argument(\"--text_column\", help=\"The name of the dataset column that contains the text.\", required=True)\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use.\", required=True)\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.\")\nargs = parser.parse_args()\n\nif args.load_from_hub_instead_of_disk:\n    if args.split is None:\n        ds = load_dataset(args.input_dataset_name)\n    else:\n        ds = load_dataset(args.input_dataset_name, split=args.split)\nelse:\n    if args.split is None:\n        ds = load_from_disk(args.input_dataset_name)\n    else:\n        ds = load_from_disk(args.input_dataset_name)[args.split]\n\n\nmodel = KenlmModel.from_pretrained(\"kenlm/wikipedia\", \"en\")\nds = ds.map(lambda example: {\"kenlm_ppl\": model.get_perplexity(example[args.text_column])}, num_proc=args.num_proc)\nds = ds.sort(\"kenlm_ppl\")\nds.save_to_disk(args.output_dataset_name)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/filter_for_only_updated_websites.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\n\nparser = argparse.ArgumentParser(description=\"Experimental script to check and filter for a diff between examples with the same URL. It drastically reduces the size of the dataset in many cases, but it helps ensure that the text is up to date. The script only keeps an example if 1) the example shares a URL with other examples 2) the example is the most recent example with that URL 3) there was a diff between the example and an earlier example with the same URL.\")\nparser.add_argument(\"--input_dataset_name\", required=True)\nparser.add_argument(\"--output_dataset_name\", required=True)\nparser.add_argument(\"--text_column\", required=True)\nparser.add_argument(\"--timestamp_column\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The split of the datset to apply this filter to. Not all datsets have splits, so this argument is optional.\")\nparser.add_argument(\"--url_column\", required=True)\nparser.add_argument(\"--num_proc\", type=int, required=True)\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input datset from the Hugging Face Hub. If this argument is not used, it is assumed that the input dataset is stored on the disk.\")\nargs = parser.parse_args()\n\nif args.load_from_hub_instead_of_disk:\n    if args.split is None:\n        ds = load_dataset(args.input_dataset_name)\n    else:\n        ds = load_dataset(args.input_dataset_name, split=args.split)\nelse:\n    if args.split is None:\n        ds = load_from_disk(args.input_dataset_name)\n    else:\n        ds = load_from_disk(args.input_dataset_name)[args.split]\n\n# Group so examples with the same URL are next to each other in the dataset.\nds = ds.sort(args.url_column)\n\n# Throw away examples with URLs occuring only once in the dataset.\nlast_index = len(ds) - 1\ndef check_for_adjacent_duplicate_url(example, index):\n    if index == last_index:\n        return ds[index-1][args.url_column] == example[args.url_column]\n    if index == 0:\n        return ds[index+1][args.url_column] == example[args.url_column]\n    return ds[index-1][args.url_column] == example[args.url_column] or ds[index+1][args.url_column] == example[args.url_column]\n\nds = ds.filter(lambda example, index: check_for_adjacent_duplicate_url(example, index), num_proc=args.num_proc, with_indices=True)\n\n# Sort the dataset so that examples with the same URL are still grouped together, but also arrange by timestamp from oldest to newest.\nds = ds.sort(args.timestamp_column)\nds = ds.sort(args.url_column, kind=\"stable\")\n\n# Keep only the pair of examples from each URL group with the oldest and newest timestamp.\nlast_index = len(ds) - 1\ndef check_for_ending_or_beginning_example_in_url_cluster(example, index):\n    if index in (last_index, 0):\n        return True\n    return ds[index-1][args.url_column] != example[args.url_column] or ds[index+1][args.url_column] != example[args.url_column]\n\nds = ds.filter(lambda example, index: check_for_ending_or_beginning_example_in_url_cluster(example, index), num_proc=args.num_proc, with_indices=True)\n\n# For each example pair, check to see if the text was modified between the old time and the new time.\n# If it was modified, keep the latest example and throw the old example out. We have evidence that this new example is up-to-date :D\n# If it wasn't modified, throw both examples out. We have no evidence that this new example is up-to-date :(\nlast_index = len(ds) - 1\ndef check_for_updated_example_in_url_pair(example, index):\n    if index == 0 or ds[index-1][args.url_column] != example[args.url_column]:\n        return False\n    if ds[index-1][args.text_column] != example[args.text_column]:\n        return True\n    return False\n\nds = ds.filter(lambda example, index: check_for_updated_example_in_url_pair(example, index), num_proc=args.num_proc, with_indices=True)\n\nds.save_to_disk(args.output_dataset_name)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/kenlm/LICENSE",
    "content": "\n                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright 2021-2022 Eduardo González Ponferrada\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License."
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/kenlm/README.md",
    "content": "---\nlanguage: \n  - es\n  - af\n  - ar\n  - arz\n  - as\n  - bn\n  - fr\n  - sw\n  - eu\n  - ca\n  - zh\n  - en\n  - hi\n  - ur\n  - id\n  - pt\n  - vi\n  - gu\n  - kn\n  - ml\n  - mr\n  - ta\n  - te\n  - yo\ntags:\n- kenlm\n- perplexity\n- n-gram\n- kneser-ney\n- bigscience\nlicense: \"mit\"\ndatasets:\n- wikipedia\n- oscar\n---\n\nTaken from the amazing repo here: [https://huggingface.co/edugp/kenlm](https://huggingface.co/edugp/kenlm)\n\n# KenLM models\nThis repo contains several KenLM models trained on different tokenized datasets and languages.  \nKenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).\n\nAt the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files\n* `{language}.arpa.bin`: The trained KenLM model binary\n* `{language}.sp.model`: The trained SentencePiece model used for tokenization\n* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model\n\nThe models have been trained using some of the preprocessing steps from [cc_net](https://github.com/facebookresearch/cc_net), in particular replacing numbers with zeros and normalizing punctuation. So, it is important to keep the default values for the parameters: `lower_case`, `remove_accents`, `normalize_numbers` and `punctuation` when using the pre-trained models in order to replicate the same pre-processing steps at inference time.\n\n# Dependencies\n* KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip`\n* SentencePiece: `pip install sentencepiece`\n\n# Example:\n```\nfrom model import KenlmModel\n\n\n# Load model trained on English wikipedia\nmodel = KenlmModel.from_pretrained(\"wikipedia\", \"en\")\n\n# Get perplexity\nmodel.get_perplexity(\"I am very perplexed\")\n# 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes)\n\nmodel.get_perplexity(\"im hella trippin\")\n# 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes)\n```\nIn the example above we see that, since Wikipedia is a collection of encyclopedic articles, a KenLM model trained on it will naturally give lower perplexity scores to sentences with formal language and no grammar mistakes than colloquial sentences with grammar mistakes.\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/kenlm/model.py",
    "content": "import os\nimport re\nimport unicodedata\nfrom typing import Dict\n\nimport kenlm\nimport sentencepiece\nfrom huggingface_hub import cached_download, hf_hub_url\n\n\nclass SentencePiece:\n    def __init__(\n        self,\n        model: str,\n    ):\n        super().__init__()\n        self.sp = sentencepiece.SentencePieceProcessor()\n        self.sp.load(str(model))\n\n    def do(self, text: dict) -> dict:\n        tokenized = self.sp.encode_as_pieces(text)\n        return \" \".join(tokenized)\n\n\nclass KenlmModel:\n    digit_re: re.Pattern = re.compile(r\"\\d\")\n    unicode_punct: Dict[str, str] = {\n        \"，\": \",\",\n        \"。\": \".\",\n        \"、\": \",\",\n        \"„\": '\"',\n        \"”\": '\"',\n        \"“\": '\"',\n        \"«\": '\"',\n        \"»\": '\"',\n        \"１\": '\"',\n        \"」\": '\"',\n        \"「\": '\"',\n        \"《\": '\"',\n        \"》\": '\"',\n        \"´\": \"'\",\n        \"∶\": \":\",\n        \"：\": \":\",\n        \"？\": \"?\",\n        \"！\": \"!\",\n        \"（\": \"(\",\n        \"）\": \")\",\n        \"；\": \";\",\n        \"–\": \"-\",\n        \"—\": \" - \",\n        \"．\": \". \",\n        \"～\": \"~\",\n        \"’\": \"'\",\n        \"…\": \"...\",\n        \"━\": \"-\",\n        \"〈\": \"<\",\n        \"〉\": \">\",\n        \"【\": \"[\",\n        \"】\": \"]\",\n        \"％\": \"%\",\n        \"►\": \"-\",\n    }\n    unicode_punct_re = re.compile(f\"[{''.join(unicode_punct.keys())}]\")\n    non_printing_chars_re = re.compile(\n        f\"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]\"\n    )\n    kenlm_model_dir = None\n    sentence_piece_model_dir = None\n\n    def __init__(\n        self,\n        model_dataset: str,\n        language: str,\n        lower_case: bool = False,\n        remove_accents: bool = False,\n        normalize_numbers: bool = True,\n        punctuation: int = 1,\n    ):\n        self.model = kenlm.Model(os.path.join(model_dataset, f\"{language}.arpa.bin\"))\n        self.tokenizer = SentencePiece(os.path.join(model_dataset, f\"{language}.sp.model\"))\n        self.accent = remove_accents\n        self.case = lower_case\n        self.numbers = normalize_numbers\n        self.punct = punctuation\n\n    @classmethod\n    def from_pretrained(\n        cls,\n        model_dataset: str,\n        language: str,\n    ):\n        return cls(\n            model_dataset,\n            language,\n            False,\n            False,\n            True,\n            1,\n        )\n\n    def pp(self, log_score, length):\n        return 10.0 ** (-log_score / length)\n\n    def get_perplexity(self, doc: str, normalize_cc_net: bool = True):\n        if normalize_cc_net:\n            doc = self.normalize(\n                doc,\n                accent=self.accent,\n                case=self.case,\n                numbers=self.numbers,\n                punct=self.punct,\n            )\n        # Tokenize (after normalizing): See https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/mine.py#L352 for full pipeline\n        doc = self.tokenizer.do(doc)\n        doc_log_score, doc_length = 0, 0\n        for line in doc.split(\"\\n\"):\n            log_score = self.model.score(line)\n            length = len(line.split()) + 1\n            doc_log_score += log_score\n            doc_length += length\n        return round(self.pp(doc_log_score, doc_length), 1)\n\n    def normalize(\n        self,\n        line: str,\n        accent: bool = True,\n        case: bool = True,\n        numbers: bool = True,\n        punct: int = 1,\n    ) -> str:\n        line = line.strip()\n        if not line:\n            return line\n        if case:\n            line = line.lower()\n        if accent:\n            line = self.strip_accents(line)\n        if numbers:\n            line = self.digit_re.sub(\"0\", line)\n        if punct == 1:\n            line = self.replace_unicode_punct(line)\n        elif punct == 2:\n            line = self.remove_unicode_punct(line)\n        line = self.remove_non_printing_char(line)\n        return line\n\n    def strip_accents(self, line: str) -> str:\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        nfd = unicodedata.normalize(\"NFD\", line)\n        output = [c for c in nfd if unicodedata.category(c) != \"Mn\"]\n        if len(output) == line:\n            return line\n        return \"\".join(output)\n\n    def replace_unicode_punct(self, text: str) -> str:\n        return \"\".join(self.unicode_punct.get(c, c) for c in text)\n\n    def remove_unicode_punct(self, text: str) -> str:\n        \"\"\"More aggressive version of replace_unicode_punct but also faster.\"\"\"\n        return self.unicode_punct_re.sub(\"\", text)\n\n    def remove_non_printing_char(self, text: str) -> str:\n        return self.non_printing_chars_re.sub(\"\", text)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.model",
    "content": "version https://git-lfs.github.com/spec/v1\noid sha256:cf8147a573770b4e6c0d4df1dcb75453baa88190706dab406be7711b84f059de\nsize 931348\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.vocab",
    "content": "version https://git-lfs.github.com/spec/v1\noid sha256:a9c3c51a7736d736cc620cbe9a4c9430533469e57a54bc29546067a252f7d872\nsize 729017\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/get_last_modified_dataset_from_wat_downloads.py",
    "content": "from datasets import load_dataset\nfrom tqdm import tqdm\nimport pandas as pd\nimport subprocess\nfrom multiprocessing import Process\nfrom os import walk, mkdir, path\nfrom shutil import rmtree\nimport dateutil\nimport dateparser\nimport argparse\nimport ujson\n\nparser = argparse.ArgumentParser(description=\"Turns WAT downloads from download_common_crawl.py into a Hugging Face dataset with Last-Modified timestamps, URLs, and crawl timestamps.\")\nparser.add_argument(\"--download_dir\", help=\"The directory of the downloaded WAT files.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"The name of the Hugging Face dataset which will be saved upon completion of this program.\", required=True)\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use.\", required=True)\nparser.add_argument(\"--tmp_dir\", default=\".tmp_get_last_modified_dataset_from_wat_downloads\")\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.\")\nargs = parser.parse_args()\n\nif path.exists(args.tmp_dir):\n    rmtree(args.tmp_dir)\n\nmkdir(args.tmp_dir)\n\nfilenames = next(walk(args.download_dir), (None, None, []))[2]\n\ndef split_a_into_n_parts(a, n):\n    k, m = divmod(len(a), n)\n    return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)]\n\nfilename_per_proc = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0]\n\nprocesses = []\nfor filenames in filename_per_proc:\n    def get_dataset(filenames):\n        for filename in tqdm(filenames):\n            dataset_dict = {\"last_modified_timestamp\": [], \"url\": [], \"crawl_timestamp\": []}\n            file_path = path.join(args.download_dir, filename)\n            if filename.endswith(\".gz\"):\n                subprocess.run(f\"gzip -d {file_path}\", shell=True)\n                filename = filename[:-3]\n                file_path = path.join(args.download_dir, filename)\n            for line in open(file_path).readlines():\n                if line.startswith(\"{\"):\n                    parsed_line = ujson.loads(line)\n                    last_modified = parsed_line.get(\"Envelope\", {}).get(\"Payload-Metadata\", {}).get(\"HTTP-Response-Metadata\", {}).get(\"Headers\", {}).get(\"Last-Modified\", None)\n                    url = parsed_line.get(\"Envelope\", {}).get(\"WARC-Header-Metadata\", {}).get(\"WARC-Target-URI\", None)\n                    date = parsed_line.get(\"Envelope\", {}).get(\"WARC-Header-Metadata\", {}).get(\"WARC-Date\", None)\n                    if None not in (last_modified, url, date):\n                        try:\n                            last_modified_timestamp = dateutil.parser.parse(last_modified).timestamp()\n                        except Exception:\n                            try:\n                                last_modified_timestamp = dateparser.parse(last_modified).timestamp()\n                            except Exception:\n                                last_modified_timestamp = None\n                        if last_modified_timestamp is not None:\n                            crawl_timestamp = dateutil.parser.parse(date).timestamp()\n                            dataset_dict[\"last_modified_timestamp\"].append(last_modified_timestamp)\n                            dataset_dict[\"url\"].append(url)\n                            dataset_dict[\"crawl_timestamp\"].append(crawl_timestamp)\n            # Zip the download file again to save space.\n            subprocess.run(f\"gzip {file_path}\", shell=True)\n            pd.DataFrame(dataset_dict).to_parquet(path.join(args.tmp_dir, filename + \".filtered.parquet\"))\n    p = Process(target=get_dataset, args=(filenames,))\n    p.start()\n    processes.append(p)\n\nfor p in processes:\n    p.join()\n\nds = load_dataset(\"parquet\", data_files=path.join(args.tmp_dir, \"*.parquet\"))\nds.save_to_disk(args.output_dataset_name)\n\nrmtree(args.tmp_dir)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py",
    "content": "from datasets import load_dataset\nfrom tqdm import tqdm\nimport pandas as pd\nimport subprocess\nfrom multiprocessing import Process\nfrom os import walk, mkdir, path\nfrom shutil import move, rmtree\nimport dateutil\nimport argparse\n\nparser = argparse.ArgumentParser(description=\"Turns downloads from download_common_crawl.py into a Hugging Face dataset, split by language (language is identified using a FastText model). The dataset has a timestamp column for the time it was crawled, along with a url column and, of course, a text column.\")\nparser.add_argument(\"--download_dir\", help=\"The directory of the downloaded WET files.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"The name of the Hugging Face dataset which will be saved upon completion of this program.\", required=True)\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use, at a minimum.\", required=True)\nparser.add_argument(\"--tmp_dir\", default=\".tmp_get_dataset_from_downloads\", help=\"The directory to store temporary files. The directory will be deleted upon completion of this script. Defaults to .tmp_get_datasets_from_downloads.\")\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.\")\nargs = parser.parse_args()\n\nif path.exists(args.tmp_dir):\n    rmtree(args.tmp_dir)\n\nmkdir(args.tmp_dir)\n\ntmp_download_dir = path.join(args.tmp_dir, \"downloads\")\n\nmove(args.download_dir, tmp_download_dir)\n\nfilenames = next(walk(tmp_download_dir), (None, None, []))[2]\n\ndef split_a_into_n_parts(a, n):\n    k, m = divmod(len(a), n)\n    return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)]\n\nungoliant_pipeline_output_dirs = []\nfilename_per_directory = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0]\nnum_files_awaiting_processing = 0\ndirs_awaiting_processing = []\ndef do_parallel_pipeline_processing(dirs_awaiting_processing):\n    processes = []\n    for obj in dirs_awaiting_processing:\n        p = subprocess.Popen(f\"ungoliant pipeline --lid-path=sp_kenlm_ft_models/lid.176.bin {obj['download_chunk_dir']} {obj['pipeline_output_dir']}\", shell=True)\n        processes.append(p)\n    for p in processes:\n        p.wait()\n\n# This loop runs the ungoliant pipeline num_proc number of times to generate num_proc number of output files.\n# the ungoliant pipeline is already parallelized, so we don't do this so that the ungoliant pipeline will run faster.\n# Instead, we do this so that we will have num_proc number of output files so we can load them in parallel into a \n# pandas dataframes, which will eventually be turned into Hugging Face dataset.\nungoliant_pipeline_results = path.join(args.tmp_dir, \"ungoliant_pipeline_results\")\nmkdir(ungoliant_pipeline_results)\nfor i in range(len(filename_per_directory)):\n    download_chunk_dir = path.join(tmp_download_dir, \"chunk_\" + str(i))\n    mkdir(download_chunk_dir)\n    for filename in filename_per_directory[i]:\n        num_files_awaiting_processing += 1\n        move(path.join(tmp_download_dir, filename), path.join(download_chunk_dir, filename))\n    pipeline_output_dir = path.join(ungoliant_pipeline_results, \"chunk_\" + str(i))\n    mkdir(pipeline_output_dir)\n    ungoliant_pipeline_output_dirs.append(pipeline_output_dir)\n    dirs_awaiting_processing.append({\"pipeline_output_dir\": pipeline_output_dir, \"download_chunk_dir\": download_chunk_dir})\n    if num_files_awaiting_processing >= args.num_proc:\n        do_parallel_pipeline_processing(dirs_awaiting_processing)\n        num_files_awaiting_processing = 0\n        dirs_awaiting_processing = []\n\ndo_parallel_pipeline_processing(dirs_awaiting_processing)\n\n# For some reason, datasets errors out if we try to load directly from the jsonl, so we need to do this first.\nprocesses = []\nfor ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs:\n    language_filenames = [name for name in next(walk(ungoliant_pipeline_output_dir), (None, None, []))[2] if name.endswith(\"_meta.jsonl\")]\n    language_ids = [language_filename.split(\"_\")[0] for language_filename in language_filenames]\n    def convert_to_parquet_and_reformat(ungoliant_pipeline_output_dir):\n        for language_filename in language_filenames:\n            language_id = language_filename.split(\"_\")[0]\n            i = 0\n            print(\"Chunking the ungoliant json into several parquet files and reformatting before loading into huggingface dataset.\")\n            parquet_file_dir = path.join(ungoliant_pipeline_output_dir, language_id + \"_parquet\")\n            mkdir(parquet_file_dir)\n            for chunk in tqdm(pd.read_json(path.join(ungoliant_pipeline_output_dir, language_id + \"_meta.jsonl\"), lines=True, chunksize=10000)):\n                parquet_file_path = path.join(parquet_file_dir, str(i) + \".parquet\")\n                chunk[\"url\"] = chunk.apply(lambda row: row[\"warc_headers\"][\"warc-target-uri\"], axis=1)\n                chunk[\"crawl_timestamp\"] = chunk.apply(lambda row: dateutil.parser.parse(row[\"warc_headers\"][\"warc-date\"]).timestamp(), axis=1)\n                chunk.drop(columns=[\"warc_headers\", \"metadata\"], inplace=True)\n                chunk.rename(columns={\"content\": \"text\"}, inplace=True)\n                chunk.to_parquet(parquet_file_path)\n                i += 1\n    p = Process(target=convert_to_parquet_and_reformat, args=(ungoliant_pipeline_output_dir,))\n    p.start()\n    processes.append(p)\n\nfor p in processes:\n    p.join()\n\ndata_files = {language_id: [path.join(ungoliant_pipeline_output_dir, language_id + \"_parquet\", \"*.parquet\") for ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs] for language_id in language_ids}\nds = load_dataset(\"parquet\", data_files=data_files)\nds.save_to_disk(args.output_dataset_name)\nrmtree(args.tmp_dir)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/common_crawl/remove_wikipedia_urls.py",
    "content": "from datasets import load_dataset, load_from_disk\nimport argparse\n\nparser = argparse.ArgumentParser(description=\"Removes all examples from a Hugging Face dataset if they have a Wikipedia URL. This script is intened to be used if you eventually want to merge the dataset with a Wikipedia snapshot. In that case, examples from Wikipedia in this dataset are redundant.\")\nparser.add_argument(\"--input_dataset_name\", help=\"Input dataset name.\", required=True)\nparser.add_argument(\"--output_dataset_name\", help=\"Output dataset name.\", required=True)\nparser.add_argument(\"--url_column\", help=\"Name of the URL column of the dataset.\", required=True)\nparser.add_argument(\"--split\", default=None, help=\"The split of the dataset to use. Some datasets don't have splits, so it is optional.\")\nparser.add_argument(\"--num_proc\", type=int, help=\"The number of processes to use.\")\nparser.add_argument(\"--push_to_hub\", action=\"store_true\", help=\"Whether to push the output dataset to the Hugging Face hub after saving to the disk.\")\nparser.add_argument(\"--load_from_hub_instead_of_disk\", action=\"store_true\", help=\"Whether to load the input dataset by name from the Hugging Face hub. If this argument isn't specified then the input dataset will be loaded from a directory of the same name on the disk.\")\nargs = parser.parse_args()\n\nif args.load_from_hub_instead_of_disk:\n    if args.split is None:\n        ds = load_dataset(args.input_dataset_name)\n    else:\n        ds = load_dataset(args.input_dataset_name, split=args.split)\nelse:\n    if args.split is None:\n        ds = load_from_disk(args.input_dataset_name)\n    else:\n        ds = load_from_disk(args.input_dataset_name)[args.split]\n\nds = ds.filter(lambda example: not example[args.url_column].startswith(\"https://en.wikipedia.org/wiki/\"), num_proc=args.num_proc)\n\nds.save_to_disk(args.output_dataset_name)\n\nif args.push_to_hub:\n    ds.push_to_hub(args.output_dataset_name)\n"
  },
  {
    "path": "pipeline_scripts/wikipedia/README.md",
    "content": "Per the repository [here](https://huggingface.co/datasets/olm/wikipedia), just run this Python code. It uses all CPUs available and should take less than an hour if you have a lot of CPUs (on the order of 100).\n\n```\nfrom datasets import load_dataset\n\nds = load_dataset(\"olm/wikipedia\", language=\"en\", date=\"20220920\")\n\nds.save_to_disk(\"wikipedia_en_20220920\")\nds.push_to_hub(\"wikipedia_en_20220920\")\n````\n\nThe code pulls the Wikipedia snapshot for the given date and language and does all the processing required to turn it into a clean pretraining dataset. You can get the dates for the latest wikipedia snapshots here: [https://dumps.wikimedia.org/enwiki/](https://dumps.wikimedia.org/enwiki/).\n"
  },
  {
    "path": "requirements.txt",
    "content": "datasets==2.6.1\nemoji==1.7.0\nfasttext==0.9.2\nsentencepiece==0.1.97\npypi-kenlm==0.1.20220713\ntext-dedup==0.2.1\nargparse==1.4.0\ndateparser==1.1.1\nmwparserfromhell==0.6.4\nmatplotlib==3.6.2\nmultiprocess==0.70.13\n"
  }
]