[
  {
    "path": ".github/workflows/codeql.yml",
    "content": "# For most projects, this workflow file will not need changing; you simply need\n# to commit it to your repository.\n#\n# You may wish to alter this file to override the set of languages analyzed,\n# or to provide custom queries or build logic.\n#\n# ******** NOTE ********\n# We have attempted to detect the languages in your repository. Please check\n# the `language` matrix defined below to confirm you have the correct set of\n# supported CodeQL languages.\n#\nname: \"CodeQL Advanced\"\n\non:\n  push:\n    branches: [ \"main\" ]\n  pull_request:\n    branches: [ \"main\" ]\n  schedule:\n    - cron: '24 3 * * 5'\n\njobs:\n  analyze:\n    name: Analyze (${{ matrix.language }})\n    # Runner size impacts CodeQL analysis time. To learn more, please see:\n    #   - https://gh.io/recommended-hardware-resources-for-running-codeql\n    #   - https://gh.io/supported-runners-and-hardware-resources\n    #   - https://gh.io/using-larger-runners (GitHub.com only)\n    # Consider using larger runners or machines with greater resources for possible analysis time improvements.\n    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}\n    permissions:\n      # required for all workflows\n      security-events: write\n\n      # required to fetch internal or private CodeQL packs\n      packages: read\n\n      # only required for workflows in private repositories\n      actions: read\n      contents: read\n\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n        - language: python\n          build-mode: none\n        # CodeQL supports the following values keywords for 'language': 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'swift'\n        # Use `c-cpp` to analyze code written in C, C++ or both\n        # Use 'java-kotlin' to analyze code written in Java, Kotlin or both\n        # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both\n        # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,\n        # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.\n        # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how\n        # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages\n    steps:\n    - name: Checkout repository\n      uses: actions/checkout@v4\n\n    # Initializes the CodeQL tools for scanning.\n    - name: Initialize CodeQL\n      uses: github/codeql-action/init@v3\n      with:\n        languages: ${{ matrix.language }}\n        build-mode: ${{ matrix.build-mode }}\n        # If you wish to specify custom queries, you can do so here or in a config file.\n        # By default, queries listed here will override any specified in a config file.\n        # Prefix the list here with \"+\" to use these queries and those in the config file.\n\n        # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs\n        # queries: security-extended,security-and-quality\n\n    # If the analyze step fails for one of the languages you are analyzing with\n    # \"We were unable to automatically build your code\", modify the matrix above\n    # to set the build mode to \"manual\" for that language. Then modify this step\n    # to build your code.\n    # ℹ️ Command-line programs to run using the OS shell.\n    # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun\n    - if: matrix.build-mode == 'manual'\n      shell: bash\n      run: |\n        echo 'If you are using a \"manual\" build mode for one or more of the' \\\n          'languages you are analyzing, replace this with the commands to build' \\\n          'your code, for example:'\n        echo '  make bootstrap'\n        echo '  make release'\n        exit 1\n\n    - name: Perform CodeQL Analysis\n      uses: github/codeql-action/analyze@v3\n      with:\n        category: \"/language:${{matrix.language}}\"\n"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "content": "# Microsoft Open Source Code of Conduct\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\n\nResources:\n\n- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)\n- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)\n- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns\n"
  },
  {
    "path": "DomainSpecific/.gitignore",
    "content": "__pycache__/\ndependency/models/\nenv_ready\nworkspace\n"
  },
  {
    "path": "DomainSpecific/configs/cc_math_filter.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_math_extraction\",\n    \"description\": \"math extraction from cc parquet file - 202323.\",\n    \"date\": \"20240513\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/pqs.CC-MAIN-2023-23.txt\"\n        },\n        \"filtered_pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_pqs/math/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"filtered_pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"pq_name_list_file_path\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"pq_names\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": -1\n            },\n            \"input\": [\"pq_names\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"Math_Filter\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_pqs/math/CC-MAIN-2023-23/\"\n            },\n            \"input\": [\"pq_names\"],\n            \"output\": [\"filtered_pq_names\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filtered_pq_names\", \"filtered_pq_name_list_file_path\"],\n            \"output\": [\"filtered_pq_name_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/cc_openquestion_filter.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_openquestion_extraction\",\n    \"description\": \"open question extraction from cc parquet file - 202323.\",\n    \"date\": \"20240527\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/pqs.CC-MAIN-2023-23.txt\"\n        },\n        \"filtered_pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_pqs/openquestion/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"filtered_pq_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"pq_name_list_file_path\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"pq_names\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": -1\n            },\n            \"input\": [\"pq_names\"],\n            \"output\": [\"pq_names\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"OpenQuestion_Filter\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_pqs/openquestion/CC-MAIN-2023-23/\"\n            },\n            \"input\": [\"pq_names\"],\n            \"output\": [\"filtered_pq_names\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filtered_pq_names\", \"filtered_pq_name_list_file_path\"],\n            \"output\": [\"filtered_pq_name_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/cc_warc_download.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_warc_download\",\n    \"description\": \"download warc files for a specific cc snapshot - CC-MAIN-2023-23.\",\n    \"date\": \"20231011\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"warc_url_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/urls.CC-MAIN-2023-23.txt\"\n        },\n        \"success_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_warcs/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        },\n        \"fail_warc_url_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_warcs/CC-MAIN-2023-23/fail_urls.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"success_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        },\n        \"fail_warc_url_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"warc_url_list_file_path\"],\n            \"output\": [\"warc_urls\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"warc_urls\"],\n            \"output\": [\"warc_urls\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": 1\n            },\n            \"input\": [\"warc_urls\"],\n            \"output\": [\"warc_urls\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"Download_Warc_File\",\n            \"joint\": \"Map\",\n            \"param\":\n            {\n                \"DOWNLOAD_FOLDER\": \"{workspace_dir}/cc_warcs/CC-MAIN-2023-23\",\n                \"CONNECTS\": 16,\n                \"TRIES\": 3\n            },\n            \"input\": [\"warc_urls\"],\n            \"output\": [\"success_warc_names\", \"fail_warc_urls\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"Data_Filter\",\n            \"param\":\n            {\n                \"FILTERS\": [null]\n            },\n            \"input\": [\"success_warc_names\"],\n            \"output\": [\"success_warc_names\"]\n        },\n        \"layer04\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"success_warc_names\", \"success_warc_name_list_file_path\"],\n            \"output\": [\"success_warc_name_list_file_path\"]\n        },\n        \"layer05\":\n        {\n            \"type\": \"Data_Filter\",\n            \"param\":\n            {\n                \"FILTERS\": [null]\n            },\n            \"input\": [\"fail_warc_urls\"],\n            \"output\": [\"fail_warc_urls\"]\n        },\n        \"layer06\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"fail_warc_urls\", \"fail_warc_url_list_file_path\"],\n            \"output\": [\"fail_warc_url_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/cc_warc_filter.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_warc_filter\",\n    \"description\": \"filter html containing specific tags on warc files - CC-MAIN-2023-23.\",\n    \"date\": \"20230825\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_warcs/CC-MAIN-2023-23/paths.txt\"\n        },\n        \"filtered_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"filtered_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"warc_name_list_file_path\"],\n            \"output\": [\"warc_names\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"warc_names\"],\n            \"output\": [\"warc_names\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": -1\n            },\n            \"input\": [\"warc_names\"],\n            \"output\": [\"warc_names\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"Warc_Filter\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_warcs/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/\",\n                \"TAGS\": [\"<math\", \"<annotation\", \"=\\\"math\", \"athjax\", \"math-container\", \"class=\\\"tex\\\"\", \"tex.cgi\", \"latex.php\", \"katex.min.css\", \"\\\\frac\", \"codecogs\", \"<code\", \"<pre\"]\n            },\n            \"input\": [\"warc_names\"],\n            \"output\": [\"filtered_warc_names\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filtered_warc_names\", \"filtered_warc_name_list_file_path\"],\n            \"output\": [\"filtered_warc_name_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/cc_warc_to_wet.code.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_warc_to_wet\",\n    \"description\": \"convert cc warc to wet and keep math formula - CC-MAIN-2023-23.\",\n    \"date\": \"20230825\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"filter_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.txt\"\n        },\n        \"encode_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        },\n        \"filter_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        },\n        \"decode_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/decode_wet_code/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"decode_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_warc_name_list_file_path\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": -1\n            },\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"Warc_Encode\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23\",\n                \"TAG\": \"code\"\n            },\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"encode_warc_names\"]\n        },\n        \"layer02_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"encode_warc_names\", \"encode_warc_name_list_file_path\"],\n            \"output\": [\"encode_warc_name_list_file_path\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"Warc_To_Wet\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23\"\n            },\n            \"input\": [\"encode_warc_names\"],\n            \"output\": [\"filter_wet_names\"]\n        },\n        \"layer03_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_wet_names\", \"filter_wet_name_list_file_path\"],\n            \"output\": [\"filter_wet_name_list_file_path\"]\n        },\n        \"layer04\":\n        {\n            \"type\": \"Wet_Decode\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/decode_wet_code/CC-MAIN-2023-23\",\n                \"TAG\": \"code\"\n            },\n            \"input\": [\"filter_wet_names\"],\n            \"output\": [\"decode_wet_names\"]\n        },\n        \"layer04_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"decode_wet_names\", \"decode_wet_name_list_file_path\"],\n            \"output\": [\"decode_wet_name_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/cc_warc_to_wet.math.CC-MAIN-2023-23.json",
    "content": "{\n    \"name\": \"cc_warc_to_wet\",\n    \"description\": \"convert cc warc to wet and keep math formula - CC-MAIN-2023-23.\",\n    \"date\": \"20230825\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"filter_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.txt\"\n        },\n        \"encode_warc_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        },\n        \"filter_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        },\n        \"decode_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\",\n            \"value\": \"{workspace_dir}/cc_wets/decode_wet_math/CC-MAIN-2023-23/paths.{worker_id}.txt\"\n        }\n    },\n    \n    \"output\":\n    {\n        \"decode_wet_name_list_file_path\":\n        {\n            \"type\": \"Mem_Str\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer01\":\n        {\n            \"type\": \"From_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_warc_name_list_file_path\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer01_par\":\n        {\n            \"type\": \"Data_Partition\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer01_sam\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": -1\n            },\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"filter_warc_names\"]\n        },\n        \"layer02\":\n        {\n            \"type\": \"Warc_Encode\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23\",\n                \"TAG\": \"math\"\n            },\n            \"input\": [\"filter_warc_names\"],\n            \"output\": [\"encode_warc_names\"]\n        },\n        \"layer02_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"encode_warc_names\", \"encode_warc_name_list_file_path\"],\n            \"output\": [\"encode_warc_name_list_file_path\"]\n        },\n        \"layer03\":\n        {\n            \"type\": \"Warc_To_Wet\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23\"\n            },\n            \"input\": [\"encode_warc_names\"],\n            \"output\": [\"filter_wet_names\"]\n        },\n        \"layer03_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"filter_wet_names\", \"filter_wet_name_list_file_path\"],\n            \"output\": [\"filter_wet_name_list_file_path\"]\n        },\n        \"layer04\":\n        {\n            \"type\": \"Wet_Decode\",\n            \"joint\": \"FlatMap\",\n            \"param\":\n            {\n                \"INPUT_FOLDER\": \"{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23\",\n                \"OUTPUT_FOLDER\": \"{workspace_dir}/cc_wets/decode_wet_math/CC-MAIN-2023-23\",\n                \"TAG\": \"math\"\n            },\n            \"input\": [\"filter_wet_names\"],\n            \"output\": [\"decode_wet_names\"]\n        },\n        \"layer04_out\":\n        {\n            \"type\": \"To_Line_File\",\n            \"joint\": \"Default\",\n            \"input\": [\"decode_wet_names\", \"decode_wet_name_list_file_path\"],\n            \"output\": [\"decode_wet_name_list_file_path\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/configs/network_template.json",
    "content": "{\n    \"name\": \"template_network\",\n    \"description\": \"Toy example of network.\",\n    \"date\": \"20230713\",\n    \"version\": \"1.0.0\",\n    \"author\": \"yanghuan\",\n    \"backend\": \"Native\",\n    \n    \"input\":\n    {\n        \"data1\":\n        {\n            \"type\": \"Mem_StrList\",\n            \"value\": [\"1\", \"2\", \"3\", \"4\", \"5\"]\n        }\n    },\n    \n    \"output\":\n    {\n        \"data2\":\n        {\n            \"type\": \"Mem_StrList\"\n        }\n    },\n    \n    \"layer\":\n    {\n        \"layer1\":\n        {\n            \"type\": \"Data_Sample\",\n            \"joint\": \"Default\",\n            \"param\":\n            {\n                \"N\": 2\n            },\n            \"input\": [\"data1\"],\n            \"output\": [\"data2\"]\n        }\n    }\n}\n"
  },
  {
    "path": "DomainSpecific/core/__init__.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nfrom .data import DataType\nfrom .layer import Layer, JointType\nfrom .layers import LayerType, LayerType2Func\nfrom .network import Network\n\n__all__ = [\"DataType\", \"Layer\", \"JointType\", \"LayerType\", \"LayerType2Func\", \"Network\"]\n"
  },
  {
    "path": "DomainSpecific/core/data.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nfrom enum import Enum\n\nclass DataType(Enum):\n    # Memory Data\n    Mem_Any          = 0\n    Mem_Binary       = 1\n    Mem_Int          = 2\n    Mem_Float        = 3\n    Mem_Str          = 4\n    Mem_Warc         = 5\n    Mem_Dict         = 6\n    Mem_Index        = 7\n    Mem_Vector       = 8\n    Mem_Record       = 9\n    Mem_List         = 10\n    Mem_BinaryList   = 11\n    Mem_IntList      = 12\n    Mem_FloatList    = 13\n    Mem_StrList      = 14\n    Mem_WarcList     = 15\n    Mem_DictList     = 16\n    Mem_IndexList    = 17\n    Mem_VectorList   = 18\n    Mem_RecordList   = 19\n\n    # Disk Data (Deprecated)\n    File_Any         = 100\n    File_Binary      = 101\n    File_Text        = 102\n    File_Warc        = 103\n    File_Parquet     = 104\n    File_Json        = 105\n    File_Index       = 106\n    File_Vector      = 107\n    File_AnyLines    = 110\n    File_TextLines   = 111\n    File_JsonLines   = 112\n    File_VectorLines = 113\n\n    @staticmethod\n    def belong(a, b):\n        if not isinstance(a, DataType) or not isinstance(b, DataType):\n            return False\n        return a == b or \\\n               (b.value % 10 == 0 and 0 <= a.value - b.value < 10) or \\\n               (b == DataType.Mem_Any and a.value < 100) or \\\n               (b == DataType.File_Any and a.value >= 100)\n\nclass Data:\n    \"\"\"\n    Data class (Deprecated).\n    \"\"\"\n    def __init__(self, type=DataType.Mem_Any, value=None):\n        self.type = type if isinstance(type, DataType) else DataType[type]\n        self.value = value\n\n\nif __name__ == \"__main__\":\n    data = Data()\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nfrom enum import Enum\nfrom tqdm import tqdm\nfrom core.layers import LayerType, LayerType2Func\n\nclass JointType(Enum):\n    Default = 0 # Only process data as whole (frequently used in data IO and control layers).\n    Map     = 1 # Firstly split data list into data unit, then process data unit to any type, finnaly return the list of processed data unit.\n    FlatMap = 2 # Firstly split data list into data unit, then process data unit to list type, then concat the whole processed data lists, finnally return the concated data list.\n\nclass Layer:\n    def __init__(self, type, joint=JointType.Default, repetition=1, param=dict(), input_names=list(), output_names=list()):\n        self.type = type if isinstance(type, LayerType) else LayerType[type]\n        self.func, self.input_types, self.output_types, self.enabled = LayerType2Func[self.type]\n        self.joint = joint if isinstance(joint, JointType) else JointType[joint]\n        self.repetition = repetition\n        self.param = param\n        self.input_names = input_names\n        self.output_names = output_names\n\n    def __call__(self, inputs, worker_id=0, worker_num=1, variables=dict()):\n        outputs = list()\n        try:\n            variables[\"worker_id\"] = worker_id\n            variables[\"worker_num\"] = worker_num\n\n            if not isinstance(inputs, list):\n                raise Exception(f\"The inputs of layer should be list data type.\")\n            if len(inputs) != len(self.input_types):\n                raise Exception(f\"The number of inputs is not {len(self.input_types)}.\")\n            for i, (data, input_type) in enumerate(zip(inputs, self.input_types)):\n                # TODO: add the check of input type.\n                # check the data type of input.\n                #if data.type != DataType[input_type]:\n                #    raise Exception(f\"The {i}th data, whose type is {data.type.name}, does not match the input type {input_type}\")\n                # Condition of empty input.\n                if data is None:\n                    outputs = [None for _ in self.output_types]\n                    return outputs\n\n            # TODO: to address the situation of repetition > 1.\n            for i in range(self.repetition):\n                if self.joint == JointType.Default:\n                    values = list(self.func(*inputs, variables, **self.param))\n                else:\n                    n = min([len(data) for data in inputs])\n                    if n != max([len(data) for data in inputs]):\n                        raise Exception(f\"Element amount of input datas are not equal.\")\n\n                    values = [[] for _ in self.output_types]\n                    for i in tqdm(range(n), desc=f\"Layer: {self.type.name}, worker_id: {worker_id}/{worker_num}\"):\n                        _values = self.func(*[data[i] for data in inputs], variables, **self.param)\n                        for value, _value in zip(values, _values):\n                            if _value is None:\n                                continue\n                            if self.joint == JointType.Map:\n                                value.append(_value)\n                            elif self.joint == JointType.FlatMap:\n                                if not isinstance(_value, list):\n                                    raise Exception(f\"The output of layer should be list data type.\")\n                                value.extend(_value)\n                            else:\n                                raise Exception(f\"Using unsupported joint type for {self.type.name} layer.\")\n\n                outputs = values\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n        return outputs\n\n\nif __name__ == \"__main__\":\n    inputs = [[\"a\", \"b\", \"c\", \"d\", \"e\"]]\n    layer = Layer(LayerType.Data_Sample, param={\"N\": 2})\n    outputs = layer(inputs)\n    print(layer)\n"
  },
  {
    "path": "DomainSpecific/core/layers/__init__.py",
    "content": "from enum import Enum\nfrom ..data import DataType\n\nfrom .template_layer import template_layer\n\n# Control layers\nfrom .control import *\n\n# Network (download/upload) layers\nfrom .network import *\n\n# IO (read/write) layers\nfrom .io import *\n\n# Extract layers\nfrom .extract import *\n\n# Transform layers\nfrom .transform import *\n\nclass LayerType(Enum):\n    Template                     = 0\n\n    # Control\n    Data_Sample                  = 1\n    Data_Concat                  = 2\n    Data_Order                   = 3\n    Data_Partition               = 4\n    Data_Filter                  = 5\n    Data_Shuffle                 = 6\n\n    # Network - download/upload\n    Upload_File_To_Blob          = 101\n    Upload_Bytes_To_Blob         = 102\n    Download_File_From_Blob      = 103\n    Download_Bytes_From_Blob     = 104\n    Download_File_From_Internet  = 105\n    Download_Bytes_From_Internet = 106\n    Download_Url_List            = 107\n    Download_Warc_Indice         = 108\n    Download_Warc_File           = 109\n    Download_Urls_From_Website   = 110\n    Download_Image_From_Jsonl    = 111\n    Download_StarCoder           = 112\n\n    # IO - read/write\n    To_Binary_File               = 201\n    To_Line_File                 = 202\n    To_Jsonl_File                = 203\n    To_Parquet_File              = 204\n    To_Index_File                = 205\n    To_Warc_File                 = 206\n    From_Binary_File             = 207\n    From_Line_File               = 208\n    From_Jsonl_File              = 209\n    From_Parquet_File            = 210\n    From_Index_File              = 211\n    From_Wet_File                = 212\n    From_Warc_File               = 213\n\n    # Extract\n    Extract_Article              = 301\n    Build_Index                  = 302\n    Search_Index                 = 303\n    \n    # Transform\n    Tokenize_Article             = 401\n    Ngrams                       = 402\n    Minhash_Tokens               = 403\n    LSH_Minhash                  = 404\n    Warc_Filter                  = 405\n    Warc_Encode                  = 406\n    Warc_To_Wet                  = 407\n    Wet_Decode                   = 408\n    Text_Embedding               = 409\n    Sentence_Embedding           = 410\n    Sentence_Filter              = 411\n    Code_Generation              = 412\n    Url_To_Record                = 413\n    Extract_Link_From_Warc       = 414\n    Wet_To_Imageinfos            = 415\n    Warc_To_Screenshot_MD        = 416\n    MCQ_Filter                   = 417\n    OpenQuestion_Filter          = 418\n    Convert_PDF                  = 419\n    Extract_HTML                 = 420\n    MD_Filter                    = 421\n    Cascaded_Filter              = 422\n    Math_Filter                  = 423\n\n\nLayerType2Func = \\\n{\n    LayerType.Template                     : (template_layer, [DataType.Mem_Any], [DataType.Mem_Any], True),\n\n    # Control\n    LayerType.Data_Sample                  : (data_sample_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n    LayerType.Data_Concat                  : (data_concat_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n    LayerType.Data_Order                   : (data_order_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n    LayerType.Data_Filter                  : (data_filter_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n    LayerType.Data_Partition               : (data_partition_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n    LayerType.Data_Shuffle                 : (data_shuffle_layer, [DataType.Mem_List], [DataType.Mem_List], True),\n\n    # Network - download/upload\n    LayerType.Upload_File_To_Blob          : (upload_file_to_blob_layer, [DataType.Mem_Str, DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),\n    LayerType.Upload_Bytes_To_Blob         : (upload_bytes_to_blob_layer, [DataType.Mem_Binary, DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),\n    LayerType.Download_File_From_Blob      : (download_file_from_blob_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),\n    LayerType.Download_Bytes_From_Blob     : (download_bytes_from_blob_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Binary, DataType.Mem_Str], True),\n    LayerType.Download_File_From_Internet  : (download_file_from_internet_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),\n    LayerType.Download_Bytes_From_Internet : (download_bytes_from_internet_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Binary, DataType.Mem_Str], True),\n    LayerType.Download_Url_List            : (download_url_list_layer, [DataType.Mem_Str], [DataType.Mem_StrList, DataType.Mem_StrList], True),\n    LayerType.Download_Warc_File           : (download_warc_file_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),\n    LayerType.Download_Warc_Indice         : (download_warc_indice_layer, [DataType.Mem_Str], [DataType.Mem_StrList, DataType.Mem_StrList], True),\n    LayerType.Download_Urls_From_Website   : (download_urls_from_website_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.Download_StarCoder           : (download_starcoder_layer, [DataType.Mem_Str], [DataType.Mem_Int], True),\n\n    # IO - read/write\n    LayerType.To_Binary_File               : (to_binary_file_layer, [DataType.Mem_Binary, DataType.Mem_Str], [DataType.Mem_Str], True),\n    LayerType.To_Line_File                 : (to_line_file_layer, [DataType.Mem_StrList, DataType.Mem_Str], [DataType.Mem_Str], True),\n    LayerType.To_Jsonl_File                : (to_jsonl_file_layer, [DataType.Mem_DictList, DataType.Mem_Str], [DataType.Mem_Str], True),\n    LayerType.To_Parquet_File              : (to_parquet_file_layer, [DataType.Mem_DictList, DataType.Mem_Str], [DataType.Mem_Str], True),\n    LayerType.To_Index_File                : (to_index_file_layer, [DataType.Mem_Index, DataType.Mem_Str], [DataType.Mem_Str], True),\n    LayerType.From_Binary_File             : (from_binary_file_layer, [DataType.Mem_Str], [DataType.Mem_Binary], True),\n    LayerType.From_Line_File               : (from_line_file_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.From_Jsonl_File              : (from_jsonl_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),\n    LayerType.From_Parquet_File            : (from_parquet_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),\n    LayerType.From_Index_File              : (from_index_file_layer, [DataType.Mem_Str], [DataType.Mem_Index], True),\n    LayerType.From_Wet_File                : (from_wet_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),\n    LayerType.From_Warc_File               : (from_warc_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),\n\n    # Extract\n    LayerType.Extract_Article              : (extract_article_layer, [DataType.Mem_Warc], [DataType.Mem_Dict], True),\n    LayerType.Build_Index                  : (build_index_layer, [DataType.Mem_VectorList], [DataType.Mem_Index], True),\n    LayerType.Search_Index                 : (search_index_layer, [DataType.Mem_Index, DataType.Mem_VectorList], [DataType.Mem_VectorList, DataType.Mem_VectorList], True),\n    \n    # Transform\n    LayerType.Tokenize_Article             : (tokenize_article_layer, [DataType.Mem_Dict], [DataType.Mem_StrList], True),\n    LayerType.Ngrams                       : (ngrams_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),\n    LayerType.Minhash_Tokens               : (minhash_tokens_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),\n    LayerType.LSH_Minhash                  : (lsh_minhash_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),\n    LayerType.Warc_Filter                  : (warc_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.Warc_Encode                  : (warc_encode_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.Warc_To_Wet                  : (warc_to_wet_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.Wet_Decode                   : (wet_decode_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.Math_Filter                  : (math_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.OpenQuestion_Filter          : (openquestion_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n    LayerType.MCQ_Filter                   : (mcq_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),\n}\n\n\n__all__ = [\n    \"LayerType\", \n    \"LayerType2Func\", \n    \"template_layer\", \n    \"data_sample_layer\", \n    \"data_concat_layer\", \n    \"data_order_layer\", \n    \"data_partition_layer\", \n    \"data_filter_layer\", \n    \"data_shuffle_layer\", \n    \"upload_file_to_blob_layer\", \n    \"upload_bytes_to_blob_layer\", \n    \"download_file_from_blob_layer\", \n    \"download_bytes_from_blob_layer\", \n    \"download_file_from_internet_layer\", \n    \"download_bytes_from_internet_layer\", \n    \"download_url_list_layer\", \n    \"download_warc_file_layer\", \n    \"download_warc_indice_layer\", \n    \"download_urls_from_website_layer\", \n    \"download_starcoder_layer\", \n    \"to_binary_file_layer\", \n    \"to_line_file_layer\", \n    \"to_jsonl_file_layer\", \n    \"to_parquet_file_layer\", \n    \"to_index_file_layer\", \n    \"from_binary_file_layer\", \n    \"from_line_file_layer\", \n    \"from_jsonl_file_layer\", \n    \"from_parquet_file_layer\", \n    \"from_index_file_layer\", \n    \"from_wet_file_layer\", \n    \"from_warc_file_layer\", \n    \"extract_article_layer\", \n    \"build_index_layer\", \n    \"search_index_layer\", \n    \"tokenize_article_layer\", \n    \"ngrams_layer\", \n    \"minhash_tokens_layer\", \n    \"lsh_minhash_layer\", \n    \"warc_filter_layer\", \n    \"warc_encode_layer\", \n    \"warc_to_wet_layer\", \n    \"wet_decode_layer\", \n    \"math_filter_layer\", \n    \"openquestion_filter_layer\", \n    \"mcq_filter_layer\", \n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/__init__.py",
    "content": "# Control\nfrom .data_sample_layer import data_sample_layer\nfrom .data_filter_layer import data_filter_layer\nfrom .data_order_layer import data_order_layer\nfrom .data_partition_layer import data_partition_layer\nfrom .data_shuffle_layer import data_shuffle_layer\nfrom .data_concat_layer import data_concat_layer\n\n__all__ = [\n    \"data_sample_layer\", \n    \"data_filter_layer\",\n    \"data_order_layer\",\n    \"data_partition_layer\",\n    \"data_shuffle_layer\", \n    \"data_concat_layer\", \n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_concat_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\n\ndef data_concat_layer(lists, variables=dict()):\n    ret = list()\n    try:\n        for a_list in lists[::-1]:\n            if a_list is not None:\n                ret[0:0] = a_list\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lists = [[\"a\"], [\"b\", \"c\"], None, [\"d\", \"e\", \"f\"]]\n    lines = data_concat_layer(lists)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_filter_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\n\ndef data_filter_layer(lines, variables=dict(), IN=False, FILTERS=(None,)):\n    ret = list()\n    try:\n        ret = list(filter(lambda line: line in FILTERS if IN else line not in FILTERS, lines))\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [\"a\", None, \"b\"]\n    FILTERS = (None,)\n    lines = data_filter_layer(lines, FILTERS=FILTERS)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_order_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\n\ndef data_order_layer(lines, variables=dict(), REVERSE=False):\n    ret = list()\n    try:\n        ret = sorted(lines, reverse=REVERSE)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [1, 3, 2]\n    lines = data_order_layer(lines)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_partition_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\n\ndef data_partition_layer(lines, variables=dict(), WORKER_ID=-1):\n    ret = list()\n    try:\n        worker_id = variables.get(\"worker_id\", 0)\n        worker_num = variables.get(\"worker_num\", 1)\n        n = len(lines)\n        if WORKER_ID == -1:\n            ret = [lines[i] for i in range(worker_id, n, worker_num)]\n        else:\n            ret = lines if WORKER_ID == worker_id else list()\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [0, 1, 2, 3, 4, 5, 6, 7, 8]\n    variables = {\"worker_id\": 0, \"worker_num\": 2}\n    lines = data_partition_layer(lines, variables=variables)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_sample_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport random\nimport traceback\n\ndef data_sample_layer(lines, variables=dict(), N=-1, SEED=1):\n    ret = list()\n    try:\n        random.seed(SEED)\n        N = min(N, len(lines))\n        if N >= 0:\n            ret = random.sample(lines, N)\n        else:\n            ret = lines\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [\"a\", \"b\"]\n    N = 1\n    lines = data_sample_layer(lines, N=N)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/control/data_shuffle_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport random\nimport traceback\n\ndef data_shuffle_layer(lines, variables=dict(), SEED=1):\n    ret = list()\n    try:\n        random.seed(SEED)\n        random.shuffle(lines)\n        ret = lines\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [\"a\", \"b\"]\n    lines = data_shuffle_layer(lines)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/extract/__init__.py",
    "content": "# Extract\nfrom .extract_article_layer import extract_article_layer\nfrom .build_index_layer import build_index_layer\nfrom .search_index_layer import search_index_layer\n\n__all__ = [\n    \"extract_article_layer\", \n    \"build_index_layer\", \n    \"search_index_layer\", \n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/extract/build_index_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport sys\nimport faiss\nimport numpy as np\nimport traceback\n\ndef build_index_layer(base_vectors, variables=dict(), SEED=1, DIM=4096, CLUSTERS=100):\n    ret = None\n    try:\n        np.random.seed(SEED)\n\n        quantizer = faiss.IndexFlatL2(DIM)\n        index = faiss.IndexIVFFlat(quantizer, DIM, CLUSTERS, faiss.METRIC_L2)\n\n        assert not index.is_trained\n        base_vectors = np.array(base_vectors)\n        index.train(base_vectors)\n        assert index.is_trained\n\n        index.add(base_vectors)\n        ret = index\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    D = 64\n    base_vectors = np.random.random((100000, D)).astype('float32')\n    base_vectors[:, 0] += np.arange(100000) / 1000.\n    index = build_index_layer(base_vectors, D=D)\n    print(index)\n"
  },
  {
    "path": "DomainSpecific/core/layers/extract/extract_article_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport re\nimport fasttext\nimport traceback\nfrom unittest.mock import patch\nfrom bs4 import BeautifulSoup\nfrom markdownify import MarkdownConverter, chomp\nfrom newspaper import Article\nimport global_var\n\ndef filter_tags_in_html(soup):\n    def del_tags(soup):\n        del_tags = ['style', 'script', 'img']\n        for tag in del_tags:\n            tags = soup.find_all(tag)\n            for tag in tags:\n                tag.decompose()\n\n        tags = soup.find_all('table')\n        for tag in tags:\n            if len(tag.text.strip()) == 0:\n                for tag in tags:\n                    tag.decompose()\n\n    def modify_text(soup):\n        modify_tags = ['a']\n        for i in range(len(modify_tags)):\n            for tag in soup.find_all(modify_tags[i]):\n                tag_text = tag.text\n                new_tag_text = tag_text.replace('\\n', '')\n                if len(new_tag_text) != len(tag_text):\n                    tag.string = new_tag_text\n    del_tags(soup)\n    modify_text(soup)\n\n    return soup\n\ndef lid(soup, model):\n    LID_WIN_SIZE=256\n    text = ''.join(soup.text.split())\n    span_start, span_end = 0, len(text)\n    if len(text) > LID_WIN_SIZE:\n        mid = len(text) // 2\n        mid_win = LID_WIN_SIZE // 2\n        span_start = max(0, int(mid - mid_win))\n        span_end = min(len(text), int(mid + mid_win))\n\n    det_text = text[span_start: span_end]\n    res = model.predict(det_text)\n    la = res[0][0].replace(\"__label__\", \"\")\n    prob = float(res[1][0])\n    return la, prob\n\ndef get_main_text_html(soup):\n    article = Article(\"padding_url\", fetch_images=False, keep_article_html=True)\n    article.download(input_html=str(soup))\n    article.parse()\n    # assert len(article.text.strip()) >= 128\n    main_html = article.article_html\n    main_text = article.text\n    return main_html, main_text\n\ndef remove_dup_newline(text):\n    fields = text.split('\\n')\n    for i in range(len(fields)):\n        fields[i] = fields[i].strip()\n    return re.sub('\\n{2,}', '\\n\\n', '\\n'.join(fields)).strip()\n\nclass User_MarkdownConverter(MarkdownConverter):\n    def convert_tr(self, el, text, convert_as_inline):\n        cells = el.find_all(['td', 'th'])\n        is_headrow = all([cell.name == 'th' for cell in cells])\n        overline = ''\n        underline = ''\n        if is_headrow and not el.previous_sibling:\n            # first row and is headline: print headline underline\n            underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\\n'\n        elif (not el.previous_sibling\n            and (el.parent.name == 'table'\n                or (el.parent.name == 'tbody'\n                    and not el.parent.previous_sibling))):\n            # first row, not headline, and:\n            # - the parent is table or\n            # - the parent is tbody at the beginning of a table.\n            # print empty headline above this row\n            overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\\n'\n            overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\\n'\n        if len(text.replace('|', ' ').strip()) == 0:\n            return overline + underline\n        else:\n            return overline + '|' + text.replace('\\n', ' ') + '\\n' + underline\n\n    def convert_a(self, el, text, convert_as_inline):\n        prefix, suffix, text = chomp(text)\n        if not text:\n            return ''\n        href = el.get('href')\n        title = el.get('title')\n        # For the replacement see #29: text nodes underscores are escaped\n        if (self.options['autolinks']\n                and text.replace(r'\\_', '_') == href\n                and not title\n                and not self.options['default_title']):\n            # Shortcut syntax\n            return '<%s>' % href\n        if self.options['default_title'] and not title:\n            title = href\n        title_part = ' \"%s\"' % title.replace('\"', r'\\\"') if title else ''\n        # return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text\n        return '%s %s %s' % (prefix, text.replace('\\n', ' '), suffix) if href else text\n\n    def convert_pre(self, el, text, convert_as_inline):\n        if not text:\n            return ''\n        code_language = self.options['code_language']\n\n        if self.options['code_language_callback']:\n            code_language = self.options['code_language_callback'](el) or code_language\n\n        return '\\n```%s\\n%s\\n```\\n' % (code_language, text)\n\ndef html2text(soup, **options):\n    def clean_markdown(md):\n        fields = md.split('\\n')\n        for i in range(len(fields)):\n            fields[i] = fields[i].strip()\n\n        new_fields = []\n        for i in range(len(fields)):\n            field_set = list(set(fields[i]))\n            if len(field_set) == 1 and field_set[0] in ['#', '*', '+', '-']:\n                continue\n            new_fields.append(fields[i])\n\n        fields = new_fields\n        md = '\\n'.join(fields)\n\n        return re.sub('\\n{2,}', '\\n\\n', md).strip()\n\n    return clean_markdown(User_MarkdownConverter(**options).convert_soup(soup))\n\ndef trans2md(html):\n    soup = BeautifulSoup(html, 'html5lib')\n    markdown_text = html2text(soup)\n    # assert len(markdown_text) > 50 and len(markdown_text.split('\\n')) != 1\n    if markdown_text.startswith('.') and markdown_text.endswith('.'):\n        markdown_text = markdown_text[1:-1]\n    main_text = remove_dup_newline(soup.text)\n    return markdown_text, main_text\n\n@classmethod\ndef _patch_newspaper_parser_clean(cls, node):\n    return node\n\n@patch('newspaper.parsers.Parser.clean_article_html', new=_patch_newspaper_parser_clean)\ndef extract(soup):\n    main_html, main_text = get_main_text_html(soup)\n    markdown_text, _new_main_text = trans2md(main_html)\n    return markdown_text, main_text\n\ndef extract_article_layer(id_html, variables=dict()):\n    ret = None\n    try:\n        LA_TIER1 = [\"en\", \"es\", \"ja\", \"fr\", \"de\", \"pt\", \"it\", \"zh\"]\n        LA_TIER2 = [\"nl\", \"sv\", \"da\", \"fi\", \"ru\", \"no\", \"ko\", \"zh\", \"pl\", \"tr\", \"ar\", \"he\", \"pt\", \"cs\", \"hu\", \"th\", \"hi\"]\n        LA_TIER = LA_TIER1 + LA_TIER2\n        article_id, html = id_html\n        \n        soup = BeautifulSoup(html, 'html5lib')\n        soup = filter_tags_in_html(soup)\n        la, la_prob = lid(soup, global_var.lid_model)\n        if la in LA_TIER:\n            main_md, main_text = extract(soup)\n            if len(main_text) >= 128:\n                ret = {\"id\": article_id, \"text\": main_text, \"lang\": la, \"lang_prob\": la_prob}\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    id_html = (None, None)\n    id_text_la = extract_article_layer(id_html)\n    print(id_text_la)\n"
  },
  {
    "path": "DomainSpecific/core/layers/extract/search_index_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport faiss\nimport numpy as np\nimport traceback\n\ndef search_index_layer(index, query_vectors, variables=dict(), TOPK=1):\n    ret = (None, None)\n    try:\n        query_vectors = np.array(query_vectors)\n        D, I = index.search(query_vectors, TOPK)\n        ret = (I, D)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    DIM = 4096\n    CLUSTERS = 2\n    base_vectors = np.random.random((100000, DIM)).astype('float32')\n    base_vectors[:, 0] += np.arange(100000) / 1000.\n    \n    quantizer = faiss.IndexFlatL2(DIM)\n    index = faiss.IndexIVFFlat(quantizer, DIM, CLUSTERS, faiss.METRIC_L2)\n\n    assert not index.is_trained\n    index.train(base_vectors)\n    assert index.is_trained\n    index.add(base_vectors)\n\n    query_vectors = np.random.random((10000, DIM)).astype('float32')\n    query_vectors[:, 0] += np.arange(10000) / 1000.\n\n    I, D = search_index_layer(index, query_vectors, D=D)\n    print(D[:1])\n"
  },
  {
    "path": "DomainSpecific/core/layers/global_var.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport traceback\n#import torch\nimport fasttext\nfrom transformers import AutoTokenizer, RobertaForSequenceClassification\nfrom dependency.gpt_api import GPTAPI\n\ntry:\n    # silences warnings as the package does not properly use the python 'warnings' package\n    # see https://github.com/facebookresearch/fastText/issues/1056\n    fasttext.FastText.eprint = lambda *args,**kwargs: None\nexcept:\n    pass\n\n\"\"\"\nclass OpenQuestionModel:\n    def __init__(self, pretrained_model_path, token_model_path=\"cardiffnlp/twitter-roberta-base-emotion\", local_files_only=False):\n        # load tokenizer model.\n        self.tokenizer = AutoTokenizer.from_pretrained(token_model_path)\n\n        # load trained model.\n        self.model = RobertaForSequenceClassification.from_pretrained(pretrained_model_path, local_files_only=local_files_only)\n\n    def run(self, text, thred=0.5, max_length=512):\n        # tokenization.\n        inputs = self.tokenizer(text, return_tensors=\"pt\", padding=\"max_length\", truncation=True, max_length=max_length)\n\n        # inference.\n        with torch.no_grad():\n            logits = self.model(**inputs).logits\n        logits = logits.softmax(dim=1)[0]\n        predicted_idx = logits.argmax().item()\n        predicted_label = self.model.config.id2label[predicted_idx]\n        predicted_conf = logits[predicted_idx].item()\n        if predicted_label == \"LABEL_0\" and predicted_conf < thred:\n            predicted_idx = 1\n            predicted_label = \"LABEL_1\"\n        #return predicted_idx, predicted_label, predicted_conf\n        return predicted_label\n\"\"\"\n\n# language detection by fasttext.\nLID_MODEL_PATH = \"./dependency/models/lid.176.bin\"\nif os.path.exists(LID_MODEL_PATH):\n    lid_model = fasttext.load_model(LID_MODEL_PATH)\nelse:\n    lid_model = None\n\n# math detection by fasttext.\nMATH_FT_MODEL_PATH = \"./dependency/models/math.bin\"\nif os.path.exists(MATH_FT_MODEL_PATH):\n    ft_math_model = fasttext.load_model(MATH_FT_MODEL_PATH)\nelse:\n    ft_math_model = None\n\n# openquestion detection by fasttext.\nOPENQUESTION_MODEL_PATH = \"./dependency/models/openquestion.bin\"\nif os.path.exists(OPENQUESTION_MODEL_PATH):\n    ft_openquestion_model = fasttext.load_model(OPENQUESTION_MODEL_PATH)\nelse:\n    ft_openquestion_model = None\n\n# multiple-choice question detection by fasttext.\nMCQ_MODEL_PATH = \"./dependency/models/mcq.bin\"\nif os.path.exists(MCQ_MODEL_PATH):\n    ft_mcq_model = fasttext.load_model(MCQ_MODEL_PATH)\nelse:\n    ft_mcq_model = None\n\n\"\"\"\n# multiple-choice question detection by pytorch.\nMCQ_PT_MODEL_PATH = \"./dependency/models/mcq.pytorch\"\nif os.path.exists(MCQ_PT_MODEL_PATH):\n    py_mcq_model = OpenQuestionModel(MCQ_PT_MODEL_PATH, local_files_only=True)\nelse:\n    py_mcq_model = None\n\"\"\"\n\n# gpt agent.\ngpt_api = GPTAPI()\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/__init__.py",
    "content": "# IO - read/write\nfrom .to_binary_file_layer import to_binary_file_layer\nfrom .to_line_file_layer import to_line_file_layer\nfrom .to_jsonl_file_layer import to_jsonl_file_layer\nfrom .to_parquet_file_layer import to_parquet_file_layer\nfrom .to_index_file_layer import to_index_file_layer\nfrom .from_binary_file_layer import from_binary_file_layer\nfrom .from_line_file_layer import from_line_file_layer\nfrom .from_jsonl_file_layer import from_jsonl_file_layer\nfrom .from_parquet_file_layer import from_parquet_file_layer\nfrom .from_index_file_layer import from_index_file_layer\nfrom .from_wet_file_layer import from_wet_file_layer\nfrom .from_warc_file_layer import from_warc_file_layer\n\n__all__ = [\n    \"to_binary_file_layer\", \n    \"to_line_file_layer\", \n    \"to_jsonl_file_layer\", \n    \"to_parquet_file_layer\", \n    \"to_index_file_layer\",\n    \"from_binary_file_layer\", \n    \"from_line_file_layer\", \n    \"from_jsonl_file_layer\", \n    \"from_parquet_file_layer\",\n    \"from_index_file_layer\",\n    \"from_wet_file_layer\", \n    \"from_warc_file_layer\",\n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_binary_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport util\n\ndef from_binary_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        with open(file_path, \"rb\") as f:\n            data = f.read()\n        ret = data\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.binary\"\n    data = from_binary_file_layer(file_path)\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_index_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport faiss\nimport traceback\nimport util\n\ndef from_index_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        index = faiss.read_index(file_path)\n        ret = index\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    file_path = \"index.faiss\"\n    index = from_index_file_layer(file_path)\n    print(index)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_jsonl_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport jsonlines\nimport util\n\ndef from_jsonl_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = list()\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        with jsonlines.open(file_path) as reader:\n            for line in reader:\n                ret.append(line)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.jsonl\"\n    data = from_jsonl_file_layer(file_path)\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_line_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport util\n\ndef from_line_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = list()\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        for line in open(file_path, \"r\"):\n            line = line.strip()\n            ret.append(line)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.line\"\n    lines = from_line_file_layer(file_path)\n    print(lines)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_parquet_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport util\n\ndef from_parquet_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        table = pq.read_table(file_path)\n        ret = table.to_pylist()\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.parquet\"\n    data = from_parquet_file_layer(file_path)\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_warc_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nfrom warcio.archiveiterator import ArchiveIterator\nimport util\n\ndef from_warc_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        if os.path.exists(file_path):\n            items = list()\n            with open(file_path, \"rb\") as input:\n                records = ArchiveIterator(input, arc2warc=True)\n                for idx, record in enumerate(records):\n                    if record.rec_type == \"response\" and record.http_headers.get_header(\"Content-Type\", \"\").startswith(\"text/html\"):\n                        item = dict()\n                        item[\"uri\"] = record.rec_headers.get(\"WARC-Target-URI\")\n                        item[\"lang\"] = record.rec_headers.get(\"Detected-Language\")\n                        item[\"content_length\"] = record.rec_headers[\"Content-Length\"]\n                        item[\"html\"] = record.content_stream().read()\n                        items.append(item)\n            ret = items\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.warc.gz\"\n    data = from_warc_file_layer(file_path)\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/from_wet_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nfrom warcio.archiveiterator import ArchiveIterator\nimport util\n\ndef from_wet_file_layer(file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        if STORAGE_PATH is not None:\n            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)\n\n        if os.path.exists(file_path):\n            items = list()\n            with open(file_path, \"rb\") as input:\n                records = ArchiveIterator(input, arc2warc=False)\n                for idx, record in enumerate(records):\n                    if record.rec_type == \"conversion\":\n                        item = dict()\n                        item[\"uri\"] = record.rec_headers.get(\"WARC-Target-URI\")\n                        item[\"lang\"] = record.rec_headers.get(\"Detected-Language\")\n                        item[\"content_length\"] = record.rec_headers[\"Content-Length\"]\n                        item[\"text\"] = record.content_stream().read()\n                        items.append(item)\n            ret = items\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    file_path = \"test.warc.wet.gz\"\n    data = from_wet_file_layer(file_path)\n    print(data)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/to_binary_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport util\n\ndef to_binary_file_layer(bytes, file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        util.create_folder_by_file_path(file_path)\n\n        with open(file_path, \"wb\") as f:\n            f.write(bytes)\n\n        if STORAGE_PATH is not None:\n            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)\n\n        ret = file_path\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    bytes = b\"hello\"\n    file_path = \"test.binary\"\n    file_path = to_binary_file_layer(bytes, file_path)\n    print(file_path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/to_index_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport faiss\nimport traceback\nimport util\n\ndef to_index_file_layer(index, file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        util.create_folder_by_file_path(file_path)\n\n        faiss.write_index(index, file_path)\n\n        if STORAGE_PATH is not None:\n            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)\n\n        ret = file_path\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    D = 64\n    NLIST = 100\n    base_vectors = np.random.random((100000, D)).astype('float32')\n    base_vectors[:, 0] += np.arange(100000) / 1000.\n    \n    quantizer = faiss.IndexFlatL2(D)\n    index = faiss.IndexIVFFlat(quantizer, D, NLIST, faiss.METRIC_L2)\n\n    assert not index.is_trained\n    index.train(base_vectors)\n    assert index.is_trained\n    index.add(base_vectors)\n\n    file_path = \"index.faiss\"\n    file_path = to_index_file_layer(index, file_path)\n    print(file_path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/to_jsonl_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport jsonlines\nimport util\n\ndef to_jsonl_file_layer(data, file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        util.create_folder_by_file_path(file_path)\n\n        with jsonlines.open(file_path, \"w\") as writer:\n            writer.write_all(data)\n\n        if STORAGE_PATH is not None:\n            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)\n\n        ret = file_path\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    data = [{'id': \"1\", 'html': \"hello\"}, {'id': \"2\", 'html': \"hi\"}]\n    file_path = \"test.jsonl\"\n    file_path = to_jsonl_file_layer(data, file_path)\n    print(file_path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/to_line_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport util\n\ndef to_line_file_layer(lines, file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        util.create_folder_by_file_path(file_path)\n\n        with open(file_path, \"w\") as f:\n            for line in lines:\n                f.write(line + \"\\n\")\n\n        if STORAGE_PATH is not None:\n            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)\n\n        ret = file_path\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    lines = [\"line1\", \"line2\"]\n    file_path = \"test.line\"\n    file_path = to_line_file_layer(lines, file_path)\n    print(file_path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/io/to_parquet_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport util\n\ndef to_parquet_file_layer(data, file_path, variables=dict(), STORAGE_PATH=None):\n    ret = None\n    try:\n        file_path = util.to_real_path(file_path, variables)\n        util.create_folder_by_file_path(file_path)\n\n        table = pa.Table.from_pylist(data)\n        pq.write_table(table, file_path)\n\n        if STORAGE_PATH is not None:\n            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)\n\n        ret = file_path\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    data = [{'id': \"1\", 'html': \"hello\"}, {'id': \"2\", 'html': \"hi\"}]\n    file_path = \"test.parquet\"\n    file_path = to_parquet_file_layer(data, file_path)\n    print(file_path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/__init__.py",
    "content": "# Network - download/upload\nfrom .upload_file_to_blob_layer import upload_file_to_blob_layer\nfrom .upload_bytes_to_blob_layer import upload_bytes_to_blob_layer\nfrom .download_file_from_blob_layer import download_file_from_blob_layer\nfrom .download_bytes_from_blob_layer import download_bytes_from_blob_layer\nfrom .download_file_from_internet_layer import download_file_from_internet_layer\nfrom .download_bytes_from_internet_layer import download_bytes_from_internet_layer\nfrom .download_url_list_layer import download_url_list_layer\nfrom .download_warc_file_layer import download_warc_file_layer\nfrom .download_warc_indice_layer import download_warc_indice_layer\nfrom .download_urls_from_website_layer import download_urls_from_website_layer\nfrom .download_starcoder_layer import download_starcoder_layer\n\n__all__ = [\n    \"upload_file_to_blob_layer\",\n    \"upload_bytes_to_blob_layer\",\n    \"download_file_from_blob_layer\", \n    \"download_bytes_from_blob_layer\", \n    \"download_file_from_internet_layer\", \n    \"download_bytes_from_internet_layer\", \n    \"download_url_list_layer\", \n    \"download_warc_file_layer\", \n    \"download_warc_indice_layer\", \n    \"download_urls_from_website_layer\", \n    \"download_starcoder_layer\", \n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_bytes_from_blob_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef download_bytes_from_blob_layer(blob_path, variables=dict(), STORAGE_PATH=None, TRIES=1):\n    ret = (None, None, blob_path)\n    try:\n        for _ in range(TRIES):\n            try:\n                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)\n                storage_config = util.load_yaml(STORAGE_PATH)\n                blob_path = util.to_real_path(blob_path, variables)\n                file_name = util.md5(blob_path) + util.suffix(blob_path)\n                bytes = util.download_bytes_from_blob(storage_config, blob_path)\n                ret = (file_name, bytes, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    blob_path = \"$(azure_blob_path)\"\n    STORAGE_PATH = \"resources/environment/llmstore.yaml\"\n    bytes = download_bytes_from_blob_layer(blob_path, STORAGE_PATH=STORAGE_PATH)\n    print(bytes)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_bytes_from_internet_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef download_bytes_from_internet_layer(url, variables=dict(), TRIES=1):\n    ret = (None, None, url)\n    try:\n        for _ in range(TRIES):\n            try:\n                url = util.to_real_path(url, variables)\n                file_name = util.md5(url) + util.suffix(url)\n                bytes = util.download_bytes_from_internet(url)\n                ret = (file_name, bytes, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    url = \"https://upload.wikimedia.org/wikipedia/commons/4/4f/SVG_Logo.svg\"\n    bytes = download_bytes_from_internet_layer(url)\n    print(bytes)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_file_from_blob_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef download_file_from_blob_layer(blob_path, variables=dict(), DOWNLOAD_PATH=\".\", STORAGE_PATH=None, TRIES=1):\n    ret = (None, blob_path)\n    try:\n        for _ in range(TRIES):\n            try:\n                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)\n                storage_config = util.load_yaml(STORAGE_PATH)\n                blob_path = util.to_real_path(blob_path, variables)\n                file_name = util.md5(blob_path) + util.suffix(blob_path)\n                file_path = os.path.join(DOWNLOAD_PATH, file_name)\n                file_path = util.to_real_path(file_path, variables)\n                util.download_file_from_blob(storage_config, blob_path, file_path)\n                ret = (file_path, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    blob_path = \"$(azure_blob_path)\"\n    DOWNLOAD_PATH = \"$(local_folder_path)\"\n    STORAGE_PATH = \"resources/environment/llmstore.yaml\"\n    path = download_file_from_blob_layer(blob_path, DOWNLOAD_PATH=DOWNLOAD_PATH, STORAGE_PATH=STORAGE_PATH)\n    print(path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_file_from_internet_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef download_file_from_internet_layer(url, variables=dict(), DOWNLOAD_PATH=\".\", TRIES=1):\n    ret = (None, url)\n    try:\n        for _ in range(TRIES):\n            try:\n                url = util.to_real_path(url, variables)\n                file_name = util.md5(url) + util.suffix(url)\n                file_path = os.path.join(DOWNLOAD_PATH, file_name)\n                file_path = util.to_real_path(file_path, variables)\n                util.download_file_from_internet(url, file_path)\n                #bytes = util.download_bytes_from_internet(url)\n                #util.upload_bytes_to_blob(variables[\"storage_config\"], bytes, file_path)\n                ret = (file_path, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    url = \"https://upload.wikimedia.org/wikipedia/commons/4/4f/SVG_Logo.svg\"\n    DOWNLOAD_PATH = \"$(local_folder_path)\"\n    path = download_file_from_internet_layer(url, DOWNLOAD_PATH=DOWNLOAD_PATH)\n    print(path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_starcoder_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport json\nfrom datetime import datetime\nimport boto3\nfrom botocore import UNSIGNED\nfrom botocore.config import Config\nimport smart_open\nfrom datasets import load_dataset\nimport util\n\ns3 = boto3.client(\"s3\", config=Config(signature_version=UNSIGNED))\n\ndef download_contents(blob_id, src_encoding):\n    s3_url = f\"s3://softwareheritage/content/{blob_id}\"\n    with smart_open.open(s3_url, \"rb\", compression=\".gz\", transport_params={\"client\": s3}) as fin:\n        content = fin.read().decode(src_encoding)\n    return content\n\ndef download_starcoder_layer(data_repo, variables=dict(), OUTPUT_FOLDER=\"./\", STORAGE_PATH=None, HUGGINGFACE_TOKEN=None):\n    ret = 0\n    try:\n        worker_id = variables[\"worker_id\"]\n        worker_num = variables[\"worker_num\"]\n        data_repo = util.to_real_path(data_repo, variables)\n        output_folder = util.to_real_path(OUTPUT_FOLDER, variables)\n        if STORAGE_PATH is not None:\n            storage_config = util.load_yaml(STORAGE_PATH)\n\n        ds = load_dataset(data_repo, split=\"train\", streaming=True, token=HUGGINGFACE_TOKEN, cache_dir=f\"./cache.{worker_id}/\")\n        ds = ds.filter(lambda row, idx: idx % worker_num == worker_id, with_indices=True)\n\n        item_count = 0\n        for i, row in enumerate(ds):\n            for key in row.keys():\n                if isinstance(row[key], datetime):\n                    row[key] = datetime.timestamp(row[key])\n\n            blob_id = row[\"blob_id\"]\n            src_encoding = row[\"src_encoding\"]\n\n            snapshot_prefix = row[\"snapshot_id\"][:4]\n            repo_name = row[\"repo_name\"].replace(\"/\", \"@\")\n            branch_name = row[\"branch_name\"].replace(\"/\", \"@\")\n            language = row[\"language\"].replace(\" \", \"_\")\n            path = row[\"path\"].lstrip(\"/\")\n            filename = row[\"filename\"].strip()\n            filename = path\n            extension = row[\"extension\"].strip()\n\n            content = download_contents(blob_id, src_encoding)\n\n            code_path = os.path.join(output_folder, snapshot_prefix, repo_name, branch_name, blob_id)\n            metadata_path = os.path.join(output_folder, snapshot_prefix, repo_name, branch_name, blob_id + \".json\")\n\n            try:\n                util.create_folder_by_file_path(code_path)\n                with open(code_path, \"w\") as f:\n                    f.write(content)\n                if STORAGE_PATH is not None:\n                    util.upload_file_to_blob(storage_config, code_path, code_path)\n\n                util.create_folder_by_file_path(metadata_path)\n                with open(metadata_path, \"w\") as f:\n                    f.write(json.dumps(row, indent=4) + \"\\n\")\n                if STORAGE_PATH is not None:\n                    util.upload_file_to_blob(storage_config, metadata_path, metadata_path)\n\n                if STORAGE_PATH is not None:\n                    try:\n                        os.remove(code_path)\n                        os.remove(metadata_path)\n                    except OSError:\n                        pass\n            except:\n                traceback.print_exc()\n            \n            item_count += 1\n\n        ret = item_count\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    data_repo = \"$(local_the_stack_v2_dedup_metadata_path)\"\n    variables = {\"workspace_dir\": r\"workspace\", \"worker_id\": 0, \"worker_num\": 1}\n    OUTPUT_FOLDER = \"$(local_the_stack_v2_dedup_data_path)\"\n    STORAGE_PATH = \"resources/storage/llmstore.yaml\"\n    HUGGINGFACE_TOKEN = None\n    item_count = download_starcoder_layer(data_repo, variables=variables, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH, HUGGINGFACE_TOKEN=HUGGINGFACE_TOKEN)\n    print(item_count)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_url_list_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport gzip\nimport json\nimport requests\nimport traceback\n\ndef download_url_list_layer(index_url, variables=dict(), FILTER_SUFFIXES=(), TRIES=1):\n    ret = list()\n    try:\n        for _ in range(TRIES):\n            try:\n                resp = requests.get(index_url, stream=True)\n                urls = list()\n                with gzip.open(resp.raw, 'rt') as f:\n                    for line in f.readlines():\n                        text = \"{\" + line.strip().split(\" {\")[1]\n                        item = json.loads(text)\n                        url = item[\"url\"]\n                        suffix = os.path.splitext(url)[1]\n                        if suffix in FILTER_SUFFIXES:\n                            urls.append(url)\n                ret[0:0] = urls\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, [index_url] if len(ret) == 0 else [])\n\n\nif __name__ == '__main__':\n    index_url = \"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-23/indexes/cdx-00000.gz\"\n    FILTER_SUFFIXES = (\".svg\",)\n    urls = download_url_list_layer(index_url, FILTER_SUFFIXES=FILTER_SUFFIXES)\n    print(urls)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_urls_from_website_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport requests\nimport logging\nimport traceback\nimport xml.etree.ElementTree as ET\n\ndef download_urls_from_website_layer(website_url, variables=dict(), FILTER=None):\n    ret = list()\n    try:\n        robot_url = website_url + \"/robots.txt\"\n        logging.disable(logging.WARNING)\n\n        # get sitemap.\n        xml_urls = list()\n        whilte_url_prefixs = list()\n        black_url_prefixs = list()\n        resp = requests.get(robot_url)\n        crawler = None\n        for line in resp.text.split(\"\\n\"):\n            line = line.strip()\n            if len(line) == 0:\n                continue\n            if line.startswith(\"#\"):\n                continue\n\n            if line.startswith(\"User-agent:\"):\n                crawler = line.split(\":\")[-1].strip()\n                continue\n\n            if crawler != \"*\":\n                continue\n            if crawler == \"*\" and line.startswith(\"Disallow:\"):\n                url_prefix = line.replace(\"Disallow:\", \"\").strip()\n                black_url_prefixs.append(url_prefix)\n                continue\n            if crawler == \"*\" and line.startswith(\"Allow:\"):\n                url_prefix = line.replace(\"Allow:\", \"\").strip()\n                whilte_url_prefixs.append(url_prefix)\n                continue\n            if crawler == \"*\" and line.startswith(\"Sitemap:\"):\n                xml_url = line.replace(\"Sitemap:\", \"\").strip()\n                if (FILTER is None or FILTER in xml_url) and xml_url.endswith(\".xml\"):\n                    xml_urls.append(xml_url)\n                continue\n\n        # get urls.\n        html_urls = set()\n        for xml_url in xml_urls:\n            try:\n                resp = requests.get(xml_url)\n                root = ET.fromstring(resp.content)\n                for sitemap in root:\n                    html_url = list(sitemap)[0].text\n                    html_urls.add(html_url)\n                #nodes = tree.xpath('//a/@href')\n                #nodes = tree.xpath(\"//loc\")\n            except:\n                pass\n\n        ret = list(html_urls)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == '__main__':\n    website_url = \"https://byjus.com/\"\n    FILTER = \"math\"\n    urls = download_urls_from_website_layer(website_url, FILTER=FILTER)\n    print(urls[0][0])\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_warc_file_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef download_warc_file_layer(warc_url, variables=dict(), DOWNLOAD_FOLDER=\"./\", CONNECTS=16, TRIES=1, OVERWRITE=False):\n    ret = (None, warc_url)\n    try:\n        if not warc_url.startswith(\"https://\"):\n            warc_url = \"https://data.commoncrawl.org/\" + warc_url\n        #warc_url = warc_url.replace(\"https://data.commoncrawl.org/\", \"https://ds5q9oxwqwsfj.cloudfront.net/\")# debug\n        warc_name = warc_url.split(\"/\")[-3] + \"_\" + os.path.basename(warc_url)\n        warc_path = os.path.join(DOWNLOAD_FOLDER, warc_name)\n        warc_path = util.to_real_path(warc_path, variables)\n\n        for _ in range(TRIES):\n            if OVERWRITE or not os.path.exists(warc_path):\n                util.create_folder_by_file_path(warc_path)\n                commandline = f\"axel -q -n {CONNECTS} -o {warc_path} {warc_url}\"\n                exit_status = os.system(commandline)\n            else:\n                exit_status = 0\n\n            if exit_status == 0:\n                break\n            time.sleep(1)\n\n        if exit_status == 0:\n            ret = (warc_name, None)\n        else:\n            ret = (None, warc_url)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    warc_url = \"https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00000.warc.gz\"\n    DOWNLOAD_FOLDER = \"$(local_folder_path)\"\n    (success_warc_url, failed_warc_url) = download_warc_file_layer(warc_url, DOWNLOAD_FOLDER=DOWNLOAD_FOLDER)\n    print(success_warc_url, failed_warc_url)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/download_warc_indice_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport gzip\nimport requests\nimport traceback\n\ndef download_warc_indice_layer(index_url, variables=dict(), TRIES=1, URL_PREFIX=\"https://data.commoncrawl.org/\"):\n    ret = list()\n    try:\n        for _ in range(TRIES):\n            try:\n                resp = requests.get(index_url, stream=True)\n                urls = list()\n                with gzip.open(resp.raw, 'rt') as f:\n                    for line in f.readlines():\n                        warc_url = URL_PREFIX + line.strip()\n                        urls.append(warc_url)\n                ret = urls\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, [index_url] if len(ret) == 0 else [])\n\n\nif __name__ == '__main__':\n    index_url = \"https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-49/warc.paths.gz\"\n    warc_urls = download_warc_indice_layer(index_url)\n    print(warc_urls[0][0])\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/upload_bytes_to_blob_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef upload_bytes_to_blob_layer(bytes, blob_path, variables=dict(), STORAGE_PATH=None, BLOB_PREFIX=\"\", TRIES=1):\n    ret = (None, blob_path)\n    try:\n        for _ in range(TRIES):\n            try:\n                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)\n                storage_config = util.load_yaml(STORAGE_PATH)\n                blob_path = util.to_real_path(os.path.join(BLOB_PREFIX, blob_path), variables)\n                util.upload_bytes_to_blob(storage_config, bytes, blob_path)\n                ret = (blob_path, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    bytes = b\"hello\"\n    blob_path = \"$(azure_blob_path)\"\n    STORAGE_PATH = \"resources/environment/llmstore.yaml\"\n    path = upload_bytes_to_blob_layer(bytes, blob_path, STORAGE_PATH=STORAGE_PATH)\n    print(path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/network/upload_file_to_blob_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport util\n\ndef upload_file_to_blob_layer(file_path, blob_path, variables=dict(), STORAGE_PATH=None, BLOB_PREFIX=\"\", TRIES=1):\n    ret = (None, blob_path)\n    try:\n        for _ in range(TRIES):\n            try:\n                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)\n                storage_config = util.load_yaml(STORAGE_PATH)\n                file_path = util.to_real_path(file_path, variables)\n                blob_path = util.to_real_path(os.path.join(BLOB_PREFIX, blob_path), variables)\n                util.upload_file_to_blob(storage_config, file_path, blob_path)\n                ret = (blob_path, None)\n                break\n            except:\n                time.sleep(1)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == '__main__':\n    blob_path = \"$(azure_blob_path)\"\n    file_path = \"$(local_file_path)\"\n    STORAGE_PATH = \"resources/environment/llmstore.yaml\"\n    path = upload_file_to_blob_layer(file_path, blob_path, STORAGE_PATH=STORAGE_PATH)\n    print(path)\n"
  },
  {
    "path": "DomainSpecific/core/layers/template_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport sys\nimport traceback\n\n# Spec of adding a new layer:\n# 1. the layer function should be registered in __init__.py file of current folder.\n# 2. the layer function should return tuple value, even though the return value is empty.\n# 3. the layer function should contain a \"variables\" variable in dictionary type for the access of global shared variables.\n# 4. It's better to implement the unit test and put it to the \"__main__\" function.\n# 5. It's better to have exception handling for the function logic.\n# 6. It's better to end with \"_layer\" for the name of function.\n# 7. It's better to write comments for the function of purpose, input and output.\n# 8. It's better to be lowercase for the name of input datas.\n# 9. It's better to be uppercase for the name of input parameters.\n\ndef template_layer(input, variables=dict(), PARAM=None):\n    ret = None\n    try:\n        ret = input\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret,)\n\n\nif __name__ == \"__main__\":\n    input = None\n    output = template_layer(input)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/__init__.py",
    "content": "# Transform\nfrom .tokenize_article_layer import tokenize_article_layer\nfrom .ngrams_layer import ngrams_layer\nfrom .minhash_tokens_layer import minhash_tokens_layer\nfrom .lsh_minhash_layer import lsh_minhash_layer\nfrom .warc_filter_layer import warc_filter_layer\nfrom .warc_encode_layer import warc_encode_layer\nfrom .warc_to_wet_layer import warc_to_wet_layer\nfrom .wet_decode_layer import wet_decode_layer\nfrom .math_filter_layer import math_filter_layer\nfrom .openquestion_filter_layer import openquestion_filter_layer\nfrom .mcq_filter_layer import mcq_filter_layer\n\n__all__ = [\n    \"tokenize_article_layer\", \n    \"ngrams_layer\", \n    \"minhash_tokens_layer\", \n    \"lsh_minhash_layer\", \n    \"warc_filter_layer\", \n    \"warc_encode_layer\", \n    \"warc_to_wet_layer\", \n    \"wet_decode_layer\", \n    \"math_filter_layer\",\n    \"openquestion_filter_layer\",\n    \"mcq_filter_layer\",\n]\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/lsh_minhash_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport time\nimport traceback\nimport numpy as np\nfrom scipy.integrate import quad as integrate\n\n# different from datasketch's implementation, will use 2^61-1 as the maximum hash value instead of 2^32-1\nNUM_PERM = 256\nLSH_THRESHOLD = 0.8\n\nclass LSH:\n    def __init__(self):\n        # gen lsh range\n        b, r = self.optimal_param(LSH_THRESHOLD, NUM_PERM, 0.5, 0.5)\n        self.hashranges = [(i*r, (i+1)*r) for i in range(b)]\n        \n    # gen lsh param\n    # https://github.com/ekzhu/datasketch/blob/44077457d32887a91297f15c3efee2c1982f690e/datasketch/lsh.py\n    def false_positive_probability(self, threshold, b, r):\n        _probability = lambda s : 1 - (1 - s**float(r))**float(b)\n        a, err = integrate(_probability, 0.0, threshold)\n        return a\n\n    def false_negative_probability(self, threshold, b, r):\n        _probability = lambda s : 1 - (1 - (1 - s**float(r))**float(b))\n        a, err = integrate(_probability, threshold, 1.0)\n        return a\n\n    def optimal_param(self, threshold, num_perm, false_positive_weight,\n            false_negative_weight):\n        '''\n        Compute the optimal `MinHashLSH` parameter that minimizes the weighted sum\n        of probabilities of false positive and false negative.\n        '''\n        min_error = float(\"inf\")\n        opt = (0, 0)\n        for b in range(1, num_perm+1):\n            max_r = int(num_perm / b)\n            for r in range(1, max_r+1):\n                fp = self.false_positive_probability(threshold, b, r)\n                fn = self.false_negative_probability(threshold, b, r)\n                error = fp*false_positive_weight + fn*false_negative_weight\n                if error < min_error:\n                    min_error = error\n                    opt = (b, r)\n        return opt\n\n    def gen_lsh(self, minhash):\n        return [bytearray(minhash[start:end]) for start, end in self.hashranges]\n\nlsh = LSH()\n\ndef lsh_minhash_layer(minhash, variables=dict()):\n    ret = list()\n    try:\n        minhash = np.array(minhash, dtype=np.uint64)\n        #assert minhash.dtype == np.uint64 and minhash.shape == (NUM_PERM,)\n        lshvalues = lsh.gen_lsh(minhash)\n        for i, value in enumerate(lshvalues):\n            key = f'{i}_'.encode() + value\n            ret.append(key)\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == \"__main__\":\n    minhash = [2170239837623632,1287605064391826,7877338491737559,1522708576701298,1959803855170230,136353893425081,3067530819312822,19822079906565762,14191953696745176,371933081470560,2359093478290026,24211742396711177,5207401883495830,3386445753675098,6482843287028185,14956790165792002,7760994632330526,3801562091963312,654119844389846,6118541550243605,1058268864309841,19648312785892006,5519054639081138,17769255728697304,1326859272534844,6541616202650748,11131462447891679,11540424367241221,6416091255362971,1178274890175074,9516296843449206,5019313649584786,556043434180166,3170749841321737,788403856226243,16256424180717928,11536645058081246,13331271075979702,5603975614240490,11332978618315755,49833277925775,28529817665769800,5399529123965416,5804862109442032,10516842515700528,1383775130067327,9593857895450592,344120332429946,3650720026287843,4927677784872807,3114522307389328,1054088699310940,11453703275676121,17145094372333782,11943406601641085,429519913626747,3559765888081715,6380853683568781,13142954055708448,1122751140539670,7679037943867431,23532369906879837,4460946791673399,6284691595180437,5534632051525650,4326069154983305,6645880540672905,1199004738171304,2741143312089611,3315947713975755,33325056362165,17905224452748795,11081894870845940,2429362824597352,8796539339687473,17606225237179401,2406479086961618,25285711888782525,1847958183256316,4198878926995358,5057832224878357,10146090240130753,2413082792037196,3530471135853536,7672611456084586,2230458118023706,9790058494528486,3351632677682193,6902744571969727,4063006572456150,2761280786272613,6242978327908865,26924233559187524,2214283527827093,951652422014210,1577851399523074,282734099627651,4284321096276342,1571021659718705,2064444079057042,25995837896147107,3642452037001290,615591136529782,2579917399379439,10350113780305730,141093940432428,9292013714641581,16926413460125,4351013271280123,4492914008491347,3885988895709230,3643655265951773,4028855757933683,10480484972551973,2399277677842610,391439629014342,4511050103292841,13930059233224697,10142483490268814,10209387364437517,10291028774837120,1963510243393060,6698235608219585,10249974506598137,2090329927024291,19452257405817527,5395347850501660,1466647506773938,18271233688875585,17909487123073655,22732716574954981,28208124344155426,16118266291737203,6436198404802809,935143955767639,4692764892567773,8853071216371112,1600664618209927,39702070969452097,7552579352900360,2729546584440357,12309935356310386,426760114692333,1297488733224877,153415463561661,18948566290952420,8432980683248649,21321844297374743,8265174613176795,905258690673816,705406607744747,9105597597214747,517772088040257,1591136193162784,27511729624229236,3634922285407283,1831578225426174,13255266977668852,15312685554649660,722931468693513,1049089865098577,3498618026981595,4820015824926872,21126162808808528,27814106051492575,4822875592156961,14999120736412943,10825146296544249,6314954554132894,937945964737656,5459760788750366,3819227047549912,6591064604768721,7907494363943122,3486632627636937,9384132089104933,22104346516322826,6658745931891482,34093012584282609,4995951742943174,3517485897161771,135044219482780,7630383357514628,5162177136386332,10728488430543051,5828055747100055,6893511170015442,11011121196423559,2528283999013590,5080079240873515,19593423843180365,6822359610856040,191087978655560,8846708703413576,33146998994366094,3940701969864300,3507581990705859,6201879648552385,27956522101531374,10178358282977630,2205391899838384,2614926987404300,1090899715885363,6945147978151211,5432157012678156,1250518799355535,3948407147690489,10306927288370802,4580562167416191,8475303907451120,2243101892749971,2451601302451002,2180238663422921,3834240093757495,12119880871693653,12134080723101916,1805202361835209,31781168568203930,42987808989068825,41914343122681270,7985132073155851,16763654385115268,1387995454655588,2351466328427087,3139781779642664,27792958762616566,11961004800461011,6612181571493100,22715857059525182,689087660337260,244785061275028,11511948953811059,8237401627755449,8214914423544509,5470929524034644,9110614658125771,17166417582628999,18571246019891132,3766276759071421,1226388404627669,9965671498507403,1214978610204088,7808074359603991,1313444080667563,9031456783378283,3783393382666945,34163041205217466,3314866608200743,3451870308271748,11716681494447625,1667361573332888,13859255454740261,7299000064706400,6085019581018810,4996856251238621,5666642298303467]\n    lsh_values = lsh_minhash_layer(minhash)\n    print(lsh_values)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/math_filter_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport re\nimport requests\nimport fasttext\nfrom gensim.utils import simple_preprocess\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport util\nimport global_var\n\nwhilte_list = {r\"\\\\displaystyle\", r\"\\\\alpha\", r\"\\\\beta\", r\"\\\\gamma\", r\"\\\\delta\", r\"\\\\zeta\", r\"\\\\eta\", r\"\\\\iota\", r\"\\\\kappa\", r\"\\\\mu\", r\"\\\\nu\", r\"\\\\xi\", r\"\\\\rho\", r\"\\\\tau\", r\"\\\\phi\", r\"\\\\chi\", r\"\\\\psi\", r\"\\\\omicron\", r\"\\\\epsilon\", r\"\\\\pi\", r\"\\\\lambda\", r\"\\\\omega\", r\"\\\\sigma\", r\"\\\\theta\", r\"\\\\vartheta\", r\"\\\\times\", r\"\\\\cdot\", r\"\\\\dot\", r\"\\\\div\", r\"\\\\frac\", r\"\\\\log\", r\"\\\\exp\", r\"\\\\poly\", r\"\\\\eq\", r\"\\\\neq\", r\"\\\\leq\", r\"\\\\geq\", r\"\\\\approx\", r\"\\\\infty\", r\"\\\\int\", r\"\\\\sum\", r\"\\\\lim\", r\"\\\\begin\", r\"\\\\subset\", r\"\\\\supset\", r\"\\\\top\", r\"\\\\star\", r\"\\\\sim\", r\"\\\\simeq\", r\"\\\\ne\", r\"\\\\ll\", r\"\\\\gg\", r\"\\\\pm\", r\"\\\\mp\", r\"\\\\triangleleft\", r\"\\\\triangleright\", r\"\\\\ast\", r\"\\\\circ\", r\"\\\\bullet\", r\"\\\\oplus\", r\"\\\\odot\", r\"\\\\otimes\", r\"\\\\ominus\", r\"\\\\oslash\", r\"\\\\bigcirc\", r\"\\\\wr\", r\"\\\\dagger\", r\"\\\\bigtriangleup\", r\"\\\\bigtriangledown\", r\"\\\\setminus\", r\"\\\\sqcup\", r\"\\\\wedge\", r\"\\\\dotplus\", r\"\\\\centerdot\", r\"\\\\ltimes\", r\"\\\\rtimes\", r\"\\\\prod\", r\"\\\\coprod\", r\"\\\\iint\", r\"\\\\iiint\", r\"\\\\iiiint\", r\"\\\\idotsint\", r\"\\\\bigoplus\", r\"\\\\big\", r\"\\\\oint\", r\"\\\\rightarrow\", r\"\\\\to\", r\"\\\\leftarrow\", r\"\\\\gets\", r\"\\\\uparrow\", r\"\\\\downarrow\", r\"\\\\forall\", r\"\\\\exists\", r\"\\\\pmod\", r\"\\\\cup\", r\"\\\\cap\", r\"\\\\hat\", r\"\\\\acute\", r\"\\\\check\", r\"\\\\grave\", r\"\\\\vec\", r\"\\\\ddot\", r\"\\\\tilde\", r\"\\\\breve\", r\"\\\\mathring\", r\"\\\\land\", r\"\\\\lor\", r\"\\\\lnot\", r\"\\\\in\", r\"\\\\smile\", r\"\\\\frown\", r\"\\\\infty\", r\"\\\\mid\", r\"\\\\sin\", r\"\\\\cos\", r\"\\\\tan\", r\"\\\\equiv\", r\"\\\\circ\", r\"\\\\dfrac\", r\"\\\\prec\", r\"\\\\preccurlyeq\", r\"\\\\sqrt\",}\nblack_list = {r\"\\\\text\", r\"\\\\if\", r\"\\\\local\", r\"\\\\usr\", r\"\\\\include\", r\"\\\\lib\", r\"\\\\bin\", r\"\\\\url\", r\"\\\\program\", r\"\\\\microsoft\", r\"\\\\temp\", r\"\\\\windows\", r\"\\\\documents\", r\"\\\\users\", r\"\\\\my\", r\"\\\\the\",}\nkeywords1 = whilte_list - black_list\nkeywords1 = set(map(lambda x: x + \"[^a-zA-Z]\", keywords1))\n\nkeywords2 = {r\"\\+\", r\"\\-\", r\"\\*\", r\"\\/\", r\"\\%\", r\"\\=\", r\"\\!\\=\", r\"\\<\", r\"\\>\", r\"\\^\", r\"\\_\", r\"\\(\", r\"\\)\", r\"\\[\", r\"\\]\", r\"\\{\", r\"\\}\", r\"\\|\\|\", r\"\\&\\&\", r\"sqrt\", r\"sum\", r\"int\", r\"\\$\", r\"\\<math\\>\", r\"\\[math\\]\", }\n\npattern0 = re.compile(r\"\\\\[A-Z]{0,9}[a-z]{2,9}\")\npattern1 = re.compile(\"|\".join(keywords1))\npattern2 = re.compile(\"|\".join(keywords2))\n\ndef ismath_by_model(text, model, thred=0.5):\n    if model is None:\n        return False\n    if not isinstance(text, str) or len(text.strip()) == 0:\n        return False\n    try:\n        x = \" \".join(simple_preprocess(text))\n        ret = model.predict(x)\n        label, prob = ret[0][0], ret[1][0]\n        return label != \"__label__0\"\n    except:\n        traceback.print_exc()\n        return False\n\ndef math_filter_layer(pq_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", OVERWRITE=False):\n    ret = list()\n    try:\n        in_pq_path = os.path.join(INPUT_FOLDER, pq_name)\n        in_pq_path = util.to_real_path(in_pq_path, variables)\n        out_pq_path = os.path.join(OUTPUT_FOLDER, pq_name)\n        out_pq_path = util.to_real_path(out_pq_path, variables)\n\n        if os.path.exists(in_pq_path) and (OVERWRITE or not os.path.exists(out_pq_path)):\n            util.create_folder_by_file_path(out_pq_path)\n\n            # read parquet file.\n            try:\n                table = pq.read_table(in_pq_path)\n            except:\n                traceback.print_exc()\n            \n            # filter records containing math.\n            records = list()\n            for record in table.to_pylist():\n                try:\n                    text = record[\"text\"]\n\n                    if record[\"la\"] != \"en\":\n                        continue\n\n                    #if item[\"la_prob\"] < 0.65:\n                    #    continue\n                    #if text is None or len(text) < 64:\n                    #    continue\n                    #if text.count(\"\\\\u\") >= 10:\n                    #    continue\n\n                    #if not check_quality(record):\n                    #    continue\n\n                    symbols0 = set(pattern0.findall(text))\n                    if len(symbols0) <= 0:\n                        continue\n\n                    symbols1 = set(pattern1.findall(text.lower()))\n                    symbols1 = set(map(lambda sym: sym[:-1], symbols1))\n                    if len(symbols1) <= 0:\n                        continue\n\n                    symbols2 = set(pattern2.findall(text.lower()))\n                    if len(symbols1) == 1 and len(symbols2) <= 0:\n                        continue\n\n                    ismath = len(symbols1) >= 5 or ismath_by_model(text, global_var.ft_math_model)\n                    if not ismath:\n                        continue\n\n                    records.append(record)\n                except:\n                    traceback.print_exc()\n\n            # write parquet file.\n            try:\n                table = pa.Table.from_pylist(records)\n                pq.write_table(table, out_pq_path)\n            except:\n                traceback.print_exc()\n            \n            ret = [out_pq_path]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == '__main__':\n    snapshot = \"CC-MAIN-2022-49\"\n    variables = {\"workspace_dir\": r\"workspace\", \"worker_id\": 0, \"worker_num\": 1}\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    STORAGE_PATH = \"resources/storage/llmstore.yaml\"\n    ret = math_filter_layer(snapshot, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH)\n    print(ret)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/mcq_filter_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport re\nimport json\nimport fasttext\nimport requests\nfrom io import BytesIO\nfrom gensim.utils import simple_preprocess\nfrom warcio.limitreader import LimitReader\nfrom warcio.warcwriter import WARCWriter\nfrom warcio.archiveiterator import ArchiveIterator\nimport util\nimport global_var\n\n\ndef detect_lang(text):\n    try:\n        LID_WIN_SIZE = 256\n        text = ''.join(text.split())\n        span_start, span_end = 0, len(text)\n        if len(text) > LID_WIN_SIZE:\n            mid = len(text) // 2\n            mid_win = LID_WIN_SIZE // 2\n            span_start = max(0, int(mid - mid_win))\n            span_end = min(len(text), int(mid + mid_win))\n        det_text = text[span_start: span_end]\n        res = global_var.lid_model.predict(det_text)\n        lang = res[0][0].replace(\"__label__\", \"\")\n        prob = float(res[1][0])\n        return lang\n    except:\n        return \"unkown\"\n\n\ndef detect_choice_exercise_by_rule(uri, html):\n    uri = uri.lower()\n    html = html.lower()\n    contain_cnt = 0\n\n    keywords_in_text = [b\"choice question\"]\n    for keyword in keywords_in_text:\n        if keyword in html:\n            contain_cnt += 1\n            break\n\n    combo_keywords_in_text = [\n        (b\"a.\",   b\"b.\",   b\"c.\",   b\"d.\"),\n        (b\"a)\",   b\"b)\",   b\"c)\",   b\"d)\"),\n        (b\"\\na \", b\"\\nb \", b\"\\nc \", b\"\\nd \"),\n        (b\">a<\",  b\">b<\",  b\">c<\",  b\">d<\"),\n\n        (b\"1.\",   b\"2.\",   b\"3.\",   b\"4.\"),\n        (b\"1)\",   b\"2)\",   b\"3)\",   b\"4)\"),\n        (b\"\\n1 \", b\"\\n2 \", b\"\\n3 \", b\"\\n4 \"),\n        (b\">1<\",  b\">2<\",  b\">3<\",  b\">4<\"),\n\n        (b\"i.\",   b\"ii.\",   b\"iii.\",   b\"iv.\"),\n        (b\"i)\",   b\"ii)\",   b\"iii)\",   b\"iv)\"),\n        (b\"\\ni \", b\"\\nii \", b\"\\niii \", b\"\\niv \"),\n        (b\">i<\",  b\">ii<\",  b\">iii<\",  b\">iv<\"),\n    ]\n\n    for combo_keyword in combo_keywords_in_text:\n        if combo_keyword[0] in html and combo_keyword[1] in html and combo_keyword[2] in html and combo_keyword[3] in html:\n            contain_cnt += 1\n            break\n\n    return contain_cnt == 2\n\n\ndef detect_choice_exercise_by_ft_model(uri, text, thred=0.5):\n    try:\n        if not isinstance(text, str) or len(text.strip()) == 0:\n            return False\n        x = \" \".join(simple_preprocess(text))\n        ret = global_var.ft_mcq_model.predict(x)\n        label, prob = ret[0][0], ret[1][0]\n        if label == \"__label__0\" and prob < thred:\n            return True\n        return label == \"__label__1\"\n    except:\n        return False\n\n\"\"\"\ndef detect_choice_exercise_by_pt_model(uri, text, thred=0.5):\n    try:\n        if not isinstance(text, str) or len(text.strip()) == 0:\n            return False\n        label = global_var.py_mcq_model.run(text, thred)\n        return label == \"LABEL_1\"\n    except:\n        return False\n\"\"\"\n\n\ndef detect_choice_exercise_by_LLM(text, engine=None):\n    system = '''\nYou will be given a text converted from a webpage. Your task is to detect whether it contains choice question by responding with 'yes' or 'no'.\n'''\n    answer = global_var.gpt_api.run(system=system, question=text, engine=engine)\n    answer = answer.lower().strip()\n    if answer.startswith(\"yes\"):\n        return True\n    elif answer.startswith(\"no\"):\n        return False\n    else:\n        return False\n\n\ndef LCS(str1, str2):\n    m = len(str1)\n    n = len(str2)\n\n    dp = [[0 for _ in range(n+1)] for _ in range(m+1)]\n\n    for i in range(1, m+1):\n        for j in range(1, n+1):\n            if str1[i-1] == str2[j-1]:\n                dp[i][j] = dp[i-1][j-1] + 1\n            else:\n                dp[i][j] = max(dp[i-1][j], dp[i][j-1])\n\n    return round(1.0 * dp[m][n] / n, 6)\n\n\ndef localize_choice_exercise_by_LLM(text, engine=None):\n    system = '''\nPurpose:\nCreate a multiple-choice question dataset.\n\nTask:\nExtract all multiple-choice questions from the provided text.\n\nRequirements:\n1. If the given text does not contain multiple-choice questions, respond only with \"No multiple-choice questions found\".\n2. Do not modify the original multiple-choice questions.\n3. Ensure all multiple-choice questions are copied without omissions.\n4. Ensure all multiple-choice questions are copied in order.\n5. Ensure all multiple-choice questions are copied under the original layout.\n6. Copy the questions along with their options.\n7. If answers and explanations are provided, copy them as well.\n8. If source materials or reading passage is provided, copy it as well.\n9. Don't add content not from original given text.\n\nPlease strictly adhere to these requirements while performing the task.\n'''\n    exercises = global_var.gpt_api.run(system=system, question=text, engine=engine)\n    exercises = exercises.strip()\n    if len(exercises) == 0 or \"no multiple-choice question\" in exercises.lower():\n        return None\n    else:\n        exercises = exercises.replace(\"Multiple Choice Questions\\n\", \"\")\n        exercises = exercises.replace(\"Multiple-choice questions:\\n\", \"\")\n        exercises = exercises.replace(\"No other multiple-choice questions found.\", \"\")\n        exercises = exercises.replace(\"No other multiple-choice questions found in the text.\", \"\")\n        exercises = exercises.replace(\"No multiple-choice questions found.\", \"\")\n        exercises = exercises.replace(\"No more multiple-choice questions found.\", \"\")\n\n        sim = LCS(text, exercises)\n        if sim < 0.9:\n            return None\n        else:\n            return exercises\n\n\n# rule + model + GPT3.5 turbor.\ndef mcq_filter_layer(wet_file_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", OVERWRITE=False):\n    ret = list()\n    try:\n        src_wet_file_path = os.path.join(INPUT_FOLDER, wet_file_name)\n        src_wet_file_path = util.to_real_path(src_wet_file_path, variables)\n        jsonl_file_name = wet_file_name.replace(\".warc.wet.gz\", \".jsonl\")\n        dst_jsonl_file_path = os.path.join(OUTPUT_FOLDER, jsonl_file_name)\n        dst_jsonl_file_path = util.to_real_path(dst_jsonl_file_path, variables)\n\n        if os.path.exists(src_wet_file_path) and (OVERWRITE or not os.path.exists(dst_jsonl_file_path)):\n            items = list()\n            with open(src_wet_file_path, \"rb\") as input:\n                records = ArchiveIterator(input, arc2warc=False)\n                for id, record in enumerate(records):\n                    if record.rec_type == \"conversion\":\n                        try:\n                            # read raw html.\n                            uri = record.rec_headers[\"WARC-Target-URI\"]\n                            bs = record.content_stream().read()\n                            if bs is None:\n                                continue\n\n                            text = str(bs, \"utf-8\")\n                            if text is None:\n                                continue\n\n                            # 1st round filter.\n                            round1_contain_exercise = detect_choice_exercise_by_rule(uri, bs)\n                            if not round1_contain_exercise:\n                                continue\n\n                            # 2nd round filter.\n                            round2_contain_exercise = detect_choice_exercise_by_ft_model(uri, text, thred=0.825)\n                            if not round2_contain_exercise:\n                                continue\n                            #round2_contain_exercise = detect_choice_exercise_by_pt_model(uri, text, thred=0.99)\n                            #if not round2_contain_exercise:\n                            #    continue\n\n                            \"\"\"\n                            # 3rd round filter.\n                            round3_contain_exercise = detect_choice_exercise_by_LLM(text, \"gpt-35-turbo\")\n                            if not round3_contain_exercise:\n                                continue\n                            \"\"\"\n\n                            item = dict()\n                            item[\"uri\"] = uri\n                            item[\"text\"] = text\n                            lang = detect_lang(text)\n                            item[\"lang\"] = lang\n                            #exercises = localize_choice_exercise_by_LLM(text, \"gpt-35-turbo\")\n                            #item[\"exercises\"] = exercises\n                            items.append(item)\n                        except:\n                            traceback.print_exc()\n                            pass\n            with open(dst_jsonl_file_path, \"w\") as output:\n                for item in items:\n                    output.write(json.dumps(item) + \"\\n\")\n            ret = [dst_jsonl_file_path]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == '__main__':\n    wet_file_name = \"CC-MAIN-20210115134101-20210115164101-00005_5.warc.wet.gz\"\n    variables = {\"workspace_dir\": r\"workspace\", \"worker_id\": 0, \"worker_num\": 1}\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    ret = mcq_filter_layer(wet_file_name, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, OVERWRITE=True)\n    print(ret)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/minhash_tokens_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport hashlib\nimport traceback\nimport numpy as np\nfrom itertools import tee\n\nMER = 2**61 - 1\nNUM_PERM = 256\nSEED = 1\n\nclass MinHasher:\n    def __init__(self):\n        np.random.seed(1)\n        self.gen = np.random.RandomState(SEED)\n        self.a = self.gen.randint(1, MER, (NUM_PERM,), dtype='u8')\n        self.b = self.gen.randint(0, MER, (NUM_PERM,), dtype='u8')\n\n    def _sha1_hash(self, val):\n        val = int.from_bytes(hashlib.sha1(val).digest()[:8], 'little')\n        val &= MER\n        return np.uint64(val)\n    \n    def hash(self, sequence):\n        res = np.ones(NUM_PERM, dtype='u8') * MER\n        for token in sequence:\n            hash0 = self._sha1_hash(token.encode('utf8'))\n            hash_vec = hash0 * self.a + self.b\n            hash_vec %= MER\n            res = np.minimum(res, hash_vec)\n        return res\n\nminhasher = MinHasher()\n\ndef minhash_tokens_layer(tokens, variables=dict()):\n    ret = None\n    try:\n        minhash = minhasher.hash(tokens)\n        ret = minhash\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == \"__main__\":\n    tokens = {'产权 份额 为 土地 出让', '商品 住房 市场 价格 合理', '确定 , 在 售 房', ', 可 向 代 持', '住房 , 划 拨 土地', '增 购 政府 份额 的', '向社会公布 。 划 拨 土地', '为 商品 住房 , 划', '▁来源 : 中国 网 地产', '出让 土地 共有 产权 保障', '的 , 可 向 代', '售 房 阶段 向社会公布 。', '商品 住房 , 划 拨', '以及 累计 缴纳 社保 或', '性质 转 为 商品 住房', '的 非 市区 户籍 家庭', '购房 款 。 ▁在 使用', '地产 ▁ 杭州市 1 日', '《 杭州市 共有 产权 保障', '住房 享有 与 购买 商品', '类型 商品 住房 市场 价格', '的 申请 , 增 购', '价 按 同 地段 、', '款 。 ▁在 使用 管理', '可根据 支付 能力 在 50%', '按照 单 套 销售 价格', '方可 通过 买卖 等方式 上市', '年限 等相关 条件 。 ▁', '10 年后 , 方可 通过', '市场 价格 合理 优惠 后', '拨 土地 共有 产权 保障', '杭州市 共有 产权 保障 住房', '销售 基准 价 按 同', '能力 在 50% 至 80%', '等相关 条件 。 ▁ 办法', '年 的 , 可 向', '至 80% 范围内 选择 产权', '共有 产权 保障 住房 销售', '符合 限购 政策 前提 下', '购房 家庭 可根据 支付 能力', '提出 共有 产权 保障 住房', '住房 , 购房 家庭 可根据', '。 ▁在 使用 管理 方面', '-12- 03 ▁记者 : ▁来源', '保障 住房 面向 符合条件的 市区', '住房 以及 累计 缴纳 社保', '。 ▁ 办法 明确 ,', '购房 家庭 产权 份额 为', '社保 或 个 税 年限', '价 及其 浮动 幅度 确定', '非 市区 户籍 家庭 供应', '购房 款 。 出让 土地', ', 购房 家庭 可根据 支付', '单 套 销售 价格 对应的', '权利 性质 调整为 出让 。', '03 ▁记者 : ▁来源 :', '▁2021 -12- 03 ▁记者 :', '产权 保障 住房 面向 符合条件的', '日 对外 发布 《 杭州市', '就业 的 非 市区 户籍', '增 购 后 住房 性质', ', 购买 共有 产权 保障', '、 同 类型 商品 住房', '同等 的 公共服务 权益 。', '对应的 不同 比例 支付 购房', '的 公共服务 权益 。 ▁根据', '网 地产 ▁ 杭州市 1', '款 。 出让 土地 共有', '套 销售 价格 对应的 产权', '管理 方面 , 杭州 提出', '住房 , 购房 家庭 产权', '和 稳定 就业 的 非', '土地 权利 性质 调整为 出让', '浮动 幅度 确定 , 在', '不动产 权 证 满 10', '▁ 办法 明确 , 共有', '机构 提出 一次性 增 购', '》 , 其中 明确 ,', '权 证 满 10 年后', '在 50% 至 80% 范围内', '方面 , 杭州 提出 共有', '满 10 年后 , 方可', '基准 价 按 同 地段', '产权 份额 比例 , 按照', '保障 住房 管理办法 》 ,', '居住证 、 住房 以及 累计', '销售 价格 对应的 产权 比例', '住房 面向 符合条件的 市区 户籍', '。 ▁根据 办法 , 市区', '单 套 销售 价格 按照', '销售 基准 价 及其 浮动', ': 中国 网 地产 ▁', '持 机构 提出 一次性 增', '价格 按照 销售 基准 价', '家庭 供应 , 购买 共有', '购买 共有 产权 保障 住房', '稳定 就业 的 非 市区', '购买 商品 住房 同等 的', '其中 明确 , 共有 产权', '▁记者 : ▁来源 : 中国', '价格 对应的 不同 比例 支付', '与 购买 商品 住房 同等', '、 住房 等相关 条件 ,', '条件 。 ▁ 办法 明确', '证 满 5 年 的', '满 5 年 的 ,', '管理办法 》 , 其中 明确', '市区 户籍 家庭 需 满足', '份额 的 申请 , 增', '商品 住房 同等 的 公共服务', '支付 能力 在 50% 至', '权 证 满 5 年', '户籍 家庭 需 满足 居住证', ', 方可 通过 买卖 等方式', ', 在 售 房 阶段', '对应的 产权 比例 支付 购房', '产权 保障 住房 购房 家庭', '家庭 需 满足 居住证 、', '杭州 提出 共有 产权 保障', '1 日 对外 发布 《', ', 其中 明确 , 共有', '满足 居住证 、 住房 以及', '选择 产权 份额 比例 ,', '同时 满足 户籍 、 住房', ', 市区 户籍 家庭 要在', '销售 价格 对应的 不同 比例', '个 税 年限 等相关 条件', '住房 市场 价格 合理 优惠', '产权 保障 住房 , 购房', '、 住房 以及 累计 缴纳', '产权 保障 住房 销售 基准', '后 住房 性质 转 为', '土地 出让 时 已 确定的', '比例 , 按照 单 套', '发布 《 杭州市 共有 产权', '住房 性质 转 为 商品', '累计 缴纳 社保 或 个', '份额 比例 , 按照 单', '时 已 确定的 份额 比例', '划 拨 土地 权利 性质', '基准 价 及其 浮动 幅度', '。 出让 土地 共有 产权', '为 土地 出让 时 已', ', 购房 家庭 产权 份额', '等相关 条件 , 非 市区', '按 同 地段 、 同', '按照 销售 基准 价 及其', '不同 比例 支付 购房 款', '住房 销售 基准 价 按', '家庭 产权 份额 为 土地', '可 向 代 持 机构', '▁在 使用 管理 方面 ,', '家庭 取得 不动产 权 证', '性质 调整为 出让 。 取得', '取得 不动产 权 证 满', '市区 户籍 家庭 要在 符合', ', 杭州 提出 共有 产权', '政策 前提 下 同时 满足', '▁根据 办法 , 市区 户籍', '办法 , 市区 户籍 家庭', '缴纳 社保 或 个 税', '。 划 拨 土地 共有', '家庭 可根据 支付 能力 在', '满足 户籍 、 住房 等相关', '一次性 增 购 政府 份额', '购 政府 份额 的 申请', '需 满足 居住证 、 住房', '同 地段 、 同 类型', '供应 , 购买 共有 产权', '使用 管理 方面 , 杭州', '保障 住房 享有 与 购买', '共有 产权 保障 住房 享有', '限购 政策 前提 下 同时', '套 销售 价格 按照 销售', '户籍 和 稳定 就业 的', '优惠 后 确定 。 单', '住房 管理办法 》 , 其中', '市区 户籍 和 稳定 就业', '支付 购房 款 。 ▁在', '户籍 家庭 供应 , 购买', '同 类型 商品 住房 市场', '保障 住房 购房 家庭 取得', '及其 浮动 幅度 确定 ,', '共有 产权 保障 住房 管理办法', '共有 产权 保障 住房 面向', '在 售 房 阶段 向社会公布', '共有 产权 保障 住房 ,', '政府 份额 的 申请 ,', '买卖 等方式 上市 交易 。', '市区 户籍 家庭 供应 ,', '出让 时 已 确定的 份额', '家庭 要在 符合 限购 政策', '申请 , 增 购 后', ', 非 市区 户籍 家庭', '前提 下 同时 满足 户籍', '划 拨 土地 共有 产权', ', 划 拨 土地 权利', '产权 保障 住房 管理办法 》', '阶段 向社会公布 。 划 拨', '明确 , 共有 产权 保障', '确定的 份额 比例 , 按照', '证 满 10 年后 ,', '通过 买卖 等方式 上市 交易', '已 确定的 份额 比例 ,', '不动产 权 证 满 5', '提出 一次性 增 购 政府', '对外 发布 《 杭州市 共有', '价格 合理 优惠 后 确定', '。 取得 不动产 权 证', '范围内 选择 产权 份额 比例', '房 阶段 向社会公布 。 划', '▁ 杭州市 1 日 对外', '份额 为 土地 出让 时', ', 增 购 后 住房', '地段 、 同 类型 商品', '杭州市 1 日 对外 发布', '户籍 家庭 要在 符合 限购', '保障 住房 销售 基准 价', '调整为 出让 。 取得 不动产', ', 共有 产权 保障 住房', '权益 。 ▁根据 办法 ,', '比例 支付 购房 款 。', '保障 住房 , 购房 家庭', '或 个 税 年限 等相关', '年后 , 方可 通过 买卖', '出让 。 取得 不动产 权', '价格 对应的 产权 比例 支付', '购 后 住房 性质 转', '确定 。 单 套 销售', '支付 购房 款 。 出让', '要在 符合 限购 政策 前提', '拨 土地 权利 性质 调整为', '转 为 商品 住房 ,', '享有 与 购买 商品 住房', '公共服务 权益 。 ▁根据 办法', '中国 网 地产 ▁ 杭州市', '5 年 的 , 可', '合理 优惠 后 确定 。', '办法 明确 , 共有 产权', '共有 产权 保障 住房 购房', '套 销售 价格 对应的 不同', '户籍 、 住房 等相关 条件', '下 同时 满足 户籍 、', '产权 保障 住房 享有 与', '面向 符合条件的 市区 户籍 和', '购房 家庭 取得 不动产 权', '条件 , 非 市区 户籍', '幅度 确定 , 在 售', ': ▁来源 : 中国 网', '代 持 机构 提出 一次性', '产权 比例 支付 购房 款', '80% 范围内 选择 产权 份额', '向 代 持 机构 提出', '住房 同等 的 公共服务 权益', '税 年限 等相关 条件 。', '土地 共有 产权 保障 住房', ', 按照 单 套 销售', '非 市区 户籍 家庭 需', '。 单 套 销售 价格', '符合条件的 市区 户籍 和 稳定', '住房 等相关 条件 , 非', '50% 至 80% 范围内 选择', '后 确定 。 单 套', '住房 购房 家庭 取得 不动产', '销售 价格 按照 销售 基准'}\n    minhash = minhash_tokens_layer(tokens)\n    print(minhash)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/ngrams_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nfrom itertools import tee\n\nNGRAM_SIZE = 5\n\ndef ngrams_layer(sequence, variables=dict()):\n    ret = None\n    try:\n        # https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/utils/tokenization.py\n        if len(sequence) < NGRAM_SIZE:\n            return iter([sequence])\n        iterables = tee(iter(sequence), NGRAM_SIZE)\n        for i, sub_iterable in enumerate(iterables):\n            for _ in range(i):\n                next(sub_iterable, None)\n        tokens = zip(*iterables)\n        tokens = {\" \".join(t).strip() for t in tokens}\n        #tokens = list(tokens)\n        ret = tokens\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == \"__main__\":\n    tokens = ['▁2021', '-12-', '03', '▁记者', ':', '▁来源', ':', '中国', '网', '地产', '▁', '杭州市', '1', '日', '对外', '发布', '《', '杭州市', '共有', '产权', '保障', '住房', '管理办法', '》', ',', '其中', '明确', ',', '共有', '产权', '保障', '住房', '面向', '符合条件的', '市区', '户籍', '和', '稳定', '就业', '的', '非', '市区', '户籍', '家庭', '供应', ',', '购买', '共有', '产权', '保障', '住房', '享有', '与', '购买', '商品', '住房', '同等', '的', '公共服务', '权益', '。', '▁根据', '办法', ',', '市区', '户籍', '家庭', '要在', '符合', '限购', '政策', '前提', '下', '同时', '满足', '户籍', '、', '住房', '等相关', '条件', ',', '非', '市区', '户籍', '家庭', '需', '满足', '居住证', '、', '住房', '以及', '累计', '缴纳', '社保', '或', '个', '税', '年限', '等相关', '条件', '。', '▁', '办法', '明确', ',', '共有', '产权', '保障', '住房', '销售', '基准', '价', '按', '同', '地段', '、', '同', '类型', '商品', '住房', '市场', '价格', '合理', '优惠', '后', '确定', '。', '单', '套', '销售', '价格', '按照', '销售', '基准', '价', '及其', '浮动', '幅度', '确定', ',', '在', '售', '房', '阶段', '向社会公布', '。', '划', '拨', '土地', '共有', '产权', '保障', '住房', ',', '购房', '家庭', '可根据', '支付', '能力', '在', '50%', '至', '80%', '范围内', '选择', '产权', '份额', '比例', ',', '按照', '单', '套', '销售', '价格', '对应的', '不同', '比例', '支付', '购房', '款', '。', '出让', '土地', '共有', '产权', '保障', '住房', ',', '购房', '家庭', '产权', '份额', '为', '土地', '出让', '时', '已', '确定的', '份额', '比例', ',', '按照', '单', '套', '销售', '价格', '对应的', '产权', '比例', '支付', '购房', '款', '。', '▁在', '使用', '管理', '方面', ',', '杭州', '提出', '共有', '产权', '保障', '住房', '购房', '家庭', '取得', '不动产', '权', '证', '满', '5', '年', '的', ',', '可', '向', '代', '持', '机构', '提出', '一次性', '增', '购', '政府', '份额', '的', '申请', ',', '增', '购', '后', '住房', '性质', '转', '为', '商品', '住房', ',', '划', '拨', '土地', '权利', '性质', '调整为', '出让', '。', '取得', '不动产', '权', '证', '满', '10', '年后', ',', '方可', '通过', '买卖', '等方式', '上市', '交易', '。']\n    tokens = ngrams_layer(tokens)\n    print(tokens)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/openquestion_filter_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport re\nimport gc\nimport requests\nimport fasttext\nfrom gensim.utils import simple_preprocess\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nsys.path.append(\".\")\nimport util\nimport global_var\n\nquestion_keywords = (\"q&a\", \"q & a\", \"q:\", \"que:\", \"question:\", \"quiz:\", \"exam:\", \"examination:\", \"probe:\", \"request:\", \"challenge:\", \"test:\", \"query:\", \"survey:\")\n#question_keywords2 = (\"what \", \"where \", \"why \", \"when \", \"who \", \"whoes \", \"how \", \"\\?\")\nquestion_keywords2 = (\"what\", \"where\", \"why\", \"when\", \"who\", \"whoes\", \"how\")\nquestion_keywords += question_keywords2\nquestion_keywords = set(map(lambda x: \"[^a-zA-Z]\" + x + \"[^a-zA-Z]\", question_keywords))\nquestion_pattern = re.compile(\"|\".join(question_keywords))\n\nanswer_keywords = (\"q&a\", \"q & a\", \"a:\", \"ans:\", \"answer:\", \"solution:\", \"reply:\", \"response:\", \"result:\", \"outcome:\", \"explanation:\", \"conclusion:\", \"finding:\", \"assertion:\", \"statement:\", \"clarification:\")\nanswer_keywords = set(map(lambda x: \"[^a-zA-Z]\" + x + \"[^a-zA-Z]\", answer_keywords))\nanswer_pattern = re.compile(\"|\".join(answer_keywords))\n\n\ndef is_openquestion_by_model(text, model, thred=0.5):\n    if model is None:\n        return False\n    if not isinstance(text, str) or len(text.strip()) == 0:\n        return False\n    try:\n        x = \" \".join(simple_preprocess(text))\n        ret = model.predict(x)\n        label, prob = ret[0][0], ret[1][0]\n        return label != \"__label__0\"\n    except:\n        traceback.print_exc()\n        return False\n\ndef check_yes_no_question(text_before, text_after):\n    text_after = text_after.lower().strip()\n    keywords = (\"yes\", \"y\", \"no\", \"n\")\n    for keyword in keywords:\n        if text_after.startswith(keyword) and not text_after[len(keyword)].isalnum():\n            return True\n    return False\n\ndef check_multiple_choise_question(text_before, text_after):\n    combo_keywords_list = [\n        (\"a.\",   \"b.\",   \"c.\",   \"d.\"),\n        (\"a)\",   \"b)\",   \"c)\",   \"d)\"),\n        (\"\\na \", \"\\nb \", \"\\nc \", \"\\nd \"),\n        (\">a<\",  \">b<\",  \">c<\",  \">d<\"),\n\n        (\"1.\",   \"2.\",   \"3.\",   \"4.\"),\n        (\"1)\",   \"2)\",   \"3)\",   \"4)\"),\n        (\"\\n1 \", \"\\n2 \", \"\\n3 \", \"\\n4 \"),\n        (\">1<\",  \">2<\",  \">3<\",  \">4<\"),\n\n        (\"i.\",   \"ii.\",   \"iii.\",   \"iv.\"),\n        (\"i)\",   \"ii)\",   \"iii)\",   \"iv)\"),\n        (\"\\ni \", \"\\nii \", \"\\niii \", \"\\niv \"),\n        (\">i<\",  \">ii<\",  \">iii<\",  \">iv<\"),\n    ]\n    text_before = text_before.lower().strip()\n    for combo_keywords in combo_keywords_list:\n        t = 0\n        for combo_keyword in combo_keywords:\n            t = text_before.find(combo_keyword, t)\n            if t == -1:\n                break\n        if t != -1:\n            return True\n        #if combo_keywords[0] in text_before and combo_keywords[1] in text_before and combo_keywords[2] in text_before:\n        #    return True\n    return False\n\ndef check_fill_in_question(text_before, text_after):\n    text_before = text_before.lower().strip()\n    if \"___\" in text_before or \"()\" in text_before or \"...\" in text_before:\n        return True\n    return False\n\ndef check_quality(item):\n    text = item[\"text\"]\n    lines = text.split(\"\\n\")\n    lens = list(map(lambda l: len(l.strip()), lines))\n    max_len = max(lens)\n\n    #if max_len > 1024:\n    if max_len > 2048:\n        return False\n    if max_len <= 128:\n        return False\n\n    if len(lens) <= 3:\n        return False\n    if len(lens) > 256:\n        return False\n\n    if len(text) < 256:\n        return False\n    if len(text) > 1024 * 16:\n        return False\n\n    if 1.0 * text.count(\" \") / len(text) > 0.33:\n        return False\n\n    if 1.0 * text.count(\"  \") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\"\\t\") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\".\") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\"-\") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\"#\") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\"|\") / len(text) > 0.1:\n        return False\n\n    if 1.0 * text.count(\",\") / len(text) > 0.1:\n        return False\n\n    sl_cnt = 1.0 * len(list(filter(lambda x: len(x.strip()) <= 32, lines))) / len(lines)\n    if sl_cnt > 0.67:\n        return False\n\n    return True\n\ndef openquestion_filter_layer(pq_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", OVERWRITE=False):\n    ret = list()\n    try:\n        in_pq_path = os.path.join(INPUT_FOLDER, pq_name)\n        in_pq_path = util.to_real_path(in_pq_path, variables)\n        out_pq_path = os.path.join(OUTPUT_FOLDER, pq_name)\n        out_pq_path = util.to_real_path(out_pq_path, variables)\n\n        if os.path.exists(in_pq_path) and (OVERWRITE or not os.path.exists(out_pq_path)):\n            util.create_folder_by_file_path(out_pq_path)\n\n            # read parquet file.\n            try:\n                table = pq.read_table(in_pq_path)\n                records = table.to_pylist()\n            except:\n                traceback.print_exc()\n            \n            # filter records containing open question.\n            openquestion_records = list()\n            for record_idx, record in enumerate(records):\n                try:\n                    text = record[\"text\"]\n                    text_low = text.lower()\n\n                    if record[\"la\"] != \"en\":\n                        continue\n\n                    #if item[\"la_prob\"] < 0.65:\n                    #    continue\n                    #if text is None or len(text) < 64:\n                    #    continue\n                    #if text.count(\"\\\\u\") >= 10:\n                    #    continue\n\n                    #if not check_quality(record):\n                    #    continue\n\n                    contain_question = len(question_pattern.findall(text_low)) >= 2\n                    if not contain_question:\n                        continue\n                    \n                    contain_answer = len(answer_pattern.findall(text_low)) >= 2\n                    if not contain_answer:\n                        continue\n\n                    contain_openquestion = is_openquestion_by_model(text, global_var.ft_openquestion_model)\n                    if not contain_openquestion:\n                        continue\n\n                    openquestion_records.append(record)\n                except:\n                    traceback.print_exc()\n\n            # write parquet file.\n            try:\n                openquestion_table = pa.Table.from_pylist(openquestion_records)\n                pq.write_table(openquestion_table, out_pq_path)\n            except:\n                traceback.print_exc()\n            \n            ret = [out_pq_path]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == '__main__':\n    snapshot = \"CC-MAIN-2022-49\"\n    variables = {\"workspace_dir\": r\"workspace\", \"worker_id\": 0, \"worker_num\": 1}\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    STORAGE_PATH = \"resources/storage/llmstore.yaml\"\n    ret = openquestion_filter_layer(snapshot, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH)\n    print(ret)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/tokenize_article_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport sentencepiece as spm\n\n\ntokenizer = None\n\ndef tokenize_article_layer(article, variables=dict(), SPM_MODEL_PATH=\"./dependency/models/sentencepiece.bpe.model\"):\n    ret = None\n    try:\n        global tokenizer\n        if tokenizer is None:\n            tokenizer = spm.SentencePieceProcessor(SPM_MODEL_PATH)\n        tokens = tokenizer.encode(article, out_type=str)\n        ret = tokens\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return ret\n\n\nif __name__ == \"__main__\":\n    article = \"2021-12-03 记者： 来源：中国网地产\\n\\n杭州市1日对外发布《杭州市共有产权保障住房管理办法》，其中明确，共有产权保障住房面向符合条件的市区户籍和稳定就业的非市区户籍家庭供应，购买共有产权保障住房享有与购买商品住房同等的公共服务权益。\\n\\n根据办法，市区户籍家庭要在符合限购政策前提下同时满足户籍、住房等相关条件，非市区户籍家庭需满足居住证、住房以及累计缴纳社保或个税年限等相关条件。\\n\\n办法明确，共有产权保障住房销售基准价按同地段、同类型商品住房市场价格合理优惠后确定。单套销售价格按照销售基准价及其浮动幅度确定，在售房阶段向社会公布。划拨土地共有产权保障住房，购房家庭可根据支付能力在50%至80%范围内选择产权份额比例，按照单套销售价格对应的不同比例支付购房款。出让土地共有产权保障住房，购房家庭产权份额为土地出让时已确定的份额比例，按照单套销售价格对应的产权比例支付购房款。\\n\\n在使用管理方面，杭州提出共有产权保障住房购房家庭取得不动产权证满5年的，可向代持机构提出一次性增购政府份额的申请，增购后住房性质转为商品住房，划拨土地权利性质调整为出让。取得不动产权证满10年后，方可通过买卖等方式上市交易。\"\n    tokens = tokenize_article_layer(article)\n    print(tokens)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/warc_encode_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\n# coding=utf-8\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport re\nimport codecs\nimport logging\nimport traceback\nimport requests\nfrom pathlib import Path\nfrom urllib.parse import urlparse\nfrom io import BytesIO\nfrom warcio.limitreader import LimitReader\nfrom warcio.warcwriter import WARCWriter\nfrom warcio.archiveiterator import ArchiveIterator\nimport lxml.etree as ET\nimport lxml.html as HT\nfrom py_asciimath.translator.translator import MathML2Tex\nfrom pylatexenc.latexwalker import LatexWalker\nfrom charset_normalizer import detect\nimport util\n\ndef tex_in_script_tag(text):\n    return text.startswith('<script type=\"math/tex\"') or \\\n           text.startswith(\"<script type='math/tex'\") or \\\n           text.startswith('<script type=\"math/latex\"') or \\\n           text.startswith(\"<script type='math/latex'\") or \\\n           text.startswith('<script type=\"math/asciimath\"') or \\\n           text.startswith(\"<script type='math/asciimath'\") or \\\n           text.startswith('<span class=\"math-formula\">') or \\\n           text.startswith(\"<span class='math-formula'>\")\n\ndef tex_in_math_tag(text):\n    return text.startswith(\"<annotation encoding='application/x-tex'>\") or \\\n           text.startswith('<annotation encoding=\"application/x-tex\">')\n\ndef tex_in_math_tag2(text):\n    return text.startswith(\"<math\") and \"</annotation>\" in text\n\ndef mathml_in_script_tag(text):\n    return text.startswith('<script type=\"math/mml\"') or \\\n           text.startswith(\"<script type='math/mml'\")\n\ndef mathml_in_math_tag(text):\n    return text.startswith(\"<math \") and 'xmlns=\"http://www.w3.org/1998/Math/MathML\"' in text\n    #return text.startswith('<math xmlns=\"http://www.w3.org/1998/Math/MathML\"') or \\\n    #       text.startswith(\"<math xmlns='http://www.w3.org/1998/Math/MathML'\")\n    #return text.startswith(\"<math \")\n\ndef is_tex(text):\n    return re.match(r\"(\\$\\$.*?\\$\\$)\", text) is not None\n\ndef contain_tex(text):\n    return re.search(r\"(\\$\\$.*?\\$\\$)\", text) is not None\n\ndef check_latex(latex):\n    try:\n        w = LatexWalker(latex, tolerant_parsing=False)\n        (nodelist, pos, len_) = w.get_latex_nodes(pos=0)\n        return True\n    except:\n        return False\n\ndef remove_hidden_content(html):\n    text = html\n    root = HT.document_fromstring(text)\n\n    hidden_nodes = root.xpath('//*[@aria-hidden=\"true\"]')\n    for hidden_node in hidden_nodes:\n        hidden_node.drop_tree()\n\n    doctype = '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">'\n    if html.strip().startswith(b'<!DOCTYPE'):\n        index = html.find(b\"<html\")\n        if index != -1:\n            doctype = html[:index].strip()\n    new_text = HT.tostring(root, method=\"html\", doctype=doctype)\n    new_html = new_text\n    return new_html\n\ndef remove_attr(text, attr):\n    index = text.find(attr)\n    if index == -1:\n        return text, False\n    before = text[:index-1]\n    text = text[index:]\n    index = len(attr) + 1\n    index = text.find(text[index:index+1], index+1) + 1\n    after = text[index:]\n    text = text[:index]\n    text = before + after\n    return text, True\n\ndef mathml_to_latex1(text):\n    mml_dom = ET.fromstring(text)\n    xslt = ET.parse(\"./dependency/xsltml_2.0/mmltex.xsl\")\n    transform = ET.XSLT(xslt)\n    mmldom = transform(mml_dom)\n    text = str(mmldom)\n    return text\n\ndef mathml_to_latex2(text):\n    symbol_mappings = {\n        \"&alpha;\": \"α\",\n        \"&Alpha;\": \"A\",\n        \"&beta;\": \"β\",\n        \"&Beta;\": \"B\",\n        \"&epsilon;\": \"ε\",\n        \"&Epsilon;\": \"Ε\",\n        \"&Mu;\": \"M\",\n        \"&Nu;\": \"N\",\n        \"&omicron;\": \"o\",\n        \"&Omicron;\": \"O\",\n        \"&iot;\": \"ι\",\n        \"&conjugate0;\": \"&#x2015;\",\n    }\n    for key1, key2 in symbol_mappings.items():\n        text = text.replace(key1, key2)\n\n    # add xml head.\n    head = \"<?xml version='1.0' encoding='UTF-8'?>\\n\" + \\\n           '<!DOCTYPE math PUBLIC \"-//W3C//DTD MathML 2.0//EN\" \"http://www.w3.org/Math/DTD/mathml2/mathml2.dtd\">'\n    text = head + text\n\n    # remove unrecognized attributes.\n    attrs = (\"fontstyle\", \"ignorefont\", \"mathcolor\", \"rtableid\", \"altimg-valign\", \"dspmath\", \"xmlns:md\", \"specific-use\")\n    for attr in attrs:\n        find = True\n        while find:\n            text, find = remove_attr(text, attr)\n    text = text.replace(' xmlns=\"\"', '')\n\n    logging.disable(logging.WARNING)\n    mathml2tex = MathML2Tex()\n    text = mathml2tex.translate(text, network=False, from_file=False,)\n    #logging.enable(logging.WARNING)\n    return text\n\ndef separate_content_and_tag(html, start_str, end_str, s=0):\n    index = html.find(start_str, s)\n    before = html[:index]\n    html = html[index:]\n    index = html.find(end_str) + len(end_str)\n    content = html[:index]\n    after = html[index:]\n    return content, before, after\n\ndef detect_code(text):\n    keywords = (\n        'if', 'else', 'for', 'while', 'def', 'class', 'include', 'switch', 'case', \n        'default', 'const', 'static', 'try', 'catch', 'exception', 'continue', 'open', \n        'close', 'import', 'var', 'None', 'null', 'true', 'True', 'false', 'False', 'print', 'return',\n        'sudo', 'apt-get', 'wget',\n        '\\+', '-', '\\*', '/', '=',\n        #'//', '#', '/*', '*/',\n    )\n    patterns = [\n        rf'\\b(?:{\"|\".join(keywords)})\\b', # keywords\n        r'[{};]', # code indicators (curly braces, semicolon)\n        r'\\w+\\s*\\(.*\\)', # function calls or declarations\n        r'\\w+\\s*=\\s*\\w+', # variable assignments\n    ]\n\n    for pattern in patterns:\n        if re.search(pattern, text):\n            return True\n\n    return False\n\ndef encode_code(node, code_tag, not_code_tag):\n    # situation 1. <pre><code>\n    # situation 2. <pre><span>\n    # situation 3. <pre><code><span>\n    # situation 4. <table><tbody>\n    # situation 5. <table><tbody><pre>...\n\n    if node.tag == \"code\":\n        parent_node = node.getparent()\n        parent_tag = parent_node.tag\n\n        if parent_tag == \"tbody\":\n            code_node = parent_node\n        elif parent_tag == \"pre\":\n            code_node = parent_node\n            # below could be commentted.\n            while parent_node is not None:\n                parent_node = parent_node.getparent()\n                if parent_node is not None and parent_node.tag == \"tbody\":\n                    code_node = parent_node\n                    break\n        else:\n            #code_node = node\n            code_node = None\n\n        if code_node is not None:\n            text = code_node.text_content()\n\n            # delete the whole attributes.\n            for key, value in code_node.attrib.items():\n                code_node.attrib.pop(key)\n            if detect_code(text):\n                code_node.tag = code_tag# + \"-\" + lang\n                return True\n            else:\n                #code_node.tag = not_code_tag# debug\n                return False\n\n    child_nodes = node.getchildren()\n    contain = False\n    for child_node in child_nodes:\n        if encode_code(child_node, code_tag, not_code_tag):\n            contain = True\n    return contain\n\ndef filter_code(html, code_tag, not_code_tag):\n    root = HT.document_fromstring(html)\n\n    contain = encode_code(root, code_tag, not_code_tag)\n\n    doctype = '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">'\n    if html.strip().startswith(b'<!DOCTYPE'):\n        index = html.find(b\"<html\")\n        if index != -1:\n            doctype = html[:index].strip()\n    new_text = HT.tostring(root, method=\"html\", doctype=doctype)\n    new_html = new_text\n\n    return new_html, contain\n\ndef encode_image(uri, node, image_tag):\n    if node.tag == \"img\":\n        node.tag = image_tag\n\n        link = node.attrib.get(\"src\")\n        if link is not None:\n            link = util.relative2absolute_path(uri, link)\n        alt = node.attrib.get(\"alt\")\n        width = node.attrib.get(\"width\")\n        height = node.attrib.get(\"height\")\n        name = util.md5(link) + Path(urlparse(link).path).suffix if link is not None else None\n        attrs = {\"link\": link, \"alt\": alt, \"width\": width, \"height\": height, \"name\": name}\n        node.text = str(attrs)\n\n        # delete the whole attributes.\n        for key, value in node.attrib.items():\n            node.attrib.pop(key)\n        return True\n\n    child_nodes = node.getchildren()\n    contain = False\n    for child_node in child_nodes:\n        if encode_image(uri, child_node, image_tag):\n            contain = True\n    return contain\n\ndef filter_image(uri, html, image_tag):\n    root = HT.document_fromstring(html)\n\n    contain = encode_image(uri, root, image_tag)\n\n    doctype = '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">'\n    if html.strip().startswith(b'<!DOCTYPE'):\n        index = html.find(b\"<html\")\n        if index != -1:\n            doctype = html[:index].strip()\n    new_text = HT.tostring(root, method=\"html\", doctype=doctype)\n    new_html = new_text\n\n    return new_html, contain\n\ndef encode_video(uri, node, video_tag):\n    if node.tag == \"video\":\n        node.tag = video_tag\n\n        link = node.attrib.get(\"src\")\n        if link is not None:\n            link = util.relative2absolute_path(uri, link)\n        alt = node.attrib.get(\"alt\")\n        width = node.attrib.get(\"width\")\n        height = node.attrib.get(\"height\")\n        name = util.md5(link) + Path(urlparse(link).path).suffix if link is not None else None\n        attrs = {\"link\": link, \"alt\": alt, \"width\": width, \"height\": height, \"name\": name}\n        node.text = str(attrs)\n\n        # delete the whole attributes.\n        for key, value in node.attrib.items():\n            node.attrib.pop(key)\n        return True\n\n    child_nodes = node.getchildren()\n    contain = False\n    for child_node in child_nodes:\n        if encode_video(uri, child_node, video_tag):\n            contain = True\n    return contain\n\ndef filter_video(uri, html, video_tag):\n    root = HT.document_fromstring(html)\n\n    contain = encode_video(uri, root, video_tag)\n\n    doctype = '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">'\n    if html.strip().startswith(b'<!DOCTYPE'):\n        index = html.find(b\"<html\")\n        if index != -1:\n            doctype = html[:index].strip()\n    new_text = HT.tostring(root, method=\"html\", doctype=doctype)\n    new_html = new_text\n\n    return new_html, contain\n\ndef encode_math_html(uri, html, encoding):\n    encode_table = {\n        b\"<\": b\"[[[less]]]\",\n        b\">\": b\"[[[large]]]\",\n    }\n\n    tag_head_mathml  = b\"[[[math-ml]]]\"\n    tag_tail_mathml  = b\"[[[/math-ml]]]\"\n    tag_head_mathtex = b\"[[[math-tex]]]\"\n    tag_tail_mathtex = b\"[[[/math-tex]]]\"\n    \n    start_end_strs = (\n        (b\"<maths\", b\"</maths>\"),#1\n        (b\"<math>\", b\"</math>\"),#2\n        (b\"<math \", b\"</math>\"),#2\n        (b\"<annotation encoding='application/x-tex'>\", b\"</annotation>\"),\n        (b'<annotation encoding=\"application/x-tex\">', b'</annotation>'),\n        (b\"<span class='math-formula'>\", b\"</span>\"),\n        (b'<span class=\"math-formula\">', b'</span>'),\n        (b'<script type=\"math/mml\"', b'</script>'),\n        (b\"<script type='math/mml'\", b\"</script>\"),\n        (b'<script type=\"math/tex\"', b'</script>'),\n        (b\"<script type='math/tex'\", b\"</script>\"),\n        (b'<script type=\"math/latex\"', b'</script>'),\n        (b\"<script type='math/latex'\", b\"</script>\"),\n        (b'<script type=\"math/asciimath\"', b'</script>'),\n        (b\"<script type='math/asciimath'\", b\"</script>\"),\n    )\n\n    sub_start_end_strs = (\n        (b\"<math\", b\"</math>\"),#1\n        (b\"<annotation encoding='application/x-tex'>\", b\"</annotation>\"),#2\n        (b'<annotation encoding=\"application/x-tex\">', b'</annotation>'),#2\n    )\n\n    assert tag_head_mathml not in html and tag_tail_mathml not in html\n    assert tag_head_mathtex not in html and tag_tail_mathtex not in html\n\n    contain_tag = False\n    for (start_str, end_str) in start_end_strs:\n        while start_str in html:\n            content, before, after = separate_content_and_tag(html, start_str, end_str)\n\n            if start_str[:5] == b\"<math\":\n                for sub_start_str, sub_end_str in sub_start_end_strs:\n                    if sub_start_str in content[len(start_str):-len(end_str)]:\n                        content = content[len(start_str):-len(end_str)]\n                        content, sub_before, sub_after = separate_content_and_tag(content, sub_start_str, sub_end_str)\n\n            contain = True\n            try:\n                content_str = str(content, encoding)\n            except:\n                return html, False\n\n            if contain and (tex_in_script_tag(content_str) or tex_in_math_tag(content_str)):\n                try:\n                    index1 = content.find(b\">\") + 1\n                    index2 = content.rfind(b\"<\")\n                    formula = content[index1:index2]\n                    formula = formula.strip()\n                    formula_str = str(formula, encoding)\n\n                    if not check_latex(formula_str):\n                        return html, False\n                    for key1, key2 in encode_table.items():\n                        formula = formula.replace(key1, key2)\n                    content = b\"<span>\" + tag_head_mathtex + formula + tag_tail_mathtex + b\"</span>\"\n                except:\n                    contain = False\n            elif contain and (tex_in_math_tag2(content_str)):\n                try:\n                    index2 = content_str.find(\"</annotation>\")\n                    index1 = content_str[:index2].rfind(\"</mrow>\") + len(\"</mrow>\")\n                    formula = content_str[index1:index2]\n                    formula = formula.strip()\n                    formula_str = str(formula, encoding)\n\n                    if not check_latex(formula_str):\n                        return html, False\n                    for key1, key2 in encode_table.items():\n                        formula = formula.replace(key1, key2)\n                    content = b\"<span>\" + tag_head_mathtex + formula + tag_tail_mathtex + b\"</span>\"\n                except:\n                    contain = False\n            elif contain and (mathml_in_script_tag(content_str) or mathml_in_math_tag(content_str)):\n                try:\n                    # convert mathml to latex.\n                    if \"<semantics>\" in content_str and \"</semantics>\" not in content_str:\n                        content_str = content_str.replace(\"<semantics>\", \"\")\n                    try:\n                        formula_str = mathml_to_latex1(content_str)\n                    except:\n                        formula_str = mathml_to_latex2(content_str)\n                    formula = bytes(formula_str, encoding)\n                    formula = formula.replace(codecs.BOM_UTF8, b\"\")\n                    formula = formula.strip(b\"$\")\n                    formula = formula.strip()\n                    formula_str = str(formula, encoding)\n\n                    if not check_latex(formula_str):\n                        return html, False\n                    for key1, key2 in encode_table.items():\n                        formula = formula.replace(key1, key2)\n                    content = b\"<span>\" + tag_head_mathml + formula + tag_tail_mathml + b\"</span>\"\n                except:\n                    contain = False\n            else:\n                contain = False\n\n            if contain:\n                html = before + content + after\n                contain_tag = True\n            else:\n                html = before + after\n\n    return html, contain_tag\n\ndef get_tag_info(tag):\n    start_tag = f\"<{tag}>\".encode()\n    end_tag = f\"</{tag}>\".encode()\n    encode_start_tag = f\"[[[{tag}]]]\".encode()\n    encode_end_tag = f\"[[[/{tag}]]]\".encode()\n    tag = tag.encode()\n    return tag, start_tag, end_tag, encode_start_tag, encode_end_tag\n\ndef encode_code_html(uri, html, encoding):\n    code_tag_str = \"code-encode\"\n    not_code_tag_str = \"not-code-encode\"\n    code_tag, code_start_tag, code_end_tag, code_encode_start_tag, code_encode_end_tag = get_tag_info(code_tag_str)\n    not_code_tag, not_code_start_tag, not_code_end_tag, not_code_encode_start_tag, not_code_encode_end_tag = get_tag_info(not_code_tag_str)\n    assert code_start_tag not in html and code_end_tag not in html\n    assert not_code_start_tag not in html and not_code_end_tag not in html\n\n    try:\n        html, contain = filter_code(html, code_tag_str, not_code_tag_str)\n\n        if contain:\n            html = html.replace(code_start_tag, b\"<pre>\" + b\"\\n\" + code_encode_start_tag + b\"\\n\")\n            html = html.replace(code_end_tag, b\"\\n\" + code_encode_end_tag + b\"\\n\" + b\"</pre>\")\n\n            #html = html.replace(not_code_start_tag, b\"<pre>\" + b\"\\n\" + not_code_encode_start_tag + b\"\\n\")# debug\n            #html = html.replace(not_code_end_tag, b\"\\n\" + not_code_encode_end_tag + b\"\\n\" + b\"</pre>\")# debug\n    except:\n        contain = False\n\n    return html, contain\n\ndef encode_image_html(uri, html, encoding):\n    image_tag_str = \"image-encode\"\n    image_tag, image_start_tag, image_end_tag, image_encode_start_tag, image_encode_end_tag = get_tag_info(image_tag_str)\n    assert image_start_tag not in html and image_end_tag not in html\n\n    try:\n        html, contain = filter_image(uri, html, image_tag_str)\n\n        if contain:\n            #html = html.replace(image_start_tag, b\"<pre>\" + b\"\\n\" + image_encode_start_tag + b\"\\n\")\n            #html = html.replace(image_end_tag, b\"\\n\" + image_encode_end_tag + b\"\\n\" + b\"</pre>\")\n            html = html.replace(image_start_tag, b\"<span>\" + b\"\\n\" + image_encode_start_tag + b\"\\n\")\n            html = html.replace(image_end_tag, b\"\\n\" + image_encode_end_tag + b\"\\n\" + b\"</span>\")\n    except:\n        contain = False\n\n    return html, contain\n\ndef encode_video_html(uri, html, encoding):\n    video_tag_str = \"video-encode\"\n    video_tag, video_start_tag, video_end_tag, video_encode_start_tag, video_encode_end_tag = get_tag_info(video_tag_str)\n    assert video_start_tag not in html and video_end_tag not in html\n\n    try:\n        html, contain = filter_video(uri, html, video_tag_str)\n\n        if contain:\n            #html = html.replace(video_start_tag, b\"<pre>\" + b\"\\n\" + video_encode_start_tag + b\"\\n\")\n            #html = html.replace(video_end_tag, b\"\\n\" + video_encode_end_tag + b\"\\n\" + b\"</pre>\")\n            html = html.replace(video_start_tag, b\"<span>\" + b\"\\n\" + video_encode_start_tag + b\"\\n\")\n            html = html.replace(video_end_tag, b\"\\n\" + video_encode_end_tag + b\"\\n\" + b\"</span>\")\n    except:\n        contain = False\n\n    return html, contain\n\ndef encode_html(uri, html, encoding, TAG):\n    if html is None:\n        return None, False\n\n    if TAG == \"math\":\n        html, contain_tag = encode_math_html(uri, html, encoding)\n    elif TAG == \"code\":\n        html, contain_tag = encode_code_html(uri, html, encoding)\n    elif TAG == \"image\":\n        html, contain_tag = encode_image_html(uri, html, encoding)\n    elif TAG == \"video\":\n        html, contain_tag = encode_video_html(uri, html, encoding)\n    return html, contain_tag\n\n\ndef warc_encode_layer(warc_file_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", TAG=None, DEFAULT_ENCODING=None, OVERWRITE=False):\n    ret = list()\n    try:\n        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)\n        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)\n        dst_warc_file_path = os.path.join(OUTPUT_FOLDER, warc_file_name)\n        dst_warc_file_path = util.to_real_path(dst_warc_file_path, variables)\n\n        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_warc_file_path)):\n            util.create_folder_by_file_path(dst_warc_file_path)\n            with open(dst_warc_file_path, \"wb\") as output:\n                writer = WARCWriter(output, gzip=True)\n                with open(src_warc_file_path, \"rb\") as input:\n                    records = ArchiveIterator(input, arc2warc=True)\n                    for id, record in enumerate(records):\n                        if record.rec_type == \"response\" and record.http_headers.get_header(\"Content-Type\", \"\").startswith(\"text/html\"):\n                            try:\n                                uri = record.rec_headers[\"WARC-Target-URI\"]\n\n                                # read raw html.\n                                html = record.content_stream().read()\n\n                                # check html codec.\n                                charset = record.http_headers[\"Content-Type\"].split(\";\")[-1].split(\"=\")\n                                if charset[0].strip().lower() == \"charset\":\n                                    encoding = charset[1]\n                                else:\n                                    index1 = html.find(b'<meta charset=\"')\n                                    if index1 >= 0:\n                                        index1 += len(b'<meta charset=\"')\n                                        index2 = html.find(b'\"', index1)\n                                        encoding = str(html[index1:index2], encoding=\"ascii\")\n                                    else:\n                                        try:\n                                            logging.disable(logging.WARNING)\n                                            encoding = detect(html)[\"encoding\"]\n                                            #logging.enable(logging.WARNING)\n                                        except:\n                                            encoding = \"\"\n                                if encoding is not None:\n                                    encoding = encoding.strip().strip('\"').lower()\n\n                                if encoding in (\"\",):\n                                    encoding = DEFAULT_ENCODING\n                                \n                                # remove hidden tag.\n                                if encoding is not None and b'aria-hidden=\"true\"' in html:\n                                #if encoding is not None and (b'aria-hidden=\"true\"' in html or b'aria-readonly=\"true\"' in html):\n                                    try:\n                                        html = remove_hidden_content(html)\n                                    except:\n                                        encoding = DEFAULT_ENCODING\n\n                                # encode html.\n                                if encoding is not None:\n                                    if TAG is not None:\n                                        html, contain_tag = encode_html(uri, html, encoding, TAG)\n                                    else:\n                                        contain_tag_cnt = 0\n                                        TAGS = (\"math\", \"code\", \"image\")# \"video\"\n                                        for tag in TAGS:\n                                            html, contain_tag = encode_html(uri, html, encoding, tag)\n                                            if contain_tag:\n                                                contain_tag_cnt += 1\n                                        contain_tag = contain_tag_cnt > 0\n                                else:\n                                    html = None\n                                    contain_tag = False\n\n                                # write encoded html.\n                                if contain_tag and html is not None:\n                                    content = BytesIO(html)\n                                    assert content.getbuffer().nbytes == len(html)\n                                    raw_length = len(html)\n                                    record.raw_stream = LimitReader(content, raw_length)\n\n                                    record.rec_headers[\"Content-Length\"] = None\n                                    record.length = None\n\n                                    writer.write_record(record)\n                            except:\n                                traceback.print_exc()\n\n            ret = [warc_file_name]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == \"__main__\":\n    warc_file_name = \"CC-MAIN-20221127073607-20221127103607-00007.warc.gz\"\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    TAG = \"math\"\n    output = warc_encode_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAG=TAG)\n    print(output)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/warc_filter_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport re\nfrom io import BytesIO\nfrom warcio.warcwriter import WARCWriter\nfrom warcio.limitreader import LimitReader\nfrom warcio.archiveiterator import ArchiveIterator\nimport util\n\ndef warc_filter_layer(warc_file_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", TAGS=(), OVERWRITE=False):\n    ret = list()\n    try:\n        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)\n        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)\n        dst_warc_file_path = os.path.join(OUTPUT_FOLDER, warc_file_name)\n        dst_warc_file_path = util.to_real_path(dst_warc_file_path, variables)\n        TAGS = list(map(lambda tag: bytes(tag, \"ascii\"), TAGS))\n        regex = re.compile(b'|'.join(TAGS))\n\n        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_warc_file_path)):\n            util.create_folder_by_file_path(dst_warc_file_path)\n            with open(dst_warc_file_path, \"wb\") as output:\n                writer = WARCWriter(output, gzip=True)\n                with open(src_warc_file_path, \"rb\") as input:\n                    reader = ArchiveIterator(input, arc2warc=True)\n                    for i, record in enumerate(reader):\n                        if record.rec_type == \"response\" and record.http_headers.get_header(\"Content-Type\", \"\").startswith(\"text/html\"):\n                            try:\n                                # read raw html.\n                                html = record.content_stream().read()\n\n                                # filter.\n                                if regex.search(html):\n                                    content = BytesIO(html)\n                                    assert len(html) == record.payload_length\n                                    record.raw_stream = LimitReader(content, record.payload_length)\n                                    writer.write_record(record)\n                            except:\n                                traceback.print_exc()\n            \n            ret = [warc_file_name]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == \"__main__\":\n    warc_file_name = \"CC-MAIN-20221127073607-20221127103607-00007.warc.gz\"\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    TAGS = (\n        \"<math\",\n        \"MathJax\",\n    )\n    output = warc_filter_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAGS=TAGS)\n    print(output)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/warc_to_wet_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport util\n\ndef warc_to_wet_layer(warc_file_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", OVERWRITE=False):\n    ret = list()\n    try:\n        wet_file_name = warc_file_name.replace(\".warc.gz\", \".warc.wet.gz\")\n        wat_file_name = warc_file_name.replace(\".warc.gz\", \".warc.wat.gz\")\n\n        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)\n        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)\n\n        dst_wet_file_path = os.path.join(OUTPUT_FOLDER, wet_file_name)\n        dst_wet_file_path = util.to_real_path(dst_wet_file_path, variables)\n\n        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_wet_file_path)):\n            util.create_folder_by_file_path(dst_wet_file_path)\n\n            # export SPARK_USER=$USER\n            java_package = \"./dependency/ia-hadoop-tools-jar-with-dependencies.jar\"\n            commandline = f\"sudo java -jar {java_package} WEATGenerator -strictMode -skipExisting batch-id-xyz {src_warc_file_path}\"\n            exit_status1 = os.system(commandline)\n            assert exit_status1 == 0\n\n            tmp_base_path = os.path.dirname(src_warc_file_path)\n            tmp_wet_file_path = os.path.join(tmp_base_path, \"..\", \"wet/\", wet_file_name)\n            tmp_wat_file_path = os.path.join(tmp_base_path, \"..\", \"wat/\", wat_file_name)\n            exit_status2 = os.system(f\"sudo cp -f {tmp_wet_file_path} {dst_wet_file_path}\")\n            assert exit_status2 == 0\n\n            os.system(f\"sudo rm {tmp_wet_file_path}\")\n            os.system(f\"sudo rm {tmp_wat_file_path}\")\n\n            ret = [wet_file_name]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == \"__main__\":\n    warc_file_name = \"CC-MAIN-20221127073607-20221127103607-00007.warc.gz\"\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    output = warc_to_wet_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER)\n    print(output)\n"
  },
  {
    "path": "DomainSpecific/core/layers/transform/wet_decode_layer.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport re\nfrom io import BytesIO\nfrom warcio.limitreader import LimitReader\nfrom warcio.warcwriter import WARCWriter\nfrom warcio.archiveiterator import ArchiveIterator\nfrom pylatexenc.latex2text import LatexNodes2Text\nfrom guesslang import Guess\nimport util\n\ndef decode_tag(tag):\n    return tag.replace(b\"[[[\", b\"<\").replace(b\"]]]\", b\">\")\n\ndef latex2text(latex, encoding=\"utf-8\"):\n    latexNodes2Text = LatexNodes2Text()\n    latex = str(latex, encoding)\n    text = latexNodes2Text.latex_to_text(latex)\n    text = bytes(text, encoding)\n    return text\n\ndef separate_content_and_tag(html, start_str, end_str):\n    index = html.find(start_str)\n    before = html[:index]\n    html = html[index:]\n    index = html.find(end_str) + len(end_str)\n    content = html[:index]\n    after = html[index:]\n    return content, before, after\n\ndef remove_number_and_merge_snippet(html, NumberThred = 7):\n    lines = html.split(b'\\n')\n\n    for interval in (1, 2, 3, 4):\n        line_no_list = list()\n        last_code_no = -1\n        for line_no in range(0, len(lines), interval):\n            try:\n                code_no = int(lines[line_no].strip())\n            except:\n                code_no = -1\n            if (last_code_no == -1 and code_no == 1) or last_code_no + 1 == code_no:\n                last_code_no = code_no\n                line_no_list.append(line_no)\n            else:\n                if last_code_no > NumberThred:\n                    for hist_line_no in line_no_list:\n                        lines[hist_line_no] = b''\n                line_no_list = list()\n                last_code_no = -1\n        lines = list(filter(lambda line: len(line) > 0, lines))\n\n    for i in range(2):\n        line_no_list = list()\n        last_code_no = -1\n        for line_no in range(len(lines)):\n            try:\n                code_no = int(lines[line_no].strip())\n            except:\n                code_no = -1\n            if (last_code_no == -1 and code_no == 1) or last_code_no + 1 == code_no:\n                last_code_no = code_no\n                line_no_list.append(line_no)\n            elif code_no == 0 or code_no == 1:\n                if last_code_no > NumberThred:\n                    for hist_line_no in line_no_list:\n                        lines[hist_line_no] = b''\n                line_no_list = [line_no]\n                last_code_no = code_no\n        lines = list(filter(lambda line: len(line) > 0, lines))\n    \n    for line_no in range(len(lines)):\n        if len(lines[line_no].strip()) == 0:\n            lines[line_no] = b''\n    lines = list(filter(lambda line: len(line) > 0, lines))\n\n    # merge code snippets which are locate continously with single line.\n    #html = re.sub(b\"</code-encode>\\n<code-encode>\\n\", b\"\\n\", html)\n    code_head = b\"<code-encode>\"\n    code_tail = b\"</code-encode>\"\n    for line_no in range(max(0, len(lines)-3)):\n        if code_tail in lines[line_no] and code_head in lines[line_no+1] and code_tail in lines[line_no+3]:\n            lines[line_no] = b''\n            lines[line_no+1] = b''\n    lines = list(filter(lambda line: len(line) > 0, lines))\n\n    # filter issue html.\n    cnt = 0\n    for line in lines:\n        if code_head in line:\n            cnt += 1\n        elif code_tail in line:\n            cnt -= 1\n        # error happens.\n        if cnt != 0 and cnt != 1:\n            return b''\n    \n    html = b'\\n'.join(lines)\n    return html\n\nguess = None\ndef identify_code(text):\n    global guess\n    if guess is None:\n        guess = Guess()\n    try:\n        #name = guess.language_name(text)\n        name, prob = guess.probabilities(text)[0]\n    except:\n        name, prob = \"unknown\", 1.0\n    return name, prob\n\ndef decode_html(uri, html, encoding, TAG):\n    if html is None:\n        return None, False\n\n    if TAG == \"math\":\n        decode_table = {\n            b\"[[[less]]]\": b\"<\",\n            b\"[[[large]]]\": b\">\",\n        }\n\n        tag_head_mathml = b\"[[[math-ml]]]\"\n        tag_tail_mathml = b\"[[[/math-ml]]]\"\n        tag_head_mathtex = b\"[[[math-tex]]]\"\n        tag_tail_mathtex = b\"[[[/math-tex]]]\"\n\n        start_end = (\n            (tag_head_mathml, tag_tail_mathml),\n            (tag_head_mathtex, tag_tail_mathtex),\n        )\n\n        for (start, end) in start_end:\n            while start in html:\n                content, before, after = separate_content_and_tag(html, start, end)\n                formula = content[len(start): -len(end)]\n\n                if len(formula.strip()) != 0:\n                    # decode < and >.\n                    for key1, key2 in decode_table.items():\n                        formula = formula.replace(key1, key2)\n                    \n                    # decode math tag.\n                    content = decode_tag(start) + formula + decode_tag(end)\n\n                    # dedup math formula around context.\n                    formula_ascii = latex2text(formula).strip()\n                    n = len(formula_ascii)\n                    if n > 0 and before.rstrip()[-n:] == formula_ascii:\n                        before = before.rstrip()[:-n]\n                    elif n > 0 and after.lstrip()[:n] == formula_ascii:\n                        after = after.lstrip()[n:]\n                    html = before + content + after\n                else:\n                    # remove empty formula.\n                    html = before + after\n\n    elif TAG == \"code\":\n        tag_head_code = b\"[[[code-encode]]]\"\n        tag_tail_code = b\"[[[/code-encode]]]\"\n        #tag_head_notcode = b\"[[[not-code-encode]]]\"# debug\n        #tag_tail_notcode = b\"[[[/not-code-encode]]]\"# debug\n\n        start_end = (\n            (tag_head_code, tag_tail_code),\n            #(tag_head_notcode, tag_tail_notcode),# debug\n        )\n\n        for (start, end) in start_end:\n            while start in html:\n                content, before, after = separate_content_and_tag(html, start, end)\n                code = content[len(start): -len(end)].strip()\n\n                if len(code) != 0:\n                    lang, prob = identify_code(code)\n                    #lcnt = code.count(b\"\\n\")\n                    #meta_lang = bytes(f\"<metadata lang={lang} prob={prob:.2f} lines={lcnt} />\", encoding=encoding)\n                    meta_lang = bytes(f\"<metadata lang={lang} prob={prob:.2f} />\", encoding=encoding)\n                    decode_start = decode_tag(start)\n                    decode_end = decode_tag(end)\n                    #content = decode_start + b\"\\n\" + code + b\"\\n\" + decode_end\n                    content = decode_start + meta_lang + b\"\\n\" + code + b\"\\n\" + decode_end\n                    html = before + content + after\n                else:\n                    # remove empty code.\n                    html = before + after\n\n        # remove number of code block.\n        html = remove_number_and_merge_snippet(html)\n\n    elif TAG == \"image\":\n        tag_head_image = b\"[[[image-encode]]]\"\n        tag_tail_image = b\"[[[/image-encode]]]\"\n\n        start_end = (\n            (tag_head_image, tag_tail_image),\n        )\n\n        for (start, end) in start_end:\n            while start in html:\n                content, before, after = separate_content_and_tag(html, start, end)\n                image_meta = content[len(start): -len(end)].strip()\n\n                if len(image_meta) != 0:\n                    decode_start = decode_tag(start)\n                    decode_end = decode_tag(end)\n                    content = decode_start + image_meta + decode_end\n                    html = before + content + after\n                else:\n                    # remove empty image.\n                    html = before + after\n                    return None, False\n\n    elif TAG == \"video\":\n        tag_head_video = b\"[[[video-encode]]]\"\n        tag_tail_video = b\"[[[/video-encode]]]\"\n\n        start_end = (\n            (tag_head_video, tag_tail_video),\n        )\n\n        for (start, end) in start_end:\n            while start in html:\n                content, before, after = separate_content_and_tag(html, start, end)\n                video_meta = content[len(start): -len(end)].strip()\n\n                if len(video_meta) != 0:\n                    decode_start = decode_tag(start)\n                    decode_end = decode_tag(end)\n                    content = decode_start + video_meta + decode_end\n                    html = before + content + after\n                else:\n                    # remove empty video.\n                    html = before + after\n                    return None, False\n\n    # remove continous empty lines.\n    if html is not None and len(html) > 0:\n        html = re.sub(b\"(\\n\\r)+\", b\"\\n\", html)\n        html = re.sub(b\"(\\r\\n)+\", b\"\\n\", html)\n        html = re.sub(b\"\\n+\", b\"\\n\", html)\n\n    contain = False\n    for (start, end) in start_end:\n        decode_start = decode_tag(start)\n        if decode_start in html:\n            contain = True\n\n    return html, contain\n\ndef wet_decode_layer(wet_file_name, variables=dict(), INPUT_FOLDER=\"./\", OUTPUT_FOLDER=\"./\", TAG=None, OVERWRITE=False):\n    ret = list()\n    try:\n        BLACK_URLS = (\"blame.php\", \"diff.php\")\n        regex = re.compile('|'.join(BLACK_URLS))\n        src_wet_file_path = os.path.join(INPUT_FOLDER, wet_file_name)\n        src_wet_file_path = util.to_real_path(src_wet_file_path, variables)\n        dst_wet_file_path = os.path.join(OUTPUT_FOLDER, wet_file_name)\n        dst_wet_file_path = util.to_real_path(dst_wet_file_path, variables)\n\n        if os.path.exists(src_wet_file_path) and (OVERWRITE or not os.path.exists(dst_wet_file_path)):\n            util.create_folder_by_file_path(dst_wet_file_path)\n            with open(dst_wet_file_path, \"wb\") as output:\n                writer = WARCWriter(output, gzip=True)\n                with open(src_wet_file_path, \"rb\") as input:\n                    records = ArchiveIterator(input, arc2warc=False)\n                    for id, record in enumerate(records):\n                        #lang = record.rec_headers[\"WARC-Identified-Content-Language\"]\n                        #if lang != \"en\":\n                        #    continue\n\n                        if record.rec_type == \"conversion\":\n                            try:\n                                uri = record.rec_headers[\"WARC-Target-URI\"]\n                                if regex.search(uri):\n                                    continue\n\n                                # read raw html.\n                                html = record.content_stream().read()\n                                encoding = \"utf-8\"\n\n                                # decode html.\n                                if encoding is not None:\n                                    if TAG is not None:\n                                        html, contain_tag = decode_html(uri, html, encoding, TAG)\n                                    else:\n                                        contain_tag_cnt = 0\n                                        TAGS = (\"math\", \"code\", \"image\")# \"video\"\n                                        for tag in TAGS:\n                                            html, contain_tag = decode_html(uri, html, encoding, tag)\n                                            if contain_tag:\n                                                contain_tag_cnt += 1\n                                        contain_tag = contain_tag_cnt > 0\n                                else:\n                                    html = None\n                                    contain_tag = False\n\n                                # write decoded html.\n                                if contain_tag and html is not None:\n                                    content = BytesIO(html)\n                                    assert content.getbuffer().nbytes == len(html)\n                                    raw_length = len(html)\n                                    record.raw_stream = LimitReader(content, raw_length)\n\n                                    record.rec_headers[\"Content-Length\"] = None\n                                    record.length = None\n\n                                    writer.write_record(record)\n                            except:\n                                traceback.print_exc()\n            #ret = [wet_file_name]\n            ret = [dst_wet_file_path]\n    except KeyboardInterrupt:\n        sys.exit()\n    except Exception as ex:\n        traceback.print_exc()\n    return (ret, )\n\n\nif __name__ == \"__main__\":\n    warc_file_name = \"CC-MAIN-20221127073607-20221127103607-00007.warc.gz\"\n    INPUT_FOLDER = \"$(input_data_folder)\"\n    OUTPUT_FOLDER = \"$(output_data_folder)\"\n    TAG = \"math\"\n    output = wet_decode_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAG=TAG)\n    print(output)\n"
  },
  {
    "path": "DomainSpecific/core/layers/util.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport copy\nimport yaml\nimport hashlib\nimport logging\nimport datetime\nimport requests\nfrom urllib.parse import urljoin\nfrom azure.storage.blob import ContainerClient, BlobSasPermissions, generate_blob_sas\nfrom azure.identity import DefaultAzureCredential\nlogging.getLogger(\"requests\").setLevel(logging.WARNING)\nlogging.getLogger(\"urllib3\").setLevel(logging.WARNING)\n\ndef load_yaml(config_path):\n    config = None\n    if os.path.exists(config_path):\n        with open(config_path, \"r\") as file:\n            config = yaml.safe_load(file)\n    return config\n\ndef save_yaml(config, config_path):\n    if os.path.exists(os.path.dirname(config_path)):\n        with open(config_path, \"w\") as file:\n            yaml.safe_dump(config, file)\n\ndef str2bytes(data):\n    data = bytes(data, \"utf-8\")\n    return data\n\ndef md5(data):\n    if isinstance(data, str):\n        data = str2bytes(data)\n    md5 = hashlib.md5(data).hexdigest()\n    return md5\n\ndef sha256(data):\n    if isinstance(data, str):\n        data = str2bytes(data)\n    sha256 = hashlib.sha256(data).hexdigest()\n    return sha256\n\ndef suffix(path):\n    suffix = os.path.splitext(path)[1]\n    return suffix\n\ndef relative2absolute_path(prefix, link):\n    # Root-relative path.\n    if link.startswith(\"/\"):\n        link = urljoin(prefix, link)\n    else:\n        colon_count = link[:10].count(\":\")\n        # Document-relative path.\n        if link.startswith(\".\") or colon_count == 0:\n            link = urljoin(prefix, link)\n        # Absolute paths, such as `http://`, `https://`, `ftp://`, or 'file://'.\n        else:\n            link = link\n    return link\n\ndef create_folder_by_file_path(local_file_path):\n    local_folder_path = os.path.dirname(local_file_path)\n    if not os.path.exists(local_folder_path) and len(local_folder_path.strip()) != 0:\n        try:\n            os.makedirs(local_folder_path, exist_ok=True)\n        except:\n            pass\n\ndef to_real_path(path, variables):\n    keys = (\"workspace_dir\", \"worker_id\", \"worker_num\")\n    path = copy.copy(path)\n    for name, value in variables.items():\n        if name in keys:\n            path = path.replace(\"{%s}\" % name, str(value))\n    return path\n\ndef get_container_client(storage_config):\n    if isinstance(storage_config, ContainerClient):\n        return storage_config\n\n    if isinstance(storage_config, str) and os.path.exists(storage_config):\n        storage_config = load_yaml(storage_config)\n\n    account_domain = \"blob.core.windows.net\"\n    account_name = storage_config[\"azstorage\"][\"account-name\"]\n    #account_key = storage_config[\"azstorage\"][\"account-key\"]\n    container_name = storage_config[\"azstorage\"][\"container\"]\n    identity_id = storage_config[\"azstorage\"][\"appid\"]\n    credential = DefaultAzureCredential(managed_identity_client_id=identity_id)\n\n    container_client = ContainerClient(\n        account_url=f\"https://{account_name}.{account_domain}/\",\n        container_name=container_name,\n        credential=credential#account_key\n    )\n\n    return container_client\n\ndef get_blob_client(storage_config, blob_path):\n    container_client = get_container_client(storage_config)\n    blob_client = container_client.get_blob_client(blob_path)\n    return blob_client\n\ndef exist_blob(container_client, blob_path):\n    with container_client.get_blob_client(blob_path) as blob_client:\n        blob_path_exists = blob_client.exists()\n        return blob_path_exists\n\ndef get_blob_size(container_client, blob_path):\n    with container_client.get_blob_client(blob_path) as blob_client:\n        properties = blob_client.get_blob_properties()\n        size = properties.size\n        return size\n\ndef list_blob_dir(container_client, blob_path):\n    names = list()\n    for blob in container_client.walk_blobs(name_starts_with=blob_path):\n        names.append(blob.name)\n    return names\n\ndef create_blob_dir(container_client, blob_path):\n    container_client.upload_blob(name=os.path.join(blob_path, \"_\"), data=b\"\", overwrite=True)\n\ndef upload_bytes_to_blob(storage_config, content, blob_path):\n    with get_blob_client(storage_config, blob_path) as blob_client:\n        blob_client.upload_blob(content, overwrite=True)\n    return blob_path\n\ndef upload_file_to_blob(storage_config, local_path, blob_path):\n    with open(local_path, \"rb\") as content:\n        upload_bytes_to_blob(storage_config, content, blob_path)\n    return blob_path\n\ndef upload_bytes_to_internet(content, blob_path):\n    # TODO: to be implemented.\n    return blob_path\n\ndef upload_file_to_internet(local_path, blob_path):\n    # TODO: to be implemented.\n    return blob_path\n\ndef download_bytes_from_blob(storage_config, blob_path):\n    with get_blob_client(storage_config, blob_path) as blob_client:\n        content = blob_client.download_blob().readall()\n    return content\n\ndef download_file_from_blob(storage_config, blob_path, local_path):\n    content = download_bytes_from_blob(storage_config, blob_path)\n    create_folder_by_file_path(local_path)\n    with open(local_path, \"wb\") as data:\n        data.write(content)\n    return local_path\n\ndef download_bytes_from_internet(url, timeout=3):\n    try:\n        resp = requests.get(url, allow_redirects=True, timeout=timeout)\n        if resp.status_code == 200:\n            content = resp.content\n            return content\n        else:\n            return None\n    except:\n        return None\n\ndef download_file_from_internet(url, local_path):\n    try:\n        content = download_bytes_from_internet(url)\n        if content is not None:\n            create_folder_by_file_path(local_path)\n            with open(local_path, \"wb\") as data:\n                data.write(content)\n            return local_path, len(content)\n        else:\n            return None, 0\n    except:\n        return None, 0\n"
  },
  {
    "path": "DomainSpecific/core/network.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nfrom core.layers import LayerType, util\n\nclass Network:\n    def __init__(self):\n        self.type = None\n        self.input_names = list()\n        self.output_names = list()\n        self.datas = dict()\n        self.layers = dict()\n\n    def set_input_names(self, input_names):\n        self.input_names = input_names\n\n    def set_output_names(self, output_names):\n        self.output_names = output_names\n\n    def add_data(self, name, value):\n        self.datas[name] = value\n\n    def add_layer(self, name, value):\n        self.layers[name] = value\n\n    def next_layer(self, invisited_layer_names):\n        for name in invisited_layer_names:\n            layer = self.layers[name]\n            input_names = layer.input_names\n            if set(input_names) <= set(self.datas.keys()):\n                input_values = [self.datas[input_name] for input_name in input_names]\n                invisited_layer_names.remove(name)\n                return layer, name, input_values\n        return None\n    \n    def __call__(self, inputs=list(), worker_id=0, worker_num=1, variables=dict()):\n        outputs = list()\n        try:\n            if len(inputs) == len(self.input_names):\n                for name, value in zip(self.input_names, inputs):\n                    self.add_data(name, value)\n            \n            invisited_layer_names = sorted(list(self.layers.keys()))\n            while len(invisited_layer_names) > 0:\n                item = self.next_layer(invisited_layer_names)\n                if item is None:\n                    raise Exception(\"There are some layers which misses input data.\")\n                layer, layer_name, input_values = item\n                print(f\"{layer_name} - input: {layer.input_names}, output: {layer.output_names}\", flush=True)\n\n                output_values = layer(input_values, worker_id=worker_id, worker_num=worker_num, variables=variables)\n                for name, value in zip(layer.output_names, output_values):\n                    self.add_data(name, value)\n            outputs = [self.datas[output_name] for output_name in self.output_names]\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n        return outputs\n\n    \"\"\"\n    def spark(self, inputs, spark_session, spark_context, worker_num=1, variables=dict()):\n        from pyspark import TaskContext, StorageLevel\n\n        def merge(x, n):\n            if n == 0:\n                return []\n            elif n == 1:\n                return [x]\n            elif n == 2:\n                return list(x)\n            else:\n                for _ in range(n - 2):\n                    x = x[0] + x[1:]\n                return list(x)\n        \n        def func(layer, input, worker_id, worker_num, variables):\n            input = list(input)\n            assert len(input) == 1\n            input = input[0]\n            output = layer(input, worker_id=worker_id, worker_num=worker_num, variables=variables)\n            return [output]\n        \n        outputs = list()\n        try:\n            if len(inputs) == len(self.input_names):\n                for name, value in zip(self.input_names, inputs):\n                    self.add_data(name, value)\n            \n            for name, data in self.datas.items():\n                input_rdd = spark_context.parallelize(worker_num * [data], worker_num)\n                # Avoid recomputation, because each rdd may be used multiple times.\n                input_rdd.persist(StorageLevel.MEMORY_AND_DISK)\n                self.add_data(name, input_rdd)\n            \n            invisited_layer_names = sorted(list(self.layers.keys()))\n            while len(invisited_layer_names) > 0:\n                item = self.next_layer(invisited_layer_names)\n                if item is None:\n                    raise Exception(\"There are some layers which misses input data.\")\n                layer, layer_name, input_values = item\n\n                input_rdds = None\n                for i, input_rdd in enumerate(input_values):\n                    input_rdds = input_rdd if i == 0 else input_rdds.zip(input_rdd)\n                input_rdds = input_rdds.map(lambda x: merge(x, len(layer.input_names)))\n\n                native_io = True\n                if native_io:\n                    output_rdds = input_rdds.mapPartitionsWithIndex(\n                        lambda worker_id, input: \n                        func(layer, input, worker_id, worker_num, variables), preservesPartitioning=True\n                    )\n                else:# (Deprecated)\n                    #if layer.type in (LayerType.To_Line_File, LayerType.To_Jsonl_File, LayerType.To_Parquet_File):\n                    if layer.type == LayerType.To_Line_File:\n                        inputs = input_rdds.collect()\n                        outputs = list()\n                        for worker_id, input in enumerate(inputs):\n                            variables[\"worker_id\"] = worker_id\n                            variables[\"worker_num\"] = worker_num\n                            assert len(input) == 2\n                            file_path = util.to_real_path(input[1], variables)\n                            \n                            spark_context.parallelize(input[0], 1).saveAsTextFile(file_path)\n                            #rdd = spark_context.parallelize(input[0], 1)\n                            #rdd.toDF().write.mode(\"overwrite\").text(file_path)\n                            #rdd.toDF().write.mode(\"overwrite\").json(file_path)\n                            #rdd.toDF().write.mode(\"overwrite\").parquet(file_path)\n                            \n                            output = [file_path]\n                            outputs.append(output)\n                        output_rdds = spark_context.parallelize(outputs, worker_num)\n                    #elif layer.type in (LayerType.From_Line_File, LayerType.From_Jsonl_File, LayerType.From_Parquet_File):\n                    elif layer.type == LayerType.From_Line_File:\n                        inputs = input_rdds.collect()\n                        outputs = list()\n                        for worker_id, input in enumerate(inputs):\n                            variables[\"worker_id\"] = worker_id\n                            variables[\"worker_num\"] = worker_num\n                            assert len(input) == 1\n                            file_path = util.to_real_path(input[0], variables)\n                            \n                            lines = spark_context.textFile(file_path).collect()\n                            #rdd = spark_session.read.option(\"mode\", \"DROPMALFORMED\").text(file_path).rdd\n                            #rdd = spark_session.read.option(\"mode\", \"DROPMALFORMED\").json(file_path).rdd\n                            #rdd = spark_session.read.option(\"mode\", \"DROPMALFORMED\").parquet(file_path).rdd\n                            #lines = rdd.collect()\n                            \n                            output = [lines]\n                            outputs.append(output)\n                        output_rdds = spark_context.parallelize(outputs, worker_num)\n                    else:\n                        output_rdds = input_rdds.mapPartitionsWithIndex(\n                            lambda worker_id, input: \n                            func(layer, input, worker_id, worker_num, variables), preservesPartitioning=True\n                        )\n\n                # Avoid recomputation, because each rdd may be used multiple times.\n                output_rdds.persist(StorageLevel.MEMORY_AND_DISK)\n                for i, name in enumerate(layer.output_names):\n                    output_rdd = output_rdds.map(lambda _:_[i])\n                    # Avoid recomputation, because each rdd may be used multiple times.\n                    output_rdd.persist(StorageLevel.MEMORY_AND_DISK)\n                    self.add_data(name, output_rdd)\n\n                print(f\"{layer_name} - {layer.input_names}, {layer.output_names}\", flush=True)\n            outputs = [self.datas[output_name].collect() for output_name in self.output_names]\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n        return outputs\n    \"\"\"\n\n\nif __name__ == \"__main__\":\n    network = Network()\n    print(network)\n"
  },
  {
    "path": "DomainSpecific/dependency/gpt_api.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport time\nimport traceback\nimport tiktoken\nimport collections\nfrom datetime import datetime\nimport openai\nfrom openai import AzureOpenAI\nfrom azure.identity import DefaultAzureCredential, get_bearer_token_provider\n\n\nclass GPTAPI:\n    def __init__(self, engine, endpoint, identity_id):\n        \"\"\"\n        Detail setting method could refer to: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/managed-identity\n        The authentication methods include key-based method, cli-based method, identity-based method, etc.\n        We use identity-based method, you could switch to other method.\n        \"\"\"\n        self.keep_history = False\n        self.user_QAs = collections.defaultdict(list)\n        self.max_tokens_per_requests = 8192 - 800 - 192\n        self.quato_tokens_per_minute = 120000#140000\n        self.quato_requests_per_minute = 720#840\n        self.last_minute = -1\n        self.acc_tokens = 0\n        self.acc_requests = 0\n\n        try:\n            self.enc = tiktoken.encoding_for_model(\"gpt-4\")\n        except:\n            self.enc = None\n        self.engine = engine\n        self.endpoint = endpoint\n\n        token_provider = get_bearer_token_provider(DefaultAzureCredential(managed_identity_client_id=identity_id), \"https://cognitiveservices.azure.com/.default\")\n        self.client = AzureOpenAI(\n            azure_endpoint=endpoint,\n            azure_ad_token_provider=token_provider,\n            #api_version=\"2024-02-15-preview\",\n            api_version=\"2024-08-01-preview\",\n            max_retries=0,\n        )\n\n    def switch_api(self, api_idx=-1):\n        # TBD: not implemented yet. \n        pass\n\n    def get_tokens(self, text):\n        tokens = max(len(text.split()), len(text) // 4)\n        return tokens\n\n    def run(self, system, question, engine=None, uid=None, temperature=0.0, max_tokens=800):\n        if engine is None:\n            engine = self.engine\n        \n        if self.enc is None:\n            return \"\"\n\n        # question check.\n        #if self.get_tokens(question) > self.max_tokens_per_requests:\n        #    question = question[:self.max_tokens_per_requests * 4]\n        tokens = self.enc.encode(question)\n        tokens_len = len(tokens)\n        if tokens_len > self.max_tokens_per_requests:\n            offset = (tokens_len - self.max_tokens_per_requests) // 2\n            cut_tokens = tokens[offset:offset+self.max_tokens_per_requests]\n            question = self.enc.decode(cut_tokens)\n\n        # system setting.\n        messages = [{\"role\": \"system\", \"content\": system}]\n        \n        # chat setting.\n        if self.keep_history:\n            for Q, A in self.user_QAs[uid]:\n                messages.append({\"role\": \"user\", \"content\": Q})\n                messages.append({\"role\": \"assistant\", \"content\": A})\n        messages.append({\"role\": \"user\", \"content\": question})\n\n        # quato check.\n        \"\"\"\n        while True:\n            cur_minute = datetime.now().minute\n            cur_tokens = self.get_tokens(str(messages))\n            if self.last_minute != cur_minute:\n                self.last_minute = cur_minute\n                self.acc_tokens = 0\n                self.acc_requests = 0\n            if self.acc_requests + 1  < self.quato_requests_per_minute and self.acc_tokens + cur_tokens < self.quato_tokens_per_minute:\n                self.acc_requests += 1\n                self.acc_tokens += cur_tokens\n                break\n            time.sleep(1)\n        \"\"\"\n\n        # robot running.\n        try:\n            response = self.client.chat.completions.create(\n                model=engine,\n                messages=messages,\n                temperature=temperature,\n                max_tokens=max_tokens,\n                #top_p=0.95,\n                #frequency_penalty=0,\n                #presence_penalty=0,\n                #stop=None\n            )\n            answer = response.choices[0].message.content\n        # https://github.com/openai/openai-python/blob/main/openai/error.py\n        except (openai.RateLimitError, openai.APITimeoutError, openai.APIConnectionError) as e:\n            time.sleep(2)\n            #seconds = int(str(e).split(\"Please retry after\")[1].split(\"second\")[0].strip())\n            #time.sleep(seconds)\n            #traceback.print_exc()\n            self.switch_api()\n            return self.run(system, question, engine, uid, temperature)\n        except openai.BadRequestError as e:\n            if e.code == \"context_length_exceeded\":\n                try:\n                    offset = len(question) // 8\n                    return self.run(system, question[offset:-offset], engine, uid, temperature)\n                except:\n                    answer = \"\"\n                    traceback.print_exc()\n            if e.code == \"content_filter\":\n                answer = \"\"\n            else:\n                answer = \"\"\n                traceback.print_exc()\n        except Exception as e:\n            if response is not None and response.choices[0].finish_reason == \"content_filter\":\n                answer = \"\"\n            else:\n                answer = \"\"\n                traceback.print_exc()\n        \n        # update history chat.\n        if self.keep_history:\n            self.user_QAs[uid].append((question, answer))\n            while len(self.user_QAs[uid]) > 10:\n                self.user_QAs[uid].pop(0)\n        \n        return answer\n\nif __name__ == \"__main__\":\n    engine = \"gpt-4\"\n    endpoint = \"https://XXX.openai.azure.com/\"# to be filled.\n    identity_id = \"\"# to be filled.\n    gpt_api = GPTAPI(engine, endpoint, identity_id)\n    system = \"You are my assistant\"\n    question = \"give me a latex math formula\"\n    answer = gpt_api.run(system=system, question=question)\n    print(answer)\n"
  },
  {
    "path": "DomainSpecific/dependency/install.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/../wrapper/utility\")\nimport time\nimport argparse\nfrom load_yaml import load_yaml\nfrom save_yaml import save_yaml\nfrom azure_env import get_local_rank, get_world_rank\n\nENV_READY = \"env_ready\"\nOS_VERSION = \"ubuntu/18.04\"# ubuntu/18.04, ubuntu/20.04, ubuntu/22.04\n\ndef install(local_id, storage_path):\n    local_id = get_local_rank() if get_local_rank() is not None else local_id\n    if local_id == 0:\n        if os.path.exists(ENV_READY):\n            return\n\n        # install python dependencies.\n        os.system(f\"pip install --upgrade pip\")\n        os.system(f\"pip install -r dependency/requirements.txt\")\n        os.system(f\"pip install guesslang==2.2.1 --no-deps\")# don't change the version.\n\n        # install others.\n        os.system(f\"sudo wget https://packages.microsoft.com/config/{OS_VERSION}/packages-microsoft-prod.deb\")\n        os.system(f\"sudo dpkg -i packages-microsoft-prod.deb\")\n        os.system(f\"sudo apt-get -y update\")\n        os.system(f\"sudo apt-get -y install axel\")# for fast file download.\n\n        os.system(f\"sudo apt update\")\n        os.system(f\"sudo apt -y install git\")\n        os.system(f\"sudo apt -y install git-lfs\")\n        os.system(f\"sudo apt -y install maven\")\n        os.system(f\"sudo apt -y install openjdk-11-jdk\")# java-related 3rd-part library.\n        os.system(f\"ulimit -n 65536\")\n\n        # mount folder: default mount the storage.\n        storage_config = load_yaml(storage_path)\n        if storage_config.get(\"mount\", True):\n            # install fuseblob library\n            os.system(f\"sudo apt-get -y install libcurl3-gnutls\")\n            os.system(f\"sudo apt-get -y install blobfuse\")\n            os.system(f\"sudo apt-get -y install libfuse2\")\n            os.system(f\"sudo apt-get -y install blobfuse2\")\n\n            # create folder to be mounted\n            workspace_dir = storage_config[\"workspace_dir\"]\n            filecache_dir = storage_config[\"file_cache\"][\"path\"]\n\n            try:\n                os.system(f\"sudo umount -l {workspace_dir}\")# debug\n                #os.system(\"ps -ef | grep blobfuse | grep -v grep | awk -F ' ' '{print $2}' | xargs sudo kill -9\")# debug\n            except:\n                pass\n            \n            os.system(f\"sudo mkdir -p {workspace_dir}\")\n            os.system(f\"sudo chown $(whoami) {workspace_dir}\")\n\n            if os.path.exists(filecache_dir):\n                try:\n                    os.system(f\"sudo rm -rf {filecache_dir}\")# debug\n                except:\n                    pass\n            \n            os.system(f\"sudo mkdir -p {filecache_dir}\")\n            os.system(f\"sudo chown $(whoami) {filecache_dir}\")\n\n            os.system(f\"sudo blobfuse2 mount {workspace_dir} --config-file={storage_path}\")\n            print(\"mount azure storage account.\")\n        else:\n            print(\"not mount azure storage account.\")\n\n        # create env tag\n        os.system(f\"sudo rm -rf packages-microsoft-prod.deb\")\n        os.system(f\"sudo touch {ENV_READY}\")\n    else:\n        mounting = True\n        while mounting:\n            mounting = not os.path.exists(ENV_READY)\n            time.sleep(1)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Install dependencies of Data Network.\")\n    parser.add_argument('--local_id', type=int, default=0, help=\"The id of local worker.\")\n    parser.add_argument('--storage_path', type=str, default=\"./resources/storage/llmstore.yaml\", help=\"The path of storage config file.\")\n    args = parser.parse_args()\n    install(args.local_id, args.storage_path)\n"
  },
  {
    "path": "DomainSpecific/dependency/requirements.txt",
    "content": "lxml==5.1.0\n#fasttext==0.9.2\nfasttext-wheel==0.9.2\nsentencepiece==0.1.99\ntrafilatura==1.6.1\nhtml5lib==1.1\nnewspaper3k==0.2.8\nbeautifulsoup4==4.12.2\nwarcio==1.7.4\nmarkdownify==0.11.6\n#cchardet==2.1.7\nnumpy==1.24.4\nscipy==1.10.1\nrequests==2.32.2\npyarrow==14.0.1\njsonlines==3.1.0\n#networkx==3.1\nmatplotlib==3.7.2\npyyaml==6.0\npsutil==5.9.5\ntqdm==4.66.3\npy_asciimath==0.3.0\npylatexenc==2.10\ncharset-normalizer==3.2.0\ntensorflow==2.12.1\n#guesslang==2.2.1\n#typing_extensions==4.12.0\nfaiss-cpu==1.7.4\n#torch==2.0.1\n#fairscale==0.4.13\nsentence_transformers==2.2.2\n#PyMuPDF==1.23.6\ntiktoken==0.5.2\ngensim==4.3.2\nopenai==1.30.2\nboto3==1.34.100\ndatasets==2.16.0\nazure-ai-ml==1.16.0\nazure-batch==14.2.0\nazure-identity==1.16.1\nazure-storage-blob==12.19.1\nazure.keyvault.secrets==4.8.0\n"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/cmarkup.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet\n\t\txmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n\t\tversion='1.0'>\n                \n<!-- ====================================================================== -->\n<!-- $id: tokens.xsl, 2002/22/11 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<!-- 4.4.1.1 cn -->\n<xsl:template match=\"m:cn\"><xsl:apply-templates/></xsl:template>\n\n<xsl:template match=\"m:cn[@type='complex-cartesian']\">\n\t<xsl:apply-templates select=\"text()[1]\"/>\n  \t<xsl:text>+</xsl:text>\n\t<xsl:apply-templates select=\"text()[2]\"/>\n\t<xsl:text>i</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:cn[@type='rational']\">\n\t<xsl:apply-templates select=\"text()[1]\"/>\n\t<xsl:text>/</xsl:text>\n\t<xsl:apply-templates select=\"text()[2]\"/>\n</xsl:template>\n\n<xsl:template match=\"m:cn[@type='integer' and @base!=10]\">\n\t\t<xsl:apply-templates/>\n\t\t<xsl:text>_{</xsl:text><xsl:value-of select=\"@base\"/><xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:cn[@type='complex-polar']\">\n\t<xsl:apply-templates select=\"text()[1]\"/>\n\t<xsl:text>e^{i </xsl:text>\n\t<xsl:apply-templates select=\"text()[2]\"/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:cn[@type='e-notation']\">\n    <xsl:apply-templates select=\"text()[1]\"/>\n    <xsl:text>E</xsl:text>\n    <xsl:apply-templates select=\"text()[2]\"/>\n</xsl:template>\n\n<!-- 4.4.1.1 ci 4.4.1.2 csymbol -->\n<xsl:template match=\"m:ci | m:csymbol\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"string-length(normalize-space(text()))>1\">\n\t\t\t<xsl:text>\\mathrm{</xsl:text><xsl:apply-templates/><xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise><xsl:apply-templates/></xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<!-- 4.4.2.1 apply 4.4.2.2 reln -->\n<xsl:template match=\"m:apply | m:reln\">\n\t<xsl:apply-templates select=\"*[1]\">\n\t<!-- <? -->\n\t\t<xsl:with-param name=\"p\" select=\"10\"/>\n\t</xsl:apply-templates>\n\t<!-- ?> -->\n \t<xsl:text>(</xsl:text>\n\t<xsl:for-each select=\"*[position()>1]\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"not(position()=last())\"><xsl:text>, </xsl:text></xsl:if>\n\t</xsl:for-each>\n \t<xsl:text>)</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.3 fn -->\n<xsl:template match=\"m:fn[m:apply[1]]\"> <!-- for m:fn using default rule -->\n\t<xsl:text>(</xsl:text><xsl:apply-templates/><xsl:text>)</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.4 interval -->\n<xsl:template match=\"m:interval[*[2]]\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"@closure='open' or @closure='open-closed'\">\n\t\t\t<xsl:text>\\left(</xsl:text>\t\t\n\t\t</xsl:when>\n\t\t<xsl:otherwise><xsl:text>\\left[</xsl:text></xsl:otherwise> \n\t</xsl:choose>\n\t<xsl:apply-templates select=\"*[1]\"/>\n\t<xsl:text> , </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"@closure='open' or @closure='closed-open'\">\n\t\t\t<xsl:text>\\right)</xsl:text>\t\t\n\t\t</xsl:when>\n\t\t<xsl:otherwise><xsl:text>\\right]</xsl:text></xsl:otherwise> \n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:interval\">\n\t<xsl:text>\\left\\{</xsl:text><xsl:apply-templates/><xsl:text>\\right\\}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.5 inverse -->\n<xsl:template match=\"m:apply[*[1][self::m:inverse]]\">\n\t<xsl:apply-templates select=\"*[2]\"/><xsl:text>^{(-1)}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.6 sep 4.4.2.7 condition -->\n<xsl:template match=\"m:sep | m:condition\"><xsl:apply-templates/></xsl:template>\n\n<!-- 4.4.2.9 lambda -->\n<xsl:template match=\"m:lambda\">\n\t<xsl:text>\\mathrm{lambda}\\: </xsl:text>\n  \t<xsl:apply-templates select=\"m:bvar/*\"/>\n  \t<xsl:text>.\\: </xsl:text>\n  <xsl:apply-templates select=\"*[last()]\"/>\n</xsl:template>\n\n<!-- 4.4.2.10 compose -->\n<xsl:template match=\"m:apply[*[1][self::m:compose]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\circ </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.2.11 ident -->\n<xsl:template match=\"m:ident\"><xsl:text>\\mathrm{id}</xsl:text></xsl:template>\n\n<!-- 4.4.2.12 domain 4.4.2.13 codomain 4.4.2.14 image 4.4.3.21 arg 4.4.3.24 lcm\n\t\t4.4.5.9 grad 4.4.5.10 curl 4.4.9.4 median 4.4.9.5 mode-->\n<xsl:template match=\"m:domain | m:codomain | m:image | m:arg | m:lcm | m:grad |\n\t\t\t\t\t\t\t\t m:curl | m:median | m:mode\">\n\t<xsl:text>\\mathop{\\mathrm{</xsl:text>\n\t<xsl:value-of select=\"local-name()\"/>\n\t<xsl:text>}}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.15 domainofapplication -->\n<xsl:template match=\"m:domainofapplication\"/>\n\n<!-- 4.4.2.16 piecewise -->\n<xsl:template match=\"m:piecewise\">\n\t<xsl:text>\\begin{cases}</xsl:text>\n\t<xsl:apply-templates select=\"m:piece\"/>\n\t<xsl:apply-templates select=\"m:otherwise\"/>\n\t<xsl:text>\\end{cases}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:piece\">\n\t\t<xsl:apply-templates select=\"*[1]\"/>\n\t\t<xsl:text> &amp; \\text{if $</xsl:text>\n\t\t<xsl:apply-templates select=\"*[2]\"/>\n\t\t<xsl:text>$}</xsl:text>\n\t\t<xsl:if test=\"not(position()=last()) or ../m:otherwise\"><xsl:text>\\\\ </xsl:text></xsl:if>\n</xsl:template>\n\n<xsl:template match=\"m:otherwise\">\n\t<xsl:apply-templates select=\"*[1]\"/>\n\t<xsl:text> &amp; \\text{otherwise}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.1 quotient -->\n<xsl:template match=\"m:apply[*[1][self::m:quotient]]\">\n\t<xsl:text>\\left\\lfloor\\frac{</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>}{</xsl:text>\n\t<xsl:apply-templates select=\"*[3]\"/>\n\t<xsl:text>}\\right\\rfloor </xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.2 factorial -->\n<xsl:template match=\"m:apply[*[1][self::m:factorial]]\">\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>!</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.3 divide -->\n<xsl:template match=\"m:apply[*[1][self::m:divide]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n  <xsl:param name=\"this-p\" select=\"3\"/>\n  <xsl:if test=\"$this-p &lt; $p\"><xsl:text>\\left(</xsl:text></xsl:if>\n  <xsl:text>\\frac{</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n<!--\t\t<xsl:with-param name=\"p\" select=\"$this-p\"/>\n\t</xsl:apply-templates>-->\n\t<xsl:text>}{</xsl:text>\n\t<xsl:apply-templates select=\"*[3]\"/>\n<!--    \t<xsl:with-param name=\"p\" select=\"$this-p\"/>\n\t</xsl:apply-templates>-->\n\t<xsl:text>}</xsl:text>\n\t<xsl:if test=\"$this-p &lt; $p\"><xsl:text>\\right)</xsl:text></xsl:if>\n</xsl:template>\n\n<!-- 4.4.3.4 max min -->\n<xsl:template match=\"m:apply[*[1][self::m:max or self::m:min]]\">\n\t<xsl:text>\\</xsl:text>\n\t<xsl:value-of select=\"local-name(*[1])\"/>\n\t<xsl:text>\\{</xsl:text>\n   <xsl:choose>\n\t\t<xsl:when test=\"m:condition\">\n   \t\t<xsl:apply-templates select=\"*[last()]\"/>\n   \t\t<xsl:text>, </xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:condition/node()\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:for-each select=\"*[position() &gt; 1]\">\n\t\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t\t<xsl:if test=\"position() !=last()\"><xsl:text> , </xsl:text></xsl:if>\n\t\t\t</xsl:for-each>\n\t\t</xsl:otherwise>\n   </xsl:choose>\n\t<xsl:text>\\}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.5  minus-->\n<xsl:template match=\"m:apply[*[1][self::m:minus] and count(*)=2]\">\n\t<xsl:text>-</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"5\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][self::m:minus] and count(*)&gt;2]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\">-</xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.6  plus-->\n<xsl:template match=\"m:apply[*[1][self::m:plus]]\">\n  <xsl:param name=\"p\" select=\"0\"/>\n  <xsl:if test=\"$p &gt; 2\">\n\t\t<xsl:text>(</xsl:text>\n\t</xsl:if>\n  <xsl:for-each select=\"*[position()&gt;1]\">\n   <xsl:if test=\"position() &gt; 1\">\n    <xsl:choose>\n      <xsl:when test=\"self::m:apply[*[1][self::m:times] and\n      *[2][self::m:apply/*[1][self::m:minus] or self::m:cn[not(m:sep) and\n      (number(.) &lt; 0)]]]\">-</xsl:when>\n      <xsl:otherwise>+</xsl:otherwise>\n    </xsl:choose>\n   </xsl:if>   \n    <xsl:choose>\n      <xsl:when test=\"self::m:apply[*[1][self::m:times] and\n      *[2][self::m:cn[not(m:sep) and (number(.) &lt;0)]]]\">\n\t\t\t<xsl:value-of select=\"-(*[2])\"/>\n\t\t\t<xsl:apply-templates select=\".\">\n\t\t     <xsl:with-param name=\"first\" select=\"2\"/>\n\t\t     <xsl:with-param name=\"p\" select=\"2\"/>\n\t\t   </xsl:apply-templates>\n       </xsl:when>\n      <xsl:when test=\"self::m:apply[*[1][self::m:times] and\n      *[2][self::m:apply/*[1][self::m:minus]]]\">\n\t\t\t\t<xsl:apply-templates select=\"./*[2]/*[2]\"/>\n\t\t\t\t<xsl:apply-templates select=\".\">\n\t\t\t\t\t<xsl:with-param name=\"first\" select=\"2\"/>\n\t\t\t\t\t<xsl:with-param name=\"p\" select=\"2\"/>\n\t\t\t\t</xsl:apply-templates>\n\t\t\t</xsl:when>\n\t\t\t<xsl:otherwise>\n\t\t\t\t<xsl:apply-templates select=\".\">\n\t\t\t\t\t<xsl:with-param name=\"p\" select=\"2\"/>\n\t\t\t\t</xsl:apply-templates>\n\t\t\t</xsl:otherwise>\n\t\t</xsl:choose>\n\t</xsl:for-each>\n\t<xsl:if test=\"$p &gt; 2\">\n\t\t<xsl:text>)</xsl:text>\n\t</xsl:if>\n</xsl:template>\n\n<!-- 4.4.3.7 power -->\n<xsl:template match=\"m:apply[*[1][self::m:power]]\">\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"5\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>^{</xsl:text>\n\t<xsl:apply-templates select=\"*[3]\">\n\t\t<xsl:with-param name=\"p\" select=\"5\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.8 remainder -->\n<xsl:template match=\"m:apply[*[1][self::m:rem]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\">\\mod </xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.9  times-->\n<xsl:template match=\"m:apply[*[1][self::m:times]]\" name=\"times\">\n  <xsl:param name=\"p\" select=\"0\"/>\n  <xsl:param name=\"first\" select=\"1\"/>\n  <xsl:if test=\"$p &gt; 3\"><xsl:text>(</xsl:text></xsl:if>\n  <xsl:for-each select=\"*[position()&gt;1]\">\n\t\t<xsl:if test=\"position() &gt; 1\">\n\t\t\t<xsl:choose>\n\t\t\t\t<xsl:when test=\"self::m:cn\">\\times <!-- times --></xsl:when>\n\t\t\t\t<xsl:otherwise><!--invisible times--></xsl:otherwise>\n\t\t\t</xsl:choose>\n\t\t</xsl:if> \n\t\t<xsl:if test=\"position()&gt;= $first\">\n\t\t\t<xsl:apply-templates select=\".\">\n\t\t\t\t<xsl:with-param name=\"p\" select=\"3\"/>\n\t\t\t</xsl:apply-templates>\n\t\t</xsl:if>\n\t</xsl:for-each>\n  <xsl:if test=\"$p &gt; 3\"><xsl:text>)</xsl:text></xsl:if>\n</xsl:template>\n\n<!-- 4.4.3.10 root -->\n<xsl:template match=\"m:apply[*[1][self::m:root]]\">\n\t<xsl:text>\\sqrt</xsl:text>\n\t<xsl:if test=\"m:degree!=2\">\n\t\t<xsl:text>[</xsl:text>\n\t\t<xsl:apply-templates select=\"m:degree/*\"/>\n\t\t<xsl:text>]</xsl:text>\n\t</xsl:if>\n\t<xsl:text>{</xsl:text>\n\t<xsl:apply-templates select=\"*[position()&gt;1 and not(self::m:degree)]\"/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.11 gcd -->\n<xsl:template match=\"m:gcd\"><xsl:text>\\gcd </xsl:text></xsl:template>\n\n<!-- 4.4.3.12 and -->\n<xsl:template match=\"m:apply[*[1][self::m:and]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\land <!-- and --></xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.13 or -->\n<xsl:template match=\"m:apply[*[1][self::m:or]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\lor </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.14 xor -->\n<xsl:template match=\"m:apply[*[1][self::m:xor]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\mathop{\\mathrm{xor}}</xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.15 not -->\n<xsl:template match=\"m:apply[*[1][self::m:not]]\">\n\t<xsl:text>\\neg </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<!-- 4.4.3.16 implies -->\n<xsl:template match=\"m:apply[*[1][self::m:implies]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\">\\implies </xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.3.17 forall 4.4.3.18 exists -->\n<xsl:template match=\"m:apply[*[1][self::m:forall or self::m:exists]]\">\n\t<xsl:text>\\</xsl:text>\n\t<xsl:value-of select=\"local-name(*[1])\"/>\n\t<xsl:text> </xsl:text>\n\t<xsl:apply-templates select=\"m:bvar\"/>\n\t<xsl:if test=\"m:condition\">\n\t\t<xsl:text>, </xsl:text><xsl:apply-templates select=\"m:condition\"/>\n\t</xsl:if>\n\t<xsl:if test=\"*[last()][local-name()!='condition'][local-name()!='bvar']\">\n\t\t<xsl:text>\\colon </xsl:text>\n\t  <xsl:apply-templates select=\"*[last()]\"/>\n  </xsl:if>\n</xsl:template>\n\n<!-- 4.4.3.19 abs -->\n<xsl:template match=\"m:apply[*[1][self::m:abs]]\">\n\t<xsl:text>\\left|</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>\\right|</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.20 conjugate -->\n<xsl:template match=\"m:apply[*[1][self::m:conjugate]]\">\n\t<xsl:text>\\overline{</xsl:text><xsl:apply-templates select=\"*[2]\"/><xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.22 real -->\n<xsl:template match=\"m:real\"><xsl:text>\\Re </xsl:text></xsl:template>\n\n<!-- 4.4.3.23 imaginary -->\n<xsl:template match=\"m:imaginary\"><xsl:text>\\Im </xsl:text></xsl:template>\n\n<!-- 4.4.3.25 floor -->\n<xsl:template match=\"m:apply[*[1][self::m:floor]]\">\n\t<xsl:text>\\lfloor </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>\\rfloor </xsl:text>\n</xsl:template>\n\n<!-- 4.4.3.25 ceiling -->\n<xsl:template match=\"m:apply[*[1][self::m:ceiling]]\">\n\t<xsl:text>\\lceil </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>\\rceil </xsl:text>\n</xsl:template>\n\n<!-- 4.4.4.1 eq -->\n<xsl:template match=\"m:apply[*[1][self::m:eq]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">=</xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.2 neq -->\n<xsl:template match=\"m:apply[*[1][self::m:neq]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\neq </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.3 gt -->\n<xsl:template match=\"m:apply[*[1][self::m:gt]]\">\n<xsl:param name=\"p\" select=\"0\"/>\n<xsl:call-template name=\"infix\">\n\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t<xsl:with-param name=\"mo\">&gt; </xsl:with-param>\n</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.4 lt -->\n<xsl:template match=\"m:apply[*[1][self::m:lt]]\">\n<xsl:param name=\"p\" select=\"0\"/>\n<xsl:call-template name=\"infix\">\n\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t<xsl:with-param name=\"mo\">&lt; </xsl:with-param>\n</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.5 geq -->\n<xsl:template match=\"m:apply[*[1][self::m:geq]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\ge </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.6 leq -->\n<xsl:template match=\"m:apply[*[1][self::m:leq]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\le </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.7 equivalent -->\n<xsl:template match=\"m:apply[*[1][self::m:equivalent]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\equiv </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.8 approx -->\n<xsl:template match=\"m:apply[*[1][self::m:approx]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"1\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\approx </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.4.9 factorof -->\n<xsl:template match=\"m:apply[*[1][self::m:factorof]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\"> | </xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.5.1 int -->\n<xsl:template match=\"m:apply[*[1][self::m:int]]\">\n\t<xsl:text>\\int</xsl:text>\n\t<xsl:if test=\"m:lowlimit/*|m:interval/*[1]|m:condition/*\">\n\t\t<xsl:text>_{</xsl:text>\n\t\t<xsl:apply-templates select=\"m:lowlimit/*|m:interval/*[1]|m:condition/*\"/>\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"m:uplimit/*|m:interval/*[2]\">\n\t\t<xsl:text>^{</xsl:text>\n\t\t<xsl:apply-templates select=\"m:uplimit/*|m:interval/*[2]\"/>\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:text> </xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\"/>\n\t<xsl:text>\\,d </xsl:text>\n\t<xsl:apply-templates select=\"m:bvar\"/>\n</xsl:template>\n\n<!-- 4.4.5.2 diff -->\n<xsl:template match=\"m:apply[*[1][self::m:diff] and m:ci and count(*)=2]\" priority=\"2\">\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>^\\prime </xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][self::m:diff]]\" priority=\"1\">\n\t<xsl:text>\\frac{</xsl:text>\n\t<xsl:choose>\n\t\t<xsl:when test=\"m:bvar/m:degree\">\n\t\t\t<xsl:text>d^{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:bvar/m:degree/node()\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t\t<xsl:apply-templates select=\"*[last()]\"/>\n\t\t\t<xsl:text>}{d</xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:bvar/node()\"/>\n\t\t\t<xsl:text>^{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:bvar/m:degree/node()\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>d </xsl:text>\n\t\t\t<xsl:apply-templates select=\"*[last()]\"/>\n\t\t\t<xsl:text>}{d </xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:bvar\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.5.3 partialdiff -->\n<xsl:template match=\"m:apply[*[1][self::m:partialdiff] and m:list and m:ci and count(*)=3]\" priority=\"2\">\n\t<xsl:text>D_{</xsl:text>\n\t<xsl:for-each select=\"m:list[1]/*\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"position()&lt;last()\"><xsl:text>, </xsl:text></xsl:if>\n\t</xsl:for-each>\n\t<xsl:text>}</xsl:text>\n\t<xsl:apply-templates select=\"*[3]\"/>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][self::m:partialdiff]]\" priority=\"1\">\n\t<xsl:text>\\frac{\\partial^{</xsl:text>\n\t<xsl:choose>\n\t\t<xsl:when test=\"m:degree\">\n\t\t\t<xsl:apply-templates select=\"m:degree/node()\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"m:bvar/m:degree[string(number(.))='NaN']\">\n\t\t\t<xsl:for-each select=\"m:bvar/m:degree\">\n\t\t\t\t<xsl:apply-templates select=\"node()\"/>\n\t\t\t\t<xsl:if test=\"position()&lt;last()\"><xsl:text>+</xsl:text></xsl:if>\n\t\t\t</xsl:for-each>\n\t\t\t<xsl:if test=\"count(m:bvar[not(m:degree)])&gt;0\">\n\t\t\t\t<xsl:text>+</xsl:text>\n\t\t\t\t<xsl:value-of select=\"count(m:bvar[not(m:degree)])\"/>\n\t\t\t</xsl:if>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:value-of select=\"sum(m:bvar/m:degree)+count(m:bvar[not(m:degree)])\"/>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:text>}</xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\"/>\n\t<xsl:text>}{</xsl:text>\n\t<xsl:for-each select=\"m:bvar\">\n\t\t<xsl:text>\\partial </xsl:text>\n\t\t<xsl:apply-templates select=\"node()\"/>\n\t\t<xsl:if test=\"m:degree\">\n\t\t\t<xsl:text>^{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:degree/node()\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:if>\n\t</xsl:for-each>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.2.8 declare 4.4.5.4 lowlimit 4.4.5.5 uplimit 4.4.5.7 degree 4.4.9.5 momentabout -->\n<xsl:template match=\"m:declare | m:lowlimit | m:uplimit | m:degree | m:momentabout\"/>\n\n<!-- 4.4.5.6  bvar-->\n<xsl:template match=\"m:bvar\">\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"following-sibling::m:bvar\"><xsl:text>, </xsl:text></xsl:if>\n</xsl:template>\n\n<!-- 4.4.5.8 divergence-->\n<xsl:template match=\"m:divergence\"><xsl:text>\\mathop{\\mathrm{div}}</xsl:text></xsl:template>\n\n<!-- 4.4.5.11 laplacian-->\n<xsl:template match=\"m:laplacian\"><xsl:text>\\nabla^2 </xsl:text></xsl:template>\n\n<!-- 4.4.6.1 set -->\n<xsl:template match=\"m:set\">\n\t<xsl:text>\\{</xsl:text><xsl:call-template name=\"set\"/><xsl:text>\\}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.6.2 list -->\n<xsl:template match=\"m:list\">\n\t<xsl:text>\\left[</xsl:text><xsl:call-template name=\"set\"/><xsl:text>\\right]</xsl:text>\n</xsl:template>\n\n<xsl:template name=\"set\">\n   <xsl:choose>\n\t\t<xsl:when test=\"m:condition\">\n   \t\t<xsl:apply-templates select=\"m:bvar/*[not(self::bvar or self::condition)]\"/>\n   \t\t<xsl:text>\\colon </xsl:text>\n\t\t\t<xsl:apply-templates select=\"m:condition/node()\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:for-each select=\"*\">\n\t\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t\t<xsl:if test=\"position()!=last()\"><xsl:text>, </xsl:text></xsl:if>\n\t\t\t</xsl:for-each>\n\t\t</xsl:otherwise>\n   </xsl:choose>\n</xsl:template>\n\n<!-- 4.4.6.3 union -->\n<xsl:template match=\"m:apply[*[1][self::m:union]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\cup </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.4 intersect -->\n<xsl:template match=\"m:apply[*[1][self::m:intersect]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\cap </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.5 in -->\n<xsl:template match=\"m:apply[*[1][self::m:in]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\">\\in </xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.5 notin -->\n<xsl:template match=\"m:apply[*[1][self::m:notin]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"mo\">\\notin </xsl:with-param>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"this-p\" select=\"3\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.7 subset -->\n<xsl:template match=\"m:apply[*[1][self::m:subset]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\subseteq </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.8 prsubset -->\n<xsl:template match=\"m:apply[*[1][self::m:prsubset]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\subset </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.9 notsubset -->\n<xsl:template match=\"m:apply[*[1][self::m:notsubset]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\nsubseteq </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.10 notprsubset -->\n<xsl:template match=\"m:apply[*[1][self::m:notprsubset]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\not\\subset </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.11 setdiff -->\n<xsl:template match=\"m:apply[*[1][self::m:setdiff]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\setminus </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.6.12 card -->\n<xsl:template match=\"m:apply[*[1][self::m:card]]\">\n\t<xsl:text>|</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>|</xsl:text>\n</xsl:template>\n\n<!-- 4.4.6.13 cartesianproduct 4.4.10.6 vectorproduct -->\n<xsl:template match=\"m:apply[*[1][self::m:cartesianproduct or self::m:vectorproduct]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\times </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<xsl:template\nmatch=\"m:apply[*[1][self::m:cartesianproduct][count(following-sibling::m:reals)=count(following-sibling::*)]]\"\npriority=\"2\">\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"5\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>^{</xsl:text>\n\t<xsl:value-of select=\"count(*)-1\"/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.7.1 sum -->\n<xsl:template match=\"m:apply[*[1][self::m:sum]]\">\n\t<xsl:text>\\sum</xsl:text><xsl:call-template name=\"series\"/>\n</xsl:template>\n\n<!-- 4.4.7.2 product -->\n<xsl:template match=\"m:apply[*[1][self::m:product]]\">\n\t<xsl:text>\\prod</xsl:text><xsl:call-template name=\"series\"/>\n</xsl:template>\n\t\n<xsl:template name=\"series\">\n\t<xsl:if test=\"m:lowlimit/*|m:interval/*[1]|m:condition/*\">\n\t\t<xsl:text>_{</xsl:text>\n\t\t<xsl:if test=\"not(m:condition)\">\n\t\t\t<xsl:apply-templates select=\"m:bvar\"/>\n\t\t\t<xsl:text>=</xsl:text>\n\t\t</xsl:if>\n\t\t<xsl:apply-templates select=\"m:lowlimit/*|m:interval/*[1]|m:condition/*\"/>\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"m:uplimit/*|m:interval/*[2]\">\n\t\t<xsl:text>^{</xsl:text>\n\t\t<xsl:apply-templates select=\"m:uplimit/*|m:interval/*[2]\"/>\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:text> </xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\"/>\n</xsl:template>\n\n<!-- 4.4.7.3 limit -->\n<xsl:template match=\"m:apply[*[1][self::m:limit]]\">\n\t<xsl:text>\\lim_{</xsl:text>\n\t<xsl:apply-templates select=\"m:lowlimit|m:condition/*\"/>\n\t<xsl:text>}</xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\"/>\n</xsl:template>\n\n<xsl:template match=\"m:apply[m:limit]/m:lowlimit\" priority=\"3\">\n\t<xsl:apply-templates select=\"../m:bvar/node()\"/>\n\t<xsl:text>\\to </xsl:text>\n\t<xsl:apply-templates/>\n</xsl:template>\n\n<!-- 4.4.7.4 tendsto -->\n<xsl:template match=\"m:apply[*[1][self::m:tendsto]]\">\n\t<xsl:param name=\"p\"/>\n\t<xsl:call-template name=\"binary\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\n\t\t\t<xsl:choose>\n\t\t\t\t<xsl:when test=\"@type='above'\">\\searrow </xsl:when>\n\t\t\t\t<xsl:when test=\"@type='below'\">\\nearrow </xsl:when>\n\t\t\t\t<xsl:when test=\"@type='two-sided'\">\\rightarrow </xsl:when>\n\t\t\t\t<xsl:otherwise>\\to </xsl:otherwise>\n\t\t\t</xsl:choose>\n\t\t</xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.8.1 common tringonometric functions 4.4.8.3 natural logarithm -->\n<xsl:template match=\"m:apply[*[1][\n self::m:sin or \t\tself::m:cos or \tself::m:tan or\t\tself::m:sec or\n self::m:csc or \t\tself::m:cot or \tself::m:sinh or\t \tself::m:cosh or\n self::m:tanh or \t\tself::m:coth or\tself::m:arcsin or \tself::m:arccos or\n self::m:arctan or \tself::m:ln]]\">\n\t<xsl:text>\\</xsl:text>\n\t<xsl:value-of select=\"local-name(*[1])\"/>\n\t<xsl:text> </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<xsl:template match=\"m:sin | m:cos | m:tan | m:sec | m:csc |\n\t\t\t\t\t\t\t\t m:cot | m:sinh | m:cosh | m:tanh | m:coth |\n\t\t\t\t\t\t\t\t m:arcsin | m:arccos | m:arctan | m:ln\">\n\t<xsl:text>\\</xsl:text>\n\t<xsl:value-of select=\"local-name(.)\"/>\n\t<xsl:text> </xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][\n self::m:sech or \t\tself::m:csch or\t\tself::m:arccosh or\n self::m:arccot or \tself::m:arccoth or \tself::m:arccsc or\n self::m:arccsch or self::m:arcsec or \tself::m:arcsech or\n self::m:arcsinh or self::m:arctanh]]\">\n\t<xsl:text>\\mathrm{</xsl:text>\n\t<xsl:value-of select=\"local-name(*[1])\"/>\n\t<xsl:text>\\,}</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<xsl:template match=\"m:sech | m:csch | m:arccosh | m:arccot |\n\t\t\t\t\t\t\t\t m:arccoth | m:arccsc |m:arccsch |m:arcsec |\n\t\t\t\t\t\t\t\t m:arcsech | m:arcsinh | m:arctanh\">\n\t<xsl:text>\\mathrm{</xsl:text>\n\t<xsl:value-of select=\"local-name(.)\"/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.8.2 exp -->\n<xsl:template match=\"m:apply[*[1][self::m:exp]]\">\n\t<xsl:text>e^{</xsl:text><xsl:apply-templates select=\"*[2]\"/><xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.8.4 log -->\n<xsl:template match=\"m:apply[*[1][self::m:log]]\">\n\t<xsl:text>\\lg </xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][self::m:log] and m:logbase != 10]\">\n\t<xsl:text>\\log_{</xsl:text>\n\t<xsl:apply-templates select=\"m:logbase/node()\"/>\n\t<xsl:text>}</xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<!-- 4.4.9.1 mean -->\n<xsl:template match=\"m:apply[*[1][self::m:mean]]\">\n\t<xsl:text>\\langle </xsl:text>\n\t<xsl:for-each select=\"*[position()&gt;1]\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"position() !=last()\"><xsl:text>, </xsl:text></xsl:if>\n\t</xsl:for-each>\n\t<xsl:text>\\rangle </xsl:text>\n</xsl:template>\n\n<!-- 4.4.9.2 sdef -->\n<xsl:template match=\"m:sdev\"><xsl:text>\\sigma </xsl:text></xsl:template>\n\n<!-- 4.4.9.3 variance -->\n<xsl:template match=\"m:apply[*[1][self::m:variance]]\">\n\t<xsl:text>\\sigma(</xsl:text>\n\t<xsl:apply-templates select=\"*[2]\"/>\n\t<xsl:text>)^2</xsl:text>\n</xsl:template>\n\n<!-- 4.4.9.5 moment -->\n<xsl:template match=\"m:apply[*[1][self::m:moment]]\">\n\t<xsl:text>\\langle </xsl:text>\n\t<xsl:apply-templates select=\"*[last()]\"/>\n\t<xsl:text>^{</xsl:text>\n\t<xsl:apply-templates select=\"m:degree/node()\"/>\n\t<xsl:text>}\\rangle</xsl:text>\n\t<xsl:if test=\"m:momentabout\">\n\t\t<xsl:text>_{</xsl:text>\n\t\t<xsl:apply-templates select=\"m:momentabout/node()\"/>\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:text> </xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.1 vector  -->\n<xsl:template match=\"m:vector\">\n\t<xsl:text>\\left(\\begin{array}{c}</xsl:text>\n\t<xsl:for-each select=\"*\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"position()!=last()\"><xsl:text>\\\\ </xsl:text></xsl:if>\n\t</xsl:for-each>\n\t<xsl:text>\\end{array}\\right)</xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.2 matrix  -->\n<xsl:template match=\"m:matrix\">\n\t<xsl:text>\\begin{pmatrix}</xsl:text>\n\t<xsl:apply-templates/>\n\t<xsl:text>\\end{pmatrix}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.3 matrixrow  -->\n<xsl:template match=\"m:matrixrow\">\n\t<xsl:for-each select=\"*\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"position()!=last()\"><xsl:text> &amp; </xsl:text></xsl:if>\n\t</xsl:for-each>\n\t<xsl:if test=\"position()!=last()\"><xsl:text>\\\\ </xsl:text></xsl:if>\n</xsl:template>\n\n<!-- 4.4.10.4 determinant  -->\n<xsl:template match=\"m:apply[*[1][self::m:determinant]]\">\n\t<xsl:text>\\det </xsl:text>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n</xsl:template>\n\n<xsl:template match=\"m:apply[*[1][self::m:determinant]][*[2][self::m:matrix]]\" priority=\"2\">\n\t<xsl:text>\\begin{vmatrix}</xsl:text>\n\t<xsl:apply-templates select=\"m:matrix/*\"/>\n\t<xsl:text>\\end{vmatrix}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.5 transpose -->\n<xsl:template match=\"m:apply[*[1][self::m:transpose]]\">\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>^T</xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.5 selector -->\n<xsl:template match=\"m:apply[*[1][self::m:selector]]\">\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"7\"/>\n\t</xsl:apply-templates>\n\t<xsl:text>_{</xsl:text>\n\t<xsl:for-each select=\"*[position()&gt;2]\">\n\t\t<xsl:apply-templates select=\".\"/>\n\t\t<xsl:if test=\"position() !=last()\"><xsl:text>, </xsl:text></xsl:if>\n\t</xsl:for-each>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<!-- 4.4.10.7 scalarproduct 4.4.10.8 outerproduct -->\n<xsl:template match=\"m:apply[*[1][self::m:scalarproduct or self::m:outerproduct]]\">\n\t<xsl:param name=\"p\" select=\"0\"/>\n\t<xsl:call-template name=\"infix\">\n\t\t<xsl:with-param name=\"this-p\" select=\"2\"/>\n\t\t<xsl:with-param name=\"p\" select=\"$p\"/>\n\t\t<xsl:with-param name=\"mo\">\\dot </xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<!-- 4.4.11.2 semantics -->\n<xsl:template match=\"m:semantics\"><xsl:apply-templates select=\"*[1]\"/></xsl:template>\n\n<xsl:template match=\"m:semantics[m:annotation/@encoding='TeX']\">\n\t<xsl:apply-templates select=\"m:annotation[@encoding='TeX']/node()\"/>\n</xsl:template>\n\n<!-- 4.4.12.1 integers -->\n<xsl:template match=\"m:integers\"><xsl:text>\\mathbb{Z}</xsl:text></xsl:template>\n\n<!-- 4.4.12.2 reals -->\n<xsl:template match=\"m:reals\"><xsl:text>\\mathbb{R}</xsl:text></xsl:template>\n\n<!-- 4.4.12.3 rationals -->\n<xsl:template match=\"m:rationals\"><xsl:text>\\mathbb{Q}</xsl:text></xsl:template>\n\n<!-- 4.4.12.4 naturalnumbers -->\n<xsl:template match=\"m:naturalnumbers\"><xsl:text>\\mathbb{N}</xsl:text></xsl:template>\n\n<!-- 4.4.12.5 complexes -->\n<xsl:template match=\"m:complexes\"><xsl:text>\\mathbb{C}</xsl:text></xsl:template>\n\n<!-- 4.4.12.6 primes -->\n<xsl:template match=\"m:primes\"><xsl:text>\\mathbb{P}</xsl:text></xsl:template>\n\t\n<!-- 4.4.12.7 exponentiale -->\n<xsl:template match=\"m:exponentiale\"><xsl:text>e</xsl:text></xsl:template>\n\n<!-- 4.4.12.8 imaginaryi -->\n<xsl:template match=\"m:imaginaryi\"><xsl:text>i</xsl:text></xsl:template>\n\n<!-- 4.4.12.9 notanumber -->\n<xsl:template match=\"m:notanumber\"><xsl:text>NaN</xsl:text></xsl:template>\n\n<!-- 4.4.12.10 true -->\n<xsl:template match=\"m:true\"><xsl:text>\\mbox{true}</xsl:text></xsl:template>\n\n<!-- 4.4.12.11 false -->\n<xsl:template match=\"m:false\"><xsl:text>\\mbox{false}</xsl:text></xsl:template>\n\n<!-- 4.4.12.12 emptyset -->\n<xsl:template match=\"m:emptyset\"><xsl:text>\\emptyset </xsl:text></xsl:template>\n\n<!-- 4.4.12.13 pi -->\n<xsl:template match=\"m:pi\"><xsl:text>\\pi </xsl:text></xsl:template>\n\n<!-- 4.4.12.14 eulergamma -->\n<xsl:template match=\"m:eulergamma\"><xsl:text>\\gamma </xsl:text></xsl:template>\n\n<!-- 4.4.12.15 infinity -->\n<xsl:template match=\"m:infinity\"><xsl:text>\\infty </xsl:text></xsl:template>\n\n<!-- ****************************** -->\n<xsl:template name=\"infix\" >\n  <xsl:param name=\"mo\"/>\n  <xsl:param name=\"p\" select=\"0\"/>\n  <xsl:param name=\"this-p\" select=\"0\"/>\n  <xsl:if test=\"$this-p &lt; $p\"><xsl:text>(</xsl:text></xsl:if>\n  <xsl:for-each select=\"*[position()&gt;1]\">\n\t\t<xsl:if test=\"position() &gt; 1\">\n\t\t\t<xsl:copy-of select=\"$mo\"/>\n\t\t</xsl:if>   \n\t\t<xsl:apply-templates select=\".\">\n\t\t\t<xsl:with-param name=\"p\" select=\"$this-p\"/>\n\t\t</xsl:apply-templates>\n\t</xsl:for-each>\n  <xsl:if test=\"$this-p &lt; $p\"><xsl:text>)</xsl:text></xsl:if>\n</xsl:template>\n\n<xsl:template name=\"binary\" >\n  <xsl:param name=\"mo\"/>\n  <xsl:param name=\"p\" select=\"0\"/>\n  <xsl:param name=\"this-p\" select=\"0\"/>\n  <xsl:if test=\"$this-p &lt; $p\"><xsl:text>(</xsl:text></xsl:if>\n\t<xsl:apply-templates select=\"*[2]\">\n\t\t<xsl:with-param name=\"p\" select=\"$this-p\"/>\n\t</xsl:apply-templates>\n\t<xsl:value-of select=\"$mo\"/>\n\t<xsl:apply-templates select=\"*[3]\">\n    \t<xsl:with-param name=\"p\" select=\"$this-p\"/>\n\t</xsl:apply-templates>\n\t<xsl:if test=\"$this-p &lt; $p\"><xsl:text>)</xsl:text></xsl:if>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/entities.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n                \n<!-- ====================================================================== -->\n<!-- $id: entities.xsl, 2002/22/11 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:template name=\"replaceEntities\">\n\t<xsl:param name=\"content\"/>\n\t<xsl:if test=\"string-length($content)>0\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"starts-with($content,'&#x0025B;')\"><xsl:value-of select=\"'\\varepsilon '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0025B;')\"/></xsl:call-template></xsl:when>\t<!--/varepsilon -->\n\n<!-- ====================================================================== -->\n<!-- \tUnicode 3.2\n\tGreek\n\tRange: 0370-03FF\n\thttp://www.unicode.org/charts/PDF/U0370.pdf\t                    -->\n<!-- ====================================================================== -->\t\n\t\t<xsl:when test=\"starts-with($content,'&#x00393;')\"><xsl:value-of select=\"'\\Gamma '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x00393;')\"/></xsl:call-template></xsl:when>\t<!--/Gamma capital Gamma, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x00394;')\"><xsl:value-of select=\"'\\Delta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x00394;')\"/></xsl:call-template></xsl:when>\t<!--/Delta capital Delta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x00398;')\"><xsl:value-of select=\"'\\Theta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x00398;')\"/></xsl:call-template></xsl:when>\t<!--/Theta capital Theta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0039B;')\"><xsl:value-of select=\"'\\Lambda '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0039B;')\"/></xsl:call-template></xsl:when>\t<!--/Lambda capital Lambda, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0039E;')\"><xsl:value-of select=\"'\\Xi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0039E;')\"/></xsl:call-template></xsl:when>\t<!--/Xi capital Xi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003A0;')\"><xsl:value-of select=\"'\\Pi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003A0;')\"/></xsl:call-template></xsl:when>\t<!--/Pi capital Pi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003A3;')\"><xsl:value-of select=\"'\\Sigma '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003A3;')\"/></xsl:call-template></xsl:when>\t<!--/Sigma capital Sigma, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003A6;')\"><xsl:value-of select=\"'\\Phi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003A6;')\"/></xsl:call-template></xsl:when>\t<!--/Phi capital Phi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003A8;')\"><xsl:value-of select=\"'\\Psi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003A8;')\"/></xsl:call-template></xsl:when>\t<!--/Psi capital Psi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003A9;')\"><xsl:value-of select=\"'\\Omega '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003A9;')\"/></xsl:call-template></xsl:when>\t<!--/Omega capital Omega, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B1;')\"><xsl:value-of select=\"'\\alpha '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B1;')\"/></xsl:call-template></xsl:when>\t<!--/alpha small alpha, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B2;')\"><xsl:value-of select=\"'\\beta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B2;')\"/></xsl:call-template></xsl:when>\t<!--/beta small beta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B3;')\"><xsl:value-of select=\"'\\gamma '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B3;')\"/></xsl:call-template></xsl:when>\t<!--/gamma small gamma, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B4;')\"><xsl:value-of select=\"'\\delta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B4;')\"/></xsl:call-template></xsl:when>\t<!--/delta small delta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B5;')\"><xsl:value-of select=\"'\\epsilon '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B5;')\"/></xsl:call-template></xsl:when>\t<!--/straightepsilon, small epsilon, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B6;')\"><xsl:value-of select=\"'\\zeta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B6;')\"/></xsl:call-template></xsl:when>\t<!--/zeta small zeta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B7;')\"><xsl:value-of select=\"'\\eta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B7;')\"/></xsl:call-template></xsl:when>\t<!--/eta small eta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B8;')\"><xsl:value-of select=\"'\\theta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B8;')\"/></xsl:call-template></xsl:when>\t<!--/theta straight theta, small theta, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003B9;')\"><xsl:value-of select=\"'\\iota '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003B9;')\"/></xsl:call-template></xsl:when>\t<!--/iota small iota, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003BA;')\"><xsl:value-of select=\"'\\kappa '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003BA;')\"/></xsl:call-template></xsl:when>\t<!--/kappa small kappa, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003BB;')\"><xsl:value-of select=\"'\\lambda '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003BB;')\"/></xsl:call-template></xsl:when>\t<!--/lambda small lambda, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003BC;')\"><xsl:value-of select=\"'\\mu '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003BC;')\"/></xsl:call-template></xsl:when>\t<!--/mu small mu, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003BD;')\"><xsl:value-of select=\"'\\nu '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003BD;')\"/></xsl:call-template></xsl:when>\t<!--/nu small nu, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003BE;')\"><xsl:value-of select=\"'\\xi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003BE;')\"/></xsl:call-template></xsl:when>\t<!--/xi small xi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C0;')\"><xsl:value-of select=\"'\\pi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C0;')\"/></xsl:call-template></xsl:when>\t<!--/pi small pi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C1;')\"><xsl:value-of select=\"'\\rho '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C1;')\"/></xsl:call-template></xsl:when>\t<!--/rho small rho, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C2;')\"><xsl:value-of select=\"'\\varsigma '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C2;')\"/></xsl:call-template></xsl:when>\t<!--/varsigma -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C3;')\"><xsl:value-of select=\"'\\sigma '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C3;')\"/></xsl:call-template></xsl:when>\t<!--/sigma small sigma, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C4;')\"><xsl:value-of select=\"'\\tau '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C4;')\"/></xsl:call-template></xsl:when>\t<!--/tau small tau, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C5;')\"><xsl:value-of select=\"'\\upsilon '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C5;')\"/></xsl:call-template></xsl:when>\t<!--/upsilon small upsilon, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C6;')\"><xsl:value-of select=\"'\\phi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C6;')\"/></xsl:call-template></xsl:when>\t<!--/straightphi - small phi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C7;')\"><xsl:value-of select=\"'\\chi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C7;')\"/></xsl:call-template></xsl:when>\t<!--/chi small chi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C8;')\"><xsl:value-of select=\"'\\psi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C8;')\"/></xsl:call-template></xsl:when>\t<!--/psi small psi, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003C9;')\"><xsl:value-of select=\"'\\omega '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003C9;')\"/></xsl:call-template></xsl:when>\t<!--/omega small omega, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003D1;')\"><xsl:value-of select=\"'\\vartheta '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003D1;')\"/></xsl:call-template></xsl:when>\t<!--/vartheta - curly or open theta -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003D2;')\"><xsl:value-of select=\"'\\Upsilon '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003D2;')\"/></xsl:call-template></xsl:when>\t<!--/Upsilon capital Upsilon, Greek -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003D5;')\"><xsl:value-of select=\"'\\varphi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003D5;')\"/></xsl:call-template></xsl:when>\t<!--/varphi - curly or open phi -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003D6;')\"><xsl:value-of select=\"'\\varpi '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003D6;')\"/></xsl:call-template></xsl:when>\t\t<!--/varpi -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003F0;')\"><xsl:value-of select=\"'\\varkappa '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003F0;')\"/></xsl:call-template></xsl:when>\t<!--/varkappa -->\n\t\t<xsl:when test=\"starts-with($content,'&#x003F1;')\"><xsl:value-of select=\"'\\varrho '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x003F1;')\"/></xsl:call-template></xsl:when>\t<!--/varrho -->\n\t\t\n<!-- ====================================================================== -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0200B;')\"><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0200B;')\"/></xsl:call-template></xsl:when>\t\t\t\t\t\t<!--short form of  &InvisibleComma; -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02026;')\"><xsl:value-of select=\"'\\dots '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02026;')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test=\"starts-with($content,'&#x02032;')\"><xsl:value-of select=\"'\\prime '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02032;')\"/></xsl:call-template></xsl:when>\t\t<!--/prime prime or minute -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02061;')\"><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02061;')\"/></xsl:call-template></xsl:when>\t\t\t\t\t\t<!-- ApplyFunction -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02062;')\"><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02062;')\"/></xsl:call-template></xsl:when>\t\t\t\t\t\t<!-- InvisibleTimes -->\n<!-- ====================================================================== -->\n<!-- \tUnicode 3.2\n\tLetterlike Symbols\n\tRange: 2100-214F\n\thttp://www.unicode.org/charts/PDF/U2100.pdf\t                    -->\n<!-- ====================================================================== -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0210F;&#x0FE00;')\"><xsl:value-of select=\"'\\hbar '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0210F;&#x0FE00;')\"/></xsl:call-template></xsl:when>\t<!--/hbar - Planck's over 2pi -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0210F;')\"><xsl:value-of select=\"'\\hslash '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0210F;')\"/></xsl:call-template></xsl:when>\t<!--/hslash - variant Planck's over 2pi --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02111;')\"><xsl:value-of select=\"'\\Im '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02111;')\"/></xsl:call-template></xsl:when>\t\t<!--/Im - imaginary   -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02113;')\"><xsl:value-of select=\"'\\ell '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02113;')\"/></xsl:call-template></xsl:when>\t\t<!--/ell - cursive small l -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02118;')\"><xsl:value-of select=\"'\\wp '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02118;')\"/></xsl:call-template></xsl:when>\t\t<!--/wp - Weierstrass p -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0211C;')\"><xsl:value-of select=\"'\\Re '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0211C;')\"/></xsl:call-template></xsl:when>\t\t<!--/Re - real -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02127;')\"><xsl:value-of select=\"'\\mho '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02127;')\"/></xsl:call-template></xsl:when>\t\t<!--/mho - conductance -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02135;')\"><xsl:value-of select=\"'\\aleph '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02135;')\"/></xsl:call-template></xsl:when>\t\t<!--/aleph aleph, Hebrew -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02136;')\"><xsl:value-of select=\"'\\beth '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02136;')\"/></xsl:call-template></xsl:when>\t\t<!--/beth - beth, Hebrew --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02137;')\"><xsl:value-of select=\"'\\gimel '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02137;')\"/></xsl:call-template></xsl:when>\t\t<!--/gimel - gimel, Hebrew --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02138;')\"><xsl:value-of select=\"'\\daleth '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02138;')\"/></xsl:call-template></xsl:when>\t<!--/daleth - daleth, Hebrew --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02145;')\"><xsl:value-of select=\"'D'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02145;')\"/></xsl:call-template></xsl:when>\t\t<!--D for use in differentials, e.g., within integrals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02146;')\"><xsl:value-of select=\"'d'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02146;')\"/></xsl:call-template></xsl:when>\t\t<!--d for use in differentials, e.g., within integrals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02147;')\"><xsl:value-of select=\"'e'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02147;')\"/></xsl:call-template></xsl:when>\t\t<!--e use for the exponential base of the natural logarithms -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02148;')\"><xsl:value-of select=\"'i'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02148;')\"/></xsl:call-template></xsl:when>\t\t<!--i for use as a square root of -1 -->\n\n<!-- ====================================================================== -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02192;')\"><xsl:value-of select=\"'\\to '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02192;')\"/></xsl:call-template></xsl:when>\t\t<!--/rightarrow /to A: =rightward arrow -->\n\t\t\n<!-- ====================================================================== -->\n<!-- \tUnicode 3.2\n\tMathematical Operators\n\tRange: 2200-22FF\n\thttp://www.unicode.org/charts/PDF/U2200.pdf                         -->\n<!-- ====================================================================== -->\t\n\t\t<xsl:when test=\"starts-with($content,'&#x02200;')\"><xsl:value-of select=\"'\\forall '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02200;')\"/></xsl:call-template></xsl:when>\t<!--/forall for all -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02201;')\"><xsl:value-of select=\"'\\complement '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02201;')\"/></xsl:call-template></xsl:when>\t<!--/complement - complement sign --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02202;')\"><xsl:value-of select=\"'\\partial '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02202;')\"/></xsl:call-template></xsl:when>\t<!--/partial partial differential -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02203;')\"><xsl:value-of select=\"'\\exists '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02203;')\"/></xsl:call-template></xsl:when>\t<!--/exists at least one exists -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02204;')\"><xsl:value-of select=\"'\\nexists '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02204;')\"/></xsl:call-template></xsl:when>\t<!--/nexists - negated exists --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02205;&#x0FE00;')\"><xsl:value-of select=\"'\\emptyset '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02205;&#x0FE00;')\"/></xsl:call-template></xsl:when>\t<!--/emptyset - zero, slash -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02205;')\"><xsl:value-of select=\"'\\varnothing '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02205;')\"/></xsl:call-template></xsl:when>\t<!--/varnothing - circle, slash --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02206;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02206;')\"/></xsl:call-template></xsl:when>-->\n\t\t<xsl:when test=\"starts-with($content,'&#x02207;')\"><xsl:value-of select=\"'\\nabla '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02207;')\"/></xsl:call-template></xsl:when>\t\t<!--/nabla del, Hamilton operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02208;')\"><xsl:value-of select=\"'\\in '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02208;')\"/></xsl:call-template></xsl:when>\t\t<!--/in R: set membership  -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02209;')\"><xsl:value-of select=\"'\\notin '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02209;')\"/></xsl:call-template></xsl:when>\t\t<!--/notin N: negated set membership -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0220B;')\"><xsl:value-of select=\"'\\ni '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0220B;')\"/></xsl:call-template></xsl:when>\t\t<!--/ni /owns R: contains -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0220C;')\"><xsl:value-of select=\"'\\not\\ni '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0220C;')\"/></xsl:call-template></xsl:when>\t<!--negated contains -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0220F;')\"><xsl:value-of select=\"'\\prod '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0220F;')\"/></xsl:call-template></xsl:when>\t\t<!--/prod L: product operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02210;')\"><xsl:value-of select=\"'\\coprod '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02210;')\"/></xsl:call-template></xsl:when>\t<!--/coprod L: coproduct operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02211;')\"><xsl:value-of select=\"'\\sum '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02211;')\"/></xsl:call-template></xsl:when>\t\t<!--/sum L: summation operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02212;')\"><xsl:value-of select=\"'-'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02212;')\"/></xsl:call-template></xsl:when>\t\t<!--B: minus sign -->\t\t\n\t\t<xsl:when test=\"starts-with($content,'&#x02213;')\"><xsl:value-of select=\"'\\mp '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02213;')\"/></xsl:call-template></xsl:when>\t\t<!--/mp B: minus-or-plus sign -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02214;')\"><xsl:value-of select=\"'\\dotplus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02214;')\"/></xsl:call-template></xsl:when>\t<!--/dotplus B: plus sign, dot above --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02215;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02215;')\"/></xsl:call-template></xsl:when>-->\n\t\t<xsl:when test=\"starts-with($content,'&#x02216;')\"><xsl:value-of select=\"'\\setminus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02216;')\"/></xsl:call-template></xsl:when>\t<!--/setminus B: reverse solidus -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02217;')\"><xsl:value-of select=\"'\\ast '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02217;')\"/></xsl:call-template></xsl:when>\t\t<!--low asterisk -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02218;')\"><xsl:value-of select=\"'\\circ '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02218;')\"/></xsl:call-template></xsl:when>\t\t<!--/circ B: composite function (small circle) -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02219;')\"><xsl:value-of select=\"'\\bullet '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02219;')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test=\"starts-with($content,'&#x0221A;')\"><xsl:value-of select=\"'\\surd '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0221A;')\"/></xsl:call-template></xsl:when>\t\t<!--/surd radical -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0221D;')\"><xsl:value-of select=\"'\\propto '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0221D;')\"/></xsl:call-template></xsl:when>\t<!--/propto R: is proportional to -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0221E;')\"><xsl:value-of select=\"'\\infty '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0221E;')\"/></xsl:call-template></xsl:when>\t\t<!--/infty infinity -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0221F;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0221F;')\"/></xsl:call-template></xsl:when>\t\tright (90 degree) angle -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02220;')\"><xsl:value-of select=\"'\\angle '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02220;')\"/></xsl:call-template></xsl:when>\t\t<!--/angle - angle -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02221;')\"><xsl:value-of select=\"'\\measuredangle '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02221;')\"/></xsl:call-template></xsl:when>\t<!--/measuredangle - angle-measured -->\t<!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02222;')\"><xsl:value-of select=\"'\\sphericalangle '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02222;')\"/></xsl:call-template></xsl:when><!--/sphericalangle angle-spherical -->\t<!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02223;')\"><xsl:value-of select=\"'\\mid '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02223;')\"/></xsl:call-template></xsl:when>\t\t<!--/mid R: -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02224;&#x0FE00;')\"><xsl:value-of select=\"'\\nshortmid '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02224;&#x0FE00;')\"/></xsl:call-template></xsl:when>\t<!--/nshortmid --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02224;')\"><xsl:value-of select=\"'\\nmid '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02224;')\"/></xsl:call-template></xsl:when>\t\t<!--/nmid --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02225;')\"><xsl:value-of select=\"'\\parallel '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02225;')\"/></xsl:call-template></xsl:when>\t<!--/parallel R: parallel -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02226;&#x0FE00;')\"><xsl:value-of select=\"'\\nshortparallel '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02226;&#x0FE00;')\"/></xsl:call-template></xsl:when>\t<!--/nshortparallel N: not short par --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02226;')\"><xsl:value-of select=\"'\\nparallel '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02226;')\"/></xsl:call-template></xsl:when>\t<!--/nparallel N: not parallel --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02227;')\"><xsl:value-of select=\"'\\wedge '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02227;')\"/></xsl:call-template></xsl:when>\t\t<!--/wedge /land B: logical and -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02228;')\"><xsl:value-of select=\"'\\vee '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02228;')\"/></xsl:call-template></xsl:when>\t\t<!--/vee /lor B: logical or -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02229;')\"><xsl:value-of select=\"'\\cap '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02229;')\"/></xsl:call-template></xsl:when>\t\t<!--/cap B: intersection -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0222A;')\"><xsl:value-of select=\"'\\cup '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222A;')\"/></xsl:call-template></xsl:when>\t\t<!--/cup B: union or logical sum -->\t\t\n\t\t<xsl:when test=\"starts-with($content,'&#x0222B;')\"><xsl:value-of select=\"'\\int '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222B;')\"/></xsl:call-template></xsl:when>\t\t<!--/int L: integral operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0222C;')\"><xsl:value-of select=\"'\\iint '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222C;')\"/></xsl:call-template></xsl:when>\t\t<!--double integral operator --> <!-- Required amsmath -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0222D;')\"><xsl:value-of select=\"'\\iiint '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222D;')\"/></xsl:call-template></xsl:when>\t\t<!--/iiint triple integral operator -->\t<!-- Required amsmath -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0222E;')\"><xsl:value-of select=\"'\\oint '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222E;')\"/></xsl:call-template></xsl:when>\t\t<!--/oint L: contour integral operator -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0222F;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0222F;')\"/></xsl:call-template></xsl:when>-->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02230;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02230;')\"/></xsl:call-template></xsl:when>-->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02231;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02231;')\"/></xsl:call-template></xsl:when>-->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02232;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02232;')\"/></xsl:call-template></xsl:when>-->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02233;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02233;')\"/></xsl:call-template></xsl:when>-->\n\t\t<xsl:when test=\"starts-with($content,'&#x02234;')\"><xsl:value-of select=\"'\\therefore '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02234;')\"/></xsl:call-template></xsl:when>\t<!--/therefore R: therefore --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02235;')\"><xsl:value-of select=\"'\\because '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02235;')\"/></xsl:call-template></xsl:when>\t<!--/because R: because --> <!-- Required amssymb -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02236;')\"><xsl:value-of select=\"':'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02236;')\"/></xsl:call-template></xsl:when>\t\t<!--/ratio -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02237;')\"><xsl:value-of select=\"'\\colon\\colon '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02237;')\"/></xsl:call-template></xsl:when>\t<!--/Colon, two colons -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02238;')\"><xsl:value-of select=\"'\\dot{-}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02238;')\"/></xsl:call-template></xsl:when>\t\t<!--/dotminus B: minus sign, dot above -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02239;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02239;')\"/></xsl:call-template></xsl:when>\t\t-->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0223A;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223A;')\"/></xsl:call-template></xsl:when>\t\tminus with four dots, geometric properties -->\t\t\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0223B;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223B;')\"/></xsl:call-template></xsl:when>\t\thomothetic -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0223C;')\"><xsl:value-of select=\"'\\sim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223C;')\"/></xsl:call-template></xsl:when>\t\t<!--/sim R: similar -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0223D;')\"><xsl:value-of select=\"'\\backsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223D;')\"/></xsl:call-template></xsl:when>\t<!--/backsim R: reverse similar --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0223E;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223E;')\"/></xsl:call-template></xsl:when>\t\tmost positive -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0223F;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0223F;')\"/></xsl:call-template></xsl:when>\t\tac current -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02240;')\"><xsl:value-of select=\"'\\wr '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02240;')\"/></xsl:call-template></xsl:when>\t\t<!--/wr B: wreath product -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02241;')\"><xsl:value-of select=\"'\\nsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02241;')\"/></xsl:call-template></xsl:when>\t\t<!--/nsim N: not similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02242;')\"><xsl:value-of select=\"'\\eqsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02242;')\"/></xsl:call-template></xsl:when>\t\t<!--/esim R: equals, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02243;')\"><xsl:value-of select=\"'\\simeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02243;')\"/></xsl:call-template></xsl:when>\t\t<!--/simeq R: similar, equals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02244;')\"><xsl:value-of select=\"'\\not\\simeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02244;')\"/></xsl:call-template></xsl:when>\t<!--/nsimeq N: not similar, equals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02245;')\"><xsl:value-of select=\"'\\cong '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02245;')\"/></xsl:call-template></xsl:when>\t\t<!--/cong R: congruent with -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02246;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02246;')\"/></xsl:call-template></xsl:when>\t\tsimilar, not equals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02247;')\"><xsl:value-of select=\"'\\ncong '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02247;')\"/></xsl:call-template></xsl:when>\t\t<!--/ncong N: not congruent with --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02248;')\"><xsl:value-of select=\"'\\approx '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02248;')\"/></xsl:call-template></xsl:when>\t<!--/approx R: approximate -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02249;&#x00338;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02249;&#x00338;')\"/></xsl:call-template></xsl:when>\tnot, vert, approximate -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02249;')\"><xsl:value-of select=\"'\\not\\approx '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02249;')\"/></xsl:call-template></xsl:when>\t<!--/napprox N: not approximate -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0224A;')\"><xsl:value-of select=\"'\\approxeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224A;')\"/></xsl:call-template></xsl:when>\t<!--/approxeq R: approximate, equals --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0224B;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224B;')\"/></xsl:call-template></xsl:when>\t\tapproximately identical to -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0224C;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224C;')\"/></xsl:call-template></xsl:when>\t\t/backcong R: reverse congruent -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0224D;')\"><xsl:value-of select=\"'\\asymp '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224D;')\"/></xsl:call-template></xsl:when>\t\t<!--/asymp R: asymptotically equal to -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0224E;')\"><xsl:value-of select=\"'\\Bumpeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224E;')\"/></xsl:call-template></xsl:when>\t<!--/Bumpeq R: bumpy equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0224F;')\"><xsl:value-of select=\"'\\bumpeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0224F;')\"/></xsl:call-template></xsl:when>\t<!--/bumpeq R: bumpy equals, equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02250;')\"><xsl:value-of select=\"'\\doteq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02250;')\"/></xsl:call-template></xsl:when>\t\t<!--/doteq R: equals, single dot above -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02251;')\"><xsl:value-of select=\"'\\doteqdot '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02251;')\"/></xsl:call-template></xsl:when>\t<!--/doteqdot /Doteq R: eq, even dots --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02252;')\"><xsl:value-of select=\"'\\fallingdotseq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02252;')\"/></xsl:call-template></xsl:when>\t<!--/fallingdotseq R: eq, falling dots --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02253;')\"><xsl:value-of select=\"'\\risingdotseq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02253;')\"/></xsl:call-template></xsl:when>\t<!--/risingdotseq R: eq, rising dots --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02254;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02254;')\"/></xsl:call-template></xsl:when>\t\t/coloneq R: colon, equals -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02255;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02255;')\"/></xsl:call-template></xsl:when>\t\t/eqcolon R: equals, colon -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02256;')\"><xsl:value-of select=\"'\\eqcirc '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02256;')\"/></xsl:call-template></xsl:when>\t<!--/eqcirc R: circle on equals sign --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02257;')\"><xsl:value-of select=\"'\\circeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02257;')\"/></xsl:call-template></xsl:when>\t<!--/circeq R: circle, equals --> <!-- Required amssymb -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02258;')\"><xsl:value-of select=\"'\\stackrel{\\frown}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02258;')\"/></xsl:call-template></xsl:when>\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02259;')\"><xsl:value-of select=\"'\\stackrel{\\wedge}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02259;')\"/></xsl:call-template></xsl:when>\t<!--/wedgeq R: corresponds to (wedge, equals) -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x0225A;')\"><xsl:value-of select=\"'\\stackrel{\\vee}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225A;')\"/></xsl:call-template></xsl:when>\t<!--logical or, equals -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x0225B;')\"><xsl:value-of select=\"'\\stackrel{\\star}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225B;')\"/></xsl:call-template></xsl:when>\t<!--equal, asterisk above -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0225C;')\"><xsl:value-of select=\"'\\triangleq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225C;')\"/></xsl:call-template></xsl:when>\t<!--/triangleq R: triangle, equals --> <!-- Required amssymb -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x0225D;')\"><xsl:value-of select=\"'\\stackrel{\\scriptscriptstyle\\mathrm{def}}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225D;')\"/></xsl:call-template></xsl:when>\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x0225E;')\"><xsl:value-of select=\"'\\stackrel{\\scriptscriptstyle\\mathrm{m}}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225E;')\"/></xsl:call-template></xsl:when>\t\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x0225F;')\"><xsl:value-of select=\"'\\stackrel{?}{=}'\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0225F;')\"/></xsl:call-template></xsl:when>\t<!--/questeq R: equal with questionmark -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02260;&#x0FE00;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02260;&#x0FE00;')\"/></xsl:call-template></xsl:when>\tnot equal, dot -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02260;')\"><xsl:value-of select=\"'\\ne '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02260;')\"/></xsl:call-template></xsl:when>\t\t<!--/ne /neq R: not equal -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02261;&#x020E5;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02261;&#x020E5;')\"/></xsl:call-template></xsl:when>\treverse not equivalent -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02261;')\"><xsl:value-of select=\"'\\equiv '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02261;')\"/></xsl:call-template></xsl:when>\t\t<!--/equiv R: identical with -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02262;')\"><xsl:value-of select=\"'\\not\\equiv '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02262;')\"/></xsl:call-template></xsl:when>\t<!--/nequiv N: not identical with -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x02263;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02263;')\"/></xsl:call-template></xsl:when>\t\t-->\n\t\t<xsl:when test=\"starts-with($content,'&#x02264;')\"><xsl:value-of select=\"'\\le '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02264;')\"/></xsl:call-template></xsl:when>\t\t<!--/leq /le R: less-than-or-equal -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02265;')\"><xsl:value-of select=\"'\\ge '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02265;')\"/></xsl:call-template></xsl:when>\t\t<!--/geq /ge R: greater-than-or-equal -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02266;')\"><xsl:value-of select=\"'\\leqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02266;')\"/></xsl:call-template></xsl:when>\t\t<!--/leqq R: less, double equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02267;')\"><xsl:value-of select=\"'\\geqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02267;')\"/></xsl:call-template></xsl:when>\t\t<!--/geqq R: greater, double equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02268;')\"><xsl:value-of select=\"'\\lneqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02268;')\"/></xsl:call-template></xsl:when>\t\t<!--/lneqq N: less, not double equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02269;')\"><xsl:value-of select=\"'\\gneqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02269;')\"/></xsl:call-template></xsl:when>\t\t<!--/gneqq N: greater, not dbl equals --> <!-- Required amssymb -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0226A;&#x00338;&#x0FE00;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226A;&#x00338;&#x0FE00;')\"/></xsl:call-template></xsl:when>\tnot much less than, variant -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0226A;&#x00338;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226A;&#x00338;')\"/></xsl:call-template></xsl:when>\tnot, vert, much less than -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0226A;')\"><xsl:value-of select=\"'\\ll '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226A;')\"/></xsl:call-template></xsl:when>\t\t<!--/ll R: double less-than sign -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0226B;&#x00338;&#x0FE00;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226B;&#x00338;&#x0FE00;')\"/></xsl:call-template></xsl:when>\tnot much greater than, variant -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x0226B;&#x00338;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226B;&#x00338;')\"/></xsl:call-template></xsl:when>\tnot, vert, much greater than -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0226B;')\"><xsl:value-of select=\"'\\gg '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226B;')\"/></xsl:call-template></xsl:when>\t\t<!--/gg R: dbl greater-than sign -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0226C;')\"><xsl:value-of select=\"'\\between '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226C;')\"/></xsl:call-template></xsl:when>\t<!--/between R: between --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0226D;')\"><xsl:value-of select=\"'\\not\\asymp '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226D;')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test=\"starts-with($content,'&#x0226E;')\"><xsl:value-of select=\"'\\nless '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226E;')\"/></xsl:call-template></xsl:when>\t\t<!--/nless N: not less-than --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0226F;')\"><xsl:value-of select=\"'\\ngtr '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0226F;')\"/></xsl:call-template></xsl:when>\t\t<!--/ngtr N: not greater-than --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02270;&#x020E5;')\"><xsl:value-of select=\"'\\nleq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02270;&#x020E5;')\"/></xsl:call-template></xsl:when>\t<!--/nleq N: not less-than-or-equal --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02270;')\"><xsl:value-of select=\"'\\nleqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02270;')\"/></xsl:call-template></xsl:when>\t\t<!--/nleqq N: not less, dbl equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02271;&#x020E5;')\"><xsl:value-of select=\"'\\ngeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02271;&#x020E5;')\"/></xsl:call-template></xsl:when>\t<!--/ngeq N: not greater-than-or-equal --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02271;')\"><xsl:value-of select=\"'\\ngeqq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02271;')\"/></xsl:call-template></xsl:when>\t\t<!--/ngeqq N: not greater, dbl equals --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02272;')\"><xsl:value-of select=\"'\\lesssim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02272;')\"/></xsl:call-template></xsl:when>\t<!--/lesssim R: less, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02273;')\"><xsl:value-of select=\"'\\gtrsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02273;')\"/></xsl:call-template></xsl:when>\t<!--/gtrsim R: greater, similar --> <!-- Required amssymb -->\t\t\n\t\t<xsl:when test=\"starts-with($content,'&#x02274;')\"><xsl:value-of select=\"'\\not\\lesssim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02274;')\"/></xsl:call-template></xsl:when>\t<!--not less, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02275;')\"><xsl:value-of select=\"'\\not\\gtrsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02275;')\"/></xsl:call-template></xsl:when>\t<!--not greater, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02276;')\"><xsl:value-of select=\"'\\lessgtr '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02276;')\"/></xsl:call-template></xsl:when>\t<!--/lessgtr R: less, greater --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02277;')\"><xsl:value-of select=\"'\\gtrless '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02277;')\"/></xsl:call-template></xsl:when>\t<!--/gtrless R: greater, less --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02278;')\"><xsl:value-of select=\"'\\not\\lessgtr '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02278;')\"/></xsl:call-template></xsl:when>\t<!--not less, greater --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02279;')\"><xsl:value-of select=\"'\\not\\gtrless '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02279;')\"/></xsl:call-template></xsl:when>\t<!--not greater, less --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0227A;')\"><xsl:value-of select=\"'\\prec '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227A;')\"/></xsl:call-template></xsl:when>\t\t<!--/prec R: precedes -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0227B;')\"><xsl:value-of select=\"'\\succ '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227B;')\"/></xsl:call-template></xsl:when>\t\t<!--/succ R: succeeds -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0227C;')\"><xsl:value-of select=\"'\\preccurlyeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227C;')\"/></xsl:call-template></xsl:when>\t<!--/preccurlyeq R: precedes, curly eq --> <!-- Required amssymb -->\t\t\n\t\t<xsl:when test=\"starts-with($content,'&#x0227D;')\"><xsl:value-of select=\"'\\succcurlyeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227D;')\"/></xsl:call-template></xsl:when>\t<!--/succcurlyeq R: succeeds, curly eq --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0227E;')\"><xsl:value-of select=\"'\\precsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227E;')\"/></xsl:call-template></xsl:when>\t<!--/precsim R: precedes, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0227F;')\"><xsl:value-of select=\"'\\succsim '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0227F;')\"/></xsl:call-template></xsl:when>\t<!--/succsim R: succeeds, similar --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02280;')\"><xsl:value-of select=\"'\\nprec '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02280;')\"/></xsl:call-template></xsl:when>\t\t<!--/nprec N: not precedes --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02281;')\"><xsl:value-of select=\"'\\nsucc '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02281;')\"/></xsl:call-template></xsl:when>\t\t<!--/nsucc N: not succeeds --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02282;')\"><xsl:value-of select=\"'\\subset '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02282;')\"/></xsl:call-template></xsl:when>\t<!--/subset R: subset or is implied by -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02283;')\"><xsl:value-of select=\"'\\supset '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02283;')\"/></xsl:call-template></xsl:when>\t<!--/supset R: superset or implies -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02284;')\"><xsl:value-of select=\"'\\not\\subset '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02284;')\"/></xsl:call-template></xsl:when>\t<!--not subset -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02285;')\"><xsl:value-of select=\"'\\not\\supset '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02285;')\"/></xsl:call-template></xsl:when>\t<!--not superset -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02286;')\"><xsl:value-of select=\"'\\subseteq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02286;')\"/></xsl:call-template></xsl:when>\t<!--/subseteq R: subset, equals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02287;')\"><xsl:value-of select=\"'\\supseteq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02287;')\"/></xsl:call-template></xsl:when>\t<!--/supseteq R: superset, equals -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0228E;')\"><xsl:value-of select=\"'\\uplus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0228E;')\"/></xsl:call-template></xsl:when>\t\t<!--/uplus B: plus sign in union -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02293;')\"><xsl:value-of select=\"'\\sqcap '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02293;')\"/></xsl:call-template></xsl:when>\t\t<!--/sqcap B: square intersection -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02294;')\"><xsl:value-of select=\"'\\bigsqcup '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02294;')\"/></xsl:call-template></xsl:when>\t\t<!--/sqcup B: square union -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02295;')\"><xsl:value-of select=\"'\\oplus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02295;')\"/></xsl:call-template></xsl:when>\t\t<!--/oplus B: plus sign in circle -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02296;')\"><xsl:value-of select=\"'\\ominus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02296;')\"/></xsl:call-template></xsl:when>\t<!--/ominus B: minus sign in circle -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02297;')\"><xsl:value-of select=\"'\\otimes '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02297;')\"/></xsl:call-template></xsl:when>\t<!--/otimes B: multiply sign in circle -->\n\t\t<xsl:when test=\"starts-with($content,'&#x02298;')\"><xsl:value-of select=\"'\\oslash '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02298;')\"/></xsl:call-template></xsl:when>\t<!--/oslash B: solidus in circle -->\n<!-- ? -->\t<xsl:when test=\"starts-with($content,'&#x02299;')\"><xsl:value-of select=\"'\\odot '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x02299;')\"/></xsl:call-template></xsl:when>\t\t<!--/odot B: middle dot in circle --> <!--/bigodot L: circle dot operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x0229F;')\"><xsl:value-of select=\"'\\boxminus '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x0229F;')\"/></xsl:call-template></xsl:when>\t<!--/boxminus B: minus sign in box --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022A4;')\"><xsl:value-of select=\"'\\top '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022A4;')\"/></xsl:call-template></xsl:when>\t\t<!--/top top -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022A5;')\"><xsl:value-of select=\"'\\perp '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022A5;')\"/></xsl:call-template></xsl:when>\t\t<!--/perp R: perpendicular --><!--/bot bottom -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022A6;')\"><xsl:value-of select=\"'\\vdash '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022A6;')\"/></xsl:call-template></xsl:when>\t\t<!--/vdash R: vertical, dash -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022A7;')\"><xsl:value-of select=\"'\\vDash '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022A7;')\"/></xsl:call-template></xsl:when>\t\t<!--/vDash R: vertical, dbl dash --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022A8;')\"><xsl:value-of select=\"'\\models '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022A8;')\"/></xsl:call-template></xsl:when>\t<!--/models R: -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022AA;')\"><xsl:value-of select=\"'\\Vvdash '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022AA;')\"/></xsl:call-template></xsl:when>\t<!--/Vvdash R: triple vertical, dash --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C0;')\"><xsl:value-of select=\"'\\bigwedge '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C0;')\"/></xsl:call-template></xsl:when>\t<!--/bigwedge L: logical or operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C1;')\"><xsl:value-of select=\"'\\bigvee '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C1;')\"/></xsl:call-template></xsl:when>\t<!--/bigcap L: intersection operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C2;')\"><xsl:value-of select=\"'\\bigcap '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C2;')\"/></xsl:call-template></xsl:when>\t<!--/bigvee L: logical and operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C3;')\"><xsl:value-of select=\"'\\bigcup '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C3;')\"/></xsl:call-template></xsl:when>\t<!--/bigcup L: union operator -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C4;')\"><xsl:value-of select=\"'\\diamond '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C4;')\"/></xsl:call-template></xsl:when>\t<!--/diamond B: open diamond -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C5;')\"><xsl:value-of select=\"'\\cdot '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C5;')\"/></xsl:call-template></xsl:when>\t\t<!--/cdot B: small middle dot -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C6;')\"><xsl:value-of select=\"'\\star '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C6;')\"/></xsl:call-template></xsl:when>\t\t<!--/star B: small star, filled -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C7;')\"><xsl:value-of select=\"'\\divideontimes '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C7;')\"/></xsl:call-template></xsl:when>\t<!--/divideontimes B: division on times --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022C8;')\"><xsl:value-of select=\"'\\bowtie '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022C8;')\"/></xsl:call-template></xsl:when>\t<!--/bowtie R: -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022CD;')\"><xsl:value-of select=\"'\\backsimeq '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022CD;')\"/></xsl:call-template></xsl:when>\t<!--/backsimeq R: reverse similar, eq --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022EF;')\"><xsl:value-of select=\"'\\cdots '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022EF;')\"/></xsl:call-template></xsl:when>\t\t<!--/cdots, three dots, centered -->\n<!--\t\t<xsl:when test=\"starts-with($content,'&#x022F0;')\"><xsl:value-of select=\"' '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022F0;')\"/></xsl:call-template></xsl:when>\t\tthree dots, ascending -->\n\t\t<xsl:when test=\"starts-with($content,'&#x022F1;')\"><xsl:value-of select=\"'\\ddots '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x022F1;')\"/></xsl:call-template></xsl:when>\t\t<!--/ddots, three dots, descending -->\n\n<!-- ====================================================================== -->\t\t\n\t\t<xsl:when test=\"starts-with($content,'&#x025A1;')\"><xsl:value-of select=\"'\\square '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x025A1;')\"/></xsl:call-template></xsl:when>\t<!--/square, square --> <!-- Required amssymb -->\n\t\t<xsl:when test=\"starts-with($content,'&#x025AA;')\"><xsl:value-of select=\"'\\blacksquare '\" /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '&#x025AA;')\"/></xsl:call-template></xsl:when>\t<!--/blacksquare, square, filled  --> <!-- Required amssymb -->\n\t\t\n\t\t<xsl:when test='starts-with($content,\"&apos;\")'><xsl:value-of select='\"\\text{&apos;}\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select='substring-after($content, \"&apos;\")'/></xsl:call-template></xsl:when><!-- \\text required amslatex -->\n\t\t<xsl:when test='starts-with($content,\"(\")'><xsl:value-of select='\"\\left(\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '(')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test='starts-with($content,\")\")'><xsl:value-of select='\"\\right)\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, ')')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test='starts-with($content,\"[\")'><xsl:value-of select='\"\\left[\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '[')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test='starts-with($content,\"]\")'><xsl:value-of select='\"\\right]\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, ']')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test='starts-with($content,\"{\")'><xsl:value-of select='\"\\left\\{\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '{')\"/></xsl:call-template></xsl:when>\n\t\t<xsl:when test='starts-with($content,\"}\")'><xsl:value-of select='\"\\right\\}\"' /><xsl:call-template name=\"replaceEntities\"><xsl:with-param name=\"content\" select=\"substring-after($content, '}')\"/></xsl:call-template></xsl:when>\n\t\t\n\n\t\t<xsl:otherwise>\n\t\t\t<xsl:value-of select=\"substring($content,1,1)\"/>\n\t\t\t<xsl:call-template name=\"replaceEntities\">\n\t\t\t\t<xsl:with-param name=\"content\" select=\"substring($content, 2)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:otherwise>\n\t</xsl:choose></xsl:if>\n</xsl:template>\n\n<xsl:template name=\"replaceMtextEntities\">\n\t<xsl:param name=\"content\"/>\n\t<xsl:choose>\n\t<xsl:when test=\"contains($content,'&#x02009;&#x0200A;&#x0200A;')\">\t<!-- ThickSpace - space of width 5/18 em -->\n\t\t<xsl:call-template name=\"replaceMtextEntities\">\n\t\t\t<xsl:with-param name=\"content\" select=\"concat(substring-before($content,'&#x02009;&#x0200A;&#x0200A;'),'\\hspace{0.28em}',substring-after($content,'&#x02009;&#x0200A;&#x0200A;'))\"/>\n\t\t</xsl:call-template>\n\t</xsl:when>\n\t<xsl:when test=\"contains($content,'&#x02009;')\">\t<!-- ThinSpace - space of width 3/18 em -->\n\t\t<xsl:call-template name=\"replaceMtextEntities\">\n\t\t\t<xsl:with-param name=\"content\" select=\"concat(substring-before($content,'&#x02009;'),'\\hspace{0.17em}',substring-after($content,'&#x02009;'))\"/>\n\t\t</xsl:call-template>\n\t</xsl:when>\n\t<xsl:otherwise>\n\t\t<xsl:value-of select=\"normalize-space($content)\"/>\n\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/glayout.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n\n<!-- ====================================================================== -->\n<!-- $id: glayout.xsl, 2002/17/05 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:template match=\"m:mfrac\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"@bevelled='true'\">\n<!--\t\t\t<xsl:text>\\raisebox{1ex}{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}\\!\\left/ \\!\\raisebox{-1ex}{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>}\\right.</xsl:text>-->\n\t\t</xsl:when>\n\t\t<xsl:when test=\"@linethickness\">\n\t\t\t<xsl:text>\\genfrac{}{}{</xsl:text>\n\t\t\t<xsl:choose>\n\t\t\t\t<xsl:when test=\"number(@linethickness)\">\n\t\t\t\t\t<xsl:value-of select=\"@linethickness div 10\"/>\n\t\t\t\t\t<xsl:text>ex</xsl:text>\n\t\t\t\t</xsl:when>\n\t\t\t\t<xsl:when test=\"@linethickness='thin'\">\n\t\t\t\t\t<xsl:text>.05ex</xsl:text>\n\t\t\t\t</xsl:when>\n\t\t\t\t<xsl:when test=\"@linethickness='medium'\"/>\n\t\t\t\t<xsl:when test=\"@linethickness='thick'\">\n\t\t\t\t\t<xsl:text>.2ex</xsl:text>\n\t\t\t\t</xsl:when>\n\t\t\t\t<xsl:otherwise>\n\t\t\t\t\t<xsl:value-of select=\"@linethickness\"/>\n\t\t\t\t</xsl:otherwise>\n\t\t\t</xsl:choose>\n\t\t\t<xsl:text>}{}{</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>\\frac{</xsl:text>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:if test=\"@numalign='right'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:apply-templates select=\"./*[1]\"/>\n\t<xsl:if test=\"@numalign='left'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:text>}{</xsl:text>\t\n\t<xsl:if test=\"@denomalign='right'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t<xsl:if test=\"@denomalign='left'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:mroot\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"count(./*)=2\">\n\t\t\t<xsl:text>\\sqrt[</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>]{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t<!-- number of argumnets is not 2 - code 25 -->\n\t\t\t<xsl:message>exception 25:</xsl:message>\n\t\t\t<xsl:text>\\text{exception 25:}</xsl:text> \n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:msqrt\">\n\t<xsl:text>\\sqrt{</xsl:text>\n\t<xsl:apply-templates/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:mfenced\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"@open\">\n\t\t\t<xsl:if test=\"translate(@open,'{}[]()|','{{{{{{{')='{'\">\n\t\t\t\t<xsl:text>\\left</xsl:text>\n\t\t\t</xsl:if>\n\t\t\t<xsl:if test=\"@open='{' or @open='}'\">\n\t\t\t\t<xsl:text>\\</xsl:text>\n\t\t\t</xsl:if>\n\t\t\t<xsl:value-of select=\"@open\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise><xsl:text>\\left(</xsl:text></xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:choose>\n\t\t<xsl:when test=\"count(./*)>1\">\n\t\t\t<xsl:variable name=\"symbol\">\n\t\t\t\t<xsl:choose>\n\t\t\t\t\t<xsl:when test=\"@separators\">\n\t\t\t\t\t\t<xsl:call-template name=\"startspace\">\n\t\t\t\t\t\t\t<xsl:with-param name=\"symbol\" select=\"@separators\"/>\n\t\t\t\t\t\t</xsl:call-template>\n\t\t\t\t\t</xsl:when>\n\t\t\t\t\t<xsl:otherwise>,</xsl:otherwise>\n\t\t\t\t</xsl:choose>\n\t\t\t</xsl:variable>\n\t\t\t<xsl:for-each select=\"./*\">\n\t\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t\t<xsl:if test=\"not(position()=last())\">\n\t\t\t\t\t<xsl:choose>\n\t\t\t\t\t\t<xsl:when test=\"position()>string-length($symbol)\">\n\t\t\t\t\t\t\t<xsl:value-of select=\"substring($symbol,string-length($symbol))\"/>\n\t\t\t\t\t\t</xsl:when>\n\t\t\t\t\t\t<xsl:otherwise>\n\t\t\t\t\t\t\t<xsl:value-of select=\"substring($symbol,position(),1)\"/>\n\t\t\t\t\t\t</xsl:otherwise>\n\t\t\t\t\t</xsl:choose>\n\t\t\t\t</xsl:if>\n\t\t\t</xsl:for-each>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:apply-templates/>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:choose>\n\t\t<xsl:when test=\"@close\">\n\t\t\t<xsl:if test=\"translate(@open,'{}[]()|','{{{{{{{')='{'\">\n\t\t\t\t<xsl:text>\\right</xsl:text>\n\t\t\t</xsl:if>\n\t\t\t<xsl:if test=\"@open='{' or @open='}'\">\n\t\t\t\t<xsl:text>\\</xsl:text>\n\t\t\t</xsl:if>\t\t\n\t\t\t<xsl:value-of select=\"@close\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise><xsl:text>\\right)</xsl:text></xsl:otherwise>\n\t</xsl:choose>\t\n</xsl:template>\n\n<xsl:template match=\"m:mphantom\">\n\t<xsl:text>\\phantom{</xsl:text>\n\t<xsl:apply-templates/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:menclose\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"@notation = 'actuarial'\">\n\t\t\t<xsl:text>\\overline{</xsl:text>\n\t\t\t<xsl:apply-templates/>\n\t\t\t<xsl:text>\\hspace{.2em}|}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"@notation = 'radical'\">\n\t\t\t<xsl:text>\\sqrt{</xsl:text>\n\t\t\t<xsl:apply-templates/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>\\overline{)</xsl:text>\n\t\t\t<xsl:apply-templates/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:mrow\">\n\t<xsl:apply-templates/>\n</xsl:template>\n\n<xsl:template match=\"m:mstyle\">\n\t<xsl:if test=\"@background\">\n\t\t<xsl:text>\\colorbox[rgb]{</xsl:text>\n\t\t<xsl:call-template name=\"color\">\n\t\t\t<xsl:with-param name=\"color\" select=\"@background\"/>\n\t\t</xsl:call-template>\n\t\t<xsl:text>}{$</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@color\">\n\t\t<xsl:text>\\textcolor[rgb]{</xsl:text>\n\t\t<xsl:call-template name=\"color\">\n\t\t\t<xsl:with-param name=\"color\" select=\"@color\"/>\n\t\t</xsl:call-template>\n\t\t<xsl:text>}{</xsl:text>\n\t</xsl:if>\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"@color\">\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@background\">\n\t\t<xsl:text>$}</xsl:text>\n\t</xsl:if>\n</xsl:template>\n<!--\n\n<xsl:template match=\"m:mstyle\">\n\t<xsl:if test=\"@displaystyle='true'\">\n\t\t<xsl:text>{\\displaystyle</xsl:text>\n\t</xsl:if>\t\t\t\n\t<xsl:if test=\"@scriptlevel=2\">\n\t\t<xsl:text>{\\scriptscriptstyle</xsl:text>\t\n\t</xsl:if>\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"@scriptlevel=2\">\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@displaystyle='true'\">\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n</xsl:template>\n-->\n\n<xsl:template match=\"m:merror\">\n\t<xsl:apply-templates/>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/mmltex.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n                \n<xsl:output method=\"text\" indent=\"no\" encoding=\"UTF-8\"/>\n\n<!-- ====================================================================== -->\n<!-- $id: mmltex.xsl, 2002/22/11 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:include href=\"tokens.xsl\"/>\n<xsl:include href=\"glayout.xsl\"/>\n<xsl:include href=\"scripts.xsl\"/>\n<xsl:include href=\"tables.xsl\"/>\n<xsl:include href=\"entities.xsl\"/>\n<xsl:include href=\"cmarkup.xsl\"/>\n\n<!-- Note: variables colora (template color) and symbola (template startspace) only for Sablotron -->\n\n<xsl:template name=\"startspace\">\n\t<xsl:param name=\"symbol\"/>\n\t<xsl:if test=\"contains($symbol,' ')\">\n\t\t<xsl:variable name=\"symbola\" select=\"concat(substring-before($symbol,' '),substring-after($symbol,' '))\"/>\n\t\t<xsl:call-template name=\"startspace\">\n\t\t\t<xsl:with-param name=\"symbol\" select=\"$symbola\"/>\n\t\t</xsl:call-template>\n\t</xsl:if>\n\t<xsl:if test=\"not(contains($symbol,' '))\">\n\t\t<xsl:value-of select=\"$symbol\"/>\n\t</xsl:if>\n</xsl:template>\n\n<xsl:strip-space elements=\"m:*\"/>\n\n<xsl:template match=\"m:math\">\n\t<xsl:text>&#x00024;</xsl:text>\n\t<xsl:apply-templates/>\n\t<xsl:text>&#x00024;</xsl:text>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/scripts.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n                \n<!-- ====================================================================== -->\n<!-- $Id: scripts.xsl,v 1.1.1.1 2002/10/26 14:20:06 shade33 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:template match=\"m:munderover\">\n\t<xsl:variable name=\"base\">\n\t\t<xsl:call-template name=\"startspace\">\n\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[1]\"/>\n\t\t</xsl:call-template>\n\t</xsl:variable>\n\t<xsl:variable name=\"under\">\n\t\t<xsl:call-template name=\"startspace\">\n\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[2]\"/>\n\t\t</xsl:call-template>\n\t</xsl:variable>\n\t<xsl:variable name=\"over\">\n\t\t<xsl:call-template name=\"startspace\">\n\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[3]\"/>\n\t\t</xsl:call-template>\n\t</xsl:variable>\n\t\n\t<xsl:choose>\n\t\t<xsl:when test=\"$over='&#x000AF;'\">\t<!-- OverBar - over bar -->\n\t\t\t<xsl:text>\\overline{</xsl:text>\n\t\t\t<xsl:call-template name=\"munder\">\n\t\t\t\t<xsl:with-param name=\"base\" select=\"$base\"/>\n\t\t\t\t<xsl:with-param name=\"under\" select=\"$under\"/>\n\t\t\t</xsl:call-template>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$over='&#x0FE37;'\">\t<!-- OverBrace - over brace -->\n\t\t\t<xsl:text>\\overbrace{</xsl:text>\n\t\t\t<xsl:call-template name=\"munder\">\n\t\t\t\t<xsl:with-param name=\"base\" select=\"$base\"/>\n\t\t\t\t<xsl:with-param name=\"under\" select=\"$under\"/>\n\t\t\t</xsl:call-template>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$under='&#x00332;'\">\t<!-- UnderBar - combining low line -->\n\t\t\t<xsl:text>\\underline{</xsl:text>\n\t\t\t<xsl:call-template name=\"mover\">\n\t\t\t\t<xsl:with-param name=\"base\" select=\"$base\"/>\n\t\t\t\t<xsl:with-param name=\"over\" select=\"$over\"/>\n\t\t\t\t<xsl:with-param name=\"pos_over\" select=\"3\"/>\n\t\t\t</xsl:call-template>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$under='&#x0FE38;'\">\t<!-- UnderBrace - under brace -->\n\t\t\t<xsl:text>\\underbrace{</xsl:text>\n\t\t\t<xsl:call-template name=\"mover\">\n\t\t\t\t<xsl:with-param name=\"base\" select=\"$base\"/>\n\t\t\t\t<xsl:with-param name=\"over\" select=\"$over\"/>\n\t\t\t\t<xsl:with-param name=\"pos_over\" select=\"3\"/>\n\t\t\t</xsl:call-template>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',\n\t\t\t\t\t\t'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'\">\n<!-- if $base is operator, such as\n\t\t\t&#x02211;\t/sum L: summation operator\n\t\t\t&#x0220F;\t/prod L: product operator\n\t\t\t&#x02210;\t/coprod L: coproduct operator\n\t\t\t&#x022c2;\t/bigcap\n\t\t\t&#x022c3;\t/bigcup\n\t\t\t&#x02294;\t/bigsqcup\n-->\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>_{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>}^{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[3]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>\\underset{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>}{\\overset{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[3]\"/>\n\t\t\t<xsl:text>}{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}}</xsl:text>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:mover\">\n\t<xsl:call-template name=\"mover\">\n\t\t<xsl:with-param name=\"base\">\n\t\t\t<xsl:call-template name=\"startspace\">\n\t\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[1]\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:with-param>\n\t\t<xsl:with-param name=\"over\">\n\t\t\t<xsl:call-template name=\"startspace\">\n\t\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[2]\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<xsl:template match=\"m:munder\">\n\t<xsl:call-template name=\"munder\">\n\t\t<xsl:with-param name=\"base\">\n\t\t\t<xsl:call-template name=\"startspace\">\n\t\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[1]\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:with-param>\n\t\t<xsl:with-param name=\"under\">\n\t\t\t<xsl:call-template name=\"startspace\">\n\t\t\t\t<xsl:with-param name=\"symbol\" select=\"./*[2]\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:with-param>\n\t</xsl:call-template>\n</xsl:template>\n\n<xsl:template name=\"mover\">\n\t<xsl:param name=\"base\"/>\n\t<xsl:param name=\"over\"/>\n\t<xsl:param name=\"pos_over\" select=\"2\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"$over='&#x000AF;'\">\t<!-- OverBar - over bar -->\n\t\t\t<xsl:text>\\overline{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$over='&#x0FE37;'\">\t<!-- OverBrace - over brace -->\n\t\t\t<xsl:text>\\overbrace{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',\n\t\t\t\t\t\t'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'\">\n<!-- if $base is operator, such as\n\t\t\t&#x02211;\t/sum L: summation operator\n\t\t\t&#x0220F;\t/prod L: product operator\n\t\t\t&#x02210;\t/coprod L: coproduct operator\n\t\t\t&#x022c2;\t/bigcap\n\t\t\t&#x022c3;\t/bigcup\n\t\t\t&#x02294;\t/bigsqcup\n-->\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>^{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[$pos_over]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>\\stackrel{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[$pos_over]\"/>\n\t\t\t<xsl:text>}{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t\t<!--\n\t\t\t<xsl:text>\\overset{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[$pos_over]\"/>\n\t\t\t<xsl:text>}{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>-->\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"munder\">\n\t<xsl:param name=\"base\"/>\n\t<xsl:param name=\"under\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"$under='&#x00332;'\">\t<!-- UnderBar - combining low line -->\n\t\t\t<xsl:text>\\underline{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$under='&#x0FE38;'\">\t<!-- UnderBrace - under brace -->\n\t\t\t<xsl:text>\\underbrace{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',\n\t\t\t\t\t\t'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'\">\n<!-- if $base is operator, such as\n\t\t\t&#x02211;\t/sum L: summation operator\n\t\t\t&#x0220F;\t/prod L: product operator\n\t\t\t&#x02210;\t/coprod L: coproduct operator\n\t\t\t&#x022c2;\t/bigcap\n\t\t\t&#x022c3;\t/bigcup\n\t\t\t&#x02294;\t/bigsqcup\n-->\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>_{</xsl:text>\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:text>\\underset{</xsl:text>\t\t<!-- Required AmsMath package -->\n\t\t\t<xsl:apply-templates select=\"./*[2]\"/>\n\t\t\t<xsl:text>}{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:msubsup\">\n\t<xsl:text>{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[1]\"/>\n\t<xsl:text>}_{</xsl:text>\n\t<xsl:apply-templates select=\"./*[2]\"/>\n\t<xsl:text>}^{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[3]\"/>\n\t<xsl:text>}</xsl:text>\t\n</xsl:template>\n\n<xsl:template match=\"m:msup\">\n\t<xsl:text>{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[1]\"/>\n\t<xsl:text>}^{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[2]\"/>\n\t<xsl:text>}</xsl:text>\t\n</xsl:template>\n\n<xsl:template match=\"m:msub\">\n\t<xsl:text>{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[1]\"/>\n\t<xsl:text>}_{</xsl:text>\t\n\t<xsl:apply-templates select=\"./*[2]\"/>\n\t<xsl:text>}</xsl:text>\t\n</xsl:template>\n\n<xsl:template match=\"m:mmultiscripts\" mode=\"mprescripts\">\n\t<xsl:for-each select=\"m:mprescripts/following-sibling::*\">\n\t\t<xsl:if test=\"position() mod 2 and local-name(.)!='none'\">\n\t\t\t<xsl:text>{}_{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:if>\n\t\t<xsl:if test=\"not(position() mod 2) and local-name(.)!='none'\">\n\t\t\t<xsl:text>{}^{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:if>\n\t</xsl:for-each>\n\t<xsl:apply-templates select=\"./*[1]\"/>\n\t<xsl:for-each select=\"m:mprescripts/preceding-sibling::*[position()!=last()]\">\n\t\t<xsl:if test=\"position()>2 and local-name(.)!='none'\">\n\t\t\t<xsl:text>{}</xsl:text>\t\n\t\t</xsl:if>\n\t\t<xsl:if test=\"position() mod 2 and local-name(.)!='none'\">\n\t\t\t<xsl:text>_{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:if>\n\t\t<xsl:if test=\"not(position() mod 2) and local-name(.)!='none'\">\n\t\t\t<xsl:text>^{</xsl:text>\t\n\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t<xsl:text>}</xsl:text>\t\n\t\t</xsl:if>\n\t</xsl:for-each>\n</xsl:template>\n\n<xsl:template match=\"m:mmultiscripts\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"m:mprescripts\">\n\t\t\t<xsl:apply-templates select=\".\" mode=\"mprescripts\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:apply-templates select=\"./*[1]\"/>\n\t\t\t<xsl:for-each select=\"*[position()>1]\">\n\t\t\t\t<xsl:if test=\"position()>2 and local-name(.)!='none'\">\n\t\t\t\t\t<xsl:text>{}</xsl:text>\t\n\t\t\t\t</xsl:if>\n\t\t\t\t<xsl:if test=\"position() mod 2 and local-name(.)!='none'\">\n\t\t\t\t\t<xsl:text>_{</xsl:text>\t\n\t\t\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t\t\t<xsl:text>}</xsl:text>\t\n\t\t\t\t</xsl:if>\n\t\t\t\t<xsl:if test=\"not(position() mod 2) and local-name(.)!='none'\">\n\t\t\t\t\t<xsl:text>^{</xsl:text>\t\n\t\t\t\t\t<xsl:apply-templates select=\".\"/>\n\t\t\t\t\t<xsl:text>}</xsl:text>\t\n\t\t\t\t</xsl:if>\n\t\t\t</xsl:for-each>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/tables.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n                \n<!-- ====================================================================== -->\n<!-- $id: tables.xsl, 2002/17/05 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:template match=\"m:mtd[@columnspan]\">\n\t<xsl:text>\\multicolumn{</xsl:text>\n\t<xsl:value-of select=\"@columnspan\"/>\n\t<xsl:text>}{c}{</xsl:text>\n\t<xsl:apply-templates/>\n\t<xsl:text>}</xsl:text>\n\t<xsl:if test=\"count(following-sibling::*)>0\">\n\t\t<xsl:text>&amp; </xsl:text>\n\t</xsl:if>\n</xsl:template>\n\n\n<xsl:template match=\"m:mtd\">\n\t<xsl:if test=\"@columnalign='right' or @columnalign='center'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"@columnalign='left' or @columnalign='center'\">\n\t\t<xsl:text>\\hfill </xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"count(following-sibling::*)>0\">\n<!--    this test valid for Sablotron, another form - test=\"not(position()=last())\".\n\tAlso for m:mtd[@columnspan] and m:mtr  -->\n\t\t<xsl:text>&amp; </xsl:text>\n\t</xsl:if>\n</xsl:template>\n\n<xsl:template match=\"m:mtr\">\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"count(following-sibling::*)>0\">\n\t\t<xsl:text>\\\\ </xsl:text>\n\t</xsl:if>\n</xsl:template>\n\n<xsl:template match=\"m:mtable\">\n\t<xsl:text>\\begin{array}{</xsl:text>\n\t<xsl:if test=\"@frame='solid'\">\n\t\t<xsl:text>|</xsl:text>\n\t</xsl:if>\n\t<xsl:variable name=\"numbercols\" select=\"count(./m:mtr[1]/m:mtd[not(@columnspan)])+sum(./m:mtr[1]/m:mtd/@columnspan)\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"@columnalign\">\n\t\t\t<xsl:variable name=\"colalign\">\n\t\t\t\t<xsl:call-template name=\"colalign\">\n\t\t\t\t\t<xsl:with-param name=\"colalign\" select=\"@columnalign\"/>\n\t\t\t\t</xsl:call-template>\n\t\t\t</xsl:variable>\n\t\t\t<xsl:choose>\n\t\t\t\t<xsl:when test=\"string-length($colalign) > $numbercols\">\n\t\t\t\t\t<xsl:value-of select=\"substring($colalign,1,$numbercols)\"/>\n\t\t\t\t</xsl:when>\n\t\t\t\t<xsl:when test=\"string-length($colalign) &lt; $numbercols\">\n\t\t\t\t\t<xsl:value-of select=\"$colalign\"/>\n\t\t\t\t\t<xsl:call-template name=\"generate-string\">\n\t\t\t\t\t\t<xsl:with-param name=\"text\" select=\"substring($colalign,string-length($colalign))\"/>\n\t\t\t\t\t\t<xsl:with-param name=\"count\" select=\"$numbercols - string-length($colalign)\"/>\n\t\t\t\t\t</xsl:call-template>\n\t\t\t\t</xsl:when>\n\t\t\t\t<xsl:otherwise>\n\t\t\t\t\t<xsl:value-of select=\"$colalign\"/>\n\t\t\t\t</xsl:otherwise>\n\t\t\t</xsl:choose>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:call-template name=\"generate-string\">\n\t\t\t\t<xsl:with-param name=\"text\" select=\"'c'\"/>\n\t\t\t\t<xsl:with-param name=\"count\" select=\"$numbercols\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n\t<xsl:if test=\"@frame='solid'\">\n\t\t<xsl:text>|</xsl:text>\n\t</xsl:if>\n\t<xsl:text>}</xsl:text>\n\t<xsl:if test=\"@frame='solid'\">\n\t\t<xsl:text>\\hline </xsl:text>\n\t</xsl:if>\n\t<xsl:apply-templates/>\n\t<xsl:if test=\"@frame='solid'\">\n\t\t<xsl:text>\\\\ \\hline</xsl:text>\n\t</xsl:if>\n\t<xsl:text>\\end{array}</xsl:text>\n</xsl:template>\n\n<xsl:template name=\"colalign\">\n\t<xsl:param name=\"colalign\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"contains($colalign,' ')\">\n\t\t\t<xsl:value-of select=\"substring($colalign,1,1)\"/>\n\t\t\t<xsl:call-template name=\"colalign\">\n\t\t\t\t<xsl:with-param name=\"colalign\" select=\"substring-after($colalign,' ')\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:value-of select=\"substring($colalign,1,1)\"/>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"generate-string\">\n<!-- template from XSLT Standard Library v1.1 -->\n    <xsl:param name=\"text\"/>\n    <xsl:param name=\"count\"/>\n\n    <xsl:choose>\n      <xsl:when test=\"string-length($text) = 0 or $count &lt;= 0\"/>\n\n      <xsl:otherwise>\n\t<xsl:value-of select=\"$text\"/>\n\t<xsl:call-template name=\"generate-string\">\n\t  <xsl:with-param name=\"text\" select=\"$text\"/>\n\t  <xsl:with-param name=\"count\" select=\"$count - 1\"/>\n\t</xsl:call-template>\n      </xsl:otherwise>\n    </xsl:choose>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/dependency/xsltml_2.0/tokens.xsl",
    "content": "<?xml version='1.0' encoding=\"UTF-8\"?>\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n\t\txmlns:m=\"http://www.w3.org/1998/Math/MathML\"\n                version='1.0'>\n                \n<!-- ====================================================================== -->\n<!-- $id: tokens.xsl, 2002/22/11 Exp $\n     This file is part of the XSLT MathML Library distribution.\n     See ./README or http://www.raleigh.ru/MathML/mmltex for\n     copyright and other information                                        -->\n<!-- ====================================================================== -->\n\n<xsl:template match=\"m:mi|m:mn|m:mo|m:mtext|m:ms\">\n\t<xsl:call-template name=\"CommonTokenAtr\"/>\n</xsl:template>\n\n<xsl:template name=\"mi\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"string-length(normalize-space(.))>1 and not(@mathvariant)\">\n\t\t\t<xsl:text>\\mathrm{</xsl:text>\n\t\t\t\t<xsl:apply-templates/>\n\t\t\t<xsl:text>}</xsl:text>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:apply-templates/>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"mn\">\n\t<xsl:apply-templates/>\n</xsl:template>\n\n<xsl:template name=\"mo\">\n\t<xsl:apply-templates/>\n</xsl:template>\n\n<xsl:template name=\"mtext\">\n\t<xsl:variable name=\"content\">\n\t\t<xsl:call-template name=\"replaceMtextEntities\">\n\t\t\t<xsl:with-param name=\"content\" select=\".\"/>\n\t\t</xsl:call-template>\n\t</xsl:variable>\n\t<xsl:text>\\text{</xsl:text>\n\t<xsl:value-of select=\"$content\"/>\n\t<xsl:text>}</xsl:text>\n</xsl:template>\n\n<xsl:template match=\"m:mspace\">\n\t<xsl:text>\\phantom{\\rule</xsl:text>\n\t<xsl:if test=\"@depth\">\n\t\t<xsl:text>[-</xsl:text>\n\t\t<xsl:value-of select=\"@depth\"/>\n\t\t<xsl:text>]</xsl:text>\n\t</xsl:if>\n\t<xsl:text>{</xsl:text>\n\t<xsl:if test=\"not(@width)\">\n\t\t<xsl:text>0ex</xsl:text>\n\t</xsl:if>\n\t<xsl:value-of select=\"@width\"/>\n\t<xsl:text>}{</xsl:text>\n\t<xsl:if test=\"not(@height)\">\n\t\t<xsl:text>0ex</xsl:text>\n\t</xsl:if>\n\t<xsl:value-of select=\"@height\"/>\n\t<xsl:text>}}</xsl:text>\n</xsl:template>\n\n<xsl:template name=\"ms\">\n\t<xsl:choose>\n\t\t<xsl:when test=\"@lquote\"><xsl:value-of select=\"@lquote\"/></xsl:when>\n\t\t<xsl:otherwise><xsl:text>\"</xsl:text></xsl:otherwise>\n\t</xsl:choose><xsl:apply-templates/><xsl:choose>\n\t\t<xsl:when test=\"@rquote\"><xsl:value-of select=\"@rquote\"/></xsl:when>\n\t\t<xsl:otherwise><xsl:text>\"</xsl:text></xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"CommonTokenAtr\">\n\t<xsl:if test=\"@mathbackground\">\n\t\t<xsl:text>\\colorbox[rgb]{</xsl:text>\n\t\t<xsl:call-template name=\"color\">\n\t\t\t<xsl:with-param name=\"color\" select=\"@mathbackground\"/>\n\t\t</xsl:call-template>\n\t\t<xsl:text>}{$</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@color or @mathcolor\"> <!-- Note: @color is deprecated in MathML 2.0 -->\n\t\t<xsl:text>\\textcolor[rgb]{</xsl:text>\n\t\t<xsl:call-template name=\"color\">\n\t\t\t<xsl:with-param name=\"color\" select=\"@color|@mathcolor\"/>\n\t\t</xsl:call-template>\n\t\t<xsl:text>}{</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@mathvariant\">\n\t\t<xsl:choose>\n\t\t\t<xsl:when test=\"@mathvariant='normal'\">\n\t\t\t\t<xsl:text>\\mathrm{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='bold'\">\n\t\t\t\t<xsl:text>\\mathbf{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='italic'\">\n\t\t\t\t<xsl:text>\\mathit{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='bold-italic'\">\t<!-- Required definition -->\n\t\t\t\t<xsl:text>\\mathbit{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='double-struck'\">\t<!-- Required amsfonts -->\n\t\t\t\t<xsl:text>\\mathbb{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='bold-fraktur'\">\t<!-- Error -->\n\t\t\t\t<xsl:text>{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='script'\">\n\t\t\t\t<xsl:text>\\mathcal{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='bold-script'\">\t<!-- Error -->\n\t\t\t\t<xsl:text>\\mathsc{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='fraktur'\">\t<!-- Required amsfonts -->\n\t\t\t\t<xsl:text>\\mathfrak{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='sans-serif'\">\n\t\t\t\t<xsl:text>\\mathsf{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='bold-sans-serif'\"> <!-- Required definition -->\n\t\t\t\t<xsl:text>\\mathbsf{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='sans-serif-italic'\"> <!-- Required definition -->\n\t\t\t\t<xsl:text>\\mathsfit{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='sans-serif-bold-italic'\">\t<!-- Error -->\n\t\t\t\t<xsl:text>\\mathbsfit{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:when test=\"@mathvariant='monospace'\">\n\t\t\t\t<xsl:text>\\mathtt{</xsl:text>\n\t\t\t</xsl:when>\n\t\t\t<xsl:otherwise>\n\t\t\t\t<xsl:text>{</xsl:text>\n\t\t\t</xsl:otherwise>\n\t\t</xsl:choose>\n\t</xsl:if>\n\t<xsl:call-template name=\"selectTemplate\"/>\n\t<xsl:if test=\"@mathvariant\">\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@color or @mathcolor\">\n\t\t<xsl:text>}</xsl:text>\n\t</xsl:if>\n\t<xsl:if test=\"@mathbackground\">\n\t\t<xsl:text>$}</xsl:text>\n\t</xsl:if>\n</xsl:template>\n\n<xsl:template name=\"selectTemplate\">\n<!--\t<xsl:variable name=\"name\" select=\"local-name()\"/>\n\t<xsl:call-template name=\"{$name}\"/>-->\n\t<xsl:choose>\n\t\t<xsl:when test=\"local-name(.)='mi'\">\n\t\t\t<xsl:call-template name=\"mi\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"local-name(.)='mn'\">\n\t\t\t<xsl:call-template name=\"mn\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"local-name(.)='mo'\">\n\t\t\t<xsl:call-template name=\"mo\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"local-name(.)='mtext'\">\n\t\t\t<xsl:call-template name=\"mtext\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"local-name(.)='ms'\">\n\t\t\t<xsl:call-template name=\"ms\"/>\n\t\t</xsl:when>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"color\">\n<!-- NB: Variables colora and valueColor{n} only for Sablotron -->\n\t<xsl:param name=\"color\"/>\n\t<xsl:variable name=\"colora\" select=\"translate($color,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')\"/>\n\t<xsl:choose>\n\t<xsl:when test=\"starts-with($colora,'#') and string-length($colora)=4\">\n\t\t<xsl:variable name=\"valueColor\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,2,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"$valueColor div 15\"/><xsl:text>,</xsl:text>\n\t\t<xsl:variable name=\"valueColor1\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,3,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"$valueColor1 div 15\"/><xsl:text>,</xsl:text>\n\t\t<xsl:variable name=\"valueColor2\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,4,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"$valueColor2 div 15\"/>\n\t</xsl:when>\n\t<xsl:when test=\"starts-with($colora,'#') and string-length($colora)=7\">\n\t\t<xsl:variable name=\"valueColor1\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,2,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:variable name=\"valueColor2\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,3,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"($valueColor1*16 + $valueColor2) div 255\"/><xsl:text>,</xsl:text>\n\t\t<xsl:variable name=\"valueColor1a\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,4,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:variable name=\"valueColor2a\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,5,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"($valueColor1a*16 + $valueColor2a) div 255\"/><xsl:text>,</xsl:text>\n\t\t<xsl:variable name=\"valueColor1b\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,6,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:variable name=\"valueColor2b\">\n\t\t\t<xsl:call-template name=\"Hex2Decimal\">\n\t\t\t\t<xsl:with-param name=\"arg\" select=\"substring($colora,7,1)\"/>\n\t\t\t</xsl:call-template>\n\t\t</xsl:variable>\n\t\t<xsl:value-of select=\"($valueColor1b*16 + $valueColor2b) div 255\"/>\n\t</xsl:when>\n<!-- ======================= if color specifed as an html-color-name ========================================== -->\n\t<xsl:when test=\"$colora='aqua'\"><xsl:text>0,1,1</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='black'\"><xsl:text>0,0,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='blue'\"><xsl:text>0,0,1</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='fuchsia'\"><xsl:text>1,0,1</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='gray'\"><xsl:text>.5,.5,.5</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='green'\"><xsl:text>0,.5,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='lime'\"><xsl:text>0,1,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='maroon'\"><xsl:text>.5,0,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='navy'\"><xsl:text>0,0,.5</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='olive'\"><xsl:text>.5,.5,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='purple'\"><xsl:text>.5,0,.5</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='red'\"><xsl:text>1,0,0</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='silver'\"><xsl:text>.75,.75,.75</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='teal'\"><xsl:text>0,.5,.5</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='white'\"><xsl:text>1,1,1</xsl:text></xsl:when>\n\t<xsl:when test=\"$colora='yellow'\"><xsl:text>1,1,0</xsl:text></xsl:when>\n\t<xsl:otherwise>\n\t\t<xsl:message>Exception at color template</xsl:message>\n\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template name=\"Hex2Decimal\">\n\t<xsl:param name=\"arg\"/>\n\t<xsl:choose>\n\t\t<xsl:when test=\"$arg='f'\">\n\t\t\t<xsl:value-of select=\"15\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$arg='e'\">\n\t\t\t<xsl:value-of select=\"14\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$arg='d'\">\n\t\t\t<xsl:value-of select=\"13\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$arg='c'\">\n\t\t\t<xsl:value-of select=\"12\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$arg='b'\">\n\t\t\t<xsl:value-of select=\"11\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"$arg='a'\">\n\t\t\t<xsl:value-of select=\"10\"/>\n\t\t</xsl:when>\n\t\t<xsl:when test=\"translate($arg, '0123456789', '9999999999')='9'\"> <!-- if $arg is number -->\n\t\t\t<xsl:value-of select=\"$arg\"/>\n\t\t</xsl:when>\n\t\t<xsl:otherwise>\n\t\t\t<xsl:message>Exception at Hex2Decimal template</xsl:message>\n\t\t</xsl:otherwise>\n\t</xsl:choose>\n</xsl:template>\n\n<xsl:template match=\"m:*/text()\">\n\t<xsl:call-template name=\"replaceEntities\">\n\t\t<xsl:with-param name=\"content\" select=\"normalize-space()\"/>\n\t</xsl:call-template>\n</xsl:template>\n\n</xsl:stylesheet>"
  },
  {
    "path": "DomainSpecific/readme.md",
    "content": "# Domain-specific Knowledge Extraction from CommonCrawl\n\n## Introduction \nDeveloping data workflows for specific requirements in distributed computing environments can be challenging for data engineers. They often face the following hurdles:\n\n - Learning to use distributed computing platforms from scratch.\n - Developing data processing modules, even when many are standard and reusable.\n - Constructing data pipelines by assembling these modules into their workflows.\n\nActually, many of these challenges can be mitigated with a unified framework. To address this, we propose the DataNetwork project. This initiative aims to enable engineers to efficiently meet customized and diverse data requirements using distributed computing resources and shared data storage.\n\n## Getting Started\nThis section will guide you through setting up and running the DataNetwork framework on your system.\nThe framework is supported in the following environments. While other operating systems, such as Ubuntu 18.04/22.04 or Windows, are theoretically supported, they have not been tested yet.\n\n1.\tEnvironment\n - [Ubuntu-20.04.1](https://ubuntu.com/download/desktop)\n - [Git-2.41.0](https://git-scm.com/downloads)\n - [Git-lfs-3.4.0](https://git-lfs.com/)\n - [Conda-23.3.1](https://conda.io/projects/conda/en/stable/user-guide/install/download.html)\n - [Python-3.10.14](https://www.python.org/downloads/)\n - Python dependencies in [requirements.txt](requirements.txt) file\n\n2.  Installation\n \n```\n# The depended libraries will be installed.\npip install -r requirements.txt\n```\n\n3. Download filters\n   \nPlease download all the filtering models used for domain-specific data processing [here](https://drive.google.com/file/d/1TQ112I1rjNqkH8acmile7i9ERQzSEmC4/view?usp=sharing), and then unzip them. The sample codes of applying these models could refer to core/layers/transform/{math/mcq/openquestion}_filter_layer.py\n\n```\ntar -zxvf models.tar.gz\nremove models.tar.gz\n```\n\n4.\tUsage:\n\n```\n# The runtime-dependencies will be installed, and an 'env_ready' file will be generated upon first use.\npython submit.py --network_path=${network_path} --run_mode=${run_mode} --computation_path=${computation_path} --storage_path=${storage_path} --docker_path=${docker_path}\n``` \n\n - network_path: the path of configuration file, which represents the instance of a data network.\n - run_mode: the running mode of data network, it supports Single, MultiProcess, and Batch.\n - computation_path: the path of setting file, which describes the computation resource.\n - storage_path: the path of setting file, which describes the storage resource.\n - docker_path: the path of setting file, which describes the environment resource (ignore it, currently not implemented yet).\n\n5.\tExamples:\n - Toy Sample:\n```\n# Please firstly run this command to ensure the installation is correct.\n# If it fails, such as unmatched environment, mannually fix the missing dependencies in the dependency/requirements.txt file.\npython submit.py --network_path=./configs/network_template.json --run_mode=Single\n```\n\n - Domain-specific Knowledge Data Extraction from CommonCrawl:\n```\n# Refer to sample_run.sh script for details.\nbash sample_run.sh\n```\n"
  },
  {
    "path": "DomainSpecific/requirements.txt",
    "content": "pyyaml==6.0\nwheel==0.43.0\nsetuptools==70.0.0\nazure-ai-ml==1.16.0\nazure-batch==14.2.0\nazure-identity==1.16.1\nazure-storage-blob==12.19.1\n"
  },
  {
    "path": "DomainSpecific/resources/computation/batch_dca_eastus.yaml",
    "content": "# To be filled.\nbatch_url: ${batch_url}\nbatch_pool_id: ${pool_id}\nbatch_node_num: ${node_num}\nbatch_process_per_node: ${process_per_node}\n"
  },
  {
    "path": "DomainSpecific/resources/computation/local.yaml",
    "content": "#worker_num: 1\nworker_num: 2\n"
  },
  {
    "path": "DomainSpecific/resources/environment/amlt_sing.yaml",
    "content": "name: datanetwork\ndescription: Environment for DataNetwork\n# To be filled.\nimage: ${image_repo}\n"
  },
  {
    "path": "DomainSpecific/resources/environment/local.yaml",
    "content": "name: datanetwork\ndescription: Environment for DataNetwork\nimage: local\n"
  },
  {
    "path": "DomainSpecific/resources/storage/llmstore.yaml",
    "content": "allow-other: true\n\nlogging:\n  type: syslog\n  level: log_debug\n\ncomponents:\n  - libfuse\n  - file_cache\n  - attr_cache\n  - azstorage\n\nlibfuse:\n  attribute-expiration-sec: 120\n  entry-expiration-sec: 120\n  negative-entry-expiration-sec: 240\n\nfile_cache:\n  path: /mnt/resource/blobfusetmp\n  timeout-sec: 360\n  max-size-mb: 4096\n\nattr_cache:\n  timeout-sec: 7200\n\n# To be filled.\nazstorage:\n  type: adls\n  account-name: ${account_name}\n  container: ${container_name}\n  endpoint: ${az_storage_endpoint}\n  mode: msi\n  appid: ${appid}\n\n# To be filled.\nresource_id: ${resource_id}\n\n# The upper part is configuration of azure storage account.\n\nworkspace_dir: ./workspace/\nmount: true\n"
  },
  {
    "path": "DomainSpecific/resources/storage/local.yaml",
    "content": "workspace_dir: ./workspace/\nmount: false\n"
  },
  {
    "path": "DomainSpecific/sample_run.sh",
    "content": "#!/usr/bin/env bash\n\n# --------------------------------------------------------------------------------------------------------------\n# Part 1 - knowledge extraction from html page.\n# step 1 - download CC warc url list.\n#Put one (or lots of) url(s) of Common Crawl WARC file to workspace/urls.CC-MAIN-2023-23.txt file.\n#such as:\nwget -P workspace https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/warc.paths.gz\ngzip -d workspace/warc.paths.gz\nmv workspace/warc.paths workspace/urls.CC-MAIN-2023-23.txt\n\n# step 2 - download CC warc.\npython submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_download.CC-MAIN-2023-23.json\ncat ./workspace/cc_warcs/CC-MAIN-2023-23/paths.*.txt > ./workspace/cc_warcs/CC-MAIN-2023-23/paths.txt\n\n# step 3 - prefilter CC warc.\npython submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_filter.CC-MAIN-2023-23.json\ncat ./workspace/cc_filtered_warc/CC-MAIN-2023-23/paths.*.txt > ./workspace/cc_filtered_warc/CC-MAIN-2023-23/paths.txt\n\n# step 4 - extract code from html tag.\npython submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_to_wet.code.CC-MAIN-2023-23.json\n\n# step 5 - extract math from html tag.\npython submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_to_wet.math.CC-MAIN-2023-23.json\n\n# --------------------------------------------------------------------------------------------------------------\n# Part 2 - knowledge extraction from text page.\n# extract text doc from CC html doc, filter text doc, and save them to parquet files.\n# please refer to GeneralDomain processing to get the text pages in parquet format, then uncomment the below commands for further processing.\n\n# step 1 - extract math from plain text.\n#python submit.py --run_mode MultiProcess --network_path ./configs/cc_math_filter.CC-MAIN-2023-23.json\n\n# step 2 - extract open questions from plain text.\n#python submit.py --run_mode MultiProcess --network_path ./configs/cc_openquestion_filter.CC-MAIN-2023-23.json\n"
  },
  {
    "path": "DomainSpecific/submit.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport argparse\n\ndef submit_job(network_path, run_mode, docker_path, computation_path, storage_path):\n    if run_mode in (\"Single\", \"MultiProcess\",):\n        from tools.submit_local_job import submit_local_job as func\n    elif run_mode == \"Batch\":\n        from tools.submit_batch_job import submit_batch_job as func\n    else:\n        assert False\n    func(network_path, run_mode, docker_path, computation_path, storage_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Tool of job submission.\")\n    parser.add_argument(\"--network_path\", type=str, default=\"./configs/network_template.json\", help=\"The config path of data network.\")\n    parser.add_argument('--run_mode', type=str, default=\"Single\", help=\"The running mode: Single, MultiProcess, and Batch.\")\n    parser.add_argument('--docker_path', type=str, default=\"./resources/environment/local.yaml\", help=\"The path of environment (docker) config file.\")\n    parser.add_argument('--computation_path', type=str, default=\"./resources/computation/local.yaml\", help=\"The path of computation config file.\")\n    parser.add_argument('--storage_path', type=str, default=\"./resources/storage/local.yaml\", help=\"The path of storage config file.\")\n    args = parser.parse_args()\n    \n    submit_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)\n"
  },
  {
    "path": "DomainSpecific/tools/__init__.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nfrom .submit_local_job import submit_local_job\nfrom .submit_batch_job import submit_batch_job\n\n__all__ = [\"submit_local_job\", \"submit_batch_job\"]\n"
  },
  {
    "path": "DomainSpecific/tools/submit_batch_job.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport argparse\nos.sys.path.append(\"./core/layers/\")\nimport util\nimport uuid\nimport datetime\nfrom azure.batch import BatchServiceClient\nfrom azure.common.credentials import BasicTokenAuthentication\nfrom azure.identity import DefaultAzureCredential, InteractiveBrowserCredential, AzureCliCredential\nfrom azure.batch.models import JobAddParameter, PoolInformation, TaskAddParameter, UserIdentity\nfrom azure.batch.models import AutoUserSpecification, ElevationLevel, TaskConstraints\nfrom azure.batch.models import EnvironmentSetting, ResourceFile, OnAllTasksComplete, ComputeNodeIdentityReference\n\ndef submit_batch_job(network_path, run_mode, docker_path, computation_path, storage_path):\n    docker_config = util.load_yaml(docker_path)\n    computation_config = util.load_yaml(computation_path)\n    storage_config = util.load_yaml(storage_path)\n\n    workspace_dir = storage_config[\"workspace_dir\"]\n    endpoint = storage_config[\"azstorage\"][\"endpoint\"]\n    container = storage_config[\"azstorage\"][\"container\"]\n    resource_id = storage_config[\"resource_id\"]\n    identity = ComputeNodeIdentityReference(resource_id=resource_id)\n    mount_blob = storage_config.get(\"mount\", True)\n\n    node_num = computation_config[\"batch_node_num\"]\n    process_per_node = computation_config[\"batch_process_per_node\"]\n    batch_url = computation_config[\"batch_url\"]\n    pool_id = computation_config[\"batch_pool_id\"]\n\n    # credential\n    ##########################################\n    try:\n        credential = AzureCliCredential()\n        # Check if given credential can get token successfully.\n        credential.get_token(\"https://management.azure.com/.default\")\n    except Exception as ex:\n        # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work\n        credential = InteractiveBrowserCredential()\n    token = credential.get_token(\"https://batch.core.windows.net/.default\")\n    credential2 = BasicTokenAuthentication({\"access_token\": token.token})\n\n    batch_client = BatchServiceClient(credential2, batch_url=batch_url)\n    pool = batch_client.pool.get(pool_id)\n    resource_files = list()\n\n    # upload source code.\n    package_local_path = \"DataNetwork.tar.gz\"\n    package_blob_path = os.path.join(\"yanghuan\", \"package\", os.path.basename(package_local_path))\n    if True:\n        os.system(f\"sudo tar \\\n                    --exclude=env_ready \\\n                    --exclude=workspace \\\n                    --exclude=dependency/models \\\n                    -czf {package_local_path} *\")\n        util.upload_file_to_blob(storage_config, package_local_path, package_blob_path)\n        os.system(f\"sudo rm {package_local_path}\")\n    if True:\n        package_url = f\"{endpoint}/{container}/{package_blob_path}\"\n        package_file = ResourceFile(http_url=package_url, file_path=package_blob_path, identity_reference=identity)\n        package_path = package_file.file_path\n        resource_files.append(package_file)\n    else:\n        package_path = os.path.join(workspace_dir, package_blob_path)\n\n    # upload model files.\n    models_local_path = \"models.tar.gz\"\n    models_blob_path = os.path.join(\"yanghuan\", \"package\", os.path.basename(models_local_path))\n    if True:\n        #            --exclude=dependency/models/math.bin \\\n        #            --exclude=dependency/models/openquestion.bin \\\n        #            --exclude=dependency/models/mcq.pytorch \\\n        #            --exclude=dependency/models/mcq.bin \\\n        os.system(f\"sudo tar \\\n                    -czf {models_local_path} dependency/models/*\")\n        util.upload_file_to_blob(storage_config, models_local_path, models_blob_path)\n        os.system(f\"sudo rm {models_local_path}\")\n    if not mount_blob:\n        model_url = f\"{endpoint}/{container}/{models_blob_path}\"\n        models_file = ResourceFile(http_url=model_url, file_path=models_blob_path, identity_reference=identity)\n        models_path = models_file.file_path\n        resource_files.append(models_file)\n    else:\n        models_path = os.path.join(workspace_dir, models_blob_path)\n\n    job_id = uuid.uuid4()\n    job = JobAddParameter(id=job_id, pool_info=PoolInformation(pool_id=pool_id), on_all_tasks_complete=OnAllTasksComplete.terminate_job)\n    batch_client.job.add(job)\n\n    tasks = []\n    for node_id in range(node_num):\n        batch_script_dependency = \"./dependency/install.py\"\n        batch_script_entry = \"./wrapper/runner.py\"\n        batch_commandline = f\"bash -c '\\\n            sudo tar -xzf {package_path} && \\\n            sudo apt install python-is-python3 && \\\n            python {batch_script_dependency} --storage_path={storage_path} && \\\n            sudo tar -xzf {models_path} && \\\n            python {batch_script_entry} --network_path={network_path} --run_mode={run_mode} --worker_num={node_num} --workspace_dir={workspace_dir}\\\n        '\"\n\n        task = TaskAddParameter(\n            id=f'{job_id}_{node_id}',\n            command_line=batch_commandline,\n            resource_files=resource_files,\n            environment_settings=[EnvironmentSetting(name=\"NODE_NUM\", value=str(node_num)), EnvironmentSetting(name=\"NODE_ID\", value=str(node_id)), EnvironmentSetting(name=\"PROCESS_PER_NODE\", value=str(process_per_node))],\n            constraints=TaskConstraints(max_task_retry_count=3, retention_time=datetime.timedelta(days=30)),\n            user_identity=UserIdentity(auto_user=AutoUserSpecification(elevation_level=ElevationLevel.admin))\n        )\n        tasks.append(task)\n\n    batch_client.task.add_collection(job_id, tasks)\n    print(f\"job id: {job.id}\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Tool of job submission in local machine.\")\n    parser.add_argument('--network_path', type=str, default=\"./configs/network_template.json\", help=\"The config path of data network.\")\n    parser.add_argument('--run_mode', type=str, default=\"Batch\", help=\"The running mode: Batch.\")\n    parser.add_argument('--docker_path', type=str, default=\"./resources/environment/local.yaml\", help=\"The path of environment (docker) config file.\")\n    parser.add_argument('--computation_path', type=str, default=\"./resources/computation/batch_dca.yaml\", help=\"The path of computation config file.\")\n    parser.add_argument('--storage_path', type=str, default=\"./resources/storage/llmstore.yaml\", help=\"The path of storage config file.\")\n    args = parser.parse_args()\n    submit_batch_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)\n"
  },
  {
    "path": "DomainSpecific/tools/submit_local_job.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport argparse\nos.sys.path.append(\"./core/layers/\")\nimport util\n\ndef submit_local_job(network_path, run_mode, docker_path, computation_path, storage_path):\n    docker_config = util.load_yaml(docker_path)\n    computation_config = util.load_yaml(computation_path)\n    storage_config = util.load_yaml(storage_path)\n\n    script_entry = \"./wrapper/runner.py\"\n    script_dependency = \"./dependency/install.py\"\n    commandline = f\"python {script_dependency} --storage_path={storage_path} && python {script_entry} --network_path={network_path} --run_mode={run_mode} --workspace_dir={storage_config['workspace_dir']} --worker_num={computation_config['worker_num']}\"\n    os.system(commandline)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Tool of job submission in local machine.\")\n    parser.add_argument('--network_path', type=str, default=\"./configs/network_template.json\", help=\"The config path of data network.\")\n    parser.add_argument('--run_mode', type=str, default=\"Single\", help=\"The running mode: Single, MultiProcess.\")\n    parser.add_argument('--docker_path', type=str, default=\"./resources/environment/local.yaml\", help=\"The path of environment (docker) config file.\")\n    parser.add_argument('--computation_path', type=str, default=\"./resources/computation/local.yaml\", help=\"The path of computation config file.\")\n    parser.add_argument('--storage_path', type=str, default=\"/resources/storage/local.yaml\", help=\"The path of storage config file.\")\n    args = parser.parse_args()\n    submit_local_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)\n"
  },
  {
    "path": "DomainSpecific/wrapper/__init__.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nfrom .parser import Parser\nfrom .interpreter import Interpreter\nfrom .runner import Runner, RunMode\nfrom .utility import *\n\n__all__ = [\"Parser\", \"Interpreter\", \"Runner\", \"RunMode\"]\n"
  },
  {
    "path": "DomainSpecific/wrapper/interpreter.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport traceback\nimport collections\nfrom core import DataType\nfrom core import Layer, LayerType, JointType, LayerType2Func\nfrom core import Network\nfrom wrapper import Parser\n\nclass Interpreter:\n    def __init__(self):\n        self.fields = (\"name\", \"description\", \"date\", \"version\", \"author\", \"input\", \"output\", \"layer\")\n        self.parser = Parser()\n\n    def check_config(self, config):\n        try:\n            # fileds check.\n            for field in self.fields:\n                assert field in config\n\n            data_name2type = collections.defaultdict(set)\n\n            # check imported modules.\n            module_data2type = dict()\n            module_names = config.get(\"import\", list())\n            for name in module_names:\n                sub_config = self.parser(f\"./configs/{name.replace('.', '/')}.json\")\n                self.check_config(sub_config)\n                for name, data in sub_config[\"input\"].items():\n                    module_data2type[name] = DataType[data[\"type\"]]\n                for name, data in sub_config[\"output\"].items():\n                    module_data2type[name] = DataType[data[\"type\"]]\n\n            # check input.\n            inputs = config.get(\"input\", dict())\n            for name, data in inputs.items():\n                assert data[\"type\"] in DataType.__members__\n                data_type = DataType[data[\"type\"]]\n                data_name2type[name].add(data_type)\n\n            # check output.\n            outputs = config.get(\"output\", dict())\n            for name, data in outputs.items():\n                assert data[\"type\"] in DataType.__members__\n                data_type = DataType[data[\"type\"]]\n                data_name2type[name].add(data_type)\n\n            # check layer.\n            layers = config.get(\"layer\", dict())\n            for _, layer in layers.items():\n                assert layer[\"type\"] in LayerType.__members__ or layer[\"type\"] in module_names\n                input_names = layer[\"input\"]\n                output_names = layer[\"output\"]\n                if layer[\"type\"] in LayerType.__members__:\n                    layer_type = LayerType[layer[\"type\"]]\n                    func, input_types, output_types, enabled = LayerType2Func[layer_type]\n                else:\n                    input_types = list(map(lambda input_name: module_data2type[input_name], input_names))\n                    output_types = list(map(lambda output_name: module_data2type[output_name], output_names))\n                assert len(input_names) == len(input_types)\n                assert len(output_names) == len(output_types)\n                assert layer.get(\"joint\", \"Default\") in JointType.__members__\n                joint_type = JointType[layer.get(\"joint\", \"Default\")]\n                for name, data_type in zip(input_names, input_types):\n                    if joint_type in (JointType.Map, JointType.FlatMap):\n                        data_type = DataType(data_type.value + 10)\n                    data_name2type[name].add(data_type)\n                for name, data_type in zip(output_names, output_types):\n                    if joint_type in (JointType.Map,):\n                        data_type = DataType(data_type.value + 10)\n                    data_name2type[name].add(data_type)\n\n            # check joint.\n            for data_name, data_type in data_name2type.items():\n                for t1 in data_type:\n                    for t2 in data_type:\n                        assert DataType.belong(t1, t2) or DataType.belong(t2, t1)\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n            sys.exit()\n\n    def __call__(self, config_path):\n        # parse config file.\n        config = self.parser(config_path)\n\n        # interpret network.\n        network = Network()\n        try:\n            assert config is not None and isinstance(config, dict)\n            config[\"base_dir\"] = os.path.dirname(config_path)\n\n            # check config.\n            self.check_config(config)\n\n            # imported modules.\n            name2module = dict()\n            module_names = config.get(\"import\", list())\n            for name in module_names:\n                name2module[name] = self(f\"./configs/{name.replace('.', '/')}.json\")\n\n            # input datas.\n            input_datas = config.get(\"input\", dict())\n            network.set_input_names(list(input_datas.keys()))\n            for name, data in input_datas.items():\n                value = data.get(\"value\")\n                network.add_data(name, value)\n\n            # output datas\n            output_datas = config.get(\"output\", dict())\n            network.set_output_names(list(output_datas.keys()))\n\n            # layers in graph.\n            layers = config.get(\"layer\", dict())\n            for name, layer in layers.items():\n                if layer[\"type\"] in name2module:\n                    value = name2module[layer[\"type\"]]\n                    # set params of sub-network.\n                    for layers_param_name, param_value in layer.get(\"param\", dict()).items():\n                        layers_param_name = layers_param_name.split(\".\")\n                        layers_name = layers_param_name[:-1]\n                        param_name = layers_param_name[-1]\n                        net = value\n                        for layer_name in layers_name:\n                            net = net.layers[layer_name]\n                        net.param[param_name] = param_value\n                else:\n                    value = Layer(\n                        type=layer[\"type\"], \n                        joint=layer.get(\"joint\", \"Default\"), \n                        repetition=layer.get(\"repetition\", 1),\n                        param=layer.get(\"param\", dict()),\n                        input_names=layer.get(\"input\", list()),\n                        output_names=layer.get(\"output\", list()),\n                    )\n                network.add_layer(name, value)\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n        return network\n\n\nif __name__ == \"__main__\":\n    config_path = f\"{os.path.dirname(os.path.realpath(__file__))}/../configs/network_template.json\"\n    \n    interpreter = Interpreter()\n    network = interpreter(config_path)\n    \n    # compute in network.\n    outputs = network()\n    #from core import DataType\n    #inputs = [[\"a\", \"b\", \"c\", \"d\", \"e\"]]\n    #outputs = network(inputs)\n    \n    print(outputs[0])\n"
  },
  {
    "path": "DomainSpecific/wrapper/parser.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nimport json\nimport traceback\n\nclass Parser:\n    def __init__(self):\n        pass\n        \n    def __call__(self, config_path):\n        config = None\n        try:\n            if config_path is None or not os.path.exists(config_path):\n                raise Exception(\"Invalid config file path or not exists.\")\n\n            with open(config_path, \"r\") as f:\n                config = json.load(f)\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n        return config\n\n\nif __name__ == \"__main__\":\n    config_path = f\"{os.path.dirname(os.path.realpath(__file__))}/../configs/network_template.json\"\n    parser = Parser()\n    config = parser(config_path)\n    print(config)\n"
  },
  {
    "path": "DomainSpecific/wrapper/runner.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport sys\nos.sys.path.append(f\"{os.path.dirname(os.path.realpath(__file__))}/..\")\nimport argparse\nimport traceback\nfrom enum import Enum\nfrom threading import Thread\nfrom multiprocessing import Process\nfrom wrapper import Interpreter\nfrom wrapper.utility import get_world_rank, get_world_size, get_process_per_node\n\nclass RunMode(Enum):\n    Single = 0\n    MultiProcess = 1\n    Batch = 2\n\nclass Runner:\n    def __init__(self, network_path):\n        interpreter = Interpreter()\n        self.network = interpreter(network_path)\n\n    def __call__(self, run_mode, worker_id, worker_num, workspace_dir):\n        try:\n            input = list()\n            variables = {\"workspace_dir\": workspace_dir}\n            if run_mode == RunMode.Single:\n                for worker_id in range(worker_num):\n                    self.network(input, worker_id, worker_num, variables)\n            elif run_mode == RunMode.MultiProcess:\n                processes = list()\n                for worker_id in range(worker_num):\n                    process = Process(target=self.network, args=(input, worker_id, worker_num, variables))\n                    process.start()\n                    processes.append(process)\n                for process in processes:\n                    process.join()\n            elif run_mode == RunMode.Batch:\n                process_per_node = get_process_per_node()\n                worker_id = process_per_node * get_world_rank()\n                worker_num = process_per_node * get_world_size()\n                processes = list()\n                for worker_id in range(worker_id, worker_id + process_per_node):\n                    process = Process(target=self.network, args=(input, worker_id, worker_num, variables))\n                    process.start()\n                    processes.append(process)\n                for process in processes:\n                    process.join()\n            else:\n                raise Exception(f\"Unknown running mode: {run_mode}\")\n        except KeyboardInterrupt:\n            sys.exit()\n        except Exception as ex:\n            traceback.print_exc()\n            return False\n        return True\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser(description=\"Runner of Data Network.\")\n    parser.add_argument('--network_path', type=str, default=\"./configs/network_template.json\", help=\"The config path of data network.\")\n    parser.add_argument('--run_mode', type=str, default=\"Single\", help=\"The running mode: Single, MultiProcess, and Batch.\")\n    parser.add_argument('--workspace_dir', type=str, default=\"./workspace/\", help=\"The path of workspace folder.\")\n    parser.add_argument('--worker_id', type=int, default=0, help=\"The id of world worker.\")\n    parser.add_argument('--worker_num', type=int, default=1, help=\"The number of world worker.\")\n    args = parser.parse_args()\n\n    runner = Runner(args.network_path)\n    success = runner(RunMode[args.run_mode], args.worker_id, args.worker_num, args.workspace_dir)\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/__init__.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nfrom .logger import Logger\nfrom .cpu_count import cpu_count\nfrom .load_yaml import load_yaml\nfrom .save_yaml import save_yaml\nfrom .azure_env import get_local_rank, get_world_rank, get_world_size, get_process_per_node\n\n__all__ = [\"Logger\", \"cpu_count\", \"load_yaml\", \"save_yaml\", \"get_local_rank\", \"get_world_rank\", \"get_world_size\", \"get_process_per_node\"]\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/azure_env.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\n\ndef get_local_rank():\n    # Azure Singularity.\n    if \"OMPI_COMM_WORLD_LOCAL_RANK\" in os.environ:\n        return int(os.environ[\"OMPI_COMM_WORLD_LOCAL_RANK\"])\n    return None\n\ndef get_world_rank():\n    # Azure Singularity.\n    if \"OMPI_COMM_WORLD_RANK\" in os.environ:\n        return int(os.environ[\"OMPI_COMM_WORLD_RANK\"])\n    # Azure Batch.\n    elif \"NODE_ID\" in os.environ:\n        return int(os.environ[\"NODE_ID\"])\n    return None\n\ndef get_world_size():\n    # Azure Singularity.\n    if \"OMPI_COMM_WORLD_SIZE\" in os.environ:\n        return int(os.environ[\"OMPI_COMM_WORLD_SIZE\"])\n    # Azure Batch.\n    elif \"NODE_NUM\" in os.environ:\n        return int(os.environ[\"NODE_NUM\"])\n    # Azure Spark.\n    elif \"NUM_EXECUTORS\" in os.environ:\n        return int(os.environ[\"NUM_EXECUTORS\"])\n    return None\n\ndef get_process_per_node():\n    # Azure Batch.\n    if \"PROCESS_PER_NODE\" in os.environ:\n        return int(os.environ[\"PROCESS_PER_NODE\"])\n    # Azure Spark.\n    elif \"EXECUTOR_CORES\" in os.environ:\n        return int(os.environ[\"EXECUTOR_CORES\"])\n    return None\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/cpu_count.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport re\nimport subprocess\n\ndef cpu_count():\n    \"\"\" Number of available virtual or physical CPUs on this system, i.e.\n    user/real as output by time(1) when called with an optimally scaling\n    userspace-only program\"\"\"\n\n    # cpuset\n    # cpuset may restrict the number of *available* processors\n    try:\n        m = re.search(r'(?m)^Cpus_allowed:\\s*(.*)$',\n                      open('/proc/self/status').read())\n        if m:\n            res = bin(int(m.group(1).replace(',', ''), 16)).count('1')\n            if res > 0:\n                return res\n    except IOError:\n        pass\n\n    # Python 2.6+\n    try:\n        import multiprocessing\n        return multiprocessing.cpu_count()\n    except (ImportError, NotImplementedError):\n        pass\n\n    # https://github.com/giampaolo/psutil\n    try:\n        import psutil\n        return psutil.cpu_count()   # psutil.NUM_CPUS on old versions\n    except (ImportError, AttributeError):\n        pass\n\n    # POSIX\n    try:\n        res = int(os.sysconf('SC_NPROCESSORS_ONLN'))\n\n        if res > 0:\n            return res\n    except (AttributeError, ValueError):\n        pass\n\n    # Windows\n    try:\n        res = int(os.environ['NUMBER_OF_PROCESSORS'])\n\n        if res > 0:\n            return res\n    except (KeyError, ValueError):\n        pass\n\n    \"\"\"\n    # jython\n    try:\n        from java.lang import Runtime\n        runtime = Runtime.getRuntime()\n        res = runtime.availableProcessors()\n        if res > 0:\n            return res\n    except ImportError:\n        pass\n    \"\"\"\n\n    # BSD\n    try:\n        sysctl = subprocess.Popen(['sysctl', '-n', 'hw.ncpu'],\n                                  stdout=subprocess.PIPE)\n        scStdout = sysctl.communicate()[0]\n        res = int(scStdout)\n\n        if res > 0:\n            return res\n    except (OSError, ValueError):\n        pass\n\n    # Linux\n    try:\n        res = open('/proc/cpuinfo').read().count('processor\\t:')\n\n        if res > 0:\n            return res\n    except IOError:\n        pass\n\n    # Solaris\n    try:\n        pseudoDevices = os.listdir('/devices/pseudo/')\n        res = 0\n        for pd in pseudoDevices:\n            if re.match(r'^cpuid@[0-9]+$', pd):\n                res += 1\n\n        if res > 0:\n            return res\n    except OSError:\n        pass\n\n    # Other UNIXes (heuristic)\n    try:\n        try:\n            dmesg = open('/var/run/dmesg.boot').read()\n        except IOError:\n            dmesgProcess = subprocess.Popen(['dmesg'], stdout=subprocess.PIPE)\n            dmesg = dmesgProcess.communicate()[0]\n\n        res = 0\n        while '\\ncpu' + str(res) + ':' in dmesg:\n            res += 1\n\n        if res > 0:\n            return res\n    except OSError:\n        pass\n\n    raise Exception('Can not determine number of CPUs on this system')\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/load_yaml.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport yaml\n\ndef load_yaml(config_path):\n    config = None\n    if os.path.exists(config_path):\n        with open(config_path, \"r\") as file:\n            config = yaml.safe_load(file)\n    return config\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/logger.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport logging\n\nlogger = None\n\nclass Logger:\n    def __init__():\n        pass\n    \n    @staticmethod\n    def init(log_path=None):\n        global logger\n        \n        if log_path is not None:\n            logging.basicConfig(filename=log_path,\n                                format=\"%(asctime)s %(message)s\",\n                                filemode=\"w\")\n        \n        # Creating an object\n        logger = logging.getLogger()\n        \n        # Setting the threshold of logger to DEBUG\n        logger.setLevel(logging.INFO)\n\n    @staticmethod\n    def debug(msg):\n        logger.debug(msg)\n    \n    @staticmethod\n    def info(msg):\n        logger.info(msg)\n    \n    @staticmethod\n    def warning(msg):\n        logger.warning(msg)\n    \n    @staticmethod\n    def error(msg):\n        logger.error(msg)\n\n    @staticmethod\n    def critical(msg):\n        logger.critical(msg)\n\n\nif __name__ == \"__main__\":\n    Logger.init()\n    Logger.debug(\"unit test: debug\")\n    Logger.info(\"unit test: info\")\n    Logger.warning(\"unit test: warning\")\n    Logger.error(\"unit test: error\")\n    Logger.critical(\"unit test: critical\")\n"
  },
  {
    "path": "DomainSpecific/wrapper/utility/save_yaml.py",
    "content": "#\n# Copyright (c) Microsoft Corporation. All rights reserved.\n#\nimport os\nimport yaml\n\ndef save_yaml(config, config_path):\n    if os.path.exists(os.path.dirname(config_path)):\n        with open(config_path, \"w\") as file:\n            yaml.safe_dump(config, file)\n"
  },
  {
    "path": "GeneralDomain/.gitignore",
    "content": "__pycache__/"
  },
  {
    "path": "GeneralDomain/README.md",
    "content": "# Redstone General CC\n\nLibrary for reproducing the general CC part of RedStone dataset from the released index Parquet file.\n\n## How to use\n\n### Install the lib\n\n```bash\npip install \"redstone-cc @ git+https://github.com/microsoft/redstone#subdirectory=general_cc/\"\n```\n\n### From CLI\n\n```bash\npython -m redstone_cc {input_index_path} {output_parquet_path}\n```\n\n### From python\n\n```python3\nfrom redstone_cc import process_file\n\nindex_file_path = '/path/to/index/file'\nitems = process_file(index_file_path)\n\nfor item in items:\n    print(item['uri'], item['text'])\n```\n\n## FAQ\n\n- About trafilatura processing failures\n    - Our original data was processed using `trafilatura` version 1.8.1, which may behave differently from the current version. If you need to reproduce our result exactly, please consider manually pinning the version of trafilatura.\n"
  },
  {
    "path": "GeneralDomain/pyproject.toml",
    "content": "[build-system]\nrequires = [\"flit_core >=3.2, <4\"]\nbuild-backend = \"flit_core.buildapi\"\n\n[project]\nname = \"redstone-cc\"\ndescription = \"Library for reproducing the general CC part of RedStone dataset from the released index Parquet file.\"\nversion='0.0.1'\nrequires-python = \">=3.8\"\nauthors = [\n  { name = \"Tengchao Lv\", email = \"tengchaolv@microsoft.com\" },\n  { name = \"Qinzheng Sun\", email = \"qinsu@microsoft.com\" }\n]\ndependencies = [\n  'numpy == 1.*',\n  'datasketch',\n  'regex',\n  'nltk',\n  'ftfy',\n  'sentence_splitter',\n  'brotlicffi',\n  'faust-cchardet',\n  'lxml',\n  'trafilatura[all]',\n  'warcio',\n  'loguru',\n  'stopit',\n  \"fasttext; platform_system != 'Windows'\",\n  \"fasttext-wheel == 0.9.2; platform_system == 'Windows'\",\n  'pyarrow',\n  'tqdm',\n  'requests',\n]\n\n[project.optional-dependencies]\ndev = [\n  'pytest',\n  'black',\n]\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/__init__.py",
    "content": "from .process import process_file, process_items\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/__main__.py",
    "content": "import argparse\n\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom loguru import logger\n\nfrom .process import process_file\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"index_path\")\n    parser.add_argument(\"output_path\")\n    args = parser.parse_args()\n\n    logger.info(f\"input path: {args.index_path}\")\n    logger.info(f\"output path: {args.output_path}\")\n    logger.info(\"processing...\")\n    res = process_file(args.index_path)\n\n    logger.info(\"writing results...\")\n    table = pa.Table.from_pylist(res)\n    pq.write_table(table, args.output_path)\n    logger.info(\"finished.\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/deduplication/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/deduplication/minhash.py",
    "content": "import hashlib\n\nimport numpy as np\nfrom datasketch.lsh import _optimal_param\n\nDEFAULT_MER = 2**61 - 1\nDEFAULT_SEED = 1\n\n\ndef gen_lsh_param(num_perm, lsh_threshold):\n    return _optimal_param(lsh_threshold, num_perm, 0.5, 0.5)\n\n\nclass CalcMinhash:\n    def __init__(self, num_perm, seed=DEFAULT_SEED, mer=DEFAULT_MER):\n        self.mer = mer\n        self.num_perm = num_perm\n\n        self.gen = np.random.RandomState(seed)\n        self.a = self.gen.randint(1, self.mer, (self.num_perm,), dtype=\"u8\")\n        self.b = self.gen.randint(0, self.mer, (self.num_perm,), dtype=\"u8\")\n\n    def _sha1_hash(self, val):\n        val = int.from_bytes(hashlib.sha1(val).digest()[:8], \"little\")\n        val &= self.mer\n        return np.uint64(val)\n\n    def hash(self, sequence: list[str]) -> np.ndarray:\n        res = np.ones(self.num_perm, dtype=\"u8\") * self.mer\n        for token in sequence:\n            hash0 = self._sha1_hash(token.encode(\"utf8\"))\n            hash_vec = hash0 * self.a + self.b\n            hash_vec %= self.mer\n            res = np.minimum(res, hash_vec)\n        return res\n\n\nclass CalcLsh:\n    def __init__(self, b, r):\n        self.b = b\n        self.r = r\n        self.hashranges = [(i * r, (i + 1) * r) for i in range(b)]\n\n    def gen_lsh(self, minhash) -> list[bytearray]:\n        return [bytearray(minhash[start:end]) for start, end in self.hashranges]\n\n\nclass CalcMinhashLsh:\n    def __init__(self, b, r, seed=DEFAULT_SEED, mer=DEFAULT_MER):\n        num_perm = b * r\n        self.minhash = CalcMinhash(num_perm, seed, mer)\n        self.lsh = CalcLsh(b, r)\n\n    def hash(self, tokens) -> list[bytearray]:\n        minhash = self.minhash.hash(tokens)\n        lsh = self.lsh.gen_lsh(minhash)\n        return lsh\n\n\nclass LocalMinhashLshDedup:\n    def __init__(self, b, r, seed=DEFAULT_SEED, mer=DEFAULT_MER):\n        self.calc_minhash_lsh = CalcMinhashLsh(b, r, seed, mer)\n        self.data = []\n        self.b = b\n\n    def add(self, id, tokens):\n        hval = self.calc_minhash_lsh.hash(tokens)\n        self.data.append((id, hval))\n\n    def dedup(self):\n        self.data.sort(key=lambda x: x[0])\n        dedup_set = [set() for _ in range(self.b)]\n        exclude = []\n        for line_id, hash_list in self.data:\n            flag_dup = False\n            for i, hval in hash_list:\n                if hval in dedup_set[i]:\n                    flag_dup = True\n                else:\n                    dedup_set[i].add(hval)\n\n            if flag_dup:\n                exclude.append(line_id)\n\n        return exclude\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/deduplication/sha1.py",
    "content": "import hashlib\n\nfrom .utils import ccnet_normalize\n\nDEFAULT_HASH_SIZE = 8\n\n\ndef sha1_hash(line, hash_size=DEFAULT_HASH_SIZE) -> bytes:\n    line = ccnet_normalize(line)\n\n    return hashlib.sha1(bytes(line, encoding=\"utf-8\")).digest()[:hash_size]\n\n\nclass LocalSha1Dedup:\n    def __init__(self, hash_size):\n        self.hash_size = hash_size\n\n        self.data = []\n\n    def add_line(self, line_id, line):\n        hval = sha1_hash(line, self.hash_size)\n        self.data.append((line_id, hval))\n\n    def add_hashes(self, line_id, hval):\n        assert isinstance(hval, bytes) and len(hval) == self.hash_size\n        self.data.append((line_id, hval))\n\n    def dedup(self):\n        self.data.sort(key=lambda item: item[0])\n        dedup_set = set()\n        exclude = []\n        for line_id, hval in self.data:\n            if hval in dedup_set:\n                exclude.append(line_id)\n            else:\n                dedup_set.add(hval)\n        return exclude\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/deduplication/utils.py",
    "content": "import unicodedata\nimport re\nimport string\n\nimport regex\nimport ftfy\nfrom nltk import ngrams\n\nDIGIT_RE = regex.compile(r\"\\d\")\nPUNCT_OR_NON_PRINTING_CHARS_RE = regex.compile(r\"(\\p{P}|\\p{C})\")\n\n\ndef ccnet_normalize(line) -> str:\n    line = line.strip()\n    if not line:\n        return line\n    # normalize\n    line = unicodedata.normalize(\"NFKC\", line)\n    # case\n    line = line.lower()\n    # numbers\n    line = DIGIT_RE.sub(\"0\", line)\n    line = PUNCT_OR_NON_PRINTING_CHARS_RE.sub(\"\", line)\n    return line\n\n\nSLIMPAJAMA_LENGTH_THRESHOLD = 200\n\n\n# https://github.com/Cerebras/modelzoo/blob/de67aaec12ba684ebedc6fb841e0c4d0ff8cd2e8/modelzoo/transformers/data_processing/slimpajama/preprocessing/filter.py#L28\ndef slimpajama_tokenize(text, num_ngrams=13):\n    text = ftfy.fix_text(text, normalization=\"NFC\")\n    text = text.lower()\n    text = text.translate(str.maketrans(\"\", \"\", string.punctuation))\n    text = re.sub(r\"\\s+\", \" \", text.strip())\n    if len(text) < SLIMPAJAMA_LENGTH_THRESHOLD:\n        return\n    tokens = map(lambda x: \"\".join(x), ngrams(text, num_ngrams))\n    return tokens\n\n\ndef spm_tokenize(text, spm_model, num_ngrams=5):\n    text = text.lower()\n    tokens = spm_model.encode(text, out_type=str)\n    tokens = ngrams(tokens, num_ngrams)\n    tokens = {\" \".join(t).strip() for t in tokens}\n    return tokens\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/fasttext_classifier.py",
    "content": "import fasttext\n\nfasttext.FastText.eprint = lambda x: None\n\nFASTTEXT_LID_176_URL = (\n    \"https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin\"\n)\n\n\nclass FastTextClassifier:\n    def __init__(self, model_path):\n        self.model = fasttext.load_model(model_path)\n\n    def predict(self, text):\n        if isinstance(text, list):\n            text = \" \".join(text)\n        text = text.replace(\"\\n\", \" \")\n\n        labels, scores = self.model.predict(text, k=1)\n        label, score = labels[0], scores[0]\n        label = label.replace(\"__label__\", \"\")\n\n        return label, score\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/func/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/func/document.py",
    "content": "import regex\n\n\ndef document_word_count(words):\n    return len(words)\n\n\ndef document_mean_word_length(words):\n    return sum(len(x) for x in words) / len(words)\n\n\nRE_ALPHA = regex.compile(r\"\\p{L}\")\n\n\ndef document_alpha_words(words):\n    return sum(int(RE_ALPHA.search(word) is not None) for word in words)\n\n\nBULLET_POINT_SYMBOLS = (\n    \"\\u2022\",  # bullet point\n    \"\\u2023\",  # triangular bullet point\n    \"\\u25B6\",  # black right pointing triangle\n    \"\\u25C0\",  # black left pointing triangle\n    \"\\u25E6\",  # white bullet point\n    \"\\u25A0\",  # black square\n    \"\\u25A1\",  # white square\n    \"\\u25AA\",  # black small square\n    \"\\u25AB\",  # white small square\n    \"\\u2013\",  # en dash\n)\n\n\ndef document_start_with_bullet(lines):\n    cnt = 0\n    for line in lines:\n        line = line.lstrip()\n        for symbol in BULLET_POINT_SYMBOLS:\n            if line.startswith(symbol):\n                cnt += 1\n                break\n    return cnt\n\n\nELLIPSIS = \"...\"\n\n\ndef document_end_with_ellipsis(lines):\n    return sum(int(x.strip().endswith(ELLIPSIS)) for x in lines)\n\n\nGOPHER_SYMBOLS = (\"#\", \"...\")\n\n\ndef document_gopher_symbols(text):\n    return sum(text.count(x) for x in GOPHER_SYMBOLS)\n\n\nGOPHER_STOPWORDS = {\"the\", \"be\", \"to\", \"of\", \"and\", \"that\", \"have\", \"with\"}\n\n\ndef document_gopher_stopwords(words):\n    return sum(int(word in GOPHER_STOPWORDS) for word in words)\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/func/line.py",
    "content": "import regex\n\nRE_UPPER = regex.compile(r\"\\p{Lu}\")\nRE_LETTER = regex.compile(r\"\\p{L}\")\n\n\ndef line_uppercase_ratio(line):\n    cnt_upper = len(RE_UPPER.findall(line))\n    cnt_letter = len(RE_LETTER.findall(line))\n    if cnt_letter == 0:\n        return 0\n    return cnt_upper / cnt_letter\n\n\nRE_NUMERICAL = regex.compile(r\"^(\\p{N}|\\p{Z}|\\p{C})+$\")\n\n\ndef line_all_numeric(line):\n    return RE_NUMERICAL.fullmatch(line) is not None\n\n\nRE_REFINEDWEB_COUNTER = regex.compile(r\"^\\d+\\s+[a-zA-Z]+$\")\n\n\ndef line_refinedweb_counter(line):\n    return RE_REFINEDWEB_COUNTER.fullmatch(line.strip()) is not None\n\n\ndef line_regex_match(line, patterns):\n    for pattern in patterns:\n        if regex.search(pattern, line) is not None:\n            return True\n    return False\n\n\ndef test_line_uppercase_ratio():\n    line = \"ASDzxczxc a././.,./,/.123\"\n    res = line_uppercase_ratio(line)\n    # ignore number, space and puncts\n    assert res == 3 / 10\n    line = \".,/./././\"\n    res = line_uppercase_ratio(line)\n    assert res == 0\n\n\ndef test_line_all_numeric():\n    line = \"1231    34\\t345345\"\n    assert line_all_numeric(line)\n    line = \"asd1231as\"\n    assert not line_all_numeric(line)\n\n\ndef test_line_refinedweb_counter():\n    line = \"3 emails\"\n    assert line_refinedweb_counter(line)\n    line = \"3 emails emails\"\n    assert not line_refinedweb_counter(line)\n\n\ndef test_line_regex_match():\n    pattern = \"^sign in\"\n    line = \"sign in 123\"\n    assert line_regex_match(line, [pattern])\n    line = \"123 sign in 123\"\n    assert not line_regex_match(line, [pattern])\n\n    pattern = \"read more...$\"\n    line = \"123 read more...\"\n    assert line_regex_match(line, [pattern])\n    line = \"read more....\"\n    assert not line_regex_match(line, [pattern])\n\n    pattern = \"target\"\n    line = \"asdtargetasd\"\n    assert line_regex_match(line, [pattern])\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/func/repetition.py",
    "content": "from collections import Counter\n\nimport numpy as np\nfrom nltk.util import ngrams\n\n\ndef repetition_ngram_top_char_frac(words, n: int):\n    items = list(ngrams(words, n))\n    counter = Counter(items)\n    most_common = counter.most_common(1)\n    if len(most_common) == 0:\n        return 0\n    most_common_ngram, count = most_common[0]\n    if count == 1:\n        return 0\n    total_chars = sum(len(w) for w in words)\n    top_chars = sum(len(w) for w in most_common_ngram) * count\n\n    return top_chars / total_chars\n\n\ndef repetition_ngram_dup_char_frac(words, n: int):\n    items = list(ngrams(words, n))\n    counter = Counter(items)\n\n    flag_dup = np.zeros(len(words), dtype=\"bool\")\n    for i, item in enumerate(items):\n        if counter[item] > 1:\n            flag_dup[i : i + n] = True\n    total_chars = sum(len(w) for w in words)\n    dup_chars = sum(len(w) for i, w in enumerate(words) if flag_dup[i])\n    return dup_chars / total_chars\n\n\ndef repetition_line_dup_frac(lines):\n    if len(lines) == 0:\n        return 0, 0\n\n    dup_lines = 0\n    dup_chars = 0\n    counter = Counter(lines)\n    for line, count in counter.items():\n        if count > 1:\n            dup_lines += count\n            dup_chars += len(line) * count\n    total_chars = sum(len(line) for line in lines)\n    if total_chars == 0:\n        return 0, 0\n\n    return dup_lines / len(lines), dup_chars / total_chars\n\n\ndef test_ngram_top():\n    words = \"a b c a b d a b\".split()\n    res = repetition_ngram_top_char_frac(words, 2)\n    assert res == 6 / len(words)\n\n    # no repetition\n    res = repetition_ngram_top_char_frac(words, 3)\n    assert res == 0\n\n    words = \"a b c a b c a b\".split()\n    res = repetition_ngram_top_char_frac(words, 3)\n    assert res == 6 / len(words)\n\n\ndef test_ngram_dup():\n    words = \"a b c a b d a b\".split()\n    res = repetition_ngram_dup_char_frac(words, 2)\n    assert res == 6 / len(words)\n\n    words = \"a b c a b c a b\".split()\n    res = repetition_ngram_dup_char_frac(words, 3)\n    assert res == 1\n\n\ndef test_dup_line():\n    lines = [\"a\", \"b\", \"c\"]\n    frac, char_frac = repetition_line_dup_frac(lines)\n    assert frac == 0 and char_frac == 0\n    lines = []\n    frac, char_frac = repetition_line_dup_frac(lines)\n    assert frac == 0 and char_frac == 0\n    lines = [\"\", \"\", \"\"]\n    frac, char_frac = repetition_line_dup_frac(lines)\n    assert frac == 0 and char_frac == 0\n    lines = [\"abc\", \"de\", \"abc\"]\n    frac, char_frac = repetition_line_dup_frac(lines)\n    assert frac == 2 / 3 and char_frac == 6 / 8\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/model/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/model/document.py",
    "content": "import sys\nfrom functools import cached_property\n\nimport stopit\nfrom loguru import logger\nfrom sentence_splitter import split_text_into_sentences\n\nfrom ..utils import normalize\n\n\nif sys.platform == \"posix\":\n    stopit_method = stopit.SignalTimeout\nelse:\n    stopit_method = stopit.ThreadingTimeout\n\n\nclass Document:\n    def __init__(self, text, lang):\n        self.text = text\n        self.lang = lang\n\n    @cached_property\n    def sents(self):\n        with stopit_method(60) as ctx:\n            res = split_text_into_sentences(self.text, self.lang)\n        if ctx:\n            return res\n        else:\n            logger.warning(\"sentence splitter timeout\")\n            return self.text.split(\"\\n\")\n\n    @cached_property\n    def paragraphs(self):\n        return self.text.split(\"\\n\")\n\n    @cached_property\n    def normalized_text(self):\n        return normalize(self.text)\n\n    @cached_property\n    def normalized_sents(self):\n        return [normalize(sent) for sent in self.sents]\n\n    @cached_property\n    def normalized_words(self):\n        return self.normalized_text.split()\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/model/violations.py",
    "content": "from typing import List\n\nfrom .document import Document\n\n\nclass Violations:\n    def __init__(self):\n        self.doc_violations = set()\n        self.line_violations = {}\n        self.excluded_lines = set()\n\n    def doc(self, key):\n        if key in self.doc_violations:\n            raise KeyError(f\"Document violation {key} has already been set\")\n        self.doc_violations.add(key)\n\n    def line(self, key, lines: List[int]):\n        if key in self.line_violations:\n            raise KeyError(f\"Line violation {key} has already been set\")\n        lines = list(set(lines))\n        lines.sort()\n        self.line_violations[key] = lines\n        self.excluded_lines.update(lines)\n\n    def apply_to_doc(self, doc: Document) -> str | None:\n        if len(self.doc_violations) > 0:\n            return None\n\n        res = []\n        for i, line in enumerate(doc.sents):\n            if i not in self.excluded_lines:\n                res.append(line)\n        return \"\\n\".join(res)\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/__init__.py",
    "content": ""
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/gopher.py",
    "content": "from ..model.document import Document\nfrom ..model.violations import Violations\nfrom ..func.document import (\n    document_alpha_words,\n    document_end_with_ellipsis,\n    document_gopher_stopwords,\n    document_gopher_symbols,\n    document_mean_word_length,\n    document_start_with_bullet,\n    document_word_count,\n)\nfrom ..func.repetition import (\n    repetition_ngram_top_char_frac,\n    repetition_ngram_dup_char_frac,\n    repetition_line_dup_frac,\n)\n\nKEY_PREFIX_TOP_NGRAM = \"rr_ngram_top_\"\nTHRESHOLD_TOP_NGRAM = {2: 0.2, 3: 0.18, 4: 0.16}\nKEY_PREFIX_DUP_NGRAM = \"rr_ngram_dup_\"\nTHRESHOLD_DUP_NGRAM = {5: 0.15, 6: 0.14, 7: 0.13, 8: 0.12, 9: 0.11, 10: 0.10}\n\n\ndef gopher_filter(doc: Document):\n    violations = Violations()\n    # repetition\n    for n, thresh in THRESHOLD_TOP_NGRAM.items():\n        val = repetition_ngram_top_char_frac(doc.normalized_words, n)\n        if val > thresh:\n            violations.doc(KEY_PREFIX_TOP_NGRAM + str(n))\n    for n, thresh in THRESHOLD_DUP_NGRAM.items():\n        val = repetition_ngram_dup_char_frac(doc.normalized_words, n)\n        if val > thresh:\n            violations.doc(KEY_PREFIX_DUP_NGRAM + str(n))\n    sent_frac, sent_char_frac = repetition_line_dup_frac(doc.sents)\n    if sent_frac > 0.3:\n        violations.doc(\"rr_sent_frac\")\n    if sent_char_frac > 0.2:\n        violations.doc(\"rr_sent_char_frac\")\n    para_frac, para_char_frac = repetition_line_dup_frac(doc.paragraphs)\n    if para_frac > 0.3:\n        violations.doc(\"rr_para_frac\")\n    if para_char_frac > 0.2:\n        violations.doc(\"rr_para_char_frac\")\n    # document\n    word_count = document_word_count(doc.normalized_words)\n    if word_count < 50 or word_count > 100_000:\n        violations.doc(\"doc_word_count\")\n    mean_word_len = document_mean_word_length(doc.normalized_words)\n    if mean_word_len < 3 or mean_word_len > 10:\n        violations.doc(\"doc_mean_word_len\")\n    symbol_to_word = document_gopher_symbols(doc.normalized_text) / len(\n        doc.normalized_words\n    )\n    if symbol_to_word > 0.1:\n        violations.doc(\"doc_gopher_symbol_to_word\")\n    alpha_word_rate = document_alpha_words(doc.normalized_words) / len(\n        doc.normalized_words\n    )\n    if alpha_word_rate < 0.8:\n        violations.doc(\"doc_alpha_word_rate\")\n    el_end_line_rate = document_end_with_ellipsis(doc.normalized_sents) / len(\n        doc.normalized_sents\n    )\n    if el_end_line_rate > 0.3:\n        violations.doc(\"doc_el_end_line_rate\")\n    bullet_start_line_rate = document_start_with_bullet(doc.normalized_sents) / len(\n        doc.normalized_sents\n    )\n    if bullet_start_line_rate > 0.9:\n        violations.doc(\"doc_bullet_start_line_rate\")\n    stopword_cnt = document_gopher_stopwords(doc.normalized_words)\n    if stopword_cnt < 2:\n        violations.doc(\"doc_gopher_stopword_count\")\n\n    return violations\n\n\ndef apply_gopher_rules(text, lang):\n    doc = Document(text, lang)\n    violations = gopher_filter(doc)\n    filtered_text = violations.apply_to_doc(doc)\n    return filtered_text\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/refinedweb.py",
    "content": "import regex\nfrom .gopher import gopher_filter\nfrom ..model.document import Document\nfrom ..func.line import (\n    line_all_numeric,\n    line_uppercase_ratio,\n    line_refinedweb_counter,\n    line_regex_match,\n)\n\nEXCLUDE_PATTERNS = (\n    \"^sign in\",\n    \"^sign-in\",\n    \"^sign up\",\n    \"^sign-up\",\n    \"read more...$\",\n    \"items in cart\",\n)\nEXCLUDE_PATTERNS = (regex.compile(x) for x in EXCLUDE_PATTERNS)\n\n\ndef refinedweb_filter(doc: Document):\n    violations = gopher_filter(doc)\n    # line\n    res = []\n    for i, line in enumerate(doc.sents):\n        upper_ratio = line_uppercase_ratio(line)\n        if upper_ratio > 0.6:\n            res.append(i)\n    violations.line(\"line_upper_ratio\", res)\n\n    res = []\n    for i, line in enumerate(doc.normalized_sents):\n        if line_all_numeric(line):\n            res.append(i)\n    violations.line(\"line_all_numeric\", res)\n\n    res = []\n    for i, line in enumerate(doc.normalized_sents):\n        if line_refinedweb_counter(line):\n            res.append(i)\n    violations.line(\"line_refinedweb_counter\", res)\n\n    res = []\n    for i, line in enumerate(doc.normalized_sents):\n        if len(line.split()) == 1:\n            res.append(i)\n    violations.line(\"line_one_word\", res)\n\n    res = []\n    for i, line in enumerate(doc.normalized_sents):\n        if line_regex_match(line, EXCLUDE_PATTERNS):\n            res.append(i)\n    violations.line(\"line_exclude_patterns\", res)\n\n    total_words = sum(len(line.split()) for line in doc.normalized_sents)\n    excluded_words = sum(\n        len(line.split())\n        for i, line in enumerate(doc.normalized_sents)\n        if i in violations.excluded_lines\n    )\n    if excluded_words / total_words > 0.05:\n        violations.doc(\"line_document_discarded\")\n\n    return violations\n\n\ndef apply_refinedweb_rules(text, lang):\n    doc = Document(text, lang)\n    violations = refinedweb_filter(doc)\n    filtered_text = violations.apply_to_doc(doc)\n    return filtered_text\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/rule_based_filters/utils.py",
    "content": "import unicodedata\n\nimport regex\n\nRE_PUNCT = regex.compile(r\"\\p{P}\")\nRE_URL = regex.compile(\n    r\"https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&//=]*)\"\n)\n\nRE_LINE_SEPARATORS = regex.compile(r\"(\\p{Zl}|\\p{Zp})+\")\nRE_SPACE_SEPARATORS = regex.compile(r\"\\p{Zs}+\")\n\n\ndef remove_url(text):\n    return RE_URL.sub(\"\", text)\n\n\ndef remove_consecutive_new_lines(text):\n    return RE_LINE_SEPARATORS.sub(\"\\n\", text)\n\n\ndef remove_punct(text):\n    return RE_PUNCT.sub(\"\", text)\n\n\ndef normalize(text):\n    text = unicodedata.normalize(\"NFKC\", text)\n    text = text.lower()\n    text = text.strip()\n    text = remove_consecutive_new_lines(text)\n    text = RE_SPACE_SEPARATORS.sub(\" \", text)\n    return text\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/algos/trafilatura_process.py",
    "content": "import zlib\nimport re\n\nimport brotlicffi\nimport lxml.etree as ET\nfrom lxml.html import tostring\nfrom trafilatura import bare_extraction\nfrom trafilatura.xml import xmltotxt\nfrom trafilatura.meta import reset_caches as trafilatura_reset_caches\n\nFLAG_TRAFILATURA_RESET_CACHE = False\nZIP_BOMB_SIZE_THRESHOLD = 100 * 1000 * 1000\n\n\nclass EmptyResultException(Exception):\n    pass\n\n\ndef _remove_dup_newline(text):\n    fields = text.split(\"\\n\")\n    for i in range(len(fields)):\n        fields[i] = fields[i].strip()\n\n    text = \"\\n\".join(fields)\n\n    return re.sub(\"\\n{2,}\", \"\\n\\n\", text).strip()\n\n\ndef _normalize_whitespace(tree):\n    def _normalize(text):\n        text = text.replace(\"\\n\", \"\")\n        text = re.sub(r\"[\\t ]+\", \" \", text)\n        return text\n\n    for item in tree.xpath(\n        \"//*[not(ancestor-or-self::pre) and not(ancestor-or-self::textarea)]\"\n    ):\n        if item.text is not None:\n            item.text = _normalize(item.text)\n        for c in item:\n            if c.tail is not None:\n                c.tail = _normalize(c.tail)\n    return tree\n\n\ndef _traf_xml_to_html(tree):\n    # replace tag\n    for elem in tree.iter(\n        \"hi\", \"list\", \"item\", \"head\", \"lb\", \"quote\", \"del\", \"row\", \"cell\", \"ab\"\n    ):\n        if elem.tag == \"hi\":\n            rend = elem.get(\"rend\", \"b\")\n            if rend == \"#i\":\n                elem.tag = \"i\"\n            elif rend == \"#b\":\n                elem.tag = \"b\"\n            elif rend == \"#u\":\n                elem.tag = \"u\"\n            elif rend == \"#t\":\n                elem.tag = \"code\"\n            elif rend == \"#sub\":\n                elem.tag = \"sub\"\n            elif rend == \"#sup\":\n                elem.tag = \"sup\"\n            if \"rend\" in elem.attrib:\n                elem.attrib.pop(\"rend\")\n        elif elem.tag == \"list\":\n            rend = elem.get(\"rend\", \"ul\")\n            elem.tag = rend\n            if \"rend\" in elem.attrib:\n                elem.attrib.pop(\"rend\")\n        elif elem.tag == \"item\":\n            rend = elem.get(\"rend\")\n            if not rend:\n                elem.tag = \"li\"\n            else:\n                tag, _idx = rend.split(\"-\", 1)\n                elem.tag = tag\n            if \"rend\" in elem.attrib:\n                elem.attrib.pop(\"rend\")\n        elif elem.tag == \"head\":\n            rend = elem.get(\"rend\", \"h6\")\n            elem.tag = rend\n            if \"rend\" in elem.attrib:\n                elem.attrib.pop(\"rend\")\n        elif elem.tag == \"lb\":\n            elem.tag = \"br\"\n        elif elem.tag == \"quote\":\n            elem.tag = \"pre\"\n        elif elem.tag == \"delete\":\n            elem.tag = \"del\"\n        elif elem.tag == \"row\":\n            elem.tag = \"tr\"\n        elif elem.tag == \"cell\":\n            if \"role\" in elem:\n                if elem[\"role\"] == \"head\":\n                    elem.tag = \"th\"\n                    elem.attrib.pop(\"role\")\n                    continue\n            elem.tag = \"td\"\n        elif elem.tag == \"ab\":\n            if \"type\" in elem:\n                if elem[\"type\"] == \"header\":\n                    elem.tag = \"h6\"\n                    elem.attrib.pop(\"type\")\n                    continue\n            elem.tag = \"p\"\n    return tree\n\n\ndef _build_traf_doc_full(traf_bare_res):\n    title = traf_bare_res.get(\"title\", \"\")\n    main = traf_bare_res[\"body\"]\n    comments = traf_bare_res.get(\"commentsbody\")\n    output = ET.Element(\"body\")\n    if title is not None and len(title) > 0:\n        ele = ET.Element(\"h1\")\n        ele.text = title\n        output.append(ele)\n    main.tag = \"p\"\n    output.append(main)\n    if comments is not None:\n        comments.tag = \"p\"\n        output.append(comments)\n\n    output = _traf_xml_to_html(output)\n    return output\n\n\n# no title no comments\ndef _build_traf_doc(traf_bare_res):\n    output = ET.Element(\"body\")\n\n    main = traf_bare_res[\"body\"]\n    main.tag = \"div\"\n    output.append(main)\n\n    output = _traf_xml_to_html(output)\n    return output\n\n\n_RESET_CACHES_INTERVAL = 100\n_reset_caches_counter = 0\n\n\ndef _reset_caches():\n    global _reset_caches_counter, _RESET_CACHES_INTERVAL\n    _reset_caches_counter += 1\n    if _reset_caches_counter >= _RESET_CACHES_INTERVAL:\n        trafilatura_reset_caches()\n        _reset_caches_counter = 0\n\n\ndef _detect_zip_bomb(data):\n    if isinstance(data, bytes):\n        if data[:2] == b\"\\x1f\\x8b\":\n            try:\n                count = 0\n                dec = zlib.decompressobj(32 + zlib.MAX_WBITS)\n                for i in range(0, len(data), 64):\n                    chunk = data[i : i + 64]\n                    rv = dec.decompress(chunk)\n                    count += len(rv)\n                    if count > ZIP_BOMB_SIZE_THRESHOLD:\n                        return True\n            except (EOFError, OSError):\n                pass\n        # try brotli\n        else:\n            try:\n                count = 0\n                dec = brotlicffi.Decompressor()\n                for i in range(0, len(data), 64):\n                    chunk = data[i : i + 64]\n                    rv = dec.decompress(chunk)\n                    count += len(rv)\n                    if count > ZIP_BOMB_SIZE_THRESHOLD:\n                        return True\n            except brotlicffi.error:\n                pass  # logging.debug('invalid Brotli file')\n\n    return False\n\n\n# ref: https://gitlab.gnome.org/GNOME/libxml2/-/blame/master/include/libxml/parserInternals.h#L45\nHTML_LENGTH_THRESHOLD = 10_000_000\n\n\ndef trafilatura_process(html):\n    assert not _detect_zip_bomb(html), \"zip bomb detected\"\n    assert len(html) < HTML_LENGTH_THRESHOLD, \"Skip html that exceed length limit\"\n\n    # article extraction\n    traf_res = bare_extraction(\n        html,\n        output_format=\"txt\",\n        include_comments=False,\n        favor_precision=True,\n        include_formatting=True,\n        include_tables=True,\n        include_images=False,\n        include_links=False,\n        deduplicate=False,\n    )\n    if traf_res is None:\n        raise EmptyResultException(\"Trafilatura empty result\")\n    traf_html_tree = _build_traf_doc(traf_res)\n    traf_html_tree = _normalize_whitespace(traf_html_tree)\n    traf_html = tostring(traf_html_tree, encoding=\"unicode\")\n    traf_text = xmltotxt(traf_html_tree, False)\n    traf_text = _remove_dup_newline(traf_text)\n\n    if FLAG_TRAFILATURA_RESET_CACHE:\n        _reset_caches()\n\n    return {\"text\": traf_text, \"html\": traf_html}\n\n\n__all__ = [\n    \"trafilatura_process\",\n]\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/download_utils.py",
    "content": "import os\nimport subprocess\nimport shlex\nimport shutil\nfrom functools import lru_cache\nfrom urllib.parse import urlparse\n\nimport requests\nfrom loguru import logger\n\n\ndef _url_basename(url):\n    parse_res = urlparse(url)\n    return os.path.split(parse_res.path)[1]\n\n\ndef _normalize_dst(src, dst):\n    if os.path.isdir(dst):\n        dst = os.path.join(dst, _url_basename(src))\n\n    return dst\n\n\n@lru_cache\ndef detect_aria2():\n    p = subprocess.run([\"aria2c\", \"--version\"], shell=True)\n    return p.returncode == 0\n\n\ndef download_with_aria2(src, dst, num_connections=16, quiet=False, extra_args=None):\n    if not detect_aria2():\n        raise RuntimeError(\"aria2c not detected\")\n\n    dst = _normalize_dst(src, dst)\n    if extra_args is None:\n        extra_args = []\n    elif not isinstance(extra_args, list):\n        raise ValueError(f\"Invalid extra_args type {type(extra_args)}\")\n\n    parts = [\n        \"aria2\",\n        \"-x\",\n        str(num_connections),\n        \"-s\",\n        str(num_connections),\n        \"--retry-after\",\n        \"3\",\n        *extra_args,\n    ]\n    if quiet:\n        parts.append(\"-q\")\n    else:\n        parts.append(\"--console-log-level=error\")\n        parts.append(\"--download-result=hide\")\n        # known issue: tqdm progress bar may still be overided by aria2\n        parts.append(\"--show-console-readout=false\")\n\n    parts.append(src)\n    dst_dir = os.path.dirname(dst)\n    dst_name = os.path.basename(dst)\n    parts.append(\"-d\")\n    parts.append(dst_dir)\n    parts.append(\"-o\")\n    parts.append(dst_name)\n    cmd = shlex.join(parts)\n    subprocess.run(cmd, shell=True, check=True)\n\n    return dst\n\n\ndef download_with_requests(src, dst):\n    dst = _normalize_dst(src, dst)\n    with requests.get(src, stream=True) as r:\n        r.raise_for_status()\n        with open(dst, \"wb\") as f:\n            shutil.copyfileobj(r.raw, f)\n\n    return dst\n\n\ndef download(src, dst):\n    if detect_aria2():\n        return download_with_aria2(src, dst)\n    else:\n        logger.info(f\"aria2 not found, fallback to requests\")\n        return download_with_requests(src, dst)\n"
  },
  {
    "path": "GeneralDomain/redstone_cc/process.py",
    "content": "import tempfile\nimport os\n\nimport pyarrow.parquet as pq\nfrom tqdm import tqdm\nfrom warcio.archiveiterator import ArchiveIterator\nfrom loguru import logger\n\nfrom .download_utils import download\nfrom .algos.trafilatura_process import trafilatura_process, EmptyResultException\nfrom .algos.fasttext_classifier import FASTTEXT_LID_176_URL, FastTextClassifier\nfrom .algos.rule_based_filters.ruleset.refinedweb import apply_refinedweb_rules\n\nLA_PROB_THRESHOLD = 0.65\n\n\ndef process_items(remote_cc_path, items, disable_tqdm=False):\n    # items to dict\n    uri_to_item = dict()\n    for item in items:\n        assert item[\"cc_path\"] == remote_cc_path\n        uri_to_item[item[\"uri\"]] = item\n\n    # main processing\n    with tempfile.TemporaryDirectory(dir=os.getcwd()) as tmp_dir:\n        logger.info(f\"downloading warc file {remote_cc_path}\")\n        local_cc_file = download(remote_cc_path, tmp_dir)\n        # prepare lid model\n        logger.info(f\"downloading fasttext lid model {FASTTEXT_LID_176_URL}\")\n        local_lid_model = download(FASTTEXT_LID_176_URL, tmp_dir)\n        lid_classfier = FastTextClassifier(local_lid_model)\n\n        results = []\n        with open(local_cc_file, \"rb\") as fd:\n            for record in tqdm(ArchiveIterator(fd), disable=disable_tqdm):\n                warc_type = record.rec_headers.get_header(\"WARC-Type\")\n                if warc_type != \"response\":\n                    continue\n\n                uri = record.rec_headers.get_header(\"WARC-Target-URI\")\n                if uri not in uri_to_item:\n                    continue\n                # article extraction\n                raw_html = record.content_stream().read()\n                try:\n                    traf_res = trafilatura_process(raw_html)\n                except EmptyResultException:\n                    logger.warning(f\"trafilatura: failed to convert record: {uri}\")\n\n                traf_text = traf_res[\"text\"]\n                # lid\n                la, la_prob = lid_classfier.predict(traf_text)\n                if la != \"en\" or la_prob < LA_PROB_THRESHOLD:\n                    continue\n                # rule based filter\n                filtered_text = apply_refinedweb_rules(traf_text, la)\n                if filtered_text is None:\n                    continue\n\n                result_item = {\n                    **uri_to_item[uri],\n                    \"text\": filtered_text,\n                }\n\n                results.append(result_item)\n\n    return results\n\n\ndef process_file(index_path):\n    items = pq.read_table(index_path).to_pylist()\n    assert len(items) > 0\n    cc_path = items[0][\"cc_path\"]\n    return process_items(cc_path, items)\n"
  },
  {
    "path": "LICENSE",
    "content": "    MIT License\n\n    Copyright (c) Microsoft Corporation.\n\n    Permission is hereby granted, free of charge, to any person obtaining a copy\n    of this software and associated documentation files (the \"Software\"), to deal\n    in the Software without restriction, including without limitation the rights\n    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n    copies of the Software, and to permit persons to whom the Software is\n    furnished to do so, subject to the following conditions:\n\n    The above copyright notice and this permission notice shall be included in all\n    copies or substantial portions of the Software.\n\n    THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n    SOFTWARE\n"
  },
  {
    "path": "README.md",
    "content": "<p align=\"center\">\n  <img src=\"assets/icon.png\" width=\"150\">\n  <br />\n  <br />\n  <a href=\"https://huggingface.co/datasets/zjsd/RedStone\"><img alt=\"MIT License\" src=\"https://img.shields.io/badge/Hugging%20Face-Dataset-orange?logo=huggingface\" /></a>\n  <a href=\"https://arxiv.org/abs/2412.03398\"><img alt=\"MIT License\" src=\"https://img.shields.io/badge/ArXiv-2412.03398-green.svg\" /></a>\n  <a href=\"https://github.com/microsoft/RedStone/blob/main/LICENSE\"><img alt=\"MIT License\" src=\"https://img.shields.io/badge/license-MIT-blue.svg\" /></a>\n</p>\n\n--------------------------------------------------------------------------------\n\n# [REDSTONE : Curating General, Code, Math, and QA Data for Large Language Models](https://arxiv.org/abs/2412.03398)\n\n**RedStone** is an innovative and scalable pipeline designed to extract and process data from a vast amount of web content, facilitating the creation of diverse and comprehensive pre-training datasets. We demonstrate its capabilities by building pre-training datasets across multiple domains, including general, code, mathematics, and question-answering. REDSTONE's flexibility allows it to easily adapt to various specialized fields.\n\n# Dataset\n| Datasets        | Tokens (B) | Link |\n|-----------------|------------| ---- |\n| REDSTONE-Web    | 3,170.2    | [REDSTONE-Web](https://huggingface.co/datasets/zjsd/RedStone) |\n| REDSTONE-Code   | 250.2      | [REDSTONE-Code-python (Python Only)](https://huggingface.co/datasets/zjsd/RedStone-Code-python) |\n| REDSTONE-Math   | 15.9       | [REDSTONE-Math](https://huggingface.co/datasets/zjsd/RedStone-Math) |\n| REDSTONE-QA     | 51.4       | [REDSTONE-OpenQuestion](https://huggingface.co/datasets/zjsd/RedStone-QA-oq) [REDSTONE-MultiChoiceQuestion](https://huggingface.co/datasets/zjsd/RedStone-QA-mcq) |\n\n**UPDATE [2/10/2025]**: All open-source datasets are reproduced by [@zjsd](https://huggingface.co/zjsd) based on our open-source code. We have verified the scale of these datasets and manually reviewed some samples; they are comparable to our internal datasets in both size and quality.\n\n**Note [12/08/2024]：** Since **we do not have the permission to open-source the processed data**, We provide all the code for RedStone to process both general and domain-specific data, along with an [index](https://huggingface.co/datasets/microsoft/RedStone) for high-quality data from Common Crawl after filtering. You can download the raw Common Crawl data, use the provided index to find high-quality pages, and process them with RedStone's scripts.\n\nIf you have the appropriate licenses, **we encourage you to use these scripts to reproduce the dataset and contribute it to the open-source community**. We will reference the data here for easy access. Additionally, we welcome you to use RedStone to expand domain-specific categories beyond just code, math, and QA.\n\n# Performance\n### General Domain Data\n| Datasets      | ARC-c | ARC-e | HellaSwag | OpenBookQA | PIQA  | Winogrande | AVERAGE |\n|---------------|-------|-------|-----------|------------|-------|------------|---------|\n| RedPajama     | 0.2270| 0.4386| 0.3171    | 0.1900     | 0.5968| **0.5296** | 0.3832  |\n| FineWeb       | 0.1928| 0.4428| 0.3506    | 0.1740     | 0.6681| 0.5288     | 0.3929  |\n| RefinedWeb    | 0.2125| 0.4369| 0.3380    | 0.2100     | 0.6491| 0.5264     | 0.3955  |\n| DCLM          | 0.2159| 0.4848| 0.3614    | 0.1760     | 0.6615| 0.5082     | 0.4013  |\n| FineWeb-Edu   | **0.2722**| **0.5648**| 0.3637    | 0.1940     | 0.6676| 0.5051     | 0.4279  |\n| **REDSTONE-Web**  | 0.2662| 0.5181| **0.3722**| **0.2340** | **0.6795**| 0.5162     | **0.4310** |\n\n<sub>**The results are based on models trained with 1.3 billion parameters on 50 billion tokens.**</sub>\n\n### Domain-specific Data\n#### REDSTONE-Code\n| Dataset         | HumanEval pass@1 | HumanEval pass@10 | MBPP pass@1 | MBPP pass@10 |\n|-----------------|------------------|-------------------|-------------|--------------|\n| REDSTONE-Web    | 0.0125           | 0.0168            | 0.0751      | 0.1566       |\n| + **REDSTONE-Code** | **0.0555**       | **0.1035**        | **0.1311**  | **0.2458**   |\n\n#### REDSTONE-Math\n| Dataset                    | GSM8k  | MATH   |\n|----------------------------|--------|--------|\n| OpenWebMath       | 3.2503 | 3.1288 |\n| **REDSTONE-Math**              | **3.1125** | **3.0557** |\n\n#### REDSTONE-QA\n| Model               | MMLU  | Arc Challenge | Arc Easy | OpenbookQA | Winogrande | AVERAGE |\n|---------------------|-------|---------------|----------|------------|------------|---------|\n| StableLM-2-1.6B     | 0.3135| 0.3481        | **0.6860**| 0.2780     | 0.6354     | 0.4522  |\n| + FALN v2           | 0.3525| 0.3601        | 0.6406   | **0.2860** | 0.6125     | 0.4503  |\n| + Open Orca         | 0.3569| 0.3089        | 0.5821   | 0.2660     | 0.5675     | 0.4163  |\n| + **REDSTONE-QA**       | **0.4582**| **0.3643**| 0.6839   | 0.2760     | **0.6377** | **0.4840** |\n\n**<sub>For evaluations on the domain-specific dataset, We utilized the same architecture as the StableLM-2-1.6B</sub>**\n\n# Getting Started\n\n| Domain | Link |\n|----------------------|--------------------------------------------------------------------------------------------|\n| General Domain Data  |[Getting Started](https://github.com/microsoft/RedStone/blob/main/GeneralDomain/README.md)  | \n| Domain-specific Data |[Getting Started](https://github.com/microsoft/RedStone/blob/main/DomainSpecific/readme.md) |\n\n# Responsible AI FAQ\n- **What is RedStone Source Code?**\n    - RedStone is a pipeline designed to extract a wide range of specified knowledge from Common Crawl on a large scale. It is composed of three modules, Collection, Filtering and Extraction. As an example, we use RedStone to build extensive domain-specific datasets in the fields of code, mathematics, question answering (QA), and general data. Utilizing RedStone, it is possible to easily acquire valuable knowledge from a multitude of other domains within Common Crawl.\n- **What can RedStone Source Code do?**\n    - RedStone Source Code provides the sample codes of the pipeline’s components, workflow and index of source location, enabling anyone to construct large-scale various domains from Common Crawl, including general web content, web code, web mathematics and web QA data.\n- **What is/are RedStone Source Code’s intended use(s)?**\n    - We release RedStone, aiming to provide this resource to the research community to accelerate the development of large language models and for demonstrating a novel method of constructing training datasets. Given the research nature of this work, production or commercial uses are out of scope without further testing and mitigation.\n- **How was RedStone Source Code evaluated? What metrics are used to measure performance?**\n    - We use RedStone to build domain-specific datasets in the fields of code, mathematics, question answering (QA), and general datasets as examples. We evaluate the performance of the datasets across multiple benchmarks, demonstrating that RedStone significantly enhances model performance in mathematics, code, and QA tasks.\n- **What are the limitations of [RedStone Source Code]? How can users minimize the impact of RedStone dataset’s limitations when using the system?**\n    - RedStone takes several domains as examples to verify the methodologies and pipelines. We believe the ways should work for other fields. However, the source code repo is customized for these domain and English materials only. It takes extra effort to revise the codes for your tasks and setting if you would like to obtain data of different domain, languages with your environment.\n    - RedStone employs quality filters to get content with correct grammar, logical consistency, and factual accuracy. Despite our efforts to remove toxic content, some harmful content may be present.\n    - RedStone used scope of deduplication, which indicates that narrowing the scope of deduplication yields the highest scores. A possible explanation is that a narrower deduplication scope results in a data distribution that more closely mirrors the real world, where frequently occurring data in real life also appears multiple times in the dataset. However, we are currently unable to verify this hypothesis and will investigate it.\n    - There might be incorrect data in raw data that could not be filtered out, which may result in inaccurate answers for some questions.\n    - Common Crawl data may not be suitable for all downstream uses due to copyright or other legal reasons. Users are responsible for verifying the legal right to use Common Crawl data for their intended purpose.\n- **What operational factors and settings allow for effective and responsible use of RedStone Source Code?**\n    - The user is responsible for validating the safety and accuracy of any datasets developed using RedStone Source Code, or any model developed using a dataset constructed using our methods.\n\n# Citation\nIf you find this repository useful, please consider citing our work:\n```\n@article{redstone,\n  title={{RedStone}: {Curating} General, Code, Math, and {QA} Data for Large Language Models},\n  author={Chang, Yaoyao and Cui, Lei and Dong, Li and Huang, Shaohan and Huang, Yangyu and Huang, Yupan and Li, Scarlett and Lv, Tengchao and Ma, Shuming and Sun, Qinzheng and others},\n  journal={arXiv preprint arXiv:2412.03398},\n  year={2024}\n}\n```\n\n# License\nThe content of this project itself is licensed under the [MIT](./LICENSE)\n\n[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)\n\n# Contact\nFor help or issues using RedStone, please submit a GitHub issue.\n\nFor other communications related to RedStone, please contact [Lei Cui](mailto:lecu@microsoft.com) or [Furu Wei](mailto:fuwei@microsoft.com).\n\n\n"
  },
  {
    "path": "SECURITY.md",
    "content": "<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->\n\n## Security\n\nMicrosoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).\n\nIf you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.\n\n## Reporting Security Issues\n\n**Please do not report security vulnerabilities through public GitHub issues.**\n\nInstead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).\n\nIf you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).\n\nYou should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). \n\nPlease include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:\n\n  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)\n  * Full paths of source file(s) related to the manifestation of the issue\n  * The location of the affected source code (tag/branch/commit or direct URL)\n  * Any special configuration required to reproduce the issue\n  * Step-by-step instructions to reproduce the issue\n  * Proof-of-concept or exploit code (if possible)\n  * Impact of the issue, including how an attacker might exploit the issue\n\nThis information will help us triage your report more quickly.\n\nIf you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.\n\n## Preferred Languages\n\nWe prefer all communications to be in English.\n\n## Policy\n\nMicrosoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).\n\n<!-- END MICROSOFT SECURITY.MD BLOCK -->\n"
  },
  {
    "path": "SUPPORT.md",
    "content": "# TODO: The maintainer of this repo has not yet edited this file\r\n\r\n**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?\r\n\r\n- **No CSS support:** Fill out this template with information about how to file issues and get help.\r\n- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.\r\n- **Not sure?** Fill out an intake as though the answer were \"Yes\". CSS will help you decide.\r\n\r\n*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*\r\n\r\n# Support\r\n\r\n## How to file issues and get help  \r\n\r\nThis project uses GitHub Issues to track bugs and feature requests. Please search the existing \r\nissues before filing new issues to avoid duplicates.  For new issues, file your bug or \r\nfeature request as a new Issue.\r\n\r\nFor help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE \r\nFOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER\r\nCHANNEL. WHERE WILL YOU HELP PEOPLE?**.\r\n\r\n## Microsoft Support Policy  \r\n\r\nSupport for this **PROJECT or PRODUCT** is limited to the resources listed above.\r\n"
  }
]