Repository: microsoft/RedStone
Branch: main
Commit: 50b3bd9dcc6f
Files: 130
Total size: 57.7 MB

Directory structure:
gitextract_ayw6h_qv/

├── .github/
│   └── workflows/
│       └── codeql.yml
├── CODE_OF_CONDUCT.md
├── DomainSpecific/
│   ├── .gitignore
│   ├── configs/
│   │   ├── cc_math_filter.CC-MAIN-2023-23.json
│   │   ├── cc_openquestion_filter.CC-MAIN-2023-23.json
│   │   ├── cc_warc_download.CC-MAIN-2023-23.json
│   │   ├── cc_warc_filter.CC-MAIN-2023-23.json
│   │   ├── cc_warc_to_wet.code.CC-MAIN-2023-23.json
│   │   ├── cc_warc_to_wet.math.CC-MAIN-2023-23.json
│   │   └── network_template.json
│   ├── core/
│   │   ├── __init__.py
│   │   ├── data.py
│   │   ├── layer.py
│   │   ├── layers/
│   │   │   ├── __init__.py
│   │   │   ├── control/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── data_concat_layer.py
│   │   │   │   ├── data_filter_layer.py
│   │   │   │   ├── data_order_layer.py
│   │   │   │   ├── data_partition_layer.py
│   │   │   │   ├── data_sample_layer.py
│   │   │   │   └── data_shuffle_layer.py
│   │   │   ├── extract/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── build_index_layer.py
│   │   │   │   ├── extract_article_layer.py
│   │   │   │   └── search_index_layer.py
│   │   │   ├── global_var.py
│   │   │   ├── io/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── from_binary_file_layer.py
│   │   │   │   ├── from_index_file_layer.py
│   │   │   │   ├── from_jsonl_file_layer.py
│   │   │   │   ├── from_line_file_layer.py
│   │   │   │   ├── from_parquet_file_layer.py
│   │   │   │   ├── from_warc_file_layer.py
│   │   │   │   ├── from_wet_file_layer.py
│   │   │   │   ├── to_binary_file_layer.py
│   │   │   │   ├── to_index_file_layer.py
│   │   │   │   ├── to_jsonl_file_layer.py
│   │   │   │   ├── to_line_file_layer.py
│   │   │   │   └── to_parquet_file_layer.py
│   │   │   ├── network/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── download_bytes_from_blob_layer.py
│   │   │   │   ├── download_bytes_from_internet_layer.py
│   │   │   │   ├── download_file_from_blob_layer.py
│   │   │   │   ├── download_file_from_internet_layer.py
│   │   │   │   ├── download_starcoder_layer.py
│   │   │   │   ├── download_url_list_layer.py
│   │   │   │   ├── download_urls_from_website_layer.py
│   │   │   │   ├── download_warc_file_layer.py
│   │   │   │   ├── download_warc_indice_layer.py
│   │   │   │   ├── upload_bytes_to_blob_layer.py
│   │   │   │   └── upload_file_to_blob_layer.py
│   │   │   ├── template_layer.py
│   │   │   ├── transform/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── lsh_minhash_layer.py
│   │   │   │   ├── math_filter_layer.py
│   │   │   │   ├── mcq_filter_layer.py
│   │   │   │   ├── minhash_tokens_layer.py
│   │   │   │   ├── ngrams_layer.py
│   │   │   │   ├── openquestion_filter_layer.py
│   │   │   │   ├── tokenize_article_layer.py
│   │   │   │   ├── warc_encode_layer.py
│   │   │   │   ├── warc_filter_layer.py
│   │   │   │   ├── warc_to_wet_layer.py
│   │   │   │   └── wet_decode_layer.py
│   │   │   └── util.py
│   │   └── network.py
│   ├── dependency/
│   │   ├── gpt_api.py
│   │   ├── ia-hadoop-tools-jar-with-dependencies.jar
│   │   ├── install.py
│   │   ├── requirements.txt
│   │   └── xsltml_2.0/
│   │       ├── cmarkup.xsl
│   │       ├── entities.xsl
│   │       ├── glayout.xsl
│   │       ├── mmltex.xsl
│   │       ├── scripts.xsl
│   │       ├── tables.xsl
│   │       └── tokens.xsl
│   ├── readme.md
│   ├── requirements.txt
│   ├── resources/
│   │   ├── computation/
│   │   │   ├── batch_dca_eastus.yaml
│   │   │   └── local.yaml
│   │   ├── environment/
│   │   │   ├── amlt_sing.yaml
│   │   │   └── local.yaml
│   │   └── storage/
│   │       ├── llmstore.yaml
│   │       └── local.yaml
│   ├── sample_run.sh
│   ├── submit.py
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── submit_batch_job.py
│   │   └── submit_local_job.py
│   └── wrapper/
│       ├── __init__.py
│       ├── interpreter.py
│       ├── parser.py
│       ├── runner.py
│       └── utility/
│           ├── __init__.py
│           ├── azure_env.py
│           ├── cpu_count.py
│           ├── load_yaml.py
│           ├── logger.py
│           └── save_yaml.py
├── GeneralDomain/
│   ├── .gitignore
│   ├── README.md
│   ├── pyproject.toml
│   └── redstone_cc/
│       ├── __init__.py
│       ├── __main__.py
│       ├── algos/
│       │   ├── __init__.py
│       │   ├── deduplication/
│       │   │   ├── __init__.py
│       │   │   ├── minhash.py
│       │   │   ├── sha1.py
│       │   │   └── utils.py
│       │   ├── fasttext_classifier.py
│       │   ├── rule_based_filters/
│       │   │   ├── __init__.py
│       │   │   ├── func/
│       │   │   │   ├── __init__.py
│       │   │   │   ├── document.py
│       │   │   │   ├── line.py
│       │   │   │   └── repetition.py
│       │   │   ├── model/
│       │   │   │   ├── __init__.py
│       │   │   │   ├── document.py
│       │   │   │   └── violations.py
│       │   │   ├── ruleset/
│       │   │   │   ├── __init__.py
│       │   │   │   ├── gopher.py
│       │   │   │   └── refinedweb.py
│       │   │   └── utils.py
│       │   └── trafilatura_process.py
│       ├── download_utils.py
│       └── process.py
├── LICENSE
├── README.md
├── SECURITY.md
└── SUPPORT.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/codeql.yml
================================================
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL Advanced"

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]
  schedule:
    - cron: '24 3 * * 5'

jobs:
  analyze:
    name: Analyze (${{ matrix.language }})
    # Runner size impacts CodeQL analysis time. To learn more, please see:
    #   - https://gh.io/recommended-hardware-resources-for-running-codeql
    #   - https://gh.io/supported-runners-and-hardware-resources
    #   - https://gh.io/using-larger-runners (GitHub.com only)
    # Consider using larger runners or machines with greater resources for possible analysis time improvements.
    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}
    permissions:
      # required for all workflows
      security-events: write

      # required to fetch internal or private CodeQL packs
      packages: read

      # only required for workflows in private repositories
      actions: read
      contents: read

    strategy:
      fail-fast: false
      matrix:
        include:
        - language: python
          build-mode: none
        # CodeQL supports the following values keywords for 'language': 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'swift'
        # Use `c-cpp` to analyze code written in C, C++ or both
        # Use 'java-kotlin' to analyze code written in Java, Kotlin or both
        # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both
        # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,
        # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.
        # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how
        # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4

    # Initializes the CodeQL tools for scanning.
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v3
      with:
        languages: ${{ matrix.language }}
        build-mode: ${{ matrix.build-mode }}
        # If you wish to specify custom queries, you can do so here or in a config file.
        # By default, queries listed here will override any specified in a config file.
        # Prefix the list here with "+" to use these queries and those in the config file.

        # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
        # queries: security-extended,security-and-quality

    # If the analyze step fails for one of the languages you are analyzing with
    # "We were unable to automatically build your code", modify the matrix above
    # to set the build mode to "manual" for that language. Then modify this step
    # to build your code.
    # ℹ️ Command-line programs to run using the OS shell.
    # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
    - if: matrix.build-mode == 'manual'
      shell: bash
      run: |
        echo 'If you are using a "manual" build mode for one or more of the' \
          'languages you are analyzing, replace this with the commands to build' \
          'your code, for example:'
        echo '  make bootstrap'
        echo '  make release'
        exit 1

    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v3
      with:
        category: "/language:${{matrix.language}}"


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).

Resources:

- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns


================================================
FILE: DomainSpecific/.gitignore
================================================
__pycache__/
dependency/models/
env_ready
workspace


================================================
FILE: DomainSpecific/configs/cc_math_filter.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_math_extraction",
    "description": "math extraction from cc parquet file - 202323.",
    "date": "20240513",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "pq_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/pqs.CC-MAIN-2023-23.txt"
        },
        "filtered_pq_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_pqs/math/CC-MAIN-2023-23/paths.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "filtered_pq_name_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["pq_name_list_file_path"],
            "output": ["pq_names"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["pq_names"],
            "output": ["pq_names"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": -1
            },
            "input": ["pq_names"],
            "output": ["pq_names"]
        },
        "layer02":
        {
            "type": "Math_Filter",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_pqs/math/CC-MAIN-2023-23/"
            },
            "input": ["pq_names"],
            "output": ["filtered_pq_names"]
        },
        "layer03":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["filtered_pq_names", "filtered_pq_name_list_file_path"],
            "output": ["filtered_pq_name_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/cc_openquestion_filter.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_openquestion_extraction",
    "description": "open question extraction from cc parquet file - 202323.",
    "date": "20240527",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "pq_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/pqs.CC-MAIN-2023-23.txt"
        },
        "filtered_pq_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_pqs/openquestion/CC-MAIN-2023-23/paths.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "filtered_pq_name_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["pq_name_list_file_path"],
            "output": ["pq_names"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["pq_names"],
            "output": ["pq_names"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": -1
            },
            "input": ["pq_names"],
            "output": ["pq_names"]
        },
        "layer02":
        {
            "type": "OpenQuestion_Filter",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_pqs/raw/CC-MAIN-2023-23/",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_pqs/openquestion/CC-MAIN-2023-23/"
            },
            "input": ["pq_names"],
            "output": ["filtered_pq_names"]
        },
        "layer03":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["filtered_pq_names", "filtered_pq_name_list_file_path"],
            "output": ["filtered_pq_name_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/cc_warc_download.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_warc_download",
    "description": "download warc files for a specific cc snapshot - CC-MAIN-2023-23.",
    "date": "20231011",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "warc_url_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/urls.CC-MAIN-2023-23.txt"
        },
        "success_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_warcs/CC-MAIN-2023-23/paths.{worker_id}.txt"
        },
        "fail_warc_url_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_warcs/CC-MAIN-2023-23/fail_urls.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "success_warc_name_list_file_path":
        {
            "type": "Mem_Str"
        },
        "fail_warc_url_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["warc_url_list_file_path"],
            "output": ["warc_urls"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["warc_urls"],
            "output": ["warc_urls"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": 1
            },
            "input": ["warc_urls"],
            "output": ["warc_urls"]
        },
        "layer02":
        {
            "type": "Download_Warc_File",
            "joint": "Map",
            "param":
            {
                "DOWNLOAD_FOLDER": "{workspace_dir}/cc_warcs/CC-MAIN-2023-23",
                "CONNECTS": 16,
                "TRIES": 3
            },
            "input": ["warc_urls"],
            "output": ["success_warc_names", "fail_warc_urls"]
        },
        "layer03":
        {
            "type": "Data_Filter",
            "param":
            {
                "FILTERS": [null]
            },
            "input": ["success_warc_names"],
            "output": ["success_warc_names"]
        },
        "layer04":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["success_warc_names", "success_warc_name_list_file_path"],
            "output": ["success_warc_name_list_file_path"]
        },
        "layer05":
        {
            "type": "Data_Filter",
            "param":
            {
                "FILTERS": [null]
            },
            "input": ["fail_warc_urls"],
            "output": ["fail_warc_urls"]
        },
        "layer06":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["fail_warc_urls", "fail_warc_url_list_file_path"],
            "output": ["fail_warc_url_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/cc_warc_filter.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_warc_filter",
    "description": "filter html containing specific tags on warc files - CC-MAIN-2023-23.",
    "date": "20230825",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_warcs/CC-MAIN-2023-23/paths.txt"
        },
        "filtered_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "filtered_warc_name_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["warc_name_list_file_path"],
            "output": ["warc_names"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["warc_names"],
            "output": ["warc_names"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": -1
            },
            "input": ["warc_names"],
            "output": ["warc_names"]
        },
        "layer02":
        {
            "type": "Warc_Filter",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_warcs/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/",
                "TAGS": ["<math", "<annotation", "=\"math", "athjax", "math-container", "class=\"tex\"", "tex.cgi", "latex.php", "katex.min.css", "\\frac", "codecogs", "<code", "<pre"]
            },
            "input": ["warc_names"],
            "output": ["filtered_warc_names"]
        },
        "layer03":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["filtered_warc_names", "filtered_warc_name_list_file_path"],
            "output": ["filtered_warc_name_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/cc_warc_to_wet.code.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_warc_to_wet",
    "description": "convert cc warc to wet and keep math formula - CC-MAIN-2023-23.",
    "date": "20230825",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "filter_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.txt"
        },
        "encode_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23/paths.{worker_id}.txt"
        },
        "filter_wet_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23/paths.{worker_id}.txt"
        },
        "decode_wet_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/decode_wet_code/CC-MAIN-2023-23/paths.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "decode_wet_name_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["filter_warc_name_list_file_path"],
            "output": ["filter_warc_names"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["filter_warc_names"],
            "output": ["filter_warc_names"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": -1
            },
            "input": ["filter_warc_names"],
            "output": ["filter_warc_names"]
        },
        "layer02":
        {
            "type": "Warc_Encode",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23",
                "TAG": "code"
            },
            "input": ["filter_warc_names"],
            "output": ["encode_warc_names"]
        },
        "layer02_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["encode_warc_names", "encode_warc_name_list_file_path"],
            "output": ["encode_warc_name_list_file_path"]
        },
        "layer03":
        {
            "type": "Warc_To_Wet",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_wets/encode_warc_code/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23"
            },
            "input": ["encode_warc_names"],
            "output": ["filter_wet_names"]
        },
        "layer03_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["filter_wet_names", "filter_wet_name_list_file_path"],
            "output": ["filter_wet_name_list_file_path"]
        },
        "layer04":
        {
            "type": "Wet_Decode",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_wets/filter_wet_code/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/decode_wet_code/CC-MAIN-2023-23",
                "TAG": "code"
            },
            "input": ["filter_wet_names"],
            "output": ["decode_wet_names"]
        },
        "layer04_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["decode_wet_names", "decode_wet_name_list_file_path"],
            "output": ["decode_wet_name_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/cc_warc_to_wet.math.CC-MAIN-2023-23.json
================================================
{
    "name": "cc_warc_to_wet",
    "description": "convert cc warc to wet and keep math formula - CC-MAIN-2023-23.",
    "date": "20230825",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "filter_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23/paths.txt"
        },
        "encode_warc_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23/paths.{worker_id}.txt"
        },
        "filter_wet_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23/paths.{worker_id}.txt"
        },
        "decode_wet_name_list_file_path":
        {
            "type": "Mem_Str",
            "value": "{workspace_dir}/cc_wets/decode_wet_math/CC-MAIN-2023-23/paths.{worker_id}.txt"
        }
    },
    
    "output":
    {
        "decode_wet_name_list_file_path":
        {
            "type": "Mem_Str"
        }
    },
    
    "layer":
    {
        "layer01":
        {
            "type": "From_Line_File",
            "joint": "Default",
            "input": ["filter_warc_name_list_file_path"],
            "output": ["filter_warc_names"]
        },
        "layer01_par":
        {
            "type": "Data_Partition",
            "joint": "Default",
            "input": ["filter_warc_names"],
            "output": ["filter_warc_names"]
        },
        "layer01_sam":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": -1
            },
            "input": ["filter_warc_names"],
            "output": ["filter_warc_names"]
        },
        "layer02":
        {
            "type": "Warc_Encode",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_filtered_warc/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23",
                "TAG": "math"
            },
            "input": ["filter_warc_names"],
            "output": ["encode_warc_names"]
        },
        "layer02_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["encode_warc_names", "encode_warc_name_list_file_path"],
            "output": ["encode_warc_name_list_file_path"]
        },
        "layer03":
        {
            "type": "Warc_To_Wet",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_wets/encode_warc_math/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23"
            },
            "input": ["encode_warc_names"],
            "output": ["filter_wet_names"]
        },
        "layer03_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["filter_wet_names", "filter_wet_name_list_file_path"],
            "output": ["filter_wet_name_list_file_path"]
        },
        "layer04":
        {
            "type": "Wet_Decode",
            "joint": "FlatMap",
            "param":
            {
                "INPUT_FOLDER": "{workspace_dir}/cc_wets/filter_wet_math/CC-MAIN-2023-23",
                "OUTPUT_FOLDER": "{workspace_dir}/cc_wets/decode_wet_math/CC-MAIN-2023-23",
                "TAG": "math"
            },
            "input": ["filter_wet_names"],
            "output": ["decode_wet_names"]
        },
        "layer04_out":
        {
            "type": "To_Line_File",
            "joint": "Default",
            "input": ["decode_wet_names", "decode_wet_name_list_file_path"],
            "output": ["decode_wet_name_list_file_path"]
        }
    }
}


================================================
FILE: DomainSpecific/configs/network_template.json
================================================
{
    "name": "template_network",
    "description": "Toy example of network.",
    "date": "20230713",
    "version": "1.0.0",
    "author": "yanghuan",
    "backend": "Native",
    
    "input":
    {
        "data1":
        {
            "type": "Mem_StrList",
            "value": ["1", "2", "3", "4", "5"]
        }
    },
    
    "output":
    {
        "data2":
        {
            "type": "Mem_StrList"
        }
    },
    
    "layer":
    {
        "layer1":
        {
            "type": "Data_Sample",
            "joint": "Default",
            "param":
            {
                "N": 2
            },
            "input": ["data1"],
            "output": ["data2"]
        }
    }
}


================================================
FILE: DomainSpecific/core/__init__.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
from .data import DataType
from .layer import Layer, JointType
from .layers import LayerType, LayerType2Func
from .network import Network

__all__ = ["DataType", "Layer", "JointType", "LayerType", "LayerType2Func", "Network"]


================================================
FILE: DomainSpecific/core/data.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
from enum import Enum

class DataType(Enum):
    # Memory Data
    Mem_Any          = 0
    Mem_Binary       = 1
    Mem_Int          = 2
    Mem_Float        = 3
    Mem_Str          = 4
    Mem_Warc         = 5
    Mem_Dict         = 6
    Mem_Index        = 7
    Mem_Vector       = 8
    Mem_Record       = 9
    Mem_List         = 10
    Mem_BinaryList   = 11
    Mem_IntList      = 12
    Mem_FloatList    = 13
    Mem_StrList      = 14
    Mem_WarcList     = 15
    Mem_DictList     = 16
    Mem_IndexList    = 17
    Mem_VectorList   = 18
    Mem_RecordList   = 19

    # Disk Data (Deprecated)
    File_Any         = 100
    File_Binary      = 101
    File_Text        = 102
    File_Warc        = 103
    File_Parquet     = 104
    File_Json        = 105
    File_Index       = 106
    File_Vector      = 107
    File_AnyLines    = 110
    File_TextLines   = 111
    File_JsonLines   = 112
    File_VectorLines = 113

    @staticmethod
    def belong(a, b):
        if not isinstance(a, DataType) or not isinstance(b, DataType):
            return False
        return a == b or \
               (b.value % 10 == 0 and 0 <= a.value - b.value < 10) or \
               (b == DataType.Mem_Any and a.value < 100) or \
               (b == DataType.File_Any and a.value >= 100)

class Data:
    """
    Data class (Deprecated).
    """
    def __init__(self, type=DataType.Mem_Any, value=None):
        self.type = type if isinstance(type, DataType) else DataType[type]
        self.value = value


if __name__ == "__main__":
    data = Data()
    print(data)


================================================
FILE: DomainSpecific/core/layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
from enum import Enum
from tqdm import tqdm
from core.layers import LayerType, LayerType2Func

class JointType(Enum):
    Default = 0 # Only process data as whole (frequently used in data IO and control layers).
    Map     = 1 # Firstly split data list into data unit, then process data unit to any type, finnaly return the list of processed data unit.
    FlatMap = 2 # Firstly split data list into data unit, then process data unit to list type, then concat the whole processed data lists, finnally return the concated data list.

class Layer:
    def __init__(self, type, joint=JointType.Default, repetition=1, param=dict(), input_names=list(), output_names=list()):
        self.type = type if isinstance(type, LayerType) else LayerType[type]
        self.func, self.input_types, self.output_types, self.enabled = LayerType2Func[self.type]
        self.joint = joint if isinstance(joint, JointType) else JointType[joint]
        self.repetition = repetition
        self.param = param
        self.input_names = input_names
        self.output_names = output_names

    def __call__(self, inputs, worker_id=0, worker_num=1, variables=dict()):
        outputs = list()
        try:
            variables["worker_id"] = worker_id
            variables["worker_num"] = worker_num

            if not isinstance(inputs, list):
                raise Exception(f"The inputs of layer should be list data type.")
            if len(inputs) != len(self.input_types):
                raise Exception(f"The number of inputs is not {len(self.input_types)}.")
            for i, (data, input_type) in enumerate(zip(inputs, self.input_types)):
                # TODO: add the check of input type.
                # check the data type of input.
                #if data.type != DataType[input_type]:
                #    raise Exception(f"The {i}th data, whose type is {data.type.name}, does not match the input type {input_type}")
                # Condition of empty input.
                if data is None:
                    outputs = [None for _ in self.output_types]
                    return outputs

            # TODO: to address the situation of repetition > 1.
            for i in range(self.repetition):
                if self.joint == JointType.Default:
                    values = list(self.func(*inputs, variables, **self.param))
                else:
                    n = min([len(data) for data in inputs])
                    if n != max([len(data) for data in inputs]):
                        raise Exception(f"Element amount of input datas are not equal.")

                    values = [[] for _ in self.output_types]
                    for i in tqdm(range(n), desc=f"Layer: {self.type.name}, worker_id: {worker_id}/{worker_num}"):
                        _values = self.func(*[data[i] for data in inputs], variables, **self.param)
                        for value, _value in zip(values, _values):
                            if _value is None:
                                continue
                            if self.joint == JointType.Map:
                                value.append(_value)
                            elif self.joint == JointType.FlatMap:
                                if not isinstance(_value, list):
                                    raise Exception(f"The output of layer should be list data type.")
                                value.extend(_value)
                            else:
                                raise Exception(f"Using unsupported joint type for {self.type.name} layer.")

                outputs = values
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
        return outputs


if __name__ == "__main__":
    inputs = [["a", "b", "c", "d", "e"]]
    layer = Layer(LayerType.Data_Sample, param={"N": 2})
    outputs = layer(inputs)
    print(layer)


================================================
FILE: DomainSpecific/core/layers/__init__.py
================================================
from enum import Enum
from ..data import DataType

from .template_layer import template_layer

# Control layers
from .control import *

# Network (download/upload) layers
from .network import *

# IO (read/write) layers
from .io import *

# Extract layers
from .extract import *

# Transform layers
from .transform import *

class LayerType(Enum):
    Template                     = 0

    # Control
    Data_Sample                  = 1
    Data_Concat                  = 2
    Data_Order                   = 3
    Data_Partition               = 4
    Data_Filter                  = 5
    Data_Shuffle                 = 6

    # Network - download/upload
    Upload_File_To_Blob          = 101
    Upload_Bytes_To_Blob         = 102
    Download_File_From_Blob      = 103
    Download_Bytes_From_Blob     = 104
    Download_File_From_Internet  = 105
    Download_Bytes_From_Internet = 106
    Download_Url_List            = 107
    Download_Warc_Indice         = 108
    Download_Warc_File           = 109
    Download_Urls_From_Website   = 110
    Download_Image_From_Jsonl    = 111
    Download_StarCoder           = 112

    # IO - read/write
    To_Binary_File               = 201
    To_Line_File                 = 202
    To_Jsonl_File                = 203
    To_Parquet_File              = 204
    To_Index_File                = 205
    To_Warc_File                 = 206
    From_Binary_File             = 207
    From_Line_File               = 208
    From_Jsonl_File              = 209
    From_Parquet_File            = 210
    From_Index_File              = 211
    From_Wet_File                = 212
    From_Warc_File               = 213

    # Extract
    Extract_Article              = 301
    Build_Index                  = 302
    Search_Index                 = 303
    
    # Transform
    Tokenize_Article             = 401
    Ngrams                       = 402
    Minhash_Tokens               = 403
    LSH_Minhash                  = 404
    Warc_Filter                  = 405
    Warc_Encode                  = 406
    Warc_To_Wet                  = 407
    Wet_Decode                   = 408
    Text_Embedding               = 409
    Sentence_Embedding           = 410
    Sentence_Filter              = 411
    Code_Generation              = 412
    Url_To_Record                = 413
    Extract_Link_From_Warc       = 414
    Wet_To_Imageinfos            = 415
    Warc_To_Screenshot_MD        = 416
    MCQ_Filter                   = 417
    OpenQuestion_Filter          = 418
    Convert_PDF                  = 419
    Extract_HTML                 = 420
    MD_Filter                    = 421
    Cascaded_Filter              = 422
    Math_Filter                  = 423


LayerType2Func = \
{
    LayerType.Template                     : (template_layer, [DataType.Mem_Any], [DataType.Mem_Any], True),

    # Control
    LayerType.Data_Sample                  : (data_sample_layer, [DataType.Mem_List], [DataType.Mem_List], True),
    LayerType.Data_Concat                  : (data_concat_layer, [DataType.Mem_List], [DataType.Mem_List], True),
    LayerType.Data_Order                   : (data_order_layer, [DataType.Mem_List], [DataType.Mem_List], True),
    LayerType.Data_Filter                  : (data_filter_layer, [DataType.Mem_List], [DataType.Mem_List], True),
    LayerType.Data_Partition               : (data_partition_layer, [DataType.Mem_List], [DataType.Mem_List], True),
    LayerType.Data_Shuffle                 : (data_shuffle_layer, [DataType.Mem_List], [DataType.Mem_List], True),

    # Network - download/upload
    LayerType.Upload_File_To_Blob          : (upload_file_to_blob_layer, [DataType.Mem_Str, DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),
    LayerType.Upload_Bytes_To_Blob         : (upload_bytes_to_blob_layer, [DataType.Mem_Binary, DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),
    LayerType.Download_File_From_Blob      : (download_file_from_blob_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),
    LayerType.Download_Bytes_From_Blob     : (download_bytes_from_blob_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Binary, DataType.Mem_Str], True),
    LayerType.Download_File_From_Internet  : (download_file_from_internet_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),
    LayerType.Download_Bytes_From_Internet : (download_bytes_from_internet_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Binary, DataType.Mem_Str], True),
    LayerType.Download_Url_List            : (download_url_list_layer, [DataType.Mem_Str], [DataType.Mem_StrList, DataType.Mem_StrList], True),
    LayerType.Download_Warc_File           : (download_warc_file_layer, [DataType.Mem_Str], [DataType.Mem_Str, DataType.Mem_Str], True),
    LayerType.Download_Warc_Indice         : (download_warc_indice_layer, [DataType.Mem_Str], [DataType.Mem_StrList, DataType.Mem_StrList], True),
    LayerType.Download_Urls_From_Website   : (download_urls_from_website_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.Download_StarCoder           : (download_starcoder_layer, [DataType.Mem_Str], [DataType.Mem_Int], True),

    # IO - read/write
    LayerType.To_Binary_File               : (to_binary_file_layer, [DataType.Mem_Binary, DataType.Mem_Str], [DataType.Mem_Str], True),
    LayerType.To_Line_File                 : (to_line_file_layer, [DataType.Mem_StrList, DataType.Mem_Str], [DataType.Mem_Str], True),
    LayerType.To_Jsonl_File                : (to_jsonl_file_layer, [DataType.Mem_DictList, DataType.Mem_Str], [DataType.Mem_Str], True),
    LayerType.To_Parquet_File              : (to_parquet_file_layer, [DataType.Mem_DictList, DataType.Mem_Str], [DataType.Mem_Str], True),
    LayerType.To_Index_File                : (to_index_file_layer, [DataType.Mem_Index, DataType.Mem_Str], [DataType.Mem_Str], True),
    LayerType.From_Binary_File             : (from_binary_file_layer, [DataType.Mem_Str], [DataType.Mem_Binary], True),
    LayerType.From_Line_File               : (from_line_file_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.From_Jsonl_File              : (from_jsonl_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),
    LayerType.From_Parquet_File            : (from_parquet_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),
    LayerType.From_Index_File              : (from_index_file_layer, [DataType.Mem_Str], [DataType.Mem_Index], True),
    LayerType.From_Wet_File                : (from_wet_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),
    LayerType.From_Warc_File               : (from_warc_file_layer, [DataType.Mem_Str], [DataType.Mem_DictList], True),

    # Extract
    LayerType.Extract_Article              : (extract_article_layer, [DataType.Mem_Warc], [DataType.Mem_Dict], True),
    LayerType.Build_Index                  : (build_index_layer, [DataType.Mem_VectorList], [DataType.Mem_Index], True),
    LayerType.Search_Index                 : (search_index_layer, [DataType.Mem_Index, DataType.Mem_VectorList], [DataType.Mem_VectorList, DataType.Mem_VectorList], True),
    
    # Transform
    LayerType.Tokenize_Article             : (tokenize_article_layer, [DataType.Mem_Dict], [DataType.Mem_StrList], True),
    LayerType.Ngrams                       : (ngrams_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),
    LayerType.Minhash_Tokens               : (minhash_tokens_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),
    LayerType.LSH_Minhash                  : (lsh_minhash_layer, [DataType.Mem_StrList], [DataType.Mem_StrList], True),
    LayerType.Warc_Filter                  : (warc_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.Warc_Encode                  : (warc_encode_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.Warc_To_Wet                  : (warc_to_wet_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.Wet_Decode                   : (wet_decode_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.Math_Filter                  : (math_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.OpenQuestion_Filter          : (openquestion_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
    LayerType.MCQ_Filter                   : (mcq_filter_layer, [DataType.Mem_Str], [DataType.Mem_StrList], True),
}


__all__ = [
    "LayerType", 
    "LayerType2Func", 
    "template_layer", 
    "data_sample_layer", 
    "data_concat_layer", 
    "data_order_layer", 
    "data_partition_layer", 
    "data_filter_layer", 
    "data_shuffle_layer", 
    "upload_file_to_blob_layer", 
    "upload_bytes_to_blob_layer", 
    "download_file_from_blob_layer", 
    "download_bytes_from_blob_layer", 
    "download_file_from_internet_layer", 
    "download_bytes_from_internet_layer", 
    "download_url_list_layer", 
    "download_warc_file_layer", 
    "download_warc_indice_layer", 
    "download_urls_from_website_layer", 
    "download_starcoder_layer", 
    "to_binary_file_layer", 
    "to_line_file_layer", 
    "to_jsonl_file_layer", 
    "to_parquet_file_layer", 
    "to_index_file_layer", 
    "from_binary_file_layer", 
    "from_line_file_layer", 
    "from_jsonl_file_layer", 
    "from_parquet_file_layer", 
    "from_index_file_layer", 
    "from_wet_file_layer", 
    "from_warc_file_layer", 
    "extract_article_layer", 
    "build_index_layer", 
    "search_index_layer", 
    "tokenize_article_layer", 
    "ngrams_layer", 
    "minhash_tokens_layer", 
    "lsh_minhash_layer", 
    "warc_filter_layer", 
    "warc_encode_layer", 
    "warc_to_wet_layer", 
    "wet_decode_layer", 
    "math_filter_layer", 
    "openquestion_filter_layer", 
    "mcq_filter_layer", 
]


================================================
FILE: DomainSpecific/core/layers/control/__init__.py
================================================
# Control
from .data_sample_layer import data_sample_layer
from .data_filter_layer import data_filter_layer
from .data_order_layer import data_order_layer
from .data_partition_layer import data_partition_layer
from .data_shuffle_layer import data_shuffle_layer
from .data_concat_layer import data_concat_layer

__all__ = [
    "data_sample_layer", 
    "data_filter_layer",
    "data_order_layer",
    "data_partition_layer",
    "data_shuffle_layer", 
    "data_concat_layer", 
]


================================================
FILE: DomainSpecific/core/layers/control/data_concat_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback

def data_concat_layer(lists, variables=dict()):
    ret = list()
    try:
        for a_list in lists[::-1]:
            if a_list is not None:
                ret[0:0] = a_list
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lists = [["a"], ["b", "c"], None, ["d", "e", "f"]]
    lines = data_concat_layer(lists)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/control/data_filter_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback

def data_filter_layer(lines, variables=dict(), IN=False, FILTERS=(None,)):
    ret = list()
    try:
        ret = list(filter(lambda line: line in FILTERS if IN else line not in FILTERS, lines))
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = ["a", None, "b"]
    FILTERS = (None,)
    lines = data_filter_layer(lines, FILTERS=FILTERS)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/control/data_order_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback

def data_order_layer(lines, variables=dict(), REVERSE=False):
    ret = list()
    try:
        ret = sorted(lines, reverse=REVERSE)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = [1, 3, 2]
    lines = data_order_layer(lines)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/control/data_partition_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback

def data_partition_layer(lines, variables=dict(), WORKER_ID=-1):
    ret = list()
    try:
        worker_id = variables.get("worker_id", 0)
        worker_num = variables.get("worker_num", 1)
        n = len(lines)
        if WORKER_ID == -1:
            ret = [lines[i] for i in range(worker_id, n, worker_num)]
        else:
            ret = lines if WORKER_ID == worker_id else list()
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = [0, 1, 2, 3, 4, 5, 6, 7, 8]
    variables = {"worker_id": 0, "worker_num": 2}
    lines = data_partition_layer(lines, variables=variables)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/control/data_sample_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import random
import traceback

def data_sample_layer(lines, variables=dict(), N=-1, SEED=1):
    ret = list()
    try:
        random.seed(SEED)
        N = min(N, len(lines))
        if N >= 0:
            ret = random.sample(lines, N)
        else:
            ret = lines
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = ["a", "b"]
    N = 1
    lines = data_sample_layer(lines, N=N)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/control/data_shuffle_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import random
import traceback

def data_shuffle_layer(lines, variables=dict(), SEED=1):
    ret = list()
    try:
        random.seed(SEED)
        random.shuffle(lines)
        ret = lines
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = ["a", "b"]
    lines = data_shuffle_layer(lines)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/extract/__init__.py
================================================
# Extract
from .extract_article_layer import extract_article_layer
from .build_index_layer import build_index_layer
from .search_index_layer import search_index_layer

__all__ = [
    "extract_article_layer", 
    "build_index_layer", 
    "search_index_layer", 
]


================================================
FILE: DomainSpecific/core/layers/extract/build_index_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import sys
import faiss
import numpy as np
import traceback

def build_index_layer(base_vectors, variables=dict(), SEED=1, DIM=4096, CLUSTERS=100):
    ret = None
    try:
        np.random.seed(SEED)

        quantizer = faiss.IndexFlatL2(DIM)
        index = faiss.IndexIVFFlat(quantizer, DIM, CLUSTERS, faiss.METRIC_L2)

        assert not index.is_trained
        base_vectors = np.array(base_vectors)
        index.train(base_vectors)
        assert index.is_trained

        index.add(base_vectors)
        ret = index
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    D = 64
    base_vectors = np.random.random((100000, D)).astype('float32')
    base_vectors[:, 0] += np.arange(100000) / 1000.
    index = build_index_layer(base_vectors, D=D)
    print(index)


================================================
FILE: DomainSpecific/core/layers/extract/extract_article_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import re
import fasttext
import traceback
from unittest.mock import patch
from bs4 import BeautifulSoup
from markdownify import MarkdownConverter, chomp
from newspaper import Article
import global_var

def filter_tags_in_html(soup):
    def del_tags(soup):
        del_tags = ['style', 'script', 'img']
        for tag in del_tags:
            tags = soup.find_all(tag)
            for tag in tags:
                tag.decompose()

        tags = soup.find_all('table')
        for tag in tags:
            if len(tag.text.strip()) == 0:
                for tag in tags:
                    tag.decompose()

    def modify_text(soup):
        modify_tags = ['a']
        for i in range(len(modify_tags)):
            for tag in soup.find_all(modify_tags[i]):
                tag_text = tag.text
                new_tag_text = tag_text.replace('\n', '')
                if len(new_tag_text) != len(tag_text):
                    tag.string = new_tag_text
    del_tags(soup)
    modify_text(soup)

    return soup

def lid(soup, model):
    LID_WIN_SIZE=256
    text = ''.join(soup.text.split())
    span_start, span_end = 0, len(text)
    if len(text) > LID_WIN_SIZE:
        mid = len(text) // 2
        mid_win = LID_WIN_SIZE // 2
        span_start = max(0, int(mid - mid_win))
        span_end = min(len(text), int(mid + mid_win))

    det_text = text[span_start: span_end]
    res = model.predict(det_text)
    la = res[0][0].replace("__label__", "")
    prob = float(res[1][0])
    return la, prob

def get_main_text_html(soup):
    article = Article("padding_url", fetch_images=False, keep_article_html=True)
    article.download(input_html=str(soup))
    article.parse()
    # assert len(article.text.strip()) >= 128
    main_html = article.article_html
    main_text = article.text
    return main_html, main_text

def remove_dup_newline(text):
    fields = text.split('\n')
    for i in range(len(fields)):
        fields[i] = fields[i].strip()
    return re.sub('\n{2,}', '\n\n', '\n'.join(fields)).strip()

class User_MarkdownConverter(MarkdownConverter):
    def convert_tr(self, el, text, convert_as_inline):
        cells = el.find_all(['td', 'th'])
        is_headrow = all([cell.name == 'th' for cell in cells])
        overline = ''
        underline = ''
        if is_headrow and not el.previous_sibling:
            # first row and is headline: print headline underline
            underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
        elif (not el.previous_sibling
            and (el.parent.name == 'table'
                or (el.parent.name == 'tbody'
                    and not el.parent.previous_sibling))):
            # first row, not headline, and:
            # - the parent is table or
            # - the parent is tbody at the beginning of a table.
            # print empty headline above this row
            overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
            overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
        if len(text.replace('|', ' ').strip()) == 0:
            return overline + underline
        else:
            return overline + '|' + text.replace('\n', ' ') + '\n' + underline

    def convert_a(self, el, text, convert_as_inline):
        prefix, suffix, text = chomp(text)
        if not text:
            return ''
        href = el.get('href')
        title = el.get('title')
        # For the replacement see #29: text nodes underscores are escaped
        if (self.options['autolinks']
                and text.replace(r'\_', '_') == href
                and not title
                and not self.options['default_title']):
            # Shortcut syntax
            return '<%s>' % href
        if self.options['default_title'] and not title:
            title = href
        title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
        # return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
        return '%s %s %s' % (prefix, text.replace('\n', ' '), suffix) if href else text

    def convert_pre(self, el, text, convert_as_inline):
        if not text:
            return ''
        code_language = self.options['code_language']

        if self.options['code_language_callback']:
            code_language = self.options['code_language_callback'](el) or code_language

        return '\n```%s\n%s\n```\n' % (code_language, text)

def html2text(soup, **options):
    def clean_markdown(md):
        fields = md.split('\n')
        for i in range(len(fields)):
            fields[i] = fields[i].strip()

        new_fields = []
        for i in range(len(fields)):
            field_set = list(set(fields[i]))
            if len(field_set) == 1 and field_set[0] in ['#', '*', '+', '-']:
                continue
            new_fields.append(fields[i])

        fields = new_fields
        md = '\n'.join(fields)

        return re.sub('\n{2,}', '\n\n', md).strip()

    return clean_markdown(User_MarkdownConverter(**options).convert_soup(soup))

def trans2md(html):
    soup = BeautifulSoup(html, 'html5lib')
    markdown_text = html2text(soup)
    # assert len(markdown_text) > 50 and len(markdown_text.split('\n')) != 1
    if markdown_text.startswith('.') and markdown_text.endswith('.'):
        markdown_text = markdown_text[1:-1]
    main_text = remove_dup_newline(soup.text)
    return markdown_text, main_text

@classmethod
def _patch_newspaper_parser_clean(cls, node):
    return node

@patch('newspaper.parsers.Parser.clean_article_html', new=_patch_newspaper_parser_clean)
def extract(soup):
    main_html, main_text = get_main_text_html(soup)
    markdown_text, _new_main_text = trans2md(main_html)
    return markdown_text, main_text

def extract_article_layer(id_html, variables=dict()):
    ret = None
    try:
        LA_TIER1 = ["en", "es", "ja", "fr", "de", "pt", "it", "zh"]
        LA_TIER2 = ["nl", "sv", "da", "fi", "ru", "no", "ko", "zh", "pl", "tr", "ar", "he", "pt", "cs", "hu", "th", "hi"]
        LA_TIER = LA_TIER1 + LA_TIER2
        article_id, html = id_html
        
        soup = BeautifulSoup(html, 'html5lib')
        soup = filter_tags_in_html(soup)
        la, la_prob = lid(soup, global_var.lid_model)
        if la in LA_TIER:
            main_md, main_text = extract(soup)
            if len(main_text) >= 128:
                ret = {"id": article_id, "text": main_text, "lang": la, "lang_prob": la_prob}
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    id_html = (None, None)
    id_text_la = extract_article_layer(id_html)
    print(id_text_la)


================================================
FILE: DomainSpecific/core/layers/extract/search_index_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import faiss
import numpy as np
import traceback

def search_index_layer(index, query_vectors, variables=dict(), TOPK=1):
    ret = (None, None)
    try:
        query_vectors = np.array(query_vectors)
        D, I = index.search(query_vectors, TOPK)
        ret = (I, D)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    DIM = 4096
    CLUSTERS = 2
    base_vectors = np.random.random((100000, DIM)).astype('float32')
    base_vectors[:, 0] += np.arange(100000) / 1000.
    
    quantizer = faiss.IndexFlatL2(DIM)
    index = faiss.IndexIVFFlat(quantizer, DIM, CLUSTERS, faiss.METRIC_L2)

    assert not index.is_trained
    index.train(base_vectors)
    assert index.is_trained
    index.add(base_vectors)

    query_vectors = np.random.random((10000, DIM)).astype('float32')
    query_vectors[:, 0] += np.arange(10000) / 1000.

    I, D = search_index_layer(index, query_vectors, D=D)
    print(D[:1])


================================================
FILE: DomainSpecific/core/layers/global_var.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import traceback
#import torch
import fasttext
from transformers import AutoTokenizer, RobertaForSequenceClassification
from dependency.gpt_api import GPTAPI

try:
    # silences warnings as the package does not properly use the python 'warnings' package
    # see https://github.com/facebookresearch/fastText/issues/1056
    fasttext.FastText.eprint = lambda *args,**kwargs: None
except:
    pass

"""
class OpenQuestionModel:
    def __init__(self, pretrained_model_path, token_model_path="cardiffnlp/twitter-roberta-base-emotion", local_files_only=False):
        # load tokenizer model.
        self.tokenizer = AutoTokenizer.from_pretrained(token_model_path)

        # load trained model.
        self.model = RobertaForSequenceClassification.from_pretrained(pretrained_model_path, local_files_only=local_files_only)

    def run(self, text, thred=0.5, max_length=512):
        # tokenization.
        inputs = self.tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=max_length)

        # inference.
        with torch.no_grad():
            logits = self.model(**inputs).logits
        logits = logits.softmax(dim=1)[0]
        predicted_idx = logits.argmax().item()
        predicted_label = self.model.config.id2label[predicted_idx]
        predicted_conf = logits[predicted_idx].item()
        if predicted_label == "LABEL_0" and predicted_conf < thred:
            predicted_idx = 1
            predicted_label = "LABEL_1"
        #return predicted_idx, predicted_label, predicted_conf
        return predicted_label
"""

# language detection by fasttext.
LID_MODEL_PATH = "./dependency/models/lid.176.bin"
if os.path.exists(LID_MODEL_PATH):
    lid_model = fasttext.load_model(LID_MODEL_PATH)
else:
    lid_model = None

# math detection by fasttext.
MATH_FT_MODEL_PATH = "./dependency/models/math.bin"
if os.path.exists(MATH_FT_MODEL_PATH):
    ft_math_model = fasttext.load_model(MATH_FT_MODEL_PATH)
else:
    ft_math_model = None

# openquestion detection by fasttext.
OPENQUESTION_MODEL_PATH = "./dependency/models/openquestion.bin"
if os.path.exists(OPENQUESTION_MODEL_PATH):
    ft_openquestion_model = fasttext.load_model(OPENQUESTION_MODEL_PATH)
else:
    ft_openquestion_model = None

# multiple-choice question detection by fasttext.
MCQ_MODEL_PATH = "./dependency/models/mcq.bin"
if os.path.exists(MCQ_MODEL_PATH):
    ft_mcq_model = fasttext.load_model(MCQ_MODEL_PATH)
else:
    ft_mcq_model = None

"""
# multiple-choice question detection by pytorch.
MCQ_PT_MODEL_PATH = "./dependency/models/mcq.pytorch"
if os.path.exists(MCQ_PT_MODEL_PATH):
    py_mcq_model = OpenQuestionModel(MCQ_PT_MODEL_PATH, local_files_only=True)
else:
    py_mcq_model = None
"""

# gpt agent.
gpt_api = GPTAPI()


================================================
FILE: DomainSpecific/core/layers/io/__init__.py
================================================
# IO - read/write
from .to_binary_file_layer import to_binary_file_layer
from .to_line_file_layer import to_line_file_layer
from .to_jsonl_file_layer import to_jsonl_file_layer
from .to_parquet_file_layer import to_parquet_file_layer
from .to_index_file_layer import to_index_file_layer
from .from_binary_file_layer import from_binary_file_layer
from .from_line_file_layer import from_line_file_layer
from .from_jsonl_file_layer import from_jsonl_file_layer
from .from_parquet_file_layer import from_parquet_file_layer
from .from_index_file_layer import from_index_file_layer
from .from_wet_file_layer import from_wet_file_layer
from .from_warc_file_layer import from_warc_file_layer

__all__ = [
    "to_binary_file_layer", 
    "to_line_file_layer", 
    "to_jsonl_file_layer", 
    "to_parquet_file_layer", 
    "to_index_file_layer",
    "from_binary_file_layer", 
    "from_line_file_layer", 
    "from_jsonl_file_layer", 
    "from_parquet_file_layer",
    "from_index_file_layer",
    "from_wet_file_layer", 
    "from_warc_file_layer",
]


================================================
FILE: DomainSpecific/core/layers/io/from_binary_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import util

def from_binary_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        with open(file_path, "rb") as f:
            data = f.read()
        ret = data
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.binary"
    data = from_binary_file_layer(file_path)
    print(data)


================================================
FILE: DomainSpecific/core/layers/io/from_index_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import faiss
import traceback
import util

def from_index_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        index = faiss.read_index(file_path)
        ret = index
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    file_path = "index.faiss"
    index = from_index_file_layer(file_path)
    print(index)


================================================
FILE: DomainSpecific/core/layers/io/from_jsonl_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import jsonlines
import util

def from_jsonl_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = list()
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        with jsonlines.open(file_path) as reader:
            for line in reader:
                ret.append(line)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.jsonl"
    data = from_jsonl_file_layer(file_path)
    print(data)


================================================
FILE: DomainSpecific/core/layers/io/from_line_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import util

def from_line_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = list()
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        for line in open(file_path, "r"):
            line = line.strip()
            ret.append(line)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.line"
    lines = from_line_file_layer(file_path)
    print(lines)


================================================
FILE: DomainSpecific/core/layers/io/from_parquet_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import pyarrow as pa
import pyarrow.parquet as pq
import util

def from_parquet_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        table = pq.read_table(file_path)
        ret = table.to_pylist()
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.parquet"
    data = from_parquet_file_layer(file_path)
    print(data)


================================================
FILE: DomainSpecific/core/layers/io/from_warc_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
from warcio.archiveiterator import ArchiveIterator
import util

def from_warc_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        if os.path.exists(file_path):
            items = list()
            with open(file_path, "rb") as input:
                records = ArchiveIterator(input, arc2warc=True)
                for idx, record in enumerate(records):
                    if record.rec_type == "response" and record.http_headers.get_header("Content-Type", "").startswith("text/html"):
                        item = dict()
                        item["uri"] = record.rec_headers.get("WARC-Target-URI")
                        item["lang"] = record.rec_headers.get("Detected-Language")
                        item["content_length"] = record.rec_headers["Content-Length"]
                        item["html"] = record.content_stream().read()
                        items.append(item)
            ret = items
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.warc.gz"
    data = from_warc_file_layer(file_path)
    print(data)


================================================
FILE: DomainSpecific/core/layers/io/from_wet_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
from warcio.archiveiterator import ArchiveIterator
import util

def from_wet_file_layer(file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        if STORAGE_PATH is not None:
            util.download_file_from_blob(STORAGE_PATH, file_path, file_path)

        if os.path.exists(file_path):
            items = list()
            with open(file_path, "rb") as input:
                records = ArchiveIterator(input, arc2warc=False)
                for idx, record in enumerate(records):
                    if record.rec_type == "conversion":
                        item = dict()
                        item["uri"] = record.rec_headers.get("WARC-Target-URI")
                        item["lang"] = record.rec_headers.get("Detected-Language")
                        item["content_length"] = record.rec_headers["Content-Length"]
                        item["text"] = record.content_stream().read()
                        items.append(item)
            ret = items
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    file_path = "test.warc.wet.gz"
    data = from_wet_file_layer(file_path)
    print(data)


================================================
FILE: DomainSpecific/core/layers/io/to_binary_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import util

def to_binary_file_layer(bytes, file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        util.create_folder_by_file_path(file_path)

        with open(file_path, "wb") as f:
            f.write(bytes)

        if STORAGE_PATH is not None:
            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)

        ret = file_path
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    bytes = b"hello"
    file_path = "test.binary"
    file_path = to_binary_file_layer(bytes, file_path)
    print(file_path)


================================================
FILE: DomainSpecific/core/layers/io/to_index_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import faiss
import traceback
import util

def to_index_file_layer(index, file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        util.create_folder_by_file_path(file_path)

        faiss.write_index(index, file_path)

        if STORAGE_PATH is not None:
            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)

        ret = file_path
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    D = 64
    NLIST = 100
    base_vectors = np.random.random((100000, D)).astype('float32')
    base_vectors[:, 0] += np.arange(100000) / 1000.
    
    quantizer = faiss.IndexFlatL2(D)
    index = faiss.IndexIVFFlat(quantizer, D, NLIST, faiss.METRIC_L2)

    assert not index.is_trained
    index.train(base_vectors)
    assert index.is_trained
    index.add(base_vectors)

    file_path = "index.faiss"
    file_path = to_index_file_layer(index, file_path)
    print(file_path)


================================================
FILE: DomainSpecific/core/layers/io/to_jsonl_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import jsonlines
import util

def to_jsonl_file_layer(data, file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        util.create_folder_by_file_path(file_path)

        with jsonlines.open(file_path, "w") as writer:
            writer.write_all(data)

        if STORAGE_PATH is not None:
            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)

        ret = file_path
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    data = [{'id': "1", 'html': "hello"}, {'id': "2", 'html': "hi"}]
    file_path = "test.jsonl"
    file_path = to_jsonl_file_layer(data, file_path)
    print(file_path)


================================================
FILE: DomainSpecific/core/layers/io/to_line_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import util

def to_line_file_layer(lines, file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        util.create_folder_by_file_path(file_path)

        with open(file_path, "w") as f:
            for line in lines:
                f.write(line + "\n")

        if STORAGE_PATH is not None:
            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)

        ret = file_path
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    lines = ["line1", "line2"]
    file_path = "test.line"
    file_path = to_line_file_layer(lines, file_path)
    print(file_path)


================================================
FILE: DomainSpecific/core/layers/io/to_parquet_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import pyarrow as pa
import pyarrow.parquet as pq
import util

def to_parquet_file_layer(data, file_path, variables=dict(), STORAGE_PATH=None):
    ret = None
    try:
        file_path = util.to_real_path(file_path, variables)
        util.create_folder_by_file_path(file_path)

        table = pa.Table.from_pylist(data)
        pq.write_table(table, file_path)

        if STORAGE_PATH is not None:
            util.upload_file_to_blob(STORAGE_PATH, file_path, file_path)

        ret = file_path
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    data = [{'id': "1", 'html': "hello"}, {'id': "2", 'html': "hi"}]
    file_path = "test.parquet"
    file_path = to_parquet_file_layer(data, file_path)
    print(file_path)


================================================
FILE: DomainSpecific/core/layers/network/__init__.py
================================================
# Network - download/upload
from .upload_file_to_blob_layer import upload_file_to_blob_layer
from .upload_bytes_to_blob_layer import upload_bytes_to_blob_layer
from .download_file_from_blob_layer import download_file_from_blob_layer
from .download_bytes_from_blob_layer import download_bytes_from_blob_layer
from .download_file_from_internet_layer import download_file_from_internet_layer
from .download_bytes_from_internet_layer import download_bytes_from_internet_layer
from .download_url_list_layer import download_url_list_layer
from .download_warc_file_layer import download_warc_file_layer
from .download_warc_indice_layer import download_warc_indice_layer
from .download_urls_from_website_layer import download_urls_from_website_layer
from .download_starcoder_layer import download_starcoder_layer

__all__ = [
    "upload_file_to_blob_layer",
    "upload_bytes_to_blob_layer",
    "download_file_from_blob_layer", 
    "download_bytes_from_blob_layer", 
    "download_file_from_internet_layer", 
    "download_bytes_from_internet_layer", 
    "download_url_list_layer", 
    "download_warc_file_layer", 
    "download_warc_indice_layer", 
    "download_urls_from_website_layer", 
    "download_starcoder_layer", 
]


================================================
FILE: DomainSpecific/core/layers/network/download_bytes_from_blob_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def download_bytes_from_blob_layer(blob_path, variables=dict(), STORAGE_PATH=None, TRIES=1):
    ret = (None, None, blob_path)
    try:
        for _ in range(TRIES):
            try:
                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)
                storage_config = util.load_yaml(STORAGE_PATH)
                blob_path = util.to_real_path(blob_path, variables)
                file_name = util.md5(blob_path) + util.suffix(blob_path)
                bytes = util.download_bytes_from_blob(storage_config, blob_path)
                ret = (file_name, bytes, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    blob_path = "$(azure_blob_path)"
    STORAGE_PATH = "resources/environment/llmstore.yaml"
    bytes = download_bytes_from_blob_layer(blob_path, STORAGE_PATH=STORAGE_PATH)
    print(bytes)


================================================
FILE: DomainSpecific/core/layers/network/download_bytes_from_internet_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def download_bytes_from_internet_layer(url, variables=dict(), TRIES=1):
    ret = (None, None, url)
    try:
        for _ in range(TRIES):
            try:
                url = util.to_real_path(url, variables)
                file_name = util.md5(url) + util.suffix(url)
                bytes = util.download_bytes_from_internet(url)
                ret = (file_name, bytes, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    url = "https://upload.wikimedia.org/wikipedia/commons/4/4f/SVG_Logo.svg"
    bytes = download_bytes_from_internet_layer(url)
    print(bytes)


================================================
FILE: DomainSpecific/core/layers/network/download_file_from_blob_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def download_file_from_blob_layer(blob_path, variables=dict(), DOWNLOAD_PATH=".", STORAGE_PATH=None, TRIES=1):
    ret = (None, blob_path)
    try:
        for _ in range(TRIES):
            try:
                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)
                storage_config = util.load_yaml(STORAGE_PATH)
                blob_path = util.to_real_path(blob_path, variables)
                file_name = util.md5(blob_path) + util.suffix(blob_path)
                file_path = os.path.join(DOWNLOAD_PATH, file_name)
                file_path = util.to_real_path(file_path, variables)
                util.download_file_from_blob(storage_config, blob_path, file_path)
                ret = (file_path, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    blob_path = "$(azure_blob_path)"
    DOWNLOAD_PATH = "$(local_folder_path)"
    STORAGE_PATH = "resources/environment/llmstore.yaml"
    path = download_file_from_blob_layer(blob_path, DOWNLOAD_PATH=DOWNLOAD_PATH, STORAGE_PATH=STORAGE_PATH)
    print(path)


================================================
FILE: DomainSpecific/core/layers/network/download_file_from_internet_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def download_file_from_internet_layer(url, variables=dict(), DOWNLOAD_PATH=".", TRIES=1):
    ret = (None, url)
    try:
        for _ in range(TRIES):
            try:
                url = util.to_real_path(url, variables)
                file_name = util.md5(url) + util.suffix(url)
                file_path = os.path.join(DOWNLOAD_PATH, file_name)
                file_path = util.to_real_path(file_path, variables)
                util.download_file_from_internet(url, file_path)
                #bytes = util.download_bytes_from_internet(url)
                #util.upload_bytes_to_blob(variables["storage_config"], bytes, file_path)
                ret = (file_path, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    url = "https://upload.wikimedia.org/wikipedia/commons/4/4f/SVG_Logo.svg"
    DOWNLOAD_PATH = "$(local_folder_path)"
    path = download_file_from_internet_layer(url, DOWNLOAD_PATH=DOWNLOAD_PATH)
    print(path)


================================================
FILE: DomainSpecific/core/layers/network/download_starcoder_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import json
from datetime import datetime
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import smart_open
from datasets import load_dataset
import util

s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

def download_contents(blob_id, src_encoding):
    s3_url = f"s3://softwareheritage/content/{blob_id}"
    with smart_open.open(s3_url, "rb", compression=".gz", transport_params={"client": s3}) as fin:
        content = fin.read().decode(src_encoding)
    return content

def download_starcoder_layer(data_repo, variables=dict(), OUTPUT_FOLDER="./", STORAGE_PATH=None, HUGGINGFACE_TOKEN=None):
    ret = 0
    try:
        worker_id = variables["worker_id"]
        worker_num = variables["worker_num"]
        data_repo = util.to_real_path(data_repo, variables)
        output_folder = util.to_real_path(OUTPUT_FOLDER, variables)
        if STORAGE_PATH is not None:
            storage_config = util.load_yaml(STORAGE_PATH)

        ds = load_dataset(data_repo, split="train", streaming=True, token=HUGGINGFACE_TOKEN, cache_dir=f"./cache.{worker_id}/")
        ds = ds.filter(lambda row, idx: idx % worker_num == worker_id, with_indices=True)

        item_count = 0
        for i, row in enumerate(ds):
            for key in row.keys():
                if isinstance(row[key], datetime):
                    row[key] = datetime.timestamp(row[key])

            blob_id = row["blob_id"]
            src_encoding = row["src_encoding"]

            snapshot_prefix = row["snapshot_id"][:4]
            repo_name = row["repo_name"].replace("/", "@")
            branch_name = row["branch_name"].replace("/", "@")
            language = row["language"].replace(" ", "_")
            path = row["path"].lstrip("/")
            filename = row["filename"].strip()
            filename = path
            extension = row["extension"].strip()

            content = download_contents(blob_id, src_encoding)

            code_path = os.path.join(output_folder, snapshot_prefix, repo_name, branch_name, blob_id)
            metadata_path = os.path.join(output_folder, snapshot_prefix, repo_name, branch_name, blob_id + ".json")

            try:
                util.create_folder_by_file_path(code_path)
                with open(code_path, "w") as f:
                    f.write(content)
                if STORAGE_PATH is not None:
                    util.upload_file_to_blob(storage_config, code_path, code_path)

                util.create_folder_by_file_path(metadata_path)
                with open(metadata_path, "w") as f:
                    f.write(json.dumps(row, indent=4) + "\n")
                if STORAGE_PATH is not None:
                    util.upload_file_to_blob(storage_config, metadata_path, metadata_path)

                if STORAGE_PATH is not None:
                    try:
                        os.remove(code_path)
                        os.remove(metadata_path)
                    except OSError:
                        pass
            except:
                traceback.print_exc()
            
            item_count += 1

        ret = item_count
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    data_repo = "$(local_the_stack_v2_dedup_metadata_path)"
    variables = {"workspace_dir": r"workspace", "worker_id": 0, "worker_num": 1}
    OUTPUT_FOLDER = "$(local_the_stack_v2_dedup_data_path)"
    STORAGE_PATH = "resources/storage/llmstore.yaml"
    HUGGINGFACE_TOKEN = None
    item_count = download_starcoder_layer(data_repo, variables=variables, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH, HUGGINGFACE_TOKEN=HUGGINGFACE_TOKEN)
    print(item_count)


================================================
FILE: DomainSpecific/core/layers/network/download_url_list_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import gzip
import json
import requests
import traceback

def download_url_list_layer(index_url, variables=dict(), FILTER_SUFFIXES=(), TRIES=1):
    ret = list()
    try:
        for _ in range(TRIES):
            try:
                resp = requests.get(index_url, stream=True)
                urls = list()
                with gzip.open(resp.raw, 'rt') as f:
                    for line in f.readlines():
                        text = "{" + line.strip().split(" {")[1]
                        item = json.loads(text)
                        url = item["url"]
                        suffix = os.path.splitext(url)[1]
                        if suffix in FILTER_SUFFIXES:
                            urls.append(url)
                ret[0:0] = urls
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, [index_url] if len(ret) == 0 else [])


if __name__ == '__main__':
    index_url = "https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-23/indexes/cdx-00000.gz"
    FILTER_SUFFIXES = (".svg",)
    urls = download_url_list_layer(index_url, FILTER_SUFFIXES=FILTER_SUFFIXES)
    print(urls)


================================================
FILE: DomainSpecific/core/layers/network/download_urls_from_website_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import requests
import logging
import traceback
import xml.etree.ElementTree as ET

def download_urls_from_website_layer(website_url, variables=dict(), FILTER=None):
    ret = list()
    try:
        robot_url = website_url + "/robots.txt"
        logging.disable(logging.WARNING)

        # get sitemap.
        xml_urls = list()
        whilte_url_prefixs = list()
        black_url_prefixs = list()
        resp = requests.get(robot_url)
        crawler = None
        for line in resp.text.split("\n"):
            line = line.strip()
            if len(line) == 0:
                continue
            if line.startswith("#"):
                continue

            if line.startswith("User-agent:"):
                crawler = line.split(":")[-1].strip()
                continue

            if crawler != "*":
                continue
            if crawler == "*" and line.startswith("Disallow:"):
                url_prefix = line.replace("Disallow:", "").strip()
                black_url_prefixs.append(url_prefix)
                continue
            if crawler == "*" and line.startswith("Allow:"):
                url_prefix = line.replace("Allow:", "").strip()
                whilte_url_prefixs.append(url_prefix)
                continue
            if crawler == "*" and line.startswith("Sitemap:"):
                xml_url = line.replace("Sitemap:", "").strip()
                if (FILTER is None or FILTER in xml_url) and xml_url.endswith(".xml"):
                    xml_urls.append(xml_url)
                continue

        # get urls.
        html_urls = set()
        for xml_url in xml_urls:
            try:
                resp = requests.get(xml_url)
                root = ET.fromstring(resp.content)
                for sitemap in root:
                    html_url = list(sitemap)[0].text
                    html_urls.add(html_url)
                #nodes = tree.xpath('//a/@href')
                #nodes = tree.xpath("//loc")
            except:
                pass

        ret = list(html_urls)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == '__main__':
    website_url = "https://byjus.com/"
    FILTER = "math"
    urls = download_urls_from_website_layer(website_url, FILTER=FILTER)
    print(urls[0][0])


================================================
FILE: DomainSpecific/core/layers/network/download_warc_file_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def download_warc_file_layer(warc_url, variables=dict(), DOWNLOAD_FOLDER="./", CONNECTS=16, TRIES=1, OVERWRITE=False):
    ret = (None, warc_url)
    try:
        if not warc_url.startswith("https://"):
            warc_url = "https://data.commoncrawl.org/" + warc_url
        #warc_url = warc_url.replace("https://data.commoncrawl.org/", "https://ds5q9oxwqwsfj.cloudfront.net/")# debug
        warc_name = warc_url.split("/")[-3] + "_" + os.path.basename(warc_url)
        warc_path = os.path.join(DOWNLOAD_FOLDER, warc_name)
        warc_path = util.to_real_path(warc_path, variables)

        for _ in range(TRIES):
            if OVERWRITE or not os.path.exists(warc_path):
                util.create_folder_by_file_path(warc_path)
                commandline = f"axel -q -n {CONNECTS} -o {warc_path} {warc_url}"
                exit_status = os.system(commandline)
            else:
                exit_status = 0

            if exit_status == 0:
                break
            time.sleep(1)

        if exit_status == 0:
            ret = (warc_name, None)
        else:
            ret = (None, warc_url)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    warc_url = "https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00000.warc.gz"
    DOWNLOAD_FOLDER = "$(local_folder_path)"
    (success_warc_url, failed_warc_url) = download_warc_file_layer(warc_url, DOWNLOAD_FOLDER=DOWNLOAD_FOLDER)
    print(success_warc_url, failed_warc_url)


================================================
FILE: DomainSpecific/core/layers/network/download_warc_indice_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import gzip
import requests
import traceback

def download_warc_indice_layer(index_url, variables=dict(), TRIES=1, URL_PREFIX="https://data.commoncrawl.org/"):
    ret = list()
    try:
        for _ in range(TRIES):
            try:
                resp = requests.get(index_url, stream=True)
                urls = list()
                with gzip.open(resp.raw, 'rt') as f:
                    for line in f.readlines():
                        warc_url = URL_PREFIX + line.strip()
                        urls.append(warc_url)
                ret = urls
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, [index_url] if len(ret) == 0 else [])


if __name__ == '__main__':
    index_url = "https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-49/warc.paths.gz"
    warc_urls = download_warc_indice_layer(index_url)
    print(warc_urls[0][0])


================================================
FILE: DomainSpecific/core/layers/network/upload_bytes_to_blob_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def upload_bytes_to_blob_layer(bytes, blob_path, variables=dict(), STORAGE_PATH=None, BLOB_PREFIX="", TRIES=1):
    ret = (None, blob_path)
    try:
        for _ in range(TRIES):
            try:
                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)
                storage_config = util.load_yaml(STORAGE_PATH)
                blob_path = util.to_real_path(os.path.join(BLOB_PREFIX, blob_path), variables)
                util.upload_bytes_to_blob(storage_config, bytes, blob_path)
                ret = (blob_path, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    bytes = b"hello"
    blob_path = "$(azure_blob_path)"
    STORAGE_PATH = "resources/environment/llmstore.yaml"
    path = upload_bytes_to_blob_layer(bytes, blob_path, STORAGE_PATH=STORAGE_PATH)
    print(path)


================================================
FILE: DomainSpecific/core/layers/network/upload_file_to_blob_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import util

def upload_file_to_blob_layer(file_path, blob_path, variables=dict(), STORAGE_PATH=None, BLOB_PREFIX="", TRIES=1):
    ret = (None, blob_path)
    try:
        for _ in range(TRIES):
            try:
                assert STORAGE_PATH is not None and os.path.exists(STORAGE_PATH)
                storage_config = util.load_yaml(STORAGE_PATH)
                file_path = util.to_real_path(file_path, variables)
                blob_path = util.to_real_path(os.path.join(BLOB_PREFIX, blob_path), variables)
                util.upload_file_to_blob(storage_config, file_path, blob_path)
                ret = (blob_path, None)
                break
            except:
                time.sleep(1)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == '__main__':
    blob_path = "$(azure_blob_path)"
    file_path = "$(local_file_path)"
    STORAGE_PATH = "resources/environment/llmstore.yaml"
    path = upload_file_to_blob_layer(file_path, blob_path, STORAGE_PATH=STORAGE_PATH)
    print(path)


================================================
FILE: DomainSpecific/core/layers/template_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import sys
import traceback

# Spec of adding a new layer:
# 1. the layer function should be registered in __init__.py file of current folder.
# 2. the layer function should return tuple value, even though the return value is empty.
# 3. the layer function should contain a "variables" variable in dictionary type for the access of global shared variables.
# 4. It's better to implement the unit test and put it to the "__main__" function.
# 5. It's better to have exception handling for the function logic.
# 6. It's better to end with "_layer" for the name of function.
# 7. It's better to write comments for the function of purpose, input and output.
# 8. It's better to be lowercase for the name of input datas.
# 9. It's better to be uppercase for the name of input parameters.

def template_layer(input, variables=dict(), PARAM=None):
    ret = None
    try:
        ret = input
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret,)


if __name__ == "__main__":
    input = None
    output = template_layer(input)


================================================
FILE: DomainSpecific/core/layers/transform/__init__.py
================================================
# Transform
from .tokenize_article_layer import tokenize_article_layer
from .ngrams_layer import ngrams_layer
from .minhash_tokens_layer import minhash_tokens_layer
from .lsh_minhash_layer import lsh_minhash_layer
from .warc_filter_layer import warc_filter_layer
from .warc_encode_layer import warc_encode_layer
from .warc_to_wet_layer import warc_to_wet_layer
from .wet_decode_layer import wet_decode_layer
from .math_filter_layer import math_filter_layer
from .openquestion_filter_layer import openquestion_filter_layer
from .mcq_filter_layer import mcq_filter_layer

__all__ = [
    "tokenize_article_layer", 
    "ngrams_layer", 
    "minhash_tokens_layer", 
    "lsh_minhash_layer", 
    "warc_filter_layer", 
    "warc_encode_layer", 
    "warc_to_wet_layer", 
    "wet_decode_layer", 
    "math_filter_layer",
    "openquestion_filter_layer",
    "mcq_filter_layer",
]


================================================
FILE: DomainSpecific/core/layers/transform/lsh_minhash_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import time
import traceback
import numpy as np
from scipy.integrate import quad as integrate

# different from datasketch's implementation, will use 2^61-1 as the maximum hash value instead of 2^32-1
NUM_PERM = 256
LSH_THRESHOLD = 0.8

class LSH:
    def __init__(self):
        # gen lsh range
        b, r = self.optimal_param(LSH_THRESHOLD, NUM_PERM, 0.5, 0.5)
        self.hashranges = [(i*r, (i+1)*r) for i in range(b)]
        
    # gen lsh param
    # https://github.com/ekzhu/datasketch/blob/44077457d32887a91297f15c3efee2c1982f690e/datasketch/lsh.py
    def false_positive_probability(self, threshold, b, r):
        _probability = lambda s : 1 - (1 - s**float(r))**float(b)
        a, err = integrate(_probability, 0.0, threshold)
        return a

    def false_negative_probability(self, threshold, b, r):
        _probability = lambda s : 1 - (1 - (1 - s**float(r))**float(b))
        a, err = integrate(_probability, threshold, 1.0)
        return a

    def optimal_param(self, threshold, num_perm, false_positive_weight,
            false_negative_weight):
        '''
        Compute the optimal `MinHashLSH` parameter that minimizes the weighted sum
        of probabilities of false positive and false negative.
        '''
        min_error = float("inf")
        opt = (0, 0)
        for b in range(1, num_perm+1):
            max_r = int(num_perm / b)
            for r in range(1, max_r+1):
                fp = self.false_positive_probability(threshold, b, r)
                fn = self.false_negative_probability(threshold, b, r)
                error = fp*false_positive_weight + fn*false_negative_weight
                if error < min_error:
                    min_error = error
                    opt = (b, r)
        return opt

    def gen_lsh(self, minhash):
        return [bytearray(minhash[start:end]) for start, end in self.hashranges]

lsh = LSH()

def lsh_minhash_layer(minhash, variables=dict()):
    ret = list()
    try:
        minhash = np.array(minhash, dtype=np.uint64)
        #assert minhash.dtype == np.uint64 and minhash.shape == (NUM_PERM,)
        lshvalues = lsh.gen_lsh(minhash)
        for i, value in enumerate(lshvalues):
            key = f'{i}_'.encode() + value
            ret.append(key)
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == "__main__":
    minhash = [2170239837623632,1287605064391826,7877338491737559,1522708576701298,1959803855170230,136353893425081,3067530819312822,19822079906565762,14191953696745176,371933081470560,2359093478290026,24211742396711177,5207401883495830,3386445753675098,6482843287028185,14956790165792002,7760994632330526,3801562091963312,654119844389846,6118541550243605,1058268864309841,19648312785892006,5519054639081138,17769255728697304,1326859272534844,6541616202650748,11131462447891679,11540424367241221,6416091255362971,1178274890175074,9516296843449206,5019313649584786,556043434180166,3170749841321737,788403856226243,16256424180717928,11536645058081246,13331271075979702,5603975614240490,11332978618315755,49833277925775,28529817665769800,5399529123965416,5804862109442032,10516842515700528,1383775130067327,9593857895450592,344120332429946,3650720026287843,4927677784872807,3114522307389328,1054088699310940,11453703275676121,17145094372333782,11943406601641085,429519913626747,3559765888081715,6380853683568781,13142954055708448,1122751140539670,7679037943867431,23532369906879837,4460946791673399,6284691595180437,5534632051525650,4326069154983305,6645880540672905,1199004738171304,2741143312089611,3315947713975755,33325056362165,17905224452748795,11081894870845940,2429362824597352,8796539339687473,17606225237179401,2406479086961618,25285711888782525,1847958183256316,4198878926995358,5057832224878357,10146090240130753,2413082792037196,3530471135853536,7672611456084586,2230458118023706,9790058494528486,3351632677682193,6902744571969727,4063006572456150,2761280786272613,6242978327908865,26924233559187524,2214283527827093,951652422014210,1577851399523074,282734099627651,4284321096276342,1571021659718705,2064444079057042,25995837896147107,3642452037001290,615591136529782,2579917399379439,10350113780305730,141093940432428,9292013714641581,16926413460125,4351013271280123,4492914008491347,3885988895709230,3643655265951773,4028855757933683,10480484972551973,2399277677842610,391439629014342,4511050103292841,13930059233224697,10142483490268814,10209387364437517,10291028774837120,1963510243393060,6698235608219585,10249974506598137,2090329927024291,19452257405817527,5395347850501660,1466647506773938,18271233688875585,17909487123073655,22732716574954981,28208124344155426,16118266291737203,6436198404802809,935143955767639,4692764892567773,8853071216371112,1600664618209927,39702070969452097,7552579352900360,2729546584440357,12309935356310386,426760114692333,1297488733224877,153415463561661,18948566290952420,8432980683248649,21321844297374743,8265174613176795,905258690673816,705406607744747,9105597597214747,517772088040257,1591136193162784,27511729624229236,3634922285407283,1831578225426174,13255266977668852,15312685554649660,722931468693513,1049089865098577,3498618026981595,4820015824926872,21126162808808528,27814106051492575,4822875592156961,14999120736412943,10825146296544249,6314954554132894,937945964737656,5459760788750366,3819227047549912,6591064604768721,7907494363943122,3486632627636937,9384132089104933,22104346516322826,6658745931891482,34093012584282609,4995951742943174,3517485897161771,135044219482780,7630383357514628,5162177136386332,10728488430543051,5828055747100055,6893511170015442,11011121196423559,2528283999013590,5080079240873515,19593423843180365,6822359610856040,191087978655560,8846708703413576,33146998994366094,3940701969864300,3507581990705859,6201879648552385,27956522101531374,10178358282977630,2205391899838384,2614926987404300,1090899715885363,6945147978151211,5432157012678156,1250518799355535,3948407147690489,10306927288370802,4580562167416191,8475303907451120,2243101892749971,2451601302451002,2180238663422921,3834240093757495,12119880871693653,12134080723101916,1805202361835209,31781168568203930,42987808989068825,41914343122681270,7985132073155851,16763654385115268,1387995454655588,2351466328427087,3139781779642664,27792958762616566,11961004800461011,6612181571493100,22715857059525182,689087660337260,244785061275028,11511948953811059,8237401627755449,8214914423544509,5470929524034644,9110614658125771,17166417582628999,18571246019891132,3766276759071421,1226388404627669,9965671498507403,1214978610204088,7808074359603991,1313444080667563,9031456783378283,3783393382666945,34163041205217466,3314866608200743,3451870308271748,11716681494447625,1667361573332888,13859255454740261,7299000064706400,6085019581018810,4996856251238621,5666642298303467]
    lsh_values = lsh_minhash_layer(minhash)
    print(lsh_values)


================================================
FILE: DomainSpecific/core/layers/transform/math_filter_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import re
import requests
import fasttext
from gensim.utils import simple_preprocess
import pyarrow as pa
import pyarrow.parquet as pq
import util
import global_var

whilte_list = {r"\\displaystyle", r"\\alpha", r"\\beta", r"\\gamma", r"\\delta", r"\\zeta", r"\\eta", r"\\iota", r"\\kappa", r"\\mu", r"\\nu", r"\\xi", r"\\rho", r"\\tau", r"\\phi", r"\\chi", r"\\psi", r"\\omicron", r"\\epsilon", r"\\pi", r"\\lambda", r"\\omega", r"\\sigma", r"\\theta", r"\\vartheta", r"\\times", r"\\cdot", r"\\dot", r"\\div", r"\\frac", r"\\log", r"\\exp", r"\\poly", r"\\eq", r"\\neq", r"\\leq", r"\\geq", r"\\approx", r"\\infty", r"\\int", r"\\sum", r"\\lim", r"\\begin", r"\\subset", r"\\supset", r"\\top", r"\\star", r"\\sim", r"\\simeq", r"\\ne", r"\\ll", r"\\gg", r"\\pm", r"\\mp", r"\\triangleleft", r"\\triangleright", r"\\ast", r"\\circ", r"\\bullet", r"\\oplus", r"\\odot", r"\\otimes", r"\\ominus", r"\\oslash", r"\\bigcirc", r"\\wr", r"\\dagger", r"\\bigtriangleup", r"\\bigtriangledown", r"\\setminus", r"\\sqcup", r"\\wedge", r"\\dotplus", r"\\centerdot", r"\\ltimes", r"\\rtimes", r"\\prod", r"\\coprod", r"\\iint", r"\\iiint", r"\\iiiint", r"\\idotsint", r"\\bigoplus", r"\\big", r"\\oint", r"\\rightarrow", r"\\to", r"\\leftarrow", r"\\gets", r"\\uparrow", r"\\downarrow", r"\\forall", r"\\exists", r"\\pmod", r"\\cup", r"\\cap", r"\\hat", r"\\acute", r"\\check", r"\\grave", r"\\vec", r"\\ddot", r"\\tilde", r"\\breve", r"\\mathring", r"\\land", r"\\lor", r"\\lnot", r"\\in", r"\\smile", r"\\frown", r"\\infty", r"\\mid", r"\\sin", r"\\cos", r"\\tan", r"\\equiv", r"\\circ", r"\\dfrac", r"\\prec", r"\\preccurlyeq", r"\\sqrt",}
black_list = {r"\\text", r"\\if", r"\\local", r"\\usr", r"\\include", r"\\lib", r"\\bin", r"\\url", r"\\program", r"\\microsoft", r"\\temp", r"\\windows", r"\\documents", r"\\users", r"\\my", r"\\the",}
keywords1 = whilte_list - black_list
keywords1 = set(map(lambda x: x + "[^a-zA-Z]", keywords1))

keywords2 = {r"\+", r"\-", r"\*", r"\/", r"\%", r"\=", r"\!\=", r"\<", r"\>", r"\^", r"\_", r"\(", r"\)", r"\[", r"\]", r"\{", r"\}", r"\|\|", r"\&\&", r"sqrt", r"sum", r"int", r"\$", r"\<math\>", r"\[math\]", }

pattern0 = re.compile(r"\\[A-Z]{0,9}[a-z]{2,9}")
pattern1 = re.compile("|".join(keywords1))
pattern2 = re.compile("|".join(keywords2))

def ismath_by_model(text, model, thred=0.5):
    if model is None:
        return False
    if not isinstance(text, str) or len(text.strip()) == 0:
        return False
    try:
        x = " ".join(simple_preprocess(text))
        ret = model.predict(x)
        label, prob = ret[0][0], ret[1][0]
        return label != "__label__0"
    except:
        traceback.print_exc()
        return False

def math_filter_layer(pq_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", OVERWRITE=False):
    ret = list()
    try:
        in_pq_path = os.path.join(INPUT_FOLDER, pq_name)
        in_pq_path = util.to_real_path(in_pq_path, variables)
        out_pq_path = os.path.join(OUTPUT_FOLDER, pq_name)
        out_pq_path = util.to_real_path(out_pq_path, variables)

        if os.path.exists(in_pq_path) and (OVERWRITE or not os.path.exists(out_pq_path)):
            util.create_folder_by_file_path(out_pq_path)

            # read parquet file.
            try:
                table = pq.read_table(in_pq_path)
            except:
                traceback.print_exc()
            
            # filter records containing math.
            records = list()
            for record in table.to_pylist():
                try:
                    text = record["text"]

                    if record["la"] != "en":
                        continue

                    #if item["la_prob"] < 0.65:
                    #    continue
                    #if text is None or len(text) < 64:
                    #    continue
                    #if text.count("\\u") >= 10:
                    #    continue

                    #if not check_quality(record):
                    #    continue

                    symbols0 = set(pattern0.findall(text))
                    if len(symbols0) <= 0:
                        continue

                    symbols1 = set(pattern1.findall(text.lower()))
                    symbols1 = set(map(lambda sym: sym[:-1], symbols1))
                    if len(symbols1) <= 0:
                        continue

                    symbols2 = set(pattern2.findall(text.lower()))
                    if len(symbols1) == 1 and len(symbols2) <= 0:
                        continue

                    ismath = len(symbols1) >= 5 or ismath_by_model(text, global_var.ft_math_model)
                    if not ismath:
                        continue

                    records.append(record)
                except:
                    traceback.print_exc()

            # write parquet file.
            try:
                table = pa.Table.from_pylist(records)
                pq.write_table(table, out_pq_path)
            except:
                traceback.print_exc()
            
            ret = [out_pq_path]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == '__main__':
    snapshot = "CC-MAIN-2022-49"
    variables = {"workspace_dir": r"workspace", "worker_id": 0, "worker_num": 1}
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    STORAGE_PATH = "resources/storage/llmstore.yaml"
    ret = math_filter_layer(snapshot, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH)
    print(ret)


================================================
FILE: DomainSpecific/core/layers/transform/mcq_filter_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import re
import json
import fasttext
import requests
from io import BytesIO
from gensim.utils import simple_preprocess
from warcio.limitreader import LimitReader
from warcio.warcwriter import WARCWriter
from warcio.archiveiterator import ArchiveIterator
import util
import global_var


def detect_lang(text):
    try:
        LID_WIN_SIZE = 256
        text = ''.join(text.split())
        span_start, span_end = 0, len(text)
        if len(text) > LID_WIN_SIZE:
            mid = len(text) // 2
            mid_win = LID_WIN_SIZE // 2
            span_start = max(0, int(mid - mid_win))
            span_end = min(len(text), int(mid + mid_win))
        det_text = text[span_start: span_end]
        res = global_var.lid_model.predict(det_text)
        lang = res[0][0].replace("__label__", "")
        prob = float(res[1][0])
        return lang
    except:
        return "unkown"


def detect_choice_exercise_by_rule(uri, html):
    uri = uri.lower()
    html = html.lower()
    contain_cnt = 0

    keywords_in_text = [b"choice question"]
    for keyword in keywords_in_text:
        if keyword in html:
            contain_cnt += 1
            break

    combo_keywords_in_text = [
        (b"a.",   b"b.",   b"c.",   b"d."),
        (b"a)",   b"b)",   b"c)",   b"d)"),
        (b"\na ", b"\nb ", b"\nc ", b"\nd "),
        (b">a<",  b">b<",  b">c<",  b">d<"),

        (b"1.",   b"2.",   b"3.",   b"4."),
        (b"1)",   b"2)",   b"3)",   b"4)"),
        (b"\n1 ", b"\n2 ", b"\n3 ", b"\n4 "),
        (b">1<",  b">2<",  b">3<",  b">4<"),

        (b"i.",   b"ii.",   b"iii.",   b"iv."),
        (b"i)",   b"ii)",   b"iii)",   b"iv)"),
        (b"\ni ", b"\nii ", b"\niii ", b"\niv "),
        (b">i<",  b">ii<",  b">iii<",  b">iv<"),
    ]

    for combo_keyword in combo_keywords_in_text:
        if combo_keyword[0] in html and combo_keyword[1] in html and combo_keyword[2] in html and combo_keyword[3] in html:
            contain_cnt += 1
            break

    return contain_cnt == 2


def detect_choice_exercise_by_ft_model(uri, text, thred=0.5):
    try:
        if not isinstance(text, str) or len(text.strip()) == 0:
            return False
        x = " ".join(simple_preprocess(text))
        ret = global_var.ft_mcq_model.predict(x)
        label, prob = ret[0][0], ret[1][0]
        if label == "__label__0" and prob < thred:
            return True
        return label == "__label__1"
    except:
        return False

"""
def detect_choice_exercise_by_pt_model(uri, text, thred=0.5):
    try:
        if not isinstance(text, str) or len(text.strip()) == 0:
            return False
        label = global_var.py_mcq_model.run(text, thred)
        return label == "LABEL_1"
    except:
        return False
"""


def detect_choice_exercise_by_LLM(text, engine=None):
    system = '''
You will be given a text converted from a webpage. Your task is to detect whether it contains choice question by responding with 'yes' or 'no'.
'''
    answer = global_var.gpt_api.run(system=system, question=text, engine=engine)
    answer = answer.lower().strip()
    if answer.startswith("yes"):
        return True
    elif answer.startswith("no"):
        return False
    else:
        return False


def LCS(str1, str2):
    m = len(str1)
    n = len(str2)

    dp = [[0 for _ in range(n+1)] for _ in range(m+1)]

    for i in range(1, m+1):
        for j in range(1, n+1):
            if str1[i-1] == str2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    return round(1.0 * dp[m][n] / n, 6)


def localize_choice_exercise_by_LLM(text, engine=None):
    system = '''
Purpose:
Create a multiple-choice question dataset.

Task:
Extract all multiple-choice questions from the provided text.

Requirements:
1. If the given text does not contain multiple-choice questions, respond only with "No multiple-choice questions found".
2. Do not modify the original multiple-choice questions.
3. Ensure all multiple-choice questions are copied without omissions.
4. Ensure all multiple-choice questions are copied in order.
5. Ensure all multiple-choice questions are copied under the original layout.
6. Copy the questions along with their options.
7. If answers and explanations are provided, copy them as well.
8. If source materials or reading passage is provided, copy it as well.
9. Don't add content not from original given text.

Please strictly adhere to these requirements while performing the task.
'''
    exercises = global_var.gpt_api.run(system=system, question=text, engine=engine)
    exercises = exercises.strip()
    if len(exercises) == 0 or "no multiple-choice question" in exercises.lower():
        return None
    else:
        exercises = exercises.replace("Multiple Choice Questions\n", "")
        exercises = exercises.replace("Multiple-choice questions:\n", "")
        exercises = exercises.replace("No other multiple-choice questions found.", "")
        exercises = exercises.replace("No other multiple-choice questions found in the text.", "")
        exercises = exercises.replace("No multiple-choice questions found.", "")
        exercises = exercises.replace("No more multiple-choice questions found.", "")

        sim = LCS(text, exercises)
        if sim < 0.9:
            return None
        else:
            return exercises


# rule + model + GPT3.5 turbor.
def mcq_filter_layer(wet_file_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", OVERWRITE=False):
    ret = list()
    try:
        src_wet_file_path = os.path.join(INPUT_FOLDER, wet_file_name)
        src_wet_file_path = util.to_real_path(src_wet_file_path, variables)
        jsonl_file_name = wet_file_name.replace(".warc.wet.gz", ".jsonl")
        dst_jsonl_file_path = os.path.join(OUTPUT_FOLDER, jsonl_file_name)
        dst_jsonl_file_path = util.to_real_path(dst_jsonl_file_path, variables)

        if os.path.exists(src_wet_file_path) and (OVERWRITE or not os.path.exists(dst_jsonl_file_path)):
            items = list()
            with open(src_wet_file_path, "rb") as input:
                records = ArchiveIterator(input, arc2warc=False)
                for id, record in enumerate(records):
                    if record.rec_type == "conversion":
                        try:
                            # read raw html.
                            uri = record.rec_headers["WARC-Target-URI"]
                            bs = record.content_stream().read()
                            if bs is None:
                                continue

                            text = str(bs, "utf-8")
                            if text is None:
                                continue

                            # 1st round filter.
                            round1_contain_exercise = detect_choice_exercise_by_rule(uri, bs)
                            if not round1_contain_exercise:
                                continue

                            # 2nd round filter.
                            round2_contain_exercise = detect_choice_exercise_by_ft_model(uri, text, thred=0.825)
                            if not round2_contain_exercise:
                                continue
                            #round2_contain_exercise = detect_choice_exercise_by_pt_model(uri, text, thred=0.99)
                            #if not round2_contain_exercise:
                            #    continue

                            """
                            # 3rd round filter.
                            round3_contain_exercise = detect_choice_exercise_by_LLM(text, "gpt-35-turbo")
                            if not round3_contain_exercise:
                                continue
                            """

                            item = dict()
                            item["uri"] = uri
                            item["text"] = text
                            lang = detect_lang(text)
                            item["lang"] = lang
                            #exercises = localize_choice_exercise_by_LLM(text, "gpt-35-turbo")
                            #item["exercises"] = exercises
                            items.append(item)
                        except:
                            traceback.print_exc()
                            pass
            with open(dst_jsonl_file_path, "w") as output:
                for item in items:
                    output.write(json.dumps(item) + "\n")
            ret = [dst_jsonl_file_path]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == '__main__':
    wet_file_name = "CC-MAIN-20210115134101-20210115164101-00005_5.warc.wet.gz"
    variables = {"workspace_dir": r"workspace", "worker_id": 0, "worker_num": 1}
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    ret = mcq_filter_layer(wet_file_name, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, OVERWRITE=True)
    print(ret)


================================================
FILE: DomainSpecific/core/layers/transform/minhash_tokens_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import hashlib
import traceback
import numpy as np
from itertools import tee

MER = 2**61 - 1
NUM_PERM = 256
SEED = 1

class MinHasher:
    def __init__(self):
        np.random.seed(1)
        self.gen = np.random.RandomState(SEED)
        self.a = self.gen.randint(1, MER, (NUM_PERM,), dtype='u8')
        self.b = self.gen.randint(0, MER, (NUM_PERM,), dtype='u8')

    def _sha1_hash(self, val):
        val = int.from_bytes(hashlib.sha1(val).digest()[:8], 'little')
        val &= MER
        return np.uint64(val)
    
    def hash(self, sequence):
        res = np.ones(NUM_PERM, dtype='u8') * MER
        for token in sequence:
            hash0 = self._sha1_hash(token.encode('utf8'))
            hash_vec = hash0 * self.a + self.b
            hash_vec %= MER
            res = np.minimum(res, hash_vec)
        return res

minhasher = MinHasher()

def minhash_tokens_layer(tokens, variables=dict()):
    ret = None
    try:
        minhash = minhasher.hash(tokens)
        ret = minhash
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == "__main__":
    tokens = {'产权 份额 为 土地 出让', '商品 住房 市场 价格 合理', '确定 , 在 售 房', ', 可 向 代 持', '住房 , 划 拨 土地', '增 购 政府 份额 的', '向社会公布 。 划 拨 土地', '为 商品 住房 , 划', '▁来源 : 中国 网 地产', '出让 土地 共有 产权 保障', '的 , 可 向 代', '售 房 阶段 向社会公布 。', '商品 住房 , 划 拨', '以及 累计 缴纳 社保 或', '性质 转 为 商品 住房', '的 非 市区 户籍 家庭', '购房 款 。 ▁在 使用', '地产 ▁ 杭州市 1 日', '《 杭州市 共有 产权 保障', '住房 享有 与 购买 商品', '类型 商品 住房 市场 价格', '的 申请 , 增 购', '价 按 同 地段 、', '款 。 ▁在 使用 管理', '可根据 支付 能力 在 50%', '按照 单 套 销售 价格', '方可 通过 买卖 等方式 上市', '年限 等相关 条件 。 ▁', '10 年后 , 方可 通过', '市场 价格 合理 优惠 后', '拨 土地 共有 产权 保障', '杭州市 共有 产权 保障 住房', '销售 基准 价 按 同', '能力 在 50% 至 80%', '等相关 条件 。 ▁ 办法', '年 的 , 可 向', '至 80% 范围内 选择 产权', '共有 产权 保障 住房 销售', '符合 限购 政策 前提 下', '购房 家庭 可根据 支付 能力', '提出 共有 产权 保障 住房', '住房 , 购房 家庭 可根据', '。 ▁在 使用 管理 方面', '-12- 03 ▁记者 : ▁来源', '保障 住房 面向 符合条件的 市区', '住房 以及 累计 缴纳 社保', '。 ▁ 办法 明确 ,', '购房 家庭 产权 份额 为', '社保 或 个 税 年限', '价 及其 浮动 幅度 确定', '非 市区 户籍 家庭 供应', '购房 款 。 出让 土地', ', 购房 家庭 可根据 支付', '单 套 销售 价格 对应的', '权利 性质 调整为 出让 。', '03 ▁记者 : ▁来源 :', '▁2021 -12- 03 ▁记者 :', '产权 保障 住房 面向 符合条件的', '日 对外 发布 《 杭州市', '就业 的 非 市区 户籍', '增 购 后 住房 性质', ', 购买 共有 产权 保障', '、 同 类型 商品 住房', '同等 的 公共服务 权益 。', '对应的 不同 比例 支付 购房', '的 公共服务 权益 。 ▁根据', '网 地产 ▁ 杭州市 1', '款 。 出让 土地 共有', '套 销售 价格 对应的 产权', '管理 方面 , 杭州 提出', '住房 , 购房 家庭 产权', '和 稳定 就业 的 非', '土地 权利 性质 调整为 出让', '浮动 幅度 确定 , 在', '不动产 权 证 满 10', '▁ 办法 明确 , 共有', '机构 提出 一次性 增 购', '》 , 其中 明确 ,', '权 证 满 10 年后', '在 50% 至 80% 范围内', '方面 , 杭州 提出 共有', '满 10 年后 , 方可', '基准 价 按 同 地段', '产权 份额 比例 , 按照', '保障 住房 管理办法 》 ,', '居住证 、 住房 以及 累计', '销售 价格 对应的 产权 比例', '住房 面向 符合条件的 市区 户籍', '。 ▁根据 办法 , 市区', '单 套 销售 价格 按照', '销售 基准 价 及其 浮动', ': 中国 网 地产 ▁', '持 机构 提出 一次性 增', '价格 按照 销售 基准 价', '家庭 供应 , 购买 共有', '购买 共有 产权 保障 住房', '稳定 就业 的 非 市区', '购买 商品 住房 同等 的', '其中 明确 , 共有 产权', '▁记者 : ▁来源 : 中国', '价格 对应的 不同 比例 支付', '与 购买 商品 住房 同等', '、 住房 等相关 条件 ,', '条件 。 ▁ 办法 明确', '证 满 5 年 的', '满 5 年 的 ,', '管理办法 》 , 其中 明确', '市区 户籍 家庭 需 满足', '份额 的 申请 , 增', '商品 住房 同等 的 公共服务', '支付 能力 在 50% 至', '权 证 满 5 年', '户籍 家庭 需 满足 居住证', ', 方可 通过 买卖 等方式', ', 在 售 房 阶段', '对应的 产权 比例 支付 购房', '产权 保障 住房 购房 家庭', '家庭 需 满足 居住证 、', '杭州 提出 共有 产权 保障', '1 日 对外 发布 《', ', 其中 明确 , 共有', '满足 居住证 、 住房 以及', '选择 产权 份额 比例 ,', '同时 满足 户籍 、 住房', ', 市区 户籍 家庭 要在', '销售 价格 对应的 不同 比例', '个 税 年限 等相关 条件', '住房 市场 价格 合理 优惠', '产权 保障 住房 , 购房', '、 住房 以及 累计 缴纳', '产权 保障 住房 销售 基准', '后 住房 性质 转 为', '土地 出让 时 已 确定的', '比例 , 按照 单 套', '发布 《 杭州市 共有 产权', '住房 性质 转 为 商品', '累计 缴纳 社保 或 个', '份额 比例 , 按照 单', '时 已 确定的 份额 比例', '划 拨 土地 权利 性质', '基准 价 及其 浮动 幅度', '。 出让 土地 共有 产权', '为 土地 出让 时 已', ', 购房 家庭 产权 份额', '等相关 条件 , 非 市区', '按 同 地段 、 同', '按照 销售 基准 价 及其', '不同 比例 支付 购房 款', '住房 销售 基准 价 按', '家庭 产权 份额 为 土地', '可 向 代 持 机构', '▁在 使用 管理 方面 ,', '家庭 取得 不动产 权 证', '性质 调整为 出让 。 取得', '取得 不动产 权 证 满', '市区 户籍 家庭 要在 符合', ', 杭州 提出 共有 产权', '政策 前提 下 同时 满足', '▁根据 办法 , 市区 户籍', '办法 , 市区 户籍 家庭', '缴纳 社保 或 个 税', '。 划 拨 土地 共有', '家庭 可根据 支付 能力 在', '满足 户籍 、 住房 等相关', '一次性 增 购 政府 份额', '购 政府 份额 的 申请', '需 满足 居住证 、 住房', '同 地段 、 同 类型', '供应 , 购买 共有 产权', '使用 管理 方面 , 杭州', '保障 住房 享有 与 购买', '共有 产权 保障 住房 享有', '限购 政策 前提 下 同时', '套 销售 价格 按照 销售', '户籍 和 稳定 就业 的', '优惠 后 确定 。 单', '住房 管理办法 》 , 其中', '市区 户籍 和 稳定 就业', '支付 购房 款 。 ▁在', '户籍 家庭 供应 , 购买', '同 类型 商品 住房 市场', '保障 住房 购房 家庭 取得', '及其 浮动 幅度 确定 ,', '共有 产权 保障 住房 管理办法', '共有 产权 保障 住房 面向', '在 售 房 阶段 向社会公布', '共有 产权 保障 住房 ,', '政府 份额 的 申请 ,', '买卖 等方式 上市 交易 。', '市区 户籍 家庭 供应 ,', '出让 时 已 确定的 份额', '家庭 要在 符合 限购 政策', '申请 , 增 购 后', ', 非 市区 户籍 家庭', '前提 下 同时 满足 户籍', '划 拨 土地 共有 产权', ', 划 拨 土地 权利', '产权 保障 住房 管理办法 》', '阶段 向社会公布 。 划 拨', '明确 , 共有 产权 保障', '确定的 份额 比例 , 按照', '证 满 10 年后 ,', '通过 买卖 等方式 上市 交易', '已 确定的 份额 比例 ,', '不动产 权 证 满 5', '提出 一次性 增 购 政府', '对外 发布 《 杭州市 共有', '价格 合理 优惠 后 确定', '。 取得 不动产 权 证', '范围内 选择 产权 份额 比例', '房 阶段 向社会公布 。 划', '▁ 杭州市 1 日 对外', '份额 为 土地 出让 时', ', 增 购 后 住房', '地段 、 同 类型 商品', '杭州市 1 日 对外 发布', '户籍 家庭 要在 符合 限购', '保障 住房 销售 基准 价', '调整为 出让 。 取得 不动产', ', 共有 产权 保障 住房', '权益 。 ▁根据 办法 ,', '比例 支付 购房 款 。', '保障 住房 , 购房 家庭', '或 个 税 年限 等相关', '年后 , 方可 通过 买卖', '出让 。 取得 不动产 权', '价格 对应的 产权 比例 支付', '购 后 住房 性质 转', '确定 。 单 套 销售', '支付 购房 款 。 出让', '要在 符合 限购 政策 前提', '拨 土地 权利 性质 调整为', '转 为 商品 住房 ,', '享有 与 购买 商品 住房', '公共服务 权益 。 ▁根据 办法', '中国 网 地产 ▁ 杭州市', '5 年 的 , 可', '合理 优惠 后 确定 。', '办法 明确 , 共有 产权', '共有 产权 保障 住房 购房', '套 销售 价格 对应的 不同', '户籍 、 住房 等相关 条件', '下 同时 满足 户籍 、', '产权 保障 住房 享有 与', '面向 符合条件的 市区 户籍 和', '购房 家庭 取得 不动产 权', '条件 , 非 市区 户籍', '幅度 确定 , 在 售', ': ▁来源 : 中国 网', '代 持 机构 提出 一次性', '产权 比例 支付 购房 款', '80% 范围内 选择 产权 份额', '向 代 持 机构 提出', '住房 同等 的 公共服务 权益', '税 年限 等相关 条件 。', '土地 共有 产权 保障 住房', ', 按照 单 套 销售', '非 市区 户籍 家庭 需', '。 单 套 销售 价格', '符合条件的 市区 户籍 和 稳定', '住房 等相关 条件 , 非', '50% 至 80% 范围内 选择', '后 确定 。 单 套', '住房 购房 家庭 取得 不动产', '销售 价格 按照 销售 基准'}
    minhash = minhash_tokens_layer(tokens)
    print(minhash)


================================================
FILE: DomainSpecific/core/layers/transform/ngrams_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
from itertools import tee

NGRAM_SIZE = 5

def ngrams_layer(sequence, variables=dict()):
    ret = None
    try:
        # https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/utils/tokenization.py
        if len(sequence) < NGRAM_SIZE:
            return iter([sequence])
        iterables = tee(iter(sequence), NGRAM_SIZE)
        for i, sub_iterable in enumerate(iterables):
            for _ in range(i):
                next(sub_iterable, None)
        tokens = zip(*iterables)
        tokens = {" ".join(t).strip() for t in tokens}
        #tokens = list(tokens)
        ret = tokens
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == "__main__":
    tokens = ['▁2021', '-12-', '03', '▁记者', ':', '▁来源', ':', '中国', '网', '地产', '▁', '杭州市', '1', '日', '对外', '发布', '《', '杭州市', '共有', '产权', '保障', '住房', '管理办法', '》', ',', '其中', '明确', ',', '共有', '产权', '保障', '住房', '面向', '符合条件的', '市区', '户籍', '和', '稳定', '就业', '的', '非', '市区', '户籍', '家庭', '供应', ',', '购买', '共有', '产权', '保障', '住房', '享有', '与', '购买', '商品', '住房', '同等', '的', '公共服务', '权益', '。', '▁根据', '办法', ',', '市区', '户籍', '家庭', '要在', '符合', '限购', '政策', '前提', '下', '同时', '满足', '户籍', '、', '住房', '等相关', '条件', ',', '非', '市区', '户籍', '家庭', '需', '满足', '居住证', '、', '住房', '以及', '累计', '缴纳', '社保', '或', '个', '税', '年限', '等相关', '条件', '。', '▁', '办法', '明确', ',', '共有', '产权', '保障', '住房', '销售', '基准', '价', '按', '同', '地段', '、', '同', '类型', '商品', '住房', '市场', '价格', '合理', '优惠', '后', '确定', '。', '单', '套', '销售', '价格', '按照', '销售', '基准', '价', '及其', '浮动', '幅度', '确定', ',', '在', '售', '房', '阶段', '向社会公布', '。', '划', '拨', '土地', '共有', '产权', '保障', '住房', ',', '购房', '家庭', '可根据', '支付', '能力', '在', '50%', '至', '80%', '范围内', '选择', '产权', '份额', '比例', ',', '按照', '单', '套', '销售', '价格', '对应的', '不同', '比例', '支付', '购房', '款', '。', '出让', '土地', '共有', '产权', '保障', '住房', ',', '购房', '家庭', '产权', '份额', '为', '土地', '出让', '时', '已', '确定的', '份额', '比例', ',', '按照', '单', '套', '销售', '价格', '对应的', '产权', '比例', '支付', '购房', '款', '。', '▁在', '使用', '管理', '方面', ',', '杭州', '提出', '共有', '产权', '保障', '住房', '购房', '家庭', '取得', '不动产', '权', '证', '满', '5', '年', '的', ',', '可', '向', '代', '持', '机构', '提出', '一次性', '增', '购', '政府', '份额', '的', '申请', ',', '增', '购', '后', '住房', '性质', '转', '为', '商品', '住房', ',', '划', '拨', '土地', '权利', '性质', '调整为', '出让', '。', '取得', '不动产', '权', '证', '满', '10', '年后', ',', '方可', '通过', '买卖', '等方式', '上市', '交易', '。']
    tokens = ngrams_layer(tokens)
    print(tokens)


================================================
FILE: DomainSpecific/core/layers/transform/openquestion_filter_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import re
import gc
import requests
import fasttext
from gensim.utils import simple_preprocess
import pyarrow as pa
import pyarrow.parquet as pq
sys.path.append(".")
import util
import global_var

question_keywords = ("q&a", "q & a", "q:", "que:", "question:", "quiz:", "exam:", "examination:", "probe:", "request:", "challenge:", "test:", "query:", "survey:")
#question_keywords2 = ("what ", "where ", "why ", "when ", "who ", "whoes ", "how ", "\?")
question_keywords2 = ("what", "where", "why", "when", "who", "whoes", "how")
question_keywords += question_keywords2
question_keywords = set(map(lambda x: "[^a-zA-Z]" + x + "[^a-zA-Z]", question_keywords))
question_pattern = re.compile("|".join(question_keywords))

answer_keywords = ("q&a", "q & a", "a:", "ans:", "answer:", "solution:", "reply:", "response:", "result:", "outcome:", "explanation:", "conclusion:", "finding:", "assertion:", "statement:", "clarification:")
answer_keywords = set(map(lambda x: "[^a-zA-Z]" + x + "[^a-zA-Z]", answer_keywords))
answer_pattern = re.compile("|".join(answer_keywords))


def is_openquestion_by_model(text, model, thred=0.5):
    if model is None:
        return False
    if not isinstance(text, str) or len(text.strip()) == 0:
        return False
    try:
        x = " ".join(simple_preprocess(text))
        ret = model.predict(x)
        label, prob = ret[0][0], ret[1][0]
        return label != "__label__0"
    except:
        traceback.print_exc()
        return False

def check_yes_no_question(text_before, text_after):
    text_after = text_after.lower().strip()
    keywords = ("yes", "y", "no", "n")
    for keyword in keywords:
        if text_after.startswith(keyword) and not text_after[len(keyword)].isalnum():
            return True
    return False

def check_multiple_choise_question(text_before, text_after):
    combo_keywords_list = [
        ("a.",   "b.",   "c.",   "d."),
        ("a)",   "b)",   "c)",   "d)"),
        ("\na ", "\nb ", "\nc ", "\nd "),
        (">a<",  ">b<",  ">c<",  ">d<"),

        ("1.",   "2.",   "3.",   "4."),
        ("1)",   "2)",   "3)",   "4)"),
        ("\n1 ", "\n2 ", "\n3 ", "\n4 "),
        (">1<",  ">2<",  ">3<",  ">4<"),

        ("i.",   "ii.",   "iii.",   "iv."),
        ("i)",   "ii)",   "iii)",   "iv)"),
        ("\ni ", "\nii ", "\niii ", "\niv "),
        (">i<",  ">ii<",  ">iii<",  ">iv<"),
    ]
    text_before = text_before.lower().strip()
    for combo_keywords in combo_keywords_list:
        t = 0
        for combo_keyword in combo_keywords:
            t = text_before.find(combo_keyword, t)
            if t == -1:
                break
        if t != -1:
            return True
        #if combo_keywords[0] in text_before and combo_keywords[1] in text_before and combo_keywords[2] in text_before:
        #    return True
    return False

def check_fill_in_question(text_before, text_after):
    text_before = text_before.lower().strip()
    if "___" in text_before or "()" in text_before or "..." in text_before:
        return True
    return False

def check_quality(item):
    text = item["text"]
    lines = text.split("\n")
    lens = list(map(lambda l: len(l.strip()), lines))
    max_len = max(lens)

    #if max_len > 1024:
    if max_len > 2048:
        return False
    if max_len <= 128:
        return False

    if len(lens) <= 3:
        return False
    if len(lens) > 256:
        return False

    if len(text) < 256:
        return False
    if len(text) > 1024 * 16:
        return False

    if 1.0 * text.count(" ") / len(text) > 0.33:
        return False

    if 1.0 * text.count("  ") / len(text) > 0.1:
        return False

    if 1.0 * text.count("\t") / len(text) > 0.1:
        return False

    if 1.0 * text.count(".") / len(text) > 0.1:
        return False

    if 1.0 * text.count("-") / len(text) > 0.1:
        return False

    if 1.0 * text.count("#") / len(text) > 0.1:
        return False

    if 1.0 * text.count("|") / len(text) > 0.1:
        return False

    if 1.0 * text.count(",") / len(text) > 0.1:
        return False

    sl_cnt = 1.0 * len(list(filter(lambda x: len(x.strip()) <= 32, lines))) / len(lines)
    if sl_cnt > 0.67:
        return False

    return True

def openquestion_filter_layer(pq_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", OVERWRITE=False):
    ret = list()
    try:
        in_pq_path = os.path.join(INPUT_FOLDER, pq_name)
        in_pq_path = util.to_real_path(in_pq_path, variables)
        out_pq_path = os.path.join(OUTPUT_FOLDER, pq_name)
        out_pq_path = util.to_real_path(out_pq_path, variables)

        if os.path.exists(in_pq_path) and (OVERWRITE or not os.path.exists(out_pq_path)):
            util.create_folder_by_file_path(out_pq_path)

            # read parquet file.
            try:
                table = pq.read_table(in_pq_path)
                records = table.to_pylist()
            except:
                traceback.print_exc()
            
            # filter records containing open question.
            openquestion_records = list()
            for record_idx, record in enumerate(records):
                try:
                    text = record["text"]
                    text_low = text.lower()

                    if record["la"] != "en":
                        continue

                    #if item["la_prob"] < 0.65:
                    #    continue
                    #if text is None or len(text) < 64:
                    #    continue
                    #if text.count("\\u") >= 10:
                    #    continue

                    #if not check_quality(record):
                    #    continue

                    contain_question = len(question_pattern.findall(text_low)) >= 2
                    if not contain_question:
                        continue
                    
                    contain_answer = len(answer_pattern.findall(text_low)) >= 2
                    if not contain_answer:
                        continue

                    contain_openquestion = is_openquestion_by_model(text, global_var.ft_openquestion_model)
                    if not contain_openquestion:
                        continue

                    openquestion_records.append(record)
                except:
                    traceback.print_exc()

            # write parquet file.
            try:
                openquestion_table = pa.Table.from_pylist(openquestion_records)
                pq.write_table(openquestion_table, out_pq_path)
            except:
                traceback.print_exc()
            
            ret = [out_pq_path]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == '__main__':
    snapshot = "CC-MAIN-2022-49"
    variables = {"workspace_dir": r"workspace", "worker_id": 0, "worker_num": 1}
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    STORAGE_PATH = "resources/storage/llmstore.yaml"
    ret = openquestion_filter_layer(snapshot, variables=variables, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, STORAGE_PATH=STORAGE_PATH)
    print(ret)


================================================
FILE: DomainSpecific/core/layers/transform/tokenize_article_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import sentencepiece as spm


tokenizer = None

def tokenize_article_layer(article, variables=dict(), SPM_MODEL_PATH="./dependency/models/sentencepiece.bpe.model"):
    ret = None
    try:
        global tokenizer
        if tokenizer is None:
            tokenizer = spm.SentencePieceProcessor(SPM_MODEL_PATH)
        tokens = tokenizer.encode(article, out_type=str)
        ret = tokens
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return ret


if __name__ == "__main__":
    article = "2021-12-03 记者： 来源：中国网地产\n\n杭州市1日对外发布《杭州市共有产权保障住房管理办法》，其中明确，共有产权保障住房面向符合条件的市区户籍和稳定就业的非市区户籍家庭供应，购买共有产权保障住房享有与购买商品住房同等的公共服务权益。\n\n根据办法，市区户籍家庭要在符合限购政策前提下同时满足户籍、住房等相关条件，非市区户籍家庭需满足居住证、住房以及累计缴纳社保或个税年限等相关条件。\n\n办法明确，共有产权保障住房销售基准价按同地段、同类型商品住房市场价格合理优惠后确定。单套销售价格按照销售基准价及其浮动幅度确定，在售房阶段向社会公布。划拨土地共有产权保障住房，购房家庭可根据支付能力在50%至80%范围内选择产权份额比例，按照单套销售价格对应的不同比例支付购房款。出让土地共有产权保障住房，购房家庭产权份额为土地出让时已确定的份额比例，按照单套销售价格对应的产权比例支付购房款。\n\n在使用管理方面，杭州提出共有产权保障住房购房家庭取得不动产权证满5年的，可向代持机构提出一次性增购政府份额的申请，增购后住房性质转为商品住房，划拨土地权利性质调整为出让。取得不动产权证满10年后，方可通过买卖等方式上市交易。"
    tokens = tokenize_article_layer(article)
    print(tokens)


================================================
FILE: DomainSpecific/core/layers/transform/warc_encode_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# coding=utf-8
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import re
import codecs
import logging
import traceback
import requests
from pathlib import Path
from urllib.parse import urlparse
from io import BytesIO
from warcio.limitreader import LimitReader
from warcio.warcwriter import WARCWriter
from warcio.archiveiterator import ArchiveIterator
import lxml.etree as ET
import lxml.html as HT
from py_asciimath.translator.translator import MathML2Tex
from pylatexenc.latexwalker import LatexWalker
from charset_normalizer import detect
import util

def tex_in_script_tag(text):
    return text.startswith('<script type="math/tex"') or \
           text.startswith("<script type='math/tex'") or \
           text.startswith('<script type="math/latex"') or \
           text.startswith("<script type='math/latex'") or \
           text.startswith('<script type="math/asciimath"') or \
           text.startswith("<script type='math/asciimath'") or \
           text.startswith('<span class="math-formula">') or \
           text.startswith("<span class='math-formula'>")

def tex_in_math_tag(text):
    return text.startswith("<annotation encoding='application/x-tex'>") or \
           text.startswith('<annotation encoding="application/x-tex">')

def tex_in_math_tag2(text):
    return text.startswith("<math") and "</annotation>" in text

def mathml_in_script_tag(text):
    return text.startswith('<script type="math/mml"') or \
           text.startswith("<script type='math/mml'")

def mathml_in_math_tag(text):
    return text.startswith("<math ") and 'xmlns="http://www.w3.org/1998/Math/MathML"' in text
    #return text.startswith('<math xmlns="http://www.w3.org/1998/Math/MathML"') or \
    #       text.startswith("<math xmlns='http://www.w3.org/1998/Math/MathML'")
    #return text.startswith("<math ")

def is_tex(text):
    return re.match(r"(\$\$.*?\$\$)", text) is not None

def contain_tex(text):
    return re.search(r"(\$\$.*?\$\$)", text) is not None

def check_latex(latex):
    try:
        w = LatexWalker(latex, tolerant_parsing=False)
        (nodelist, pos, len_) = w.get_latex_nodes(pos=0)
        return True
    except:
        return False

def remove_hidden_content(html):
    text = html
    root = HT.document_fromstring(text)

    hidden_nodes = root.xpath('//*[@aria-hidden="true"]')
    for hidden_node in hidden_nodes:
        hidden_node.drop_tree()

    doctype = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
    if html.strip().startswith(b'<!DOCTYPE'):
        index = html.find(b"<html")
        if index != -1:
            doctype = html[:index].strip()
    new_text = HT.tostring(root, method="html", doctype=doctype)
    new_html = new_text
    return new_html

def remove_attr(text, attr):
    index = text.find(attr)
    if index == -1:
        return text, False
    before = text[:index-1]
    text = text[index:]
    index = len(attr) + 1
    index = text.find(text[index:index+1], index+1) + 1
    after = text[index:]
    text = text[:index]
    text = before + after
    return text, True

def mathml_to_latex1(text):
    mml_dom = ET.fromstring(text)
    xslt = ET.parse("./dependency/xsltml_2.0/mmltex.xsl")
    transform = ET.XSLT(xslt)
    mmldom = transform(mml_dom)
    text = str(mmldom)
    return text

def mathml_to_latex2(text):
    symbol_mappings = {
        "&alpha;": "α",
        "&Alpha;": "A",
        "&beta;": "β",
        "&Beta;": "B",
        "&epsilon;": "ε",
        "&Epsilon;": "Ε",
        "&Mu;": "M",
        "&Nu;": "N",
        "&omicron;": "o",
        "&Omicron;": "O",
        "&iot;": "ι",
        "&conjugate0;": "&#x2015;",
    }
    for key1, key2 in symbol_mappings.items():
        text = text.replace(key1, key2)

    # add xml head.
    head = "<?xml version='1.0' encoding='UTF-8'?>\n" + \
           '<!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/mathml2.dtd">'
    text = head + text

    # remove unrecognized attributes.
    attrs = ("fontstyle", "ignorefont", "mathcolor", "rtableid", "altimg-valign", "dspmath", "xmlns:md", "specific-use")
    for attr in attrs:
        find = True
        while find:
            text, find = remove_attr(text, attr)
    text = text.replace(' xmlns=""', '')

    logging.disable(logging.WARNING)
    mathml2tex = MathML2Tex()
    text = mathml2tex.translate(text, network=False, from_file=False,)
    #logging.enable(logging.WARNING)
    return text

def separate_content_and_tag(html, start_str, end_str, s=0):
    index = html.find(start_str, s)
    before = html[:index]
    html = html[index:]
    index = html.find(end_str) + len(end_str)
    content = html[:index]
    after = html[index:]
    return content, before, after

def detect_code(text):
    keywords = (
        'if', 'else', 'for', 'while', 'def', 'class', 'include', 'switch', 'case', 
        'default', 'const', 'static', 'try', 'catch', 'exception', 'continue', 'open', 
        'close', 'import', 'var', 'None', 'null', 'true', 'True', 'false', 'False', 'print', 'return',
        'sudo', 'apt-get', 'wget',
        '\+', '-', '\*', '/', '=',
        #'//', '#', '/*', '*/',
    )
    patterns = [
        rf'\b(?:{"|".join(keywords)})\b', # keywords
        r'[{};]', # code indicators (curly braces, semicolon)
        r'\w+\s*\(.*\)', # function calls or declarations
        r'\w+\s*=\s*\w+', # variable assignments
    ]

    for pattern in patterns:
        if re.search(pattern, text):
            return True

    return False

def encode_code(node, code_tag, not_code_tag):
    # situation 1. <pre><code>
    # situation 2. <pre><span>
    # situation 3. <pre><code><span>
    # situation 4. <table><tbody>
    # situation 5. <table><tbody><pre>...

    if node.tag == "code":
        parent_node = node.getparent()
        parent_tag = parent_node.tag

        if parent_tag == "tbody":
            code_node = parent_node
        elif parent_tag == "pre":
            code_node = parent_node
            # below could be commentted.
            while parent_node is not None:
                parent_node = parent_node.getparent()
                if parent_node is not None and parent_node.tag == "tbody":
                    code_node = parent_node
                    break
        else:
            #code_node = node
            code_node = None

        if code_node is not None:
            text = code_node.text_content()

            # delete the whole attributes.
            for key, value in code_node.attrib.items():
                code_node.attrib.pop(key)
            if detect_code(text):
                code_node.tag = code_tag# + "-" + lang
                return True
            else:
                #code_node.tag = not_code_tag# debug
                return False

    child_nodes = node.getchildren()
    contain = False
    for child_node in child_nodes:
        if encode_code(child_node, code_tag, not_code_tag):
            contain = True
    return contain

def filter_code(html, code_tag, not_code_tag):
    root = HT.document_fromstring(html)

    contain = encode_code(root, code_tag, not_code_tag)

    doctype = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
    if html.strip().startswith(b'<!DOCTYPE'):
        index = html.find(b"<html")
        if index != -1:
            doctype = html[:index].strip()
    new_text = HT.tostring(root, method="html", doctype=doctype)
    new_html = new_text

    return new_html, contain

def encode_image(uri, node, image_tag):
    if node.tag == "img":
        node.tag = image_tag

        link = node.attrib.get("src")
        if link is not None:
            link = util.relative2absolute_path(uri, link)
        alt = node.attrib.get("alt")
        width = node.attrib.get("width")
        height = node.attrib.get("height")
        name = util.md5(link) + Path(urlparse(link).path).suffix if link is not None else None
        attrs = {"link": link, "alt": alt, "width": width, "height": height, "name": name}
        node.text = str(attrs)

        # delete the whole attributes.
        for key, value in node.attrib.items():
            node.attrib.pop(key)
        return True

    child_nodes = node.getchildren()
    contain = False
    for child_node in child_nodes:
        if encode_image(uri, child_node, image_tag):
            contain = True
    return contain

def filter_image(uri, html, image_tag):
    root = HT.document_fromstring(html)

    contain = encode_image(uri, root, image_tag)

    doctype = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
    if html.strip().startswith(b'<!DOCTYPE'):
        index = html.find(b"<html")
        if index != -1:
            doctype = html[:index].strip()
    new_text = HT.tostring(root, method="html", doctype=doctype)
    new_html = new_text

    return new_html, contain

def encode_video(uri, node, video_tag):
    if node.tag == "video":
        node.tag = video_tag

        link = node.attrib.get("src")
        if link is not None:
            link = util.relative2absolute_path(uri, link)
        alt = node.attrib.get("alt")
        width = node.attrib.get("width")
        height = node.attrib.get("height")
        name = util.md5(link) + Path(urlparse(link).path).suffix if link is not None else None
        attrs = {"link": link, "alt": alt, "width": width, "height": height, "name": name}
        node.text = str(attrs)

        # delete the whole attributes.
        for key, value in node.attrib.items():
            node.attrib.pop(key)
        return True

    child_nodes = node.getchildren()
    contain = False
    for child_node in child_nodes:
        if encode_video(uri, child_node, video_tag):
            contain = True
    return contain

def filter_video(uri, html, video_tag):
    root = HT.document_fromstring(html)

    contain = encode_video(uri, root, video_tag)

    doctype = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
    if html.strip().startswith(b'<!DOCTYPE'):
        index = html.find(b"<html")
        if index != -1:
            doctype = html[:index].strip()
    new_text = HT.tostring(root, method="html", doctype=doctype)
    new_html = new_text

    return new_html, contain

def encode_math_html(uri, html, encoding):
    encode_table = {
        b"<": b"[[[less]]]",
        b">": b"[[[large]]]",
    }

    tag_head_mathml  = b"[[[math-ml]]]"
    tag_tail_mathml  = b"[[[/math-ml]]]"
    tag_head_mathtex = b"[[[math-tex]]]"
    tag_tail_mathtex = b"[[[/math-tex]]]"
    
    start_end_strs = (
        (b"<maths", b"</maths>"),#1
        (b"<math>", b"</math>"),#2
        (b"<math ", b"</math>"),#2
        (b"<annotation encoding='application/x-tex'>", b"</annotation>"),
        (b'<annotation encoding="application/x-tex">', b'</annotation>'),
        (b"<span class='math-formula'>", b"</span>"),
        (b'<span class="math-formula">', b'</span>'),
        (b'<script type="math/mml"', b'</script>'),
        (b"<script type='math/mml'", b"</script>"),
        (b'<script type="math/tex"', b'</script>'),
        (b"<script type='math/tex'", b"</script>"),
        (b'<script type="math/latex"', b'</script>'),
        (b"<script type='math/latex'", b"</script>"),
        (b'<script type="math/asciimath"', b'</script>'),
        (b"<script type='math/asciimath'", b"</script>"),
    )

    sub_start_end_strs = (
        (b"<math", b"</math>"),#1
        (b"<annotation encoding='application/x-tex'>", b"</annotation>"),#2
        (b'<annotation encoding="application/x-tex">', b'</annotation>'),#2
    )

    assert tag_head_mathml not in html and tag_tail_mathml not in html
    assert tag_head_mathtex not in html and tag_tail_mathtex not in html

    contain_tag = False
    for (start_str, end_str) in start_end_strs:
        while start_str in html:
            content, before, after = separate_content_and_tag(html, start_str, end_str)

            if start_str[:5] == b"<math":
                for sub_start_str, sub_end_str in sub_start_end_strs:
                    if sub_start_str in content[len(start_str):-len(end_str)]:
                        content = content[len(start_str):-len(end_str)]
                        content, sub_before, sub_after = separate_content_and_tag(content, sub_start_str, sub_end_str)

            contain = True
            try:
                content_str = str(content, encoding)
            except:
                return html, False

            if contain and (tex_in_script_tag(content_str) or tex_in_math_tag(content_str)):
                try:
                    index1 = content.find(b">") + 1
                    index2 = content.rfind(b"<")
                    formula = content[index1:index2]
                    formula = formula.strip()
                    formula_str = str(formula, encoding)

                    if not check_latex(formula_str):
                        return html, False
                    for key1, key2 in encode_table.items():
                        formula = formula.replace(key1, key2)
                    content = b"<span>" + tag_head_mathtex + formula + tag_tail_mathtex + b"</span>"
                except:
                    contain = False
            elif contain and (tex_in_math_tag2(content_str)):
                try:
                    index2 = content_str.find("</annotation>")
                    index1 = content_str[:index2].rfind("</mrow>") + len("</mrow>")
                    formula = content_str[index1:index2]
                    formula = formula.strip()
                    formula_str = str(formula, encoding)

                    if not check_latex(formula_str):
                        return html, False
                    for key1, key2 in encode_table.items():
                        formula = formula.replace(key1, key2)
                    content = b"<span>" + tag_head_mathtex + formula + tag_tail_mathtex + b"</span>"
                except:
                    contain = False
            elif contain and (mathml_in_script_tag(content_str) or mathml_in_math_tag(content_str)):
                try:
                    # convert mathml to latex.
                    if "<semantics>" in content_str and "</semantics>" not in content_str:
                        content_str = content_str.replace("<semantics>", "")
                    try:
                        formula_str = mathml_to_latex1(content_str)
                    except:
                        formula_str = mathml_to_latex2(content_str)
                    formula = bytes(formula_str, encoding)
                    formula = formula.replace(codecs.BOM_UTF8, b"")
                    formula = formula.strip(b"$")
                    formula = formula.strip()
                    formula_str = str(formula, encoding)

                    if not check_latex(formula_str):
                        return html, False
                    for key1, key2 in encode_table.items():
                        formula = formula.replace(key1, key2)
                    content = b"<span>" + tag_head_mathml + formula + tag_tail_mathml + b"</span>"
                except:
                    contain = False
            else:
                contain = False

            if contain:
                html = before + content + after
                contain_tag = True
            else:
                html = before + after

    return html, contain_tag

def get_tag_info(tag):
    start_tag = f"<{tag}>".encode()
    end_tag = f"</{tag}>".encode()
    encode_start_tag = f"[[[{tag}]]]".encode()
    encode_end_tag = f"[[[/{tag}]]]".encode()
    tag = tag.encode()
    return tag, start_tag, end_tag, encode_start_tag, encode_end_tag

def encode_code_html(uri, html, encoding):
    code_tag_str = "code-encode"
    not_code_tag_str = "not-code-encode"
    code_tag, code_start_tag, code_end_tag, code_encode_start_tag, code_encode_end_tag = get_tag_info(code_tag_str)
    not_code_tag, not_code_start_tag, not_code_end_tag, not_code_encode_start_tag, not_code_encode_end_tag = get_tag_info(not_code_tag_str)
    assert code_start_tag not in html and code_end_tag not in html
    assert not_code_start_tag not in html and not_code_end_tag not in html

    try:
        html, contain = filter_code(html, code_tag_str, not_code_tag_str)

        if contain:
            html = html.replace(code_start_tag, b"<pre>" + b"\n" + code_encode_start_tag + b"\n")
            html = html.replace(code_end_tag, b"\n" + code_encode_end_tag + b"\n" + b"</pre>")

            #html = html.replace(not_code_start_tag, b"<pre>" + b"\n" + not_code_encode_start_tag + b"\n")# debug
            #html = html.replace(not_code_end_tag, b"\n" + not_code_encode_end_tag + b"\n" + b"</pre>")# debug
    except:
        contain = False

    return html, contain

def encode_image_html(uri, html, encoding):
    image_tag_str = "image-encode"
    image_tag, image_start_tag, image_end_tag, image_encode_start_tag, image_encode_end_tag = get_tag_info(image_tag_str)
    assert image_start_tag not in html and image_end_tag not in html

    try:
        html, contain = filter_image(uri, html, image_tag_str)

        if contain:
            #html = html.replace(image_start_tag, b"<pre>" + b"\n" + image_encode_start_tag + b"\n")
            #html = html.replace(image_end_tag, b"\n" + image_encode_end_tag + b"\n" + b"</pre>")
            html = html.replace(image_start_tag, b"<span>" + b"\n" + image_encode_start_tag + b"\n")
            html = html.replace(image_end_tag, b"\n" + image_encode_end_tag + b"\n" + b"</span>")
    except:
        contain = False

    return html, contain

def encode_video_html(uri, html, encoding):
    video_tag_str = "video-encode"
    video_tag, video_start_tag, video_end_tag, video_encode_start_tag, video_encode_end_tag = get_tag_info(video_tag_str)
    assert video_start_tag not in html and video_end_tag not in html

    try:
        html, contain = filter_video(uri, html, video_tag_str)

        if contain:
            #html = html.replace(video_start_tag, b"<pre>" + b"\n" + video_encode_start_tag + b"\n")
            #html = html.replace(video_end_tag, b"\n" + video_encode_end_tag + b"\n" + b"</pre>")
            html = html.replace(video_start_tag, b"<span>" + b"\n" + video_encode_start_tag + b"\n")
            html = html.replace(video_end_tag, b"\n" + video_encode_end_tag + b"\n" + b"</span>")
    except:
        contain = False

    return html, contain

def encode_html(uri, html, encoding, TAG):
    if html is None:
        return None, False

    if TAG == "math":
        html, contain_tag = encode_math_html(uri, html, encoding)
    elif TAG == "code":
        html, contain_tag = encode_code_html(uri, html, encoding)
    elif TAG == "image":
        html, contain_tag = encode_image_html(uri, html, encoding)
    elif TAG == "video":
        html, contain_tag = encode_video_html(uri, html, encoding)
    return html, contain_tag


def warc_encode_layer(warc_file_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", TAG=None, DEFAULT_ENCODING=None, OVERWRITE=False):
    ret = list()
    try:
        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)
        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)
        dst_warc_file_path = os.path.join(OUTPUT_FOLDER, warc_file_name)
        dst_warc_file_path = util.to_real_path(dst_warc_file_path, variables)

        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_warc_file_path)):
            util.create_folder_by_file_path(dst_warc_file_path)
            with open(dst_warc_file_path, "wb") as output:
                writer = WARCWriter(output, gzip=True)
                with open(src_warc_file_path, "rb") as input:
                    records = ArchiveIterator(input, arc2warc=True)
                    for id, record in enumerate(records):
                        if record.rec_type == "response" and record.http_headers.get_header("Content-Type", "").startswith("text/html"):
                            try:
                                uri = record.rec_headers["WARC-Target-URI"]

                                # read raw html.
                                html = record.content_stream().read()

                                # check html codec.
                                charset = record.http_headers["Content-Type"].split(";")[-1].split("=")
                                if charset[0].strip().lower() == "charset":
                                    encoding = charset[1]
                                else:
                                    index1 = html.find(b'<meta charset="')
                                    if index1 >= 0:
                                        index1 += len(b'<meta charset="')
                                        index2 = html.find(b'"', index1)
                                        encoding = str(html[index1:index2], encoding="ascii")
                                    else:
                                        try:
                                            logging.disable(logging.WARNING)
                                            encoding = detect(html)["encoding"]
                                            #logging.enable(logging.WARNING)
                                        except:
                                            encoding = ""
                                if encoding is not None:
                                    encoding = encoding.strip().strip('"').lower()

                                if encoding in ("",):
                                    encoding = DEFAULT_ENCODING
                                
                                # remove hidden tag.
                                if encoding is not None and b'aria-hidden="true"' in html:
                                #if encoding is not None and (b'aria-hidden="true"' in html or b'aria-readonly="true"' in html):
                                    try:
                                        html = remove_hidden_content(html)
                                    except:
                                        encoding = DEFAULT_ENCODING

                                # encode html.
                                if encoding is not None:
                                    if TAG is not None:
                                        html, contain_tag = encode_html(uri, html, encoding, TAG)
                                    else:
                                        contain_tag_cnt = 0
                                        TAGS = ("math", "code", "image")# "video"
                                        for tag in TAGS:
                                            html, contain_tag = encode_html(uri, html, encoding, tag)
                                            if contain_tag:
                                                contain_tag_cnt += 1
                                        contain_tag = contain_tag_cnt > 0
                                else:
                                    html = None
                                    contain_tag = False

                                # write encoded html.
                                if contain_tag and html is not None:
                                    content = BytesIO(html)
                                    assert content.getbuffer().nbytes == len(html)
                                    raw_length = len(html)
                                    record.raw_stream = LimitReader(content, raw_length)

                                    record.rec_headers["Content-Length"] = None
                                    record.length = None

                                    writer.write_record(record)
                            except:
                                traceback.print_exc()

            ret = [warc_file_name]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == "__main__":
    warc_file_name = "CC-MAIN-20221127073607-20221127103607-00007.warc.gz"
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    TAG = "math"
    output = warc_encode_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAG=TAG)
    print(output)


================================================
FILE: DomainSpecific/core/layers/transform/warc_filter_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import re
from io import BytesIO
from warcio.warcwriter import WARCWriter
from warcio.limitreader import LimitReader
from warcio.archiveiterator import ArchiveIterator
import util

def warc_filter_layer(warc_file_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", TAGS=(), OVERWRITE=False):
    ret = list()
    try:
        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)
        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)
        dst_warc_file_path = os.path.join(OUTPUT_FOLDER, warc_file_name)
        dst_warc_file_path = util.to_real_path(dst_warc_file_path, variables)
        TAGS = list(map(lambda tag: bytes(tag, "ascii"), TAGS))
        regex = re.compile(b'|'.join(TAGS))

        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_warc_file_path)):
            util.create_folder_by_file_path(dst_warc_file_path)
            with open(dst_warc_file_path, "wb") as output:
                writer = WARCWriter(output, gzip=True)
                with open(src_warc_file_path, "rb") as input:
                    reader = ArchiveIterator(input, arc2warc=True)
                    for i, record in enumerate(reader):
                        if record.rec_type == "response" and record.http_headers.get_header("Content-Type", "").startswith("text/html"):
                            try:
                                # read raw html.
                                html = record.content_stream().read()

                                # filter.
                                if regex.search(html):
                                    content = BytesIO(html)
                                    assert len(html) == record.payload_length
                                    record.raw_stream = LimitReader(content, record.payload_length)
                                    writer.write_record(record)
                            except:
                                traceback.print_exc()
            
            ret = [warc_file_name]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == "__main__":
    warc_file_name = "CC-MAIN-20221127073607-20221127103607-00007.warc.gz"
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    TAGS = (
        "<math",
        "MathJax",
    )
    output = warc_filter_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAGS=TAGS)
    print(output)


================================================
FILE: DomainSpecific/core/layers/transform/warc_to_wet_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import util

def warc_to_wet_layer(warc_file_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", OVERWRITE=False):
    ret = list()
    try:
        wet_file_name = warc_file_name.replace(".warc.gz", ".warc.wet.gz")
        wat_file_name = warc_file_name.replace(".warc.gz", ".warc.wat.gz")

        src_warc_file_path = os.path.join(INPUT_FOLDER, warc_file_name)
        src_warc_file_path = util.to_real_path(src_warc_file_path, variables)

        dst_wet_file_path = os.path.join(OUTPUT_FOLDER, wet_file_name)
        dst_wet_file_path = util.to_real_path(dst_wet_file_path, variables)

        if os.path.exists(src_warc_file_path) and (OVERWRITE or not os.path.exists(dst_wet_file_path)):
            util.create_folder_by_file_path(dst_wet_file_path)

            # export SPARK_USER=$USER
            java_package = "./dependency/ia-hadoop-tools-jar-with-dependencies.jar"
            commandline = f"sudo java -jar {java_package} WEATGenerator -strictMode -skipExisting batch-id-xyz {src_warc_file_path}"
            exit_status1 = os.system(commandline)
            assert exit_status1 == 0

            tmp_base_path = os.path.dirname(src_warc_file_path)
            tmp_wet_file_path = os.path.join(tmp_base_path, "..", "wet/", wet_file_name)
            tmp_wat_file_path = os.path.join(tmp_base_path, "..", "wat/", wat_file_name)
            exit_status2 = os.system(f"sudo cp -f {tmp_wet_file_path} {dst_wet_file_path}")
            assert exit_status2 == 0

            os.system(f"sudo rm {tmp_wet_file_path}")
            os.system(f"sudo rm {tmp_wat_file_path}")

            ret = [wet_file_name]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == "__main__":
    warc_file_name = "CC-MAIN-20221127073607-20221127103607-00007.warc.gz"
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    output = warc_to_wet_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER)
    print(output)


================================================
FILE: DomainSpecific/core/layers/transform/wet_decode_layer.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import re
from io import BytesIO
from warcio.limitreader import LimitReader
from warcio.warcwriter import WARCWriter
from warcio.archiveiterator import ArchiveIterator
from pylatexenc.latex2text import LatexNodes2Text
from guesslang import Guess
import util

def decode_tag(tag):
    return tag.replace(b"[[[", b"<").replace(b"]]]", b">")

def latex2text(latex, encoding="utf-8"):
    latexNodes2Text = LatexNodes2Text()
    latex = str(latex, encoding)
    text = latexNodes2Text.latex_to_text(latex)
    text = bytes(text, encoding)
    return text

def separate_content_and_tag(html, start_str, end_str):
    index = html.find(start_str)
    before = html[:index]
    html = html[index:]
    index = html.find(end_str) + len(end_str)
    content = html[:index]
    after = html[index:]
    return content, before, after

def remove_number_and_merge_snippet(html, NumberThred = 7):
    lines = html.split(b'\n')

    for interval in (1, 2, 3, 4):
        line_no_list = list()
        last_code_no = -1
        for line_no in range(0, len(lines), interval):
            try:
                code_no = int(lines[line_no].strip())
            except:
                code_no = -1
            if (last_code_no == -1 and code_no == 1) or last_code_no + 1 == code_no:
                last_code_no = code_no
                line_no_list.append(line_no)
            else:
                if last_code_no > NumberThred:
                    for hist_line_no in line_no_list:
                        lines[hist_line_no] = b''
                line_no_list = list()
                last_code_no = -1
        lines = list(filter(lambda line: len(line) > 0, lines))

    for i in range(2):
        line_no_list = list()
        last_code_no = -1
        for line_no in range(len(lines)):
            try:
                code_no = int(lines[line_no].strip())
            except:
                code_no = -1
            if (last_code_no == -1 and code_no == 1) or last_code_no + 1 == code_no:
                last_code_no = code_no
                line_no_list.append(line_no)
            elif code_no == 0 or code_no == 1:
                if last_code_no > NumberThred:
                    for hist_line_no in line_no_list:
                        lines[hist_line_no] = b''
                line_no_list = [line_no]
                last_code_no = code_no
        lines = list(filter(lambda line: len(line) > 0, lines))
    
    for line_no in range(len(lines)):
        if len(lines[line_no].strip()) == 0:
            lines[line_no] = b''
    lines = list(filter(lambda line: len(line) > 0, lines))

    # merge code snippets which are locate continously with single line.
    #html = re.sub(b"</code-encode>\n<code-encode>\n", b"\n", html)
    code_head = b"<code-encode>"
    code_tail = b"</code-encode>"
    for line_no in range(max(0, len(lines)-3)):
        if code_tail in lines[line_no] and code_head in lines[line_no+1] and code_tail in lines[line_no+3]:
            lines[line_no] = b''
            lines[line_no+1] = b''
    lines = list(filter(lambda line: len(line) > 0, lines))

    # filter issue html.
    cnt = 0
    for line in lines:
        if code_head in line:
            cnt += 1
        elif code_tail in line:
            cnt -= 1
        # error happens.
        if cnt != 0 and cnt != 1:
            return b''
    
    html = b'\n'.join(lines)
    return html

guess = None
def identify_code(text):
    global guess
    if guess is None:
        guess = Guess()
    try:
        #name = guess.language_name(text)
        name, prob = guess.probabilities(text)[0]
    except:
        name, prob = "unknown", 1.0
    return name, prob

def decode_html(uri, html, encoding, TAG):
    if html is None:
        return None, False

    if TAG == "math":
        decode_table = {
            b"[[[less]]]": b"<",
            b"[[[large]]]": b">",
        }

        tag_head_mathml = b"[[[math-ml]]]"
        tag_tail_mathml = b"[[[/math-ml]]]"
        tag_head_mathtex = b"[[[math-tex]]]"
        tag_tail_mathtex = b"[[[/math-tex]]]"

        start_end = (
            (tag_head_mathml, tag_tail_mathml),
            (tag_head_mathtex, tag_tail_mathtex),
        )

        for (start, end) in start_end:
            while start in html:
                content, before, after = separate_content_and_tag(html, start, end)
                formula = content[len(start): -len(end)]

                if len(formula.strip()) != 0:
                    # decode < and >.
                    for key1, key2 in decode_table.items():
                        formula = formula.replace(key1, key2)
                    
                    # decode math tag.
                    content = decode_tag(start) + formula + decode_tag(end)

                    # dedup math formula around context.
                    formula_ascii = latex2text(formula).strip()
                    n = len(formula_ascii)
                    if n > 0 and before.rstrip()[-n:] == formula_ascii:
                        before = before.rstrip()[:-n]
                    elif n > 0 and after.lstrip()[:n] == formula_ascii:
                        after = after.lstrip()[n:]
                    html = before + content + after
                else:
                    # remove empty formula.
                    html = before + after

    elif TAG == "code":
        tag_head_code = b"[[[code-encode]]]"
        tag_tail_code = b"[[[/code-encode]]]"
        #tag_head_notcode = b"[[[not-code-encode]]]"# debug
        #tag_tail_notcode = b"[[[/not-code-encode]]]"# debug

        start_end = (
            (tag_head_code, tag_tail_code),
            #(tag_head_notcode, tag_tail_notcode),# debug
        )

        for (start, end) in start_end:
            while start in html:
                content, before, after = separate_content_and_tag(html, start, end)
                code = content[len(start): -len(end)].strip()

                if len(code) != 0:
                    lang, prob = identify_code(code)
                    #lcnt = code.count(b"\n")
                    #meta_lang = bytes(f"<metadata lang={lang} prob={prob:.2f} lines={lcnt} />", encoding=encoding)
                    meta_lang = bytes(f"<metadata lang={lang} prob={prob:.2f} />", encoding=encoding)
                    decode_start = decode_tag(start)
                    decode_end = decode_tag(end)
                    #content = decode_start + b"\n" + code + b"\n" + decode_end
                    content = decode_start + meta_lang + b"\n" + code + b"\n" + decode_end
                    html = before + content + after
                else:
                    # remove empty code.
                    html = before + after

        # remove number of code block.
        html = remove_number_and_merge_snippet(html)

    elif TAG == "image":
        tag_head_image = b"[[[image-encode]]]"
        tag_tail_image = b"[[[/image-encode]]]"

        start_end = (
            (tag_head_image, tag_tail_image),
        )

        for (start, end) in start_end:
            while start in html:
                content, before, after = separate_content_and_tag(html, start, end)
                image_meta = content[len(start): -len(end)].strip()

                if len(image_meta) != 0:
                    decode_start = decode_tag(start)
                    decode_end = decode_tag(end)
                    content = decode_start + image_meta + decode_end
                    html = before + content + after
                else:
                    # remove empty image.
                    html = before + after
                    return None, False

    elif TAG == "video":
        tag_head_video = b"[[[video-encode]]]"
        tag_tail_video = b"[[[/video-encode]]]"

        start_end = (
            (tag_head_video, tag_tail_video),
        )

        for (start, end) in start_end:
            while start in html:
                content, before, after = separate_content_and_tag(html, start, end)
                video_meta = content[len(start): -len(end)].strip()

                if len(video_meta) != 0:
                    decode_start = decode_tag(start)
                    decode_end = decode_tag(end)
                    content = decode_start + video_meta + decode_end
                    html = before + content + after
                else:
                    # remove empty video.
                    html = before + after
                    return None, False

    # remove continous empty lines.
    if html is not None and len(html) > 0:
        html = re.sub(b"(\n\r)+", b"\n", html)
        html = re.sub(b"(\r\n)+", b"\n", html)
        html = re.sub(b"\n+", b"\n", html)

    contain = False
    for (start, end) in start_end:
        decode_start = decode_tag(start)
        if decode_start in html:
            contain = True

    return html, contain

def wet_decode_layer(wet_file_name, variables=dict(), INPUT_FOLDER="./", OUTPUT_FOLDER="./", TAG=None, OVERWRITE=False):
    ret = list()
    try:
        BLACK_URLS = ("blame.php", "diff.php")
        regex = re.compile('|'.join(BLACK_URLS))
        src_wet_file_path = os.path.join(INPUT_FOLDER, wet_file_name)
        src_wet_file_path = util.to_real_path(src_wet_file_path, variables)
        dst_wet_file_path = os.path.join(OUTPUT_FOLDER, wet_file_name)
        dst_wet_file_path = util.to_real_path(dst_wet_file_path, variables)

        if os.path.exists(src_wet_file_path) and (OVERWRITE or not os.path.exists(dst_wet_file_path)):
            util.create_folder_by_file_path(dst_wet_file_path)
            with open(dst_wet_file_path, "wb") as output:
                writer = WARCWriter(output, gzip=True)
                with open(src_wet_file_path, "rb") as input:
                    records = ArchiveIterator(input, arc2warc=False)
                    for id, record in enumerate(records):
                        #lang = record.rec_headers["WARC-Identified-Content-Language"]
                        #if lang != "en":
                        #    continue

                        if record.rec_type == "conversion":
                            try:
                                uri = record.rec_headers["WARC-Target-URI"]
                                if regex.search(uri):
                                    continue

                                # read raw html.
                                html = record.content_stream().read()
                                encoding = "utf-8"

                                # decode html.
                                if encoding is not None:
                                    if TAG is not None:
                                        html, contain_tag = decode_html(uri, html, encoding, TAG)
                                    else:
                                        contain_tag_cnt = 0
                                        TAGS = ("math", "code", "image")# "video"
                                        for tag in TAGS:
                                            html, contain_tag = decode_html(uri, html, encoding, tag)
                                            if contain_tag:
                                                contain_tag_cnt += 1
                                        contain_tag = contain_tag_cnt > 0
                                else:
                                    html = None
                                    contain_tag = False

                                # write decoded html.
                                if contain_tag and html is not None:
                                    content = BytesIO(html)
                                    assert content.getbuffer().nbytes == len(html)
                                    raw_length = len(html)
                                    record.raw_stream = LimitReader(content, raw_length)

                                    record.rec_headers["Content-Length"] = None
                                    record.length = None

                                    writer.write_record(record)
                            except:
                                traceback.print_exc()
            #ret = [wet_file_name]
            ret = [dst_wet_file_path]
    except KeyboardInterrupt:
        sys.exit()
    except Exception as ex:
        traceback.print_exc()
    return (ret, )


if __name__ == "__main__":
    warc_file_name = "CC-MAIN-20221127073607-20221127103607-00007.warc.gz"
    INPUT_FOLDER = "$(input_data_folder)"
    OUTPUT_FOLDER = "$(output_data_folder)"
    TAG = "math"
    output = wet_decode_layer(warc_file_name, INPUT_FOLDER=INPUT_FOLDER, OUTPUT_FOLDER=OUTPUT_FOLDER, TAG=TAG)
    print(output)


================================================
FILE: DomainSpecific/core/layers/util.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import copy
import yaml
import hashlib
import logging
import datetime
import requests
from urllib.parse import urljoin
from azure.storage.blob import ContainerClient, BlobSasPermissions, generate_blob_sas
from azure.identity import DefaultAzureCredential
logging.getLogger("requests").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

def load_yaml(config_path):
    config = None
    if os.path.exists(config_path):
        with open(config_path, "r") as file:
            config = yaml.safe_load(file)
    return config

def save_yaml(config, config_path):
    if os.path.exists(os.path.dirname(config_path)):
        with open(config_path, "w") as file:
            yaml.safe_dump(config, file)

def str2bytes(data):
    data = bytes(data, "utf-8")
    return data

def md5(data):
    if isinstance(data, str):
        data = str2bytes(data)
    md5 = hashlib.md5(data).hexdigest()
    return md5

def sha256(data):
    if isinstance(data, str):
        data = str2bytes(data)
    sha256 = hashlib.sha256(data).hexdigest()
    return sha256

def suffix(path):
    suffix = os.path.splitext(path)[1]
    return suffix

def relative2absolute_path(prefix, link):
    # Root-relative path.
    if link.startswith("/"):
        link = urljoin(prefix, link)
    else:
        colon_count = link[:10].count(":")
        # Document-relative path.
        if link.startswith(".") or colon_count == 0:
            link = urljoin(prefix, link)
        # Absolute paths, such as `http://`, `https://`, `ftp://`, or 'file://'.
        else:
            link = link
    return link

def create_folder_by_file_path(local_file_path):
    local_folder_path = os.path.dirname(local_file_path)
    if not os.path.exists(local_folder_path) and len(local_folder_path.strip()) != 0:
        try:
            os.makedirs(local_folder_path, exist_ok=True)
        except:
            pass

def to_real_path(path, variables):
    keys = ("workspace_dir", "worker_id", "worker_num")
    path = copy.copy(path)
    for name, value in variables.items():
        if name in keys:
            path = path.replace("{%s}" % name, str(value))
    return path

def get_container_client(storage_config):
    if isinstance(storage_config, ContainerClient):
        return storage_config

    if isinstance(storage_config, str) and os.path.exists(storage_config):
        storage_config = load_yaml(storage_config)

    account_domain = "blob.core.windows.net"
    account_name = storage_config["azstorage"]["account-name"]
    #account_key = storage_config["azstorage"]["account-key"]
    container_name = storage_config["azstorage"]["container"]
    identity_id = storage_config["azstorage"]["appid"]
    credential = DefaultAzureCredential(managed_identity_client_id=identity_id)

    container_client = ContainerClient(
        account_url=f"https://{account_name}.{account_domain}/",
        container_name=container_name,
        credential=credential#account_key
    )

    return container_client

def get_blob_client(storage_config, blob_path):
    container_client = get_container_client(storage_config)
    blob_client = container_client.get_blob_client(blob_path)
    return blob_client

def exist_blob(container_client, blob_path):
    with container_client.get_blob_client(blob_path) as blob_client:
        blob_path_exists = blob_client.exists()
        return blob_path_exists

def get_blob_size(container_client, blob_path):
    with container_client.get_blob_client(blob_path) as blob_client:
        properties = blob_client.get_blob_properties()
        size = properties.size
        return size

def list_blob_dir(container_client, blob_path):
    names = list()
    for blob in container_client.walk_blobs(name_starts_with=blob_path):
        names.append(blob.name)
    return names

def create_blob_dir(container_client, blob_path):
    container_client.upload_blob(name=os.path.join(blob_path, "_"), data=b"", overwrite=True)

def upload_bytes_to_blob(storage_config, content, blob_path):
    with get_blob_client(storage_config, blob_path) as blob_client:
        blob_client.upload_blob(content, overwrite=True)
    return blob_path

def upload_file_to_blob(storage_config, local_path, blob_path):
    with open(local_path, "rb") as content:
        upload_bytes_to_blob(storage_config, content, blob_path)
    return blob_path

def upload_bytes_to_internet(content, blob_path):
    # TODO: to be implemented.
    return blob_path

def upload_file_to_internet(local_path, blob_path):
    # TODO: to be implemented.
    return blob_path

def download_bytes_from_blob(storage_config, blob_path):
    with get_blob_client(storage_config, blob_path) as blob_client:
        content = blob_client.download_blob().readall()
    return content

def download_file_from_blob(storage_config, blob_path, local_path):
    content = download_bytes_from_blob(storage_config, blob_path)
    create_folder_by_file_path(local_path)
    with open(local_path, "wb") as data:
        data.write(content)
    return local_path

def download_bytes_from_internet(url, timeout=3):
    try:
        resp = requests.get(url, allow_redirects=True, timeout=timeout)
        if resp.status_code == 200:
            content = resp.content
            return content
        else:
            return None
    except:
        return None

def download_file_from_internet(url, local_path):
    try:
        content = download_bytes_from_internet(url)
        if content is not None:
            create_folder_by_file_path(local_path)
            with open(local_path, "wb") as data:
                data.write(content)
            return local_path, len(content)
        else:
            return None, 0
    except:
        return None, 0


================================================
FILE: DomainSpecific/core/network.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
from core.layers import LayerType, util

class Network:
    def __init__(self):
        self.type = None
        self.input_names = list()
        self.output_names = list()
        self.datas = dict()
        self.layers = dict()

    def set_input_names(self, input_names):
        self.input_names = input_names

    def set_output_names(self, output_names):
        self.output_names = output_names

    def add_data(self, name, value):
        self.datas[name] = value

    def add_layer(self, name, value):
        self.layers[name] = value

    def next_layer(self, invisited_layer_names):
        for name in invisited_layer_names:
            layer = self.layers[name]
            input_names = layer.input_names
            if set(input_names) <= set(self.datas.keys()):
                input_values = [self.datas[input_name] for input_name in input_names]
                invisited_layer_names.remove(name)
                return layer, name, input_values
        return None
    
    def __call__(self, inputs=list(), worker_id=0, worker_num=1, variables=dict()):
        outputs = list()
        try:
            if len(inputs) == len(self.input_names):
                for name, value in zip(self.input_names, inputs):
                    self.add_data(name, value)
            
            invisited_layer_names = sorted(list(self.layers.keys()))
            while len(invisited_layer_names) > 0:
                item = self.next_layer(invisited_layer_names)
                if item is None:
                    raise Exception("There are some layers which misses input data.")
                layer, layer_name, input_values = item
                print(f"{layer_name} - input: {layer.input_names}, output: {layer.output_names}", flush=True)

                output_values = layer(input_values, worker_id=worker_id, worker_num=worker_num, variables=variables)
                for name, value in zip(layer.output_names, output_values):
                    self.add_data(name, value)
            outputs = [self.datas[output_name] for output_name in self.output_names]
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
        return outputs

    """
    def spark(self, inputs, spark_session, spark_context, worker_num=1, variables=dict()):
        from pyspark import TaskContext, StorageLevel

        def merge(x, n):
            if n == 0:
                return []
            elif n == 1:
                return [x]
            elif n == 2:
                return list(x)
            else:
                for _ in range(n - 2):
                    x = x[0] + x[1:]
                return list(x)
        
        def func(layer, input, worker_id, worker_num, variables):
            input = list(input)
            assert len(input) == 1
            input = input[0]
            output = layer(input, worker_id=worker_id, worker_num=worker_num, variables=variables)
            return [output]
        
        outputs = list()
        try:
            if len(inputs) == len(self.input_names):
                for name, value in zip(self.input_names, inputs):
                    self.add_data(name, value)
            
            for name, data in self.datas.items():
                input_rdd = spark_context.parallelize(worker_num * [data], worker_num)
                # Avoid recomputation, because each rdd may be used multiple times.
                input_rdd.persist(StorageLevel.MEMORY_AND_DISK)
                self.add_data(name, input_rdd)
            
            invisited_layer_names = sorted(list(self.layers.keys()))
            while len(invisited_layer_names) > 0:
                item = self.next_layer(invisited_layer_names)
                if item is None:
                    raise Exception("There are some layers which misses input data.")
                layer, layer_name, input_values = item

                input_rdds = None
                for i, input_rdd in enumerate(input_values):
                    input_rdds = input_rdd if i == 0 else input_rdds.zip(input_rdd)
                input_rdds = input_rdds.map(lambda x: merge(x, len(layer.input_names)))

                native_io = True
                if native_io:
                    output_rdds = input_rdds.mapPartitionsWithIndex(
                        lambda worker_id, input: 
                        func(layer, input, worker_id, worker_num, variables), preservesPartitioning=True
                    )
                else:# (Deprecated)
                    #if layer.type in (LayerType.To_Line_File, LayerType.To_Jsonl_File, LayerType.To_Parquet_File):
                    if layer.type == LayerType.To_Line_File:
                        inputs = input_rdds.collect()
                        outputs = list()
                        for worker_id, input in enumerate(inputs):
                            variables["worker_id"] = worker_id
                            variables["worker_num"] = worker_num
                            assert len(input) == 2
                            file_path = util.to_real_path(input[1], variables)
                            
                            spark_context.parallelize(input[0], 1).saveAsTextFile(file_path)
                            #rdd = spark_context.parallelize(input[0], 1)
                            #rdd.toDF().write.mode("overwrite").text(file_path)
                            #rdd.toDF().write.mode("overwrite").json(file_path)
                            #rdd.toDF().write.mode("overwrite").parquet(file_path)
                            
                            output = [file_path]
                            outputs.append(output)
                        output_rdds = spark_context.parallelize(outputs, worker_num)
                    #elif layer.type in (LayerType.From_Line_File, LayerType.From_Jsonl_File, LayerType.From_Parquet_File):
                    elif layer.type == LayerType.From_Line_File:
                        inputs = input_rdds.collect()
                        outputs = list()
                        for worker_id, input in enumerate(inputs):
                            variables["worker_id"] = worker_id
                            variables["worker_num"] = worker_num
                            assert len(input) == 1
                            file_path = util.to_real_path(input[0], variables)
                            
                            lines = spark_context.textFile(file_path).collect()
                            #rdd = spark_session.read.option("mode", "DROPMALFORMED").text(file_path).rdd
                            #rdd = spark_session.read.option("mode", "DROPMALFORMED").json(file_path).rdd
                            #rdd = spark_session.read.option("mode", "DROPMALFORMED").parquet(file_path).rdd
                            #lines = rdd.collect()
                            
                            output = [lines]
                            outputs.append(output)
                        output_rdds = spark_context.parallelize(outputs, worker_num)
                    else:
                        output_rdds = input_rdds.mapPartitionsWithIndex(
                            lambda worker_id, input: 
                            func(layer, input, worker_id, worker_num, variables), preservesPartitioning=True
                        )

                # Avoid recomputation, because each rdd may be used multiple times.
                output_rdds.persist(StorageLevel.MEMORY_AND_DISK)
                for i, name in enumerate(layer.output_names):
                    output_rdd = output_rdds.map(lambda _:_[i])
                    # Avoid recomputation, because each rdd may be used multiple times.
                    output_rdd.persist(StorageLevel.MEMORY_AND_DISK)
                    self.add_data(name, output_rdd)

                print(f"{layer_name} - {layer.input_names}, {layer.output_names}", flush=True)
            outputs = [self.datas[output_name].collect() for output_name in self.output_names]
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
        return outputs
    """


if __name__ == "__main__":
    network = Network()
    print(network)


================================================
FILE: DomainSpecific/dependency/gpt_api.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import time
import traceback
import tiktoken
import collections
from datetime import datetime
import openai
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider


class GPTAPI:
    def __init__(self, engine, endpoint, identity_id):
        """
        Detail setting method could refer to: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/managed-identity
        The authentication methods include key-based method, cli-based method, identity-based method, etc.
        We use identity-based method, you could switch to other method.
        """
        self.keep_history = False
        self.user_QAs = collections.defaultdict(list)
        self.max_tokens_per_requests = 8192 - 800 - 192
        self.quato_tokens_per_minute = 120000#140000
        self.quato_requests_per_minute = 720#840
        self.last_minute = -1
        self.acc_tokens = 0
        self.acc_requests = 0

        try:
            self.enc = tiktoken.encoding_for_model("gpt-4")
        except:
            self.enc = None
        self.engine = engine
        self.endpoint = endpoint

        token_provider = get_bearer_token_provider(DefaultAzureCredential(managed_identity_client_id=identity_id), "https://cognitiveservices.azure.com/.default")
        self.client = AzureOpenAI(
            azure_endpoint=endpoint,
            azure_ad_token_provider=token_provider,
            #api_version="2024-02-15-preview",
            api_version="2024-08-01-preview",
            max_retries=0,
        )

    def switch_api(self, api_idx=-1):
        # TBD: not implemented yet. 
        pass

    def get_tokens(self, text):
        tokens = max(len(text.split()), len(text) // 4)
        return tokens

    def run(self, system, question, engine=None, uid=None, temperature=0.0, max_tokens=800):
        if engine is None:
            engine = self.engine
        
        if self.enc is None:
            return ""

        # question check.
        #if self.get_tokens(question) > self.max_tokens_per_requests:
        #    question = question[:self.max_tokens_per_requests * 4]
        tokens = self.enc.encode(question)
        tokens_len = len(tokens)
        if tokens_len > self.max_tokens_per_requests:
            offset = (tokens_len - self.max_tokens_per_requests) // 2
            cut_tokens = tokens[offset:offset+self.max_tokens_per_requests]
            question = self.enc.decode(cut_tokens)

        # system setting.
        messages = [{"role": "system", "content": system}]
        
        # chat setting.
        if self.keep_history:
            for Q, A in self.user_QAs[uid]:
                messages.append({"role": "user", "content": Q})
                messages.append({"role": "assistant", "content": A})
        messages.append({"role": "user", "content": question})

        # quato check.
        """
        while True:
            cur_minute = datetime.now().minute
            cur_tokens = self.get_tokens(str(messages))
            if self.last_minute != cur_minute:
                self.last_minute = cur_minute
                self.acc_tokens = 0
                self.acc_requests = 0
            if self.acc_requests + 1  < self.quato_requests_per_minute and self.acc_tokens + cur_tokens < self.quato_tokens_per_minute:
                self.acc_requests += 1
                self.acc_tokens += cur_tokens
                break
            time.sleep(1)
        """

        # robot running.
        try:
            response = self.client.chat.completions.create(
                model=engine,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
                #top_p=0.95,
                #frequency_penalty=0,
                #presence_penalty=0,
                #stop=None
            )
            answer = response.choices[0].message.content
        # https://github.com/openai/openai-python/blob/main/openai/error.py
        except (openai.RateLimitError, openai.APITimeoutError, openai.APIConnectionError) as e:
            time.sleep(2)
            #seconds = int(str(e).split("Please retry after")[1].split("second")[0].strip())
            #time.sleep(seconds)
            #traceback.print_exc()
            self.switch_api()
            return self.run(system, question, engine, uid, temperature)
        except openai.BadRequestError as e:
            if e.code == "context_length_exceeded":
                try:
                    offset = len(question) // 8
                    return self.run(system, question[offset:-offset], engine, uid, temperature)
                except:
                    answer = ""
                    traceback.print_exc()
            if e.code == "content_filter":
                answer = ""
            else:
                answer = ""
                traceback.print_exc()
        except Exception as e:
            if response is not None and response.choices[0].finish_reason == "content_filter":
                answer = ""
            else:
                answer = ""
                traceback.print_exc()
        
        # update history chat.
        if self.keep_history:
            self.user_QAs[uid].append((question, answer))
            while len(self.user_QAs[uid]) > 10:
                self.user_QAs[uid].pop(0)
        
        return answer

if __name__ == "__main__":
    engine = "gpt-4"
    endpoint = "https://XXX.openai.azure.com/"# to be filled.
    identity_id = ""# to be filled.
    gpt_api = GPTAPI(engine, endpoint, identity_id)
    system = "You are my assistant"
    question = "give me a latex math formula"
    answer = gpt_api.run(system=system, question=question)
    print(answer)


================================================
FILE: DomainSpecific/dependency/ia-hadoop-tools-jar-with-dependencies.jar
================================================
[File too large to display: 57.3 MB]

================================================
FILE: DomainSpecific/dependency/install.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/../wrapper/utility")
import time
import argparse
from load_yaml import load_yaml
from save_yaml import save_yaml
from azure_env import get_local_rank, get_world_rank

ENV_READY = "env_ready"
OS_VERSION = "ubuntu/18.04"# ubuntu/18.04, ubuntu/20.04, ubuntu/22.04

def install(local_id, storage_path):
    local_id = get_local_rank() if get_local_rank() is not None else local_id
    if local_id == 0:
        if os.path.exists(ENV_READY):
            return

        # install python dependencies.
        os.system(f"pip install --upgrade pip")
        os.system(f"pip install -r dependency/requirements.txt")
        os.system(f"pip install guesslang==2.2.1 --no-deps")# don't change the version.

        # install others.
        os.system(f"sudo wget https://packages.microsoft.com/config/{OS_VERSION}/packages-microsoft-prod.deb")
        os.system(f"sudo dpkg -i packages-microsoft-prod.deb")
        os.system(f"sudo apt-get -y update")
        os.system(f"sudo apt-get -y install axel")# for fast file download.

        os.system(f"sudo apt update")
        os.system(f"sudo apt -y install git")
        os.system(f"sudo apt -y install git-lfs")
        os.system(f"sudo apt -y install maven")
        os.system(f"sudo apt -y install openjdk-11-jdk")# java-related 3rd-part library.
        os.system(f"ulimit -n 65536")

        # mount folder: default mount the storage.
        storage_config = load_yaml(storage_path)
        if storage_config.get("mount", True):
            # install fuseblob library
            os.system(f"sudo apt-get -y install libcurl3-gnutls")
            os.system(f"sudo apt-get -y install blobfuse")
            os.system(f"sudo apt-get -y install libfuse2")
            os.system(f"sudo apt-get -y install blobfuse2")

            # create folder to be mounted
            workspace_dir = storage_config["workspace_dir"]
            filecache_dir = storage_config["file_cache"]["path"]

            try:
                os.system(f"sudo umount -l {workspace_dir}")# debug
                #os.system("ps -ef | grep blobfuse | grep -v grep | awk -F ' ' '{print $2}' | xargs sudo kill -9")# debug
            except:
                pass
            
            os.system(f"sudo mkdir -p {workspace_dir}")
            os.system(f"sudo chown $(whoami) {workspace_dir}")

            if os.path.exists(filecache_dir):
                try:
                    os.system(f"sudo rm -rf {filecache_dir}")# debug
                except:
                    pass
            
            os.system(f"sudo mkdir -p {filecache_dir}")
            os.system(f"sudo chown $(whoami) {filecache_dir}")

            os.system(f"sudo blobfuse2 mount {workspace_dir} --config-file={storage_path}")
            print("mount azure storage account.")
        else:
            print("not mount azure storage account.")

        # create env tag
        os.system(f"sudo rm -rf packages-microsoft-prod.deb")
        os.system(f"sudo touch {ENV_READY}")
    else:
        mounting = True
        while mounting:
            mounting = not os.path.exists(ENV_READY)
            time.sleep(1)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Install dependencies of Data Network.")
    parser.add_argument('--local_id', type=int, default=0, help="The id of local worker.")
    parser.add_argument('--storage_path', type=str, default="./resources/storage/llmstore.yaml", help="The path of storage config file.")
    args = parser.parse_args()
    install(args.local_id, args.storage_path)


================================================
FILE: DomainSpecific/dependency/requirements.txt
================================================
lxml==5.1.0
#fasttext==0.9.2
fasttext-wheel==0.9.2
sentencepiece==0.1.99
trafilatura==1.6.1
html5lib==1.1
newspaper3k==0.2.8
beautifulsoup4==4.12.2
warcio==1.7.4
markdownify==0.11.6
#cchardet==2.1.7
numpy==1.24.4
scipy==1.10.1
requests==2.32.2
pyarrow==14.0.1
jsonlines==3.1.0
#networkx==3.1
matplotlib==3.7.2
pyyaml==6.0
psutil==5.9.5
tqdm==4.66.3
py_asciimath==0.3.0
pylatexenc==2.10
charset-normalizer==3.2.0
tensorflow==2.12.1
#guesslang==2.2.1
#typing_extensions==4.12.0
faiss-cpu==1.7.4
#torch==2.0.1
#fairscale==0.4.13
sentence_transformers==2.2.2
#PyMuPDF==1.23.6
tiktoken==0.5.2
gensim==4.3.2
openai==1.30.2
boto3==1.34.100
datasets==2.16.0
azure-ai-ml==1.16.0
azure-batch==14.2.0
azure-identity==1.16.1
azure-storage-blob==12.19.1
azure.keyvault.secrets==4.8.0


================================================
FILE: DomainSpecific/dependency/xsltml_2.0/cmarkup.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
		version='1.0'>
                
<!-- ====================================================================== -->
<!-- $id: tokens.xsl, 2002/22/11 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<!-- 4.4.1.1 cn -->
<xsl:template match="m:cn"><xsl:apply-templates/></xsl:template>

<xsl:template match="m:cn[@type='complex-cartesian']">
	<xsl:apply-templates select="text()[1]"/>
  	<xsl:text>+</xsl:text>
	<xsl:apply-templates select="text()[2]"/>
	<xsl:text>i</xsl:text>
</xsl:template>

<xsl:template match="m:cn[@type='rational']">
	<xsl:apply-templates select="text()[1]"/>
	<xsl:text>/</xsl:text>
	<xsl:apply-templates select="text()[2]"/>
</xsl:template>

<xsl:template match="m:cn[@type='integer' and @base!=10]">
		<xsl:apply-templates/>
		<xsl:text>_{</xsl:text><xsl:value-of select="@base"/><xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:cn[@type='complex-polar']">
	<xsl:apply-templates select="text()[1]"/>
	<xsl:text>e^{i </xsl:text>
	<xsl:apply-templates select="text()[2]"/>
	<xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:cn[@type='e-notation']">
    <xsl:apply-templates select="text()[1]"/>
    <xsl:text>E</xsl:text>
    <xsl:apply-templates select="text()[2]"/>
</xsl:template>

<!-- 4.4.1.1 ci 4.4.1.2 csymbol -->
<xsl:template match="m:ci | m:csymbol">
	<xsl:choose>
		<xsl:when test="string-length(normalize-space(text()))>1">
			<xsl:text>\mathrm{</xsl:text><xsl:apply-templates/><xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise><xsl:apply-templates/></xsl:otherwise>
	</xsl:choose>
</xsl:template>

<!-- 4.4.2.1 apply 4.4.2.2 reln -->
<xsl:template match="m:apply | m:reln">
	<xsl:apply-templates select="*[1]">
	<!-- <? -->
		<xsl:with-param name="p" select="10"/>
	</xsl:apply-templates>
	<!-- ?> -->
 	<xsl:text>(</xsl:text>
	<xsl:for-each select="*[position()>1]">
		<xsl:apply-templates select="."/>
		<xsl:if test="not(position()=last())"><xsl:text>, </xsl:text></xsl:if>
	</xsl:for-each>
 	<xsl:text>)</xsl:text>
</xsl:template>

<!-- 4.4.2.3 fn -->
<xsl:template match="m:fn[m:apply[1]]"> <!-- for m:fn using default rule -->
	<xsl:text>(</xsl:text><xsl:apply-templates/><xsl:text>)</xsl:text>
</xsl:template>

<!-- 4.4.2.4 interval -->
<xsl:template match="m:interval[*[2]]">
	<xsl:choose>
		<xsl:when test="@closure='open' or @closure='open-closed'">
			<xsl:text>\left(</xsl:text>		
		</xsl:when>
		<xsl:otherwise><xsl:text>\left[</xsl:text></xsl:otherwise> 
	</xsl:choose>
	<xsl:apply-templates select="*[1]"/>
	<xsl:text> , </xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:choose>
		<xsl:when test="@closure='open' or @closure='closed-open'">
			<xsl:text>\right)</xsl:text>		
		</xsl:when>
		<xsl:otherwise><xsl:text>\right]</xsl:text></xsl:otherwise> 
	</xsl:choose>
</xsl:template>

<xsl:template match="m:interval">
	<xsl:text>\left\{</xsl:text><xsl:apply-templates/><xsl:text>\right\}</xsl:text>
</xsl:template>

<!-- 4.4.2.5 inverse -->
<xsl:template match="m:apply[*[1][self::m:inverse]]">
	<xsl:apply-templates select="*[2]"/><xsl:text>^{(-1)}</xsl:text>
</xsl:template>

<!-- 4.4.2.6 sep 4.4.2.7 condition -->
<xsl:template match="m:sep | m:condition"><xsl:apply-templates/></xsl:template>

<!-- 4.4.2.9 lambda -->
<xsl:template match="m:lambda">
	<xsl:text>\mathrm{lambda}\: </xsl:text>
  	<xsl:apply-templates select="m:bvar/*"/>
  	<xsl:text>.\: </xsl:text>
  <xsl:apply-templates select="*[last()]"/>
</xsl:template>

<!-- 4.4.2.10 compose -->
<xsl:template match="m:apply[*[1][self::m:compose]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\circ </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.2.11 ident -->
<xsl:template match="m:ident"><xsl:text>\mathrm{id}</xsl:text></xsl:template>

<!-- 4.4.2.12 domain 4.4.2.13 codomain 4.4.2.14 image 4.4.3.21 arg 4.4.3.24 lcm
		4.4.5.9 grad 4.4.5.10 curl 4.4.9.4 median 4.4.9.5 mode-->
<xsl:template match="m:domain | m:codomain | m:image | m:arg | m:lcm | m:grad |
								 m:curl | m:median | m:mode">
	<xsl:text>\mathop{\mathrm{</xsl:text>
	<xsl:value-of select="local-name()"/>
	<xsl:text>}}</xsl:text>
</xsl:template>

<!-- 4.4.2.15 domainofapplication -->
<xsl:template match="m:domainofapplication"/>

<!-- 4.4.2.16 piecewise -->
<xsl:template match="m:piecewise">
	<xsl:text>\begin{cases}</xsl:text>
	<xsl:apply-templates select="m:piece"/>
	<xsl:apply-templates select="m:otherwise"/>
	<xsl:text>\end{cases}</xsl:text>
</xsl:template>

<xsl:template match="m:piece">
		<xsl:apply-templates select="*[1]"/>
		<xsl:text> &amp; \text{if $</xsl:text>
		<xsl:apply-templates select="*[2]"/>
		<xsl:text>$}</xsl:text>
		<xsl:if test="not(position()=last()) or ../m:otherwise"><xsl:text>\\ </xsl:text></xsl:if>
</xsl:template>

<xsl:template match="m:otherwise">
	<xsl:apply-templates select="*[1]"/>
	<xsl:text> &amp; \text{otherwise}</xsl:text>
</xsl:template>

<!-- 4.4.3.1 quotient -->
<xsl:template match="m:apply[*[1][self::m:quotient]]">
	<xsl:text>\left\lfloor\frac{</xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>}{</xsl:text>
	<xsl:apply-templates select="*[3]"/>
	<xsl:text>}\right\rfloor </xsl:text>
</xsl:template>

<!-- 4.4.3.2 factorial -->
<xsl:template match="m:apply[*[1][self::m:factorial]]">
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
	<xsl:text>!</xsl:text>
</xsl:template>

<!-- 4.4.3.3 divide -->
<xsl:template match="m:apply[*[1][self::m:divide]]">
	<xsl:param name="p" select="0"/>
  <xsl:param name="this-p" select="3"/>
  <xsl:if test="$this-p &lt; $p"><xsl:text>\left(</xsl:text></xsl:if>
  <xsl:text>\frac{</xsl:text>
	<xsl:apply-templates select="*[2]"/>
<!--		<xsl:with-param name="p" select="$this-p"/>
	</xsl:apply-templates>-->
	<xsl:text>}{</xsl:text>
	<xsl:apply-templates select="*[3]"/>
<!--    	<xsl:with-param name="p" select="$this-p"/>
	</xsl:apply-templates>-->
	<xsl:text>}</xsl:text>
	<xsl:if test="$this-p &lt; $p"><xsl:text>\right)</xsl:text></xsl:if>
</xsl:template>

<!-- 4.4.3.4 max min -->
<xsl:template match="m:apply[*[1][self::m:max or self::m:min]]">
	<xsl:text>\</xsl:text>
	<xsl:value-of select="local-name(*[1])"/>
	<xsl:text>\{</xsl:text>
   <xsl:choose>
		<xsl:when test="m:condition">
   		<xsl:apply-templates select="*[last()]"/>
   		<xsl:text>, </xsl:text>
			<xsl:apply-templates select="m:condition/node()"/>
		</xsl:when>
		<xsl:otherwise>
			<xsl:for-each select="*[position() &gt; 1]">
				<xsl:apply-templates select="."/>
				<xsl:if test="position() !=last()"><xsl:text> , </xsl:text></xsl:if>
			</xsl:for-each>
		</xsl:otherwise>
   </xsl:choose>
	<xsl:text>\}</xsl:text>
</xsl:template>

<!-- 4.4.3.5  minus-->
<xsl:template match="m:apply[*[1][self::m:minus] and count(*)=2]">
	<xsl:text>-</xsl:text>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="5"/>
	</xsl:apply-templates>
</xsl:template>

<xsl:template match="m:apply[*[1][self::m:minus] and count(*)&gt;2]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo">-</xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="2"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.6  plus-->
<xsl:template match="m:apply[*[1][self::m:plus]]">
  <xsl:param name="p" select="0"/>
  <xsl:if test="$p &gt; 2">
		<xsl:text>(</xsl:text>
	</xsl:if>
  <xsl:for-each select="*[position()&gt;1]">
   <xsl:if test="position() &gt; 1">
    <xsl:choose>
      <xsl:when test="self::m:apply[*[1][self::m:times] and
      *[2][self::m:apply/*[1][self::m:minus] or self::m:cn[not(m:sep) and
      (number(.) &lt; 0)]]]">-</xsl:when>
      <xsl:otherwise>+</xsl:otherwise>
    </xsl:choose>
   </xsl:if>   
    <xsl:choose>
      <xsl:when test="self::m:apply[*[1][self::m:times] and
      *[2][self::m:cn[not(m:sep) and (number(.) &lt;0)]]]">
			<xsl:value-of select="-(*[2])"/>
			<xsl:apply-templates select=".">
		     <xsl:with-param name="first" select="2"/>
		     <xsl:with-param name="p" select="2"/>
		   </xsl:apply-templates>
       </xsl:when>
      <xsl:when test="self::m:apply[*[1][self::m:times] and
      *[2][self::m:apply/*[1][self::m:minus]]]">
				<xsl:apply-templates select="./*[2]/*[2]"/>
				<xsl:apply-templates select=".">
					<xsl:with-param name="first" select="2"/>
					<xsl:with-param name="p" select="2"/>
				</xsl:apply-templates>
			</xsl:when>
			<xsl:otherwise>
				<xsl:apply-templates select=".">
					<xsl:with-param name="p" select="2"/>
				</xsl:apply-templates>
			</xsl:otherwise>
		</xsl:choose>
	</xsl:for-each>
	<xsl:if test="$p &gt; 2">
		<xsl:text>)</xsl:text>
	</xsl:if>
</xsl:template>

<!-- 4.4.3.7 power -->
<xsl:template match="m:apply[*[1][self::m:power]]">
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="5"/>
	</xsl:apply-templates>
	<xsl:text>^{</xsl:text>
	<xsl:apply-templates select="*[3]">
		<xsl:with-param name="p" select="5"/>
	</xsl:apply-templates>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.3.8 remainder -->
<xsl:template match="m:apply[*[1][self::m:rem]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo">\mod </xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="3"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.9  times-->
<xsl:template match="m:apply[*[1][self::m:times]]" name="times">
  <xsl:param name="p" select="0"/>
  <xsl:param name="first" select="1"/>
  <xsl:if test="$p &gt; 3"><xsl:text>(</xsl:text></xsl:if>
  <xsl:for-each select="*[position()&gt;1]">
		<xsl:if test="position() &gt; 1">
			<xsl:choose>
				<xsl:when test="self::m:cn">\times <!-- times --></xsl:when>
				<xsl:otherwise><!--invisible times--></xsl:otherwise>
			</xsl:choose>
		</xsl:if> 
		<xsl:if test="position()&gt;= $first">
			<xsl:apply-templates select=".">
				<xsl:with-param name="p" select="3"/>
			</xsl:apply-templates>
		</xsl:if>
	</xsl:for-each>
  <xsl:if test="$p &gt; 3"><xsl:text>)</xsl:text></xsl:if>
</xsl:template>

<!-- 4.4.3.10 root -->
<xsl:template match="m:apply[*[1][self::m:root]]">
	<xsl:text>\sqrt</xsl:text>
	<xsl:if test="m:degree!=2">
		<xsl:text>[</xsl:text>
		<xsl:apply-templates select="m:degree/*"/>
		<xsl:text>]</xsl:text>
	</xsl:if>
	<xsl:text>{</xsl:text>
	<xsl:apply-templates select="*[position()&gt;1 and not(self::m:degree)]"/>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.3.11 gcd -->
<xsl:template match="m:gcd"><xsl:text>\gcd </xsl:text></xsl:template>

<!-- 4.4.3.12 and -->
<xsl:template match="m:apply[*[1][self::m:and]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\land <!-- and --></xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.13 or -->
<xsl:template match="m:apply[*[1][self::m:or]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="3"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\lor </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.14 xor -->
<xsl:template match="m:apply[*[1][self::m:xor]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="3"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\mathop{\mathrm{xor}}</xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.15 not -->
<xsl:template match="m:apply[*[1][self::m:not]]">
	<xsl:text>\neg </xsl:text>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<!-- 4.4.3.16 implies -->
<xsl:template match="m:apply[*[1][self::m:implies]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo">\implies </xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="3"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.3.17 forall 4.4.3.18 exists -->
<xsl:template match="m:apply[*[1][self::m:forall or self::m:exists]]">
	<xsl:text>\</xsl:text>
	<xsl:value-of select="local-name(*[1])"/>
	<xsl:text> </xsl:text>
	<xsl:apply-templates select="m:bvar"/>
	<xsl:if test="m:condition">
		<xsl:text>, </xsl:text><xsl:apply-templates select="m:condition"/>
	</xsl:if>
	<xsl:if test="*[last()][local-name()!='condition'][local-name()!='bvar']">
		<xsl:text>\colon </xsl:text>
	  <xsl:apply-templates select="*[last()]"/>
  </xsl:if>
</xsl:template>

<!-- 4.4.3.19 abs -->
<xsl:template match="m:apply[*[1][self::m:abs]]">
	<xsl:text>\left|</xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>\right|</xsl:text>
</xsl:template>

<!-- 4.4.3.20 conjugate -->
<xsl:template match="m:apply[*[1][self::m:conjugate]]">
	<xsl:text>\overline{</xsl:text><xsl:apply-templates select="*[2]"/><xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.3.22 real -->
<xsl:template match="m:real"><xsl:text>\Re </xsl:text></xsl:template>

<!-- 4.4.3.23 imaginary -->
<xsl:template match="m:imaginary"><xsl:text>\Im </xsl:text></xsl:template>

<!-- 4.4.3.25 floor -->
<xsl:template match="m:apply[*[1][self::m:floor]]">
	<xsl:text>\lfloor </xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>\rfloor </xsl:text>
</xsl:template>

<!-- 4.4.3.25 ceiling -->
<xsl:template match="m:apply[*[1][self::m:ceiling]]">
	<xsl:text>\lceil </xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>\rceil </xsl:text>
</xsl:template>

<!-- 4.4.4.1 eq -->
<xsl:template match="m:apply[*[1][self::m:eq]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">=</xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.2 neq -->
<xsl:template match="m:apply[*[1][self::m:neq]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\neq </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.3 gt -->
<xsl:template match="m:apply[*[1][self::m:gt]]">
<xsl:param name="p" select="0"/>
<xsl:call-template name="infix">
	<xsl:with-param name="this-p" select="1"/>
	<xsl:with-param name="p" select="$p"/>
	<xsl:with-param name="mo">&gt; </xsl:with-param>
</xsl:call-template>
</xsl:template>

<!-- 4.4.4.4 lt -->
<xsl:template match="m:apply[*[1][self::m:lt]]">
<xsl:param name="p" select="0"/>
<xsl:call-template name="infix">
	<xsl:with-param name="this-p" select="1"/>
	<xsl:with-param name="p" select="$p"/>
	<xsl:with-param name="mo">&lt; </xsl:with-param>
</xsl:call-template>
</xsl:template>

<!-- 4.4.4.5 geq -->
<xsl:template match="m:apply[*[1][self::m:geq]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\ge </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.6 leq -->
<xsl:template match="m:apply[*[1][self::m:leq]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\le </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.7 equivalent -->
<xsl:template match="m:apply[*[1][self::m:equivalent]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\equiv </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.8 approx -->
<xsl:template match="m:apply[*[1][self::m:approx]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="1"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\approx </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.4.9 factorof -->
<xsl:template match="m:apply[*[1][self::m:factorof]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo"> | </xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="3"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.5.1 int -->
<xsl:template match="m:apply[*[1][self::m:int]]">
	<xsl:text>\int</xsl:text>
	<xsl:if test="m:lowlimit/*|m:interval/*[1]|m:condition/*">
		<xsl:text>_{</xsl:text>
		<xsl:apply-templates select="m:lowlimit/*|m:interval/*[1]|m:condition/*"/>
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="m:uplimit/*|m:interval/*[2]">
		<xsl:text>^{</xsl:text>
		<xsl:apply-templates select="m:uplimit/*|m:interval/*[2]"/>
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:text> </xsl:text>
	<xsl:apply-templates select="*[last()]"/>
	<xsl:text>\,d </xsl:text>
	<xsl:apply-templates select="m:bvar"/>
</xsl:template>

<!-- 4.4.5.2 diff -->
<xsl:template match="m:apply[*[1][self::m:diff] and m:ci and count(*)=2]" priority="2">
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>^\prime </xsl:text>
</xsl:template>

<xsl:template match="m:apply[*[1][self::m:diff]]" priority="1">
	<xsl:text>\frac{</xsl:text>
	<xsl:choose>
		<xsl:when test="m:bvar/m:degree">
			<xsl:text>d^{</xsl:text>
			<xsl:apply-templates select="m:bvar/m:degree/node()"/>
			<xsl:text>}</xsl:text>
			<xsl:apply-templates select="*[last()]"/>
			<xsl:text>}{d</xsl:text>
			<xsl:apply-templates select="m:bvar/node()"/>
			<xsl:text>^{</xsl:text>
			<xsl:apply-templates select="m:bvar/m:degree/node()"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>d </xsl:text>
			<xsl:apply-templates select="*[last()]"/>
			<xsl:text>}{d </xsl:text>
			<xsl:apply-templates select="m:bvar"/>
			<xsl:text>}</xsl:text>
		</xsl:otherwise>
	</xsl:choose>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.5.3 partialdiff -->
<xsl:template match="m:apply[*[1][self::m:partialdiff] and m:list and m:ci and count(*)=3]" priority="2">
	<xsl:text>D_{</xsl:text>
	<xsl:for-each select="m:list[1]/*">
		<xsl:apply-templates select="."/>
		<xsl:if test="position()&lt;last()"><xsl:text>, </xsl:text></xsl:if>
	</xsl:for-each>
	<xsl:text>}</xsl:text>
	<xsl:apply-templates select="*[3]"/>
</xsl:template>

<xsl:template match="m:apply[*[1][self::m:partialdiff]]" priority="1">
	<xsl:text>\frac{\partial^{</xsl:text>
	<xsl:choose>
		<xsl:when test="m:degree">
			<xsl:apply-templates select="m:degree/node()"/>
		</xsl:when>
		<xsl:when test="m:bvar/m:degree[string(number(.))='NaN']">
			<xsl:for-each select="m:bvar/m:degree">
				<xsl:apply-templates select="node()"/>
				<xsl:if test="position()&lt;last()"><xsl:text>+</xsl:text></xsl:if>
			</xsl:for-each>
			<xsl:if test="count(m:bvar[not(m:degree)])&gt;0">
				<xsl:text>+</xsl:text>
				<xsl:value-of select="count(m:bvar[not(m:degree)])"/>
			</xsl:if>
		</xsl:when>
		<xsl:otherwise>
			<xsl:value-of select="sum(m:bvar/m:degree)+count(m:bvar[not(m:degree)])"/>
		</xsl:otherwise>
	</xsl:choose>
	<xsl:text>}</xsl:text>
	<xsl:apply-templates select="*[last()]"/>
	<xsl:text>}{</xsl:text>
	<xsl:for-each select="m:bvar">
		<xsl:text>\partial </xsl:text>
		<xsl:apply-templates select="node()"/>
		<xsl:if test="m:degree">
			<xsl:text>^{</xsl:text>
			<xsl:apply-templates select="m:degree/node()"/>
			<xsl:text>}</xsl:text>
		</xsl:if>
	</xsl:for-each>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.2.8 declare 4.4.5.4 lowlimit 4.4.5.5 uplimit 4.4.5.7 degree 4.4.9.5 momentabout -->
<xsl:template match="m:declare | m:lowlimit | m:uplimit | m:degree | m:momentabout"/>

<!-- 4.4.5.6  bvar-->
<xsl:template match="m:bvar">
	<xsl:apply-templates/>
	<xsl:if test="following-sibling::m:bvar"><xsl:text>, </xsl:text></xsl:if>
</xsl:template>

<!-- 4.4.5.8 divergence-->
<xsl:template match="m:divergence"><xsl:text>\mathop{\mathrm{div}}</xsl:text></xsl:template>

<!-- 4.4.5.11 laplacian-->
<xsl:template match="m:laplacian"><xsl:text>\nabla^2 </xsl:text></xsl:template>

<!-- 4.4.6.1 set -->
<xsl:template match="m:set">
	<xsl:text>\{</xsl:text><xsl:call-template name="set"/><xsl:text>\}</xsl:text>
</xsl:template>

<!-- 4.4.6.2 list -->
<xsl:template match="m:list">
	<xsl:text>\left[</xsl:text><xsl:call-template name="set"/><xsl:text>\right]</xsl:text>
</xsl:template>

<xsl:template name="set">
   <xsl:choose>
		<xsl:when test="m:condition">
   		<xsl:apply-templates select="m:bvar/*[not(self::bvar or self::condition)]"/>
   		<xsl:text>\colon </xsl:text>
			<xsl:apply-templates select="m:condition/node()"/>
		</xsl:when>
		<xsl:otherwise>
			<xsl:for-each select="*">
				<xsl:apply-templates select="."/>
				<xsl:if test="position()!=last()"><xsl:text>, </xsl:text></xsl:if>
			</xsl:for-each>
		</xsl:otherwise>
   </xsl:choose>
</xsl:template>

<!-- 4.4.6.3 union -->
<xsl:template match="m:apply[*[1][self::m:union]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\cup </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.4 intersect -->
<xsl:template match="m:apply[*[1][self::m:intersect]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="3"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\cap </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.5 in -->
<xsl:template match="m:apply[*[1][self::m:in]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo">\in </xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="3"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.5 notin -->
<xsl:template match="m:apply[*[1][self::m:notin]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="mo">\notin </xsl:with-param>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="this-p" select="3"/>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.7 subset -->
<xsl:template match="m:apply[*[1][self::m:subset]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\subseteq </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.8 prsubset -->
<xsl:template match="m:apply[*[1][self::m:prsubset]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\subset </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.9 notsubset -->
<xsl:template match="m:apply[*[1][self::m:notsubset]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\nsubseteq </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.10 notprsubset -->
<xsl:template match="m:apply[*[1][self::m:notprsubset]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\not\subset </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.11 setdiff -->
<xsl:template match="m:apply[*[1][self::m:setdiff]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\setminus </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.6.12 card -->
<xsl:template match="m:apply[*[1][self::m:card]]">
	<xsl:text>|</xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>|</xsl:text>
</xsl:template>

<!-- 4.4.6.13 cartesianproduct 4.4.10.6 vectorproduct -->
<xsl:template match="m:apply[*[1][self::m:cartesianproduct or self::m:vectorproduct]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\times </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<xsl:template
match="m:apply[*[1][self::m:cartesianproduct][count(following-sibling::m:reals)=count(following-sibling::*)]]"
priority="2">
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="5"/>
	</xsl:apply-templates>
	<xsl:text>^{</xsl:text>
	<xsl:value-of select="count(*)-1"/>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.7.1 sum -->
<xsl:template match="m:apply[*[1][self::m:sum]]">
	<xsl:text>\sum</xsl:text><xsl:call-template name="series"/>
</xsl:template>

<!-- 4.4.7.2 product -->
<xsl:template match="m:apply[*[1][self::m:product]]">
	<xsl:text>\prod</xsl:text><xsl:call-template name="series"/>
</xsl:template>
	
<xsl:template name="series">
	<xsl:if test="m:lowlimit/*|m:interval/*[1]|m:condition/*">
		<xsl:text>_{</xsl:text>
		<xsl:if test="not(m:condition)">
			<xsl:apply-templates select="m:bvar"/>
			<xsl:text>=</xsl:text>
		</xsl:if>
		<xsl:apply-templates select="m:lowlimit/*|m:interval/*[1]|m:condition/*"/>
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="m:uplimit/*|m:interval/*[2]">
		<xsl:text>^{</xsl:text>
		<xsl:apply-templates select="m:uplimit/*|m:interval/*[2]"/>
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:text> </xsl:text>
	<xsl:apply-templates select="*[last()]"/>
</xsl:template>

<!-- 4.4.7.3 limit -->
<xsl:template match="m:apply[*[1][self::m:limit]]">
	<xsl:text>\lim_{</xsl:text>
	<xsl:apply-templates select="m:lowlimit|m:condition/*"/>
	<xsl:text>}</xsl:text>
	<xsl:apply-templates select="*[last()]"/>
</xsl:template>

<xsl:template match="m:apply[m:limit]/m:lowlimit" priority="3">
	<xsl:apply-templates select="../m:bvar/node()"/>
	<xsl:text>\to </xsl:text>
	<xsl:apply-templates/>
</xsl:template>

<!-- 4.4.7.4 tendsto -->
<xsl:template match="m:apply[*[1][self::m:tendsto]]">
	<xsl:param name="p"/>
	<xsl:call-template name="binary">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">
			<xsl:choose>
				<xsl:when test="@type='above'">\searrow </xsl:when>
				<xsl:when test="@type='below'">\nearrow </xsl:when>
				<xsl:when test="@type='two-sided'">\rightarrow </xsl:when>
				<xsl:otherwise>\to </xsl:otherwise>
			</xsl:choose>
		</xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.8.1 common tringonometric functions 4.4.8.3 natural logarithm -->
<xsl:template match="m:apply[*[1][
 self::m:sin or 		self::m:cos or 	self::m:tan or		self::m:sec or
 self::m:csc or 		self::m:cot or 	self::m:sinh or	 	self::m:cosh or
 self::m:tanh or 		self::m:coth or	self::m:arcsin or 	self::m:arccos or
 self::m:arctan or 	self::m:ln]]">
	<xsl:text>\</xsl:text>
	<xsl:value-of select="local-name(*[1])"/>
	<xsl:text> </xsl:text>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<xsl:template match="m:sin | m:cos | m:tan | m:sec | m:csc |
								 m:cot | m:sinh | m:cosh | m:tanh | m:coth |
								 m:arcsin | m:arccos | m:arctan | m:ln">
	<xsl:text>\</xsl:text>
	<xsl:value-of select="local-name(.)"/>
	<xsl:text> </xsl:text>
</xsl:template>

<xsl:template match="m:apply[*[1][
 self::m:sech or 		self::m:csch or		self::m:arccosh or
 self::m:arccot or 	self::m:arccoth or 	self::m:arccsc or
 self::m:arccsch or self::m:arcsec or 	self::m:arcsech or
 self::m:arcsinh or self::m:arctanh]]">
	<xsl:text>\mathrm{</xsl:text>
	<xsl:value-of select="local-name(*[1])"/>
	<xsl:text>\,}</xsl:text>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<xsl:template match="m:sech | m:csch | m:arccosh | m:arccot |
								 m:arccoth | m:arccsc |m:arccsch |m:arcsec |
								 m:arcsech | m:arcsinh | m:arctanh">
	<xsl:text>\mathrm{</xsl:text>
	<xsl:value-of select="local-name(.)"/>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.8.2 exp -->
<xsl:template match="m:apply[*[1][self::m:exp]]">
	<xsl:text>e^{</xsl:text><xsl:apply-templates select="*[2]"/><xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.8.4 log -->
<xsl:template match="m:apply[*[1][self::m:log]]">
	<xsl:text>\lg </xsl:text>
	<xsl:apply-templates select="*[last()]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<xsl:template match="m:apply[*[1][self::m:log] and m:logbase != 10]">
	<xsl:text>\log_{</xsl:text>
	<xsl:apply-templates select="m:logbase/node()"/>
	<xsl:text>}</xsl:text>
	<xsl:apply-templates select="*[last()]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<!-- 4.4.9.1 mean -->
<xsl:template match="m:apply[*[1][self::m:mean]]">
	<xsl:text>\langle </xsl:text>
	<xsl:for-each select="*[position()&gt;1]">
		<xsl:apply-templates select="."/>
		<xsl:if test="position() !=last()"><xsl:text>, </xsl:text></xsl:if>
	</xsl:for-each>
	<xsl:text>\rangle </xsl:text>
</xsl:template>

<!-- 4.4.9.2 sdef -->
<xsl:template match="m:sdev"><xsl:text>\sigma </xsl:text></xsl:template>

<!-- 4.4.9.3 variance -->
<xsl:template match="m:apply[*[1][self::m:variance]]">
	<xsl:text>\sigma(</xsl:text>
	<xsl:apply-templates select="*[2]"/>
	<xsl:text>)^2</xsl:text>
</xsl:template>

<!-- 4.4.9.5 moment -->
<xsl:template match="m:apply[*[1][self::m:moment]]">
	<xsl:text>\langle </xsl:text>
	<xsl:apply-templates select="*[last()]"/>
	<xsl:text>^{</xsl:text>
	<xsl:apply-templates select="m:degree/node()"/>
	<xsl:text>}\rangle</xsl:text>
	<xsl:if test="m:momentabout">
		<xsl:text>_{</xsl:text>
		<xsl:apply-templates select="m:momentabout/node()"/>
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:text> </xsl:text>
</xsl:template>

<!-- 4.4.10.1 vector  -->
<xsl:template match="m:vector">
	<xsl:text>\left(\begin{array}{c}</xsl:text>
	<xsl:for-each select="*">
		<xsl:apply-templates select="."/>
		<xsl:if test="position()!=last()"><xsl:text>\\ </xsl:text></xsl:if>
	</xsl:for-each>
	<xsl:text>\end{array}\right)</xsl:text>
</xsl:template>

<!-- 4.4.10.2 matrix  -->
<xsl:template match="m:matrix">
	<xsl:text>\begin{pmatrix}</xsl:text>
	<xsl:apply-templates/>
	<xsl:text>\end{pmatrix}</xsl:text>
</xsl:template>

<!-- 4.4.10.3 matrixrow  -->
<xsl:template match="m:matrixrow">
	<xsl:for-each select="*">
		<xsl:apply-templates select="."/>
		<xsl:if test="position()!=last()"><xsl:text> &amp; </xsl:text></xsl:if>
	</xsl:for-each>
	<xsl:if test="position()!=last()"><xsl:text>\\ </xsl:text></xsl:if>
</xsl:template>

<!-- 4.4.10.4 determinant  -->
<xsl:template match="m:apply[*[1][self::m:determinant]]">
	<xsl:text>\det </xsl:text>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
</xsl:template>

<xsl:template match="m:apply[*[1][self::m:determinant]][*[2][self::m:matrix]]" priority="2">
	<xsl:text>\begin{vmatrix}</xsl:text>
	<xsl:apply-templates select="m:matrix/*"/>
	<xsl:text>\end{vmatrix}</xsl:text>
</xsl:template>

<!-- 4.4.10.5 transpose -->
<xsl:template match="m:apply[*[1][self::m:transpose]]">
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
	<xsl:text>^T</xsl:text>
</xsl:template>

<!-- 4.4.10.5 selector -->
<xsl:template match="m:apply[*[1][self::m:selector]]">
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="7"/>
	</xsl:apply-templates>
	<xsl:text>_{</xsl:text>
	<xsl:for-each select="*[position()&gt;2]">
		<xsl:apply-templates select="."/>
		<xsl:if test="position() !=last()"><xsl:text>, </xsl:text></xsl:if>
	</xsl:for-each>
	<xsl:text>}</xsl:text>
</xsl:template>

<!-- 4.4.10.7 scalarproduct 4.4.10.8 outerproduct -->
<xsl:template match="m:apply[*[1][self::m:scalarproduct or self::m:outerproduct]]">
	<xsl:param name="p" select="0"/>
	<xsl:call-template name="infix">
		<xsl:with-param name="this-p" select="2"/>
		<xsl:with-param name="p" select="$p"/>
		<xsl:with-param name="mo">\dot </xsl:with-param>
	</xsl:call-template>
</xsl:template>

<!-- 4.4.11.2 semantics -->
<xsl:template match="m:semantics"><xsl:apply-templates select="*[1]"/></xsl:template>

<xsl:template match="m:semantics[m:annotation/@encoding='TeX']">
	<xsl:apply-templates select="m:annotation[@encoding='TeX']/node()"/>
</xsl:template>

<!-- 4.4.12.1 integers -->
<xsl:template match="m:integers"><xsl:text>\mathbb{Z}</xsl:text></xsl:template>

<!-- 4.4.12.2 reals -->
<xsl:template match="m:reals"><xsl:text>\mathbb{R}</xsl:text></xsl:template>

<!-- 4.4.12.3 rationals -->
<xsl:template match="m:rationals"><xsl:text>\mathbb{Q}</xsl:text></xsl:template>

<!-- 4.4.12.4 naturalnumbers -->
<xsl:template match="m:naturalnumbers"><xsl:text>\mathbb{N}</xsl:text></xsl:template>

<!-- 4.4.12.5 complexes -->
<xsl:template match="m:complexes"><xsl:text>\mathbb{C}</xsl:text></xsl:template>

<!-- 4.4.12.6 primes -->
<xsl:template match="m:primes"><xsl:text>\mathbb{P}</xsl:text></xsl:template>
	
<!-- 4.4.12.7 exponentiale -->
<xsl:template match="m:exponentiale"><xsl:text>e</xsl:text></xsl:template>

<!-- 4.4.12.8 imaginaryi -->
<xsl:template match="m:imaginaryi"><xsl:text>i</xsl:text></xsl:template>

<!-- 4.4.12.9 notanumber -->
<xsl:template match="m:notanumber"><xsl:text>NaN</xsl:text></xsl:template>

<!-- 4.4.12.10 true -->
<xsl:template match="m:true"><xsl:text>\mbox{true}</xsl:text></xsl:template>

<!-- 4.4.12.11 false -->
<xsl:template match="m:false"><xsl:text>\mbox{false}</xsl:text></xsl:template>

<!-- 4.4.12.12 emptyset -->
<xsl:template match="m:emptyset"><xsl:text>\emptyset </xsl:text></xsl:template>

<!-- 4.4.12.13 pi -->
<xsl:template match="m:pi"><xsl:text>\pi </xsl:text></xsl:template>

<!-- 4.4.12.14 eulergamma -->
<xsl:template match="m:eulergamma"><xsl:text>\gamma </xsl:text></xsl:template>

<!-- 4.4.12.15 infinity -->
<xsl:template match="m:infinity"><xsl:text>\infty </xsl:text></xsl:template>

<!-- ****************************** -->
<xsl:template name="infix" >
  <xsl:param name="mo"/>
  <xsl:param name="p" select="0"/>
  <xsl:param name="this-p" select="0"/>
  <xsl:if test="$this-p &lt; $p"><xsl:text>(</xsl:text></xsl:if>
  <xsl:for-each select="*[position()&gt;1]">
		<xsl:if test="position() &gt; 1">
			<xsl:copy-of select="$mo"/>
		</xsl:if>   
		<xsl:apply-templates select=".">
			<xsl:with-param name="p" select="$this-p"/>
		</xsl:apply-templates>
	</xsl:for-each>
  <xsl:if test="$this-p &lt; $p"><xsl:text>)</xsl:text></xsl:if>
</xsl:template>

<xsl:template name="binary" >
  <xsl:param name="mo"/>
  <xsl:param name="p" select="0"/>
  <xsl:param name="this-p" select="0"/>
  <xsl:if test="$this-p &lt; $p"><xsl:text>(</xsl:text></xsl:if>
	<xsl:apply-templates select="*[2]">
		<xsl:with-param name="p" select="$this-p"/>
	</xsl:apply-templates>
	<xsl:value-of select="$mo"/>
	<xsl:apply-templates select="*[3]">
    	<xsl:with-param name="p" select="$this-p"/>
	</xsl:apply-templates>
	<xsl:if test="$this-p &lt; $p"><xsl:text>)</xsl:text></xsl:if>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/entities.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>
                
<!-- ====================================================================== -->
<!-- $id: entities.xsl, 2002/22/11 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:template name="replaceEntities">
	<xsl:param name="content"/>
	<xsl:if test="string-length($content)>0">
	<xsl:choose>
		<xsl:when test="starts-with($content,'&#x0025B;')"><xsl:value-of select="'\varepsilon '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0025B;')"/></xsl:call-template></xsl:when>	<!--/varepsilon -->

<!-- ====================================================================== -->
<!-- 	Unicode 3.2
	Greek
	Range: 0370-03FF
	http://www.unicode.org/charts/PDF/U0370.pdf	                    -->
<!-- ====================================================================== -->	
		<xsl:when test="starts-with($content,'&#x00393;')"><xsl:value-of select="'\Gamma '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x00393;')"/></xsl:call-template></xsl:when>	<!--/Gamma capital Gamma, Greek -->
		<xsl:when test="starts-with($content,'&#x00394;')"><xsl:value-of select="'\Delta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x00394;')"/></xsl:call-template></xsl:when>	<!--/Delta capital Delta, Greek -->
		<xsl:when test="starts-with($content,'&#x00398;')"><xsl:value-of select="'\Theta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x00398;')"/></xsl:call-template></xsl:when>	<!--/Theta capital Theta, Greek -->
		<xsl:when test="starts-with($content,'&#x0039B;')"><xsl:value-of select="'\Lambda '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0039B;')"/></xsl:call-template></xsl:when>	<!--/Lambda capital Lambda, Greek -->
		<xsl:when test="starts-with($content,'&#x0039E;')"><xsl:value-of select="'\Xi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0039E;')"/></xsl:call-template></xsl:when>	<!--/Xi capital Xi, Greek -->
		<xsl:when test="starts-with($content,'&#x003A0;')"><xsl:value-of select="'\Pi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003A0;')"/></xsl:call-template></xsl:when>	<!--/Pi capital Pi, Greek -->
		<xsl:when test="starts-with($content,'&#x003A3;')"><xsl:value-of select="'\Sigma '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003A3;')"/></xsl:call-template></xsl:when>	<!--/Sigma capital Sigma, Greek -->
		<xsl:when test="starts-with($content,'&#x003A6;')"><xsl:value-of select="'\Phi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003A6;')"/></xsl:call-template></xsl:when>	<!--/Phi capital Phi, Greek -->
		<xsl:when test="starts-with($content,'&#x003A8;')"><xsl:value-of select="'\Psi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003A8;')"/></xsl:call-template></xsl:when>	<!--/Psi capital Psi, Greek -->
		<xsl:when test="starts-with($content,'&#x003A9;')"><xsl:value-of select="'\Omega '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003A9;')"/></xsl:call-template></xsl:when>	<!--/Omega capital Omega, Greek -->
		<xsl:when test="starts-with($content,'&#x003B1;')"><xsl:value-of select="'\alpha '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B1;')"/></xsl:call-template></xsl:when>	<!--/alpha small alpha, Greek -->
		<xsl:when test="starts-with($content,'&#x003B2;')"><xsl:value-of select="'\beta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B2;')"/></xsl:call-template></xsl:when>	<!--/beta small beta, Greek -->
		<xsl:when test="starts-with($content,'&#x003B3;')"><xsl:value-of select="'\gamma '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B3;')"/></xsl:call-template></xsl:when>	<!--/gamma small gamma, Greek -->
		<xsl:when test="starts-with($content,'&#x003B4;')"><xsl:value-of select="'\delta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B4;')"/></xsl:call-template></xsl:when>	<!--/delta small delta, Greek -->
		<xsl:when test="starts-with($content,'&#x003B5;')"><xsl:value-of select="'\epsilon '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B5;')"/></xsl:call-template></xsl:when>	<!--/straightepsilon, small epsilon, Greek -->
		<xsl:when test="starts-with($content,'&#x003B6;')"><xsl:value-of select="'\zeta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B6;')"/></xsl:call-template></xsl:when>	<!--/zeta small zeta, Greek -->
		<xsl:when test="starts-with($content,'&#x003B7;')"><xsl:value-of select="'\eta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B7;')"/></xsl:call-template></xsl:when>	<!--/eta small eta, Greek -->
		<xsl:when test="starts-with($content,'&#x003B8;')"><xsl:value-of select="'\theta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B8;')"/></xsl:call-template></xsl:when>	<!--/theta straight theta, small theta, Greek -->
		<xsl:when test="starts-with($content,'&#x003B9;')"><xsl:value-of select="'\iota '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003B9;')"/></xsl:call-template></xsl:when>	<!--/iota small iota, Greek -->
		<xsl:when test="starts-with($content,'&#x003BA;')"><xsl:value-of select="'\kappa '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003BA;')"/></xsl:call-template></xsl:when>	<!--/kappa small kappa, Greek -->
		<xsl:when test="starts-with($content,'&#x003BB;')"><xsl:value-of select="'\lambda '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003BB;')"/></xsl:call-template></xsl:when>	<!--/lambda small lambda, Greek -->
		<xsl:when test="starts-with($content,'&#x003BC;')"><xsl:value-of select="'\mu '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003BC;')"/></xsl:call-template></xsl:when>	<!--/mu small mu, Greek -->
		<xsl:when test="starts-with($content,'&#x003BD;')"><xsl:value-of select="'\nu '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003BD;')"/></xsl:call-template></xsl:when>	<!--/nu small nu, Greek -->
		<xsl:when test="starts-with($content,'&#x003BE;')"><xsl:value-of select="'\xi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003BE;')"/></xsl:call-template></xsl:when>	<!--/xi small xi, Greek -->
		<xsl:when test="starts-with($content,'&#x003C0;')"><xsl:value-of select="'\pi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C0;')"/></xsl:call-template></xsl:when>	<!--/pi small pi, Greek -->
		<xsl:when test="starts-with($content,'&#x003C1;')"><xsl:value-of select="'\rho '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C1;')"/></xsl:call-template></xsl:when>	<!--/rho small rho, Greek -->
		<xsl:when test="starts-with($content,'&#x003C2;')"><xsl:value-of select="'\varsigma '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C2;')"/></xsl:call-template></xsl:when>	<!--/varsigma -->
		<xsl:when test="starts-with($content,'&#x003C3;')"><xsl:value-of select="'\sigma '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C3;')"/></xsl:call-template></xsl:when>	<!--/sigma small sigma, Greek -->
		<xsl:when test="starts-with($content,'&#x003C4;')"><xsl:value-of select="'\tau '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C4;')"/></xsl:call-template></xsl:when>	<!--/tau small tau, Greek -->
		<xsl:when test="starts-with($content,'&#x003C5;')"><xsl:value-of select="'\upsilon '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C5;')"/></xsl:call-template></xsl:when>	<!--/upsilon small upsilon, Greek -->
		<xsl:when test="starts-with($content,'&#x003C6;')"><xsl:value-of select="'\phi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C6;')"/></xsl:call-template></xsl:when>	<!--/straightphi - small phi, Greek -->
		<xsl:when test="starts-with($content,'&#x003C7;')"><xsl:value-of select="'\chi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C7;')"/></xsl:call-template></xsl:when>	<!--/chi small chi, Greek -->
		<xsl:when test="starts-with($content,'&#x003C8;')"><xsl:value-of select="'\psi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C8;')"/></xsl:call-template></xsl:when>	<!--/psi small psi, Greek -->
		<xsl:when test="starts-with($content,'&#x003C9;')"><xsl:value-of select="'\omega '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003C9;')"/></xsl:call-template></xsl:when>	<!--/omega small omega, Greek -->
		<xsl:when test="starts-with($content,'&#x003D1;')"><xsl:value-of select="'\vartheta '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003D1;')"/></xsl:call-template></xsl:when>	<!--/vartheta - curly or open theta -->
		<xsl:when test="starts-with($content,'&#x003D2;')"><xsl:value-of select="'\Upsilon '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003D2;')"/></xsl:call-template></xsl:when>	<!--/Upsilon capital Upsilon, Greek -->
		<xsl:when test="starts-with($content,'&#x003D5;')"><xsl:value-of select="'\varphi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003D5;')"/></xsl:call-template></xsl:when>	<!--/varphi - curly or open phi -->
		<xsl:when test="starts-with($content,'&#x003D6;')"><xsl:value-of select="'\varpi '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003D6;')"/></xsl:call-template></xsl:when>		<!--/varpi -->
		<xsl:when test="starts-with($content,'&#x003F0;')"><xsl:value-of select="'\varkappa '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003F0;')"/></xsl:call-template></xsl:when>	<!--/varkappa -->
		<xsl:when test="starts-with($content,'&#x003F1;')"><xsl:value-of select="'\varrho '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x003F1;')"/></xsl:call-template></xsl:when>	<!--/varrho -->
		
<!-- ====================================================================== -->
		<xsl:when test="starts-with($content,'&#x0200B;')"><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0200B;')"/></xsl:call-template></xsl:when>						<!--short form of  &InvisibleComma; -->
		<xsl:when test="starts-with($content,'&#x02026;')"><xsl:value-of select="'\dots '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02026;')"/></xsl:call-template></xsl:when>
		<xsl:when test="starts-with($content,'&#x02032;')"><xsl:value-of select="'\prime '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02032;')"/></xsl:call-template></xsl:when>		<!--/prime prime or minute -->
		<xsl:when test="starts-with($content,'&#x02061;')"><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02061;')"/></xsl:call-template></xsl:when>						<!-- ApplyFunction -->
		<xsl:when test="starts-with($content,'&#x02062;')"><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02062;')"/></xsl:call-template></xsl:when>						<!-- InvisibleTimes -->
<!-- ====================================================================== -->
<!-- 	Unicode 3.2
	Letterlike Symbols
	Range: 2100-214F
	http://www.unicode.org/charts/PDF/U2100.pdf	                    -->
<!-- ====================================================================== -->
		<xsl:when test="starts-with($content,'&#x0210F;&#x0FE00;')"><xsl:value-of select="'\hbar '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0210F;&#x0FE00;')"/></xsl:call-template></xsl:when>	<!--/hbar - Planck's over 2pi -->
		<xsl:when test="starts-with($content,'&#x0210F;')"><xsl:value-of select="'\hslash '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0210F;')"/></xsl:call-template></xsl:when>	<!--/hslash - variant Planck's over 2pi --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02111;')"><xsl:value-of select="'\Im '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02111;')"/></xsl:call-template></xsl:when>		<!--/Im - imaginary   -->
		<xsl:when test="starts-with($content,'&#x02113;')"><xsl:value-of select="'\ell '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02113;')"/></xsl:call-template></xsl:when>		<!--/ell - cursive small l -->
		<xsl:when test="starts-with($content,'&#x02118;')"><xsl:value-of select="'\wp '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02118;')"/></xsl:call-template></xsl:when>		<!--/wp - Weierstrass p -->
		<xsl:when test="starts-with($content,'&#x0211C;')"><xsl:value-of select="'\Re '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0211C;')"/></xsl:call-template></xsl:when>		<!--/Re - real -->
		<xsl:when test="starts-with($content,'&#x02127;')"><xsl:value-of select="'\mho '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02127;')"/></xsl:call-template></xsl:when>		<!--/mho - conductance -->
		<xsl:when test="starts-with($content,'&#x02135;')"><xsl:value-of select="'\aleph '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02135;')"/></xsl:call-template></xsl:when>		<!--/aleph aleph, Hebrew -->
		<xsl:when test="starts-with($content,'&#x02136;')"><xsl:value-of select="'\beth '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02136;')"/></xsl:call-template></xsl:when>		<!--/beth - beth, Hebrew --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02137;')"><xsl:value-of select="'\gimel '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02137;')"/></xsl:call-template></xsl:when>		<!--/gimel - gimel, Hebrew --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02138;')"><xsl:value-of select="'\daleth '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02138;')"/></xsl:call-template></xsl:when>	<!--/daleth - daleth, Hebrew --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02145;')"><xsl:value-of select="'D'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02145;')"/></xsl:call-template></xsl:when>		<!--D for use in differentials, e.g., within integrals -->
		<xsl:when test="starts-with($content,'&#x02146;')"><xsl:value-of select="'d'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02146;')"/></xsl:call-template></xsl:when>		<!--d for use in differentials, e.g., within integrals -->
		<xsl:when test="starts-with($content,'&#x02147;')"><xsl:value-of select="'e'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02147;')"/></xsl:call-template></xsl:when>		<!--e use for the exponential base of the natural logarithms -->
		<xsl:when test="starts-with($content,'&#x02148;')"><xsl:value-of select="'i'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02148;')"/></xsl:call-template></xsl:when>		<!--i for use as a square root of -1 -->

<!-- ====================================================================== -->
		<xsl:when test="starts-with($content,'&#x02192;')"><xsl:value-of select="'\to '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02192;')"/></xsl:call-template></xsl:when>		<!--/rightarrow /to A: =rightward arrow -->
		
<!-- ====================================================================== -->
<!-- 	Unicode 3.2
	Mathematical Operators
	Range: 2200-22FF
	http://www.unicode.org/charts/PDF/U2200.pdf                         -->
<!-- ====================================================================== -->	
		<xsl:when test="starts-with($content,'&#x02200;')"><xsl:value-of select="'\forall '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02200;')"/></xsl:call-template></xsl:when>	<!--/forall for all -->
		<xsl:when test="starts-with($content,'&#x02201;')"><xsl:value-of select="'\complement '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02201;')"/></xsl:call-template></xsl:when>	<!--/complement - complement sign --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02202;')"><xsl:value-of select="'\partial '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02202;')"/></xsl:call-template></xsl:when>	<!--/partial partial differential -->
		<xsl:when test="starts-with($content,'&#x02203;')"><xsl:value-of select="'\exists '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02203;')"/></xsl:call-template></xsl:when>	<!--/exists at least one exists -->
		<xsl:when test="starts-with($content,'&#x02204;')"><xsl:value-of select="'\nexists '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02204;')"/></xsl:call-template></xsl:when>	<!--/nexists - negated exists --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02205;&#x0FE00;')"><xsl:value-of select="'\emptyset '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02205;&#x0FE00;')"/></xsl:call-template></xsl:when>	<!--/emptyset - zero, slash -->
		<xsl:when test="starts-with($content,'&#x02205;')"><xsl:value-of select="'\varnothing '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02205;')"/></xsl:call-template></xsl:when>	<!--/varnothing - circle, slash --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x02206;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02206;')"/></xsl:call-template></xsl:when>-->
		<xsl:when test="starts-with($content,'&#x02207;')"><xsl:value-of select="'\nabla '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02207;')"/></xsl:call-template></xsl:when>		<!--/nabla del, Hamilton operator -->
		<xsl:when test="starts-with($content,'&#x02208;')"><xsl:value-of select="'\in '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02208;')"/></xsl:call-template></xsl:when>		<!--/in R: set membership  -->
		<xsl:when test="starts-with($content,'&#x02209;')"><xsl:value-of select="'\notin '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02209;')"/></xsl:call-template></xsl:when>		<!--/notin N: negated set membership -->
		<xsl:when test="starts-with($content,'&#x0220B;')"><xsl:value-of select="'\ni '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0220B;')"/></xsl:call-template></xsl:when>		<!--/ni /owns R: contains -->
		<xsl:when test="starts-with($content,'&#x0220C;')"><xsl:value-of select="'\not\ni '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0220C;')"/></xsl:call-template></xsl:when>	<!--negated contains -->
		<xsl:when test="starts-with($content,'&#x0220F;')"><xsl:value-of select="'\prod '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0220F;')"/></xsl:call-template></xsl:when>		<!--/prod L: product operator -->
		<xsl:when test="starts-with($content,'&#x02210;')"><xsl:value-of select="'\coprod '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02210;')"/></xsl:call-template></xsl:when>	<!--/coprod L: coproduct operator -->
		<xsl:when test="starts-with($content,'&#x02211;')"><xsl:value-of select="'\sum '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02211;')"/></xsl:call-template></xsl:when>		<!--/sum L: summation operator -->
		<xsl:when test="starts-with($content,'&#x02212;')"><xsl:value-of select="'-'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02212;')"/></xsl:call-template></xsl:when>		<!--B: minus sign -->		
		<xsl:when test="starts-with($content,'&#x02213;')"><xsl:value-of select="'\mp '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02213;')"/></xsl:call-template></xsl:when>		<!--/mp B: minus-or-plus sign -->
		<xsl:when test="starts-with($content,'&#x02214;')"><xsl:value-of select="'\dotplus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02214;')"/></xsl:call-template></xsl:when>	<!--/dotplus B: plus sign, dot above --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x02215;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02215;')"/></xsl:call-template></xsl:when>-->
		<xsl:when test="starts-with($content,'&#x02216;')"><xsl:value-of select="'\setminus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02216;')"/></xsl:call-template></xsl:when>	<!--/setminus B: reverse solidus -->
		<xsl:when test="starts-with($content,'&#x02217;')"><xsl:value-of select="'\ast '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02217;')"/></xsl:call-template></xsl:when>		<!--low asterisk -->
		<xsl:when test="starts-with($content,'&#x02218;')"><xsl:value-of select="'\circ '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02218;')"/></xsl:call-template></xsl:when>		<!--/circ B: composite function (small circle) -->
		<xsl:when test="starts-with($content,'&#x02219;')"><xsl:value-of select="'\bullet '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02219;')"/></xsl:call-template></xsl:when>
		<xsl:when test="starts-with($content,'&#x0221A;')"><xsl:value-of select="'\surd '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0221A;')"/></xsl:call-template></xsl:when>		<!--/surd radical -->
		<xsl:when test="starts-with($content,'&#x0221D;')"><xsl:value-of select="'\propto '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0221D;')"/></xsl:call-template></xsl:when>	<!--/propto R: is proportional to -->
		<xsl:when test="starts-with($content,'&#x0221E;')"><xsl:value-of select="'\infty '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0221E;')"/></xsl:call-template></xsl:when>		<!--/infty infinity -->
<!--		<xsl:when test="starts-with($content,'&#x0221F;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0221F;')"/></xsl:call-template></xsl:when>		right (90 degree) angle -->
		<xsl:when test="starts-with($content,'&#x02220;')"><xsl:value-of select="'\angle '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02220;')"/></xsl:call-template></xsl:when>		<!--/angle - angle -->
		<xsl:when test="starts-with($content,'&#x02221;')"><xsl:value-of select="'\measuredangle '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02221;')"/></xsl:call-template></xsl:when>	<!--/measuredangle - angle-measured -->	<!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02222;')"><xsl:value-of select="'\sphericalangle '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02222;')"/></xsl:call-template></xsl:when><!--/sphericalangle angle-spherical -->	<!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02223;')"><xsl:value-of select="'\mid '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02223;')"/></xsl:call-template></xsl:when>		<!--/mid R: -->
		<xsl:when test="starts-with($content,'&#x02224;&#x0FE00;')"><xsl:value-of select="'\nshortmid '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02224;&#x0FE00;')"/></xsl:call-template></xsl:when>	<!--/nshortmid --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02224;')"><xsl:value-of select="'\nmid '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02224;')"/></xsl:call-template></xsl:when>		<!--/nmid --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02225;')"><xsl:value-of select="'\parallel '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02225;')"/></xsl:call-template></xsl:when>	<!--/parallel R: parallel -->
		<xsl:when test="starts-with($content,'&#x02226;&#x0FE00;')"><xsl:value-of select="'\nshortparallel '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02226;&#x0FE00;')"/></xsl:call-template></xsl:when>	<!--/nshortparallel N: not short par --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02226;')"><xsl:value-of select="'\nparallel '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02226;')"/></xsl:call-template></xsl:when>	<!--/nparallel N: not parallel --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02227;')"><xsl:value-of select="'\wedge '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02227;')"/></xsl:call-template></xsl:when>		<!--/wedge /land B: logical and -->
		<xsl:when test="starts-with($content,'&#x02228;')"><xsl:value-of select="'\vee '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02228;')"/></xsl:call-template></xsl:when>		<!--/vee /lor B: logical or -->
		<xsl:when test="starts-with($content,'&#x02229;')"><xsl:value-of select="'\cap '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02229;')"/></xsl:call-template></xsl:when>		<!--/cap B: intersection -->
		<xsl:when test="starts-with($content,'&#x0222A;')"><xsl:value-of select="'\cup '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222A;')"/></xsl:call-template></xsl:when>		<!--/cup B: union or logical sum -->		
		<xsl:when test="starts-with($content,'&#x0222B;')"><xsl:value-of select="'\int '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222B;')"/></xsl:call-template></xsl:when>		<!--/int L: integral operator -->
		<xsl:when test="starts-with($content,'&#x0222C;')"><xsl:value-of select="'\iint '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222C;')"/></xsl:call-template></xsl:when>		<!--double integral operator --> <!-- Required amsmath -->
		<xsl:when test="starts-with($content,'&#x0222D;')"><xsl:value-of select="'\iiint '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222D;')"/></xsl:call-template></xsl:when>		<!--/iiint triple integral operator -->	<!-- Required amsmath -->
		<xsl:when test="starts-with($content,'&#x0222E;')"><xsl:value-of select="'\oint '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222E;')"/></xsl:call-template></xsl:when>		<!--/oint L: contour integral operator -->
<!--		<xsl:when test="starts-with($content,'&#x0222F;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0222F;')"/></xsl:call-template></xsl:when>-->
<!--		<xsl:when test="starts-with($content,'&#x02230;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02230;')"/></xsl:call-template></xsl:when>-->
<!--		<xsl:when test="starts-with($content,'&#x02231;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02231;')"/></xsl:call-template></xsl:when>-->
<!--		<xsl:when test="starts-with($content,'&#x02232;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02232;')"/></xsl:call-template></xsl:when>-->
<!--		<xsl:when test="starts-with($content,'&#x02233;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02233;')"/></xsl:call-template></xsl:when>-->
		<xsl:when test="starts-with($content,'&#x02234;')"><xsl:value-of select="'\therefore '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02234;')"/></xsl:call-template></xsl:when>	<!--/therefore R: therefore --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02235;')"><xsl:value-of select="'\because '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02235;')"/></xsl:call-template></xsl:when>	<!--/because R: because --> <!-- Required amssymb -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02236;')"><xsl:value-of select="':'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02236;')"/></xsl:call-template></xsl:when>		<!--/ratio -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02237;')"><xsl:value-of select="'\colon\colon '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02237;')"/></xsl:call-template></xsl:when>	<!--/Colon, two colons -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02238;')"><xsl:value-of select="'\dot{-}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02238;')"/></xsl:call-template></xsl:when>		<!--/dotminus B: minus sign, dot above -->
<!--		<xsl:when test="starts-with($content,'&#x02239;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02239;')"/></xsl:call-template></xsl:when>		-->
<!--		<xsl:when test="starts-with($content,'&#x0223A;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223A;')"/></xsl:call-template></xsl:when>		minus with four dots, geometric properties -->		
<!--		<xsl:when test="starts-with($content,'&#x0223B;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223B;')"/></xsl:call-template></xsl:when>		homothetic -->
		<xsl:when test="starts-with($content,'&#x0223C;')"><xsl:value-of select="'\sim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223C;')"/></xsl:call-template></xsl:when>		<!--/sim R: similar -->
		<xsl:when test="starts-with($content,'&#x0223D;')"><xsl:value-of select="'\backsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223D;')"/></xsl:call-template></xsl:when>	<!--/backsim R: reverse similar --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x0223E;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223E;')"/></xsl:call-template></xsl:when>		most positive -->
<!--		<xsl:when test="starts-with($content,'&#x0223F;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0223F;')"/></xsl:call-template></xsl:when>		ac current -->
		<xsl:when test="starts-with($content,'&#x02240;')"><xsl:value-of select="'\wr '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02240;')"/></xsl:call-template></xsl:when>		<!--/wr B: wreath product -->
		<xsl:when test="starts-with($content,'&#x02241;')"><xsl:value-of select="'\nsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02241;')"/></xsl:call-template></xsl:when>		<!--/nsim N: not similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02242;')"><xsl:value-of select="'\eqsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02242;')"/></xsl:call-template></xsl:when>		<!--/esim R: equals, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02243;')"><xsl:value-of select="'\simeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02243;')"/></xsl:call-template></xsl:when>		<!--/simeq R: similar, equals -->
		<xsl:when test="starts-with($content,'&#x02244;')"><xsl:value-of select="'\not\simeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02244;')"/></xsl:call-template></xsl:when>	<!--/nsimeq N: not similar, equals -->
		<xsl:when test="starts-with($content,'&#x02245;')"><xsl:value-of select="'\cong '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02245;')"/></xsl:call-template></xsl:when>		<!--/cong R: congruent with -->
<!--		<xsl:when test="starts-with($content,'&#x02246;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02246;')"/></xsl:call-template></xsl:when>		similar, not equals -->
		<xsl:when test="starts-with($content,'&#x02247;')"><xsl:value-of select="'\ncong '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02247;')"/></xsl:call-template></xsl:when>		<!--/ncong N: not congruent with --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02248;')"><xsl:value-of select="'\approx '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02248;')"/></xsl:call-template></xsl:when>	<!--/approx R: approximate -->
<!--		<xsl:when test="starts-with($content,'&#x02249;&#x00338;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02249;&#x00338;')"/></xsl:call-template></xsl:when>	not, vert, approximate -->
		<xsl:when test="starts-with($content,'&#x02249;')"><xsl:value-of select="'\not\approx '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02249;')"/></xsl:call-template></xsl:when>	<!--/napprox N: not approximate -->
		<xsl:when test="starts-with($content,'&#x0224A;')"><xsl:value-of select="'\approxeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224A;')"/></xsl:call-template></xsl:when>	<!--/approxeq R: approximate, equals --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x0224B;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224B;')"/></xsl:call-template></xsl:when>		approximately identical to -->
<!--		<xsl:when test="starts-with($content,'&#x0224C;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224C;')"/></xsl:call-template></xsl:when>		/backcong R: reverse congruent -->
		<xsl:when test="starts-with($content,'&#x0224D;')"><xsl:value-of select="'\asymp '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224D;')"/></xsl:call-template></xsl:when>		<!--/asymp R: asymptotically equal to -->
		<xsl:when test="starts-with($content,'&#x0224E;')"><xsl:value-of select="'\Bumpeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224E;')"/></xsl:call-template></xsl:when>	<!--/Bumpeq R: bumpy equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0224F;')"><xsl:value-of select="'\bumpeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0224F;')"/></xsl:call-template></xsl:when>	<!--/bumpeq R: bumpy equals, equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02250;')"><xsl:value-of select="'\doteq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02250;')"/></xsl:call-template></xsl:when>		<!--/doteq R: equals, single dot above -->
		<xsl:when test="starts-with($content,'&#x02251;')"><xsl:value-of select="'\doteqdot '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02251;')"/></xsl:call-template></xsl:when>	<!--/doteqdot /Doteq R: eq, even dots --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02252;')"><xsl:value-of select="'\fallingdotseq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02252;')"/></xsl:call-template></xsl:when>	<!--/fallingdotseq R: eq, falling dots --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02253;')"><xsl:value-of select="'\risingdotseq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02253;')"/></xsl:call-template></xsl:when>	<!--/risingdotseq R: eq, rising dots --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x02254;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02254;')"/></xsl:call-template></xsl:when>		/coloneq R: colon, equals -->
<!--		<xsl:when test="starts-with($content,'&#x02255;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02255;')"/></xsl:call-template></xsl:when>		/eqcolon R: equals, colon -->
		<xsl:when test="starts-with($content,'&#x02256;')"><xsl:value-of select="'\eqcirc '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02256;')"/></xsl:call-template></xsl:when>	<!--/eqcirc R: circle on equals sign --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02257;')"><xsl:value-of select="'\circeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02257;')"/></xsl:call-template></xsl:when>	<!--/circeq R: circle, equals --> <!-- Required amssymb -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02258;')"><xsl:value-of select="'\stackrel{\frown}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02258;')"/></xsl:call-template></xsl:when>
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02259;')"><xsl:value-of select="'\stackrel{\wedge}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02259;')"/></xsl:call-template></xsl:when>	<!--/wedgeq R: corresponds to (wedge, equals) -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x0225A;')"><xsl:value-of select="'\stackrel{\vee}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225A;')"/></xsl:call-template></xsl:when>	<!--logical or, equals -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x0225B;')"><xsl:value-of select="'\stackrel{\star}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225B;')"/></xsl:call-template></xsl:when>	<!--equal, asterisk above -->
		<xsl:when test="starts-with($content,'&#x0225C;')"><xsl:value-of select="'\triangleq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225C;')"/></xsl:call-template></xsl:when>	<!--/triangleq R: triangle, equals --> <!-- Required amssymb -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x0225D;')"><xsl:value-of select="'\stackrel{\scriptscriptstyle\mathrm{def}}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225D;')"/></xsl:call-template></xsl:when>
<!-- ? -->	<xsl:when test="starts-with($content,'&#x0225E;')"><xsl:value-of select="'\stackrel{\scriptscriptstyle\mathrm{m}}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225E;')"/></xsl:call-template></xsl:when>	
<!-- ? -->	<xsl:when test="starts-with($content,'&#x0225F;')"><xsl:value-of select="'\stackrel{?}{=}'" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0225F;')"/></xsl:call-template></xsl:when>	<!--/questeq R: equal with questionmark -->
<!--		<xsl:when test="starts-with($content,'&#x02260;&#x0FE00;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02260;&#x0FE00;')"/></xsl:call-template></xsl:when>	not equal, dot -->
		<xsl:when test="starts-with($content,'&#x02260;')"><xsl:value-of select="'\ne '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02260;')"/></xsl:call-template></xsl:when>		<!--/ne /neq R: not equal -->
<!--		<xsl:when test="starts-with($content,'&#x02261;&#x020E5;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02261;&#x020E5;')"/></xsl:call-template></xsl:when>	reverse not equivalent -->
		<xsl:when test="starts-with($content,'&#x02261;')"><xsl:value-of select="'\equiv '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02261;')"/></xsl:call-template></xsl:when>		<!--/equiv R: identical with -->
		<xsl:when test="starts-with($content,'&#x02262;')"><xsl:value-of select="'\not\equiv '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02262;')"/></xsl:call-template></xsl:when>	<!--/nequiv N: not identical with -->
<!--		<xsl:when test="starts-with($content,'&#x02263;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02263;')"/></xsl:call-template></xsl:when>		-->
		<xsl:when test="starts-with($content,'&#x02264;')"><xsl:value-of select="'\le '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02264;')"/></xsl:call-template></xsl:when>		<!--/leq /le R: less-than-or-equal -->
		<xsl:when test="starts-with($content,'&#x02265;')"><xsl:value-of select="'\ge '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02265;')"/></xsl:call-template></xsl:when>		<!--/geq /ge R: greater-than-or-equal -->
		<xsl:when test="starts-with($content,'&#x02266;')"><xsl:value-of select="'\leqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02266;')"/></xsl:call-template></xsl:when>		<!--/leqq R: less, double equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02267;')"><xsl:value-of select="'\geqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02267;')"/></xsl:call-template></xsl:when>		<!--/geqq R: greater, double equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02268;')"><xsl:value-of select="'\lneqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02268;')"/></xsl:call-template></xsl:when>		<!--/lneqq N: less, not double equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02269;')"><xsl:value-of select="'\gneqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02269;')"/></xsl:call-template></xsl:when>		<!--/gneqq N: greater, not dbl equals --> <!-- Required amssymb -->
<!--		<xsl:when test="starts-with($content,'&#x0226A;&#x00338;&#x0FE00;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226A;&#x00338;&#x0FE00;')"/></xsl:call-template></xsl:when>	not much less than, variant -->
<!--		<xsl:when test="starts-with($content,'&#x0226A;&#x00338;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226A;&#x00338;')"/></xsl:call-template></xsl:when>	not, vert, much less than -->
		<xsl:when test="starts-with($content,'&#x0226A;')"><xsl:value-of select="'\ll '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226A;')"/></xsl:call-template></xsl:when>		<!--/ll R: double less-than sign -->
<!--		<xsl:when test="starts-with($content,'&#x0226B;&#x00338;&#x0FE00;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226B;&#x00338;&#x0FE00;')"/></xsl:call-template></xsl:when>	not much greater than, variant -->
<!--		<xsl:when test="starts-with($content,'&#x0226B;&#x00338;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226B;&#x00338;')"/></xsl:call-template></xsl:when>	not, vert, much greater than -->
		<xsl:when test="starts-with($content,'&#x0226B;')"><xsl:value-of select="'\gg '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226B;')"/></xsl:call-template></xsl:when>		<!--/gg R: dbl greater-than sign -->
		<xsl:when test="starts-with($content,'&#x0226C;')"><xsl:value-of select="'\between '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226C;')"/></xsl:call-template></xsl:when>	<!--/between R: between --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0226D;')"><xsl:value-of select="'\not\asymp '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226D;')"/></xsl:call-template></xsl:when>
		<xsl:when test="starts-with($content,'&#x0226E;')"><xsl:value-of select="'\nless '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226E;')"/></xsl:call-template></xsl:when>		<!--/nless N: not less-than --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0226F;')"><xsl:value-of select="'\ngtr '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0226F;')"/></xsl:call-template></xsl:when>		<!--/ngtr N: not greater-than --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02270;&#x020E5;')"><xsl:value-of select="'\nleq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02270;&#x020E5;')"/></xsl:call-template></xsl:when>	<!--/nleq N: not less-than-or-equal --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02270;')"><xsl:value-of select="'\nleqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02270;')"/></xsl:call-template></xsl:when>		<!--/nleqq N: not less, dbl equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02271;&#x020E5;')"><xsl:value-of select="'\ngeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02271;&#x020E5;')"/></xsl:call-template></xsl:when>	<!--/ngeq N: not greater-than-or-equal --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02271;')"><xsl:value-of select="'\ngeqq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02271;')"/></xsl:call-template></xsl:when>		<!--/ngeqq N: not greater, dbl equals --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02272;')"><xsl:value-of select="'\lesssim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02272;')"/></xsl:call-template></xsl:when>	<!--/lesssim R: less, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02273;')"><xsl:value-of select="'\gtrsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02273;')"/></xsl:call-template></xsl:when>	<!--/gtrsim R: greater, similar --> <!-- Required amssymb -->		
		<xsl:when test="starts-with($content,'&#x02274;')"><xsl:value-of select="'\not\lesssim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02274;')"/></xsl:call-template></xsl:when>	<!--not less, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02275;')"><xsl:value-of select="'\not\gtrsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02275;')"/></xsl:call-template></xsl:when>	<!--not greater, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02276;')"><xsl:value-of select="'\lessgtr '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02276;')"/></xsl:call-template></xsl:when>	<!--/lessgtr R: less, greater --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02277;')"><xsl:value-of select="'\gtrless '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02277;')"/></xsl:call-template></xsl:when>	<!--/gtrless R: greater, less --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02278;')"><xsl:value-of select="'\not\lessgtr '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02278;')"/></xsl:call-template></xsl:when>	<!--not less, greater --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02279;')"><xsl:value-of select="'\not\gtrless '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02279;')"/></xsl:call-template></xsl:when>	<!--not greater, less --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0227A;')"><xsl:value-of select="'\prec '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227A;')"/></xsl:call-template></xsl:when>		<!--/prec R: precedes -->
		<xsl:when test="starts-with($content,'&#x0227B;')"><xsl:value-of select="'\succ '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227B;')"/></xsl:call-template></xsl:when>		<!--/succ R: succeeds -->
		<xsl:when test="starts-with($content,'&#x0227C;')"><xsl:value-of select="'\preccurlyeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227C;')"/></xsl:call-template></xsl:when>	<!--/preccurlyeq R: precedes, curly eq --> <!-- Required amssymb -->		
		<xsl:when test="starts-with($content,'&#x0227D;')"><xsl:value-of select="'\succcurlyeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227D;')"/></xsl:call-template></xsl:when>	<!--/succcurlyeq R: succeeds, curly eq --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0227E;')"><xsl:value-of select="'\precsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227E;')"/></xsl:call-template></xsl:when>	<!--/precsim R: precedes, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x0227F;')"><xsl:value-of select="'\succsim '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0227F;')"/></xsl:call-template></xsl:when>	<!--/succsim R: succeeds, similar --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02280;')"><xsl:value-of select="'\nprec '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02280;')"/></xsl:call-template></xsl:when>		<!--/nprec N: not precedes --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02281;')"><xsl:value-of select="'\nsucc '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02281;')"/></xsl:call-template></xsl:when>		<!--/nsucc N: not succeeds --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x02282;')"><xsl:value-of select="'\subset '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02282;')"/></xsl:call-template></xsl:when>	<!--/subset R: subset or is implied by -->
		<xsl:when test="starts-with($content,'&#x02283;')"><xsl:value-of select="'\supset '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02283;')"/></xsl:call-template></xsl:when>	<!--/supset R: superset or implies -->
		<xsl:when test="starts-with($content,'&#x02284;')"><xsl:value-of select="'\not\subset '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02284;')"/></xsl:call-template></xsl:when>	<!--not subset -->
		<xsl:when test="starts-with($content,'&#x02285;')"><xsl:value-of select="'\not\supset '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02285;')"/></xsl:call-template></xsl:when>	<!--not superset -->
		<xsl:when test="starts-with($content,'&#x02286;')"><xsl:value-of select="'\subseteq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02286;')"/></xsl:call-template></xsl:when>	<!--/subseteq R: subset, equals -->
		<xsl:when test="starts-with($content,'&#x02287;')"><xsl:value-of select="'\supseteq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02287;')"/></xsl:call-template></xsl:when>	<!--/supseteq R: superset, equals -->
		<xsl:when test="starts-with($content,'&#x0228E;')"><xsl:value-of select="'\uplus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0228E;')"/></xsl:call-template></xsl:when>		<!--/uplus B: plus sign in union -->
		<xsl:when test="starts-with($content,'&#x02293;')"><xsl:value-of select="'\sqcap '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02293;')"/></xsl:call-template></xsl:when>		<!--/sqcap B: square intersection -->
		<xsl:when test="starts-with($content,'&#x02294;')"><xsl:value-of select="'\bigsqcup '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02294;')"/></xsl:call-template></xsl:when>		<!--/sqcup B: square union -->
		<xsl:when test="starts-with($content,'&#x02295;')"><xsl:value-of select="'\oplus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02295;')"/></xsl:call-template></xsl:when>		<!--/oplus B: plus sign in circle -->
		<xsl:when test="starts-with($content,'&#x02296;')"><xsl:value-of select="'\ominus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02296;')"/></xsl:call-template></xsl:when>	<!--/ominus B: minus sign in circle -->
		<xsl:when test="starts-with($content,'&#x02297;')"><xsl:value-of select="'\otimes '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02297;')"/></xsl:call-template></xsl:when>	<!--/otimes B: multiply sign in circle -->
		<xsl:when test="starts-with($content,'&#x02298;')"><xsl:value-of select="'\oslash '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02298;')"/></xsl:call-template></xsl:when>	<!--/oslash B: solidus in circle -->
<!-- ? -->	<xsl:when test="starts-with($content,'&#x02299;')"><xsl:value-of select="'\odot '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x02299;')"/></xsl:call-template></xsl:when>		<!--/odot B: middle dot in circle --> <!--/bigodot L: circle dot operator -->
		<xsl:when test="starts-with($content,'&#x0229F;')"><xsl:value-of select="'\boxminus '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x0229F;')"/></xsl:call-template></xsl:when>	<!--/boxminus B: minus sign in box --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x022A4;')"><xsl:value-of select="'\top '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022A4;')"/></xsl:call-template></xsl:when>		<!--/top top -->
		<xsl:when test="starts-with($content,'&#x022A5;')"><xsl:value-of select="'\perp '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022A5;')"/></xsl:call-template></xsl:when>		<!--/perp R: perpendicular --><!--/bot bottom -->
		<xsl:when test="starts-with($content,'&#x022A6;')"><xsl:value-of select="'\vdash '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022A6;')"/></xsl:call-template></xsl:when>		<!--/vdash R: vertical, dash -->
		<xsl:when test="starts-with($content,'&#x022A7;')"><xsl:value-of select="'\vDash '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022A7;')"/></xsl:call-template></xsl:when>		<!--/vDash R: vertical, dbl dash --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x022A8;')"><xsl:value-of select="'\models '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022A8;')"/></xsl:call-template></xsl:when>	<!--/models R: -->
		<xsl:when test="starts-with($content,'&#x022AA;')"><xsl:value-of select="'\Vvdash '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022AA;')"/></xsl:call-template></xsl:when>	<!--/Vvdash R: triple vertical, dash --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x022C0;')"><xsl:value-of select="'\bigwedge '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C0;')"/></xsl:call-template></xsl:when>	<!--/bigwedge L: logical or operator -->
		<xsl:when test="starts-with($content,'&#x022C1;')"><xsl:value-of select="'\bigvee '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C1;')"/></xsl:call-template></xsl:when>	<!--/bigcap L: intersection operator -->
		<xsl:when test="starts-with($content,'&#x022C2;')"><xsl:value-of select="'\bigcap '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C2;')"/></xsl:call-template></xsl:when>	<!--/bigvee L: logical and operator -->
		<xsl:when test="starts-with($content,'&#x022C3;')"><xsl:value-of select="'\bigcup '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C3;')"/></xsl:call-template></xsl:when>	<!--/bigcup L: union operator -->
		<xsl:when test="starts-with($content,'&#x022C4;')"><xsl:value-of select="'\diamond '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C4;')"/></xsl:call-template></xsl:when>	<!--/diamond B: open diamond -->
		<xsl:when test="starts-with($content,'&#x022C5;')"><xsl:value-of select="'\cdot '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C5;')"/></xsl:call-template></xsl:when>		<!--/cdot B: small middle dot -->
		<xsl:when test="starts-with($content,'&#x022C6;')"><xsl:value-of select="'\star '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C6;')"/></xsl:call-template></xsl:when>		<!--/star B: small star, filled -->
		<xsl:when test="starts-with($content,'&#x022C7;')"><xsl:value-of select="'\divideontimes '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C7;')"/></xsl:call-template></xsl:when>	<!--/divideontimes B: division on times --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x022C8;')"><xsl:value-of select="'\bowtie '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022C8;')"/></xsl:call-template></xsl:when>	<!--/bowtie R: -->
		<xsl:when test="starts-with($content,'&#x022CD;')"><xsl:value-of select="'\backsimeq '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022CD;')"/></xsl:call-template></xsl:when>	<!--/backsimeq R: reverse similar, eq --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x022EF;')"><xsl:value-of select="'\cdots '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022EF;')"/></xsl:call-template></xsl:when>		<!--/cdots, three dots, centered -->
<!--		<xsl:when test="starts-with($content,'&#x022F0;')"><xsl:value-of select="' '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022F0;')"/></xsl:call-template></xsl:when>		three dots, ascending -->
		<xsl:when test="starts-with($content,'&#x022F1;')"><xsl:value-of select="'\ddots '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x022F1;')"/></xsl:call-template></xsl:when>		<!--/ddots, three dots, descending -->

<!-- ====================================================================== -->		
		<xsl:when test="starts-with($content,'&#x025A1;')"><xsl:value-of select="'\square '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x025A1;')"/></xsl:call-template></xsl:when>	<!--/square, square --> <!-- Required amssymb -->
		<xsl:when test="starts-with($content,'&#x025AA;')"><xsl:value-of select="'\blacksquare '" /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '&#x025AA;')"/></xsl:call-template></xsl:when>	<!--/blacksquare, square, filled  --> <!-- Required amssymb -->
		
		<xsl:when test='starts-with($content,"&apos;")'><xsl:value-of select='"\text{&apos;}"' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select='substring-after($content, "&apos;")'/></xsl:call-template></xsl:when><!-- \text required amslatex -->
		<xsl:when test='starts-with($content,"(")'><xsl:value-of select='"\left("' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '(')"/></xsl:call-template></xsl:when>
		<xsl:when test='starts-with($content,")")'><xsl:value-of select='"\right)"' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, ')')"/></xsl:call-template></xsl:when>
		<xsl:when test='starts-with($content,"[")'><xsl:value-of select='"\left["' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '[')"/></xsl:call-template></xsl:when>
		<xsl:when test='starts-with($content,"]")'><xsl:value-of select='"\right]"' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, ']')"/></xsl:call-template></xsl:when>
		<xsl:when test='starts-with($content,"{")'><xsl:value-of select='"\left\{"' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '{')"/></xsl:call-template></xsl:when>
		<xsl:when test='starts-with($content,"}")'><xsl:value-of select='"\right\}"' /><xsl:call-template name="replaceEntities"><xsl:with-param name="content" select="substring-after($content, '}')"/></xsl:call-template></xsl:when>
		

		<xsl:otherwise>
			<xsl:value-of select="substring($content,1,1)"/>
			<xsl:call-template name="replaceEntities">
				<xsl:with-param name="content" select="substring($content, 2)"/>
			</xsl:call-template>
		</xsl:otherwise>
	</xsl:choose></xsl:if>
</xsl:template>

<xsl:template name="replaceMtextEntities">
	<xsl:param name="content"/>
	<xsl:choose>
	<xsl:when test="contains($content,'&#x02009;&#x0200A;&#x0200A;')">	<!-- ThickSpace - space of width 5/18 em -->
		<xsl:call-template name="replaceMtextEntities">
			<xsl:with-param name="content" select="concat(substring-before($content,'&#x02009;&#x0200A;&#x0200A;'),'\hspace{0.28em}',substring-after($content,'&#x02009;&#x0200A;&#x0200A;'))"/>
		</xsl:call-template>
	</xsl:when>
	<xsl:when test="contains($content,'&#x02009;')">	<!-- ThinSpace - space of width 3/18 em -->
		<xsl:call-template name="replaceMtextEntities">
			<xsl:with-param name="content" select="concat(substring-before($content,'&#x02009;'),'\hspace{0.17em}',substring-after($content,'&#x02009;'))"/>
		</xsl:call-template>
	</xsl:when>
	<xsl:otherwise>
		<xsl:value-of select="normalize-space($content)"/>
	</xsl:otherwise>
	</xsl:choose>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/glayout.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>

<!-- ====================================================================== -->
<!-- $id: glayout.xsl, 2002/17/05 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:template match="m:mfrac">
	<xsl:choose>
		<xsl:when test="@bevelled='true'">
<!--			<xsl:text>\raisebox{1ex}{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}\!\left/ \!\raisebox{-1ex}{</xsl:text>
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>}\right.</xsl:text>-->
		</xsl:when>
		<xsl:when test="@linethickness">
			<xsl:text>\genfrac{}{}{</xsl:text>
			<xsl:choose>
				<xsl:when test="number(@linethickness)">
					<xsl:value-of select="@linethickness div 10"/>
					<xsl:text>ex</xsl:text>
				</xsl:when>
				<xsl:when test="@linethickness='thin'">
					<xsl:text>.05ex</xsl:text>
				</xsl:when>
				<xsl:when test="@linethickness='medium'"/>
				<xsl:when test="@linethickness='thick'">
					<xsl:text>.2ex</xsl:text>
				</xsl:when>
				<xsl:otherwise>
					<xsl:value-of select="@linethickness"/>
				</xsl:otherwise>
			</xsl:choose>
			<xsl:text>}{}{</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>\frac{</xsl:text>
		</xsl:otherwise>
	</xsl:choose>
	<xsl:if test="@numalign='right'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:apply-templates select="./*[1]"/>
	<xsl:if test="@numalign='left'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:text>}{</xsl:text>	
	<xsl:if test="@denomalign='right'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:apply-templates select="./*[2]"/>
		<xsl:if test="@denomalign='left'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:mroot">
	<xsl:choose>
		<xsl:when test="count(./*)=2">
			<xsl:text>\sqrt[</xsl:text>
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>]{</xsl:text>	
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>	
		</xsl:when>
		<xsl:otherwise>
		<!-- number of argumnets is not 2 - code 25 -->
			<xsl:message>exception 25:</xsl:message>
			<xsl:text>\text{exception 25:}</xsl:text> 
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template match="m:msqrt">
	<xsl:text>\sqrt{</xsl:text>
	<xsl:apply-templates/>
	<xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:mfenced">
	<xsl:choose>
		<xsl:when test="@open">
			<xsl:if test="translate(@open,'{}[]()|','{{{{{{{')='{'">
				<xsl:text>\left</xsl:text>
			</xsl:if>
			<xsl:if test="@open='{' or @open='}'">
				<xsl:text>\</xsl:text>
			</xsl:if>
			<xsl:value-of select="@open"/>
		</xsl:when>
		<xsl:otherwise><xsl:text>\left(</xsl:text></xsl:otherwise>
	</xsl:choose>
	<xsl:choose>
		<xsl:when test="count(./*)>1">
			<xsl:variable name="symbol">
				<xsl:choose>
					<xsl:when test="@separators">
						<xsl:call-template name="startspace">
							<xsl:with-param name="symbol" select="@separators"/>
						</xsl:call-template>
					</xsl:when>
					<xsl:otherwise>,</xsl:otherwise>
				</xsl:choose>
			</xsl:variable>
			<xsl:for-each select="./*">
				<xsl:apply-templates select="."/>
				<xsl:if test="not(position()=last())">
					<xsl:choose>
						<xsl:when test="position()>string-length($symbol)">
							<xsl:value-of select="substring($symbol,string-length($symbol))"/>
						</xsl:when>
						<xsl:otherwise>
							<xsl:value-of select="substring($symbol,position(),1)"/>
						</xsl:otherwise>
					</xsl:choose>
				</xsl:if>
			</xsl:for-each>
		</xsl:when>
		<xsl:otherwise>
			<xsl:apply-templates/>
		</xsl:otherwise>
	</xsl:choose>
	<xsl:choose>
		<xsl:when test="@close">
			<xsl:if test="translate(@open,'{}[]()|','{{{{{{{')='{'">
				<xsl:text>\right</xsl:text>
			</xsl:if>
			<xsl:if test="@open='{' or @open='}'">
				<xsl:text>\</xsl:text>
			</xsl:if>		
			<xsl:value-of select="@close"/>
		</xsl:when>
		<xsl:otherwise><xsl:text>\right)</xsl:text></xsl:otherwise>
	</xsl:choose>	
</xsl:template>

<xsl:template match="m:mphantom">
	<xsl:text>\phantom{</xsl:text>
	<xsl:apply-templates/>
	<xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:menclose">
	<xsl:choose>
		<xsl:when test="@notation = 'actuarial'">
			<xsl:text>\overline{</xsl:text>
			<xsl:apply-templates/>
			<xsl:text>\hspace{.2em}|}</xsl:text>
		</xsl:when>
		<xsl:when test="@notation = 'radical'">
			<xsl:text>\sqrt{</xsl:text>
			<xsl:apply-templates/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>\overline{)</xsl:text>
			<xsl:apply-templates/>
			<xsl:text>}</xsl:text>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template match="m:mrow">
	<xsl:apply-templates/>
</xsl:template>

<xsl:template match="m:mstyle">
	<xsl:if test="@background">
		<xsl:text>\colorbox[rgb]{</xsl:text>
		<xsl:call-template name="color">
			<xsl:with-param name="color" select="@background"/>
		</xsl:call-template>
		<xsl:text>}{$</xsl:text>
	</xsl:if>
	<xsl:if test="@color">
		<xsl:text>\textcolor[rgb]{</xsl:text>
		<xsl:call-template name="color">
			<xsl:with-param name="color" select="@color"/>
		</xsl:call-template>
		<xsl:text>}{</xsl:text>
	</xsl:if>
	<xsl:apply-templates/>
	<xsl:if test="@color">
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="@background">
		<xsl:text>$}</xsl:text>
	</xsl:if>
</xsl:template>
<!--

<xsl:template match="m:mstyle">
	<xsl:if test="@displaystyle='true'">
		<xsl:text>{\displaystyle</xsl:text>
	</xsl:if>			
	<xsl:if test="@scriptlevel=2">
		<xsl:text>{\scriptscriptstyle</xsl:text>	
	</xsl:if>
	<xsl:apply-templates/>
	<xsl:if test="@scriptlevel=2">
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="@displaystyle='true'">
		<xsl:text>}</xsl:text>
	</xsl:if>
</xsl:template>
-->

<xsl:template match="m:merror">
	<xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/mmltex.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>
                
<xsl:output method="text" indent="no" encoding="UTF-8"/>

<!-- ====================================================================== -->
<!-- $id: mmltex.xsl, 2002/22/11 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:include href="tokens.xsl"/>
<xsl:include href="glayout.xsl"/>
<xsl:include href="scripts.xsl"/>
<xsl:include href="tables.xsl"/>
<xsl:include href="entities.xsl"/>
<xsl:include href="cmarkup.xsl"/>

<!-- Note: variables colora (template color) and symbola (template startspace) only for Sablotron -->

<xsl:template name="startspace">
	<xsl:param name="symbol"/>
	<xsl:if test="contains($symbol,' ')">
		<xsl:variable name="symbola" select="concat(substring-before($symbol,' '),substring-after($symbol,' '))"/>
		<xsl:call-template name="startspace">
			<xsl:with-param name="symbol" select="$symbola"/>
		</xsl:call-template>
	</xsl:if>
	<xsl:if test="not(contains($symbol,' '))">
		<xsl:value-of select="$symbol"/>
	</xsl:if>
</xsl:template>

<xsl:strip-space elements="m:*"/>

<xsl:template match="m:math">
	<xsl:text>&#x00024;</xsl:text>
	<xsl:apply-templates/>
	<xsl:text>&#x00024;</xsl:text>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/scripts.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>
                
<!-- ====================================================================== -->
<!-- $Id: scripts.xsl,v 1.1.1.1 2002/10/26 14:20:06 shade33 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:template match="m:munderover">
	<xsl:variable name="base">
		<xsl:call-template name="startspace">
			<xsl:with-param name="symbol" select="./*[1]"/>
		</xsl:call-template>
	</xsl:variable>
	<xsl:variable name="under">
		<xsl:call-template name="startspace">
			<xsl:with-param name="symbol" select="./*[2]"/>
		</xsl:call-template>
	</xsl:variable>
	<xsl:variable name="over">
		<xsl:call-template name="startspace">
			<xsl:with-param name="symbol" select="./*[3]"/>
		</xsl:call-template>
	</xsl:variable>
	
	<xsl:choose>
		<xsl:when test="$over='&#x000AF;'">	<!-- OverBar - over bar -->
			<xsl:text>\overline{</xsl:text>
			<xsl:call-template name="munder">
				<xsl:with-param name="base" select="$base"/>
				<xsl:with-param name="under" select="$under"/>
			</xsl:call-template>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="$over='&#x0FE37;'">	<!-- OverBrace - over brace -->
			<xsl:text>\overbrace{</xsl:text>
			<xsl:call-template name="munder">
				<xsl:with-param name="base" select="$base"/>
				<xsl:with-param name="under" select="$under"/>
			</xsl:call-template>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="$under='&#x00332;'">	<!-- UnderBar - combining low line -->
			<xsl:text>\underline{</xsl:text>
			<xsl:call-template name="mover">
				<xsl:with-param name="base" select="$base"/>
				<xsl:with-param name="over" select="$over"/>
				<xsl:with-param name="pos_over" select="3"/>
			</xsl:call-template>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="$under='&#x0FE38;'">	<!-- UnderBrace - under brace -->
			<xsl:text>\underbrace{</xsl:text>
			<xsl:call-template name="mover">
				<xsl:with-param name="base" select="$base"/>
				<xsl:with-param name="over" select="$over"/>
				<xsl:with-param name="pos_over" select="3"/>
			</xsl:call-template>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',
						'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'">
<!-- if $base is operator, such as
			&#x02211;	/sum L: summation operator
			&#x0220F;	/prod L: product operator
			&#x02210;	/coprod L: coproduct operator
			&#x022c2;	/bigcap
			&#x022c3;	/bigcup
			&#x02294;	/bigsqcup
-->
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>_{</xsl:text>
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>}^{</xsl:text>
			<xsl:apply-templates select="./*[3]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>\underset{</xsl:text>
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>}{\overset{</xsl:text>
			<xsl:apply-templates select="./*[3]"/>
			<xsl:text>}{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}}</xsl:text>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template match="m:mover">
	<xsl:call-template name="mover">
		<xsl:with-param name="base">
			<xsl:call-template name="startspace">
				<xsl:with-param name="symbol" select="./*[1]"/>
			</xsl:call-template>
		</xsl:with-param>
		<xsl:with-param name="over">
			<xsl:call-template name="startspace">
				<xsl:with-param name="symbol" select="./*[2]"/>
			</xsl:call-template>
		</xsl:with-param>
	</xsl:call-template>
</xsl:template>

<xsl:template match="m:munder">
	<xsl:call-template name="munder">
		<xsl:with-param name="base">
			<xsl:call-template name="startspace">
				<xsl:with-param name="symbol" select="./*[1]"/>
			</xsl:call-template>
		</xsl:with-param>
		<xsl:with-param name="under">
			<xsl:call-template name="startspace">
				<xsl:with-param name="symbol" select="./*[2]"/>
			</xsl:call-template>
		</xsl:with-param>
	</xsl:call-template>
</xsl:template>

<xsl:template name="mover">
	<xsl:param name="base"/>
	<xsl:param name="over"/>
	<xsl:param name="pos_over" select="2"/>
	<xsl:choose>
		<xsl:when test="$over='&#x000AF;'">	<!-- OverBar - over bar -->
			<xsl:text>\overline{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="$over='&#x0FE37;'">	<!-- OverBrace - over brace -->
			<xsl:text>\overbrace{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',
						'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'">
<!-- if $base is operator, such as
			&#x02211;	/sum L: summation operator
			&#x0220F;	/prod L: product operator
			&#x02210;	/coprod L: coproduct operator
			&#x022c2;	/bigcap
			&#x022c3;	/bigcup
			&#x02294;	/bigsqcup
-->
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>^{</xsl:text>
			<xsl:apply-templates select="./*[$pos_over]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>\stackrel{</xsl:text>
			<xsl:apply-templates select="./*[$pos_over]"/>
			<xsl:text>}{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>
			<!--
			<xsl:text>\overset{</xsl:text>
			<xsl:apply-templates select="./*[$pos_over]"/>
			<xsl:text>}{</xsl:text>	
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>-->
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template name="munder">
	<xsl:param name="base"/>
	<xsl:param name="under"/>
	<xsl:choose>
		<xsl:when test="$under='&#x00332;'">	<!-- UnderBar - combining low line -->
			<xsl:text>\underline{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="$under='&#x0FE38;'">	<!-- UnderBrace - under brace -->
			<xsl:text>\underbrace{</xsl:text>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:when test="translate($base,'&#x0220F;&#x02210;&#x022c2;&#x022c3;&#x02294;',
						'&#x02211;&#x02211;&#x02211;&#x02211;&#x02211;')='&#x02211;'">
<!-- if $base is operator, such as
			&#x02211;	/sum L: summation operator
			&#x0220F;	/prod L: product operator
			&#x02210;	/coprod L: coproduct operator
			&#x022c2;	/bigcap
			&#x022c3;	/bigcup
			&#x02294;	/bigsqcup
-->
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>_{</xsl:text>
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:text>\underset{</xsl:text>		<!-- Required AmsMath package -->
			<xsl:apply-templates select="./*[2]"/>
			<xsl:text>}{</xsl:text>	
			<xsl:apply-templates select="./*[1]"/>
			<xsl:text>}</xsl:text>	
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template match="m:msubsup">
	<xsl:text>{</xsl:text>	
	<xsl:apply-templates select="./*[1]"/>
	<xsl:text>}_{</xsl:text>
	<xsl:apply-templates select="./*[2]"/>
	<xsl:text>}^{</xsl:text>	
	<xsl:apply-templates select="./*[3]"/>
	<xsl:text>}</xsl:text>	
</xsl:template>

<xsl:template match="m:msup">
	<xsl:text>{</xsl:text>	
	<xsl:apply-templates select="./*[1]"/>
	<xsl:text>}^{</xsl:text>	
	<xsl:apply-templates select="./*[2]"/>
	<xsl:text>}</xsl:text>	
</xsl:template>

<xsl:template match="m:msub">
	<xsl:text>{</xsl:text>	
	<xsl:apply-templates select="./*[1]"/>
	<xsl:text>}_{</xsl:text>	
	<xsl:apply-templates select="./*[2]"/>
	<xsl:text>}</xsl:text>	
</xsl:template>

<xsl:template match="m:mmultiscripts" mode="mprescripts">
	<xsl:for-each select="m:mprescripts/following-sibling::*">
		<xsl:if test="position() mod 2 and local-name(.)!='none'">
			<xsl:text>{}_{</xsl:text>	
			<xsl:apply-templates select="."/>
			<xsl:text>}</xsl:text>	
		</xsl:if>
		<xsl:if test="not(position() mod 2) and local-name(.)!='none'">
			<xsl:text>{}^{</xsl:text>	
			<xsl:apply-templates select="."/>
			<xsl:text>}</xsl:text>	
		</xsl:if>
	</xsl:for-each>
	<xsl:apply-templates select="./*[1]"/>
	<xsl:for-each select="m:mprescripts/preceding-sibling::*[position()!=last()]">
		<xsl:if test="position()>2 and local-name(.)!='none'">
			<xsl:text>{}</xsl:text>	
		</xsl:if>
		<xsl:if test="position() mod 2 and local-name(.)!='none'">
			<xsl:text>_{</xsl:text>	
			<xsl:apply-templates select="."/>
			<xsl:text>}</xsl:text>	
		</xsl:if>
		<xsl:if test="not(position() mod 2) and local-name(.)!='none'">
			<xsl:text>^{</xsl:text>	
			<xsl:apply-templates select="."/>
			<xsl:text>}</xsl:text>	
		</xsl:if>
	</xsl:for-each>
</xsl:template>

<xsl:template match="m:mmultiscripts">
	<xsl:choose>
		<xsl:when test="m:mprescripts">
			<xsl:apply-templates select="." mode="mprescripts"/>
		</xsl:when>
		<xsl:otherwise>
			<xsl:apply-templates select="./*[1]"/>
			<xsl:for-each select="*[position()>1]">
				<xsl:if test="position()>2 and local-name(.)!='none'">
					<xsl:text>{}</xsl:text>	
				</xsl:if>
				<xsl:if test="position() mod 2 and local-name(.)!='none'">
					<xsl:text>_{</xsl:text>	
					<xsl:apply-templates select="."/>
					<xsl:text>}</xsl:text>	
				</xsl:if>
				<xsl:if test="not(position() mod 2) and local-name(.)!='none'">
					<xsl:text>^{</xsl:text>	
					<xsl:apply-templates select="."/>
					<xsl:text>}</xsl:text>	
				</xsl:if>
			</xsl:for-each>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/tables.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>
                
<!-- ====================================================================== -->
<!-- $id: tables.xsl, 2002/17/05 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:template match="m:mtd[@columnspan]">
	<xsl:text>\multicolumn{</xsl:text>
	<xsl:value-of select="@columnspan"/>
	<xsl:text>}{c}{</xsl:text>
	<xsl:apply-templates/>
	<xsl:text>}</xsl:text>
	<xsl:if test="count(following-sibling::*)>0">
		<xsl:text>&amp; </xsl:text>
	</xsl:if>
</xsl:template>


<xsl:template match="m:mtd">
	<xsl:if test="@columnalign='right' or @columnalign='center'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:apply-templates/>
	<xsl:if test="@columnalign='left' or @columnalign='center'">
		<xsl:text>\hfill </xsl:text>
	</xsl:if>
	<xsl:if test="count(following-sibling::*)>0">
<!--    this test valid for Sablotron, another form - test="not(position()=last())".
	Also for m:mtd[@columnspan] and m:mtr  -->
		<xsl:text>&amp; </xsl:text>
	</xsl:if>
</xsl:template>

<xsl:template match="m:mtr">
	<xsl:apply-templates/>
	<xsl:if test="count(following-sibling::*)>0">
		<xsl:text>\\ </xsl:text>
	</xsl:if>
</xsl:template>

<xsl:template match="m:mtable">
	<xsl:text>\begin{array}{</xsl:text>
	<xsl:if test="@frame='solid'">
		<xsl:text>|</xsl:text>
	</xsl:if>
	<xsl:variable name="numbercols" select="count(./m:mtr[1]/m:mtd[not(@columnspan)])+sum(./m:mtr[1]/m:mtd/@columnspan)"/>
	<xsl:choose>
		<xsl:when test="@columnalign">
			<xsl:variable name="colalign">
				<xsl:call-template name="colalign">
					<xsl:with-param name="colalign" select="@columnalign"/>
				</xsl:call-template>
			</xsl:variable>
			<xsl:choose>
				<xsl:when test="string-length($colalign) > $numbercols">
					<xsl:value-of select="substring($colalign,1,$numbercols)"/>
				</xsl:when>
				<xsl:when test="string-length($colalign) &lt; $numbercols">
					<xsl:value-of select="$colalign"/>
					<xsl:call-template name="generate-string">
						<xsl:with-param name="text" select="substring($colalign,string-length($colalign))"/>
						<xsl:with-param name="count" select="$numbercols - string-length($colalign)"/>
					</xsl:call-template>
				</xsl:when>
				<xsl:otherwise>
					<xsl:value-of select="$colalign"/>
				</xsl:otherwise>
			</xsl:choose>
		</xsl:when>
		<xsl:otherwise>
			<xsl:call-template name="generate-string">
				<xsl:with-param name="text" select="'c'"/>
				<xsl:with-param name="count" select="$numbercols"/>
			</xsl:call-template>
		</xsl:otherwise>
	</xsl:choose>
	<xsl:if test="@frame='solid'">
		<xsl:text>|</xsl:text>
	</xsl:if>
	<xsl:text>}</xsl:text>
	<xsl:if test="@frame='solid'">
		<xsl:text>\hline </xsl:text>
	</xsl:if>
	<xsl:apply-templates/>
	<xsl:if test="@frame='solid'">
		<xsl:text>\\ \hline</xsl:text>
	</xsl:if>
	<xsl:text>\end{array}</xsl:text>
</xsl:template>

<xsl:template name="colalign">
	<xsl:param name="colalign"/>
	<xsl:choose>
		<xsl:when test="contains($colalign,' ')">
			<xsl:value-of select="substring($colalign,1,1)"/>
			<xsl:call-template name="colalign">
				<xsl:with-param name="colalign" select="substring-after($colalign,' ')"/>
			</xsl:call-template>
		</xsl:when>
		<xsl:otherwise>
			<xsl:value-of select="substring($colalign,1,1)"/>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template name="generate-string">
<!-- template from XSLT Standard Library v1.1 -->
    <xsl:param name="text"/>
    <xsl:param name="count"/>

    <xsl:choose>
      <xsl:when test="string-length($text) = 0 or $count &lt;= 0"/>

      <xsl:otherwise>
	<xsl:value-of select="$text"/>
	<xsl:call-template name="generate-string">
	  <xsl:with-param name="text" select="$text"/>
	  <xsl:with-param name="count" select="$count - 1"/>
	</xsl:call-template>
      </xsl:otherwise>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/dependency/xsltml_2.0/tokens.xsl
================================================
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:m="http://www.w3.org/1998/Math/MathML"
                version='1.0'>
                
<!-- ====================================================================== -->
<!-- $id: tokens.xsl, 2002/22/11 Exp $
     This file is part of the XSLT MathML Library distribution.
     See ./README or http://www.raleigh.ru/MathML/mmltex for
     copyright and other information                                        -->
<!-- ====================================================================== -->

<xsl:template match="m:mi|m:mn|m:mo|m:mtext|m:ms">
	<xsl:call-template name="CommonTokenAtr"/>
</xsl:template>

<xsl:template name="mi">
	<xsl:choose>
		<xsl:when test="string-length(normalize-space(.))>1 and not(@mathvariant)">
			<xsl:text>\mathrm{</xsl:text>
				<xsl:apply-templates/>
			<xsl:text>}</xsl:text>
		</xsl:when>
		<xsl:otherwise>
			<xsl:apply-templates/>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template name="mn">
	<xsl:apply-templates/>
</xsl:template>

<xsl:template name="mo">
	<xsl:apply-templates/>
</xsl:template>

<xsl:template name="mtext">
	<xsl:variable name="content">
		<xsl:call-template name="replaceMtextEntities">
			<xsl:with-param name="content" select="."/>
		</xsl:call-template>
	</xsl:variable>
	<xsl:text>\text{</xsl:text>
	<xsl:value-of select="$content"/>
	<xsl:text>}</xsl:text>
</xsl:template>

<xsl:template match="m:mspace">
	<xsl:text>\phantom{\rule</xsl:text>
	<xsl:if test="@depth">
		<xsl:text>[-</xsl:text>
		<xsl:value-of select="@depth"/>
		<xsl:text>]</xsl:text>
	</xsl:if>
	<xsl:text>{</xsl:text>
	<xsl:if test="not(@width)">
		<xsl:text>0ex</xsl:text>
	</xsl:if>
	<xsl:value-of select="@width"/>
	<xsl:text>}{</xsl:text>
	<xsl:if test="not(@height)">
		<xsl:text>0ex</xsl:text>
	</xsl:if>
	<xsl:value-of select="@height"/>
	<xsl:text>}}</xsl:text>
</xsl:template>

<xsl:template name="ms">
	<xsl:choose>
		<xsl:when test="@lquote"><xsl:value-of select="@lquote"/></xsl:when>
		<xsl:otherwise><xsl:text>"</xsl:text></xsl:otherwise>
	</xsl:choose><xsl:apply-templates/><xsl:choose>
		<xsl:when test="@rquote"><xsl:value-of select="@rquote"/></xsl:when>
		<xsl:otherwise><xsl:text>"</xsl:text></xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template name="CommonTokenAtr">
	<xsl:if test="@mathbackground">
		<xsl:text>\colorbox[rgb]{</xsl:text>
		<xsl:call-template name="color">
			<xsl:with-param name="color" select="@mathbackground"/>
		</xsl:call-template>
		<xsl:text>}{$</xsl:text>
	</xsl:if>
	<xsl:if test="@color or @mathcolor"> <!-- Note: @color is deprecated in MathML 2.0 -->
		<xsl:text>\textcolor[rgb]{</xsl:text>
		<xsl:call-template name="color">
			<xsl:with-param name="color" select="@color|@mathcolor"/>
		</xsl:call-template>
		<xsl:text>}{</xsl:text>
	</xsl:if>
	<xsl:if test="@mathvariant">
		<xsl:choose>
			<xsl:when test="@mathvariant='normal'">
				<xsl:text>\mathrm{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='bold'">
				<xsl:text>\mathbf{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='italic'">
				<xsl:text>\mathit{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='bold-italic'">	<!-- Required definition -->
				<xsl:text>\mathbit{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='double-struck'">	<!-- Required amsfonts -->
				<xsl:text>\mathbb{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='bold-fraktur'">	<!-- Error -->
				<xsl:text>{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='script'">
				<xsl:text>\mathcal{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='bold-script'">	<!-- Error -->
				<xsl:text>\mathsc{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='fraktur'">	<!-- Required amsfonts -->
				<xsl:text>\mathfrak{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='sans-serif'">
				<xsl:text>\mathsf{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='bold-sans-serif'"> <!-- Required definition -->
				<xsl:text>\mathbsf{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='sans-serif-italic'"> <!-- Required definition -->
				<xsl:text>\mathsfit{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='sans-serif-bold-italic'">	<!-- Error -->
				<xsl:text>\mathbsfit{</xsl:text>
			</xsl:when>
			<xsl:when test="@mathvariant='monospace'">
				<xsl:text>\mathtt{</xsl:text>
			</xsl:when>
			<xsl:otherwise>
				<xsl:text>{</xsl:text>
			</xsl:otherwise>
		</xsl:choose>
	</xsl:if>
	<xsl:call-template name="selectTemplate"/>
	<xsl:if test="@mathvariant">
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="@color or @mathcolor">
		<xsl:text>}</xsl:text>
	</xsl:if>
	<xsl:if test="@mathbackground">
		<xsl:text>$}</xsl:text>
	</xsl:if>
</xsl:template>

<xsl:template name="selectTemplate">
<!--	<xsl:variable name="name" select="local-name()"/>
	<xsl:call-template name="{$name}"/>-->
	<xsl:choose>
		<xsl:when test="local-name(.)='mi'">
			<xsl:call-template name="mi"/>
		</xsl:when>
		<xsl:when test="local-name(.)='mn'">
			<xsl:call-template name="mn"/>
		</xsl:when>
		<xsl:when test="local-name(.)='mo'">
			<xsl:call-template name="mo"/>
		</xsl:when>
		<xsl:when test="local-name(.)='mtext'">
			<xsl:call-template name="mtext"/>
		</xsl:when>
		<xsl:when test="local-name(.)='ms'">
			<xsl:call-template name="ms"/>
		</xsl:when>
	</xsl:choose>
</xsl:template>

<xsl:template name="color">
<!-- NB: Variables colora and valueColor{n} only for Sablotron -->
	<xsl:param name="color"/>
	<xsl:variable name="colora" select="translate($color,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')"/>
	<xsl:choose>
	<xsl:when test="starts-with($colora,'#') and string-length($colora)=4">
		<xsl:variable name="valueColor">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,2,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="$valueColor div 15"/><xsl:text>,</xsl:text>
		<xsl:variable name="valueColor1">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,3,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="$valueColor1 div 15"/><xsl:text>,</xsl:text>
		<xsl:variable name="valueColor2">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,4,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="$valueColor2 div 15"/>
	</xsl:when>
	<xsl:when test="starts-with($colora,'#') and string-length($colora)=7">
		<xsl:variable name="valueColor1">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,2,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:variable name="valueColor2">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,3,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="($valueColor1*16 + $valueColor2) div 255"/><xsl:text>,</xsl:text>
		<xsl:variable name="valueColor1a">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,4,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:variable name="valueColor2a">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,5,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="($valueColor1a*16 + $valueColor2a) div 255"/><xsl:text>,</xsl:text>
		<xsl:variable name="valueColor1b">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,6,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:variable name="valueColor2b">
			<xsl:call-template name="Hex2Decimal">
				<xsl:with-param name="arg" select="substring($colora,7,1)"/>
			</xsl:call-template>
		</xsl:variable>
		<xsl:value-of select="($valueColor1b*16 + $valueColor2b) div 255"/>
	</xsl:when>
<!-- ======================= if color specifed as an html-color-name ========================================== -->
	<xsl:when test="$colora='aqua'"><xsl:text>0,1,1</xsl:text></xsl:when>
	<xsl:when test="$colora='black'"><xsl:text>0,0,0</xsl:text></xsl:when>
	<xsl:when test="$colora='blue'"><xsl:text>0,0,1</xsl:text></xsl:when>
	<xsl:when test="$colora='fuchsia'"><xsl:text>1,0,1</xsl:text></xsl:when>
	<xsl:when test="$colora='gray'"><xsl:text>.5,.5,.5</xsl:text></xsl:when>
	<xsl:when test="$colora='green'"><xsl:text>0,.5,0</xsl:text></xsl:when>
	<xsl:when test="$colora='lime'"><xsl:text>0,1,0</xsl:text></xsl:when>
	<xsl:when test="$colora='maroon'"><xsl:text>.5,0,0</xsl:text></xsl:when>
	<xsl:when test="$colora='navy'"><xsl:text>0,0,.5</xsl:text></xsl:when>
	<xsl:when test="$colora='olive'"><xsl:text>.5,.5,0</xsl:text></xsl:when>
	<xsl:when test="$colora='purple'"><xsl:text>.5,0,.5</xsl:text></xsl:when>
	<xsl:when test="$colora='red'"><xsl:text>1,0,0</xsl:text></xsl:when>
	<xsl:when test="$colora='silver'"><xsl:text>.75,.75,.75</xsl:text></xsl:when>
	<xsl:when test="$colora='teal'"><xsl:text>0,.5,.5</xsl:text></xsl:when>
	<xsl:when test="$colora='white'"><xsl:text>1,1,1</xsl:text></xsl:when>
	<xsl:when test="$colora='yellow'"><xsl:text>1,1,0</xsl:text></xsl:when>
	<xsl:otherwise>
		<xsl:message>Exception at color template</xsl:message>
	</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template name="Hex2Decimal">
	<xsl:param name="arg"/>
	<xsl:choose>
		<xsl:when test="$arg='f'">
			<xsl:value-of select="15"/>
		</xsl:when>
		<xsl:when test="$arg='e'">
			<xsl:value-of select="14"/>
		</xsl:when>
		<xsl:when test="$arg='d'">
			<xsl:value-of select="13"/>
		</xsl:when>
		<xsl:when test="$arg='c'">
			<xsl:value-of select="12"/>
		</xsl:when>
		<xsl:when test="$arg='b'">
			<xsl:value-of select="11"/>
		</xsl:when>
		<xsl:when test="$arg='a'">
			<xsl:value-of select="10"/>
		</xsl:when>
		<xsl:when test="translate($arg, '0123456789', '9999999999')='9'"> <!-- if $arg is number -->
			<xsl:value-of select="$arg"/>
		</xsl:when>
		<xsl:otherwise>
			<xsl:message>Exception at Hex2Decimal template</xsl:message>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

<xsl:template match="m:*/text()">
	<xsl:call-template name="replaceEntities">
		<xsl:with-param name="content" select="normalize-space()"/>
	</xsl:call-template>
</xsl:template>

</xsl:stylesheet>

================================================
FILE: DomainSpecific/readme.md
================================================
# Domain-specific Knowledge Extraction from CommonCrawl

## Introduction 
Developing data workflows for specific requirements in distributed computing environments can be challenging for data engineers. They often face the following hurdles:

 - Learning to use distributed computing platforms from scratch.
 - Developing data processing modules, even when many are standard and reusable.
 - Constructing data pipelines by assembling these modules into their workflows.

Actually, many of these challenges can be mitigated with a unified framework. To address this, we propose the DataNetwork project. This initiative aims to enable engineers to efficiently meet customized and diverse data requirements using distributed computing resources and shared data storage.

## Getting Started
This section will guide you through setting up and running the DataNetwork framework on your system.
The framework is supported in the following environments. While other operating systems, such as Ubuntu 18.04/22.04 or Windows, are theoretically supported, they have not been tested yet.

1.	Environment
 - [Ubuntu-20.04.1](https://ubuntu.com/download/desktop)
 - [Git-2.41.0](https://git-scm.com/downloads)
 - [Git-lfs-3.4.0](https://git-lfs.com/)
 - [Conda-23.3.1](https://conda.io/projects/conda/en/stable/user-guide/install/download.html)
 - [Python-3.10.14](https://www.python.org/downloads/)
 - Python dependencies in [requirements.txt](requirements.txt) file

2.  Installation
 
```
# The depended libraries will be installed.
pip install -r requirements.txt
```

3. Download filters
   
Please download all the filtering models used for domain-specific data processing [here](https://drive.google.com/file/d/1TQ112I1rjNqkH8acmile7i9ERQzSEmC4/view?usp=sharing), and then unzip them. The sample codes of applying these models could refer to core/layers/transform/{math/mcq/openquestion}_filter_layer.py

```
tar -zxvf models.tar.gz
remove models.tar.gz
```

4.	Usage:

```
# The runtime-dependencies will be installed, and an 'env_ready' file will be generated upon first use.
python submit.py --network_path=${network_path} --run_mode=${run_mode} --computation_path=${computation_path} --storage_path=${storage_path} --docker_path=${docker_path}
``` 

 - network_path: the path of configuration file, which represents the instance of a data network.
 - run_mode: the running mode of data network, it supports Single, MultiProcess, and Batch.
 - computation_path: the path of setting file, which describes the computation resource.
 - storage_path: the path of setting file, which describes the storage resource.
 - docker_path: the path of setting file, which describes the environment resource (ignore it, currently not implemented yet).

5.	Examples:
 - Toy Sample:
```
# Please firstly run this command to ensure the installation is correct.
# If it fails, such as unmatched environment, mannually fix the missing dependencies in the dependency/requirements.txt file.
python submit.py --network_path=./configs/network_template.json --run_mode=Single
```

 - Domain-specific Knowledge Data Extraction from CommonCrawl:
```
# Refer to sample_run.sh script for details.
bash sample_run.sh
```


================================================
FILE: DomainSpecific/requirements.txt
================================================
pyyaml==6.0
wheel==0.43.0
setuptools==70.0.0
azure-ai-ml==1.16.0
azure-batch==14.2.0
azure-identity==1.16.1
azure-storage-blob==12.19.1


================================================
FILE: DomainSpecific/resources/computation/batch_dca_eastus.yaml
================================================
# To be filled.
batch_url: ${batch_url}
batch_pool_id: ${pool_id}
batch_node_num: ${node_num}
batch_process_per_node: ${process_per_node}


================================================
FILE: DomainSpecific/resources/computation/local.yaml
================================================
#worker_num: 1
worker_num: 2


================================================
FILE: DomainSpecific/resources/environment/amlt_sing.yaml
================================================
name: datanetwork
description: Environment for DataNetwork
# To be filled.
image: ${image_repo}


================================================
FILE: DomainSpecific/resources/environment/local.yaml
================================================
name: datanetwork
description: Environment for DataNetwork
image: local


================================================
FILE: DomainSpecific/resources/storage/llmstore.yaml
================================================
allow-other: true

logging:
  type: syslog
  level: log_debug

components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage

libfuse:
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240

file_cache:
  path: /mnt/resource/blobfusetmp
  timeout-sec: 360
  max-size-mb: 4096

attr_cache:
  timeout-sec: 7200

# To be filled.
azstorage:
  type: adls
  account-name: ${account_name}
  container: ${container_name}
  endpoint: ${az_storage_endpoint}
  mode: msi
  appid: ${appid}

# To be filled.
resource_id: ${resource_id}

# The upper part is configuration of azure storage account.

workspace_dir: ./workspace/
mount: true


================================================
FILE: DomainSpecific/resources/storage/local.yaml
================================================
workspace_dir: ./workspace/
mount: false


================================================
FILE: DomainSpecific/sample_run.sh
================================================
#!/usr/bin/env bash

# --------------------------------------------------------------------------------------------------------------
# Part 1 - knowledge extraction from html page.
# step 1 - download CC warc url list.
#Put one (or lots of) url(s) of Common Crawl WARC file to workspace/urls.CC-MAIN-2023-23.txt file.
#such as:
wget -P workspace https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/warc.paths.gz
gzip -d workspace/warc.paths.gz
mv workspace/warc.paths workspace/urls.CC-MAIN-2023-23.txt

# step 2 - download CC warc.
python submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_download.CC-MAIN-2023-23.json
cat ./workspace/cc_warcs/CC-MAIN-2023-23/paths.*.txt > ./workspace/cc_warcs/CC-MAIN-2023-23/paths.txt

# step 3 - prefilter CC warc.
python submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_filter.CC-MAIN-2023-23.json
cat ./workspace/cc_filtered_warc/CC-MAIN-2023-23/paths.*.txt > ./workspace/cc_filtered_warc/CC-MAIN-2023-23/paths.txt

# step 4 - extract code from html tag.
python submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_to_wet.code.CC-MAIN-2023-23.json

# step 5 - extract math from html tag.
python submit.py --run_mode MultiProcess --network_path ./configs/cc_warc_to_wet.math.CC-MAIN-2023-23.json

# --------------------------------------------------------------------------------------------------------------
# Part 2 - knowledge extraction from text page.
# extract text doc from CC html doc, filter text doc, and save them to parquet files.
# please refer to GeneralDomain processing to get the text pages in parquet format, then uncomment the below commands for further processing.

# step 1 - extract math from plain text.
#python submit.py --run_mode MultiProcess --network_path ./configs/cc_math_filter.CC-MAIN-2023-23.json

# step 2 - extract open questions from plain text.
#python submit.py --run_mode MultiProcess --network_path ./configs/cc_openquestion_filter.CC-MAIN-2023-23.json


================================================
FILE: DomainSpecific/submit.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import argparse

def submit_job(network_path, run_mode, docker_path, computation_path, storage_path):
    if run_mode in ("Single", "MultiProcess",):
        from tools.submit_local_job import submit_local_job as func
    elif run_mode == "Batch":
        from tools.submit_batch_job import submit_batch_job as func
    else:
        assert False
    func(network_path, run_mode, docker_path, computation_path, storage_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Tool of job submission.")
    parser.add_argument("--network_path", type=str, default="./configs/network_template.json", help="The config path of data network.")
    parser.add_argument('--run_mode', type=str, default="Single", help="The running mode: Single, MultiProcess, and Batch.")
    parser.add_argument('--docker_path', type=str, default="./resources/environment/local.yaml", help="The path of environment (docker) config file.")
    parser.add_argument('--computation_path', type=str, default="./resources/computation/local.yaml", help="The path of computation config file.")
    parser.add_argument('--storage_path', type=str, default="./resources/storage/local.yaml", help="The path of storage config file.")
    args = parser.parse_args()
    
    submit_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)


================================================
FILE: DomainSpecific/tools/__init__.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
from .submit_local_job import submit_local_job
from .submit_batch_job import submit_batch_job

__all__ = ["submit_local_job", "submit_batch_job"]


================================================
FILE: DomainSpecific/tools/submit_batch_job.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import argparse
os.sys.path.append("./core/layers/")
import util
import uuid
import datetime
from azure.batch import BatchServiceClient
from azure.common.credentials import BasicTokenAuthentication
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential, AzureCliCredential
from azure.batch.models import JobAddParameter, PoolInformation, TaskAddParameter, UserIdentity
from azure.batch.models import AutoUserSpecification, ElevationLevel, TaskConstraints
from azure.batch.models import EnvironmentSetting, ResourceFile, OnAllTasksComplete, ComputeNodeIdentityReference

def submit_batch_job(network_path, run_mode, docker_path, computation_path, storage_path):
    docker_config = util.load_yaml(docker_path)
    computation_config = util.load_yaml(computation_path)
    storage_config = util.load_yaml(storage_path)

    workspace_dir = storage_config["workspace_dir"]
    endpoint = storage_config["azstorage"]["endpoint"]
    container = storage_config["azstorage"]["container"]
    resource_id = storage_config["resource_id"]
    identity = ComputeNodeIdentityReference(resource_id=resource_id)
    mount_blob = storage_config.get("mount", True)

    node_num = computation_config["batch_node_num"]
    process_per_node = computation_config["batch_process_per_node"]
    batch_url = computation_config["batch_url"]
    pool_id = computation_config["batch_pool_id"]

    # credential
    ##########################################
    try:
        credential = AzureCliCredential()
        # Check if given credential can get token successfully.
        credential.get_token("https://management.azure.com/.default")
    except Exception as ex:
        # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
        credential = InteractiveBrowserCredential()
    token = credential.get_token("https://batch.core.windows.net/.default")
    credential2 = BasicTokenAuthentication({"access_token": token.token})

    batch_client = BatchServiceClient(credential2, batch_url=batch_url)
    pool = batch_client.pool.get(pool_id)
    resource_files = list()

    # upload source code.
    package_local_path = "DataNetwork.tar.gz"
    package_blob_path = os.path.join("yanghuan", "package", os.path.basename(package_local_path))
    if True:
        os.system(f"sudo tar \
                    --exclude=env_ready \
                    --exclude=workspace \
                    --exclude=dependency/models \
                    -czf {package_local_path} *")
        util.upload_file_to_blob(storage_config, package_local_path, package_blob_path)
        os.system(f"sudo rm {package_local_path}")
    if True:
        package_url = f"{endpoint}/{container}/{package_blob_path}"
        package_file = ResourceFile(http_url=package_url, file_path=package_blob_path, identity_reference=identity)
        package_path = package_file.file_path
        resource_files.append(package_file)
    else:
        package_path = os.path.join(workspace_dir, package_blob_path)

    # upload model files.
    models_local_path = "models.tar.gz"
    models_blob_path = os.path.join("yanghuan", "package", os.path.basename(models_local_path))
    if True:
        #            --exclude=dependency/models/math.bin \
        #            --exclude=dependency/models/openquestion.bin \
        #            --exclude=dependency/models/mcq.pytorch \
        #            --exclude=dependency/models/mcq.bin \
        os.system(f"sudo tar \
                    -czf {models_local_path} dependency/models/*")
        util.upload_file_to_blob(storage_config, models_local_path, models_blob_path)
        os.system(f"sudo rm {models_local_path}")
    if not mount_blob:
        model_url = f"{endpoint}/{container}/{models_blob_path}"
        models_file = ResourceFile(http_url=model_url, file_path=models_blob_path, identity_reference=identity)
        models_path = models_file.file_path
        resource_files.append(models_file)
    else:
        models_path = os.path.join(workspace_dir, models_blob_path)

    job_id = uuid.uuid4()
    job = JobAddParameter(id=job_id, pool_info=PoolInformation(pool_id=pool_id), on_all_tasks_complete=OnAllTasksComplete.terminate_job)
    batch_client.job.add(job)

    tasks = []
    for node_id in range(node_num):
        batch_script_dependency = "./dependency/install.py"
        batch_script_entry = "./wrapper/runner.py"
        batch_commandline = f"bash -c '\
            sudo tar -xzf {package_path} && \
            sudo apt install python-is-python3 && \
            python {batch_script_dependency} --storage_path={storage_path} && \
            sudo tar -xzf {models_path} && \
            python {batch_script_entry} --network_path={network_path} --run_mode={run_mode} --worker_num={node_num} --workspace_dir={workspace_dir}\
        '"

        task = TaskAddParameter(
            id=f'{job_id}_{node_id}',
            command_line=batch_commandline,
            resource_files=resource_files,
            environment_settings=[EnvironmentSetting(name="NODE_NUM", value=str(node_num)), EnvironmentSetting(name="NODE_ID", value=str(node_id)), EnvironmentSetting(name="PROCESS_PER_NODE", value=str(process_per_node))],
            constraints=TaskConstraints(max_task_retry_count=3, retention_time=datetime.timedelta(days=30)),
            user_identity=UserIdentity(auto_user=AutoUserSpecification(elevation_level=ElevationLevel.admin))
        )
        tasks.append(task)

    batch_client.task.add_collection(job_id, tasks)
    print(f"job id: {job.id}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Tool of job submission in local machine.")
    parser.add_argument('--network_path', type=str, default="./configs/network_template.json", help="The config path of data network.")
    parser.add_argument('--run_mode', type=str, default="Batch", help="The running mode: Batch.")
    parser.add_argument('--docker_path', type=str, default="./resources/environment/local.yaml", help="The path of environment (docker) config file.")
    parser.add_argument('--computation_path', type=str, default="./resources/computation/batch_dca.yaml", help="The path of computation config file.")
    parser.add_argument('--storage_path', type=str, default="./resources/storage/llmstore.yaml", help="The path of storage config file.")
    args = parser.parse_args()
    submit_batch_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)


================================================
FILE: DomainSpecific/tools/submit_local_job.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import argparse
os.sys.path.append("./core/layers/")
import util

def submit_local_job(network_path, run_mode, docker_path, computation_path, storage_path):
    docker_config = util.load_yaml(docker_path)
    computation_config = util.load_yaml(computation_path)
    storage_config = util.load_yaml(storage_path)

    script_entry = "./wrapper/runner.py"
    script_dependency = "./dependency/install.py"
    commandline = f"python {script_dependency} --storage_path={storage_path} && python {script_entry} --network_path={network_path} --run_mode={run_mode} --workspace_dir={storage_config['workspace_dir']} --worker_num={computation_config['worker_num']}"
    os.system(commandline)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Tool of job submission in local machine.")
    parser.add_argument('--network_path', type=str, default="./configs/network_template.json", help="The config path of data network.")
    parser.add_argument('--run_mode', type=str, default="Single", help="The running mode: Single, MultiProcess.")
    parser.add_argument('--docker_path', type=str, default="./resources/environment/local.yaml", help="The path of environment (docker) config file.")
    parser.add_argument('--computation_path', type=str, default="./resources/computation/local.yaml", help="The path of computation config file.")
    parser.add_argument('--storage_path', type=str, default="/resources/storage/local.yaml", help="The path of storage config file.")
    args = parser.parse_args()
    submit_local_job(args.network_path, args.run_mode, args.docker_path, args.computation_path, args.storage_path)


================================================
FILE: DomainSpecific/wrapper/__init__.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
from .parser import Parser
from .interpreter import Interpreter
from .runner import Runner, RunMode
from .utility import *

__all__ = ["Parser", "Interpreter", "Runner", "RunMode"]


================================================
FILE: DomainSpecific/wrapper/interpreter.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import traceback
import collections
from core import DataType
from core import Layer, LayerType, JointType, LayerType2Func
from core import Network
from wrapper import Parser

class Interpreter:
    def __init__(self):
        self.fields = ("name", "description", "date", "version", "author", "input", "output", "layer")
        self.parser = Parser()

    def check_config(self, config):
        try:
            # fileds check.
            for field in self.fields:
                assert field in config

            data_name2type = collections.defaultdict(set)

            # check imported modules.
            module_data2type = dict()
            module_names = config.get("import", list())
            for name in module_names:
                sub_config = self.parser(f"./configs/{name.replace('.', '/')}.json")
                self.check_config(sub_config)
                for name, data in sub_config["input"].items():
                    module_data2type[name] = DataType[data["type"]]
                for name, data in sub_config["output"].items():
                    module_data2type[name] = DataType[data["type"]]

            # check input.
            inputs = config.get("input", dict())
            for name, data in inputs.items():
                assert data["type"] in DataType.__members__
                data_type = DataType[data["type"]]
                data_name2type[name].add(data_type)

            # check output.
            outputs = config.get("output", dict())
            for name, data in outputs.items():
                assert data["type"] in DataType.__members__
                data_type = DataType[data["type"]]
                data_name2type[name].add(data_type)

            # check layer.
            layers = config.get("layer", dict())
            for _, layer in layers.items():
                assert layer["type"] in LayerType.__members__ or layer["type"] in module_names
                input_names = layer["input"]
                output_names = layer["output"]
                if layer["type"] in LayerType.__members__:
                    layer_type = LayerType[layer["type"]]
                    func, input_types, output_types, enabled = LayerType2Func[layer_type]
                else:
                    input_types = list(map(lambda input_name: module_data2type[input_name], input_names))
                    output_types = list(map(lambda output_name: module_data2type[output_name], output_names))
                assert len(input_names) == len(input_types)
                assert len(output_names) == len(output_types)
                assert layer.get("joint", "Default") in JointType.__members__
                joint_type = JointType[layer.get("joint", "Default")]
                for name, data_type in zip(input_names, input_types):
                    if joint_type in (JointType.Map, JointType.FlatMap):
                        data_type = DataType(data_type.value + 10)
                    data_name2type[name].add(data_type)
                for name, data_type in zip(output_names, output_types):
                    if joint_type in (JointType.Map,):
                        data_type = DataType(data_type.value + 10)
                    data_name2type[name].add(data_type)

            # check joint.
            for data_name, data_type in data_name2type.items():
                for t1 in data_type:
                    for t2 in data_type:
                        assert DataType.belong(t1, t2) or DataType.belong(t2, t1)
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
            sys.exit()

    def __call__(self, config_path):
        # parse config file.
        config = self.parser(config_path)

        # interpret network.
        network = Network()
        try:
            assert config is not None and isinstance(config, dict)
            config["base_dir"] = os.path.dirname(config_path)

            # check config.
            self.check_config(config)

            # imported modules.
            name2module = dict()
            module_names = config.get("import", list())
            for name in module_names:
                name2module[name] = self(f"./configs/{name.replace('.', '/')}.json")

            # input datas.
            input_datas = config.get("input", dict())
            network.set_input_names(list(input_datas.keys()))
            for name, data in input_datas.items():
                value = data.get("value")
                network.add_data(name, value)

            # output datas
            output_datas = config.get("output", dict())
            network.set_output_names(list(output_datas.keys()))

            # layers in graph.
            layers = config.get("layer", dict())
            for name, layer in layers.items():
                if layer["type"] in name2module:
                    value = name2module[layer["type"]]
                    # set params of sub-network.
                    for layers_param_name, param_value in layer.get("param", dict()).items():
                        layers_param_name = layers_param_name.split(".")
                        layers_name = layers_param_name[:-1]
                        param_name = layers_param_name[-1]
                        net = value
                        for layer_name in layers_name:
                            net = net.layers[layer_name]
                        net.param[param_name] = param_value
                else:
                    value = Layer(
                        type=layer["type"], 
                        joint=layer.get("joint", "Default"), 
                        repetition=layer.get("repetition", 1),
                        param=layer.get("param", dict()),
                        input_names=layer.get("input", list()),
                        output_names=layer.get("output", list()),
                    )
                network.add_layer(name, value)
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
        return network


if __name__ == "__main__":
    config_path = f"{os.path.dirname(os.path.realpath(__file__))}/../configs/network_template.json"
    
    interpreter = Interpreter()
    network = interpreter(config_path)
    
    # compute in network.
    outputs = network()
    #from core import DataType
    #inputs = [["a", "b", "c", "d", "e"]]
    #outputs = network(inputs)
    
    print(outputs[0])


================================================
FILE: DomainSpecific/wrapper/parser.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
import json
import traceback

class Parser:
    def __init__(self):
        pass
        
    def __call__(self, config_path):
        config = None
        try:
            if config_path is None or not os.path.exists(config_path):
                raise Exception("Invalid config file path or not exists.")

            with open(config_path, "r") as f:
                config = json.load(f)
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
        return config


if __name__ == "__main__":
    config_path = f"{os.path.dirname(os.path.realpath(__file__))}/../configs/network_template.json"
    parser = Parser()
    config = parser(config_path)
    print(config)


================================================
FILE: DomainSpecific/wrapper/runner.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import sys
os.sys.path.append(f"{os.path.dirname(os.path.realpath(__file__))}/..")
import argparse
import traceback
from enum import Enum
from threading import Thread
from multiprocessing import Process
from wrapper import Interpreter
from wrapper.utility import get_world_rank, get_world_size, get_process_per_node

class RunMode(Enum):
    Single = 0
    MultiProcess = 1
    Batch = 2

class Runner:
    def __init__(self, network_path):
        interpreter = Interpreter()
        self.network = interpreter(network_path)

    def __call__(self, run_mode, worker_id, worker_num, workspace_dir):
        try:
            input = list()
            variables = {"workspace_dir": workspace_dir}
            if run_mode == RunMode.Single:
                for worker_id in range(worker_num):
                    self.network(input, worker_id, worker_num, variables)
            elif run_mode == RunMode.MultiProcess:
                processes = list()
                for worker_id in range(worker_num):
                    process = Process(target=self.network, args=(input, worker_id, worker_num, variables))
                    process.start()
                    processes.append(process)
                for process in processes:
                    process.join()
            elif run_mode == RunMode.Batch:
                process_per_node = get_process_per_node()
                worker_id = process_per_node * get_world_rank()
                worker_num = process_per_node * get_world_size()
                processes = list()
                for worker_id in range(worker_id, worker_id + process_per_node):
                    process = Process(target=self.network, args=(input, worker_id, worker_num, variables))
                    process.start()
                    processes.append(process)
                for process in processes:
                    process.join()
            else:
                raise Exception(f"Unknown running mode: {run_mode}")
        except KeyboardInterrupt:
            sys.exit()
        except Exception as ex:
            traceback.print_exc()
            return False
        return True


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Runner of Data Network.")
    parser.add_argument('--network_path', type=str, default="./configs/network_template.json", help="The config path of data network.")
    parser.add_argument('--run_mode', type=str, default="Single", help="The running mode: Single, MultiProcess, and Batch.")
    parser.add_argument('--workspace_dir', type=str, default="./workspace/", help="The path of workspace folder.")
    parser.add_argument('--worker_id', type=int, default=0, help="The id of world worker.")
    parser.add_argument('--worker_num', type=int, default=1, help="The number of world worker.")
    args = parser.parse_args()

    runner = Runner(args.network_path)
    success = runner(RunMode[args.run_mode], args.worker_id, args.worker_num, args.workspace_dir)


================================================
FILE: DomainSpecific/wrapper/utility/__init__.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
from .logger import Logger
from .cpu_count import cpu_count
from .load_yaml import load_yaml
from .save_yaml import save_yaml
from .azure_env import get_local_rank, get_world_rank, get_world_size, get_process_per_node

__all__ = ["Logger", "cpu_count", "load_yaml", "save_yaml", "get_local_rank", "get_world_rank", "get_world_size", "get_process_per_node"]


================================================
FILE: DomainSpecific/wrapper/utility/azure_env.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os

def get_local_rank():
    # Azure Singularity.
    if "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ:
        return int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
    return None

def get_world_rank():
    # Azure Singularity.
    if "OMPI_COMM_WORLD_RANK" in os.environ:
        return int(os.environ["OMPI_COMM_WORLD_RANK"])
    # Azure Batch.
    elif "NODE_ID" in os.environ:
        return int(os.environ["NODE_ID"])
    return None

def get_world_size():
    # Azure Singularity.
    if "OMPI_COMM_WORLD_SIZE" in os.environ:
        return int(os.environ["OMPI_COMM_WORLD_SIZE"])
    # Azure Batch.
    elif "NODE_NUM" in os.environ:
        return int(os.environ["NODE_NUM"])
    # Azure Spark.
    elif "NUM_EXECUTORS" in os.environ:
        return int(os.environ["NUM_EXECUTORS"])
    return None

def get_process_per_node():
    # Azure Batch.
    if "PROCESS_PER_NODE" in os.environ:
        return int(os.environ["PROCESS_PER_NODE"])
    # Azure Spark.
    elif "EXECUTOR_CORES" in os.environ:
        return int(os.environ["EXECUTOR_CORES"])
    return None


================================================
FILE: DomainSpecific/wrapper/utility/cpu_count.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import re
import subprocess

def cpu_count():
    """ Number of available virtual or physical CPUs on this system, i.e.
    user/real as output by time(1) when called with an optimally scaling
    userspace-only program"""

    # cpuset
    # cpuset may restrict the number of *available* processors
    try:
        m = re.search(r'(?m)^Cpus_allowed:\s*(.*)$',
                      open('/proc/self/status').read())
        if m:
            res = bin(int(m.group(1).replace(',', ''), 16)).count('1')
            if res > 0:
                return res
    except IOError:
        pass

    # Python 2.6+
    try:
        import multiprocessing
        return multiprocessing.cpu_count()
    except (ImportError, NotImplementedError):
        pass

    # https://github.com/giampaolo/psutil
    try:
        import psutil
        return psutil.cpu_count()   # psutil.NUM_CPUS on old versions
    except (ImportError, AttributeError):
        pass

    # POSIX
    try:
        res = int(os.sysconf('SC_NPROCESSORS_ONLN'))

        if res > 0:
            return res
    except (AttributeError, ValueError):
        pass

    # Windows
    try:
        res = int(os.environ['NUMBER_OF_PROCESSORS'])

        if res > 0:
            return res
    except (KeyError, ValueError):
        pass

    """
    # jython
    try:
        from java.lang import Runtime
        runtime = Runtime.getRuntime()
        res = runtime.availableProcessors()
        if res > 0:
            return res
    except ImportError:
        pass
    """

    # BSD
    try:
        sysctl = subprocess.Popen(['sysctl', '-n', 'hw.ncpu'],
                                  stdout=subprocess.PIPE)
        scStdout = sysctl.communicate()[0]
        res = int(scStdout)

        if res > 0:
            return res
    except (OSError, ValueError):
        pass

    # Linux
    try:
        res = open('/proc/cpuinfo').read().count('processor\t:')

        if res > 0:
            return res
    except IOError:
        pass

    # Solaris
    try:
        pseudoDevices = os.listdir('/devices/pseudo/')
        res = 0
        for pd in pseudoDevices:
            if re.match(r'^cpuid@[0-9]+$', pd):
                res += 1

        if res > 0:
            return res
    except OSError:
        pass

    # Other UNIXes (heuristic)
    try:
        try:
            dmesg = open('/var/run/dmesg.boot').read()
        except IOError:
            dmesgProcess = subprocess.Popen(['dmesg'], stdout=subprocess.PIPE)
            dmesg = dmesgProcess.communicate()[0]

        res = 0
        while '\ncpu' + str(res) + ':' in dmesg:
            res += 1

        if res > 0:
            return res
    except OSError:
        pass

    raise Exception('Can not determine number of CPUs on this system')


================================================
FILE: DomainSpecific/wrapper/utility/load_yaml.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import yaml

def load_yaml(config_path):
    config = None
    if os.path.exists(config_path):
        with open(config_path, "r") as file:
            config = yaml.safe_load(file)
    return config


================================================
FILE: DomainSpecific/wrapper/utility/logger.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import logging

logger = None

class Logger:
    def __init__():
        pass
    
    @staticmethod
    def init(log_path=None):
        global logger
        
        if log_path is not None:
            logging.basicConfig(filename=log_path,
                                format="%(asctime)s %(message)s",
                                filemode="w")
        
        # Creating an object
        logger = logging.getLogger()
        
        # Setting the threshold of logger to DEBUG
        logger.setLevel(logging.INFO)

    @staticmethod
    def debug(msg):
        logger.debug(msg)
    
    @staticmethod
    def info(msg):
        logger.info(msg)
    
    @staticmethod
    def warning(msg):
        logger.warning(msg)
    
    @staticmethod
    def error(msg):
        logger.error(msg)

    @staticmethod
    def critical(msg):
        logger.critical(msg)


if __name__ == "__main__":
    Logger.init()
    Logger.debug("unit test: debug")
    Logger.info("unit test: info")
    Logger.warning("unit test: warning")
    Logger.error("unit test: error")
    Logger.critical("unit test: critical")


================================================
FILE: DomainSpecific/wrapper/utility/save_yaml.py
================================================
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
import os
import yaml

def save_yaml(config, config_path):
    if os.path.exists(os.path.dirname(config_path)):
        with open(config_path, "w") as file:
            yaml.safe_dump(config, file)


================================================
FILE: GeneralDomain/.gitignore
================================================
__pycache__/

================================================
FILE: GeneralDomain/README.md
================================================
# Redstone General CC

Library for reproducing the general CC part of RedStone dataset from the released index Parquet file.

## How to use

### Install the lib

```bash
pip install "redstone-cc @ git+https://github.com/microsoft/redstone#subdirectory=general_cc/"
```

### From CLI

```bash
python -m redstone_cc {input_index_path} {output_parquet_path}
```

### From python

```python3
from redstone_cc import process_file

index_file_path = '/path/to/index/file'
items = process_file(index_file_path)

for item in items:
    print(item['uri'], item['text'])
```

## FAQ

- About trafilatura processing failures
    - Our original data was processed using `trafilatura` version 1.8.1, which may behave differently from the current version. If you need to reproduce our result exactly, please consider manually pinning the version of trafilatura.


================================================
FILE: GeneralDomain/pyproject.toml
================================================
[build-system]
requires = ["flit_core >=3.2, <4"]
build-backend = "flit_core.buildapi"

[project]
name = "redstone-cc"
description = "Library for reproducing the general CC part of RedStone dataset from the released index Parquet file."
version='0.0.1'
requires-python = ">=3.8"
authors = [
  { name = "Tengchao Lv", email = "tengchaolv@microsoft.com" },
  { name = "Qinzheng Sun", email = "qinsu@microsoft.com" }
]
dependencies = [
  'numpy == 1.*',
  'datasketch',
  'regex',
  'nltk',
  'ftfy',
  'sentence_splitter',
  'brotlicffi',
  'faust-cchardet',
  'lxml',
  'trafilatura[all]',
  'warcio',
  'loguru',
  'stopit',
  "fasttext; platform_system != 'Windows'",
  "fasttext-wheel == 0.9.2; platform_system == 'Windows'",
  'pyarrow',
  'tqdm',
  'requests',
]

[project.optional-dependencies]
dev = [
  'pytest',
  'black',
]


================================================
FILE: GeneralDomain/redstone_cc/__init__.py
================================================
from .process import process_file, process_items


================================================
FILE: GeneralDomain/redstone_cc/__main__.py
================================================
import argparse

import pyarrow as pa
import pyarrow.parquet as pq
from loguru import logger

from .process import process_file


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("index_path")
    parser.add_argument("output_path")
    args = parser.parse_args()

    logger.info(f"input path: {args.index_path}")
    logger.info(f"output path: {args.output_path}")
    logger.info("processing...")
    res = process_file(args.index_path)

    logger.info("writing results...")
    table = pa.Table.from_pylist(res)
    pq.write_table(table, args.output_path)
    logger.info("finished.")


if __name__ == "__main__":
    main()


================================================
FILE: GeneralDomain/redstone_cc/algos/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/deduplication/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/deduplication/minhash.py
================================================
import hashlib

import numpy as np
from datasketch.lsh import _optimal_param

DEFAULT_MER = 2**61 - 1
DEFAULT_SEED = 1


def gen_lsh_param(num_perm, lsh_threshold):
    return _optimal_param(lsh_threshold, num_perm, 0.5, 0.5)


class CalcMinhash:
    def __init__(self, num_perm, seed=DEFAULT_SEED, mer=DEFAULT_MER):
        self.mer = mer
        self.num_perm = num_perm

        self.gen = np.random.RandomState(seed)
        self.a = self.gen.randint(1, self.mer, (self.num_perm,), dtype="u8")
        self.b = self.gen.randint(0, self.mer, (self.num_perm,), dtype="u8")

    def _sha1_hash(self, val):
        val = int.from_bytes(hashlib.sha1(val).digest()[:8], "little")
        val &= self.mer
        return np.uint64(val)

    def hash(self, sequence: list[str]) -> np.ndarray:
        res = np.ones(self.num_perm, dtype="u8") * self.mer
        for token in sequence:
            hash0 = self._sha1_hash(token.encode("utf8"))
            hash_vec = hash0 * self.a + self.b
            hash_vec %= self.mer
            res = np.minimum(res, hash_vec)
        return res


class CalcLsh:
    def __init__(self, b, r):
        self.b = b
        self.r = r
        self.hashranges = [(i * r, (i + 1) * r) for i in range(b)]

    def gen_lsh(self, minhash) -> list[bytearray]:
        return [bytearray(minhash[start:end]) for start, end in self.hashranges]


class CalcMinhashLsh:
    def __init__(self, b, r, seed=DEFAULT_SEED, mer=DEFAULT_MER):
        num_perm = b * r
        self.minhash = CalcMinhash(num_perm, seed, mer)
        self.lsh = CalcLsh(b, r)

    def hash(self, tokens) -> list[bytearray]:
        minhash = self.minhash.hash(tokens)
        lsh = self.lsh.gen_lsh(minhash)
        return lsh


class LocalMinhashLshDedup:
    def __init__(self, b, r, seed=DEFAULT_SEED, mer=DEFAULT_MER):
        self.calc_minhash_lsh = CalcMinhashLsh(b, r, seed, mer)
        self.data = []
        self.b = b

    def add(self, id, tokens):
        hval = self.calc_minhash_lsh.hash(tokens)
        self.data.append((id, hval))

    def dedup(self):
        self.data.sort(key=lambda x: x[0])
        dedup_set = [set() for _ in range(self.b)]
        exclude = []
        for line_id, hash_list in self.data:
            flag_dup = False
            for i, hval in hash_list:
                if hval in dedup_set[i]:
                    flag_dup = True
                else:
                    dedup_set[i].add(hval)

            if flag_dup:
                exclude.append(line_id)

        return exclude


================================================
FILE: GeneralDomain/redstone_cc/algos/deduplication/sha1.py
================================================
import hashlib

from .utils import ccnet_normalize

DEFAULT_HASH_SIZE = 8


def sha1_hash(line, hash_size=DEFAULT_HASH_SIZE) -> bytes:
    line = ccnet_normalize(line)

    return hashlib.sha1(bytes(line, encoding="utf-8")).digest()[:hash_size]


class LocalSha1Dedup:
    def __init__(self, hash_size):
        self.hash_size = hash_size

        self.data = []

    def add_line(self, line_id, line):
        hval = sha1_hash(line, self.hash_size)
        self.data.append((line_id, hval))

    def add_hashes(self, line_id, hval):
        assert isinstance(hval, bytes) and len(hval) == self.hash_size
        self.data.append((line_id, hval))

    def dedup(self):
        self.data.sort(key=lambda item: item[0])
        dedup_set = set()
        exclude = []
        for line_id, hval in self.data:
            if hval in dedup_set:
                exclude.append(line_id)
            else:
                dedup_set.add(hval)
        return exclude


================================================
FILE: GeneralDomain/redstone_cc/algos/deduplication/utils.py
================================================
import unicodedata
import re
import string

import regex
import ftfy
from nltk import ngrams

DIGIT_RE = regex.compile(r"\d")
PUNCT_OR_NON_PRINTING_CHARS_RE = regex.compile(r"(\p{P}|\p{C})")


def ccnet_normalize(line) -> str:
    line = line.strip()
    if not line:
        return line
    # normalize
    line = unicodedata.normalize("NFKC", line)
    # case
    line = line.lower()
    # numbers
    line = DIGIT_RE.sub("0", line)
    line = PUNCT_OR_NON_PRINTING_CHARS_RE.sub("", line)
    return line


SLIMPAJAMA_LENGTH_THRESHOLD = 200


# https://github.com/Cerebras/modelzoo/blob/de67aaec12ba684ebedc6fb841e0c4d0ff8cd2e8/modelzoo/transformers/data_processing/slimpajama/preprocessing/filter.py#L28
def slimpajama_tokenize(text, num_ngrams=13):
    text = ftfy.fix_text(text, normalization="NFC")
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = re.sub(r"\s+", " ", text.strip())
    if len(text) < SLIMPAJAMA_LENGTH_THRESHOLD:
        return
    tokens = map(lambda x: "".join(x), ngrams(text, num_ngrams))
    return tokens


def spm_tokenize(text, spm_model, num_ngrams=5):
    text = text.lower()
    tokens = spm_model.encode(text, out_type=str)
    tokens = ngrams(tokens, num_ngrams)
    tokens = {" ".join(t).strip() for t in tokens}
    return tokens


================================================
FILE: GeneralDomain/redstone_cc/algos/fasttext_classifier.py
================================================
import fasttext

fasttext.FastText.eprint = lambda x: None

FASTTEXT_LID_176_URL = (
    "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
)


class FastTextClassifier:
    def __init__(self, model_path):
        self.model = fasttext.load_model(model_path)

    def predict(self, text):
        if isinstance(text, list):
            text = " ".join(text)
        text = text.replace("\n", " ")

        labels, scores = self.model.predict(text, k=1)
        label, score = labels[0], scores[0]
        label = label.replace("__label__", "")

        return label, score


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/func/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/func/document.py
================================================
import regex


def document_word_count(words):
    return len(words)


def document_mean_word_length(words):
    return sum(len(x) for x in words) / len(words)


RE_ALPHA = regex.compile(r"\p{L}")


def document_alpha_words(words):
    return sum(int(RE_ALPHA.search(word) is not None) for word in words)


BULLET_POINT_SYMBOLS = (
    "\u2022",  # bullet point
    "\u2023",  # triangular bullet point
    "\u25B6",  # black right pointing triangle
    "\u25C0",  # black left pointing triangle
    "\u25E6",  # white bullet point
    "\u25A0",  # black square
    "\u25A1",  # white square
    "\u25AA",  # black small square
    "\u25AB",  # white small square
    "\u2013",  # en dash
)


def document_start_with_bullet(lines):
    cnt = 0
    for line in lines:
        line = line.lstrip()
        for symbol in BULLET_POINT_SYMBOLS:
            if line.startswith(symbol):
                cnt += 1
                break
    return cnt


ELLIPSIS = "..."


def document_end_with_ellipsis(lines):
    return sum(int(x.strip().endswith(ELLIPSIS)) for x in lines)


GOPHER_SYMBOLS = ("#", "...")


def document_gopher_symbols(text):
    return sum(text.count(x) for x in GOPHER_SYMBOLS)


GOPHER_STOPWORDS = {"the", "be", "to", "of", "and", "that", "have", "with"}


def document_gopher_stopwords(words):
    return sum(int(word in GOPHER_STOPWORDS) for word in words)


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/func/line.py
================================================
import regex

RE_UPPER = regex.compile(r"\p{Lu}")
RE_LETTER = regex.compile(r"\p{L}")


def line_uppercase_ratio(line):
    cnt_upper = len(RE_UPPER.findall(line))
    cnt_letter = len(RE_LETTER.findall(line))
    if cnt_letter == 0:
        return 0
    return cnt_upper / cnt_letter


RE_NUMERICAL = regex.compile(r"^(\p{N}|\p{Z}|\p{C})+$")


def line_all_numeric(line):
    return RE_NUMERICAL.fullmatch(line) is not None


RE_REFINEDWEB_COUNTER = regex.compile(r"^\d+\s+[a-zA-Z]+$")


def line_refinedweb_counter(line):
    return RE_REFINEDWEB_COUNTER.fullmatch(line.strip()) is not None


def line_regex_match(line, patterns):
    for pattern in patterns:
        if regex.search(pattern, line) is not None:
            return True
    return False


def test_line_uppercase_ratio():
    line = "ASDzxczxc a././.,./,/.123"
    res = line_uppercase_ratio(line)
    # ignore number, space and puncts
    assert res == 3 / 10
    line = ".,/./././"
    res = line_uppercase_ratio(line)
    assert res == 0


def test_line_all_numeric():
    line = "1231    34\t345345"
    assert line_all_numeric(line)
    line = "asd1231as"
    assert not line_all_numeric(line)


def test_line_refinedweb_counter():
    line = "3 emails"
    assert line_refinedweb_counter(line)
    line = "3 emails emails"
    assert not line_refinedweb_counter(line)


def test_line_regex_match():
    pattern = "^sign in"
    line = "sign in 123"
    assert line_regex_match(line, [pattern])
    line = "123 sign in 123"
    assert not line_regex_match(line, [pattern])

    pattern = "read more...$"
    line = "123 read more..."
    assert line_regex_match(line, [pattern])
    line = "read more...."
    assert not line_regex_match(line, [pattern])

    pattern = "target"
    line = "asdtargetasd"
    assert line_regex_match(line, [pattern])


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/func/repetition.py
================================================
from collections import Counter

import numpy as np
from nltk.util import ngrams


def repetition_ngram_top_char_frac(words, n: int):
    items = list(ngrams(words, n))
    counter = Counter(items)
    most_common = counter.most_common(1)
    if len(most_common) == 0:
        return 0
    most_common_ngram, count = most_common[0]
    if count == 1:
        return 0
    total_chars = sum(len(w) for w in words)
    top_chars = sum(len(w) for w in most_common_ngram) * count

    return top_chars / total_chars


def repetition_ngram_dup_char_frac(words, n: int):
    items = list(ngrams(words, n))
    counter = Counter(items)

    flag_dup = np.zeros(len(words), dtype="bool")
    for i, item in enumerate(items):
        if counter[item] > 1:
            flag_dup[i : i + n] = True
    total_chars = sum(len(w) for w in words)
    dup_chars = sum(len(w) for i, w in enumerate(words) if flag_dup[i])
    return dup_chars / total_chars


def repetition_line_dup_frac(lines):
    if len(lines) == 0:
        return 0, 0

    dup_lines = 0
    dup_chars = 0
    counter = Counter(lines)
    for line, count in counter.items():
        if count > 1:
            dup_lines += count
            dup_chars += len(line) * count
    total_chars = sum(len(line) for line in lines)
    if total_chars == 0:
        return 0, 0

    return dup_lines / len(lines), dup_chars / total_chars


def test_ngram_top():
    words = "a b c a b d a b".split()
    res = repetition_ngram_top_char_frac(words, 2)
    assert res == 6 / len(words)

    # no repetition
    res = repetition_ngram_top_char_frac(words, 3)
    assert res == 0

    words = "a b c a b c a b".split()
    res = repetition_ngram_top_char_frac(words, 3)
    assert res == 6 / len(words)


def test_ngram_dup():
    words = "a b c a b d a b".split()
    res = repetition_ngram_dup_char_frac(words, 2)
    assert res == 6 / len(words)

    words = "a b c a b c a b".split()
    res = repetition_ngram_dup_char_frac(words, 3)
    assert res == 1


def test_dup_line():
    lines = ["a", "b", "c"]
    frac, char_frac = repetition_line_dup_frac(lines)
    assert frac == 0 and char_frac == 0
    lines = []
    frac, char_frac = repetition_line_dup_frac(lines)
    assert frac == 0 and char_frac == 0
    lines = ["", "", ""]
    frac, char_frac = repetition_line_dup_frac(lines)
    assert frac == 0 and char_frac == 0
    lines = ["abc", "de", "abc"]
    frac, char_frac = repetition_line_dup_frac(lines)
    assert frac == 2 / 3 and char_frac == 6 / 8


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/model/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/model/document.py
================================================
import sys
from functools import cached_property

import stopit
from loguru import logger
from sentence_splitter import split_text_into_sentences

from ..utils import normalize


if sys.platform == "posix":
    stopit_method = stopit.SignalTimeout
else:
    stopit_method = stopit.ThreadingTimeout


class Document:
    def __init__(self, text, lang):
        self.text = text
        self.lang = lang

    @cached_property
    def sents(self):
        with stopit_method(60) as ctx:
            res = split_text_into_sentences(self.text, self.lang)
        if ctx:
            return res
        else:
            logger.warning("sentence splitter timeout")
            return self.text.split("\n")

    @cached_property
    def paragraphs(self):
        return self.text.split("\n")

    @cached_property
    def normalized_text(self):
        return normalize(self.text)

    @cached_property
    def normalized_sents(self):
        return [normalize(sent) for sent in self.sents]

    @cached_property
    def normalized_words(self):
        return self.normalized_text.split()


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/model/violations.py
================================================
from typing import List

from .document import Document


class Violations:
    def __init__(self):
        self.doc_violations = set()
        self.line_violations = {}
        self.excluded_lines = set()

    def doc(self, key):
        if key in self.doc_violations:
            raise KeyError(f"Document violation {key} has already been set")
        self.doc_violations.add(key)

    def line(self, key, lines: List[int]):
        if key in self.line_violations:
            raise KeyError(f"Line violation {key} has already been set")
        lines = list(set(lines))
        lines.sort()
        self.line_violations[key] = lines
        self.excluded_lines.update(lines)

    def apply_to_doc(self, doc: Document) -> str | None:
        if len(self.doc_violations) > 0:
            return None

        res = []
        for i, line in enumerate(doc.sents):
            if i not in self.excluded_lines:
                res.append(line)
        return "\n".join(res)


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/__init__.py
================================================


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/gopher.py
================================================
from ..model.document import Document
from ..model.violations import Violations
from ..func.document import (
    document_alpha_words,
    document_end_with_ellipsis,
    document_gopher_stopwords,
    document_gopher_symbols,
    document_mean_word_length,
    document_start_with_bullet,
    document_word_count,
)
from ..func.repetition import (
    repetition_ngram_top_char_frac,
    repetition_ngram_dup_char_frac,
    repetition_line_dup_frac,
)

KEY_PREFIX_TOP_NGRAM = "rr_ngram_top_"
THRESHOLD_TOP_NGRAM = {2: 0.2, 3: 0.18, 4: 0.16}
KEY_PREFIX_DUP_NGRAM = "rr_ngram_dup_"
THRESHOLD_DUP_NGRAM = {5: 0.15, 6: 0.14, 7: 0.13, 8: 0.12, 9: 0.11, 10: 0.10}


def gopher_filter(doc: Document):
    violations = Violations()
    # repetition
    for n, thresh in THRESHOLD_TOP_NGRAM.items():
        val = repetition_ngram_top_char_frac(doc.normalized_words, n)
        if val > thresh:
            violations.doc(KEY_PREFIX_TOP_NGRAM + str(n))
    for n, thresh in THRESHOLD_DUP_NGRAM.items():
        val = repetition_ngram_dup_char_frac(doc.normalized_words, n)
        if val > thresh:
            violations.doc(KEY_PREFIX_DUP_NGRAM + str(n))
    sent_frac, sent_char_frac = repetition_line_dup_frac(doc.sents)
    if sent_frac > 0.3:
        violations.doc("rr_sent_frac")
    if sent_char_frac > 0.2:
        violations.doc("rr_sent_char_frac")
    para_frac, para_char_frac = repetition_line_dup_frac(doc.paragraphs)
    if para_frac > 0.3:
        violations.doc("rr_para_frac")
    if para_char_frac > 0.2:
        violations.doc("rr_para_char_frac")
    # document
    word_count = document_word_count(doc.normalized_words)
    if word_count < 50 or word_count > 100_000:
        violations.doc("doc_word_count")
    mean_word_len = document_mean_word_length(doc.normalized_words)
    if mean_word_len < 3 or mean_word_len > 10:
        violations.doc("doc_mean_word_len")
    symbol_to_word = document_gopher_symbols(doc.normalized_text) / len(
        doc.normalized_words
    )
    if symbol_to_word > 0.1:
        violations.doc("doc_gopher_symbol_to_word")
    alpha_word_rate = document_alpha_words(doc.normalized_words) / len(
        doc.normalized_words
    )
    if alpha_word_rate < 0.8:
        violations.doc("doc_alpha_word_rate")
    el_end_line_rate = document_end_with_ellipsis(doc.normalized_sents) / len(
        doc.normalized_sents
    )
    if el_end_line_rate > 0.3:
        violations.doc("doc_el_end_line_rate")
    bullet_start_line_rate = document_start_with_bullet(doc.normalized_sents) / len(
        doc.normalized_sents
    )
    if bullet_start_line_rate > 0.9:
        violations.doc("doc_bullet_start_line_rate")
    stopword_cnt = document_gopher_stopwords(doc.normalized_words)
    if stopword_cnt < 2:
        violations.doc("doc_gopher_stopword_count")

    return violations


def apply_gopher_rules(text, lang):
    doc = Document(text, lang)
    violations = gopher_filter(doc)
    filtered_text = violations.apply_to_doc(doc)
    return filtered_text


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/ruleset/refinedweb.py
================================================
import regex
from .gopher import gopher_filter
from ..model.document import Document
from ..func.line import (
    line_all_numeric,
    line_uppercase_ratio,
    line_refinedweb_counter,
    line_regex_match,
)

EXCLUDE_PATTERNS = (
    "^sign in",
    "^sign-in",
    "^sign up",
    "^sign-up",
    "read more...$",
    "items in cart",
)
EXCLUDE_PATTERNS = (regex.compile(x) for x in EXCLUDE_PATTERNS)


def refinedweb_filter(doc: Document):
    violations = gopher_filter(doc)
    # line
    res = []
    for i, line in enumerate(doc.sents):
        upper_ratio = line_uppercase_ratio(line)
        if upper_ratio > 0.6:
            res.append(i)
    violations.line("line_upper_ratio", res)

    res = []
    for i, line in enumerate(doc.normalized_sents):
        if line_all_numeric(line):
            res.append(i)
    violations.line("line_all_numeric", res)

    res = []
    for i, line in enumerate(doc.normalized_sents):
        if line_refinedweb_counter(line):
            res.append(i)
    violations.line("line_refinedweb_counter", res)

    res = []
    for i, line in enumerate(doc.normalized_sents):
        if len(line.split()) == 1:
            res.append(i)
    violations.line("line_one_word", res)

    res = []
    for i, line in enumerate(doc.normalized_sents):
        if line_regex_match(line, EXCLUDE_PATTERNS):
            res.append(i)
    violations.line("line_exclude_patterns", res)

    total_words = sum(len(line.split()) for line in doc.normalized_sents)
    excluded_words = sum(
        len(line.split())
        for i, line in enumerate(doc.normalized_sents)
        if i in violations.excluded_lines
    )
    if excluded_words / total_words > 0.05:
        violations.doc("line_document_discarded")

    return violations


def apply_refinedweb_rules(text, lang):
    doc = Document(text, lang)
    violations = refinedweb_filter(doc)
    filtered_text = violations.apply_to_doc(doc)
    return filtered_text


================================================
FILE: GeneralDomain/redstone_cc/algos/rule_based_filters/utils.py
================================================
import unicodedata

import regex

RE_PUNCT = regex.compile(r"\p{P}")
RE_URL = regex.compile(
    r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
)

RE_LINE_SEPARATORS = regex.compile(r"(\p{Zl}|\p{Zp})+")
RE_SPACE_SEPARATORS = regex.compile(r"\p{Zs}+")


def remove_url(text):
    return RE_URL.sub("", text)


def remove_consecutive_new_lines(text):
    return RE_LINE_SEPARATORS.sub("\n", text)


def remove_punct(text):
    return RE_PUNCT.sub("", text)


def normalize(text):
    text = unicodedata.normalize("NFKC", text)
    text = text.lower()
    text = text.strip()
    text = remove_consecutive_new_lines(text)
    text = RE_SPACE_SEPARATORS.sub(" ", text)
    return text


================================================
FILE: GeneralDomain/redstone_cc/algos/trafilatura_process.py
================================================
import zlib
import re

import brotlicffi
import lxml.etree as ET
from lxml.html import tostring
from trafilatura import bare_extraction
from trafilatura.xml import xmltotxt
from trafilatura.meta import reset_caches as trafilatura_reset_caches

FLAG_TRAFILATURA_RESET_CACHE = False
ZIP_BOMB_SIZE_THRESHOLD = 100 * 1000 * 1000


class EmptyResultException(Exception):
    pass


def _remove_dup_newline(text):
    fields = text.split("\n")
    for i in range(len(fields)):
        fields[i] = fields[i].strip()

    text = "\n".join(fields)

    return re.sub("\n{2,}", "\n\n", text).strip()


def _normalize_whitespace(tree):
    def _normalize(text):
        text = text.replace("\n", "")
        text = re.sub(r"[\t ]+", " ", text)
        return text

    for item in tree.xpath(
        "//*[not(ancestor-or-self::pre) and not(ancestor-or-self::textarea)]"
    ):
        if item.text is not None:
            item.text = _normalize(item.text)
        for c in item:
            if c.tail is not None:
                c.tail = _normalize(c.tail)
    return tree


def _traf_xml_to_html(tree):
    # replace tag
    for elem in tree.iter(
        "hi", "list", "item", "head", "lb", "quote", "del", "row", "cell", "ab"
    ):
        if elem.tag == "hi":
            rend = elem.get("rend", "b")
            if rend == "#i":
                elem.tag = "i"
            elif rend == "#b":
                elem.tag = "b"
            elif rend == "#u":
                elem.tag = "u"
            elif rend == "#t":
                elem.tag = "code"
            elif rend == "#sub":
                elem.tag = "sub"
            elif rend == "#sup":
                elem.tag = "sup"
            if "rend" in elem.attrib:
                elem.attrib.pop("rend")
        elif elem.tag == "list":
            rend = elem.get("rend", "ul")
            elem.tag = rend
            if "rend" in elem.attrib:
                elem.attrib.pop("rend")
        elif elem.tag == "item":
            rend = elem.get("rend")
            if not rend:
                elem.tag = "li"
            else:
                tag, _idx = rend.split("-", 1)
                elem.tag = tag
            if "rend" in elem.attrib:
                elem.attrib.pop("rend")
        elif elem.tag == "head":
            rend = elem.get("rend", "h6")
            elem.tag = rend
            if "rend" in elem.attrib:
                elem.attrib.pop("rend")
        elif elem.tag == "lb":
            elem.tag = "br"
        elif elem.tag == "quote":
            elem.tag = "pre"
        elif elem.tag == "delete":
            elem.tag = "del"
        elif elem.tag == "row":
            elem.tag = "tr"
        elif elem.tag == "cell":
            if "role" in elem:
                if elem["role"] == "head":
                    elem.tag = "th"
                    elem.attrib.pop("role")
                    continue
            elem.tag = "td"
        elif elem.tag == "ab":
            if "type" in elem:
                if elem["type"] == "header":
                    elem.tag = "h6"
                    elem.attrib.pop("type")
                    continue
            elem.tag = "p"
    return tree


def _build_traf_doc_full(traf_bare_res):
    title = traf_bare_res.get("title", "")
    main = traf_bare_res["body"]
    comments = traf_bare_res.get("commentsbody")
    output = ET.Element("body")
    if title is not None and len(title) > 0:
        ele = ET.Element("h1")
        ele.text = title
        output.append(ele)
    main.tag = "p"
    output.append(main)
    if comments is not None:
        comments.tag = "p"
        output.append(comments)

    output = _traf_xml_to_html(output)
    return output


# no title no comments
def _build_traf_doc(traf_bare_res):
    output = ET.Element("body")

    main = traf_bare_res["body"]
    main.tag = "div"
    output.append(main)

    output = _traf_xml_to_html(output)
    return output


_RESET_CACHES_INTERVAL = 100
_reset_caches_counter = 0


def _reset_caches():
    global _reset_caches_counter, _RESET_CACHES_INTERVAL
    _reset_caches_counter += 1
    if _reset_caches_counter >= _RESET_CACHES_INTERVAL:
        trafilatura_reset_caches()
        _reset_caches_counter = 0


def _detect_zip_bomb(data):
    if isinstance(data, bytes):
        if data[:2] == b"\x1f\x8b":
            try:
                count = 0
                dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
                for i in range(0, len(data), 64):
                    chunk = data[i : i + 64]
                    rv = dec.decompress(chunk)
                    count += len(rv)
                    if count > ZIP_BOMB_SIZE_THRESHOLD:
                        return True
            except (EOFError, OSError):
                pass
        # try brotli
        else:
            try:
                count = 0
                dec = brotlicffi.Decompressor()
                for i in range(0, len(data), 64):
                    chunk = data[i : i + 64]
                    rv = dec.decompress(chunk)
                    count += len(rv)
                    if count > ZIP_BOMB_SIZE_THRESHOLD:
                        return True
            except brotlicffi.error:
                pass  # logging.debug('invalid Brotli file')

    return False


# ref: https://gitlab.gnome.org/GNOME/libxml2/-/blame/master/include/libxml/parserInternals.h#L45
HTML_LENGTH_THRESHOLD = 10_000_000


def trafilatura_process(html):
    assert not _detect_zip_bomb(html), "zip bomb detected"
    assert len(html) < HTML_LENGTH_THRESHOLD, "Skip html that exceed length limit"

    # article extraction
    traf_res = bare_extraction(
        html,
        output_format="txt",
        include_comments=False,
        favor_precision=True,
        include_formatting=True,
        include_tables=True,
        include_images=False,
        include_links=False,
        deduplicate=False,
    )
    if traf_res is None:
        raise EmptyResultException("Trafilatura empty result")
    traf_html_tree = _build_traf_doc(traf_res)
    traf_html_tree = _normalize_whitespace(traf_html_tree)
    traf_html = tostring(traf_html_tree, encoding="unicode")
    traf_text = xmltotxt(traf_html_tree, False)
    traf_text = _remove_dup_newline(traf_text)

    if FLAG_TRAFILATURA_RESET_CACHE:
        _reset_caches()

    return {"text": traf_text, "html": traf_html}


__all__ = [
    "trafilatura_process",
]


================================================
FILE: GeneralDomain/redstone_cc/download_utils.py
================================================
import os
import subprocess
import shlex
import shutil
from functools import lru_cache
from urllib.parse import urlparse

import requests
from loguru import logger


def _url_basename(url):
    parse_res = urlparse(url)
    return os.path.split(parse_res.path)[1]


def _normalize_dst(src, dst):
    if os.path.isdir(dst):
        dst = os.path.join(dst, _url_basename(src))

    return dst


@lru_cache
def detect_aria2():
    p = subprocess.run(["aria2c", "--version"], shell=True)
    return p.returncode == 0


def download_with_aria2(src, dst, num_connections=16, quiet=False, extra_args=None):
    if not detect_aria2():
        raise RuntimeError("aria2c not detected")

    dst = _normalize_dst(src, dst)
    if extra_args is None:
        extra_args = []
    elif not isinstance(extra_args, list):
        raise ValueError(f"Invalid extra_args type {type(extra_args)}")

    parts = [
        "aria2",
        "-x",
        str(num_connections),
        "-s",
        str(num_connections),
        "--retry-after",
        "3",
        *extra_args,
    ]
    if quiet:
        parts.append("-q")
    else:
        parts.append("--console-log-level=error")
        parts.append("--download-result=hide")
        # known issue: tqdm progress bar may still be overided by aria2
        parts.append("--show-console-readout=false")

    parts.append(src)
    dst_dir = os.path.dirname(dst)
    dst_name = os.path.basename(dst)
    parts.append("-d")
    parts.append(dst_dir)
    parts.append("-o")
    parts.append(dst_name)
    cmd = shlex.join(parts)
    subprocess.run(cmd, shell=True, check=True)

    return dst


def download_with_requests(src, dst):
    dst = _normalize_dst(src, dst)
    with requests.get(src, stream=True) as r:
        r.raise_for_status()
        with open(dst, "wb") as f:
            shutil.copyfileobj(r.raw, f)

    return dst


def download(src, dst):
    if detect_aria2():
        return download_with_aria2(src, dst)
    else:
        logger.info(f"aria2 not found, fallback to requests")
        return download_with_requests(src, dst)


================================================
FILE: GeneralDomain/redstone_cc/process.py
================================================
import tempfile
import os

import pyarrow.parquet as pq
from tqdm import tqdm
from warcio.archiveiterator import ArchiveIterator
from loguru import logger

from .download_utils import download
from .algos.trafilatura_process import trafilatura_process, EmptyResultException
from .algos.fasttext_classifier import FASTTEXT_LID_176_URL, FastTextClassifier
from .algos.rule_based_filters.ruleset.refinedweb import apply_refinedweb_rules

LA_PROB_THRESHOLD = 0.65


def process_items(remote_cc_path, items, disable_tqdm=False):
    # items to dict
    uri_to_item = dict()
    for item in items:
        assert item["cc_path"] == remote_cc_path
        uri_to_item[item["uri"]] = item

    # main processing
    with tempfile.TemporaryDirectory(dir=os.getcwd()) as tmp_dir:
        logger.info(f"downloading warc file {remote_cc_path}")
        local_cc_file = download(remote_cc_path, tmp_dir)
        # prepare lid model
        logger.info(f"downloading fasttext lid model {FASTTEXT_LID_176_URL}")
        local_lid_model = download(FASTTEXT_LID_176_URL, tmp_dir)
        lid_classfier = FastTextClassifier(local_lid_model)

        results = []
        with open(local_cc_file, "rb") as fd:
            for record in tqdm(ArchiveIterator(fd), disable=disable_tqdm):
                warc_type = record.rec_headers.get_header("WARC-Type")
                if warc_type != "response":
                    continue

                uri = record.rec_headers.get_header("WARC-Target-URI")
                if uri not in uri_to_item:
                    continue
                # article extraction
                raw_html = record.content_stream().read()
                try:
                    traf_res = trafilatura_process(raw_html)
                except EmptyResultException:
                    logger.warning(f"trafilatura: failed to convert record: {uri}")

                traf_text = traf_res["text"]
                # lid
                la, la_prob = lid_classfier.predict(traf_text)
                if la != "en" or la_prob < LA_PROB_THRESHOLD:
                    continue
                # rule based filter
                filtered_text = apply_refinedweb_rules(traf_text, la)
                if filtered_text is None:
                    continue

                result_item = {
                    **uri_to_item[uri],
                    "text": filtered_text,
                }

                results.append(result_item)

    return results


def process_file(index_path):
    items = pq.read_table(index_path).to_pylist()
    assert len(items) > 0
    cc_path = items[0]["cc_path"]
    return process_items(cc_path, items)


================================================
FILE: LICENSE
================================================
    MIT License

    Copyright (c) Microsoft Corporation.

    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE


================================================
FILE: README.md
================================================
<p align="center">
  <img src="assets/icon.png" width="150">
  <br />
  <br />
  <a href="https://huggingface.co/datasets/zjsd/RedStone"><img alt="MIT License" src="https://img.shields.io/badge/Hugging%20Face-Dataset-orange?logo=huggingface" /></a>
  <a href="https://arxiv.org/abs/2412.03398"><img alt="MIT License" src="https://img.shields.io/badge/ArXiv-2412.03398-green.svg" /></a>
  <a href="https://github.com/microsoft/RedStone/blob/main/LICENSE"><img alt="MIT License" src="https://img.shields.io/badge/license-MIT-blue.svg" /></a>
</p>

--------------------------------------------------------------------------------

# [REDSTONE : Curating General, Code, Math, and QA Data for Large Language Models](https://arxiv.org/abs/2412.03398)

**RedStone** is an innovative and scalable pipeline designed to extract and process data from a vast amount of web content, facilitating the creation of diverse and comprehensive pre-training datasets. We demonstrate its capabilities by building pre-training datasets across multiple domains, including general, code, mathematics, and question-answering. REDSTONE's flexibility allows it to easily adapt to various specialized fields.

# Dataset
| Datasets        | Tokens (B) | Link |
|-----------------|------------| ---- |
| REDSTONE-Web    | 3,170.2    | [REDSTONE-Web](https://huggingface.co/datasets/zjsd/RedStone) |
| REDSTONE-Code   | 250.2      | [REDSTONE-Code-python (Python Only)](https://huggingface.co/datasets/zjsd/RedStone-Code-python) |
| REDSTONE-Math   | 15.9       | [REDSTONE-Math](https://huggingface.co/datasets/zjsd/RedStone-Math) |
| REDSTONE-QA     | 51.4       | [REDSTONE-OpenQuestion](https://huggingface.co/datasets/zjsd/RedStone-QA-oq) [REDSTONE-MultiChoiceQuestion](https://huggingface.co/datasets/zjsd/RedStone-QA-mcq) |

**UPDATE [2/10/2025]**: All open-source datasets are reproduced by [@zjsd](https://huggingface.co/zjsd) based on our open-source code. We have verified the scale of these datasets and manually reviewed some samples; they are comparable to our internal datasets in both size and quality.

**Note [12/08/2024]：** Since **we do not have the permission to open-source the processed data**, We provide all the code for RedStone to process both general and domain-specific data, along with an [index](https://huggingface.co/datasets/microsoft/RedStone) for high-quality data from Common Crawl after filtering. You can download the raw Common Crawl data, use the provided index to find high-quality pages, and process them with RedStone's scripts.

If you have the appropriate licenses, **we encourage you to use these scripts to reproduce the dataset and contribute it to the open-source community**. We will reference the data here for easy access. Additionally, we welcome you to use RedStone to expand domain-specific categories beyond just code, math, and QA.

# Performance
### General Domain Data
| Datasets      | ARC-c | ARC-e | HellaSwag | OpenBookQA | PIQA  | Winogrande | AVERAGE |
|---------------|-------|-------|-----------|------------|-------|------------|---------|
| RedPajama     | 0.2270| 0.4386| 0.3171    | 0.1900     | 0.5968| **0.5296** | 0.3832  |
| FineWeb       | 0.1928| 0.4428| 0.3506    | 0.1740     | 0.6681| 0.5288     | 0.3929  |
| RefinedWeb    | 0.2125| 0.4369| 0.3380    | 0.2100     | 0.6491| 0.5264     | 0.3955  |
| DCLM          | 0.2159| 0.4848| 0.3614    | 0.1760     | 0.6615| 0.5082     | 0.4013  |
| FineWeb-Edu   | **0.2722**| **0.5648**| 0.3637    | 0.1940     | 0.6676| 0.5051     | 0.4279  |
| **REDSTONE-Web**  | 0.2662| 0.5181| **0.3722**| **0.2340** | **0.6795**| 0.5162     | **0.4310** |

<sub>**The results are based on models trained with 1.3 billion parameters on 50 billion tokens.**</sub>

### Domain-specific Data
#### REDSTONE-Code
| Dataset         | HumanEval pass@1 | HumanEval pass@10 | MBPP pass@1 | MBPP pass@10 |
|-----------------|------------------|-------------------|-------------|--------------|
| REDSTONE-Web    | 0.0125           | 0.0168            | 0.0751      | 0.1566       |
| + **REDSTONE-Code** | **0.0555**       | **0.1035**        | **0.1311**  | **0.2458**   |

#### REDSTONE-Math
| Dataset                    | GSM8k  | MATH   |
|----------------------------|--------|--------|
| OpenWebMath       | 3.2503 | 3.1288 |
| **REDSTONE-Math**              | **3.1125** | **3.0557** |

#### REDSTONE-QA
| Model               | MMLU  | Arc Challenge | Arc Easy | OpenbookQA | Winogrande | AVERAGE |
|---------------------|-------|---------------|----------|------------|------------|---------|
| StableLM-2-1.6B     | 0.3135| 0.3481        | **0.6860**| 0.2780     | 0.6354     | 0.4522  |
| + FALN v2           | 0.3525| 0.3601        | 0.6406   | **0.2860** | 0.6125     | 0.4503  |
| + Open Orca         | 0.3569| 0.3089        | 0.5821   | 0.2660     | 0.5675     | 0.4163  |
| + **REDSTONE-QA**       | **0.4582**| **0.3643**| 0.6839   | 0.2760     | **0.6377** | **0.4840** |

**<sub>For evaluations on the domain-specific dataset, We utilized the same architecture as the StableLM-2-1.6B</sub>**

# Getting Started

| Domain | Link |
|----------------------|--------------------------------------------------------------------------------------------|
| General Domain Data  |[Getting Started](https://github.com/microsoft/RedStone/blob/main/GeneralDomain/README.md)  | 
| Domain-specific Data |[Getting Started](https://github.com/microsoft/RedStone/blob/main/DomainSpecific/readme.md) |

# Responsible AI FAQ
- **What is RedStone Source Code?**
    - RedStone is a pipeline designed to extract a wide range of specified knowledge from Common Crawl on a large scale. It is composed of three modules, Collection, Filtering and Extraction. As an example, we use RedStone to build extensive domain-specific datasets in the fields of code, mathematics, question answering (QA), and general data. Utilizing RedStone, it is possible to easily acquire valuable knowledge from a multitude of other domains within Common Crawl.
- **What can RedStone Source Code do?**
    - RedStone Source Code provides the sample codes of the pipeline’s components, workflow and index of source location, enabling anyone to construct large-scale various domains from Common Crawl, including general web content, web code, web mathematics and web QA data.
- **What is/are RedStone Source Code’s intended use(s)?**
    - We release RedStone, aiming to provide this resource to the research community to accelerate the development of large language models and for demonstrating a novel method of constructing training datasets. Given the research nature of this work, production or commercial uses are out of scope without further testing and mitigation.
- **How was RedStone Source Code evaluated? What metrics are used to measure performance?**
    - We use RedStone to build domain-specific datasets in the fields of code, mathematics, question answering (QA), and general datasets as examples. We evaluate the performance of the datasets across multiple benchmarks, demonstrating that RedStone significantly enhances model performance in mathematics, code, and QA tasks.
- **What are the limitations of [RedStone Source Code]? How can users minimize the impact of RedStone dataset’s limitations when using the system?**
    - RedStone takes several domains as examples to verify the methodologies and pipelines. We believe the ways should work for other fields. However, the source code repo is customized for these domain and English materials only. It takes extra effort to revise the codes for your tasks and setting if you would like to obtain data of different domain, languages with your environment.
    - RedStone employs quality filters to get content with correct grammar, logical consistency, and factual accuracy. Despite our efforts to remove toxic content, some harmful content may be present.
    - RedStone used scope of deduplication, which indicates that narrowing the scope of deduplication yields the highest scores. A possible explanation is that a narrower deduplication scope results in a data distribution that more closely mirrors the real world, where frequently occurring data in real life also appears multiple times in the dataset. However, we are currently unable to verify this hypothesis and will investigate it.
    - There might be incorrect data in raw data that could not be filtered out, which may result in inaccurate answers for some questions.
    - Common Crawl data may not be suitable for all downstream uses due to copyright or other legal reasons. Users are responsible for verifying the legal right to use Common Crawl data for their intended purpose.
- **What operational factors and settings allow for effective and responsible use of RedStone Source Code?**
    - The user is responsible for validating the safety and accuracy of any datasets developed using RedStone Source Code, or any model developed using a dataset constructed using our methods.

# Citation
If you find this repository useful, please consider citing our work:
```
@article{redstone,
  title={{RedStone}: {Curating} General, Code, Math, and {QA} Data for Large Language Models},
  author={Chang, Yaoyao and Cui, Lei and Dong, Li and Huang, Shaohan and Huang, Yangyu and Huang, Yupan and Li, Scarlett and Lv, Tengchao and Ma, Shuming and Sun, Qinzheng and others},
  journal={arXiv preprint arXiv:2412.03398},
  year={2024}
}
```

# License
The content of this project itself is licensed under the [MIT](./LICENSE)

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

# Contact
For help or issues using RedStone, please submit a GitHub issue.

For other communications related to RedStone, please contact [Lei Cui](mailto:lecu@microsoft.com) or [Furu Wei](mailto:fuwei@microsoft.com).


================================================
FILE: SECURITY.md
================================================
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->

## Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.

## Reporting Security Issues

**Please do not report security vulnerabilities through public GitHub issues.**

Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).

If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).

You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
  * Full paths of source file(s) related to the manifestation of the issue
  * The location of the affected source code (tag/branch/commit or direct URL)
  * Any special configuration required to reproduce the issue
  * Step-by-step instructions to reproduce the issue
  * Proof-of-concept or exploit code (if possible)
  * Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.

## Preferred Languages

We prefer all communications to be in English.

## Policy

Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).

<!-- END MICROSOFT SECURITY.MD BLOCK -->


================================================
FILE: SUPPORT.md
================================================
# TODO: The maintainer of this repo has not yet edited this file

**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?

- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.

*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*

# Support

## How to file issues and get help  

This project uses GitHub Issues to track bugs and feature requests. Please search the existing 
issues before filing new issues to avoid duplicates.  For new issues, file your bug or 
feature request as a new Issue.

For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE 
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.

## Microsoft Support Policy  

Support for this **PROJECT or PRODUCT** is limited to the resources listed above.