[
  {
    "path": ".github/workflows/duplicate-code-detection.yml",
    "content": "name: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ubuntu-22.04\n    steps:\n      - uses: actions/checkout@v4\n      - name: Check for duplicate code\n        uses: ./\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \"./\"\n          # Only examine .h and .cpp files\n          file_extensions: \"py\"\n          ignore_below: 1\n          fail_above: 70\n          warn_above: 15\n          one_comment: true\n"
  },
  {
    "path": ".gitignore",
    "content": ".vscode\n"
  },
  {
    "path": ".pre-commit-hooks.yaml",
    "content": "-   id: duplicate-code-detection\n    name: Detect duplicate code\n    description: This hook will run duplicate code detection.\n    entry: duplicate-code-detection -f\n    language: python\n    types: [text]"
  },
  {
    "path": "Dockerfile",
    "content": "FROM python:3.7-slim\n\nRUN apt-get update\nRUN apt-get -y install git jq\n\nCOPY duplicate_code_detection.py requirements.txt run_action.py entrypoint.sh /action/\n\nRUN pip3 install -r /action/requirements.txt requests && \\\n    python3 -c \"import nltk; nltk.download('punkt')\" && \\\n    ln -s /root/nltk_data /usr/local/nltk_data \n\nENTRYPOINT [\"/action/entrypoint.sh\"]\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018 Dimitris Platis\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Duplicate Code Detection Tool\nA simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect\nsimilarities between files within a repository.\n\n## What?\nA command line tool that receives a directory or a list of files and determines\nthe degree of similarity between them.\n\n## Why?\nThe tool intends guide the refactoring efforts of a developer who wishes\nto reduce code duplication within a component and improve its software\narchitecture.\n\nIts development was initiated within the context of the\n[DAT265 - Software Evolution Project](https://pingpong.chalmers.se/public/courseId/9754/lang-en/publicPage.do).\n\n## How?\nThe tool uses the [gensim](https://radimrehurek.com/gensim/) Python library to\ndetermine the similarity between source code files, supplied by the user.\nThe default supported languages are C, C++, JAVA, Python and C#.\n\n### Dependencies\nThe following Python packages have to be installed:\n  * nltk\n    * `pip3 install --user nltk`\n  * gensim\n    * `pip3 install --user gensim`\n  * astor\n    * `pip3 install --user astor`\n  * punkt\n    * `python3 -m nltk.downloader punkt`\n\n## Get started\nSuppress the warnings (generated by the used libraries)\nas `python3 -W ignore duplicate_code_detection.py` and then supply the necessary\narguments. More details can be found by running the tool with the `--help` option.\n\n**Notice:** Due to the way the models are created, the more source files you\nprovide the tool the more accurate the similarity calculations are. In other\nwords, the bigger the project, the more useful the tool is.\n\n### Example\nIf `duplicate-code-detection-tool` is the name where the tool resides in and\n`smartcar_shield/src` contains the repository you want to check for source code\nsimilarities between the files, then you can run the following to get the\nsimilarity report:\n\n`python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/`\n\nThe result should look something like this:\n\n![code duplication tool screenshot](https://i.imgur.com/wi1TnVM.png)\n\n## GitHub Action\n\nThe tool is also available as a [GitHub Action](https://docs.github.com/en/actions) for easy integration\nwith projects hosted on GitHub. An example output of the tool can be seen\n[here](https://github.com/platisd/smartcar_shield/pull/36#issuecomment-778635111).\n\nThe Action is meant to be triggered during **pull requests** to give the developers an impression\nover the **degree of similarity** between the files in the source code. Below you will find a sample\nworkflow files that illustrate the usage.\n\nDepending on the *size* of your project, you may want to have the tool running multiple times\n(i.e in diffferent steps) that test specific parts of your repository for duplicate code.\nThis way you will not compare each file in your codebase with everything else and get back more\nmeaningful reports.\n\n### Bare minimum\n\nIn the following example the tool will examine source code (the languages supported by default)\nin the `src/` and `test/ut` directories *relative* to the root directory of your repository.\nThe results will be posted as a comment in the **pull request** that was opened.\n\n```yaml\nname: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \"src/, test/ut\"\n```\n\n### Trigger on pull request comment\n\nIf you want to avoid the \"spam\" you should configure the tool to not always run. Specifically, if you\nwish to trigger the Action manually, you can do so by leaving a comment in the pull request.\n\nThe following action will trigger the tool to be run when a comment containig `run_duplicate_code_detection_tool`\nis posted in a pull request. The tool will run using the code in the pull request.\n\n```yaml\nname: Duplicate code\n\non: issue_comment\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    # Trigger the tool only when a comment containing the keyword is published in a pull request\n    if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool')\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \".\"\n```\n\n**Important:** Please note that due to the way GitHub Actions work, you will *first* have to merge this into your main\nbranch so it starts taking effect.\n\n### Optional configuration\n\nIt may not make sense to compare all files or get a files with very low similarity reported.\nIn the following workflow, the different *optional* arguments are demonstrated.\n\nFor the various default values, please consult [action.yml](action.yml).\n\n```yaml\nname: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \"src\"\n          # Ignore the specified directories\n          ignore_directories: \"src/external_libraries\"\n          # Only examine .h and .cpp files\n          file_extensions: \"h, cpp\"\n          # Only report similarities above 5%\n          ignore_below: 5\n          # If a file is more than 70% similar to another, then the job fails\n          fail_above: 70\n          # If a file is more than 15% similar to another, show a warning symbol in the report\n          warn_above: 15\n          # Remove `src/` from the file paths when reporting similarities\n          project_root_dir: \"src\"\n          # Remove docstrings from code before analysis\n          # For python source code only. This is checked on a per-file basis\n          only_code: true\n          # Leave only one comment with the report and update it for consecutive runs\n          one_comment: true\n          # The message to be displayed at the start of the report\n          header_message_start: \"The following files have a similarity above the threshold:\"\n```\n## Using duplicate-code-check with pre-commit\nTo use Duplicate Code Detection Tool as a pre-commit hook with [pre-commit](https://pre-commit.com/) add the following to your `.pre-commit-config.yaml` file:\n```yaml\n-   repo: https://github.com/platisd/duplicate-code-detection-tool.git\n    rev: ''  # Use the sha / tag you want to point at\n    hooks:\n    -   id: duplicate-code-detection\n```\n> **_NOTE:_** that this repository sets args: `-f`, if you are configuring duplicate-code-detection-tool using args you'll want to include either `-f` (`--files`) or `-d` (`--directories`).\n\n## Limitations\n\n- `only_code` option only works with python files for now\n"
  },
  {
    "path": "action.yml",
    "content": "name: 'Duplicate code detection tool'\ndescription: 'Detect similarities between source code files'\ninputs:\n  github_token:\n    description: 'The GitHub token'\n    required: true\n  directories:\n    description: 'A comma-separated list of the directories containing the source code'\n    required: true\n  ignore_directories:\n    description: 'A comma-separated list of directories that should be ignored'\n    required: false\n    default: ''\n  project_root_dir:\n    description: 'The relative path to filter out when reporting results'\n    required: false\n    default: './'\n  file_extensions:\n    description: 'A comma-separated list of source code file extensions to check for similarities'\n    required: false\n    default: 'h, hpp, c, cpp, cc, java, py, cs'\n  ignore_below:\n    description: 'The minimum similarity percentage to be reported'\n    required: false\n    default: 10\n  fail_above:\n    description: 'The maximum allowed similarity percentage before the action fails'\n    required: false\n    default: 100\n  warn_above:\n    description: 'The maximum allowed similarity percentage before the action warns'\n    required: false\n    default: 100\n  only_code:\n    description: \"Removes comments and docstrings from the source code before analysis\"\n    required: false\n    default: false\n  one_comment:\n    description: 'Duplication report will be left as a single comment, which will be updated, instead of multiple ones'\n    required: false\n    default: false\n  header_message_start:\n    description: 'The message to be displayed at the start of the duplication report.\n                  It is used by the bot to identify previous reports and update them, so it must be unique.\n                  If you want to use the Action in multiple steps of the same workflow,\n                  then you can change this message in each step to avoid conflicts'\n    required: false\n    default: '## 📌 Duplicate code detection tool report'\nruns:\n  using: 'docker'\n  image: 'Dockerfile'\nbranding:\n  icon: 'check'  \n  color: 'green'\n"
  },
  {
    "path": "duplicate_code_detection.py",
    "content": "\"\"\"\nA simple Python3 tool to detect similarities between files within a repository.\n\nDocument similarity code adapted from Jonathan Mugan's tutorial:\nhttps://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python\n\"\"\"\nimport os\nimport sys\nimport argparse\nimport gensim\nimport ast\nimport csv\nimport astor\nimport re\nimport tempfile\nimport json\nfrom enum import Enum\nfrom nltk.tokenize import word_tokenize\nfrom collections import OrderedDict\n\nsource_code_file_extensions = [\"h\", \"c\", \"cpp\", \"cc\", \"java\", \"py\", \"cs\"]\nfile_column_label = \"File\"\nfile_loc_label = \",#LoC\"\nsimilarity_column_label = \"Similarity (%)\"\nsimilarity_label_length = len(similarity_column_label)\nloc_label = \"#LoC\"\nsimilarity_label = \"Similarity\"\n\n\nclass ReturnCode(Enum):\n    SUCCESS = 0\n    BAD_INPUT = 1\n    THRESHOLD_EXCEEDED = 2\n\n\nclass CliColors:\n    HEADER = \"\\033[95m\"\n    OKBLUE = \"\\033[94m\"\n    OKGREEN = \"\\033[92m\"\n    WARNING = \"\\033[93m\"\n    FAIL = \"\\033[91m\"\n    ENDC = \"\\033[0m\"\n    BOLD = \"\\033[1m\"\n    UNDERLINE = \"\\033[4m\"\n\n\ndef get_all_source_code_from_directory(directory, file_extensions):\n    \"\"\"Get a list with all the source code files within the directory\"\"\"\n    source_code_files = list()\n    for dirpath, _, filenames in os.walk(directory):\n        for name in filenames:\n            _, file_extension = os.path.splitext(name)\n            if file_extension[1:] in file_extensions:\n                filename = os.path.join(dirpath, name)\n                source_code_files.append(filename)\n\n    return source_code_files\n\n\ndef conditional_print(text, machine_friendly_output):\n    if not machine_friendly_output:\n        print(text)\n\n\ndef remove_comments_and_docstrings(source_code: str) -> str:\n    \"\"\"Strip comments and docstrings from source code\n\n    .. seealso::\n\n        https://gist.github.com/phpdude/1ae6f19de213d66286c8183e9e3b9ec1\n\n    :param source_code: Raw source code as a single string\n    :type source_code: str\n    :return: Stripped source code as a single string\n    :rtype: str\n    \"\"\"\n    parsed = ast.parse(source_code)\n    for node in ast.walk(parsed):\n        if not isinstance(\n            node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef, ast.Module)\n        ):\n            continue\n\n        if not len(node.body):\n            continue\n\n        if not isinstance(node.body[0], ast.Expr):\n            continue\n\n        if not hasattr(node.body[0], \"value\") or not isinstance(\n            node.body[0].value, ast.Str\n        ):\n            continue\n\n        node.body = node.body[1:]\n\n    source_code_clean = astor.to_source(parsed)\n    return source_code_clean\n\n\ndef get_loc_count(file_path):\n    lines_count = -1\n    try:\n        with open(os.path.normpath(file_path), 'r') as the_file:\n            lines_count = len(the_file.readlines())\n    except Exception as err:\n        print(f\"WARNING: Failed to get lines count for file {file_path}, reason: {str(err)}\")\n    return lines_count\n\n\ndef get_loc_to_print(loc_count):\n    loc_to_print = str(loc_count) if loc_count >= 0 else \"\"\n    return loc_to_print\n\n\ndef main():\n    parser_description = (\n        CliColors.HEADER\n        + CliColors.BOLD\n        + \"=== Duplicate Code Detection Tool ===\"\n        + CliColors.ENDC\n    )\n    parser = argparse.ArgumentParser(description=parser_description)\n    parser.add_argument(\n        \"-t\",\n        \"--fail-threshold\",\n        type=int,\n        default=100,\n        help=\"The maximum allowed similarity before the script exits with an error.\",\n    )\n    group = parser.add_mutually_exclusive_group(required=True)\n    group.add_argument(\n        \"-d\",\n        \"--directories\",\n        nargs=\"+\",\n        help=\"Check for similarities between all files of the specified directories.\",\n    )\n    group.add_argument(\n        \"-f\",\n        \"--files\",\n        nargs=\"+\",\n        help=\"Check for similarities between specified files. \\\n                        The more files are supplied the more accurate are the results.\",\n    )\n    parser.add_argument(\n        \"--ignore-directories\", nargs=\"+\", default=list(), help=\"Directories to ignore.\"\n    )\n    parser.add_argument(\"--ignore-files\", nargs=\"+\", help=\"Files to ignore.\")\n    parser.add_argument(\n        \"-j\", \"--json\", type=bool, default=False, help=\"Print output as JSON.\"\n    )\n    parser.add_argument(\n        \"--project-root-dir\",\n        type=str,\n        default=str(),\n        help=\"The relative path to the project root directory to be removed when printing out results.\",\n    )\n    parser.add_argument(\n        \"--file-extensions\",\n        nargs=\"+\",\n        default=source_code_file_extensions,\n        help=\"File extensions to check for similarities.\",\n    )\n    parser.add_argument(\n        \"--ignore-threshold\",\n        type=int,\n        default=0,\n        help=\"Don't print out similarity below the ignore threshold\",\n    )\n    parser.add_argument(\n        \"--only-code\",\n        action=\"store_true\",\n        help=\"Removes comments and docstrings from the source code before analysis\",\n    )\n    parser.add_argument(\n        \"--csv-output\",\n        type=str,\n        default=str(),\n        help=\"Outputs results as a CSV to the specified CSV path\",\n    )\n    parser.add_argument(\n        \"--show-loc\",\n        action=\"store_true\",\n        help=\"Add file line counts, including blank lines and comments, to all outputs.\",\n    )\n    args = parser.parse_args()\n\n    result = run(\n        args.fail_threshold,\n        args.directories,\n        args.files,\n        args.ignore_directories,\n        args.ignore_files,\n        args.json,\n        args.project_root_dir,\n        args.file_extensions,\n        args.ignore_threshold,\n        args.only_code,\n        args.csv_output,\n        args.show_loc,\n    )\n\n    return result\n\n\ndef run(\n    fail_threshold,\n    directories,\n    files,\n    ignore_directories,\n    ignore_files,\n    json_output,\n    project_root_dir,\n    file_extensions,\n    ignore_threshold,\n    only_code,\n    csv_output,\n    show_loc,\n):\n    # Determine which files to compare for similarities\n    source_code_files = list()\n    files_to_ignore = list()\n    if directories:\n        for directory in directories:\n            if not os.path.isdir(directory):\n                print(\"Path does not exist or is not a directory:\", directory)\n                return (ReturnCode.BAD_INPUT, {})\n            source_code_files += get_all_source_code_from_directory(\n                directory, file_extensions\n            )\n        for directory in ignore_directories:\n            files_to_ignore += get_all_source_code_from_directory(\n                directory, file_extensions\n            )\n    else:\n        if len(files) < 2:\n            print(\"Too few files to compare, you need to supply at least 2\")\n            return (ReturnCode.BAD_INPUT, {})\n        for supplied_file in files:\n            if not os.path.isfile(supplied_file):\n                print(\"Supplied file does not exist:\", supplied_file)\n                return (ReturnCode.BAD_INPUT, {})\n        source_code_files = files\n\n    files_to_ignore += ignore_files if ignore_files else list()\n    files_to_ignore = [os.path.normpath(f) for f in files_to_ignore]\n    source_code_files = [os.path.normpath(f) for f in source_code_files]\n    source_code_files = list(set(source_code_files) - set(files_to_ignore))\n    if len(source_code_files) < 2:\n        print(\"Not enough source code files found\")\n        return (ReturnCode.BAD_INPUT, {})\n    # Sort the sources, so the results are sorted too and are reproducible\n    source_code_files.sort()\n    source_code_files = [os.path.abspath(f) for f in source_code_files]\n\n    # Get the absolute project root directory path to remove when printing out the results\n    if project_root_dir:\n        if not os.path.isdir(project_root_dir):\n            print(\n                \"The project root directory does not exist or is not a directory:\",\n                project_root_dir,\n            )\n            return (ReturnCode.BAD_INPUT, {})\n        project_root_dir = os.path.abspath(project_root_dir)\n        project_root_dir = os.path.join(project_root_dir, \"\")  # Add the trailing slash\n\n    # Find the largest string length to format the textual output\n    largest_string_length = len(\n        max(source_code_files, key=len).replace(project_root_dir, \"\")\n    )\n\n    # Parse the contents of all the source files\n    source_code = OrderedDict()\n    for source_code_file in source_code_files:\n        try:\n            # read file but also recover from encoding errors in source files\n            with open(source_code_file, \"r\", errors=\"surrogateescape\") as f:\n                # Store source code with the file path as the key\n                content = f.read()\n                if only_code and source_code_file.endswith(\"py\"):\n                    content = remove_comments_and_docstrings(content)\n                source_code[source_code_file] = content\n        except Exception as err:\n            print(f\"ERROR: Failed to open file {source_code_file}, reason: {str(err)}\")\n\n    # Create a Similarity object of all the source code\n    gen_docs = [\n        [word.lower() for word in word_tokenize(source_code[source_file])]\n        for source_file in source_code\n    ]\n    dictionary = gensim.corpora.Dictionary(gen_docs)\n    corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]\n    tf_idf = gensim.models.TfidfModel(corpus)\n    sims = gensim.similarities.Similarity(\n        tempfile.gettempdir() + os.sep, tf_idf[corpus], num_features=len(dictionary)\n    )\n\n    column_label = file_column_label\n    if show_loc:\n        column_label += file_loc_label\n        largest_string_length += len(file_loc_label)\n\n    exit_code = ReturnCode.SUCCESS\n    code_similarity = dict()\n    for source_file in source_code:\n        # Check for similarities\n        query_doc = [w.lower() for w in word_tokenize(source_code[source_file])]\n        query_doc_bow = dictionary.doc2bow(query_doc)\n        query_doc_tf_idf = tf_idf[query_doc_bow]\n\n        loc_info = \"\"\n        source_file_loc = -1\n        if show_loc:\n            source_file_loc = get_loc_count(source_file)\n            loc_info = \",\" + get_loc_to_print(source_file_loc)\n\n        short_source_file_path = source_file.replace(project_root_dir, \"\")\n        conditional_print(\n            \"\\n\\n\\n\"\n            + CliColors.HEADER\n            + \"Code duplication probability for \"\n            + short_source_file_path\n            + loc_info\n            + CliColors.ENDC,\n            json_output,\n        )\n        conditional_print(\n            \"-\" * (largest_string_length + similarity_label_length), json_output\n        )\n        conditional_print(\n            CliColors.BOLD\n            + \"%s %s\"\n            % (column_label.center(largest_string_length), similarity_column_label)\n            + CliColors.ENDC,\n            json_output,\n        )\n        conditional_print(\n            \"-\" * (largest_string_length + similarity_label_length), json_output\n        )\n\n        empty_length = 0\n        code_similarity[short_source_file_path] = dict()\n        if show_loc:\n            code_similarity[short_source_file_path][loc_label] = source_file_loc\n            empty_length = len(code_similarity[short_source_file_path])\n        for similarity, source in zip(sims[query_doc_tf_idf], source_code):\n            # Ignore similarities for the same file\n            if source == source_file:\n                continue\n            similarity_percentage = similarity * 100\n            # Ignore very low similarity\n            if similarity_percentage < ignore_threshold:\n                continue\n            short_source_path = source.replace(project_root_dir, \"\")\n            if show_loc:\n                code_similarity[short_source_file_path][short_source_path] = dict()\n                code_similarity[short_source_file_path][short_source_path][loc_label] = get_loc_count(\n                    source\n                )\n                code_similarity[short_source_file_path][short_source_path][similarity_label]  = round(\n                    similarity_percentage, 2\n                )\n            else:\n                code_similarity[short_source_file_path][short_source_path] = round(\n                    similarity_percentage, 2\n                )\n            if similarity_percentage > fail_threshold:\n                exit_code = ReturnCode.THRESHOLD_EXCEEDED\n            color = (\n                CliColors.OKGREEN\n                if similarity_percentage < 10\n                else (\n                    CliColors.WARNING if similarity_percentage < 20 else CliColors.FAIL\n                )\n            )\n            info_to_print = short_source_path\n            if show_loc:\n                info_to_print += \",\" + get_loc_to_print(get_loc_count(source))\n\n            conditional_print(\n                \"%s     \" % (info_to_print.ljust(largest_string_length))\n                + color\n                + \"%.2f\" % (similarity_percentage)\n                + CliColors.ENDC,\n                json_output,\n            )\n        # If no similarities found for the particular file, remove it from the report\n        if len(code_similarity[short_source_file_path]) == empty_length:\n            del code_similarity[short_source_file_path]\n    if exit_code == ReturnCode.THRESHOLD_EXCEEDED:\n        conditional_print(\n            \"Code duplication threshold exceeded. Please consult logs.\", json_output\n        )\n\n    if json_output:\n        similarities_json = json.dumps(code_similarity, indent=4)\n        print(similarities_json)\n\n    if csv_output:\n        with open(csv_output, \"w\") as csv_file:\n            writer = csv.writer(csv_file)\n            if show_loc:\n                writer.writerow([\"File A\", \"#LoC A\", \"File B\", \"#LoC B\", \"Similarity\"])\n                for first_file in code_similarity:\n                    for second_file in code_similarity[first_file]:\n                        if second_file != loc_label:\n                            \n                            writer.writerow(\n                                [\n                                    first_file,\n                                    get_loc_to_print(get_loc_count(os.path.join(project_root_dir, first_file))),\n                                    second_file,\n                                    get_loc_to_print(get_loc_count(os.path.join(project_root_dir, second_file))),\n                                    code_similarity[first_file][second_file][similarity_label],\n                                ]\n                            )\n            else:\n                writer.writerow([\"File A\", \"File B\", \"Similarity\"])\n                for first_file in code_similarity:\n                    for second_file in code_similarity[first_file]:\n                        writer.writerow(\n                            [\n                                first_file,\n                                second_file,\n                                code_similarity[first_file][second_file],\n                            ]\n                        )\n\n    return (exit_code, code_similarity)\n\n\nif __name__ == \"__main__\":\n    exit_code, _ = main()\n    sys.exit(exit_code.value)\n"
  },
  {
    "path": "entrypoint.sh",
    "content": "#!/bin/bash\nset -eu\n\nscript_dir=\"$(dirname \"$0\")\"\ncd $script_dir\n\npull_request_id=$(cat \"$GITHUB_EVENT_PATH\" | jq 'if (.issue.number != null) then .issue.number else .number end')\nbranch_name=\"pull_request_branch\"\n\nif [ $pull_request_id == \"null\" ]; then\n  echo \"Could not find a pull request ID. Is this a pull request?\"\n  exit 1\nfi\n\nmaintainer=${GITHUB_REPOSITORY%/*}\neval git clone \"https://${maintainer}:${INPUT_GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git\" ${GITHUB_REPOSITORY}\ncd $GITHUB_REPOSITORY\neval git config remote.origin.fetch +refs/heads/*:refs/remotes/origin/*\neval git fetch origin pull/$pull_request_id/head:$branch_name\neval git checkout $branch_name\n\nlatest_head=$(git rev-parse HEAD)\n\neval python3 /action/run_action.py --latest-head $latest_head --pull-request-id $pull_request_id\n"
  },
  {
    "path": "requirements.txt",
    "content": "gensim>=3.8\nnltk>=3.5\nastor>=0.8.1"
  },
  {
    "path": "run_action.py",
    "content": "#!/usr/bin/env python\n\nimport os\nimport sys\nimport json\nimport requests\nimport argparse\n\nimport duplicate_code_detection\n\nWARNING_SUFFIX = \" ⚠️\"\n\n\ndef make_markdown_table(array):\n    \"\"\"Input: Python list with rows of table as lists\n               First element as header.\n        Output: String to put into a .md file\n\n    Ex Input:\n        [[\"Name\", \"Age\", \"Height\"],\n         [\"Jake\", 20, 5'10],\n         [\"Mary\", 21, 5'7]]\n\n     Adopted from: https://gist.github.com/m0neysha/219bad4b02d2008e0154\n    \"\"\"\n    markdown = \"\\n\" + str(\"| \")\n\n    for e in array[0]:\n        to_add = \" \" + str(e) + str(\" |\")\n        markdown += to_add\n    markdown += \"\\n\"\n\n    markdown += \"|\"\n    for i in range(len(array[0])):\n        markdown += str(\"-------------- | \")\n    markdown += \"\\n\"\n\n    markdown_characters = 0\n    max_characters = 65000\n    for entry in array[1:]:\n        markdown += str(\"| \")\n        for e in entry:\n            to_add = str(e) + str(\" | \")\n            markdown += to_add\n        markdown += \"\\n\"\n        markdown_characters += len(markdown)\n        if markdown_characters > max_characters:\n            markdown += \"\\n\" + WARNING_SUFFIX + \" \"\n            markdown += \"Results were omitted because the report was too large. \"\n            markdown += \"Please consider ignoring results below a certain threshold.\\n\"\n            break\n\n    return markdown + \"\\n\"\n\n\ndef get_markdown_link(file, url):\n    return \"[%s](%s%s)\" % (file, url, file)\n\n\ndef get_warning(similarity, warn_threshold):\n    return (\n        str(similarity)\n        if similarity < int(warn_threshold)\n        else str(similarity) + WARNING_SUFFIX\n    )\n\n\ndef similarities_to_markdown(similarities, url_prefix, warn_threshold):\n    markdown = str()\n    for checked_file in similarities.keys():\n        markdown += \"<details><summary>%s</summary>\\n\\n\" % checked_file\n        markdown += \"### 📄 %s\\n\" % get_markdown_link(checked_file, url_prefix)\n\n        table_header = [\"File\", \"Similarity (%)\"]\n        table_contents = [\n            [get_markdown_link(f, url_prefix), get_warning(s, warn_threshold)]\n            for (f, s) in similarities[checked_file].items()\n        ]\n        # Sort table contents based on similarity\n        table_contents.sort(\n            reverse=True, key=lambda row: float(row[1].replace(WARNING_SUFFIX, \"\"))\n        )\n        entire_table = [[] for _ in range(len(table_contents) + 1)]\n        entire_table[0] = table_header\n        for i in range(1, len(table_contents) + 1):\n            entire_table[i] = table_contents[i - 1]\n\n        markdown += make_markdown_table(entire_table)\n        markdown += \"</details>\\n\"\n\n    return markdown\n\n\ndef split_and_trim(input_list):\n    return [token.strip() for token in input_list.split(\",\")]\n\n\ndef to_absolute_path(paths):\n    return [os.path.abspath(path) for path in paths]\n\n\ndef main():\n    parser = argparse.ArgumentParser(\n        description=\"Duplicate code detection action runner\"\n    )\n    parser.add_argument(\n        \"--latest-head\",\n        type=str,\n        default=\"master\",\n        help=\"The latest commit hash or branch\",\n    )\n    parser.add_argument(\n        \"--pull-request-id\", type=str, required=True, help=\"The pull request id\"\n    )\n    args = parser.parse_args()\n\n    fail_threshold = os.environ.get(\"INPUT_FAIL_ABOVE\")\n    directories = os.environ.get(\"INPUT_DIRECTORIES\")\n    ignore_directories = os.environ.get(\"INPUT_IGNORE_DIRECTORIES\")\n    project_root_dir = os.environ.get(\"INPUT_PROJECT_ROOT_DIR\")\n    file_extensions = os.environ.get(\"INPUT_FILE_EXTENSIONS\")\n    ignore_threshold = os.environ.get(\"INPUT_IGNORE_BELOW\")\n    only_code = os.environ.get(\"INPUT_ONLY_CODE\")\n\n    directories_list = split_and_trim(directories)\n    directories_list = to_absolute_path(directories_list)\n    ignore_directories_list = (\n        split_and_trim(ignore_directories) if ignore_directories != \"\" else list()\n    )\n    ignore_directories_list = to_absolute_path(ignore_directories_list)\n    file_extensions_list = split_and_trim(file_extensions)\n    project_root_dir = os.path.abspath(project_root_dir)\n\n    files_list = None\n    ignore_files_list = None\n    json_output = True\n    csv_output_path = \"\"  # No CSV output by default for now in GitHub Actions\n    show_loc = False\n\n    detection_result, code_similarity = duplicate_code_detection.run(\n        int(fail_threshold),\n        directories_list,\n        files_list,\n        ignore_directories_list,\n        ignore_files_list,\n        json_output,\n        project_root_dir,\n        file_extensions_list,\n        int(ignore_threshold),\n        bool(only_code),\n        csv_output_path,\n        show_loc,\n    )\n\n    if detection_result == duplicate_code_detection.ReturnCode.BAD_INPUT:\n        print(\"Action aborted due to bad user input\")\n        return detection_result.value\n    elif detection_result == duplicate_code_detection.ReturnCode.THRESHOLD_EXCEEDED:\n        print(\n            \"Action failed due to maximum similarity threshold exceeded, check the report\"\n        )\n\n    repo = os.environ.get(\"GITHUB_REPOSITORY\")\n    files_url_prefix = \"https://github.com/%s/blob/%s/\" % (repo, args.latest_head)\n    warn_threshold = os.environ.get(\"INPUT_WARN_ABOVE\")\n\n    header_message_start = os.environ.get(\"INPUT_HEADER_MESSAGE_START\") + \"\\n\"\n    message = header_message_start\n    message += \"The [tool](https://github.com/platisd/duplicate-code-detection-tool)\"\n    message += \" analyzed your source code and found the following degree of\"\n    message += \" similarity between the files:\\n\"\n    message += similarities_to_markdown(\n        code_similarity, files_url_prefix, warn_threshold\n    )\n\n    github_token = os.environ.get(\"INPUT_GITHUB_TOKEN\")\n    github_api_url = os.environ.get(\"GITHUB_API_URL\")\n\n    request_url = \"%s/repos/%s/issues/%s/comments\" % (\n        github_api_url,\n        repo,\n        args.pull_request_id,\n    )\n\n    headers = {\n        \"Authorization\": \"token %s\" % github_token,\n    }\n    report = {\"body\": message}\n\n    update_existing_comment = os.environ.get(\"INPUT_ONE_COMMENT\", \"false\").lower() in (\n        \"true\",\n        \"1\",\n    )\n    comment_updated = False\n    if update_existing_comment:\n        # If the bot has posted many comments, update the last one\n        pr_comments = requests.get(request_url, headers=headers).json()\n        for pr_comment in pr_comments[::-1]:\n            if pr_comment[\"body\"].startswith(header_message_start):\n                update_result = requests.patch(\n                    pr_comment[\"url\"],\n                    json=report,\n                    headers=headers,\n                )\n                if update_result.status_code != 200:\n                    print(\n                        \"Updating existing comment failed with code: \"\n                        + str(update_result.status_code)\n                    )\n                    print(update_result.text)\n                    print(\"Attempting to post a new comment instead\")\n                else:\n                    comment_updated = True\n                break\n\n    if not comment_updated:\n        post_result = requests.post(\n            request_url,\n            json=report,\n            headers=headers,\n        )\n\n        if post_result.status_code != 201:\n            print(\n                \"Posting results to GitHub failed with code: \"\n                + str(post_result.status_code)\n            )\n            print(post_result.text)\n\n    with open(\"message.md\", \"w\") as f:\n        f.write(message)\n\n    return detection_result.value\n\n\nif __name__ == \"__main__\":\n    sys.exit(main())\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup\n\nsetup(\n    name='duplicate code detection tool',\n    entry_points={\n        'console_scripts': ['duplicate-code-detection=duplicate_code_detection:main']\n    },\n    py_modules=['duplicate_code_detection'],\n    package_dir={\n        'duplicate_code_detection': '.',\n    },\n    install_requires=[\n        'gensim>=3.8',\n        'nltk>=3.5',\n        'astor>=0.8.1'\n    ],\n    setuptools_git_versioning={\n        \"enabled\": True,\n    },\n    setup_requires=[\"setuptools-git-versioning<2\"],\n)\n"
  }
]