Repository: platisd/duplicate-code-detection-tool Branch: master Commit: f0413b1571a9 Files: 12 Total size: 34.3 KB Directory structure: gitextract_hdgkqkk1/ ├── .github/ │ └── workflows/ │ └── duplicate-code-detection.yml ├── .gitignore ├── .pre-commit-hooks.yaml ├── Dockerfile ├── LICENSE ├── README.md ├── action.yml ├── duplicate_code_detection.py ├── entrypoint.sh ├── requirements.txt ├── run_action.py └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/duplicate-code-detection.yml ================================================ name: Duplicate code on: pull_request jobs: duplicate-code-check: name: Check for duplicate code runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v4 - name: Check for duplicate code uses: ./ with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: "./" # Only examine .h and .cpp files file_extensions: "py" ignore_below: 1 fail_above: 70 warn_above: 15 one_comment: true ================================================ FILE: .gitignore ================================================ .vscode ================================================ FILE: .pre-commit-hooks.yaml ================================================ - id: duplicate-code-detection name: Detect duplicate code description: This hook will run duplicate code detection. entry: duplicate-code-detection -f language: python types: [text] ================================================ FILE: Dockerfile ================================================ FROM python:3.7-slim RUN apt-get update RUN apt-get -y install git jq COPY duplicate_code_detection.py requirements.txt run_action.py entrypoint.sh /action/ RUN pip3 install -r /action/requirements.txt requests && \ python3 -c "import nltk; nltk.download('punkt')" && \ ln -s /root/nltk_data /usr/local/nltk_data ENTRYPOINT ["/action/entrypoint.sh"] ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Dimitris Platis Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Duplicate Code Detection Tool A simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect similarities between files within a repository. ## What? A command line tool that receives a directory or a list of files and determines the degree of similarity between them. ## Why? The tool intends guide the refactoring efforts of a developer who wishes to reduce code duplication within a component and improve its software architecture. Its development was initiated within the context of the [DAT265 - Software Evolution Project](https://pingpong.chalmers.se/public/courseId/9754/lang-en/publicPage.do). ## How? The tool uses the [gensim](https://radimrehurek.com/gensim/) Python library to determine the similarity between source code files, supplied by the user. The default supported languages are C, C++, JAVA, Python and C#. ### Dependencies The following Python packages have to be installed: * nltk * `pip3 install --user nltk` * gensim * `pip3 install --user gensim` * astor * `pip3 install --user astor` * punkt * `python3 -m nltk.downloader punkt` ## Get started Suppress the warnings (generated by the used libraries) as `python3 -W ignore duplicate_code_detection.py` and then supply the necessary arguments. More details can be found by running the tool with the `--help` option. **Notice:** Due to the way the models are created, the more source files you provide the tool the more accurate the similarity calculations are. In other words, the bigger the project, the more useful the tool is. ### Example If `duplicate-code-detection-tool` is the name where the tool resides in and `smartcar_shield/src` contains the repository you want to check for source code similarities between the files, then you can run the following to get the similarity report: `python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/` The result should look something like this: ![code duplication tool screenshot](https://i.imgur.com/wi1TnVM.png) ## GitHub Action The tool is also available as a [GitHub Action](https://docs.github.com/en/actions) for easy integration with projects hosted on GitHub. An example output of the tool can be seen [here](https://github.com/platisd/smartcar_shield/pull/36#issuecomment-778635111). The Action is meant to be triggered during **pull requests** to give the developers an impression over the **degree of similarity** between the files in the source code. Below you will find a sample workflow files that illustrate the usage. Depending on the *size* of your project, you may want to have the tool running multiple times (i.e in diffferent steps) that test specific parts of your repository for duplicate code. This way you will not compare each file in your codebase with everything else and get back more meaningful reports. ### Bare minimum In the following example the tool will examine source code (the languages supported by default) in the `src/` and `test/ut` directories *relative* to the root directory of your repository. The results will be posted as a comment in the **pull request** that was opened. ```yaml name: Duplicate code on: pull_request jobs: duplicate-code-check: name: Check for duplicate code runs-on: ubuntu-20.04 steps: - name: Check for duplicate code uses: platisd/duplicate-code-detection-tool@master with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: "src/, test/ut" ``` ### Trigger on pull request comment If you want to avoid the "spam" you should configure the tool to not always run. Specifically, if you wish to trigger the Action manually, you can do so by leaving a comment in the pull request. The following action will trigger the tool to be run when a comment containig `run_duplicate_code_detection_tool` is posted in a pull request. The tool will run using the code in the pull request. ```yaml name: Duplicate code on: issue_comment jobs: duplicate-code-check: name: Check for duplicate code # Trigger the tool only when a comment containing the keyword is published in a pull request if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool') runs-on: ubuntu-20.04 steps: - name: Check for duplicate code uses: platisd/duplicate-code-detection-tool@master with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: "." ``` **Important:** Please note that due to the way GitHub Actions work, you will *first* have to merge this into your main branch so it starts taking effect. ### Optional configuration It may not make sense to compare all files or get a files with very low similarity reported. In the following workflow, the different *optional* arguments are demonstrated. For the various default values, please consult [action.yml](action.yml). ```yaml name: Duplicate code on: pull_request jobs: duplicate-code-check: name: Check for duplicate code runs-on: ubuntu-20.04 steps: - name: Check for duplicate code uses: platisd/duplicate-code-detection-tool@master with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: "src" # Ignore the specified directories ignore_directories: "src/external_libraries" # Only examine .h and .cpp files file_extensions: "h, cpp" # Only report similarities above 5% ignore_below: 5 # If a file is more than 70% similar to another, then the job fails fail_above: 70 # If a file is more than 15% similar to another, show a warning symbol in the report warn_above: 15 # Remove `src/` from the file paths when reporting similarities project_root_dir: "src" # Remove docstrings from code before analysis # For python source code only. This is checked on a per-file basis only_code: true # Leave only one comment with the report and update it for consecutive runs one_comment: true # The message to be displayed at the start of the report header_message_start: "The following files have a similarity above the threshold:" ``` ## Using duplicate-code-check with pre-commit To use Duplicate Code Detection Tool as a pre-commit hook with [pre-commit](https://pre-commit.com/) add the following to your `.pre-commit-config.yaml` file: ```yaml - repo: https://github.com/platisd/duplicate-code-detection-tool.git rev: '' # Use the sha / tag you want to point at hooks: - id: duplicate-code-detection ``` > **_NOTE:_** that this repository sets args: `-f`, if you are configuring duplicate-code-detection-tool using args you'll want to include either `-f` (`--files`) or `-d` (`--directories`). ## Limitations - `only_code` option only works with python files for now ================================================ FILE: action.yml ================================================ name: 'Duplicate code detection tool' description: 'Detect similarities between source code files' inputs: github_token: description: 'The GitHub token' required: true directories: description: 'A comma-separated list of the directories containing the source code' required: true ignore_directories: description: 'A comma-separated list of directories that should be ignored' required: false default: '' project_root_dir: description: 'The relative path to filter out when reporting results' required: false default: './' file_extensions: description: 'A comma-separated list of source code file extensions to check for similarities' required: false default: 'h, hpp, c, cpp, cc, java, py, cs' ignore_below: description: 'The minimum similarity percentage to be reported' required: false default: 10 fail_above: description: 'The maximum allowed similarity percentage before the action fails' required: false default: 100 warn_above: description: 'The maximum allowed similarity percentage before the action warns' required: false default: 100 only_code: description: "Removes comments and docstrings from the source code before analysis" required: false default: false one_comment: description: 'Duplication report will be left as a single comment, which will be updated, instead of multiple ones' required: false default: false header_message_start: description: 'The message to be displayed at the start of the duplication report. It is used by the bot to identify previous reports and update them, so it must be unique. If you want to use the Action in multiple steps of the same workflow, then you can change this message in each step to avoid conflicts' required: false default: '## 📌 Duplicate code detection tool report' runs: using: 'docker' image: 'Dockerfile' branding: icon: 'check' color: 'green' ================================================ FILE: duplicate_code_detection.py ================================================ """ A simple Python3 tool to detect similarities between files within a repository. Document similarity code adapted from Jonathan Mugan's tutorial: https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python """ import os import sys import argparse import gensim import ast import csv import astor import re import tempfile import json from enum import Enum from nltk.tokenize import word_tokenize from collections import OrderedDict source_code_file_extensions = ["h", "c", "cpp", "cc", "java", "py", "cs"] file_column_label = "File" file_loc_label = ",#LoC" similarity_column_label = "Similarity (%)" similarity_label_length = len(similarity_column_label) loc_label = "#LoC" similarity_label = "Similarity" class ReturnCode(Enum): SUCCESS = 0 BAD_INPUT = 1 THRESHOLD_EXCEEDED = 2 class CliColors: HEADER = "\033[95m" OKBLUE = "\033[94m" OKGREEN = "\033[92m" WARNING = "\033[93m" FAIL = "\033[91m" ENDC = "\033[0m" BOLD = "\033[1m" UNDERLINE = "\033[4m" def get_all_source_code_from_directory(directory, file_extensions): """Get a list with all the source code files within the directory""" source_code_files = list() for dirpath, _, filenames in os.walk(directory): for name in filenames: _, file_extension = os.path.splitext(name) if file_extension[1:] in file_extensions: filename = os.path.join(dirpath, name) source_code_files.append(filename) return source_code_files def conditional_print(text, machine_friendly_output): if not machine_friendly_output: print(text) def remove_comments_and_docstrings(source_code: str) -> str: """Strip comments and docstrings from source code .. seealso:: https://gist.github.com/phpdude/1ae6f19de213d66286c8183e9e3b9ec1 :param source_code: Raw source code as a single string :type source_code: str :return: Stripped source code as a single string :rtype: str """ parsed = ast.parse(source_code) for node in ast.walk(parsed): if not isinstance( node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef, ast.Module) ): continue if not len(node.body): continue if not isinstance(node.body[0], ast.Expr): continue if not hasattr(node.body[0], "value") or not isinstance( node.body[0].value, ast.Str ): continue node.body = node.body[1:] source_code_clean = astor.to_source(parsed) return source_code_clean def get_loc_count(file_path): lines_count = -1 try: with open(os.path.normpath(file_path), 'r') as the_file: lines_count = len(the_file.readlines()) except Exception as err: print(f"WARNING: Failed to get lines count for file {file_path}, reason: {str(err)}") return lines_count def get_loc_to_print(loc_count): loc_to_print = str(loc_count) if loc_count >= 0 else "" return loc_to_print def main(): parser_description = ( CliColors.HEADER + CliColors.BOLD + "=== Duplicate Code Detection Tool ===" + CliColors.ENDC ) parser = argparse.ArgumentParser(description=parser_description) parser.add_argument( "-t", "--fail-threshold", type=int, default=100, help="The maximum allowed similarity before the script exits with an error.", ) group = parser.add_mutually_exclusive_group(required=True) group.add_argument( "-d", "--directories", nargs="+", help="Check for similarities between all files of the specified directories.", ) group.add_argument( "-f", "--files", nargs="+", help="Check for similarities between specified files. \ The more files are supplied the more accurate are the results.", ) parser.add_argument( "--ignore-directories", nargs="+", default=list(), help="Directories to ignore." ) parser.add_argument("--ignore-files", nargs="+", help="Files to ignore.") parser.add_argument( "-j", "--json", type=bool, default=False, help="Print output as JSON." ) parser.add_argument( "--project-root-dir", type=str, default=str(), help="The relative path to the project root directory to be removed when printing out results.", ) parser.add_argument( "--file-extensions", nargs="+", default=source_code_file_extensions, help="File extensions to check for similarities.", ) parser.add_argument( "--ignore-threshold", type=int, default=0, help="Don't print out similarity below the ignore threshold", ) parser.add_argument( "--only-code", action="store_true", help="Removes comments and docstrings from the source code before analysis", ) parser.add_argument( "--csv-output", type=str, default=str(), help="Outputs results as a CSV to the specified CSV path", ) parser.add_argument( "--show-loc", action="store_true", help="Add file line counts, including blank lines and comments, to all outputs.", ) args = parser.parse_args() result = run( args.fail_threshold, args.directories, args.files, args.ignore_directories, args.ignore_files, args.json, args.project_root_dir, args.file_extensions, args.ignore_threshold, args.only_code, args.csv_output, args.show_loc, ) return result def run( fail_threshold, directories, files, ignore_directories, ignore_files, json_output, project_root_dir, file_extensions, ignore_threshold, only_code, csv_output, show_loc, ): # Determine which files to compare for similarities source_code_files = list() files_to_ignore = list() if directories: for directory in directories: if not os.path.isdir(directory): print("Path does not exist or is not a directory:", directory) return (ReturnCode.BAD_INPUT, {}) source_code_files += get_all_source_code_from_directory( directory, file_extensions ) for directory in ignore_directories: files_to_ignore += get_all_source_code_from_directory( directory, file_extensions ) else: if len(files) < 2: print("Too few files to compare, you need to supply at least 2") return (ReturnCode.BAD_INPUT, {}) for supplied_file in files: if not os.path.isfile(supplied_file): print("Supplied file does not exist:", supplied_file) return (ReturnCode.BAD_INPUT, {}) source_code_files = files files_to_ignore += ignore_files if ignore_files else list() files_to_ignore = [os.path.normpath(f) for f in files_to_ignore] source_code_files = [os.path.normpath(f) for f in source_code_files] source_code_files = list(set(source_code_files) - set(files_to_ignore)) if len(source_code_files) < 2: print("Not enough source code files found") return (ReturnCode.BAD_INPUT, {}) # Sort the sources, so the results are sorted too and are reproducible source_code_files.sort() source_code_files = [os.path.abspath(f) for f in source_code_files] # Get the absolute project root directory path to remove when printing out the results if project_root_dir: if not os.path.isdir(project_root_dir): print( "The project root directory does not exist or is not a directory:", project_root_dir, ) return (ReturnCode.BAD_INPUT, {}) project_root_dir = os.path.abspath(project_root_dir) project_root_dir = os.path.join(project_root_dir, "") # Add the trailing slash # Find the largest string length to format the textual output largest_string_length = len( max(source_code_files, key=len).replace(project_root_dir, "") ) # Parse the contents of all the source files source_code = OrderedDict() for source_code_file in source_code_files: try: # read file but also recover from encoding errors in source files with open(source_code_file, "r", errors="surrogateescape") as f: # Store source code with the file path as the key content = f.read() if only_code and source_code_file.endswith("py"): content = remove_comments_and_docstrings(content) source_code[source_code_file] = content except Exception as err: print(f"ERROR: Failed to open file {source_code_file}, reason: {str(err)}") # Create a Similarity object of all the source code gen_docs = [ [word.lower() for word in word_tokenize(source_code[source_file])] for source_file in source_code ] dictionary = gensim.corpora.Dictionary(gen_docs) corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs] tf_idf = gensim.models.TfidfModel(corpus) sims = gensim.similarities.Similarity( tempfile.gettempdir() + os.sep, tf_idf[corpus], num_features=len(dictionary) ) column_label = file_column_label if show_loc: column_label += file_loc_label largest_string_length += len(file_loc_label) exit_code = ReturnCode.SUCCESS code_similarity = dict() for source_file in source_code: # Check for similarities query_doc = [w.lower() for w in word_tokenize(source_code[source_file])] query_doc_bow = dictionary.doc2bow(query_doc) query_doc_tf_idf = tf_idf[query_doc_bow] loc_info = "" source_file_loc = -1 if show_loc: source_file_loc = get_loc_count(source_file) loc_info = "," + get_loc_to_print(source_file_loc) short_source_file_path = source_file.replace(project_root_dir, "") conditional_print( "\n\n\n" + CliColors.HEADER + "Code duplication probability for " + short_source_file_path + loc_info + CliColors.ENDC, json_output, ) conditional_print( "-" * (largest_string_length + similarity_label_length), json_output ) conditional_print( CliColors.BOLD + "%s %s" % (column_label.center(largest_string_length), similarity_column_label) + CliColors.ENDC, json_output, ) conditional_print( "-" * (largest_string_length + similarity_label_length), json_output ) empty_length = 0 code_similarity[short_source_file_path] = dict() if show_loc: code_similarity[short_source_file_path][loc_label] = source_file_loc empty_length = len(code_similarity[short_source_file_path]) for similarity, source in zip(sims[query_doc_tf_idf], source_code): # Ignore similarities for the same file if source == source_file: continue similarity_percentage = similarity * 100 # Ignore very low similarity if similarity_percentage < ignore_threshold: continue short_source_path = source.replace(project_root_dir, "") if show_loc: code_similarity[short_source_file_path][short_source_path] = dict() code_similarity[short_source_file_path][short_source_path][loc_label] = get_loc_count( source ) code_similarity[short_source_file_path][short_source_path][similarity_label] = round( similarity_percentage, 2 ) else: code_similarity[short_source_file_path][short_source_path] = round( similarity_percentage, 2 ) if similarity_percentage > fail_threshold: exit_code = ReturnCode.THRESHOLD_EXCEEDED color = ( CliColors.OKGREEN if similarity_percentage < 10 else ( CliColors.WARNING if similarity_percentage < 20 else CliColors.FAIL ) ) info_to_print = short_source_path if show_loc: info_to_print += "," + get_loc_to_print(get_loc_count(source)) conditional_print( "%s " % (info_to_print.ljust(largest_string_length)) + color + "%.2f" % (similarity_percentage) + CliColors.ENDC, json_output, ) # If no similarities found for the particular file, remove it from the report if len(code_similarity[short_source_file_path]) == empty_length: del code_similarity[short_source_file_path] if exit_code == ReturnCode.THRESHOLD_EXCEEDED: conditional_print( "Code duplication threshold exceeded. Please consult logs.", json_output ) if json_output: similarities_json = json.dumps(code_similarity, indent=4) print(similarities_json) if csv_output: with open(csv_output, "w") as csv_file: writer = csv.writer(csv_file) if show_loc: writer.writerow(["File A", "#LoC A", "File B", "#LoC B", "Similarity"]) for first_file in code_similarity: for second_file in code_similarity[first_file]: if second_file != loc_label: writer.writerow( [ first_file, get_loc_to_print(get_loc_count(os.path.join(project_root_dir, first_file))), second_file, get_loc_to_print(get_loc_count(os.path.join(project_root_dir, second_file))), code_similarity[first_file][second_file][similarity_label], ] ) else: writer.writerow(["File A", "File B", "Similarity"]) for first_file in code_similarity: for second_file in code_similarity[first_file]: writer.writerow( [ first_file, second_file, code_similarity[first_file][second_file], ] ) return (exit_code, code_similarity) if __name__ == "__main__": exit_code, _ = main() sys.exit(exit_code.value) ================================================ FILE: entrypoint.sh ================================================ #!/bin/bash set -eu script_dir="$(dirname "$0")" cd $script_dir pull_request_id=$(cat "$GITHUB_EVENT_PATH" | jq 'if (.issue.number != null) then .issue.number else .number end') branch_name="pull_request_branch" if [ $pull_request_id == "null" ]; then echo "Could not find a pull request ID. Is this a pull request?" exit 1 fi maintainer=${GITHUB_REPOSITORY%/*} eval git clone "https://${maintainer}:${INPUT_GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" ${GITHUB_REPOSITORY} cd $GITHUB_REPOSITORY eval git config remote.origin.fetch +refs/heads/*:refs/remotes/origin/* eval git fetch origin pull/$pull_request_id/head:$branch_name eval git checkout $branch_name latest_head=$(git rev-parse HEAD) eval python3 /action/run_action.py --latest-head $latest_head --pull-request-id $pull_request_id ================================================ FILE: requirements.txt ================================================ gensim>=3.8 nltk>=3.5 astor>=0.8.1 ================================================ FILE: run_action.py ================================================ #!/usr/bin/env python import os import sys import json import requests import argparse import duplicate_code_detection WARNING_SUFFIX = " ⚠️" def make_markdown_table(array): """Input: Python list with rows of table as lists First element as header. Output: String to put into a .md file Ex Input: [["Name", "Age", "Height"], ["Jake", 20, 5'10], ["Mary", 21, 5'7]] Adopted from: https://gist.github.com/m0neysha/219bad4b02d2008e0154 """ markdown = "\n" + str("| ") for e in array[0]: to_add = " " + str(e) + str(" |") markdown += to_add markdown += "\n" markdown += "|" for i in range(len(array[0])): markdown += str("-------------- | ") markdown += "\n" markdown_characters = 0 max_characters = 65000 for entry in array[1:]: markdown += str("| ") for e in entry: to_add = str(e) + str(" | ") markdown += to_add markdown += "\n" markdown_characters += len(markdown) if markdown_characters > max_characters: markdown += "\n" + WARNING_SUFFIX + " " markdown += "Results were omitted because the report was too large. " markdown += "Please consider ignoring results below a certain threshold.\n" break return markdown + "\n" def get_markdown_link(file, url): return "[%s](%s%s)" % (file, url, file) def get_warning(similarity, warn_threshold): return ( str(similarity) if similarity < int(warn_threshold) else str(similarity) + WARNING_SUFFIX ) def similarities_to_markdown(similarities, url_prefix, warn_threshold): markdown = str() for checked_file in similarities.keys(): markdown += "
%s\n\n" % checked_file markdown += "### 📄 %s\n" % get_markdown_link(checked_file, url_prefix) table_header = ["File", "Similarity (%)"] table_contents = [ [get_markdown_link(f, url_prefix), get_warning(s, warn_threshold)] for (f, s) in similarities[checked_file].items() ] # Sort table contents based on similarity table_contents.sort( reverse=True, key=lambda row: float(row[1].replace(WARNING_SUFFIX, "")) ) entire_table = [[] for _ in range(len(table_contents) + 1)] entire_table[0] = table_header for i in range(1, len(table_contents) + 1): entire_table[i] = table_contents[i - 1] markdown += make_markdown_table(entire_table) markdown += "
\n" return markdown def split_and_trim(input_list): return [token.strip() for token in input_list.split(",")] def to_absolute_path(paths): return [os.path.abspath(path) for path in paths] def main(): parser = argparse.ArgumentParser( description="Duplicate code detection action runner" ) parser.add_argument( "--latest-head", type=str, default="master", help="The latest commit hash or branch", ) parser.add_argument( "--pull-request-id", type=str, required=True, help="The pull request id" ) args = parser.parse_args() fail_threshold = os.environ.get("INPUT_FAIL_ABOVE") directories = os.environ.get("INPUT_DIRECTORIES") ignore_directories = os.environ.get("INPUT_IGNORE_DIRECTORIES") project_root_dir = os.environ.get("INPUT_PROJECT_ROOT_DIR") file_extensions = os.environ.get("INPUT_FILE_EXTENSIONS") ignore_threshold = os.environ.get("INPUT_IGNORE_BELOW") only_code = os.environ.get("INPUT_ONLY_CODE") directories_list = split_and_trim(directories) directories_list = to_absolute_path(directories_list) ignore_directories_list = ( split_and_trim(ignore_directories) if ignore_directories != "" else list() ) ignore_directories_list = to_absolute_path(ignore_directories_list) file_extensions_list = split_and_trim(file_extensions) project_root_dir = os.path.abspath(project_root_dir) files_list = None ignore_files_list = None json_output = True csv_output_path = "" # No CSV output by default for now in GitHub Actions show_loc = False detection_result, code_similarity = duplicate_code_detection.run( int(fail_threshold), directories_list, files_list, ignore_directories_list, ignore_files_list, json_output, project_root_dir, file_extensions_list, int(ignore_threshold), bool(only_code), csv_output_path, show_loc, ) if detection_result == duplicate_code_detection.ReturnCode.BAD_INPUT: print("Action aborted due to bad user input") return detection_result.value elif detection_result == duplicate_code_detection.ReturnCode.THRESHOLD_EXCEEDED: print( "Action failed due to maximum similarity threshold exceeded, check the report" ) repo = os.environ.get("GITHUB_REPOSITORY") files_url_prefix = "https://github.com/%s/blob/%s/" % (repo, args.latest_head) warn_threshold = os.environ.get("INPUT_WARN_ABOVE") header_message_start = os.environ.get("INPUT_HEADER_MESSAGE_START") + "\n" message = header_message_start message += "The [tool](https://github.com/platisd/duplicate-code-detection-tool)" message += " analyzed your source code and found the following degree of" message += " similarity between the files:\n" message += similarities_to_markdown( code_similarity, files_url_prefix, warn_threshold ) github_token = os.environ.get("INPUT_GITHUB_TOKEN") github_api_url = os.environ.get("GITHUB_API_URL") request_url = "%s/repos/%s/issues/%s/comments" % ( github_api_url, repo, args.pull_request_id, ) headers = { "Authorization": "token %s" % github_token, } report = {"body": message} update_existing_comment = os.environ.get("INPUT_ONE_COMMENT", "false").lower() in ( "true", "1", ) comment_updated = False if update_existing_comment: # If the bot has posted many comments, update the last one pr_comments = requests.get(request_url, headers=headers).json() for pr_comment in pr_comments[::-1]: if pr_comment["body"].startswith(header_message_start): update_result = requests.patch( pr_comment["url"], json=report, headers=headers, ) if update_result.status_code != 200: print( "Updating existing comment failed with code: " + str(update_result.status_code) ) print(update_result.text) print("Attempting to post a new comment instead") else: comment_updated = True break if not comment_updated: post_result = requests.post( request_url, json=report, headers=headers, ) if post_result.status_code != 201: print( "Posting results to GitHub failed with code: " + str(post_result.status_code) ) print(post_result.text) with open("message.md", "w") as f: f.write(message) return detection_result.value if __name__ == "__main__": sys.exit(main()) ================================================ FILE: setup.py ================================================ from setuptools import setup setup( name='duplicate code detection tool', entry_points={ 'console_scripts': ['duplicate-code-detection=duplicate_code_detection:main'] }, py_modules=['duplicate_code_detection'], package_dir={ 'duplicate_code_detection': '.', }, install_requires=[ 'gensim>=3.8', 'nltk>=3.5', 'astor>=0.8.1' ], setuptools_git_versioning={ "enabled": True, }, setup_requires=["setuptools-git-versioning<2"], )