Full Code of platisd/duplicate-code-detection-tool for AI

master f0413b1571a9 cached

12 files

34.3 KB

7.9k tokens

16 symbols

1 requests

Download .txt

Repository: platisd/duplicate-code-detection-tool
Branch: master
Commit: f0413b1571a9
Files: 12
Total size: 34.3 KB

Directory structure:
gitextract_hdgkqkk1/

├── .github/
│   └── workflows/
│       └── duplicate-code-detection.yml
├── .gitignore
├── .pre-commit-hooks.yaml
├── Dockerfile
├── LICENSE
├── README.md
├── action.yml
├── duplicate_code_detection.py
├── entrypoint.sh
├── requirements.txt
├── run_action.py
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/duplicate-code-detection.yml
================================================
name: Duplicate code

on: pull_request

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - name: Check for duplicate code
        uses: ./
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "./"
          # Only examine .h and .cpp files
          file_extensions: "py"
          ignore_below: 1
          fail_above: 70
          warn_above: 15
          one_comment: true


================================================
FILE: .gitignore
================================================
.vscode


================================================
FILE: .pre-commit-hooks.yaml
================================================
-   id: duplicate-code-detection
    name: Detect duplicate code
    description: This hook will run duplicate code detection.
    entry: duplicate-code-detection -f
    language: python
    types: [text]

================================================
FILE: Dockerfile
================================================
FROM python:3.7-slim

RUN apt-get update
RUN apt-get -y install git jq

COPY duplicate_code_detection.py requirements.txt run_action.py entrypoint.sh /action/

RUN pip3 install -r /action/requirements.txt requests && \
    python3 -c "import nltk; nltk.download('punkt')" && \
    ln -s /root/nltk_data /usr/local/nltk_data 

ENTRYPOINT ["/action/entrypoint.sh"]


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Dimitris Platis

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Duplicate Code Detection Tool
A simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect
similarities between files within a repository.

## What?
A command line tool that receives a directory or a list of files and determines
the degree of similarity between them.

## Why?
The tool intends guide the refactoring efforts of a developer who wishes
to reduce code duplication within a component and improve its software
architecture.

Its development was initiated within the context of the
[DAT265 - Software Evolution Project](https://pingpong.chalmers.se/public/courseId/9754/lang-en/publicPage.do).

## How?
The tool uses the [gensim](https://radimrehurek.com/gensim/) Python library to
determine the similarity between source code files, supplied by the user.
The default supported languages are C, C++, JAVA, Python and C#.

### Dependencies
The following Python packages have to be installed:
  * nltk
    * `pip3 install --user nltk`
  * gensim
    * `pip3 install --user gensim`
  * astor
    * `pip3 install --user astor`
  * punkt
    * `python3 -m nltk.downloader punkt`

## Get started
Suppress the warnings (generated by the used libraries)
as `python3 -W ignore duplicate_code_detection.py` and then supply the necessary
arguments. More details can be found by running the tool with the `--help` option.

**Notice:** Due to the way the models are created, the more source files you
provide the tool the more accurate the similarity calculations are. In other
words, the bigger the project, the more useful the tool is.

### Example
If `duplicate-code-detection-tool` is the name where the tool resides in and
`smartcar_shield/src` contains the repository you want to check for source code
similarities between the files, then you can run the following to get the
similarity report:

`python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/`

The result should look something like this:

![code duplication tool screenshot](https://i.imgur.com/wi1TnVM.png)

## GitHub Action

The tool is also available as a [GitHub Action](https://docs.github.com/en/actions) for easy integration
with projects hosted on GitHub. An example output of the tool can be seen
[here](https://github.com/platisd/smartcar_shield/pull/36#issuecomment-778635111).

The Action is meant to be triggered during **pull requests** to give the developers an impression
over the **degree of similarity** between the files in the source code. Below you will find a sample
workflow files that illustrate the usage.

Depending on the *size* of your project, you may want to have the tool running multiple times
(i.e in diffferent steps) that test specific parts of your repository for duplicate code.
This way you will not compare each file in your codebase with everything else and get back more
meaningful reports.

### Bare minimum

In the following example the tool will examine source code (the languages supported by default)
in the `src/` and `test/ut` directories *relative* to the root directory of your repository.
The results will be posted as a comment in the **pull request** that was opened.

```yaml
name: Duplicate code

on: pull_request

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "src/, test/ut"
```

### Trigger on pull request comment

If you want to avoid the "spam" you should configure the tool to not always run. Specifically, if you
wish to trigger the Action manually, you can do so by leaving a comment in the pull request.

The following action will trigger the tool to be run when a comment containig `run_duplicate_code_detection_tool`
is posted in a pull request. The tool will run using the code in the pull request.

```yaml
name: Duplicate code

on: issue_comment

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    # Trigger the tool only when a comment containing the keyword is published in a pull request
    if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool')
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "."
```

**Important:** Please note that due to the way GitHub Actions work, you will *first* have to merge this into your main
branch so it starts taking effect.

### Optional configuration

It may not make sense to compare all files or get a files with very low similarity reported.
In the following workflow, the different *optional* arguments are demonstrated.

For the various default values, please consult [action.yml](action.yml).

```yaml
name: Duplicate code

on: pull_request

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "src"
          # Ignore the specified directories
          ignore_directories: "src/external_libraries"
          # Only examine .h and .cpp files
          file_extensions: "h, cpp"
          # Only report similarities above 5%
          ignore_below: 5
          # If a file is more than 70% similar to another, then the job fails
          fail_above: 70
          # If a file is more than 15% similar to another, show a warning symbol in the report
          warn_above: 15
          # Remove `src/` from the file paths when reporting similarities
          project_root_dir: "src"
          # Remove docstrings from code before analysis
          # For python source code only. This is checked on a per-file basis
          only_code: true
          # Leave only one comment with the report and update it for consecutive runs
          one_comment: true
          # The message to be displayed at the start of the report
          header_message_start: "The following files have a similarity above the threshold:"
```
## Using duplicate-code-check with pre-commit
To use Duplicate Code Detection Tool as a pre-commit hook with [pre-commit](https://pre-commit.com/) add the following to your `.pre-commit-config.yaml` file:
```yaml
-   repo: https://github.com/platisd/duplicate-code-detection-tool.git
    rev: ''  # Use the sha / tag you want to point at
    hooks:
    -   id: duplicate-code-detection
```
> **_NOTE:_** that this repository sets args: `-f`, if you are configuring duplicate-code-detection-tool using args you'll want to include either `-f` (`--files`) or `-d` (`--directories`).

## Limitations

- `only_code` option only works with python files for now


================================================
FILE: action.yml
================================================
name: 'Duplicate code detection tool'
description: 'Detect similarities between source code files'
inputs:
  github_token:
    description: 'The GitHub token'
    required: true
  directories:
    description: 'A comma-separated list of the directories containing the source code'
    required: true
  ignore_directories:
    description: 'A comma-separated list of directories that should be ignored'
    required: false
    default: ''
  project_root_dir:
    description: 'The relative path to filter out when reporting results'
    required: false
    default: './'
  file_extensions:
    description: 'A comma-separated list of source code file extensions to check for similarities'
    required: false
    default: 'h, hpp, c, cpp, cc, java, py, cs'
  ignore_below:
    description: 'The minimum similarity percentage to be reported'
    required: false
    default: 10
  fail_above:
    description: 'The maximum allowed similarity percentage before the action fails'
    required: false
    default: 100
  warn_above:
    description: 'The maximum allowed similarity percentage before the action warns'
    required: false
    default: 100
  only_code:
    description: "Removes comments and docstrings from the source code before analysis"
    required: false
    default: false
  one_comment:
    description: 'Duplication report will be left as a single comment, which will be updated, instead of multiple ones'
    required: false
    default: false
  header_message_start:
    description: 'The message to be displayed at the start of the duplication report.
                  It is used by the bot to identify previous reports and update them, so it must be unique.
                  If you want to use the Action in multiple steps of the same workflow,
                  then you can change this message in each step to avoid conflicts'
    required: false
    default: '## 📌 Duplicate code detection tool report'
runs:
  using: 'docker'
  image: 'Dockerfile'
branding:
  icon: 'check'  
  color: 'green'


================================================
FILE: duplicate_code_detection.py
================================================
"""
A simple Python3 tool to detect similarities between files within a repository.

Document similarity code adapted from Jonathan Mugan's tutorial:
https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
"""
import os
import sys
import argparse
import gensim
import ast
import csv
import astor
import re
import tempfile
import json
from enum import Enum
from nltk.tokenize import word_tokenize
from collections import OrderedDict

source_code_file_extensions = ["h", "c", "cpp", "cc", "java", "py", "cs"]
file_column_label = "File"
file_loc_label = ",#LoC"
similarity_column_label = "Similarity (%)"
similarity_label_length = len(similarity_column_label)
loc_label = "#LoC"
similarity_label = "Similarity"


class ReturnCode(Enum):
    SUCCESS = 0
    BAD_INPUT = 1
    THRESHOLD_EXCEEDED = 2


class CliColors:
    HEADER = "\033[95m"
    OKBLUE = "\033[94m"
    OKGREEN = "\033[92m"
    WARNING = "\033[93m"
    FAIL = "\033[91m"
    ENDC = "\033[0m"
    BOLD = "\033[1m"
    UNDERLINE = "\033[4m"


def get_all_source_code_from_directory(directory, file_extensions):
    """Get a list with all the source code files within the directory"""
    source_code_files = list()
    for dirpath, _, filenames in os.walk(directory):
        for name in filenames:
            _, file_extension = os.path.splitext(name)
            if file_extension[1:] in file_extensions:
                filename = os.path.join(dirpath, name)
                source_code_files.append(filename)

    return source_code_files


def conditional_print(text, machine_friendly_output):
    if not machine_friendly_output:
        print(text)


def remove_comments_and_docstrings(source_code: str) -> str:
    """Strip comments and docstrings from source code

    .. seealso::

        https://gist.github.com/phpdude/1ae6f19de213d66286c8183e9e3b9ec1

    :param source_code: Raw source code as a single string
    :type source_code: str
    :return: Stripped source code as a single string
    :rtype: str
    """
    parsed = ast.parse(source_code)
    for node in ast.walk(parsed):
        if not isinstance(
            node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef, ast.Module)
        ):
            continue

        if not len(node.body):
            continue

        if not isinstance(node.body[0], ast.Expr):
            continue

        if not hasattr(node.body[0], "value") or not isinstance(
            node.body[0].value, ast.Str
        ):
            continue

        node.body = node.body[1:]

    source_code_clean = astor.to_source(parsed)
    return source_code_clean


def get_loc_count(file_path):
    lines_count = -1
    try:
        with open(os.path.normpath(file_path), 'r') as the_file:
            lines_count = len(the_file.readlines())
    except Exception as err:
        print(f"WARNING: Failed to get lines count for file {file_path}, reason: {str(err)}")
    return lines_count


def get_loc_to_print(loc_count):
    loc_to_print = str(loc_count) if loc_count >= 0 else ""
    return loc_to_print


def main():
    parser_description = (
        CliColors.HEADER
        + CliColors.BOLD
        + "=== Duplicate Code Detection Tool ==="
        + CliColors.ENDC
    )
    parser = argparse.ArgumentParser(description=parser_description)
    parser.add_argument(
        "-t",
        "--fail-threshold",
        type=int,
        default=100,
        help="The maximum allowed similarity before the script exits with an error.",
    )
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument(
        "-d",
        "--directories",
        nargs="+",
        help="Check for similarities between all files of the specified directories.",
    )
    group.add_argument(
        "-f",
        "--files",
        nargs="+",
        help="Check for similarities between specified files. \
                        The more files are supplied the more accurate are the results.",
    )
    parser.add_argument(
        "--ignore-directories", nargs="+", default=list(), help="Directories to ignore."
    )
    parser.add_argument("--ignore-files", nargs="+", help="Files to ignore.")
    parser.add_argument(
        "-j", "--json", type=bool, default=False, help="Print output as JSON."
    )
    parser.add_argument(
        "--project-root-dir",
        type=str,
        default=str(),
        help="The relative path to the project root directory to be removed when printing out results.",
    )
    parser.add_argument(
        "--file-extensions",
        nargs="+",
        default=source_code_file_extensions,
        help="File extensions to check for similarities.",
    )
    parser.add_argument(
        "--ignore-threshold",
        type=int,
        default=0,
        help="Don't print out similarity below the ignore threshold",
    )
    parser.add_argument(
        "--only-code",
        action="store_true",
        help="Removes comments and docstrings from the source code before analysis",
    )
    parser.add_argument(
        "--csv-output",
        type=str,
        default=str(),
        help="Outputs results as a CSV to the specified CSV path",
    )
    parser.add_argument(
        "--show-loc",
        action="store_true",
        help="Add file line counts, including blank lines and comments, to all outputs.",
    )
    args = parser.parse_args()

    result = run(
        args.fail_threshold,
        args.directories,
        args.files,
        args.ignore_directories,
        args.ignore_files,
        args.json,
        args.project_root_dir,
        args.file_extensions,
        args.ignore_threshold,
        args.only_code,
        args.csv_output,
        args.show_loc,
    )

    return result


def run(
    fail_threshold,
    directories,
    files,
    ignore_directories,
    ignore_files,
    json_output,
    project_root_dir,
    file_extensions,
    ignore_threshold,
    only_code,
    csv_output,
    show_loc,
):
    # Determine which files to compare for similarities
    source_code_files = list()
    files_to_ignore = list()
    if directories:
        for directory in directories:
            if not os.path.isdir(directory):
                print("Path does not exist or is not a directory:", directory)
                return (ReturnCode.BAD_INPUT, {})
            source_code_files += get_all_source_code_from_directory(
                directory, file_extensions
            )
        for directory in ignore_directories:
            files_to_ignore += get_all_source_code_from_directory(
                directory, file_extensions
            )
    else:
        if len(files) < 2:
            print("Too few files to compare, you need to supply at least 2")
            return (ReturnCode.BAD_INPUT, {})
        for supplied_file in files:
            if not os.path.isfile(supplied_file):
                print("Supplied file does not exist:", supplied_file)
                return (ReturnCode.BAD_INPUT, {})
        source_code_files = files

    files_to_ignore += ignore_files if ignore_files else list()
    files_to_ignore = [os.path.normpath(f) for f in files_to_ignore]
    source_code_files = [os.path.normpath(f) for f in source_code_files]
    source_code_files = list(set(source_code_files) - set(files_to_ignore))
    if len(source_code_files) < 2:
        print("Not enough source code files found")
        return (ReturnCode.BAD_INPUT, {})
    # Sort the sources, so the results are sorted too and are reproducible
    source_code_files.sort()
    source_code_files = [os.path.abspath(f) for f in source_code_files]

    # Get the absolute project root directory path to remove when printing out the results
    if project_root_dir:
        if not os.path.isdir(project_root_dir):
            print(
                "The project root directory does not exist or is not a directory:",
                project_root_dir,
            )
            return (ReturnCode.BAD_INPUT, {})
        project_root_dir = os.path.abspath(project_root_dir)
        project_root_dir = os.path.join(project_root_dir, "")  # Add the trailing slash

    # Find the largest string length to format the textual output
    largest_string_length = len(
        max(source_code_files, key=len).replace(project_root_dir, "")
    )

    # Parse the contents of all the source files
    source_code = OrderedDict()
    for source_code_file in source_code_files:
        try:
            # read file but also recover from encoding errors in source files
            with open(source_code_file, "r", errors="surrogateescape") as f:
                # Store source code with the file path as the key
                content = f.read()
                if only_code and source_code_file.endswith("py"):
                    content = remove_comments_and_docstrings(content)
                source_code[source_code_file] = content
        except Exception as err:
            print(f"ERROR: Failed to open file {source_code_file}, reason: {str(err)}")

    # Create a Similarity object of all the source code
    gen_docs = [
        [word.lower() for word in word_tokenize(source_code[source_file])]
        for source_file in source_code
    ]
    dictionary = gensim.corpora.Dictionary(gen_docs)
    corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
    tf_idf = gensim.models.TfidfModel(corpus)
    sims = gensim.similarities.Similarity(
        tempfile.gettempdir() + os.sep, tf_idf[corpus], num_features=len(dictionary)
    )

    column_label = file_column_label
    if show_loc:
        column_label += file_loc_label
        largest_string_length += len(file_loc_label)

    exit_code = ReturnCode.SUCCESS
    code_similarity = dict()
    for source_file in source_code:
        # Check for similarities
        query_doc = [w.lower() for w in word_tokenize(source_code[source_file])]
        query_doc_bow = dictionary.doc2bow(query_doc)
        query_doc_tf_idf = tf_idf[query_doc_bow]

        loc_info = ""
        source_file_loc = -1
        if show_loc:
            source_file_loc = get_loc_count(source_file)
            loc_info = "," + get_loc_to_print(source_file_loc)

        short_source_file_path = source_file.replace(project_root_dir, "")
        conditional_print(
            "\n\n\n"
            + CliColors.HEADER
            + "Code duplication probability for "
            + short_source_file_path
            + loc_info
            + CliColors.ENDC,
            json_output,
        )
        conditional_print(
            "-" * (largest_string_length + similarity_label_length), json_output
        )
        conditional_print(
            CliColors.BOLD
            + "%s %s"
            % (column_label.center(largest_string_length), similarity_column_label)
            + CliColors.ENDC,
            json_output,
        )
        conditional_print(
            "-" * (largest_string_length + similarity_label_length), json_output
        )

        empty_length = 0
        code_similarity[short_source_file_path] = dict()
        if show_loc:
            code_similarity[short_source_file_path][loc_label] = source_file_loc
            empty_length = len(code_similarity[short_source_file_path])
        for similarity, source in zip(sims[query_doc_tf_idf], source_code):
            # Ignore similarities for the same file
            if source == source_file:
                continue
            similarity_percentage = similarity * 100
            # Ignore very low similarity
            if similarity_percentage < ignore_threshold:
                continue
            short_source_path = source.replace(project_root_dir, "")
            if show_loc:
                code_similarity[short_source_file_path][short_source_path] = dict()
                code_similarity[short_source_file_path][short_source_path][loc_label] = get_loc_count(
                    source
                )
                code_similarity[short_source_file_path][short_source_path][similarity_label]  = round(
                    similarity_percentage, 2
                )
            else:
                code_similarity[short_source_file_path][short_source_path] = round(
                    similarity_percentage, 2
                )
            if similarity_percentage > fail_threshold:
                exit_code = ReturnCode.THRESHOLD_EXCEEDED
            color = (
                CliColors.OKGREEN
                if similarity_percentage < 10
                else (
                    CliColors.WARNING if similarity_percentage < 20 else CliColors.FAIL
                )
            )
            info_to_print = short_source_path
            if show_loc:
                info_to_print += "," + get_loc_to_print(get_loc_count(source))

            conditional_print(
                "%s     " % (info_to_print.ljust(largest_string_length))
                + color
                + "%.2f" % (similarity_percentage)
                + CliColors.ENDC,
                json_output,
            )
        # If no similarities found for the particular file, remove it from the report
        if len(code_similarity[short_source_file_path]) == empty_length:
            del code_similarity[short_source_file_path]
    if exit_code == ReturnCode.THRESHOLD_EXCEEDED:
        conditional_print(
            "Code duplication threshold exceeded. Please consult logs.", json_output
        )

    if json_output:
        similarities_json = json.dumps(code_similarity, indent=4)
        print(similarities_json)

    if csv_output:
        with open(csv_output, "w") as csv_file:
            writer = csv.writer(csv_file)
            if show_loc:
                writer.writerow(["File A", "#LoC A", "File B", "#LoC B", "Similarity"])
                for first_file in code_similarity:
                    for second_file in code_similarity[first_file]:
                        if second_file != loc_label:
                            
                            writer.writerow(
                                [
                                    first_file,
                                    get_loc_to_print(get_loc_count(os.path.join(project_root_dir, first_file))),
                                    second_file,
                                    get_loc_to_print(get_loc_count(os.path.join(project_root_dir, second_file))),
                                    code_similarity[first_file][second_file][similarity_label],
                                ]
                            )
            else:
                writer.writerow(["File A", "File B", "Similarity"])
                for first_file in code_similarity:
                    for second_file in code_similarity[first_file]:
                        writer.writerow(
                            [
                                first_file,
                                second_file,
                                code_similarity[first_file][second_file],
                            ]
                        )

    return (exit_code, code_similarity)


if __name__ == "__main__":
    exit_code, _ = main()
    sys.exit(exit_code.value)


================================================
FILE: entrypoint.sh
================================================
#!/bin/bash
set -eu

script_dir="$(dirname "$0")"
cd $script_dir

pull_request_id=$(cat "$GITHUB_EVENT_PATH" | jq 'if (.issue.number != null) then .issue.number else .number end')
branch_name="pull_request_branch"

if [ $pull_request_id == "null" ]; then
  echo "Could not find a pull request ID. Is this a pull request?"
  exit 1
fi

maintainer=${GITHUB_REPOSITORY%/*}
eval git clone "https://${maintainer}:${INPUT_GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" ${GITHUB_REPOSITORY}
cd $GITHUB_REPOSITORY
eval git config remote.origin.fetch +refs/heads/*:refs/remotes/origin/*
eval git fetch origin pull/$pull_request_id/head:$branch_name
eval git checkout $branch_name

latest_head=$(git rev-parse HEAD)

eval python3 /action/run_action.py --latest-head $latest_head --pull-request-id $pull_request_id


================================================
FILE: requirements.txt
================================================
gensim>=3.8
nltk>=3.5
astor>=0.8.1

================================================
FILE: run_action.py
================================================
#!/usr/bin/env python

import os
import sys
import json
import requests
import argparse

import duplicate_code_detection

WARNING_SUFFIX = " ⚠️"


def make_markdown_table(array):
    """Input: Python list with rows of table as lists
               First element as header.
        Output: String to put into a .md file

    Ex Input:
        [["Name", "Age", "Height"],
         ["Jake", 20, 5'10],
         ["Mary", 21, 5'7]]

     Adopted from: https://gist.github.com/m0neysha/219bad4b02d2008e0154
    """
    markdown = "\n" + str("| ")

    for e in array[0]:
        to_add = " " + str(e) + str(" |")
        markdown += to_add
    markdown += "\n"

    markdown += "|"
    for i in range(len(array[0])):
        markdown += str("-------------- | ")
    markdown += "\n"

    markdown_characters = 0
    max_characters = 65000
    for entry in array[1:]:
        markdown += str("| ")
        for e in entry:
            to_add = str(e) + str(" | ")
            markdown += to_add
        markdown += "\n"
        markdown_characters += len(markdown)
        if markdown_characters > max_characters:
            markdown += "\n" + WARNING_SUFFIX + " "
            markdown += "Results were omitted because the report was too large. "
            markdown += "Please consider ignoring results below a certain threshold.\n"
            break

    return markdown + "\n"


def get_markdown_link(file, url):
    return "[%s](%s%s)" % (file, url, file)


def get_warning(similarity, warn_threshold):
    return (
        str(similarity)
        if similarity < int(warn_threshold)
        else str(similarity) + WARNING_SUFFIX
    )


def similarities_to_markdown(similarities, url_prefix, warn_threshold):
    markdown = str()
    for checked_file in similarities.keys():
        markdown += "<details><summary>%s</summary>\n\n" % checked_file
        markdown += "### 📄 %s\n" % get_markdown_link(checked_file, url_prefix)

        table_header = ["File", "Similarity (%)"]
        table_contents = [
            [get_markdown_link(f, url_prefix), get_warning(s, warn_threshold)]
            for (f, s) in similarities[checked_file].items()
        ]
        # Sort table contents based on similarity
        table_contents.sort(
            reverse=True, key=lambda row: float(row[1].replace(WARNING_SUFFIX, ""))
        )
        entire_table = [[] for _ in range(len(table_contents) + 1)]
        entire_table[0] = table_header
        for i in range(1, len(table_contents) + 1):
            entire_table[i] = table_contents[i - 1]

        markdown += make_markdown_table(entire_table)
        markdown += "</details>\n"

    return markdown


def split_and_trim(input_list):
    return [token.strip() for token in input_list.split(",")]


def to_absolute_path(paths):
    return [os.path.abspath(path) for path in paths]


def main():
    parser = argparse.ArgumentParser(
        description="Duplicate code detection action runner"
    )
    parser.add_argument(
        "--latest-head",
        type=str,
        default="master",
        help="The latest commit hash or branch",
    )
    parser.add_argument(
        "--pull-request-id", type=str, required=True, help="The pull request id"
    )
    args = parser.parse_args()

    fail_threshold = os.environ.get("INPUT_FAIL_ABOVE")
    directories = os.environ.get("INPUT_DIRECTORIES")
    ignore_directories = os.environ.get("INPUT_IGNORE_DIRECTORIES")
    project_root_dir = os.environ.get("INPUT_PROJECT_ROOT_DIR")
    file_extensions = os.environ.get("INPUT_FILE_EXTENSIONS")
    ignore_threshold = os.environ.get("INPUT_IGNORE_BELOW")
    only_code = os.environ.get("INPUT_ONLY_CODE")

    directories_list = split_and_trim(directories)
    directories_list = to_absolute_path(directories_list)
    ignore_directories_list = (
        split_and_trim(ignore_directories) if ignore_directories != "" else list()
    )
    ignore_directories_list = to_absolute_path(ignore_directories_list)
    file_extensions_list = split_and_trim(file_extensions)
    project_root_dir = os.path.abspath(project_root_dir)

    files_list = None
    ignore_files_list = None
    json_output = True
    csv_output_path = ""  # No CSV output by default for now in GitHub Actions
    show_loc = False

    detection_result, code_similarity = duplicate_code_detection.run(
        int(fail_threshold),
        directories_list,
        files_list,
        ignore_directories_list,
        ignore_files_list,
        json_output,
        project_root_dir,
        file_extensions_list,
        int(ignore_threshold),
        bool(only_code),
        csv_output_path,
        show_loc,
    )

    if detection_result == duplicate_code_detection.ReturnCode.BAD_INPUT:
        print("Action aborted due to bad user input")
        return detection_result.value
    elif detection_result == duplicate_code_detection.ReturnCode.THRESHOLD_EXCEEDED:
        print(
            "Action failed due to maximum similarity threshold exceeded, check the report"
        )

    repo = os.environ.get("GITHUB_REPOSITORY")
    files_url_prefix = "https://github.com/%s/blob/%s/" % (repo, args.latest_head)
    warn_threshold = os.environ.get("INPUT_WARN_ABOVE")

    header_message_start = os.environ.get("INPUT_HEADER_MESSAGE_START") + "\n"
    message = header_message_start
    message += "The [tool](https://github.com/platisd/duplicate-code-detection-tool)"
    message += " analyzed your source code and found the following degree of"
    message += " similarity between the files:\n"
    message += similarities_to_markdown(
        code_similarity, files_url_prefix, warn_threshold
    )

    github_token = os.environ.get("INPUT_GITHUB_TOKEN")
    github_api_url = os.environ.get("GITHUB_API_URL")

    request_url = "%s/repos/%s/issues/%s/comments" % (
        github_api_url,
        repo,
        args.pull_request_id,
    )

    headers = {
        "Authorization": "token %s" % github_token,
    }
    report = {"body": message}

    update_existing_comment = os.environ.get("INPUT_ONE_COMMENT", "false").lower() in (
        "true",
        "1",
    )
    comment_updated = False
    if update_existing_comment:
        # If the bot has posted many comments, update the last one
        pr_comments = requests.get(request_url, headers=headers).json()
        for pr_comment in pr_comments[::-1]:
            if pr_comment["body"].startswith(header_message_start):
                update_result = requests.patch(
                    pr_comment["url"],
                    json=report,
                    headers=headers,
                )
                if update_result.status_code != 200:
                    print(
                        "Updating existing comment failed with code: "
                        + str(update_result.status_code)
                    )
                    print(update_result.text)
                    print("Attempting to post a new comment instead")
                else:
                    comment_updated = True
                break

    if not comment_updated:
        post_result = requests.post(
            request_url,
            json=report,
            headers=headers,
        )

        if post_result.status_code != 201:
            print(
                "Posting results to GitHub failed with code: "
                + str(post_result.status_code)
            )
            print(post_result.text)

    with open("message.md", "w") as f:
        f.write(message)

    return detection_result.value


if __name__ == "__main__":
    sys.exit(main())


================================================
FILE: setup.py
================================================
from setuptools import setup

setup(
    name='duplicate code detection tool',
    entry_points={
        'console_scripts': ['duplicate-code-detection=duplicate_code_detection:main']
    },
    py_modules=['duplicate_code_detection'],
    package_dir={
        'duplicate_code_detection': '.',
    },
    install_requires=[
        'gensim>=3.8',
        'nltk>=3.5',
        'astor>=0.8.1'
    ],
    setuptools_git_versioning={
        "enabled": True,
    },
    setup_requires=["setuptools-git-versioning<2"],
)

Download .txt

gitextract_hdgkqkk1/

├── .github/
│   └── workflows/
│       └── duplicate-code-detection.yml
├── .gitignore
├── .pre-commit-hooks.yaml
├── Dockerfile
├── LICENSE
├── README.md
├── action.yml
├── duplicate_code_detection.py
├── entrypoint.sh
├── requirements.txt
├── run_action.py
└── setup.py

Download .txt

SYMBOL INDEX (16 symbols across 2 files)

FILE: duplicate_code_detection.py
  class ReturnCode (line 30) | class ReturnCode(Enum):
  class CliColors (line 36) | class CliColors:
  function get_all_source_code_from_directory (line 47) | def get_all_source_code_from_directory(directory, file_extensions):
  function conditional_print (line 60) | def conditional_print(text, machine_friendly_output):
  function remove_comments_and_docstrings (line 65) | def remove_comments_and_docstrings(source_code: str) -> str:
  function get_loc_count (line 101) | def get_loc_count(file_path):
  function get_loc_to_print (line 111) | def get_loc_to_print(loc_count):
  function main (line 116) | def main():
  function run (line 206) | def run(

FILE: run_action.py
  function make_markdown_table (line 14) | def make_markdown_table(array):
  function get_markdown_link (line 56) | def get_markdown_link(file, url):
  function get_warning (line 60) | def get_warning(similarity, warn_threshold):
  function similarities_to_markdown (line 68) | def similarities_to_markdown(similarities, url_prefix, warn_threshold):
  function split_and_trim (line 94) | def split_and_trim(input_list):
  function to_absolute_path (line 98) | def to_absolute_path(paths):
  function main (line 102) | def main():

Download .json

Condensed preview — 12 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (37K chars).

[
  {
    "path": ".github/workflows/duplicate-code-detection.yml",
    "chars": 505,
    "preview": "name: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ub"
  },
  {
    "path": ".gitignore",
    "chars": 8,
    "preview": ".vscode\n"
  },
  {
    "path": ".pre-commit-hooks.yaml",
    "chars": 204,
    "preview": "-   id: duplicate-code-detection\n    name: Detect duplicate code\n    description: This hook will run duplicate code dete"
  },
  {
    "path": "Dockerfile",
    "chars": 363,
    "preview": "FROM python:3.7-slim\n\nRUN apt-get update\nRUN apt-get -y install git jq\n\nCOPY duplicate_code_detection.py requirements.tx"
  },
  {
    "path": "LICENSE",
    "chars": 1072,
    "preview": "MIT License\n\nCopyright (c) 2018 Dimitris Platis\n\nPermission is hereby granted, free of charge, to any person obtaining a"
  },
  {
    "path": "README.md",
    "chars": 6954,
    "preview": "# Duplicate Code Detection Tool\nA simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect\nsi"
  },
  {
    "path": "action.yml",
    "chars": 2020,
    "preview": "name: 'Duplicate code detection tool'\ndescription: 'Detect similarities between source code files'\ninputs:\n  github_toke"
  },
  {
    "path": "duplicate_code_detection.py",
    "chars": 15090,
    "preview": "\"\"\"\nA simple Python3 tool to detect similarities between files within a repository.\n\nDocument similarity code adapted fr"
  },
  {
    "path": "entrypoint.sh",
    "chars": 808,
    "preview": "#!/bin/bash\nset -eu\n\nscript_dir=\"$(dirname \"$0\")\"\ncd $script_dir\n\npull_request_id=$(cat \"$GITHUB_EVENT_PATH\" | jq 'if (."
  },
  {
    "path": "requirements.txt",
    "chars": 34,
    "preview": "gensim>=3.8\nnltk>=3.5\nastor>=0.8.1"
  },
  {
    "path": "run_action.py",
    "chars": 7544,
    "preview": "#!/usr/bin/env python\n\nimport os\nimport sys\nimport json\nimport requests\nimport argparse\n\nimport duplicate_code_detection"
  },
  {
    "path": "setup.py",
    "chars": 517,
    "preview": "from setuptools import setup\n\nsetup(\n    name='duplicate code detection tool',\n    entry_points={\n        'console_scrip"
  }
]

About this extraction

This page contains the full source code of the platisd/duplicate-code-detection-tool GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 12 files (34.3 KB), approximately 7.9k tokens, and a symbol index with 16 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo