Repository: platisd/duplicate-code-detection-tool
Branch: master
Commit: f0413b1571a9
Files: 12
Total size: 34.3 KB
Directory structure:
gitextract_hdgkqkk1/
├── .github/
│ └── workflows/
│ └── duplicate-code-detection.yml
├── .gitignore
├── .pre-commit-hooks.yaml
├── Dockerfile
├── LICENSE
├── README.md
├── action.yml
├── duplicate_code_detection.py
├── entrypoint.sh
├── requirements.txt
├── run_action.py
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/duplicate-code-detection.yml
================================================
name: Duplicate code
on: pull_request
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- name: Check for duplicate code
uses: ./
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "./"
# Only examine .h and .cpp files
file_extensions: "py"
ignore_below: 1
fail_above: 70
warn_above: 15
one_comment: true
================================================
FILE: .gitignore
================================================
.vscode
================================================
FILE: .pre-commit-hooks.yaml
================================================
- id: duplicate-code-detection
name: Detect duplicate code
description: This hook will run duplicate code detection.
entry: duplicate-code-detection -f
language: python
types: [text]
================================================
FILE: Dockerfile
================================================
FROM python:3.7-slim
RUN apt-get update
RUN apt-get -y install git jq
COPY duplicate_code_detection.py requirements.txt run_action.py entrypoint.sh /action/
RUN pip3 install -r /action/requirements.txt requests && \
python3 -c "import nltk; nltk.download('punkt')" && \
ln -s /root/nltk_data /usr/local/nltk_data
ENTRYPOINT ["/action/entrypoint.sh"]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2018 Dimitris Platis
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Duplicate Code Detection Tool
A simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect
similarities between files within a repository.
## What?
A command line tool that receives a directory or a list of files and determines
the degree of similarity between them.
## Why?
The tool intends guide the refactoring efforts of a developer who wishes
to reduce code duplication within a component and improve its software
architecture.
Its development was initiated within the context of the
[DAT265 - Software Evolution Project](https://pingpong.chalmers.se/public/courseId/9754/lang-en/publicPage.do).
## How?
The tool uses the [gensim](https://radimrehurek.com/gensim/) Python library to
determine the similarity between source code files, supplied by the user.
The default supported languages are C, C++, JAVA, Python and C#.
### Dependencies
The following Python packages have to be installed:
* nltk
* `pip3 install --user nltk`
* gensim
* `pip3 install --user gensim`
* astor
* `pip3 install --user astor`
* punkt
* `python3 -m nltk.downloader punkt`
## Get started
Suppress the warnings (generated by the used libraries)
as `python3 -W ignore duplicate_code_detection.py` and then supply the necessary
arguments. More details can be found by running the tool with the `--help` option.
**Notice:** Due to the way the models are created, the more source files you
provide the tool the more accurate the similarity calculations are. In other
words, the bigger the project, the more useful the tool is.
### Example
If `duplicate-code-detection-tool` is the name where the tool resides in and
`smartcar_shield/src` contains the repository you want to check for source code
similarities between the files, then you can run the following to get the
similarity report:
`python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/`
The result should look something like this:

## GitHub Action
The tool is also available as a [GitHub Action](https://docs.github.com/en/actions) for easy integration
with projects hosted on GitHub. An example output of the tool can be seen
[here](https://github.com/platisd/smartcar_shield/pull/36#issuecomment-778635111).
The Action is meant to be triggered during **pull requests** to give the developers an impression
over the **degree of similarity** between the files in the source code. Below you will find a sample
workflow files that illustrate the usage.
Depending on the *size* of your project, you may want to have the tool running multiple times
(i.e in diffferent steps) that test specific parts of your repository for duplicate code.
This way you will not compare each file in your codebase with everything else and get back more
meaningful reports.
### Bare minimum
In the following example the tool will examine source code (the languages supported by default)
in the `src/` and `test/ut` directories *relative* to the root directory of your repository.
The results will be posted as a comment in the **pull request** that was opened.
```yaml
name: Duplicate code
on: pull_request
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "src/, test/ut"
```
### Trigger on pull request comment
If you want to avoid the "spam" you should configure the tool to not always run. Specifically, if you
wish to trigger the Action manually, you can do so by leaving a comment in the pull request.
The following action will trigger the tool to be run when a comment containig `run_duplicate_code_detection_tool`
is posted in a pull request. The tool will run using the code in the pull request.
```yaml
name: Duplicate code
on: issue_comment
jobs:
duplicate-code-check:
name: Check for duplicate code
# Trigger the tool only when a comment containing the keyword is published in a pull request
if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool')
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "."
```
**Important:** Please note that due to the way GitHub Actions work, you will *first* have to merge this into your main
branch so it starts taking effect.
### Optional configuration
It may not make sense to compare all files or get a files with very low similarity reported.
In the following workflow, the different *optional* arguments are demonstrated.
For the various default values, please consult [action.yml](action.yml).
```yaml
name: Duplicate code
on: pull_request
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-20.04
steps:
- name: Check for duplicate code
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "src"
# Ignore the specified directories
ignore_directories: "src/external_libraries"
# Only examine .h and .cpp files
file_extensions: "h, cpp"
# Only report similarities above 5%
ignore_below: 5
# If a file is more than 70% similar to another, then the job fails
fail_above: 70
# If a file is more than 15% similar to another, show a warning symbol in the report
warn_above: 15
# Remove `src/` from the file paths when reporting similarities
project_root_dir: "src"
# Remove docstrings from code before analysis
# For python source code only. This is checked on a per-file basis
only_code: true
# Leave only one comment with the report and update it for consecutive runs
one_comment: true
# The message to be displayed at the start of the report
header_message_start: "The following files have a similarity above the threshold:"
```
## Using duplicate-code-check with pre-commit
To use Duplicate Code Detection Tool as a pre-commit hook with [pre-commit](https://pre-commit.com/) add the following to your `.pre-commit-config.yaml` file:
```yaml
- repo: https://github.com/platisd/duplicate-code-detection-tool.git
rev: '' # Use the sha / tag you want to point at
hooks:
- id: duplicate-code-detection
```
> **_NOTE:_** that this repository sets args: `-f`, if you are configuring duplicate-code-detection-tool using args you'll want to include either `-f` (`--files`) or `-d` (`--directories`).
## Limitations
- `only_code` option only works with python files for now
================================================
FILE: action.yml
================================================
name: 'Duplicate code detection tool'
description: 'Detect similarities between source code files'
inputs:
github_token:
description: 'The GitHub token'
required: true
directories:
description: 'A comma-separated list of the directories containing the source code'
required: true
ignore_directories:
description: 'A comma-separated list of directories that should be ignored'
required: false
default: ''
project_root_dir:
description: 'The relative path to filter out when reporting results'
required: false
default: './'
file_extensions:
description: 'A comma-separated list of source code file extensions to check for similarities'
required: false
default: 'h, hpp, c, cpp, cc, java, py, cs'
ignore_below:
description: 'The minimum similarity percentage to be reported'
required: false
default: 10
fail_above:
description: 'The maximum allowed similarity percentage before the action fails'
required: false
default: 100
warn_above:
description: 'The maximum allowed similarity percentage before the action warns'
required: false
default: 100
only_code:
description: "Removes comments and docstrings from the source code before analysis"
required: false
default: false
one_comment:
description: 'Duplication report will be left as a single comment, which will be updated, instead of multiple ones'
required: false
default: false
header_message_start:
description: 'The message to be displayed at the start of the duplication report.
It is used by the bot to identify previous reports and update them, so it must be unique.
If you want to use the Action in multiple steps of the same workflow,
then you can change this message in each step to avoid conflicts'
required: false
default: '## 📌 Duplicate code detection tool report'
runs:
using: 'docker'
image: 'Dockerfile'
branding:
icon: 'check'
color: 'green'
================================================
FILE: duplicate_code_detection.py
================================================
"""
A simple Python3 tool to detect similarities between files within a repository.
Document similarity code adapted from Jonathan Mugan's tutorial:
https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
"""
import os
import sys
import argparse
import gensim
import ast
import csv
import astor
import re
import tempfile
import json
from enum import Enum
from nltk.tokenize import word_tokenize
from collections import OrderedDict
source_code_file_extensions = ["h", "c", "cpp", "cc", "java", "py", "cs"]
file_column_label = "File"
file_loc_label = ",#LoC"
similarity_column_label = "Similarity (%)"
similarity_label_length = len(similarity_column_label)
loc_label = "#LoC"
similarity_label = "Similarity"
class ReturnCode(Enum):
SUCCESS = 0
BAD_INPUT = 1
THRESHOLD_EXCEEDED = 2
class CliColors:
HEADER = "\033[95m"
OKBLUE = "\033[94m"
OKGREEN = "\033[92m"
WARNING = "\033[93m"
FAIL = "\033[91m"
ENDC = "\033[0m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
def get_all_source_code_from_directory(directory, file_extensions):
"""Get a list with all the source code files within the directory"""
source_code_files = list()
for dirpath, _, filenames in os.walk(directory):
for name in filenames:
_, file_extension = os.path.splitext(name)
if file_extension[1:] in file_extensions:
filename = os.path.join(dirpath, name)
source_code_files.append(filename)
return source_code_files
def conditional_print(text, machine_friendly_output):
if not machine_friendly_output:
print(text)
def remove_comments_and_docstrings(source_code: str) -> str:
"""Strip comments and docstrings from source code
.. seealso::
https://gist.github.com/phpdude/1ae6f19de213d66286c8183e9e3b9ec1
:param source_code: Raw source code as a single string
:type source_code: str
:return: Stripped source code as a single string
:rtype: str
"""
parsed = ast.parse(source_code)
for node in ast.walk(parsed):
if not isinstance(
node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef, ast.Module)
):
continue
if not len(node.body):
continue
if not isinstance(node.body[0], ast.Expr):
continue
if not hasattr(node.body[0], "value") or not isinstance(
node.body[0].value, ast.Str
):
continue
node.body = node.body[1:]
source_code_clean = astor.to_source(parsed)
return source_code_clean
def get_loc_count(file_path):
lines_count = -1
try:
with open(os.path.normpath(file_path), 'r') as the_file:
lines_count = len(the_file.readlines())
except Exception as err:
print(f"WARNING: Failed to get lines count for file {file_path}, reason: {str(err)}")
return lines_count
def get_loc_to_print(loc_count):
loc_to_print = str(loc_count) if loc_count >= 0 else ""
return loc_to_print
def main():
parser_description = (
CliColors.HEADER
+ CliColors.BOLD
+ "=== Duplicate Code Detection Tool ==="
+ CliColors.ENDC
)
parser = argparse.ArgumentParser(description=parser_description)
parser.add_argument(
"-t",
"--fail-threshold",
type=int,
default=100,
help="The maximum allowed similarity before the script exits with an error.",
)
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
"-d",
"--directories",
nargs="+",
help="Check for similarities between all files of the specified directories.",
)
group.add_argument(
"-f",
"--files",
nargs="+",
help="Check for similarities between specified files. \
The more files are supplied the more accurate are the results.",
)
parser.add_argument(
"--ignore-directories", nargs="+", default=list(), help="Directories to ignore."
)
parser.add_argument("--ignore-files", nargs="+", help="Files to ignore.")
parser.add_argument(
"-j", "--json", type=bool, default=False, help="Print output as JSON."
)
parser.add_argument(
"--project-root-dir",
type=str,
default=str(),
help="The relative path to the project root directory to be removed when printing out results.",
)
parser.add_argument(
"--file-extensions",
nargs="+",
default=source_code_file_extensions,
help="File extensions to check for similarities.",
)
parser.add_argument(
"--ignore-threshold",
type=int,
default=0,
help="Don't print out similarity below the ignore threshold",
)
parser.add_argument(
"--only-code",
action="store_true",
help="Removes comments and docstrings from the source code before analysis",
)
parser.add_argument(
"--csv-output",
type=str,
default=str(),
help="Outputs results as a CSV to the specified CSV path",
)
parser.add_argument(
"--show-loc",
action="store_true",
help="Add file line counts, including blank lines and comments, to all outputs.",
)
args = parser.parse_args()
result = run(
args.fail_threshold,
args.directories,
args.files,
args.ignore_directories,
args.ignore_files,
args.json,
args.project_root_dir,
args.file_extensions,
args.ignore_threshold,
args.only_code,
args.csv_output,
args.show_loc,
)
return result
def run(
fail_threshold,
directories,
files,
ignore_directories,
ignore_files,
json_output,
project_root_dir,
file_extensions,
ignore_threshold,
only_code,
csv_output,
show_loc,
):
# Determine which files to compare for similarities
source_code_files = list()
files_to_ignore = list()
if directories:
for directory in directories:
if not os.path.isdir(directory):
print("Path does not exist or is not a directory:", directory)
return (ReturnCode.BAD_INPUT, {})
source_code_files += get_all_source_code_from_directory(
directory, file_extensions
)
for directory in ignore_directories:
files_to_ignore += get_all_source_code_from_directory(
directory, file_extensions
)
else:
if len(files) < 2:
print("Too few files to compare, you need to supply at least 2")
return (ReturnCode.BAD_INPUT, {})
for supplied_file in files:
if not os.path.isfile(supplied_file):
print("Supplied file does not exist:", supplied_file)
return (ReturnCode.BAD_INPUT, {})
source_code_files = files
files_to_ignore += ignore_files if ignore_files else list()
files_to_ignore = [os.path.normpath(f) for f in files_to_ignore]
source_code_files = [os.path.normpath(f) for f in source_code_files]
source_code_files = list(set(source_code_files) - set(files_to_ignore))
if len(source_code_files) < 2:
print("Not enough source code files found")
return (ReturnCode.BAD_INPUT, {})
# Sort the sources, so the results are sorted too and are reproducible
source_code_files.sort()
source_code_files = [os.path.abspath(f) for f in source_code_files]
# Get the absolute project root directory path to remove when printing out the results
if project_root_dir:
if not os.path.isdir(project_root_dir):
print(
"The project root directory does not exist or is not a directory:",
project_root_dir,
)
return (ReturnCode.BAD_INPUT, {})
project_root_dir = os.path.abspath(project_root_dir)
project_root_dir = os.path.join(project_root_dir, "") # Add the trailing slash
# Find the largest string length to format the textual output
largest_string_length = len(
max(source_code_files, key=len).replace(project_root_dir, "")
)
# Parse the contents of all the source files
source_code = OrderedDict()
for source_code_file in source_code_files:
try:
# read file but also recover from encoding errors in source files
with open(source_code_file, "r", errors="surrogateescape") as f:
# Store source code with the file path as the key
content = f.read()
if only_code and source_code_file.endswith("py"):
content = remove_comments_and_docstrings(content)
source_code[source_code_file] = content
except Exception as err:
print(f"ERROR: Failed to open file {source_code_file}, reason: {str(err)}")
# Create a Similarity object of all the source code
gen_docs = [
[word.lower() for word in word_tokenize(source_code[source_file])]
for source_file in source_code
]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity(
tempfile.gettempdir() + os.sep, tf_idf[corpus], num_features=len(dictionary)
)
column_label = file_column_label
if show_loc:
column_label += file_loc_label
largest_string_length += len(file_loc_label)
exit_code = ReturnCode.SUCCESS
code_similarity = dict()
for source_file in source_code:
# Check for similarities
query_doc = [w.lower() for w in word_tokenize(source_code[source_file])]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
loc_info = ""
source_file_loc = -1
if show_loc:
source_file_loc = get_loc_count(source_file)
loc_info = "," + get_loc_to_print(source_file_loc)
short_source_file_path = source_file.replace(project_root_dir, "")
conditional_print(
"\n\n\n"
+ CliColors.HEADER
+ "Code duplication probability for "
+ short_source_file_path
+ loc_info
+ CliColors.ENDC,
json_output,
)
conditional_print(
"-" * (largest_string_length + similarity_label_length), json_output
)
conditional_print(
CliColors.BOLD
+ "%s %s"
% (column_label.center(largest_string_length), similarity_column_label)
+ CliColors.ENDC,
json_output,
)
conditional_print(
"-" * (largest_string_length + similarity_label_length), json_output
)
empty_length = 0
code_similarity[short_source_file_path] = dict()
if show_loc:
code_similarity[short_source_file_path][loc_label] = source_file_loc
empty_length = len(code_similarity[short_source_file_path])
for similarity, source in zip(sims[query_doc_tf_idf], source_code):
# Ignore similarities for the same file
if source == source_file:
continue
similarity_percentage = similarity * 100
# Ignore very low similarity
if similarity_percentage < ignore_threshold:
continue
short_source_path = source.replace(project_root_dir, "")
if show_loc:
code_similarity[short_source_file_path][short_source_path] = dict()
code_similarity[short_source_file_path][short_source_path][loc_label] = get_loc_count(
source
)
code_similarity[short_source_file_path][short_source_path][similarity_label] = round(
similarity_percentage, 2
)
else:
code_similarity[short_source_file_path][short_source_path] = round(
similarity_percentage, 2
)
if similarity_percentage > fail_threshold:
exit_code = ReturnCode.THRESHOLD_EXCEEDED
color = (
CliColors.OKGREEN
if similarity_percentage < 10
else (
CliColors.WARNING if similarity_percentage < 20 else CliColors.FAIL
)
)
info_to_print = short_source_path
if show_loc:
info_to_print += "," + get_loc_to_print(get_loc_count(source))
conditional_print(
"%s " % (info_to_print.ljust(largest_string_length))
+ color
+ "%.2f" % (similarity_percentage)
+ CliColors.ENDC,
json_output,
)
# If no similarities found for the particular file, remove it from the report
if len(code_similarity[short_source_file_path]) == empty_length:
del code_similarity[short_source_file_path]
if exit_code == ReturnCode.THRESHOLD_EXCEEDED:
conditional_print(
"Code duplication threshold exceeded. Please consult logs.", json_output
)
if json_output:
similarities_json = json.dumps(code_similarity, indent=4)
print(similarities_json)
if csv_output:
with open(csv_output, "w") as csv_file:
writer = csv.writer(csv_file)
if show_loc:
writer.writerow(["File A", "#LoC A", "File B", "#LoC B", "Similarity"])
for first_file in code_similarity:
for second_file in code_similarity[first_file]:
if second_file != loc_label:
writer.writerow(
[
first_file,
get_loc_to_print(get_loc_count(os.path.join(project_root_dir, first_file))),
second_file,
get_loc_to_print(get_loc_count(os.path.join(project_root_dir, second_file))),
code_similarity[first_file][second_file][similarity_label],
]
)
else:
writer.writerow(["File A", "File B", "Similarity"])
for first_file in code_similarity:
for second_file in code_similarity[first_file]:
writer.writerow(
[
first_file,
second_file,
code_similarity[first_file][second_file],
]
)
return (exit_code, code_similarity)
if __name__ == "__main__":
exit_code, _ = main()
sys.exit(exit_code.value)
================================================
FILE: entrypoint.sh
================================================
#!/bin/bash
set -eu
script_dir="$(dirname "$0")"
cd $script_dir
pull_request_id=$(cat "$GITHUB_EVENT_PATH" | jq 'if (.issue.number != null) then .issue.number else .number end')
branch_name="pull_request_branch"
if [ $pull_request_id == "null" ]; then
echo "Could not find a pull request ID. Is this a pull request?"
exit 1
fi
maintainer=${GITHUB_REPOSITORY%/*}
eval git clone "https://${maintainer}:${INPUT_GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" ${GITHUB_REPOSITORY}
cd $GITHUB_REPOSITORY
eval git config remote.origin.fetch +refs/heads/*:refs/remotes/origin/*
eval git fetch origin pull/$pull_request_id/head:$branch_name
eval git checkout $branch_name
latest_head=$(git rev-parse HEAD)
eval python3 /action/run_action.py --latest-head $latest_head --pull-request-id $pull_request_id
================================================
FILE: requirements.txt
================================================
gensim>=3.8
nltk>=3.5
astor>=0.8.1
================================================
FILE: run_action.py
================================================
#!/usr/bin/env python
import os
import sys
import json
import requests
import argparse
import duplicate_code_detection
WARNING_SUFFIX = " ⚠️"
def make_markdown_table(array):
"""Input: Python list with rows of table as lists
First element as header.
Output: String to put into a .md file
Ex Input:
[["Name", "Age", "Height"],
["Jake", 20, 5'10],
["Mary", 21, 5'7]]
Adopted from: https://gist.github.com/m0neysha/219bad4b02d2008e0154
"""
markdown = "\n" + str("| ")
for e in array[0]:
to_add = " " + str(e) + str(" |")
markdown += to_add
markdown += "\n"
markdown += "|"
for i in range(len(array[0])):
markdown += str("-------------- | ")
markdown += "\n"
markdown_characters = 0
max_characters = 65000
for entry in array[1:]:
markdown += str("| ")
for e in entry:
to_add = str(e) + str(" | ")
markdown += to_add
markdown += "\n"
markdown_characters += len(markdown)
if markdown_characters > max_characters:
markdown += "\n" + WARNING_SUFFIX + " "
markdown += "Results were omitted because the report was too large. "
markdown += "Please consider ignoring results below a certain threshold.\n"
break
return markdown + "\n"
def get_markdown_link(file, url):
return "[%s](%s%s)" % (file, url, file)
def get_warning(similarity, warn_threshold):
return (
str(similarity)
if similarity < int(warn_threshold)
else str(similarity) + WARNING_SUFFIX
)
def similarities_to_markdown(similarities, url_prefix, warn_threshold):
markdown = str()
for checked_file in similarities.keys():
markdown += "<details><summary>%s</summary>\n\n" % checked_file
markdown += "### 📄 %s\n" % get_markdown_link(checked_file, url_prefix)
table_header = ["File", "Similarity (%)"]
table_contents = [
[get_markdown_link(f, url_prefix), get_warning(s, warn_threshold)]
for (f, s) in similarities[checked_file].items()
]
# Sort table contents based on similarity
table_contents.sort(
reverse=True, key=lambda row: float(row[1].replace(WARNING_SUFFIX, ""))
)
entire_table = [[] for _ in range(len(table_contents) + 1)]
entire_table[0] = table_header
for i in range(1, len(table_contents) + 1):
entire_table[i] = table_contents[i - 1]
markdown += make_markdown_table(entire_table)
markdown += "</details>\n"
return markdown
def split_and_trim(input_list):
return [token.strip() for token in input_list.split(",")]
def to_absolute_path(paths):
return [os.path.abspath(path) for path in paths]
def main():
parser = argparse.ArgumentParser(
description="Duplicate code detection action runner"
)
parser.add_argument(
"--latest-head",
type=str,
default="master",
help="The latest commit hash or branch",
)
parser.add_argument(
"--pull-request-id", type=str, required=True, help="The pull request id"
)
args = parser.parse_args()
fail_threshold = os.environ.get("INPUT_FAIL_ABOVE")
directories = os.environ.get("INPUT_DIRECTORIES")
ignore_directories = os.environ.get("INPUT_IGNORE_DIRECTORIES")
project_root_dir = os.environ.get("INPUT_PROJECT_ROOT_DIR")
file_extensions = os.environ.get("INPUT_FILE_EXTENSIONS")
ignore_threshold = os.environ.get("INPUT_IGNORE_BELOW")
only_code = os.environ.get("INPUT_ONLY_CODE")
directories_list = split_and_trim(directories)
directories_list = to_absolute_path(directories_list)
ignore_directories_list = (
split_and_trim(ignore_directories) if ignore_directories != "" else list()
)
ignore_directories_list = to_absolute_path(ignore_directories_list)
file_extensions_list = split_and_trim(file_extensions)
project_root_dir = os.path.abspath(project_root_dir)
files_list = None
ignore_files_list = None
json_output = True
csv_output_path = "" # No CSV output by default for now in GitHub Actions
show_loc = False
detection_result, code_similarity = duplicate_code_detection.run(
int(fail_threshold),
directories_list,
files_list,
ignore_directories_list,
ignore_files_list,
json_output,
project_root_dir,
file_extensions_list,
int(ignore_threshold),
bool(only_code),
csv_output_path,
show_loc,
)
if detection_result == duplicate_code_detection.ReturnCode.BAD_INPUT:
print("Action aborted due to bad user input")
return detection_result.value
elif detection_result == duplicate_code_detection.ReturnCode.THRESHOLD_EXCEEDED:
print(
"Action failed due to maximum similarity threshold exceeded, check the report"
)
repo = os.environ.get("GITHUB_REPOSITORY")
files_url_prefix = "https://github.com/%s/blob/%s/" % (repo, args.latest_head)
warn_threshold = os.environ.get("INPUT_WARN_ABOVE")
header_message_start = os.environ.get("INPUT_HEADER_MESSAGE_START") + "\n"
message = header_message_start
message += "The [tool](https://github.com/platisd/duplicate-code-detection-tool)"
message += " analyzed your source code and found the following degree of"
message += " similarity between the files:\n"
message += similarities_to_markdown(
code_similarity, files_url_prefix, warn_threshold
)
github_token = os.environ.get("INPUT_GITHUB_TOKEN")
github_api_url = os.environ.get("GITHUB_API_URL")
request_url = "%s/repos/%s/issues/%s/comments" % (
github_api_url,
repo,
args.pull_request_id,
)
headers = {
"Authorization": "token %s" % github_token,
}
report = {"body": message}
update_existing_comment = os.environ.get("INPUT_ONE_COMMENT", "false").lower() in (
"true",
"1",
)
comment_updated = False
if update_existing_comment:
# If the bot has posted many comments, update the last one
pr_comments = requests.get(request_url, headers=headers).json()
for pr_comment in pr_comments[::-1]:
if pr_comment["body"].startswith(header_message_start):
update_result = requests.patch(
pr_comment["url"],
json=report,
headers=headers,
)
if update_result.status_code != 200:
print(
"Updating existing comment failed with code: "
+ str(update_result.status_code)
)
print(update_result.text)
print("Attempting to post a new comment instead")
else:
comment_updated = True
break
if not comment_updated:
post_result = requests.post(
request_url,
json=report,
headers=headers,
)
if post_result.status_code != 201:
print(
"Posting results to GitHub failed with code: "
+ str(post_result.status_code)
)
print(post_result.text)
with open("message.md", "w") as f:
f.write(message)
return detection_result.value
if __name__ == "__main__":
sys.exit(main())
================================================
FILE: setup.py
================================================
from setuptools import setup
setup(
name='duplicate code detection tool',
entry_points={
'console_scripts': ['duplicate-code-detection=duplicate_code_detection:main']
},
py_modules=['duplicate_code_detection'],
package_dir={
'duplicate_code_detection': '.',
},
install_requires=[
'gensim>=3.8',
'nltk>=3.5',
'astor>=0.8.1'
],
setuptools_git_versioning={
"enabled": True,
},
setup_requires=["setuptools-git-versioning<2"],
)
gitextract_hdgkqkk1/ ├── .github/ │ └── workflows/ │ └── duplicate-code-detection.yml ├── .gitignore ├── .pre-commit-hooks.yaml ├── Dockerfile ├── LICENSE ├── README.md ├── action.yml ├── duplicate_code_detection.py ├── entrypoint.sh ├── requirements.txt ├── run_action.py └── setup.py
SYMBOL INDEX (16 symbols across 2 files) FILE: duplicate_code_detection.py class ReturnCode (line 30) | class ReturnCode(Enum): class CliColors (line 36) | class CliColors: function get_all_source_code_from_directory (line 47) | def get_all_source_code_from_directory(directory, file_extensions): function conditional_print (line 60) | def conditional_print(text, machine_friendly_output): function remove_comments_and_docstrings (line 65) | def remove_comments_and_docstrings(source_code: str) -> str: function get_loc_count (line 101) | def get_loc_count(file_path): function get_loc_to_print (line 111) | def get_loc_to_print(loc_count): function main (line 116) | def main(): function run (line 206) | def run( FILE: run_action.py function make_markdown_table (line 14) | def make_markdown_table(array): function get_markdown_link (line 56) | def get_markdown_link(file, url): function get_warning (line 60) | def get_warning(similarity, warn_threshold): function similarities_to_markdown (line 68) | def similarities_to_markdown(similarities, url_prefix, warn_threshold): function split_and_trim (line 94) | def split_and_trim(input_list): function to_absolute_path (line 98) | def to_absolute_path(paths): function main (line 102) | def main():
Condensed preview — 12 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (37K chars).
[
{
"path": ".github/workflows/duplicate-code-detection.yml",
"chars": 505,
"preview": "name: Duplicate code\n\non: pull_request\n\njobs:\n duplicate-code-check:\n name: Check for duplicate code\n runs-on: ub"
},
{
"path": ".gitignore",
"chars": 8,
"preview": ".vscode\n"
},
{
"path": ".pre-commit-hooks.yaml",
"chars": 204,
"preview": "- id: duplicate-code-detection\n name: Detect duplicate code\n description: This hook will run duplicate code dete"
},
{
"path": "Dockerfile",
"chars": 363,
"preview": "FROM python:3.7-slim\n\nRUN apt-get update\nRUN apt-get -y install git jq\n\nCOPY duplicate_code_detection.py requirements.tx"
},
{
"path": "LICENSE",
"chars": 1072,
"preview": "MIT License\n\nCopyright (c) 2018 Dimitris Platis\n\nPermission is hereby granted, free of charge, to any person obtaining a"
},
{
"path": "README.md",
"chars": 6954,
"preview": "# Duplicate Code Detection Tool\nA simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect\nsi"
},
{
"path": "action.yml",
"chars": 2020,
"preview": "name: 'Duplicate code detection tool'\ndescription: 'Detect similarities between source code files'\ninputs:\n github_toke"
},
{
"path": "duplicate_code_detection.py",
"chars": 15090,
"preview": "\"\"\"\nA simple Python3 tool to detect similarities between files within a repository.\n\nDocument similarity code adapted fr"
},
{
"path": "entrypoint.sh",
"chars": 808,
"preview": "#!/bin/bash\nset -eu\n\nscript_dir=\"$(dirname \"$0\")\"\ncd $script_dir\n\npull_request_id=$(cat \"$GITHUB_EVENT_PATH\" | jq 'if (."
},
{
"path": "requirements.txt",
"chars": 34,
"preview": "gensim>=3.8\nnltk>=3.5\nastor>=0.8.1"
},
{
"path": "run_action.py",
"chars": 7544,
"preview": "#!/usr/bin/env python\n\nimport os\nimport sys\nimport json\nimport requests\nimport argparse\n\nimport duplicate_code_detection"
},
{
"path": "setup.py",
"chars": 517,
"preview": "from setuptools import setup\n\nsetup(\n name='duplicate code detection tool',\n entry_points={\n 'console_scrip"
}
]
About this extraction
This page contains the full source code of the platisd/duplicate-code-detection-tool GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 12 files (34.3 KB), approximately 7.9k tokens, and a symbol index with 16 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.