master 3fb55b90c6cc cached
6 files
3.9 KB
1.1k tokens
3 symbols
1 requests
Download .txt
Repository: Kalebu/Plagiarism-checker-Python
Branch: master
Commit: 3fb55b90c6cc
Files: 6
Total size: 3.9 KB

Directory structure:
gitextract_73gmgpcf/

├── README.md
├── app.py
├── fatma.txt
├── john.txt
├── juma.txt
└── requirements.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Plagiarism-checker-Python

This repo consists of a source code of a Python script which detects plagiarism in a textual document using **cosine similarity**.

[![Become a patron](pictures/become_a_patron_button.png)](https://www.patreon.com/kalebujordan)

## How is it Done?

You might be wondering how plagiarism detection on textual data is done, well it ain't as complicated as you may think.

We all know that computers are good with numbers; so in order to compute the similarity between two text documents, the textual raw data is transformed into vectors => arrays of numbers and from that, we make use of basic knowledge of vectors to compute the similarity between them.

This repo contains a basic example on how to do that.


## Getting Started

To get started with the code on this repo, you need to either *clone* or *download* this repo into your machine as shown below;

```bash
git clone https://github.com/Kalebu/Plagiarism-checker-Python
```

## Dependencies

Before you begin playing with the source code, you might need to install dependencies just as shown below;

```bash
pip3 install -r requirements.txt
```

## Running the App

To run this code you need to have your textual documents in your project directory with the **.txt** extension. When you run the script, it will automatically load all the documents with that extension and then compute the similarities between them as shown below;

```bash
$-> cd Plagiarism-checker-Python
$ Plagiarism-checker-Python-> python3 app.py
('john.txt', 'juma.txt', 0.5465972177348937)
('fatma.txt', 'john.txt', 0.14806887549598566)
('fatma.txt', 'juma.txt', 0.18643448370323362)

```

## A Python Library?

Would you like to use a Python library instead to help you compare strings and documents without spending time writing the vectorizers by yourself, then take a look at [Pysimilar](https://github.com/Kalebu/pysimilar).

## Explore it 

Explore it and twist it to your own use case. In case of any questions feel free to reach me directly at *isaackeinstein@gmail.com*.

## Issues

In case you have any difficulties or issues while trying to run the script
you can raise an issue. 

## Pull Requests

If you have something to add, I welcome pull requests on improvement; your helpful contribution will be merged as soon as possible.

## Give it a Star

If you find this repo useful, give it a star so that many people can get to know it.

## Credits

All the credit goes to [kalebu](https://github.com/kalebu).


================================================
FILE: app.py
================================================
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

student_files = [doc for doc in os.listdir() if doc.endswith('.txt')]
student_notes = [open(_file, encoding='utf-8').read()
                 for _file in student_files]


def vectorize(Text): return TfidfVectorizer().fit_transform(Text).toarray()
def similarity(doc1, doc2): return cosine_similarity([doc1, doc2])


vectors = vectorize(student_notes)
s_vectors = list(zip(student_files, vectors))
plagiarism_results = set()


def check_plagiarism():
    global s_vectors
    for student_a, text_vector_a in s_vectors:
        new_vectors = s_vectors.copy()
        current_index = new_vectors.index((student_a, text_vector_a))
        del new_vectors[current_index]
        for student_b, text_vector_b in new_vectors:
            sim_score = similarity(text_vector_a, text_vector_b)[0][1]
            student_pair = sorted((student_a, student_b))
            score = (student_pair[0], student_pair[1], sim_score)
            plagiarism_results.add(score)
    return plagiarism_results


for data in check_plagiarism():
    print(data)


================================================
FILE: fatma.txt
================================================
Life is all about doing your best in trying to
find what works out for you and taking most time in
trying to pursue those skills 

================================================
FILE: john.txt
================================================
Life is all about finding money and spending on luxury stuffs
Coz this life is kinda short , trust 

================================================
FILE: juma.txt
================================================
Life to me is about finding money and use it on things that makes you happy
coz this life is kinda short 

================================================
FILE: requirements.txt
================================================
scikit_learn==0.24.2
Download .txt
gitextract_73gmgpcf/

├── README.md
├── app.py
├── fatma.txt
├── john.txt
├── juma.txt
└── requirements.txt
Download .txt
SYMBOL INDEX (3 symbols across 1 files)

FILE: app.py
  function vectorize (line 10) | def vectorize(Text): return TfidfVectorizer().fit_transform(Text).toarray()
  function similarity (line 11) | def similarity(doc1, doc2): return cosine_similarity([doc1, doc2])
  function check_plagiarism (line 19) | def check_plagiarism():
Condensed preview — 6 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (4K chars).
[
  {
    "path": "README.md",
    "chars": 2482,
    "preview": "# Plagiarism-checker-Python\n\nThis repo consists of a source code of a Python script which detects plagiarism in a textua"
  },
  {
    "path": "app.py",
    "chars": 1162,
    "preview": "import os\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics.pairwise import cosine_simila"
  },
  {
    "path": "fatma.txt",
    "chars": 129,
    "preview": "Life is all about doing your best in trying to\nfind what works out for you and taking most time in\ntrying to pursue thos"
  },
  {
    "path": "john.txt",
    "chars": 99,
    "preview": "Life is all about finding money and spending on luxury stuffs\nCoz this life is kinda short , trust "
  },
  {
    "path": "juma.txt",
    "chars": 105,
    "preview": "Life to me is about finding money and use it on things that makes you happy\ncoz this life is kinda short "
  },
  {
    "path": "requirements.txt",
    "chars": 21,
    "preview": "scikit_learn==0.24.2\n"
  }
]

About this extraction

This page contains the full source code of the Kalebu/Plagiarism-checker-Python GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 6 files (3.9 KB), approximately 1.1k tokens, and a symbol index with 3 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!