Full Code of xnl-h4ck3r/urless for AI

main e9bfa484ea6e cached
7 files
67.9 KB
14.9k tokens
25 symbols
1 requests
Download .txt
Repository: xnl-h4ck3r/urless
Branch: main
Commit: e9bfa484ea6e
Files: 7
Total size: 67.9 KB

Directory structure:
gitextract_gotbcstc/

├── .gitignore
├── CHANGELOG.md
├── README.md
├── config.yml
├── setup.py
└── urless/
    ├── __init__.py
    └── urless.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
build/
dist/
urless.egg-info
__pycache__
test.txt

================================================
FILE: CHANGELOG.md
================================================
## Changelog

- v2.7

  - New
    - If the `config.yml` file is not found in the expected config directory (e.g. `~/.config/urless/` on Linux or `%APPDATA%/urless/` on Windows), it will be automatically created with default values. This fixes the issue where installing with `pipx` did not create the `config.yml` file.
    - Surpresses the warning about `requests` not being able to import `urllib3`.

- v2.6

  - Changed

    - BUG FIX: Change the type `js.ko` to `ja,ko` in `LANGUAGE` within `config.yml` and `DEFAULT_LANGUAGE` within `urless.py`
    - Set `DEFAULT_REMOVE_PARAMS` and the `REMOVE_PARAMS` in `config.yml` file to `_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name` in `urless.py`. These was a mismatch between the two files. Also, the Google Analytics parameters should be removed by default.

- v2.5

  - Changed

    - Fix the issue of it saying the version is outdated when it is the latest version.
    - Applied black code formatting to `__init__.py`, `setup.py`, and `urless.py` to ensure consistent code style.

- v2.4

  - Changed

    - Various optimizations to improve performance, e.g. Pre-compiled Regular Expressions, Optimized Extension Filtering and Memory-Efficient File Processing.

- v2.3

  - Fixed

    - Remove TTY-gating that silences output in non-TTY environments like Docker, CI, or cron jobs. The --no-banner flag and -o/--output already provide users control over output, so the extra TTY checks only broke non-interactive usage. Thanks to [@tavgar](https://github.com/tavgar) for the fix in [PR #15](https://github.com/xnl-h4ck3r/urless/pull/15).

- v2.2

  - New

    - Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9).
    - Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12).

  - Changed

    - The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`.
    - Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed.

- v2.1

  - New

    - Add `long_description_content_type` to `setup.py` to upload to PyPi
    - Add `urless` to `PyPi` so can be installed with `pip install urless`

- v2.0

  - New

    - Add `REMOVE_PARAMS` to `config.yml`. This will be a comma separated list of case sensitive parameter names that you want removed completely from URLs. This can be useful to remove cache buster parameters, so will default to `cachebuster,cacheBuster` to show examples.
    - Add arg `-rp`/`--remove-params` which can be used to pass a comma separated list of parameter names to remove from URLs. This will override the `REMOVE_PARAMS` list in `config.yml`.
    - Show the current version of the tool in the banner, and whether it is the latest, or outdated.
    - Add arg `--version` to show the current version of the tool.
    - When installing `urless`, if the `config.yml` already exists then it will keep that one and create `config.yml.NEW` in case you need to replace the old config.

  - Changed

    - Fix a bug that meant defaults were not set correctly if `config.yml` keys are missing.

- v1.3

  - New

    - Add argument `-fnp`/`--fragment-not-param`. If passed the URL fragments `#` will NOT be treated in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link.

- v1.2

  - Changed

    - Changes to prevent `SyntaxWarning: invalid escape sequence` errors when Python 3.12 is used.

- v1.1

- Changed

  - Add support to automatically identify file encoding.

- v1.0

- Changed

  - Add support for quick install using pip or pipx.

- v0.9

- Changed

  - Add i18N language codes `gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru`

- v0.8

  - New

    - Add `DEFAULT_LANGUAGE` constant and `LANGUAGE` key in `config.yml` with the most common language codes: `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`
    - Add `-lang`/`--language` argument. If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specific in the `LANGUAGE` key of `config.yml`

  - Changed

    - A URL can have a GUID, Integer, CustomID and Language Code in the same URL and be de-cluttered properly.
    - If the Custom Regex ID doesn't start with `^` and end in `$`, those will be added.
    - Fix bug where it added the last occurrence of a regex pattern instead of the first.
    - Simplify the code in `processUrl` and `createPattern` functions... I had some strange logic that was unnecessary!
    - Make sure case is ignored when any `FILTER_EXTENSIONS` in `config.yml` or passed with `-fe` are compared with input.

- v0.7

  - New

    - Add `-rcid` / `--regex-custom-id` argument to provide a regex expression for a Custom ID that your target uses.
    - Add `-nb` / `--no-banner` argument to hide the tool banner. This is only needed if you are not piping input to `urless`.
    - Add `-khw` / `--keep-human-written` argument to prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post). These are normally removed by default.
    - Add `-kym` / `--keep-yyyymm` argument to prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news). These are normally removed by default.
    - Add `-iq` / `--ignore-querystring` argument to remove the query string (including URL fragments `#`) so output is unique paths only.

  - Changed

    - Fix bug where `/blah/1337` was not being treated differently to `/1337` for example.
    - When a Custom ID, GUID or Integer ID is found in a URL, and only one URL from many in the same format are returned in the output, use the first ID found in the input for that ID type.

- v0.6

  - New

    - By default, a trailing `/` will be removed from the end of a URL.
    - Added new argument `-ks`/`--keep-slash` that will ensure any links that do have a trailing slash in the input will not have the slash removed in the output, and therefore there may be identical URLs output, one with and one without a trailing slash.

- v0.5

  - Changed

    - Fixed Github Issue #3 to remove port 80 and 443 correctly

- v0.4

  - Changed

    - Various bug fixes

- v0.3

  - New

    - Add an `__init_.py` file to store the version, and move the image to a separate folder to make it cleaner.

  - Changed

    - If a line in the input throws an error due to not being a valid URL when parsed, then skip it, but output an error showing the URL if the `-v` arg is passed.

- v0.2

  - Fixed the bug `ERROR matchesPatterns 1: missing ), unterminated subpattern at position 237` by escaping the regex string before searching

- v0.1

  - Inital release. Please see README.md


================================================
FILE: README.md
================================================
<center><img src="https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png"></center>

## About - v2.7

This is a tool used to de-clutter a list of URLs.
As a starting point, I took the amazing tool [uro](https://github.com/s0md3v/uro/) by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable.

## Installation

`urless` supports **Python 3**.

Install `urless` in default (global) python environment.

```bash
pip install urless
```

OR

```bash
pip install git+https://github.com/xnl-h4ck3r/urless.git -v
```

You can upgrade with

```bash
pip install --upgrade urless
```

### pipx

Quick setup in isolated python environment using [pipx](https://pypa.github.io/pipx/)

```bash
pipx install git+https://github.com/xnl-h4ck3r/urless.git
```

## Usage

| Argument | Long Argument        | Description                                                                                                                                                                                                                                                                                                                                                                                                     |
| -------- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| -i       | --input              | A file of URLs to de-clutter.                                                                                                                                                                                                                                                                                                                                                                                   |
| -o       | --output             | The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.                                                                                                                                                                                                                                                   |
| -fk      | --filter-keywords    | A comma separated list of keywords to exclude links (if there no parameters). This will override the `FILTER_KEYWORDS` list specified in config.yml                                                                                                                                                                                                                                                             |
| -fe      | --filter-extensions  | A comma separated list of file extensions to exclude. This will override the `FILTER_EXTENSIONS` list specified in `config.yml`                                                                                                                                                                                                                                                                                 |
| -rp      | --remove-params      | A comma separated list of **case senistive** parameters to remove from ALL URLs. This will override the `REMOVE_PARAMS` list specified in `config.yml`. This can be useful to remove cache buster parameters for example.\*\*                                                                                                                                                                                   |
| -ks      | --keep-slash         | A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.                                                                                                                                                                                                                                                     |
| -khw     | --keep-human-written | By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output.                                                                                                                                                                              |
| -kym     | --keep-yyyymm        | By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.                                                                                                                                                                                     |
| -rcid    | --regex-custom-id    | **USE WITH CAUTION!** Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this.                                                                                                                                                                                                                                                        |
| -iq      | --ignore-querystring | Remove the query string (including URL fragments `#`) so output is unique paths only.                                                                                                                                                                                                                                                                                                                           |
| -fnp     | --fragment-not-param | Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. |
| -lang    | --language           | If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the `LANGUAGE` section of `config.yml`.                                                                                                                                                                                                       |
| -c       | --config             | Path to the YML config file. If not passed, it looks for file `config.yml` in the default config directory, e.g. `~/.config/urless/`.                                                                                                                                                                                                                                                                           |
| -dp      | --disregard-params   | There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.                                                                                                                                                                 |
| -nb      | --no-banner          | Hides the tool banner (it is hidden by default if you pipe input to urless) output.                                                                                                                                                                                                                                                                                                                             |
|          | --version            | Show current version number.                                                                                                                                                                                                                                                                                                                                                                                    |
| -v       | --verbose            | Verbose output                                                                                                                                                                                                                                                                                                                                                                                                  |

## What does it do exactly?

You basically pass a list of URLs in (from a file, or pipe from STDIN), and get a de-cluttered file or URLs out. But in what way are they de-cluttered?
I'll explain this below, but first here are some terms that will be used:

- **FILTER-EXTENSIONS**: This refers to the list of extensions that can either be passed with `-fe`, specified with `FILTER_EXTENSIONS` in the `config.yml`, or if neither of those exist, a default list of `.css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image`.
- **FILTER-KEYWORDS**: This refers to the list of keywords that can either be passed with `-fk`, specified with `FILTER_KEYWORDS` in the `config.yml`, or if neither of those exist, a default list of `blog,article,news,bootstrap,jquery,captcha,node_modules`
- **LANGUAGE**: This refers to the list of language codes that can be specified with `LANGUAGE` in the `config.yml`, or if it doesn't exist, a default list of the most common codes `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`
- **UNWANTED-CONTENT**:
  - A section of the URL path contains more than 3 dashes (`-`), BUT isn't a GUID. This implies human written content, e.g. `how-to-hack-the-planet`. If arg `-khw` is passed, then this won't be removed.
  - The URL contains `/YYYY/MM/` , e.g. a year, month . This is usually static content such as a blog. If arg `-kym` is passed, then this won't be removed.

Here's what happens:

- If a URL has port 80 or 443 explicitly given, then remove it from the URL (e.g. http://example.com:80/test -> http://example.com/test)
- If the URL has any **FILTER-EXTENSIONS**, it will be removed from the output.
- If the URL has NO parameters **OR** the `-dp`/`--disregard-params` argument was passed:
  - If the URL contains a **FILTER-KEYWORDS** or **UNWANTED-CONTENT**, it will be removed.
  - if the URL query string contains unwanted parameters specified in config `REMOVE_PARAMS` (or overridden wit argument `-rp`/`--remove-params`), they will be removed from all URLs before processing.
  - If `-rcid`/`--regex-custom-id` is passed and the URL path contains a Custom ID, only one match to the Custom ID regex will be included if there are multiple URLs where that is the only difference.
  - If the URL path contains a GUID, only one of the GUIDs will be included if there are multiple URLs where the GUID is the only difference.
  - If the URL path contains an Integer ID, only one of the Integer IDs will be included if there are multiple URLs where the Integer ID is the only difference.
  - If the `-lang` argument is passed and the URL contains a language code (e.g. `en-gb`), only one of the language codes will be included if there are multiple URLs where the language code is different.
- Else the URL has Parameters (or a fragment `#`) **AND** the `-dp`/`--disregard-params` argument was NOT passed:
  - If there are multiple URLs with the same parameters, then only URLs with unique parameter values are included.
  - If there are URL's with a Parameter, but no value (or a fragment), then this will be included.

## Examples

### Basic use

```
cat target_urls.txt | urless
```

or

```
urless -i target_urls.txt
```

### Capture output

```
cat target_urls.txt | urless > output.txt
```

or

```
urless -i target_urls.txt -o output.txt
```

## config.yml

The `config.yml` file has the keys which can be updated to suit your needs:

- `FILTER_KEYWORDS` - A comma separated list of keywords (e.g. `blog,article,news` etc.) that URLs are checked against in certain circumstances.
- `FILTER_EXTENSIONS` - A comma separated list of file extensions (e.g. `.css,.jpg,.jpeg` etc.) that all URLs are checked against. If a URL includes any of the strings then it will be excluded from the output.
- `LANGUAGE` - A comma separated list of language codes (e.g. `en-gb,fr,nl` etc.) that all URLs are checked against when the `-lang` argument is passed. If there are multiple URLs with different language codes, only one version of the URL will be output.
- `REMOVE_PARAMS` - A comma separated list of **case sensitive** parameter names (e.g. `cachebuster,cacheBuster`) that will be removed from all URLs before processing.

## Custom Regex

There are currently automatic regex checks for a path part being a Globally Unique ID (GUID) and an Integer ID, but the `-rcid` / `--regex-custom-id` argument lets you provide a regular expression to identify a custom ID. For example, if a target has a specific ID format (that isn't a GUID or Integer) then you can specify a regex expression for it, and then only one of those will be returned in the output if the rest of the URL is the same. For example:

- Assume the target has a user ID in a format like `U-65241X`
- And there are multiple URLs like the following:
  ```
  https://target.com/blah/U-61723A/settings
  https://target.com/blah/U-63352B/settings
  https://target.com/blah/U-61351A/profile
  https://target.com/blah/U-61723A/settings
  https://target.com/blah/U-64135C/profile
  ```
- You can call `urless` and pass `-rcid 'U-[0-9]{5}[A-Z]'`, then the output would be:
  ```
  https://target.com/blah/U-61723A/settings
  https://target.com/blah/U-64135C/profile
  ```

**IMPORTANT REGEX NOTES:**

- Writing correct regex expressions can be difficult, and if it isn't correct, you could end up with unpredictable and incorrect output.
- Always enclose your regex expression in single quotes when passing to the `-rcid` argument.
- You don't need to add a custom regex for a GUID or Integer ID - these are dealt with already.
- The regex expression should highlight the whole part of the path. So, if your regex only identifies the start of the path, then add `[^(\?|\/|#|$)]*` to the end of your regex which will mean ALL other characters up until the end of the path part.
- You can add `^` at the start, and `$` at the end, of your regex to ensure it represents the whole part of a path between slashes. However, these will be added for you if they are left out.
- Make sure the regex only identifies the sections you are interested in, otherwise you may have unexpected results. To test your regex, you can take your input file and do `cat input.txt | grep -E 'U-[0-9]{5}[A-Z]'` for example, and see whether your expression looks correct (it should only highlight what you are interested in, and highlight the whole part of the path that is the custom ID).
- You can also test using [Regex101](https://regex101.com), entering sample URLs in the **TEST STRING** section to check if it is correct. Make sure the **REGEX FLAGS** **g**lobal and **m**ultiline are selected.
- There maybe cases where you just can't supply a regex that is going to identify the Custom ID correctly without treating other values as the same. For example, if there are URLs like `https://target.com/blah/xnl/settings` where `xnl` is a User Name, you won't be able to create a regex for user name because it is not a unique enough format to distinguish it from other possible path values.

## Issues

If you come across any problems at all, or have ideas for improvements, please feel free to raise an issue on Github. If there is a problem, it will be useful if you can provide the exact command you ran and a detailed description of the problem. If possible, run with `-v` to reproduce the problem and let me know about any error messages that are given.

## TODO

None - feel free to raise a Github issue to suggest any enhancements.

## And finally...

Good luck and good hunting!
If you really love the tool (or any others), or they helped you find an awesome bounty, consider [BUYING ME A COFFEE!](https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)

🤘 /XNL-h4ck3r

<p>
<a href='https://ko-fi.com/B0B3CZKR5' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi2.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>


================================================
FILE: config.yml
================================================
FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules
FILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image
LANGUAGE: en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru
REMOVE_PARAMS: _,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name


================================================
FILE: setup.py
================================================
#!/usr/bin/env python
import os
import shutil
from setuptools import setup, find_packages

# Define the target directory for the config.yml file
target_directory = (
    os.path.join(os.getenv("APPDATA", ""), "urless")
    if os.name == "nt"
    else (
        os.path.join(os.path.expanduser("~"), ".config", "urless")
        if os.name == "posix"
        else (
            os.path.join(
                os.path.expanduser("~"), "Library", "Application Support", "urless"
            )
            if os.name == "darwin"
            else None
        )
    )
)

# Copy the config.yml file to the target directory if it exists
configNew = False
if target_directory and os.path.isfile("config.yml"):
    os.makedirs(target_directory, exist_ok=True)
    # If file already exists, create a new one
    if os.path.isfile(target_directory + "/config.yml"):
        configNew = True
        os.rename(
            target_directory + "/config.yml", target_directory + "/config.yml.OLD"
        )
        shutil.copy("config.yml", target_directory)
        os.rename(
            target_directory + "/config.yml", target_directory + "/config.yml.NEW"
        )
        os.rename(
            target_directory + "/config.yml.OLD", target_directory + "/config.yml"
        )
    else:
        shutil.copy("config.yml", target_directory)

setup(
    name="urless",
    packages=find_packages(),
    version=__import__("urless").__version__,
    description="De-clutter a list of URLs",
    long_description=open("README.md").read(),
    long_description_content_type="text/markdown",
    author="@xnl-h4ck3r",
    url="https://github.com/xnl-h4ck3r/urless",
    zip_safe=False,
    install_requires=[
        "argparse",
        "pyyaml",
        "termcolor",
        "urlparse3",
        "chardet",
        "requests",
    ],
    entry_points={
        "console_scripts": [
            "urless = urless.urless:main",
        ],
    },
)

if configNew:
    print(
        "\n\033[33mIMPORTANT: The file "
        + target_directory
        + "/config.yml already exists.\nCreating config.yml.NEW but leaving existing config.\nIf you need the new file, then remove the current one and rename config.yml.NEW to config.yml\n\033[0m"
    )
else:
    print(
        "\n\033[92mThe file "
        + target_directory
        + "/config.yml has been created.\n\033[0m"
    )


================================================
FILE: urless/__init__.py
================================================
__version__ = "2.7"


================================================
FILE: urless/urless.py
================================================
#!/usr/bin/env python
# Python 3
# urless - by @Xnl-h4ck3r: De-clutter a list of URLs
# Full help here: https://github.com/xnl-h4ck3r/urless/blob/main/README.md
# Good luck and good hunting! If you really love the tool (or any others), or they helped you find an awesome bounty, consider BUYING ME A COFFEE! (https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)


import re
import os
import sys
from typing import Pattern
import yaml
import argparse
import chardet
from signal import SIGINT, signal
from urllib.parse import urlparse
from termcolor import colored
from pathlib import Path

try:
    from . import __version__
    import warnings

    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        import requests
except Exception:
    pass

# Default values if config.yml not found
DEFAULT_FILTER_EXTENSIONS = ".css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image"
DEFAULT_FILTER_KEYWORDS = "blog,article,news,bootstrap,jquery,captcha,node_modules"
DEFAULT_LANGUAGE = "en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru"
DEFAULT_REMOVE_PARAMS = "_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name"

# Variables to hold config.yml values
FILTER_EXTENSIONS = ""
FILTER_KEYWORDS = ""
LANGUAGE = ""
REMOVE_PARAMS = ""
reFilterKeywords = ""
badExtensions = ()


# Regex delimiters
REGEX_START = "^"
REGEX_END = "$"

# Regex for a path folder of integer
REGEX_INTEGER = REGEX_START + r"\d+" + REGEX_END
reIntPart = re.compile(REGEX_INTEGER)
patternsInt = {}

# Regex for a path folder of GUID
REGEX_GUID = (
    REGEX_START
    + "[({]?[a-fA-F0-9]{8}[-]?([a-fA-F0-9]{4}[-]?){3}[a-fA-F0-9]{12}[})]?"
    + REGEX_END
)
reGuidPart = re.compile(REGEX_GUID)
patternsGUID = {}

# Regex fields for Custom ID
reCustomIDPart = Pattern
patternsCustomID = {}

# Regex for path of YYYY/MM
REGEX_YYYYMM = r"\/[1|2][0|1|9]\\d{2}/[0|1]\\d{1}\/"
reYYYYMM = re.compile(REGEX_YYYYMM)

# Regex for path of language code
reLangPart = Pattern
patternsLang = {}

# Global variables
args = None
urlmap = {}
patternsSeen = []
outFile = None
linesOrigCount = 0
linesFinalCount = 0
usingConfigDefaults = False


def verbose():
    """
    Functions used when printing messages dependant on verbose option
    """
    return args.verbose


def write(text=""):
    """
    Always print one line to stdout.
    The --no-banner flag and -o/--output already give users
    control over noise and redirection, so extra TTY checks only
    break non-interactive usage (Docker, CI, cron).
    """
    sys.stdout.write(text + "\n")


def writerr(text=""):
    """
    Always print one line to stderr.
    """
    sys.stderr.write(text + "\n")


def showVersion():
    try:
        try:
            resp = requests.get(
                "https://raw.githubusercontent.com/xnl-h4ck3r/urless/main/urless/__init__.py",
                timeout=3,
            )
        except Exception:
            write(
                "Current urless version "
                + __version__
                + " (unable to check if latest)\n"
            )
        if __version__ == resp.text.split("=")[1].replace('"', "").strip():
            write(
                "Current urless version "
                + __version__
                + " ("
                + colored("latest", "green")
                + ")\n"
            )
        else:
            write(
                "Current urless version "
                + __version__
                + " ("
                + colored("outdated", "red")
                + ")\n"
            )
    except Exception:
        pass


def showBanner():
    write("")
    write(colored(r"  __  _ ____  _   ___  ___ ____ ", "red"))
    write(colored(r" | | | |  _ \| | / _ \/ __/ __/ ", "yellow"))
    write(colored(r" | | | | |_) | ||  __/\__ \__ \ ", "green"))
    write(colored(r" | |_| |  _ <| |_\___/\___/___/ ", "cyan"))
    write(colored(r"  \___/|_| \_\___/", "magenta") + colored("by Xnl-h4ck3r", "white"))
    write("")
    showVersion()


def getConfig():
    """
    Try to get the values from the config file, otherwise use the defaults
    """
    global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults, reFilterKeywords, badExtensions
    try:

        # Try to get the config file values
        try:
            # Put config in global location based on the OS.
            urlessPath = (
                Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
                if os.name == "nt"
                else (
                    Path(os.path.join(os.path.expanduser("~"), ".config", "urless"))
                    if os.name == "posix"
                    else (
                        Path(
                            os.path.join(
                                os.path.expanduser("~"),
                                "Library",
                                "Application Support",
                                "urless",
                            )
                        )
                        if os.name == "darwin"
                        else None
                    )
                )
            )

            urlessPath.absolute
            if args.config is None:
                if urlessPath == "":
                    configPath = "config.yml"
                else:
                    configPath = Path(urlessPath / "config.yml")
            else:
                configPath = Path(args.config)
            config = yaml.safe_load(open(configPath))

            # If the user provided the --filter-extensions argument then it overrides the config value
            if args.filter_keywords:
                FILTER_KEYWORDS = args.filter_keywords
            else:
                try:
                    FILTER_KEYWORDS = config.get("FILTER_KEYWORDS")
                    if str(FILTER_KEYWORDS) == "None":
                        writerr(
                            colored(
                                "No value for FILTER_KEYWORDS in config.yml - default set",
                                "yellow",
                            )
                        )
                        FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
                except Exception:
                    writerr(
                        colored(
                            "Unable to read FILTER_EXTENSIONS from config.yml - default set",
                            "red",
                        )
                    )
                    FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
            reFilterKeywords = re.compile(
                FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
            )

            # If the user provided the --filter-extensions argument then it overrides the config value
            if args.filter_extensions:
                FILTER_EXTENSIONS = args.filter_extensions
            else:
                try:
                    FILTER_EXTENSIONS = config.get("FILTER_EXTENSIONS")
                    if str(FILTER_EXTENSIONS) == "None":
                        writerr(
                            colored(
                                "No value for FILTER_EXTENSIONS in config.yml - default set",
                                "yellow",
                            )
                        )
                        FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
                except Exception:
                    writerr(
                        colored(
                            "Unable to read FILTER_EXTENSIONS from config.yml - default set",
                            "red",
                        )
                    )
                    FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
            badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))

            # If the user provided the --language argument then create the regex for language codes
            if args.language:
                # Get the language codes
                try:
                    LANGUAGE = config.get("LANGUAGE")
                    if str(LANGUAGE) == "None":
                        writerr(
                            colored(
                                "No value for LANGUAGE in config.yml - default set",
                                "yellow",
                            )
                        )
                        LANGUAGE = DEFAULT_LANGUAGE
                except Exception:
                    writerr(
                        colored(
                            "Unable to read LANGUAGE from config.yml - default set",
                            "red",
                        )
                    )
                    LANGUAGE = DEFAULT_LANGUAGE
                # Set the language regex
                try:
                    reLangPart = re.compile(
                        REGEX_START + "(" + LANGUAGE.replace(",", "|") + ")" + REGEX_END
                    )
                except Exception as e:
                    writerr(colored("ERROR getConfig 2: " + str(e), "red"))

            # If the user provided the --remove-params argument then it overrides the config value
            if args.remove_params:
                REMOVE_PARAMS = args.remove_params
            else:
                try:
                    REMOVE_PARAMS = config.get("REMOVE_PARAMS")
                    if str(REMOVE_PARAMS) == "None":
                        if verbose():
                            writerr(
                                colored(
                                    "No value for REMOVE_PARAMS in config.yml - default set",
                                    "yellow",
                                )
                            )
                        REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
                except Exception:
                    if verbose():
                        writerr(
                            colored(
                                "Unable to read REMOVE_PARAMS from config.yml - default set",
                                "red",
                            )
                        )
                    REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS

        except Exception:
            if args.config is None:
                writerr(
                    colored(
                        'WARNING: Cannot find file "config.yml", so using default values',
                        "yellow",
                    )
                )
            else:
                writerr(
                    colored(
                        'WARNING: Cannot find file "'
                        + args.config
                        + '", so using default values',
                        "yellow",
                    )
                )
            usingConfigDefaults = True
            FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
            FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
            LANGUAGE = DEFAULT_LANGUAGE
            REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
            reFilterKeywords = re.compile(
                FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
            )
            badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))

    except Exception as e:
        writerr(colored("ERROR getConfig 1: " + str(e), "red"))


def ensureConfig():
    """
    Ensure the config.yml file exists in the default config directory.
    If not, create the directory and write the default config.
    This is called before argument parsing so the file is created
    even when running 'urless' or 'urless -h'.
    """
    try:
        # Determine the config directory based on OS
        if os.name == "nt":
            urlessPath = Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
        elif os.name == "posix":
            urlessPath = Path(
                os.path.join(os.path.expanduser("~"), ".config", "urless")
            )
        else:
            urlessPath = Path(
                os.path.join(
                    os.path.expanduser("~"),
                    "Library",
                    "Application Support",
                    "urless",
                )
            )

        configPath = urlessPath / "config.yml"

        # If the config file doesn't exist, create it with default values
        if not configPath.exists():
            try:
                urlessPath.mkdir(parents=True, exist_ok=True)
                with open(configPath, "w") as f:
                    f.write(f"FILTER_KEYWORDS: {DEFAULT_FILTER_KEYWORDS}\n")
                    f.write(f"FILTER_EXTENSIONS: {DEFAULT_FILTER_EXTENSIONS}\n")
                    f.write(f"LANGUAGE: {DEFAULT_LANGUAGE}\n")
                    f.write(f"REMOVE_PARAMS: {DEFAULT_REMOVE_PARAMS}\n")
            except Exception as e:
                writerr(
                    colored("WARNING: Could not create config.yml: " + str(e), "yellow")
                )
    except Exception as e:
        writerr(colored("ERROR ensureConfig: " + str(e), "red"))


def handler(signal_received, frame):
    """
    This function is called if Ctrl-C is called by the user
    An attempt will be made to try and clean up properly
    """
    writerr(colored('>>> "Oh my God, they killed Kenny... and urless!" - Kyle', "red"))
    sys.exit()


def paramsToDict(params: str) -> list:
    """
    converts query string to dict
    """
    try:
        the_dict = {}
        if params:
            for pair in params.split("&"):
                # If there is a parameter but no = then add a value of {EMPTY}
                if pair.find("=") < 0:
                    key = pair + "{EMPTY}"
                    the_dict[key] = "{EMPTY}"
                else:
                    parts = pair.split("=")
                    try:
                        the_dict[parts[0]] = parts[1]
                    except IndexError:
                        pass
        return the_dict
    except Exception as e:
        writerr(colored("ERROR paramsToDict 1: " + str(e), "red"))


def dictToParams(params: dict) -> str:
    """
    converts dict of params to query string
    """
    try:
        # If a parameter has a value of {EMPTY} then just the name will be written and no =
        stringed = [
            name if value == "{EMPTY}" else name + "=" + value
            for name, value in params.items()
        ]

        # Only add a ? at the start of parameters, unless the first starts with #
        if list(params.keys())[0][:1] == "#":
            paramString = "".join(stringed)
        else:
            paramString = "?" + "&".join(stringed)

        # If a there are any parameters with {EMPTY} in the name then remove the string
        return paramString.replace("{EMPTY}", "")
    except Exception as e:
        writerr(colored("ERROR dictToParams 1: " + str(e), "red"))


def compareParams(currentParams: list, newParams: dict) -> bool:
    """
    checks if newParams contain a param
    that doesn't exist in currentParams
    """
    try:
        ogSet = set([])
        for each in currentParams:
            for key in each.keys():
                ogSet.add(key)
        return set(newParams.keys()) - ogSet
    except Exception as e:
        writerr(colored("ERROR compareParams 1: " + str(e), "red"))


def isUnwantedContent(path: str) -> bool:
    """
    Checks any potentially unwanted patterns (unless specified otherwise) such as blog/news content
    """
    try:
        unwanted = False

        if not args.keep_human_written:
            # If the path has more than 3 dashes '-' AND isn't a GUID AND (if specified) isn't a Custom ID, then assume it's human written content, e.g. blog
            for part in path.split("/"):
                if part.count("-") > 3:
                    if str(reCustomIDPart.pattern) == "":
                        if not reGuidPart.search(part) and reCustomIDPart.search(part):
                            unwanted = True
                    else:
                        if not reGuidPart.search(part):
                            unwanted = True

        if not args.keep_yyyymm:
            # If it contains a year and month in the path then assume like blog/news content, r.g. .../2019/06/...
            if reYYYYMM.search(path):
                unwanted = True

        return unwanted
    except Exception as e:
        writerr(colored("ERROR isUnwantedContent 1: " + str(e), "red"))


def createPattern(path: str) -> str:
    """
    creates patterns for urls with integers or GUIDs in them
    """
    global patternsGUID, patternsInt, patternsCustomID, patternsLang
    try:
        newParts = []

        regexInt = False
        regexGUID = False
        regexCustom = False
        regexLang = False
        for part in path.split("/"):
            if part == "":
                newParts.append(part)
            elif str(reCustomIDPart.pattern) != "" and reCustomIDPart.search(part):
                regexCustom = True
                newParts.append(reCustomIDPart.pattern)
            elif reGuidPart.search(part):
                regexGUID = True
                newParts.append(reGuidPart.pattern)
            elif reIntPart.match(part):
                regexInt = True
                newParts.append(reIntPart.pattern)
            elif args.language and reLangPart.match(part.lower()):
                regexLang = True
                newParts.append(reLangPart.pattern)
            else:
                newParts.append(part)
        createdPattern = "/".join(newParts)

        # Depending on the type of regex, add the found pattern to the dictionary if it hasn't been added already
        if regexCustom and createdPattern not in patternsCustomID:
            patternsCustomID[createdPattern] = path
        elif regexGUID and createdPattern not in patternsGUID:
            patternsGUID[createdPattern] = path
        elif regexInt and createdPattern not in patternsInt:
            patternsInt[createdPattern] = path
        elif regexLang and createdPattern not in patternsLang:
            patternsLang[createdPattern] = path

        return createdPattern
    except Exception as e:
        writerr(colored("ERROR createPattern 1: " + str(e), "red"))


def patternExists(pattern: str) -> bool:
    """
    Checks if a pattern exists
    """
    try:
        for i, seen_pattern in enumerate(patternsSeen):
            if pattern == seen_pattern:
                patternsSeen[i] = pattern
                return True
            elif seen_pattern in pattern:
                return True
        return False
    except Exception as e:
        writerr(colored("ERROR patternExists 1: " + str(e), "red"))


def matchesPatterns(path: str) -> bool:
    """
    checks if the url matches any of the regex patterns
    """
    try:
        for pattern in patternsSeen:
            if re.search(pattern, re.escape(path)) is not None:
                return True
        return False
    except Exception as e:
        writerr(colored("ERROR matchesPatterns 1: " + str(e), "red"))


def hasFilterKeyword(path: str) -> bool:
    """
    checks if the url matches the blacklist regex
    """
    global reFilterKeywords
    try:
        return reFilterKeywords.search(path)
    except Exception as e:
        writerr(colored("ERROR hasFilterKeyword 1: " + str(e), "red"))


def hasBadExtension(path: str) -> bool:
    """
    checks if a url has a blacklisted extension
    """
    global badExtensions
    try:
        return path.lower().endswith(badExtensions)
    except Exception as e:
        writerr(colored("ERROR hasBadExtension 1: " + str(e), "red"))


def removeParameters(params) -> dict:
    """
    Removes any parameters from the parameter dictionary
    """
    global REMOVE_PARAMS
    try:
        # For every parameter name in the REMOVE_PARAMS list, remove from the dictionary passed
        for param in REMOVE_PARAMS.split(","):
            if param in params:
                del params[param]
        return params
    except Exception as e:
        writerr(colored("ERROR removeParameters 1: " + str(e), "red"))


def processUrl(line):

    try:
        parsed = urlparse(line.strip())

        # Set the host
        scheme = parsed.scheme
        if scheme == "":
            host = parsed.netloc
        else:
            host = scheme + "://" + parsed.netloc

        # If the link specifies port 80 or 443, e.g. http://example.com:80, then remove the port
        if str(parsed.port) == "80":
            host = host.replace(":80", "", 1)
        if str(parsed.port) == "443":
            host = host.replace(":443", "", 1)

        # Build the path and parameters
        path, params = parsed.path, paramsToDict(parsed.query)

        # Remove any necessary parameters
        params = removeParameters(params)

        # If there is a fragment...
        #   if arg -fnp / --fragment-not-param was passed, change the path to include the hash,
        #   else, add as the last parameter with a name but with value {EMPTY} that doesn't add an = afterwards
        if parsed.fragment:
            if args.fragment_not_param:
                path = path + "#" + parsed.fragment
            else:
                params["#" + parsed.fragment] = "{EMPTY}"

        # Add the host to the map if it hasn't already been seen
        if host not in urlmap:
            urlmap[host] = {}

        # If the path has an extension we want to exclude, then just return to continue with the next line
        if hasBadExtension(path):
            return

        # If there are no parameters (or the --disregard-params argument was passed) and path isn't empty
        if (not params or args.disregard_params) and path != "":

            # If its unwanted content or has a keyword to be excluded, then just return to continue with the next line
            if isUnwantedContent(path) or hasFilterKeyword(path):
                return

            # If the current path already matches a previously saved pattern then just return to continue with the next line
            if matchesPatterns(path):
                return

        # If the path has ++ in it for any reason, then just output "as is" otherwise it will raise a regex Multiple Repeat Error
        if path.find("++") > 0:
            pattern = path
        else:
            # Create a pattern for the current path
            pattern = createPattern(path)

        # Update the url map
        if pattern not in urlmap[host]:
            urlmap[host][pattern] = [params] if params else []
        elif params and compareParams(urlmap[host][pattern], params):
            urlmap[host][pattern].append(params)

    except ValueError:
        if verbose():
            writerr(
                colored(
                    "This URL caused a Value Error and was not included: " + line, "red"
                )
            )
    except Exception as e:
        writerr(colored("ERROR processUrl 1: " + str(e), "red"))


def processLine(line):
    """
    Process a line from the input based on whether the -ks / --keep-slash argument was passed
    """
    # If the -ks / --keep-slash argument was passed, then just add all URLs,
    # else remove the trailing slash form any URLs (before any query string)
    if args.keep_slash:
        line = line.rstrip("\n")
    else:
        if line.find("/?") > 0:
            line = line.replace("/?", "?", 1)
        else:
            line = line.rstrip("\n").rstrip("/")

    # If the -iq / --ignore-querystring argument was passed, remove any querystring and fragment (unless -fnp is passed, in which case the fragment is only removed if a query string exists too)
    if args.ignore_querystring:
        if args.fragment_not_param:
            line = line.split("?")[0]
        else:
            line = line.split("?")[0].split("#")[0]
    return line


def processInput():
    global linesOrigCount
    try:
        if not sys.stdin.isatty():
            for line in sys.stdin:
                processUrl(processLine(line))
        else:
            with open(os.path.expanduser(args.input), "rb") as f:
                result = chardet.detect(f.read())  # or readline if the file is large

            try:
                linesOrigCount = 0
                with open(
                    os.path.expanduser(args.input), "r", encoding=result["encoding"]
                ) as inFile:
                    for line in inFile:
                        linesOrigCount += 1
                        processUrl(processLine(line))
            except Exception as e:
                writerr(colored("ERROR processInput 2 " + str(e), "red"))
    except Exception as e:
        writerr(colored("ERROR processInput 1: " + str(e), "red"))


def processOutput():
    global linesFinalCount, linesOrigCount, patternsGUID, patternsInt, patternsCustomID, patternsLang
    try:
        # If an output file was specified, open it
        if args.output is not None:
            try:
                outFile = open(os.path.expanduser(args.output), "w")
            except Exception as e:
                writerr(colored("ERROR processOutput 2 " + str(e), "red"))

        # Output all URLs
        for host, value in urlmap.items():
            for path, params in value.items():

                # Replace the regex pattern in the path with the first occurrence of that pattern found
                try:
                    customRegexFound = False
                    if (
                        str(reCustomIDPart.pattern) != ""
                        and path.find(str(reCustomIDPart.pattern)) > 0
                    ):
                        for pattern in patternsCustomID:
                            if pattern == path:
                                path = patternsCustomID[pattern]
                                customRegexFound = True
                    if not customRegexFound:
                        if path.find(REGEX_GUID) > 0:
                            for pattern in patternsGUID:
                                if pattern == path:
                                    path = patternsGUID[pattern]
                        elif path.find(REGEX_INTEGER) > 0:
                            for pattern in patternsInt:
                                if pattern == path:
                                    path = patternsInt[pattern]
                        elif path.find(str(reLangPart.pattern)) > 0:
                            for pattern in patternsLang:
                                if pattern == path:
                                    path = patternsLang[pattern]
                except Exception as e:
                    writerr(colored("ERROR processOutput 4: " + str(e), "red"))

                if params:
                    for param in params:
                        linesFinalCount = linesFinalCount + 1
                        # If an output file was specified, write to the file
                        if args.output is not None:
                            outFile.write(host + path + dictToParams(param) + "\n")
                        else:
                            # If output is piped or the --output argument was not specified, output to STDOUT
                            if not sys.stdin.isatty() or args.output is None:
                                write(host + path + dictToParams(param))
                else:
                    linesFinalCount = linesFinalCount + 1
                    # If an output file was specified, write to the file
                    if args.output is not None:
                        outFile.write(host + path + "\n")
                    else:
                        # If output is piped or the --output argument was not specified, output to STDOUT
                        if not sys.stdin.isatty() or args.output is None:
                            write(host + path)

        if verbose() and sys.stdin.isatty():
            writerr(
                colored(
                    "\nInput reduced from "
                    + str(linesOrigCount)
                    + " to "
                    + str(linesFinalCount)
                    + " lines 🤘",
                    "cyan",
                )
            )

        # Close the output file if it was opened
        try:
            if args.output is not None:
                write(
                    colored("Output successfully written to file: ", "cyan")
                    + colored(args.output, "white")
                )
                write()
                outFile.close()
        except Exception as e:
            writerr(colored("ERROR processOutput 3: " + str(e), "red"))

    except Exception as e:
        writerr(colored("ERROR processOutput 1: " + str(e), "red"))


def showOptionsAndConfig():
    global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults
    try:
        write(colored("Selected options and config:", "cyan"))
        write(
            colored("-i: " + args.input, "magenta")
            + colored(" The input file of URLs to de-clutter.", "white")
        )
        if args.output is not None:
            write(
                colored("-o: " + args.output, "magenta")
                + colored(
                    " The output file that the de-cluttered URL list will be written to.",
                    "white",
                )
            )
        else:
            write(
                colored("-o: <STDOUT>", "magenta")
                + colored(
                    " An output file wasn't given, so output will be written to STDOUT.",
                    "white",
                )
            )

        if args.disregard_params:
            write(
                colored("-dp: True", "magenta")
                + colored(
                    " When filtering the URLs, they will not be treated differently just because they have parameters.",
                    "white",
                )
            )

        if args.config:
            if usingConfigDefaults:
                write(
                    colored("-config: " + args.config, "magenta")
                    + colored(" The path of the YML config file.", "white")
                    + colored(" WARNING: Not found, so using default values.", "yellow")
                )
            else:
                write(
                    colored("-config: " + args.config, "magenta")
                    + colored(" The path of the YML config file.", "white")
                )

        if args.filter_keywords:
            write(
                colored("-fk (Keywords to Filter): ", "magenta")
                + colored(args.filter_keywords, "white")
            )
        else:
            write(
                colored("Filter Keywords (from Config.yml): ", "magenta")
                + colored(FILTER_KEYWORDS, "white")
            )

        if args.filter_extensions:
            write(
                colored("-fe (Extensions to Filter): ", "magenta")
                + colored(args.filter_extensions, "white")
            )
        else:
            write(
                colored("Filter Extensions (from Config.yml): ", "magenta")
                + colored(FILTER_EXTENSIONS, "white")
            )

        if args.language:
            write(
                colored("Languages (from Config.yml): ", "magenta")
                + colored(LANGUAGE, "white")
            )
            write(
                colored("-lang: True", "magenta")
                + colored(
                    "If there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output.",
                    "white",
                )
            )

        if args.remove_params:
            write(
                colored("-rp (Params to Remove): ", "magenta")
                + colored(args.remove_params, "white")
            )
        else:
            write(
                colored("Remove Params (from Config.yml): ", "magenta")
                + colored(REMOVE_PARAMS, "white")
            )

        if args.keep_slash:
            write(
                colored("-ks: True", "magenta")
                + colored(
                    "A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.",
                    "white",
                )
            )

        if args.keep_human_written:
            write(
                colored("-khw: True", "magenta")
                + colored(
                    "Prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post)",
                    "white",
                )
            )

        if args.keep_yyyymm:
            write(
                colored("-kym: True", "magenta")
                + colored(
                    "Prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news)",
                    "white",
                )
            )

        if args.regex_custom_id:
            write(
                colored("-rcid: '" + str(reCustomIDPart.pattern) + "'", "magenta")
                + colored(" USE WITH CAUTION! ", "red")
                + colored(
                    "Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.",
                    "white",
                )
            )

        if args.keep_yyyymm:
            write(
                colored("-iq: True", "magenta")
                + colored(
                    " Remove the query string (including URL fragments `#`) so output is unique paths only.",
                    "white",
                )
            )

        write("")

    except Exception as e:
        writerr(colored("ERROR showOptionsAndConfig 1: " + str(e), "red"))


def argCheckRegexCustomID(value):
    global reCustomIDPart
    try:

        # If the Custom ID regex was passed, then prefix with ^ and suffix with $ if they are not there already
        if value != "":
            if value[0] != REGEX_START:
                value = REGEX_START + value
            if value[-1] != REGEX_END:
                value = value + REGEX_END

        # Try to compile the regex
        reCustomIDPart = re.compile(value)

        return value
    except Exception:
        raise argparse.ArgumentTypeError("Valid regex must be passed.")


def main():

    global args, urlmap, patternsSeen, patternsInt, patternsCustomID, patternsGUID, patternsLang

    # Ensure config.yml exists before anything else
    ensureConfig()

    # Tell Python to run the handler() function when SIGINT is received
    signal(SIGINT, handler)

    # Parse command line arguments
    parser = argparse.ArgumentParser(
        description="urless - by @Xnl-h4ck3r: De-clutter a list of URLs."
    )
    parser.add_argument(
        "-i", "--input", action="store", help="A file of URLs to de-clutter."
    )
    parser.add_argument(
        "-o",
        "--output",
        action="store",
        help="The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.",
    )
    parser.add_argument(
        "-fk",
        "--filter_keywords",
        action="store",
        help="A comma separated list of keywords to exclude links (if there no parameters). This will override the FILTER_KEYWORDS list specified in config.yml",
        metavar="<comma separated list>",
    )
    parser.add_argument(
        "-fe",
        "--filter-extensions",
        action="store",
        help="A comma separated list of file extensions to exclude. This will override the FILTER_EXTENSIONS list specified in config.yml",
        metavar="<comma separated list>",
    )
    parser.add_argument(
        "-rp",
        "--remove-params",
        action="store",
        help="A comma separated list of case sensitive parameters to remove from all URLs. This will override the REMOVE_PARAMS list specified in config.yml. This can be useful for cache buster parameters for example.",
        metavar="<comma separated list>",
    )
    parser.add_argument(
        "-ks",
        "--keep-slash",
        action="store_true",
        help="A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.",
    )
    parser.add_argument(
        "-khw",
        "--keep-human-written",
        action="store_true",
        help="By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.",
    )
    parser.add_argument(
        "-kym",
        "--keep-yyyymm",
        action="store_true",
        help="By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.",
    )
    parser.add_argument(
        "-rcid",
        "--regex-custom-id",
        action="store",
        help="USE WITH CAUTION! Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.",
        default="",
        metavar="REGEX",
        type=argCheckRegexCustomID,
    )
    parser.add_argument(
        "-iq",
        "--ignore-querystring",
        action="store_true",
        help="Remove the query string (including URL fragments `#`) so output is unique paths only.",
    )
    parser.add_argument(
        "-fnp",
        "--fragment-not-param",
        action="store_true",
        help="Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) it is usually kept, but if this argument is passed and a link has a filter word and fragment, it will be removed.",
    )
    parser.add_argument(
        "-lang",
        "--language",
        action="store_true",
        help='If passed, and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the "LANGUAGE" section of "config.yml".',
    )
    parser.add_argument(
        "-c",
        "--config",
        action="store",
        help="Path to the YML config file. If not passed, it looks for file 'config.yml' in the default config directory, e.g. '~/.config/urless/'.",
    )
    parser.add_argument(
        "-dp",
        "--disregard-params",
        action="store_true",
        help="There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.",
    )
    parser.add_argument(
        "-nb", "--no-banner", action="store_true", help="Hides the tool banner."
    )
    parser.add_argument("--version", action="store_true", help="Show version number")
    parser.add_argument("-v", "--verbose", action="store_true", help="Verbose output.")
    args = parser.parse_args()

    # If --version was passed, display version and exit
    if args.version:
        write(colored("urless - v" + __version__, "cyan"))
        sys.exit()

    try:
        # If no input was given, raise an error
        if sys.stdin.isatty():
            if args.input is None:
                writerr(
                    colored(
                        "You need to provide an input with -i argument or through <stdin>.",
                        "red",
                    )
                )
                sys.exit()

        # Get the config settings from the config.yml file
        getConfig()

        # If input is not piped, show the banner, and if --verbose option was chosen show options and config values
        if sys.stdin.isatty():
            # Show banner unless requested to hide
            if not args.no_banner:
                showBanner()
            if verbose():
                showOptionsAndConfig()

        # Process the input given on -i (--input), or <stdin>
        processInput()

        # Output the saved urls with parameters
        processOutput()

    except Exception as e:
        writerr(colored("ERROR main 1: " + str(e), "red"))

    # Show ko-fi link if verbose and not piped
    try:
        if verbose() and sys.stdin.isatty():
            writerr(
                colored(
                    "✅ Want to buy me a coffee? ☕ https://ko-fi.com/xnlh4ck3r 🤘",
                    "green",
                )
            )
    except Exception:
        pass

    finally:  # Clean up
        urlmap = None
        patternsSeen = None
        patternsCustomID = None
        patternsGUID = None
        patternsInt = None
        patternsLang = None


if __name__ == "__main__":
    main()
Download .txt
gitextract_gotbcstc/

├── .gitignore
├── CHANGELOG.md
├── README.md
├── config.yml
├── setup.py
└── urless/
    ├── __init__.py
    └── urless.py
Download .txt
SYMBOL INDEX (25 symbols across 1 files)

FILE: urless/urless.py
  function verbose (line 85) | def verbose():
  function write (line 92) | def write(text=""):
  function writerr (line 102) | def writerr(text=""):
  function showVersion (line 109) | def showVersion():
  function showBanner (line 142) | def showBanner():
  function getConfig (line 153) | def getConfig():
  function ensureConfig (line 329) | def ensureConfig():
  function handler (line 373) | def handler(signal_received, frame):
  function paramsToDict (line 382) | def paramsToDict(params: str) -> list:
  function dictToParams (line 405) | def dictToParams(params: dict) -> str:
  function compareParams (line 428) | def compareParams(currentParams: list, newParams: dict) -> bool:
  function isUnwantedContent (line 443) | def isUnwantedContent(path: str) -> bool:
  function createPattern (line 471) | def createPattern(path: str) -> str:
  function patternExists (line 517) | def patternExists(pattern: str) -> bool:
  function matchesPatterns (line 533) | def matchesPatterns(path: str) -> bool:
  function hasFilterKeyword (line 546) | def hasFilterKeyword(path: str) -> bool:
  function hasBadExtension (line 557) | def hasBadExtension(path: str) -> bool:
  function removeParameters (line 568) | def removeParameters(params) -> dict:
  function processUrl (line 583) | def processUrl(line):
  function processLine (line 659) | def processLine(line):
  function processInput (line 682) | def processInput():
  function processOutput (line 706) | def processOutput():
  function showOptionsAndConfig (line 795) | def showOptionsAndConfig():
  function argCheckRegexCustomID (line 940) | def argCheckRegexCustomID(value):
  function main (line 959) | def main():
Condensed preview — 7 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (73K chars).
[
  {
    "path": ".gitignore",
    "chars": 53,
    "preview": "build/\r\ndist/\r\nurless.egg-info\r\n__pycache__\r\ntest.txt"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 7910,
    "preview": "## Changelog\r\n\r\n- v2.7\r\n\r\n  - New\r\n    - If the `config.yml` file is not found in the expected config directory (e.g. `~"
  },
  {
    "path": "README.md",
    "chars": 17320,
    "preview": "<center><img src=\"https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png\"></center>\r\n\r\n## About - v2.7\r\n"
  },
  {
    "path": "config.yml",
    "chars": 582,
    "preview": "FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules\r\nFILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,."
  },
  {
    "path": "setup.py",
    "chars": 2437,
    "preview": "#!/usr/bin/env python\r\nimport os\r\nimport shutil\r\nfrom setuptools import setup, find_packages\r\n\r\n# Define the target dire"
  },
  {
    "path": "urless/__init__.py",
    "chars": 20,
    "preview": "__version__ = \"2.7\"\n"
  },
  {
    "path": "urless/urless.py",
    "chars": 41179,
    "preview": "#!/usr/bin/env python\n# Python 3\n# urless - by @Xnl-h4ck3r: De-clutter a list of URLs\n# Full help here: https://github.c"
  }
]

About this extraction

This page contains the full source code of the xnl-h4ck3r/urless GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 7 files (67.9 KB), approximately 14.9k tokens, and a symbol index with 25 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!