Repository: xnl-h4ck3r/urless
Branch: main
Commit: e9bfa484ea6e
Files: 7
Total size: 67.9 KB
Directory structure:
gitextract_gotbcstc/
├── .gitignore
├── CHANGELOG.md
├── README.md
├── config.yml
├── setup.py
└── urless/
├── __init__.py
└── urless.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
build/
dist/
urless.egg-info
__pycache__
test.txt
================================================
FILE: CHANGELOG.md
================================================
## Changelog
- v2.7
- New
- If the `config.yml` file is not found in the expected config directory (e.g. `~/.config/urless/` on Linux or `%APPDATA%/urless/` on Windows), it will be automatically created with default values. This fixes the issue where installing with `pipx` did not create the `config.yml` file.
- Surpresses the warning about `requests` not being able to import `urllib3`.
- v2.6
- Changed
- BUG FIX: Change the type `js.ko` to `ja,ko` in `LANGUAGE` within `config.yml` and `DEFAULT_LANGUAGE` within `urless.py`
- Set `DEFAULT_REMOVE_PARAMS` and the `REMOVE_PARAMS` in `config.yml` file to `_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name` in `urless.py`. These was a mismatch between the two files. Also, the Google Analytics parameters should be removed by default.
- v2.5
- Changed
- Fix the issue of it saying the version is outdated when it is the latest version.
- Applied black code formatting to `__init__.py`, `setup.py`, and `urless.py` to ensure consistent code style.
- v2.4
- Changed
- Various optimizations to improve performance, e.g. Pre-compiled Regular Expressions, Optimized Extension Filtering and Memory-Efficient File Processing.
- v2.3
- Fixed
- Remove TTY-gating that silences output in non-TTY environments like Docker, CI, or cron jobs. The --no-banner flag and -o/--output already provide users control over output, so the extra TTY checks only broke non-interactive usage. Thanks to [@tavgar](https://github.com/tavgar) for the fix in [PR #15](https://github.com/xnl-h4ck3r/urless/pull/15).
- v2.2
- New
- Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9).
- Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12).
- Changed
- The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`.
- Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed.
- v2.1
- New
- Add `long_description_content_type` to `setup.py` to upload to PyPi
- Add `urless` to `PyPi` so can be installed with `pip install urless`
- v2.0
- New
- Add `REMOVE_PARAMS` to `config.yml`. This will be a comma separated list of case sensitive parameter names that you want removed completely from URLs. This can be useful to remove cache buster parameters, so will default to `cachebuster,cacheBuster` to show examples.
- Add arg `-rp`/`--remove-params` which can be used to pass a comma separated list of parameter names to remove from URLs. This will override the `REMOVE_PARAMS` list in `config.yml`.
- Show the current version of the tool in the banner, and whether it is the latest, or outdated.
- Add arg `--version` to show the current version of the tool.
- When installing `urless`, if the `config.yml` already exists then it will keep that one and create `config.yml.NEW` in case you need to replace the old config.
- Changed
- Fix a bug that meant defaults were not set correctly if `config.yml` keys are missing.
- v1.3
- New
- Add argument `-fnp`/`--fragment-not-param`. If passed the URL fragments `#` will NOT be treated in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link.
- v1.2
- Changed
- Changes to prevent `SyntaxWarning: invalid escape sequence` errors when Python 3.12 is used.
- v1.1
- Changed
- Add support to automatically identify file encoding.
- v1.0
- Changed
- Add support for quick install using pip or pipx.
- v0.9
- Changed
- Add i18N language codes `gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru`
- v0.8
- New
- Add `DEFAULT_LANGUAGE` constant and `LANGUAGE` key in `config.yml` with the most common language codes: `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`
- Add `-lang`/`--language` argument. If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specific in the `LANGUAGE` key of `config.yml`
- Changed
- A URL can have a GUID, Integer, CustomID and Language Code in the same URL and be de-cluttered properly.
- If the Custom Regex ID doesn't start with `^` and end in `$`, those will be added.
- Fix bug where it added the last occurrence of a regex pattern instead of the first.
- Simplify the code in `processUrl` and `createPattern` functions... I had some strange logic that was unnecessary!
- Make sure case is ignored when any `FILTER_EXTENSIONS` in `config.yml` or passed with `-fe` are compared with input.
- v0.7
- New
- Add `-rcid` / `--regex-custom-id` argument to provide a regex expression for a Custom ID that your target uses.
- Add `-nb` / `--no-banner` argument to hide the tool banner. This is only needed if you are not piping input to `urless`.
- Add `-khw` / `--keep-human-written` argument to prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post). These are normally removed by default.
- Add `-kym` / `--keep-yyyymm` argument to prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news). These are normally removed by default.
- Add `-iq` / `--ignore-querystring` argument to remove the query string (including URL fragments `#`) so output is unique paths only.
- Changed
- Fix bug where `/blah/1337` was not being treated differently to `/1337` for example.
- When a Custom ID, GUID or Integer ID is found in a URL, and only one URL from many in the same format are returned in the output, use the first ID found in the input for that ID type.
- v0.6
- New
- By default, a trailing `/` will be removed from the end of a URL.
- Added new argument `-ks`/`--keep-slash` that will ensure any links that do have a trailing slash in the input will not have the slash removed in the output, and therefore there may be identical URLs output, one with and one without a trailing slash.
- v0.5
- Changed
- Fixed Github Issue #3 to remove port 80 and 443 correctly
- v0.4
- Changed
- Various bug fixes
- v0.3
- New
- Add an `__init_.py` file to store the version, and move the image to a separate folder to make it cleaner.
- Changed
- If a line in the input throws an error due to not being a valid URL when parsed, then skip it, but output an error showing the URL if the `-v` arg is passed.
- v0.2
- Fixed the bug `ERROR matchesPatterns 1: missing ), unterminated subpattern at position 237` by escaping the regex string before searching
- v0.1
- Inital release. Please see README.md
================================================
FILE: README.md
================================================
<center><img src="https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png"></center>
## About - v2.7
This is a tool used to de-clutter a list of URLs.
As a starting point, I took the amazing tool [uro](https://github.com/s0md3v/uro/) by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable.
## Installation
`urless` supports **Python 3**.
Install `urless` in default (global) python environment.
```bash
pip install urless
```
OR
```bash
pip install git+https://github.com/xnl-h4ck3r/urless.git -v
```
You can upgrade with
```bash
pip install --upgrade urless
```
### pipx
Quick setup in isolated python environment using [pipx](https://pypa.github.io/pipx/)
```bash
pipx install git+https://github.com/xnl-h4ck3r/urless.git
```
## Usage
| Argument | Long Argument | Description |
| -------- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| -i | --input | A file of URLs to de-clutter. |
| -o | --output | The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead. |
| -fk | --filter-keywords | A comma separated list of keywords to exclude links (if there no parameters). This will override the `FILTER_KEYWORDS` list specified in config.yml |
| -fe | --filter-extensions | A comma separated list of file extensions to exclude. This will override the `FILTER_EXTENSIONS` list specified in `config.yml` |
| -rp | --remove-params | A comma separated list of **case senistive** parameters to remove from ALL URLs. This will override the `REMOVE_PARAMS` list specified in `config.yml`. This can be useful to remove cache buster parameters for example.\*\* |
| -ks | --keep-slash | A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash. |
| -khw | --keep-human-written | By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output. |
| -kym | --keep-yyyymm | By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output. |
| -rcid | --regex-custom-id | **USE WITH CAUTION!** Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this. |
| -iq | --ignore-querystring | Remove the query string (including URL fragments `#`) so output is unique paths only. |
| -fnp | --fragment-not-param | Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. |
| -lang | --language | If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the `LANGUAGE` section of `config.yml`. |
| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default config directory, e.g. `~/.config/urless/`. |
| -dp | --disregard-params | There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. |
| -nb | --no-banner | Hides the tool banner (it is hidden by default if you pipe input to urless) output. |
| | --version | Show current version number. |
| -v | --verbose | Verbose output |
## What does it do exactly?
You basically pass a list of URLs in (from a file, or pipe from STDIN), and get a de-cluttered file or URLs out. But in what way are they de-cluttered?
I'll explain this below, but first here are some terms that will be used:
- **FILTER-EXTENSIONS**: This refers to the list of extensions that can either be passed with `-fe`, specified with `FILTER_EXTENSIONS` in the `config.yml`, or if neither of those exist, a default list of `.css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image`.
- **FILTER-KEYWORDS**: This refers to the list of keywords that can either be passed with `-fk`, specified with `FILTER_KEYWORDS` in the `config.yml`, or if neither of those exist, a default list of `blog,article,news,bootstrap,jquery,captcha,node_modules`
- **LANGUAGE**: This refers to the list of language codes that can be specified with `LANGUAGE` in the `config.yml`, or if it doesn't exist, a default list of the most common codes `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`
- **UNWANTED-CONTENT**:
- A section of the URL path contains more than 3 dashes (`-`), BUT isn't a GUID. This implies human written content, e.g. `how-to-hack-the-planet`. If arg `-khw` is passed, then this won't be removed.
- The URL contains `/YYYY/MM/` , e.g. a year, month . This is usually static content such as a blog. If arg `-kym` is passed, then this won't be removed.
Here's what happens:
- If a URL has port 80 or 443 explicitly given, then remove it from the URL (e.g. http://example.com:80/test -> http://example.com/test)
- If the URL has any **FILTER-EXTENSIONS**, it will be removed from the output.
- If the URL has NO parameters **OR** the `-dp`/`--disregard-params` argument was passed:
- If the URL contains a **FILTER-KEYWORDS** or **UNWANTED-CONTENT**, it will be removed.
- if the URL query string contains unwanted parameters specified in config `REMOVE_PARAMS` (or overridden wit argument `-rp`/`--remove-params`), they will be removed from all URLs before processing.
- If `-rcid`/`--regex-custom-id` is passed and the URL path contains a Custom ID, only one match to the Custom ID regex will be included if there are multiple URLs where that is the only difference.
- If the URL path contains a GUID, only one of the GUIDs will be included if there are multiple URLs where the GUID is the only difference.
- If the URL path contains an Integer ID, only one of the Integer IDs will be included if there are multiple URLs where the Integer ID is the only difference.
- If the `-lang` argument is passed and the URL contains a language code (e.g. `en-gb`), only one of the language codes will be included if there are multiple URLs where the language code is different.
- Else the URL has Parameters (or a fragment `#`) **AND** the `-dp`/`--disregard-params` argument was NOT passed:
- If there are multiple URLs with the same parameters, then only URLs with unique parameter values are included.
- If there are URL's with a Parameter, but no value (or a fragment), then this will be included.
## Examples
### Basic use
```
cat target_urls.txt | urless
```
or
```
urless -i target_urls.txt
```
### Capture output
```
cat target_urls.txt | urless > output.txt
```
or
```
urless -i target_urls.txt -o output.txt
```
## config.yml
The `config.yml` file has the keys which can be updated to suit your needs:
- `FILTER_KEYWORDS` - A comma separated list of keywords (e.g. `blog,article,news` etc.) that URLs are checked against in certain circumstances.
- `FILTER_EXTENSIONS` - A comma separated list of file extensions (e.g. `.css,.jpg,.jpeg` etc.) that all URLs are checked against. If a URL includes any of the strings then it will be excluded from the output.
- `LANGUAGE` - A comma separated list of language codes (e.g. `en-gb,fr,nl` etc.) that all URLs are checked against when the `-lang` argument is passed. If there are multiple URLs with different language codes, only one version of the URL will be output.
- `REMOVE_PARAMS` - A comma separated list of **case sensitive** parameter names (e.g. `cachebuster,cacheBuster`) that will be removed from all URLs before processing.
## Custom Regex
There are currently automatic regex checks for a path part being a Globally Unique ID (GUID) and an Integer ID, but the `-rcid` / `--regex-custom-id` argument lets you provide a regular expression to identify a custom ID. For example, if a target has a specific ID format (that isn't a GUID or Integer) then you can specify a regex expression for it, and then only one of those will be returned in the output if the rest of the URL is the same. For example:
- Assume the target has a user ID in a format like `U-65241X`
- And there are multiple URLs like the following:
```
https://target.com/blah/U-61723A/settings
https://target.com/blah/U-63352B/settings
https://target.com/blah/U-61351A/profile
https://target.com/blah/U-61723A/settings
https://target.com/blah/U-64135C/profile
```
- You can call `urless` and pass `-rcid 'U-[0-9]{5}[A-Z]'`, then the output would be:
```
https://target.com/blah/U-61723A/settings
https://target.com/blah/U-64135C/profile
```
**IMPORTANT REGEX NOTES:**
- Writing correct regex expressions can be difficult, and if it isn't correct, you could end up with unpredictable and incorrect output.
- Always enclose your regex expression in single quotes when passing to the `-rcid` argument.
- You don't need to add a custom regex for a GUID or Integer ID - these are dealt with already.
- The regex expression should highlight the whole part of the path. So, if your regex only identifies the start of the path, then add `[^(\?|\/|#|$)]*` to the end of your regex which will mean ALL other characters up until the end of the path part.
- You can add `^` at the start, and `$` at the end, of your regex to ensure it represents the whole part of a path between slashes. However, these will be added for you if they are left out.
- Make sure the regex only identifies the sections you are interested in, otherwise you may have unexpected results. To test your regex, you can take your input file and do `cat input.txt | grep -E 'U-[0-9]{5}[A-Z]'` for example, and see whether your expression looks correct (it should only highlight what you are interested in, and highlight the whole part of the path that is the custom ID).
- You can also test using [Regex101](https://regex101.com), entering sample URLs in the **TEST STRING** section to check if it is correct. Make sure the **REGEX FLAGS** **g**lobal and **m**ultiline are selected.
- There maybe cases where you just can't supply a regex that is going to identify the Custom ID correctly without treating other values as the same. For example, if there are URLs like `https://target.com/blah/xnl/settings` where `xnl` is a User Name, you won't be able to create a regex for user name because it is not a unique enough format to distinguish it from other possible path values.
## Issues
If you come across any problems at all, or have ideas for improvements, please feel free to raise an issue on Github. If there is a problem, it will be useful if you can provide the exact command you ran and a detailed description of the problem. If possible, run with `-v` to reproduce the problem and let me know about any error messages that are given.
## TODO
None - feel free to raise a Github issue to suggest any enhancements.
## And finally...
Good luck and good hunting!
If you really love the tool (or any others), or they helped you find an awesome bounty, consider [BUYING ME A COFFEE!](https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)
🤘 /XNL-h4ck3r
<p>
<a href='https://ko-fi.com/B0B3CZKR5' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi2.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
================================================
FILE: config.yml
================================================
FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules
FILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image
LANGUAGE: en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru
REMOVE_PARAMS: _,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
import os
import shutil
from setuptools import setup, find_packages
# Define the target directory for the config.yml file
target_directory = (
os.path.join(os.getenv("APPDATA", ""), "urless")
if os.name == "nt"
else (
os.path.join(os.path.expanduser("~"), ".config", "urless")
if os.name == "posix"
else (
os.path.join(
os.path.expanduser("~"), "Library", "Application Support", "urless"
)
if os.name == "darwin"
else None
)
)
)
# Copy the config.yml file to the target directory if it exists
configNew = False
if target_directory and os.path.isfile("config.yml"):
os.makedirs(target_directory, exist_ok=True)
# If file already exists, create a new one
if os.path.isfile(target_directory + "/config.yml"):
configNew = True
os.rename(
target_directory + "/config.yml", target_directory + "/config.yml.OLD"
)
shutil.copy("config.yml", target_directory)
os.rename(
target_directory + "/config.yml", target_directory + "/config.yml.NEW"
)
os.rename(
target_directory + "/config.yml.OLD", target_directory + "/config.yml"
)
else:
shutil.copy("config.yml", target_directory)
setup(
name="urless",
packages=find_packages(),
version=__import__("urless").__version__,
description="De-clutter a list of URLs",
long_description=open("README.md").read(),
long_description_content_type="text/markdown",
author="@xnl-h4ck3r",
url="https://github.com/xnl-h4ck3r/urless",
zip_safe=False,
install_requires=[
"argparse",
"pyyaml",
"termcolor",
"urlparse3",
"chardet",
"requests",
],
entry_points={
"console_scripts": [
"urless = urless.urless:main",
],
},
)
if configNew:
print(
"\n\033[33mIMPORTANT: The file "
+ target_directory
+ "/config.yml already exists.\nCreating config.yml.NEW but leaving existing config.\nIf you need the new file, then remove the current one and rename config.yml.NEW to config.yml\n\033[0m"
)
else:
print(
"\n\033[92mThe file "
+ target_directory
+ "/config.yml has been created.\n\033[0m"
)
================================================
FILE: urless/__init__.py
================================================
__version__ = "2.7"
================================================
FILE: urless/urless.py
================================================
#!/usr/bin/env python
# Python 3
# urless - by @Xnl-h4ck3r: De-clutter a list of URLs
# Full help here: https://github.com/xnl-h4ck3r/urless/blob/main/README.md
# Good luck and good hunting! If you really love the tool (or any others), or they helped you find an awesome bounty, consider BUYING ME A COFFEE! (https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)
import re
import os
import sys
from typing import Pattern
import yaml
import argparse
import chardet
from signal import SIGINT, signal
from urllib.parse import urlparse
from termcolor import colored
from pathlib import Path
try:
from . import __version__
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
import requests
except Exception:
pass
# Default values if config.yml not found
DEFAULT_FILTER_EXTENSIONS = ".css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image"
DEFAULT_FILTER_KEYWORDS = "blog,article,news,bootstrap,jquery,captcha,node_modules"
DEFAULT_LANGUAGE = "en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru"
DEFAULT_REMOVE_PARAMS = "_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name"
# Variables to hold config.yml values
FILTER_EXTENSIONS = ""
FILTER_KEYWORDS = ""
LANGUAGE = ""
REMOVE_PARAMS = ""
reFilterKeywords = ""
badExtensions = ()
# Regex delimiters
REGEX_START = "^"
REGEX_END = "$"
# Regex for a path folder of integer
REGEX_INTEGER = REGEX_START + r"\d+" + REGEX_END
reIntPart = re.compile(REGEX_INTEGER)
patternsInt = {}
# Regex for a path folder of GUID
REGEX_GUID = (
REGEX_START
+ "[({]?[a-fA-F0-9]{8}[-]?([a-fA-F0-9]{4}[-]?){3}[a-fA-F0-9]{12}[})]?"
+ REGEX_END
)
reGuidPart = re.compile(REGEX_GUID)
patternsGUID = {}
# Regex fields for Custom ID
reCustomIDPart = Pattern
patternsCustomID = {}
# Regex for path of YYYY/MM
REGEX_YYYYMM = r"\/[1|2][0|1|9]\\d{2}/[0|1]\\d{1}\/"
reYYYYMM = re.compile(REGEX_YYYYMM)
# Regex for path of language code
reLangPart = Pattern
patternsLang = {}
# Global variables
args = None
urlmap = {}
patternsSeen = []
outFile = None
linesOrigCount = 0
linesFinalCount = 0
usingConfigDefaults = False
def verbose():
"""
Functions used when printing messages dependant on verbose option
"""
return args.verbose
def write(text=""):
"""
Always print one line to stdout.
The --no-banner flag and -o/--output already give users
control over noise and redirection, so extra TTY checks only
break non-interactive usage (Docker, CI, cron).
"""
sys.stdout.write(text + "\n")
def writerr(text=""):
"""
Always print one line to stderr.
"""
sys.stderr.write(text + "\n")
def showVersion():
try:
try:
resp = requests.get(
"https://raw.githubusercontent.com/xnl-h4ck3r/urless/main/urless/__init__.py",
timeout=3,
)
except Exception:
write(
"Current urless version "
+ __version__
+ " (unable to check if latest)\n"
)
if __version__ == resp.text.split("=")[1].replace('"', "").strip():
write(
"Current urless version "
+ __version__
+ " ("
+ colored("latest", "green")
+ ")\n"
)
else:
write(
"Current urless version "
+ __version__
+ " ("
+ colored("outdated", "red")
+ ")\n"
)
except Exception:
pass
def showBanner():
write("")
write(colored(r" __ _ ____ _ ___ ___ ____ ", "red"))
write(colored(r" | | | | _ \| | / _ \/ __/ __/ ", "yellow"))
write(colored(r" | | | | |_) | || __/\__ \__ \ ", "green"))
write(colored(r" | |_| | _ <| |_\___/\___/___/ ", "cyan"))
write(colored(r" \___/|_| \_\___/", "magenta") + colored("by Xnl-h4ck3r", "white"))
write("")
showVersion()
def getConfig():
"""
Try to get the values from the config file, otherwise use the defaults
"""
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults, reFilterKeywords, badExtensions
try:
# Try to get the config file values
try:
# Put config in global location based on the OS.
urlessPath = (
Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
if os.name == "nt"
else (
Path(os.path.join(os.path.expanduser("~"), ".config", "urless"))
if os.name == "posix"
else (
Path(
os.path.join(
os.path.expanduser("~"),
"Library",
"Application Support",
"urless",
)
)
if os.name == "darwin"
else None
)
)
)
urlessPath.absolute
if args.config is None:
if urlessPath == "":
configPath = "config.yml"
else:
configPath = Path(urlessPath / "config.yml")
else:
configPath = Path(args.config)
config = yaml.safe_load(open(configPath))
# If the user provided the --filter-extensions argument then it overrides the config value
if args.filter_keywords:
FILTER_KEYWORDS = args.filter_keywords
else:
try:
FILTER_KEYWORDS = config.get("FILTER_KEYWORDS")
if str(FILTER_KEYWORDS) == "None":
writerr(
colored(
"No value for FILTER_KEYWORDS in config.yml - default set",
"yellow",
)
)
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
except Exception:
writerr(
colored(
"Unable to read FILTER_EXTENSIONS from config.yml - default set",
"red",
)
)
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
reFilterKeywords = re.compile(
FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
)
# If the user provided the --filter-extensions argument then it overrides the config value
if args.filter_extensions:
FILTER_EXTENSIONS = args.filter_extensions
else:
try:
FILTER_EXTENSIONS = config.get("FILTER_EXTENSIONS")
if str(FILTER_EXTENSIONS) == "None":
writerr(
colored(
"No value for FILTER_EXTENSIONS in config.yml - default set",
"yellow",
)
)
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
except Exception:
writerr(
colored(
"Unable to read FILTER_EXTENSIONS from config.yml - default set",
"red",
)
)
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))
# If the user provided the --language argument then create the regex for language codes
if args.language:
# Get the language codes
try:
LANGUAGE = config.get("LANGUAGE")
if str(LANGUAGE) == "None":
writerr(
colored(
"No value for LANGUAGE in config.yml - default set",
"yellow",
)
)
LANGUAGE = DEFAULT_LANGUAGE
except Exception:
writerr(
colored(
"Unable to read LANGUAGE from config.yml - default set",
"red",
)
)
LANGUAGE = DEFAULT_LANGUAGE
# Set the language regex
try:
reLangPart = re.compile(
REGEX_START + "(" + LANGUAGE.replace(",", "|") + ")" + REGEX_END
)
except Exception as e:
writerr(colored("ERROR getConfig 2: " + str(e), "red"))
# If the user provided the --remove-params argument then it overrides the config value
if args.remove_params:
REMOVE_PARAMS = args.remove_params
else:
try:
REMOVE_PARAMS = config.get("REMOVE_PARAMS")
if str(REMOVE_PARAMS) == "None":
if verbose():
writerr(
colored(
"No value for REMOVE_PARAMS in config.yml - default set",
"yellow",
)
)
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
except Exception:
if verbose():
writerr(
colored(
"Unable to read REMOVE_PARAMS from config.yml - default set",
"red",
)
)
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
except Exception:
if args.config is None:
writerr(
colored(
'WARNING: Cannot find file "config.yml", so using default values',
"yellow",
)
)
else:
writerr(
colored(
'WARNING: Cannot find file "'
+ args.config
+ '", so using default values',
"yellow",
)
)
usingConfigDefaults = True
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
LANGUAGE = DEFAULT_LANGUAGE
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
reFilterKeywords = re.compile(
FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
)
badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))
except Exception as e:
writerr(colored("ERROR getConfig 1: " + str(e), "red"))
def ensureConfig():
"""
Ensure the config.yml file exists in the default config directory.
If not, create the directory and write the default config.
This is called before argument parsing so the file is created
even when running 'urless' or 'urless -h'.
"""
try:
# Determine the config directory based on OS
if os.name == "nt":
urlessPath = Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
elif os.name == "posix":
urlessPath = Path(
os.path.join(os.path.expanduser("~"), ".config", "urless")
)
else:
urlessPath = Path(
os.path.join(
os.path.expanduser("~"),
"Library",
"Application Support",
"urless",
)
)
configPath = urlessPath / "config.yml"
# If the config file doesn't exist, create it with default values
if not configPath.exists():
try:
urlessPath.mkdir(parents=True, exist_ok=True)
with open(configPath, "w") as f:
f.write(f"FILTER_KEYWORDS: {DEFAULT_FILTER_KEYWORDS}\n")
f.write(f"FILTER_EXTENSIONS: {DEFAULT_FILTER_EXTENSIONS}\n")
f.write(f"LANGUAGE: {DEFAULT_LANGUAGE}\n")
f.write(f"REMOVE_PARAMS: {DEFAULT_REMOVE_PARAMS}\n")
except Exception as e:
writerr(
colored("WARNING: Could not create config.yml: " + str(e), "yellow")
)
except Exception as e:
writerr(colored("ERROR ensureConfig: " + str(e), "red"))
def handler(signal_received, frame):
"""
This function is called if Ctrl-C is called by the user
An attempt will be made to try and clean up properly
"""
writerr(colored('>>> "Oh my God, they killed Kenny... and urless!" - Kyle', "red"))
sys.exit()
def paramsToDict(params: str) -> list:
"""
converts query string to dict
"""
try:
the_dict = {}
if params:
for pair in params.split("&"):
# If there is a parameter but no = then add a value of {EMPTY}
if pair.find("=") < 0:
key = pair + "{EMPTY}"
the_dict[key] = "{EMPTY}"
else:
parts = pair.split("=")
try:
the_dict[parts[0]] = parts[1]
except IndexError:
pass
return the_dict
except Exception as e:
writerr(colored("ERROR paramsToDict 1: " + str(e), "red"))
def dictToParams(params: dict) -> str:
"""
converts dict of params to query string
"""
try:
# If a parameter has a value of {EMPTY} then just the name will be written and no =
stringed = [
name if value == "{EMPTY}" else name + "=" + value
for name, value in params.items()
]
# Only add a ? at the start of parameters, unless the first starts with #
if list(params.keys())[0][:1] == "#":
paramString = "".join(stringed)
else:
paramString = "?" + "&".join(stringed)
# If a there are any parameters with {EMPTY} in the name then remove the string
return paramString.replace("{EMPTY}", "")
except Exception as e:
writerr(colored("ERROR dictToParams 1: " + str(e), "red"))
def compareParams(currentParams: list, newParams: dict) -> bool:
"""
checks if newParams contain a param
that doesn't exist in currentParams
"""
try:
ogSet = set([])
for each in currentParams:
for key in each.keys():
ogSet.add(key)
return set(newParams.keys()) - ogSet
except Exception as e:
writerr(colored("ERROR compareParams 1: " + str(e), "red"))
def isUnwantedContent(path: str) -> bool:
"""
Checks any potentially unwanted patterns (unless specified otherwise) such as blog/news content
"""
try:
unwanted = False
if not args.keep_human_written:
# If the path has more than 3 dashes '-' AND isn't a GUID AND (if specified) isn't a Custom ID, then assume it's human written content, e.g. blog
for part in path.split("/"):
if part.count("-") > 3:
if str(reCustomIDPart.pattern) == "":
if not reGuidPart.search(part) and reCustomIDPart.search(part):
unwanted = True
else:
if not reGuidPart.search(part):
unwanted = True
if not args.keep_yyyymm:
# If it contains a year and month in the path then assume like blog/news content, r.g. .../2019/06/...
if reYYYYMM.search(path):
unwanted = True
return unwanted
except Exception as e:
writerr(colored("ERROR isUnwantedContent 1: " + str(e), "red"))
def createPattern(path: str) -> str:
"""
creates patterns for urls with integers or GUIDs in them
"""
global patternsGUID, patternsInt, patternsCustomID, patternsLang
try:
newParts = []
regexInt = False
regexGUID = False
regexCustom = False
regexLang = False
for part in path.split("/"):
if part == "":
newParts.append(part)
elif str(reCustomIDPart.pattern) != "" and reCustomIDPart.search(part):
regexCustom = True
newParts.append(reCustomIDPart.pattern)
elif reGuidPart.search(part):
regexGUID = True
newParts.append(reGuidPart.pattern)
elif reIntPart.match(part):
regexInt = True
newParts.append(reIntPart.pattern)
elif args.language and reLangPart.match(part.lower()):
regexLang = True
newParts.append(reLangPart.pattern)
else:
newParts.append(part)
createdPattern = "/".join(newParts)
# Depending on the type of regex, add the found pattern to the dictionary if it hasn't been added already
if regexCustom and createdPattern not in patternsCustomID:
patternsCustomID[createdPattern] = path
elif regexGUID and createdPattern not in patternsGUID:
patternsGUID[createdPattern] = path
elif regexInt and createdPattern not in patternsInt:
patternsInt[createdPattern] = path
elif regexLang and createdPattern not in patternsLang:
patternsLang[createdPattern] = path
return createdPattern
except Exception as e:
writerr(colored("ERROR createPattern 1: " + str(e), "red"))
def patternExists(pattern: str) -> bool:
"""
Checks if a pattern exists
"""
try:
for i, seen_pattern in enumerate(patternsSeen):
if pattern == seen_pattern:
patternsSeen[i] = pattern
return True
elif seen_pattern in pattern:
return True
return False
except Exception as e:
writerr(colored("ERROR patternExists 1: " + str(e), "red"))
def matchesPatterns(path: str) -> bool:
"""
checks if the url matches any of the regex patterns
"""
try:
for pattern in patternsSeen:
if re.search(pattern, re.escape(path)) is not None:
return True
return False
except Exception as e:
writerr(colored("ERROR matchesPatterns 1: " + str(e), "red"))
def hasFilterKeyword(path: str) -> bool:
"""
checks if the url matches the blacklist regex
"""
global reFilterKeywords
try:
return reFilterKeywords.search(path)
except Exception as e:
writerr(colored("ERROR hasFilterKeyword 1: " + str(e), "red"))
def hasBadExtension(path: str) -> bool:
"""
checks if a url has a blacklisted extension
"""
global badExtensions
try:
return path.lower().endswith(badExtensions)
except Exception as e:
writerr(colored("ERROR hasBadExtension 1: " + str(e), "red"))
def removeParameters(params) -> dict:
"""
Removes any parameters from the parameter dictionary
"""
global REMOVE_PARAMS
try:
# For every parameter name in the REMOVE_PARAMS list, remove from the dictionary passed
for param in REMOVE_PARAMS.split(","):
if param in params:
del params[param]
return params
except Exception as e:
writerr(colored("ERROR removeParameters 1: " + str(e), "red"))
def processUrl(line):
try:
parsed = urlparse(line.strip())
# Set the host
scheme = parsed.scheme
if scheme == "":
host = parsed.netloc
else:
host = scheme + "://" + parsed.netloc
# If the link specifies port 80 or 443, e.g. http://example.com:80, then remove the port
if str(parsed.port) == "80":
host = host.replace(":80", "", 1)
if str(parsed.port) == "443":
host = host.replace(":443", "", 1)
# Build the path and parameters
path, params = parsed.path, paramsToDict(parsed.query)
# Remove any necessary parameters
params = removeParameters(params)
# If there is a fragment...
# if arg -fnp / --fragment-not-param was passed, change the path to include the hash,
# else, add as the last parameter with a name but with value {EMPTY} that doesn't add an = afterwards
if parsed.fragment:
if args.fragment_not_param:
path = path + "#" + parsed.fragment
else:
params["#" + parsed.fragment] = "{EMPTY}"
# Add the host to the map if it hasn't already been seen
if host not in urlmap:
urlmap[host] = {}
# If the path has an extension we want to exclude, then just return to continue with the next line
if hasBadExtension(path):
return
# If there are no parameters (or the --disregard-params argument was passed) and path isn't empty
if (not params or args.disregard_params) and path != "":
# If its unwanted content or has a keyword to be excluded, then just return to continue with the next line
if isUnwantedContent(path) or hasFilterKeyword(path):
return
# If the current path already matches a previously saved pattern then just return to continue with the next line
if matchesPatterns(path):
return
# If the path has ++ in it for any reason, then just output "as is" otherwise it will raise a regex Multiple Repeat Error
if path.find("++") > 0:
pattern = path
else:
# Create a pattern for the current path
pattern = createPattern(path)
# Update the url map
if pattern not in urlmap[host]:
urlmap[host][pattern] = [params] if params else []
elif params and compareParams(urlmap[host][pattern], params):
urlmap[host][pattern].append(params)
except ValueError:
if verbose():
writerr(
colored(
"This URL caused a Value Error and was not included: " + line, "red"
)
)
except Exception as e:
writerr(colored("ERROR processUrl 1: " + str(e), "red"))
def processLine(line):
"""
Process a line from the input based on whether the -ks / --keep-slash argument was passed
"""
# If the -ks / --keep-slash argument was passed, then just add all URLs,
# else remove the trailing slash form any URLs (before any query string)
if args.keep_slash:
line = line.rstrip("\n")
else:
if line.find("/?") > 0:
line = line.replace("/?", "?", 1)
else:
line = line.rstrip("\n").rstrip("/")
# If the -iq / --ignore-querystring argument was passed, remove any querystring and fragment (unless -fnp is passed, in which case the fragment is only removed if a query string exists too)
if args.ignore_querystring:
if args.fragment_not_param:
line = line.split("?")[0]
else:
line = line.split("?")[0].split("#")[0]
return line
def processInput():
global linesOrigCount
try:
if not sys.stdin.isatty():
for line in sys.stdin:
processUrl(processLine(line))
else:
with open(os.path.expanduser(args.input), "rb") as f:
result = chardet.detect(f.read()) # or readline if the file is large
try:
linesOrigCount = 0
with open(
os.path.expanduser(args.input), "r", encoding=result["encoding"]
) as inFile:
for line in inFile:
linesOrigCount += 1
processUrl(processLine(line))
except Exception as e:
writerr(colored("ERROR processInput 2 " + str(e), "red"))
except Exception as e:
writerr(colored("ERROR processInput 1: " + str(e), "red"))
def processOutput():
global linesFinalCount, linesOrigCount, patternsGUID, patternsInt, patternsCustomID, patternsLang
try:
# If an output file was specified, open it
if args.output is not None:
try:
outFile = open(os.path.expanduser(args.output), "w")
except Exception as e:
writerr(colored("ERROR processOutput 2 " + str(e), "red"))
# Output all URLs
for host, value in urlmap.items():
for path, params in value.items():
# Replace the regex pattern in the path with the first occurrence of that pattern found
try:
customRegexFound = False
if (
str(reCustomIDPart.pattern) != ""
and path.find(str(reCustomIDPart.pattern)) > 0
):
for pattern in patternsCustomID:
if pattern == path:
path = patternsCustomID[pattern]
customRegexFound = True
if not customRegexFound:
if path.find(REGEX_GUID) > 0:
for pattern in patternsGUID:
if pattern == path:
path = patternsGUID[pattern]
elif path.find(REGEX_INTEGER) > 0:
for pattern in patternsInt:
if pattern == path:
path = patternsInt[pattern]
elif path.find(str(reLangPart.pattern)) > 0:
for pattern in patternsLang:
if pattern == path:
path = patternsLang[pattern]
except Exception as e:
writerr(colored("ERROR processOutput 4: " + str(e), "red"))
if params:
for param in params:
linesFinalCount = linesFinalCount + 1
# If an output file was specified, write to the file
if args.output is not None:
outFile.write(host + path + dictToParams(param) + "\n")
else:
# If output is piped or the --output argument was not specified, output to STDOUT
if not sys.stdin.isatty() or args.output is None:
write(host + path + dictToParams(param))
else:
linesFinalCount = linesFinalCount + 1
# If an output file was specified, write to the file
if args.output is not None:
outFile.write(host + path + "\n")
else:
# If output is piped or the --output argument was not specified, output to STDOUT
if not sys.stdin.isatty() or args.output is None:
write(host + path)
if verbose() and sys.stdin.isatty():
writerr(
colored(
"\nInput reduced from "
+ str(linesOrigCount)
+ " to "
+ str(linesFinalCount)
+ " lines 🤘",
"cyan",
)
)
# Close the output file if it was opened
try:
if args.output is not None:
write(
colored("Output successfully written to file: ", "cyan")
+ colored(args.output, "white")
)
write()
outFile.close()
except Exception as e:
writerr(colored("ERROR processOutput 3: " + str(e), "red"))
except Exception as e:
writerr(colored("ERROR processOutput 1: " + str(e), "red"))
def showOptionsAndConfig():
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults
try:
write(colored("Selected options and config:", "cyan"))
write(
colored("-i: " + args.input, "magenta")
+ colored(" The input file of URLs to de-clutter.", "white")
)
if args.output is not None:
write(
colored("-o: " + args.output, "magenta")
+ colored(
" The output file that the de-cluttered URL list will be written to.",
"white",
)
)
else:
write(
colored("-o: <STDOUT>", "magenta")
+ colored(
" An output file wasn't given, so output will be written to STDOUT.",
"white",
)
)
if args.disregard_params:
write(
colored("-dp: True", "magenta")
+ colored(
" When filtering the URLs, they will not be treated differently just because they have parameters.",
"white",
)
)
if args.config:
if usingConfigDefaults:
write(
colored("-config: " + args.config, "magenta")
+ colored(" The path of the YML config file.", "white")
+ colored(" WARNING: Not found, so using default values.", "yellow")
)
else:
write(
colored("-config: " + args.config, "magenta")
+ colored(" The path of the YML config file.", "white")
)
if args.filter_keywords:
write(
colored("-fk (Keywords to Filter): ", "magenta")
+ colored(args.filter_keywords, "white")
)
else:
write(
colored("Filter Keywords (from Config.yml): ", "magenta")
+ colored(FILTER_KEYWORDS, "white")
)
if args.filter_extensions:
write(
colored("-fe (Extensions to Filter): ", "magenta")
+ colored(args.filter_extensions, "white")
)
else:
write(
colored("Filter Extensions (from Config.yml): ", "magenta")
+ colored(FILTER_EXTENSIONS, "white")
)
if args.language:
write(
colored("Languages (from Config.yml): ", "magenta")
+ colored(LANGUAGE, "white")
)
write(
colored("-lang: True", "magenta")
+ colored(
"If there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output.",
"white",
)
)
if args.remove_params:
write(
colored("-rp (Params to Remove): ", "magenta")
+ colored(args.remove_params, "white")
)
else:
write(
colored("Remove Params (from Config.yml): ", "magenta")
+ colored(REMOVE_PARAMS, "white")
)
if args.keep_slash:
write(
colored("-ks: True", "magenta")
+ colored(
"A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.",
"white",
)
)
if args.keep_human_written:
write(
colored("-khw: True", "magenta")
+ colored(
"Prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post)",
"white",
)
)
if args.keep_yyyymm:
write(
colored("-kym: True", "magenta")
+ colored(
"Prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news)",
"white",
)
)
if args.regex_custom_id:
write(
colored("-rcid: '" + str(reCustomIDPart.pattern) + "'", "magenta")
+ colored(" USE WITH CAUTION! ", "red")
+ colored(
"Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.",
"white",
)
)
if args.keep_yyyymm:
write(
colored("-iq: True", "magenta")
+ colored(
" Remove the query string (including URL fragments `#`) so output is unique paths only.",
"white",
)
)
write("")
except Exception as e:
writerr(colored("ERROR showOptionsAndConfig 1: " + str(e), "red"))
def argCheckRegexCustomID(value):
global reCustomIDPart
try:
# If the Custom ID regex was passed, then prefix with ^ and suffix with $ if they are not there already
if value != "":
if value[0] != REGEX_START:
value = REGEX_START + value
if value[-1] != REGEX_END:
value = value + REGEX_END
# Try to compile the regex
reCustomIDPart = re.compile(value)
return value
except Exception:
raise argparse.ArgumentTypeError("Valid regex must be passed.")
def main():
global args, urlmap, patternsSeen, patternsInt, patternsCustomID, patternsGUID, patternsLang
# Ensure config.yml exists before anything else
ensureConfig()
# Tell Python to run the handler() function when SIGINT is received
signal(SIGINT, handler)
# Parse command line arguments
parser = argparse.ArgumentParser(
description="urless - by @Xnl-h4ck3r: De-clutter a list of URLs."
)
parser.add_argument(
"-i", "--input", action="store", help="A file of URLs to de-clutter."
)
parser.add_argument(
"-o",
"--output",
action="store",
help="The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.",
)
parser.add_argument(
"-fk",
"--filter_keywords",
action="store",
help="A comma separated list of keywords to exclude links (if there no parameters). This will override the FILTER_KEYWORDS list specified in config.yml",
metavar="<comma separated list>",
)
parser.add_argument(
"-fe",
"--filter-extensions",
action="store",
help="A comma separated list of file extensions to exclude. This will override the FILTER_EXTENSIONS list specified in config.yml",
metavar="<comma separated list>",
)
parser.add_argument(
"-rp",
"--remove-params",
action="store",
help="A comma separated list of case sensitive parameters to remove from all URLs. This will override the REMOVE_PARAMS list specified in config.yml. This can be useful for cache buster parameters for example.",
metavar="<comma separated list>",
)
parser.add_argument(
"-ks",
"--keep-slash",
action="store_true",
help="A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.",
)
parser.add_argument(
"-khw",
"--keep-human-written",
action="store_true",
help="By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.",
)
parser.add_argument(
"-kym",
"--keep-yyyymm",
action="store_true",
help="By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.",
)
parser.add_argument(
"-rcid",
"--regex-custom-id",
action="store",
help="USE WITH CAUTION! Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.",
default="",
metavar="REGEX",
type=argCheckRegexCustomID,
)
parser.add_argument(
"-iq",
"--ignore-querystring",
action="store_true",
help="Remove the query string (including URL fragments `#`) so output is unique paths only.",
)
parser.add_argument(
"-fnp",
"--fragment-not-param",
action="store_true",
help="Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) it is usually kept, but if this argument is passed and a link has a filter word and fragment, it will be removed.",
)
parser.add_argument(
"-lang",
"--language",
action="store_true",
help='If passed, and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the "LANGUAGE" section of "config.yml".',
)
parser.add_argument(
"-c",
"--config",
action="store",
help="Path to the YML config file. If not passed, it looks for file 'config.yml' in the default config directory, e.g. '~/.config/urless/'.",
)
parser.add_argument(
"-dp",
"--disregard-params",
action="store_true",
help="There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.",
)
parser.add_argument(
"-nb", "--no-banner", action="store_true", help="Hides the tool banner."
)
parser.add_argument("--version", action="store_true", help="Show version number")
parser.add_argument("-v", "--verbose", action="store_true", help="Verbose output.")
args = parser.parse_args()
# If --version was passed, display version and exit
if args.version:
write(colored("urless - v" + __version__, "cyan"))
sys.exit()
try:
# If no input was given, raise an error
if sys.stdin.isatty():
if args.input is None:
writerr(
colored(
"You need to provide an input with -i argument or through <stdin>.",
"red",
)
)
sys.exit()
# Get the config settings from the config.yml file
getConfig()
# If input is not piped, show the banner, and if --verbose option was chosen show options and config values
if sys.stdin.isatty():
# Show banner unless requested to hide
if not args.no_banner:
showBanner()
if verbose():
showOptionsAndConfig()
# Process the input given on -i (--input), or <stdin>
processInput()
# Output the saved urls with parameters
processOutput()
except Exception as e:
writerr(colored("ERROR main 1: " + str(e), "red"))
# Show ko-fi link if verbose and not piped
try:
if verbose() and sys.stdin.isatty():
writerr(
colored(
"✅ Want to buy me a coffee? ☕ https://ko-fi.com/xnlh4ck3r 🤘",
"green",
)
)
except Exception:
pass
finally: # Clean up
urlmap = None
patternsSeen = None
patternsCustomID = None
patternsGUID = None
patternsInt = None
patternsLang = None
if __name__ == "__main__":
main()
gitextract_gotbcstc/
├── .gitignore
├── CHANGELOG.md
├── README.md
├── config.yml
├── setup.py
└── urless/
├── __init__.py
└── urless.py
SYMBOL INDEX (25 symbols across 1 files) FILE: urless/urless.py function verbose (line 85) | def verbose(): function write (line 92) | def write(text=""): function writerr (line 102) | def writerr(text=""): function showVersion (line 109) | def showVersion(): function showBanner (line 142) | def showBanner(): function getConfig (line 153) | def getConfig(): function ensureConfig (line 329) | def ensureConfig(): function handler (line 373) | def handler(signal_received, frame): function paramsToDict (line 382) | def paramsToDict(params: str) -> list: function dictToParams (line 405) | def dictToParams(params: dict) -> str: function compareParams (line 428) | def compareParams(currentParams: list, newParams: dict) -> bool: function isUnwantedContent (line 443) | def isUnwantedContent(path: str) -> bool: function createPattern (line 471) | def createPattern(path: str) -> str: function patternExists (line 517) | def patternExists(pattern: str) -> bool: function matchesPatterns (line 533) | def matchesPatterns(path: str) -> bool: function hasFilterKeyword (line 546) | def hasFilterKeyword(path: str) -> bool: function hasBadExtension (line 557) | def hasBadExtension(path: str) -> bool: function removeParameters (line 568) | def removeParameters(params) -> dict: function processUrl (line 583) | def processUrl(line): function processLine (line 659) | def processLine(line): function processInput (line 682) | def processInput(): function processOutput (line 706) | def processOutput(): function showOptionsAndConfig (line 795) | def showOptionsAndConfig(): function argCheckRegexCustomID (line 940) | def argCheckRegexCustomID(value): function main (line 959) | def main():
Condensed preview — 7 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (73K chars).
[
{
"path": ".gitignore",
"chars": 53,
"preview": "build/\r\ndist/\r\nurless.egg-info\r\n__pycache__\r\ntest.txt"
},
{
"path": "CHANGELOG.md",
"chars": 7910,
"preview": "## Changelog\r\n\r\n- v2.7\r\n\r\n - New\r\n - If the `config.yml` file is not found in the expected config directory (e.g. `~"
},
{
"path": "README.md",
"chars": 17320,
"preview": "<center><img src=\"https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png\"></center>\r\n\r\n## About - v2.7\r\n"
},
{
"path": "config.yml",
"chars": 582,
"preview": "FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules\r\nFILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,."
},
{
"path": "setup.py",
"chars": 2437,
"preview": "#!/usr/bin/env python\r\nimport os\r\nimport shutil\r\nfrom setuptools import setup, find_packages\r\n\r\n# Define the target dire"
},
{
"path": "urless/__init__.py",
"chars": 20,
"preview": "__version__ = \"2.7\"\n"
},
{
"path": "urless/urless.py",
"chars": 41179,
"preview": "#!/usr/bin/env python\n# Python 3\n# urless - by @Xnl-h4ck3r: De-clutter a list of URLs\n# Full help here: https://github.c"
}
]
About this extraction
This page contains the full source code of the xnl-h4ck3r/urless GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 7 files (67.9 KB), approximately 14.9k tokens, and a symbol index with 25 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.