Repository: xnl-h4ck3r/urless Branch: main Commit: e9bfa484ea6e Files: 7 Total size: 67.9 KB Directory structure: gitextract_gotbcstc/ ├── .gitignore ├── CHANGELOG.md ├── README.md ├── config.yml ├── setup.py └── urless/ ├── __init__.py └── urless.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ build/ dist/ urless.egg-info __pycache__ test.txt ================================================ FILE: CHANGELOG.md ================================================ ## Changelog - v2.7 - New - If the `config.yml` file is not found in the expected config directory (e.g. `~/.config/urless/` on Linux or `%APPDATA%/urless/` on Windows), it will be automatically created with default values. This fixes the issue where installing with `pipx` did not create the `config.yml` file. - Surpresses the warning about `requests` not being able to import `urllib3`. - v2.6 - Changed - BUG FIX: Change the type `js.ko` to `ja,ko` in `LANGUAGE` within `config.yml` and `DEFAULT_LANGUAGE` within `urless.py` - Set `DEFAULT_REMOVE_PARAMS` and the `REMOVE_PARAMS` in `config.yml` file to `_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name` in `urless.py`. These was a mismatch between the two files. Also, the Google Analytics parameters should be removed by default. - v2.5 - Changed - Fix the issue of it saying the version is outdated when it is the latest version. - Applied black code formatting to `__init__.py`, `setup.py`, and `urless.py` to ensure consistent code style. - v2.4 - Changed - Various optimizations to improve performance, e.g. Pre-compiled Regular Expressions, Optimized Extension Filtering and Memory-Efficient File Processing. - v2.3 - Fixed - Remove TTY-gating that silences output in non-TTY environments like Docker, CI, or cron jobs. The --no-banner flag and -o/--output already provide users control over output, so the extra TTY checks only broke non-interactive usage. Thanks to [@tavgar](https://github.com/tavgar) for the fix in [PR #15](https://github.com/xnl-h4ck3r/urless/pull/15). - v2.2 - New - Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9). - Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12). - Changed - The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`. - Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed. - v2.1 - New - Add `long_description_content_type` to `setup.py` to upload to PyPi - Add `urless` to `PyPi` so can be installed with `pip install urless` - v2.0 - New - Add `REMOVE_PARAMS` to `config.yml`. This will be a comma separated list of case sensitive parameter names that you want removed completely from URLs. This can be useful to remove cache buster parameters, so will default to `cachebuster,cacheBuster` to show examples. - Add arg `-rp`/`--remove-params` which can be used to pass a comma separated list of parameter names to remove from URLs. This will override the `REMOVE_PARAMS` list in `config.yml`. - Show the current version of the tool in the banner, and whether it is the latest, or outdated. - Add arg `--version` to show the current version of the tool. - When installing `urless`, if the `config.yml` already exists then it will keep that one and create `config.yml.NEW` in case you need to replace the old config. - Changed - Fix a bug that meant defaults were not set correctly if `config.yml` keys are missing. - v1.3 - New - Add argument `-fnp`/`--fragment-not-param`. If passed the URL fragments `#` will NOT be treated in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. - v1.2 - Changed - Changes to prevent `SyntaxWarning: invalid escape sequence` errors when Python 3.12 is used. - v1.1 - Changed - Add support to automatically identify file encoding. - v1.0 - Changed - Add support for quick install using pip or pipx. - v0.9 - Changed - Add i18N language codes `gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru` - v0.8 - New - Add `DEFAULT_LANGUAGE` constant and `LANGUAGE` key in `config.yml` with the most common language codes: `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko` - Add `-lang`/`--language` argument. If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specific in the `LANGUAGE` key of `config.yml` - Changed - A URL can have a GUID, Integer, CustomID and Language Code in the same URL and be de-cluttered properly. - If the Custom Regex ID doesn't start with `^` and end in `$`, those will be added. - Fix bug where it added the last occurrence of a regex pattern instead of the first. - Simplify the code in `processUrl` and `createPattern` functions... I had some strange logic that was unnecessary! - Make sure case is ignored when any `FILTER_EXTENSIONS` in `config.yml` or passed with `-fe` are compared with input. - v0.7 - New - Add `-rcid` / `--regex-custom-id` argument to provide a regex expression for a Custom ID that your target uses. - Add `-nb` / `--no-banner` argument to hide the tool banner. This is only needed if you are not piping input to `urless`. - Add `-khw` / `--keep-human-written` argument to prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post). These are normally removed by default. - Add `-kym` / `--keep-yyyymm` argument to prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news). These are normally removed by default. - Add `-iq` / `--ignore-querystring` argument to remove the query string (including URL fragments `#`) so output is unique paths only. - Changed - Fix bug where `/blah/1337` was not being treated differently to `/1337` for example. - When a Custom ID, GUID or Integer ID is found in a URL, and only one URL from many in the same format are returned in the output, use the first ID found in the input for that ID type. - v0.6 - New - By default, a trailing `/` will be removed from the end of a URL. - Added new argument `-ks`/`--keep-slash` that will ensure any links that do have a trailing slash in the input will not have the slash removed in the output, and therefore there may be identical URLs output, one with and one without a trailing slash. - v0.5 - Changed - Fixed Github Issue #3 to remove port 80 and 443 correctly - v0.4 - Changed - Various bug fixes - v0.3 - New - Add an `__init_.py` file to store the version, and move the image to a separate folder to make it cleaner. - Changed - If a line in the input throws an error due to not being a valid URL when parsed, then skip it, but output an error showing the URL if the `-v` arg is passed. - v0.2 - Fixed the bug `ERROR matchesPatterns 1: missing ), unterminated subpattern at position 237` by escaping the regex string before searching - v0.1 - Inital release. Please see README.md ================================================ FILE: README.md ================================================

================================================
FILE: config.yml
================================================
FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules
FILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image
LANGUAGE: en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru
REMOVE_PARAMS: _,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
import os
import shutil
from setuptools import setup, find_packages
# Define the target directory for the config.yml file
target_directory = (
os.path.join(os.getenv("APPDATA", ""), "urless")
if os.name == "nt"
else (
os.path.join(os.path.expanduser("~"), ".config", "urless")
if os.name == "posix"
else (
os.path.join(
os.path.expanduser("~"), "Library", "Application Support", "urless"
)
if os.name == "darwin"
else None
)
)
)
# Copy the config.yml file to the target directory if it exists
configNew = False
if target_directory and os.path.isfile("config.yml"):
os.makedirs(target_directory, exist_ok=True)
# If file already exists, create a new one
if os.path.isfile(target_directory + "/config.yml"):
configNew = True
os.rename(
target_directory + "/config.yml", target_directory + "/config.yml.OLD"
)
shutil.copy("config.yml", target_directory)
os.rename(
target_directory + "/config.yml", target_directory + "/config.yml.NEW"
)
os.rename(
target_directory + "/config.yml.OLD", target_directory + "/config.yml"
)
else:
shutil.copy("config.yml", target_directory)
setup(
name="urless",
packages=find_packages(),
version=__import__("urless").__version__,
description="De-clutter a list of URLs",
long_description=open("README.md").read(),
long_description_content_type="text/markdown",
author="@xnl-h4ck3r",
url="https://github.com/xnl-h4ck3r/urless",
zip_safe=False,
install_requires=[
"argparse",
"pyyaml",
"termcolor",
"urlparse3",
"chardet",
"requests",
],
entry_points={
"console_scripts": [
"urless = urless.urless:main",
],
},
)
if configNew:
print(
"\n\033[33mIMPORTANT: The file "
+ target_directory
+ "/config.yml already exists.\nCreating config.yml.NEW but leaving existing config.\nIf you need the new file, then remove the current one and rename config.yml.NEW to config.yml\n\033[0m"
)
else:
print(
"\n\033[92mThe file "
+ target_directory
+ "/config.yml has been created.\n\033[0m"
)
================================================
FILE: urless/__init__.py
================================================
__version__ = "2.7"
================================================
FILE: urless/urless.py
================================================
#!/usr/bin/env python
# Python 3
# urless - by @Xnl-h4ck3r: De-clutter a list of URLs
# Full help here: https://github.com/xnl-h4ck3r/urless/blob/main/README.md
# Good luck and good hunting! If you really love the tool (or any others), or they helped you find an awesome bounty, consider BUYING ME A COFFEE! (https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)
import re
import os
import sys
from typing import Pattern
import yaml
import argparse
import chardet
from signal import SIGINT, signal
from urllib.parse import urlparse
from termcolor import colored
from pathlib import Path
try:
from . import __version__
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
import requests
except Exception:
pass
# Default values if config.yml not found
DEFAULT_FILTER_EXTENSIONS = ".css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image"
DEFAULT_FILTER_KEYWORDS = "blog,article,news,bootstrap,jquery,captcha,node_modules"
DEFAULT_LANGUAGE = "en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru"
DEFAULT_REMOVE_PARAMS = "_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name"
# Variables to hold config.yml values
FILTER_EXTENSIONS = ""
FILTER_KEYWORDS = ""
LANGUAGE = ""
REMOVE_PARAMS = ""
reFilterKeywords = ""
badExtensions = ()
# Regex delimiters
REGEX_START = "^"
REGEX_END = "$"
# Regex for a path folder of integer
REGEX_INTEGER = REGEX_START + r"\d+" + REGEX_END
reIntPart = re.compile(REGEX_INTEGER)
patternsInt = {}
# Regex for a path folder of GUID
REGEX_GUID = (
REGEX_START
+ "[({]?[a-fA-F0-9]{8}[-]?([a-fA-F0-9]{4}[-]?){3}[a-fA-F0-9]{12}[})]?"
+ REGEX_END
)
reGuidPart = re.compile(REGEX_GUID)
patternsGUID = {}
# Regex fields for Custom ID
reCustomIDPart = Pattern
patternsCustomID = {}
# Regex for path of YYYY/MM
REGEX_YYYYMM = r"\/[1|2][0|1|9]\\d{2}/[0|1]\\d{1}\/"
reYYYYMM = re.compile(REGEX_YYYYMM)
# Regex for path of language code
reLangPart = Pattern
patternsLang = {}
# Global variables
args = None
urlmap = {}
patternsSeen = []
outFile = None
linesOrigCount = 0
linesFinalCount = 0
usingConfigDefaults = False
def verbose():
"""
Functions used when printing messages dependant on verbose option
"""
return args.verbose
def write(text=""):
"""
Always print one line to stdout.
The --no-banner flag and -o/--output already give users
control over noise and redirection, so extra TTY checks only
break non-interactive usage (Docker, CI, cron).
"""
sys.stdout.write(text + "\n")
def writerr(text=""):
"""
Always print one line to stderr.
"""
sys.stderr.write(text + "\n")
def showVersion():
try:
try:
resp = requests.get(
"https://raw.githubusercontent.com/xnl-h4ck3r/urless/main/urless/__init__.py",
timeout=3,
)
except Exception:
write(
"Current urless version "
+ __version__
+ " (unable to check if latest)\n"
)
if __version__ == resp.text.split("=")[1].replace('"', "").strip():
write(
"Current urless version "
+ __version__
+ " ("
+ colored("latest", "green")
+ ")\n"
)
else:
write(
"Current urless version "
+ __version__
+ " ("
+ colored("outdated", "red")
+ ")\n"
)
except Exception:
pass
def showBanner():
write("")
write(colored(r" __ _ ____ _ ___ ___ ____ ", "red"))
write(colored(r" | | | | _ \| | / _ \/ __/ __/ ", "yellow"))
write(colored(r" | | | | |_) | || __/\__ \__ \ ", "green"))
write(colored(r" | |_| | _ <| |_\___/\___/___/ ", "cyan"))
write(colored(r" \___/|_| \_\___/", "magenta") + colored("by Xnl-h4ck3r", "white"))
write("")
showVersion()
def getConfig():
"""
Try to get the values from the config file, otherwise use the defaults
"""
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults, reFilterKeywords, badExtensions
try:
# Try to get the config file values
try:
# Put config in global location based on the OS.
urlessPath = (
Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
if os.name == "nt"
else (
Path(os.path.join(os.path.expanduser("~"), ".config", "urless"))
if os.name == "posix"
else (
Path(
os.path.join(
os.path.expanduser("~"),
"Library",
"Application Support",
"urless",
)
)
if os.name == "darwin"
else None
)
)
)
urlessPath.absolute
if args.config is None:
if urlessPath == "":
configPath = "config.yml"
else:
configPath = Path(urlessPath / "config.yml")
else:
configPath = Path(args.config)
config = yaml.safe_load(open(configPath))
# If the user provided the --filter-extensions argument then it overrides the config value
if args.filter_keywords:
FILTER_KEYWORDS = args.filter_keywords
else:
try:
FILTER_KEYWORDS = config.get("FILTER_KEYWORDS")
if str(FILTER_KEYWORDS) == "None":
writerr(
colored(
"No value for FILTER_KEYWORDS in config.yml - default set",
"yellow",
)
)
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
except Exception:
writerr(
colored(
"Unable to read FILTER_EXTENSIONS from config.yml - default set",
"red",
)
)
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
reFilterKeywords = re.compile(
FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
)
# If the user provided the --filter-extensions argument then it overrides the config value
if args.filter_extensions:
FILTER_EXTENSIONS = args.filter_extensions
else:
try:
FILTER_EXTENSIONS = config.get("FILTER_EXTENSIONS")
if str(FILTER_EXTENSIONS) == "None":
writerr(
colored(
"No value for FILTER_EXTENSIONS in config.yml - default set",
"yellow",
)
)
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
except Exception:
writerr(
colored(
"Unable to read FILTER_EXTENSIONS from config.yml - default set",
"red",
)
)
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))
# If the user provided the --language argument then create the regex for language codes
if args.language:
# Get the language codes
try:
LANGUAGE = config.get("LANGUAGE")
if str(LANGUAGE) == "None":
writerr(
colored(
"No value for LANGUAGE in config.yml - default set",
"yellow",
)
)
LANGUAGE = DEFAULT_LANGUAGE
except Exception:
writerr(
colored(
"Unable to read LANGUAGE from config.yml - default set",
"red",
)
)
LANGUAGE = DEFAULT_LANGUAGE
# Set the language regex
try:
reLangPart = re.compile(
REGEX_START + "(" + LANGUAGE.replace(",", "|") + ")" + REGEX_END
)
except Exception as e:
writerr(colored("ERROR getConfig 2: " + str(e), "red"))
# If the user provided the --remove-params argument then it overrides the config value
if args.remove_params:
REMOVE_PARAMS = args.remove_params
else:
try:
REMOVE_PARAMS = config.get("REMOVE_PARAMS")
if str(REMOVE_PARAMS) == "None":
if verbose():
writerr(
colored(
"No value for REMOVE_PARAMS in config.yml - default set",
"yellow",
)
)
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
except Exception:
if verbose():
writerr(
colored(
"Unable to read REMOVE_PARAMS from config.yml - default set",
"red",
)
)
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
except Exception:
if args.config is None:
writerr(
colored(
'WARNING: Cannot find file "config.yml", so using default values',
"yellow",
)
)
else:
writerr(
colored(
'WARNING: Cannot find file "'
+ args.config
+ '", so using default values',
"yellow",
)
)
usingConfigDefaults = True
FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS
FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS
LANGUAGE = DEFAULT_LANGUAGE
REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS
reFilterKeywords = re.compile(
FILTER_KEYWORDS.replace(",", "|"), re.IGNORECASE
)
badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(","))
except Exception as e:
writerr(colored("ERROR getConfig 1: " + str(e), "red"))
def ensureConfig():
"""
Ensure the config.yml file exists in the default config directory.
If not, create the directory and write the default config.
This is called before argument parsing so the file is created
even when running 'urless' or 'urless -h'.
"""
try:
# Determine the config directory based on OS
if os.name == "nt":
urlessPath = Path(os.path.join(os.getenv("APPDATA", ""), "urless"))
elif os.name == "posix":
urlessPath = Path(
os.path.join(os.path.expanduser("~"), ".config", "urless")
)
else:
urlessPath = Path(
os.path.join(
os.path.expanduser("~"),
"Library",
"Application Support",
"urless",
)
)
configPath = urlessPath / "config.yml"
# If the config file doesn't exist, create it with default values
if not configPath.exists():
try:
urlessPath.mkdir(parents=True, exist_ok=True)
with open(configPath, "w") as f:
f.write(f"FILTER_KEYWORDS: {DEFAULT_FILTER_KEYWORDS}\n")
f.write(f"FILTER_EXTENSIONS: {DEFAULT_FILTER_EXTENSIONS}\n")
f.write(f"LANGUAGE: {DEFAULT_LANGUAGE}\n")
f.write(f"REMOVE_PARAMS: {DEFAULT_REMOVE_PARAMS}\n")
except Exception as e:
writerr(
colored("WARNING: Could not create config.yml: " + str(e), "yellow")
)
except Exception as e:
writerr(colored("ERROR ensureConfig: " + str(e), "red"))
def handler(signal_received, frame):
"""
This function is called if Ctrl-C is called by the user
An attempt will be made to try and clean up properly
"""
writerr(colored('>>> "Oh my God, they killed Kenny... and urless!" - Kyle', "red"))
sys.exit()
def paramsToDict(params: str) -> list:
"""
converts query string to dict
"""
try:
the_dict = {}
if params:
for pair in params.split("&"):
# If there is a parameter but no = then add a value of {EMPTY}
if pair.find("=") < 0:
key = pair + "{EMPTY}"
the_dict[key] = "{EMPTY}"
else:
parts = pair.split("=")
try:
the_dict[parts[0]] = parts[1]
except IndexError:
pass
return the_dict
except Exception as e:
writerr(colored("ERROR paramsToDict 1: " + str(e), "red"))
def dictToParams(params: dict) -> str:
"""
converts dict of params to query string
"""
try:
# If a parameter has a value of {EMPTY} then just the name will be written and no =
stringed = [
name if value == "{EMPTY}" else name + "=" + value
for name, value in params.items()
]
# Only add a ? at the start of parameters, unless the first starts with #
if list(params.keys())[0][:1] == "#":
paramString = "".join(stringed)
else:
paramString = "?" + "&".join(stringed)
# If a there are any parameters with {EMPTY} in the name then remove the string
return paramString.replace("{EMPTY}", "")
except Exception as e:
writerr(colored("ERROR dictToParams 1: " + str(e), "red"))
def compareParams(currentParams: list, newParams: dict) -> bool:
"""
checks if newParams contain a param
that doesn't exist in currentParams
"""
try:
ogSet = set([])
for each in currentParams:
for key in each.keys():
ogSet.add(key)
return set(newParams.keys()) - ogSet
except Exception as e:
writerr(colored("ERROR compareParams 1: " + str(e), "red"))
def isUnwantedContent(path: str) -> bool:
"""
Checks any potentially unwanted patterns (unless specified otherwise) such as blog/news content
"""
try:
unwanted = False
if not args.keep_human_written:
# If the path has more than 3 dashes '-' AND isn't a GUID AND (if specified) isn't a Custom ID, then assume it's human written content, e.g. blog
for part in path.split("/"):
if part.count("-") > 3:
if str(reCustomIDPart.pattern) == "":
if not reGuidPart.search(part) and reCustomIDPart.search(part):
unwanted = True
else:
if not reGuidPart.search(part):
unwanted = True
if not args.keep_yyyymm:
# If it contains a year and month in the path then assume like blog/news content, r.g. .../2019/06/...
if reYYYYMM.search(path):
unwanted = True
return unwanted
except Exception as e:
writerr(colored("ERROR isUnwantedContent 1: " + str(e), "red"))
def createPattern(path: str) -> str:
"""
creates patterns for urls with integers or GUIDs in them
"""
global patternsGUID, patternsInt, patternsCustomID, patternsLang
try:
newParts = []
regexInt = False
regexGUID = False
regexCustom = False
regexLang = False
for part in path.split("/"):
if part == "":
newParts.append(part)
elif str(reCustomIDPart.pattern) != "" and reCustomIDPart.search(part):
regexCustom = True
newParts.append(reCustomIDPart.pattern)
elif reGuidPart.search(part):
regexGUID = True
newParts.append(reGuidPart.pattern)
elif reIntPart.match(part):
regexInt = True
newParts.append(reIntPart.pattern)
elif args.language and reLangPart.match(part.lower()):
regexLang = True
newParts.append(reLangPart.pattern)
else:
newParts.append(part)
createdPattern = "/".join(newParts)
# Depending on the type of regex, add the found pattern to the dictionary if it hasn't been added already
if regexCustom and createdPattern not in patternsCustomID:
patternsCustomID[createdPattern] = path
elif regexGUID and createdPattern not in patternsGUID:
patternsGUID[createdPattern] = path
elif regexInt and createdPattern not in patternsInt:
patternsInt[createdPattern] = path
elif regexLang and createdPattern not in patternsLang:
patternsLang[createdPattern] = path
return createdPattern
except Exception as e:
writerr(colored("ERROR createPattern 1: " + str(e), "red"))
def patternExists(pattern: str) -> bool:
"""
Checks if a pattern exists
"""
try:
for i, seen_pattern in enumerate(patternsSeen):
if pattern == seen_pattern:
patternsSeen[i] = pattern
return True
elif seen_pattern in pattern:
return True
return False
except Exception as e:
writerr(colored("ERROR patternExists 1: " + str(e), "red"))
def matchesPatterns(path: str) -> bool:
"""
checks if the url matches any of the regex patterns
"""
try:
for pattern in patternsSeen:
if re.search(pattern, re.escape(path)) is not None:
return True
return False
except Exception as e:
writerr(colored("ERROR matchesPatterns 1: " + str(e), "red"))
def hasFilterKeyword(path: str) -> bool:
"""
checks if the url matches the blacklist regex
"""
global reFilterKeywords
try:
return reFilterKeywords.search(path)
except Exception as e:
writerr(colored("ERROR hasFilterKeyword 1: " + str(e), "red"))
def hasBadExtension(path: str) -> bool:
"""
checks if a url has a blacklisted extension
"""
global badExtensions
try:
return path.lower().endswith(badExtensions)
except Exception as e:
writerr(colored("ERROR hasBadExtension 1: " + str(e), "red"))
def removeParameters(params) -> dict:
"""
Removes any parameters from the parameter dictionary
"""
global REMOVE_PARAMS
try:
# For every parameter name in the REMOVE_PARAMS list, remove from the dictionary passed
for param in REMOVE_PARAMS.split(","):
if param in params:
del params[param]
return params
except Exception as e:
writerr(colored("ERROR removeParameters 1: " + str(e), "red"))
def processUrl(line):
try:
parsed = urlparse(line.strip())
# Set the host
scheme = parsed.scheme
if scheme == "":
host = parsed.netloc
else:
host = scheme + "://" + parsed.netloc
# If the link specifies port 80 or 443, e.g. http://example.com:80, then remove the port
if str(parsed.port) == "80":
host = host.replace(":80", "", 1)
if str(parsed.port) == "443":
host = host.replace(":443", "", 1)
# Build the path and parameters
path, params = parsed.path, paramsToDict(parsed.query)
# Remove any necessary parameters
params = removeParameters(params)
# If there is a fragment...
# if arg -fnp / --fragment-not-param was passed, change the path to include the hash,
# else, add as the last parameter with a name but with value {EMPTY} that doesn't add an = afterwards
if parsed.fragment:
if args.fragment_not_param:
path = path + "#" + parsed.fragment
else:
params["#" + parsed.fragment] = "{EMPTY}"
# Add the host to the map if it hasn't already been seen
if host not in urlmap:
urlmap[host] = {}
# If the path has an extension we want to exclude, then just return to continue with the next line
if hasBadExtension(path):
return
# If there are no parameters (or the --disregard-params argument was passed) and path isn't empty
if (not params or args.disregard_params) and path != "":
# If its unwanted content or has a keyword to be excluded, then just return to continue with the next line
if isUnwantedContent(path) or hasFilterKeyword(path):
return
# If the current path already matches a previously saved pattern then just return to continue with the next line
if matchesPatterns(path):
return
# If the path has ++ in it for any reason, then just output "as is" otherwise it will raise a regex Multiple Repeat Error
if path.find("++") > 0:
pattern = path
else:
# Create a pattern for the current path
pattern = createPattern(path)
# Update the url map
if pattern not in urlmap[host]:
urlmap[host][pattern] = [params] if params else []
elif params and compareParams(urlmap[host][pattern], params):
urlmap[host][pattern].append(params)
except ValueError:
if verbose():
writerr(
colored(
"This URL caused a Value Error and was not included: " + line, "red"
)
)
except Exception as e:
writerr(colored("ERROR processUrl 1: " + str(e), "red"))
def processLine(line):
"""
Process a line from the input based on whether the -ks / --keep-slash argument was passed
"""
# If the -ks / --keep-slash argument was passed, then just add all URLs,
# else remove the trailing slash form any URLs (before any query string)
if args.keep_slash:
line = line.rstrip("\n")
else:
if line.find("/?") > 0:
line = line.replace("/?", "?", 1)
else:
line = line.rstrip("\n").rstrip("/")
# If the -iq / --ignore-querystring argument was passed, remove any querystring and fragment (unless -fnp is passed, in which case the fragment is only removed if a query string exists too)
if args.ignore_querystring:
if args.fragment_not_param:
line = line.split("?")[0]
else:
line = line.split("?")[0].split("#")[0]
return line
def processInput():
global linesOrigCount
try:
if not sys.stdin.isatty():
for line in sys.stdin:
processUrl(processLine(line))
else:
with open(os.path.expanduser(args.input), "rb") as f:
result = chardet.detect(f.read()) # or readline if the file is large
try:
linesOrigCount = 0
with open(
os.path.expanduser(args.input), "r", encoding=result["encoding"]
) as inFile:
for line in inFile:
linesOrigCount += 1
processUrl(processLine(line))
except Exception as e:
writerr(colored("ERROR processInput 2 " + str(e), "red"))
except Exception as e:
writerr(colored("ERROR processInput 1: " + str(e), "red"))
def processOutput():
global linesFinalCount, linesOrigCount, patternsGUID, patternsInt, patternsCustomID, patternsLang
try:
# If an output file was specified, open it
if args.output is not None:
try:
outFile = open(os.path.expanduser(args.output), "w")
except Exception as e:
writerr(colored("ERROR processOutput 2 " + str(e), "red"))
# Output all URLs
for host, value in urlmap.items():
for path, params in value.items():
# Replace the regex pattern in the path with the first occurrence of that pattern found
try:
customRegexFound = False
if (
str(reCustomIDPart.pattern) != ""
and path.find(str(reCustomIDPart.pattern)) > 0
):
for pattern in patternsCustomID:
if pattern == path:
path = patternsCustomID[pattern]
customRegexFound = True
if not customRegexFound:
if path.find(REGEX_GUID) > 0:
for pattern in patternsGUID:
if pattern == path:
path = patternsGUID[pattern]
elif path.find(REGEX_INTEGER) > 0:
for pattern in patternsInt:
if pattern == path:
path = patternsInt[pattern]
elif path.find(str(reLangPart.pattern)) > 0:
for pattern in patternsLang:
if pattern == path:
path = patternsLang[pattern]
except Exception as e:
writerr(colored("ERROR processOutput 4: " + str(e), "red"))
if params:
for param in params:
linesFinalCount = linesFinalCount + 1
# If an output file was specified, write to the file
if args.output is not None:
outFile.write(host + path + dictToParams(param) + "\n")
else:
# If output is piped or the --output argument was not specified, output to STDOUT
if not sys.stdin.isatty() or args.output is None:
write(host + path + dictToParams(param))
else:
linesFinalCount = linesFinalCount + 1
# If an output file was specified, write to the file
if args.output is not None:
outFile.write(host + path + "\n")
else:
# If output is piped or the --output argument was not specified, output to STDOUT
if not sys.stdin.isatty() or args.output is None:
write(host + path)
if verbose() and sys.stdin.isatty():
writerr(
colored(
"\nInput reduced from "
+ str(linesOrigCount)
+ " to "
+ str(linesFinalCount)
+ " lines 🤘",
"cyan",
)
)
# Close the output file if it was opened
try:
if args.output is not None:
write(
colored("Output successfully written to file: ", "cyan")
+ colored(args.output, "white")
)
write()
outFile.close()
except Exception as e:
writerr(colored("ERROR processOutput 3: " + str(e), "red"))
except Exception as e:
writerr(colored("ERROR processOutput 1: " + str(e), "red"))
def showOptionsAndConfig():
global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults
try:
write(colored("Selected options and config:", "cyan"))
write(
colored("-i: " + args.input, "magenta")
+ colored(" The input file of URLs to de-clutter.", "white")
)
if args.output is not None:
write(
colored("-o: " + args.output, "magenta")
+ colored(
" The output file that the de-cluttered URL list will be written to.",
"white",
)
)
else:
write(
colored("-o: