[
  {
    "path": ".gitignore",
    "content": "build/\r\ndist/\r\nurless.egg-info\r\n__pycache__\r\ntest.txt"
  },
  {
    "path": "CHANGELOG.md",
    "content": "## Changelog\r\n\r\n- v2.7\r\n\r\n  - New\r\n    - If the `config.yml` file is not found in the expected config directory (e.g. `~/.config/urless/` on Linux or `%APPDATA%/urless/` on Windows), it will be automatically created with default values. This fixes the issue where installing with `pipx` did not create the `config.yml` file.\r\n    - Surpresses the warning about `requests` not being able to import `urllib3`.\r\n\r\n- v2.6\r\n\r\n  - Changed\r\n\r\n    - BUG FIX: Change the type `js.ko` to `ja,ko` in `LANGUAGE` within `config.yml` and `DEFAULT_LANGUAGE` within `urless.py`\r\n    - Set `DEFAULT_REMOVE_PARAMS` and the `REMOVE_PARAMS` in `config.yml` file to `_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name` in `urless.py`. These was a mismatch between the two files. Also, the Google Analytics parameters should be removed by default.\r\n\r\n- v2.5\r\n\r\n  - Changed\r\n\r\n    - Fix the issue of it saying the version is outdated when it is the latest version.\r\n    - Applied black code formatting to `__init__.py`, `setup.py`, and `urless.py` to ensure consistent code style.\r\n\r\n- v2.4\r\n\r\n  - Changed\r\n\r\n    - Various optimizations to improve performance, e.g. Pre-compiled Regular Expressions, Optimized Extension Filtering and Memory-Efficient File Processing.\r\n\r\n- v2.3\r\n\r\n  - Fixed\r\n\r\n    - Remove TTY-gating that silences output in non-TTY environments like Docker, CI, or cron jobs. The --no-banner flag and -o/--output already provide users control over output, so the extra TTY checks only broke non-interactive usage. Thanks to [@tavgar](https://github.com/tavgar) for the fix in [PR #15](https://github.com/xnl-h4ck3r/urless/pull/15).\r\n\r\n- v2.2\r\n\r\n  - New\r\n\r\n    - Add argument `-c`/`--config` to specify a path to a custom `config.yml` file. This resolves [Issue 9](https://github.com/xnl-h4ck3r/urless/issues/9).\r\n    - Add argument `-dp`/`--disregard-params`. There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters. This resolves [Issue 11](https://github.com/xnl-h4ck3r/urless/issues/11) and [Issue 12](https://github.com/xnl-h4ck3r/urless/issues/12).\r\n\r\n  - Changed\r\n\r\n    - The description for argument `-khw`/`--keep-human-written` says `By default, any URL with a path part that contains 3 or more dashes (-) are removed` but this will be corrected to `contains more than 3 dashes`.\r\n    - Correct the description for argument `-kym`/`--keep-yyyymm` on the `-h` output and `README.md`. It says `By default, any URL with a path containing 3 /YYYY/MM` but the `3` should be removed.\r\n\r\n- v2.1\r\n\r\n  - New\r\n\r\n    - Add `long_description_content_type` to `setup.py` to upload to PyPi\r\n    - Add `urless` to `PyPi` so can be installed with `pip install urless`\r\n\r\n- v2.0\r\n\r\n  - New\r\n\r\n    - Add `REMOVE_PARAMS` to `config.yml`. This will be a comma separated list of case sensitive parameter names that you want removed completely from URLs. This can be useful to remove cache buster parameters, so will default to `cachebuster,cacheBuster` to show examples.\r\n    - Add arg `-rp`/`--remove-params` which can be used to pass a comma separated list of parameter names to remove from URLs. This will override the `REMOVE_PARAMS` list in `config.yml`.\r\n    - Show the current version of the tool in the banner, and whether it is the latest, or outdated.\r\n    - Add arg `--version` to show the current version of the tool.\r\n    - When installing `urless`, if the `config.yml` already exists then it will keep that one and create `config.yml.NEW` in case you need to replace the old config.\r\n\r\n  - Changed\r\n\r\n    - Fix a bug that meant defaults were not set correctly if `config.yml` keys are missing.\r\n\r\n- v1.3\r\n\r\n  - New\r\n\r\n    - Add argument `-fnp`/`--fragment-not-param`. If passed the URL fragments `#` will NOT be treated in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link.\r\n\r\n- v1.2\r\n\r\n  - Changed\r\n\r\n    - Changes to prevent `SyntaxWarning: invalid escape sequence` errors when Python 3.12 is used.\r\n\r\n- v1.1\r\n\r\n- Changed\r\n\r\n  - Add support to automatically identify file encoding.\r\n\r\n- v1.0\r\n\r\n- Changed\r\n\r\n  - Add support for quick install using pip or pipx.\r\n\r\n- v0.9\r\n\r\n- Changed\r\n\r\n  - Add i18N language codes `gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru`\r\n\r\n- v0.8\r\n\r\n  - New\r\n\r\n    - Add `DEFAULT_LANGUAGE` constant and `LANGUAGE` key in `config.yml` with the most common language codes: `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`\r\n    - Add `-lang`/`--language` argument. If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specific in the `LANGUAGE` key of `config.yml`\r\n\r\n  - Changed\r\n\r\n    - A URL can have a GUID, Integer, CustomID and Language Code in the same URL and be de-cluttered properly.\r\n    - If the Custom Regex ID doesn't start with `^` and end in `$`, those will be added.\r\n    - Fix bug where it added the last occurrence of a regex pattern instead of the first.\r\n    - Simplify the code in `processUrl` and `createPattern` functions... I had some strange logic that was unnecessary!\r\n    - Make sure case is ignored when any `FILTER_EXTENSIONS` in `config.yml` or passed with `-fe` are compared with input.\r\n\r\n- v0.7\r\n\r\n  - New\r\n\r\n    - Add `-rcid` / `--regex-custom-id` argument to provide a regex expression for a Custom ID that your target uses.\r\n    - Add `-nb` / `--no-banner` argument to hide the tool banner. This is only needed if you are not piping input to `urless`.\r\n    - Add `-khw` / `--keep-human-written` argument to prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post). These are normally removed by default.\r\n    - Add `-kym` / `--keep-yyyymm` argument to prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news). These are normally removed by default.\r\n    - Add `-iq` / `--ignore-querystring` argument to remove the query string (including URL fragments `#`) so output is unique paths only.\r\n\r\n  - Changed\r\n\r\n    - Fix bug where `/blah/1337` was not being treated differently to `/1337` for example.\r\n    - When a Custom ID, GUID or Integer ID is found in a URL, and only one URL from many in the same format are returned in the output, use the first ID found in the input for that ID type.\r\n\r\n- v0.6\r\n\r\n  - New\r\n\r\n    - By default, a trailing `/` will be removed from the end of a URL.\r\n    - Added new argument `-ks`/`--keep-slash` that will ensure any links that do have a trailing slash in the input will not have the slash removed in the output, and therefore there may be identical URLs output, one with and one without a trailing slash.\r\n\r\n- v0.5\r\n\r\n  - Changed\r\n\r\n    - Fixed Github Issue #3 to remove port 80 and 443 correctly\r\n\r\n- v0.4\r\n\r\n  - Changed\r\n\r\n    - Various bug fixes\r\n\r\n- v0.3\r\n\r\n  - New\r\n\r\n    - Add an `__init_.py` file to store the version, and move the image to a separate folder to make it cleaner.\r\n\r\n  - Changed\r\n\r\n    - If a line in the input throws an error due to not being a valid URL when parsed, then skip it, but output an error showing the URL if the `-v` arg is passed.\r\n\r\n- v0.2\r\n\r\n  - Fixed the bug `ERROR matchesPatterns 1: missing ), unterminated subpattern at position 237` by escaping the regex string before searching\r\n\r\n- v0.1\r\n\r\n  - Inital release. Please see README.md\r\n"
  },
  {
    "path": "README.md",
    "content": "<center><img src=\"https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png\"></center>\r\n\r\n## About - v2.7\r\n\r\nThis is a tool used to de-clutter a list of URLs.\r\nAs a starting point, I took the amazing tool [uro](https://github.com/s0md3v/uro/) by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable.\r\n\r\n## Installation\r\n\r\n`urless` supports **Python 3**.\r\n\r\nInstall `urless` in default (global) python environment.\r\n\r\n```bash\r\npip install urless\r\n```\r\n\r\nOR\r\n\r\n```bash\r\npip install git+https://github.com/xnl-h4ck3r/urless.git -v\r\n```\r\n\r\nYou can upgrade with\r\n\r\n```bash\r\npip install --upgrade urless\r\n```\r\n\r\n### pipx\r\n\r\nQuick setup in isolated python environment using [pipx](https://pypa.github.io/pipx/)\r\n\r\n```bash\r\npipx install git+https://github.com/xnl-h4ck3r/urless.git\r\n```\r\n\r\n## Usage\r\n\r\n| Argument | Long Argument        | Description                                                                                                                                                                                                                                                                                                                                                                                                     |\r\n| -------- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\r\n| -i       | --input              | A file of URLs to de-clutter.                                                                                                                                                                                                                                                                                                                                                                                   |\r\n| -o       | --output             | The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.                                                                                                                                                                                                                                                   |\r\n| -fk      | --filter-keywords    | A comma separated list of keywords to exclude links (if there no parameters). This will override the `FILTER_KEYWORDS` list specified in config.yml                                                                                                                                                                                                                                                             |\r\n| -fe      | --filter-extensions  | A comma separated list of file extensions to exclude. This will override the `FILTER_EXTENSIONS` list specified in `config.yml`                                                                                                                                                                                                                                                                                 |\r\n| -rp      | --remove-params      | A comma separated list of **case senistive** parameters to remove from ALL URLs. This will override the `REMOVE_PARAMS` list specified in `config.yml`. This can be useful to remove cache buster parameters for example.\\*\\*                                                                                                                                                                                   |\r\n| -ks      | --keep-slash         | A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.                                                                                                                                                                                                                                                     |\r\n| -khw     | --keep-human-written | By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output.                                                                                                                                                                              |\r\n| -kym     | --keep-yyyymm        | By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.                                                                                                                                                                                     |\r\n| -rcid    | --regex-custom-id    | **USE WITH CAUTION!** Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this.                                                                                                                                                                                                                                                        |\r\n| -iq      | --ignore-querystring | Remove the query string (including URL fragments `#`) so output is unique paths only.                                                                                                                                                                                                                                                                                                                           |\r\n| -fnp     | --fragment-not-param | Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and `-iq` / `--ignore-querystring` is used, the fragment will NOT be removed from links if no query string is in the link. |\r\n| -lang    | --language           | If passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the `LANGUAGE` section of `config.yml`.                                                                                                                                                                                                       |\r\n| -c       | --config             | Path to the YML config file. If not passed, it looks for file `config.yml` in the default config directory, e.g. `~/.config/urless/`.                                                                                                                                                                                                                                                                           |\r\n| -dp      | --disregard-params   | There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.                                                                                                                                                                 |\r\n| -nb      | --no-banner          | Hides the tool banner (it is hidden by default if you pipe input to urless) output.                                                                                                                                                                                                                                                                                                                             |\r\n|          | --version            | Show current version number.                                                                                                                                                                                                                                                                                                                                                                                    |\r\n| -v       | --verbose            | Verbose output                                                                                                                                                                                                                                                                                                                                                                                                  |\r\n\r\n## What does it do exactly?\r\n\r\nYou basically pass a list of URLs in (from a file, or pipe from STDIN), and get a de-cluttered file or URLs out. But in what way are they de-cluttered?\r\nI'll explain this below, but first here are some terms that will be used:\r\n\r\n- **FILTER-EXTENSIONS**: This refers to the list of extensions that can either be passed with `-fe`, specified with `FILTER_EXTENSIONS` in the `config.yml`, or if neither of those exist, a default list of `.css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image`.\r\n- **FILTER-KEYWORDS**: This refers to the list of keywords that can either be passed with `-fk`, specified with `FILTER_KEYWORDS` in the `config.yml`, or if neither of those exist, a default list of `blog,article,news,bootstrap,jquery,captcha,node_modules`\r\n- **LANGUAGE**: This refers to the list of language codes that can be specified with `LANGUAGE` in the `config.yml`, or if it doesn't exist, a default list of the most common codes `en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,js.ko`\r\n- **UNWANTED-CONTENT**:\r\n  - A section of the URL path contains more than 3 dashes (`-`), BUT isn't a GUID. This implies human written content, e.g. `how-to-hack-the-planet`. If arg `-khw` is passed, then this won't be removed.\r\n  - The URL contains `/YYYY/MM/` , e.g. a year, month . This is usually static content such as a blog. If arg `-kym` is passed, then this won't be removed.\r\n\r\nHere's what happens:\r\n\r\n- If a URL has port 80 or 443 explicitly given, then remove it from the URL (e.g. http://example.com:80/test -> http://example.com/test)\r\n- If the URL has any **FILTER-EXTENSIONS**, it will be removed from the output.\r\n- If the URL has NO parameters **OR** the `-dp`/`--disregard-params` argument was passed:\r\n  - If the URL contains a **FILTER-KEYWORDS** or **UNWANTED-CONTENT**, it will be removed.\r\n  - if the URL query string contains unwanted parameters specified in config `REMOVE_PARAMS` (or overridden wit argument `-rp`/`--remove-params`), they will be removed from all URLs before processing.\r\n  - If `-rcid`/`--regex-custom-id` is passed and the URL path contains a Custom ID, only one match to the Custom ID regex will be included if there are multiple URLs where that is the only difference.\r\n  - If the URL path contains a GUID, only one of the GUIDs will be included if there are multiple URLs where the GUID is the only difference.\r\n  - If the URL path contains an Integer ID, only one of the Integer IDs will be included if there are multiple URLs where the Integer ID is the only difference.\r\n  - If the `-lang` argument is passed and the URL contains a language code (e.g. `en-gb`), only one of the language codes will be included if there are multiple URLs where the language code is different.\r\n- Else the URL has Parameters (or a fragment `#`) **AND** the `-dp`/`--disregard-params` argument was NOT passed:\r\n  - If there are multiple URLs with the same parameters, then only URLs with unique parameter values are included.\r\n  - If there are URL's with a Parameter, but no value (or a fragment), then this will be included.\r\n\r\n## Examples\r\n\r\n### Basic use\r\n\r\n```\r\ncat target_urls.txt | urless\r\n```\r\n\r\nor\r\n\r\n```\r\nurless -i target_urls.txt\r\n```\r\n\r\n### Capture output\r\n\r\n```\r\ncat target_urls.txt | urless > output.txt\r\n```\r\n\r\nor\r\n\r\n```\r\nurless -i target_urls.txt -o output.txt\r\n```\r\n\r\n## config.yml\r\n\r\nThe `config.yml` file has the keys which can be updated to suit your needs:\r\n\r\n- `FILTER_KEYWORDS` - A comma separated list of keywords (e.g. `blog,article,news` etc.) that URLs are checked against in certain circumstances.\r\n- `FILTER_EXTENSIONS` - A comma separated list of file extensions (e.g. `.css,.jpg,.jpeg` etc.) that all URLs are checked against. If a URL includes any of the strings then it will be excluded from the output.\r\n- `LANGUAGE` - A comma separated list of language codes (e.g. `en-gb,fr,nl` etc.) that all URLs are checked against when the `-lang` argument is passed. If there are multiple URLs with different language codes, only one version of the URL will be output.\r\n- `REMOVE_PARAMS` - A comma separated list of **case sensitive** parameter names (e.g. `cachebuster,cacheBuster`) that will be removed from all URLs before processing.\r\n\r\n## Custom Regex\r\n\r\nThere are currently automatic regex checks for a path part being a Globally Unique ID (GUID) and an Integer ID, but the `-rcid` / `--regex-custom-id` argument lets you provide a regular expression to identify a custom ID. For example, if a target has a specific ID format (that isn't a GUID or Integer) then you can specify a regex expression for it, and then only one of those will be returned in the output if the rest of the URL is the same. For example:\r\n\r\n- Assume the target has a user ID in a format like `U-65241X`\r\n- And there are multiple URLs like the following:\r\n  ```\r\n  https://target.com/blah/U-61723A/settings\r\n  https://target.com/blah/U-63352B/settings\r\n  https://target.com/blah/U-61351A/profile\r\n  https://target.com/blah/U-61723A/settings\r\n  https://target.com/blah/U-64135C/profile\r\n  ```\r\n- You can call `urless` and pass `-rcid 'U-[0-9]{5}[A-Z]'`, then the output would be:\r\n  ```\r\n  https://target.com/blah/U-61723A/settings\r\n  https://target.com/blah/U-64135C/profile\r\n  ```\r\n\r\n**IMPORTANT REGEX NOTES:**\r\n\r\n- Writing correct regex expressions can be difficult, and if it isn't correct, you could end up with unpredictable and incorrect output.\r\n- Always enclose your regex expression in single quotes when passing to the `-rcid` argument.\r\n- You don't need to add a custom regex for a GUID or Integer ID - these are dealt with already.\r\n- The regex expression should highlight the whole part of the path. So, if your regex only identifies the start of the path, then add `[^(\\?|\\/|#|$)]*` to the end of your regex which will mean ALL other characters up until the end of the path part.\r\n- You can add `^` at the start, and `$` at the end, of your regex to ensure it represents the whole part of a path between slashes. However, these will be added for you if they are left out.\r\n- Make sure the regex only identifies the sections you are interested in, otherwise you may have unexpected results. To test your regex, you can take your input file and do `cat input.txt | grep -E 'U-[0-9]{5}[A-Z]'` for example, and see whether your expression looks correct (it should only highlight what you are interested in, and highlight the whole part of the path that is the custom ID).\r\n- You can also test using [Regex101](https://regex101.com), entering sample URLs in the **TEST STRING** section to check if it is correct. Make sure the **REGEX FLAGS** **g**lobal and **m**ultiline are selected.\r\n- There maybe cases where you just can't supply a regex that is going to identify the Custom ID correctly without treating other values as the same. For example, if there are URLs like `https://target.com/blah/xnl/settings` where `xnl` is a User Name, you won't be able to create a regex for user name because it is not a unique enough format to distinguish it from other possible path values.\r\n\r\n## Issues\r\n\r\nIf you come across any problems at all, or have ideas for improvements, please feel free to raise an issue on Github. If there is a problem, it will be useful if you can provide the exact command you ran and a detailed description of the problem. If possible, run with `-v` to reproduce the problem and let me know about any error messages that are given.\r\n\r\n## TODO\r\n\r\nNone - feel free to raise a Github issue to suggest any enhancements.\r\n\r\n## And finally...\r\n\r\nGood luck and good hunting!\r\nIf you really love the tool (or any others), or they helped you find an awesome bounty, consider [BUYING ME A COFFEE!](https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)\r\n\r\n🤘 /XNL-h4ck3r\r\n\r\n<p>\r\n<a href='https://ko-fi.com/B0B3CZKR5' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi2.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>\r\n"
  },
  {
    "path": "config.yml",
    "content": "FILTER_KEYWORDS: blog,article,news,bootstrap,jquery,captcha,node_modules\r\nFILTER_EXTENSIONS: .css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image\r\nLANGUAGE: en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru\r\nREMOVE_PARAMS: _,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name\r\n"
  },
  {
    "path": "setup.py",
    "content": "#!/usr/bin/env python\r\nimport os\r\nimport shutil\r\nfrom setuptools import setup, find_packages\r\n\r\n# Define the target directory for the config.yml file\r\ntarget_directory = (\r\n    os.path.join(os.getenv(\"APPDATA\", \"\"), \"urless\")\r\n    if os.name == \"nt\"\r\n    else (\r\n        os.path.join(os.path.expanduser(\"~\"), \".config\", \"urless\")\r\n        if os.name == \"posix\"\r\n        else (\r\n            os.path.join(\r\n                os.path.expanduser(\"~\"), \"Library\", \"Application Support\", \"urless\"\r\n            )\r\n            if os.name == \"darwin\"\r\n            else None\r\n        )\r\n    )\r\n)\r\n\r\n# Copy the config.yml file to the target directory if it exists\r\nconfigNew = False\r\nif target_directory and os.path.isfile(\"config.yml\"):\r\n    os.makedirs(target_directory, exist_ok=True)\r\n    # If file already exists, create a new one\r\n    if os.path.isfile(target_directory + \"/config.yml\"):\r\n        configNew = True\r\n        os.rename(\r\n            target_directory + \"/config.yml\", target_directory + \"/config.yml.OLD\"\r\n        )\r\n        shutil.copy(\"config.yml\", target_directory)\r\n        os.rename(\r\n            target_directory + \"/config.yml\", target_directory + \"/config.yml.NEW\"\r\n        )\r\n        os.rename(\r\n            target_directory + \"/config.yml.OLD\", target_directory + \"/config.yml\"\r\n        )\r\n    else:\r\n        shutil.copy(\"config.yml\", target_directory)\r\n\r\nsetup(\r\n    name=\"urless\",\r\n    packages=find_packages(),\r\n    version=__import__(\"urless\").__version__,\r\n    description=\"De-clutter a list of URLs\",\r\n    long_description=open(\"README.md\").read(),\r\n    long_description_content_type=\"text/markdown\",\r\n    author=\"@xnl-h4ck3r\",\r\n    url=\"https://github.com/xnl-h4ck3r/urless\",\r\n    zip_safe=False,\r\n    install_requires=[\r\n        \"argparse\",\r\n        \"pyyaml\",\r\n        \"termcolor\",\r\n        \"urlparse3\",\r\n        \"chardet\",\r\n        \"requests\",\r\n    ],\r\n    entry_points={\r\n        \"console_scripts\": [\r\n            \"urless = urless.urless:main\",\r\n        ],\r\n    },\r\n)\r\n\r\nif configNew:\r\n    print(\r\n        \"\\n\\033[33mIMPORTANT: The file \"\r\n        + target_directory\r\n        + \"/config.yml already exists.\\nCreating config.yml.NEW but leaving existing config.\\nIf you need the new file, then remove the current one and rename config.yml.NEW to config.yml\\n\\033[0m\"\r\n    )\r\nelse:\r\n    print(\r\n        \"\\n\\033[92mThe file \"\r\n        + target_directory\r\n        + \"/config.yml has been created.\\n\\033[0m\"\r\n    )\r\n"
  },
  {
    "path": "urless/__init__.py",
    "content": "__version__ = \"2.7\"\n"
  },
  {
    "path": "urless/urless.py",
    "content": "#!/usr/bin/env python\n# Python 3\n# urless - by @Xnl-h4ck3r: De-clutter a list of URLs\n# Full help here: https://github.com/xnl-h4ck3r/urless/blob/main/README.md\n# Good luck and good hunting! If you really love the tool (or any others), or they helped you find an awesome bounty, consider BUYING ME A COFFEE! (https://ko-fi.com/xnlh4ck3r) ☕ (I could use the caffeine!)\n\n\nimport re\nimport os\nimport sys\nfrom typing import Pattern\nimport yaml\nimport argparse\nimport chardet\nfrom signal import SIGINT, signal\nfrom urllib.parse import urlparse\nfrom termcolor import colored\nfrom pathlib import Path\n\ntry:\n    from . import __version__\n    import warnings\n\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"ignore\")\n        import requests\nexcept Exception:\n    pass\n\n# Default values if config.yml not found\nDEFAULT_FILTER_EXTENSIONS = \".css,.ico,.jpg,.jpeg,.png,.bmp,.svg,.img,.gif,.mp4,.flv,.ogv,.webm,.webp,.mov,.mp3,.m4a,.m4p,.scss,.tif,.tiff,.ttf,.otf,.woff,.woff2,.bmp,.ico,.eot,.htc,.rtf,.swf,.image\"\nDEFAULT_FILTER_KEYWORDS = \"blog,article,news,bootstrap,jquery,captcha,node_modules\"\nDEFAULT_LANGUAGE = \"en,en-us,en-gb,fr,de,pl,nl,fi,sv,it,es,pt,ru,pt-br,es-mx,zh-tw,ja,ko,gb-en,ca-en,au-en,fr-fr,ca-fr,es-es,mx-es,de-de,it-it,br-pt,pt-pt,jp-ja,cn-zh,tw-zh,kr-ko,sa-ar,in-hi,ru-ru\"\nDEFAULT_REMOVE_PARAMS = \"_,cachebuster,cacheBuster,utm_source,utm_medium,utm_campaign,utm_content,utm_term,utm_adgroup,utm_custom,utm_name\"\n\n# Variables to hold config.yml values\nFILTER_EXTENSIONS = \"\"\nFILTER_KEYWORDS = \"\"\nLANGUAGE = \"\"\nREMOVE_PARAMS = \"\"\nreFilterKeywords = \"\"\nbadExtensions = ()\n\n\n# Regex delimiters\nREGEX_START = \"^\"\nREGEX_END = \"$\"\n\n# Regex for a path folder of integer\nREGEX_INTEGER = REGEX_START + r\"\\d+\" + REGEX_END\nreIntPart = re.compile(REGEX_INTEGER)\npatternsInt = {}\n\n# Regex for a path folder of GUID\nREGEX_GUID = (\n    REGEX_START\n    + \"[({]?[a-fA-F0-9]{8}[-]?([a-fA-F0-9]{4}[-]?){3}[a-fA-F0-9]{12}[})]?\"\n    + REGEX_END\n)\nreGuidPart = re.compile(REGEX_GUID)\npatternsGUID = {}\n\n# Regex fields for Custom ID\nreCustomIDPart = Pattern\npatternsCustomID = {}\n\n# Regex for path of YYYY/MM\nREGEX_YYYYMM = r\"\\/[1|2][0|1|9]\\\\d{2}/[0|1]\\\\d{1}\\/\"\nreYYYYMM = re.compile(REGEX_YYYYMM)\n\n# Regex for path of language code\nreLangPart = Pattern\npatternsLang = {}\n\n# Global variables\nargs = None\nurlmap = {}\npatternsSeen = []\noutFile = None\nlinesOrigCount = 0\nlinesFinalCount = 0\nusingConfigDefaults = False\n\n\ndef verbose():\n    \"\"\"\n    Functions used when printing messages dependant on verbose option\n    \"\"\"\n    return args.verbose\n\n\ndef write(text=\"\"):\n    \"\"\"\n    Always print one line to stdout.\n    The --no-banner flag and -o/--output already give users\n    control over noise and redirection, so extra TTY checks only\n    break non-interactive usage (Docker, CI, cron).\n    \"\"\"\n    sys.stdout.write(text + \"\\n\")\n\n\ndef writerr(text=\"\"):\n    \"\"\"\n    Always print one line to stderr.\n    \"\"\"\n    sys.stderr.write(text + \"\\n\")\n\n\ndef showVersion():\n    try:\n        try:\n            resp = requests.get(\n                \"https://raw.githubusercontent.com/xnl-h4ck3r/urless/main/urless/__init__.py\",\n                timeout=3,\n            )\n        except Exception:\n            write(\n                \"Current urless version \"\n                + __version__\n                + \" (unable to check if latest)\\n\"\n            )\n        if __version__ == resp.text.split(\"=\")[1].replace('\"', \"\").strip():\n            write(\n                \"Current urless version \"\n                + __version__\n                + \" (\"\n                + colored(\"latest\", \"green\")\n                + \")\\n\"\n            )\n        else:\n            write(\n                \"Current urless version \"\n                + __version__\n                + \" (\"\n                + colored(\"outdated\", \"red\")\n                + \")\\n\"\n            )\n    except Exception:\n        pass\n\n\ndef showBanner():\n    write(\"\")\n    write(colored(r\"  __  _ ____  _   ___  ___ ____ \", \"red\"))\n    write(colored(r\" | | | |  _ \\| | / _ \\/ __/ __/ \", \"yellow\"))\n    write(colored(r\" | | | | |_) | ||  __/\\__ \\__ \\ \", \"green\"))\n    write(colored(r\" | |_| |  _ <| |_\\___/\\___/___/ \", \"cyan\"))\n    write(colored(r\"  \\___/|_| \\_\\___/\", \"magenta\") + colored(\"by Xnl-h4ck3r\", \"white\"))\n    write(\"\")\n    showVersion()\n\n\ndef getConfig():\n    \"\"\"\n    Try to get the values from the config file, otherwise use the defaults\n    \"\"\"\n    global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, reLangPart, usingConfigDefaults, reFilterKeywords, badExtensions\n    try:\n\n        # Try to get the config file values\n        try:\n            # Put config in global location based on the OS.\n            urlessPath = (\n                Path(os.path.join(os.getenv(\"APPDATA\", \"\"), \"urless\"))\n                if os.name == \"nt\"\n                else (\n                    Path(os.path.join(os.path.expanduser(\"~\"), \".config\", \"urless\"))\n                    if os.name == \"posix\"\n                    else (\n                        Path(\n                            os.path.join(\n                                os.path.expanduser(\"~\"),\n                                \"Library\",\n                                \"Application Support\",\n                                \"urless\",\n                            )\n                        )\n                        if os.name == \"darwin\"\n                        else None\n                    )\n                )\n            )\n\n            urlessPath.absolute\n            if args.config is None:\n                if urlessPath == \"\":\n                    configPath = \"config.yml\"\n                else:\n                    configPath = Path(urlessPath / \"config.yml\")\n            else:\n                configPath = Path(args.config)\n            config = yaml.safe_load(open(configPath))\n\n            # If the user provided the --filter-extensions argument then it overrides the config value\n            if args.filter_keywords:\n                FILTER_KEYWORDS = args.filter_keywords\n            else:\n                try:\n                    FILTER_KEYWORDS = config.get(\"FILTER_KEYWORDS\")\n                    if str(FILTER_KEYWORDS) == \"None\":\n                        writerr(\n                            colored(\n                                \"No value for FILTER_KEYWORDS in config.yml - default set\",\n                                \"yellow\",\n                            )\n                        )\n                        FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS\n                except Exception:\n                    writerr(\n                        colored(\n                            \"Unable to read FILTER_EXTENSIONS from config.yml - default set\",\n                            \"red\",\n                        )\n                    )\n                    FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS\n            reFilterKeywords = re.compile(\n                FILTER_KEYWORDS.replace(\",\", \"|\"), re.IGNORECASE\n            )\n\n            # If the user provided the --filter-extensions argument then it overrides the config value\n            if args.filter_extensions:\n                FILTER_EXTENSIONS = args.filter_extensions\n            else:\n                try:\n                    FILTER_EXTENSIONS = config.get(\"FILTER_EXTENSIONS\")\n                    if str(FILTER_EXTENSIONS) == \"None\":\n                        writerr(\n                            colored(\n                                \"No value for FILTER_EXTENSIONS in config.yml - default set\",\n                                \"yellow\",\n                            )\n                        )\n                        FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS\n                except Exception:\n                    writerr(\n                        colored(\n                            \"Unable to read FILTER_EXTENSIONS from config.yml - default set\",\n                            \"red\",\n                        )\n                    )\n                    FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS\n            badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(\",\"))\n\n            # If the user provided the --language argument then create the regex for language codes\n            if args.language:\n                # Get the language codes\n                try:\n                    LANGUAGE = config.get(\"LANGUAGE\")\n                    if str(LANGUAGE) == \"None\":\n                        writerr(\n                            colored(\n                                \"No value for LANGUAGE in config.yml - default set\",\n                                \"yellow\",\n                            )\n                        )\n                        LANGUAGE = DEFAULT_LANGUAGE\n                except Exception:\n                    writerr(\n                        colored(\n                            \"Unable to read LANGUAGE from config.yml - default set\",\n                            \"red\",\n                        )\n                    )\n                    LANGUAGE = DEFAULT_LANGUAGE\n                # Set the language regex\n                try:\n                    reLangPart = re.compile(\n                        REGEX_START + \"(\" + LANGUAGE.replace(\",\", \"|\") + \")\" + REGEX_END\n                    )\n                except Exception as e:\n                    writerr(colored(\"ERROR getConfig 2: \" + str(e), \"red\"))\n\n            # If the user provided the --remove-params argument then it overrides the config value\n            if args.remove_params:\n                REMOVE_PARAMS = args.remove_params\n            else:\n                try:\n                    REMOVE_PARAMS = config.get(\"REMOVE_PARAMS\")\n                    if str(REMOVE_PARAMS) == \"None\":\n                        if verbose():\n                            writerr(\n                                colored(\n                                    \"No value for REMOVE_PARAMS in config.yml - default set\",\n                                    \"yellow\",\n                                )\n                            )\n                        REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS\n                except Exception:\n                    if verbose():\n                        writerr(\n                            colored(\n                                \"Unable to read REMOVE_PARAMS from config.yml - default set\",\n                                \"red\",\n                            )\n                        )\n                    REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS\n\n        except Exception:\n            if args.config is None:\n                writerr(\n                    colored(\n                        'WARNING: Cannot find file \"config.yml\", so using default values',\n                        \"yellow\",\n                    )\n                )\n            else:\n                writerr(\n                    colored(\n                        'WARNING: Cannot find file \"'\n                        + args.config\n                        + '\", so using default values',\n                        \"yellow\",\n                    )\n                )\n            usingConfigDefaults = True\n            FILTER_EXTENSIONS = DEFAULT_FILTER_EXTENSIONS\n            FILTER_KEYWORDS = DEFAULT_FILTER_KEYWORDS\n            LANGUAGE = DEFAULT_LANGUAGE\n            REMOVE_PARAMS = DEFAULT_REMOVE_PARAMS\n            reFilterKeywords = re.compile(\n                FILTER_KEYWORDS.replace(\",\", \"|\"), re.IGNORECASE\n            )\n            badExtensions = tuple(ext.lower() for ext in FILTER_EXTENSIONS.split(\",\"))\n\n    except Exception as e:\n        writerr(colored(\"ERROR getConfig 1: \" + str(e), \"red\"))\n\n\ndef ensureConfig():\n    \"\"\"\n    Ensure the config.yml file exists in the default config directory.\n    If not, create the directory and write the default config.\n    This is called before argument parsing so the file is created\n    even when running 'urless' or 'urless -h'.\n    \"\"\"\n    try:\n        # Determine the config directory based on OS\n        if os.name == \"nt\":\n            urlessPath = Path(os.path.join(os.getenv(\"APPDATA\", \"\"), \"urless\"))\n        elif os.name == \"posix\":\n            urlessPath = Path(\n                os.path.join(os.path.expanduser(\"~\"), \".config\", \"urless\")\n            )\n        else:\n            urlessPath = Path(\n                os.path.join(\n                    os.path.expanduser(\"~\"),\n                    \"Library\",\n                    \"Application Support\",\n                    \"urless\",\n                )\n            )\n\n        configPath = urlessPath / \"config.yml\"\n\n        # If the config file doesn't exist, create it with default values\n        if not configPath.exists():\n            try:\n                urlessPath.mkdir(parents=True, exist_ok=True)\n                with open(configPath, \"w\") as f:\n                    f.write(f\"FILTER_KEYWORDS: {DEFAULT_FILTER_KEYWORDS}\\n\")\n                    f.write(f\"FILTER_EXTENSIONS: {DEFAULT_FILTER_EXTENSIONS}\\n\")\n                    f.write(f\"LANGUAGE: {DEFAULT_LANGUAGE}\\n\")\n                    f.write(f\"REMOVE_PARAMS: {DEFAULT_REMOVE_PARAMS}\\n\")\n            except Exception as e:\n                writerr(\n                    colored(\"WARNING: Could not create config.yml: \" + str(e), \"yellow\")\n                )\n    except Exception as e:\n        writerr(colored(\"ERROR ensureConfig: \" + str(e), \"red\"))\n\n\ndef handler(signal_received, frame):\n    \"\"\"\n    This function is called if Ctrl-C is called by the user\n    An attempt will be made to try and clean up properly\n    \"\"\"\n    writerr(colored('>>> \"Oh my God, they killed Kenny... and urless!\" - Kyle', \"red\"))\n    sys.exit()\n\n\ndef paramsToDict(params: str) -> list:\n    \"\"\"\n    converts query string to dict\n    \"\"\"\n    try:\n        the_dict = {}\n        if params:\n            for pair in params.split(\"&\"):\n                # If there is a parameter but no = then add a value of {EMPTY}\n                if pair.find(\"=\") < 0:\n                    key = pair + \"{EMPTY}\"\n                    the_dict[key] = \"{EMPTY}\"\n                else:\n                    parts = pair.split(\"=\")\n                    try:\n                        the_dict[parts[0]] = parts[1]\n                    except IndexError:\n                        pass\n        return the_dict\n    except Exception as e:\n        writerr(colored(\"ERROR paramsToDict 1: \" + str(e), \"red\"))\n\n\ndef dictToParams(params: dict) -> str:\n    \"\"\"\n    converts dict of params to query string\n    \"\"\"\n    try:\n        # If a parameter has a value of {EMPTY} then just the name will be written and no =\n        stringed = [\n            name if value == \"{EMPTY}\" else name + \"=\" + value\n            for name, value in params.items()\n        ]\n\n        # Only add a ? at the start of parameters, unless the first starts with #\n        if list(params.keys())[0][:1] == \"#\":\n            paramString = \"\".join(stringed)\n        else:\n            paramString = \"?\" + \"&\".join(stringed)\n\n        # If a there are any parameters with {EMPTY} in the name then remove the string\n        return paramString.replace(\"{EMPTY}\", \"\")\n    except Exception as e:\n        writerr(colored(\"ERROR dictToParams 1: \" + str(e), \"red\"))\n\n\ndef compareParams(currentParams: list, newParams: dict) -> bool:\n    \"\"\"\n    checks if newParams contain a param\n    that doesn't exist in currentParams\n    \"\"\"\n    try:\n        ogSet = set([])\n        for each in currentParams:\n            for key in each.keys():\n                ogSet.add(key)\n        return set(newParams.keys()) - ogSet\n    except Exception as e:\n        writerr(colored(\"ERROR compareParams 1: \" + str(e), \"red\"))\n\n\ndef isUnwantedContent(path: str) -> bool:\n    \"\"\"\n    Checks any potentially unwanted patterns (unless specified otherwise) such as blog/news content\n    \"\"\"\n    try:\n        unwanted = False\n\n        if not args.keep_human_written:\n            # If the path has more than 3 dashes '-' AND isn't a GUID AND (if specified) isn't a Custom ID, then assume it's human written content, e.g. blog\n            for part in path.split(\"/\"):\n                if part.count(\"-\") > 3:\n                    if str(reCustomIDPart.pattern) == \"\":\n                        if not reGuidPart.search(part) and reCustomIDPart.search(part):\n                            unwanted = True\n                    else:\n                        if not reGuidPart.search(part):\n                            unwanted = True\n\n        if not args.keep_yyyymm:\n            # If it contains a year and month in the path then assume like blog/news content, r.g. .../2019/06/...\n            if reYYYYMM.search(path):\n                unwanted = True\n\n        return unwanted\n    except Exception as e:\n        writerr(colored(\"ERROR isUnwantedContent 1: \" + str(e), \"red\"))\n\n\ndef createPattern(path: str) -> str:\n    \"\"\"\n    creates patterns for urls with integers or GUIDs in them\n    \"\"\"\n    global patternsGUID, patternsInt, patternsCustomID, patternsLang\n    try:\n        newParts = []\n\n        regexInt = False\n        regexGUID = False\n        regexCustom = False\n        regexLang = False\n        for part in path.split(\"/\"):\n            if part == \"\":\n                newParts.append(part)\n            elif str(reCustomIDPart.pattern) != \"\" and reCustomIDPart.search(part):\n                regexCustom = True\n                newParts.append(reCustomIDPart.pattern)\n            elif reGuidPart.search(part):\n                regexGUID = True\n                newParts.append(reGuidPart.pattern)\n            elif reIntPart.match(part):\n                regexInt = True\n                newParts.append(reIntPart.pattern)\n            elif args.language and reLangPart.match(part.lower()):\n                regexLang = True\n                newParts.append(reLangPart.pattern)\n            else:\n                newParts.append(part)\n        createdPattern = \"/\".join(newParts)\n\n        # Depending on the type of regex, add the found pattern to the dictionary if it hasn't been added already\n        if regexCustom and createdPattern not in patternsCustomID:\n            patternsCustomID[createdPattern] = path\n        elif regexGUID and createdPattern not in patternsGUID:\n            patternsGUID[createdPattern] = path\n        elif regexInt and createdPattern not in patternsInt:\n            patternsInt[createdPattern] = path\n        elif regexLang and createdPattern not in patternsLang:\n            patternsLang[createdPattern] = path\n\n        return createdPattern\n    except Exception as e:\n        writerr(colored(\"ERROR createPattern 1: \" + str(e), \"red\"))\n\n\ndef patternExists(pattern: str) -> bool:\n    \"\"\"\n    Checks if a pattern exists\n    \"\"\"\n    try:\n        for i, seen_pattern in enumerate(patternsSeen):\n            if pattern == seen_pattern:\n                patternsSeen[i] = pattern\n                return True\n            elif seen_pattern in pattern:\n                return True\n        return False\n    except Exception as e:\n        writerr(colored(\"ERROR patternExists 1: \" + str(e), \"red\"))\n\n\ndef matchesPatterns(path: str) -> bool:\n    \"\"\"\n    checks if the url matches any of the regex patterns\n    \"\"\"\n    try:\n        for pattern in patternsSeen:\n            if re.search(pattern, re.escape(path)) is not None:\n                return True\n        return False\n    except Exception as e:\n        writerr(colored(\"ERROR matchesPatterns 1: \" + str(e), \"red\"))\n\n\ndef hasFilterKeyword(path: str) -> bool:\n    \"\"\"\n    checks if the url matches the blacklist regex\n    \"\"\"\n    global reFilterKeywords\n    try:\n        return reFilterKeywords.search(path)\n    except Exception as e:\n        writerr(colored(\"ERROR hasFilterKeyword 1: \" + str(e), \"red\"))\n\n\ndef hasBadExtension(path: str) -> bool:\n    \"\"\"\n    checks if a url has a blacklisted extension\n    \"\"\"\n    global badExtensions\n    try:\n        return path.lower().endswith(badExtensions)\n    except Exception as e:\n        writerr(colored(\"ERROR hasBadExtension 1: \" + str(e), \"red\"))\n\n\ndef removeParameters(params) -> dict:\n    \"\"\"\n    Removes any parameters from the parameter dictionary\n    \"\"\"\n    global REMOVE_PARAMS\n    try:\n        # For every parameter name in the REMOVE_PARAMS list, remove from the dictionary passed\n        for param in REMOVE_PARAMS.split(\",\"):\n            if param in params:\n                del params[param]\n        return params\n    except Exception as e:\n        writerr(colored(\"ERROR removeParameters 1: \" + str(e), \"red\"))\n\n\ndef processUrl(line):\n\n    try:\n        parsed = urlparse(line.strip())\n\n        # Set the host\n        scheme = parsed.scheme\n        if scheme == \"\":\n            host = parsed.netloc\n        else:\n            host = scheme + \"://\" + parsed.netloc\n\n        # If the link specifies port 80 or 443, e.g. http://example.com:80, then remove the port\n        if str(parsed.port) == \"80\":\n            host = host.replace(\":80\", \"\", 1)\n        if str(parsed.port) == \"443\":\n            host = host.replace(\":443\", \"\", 1)\n\n        # Build the path and parameters\n        path, params = parsed.path, paramsToDict(parsed.query)\n\n        # Remove any necessary parameters\n        params = removeParameters(params)\n\n        # If there is a fragment...\n        #   if arg -fnp / --fragment-not-param was passed, change the path to include the hash,\n        #   else, add as the last parameter with a name but with value {EMPTY} that doesn't add an = afterwards\n        if parsed.fragment:\n            if args.fragment_not_param:\n                path = path + \"#\" + parsed.fragment\n            else:\n                params[\"#\" + parsed.fragment] = \"{EMPTY}\"\n\n        # Add the host to the map if it hasn't already been seen\n        if host not in urlmap:\n            urlmap[host] = {}\n\n        # If the path has an extension we want to exclude, then just return to continue with the next line\n        if hasBadExtension(path):\n            return\n\n        # If there are no parameters (or the --disregard-params argument was passed) and path isn't empty\n        if (not params or args.disregard_params) and path != \"\":\n\n            # If its unwanted content or has a keyword to be excluded, then just return to continue with the next line\n            if isUnwantedContent(path) or hasFilterKeyword(path):\n                return\n\n            # If the current path already matches a previously saved pattern then just return to continue with the next line\n            if matchesPatterns(path):\n                return\n\n        # If the path has ++ in it for any reason, then just output \"as is\" otherwise it will raise a regex Multiple Repeat Error\n        if path.find(\"++\") > 0:\n            pattern = path\n        else:\n            # Create a pattern for the current path\n            pattern = createPattern(path)\n\n        # Update the url map\n        if pattern not in urlmap[host]:\n            urlmap[host][pattern] = [params] if params else []\n        elif params and compareParams(urlmap[host][pattern], params):\n            urlmap[host][pattern].append(params)\n\n    except ValueError:\n        if verbose():\n            writerr(\n                colored(\n                    \"This URL caused a Value Error and was not included: \" + line, \"red\"\n                )\n            )\n    except Exception as e:\n        writerr(colored(\"ERROR processUrl 1: \" + str(e), \"red\"))\n\n\ndef processLine(line):\n    \"\"\"\n    Process a line from the input based on whether the -ks / --keep-slash argument was passed\n    \"\"\"\n    # If the -ks / --keep-slash argument was passed, then just add all URLs,\n    # else remove the trailing slash form any URLs (before any query string)\n    if args.keep_slash:\n        line = line.rstrip(\"\\n\")\n    else:\n        if line.find(\"/?\") > 0:\n            line = line.replace(\"/?\", \"?\", 1)\n        else:\n            line = line.rstrip(\"\\n\").rstrip(\"/\")\n\n    # If the -iq / --ignore-querystring argument was passed, remove any querystring and fragment (unless -fnp is passed, in which case the fragment is only removed if a query string exists too)\n    if args.ignore_querystring:\n        if args.fragment_not_param:\n            line = line.split(\"?\")[0]\n        else:\n            line = line.split(\"?\")[0].split(\"#\")[0]\n    return line\n\n\ndef processInput():\n    global linesOrigCount\n    try:\n        if not sys.stdin.isatty():\n            for line in sys.stdin:\n                processUrl(processLine(line))\n        else:\n            with open(os.path.expanduser(args.input), \"rb\") as f:\n                result = chardet.detect(f.read())  # or readline if the file is large\n\n            try:\n                linesOrigCount = 0\n                with open(\n                    os.path.expanduser(args.input), \"r\", encoding=result[\"encoding\"]\n                ) as inFile:\n                    for line in inFile:\n                        linesOrigCount += 1\n                        processUrl(processLine(line))\n            except Exception as e:\n                writerr(colored(\"ERROR processInput 2 \" + str(e), \"red\"))\n    except Exception as e:\n        writerr(colored(\"ERROR processInput 1: \" + str(e), \"red\"))\n\n\ndef processOutput():\n    global linesFinalCount, linesOrigCount, patternsGUID, patternsInt, patternsCustomID, patternsLang\n    try:\n        # If an output file was specified, open it\n        if args.output is not None:\n            try:\n                outFile = open(os.path.expanduser(args.output), \"w\")\n            except Exception as e:\n                writerr(colored(\"ERROR processOutput 2 \" + str(e), \"red\"))\n\n        # Output all URLs\n        for host, value in urlmap.items():\n            for path, params in value.items():\n\n                # Replace the regex pattern in the path with the first occurrence of that pattern found\n                try:\n                    customRegexFound = False\n                    if (\n                        str(reCustomIDPart.pattern) != \"\"\n                        and path.find(str(reCustomIDPart.pattern)) > 0\n                    ):\n                        for pattern in patternsCustomID:\n                            if pattern == path:\n                                path = patternsCustomID[pattern]\n                                customRegexFound = True\n                    if not customRegexFound:\n                        if path.find(REGEX_GUID) > 0:\n                            for pattern in patternsGUID:\n                                if pattern == path:\n                                    path = patternsGUID[pattern]\n                        elif path.find(REGEX_INTEGER) > 0:\n                            for pattern in patternsInt:\n                                if pattern == path:\n                                    path = patternsInt[pattern]\n                        elif path.find(str(reLangPart.pattern)) > 0:\n                            for pattern in patternsLang:\n                                if pattern == path:\n                                    path = patternsLang[pattern]\n                except Exception as e:\n                    writerr(colored(\"ERROR processOutput 4: \" + str(e), \"red\"))\n\n                if params:\n                    for param in params:\n                        linesFinalCount = linesFinalCount + 1\n                        # If an output file was specified, write to the file\n                        if args.output is not None:\n                            outFile.write(host + path + dictToParams(param) + \"\\n\")\n                        else:\n                            # If output is piped or the --output argument was not specified, output to STDOUT\n                            if not sys.stdin.isatty() or args.output is None:\n                                write(host + path + dictToParams(param))\n                else:\n                    linesFinalCount = linesFinalCount + 1\n                    # If an output file was specified, write to the file\n                    if args.output is not None:\n                        outFile.write(host + path + \"\\n\")\n                    else:\n                        # If output is piped or the --output argument was not specified, output to STDOUT\n                        if not sys.stdin.isatty() or args.output is None:\n                            write(host + path)\n\n        if verbose() and sys.stdin.isatty():\n            writerr(\n                colored(\n                    \"\\nInput reduced from \"\n                    + str(linesOrigCount)\n                    + \" to \"\n                    + str(linesFinalCount)\n                    + \" lines 🤘\",\n                    \"cyan\",\n                )\n            )\n\n        # Close the output file if it was opened\n        try:\n            if args.output is not None:\n                write(\n                    colored(\"Output successfully written to file: \", \"cyan\")\n                    + colored(args.output, \"white\")\n                )\n                write()\n                outFile.close()\n        except Exception as e:\n            writerr(colored(\"ERROR processOutput 3: \" + str(e), \"red\"))\n\n    except Exception as e:\n        writerr(colored(\"ERROR processOutput 1: \" + str(e), \"red\"))\n\n\ndef showOptionsAndConfig():\n    global FILTER_EXTENSIONS, FILTER_KEYWORDS, LANGUAGE, REMOVE_PARAMS, usingConfigDefaults\n    try:\n        write(colored(\"Selected options and config:\", \"cyan\"))\n        write(\n            colored(\"-i: \" + args.input, \"magenta\")\n            + colored(\" The input file of URLs to de-clutter.\", \"white\")\n        )\n        if args.output is not None:\n            write(\n                colored(\"-o: \" + args.output, \"magenta\")\n                + colored(\n                    \" The output file that the de-cluttered URL list will be written to.\",\n                    \"white\",\n                )\n            )\n        else:\n            write(\n                colored(\"-o: <STDOUT>\", \"magenta\")\n                + colored(\n                    \" An output file wasn't given, so output will be written to STDOUT.\",\n                    \"white\",\n                )\n            )\n\n        if args.disregard_params:\n            write(\n                colored(\"-dp: True\", \"magenta\")\n                + colored(\n                    \" When filtering the URLs, they will not be treated differently just because they have parameters.\",\n                    \"white\",\n                )\n            )\n\n        if args.config:\n            if usingConfigDefaults:\n                write(\n                    colored(\"-config: \" + args.config, \"magenta\")\n                    + colored(\" The path of the YML config file.\", \"white\")\n                    + colored(\" WARNING: Not found, so using default values.\", \"yellow\")\n                )\n            else:\n                write(\n                    colored(\"-config: \" + args.config, \"magenta\")\n                    + colored(\" The path of the YML config file.\", \"white\")\n                )\n\n        if args.filter_keywords:\n            write(\n                colored(\"-fk (Keywords to Filter): \", \"magenta\")\n                + colored(args.filter_keywords, \"white\")\n            )\n        else:\n            write(\n                colored(\"Filter Keywords (from Config.yml): \", \"magenta\")\n                + colored(FILTER_KEYWORDS, \"white\")\n            )\n\n        if args.filter_extensions:\n            write(\n                colored(\"-fe (Extensions to Filter): \", \"magenta\")\n                + colored(args.filter_extensions, \"white\")\n            )\n        else:\n            write(\n                colored(\"Filter Extensions (from Config.yml): \", \"magenta\")\n                + colored(FILTER_EXTENSIONS, \"white\")\n            )\n\n        if args.language:\n            write(\n                colored(\"Languages (from Config.yml): \", \"magenta\")\n                + colored(LANGUAGE, \"white\")\n            )\n            write(\n                colored(\"-lang: True\", \"magenta\")\n                + colored(\n                    \"If there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output.\",\n                    \"white\",\n                )\n            )\n\n        if args.remove_params:\n            write(\n                colored(\"-rp (Params to Remove): \", \"magenta\")\n                + colored(args.remove_params, \"white\")\n            )\n        else:\n            write(\n                colored(\"Remove Params (from Config.yml): \", \"magenta\")\n                + colored(REMOVE_PARAMS, \"white\")\n            )\n\n        if args.keep_slash:\n            write(\n                colored(\"-ks: True\", \"magenta\")\n                + colored(\n                    \"A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.\",\n                    \"white\",\n                )\n            )\n\n        if args.keep_human_written:\n            write(\n                colored(\"-khw: True\", \"magenta\")\n                + colored(\n                    \"Prevent URLs with a path part that contains 3 or more dashes (-) from being removed (e.g. blog post)\",\n                    \"white\",\n                )\n            )\n\n        if args.keep_yyyymm:\n            write(\n                colored(\"-kym: True\", \"magenta\")\n                + colored(\n                    \"Prevent URLs with a path part that contains a year and month in the format `/YYYY/DD` (e.g. blog or news)\",\n                    \"white\",\n                )\n            )\n\n        if args.regex_custom_id:\n            write(\n                colored(\"-rcid: '\" + str(reCustomIDPart.pattern) + \"'\", \"magenta\")\n                + colored(\" USE WITH CAUTION! \", \"red\")\n                + colored(\n                    \"Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.\",\n                    \"white\",\n                )\n            )\n\n        if args.keep_yyyymm:\n            write(\n                colored(\"-iq: True\", \"magenta\")\n                + colored(\n                    \" Remove the query string (including URL fragments `#`) so output is unique paths only.\",\n                    \"white\",\n                )\n            )\n\n        write(\"\")\n\n    except Exception as e:\n        writerr(colored(\"ERROR showOptionsAndConfig 1: \" + str(e), \"red\"))\n\n\ndef argCheckRegexCustomID(value):\n    global reCustomIDPart\n    try:\n\n        # If the Custom ID regex was passed, then prefix with ^ and suffix with $ if they are not there already\n        if value != \"\":\n            if value[0] != REGEX_START:\n                value = REGEX_START + value\n            if value[-1] != REGEX_END:\n                value = value + REGEX_END\n\n        # Try to compile the regex\n        reCustomIDPart = re.compile(value)\n\n        return value\n    except Exception:\n        raise argparse.ArgumentTypeError(\"Valid regex must be passed.\")\n\n\ndef main():\n\n    global args, urlmap, patternsSeen, patternsInt, patternsCustomID, patternsGUID, patternsLang\n\n    # Ensure config.yml exists before anything else\n    ensureConfig()\n\n    # Tell Python to run the handler() function when SIGINT is received\n    signal(SIGINT, handler)\n\n    # Parse command line arguments\n    parser = argparse.ArgumentParser(\n        description=\"urless - by @Xnl-h4ck3r: De-clutter a list of URLs.\"\n    )\n    parser.add_argument(\n        \"-i\", \"--input\", action=\"store\", help=\"A file of URLs to de-clutter.\"\n    )\n    parser.add_argument(\n        \"-o\",\n        \"--output\",\n        action=\"store\",\n        help=\"The output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.\",\n    )\n    parser.add_argument(\n        \"-fk\",\n        \"--filter_keywords\",\n        action=\"store\",\n        help=\"A comma separated list of keywords to exclude links (if there no parameters). This will override the FILTER_KEYWORDS list specified in config.yml\",\n        metavar=\"<comma separated list>\",\n    )\n    parser.add_argument(\n        \"-fe\",\n        \"--filter-extensions\",\n        action=\"store\",\n        help=\"A comma separated list of file extensions to exclude. This will override the FILTER_EXTENSIONS list specified in config.yml\",\n        metavar=\"<comma separated list>\",\n    )\n    parser.add_argument(\n        \"-rp\",\n        \"--remove-params\",\n        action=\"store\",\n        help=\"A comma separated list of case sensitive parameters to remove from all URLs. This will override the REMOVE_PARAMS list specified in config.yml. This can be useful for cache buster parameters for example.\",\n        metavar=\"<comma separated list>\",\n    )\n    parser.add_argument(\n        \"-ks\",\n        \"--keep-slash\",\n        action=\"store_true\",\n        help=\"A trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.\",\n    )\n    parser.add_argument(\n        \"-khw\",\n        \"--keep-human-written\",\n        action=\"store_true\",\n        help=\"By default, any URL with a path part that contains more than 3 dashes (-) are removed because it is assumed to be human written content (e.g. blog post) and not interesting. Passing this argument will keep them in the output.\",\n    )\n    parser.add_argument(\n        \"-kym\",\n        \"--keep-yyyymm\",\n        action=\"store_true\",\n        help=\"By default, any URL with a path containing /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.\",\n    )\n    parser.add_argument(\n        \"-rcid\",\n        \"--regex-custom-id\",\n        action=\"store\",\n        help=\"USE WITH CAUTION! Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the README for more details on this.\",\n        default=\"\",\n        metavar=\"REGEX\",\n        type=argCheckRegexCustomID,\n    )\n    parser.add_argument(\n        \"-iq\",\n        \"--ignore-querystring\",\n        action=\"store_true\",\n        help=\"Remove the query string (including URL fragments `#`) so output is unique paths only.\",\n    )\n    parser.add_argument(\n        \"-fnp\",\n        \"--fragment-not-param\",\n        action=\"store_true\",\n        help=\"Don't treat URL fragments `#` in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) it is usually kept, but if this argument is passed and a link has a filter word and fragment, it will be removed.\",\n    )\n    parser.add_argument(\n        \"-lang\",\n        \"--language\",\n        action=\"store_true\",\n        help='If passed, and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the \"LANGUAGE\" section of \"config.yml\".',\n    )\n    parser.add_argument(\n        \"-c\",\n        \"--config\",\n        action=\"store\",\n        help=\"Path to the YML config file. If not passed, it looks for file 'config.yml' in the default config directory, e.g. '~/.config/urless/'.\",\n    )\n    parser.add_argument(\n        \"-dp\",\n        \"--disregard-params\",\n        action=\"store_true\",\n        help=\"There is certain filtering that is not done if the URLs have parameters, because by default we want to see all possible parameters. If this argument is passed, then the filtering will be done, regardless of the existence of any parameters.\",\n    )\n    parser.add_argument(\n        \"-nb\", \"--no-banner\", action=\"store_true\", help=\"Hides the tool banner.\"\n    )\n    parser.add_argument(\"--version\", action=\"store_true\", help=\"Show version number\")\n    parser.add_argument(\"-v\", \"--verbose\", action=\"store_true\", help=\"Verbose output.\")\n    args = parser.parse_args()\n\n    # If --version was passed, display version and exit\n    if args.version:\n        write(colored(\"urless - v\" + __version__, \"cyan\"))\n        sys.exit()\n\n    try:\n        # If no input was given, raise an error\n        if sys.stdin.isatty():\n            if args.input is None:\n                writerr(\n                    colored(\n                        \"You need to provide an input with -i argument or through <stdin>.\",\n                        \"red\",\n                    )\n                )\n                sys.exit()\n\n        # Get the config settings from the config.yml file\n        getConfig()\n\n        # If input is not piped, show the banner, and if --verbose option was chosen show options and config values\n        if sys.stdin.isatty():\n            # Show banner unless requested to hide\n            if not args.no_banner:\n                showBanner()\n            if verbose():\n                showOptionsAndConfig()\n\n        # Process the input given on -i (--input), or <stdin>\n        processInput()\n\n        # Output the saved urls with parameters\n        processOutput()\n\n    except Exception as e:\n        writerr(colored(\"ERROR main 1: \" + str(e), \"red\"))\n\n    # Show ko-fi link if verbose and not piped\n    try:\n        if verbose() and sys.stdin.isatty():\n            writerr(\n                colored(\n                    \"✅ Want to buy me a coffee? ☕ https://ko-fi.com/xnlh4ck3r 🤘\",\n                    \"green\",\n                )\n            )\n    except Exception:\n        pass\n\n    finally:  # Clean up\n        urlmap = None\n        patternsSeen = None\n        patternsCustomID = None\n        patternsGUID = None\n        patternsInt = None\n        patternsLang = None\n\n\nif __name__ == \"__main__\":\n    main()\n"
  }
]