[
  {
    "path": ".github/FUNDING.yml",
    "content": "github: lgandx\npatreon: PythonResponder\ncustom: 'https://paypal.me/PythonResponder'\n"
  },
  {
    "path": "CCrawlDNS.py",
    "content": "#!/usr/bin/env python3\n# This file is part of an external network pentest set of tools \n# created and maintained by Laurent Gaffie.\n# email: lgaffie@secorizon.com\n# This program is free software: you can redistribute it and/or modify\n# it under the terms of the GNU General Public License as published by\n# the Free Software Foundation, either version 3 of the License, or\n# (at your option) any later version.\n#\n# This program is distributed in the hope that it will be useful,\n# but WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n# GNU General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with this program.  If not, see <http://www.gnu.org/licenses/>.\nimport argparse\nimport json\nimport os\nimport re\nimport requests\nimport sqlite3\nimport sys\nimport time\nfrom urllib.parse import urlparse\n\nVERSION = \"1.0\"\nINDEX_URL = \"https://index.commoncrawl.org/collinfo.json\"\nRESULTS_DIR = \"results\"\nSESSION = requests.Session()\nSESSION.headers.update({\n    \"User-Agent\": \"CCrawlDNS/1.0 (passive reconnaissance tool)\"\n})\n\ndef color(txt, code = 1, modifier = 0):\n    if os.name == 'nt':\n        return txt\n    return \"\\033[%d;3%dm%s\\033[0m\" % (modifier, code, txt)\n\ndef Banner():\n    Banner = r\"\"\"\n   ██████╗ ██████╗ ██████╗  █████╗ ██╗    ██╗██╗     ██████╗ ███╗   ██╗███████╗\n  ██╔════╝██╔═══╗  ██╔══██╗██╔══██╗██║    ██║██║     ██╔══██╗████╗  ██║██╔════╝\n  ██║     ██║      ██████╔╝███████║██║ █╗ ██║██║     ██║  ██║██╔██╗ ██║███████╗\n  ██║     ██║      ██╔══██╗██╔══██║██║███╗██║██║     ██║  ██║██║╚██╗██║╚════██║\n  ╚██████╗╚██████╔╝██║  ██║██║  ██║╚███╔███╔╝███████╗██████╔╝██║ ╚████║███████║\n   ╚═════╝ ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝ ╚══╝╚══╝ ╚══════╝╚═════╝ ╚═╝  ╚═══╝╚══════╝\n\n                            Passive Recon from history\n                      Author: Laurent Gaffie, lgaffie@secorizon.com\n                                x.com/@secorizon\n\"\"\"\n    return Banner\n\n# DB handling\ndef get_db_path(domain):\n    domain_clean = re.sub(r'[^\\w\\-.]', '_', domain.lower())\n    domain_dir = os.path.join(RESULTS_DIR, domain_clean)\n    os.makedirs(domain_dir, exist_ok=True)\n    return os.path.join(domain_dir, f\"{domain_clean}.db\")\n\ndef create_db(db_path):\n    if os.path.exists(db_path):\n        os.remove(db_path)\n    conn = sqlite3.connect(db_path)\n    cur = conn.cursor()\n    cur.execute(\"\"\"\n        CREATE TABLE subdomains (\n            id INTEGER PRIMARY KEY AUTOINCREMENT,\n            subdomain TEXT UNIQUE,\n            tech_detected TEXT,\n            example_url TEXT\n        )\n    \"\"\")\n    conn.commit()\n    conn.close()\n\ndef save_subdomain(db_path, subdomain, tech=\"\", example_url=\"\"):\n    conn = sqlite3.connect(db_path)\n    cur = conn.cursor()\n    cur.execute(\"\"\"\n        INSERT OR IGNORE INTO subdomains (subdomain, tech_detected, example_url)\n        VALUES (?, ?, ?)\n    \"\"\", (subdomain.lower(), tech, example_url))\n    conn.commit()\n    conn.close()\n\ndef fetch_index_list(years_filter: set[int] | None, max_per_year: int = 3) -> list[dict]:\n    try:\n        resp = SESSION.get(INDEX_URL, timeout=30)\n        resp.raise_for_status()\n        all_indexes = resp.json()\n\n        if years_filter is None:\n            print(f\"[+] Loaded all {len(all_indexes)} Common Crawl indexes\")\n            return all_indexes\n\n        by_year = {}\n        for idx in all_indexes:\n            match = re.search(r'CC-MAIN-(\\d{4})', idx['id'])\n            if match:\n                year = int(match.group(1))\n                if year in years_filter:\n                    by_year.setdefault(year, []).append(idx)\n\n        filtered = []\n        for year in sorted(by_year.keys(), reverse=True):\n            year_indexes = sorted(by_year[year], key=lambda x: x['id'], reverse=True)\n            selected = year_indexes[:max_per_year]\n            filtered.extend(selected)\n            print(f\"[+] Year {year}: using {len(selected)}/{len(by_year[year])} indexes\")\n\n        print(f\"[+] Total selected: {len(filtered)} indexes from years {sorted(years_filter)}\")\n        return filtered\n\n    except Exception as e:\n        print(f\"[-] Failed to fetch index list: {e}\")\n        sys.exit(1)\n\n\ndef extract_subdomain_from_url(url: str, target_domain: str) -> str | None:\n    try:\n        parsed = urlparse(url)\n        hostname = parsed.netloc.lower()\n        if ':' in hostname:\n            hostname = hostname.split(':')[0]\n        target = target_domain.lower()\n        if hostname == target or hostname.endswith('.' + target):\n            return hostname\n    except Exception:\n        pass\n    return None\n\n\ndef detect_tech_and_example(urls: list[str]) -> tuple[str, str | None]:\n    extensions = {\n        # PHP family\n        '.php': 'PHP',\n        '.php3': 'PHP',\n        '.php4': 'PHP',\n        '.php5': 'PHP',\n        '.phtml': 'PHP',\n        '.phar': 'PHP',\n\n        # Microsoft\n        '.asp': 'Classic ASP',\n        '.aspx': 'ASP.NET',\n        '.ascx': 'ASP.NET',\n        '.asmx': 'ASP.NET Web Service',\n        '.ashx': 'ASP.NET Handler',\n        '.axd': 'ASP.NET Handler',\n        '.master': 'ASP.NET Master Page',\n\n        # Java\n        '.jsp': 'Java JSP',\n        '.jspx': 'Java JSP',\n        '.do': 'Java Struts',\n        '.action': 'Java Struts',\n\n        # ColdFusion\n        '.cfm': 'ColdFusion',\n        '.cfml': 'ColdFusion',\n        '.cfc': 'ColdFusion Component',\n\n        # Perl & CGI\n        '.pl': 'Perl',\n        '.pm': 'Perl Module',\n        '.cgi': 'CGI Script',\n\n        # Python\n        '.py': 'Python',\n\n        # Ruby\n        '.rb': 'Ruby',\n        '.erb': 'Ruby on Rails',\n\n        # Node.js / JavaScript\n        '.js': 'JavaScript',\n        '.mjs': 'JavaScript Module',\n\n        # Go\n        '.go': 'Go',\n\n        # Rust\n        '.rs': 'Rust',\n\n        # Other\n        '.lua': 'Lua',\n        '.scala': 'Scala',\n        '.dart': 'Dart (Flutter)',\n        '.swift': 'Swift',\n    }\n\n    path_indicators = {\n        # CMS\n        '/wp-admin/': 'WordPress',\n        '/wp-content/': 'WordPress',\n        '/wp-includes/': 'WordPress',\n        '/wp-json/': 'WordPress REST API',\n        '/xmlrpc.php': 'WordPress XML-RPC',\n        '/wp-login.php': 'WordPress Login',\n\n        '/administrator/': 'Joomla/Drupal Admin',\n        '/joomla/': 'Joomla',\n        '/sites/all/': 'Drupal',\n        '/user/login': 'Drupal',\n        '/magento/': 'Magento',\n        '/downloader/': 'Magento',\n        '/skin/adminhtml/': 'Magento Admin',\n\n        '/typo3/': 'TYPO3',\n        '/typo3conf/': 'TYPO3',\n        '/typo3temp/': 'TYPO3',\n\n        '/concrete/': 'Concrete CMS',\n        '/index.php?id=': 'Generic CMS (often Joomla/WordPress)',\n\n        # Frameworks\n        '/laravel/': 'Laravel (exposed?)',\n        '/artisan': 'Laravel',\n        '/public/index.php': 'Laravel/Symfony',\n\n        '/rails/': 'Ruby on Rails',\n        '/config.ru': 'Ruby on Rails',\n\n        '/symfony/': 'Symfony',\n        '/app_dev.php': 'Symfony Dev',\n\n        '/yii/': 'Yii Framework',\n\n        # Admin Panels\n        '/admin/': 'Admin Panel',\n        '/admin.php': 'Admin Panel',\n        '/admin.html': 'Admin Panel',\n        '/login/': 'Login Page',\n        '/dashboard/': 'Dashboard',\n        '/cpanel/': 'cPanel',\n        '/webmail/': 'Webmail',\n\n        # Database/Admin Tools\n        '/phpmyadmin/': 'phpMyAdmin',\n        '/pma/': 'phpMyAdmin',\n        '/mysql/': 'MySQL Admin',\n        '/adminer.php': 'Adminer',\n        '/dbadmin/': 'Database Admin',\n\n        # API & Modern\n        '/api/': 'API Endpoint',\n        '/v1/': 'API v1',\n        '/v2/': 'API v2',\n        '/graphql': 'GraphQL',\n        '/rest/': 'REST API',\n        '/swagger/': 'Swagger/OpenAPI',\n        '/redoc/': 'Redoc',\n\n        # Dev/Exposure\n        '/.env': '.env exposed!',\n        '/.git/': '.git exposed!',\n        '/.svn/': '.svn exposed!',\n        '/.hg/': '.hg exposed!',\n        '/config.php': 'Config exposed',\n        '/backup/': 'Backup directory',\n        '/test/': 'Test directory',\n        '/dev/': 'Development',\n        '/debug/': 'Debug mode',\n        '/node_modules/': 'Node.js (exposed)',\n        '/package.json': 'Node.js/npm',\n\n        # Common Tools\n        '/jenkins/': 'Jenkins',\n        '/hudson/': 'Jenkins/Hudson',\n        '/sonar/': 'SonarQube',\n        '/nexus/': 'Nexus Repository',\n        '/artifactory/': 'Artifactory',\n        '/gitlab/': 'GitLab',\n        '/gogs/': 'Gogs',\n        '/gitea/': 'Gitea',\n\n        # Monitoring\n        '/kibana/': 'Kibana',\n        '/grafana/': 'Grafana',\n        '/prometheus/': 'Prometheus',\n        '/zabbix/': 'Zabbix',\n\n        # E-commerce\n        '/shop/': 'Shop System',\n        '/cart/': 'E-commerce Cart',\n        '/checkout/': 'E-commerce Checkout',\n        '/opencart/': 'OpenCart',\n        '/prestashop/': 'PrestaShop',\n        '/oscommerce/': 'osCommerce',\n\n        # Forums\n        '/phpbb/': 'phpBB',\n        '/forum/': 'Forum Software',\n        '/discourse/': 'Discourse',\n        '/vanilla/': 'Vanilla Forums',\n    }\n\n\n    tech_found = set()\n    trigger_url = None\n\n    for url in urls:\n        path = urlparse(url).path.lower()\n\n        for ext, tech in extensions.items():\n            if path.endswith(ext):\n                tech_found.add(tech)\n                if not trigger_url:\n                    trigger_url = url\n\n        for pattern, tech in path_indicators.items():\n            if pattern in path:\n                tech_found.add(tech)\n                if not trigger_url:\n                    trigger_url = url\n\n        if trigger_url:\n            break\n\n    if not tech_found:\n        return \"\", urls[0] if urls else None\n\n    tech_str = \", \".join(sorted(tech_found))\n    return tech_str, trigger_url or urls[0]\n\n\ndef process_index(index_info: dict, target_domain: str, db_path: str):\n    cdx_api = index_info['cdx-api']\n    index_id = index_info['id']\n\n    params = {\n        'url': target_domain,\n        'matchType': 'domain',\n        'fl': 'url',\n        'output': 'json',\n        'pageSize': 2000\n    }\n\n    subdomain_data = {}\n\n    max_retries = 3\n    for attempt in range(max_retries):\n        try:\n            print(f\"[+] Querying {index_id} (attempt {attempt + 1}/{max_retries})...\")\n            resp = SESSION.get(cdx_api, params=params, timeout=40)\n            time.sleep(1)\n\n            if resp.status_code == 503:\n                wait = 5 * (2 ** attempt)\n                print(f\"    [~] 503 Throttled — waiting {wait}s...\")\n                time.sleep(wait)\n                continue\n\n            if resp.status_code != 200:\n                print(f\"    [-] {index_id}: HTTP {resp.status_code}\")\n                return\n\n            lines = resp.text.strip().split('\\n')\n            if not lines or len(lines) <= 1:\n                return\n\n            count = 0\n            for line in lines:\n                try:\n                    data = json.loads(line)\n                    url = data.get('url')\n                    if not url:\n                        continue\n\n                    subdomain = extract_subdomain_from_url(url, target_domain)\n                    if not subdomain:\n                        continue\n\n                    subdomain_data.setdefault(subdomain, []).append(url)\n                    count += 1\n                except:\n                    continue\n\n            print(f\"    [+] Extracted {count} records from {index_id}\")\n            break\n\n        except Exception as e:\n            print(f\"    [-] Error querying {index_id}: {e}\")\n            if attempt < max_retries - 1:\n                time.sleep(5)\n\n    # Print all subdomains in original format\n    for sub in sorted(subdomain_data.keys()):\n        urls = subdomain_data[sub]\n        tech, example = detect_tech_and_example(urls)\n        tech_str = f\" --> [{tech}]\" if tech else \"\"\n        print(f\"    {sub}{tech_str}\")\n        if example and tech:\n            print(f\"       [URL]: {example}\")\n\n        # Always save subdomains (tech and example if available)\n        save_subdomain(db_path, sub, tech if tech else \"\", example if tech else \"\")\n\n\ndef main():\n    print(color(Banner(),2,1))\n    parser = argparse.ArgumentParser(\n        description=\"CCrawlDNS - Passive subdomain discovery using Common Crawl\",\n        formatter_class=argparse.RawDescriptionHelpFormatter,\n        epilog=\"\"\"\nExamples:\n  python3 CCrawlDNS.py -d yahoo.com --years last2 --max-per-year 1\n  python3 CCrawlDNS.py -d yahoo.com --years 2025 --max-per-year 3\n        \"\"\"\n    )\n    parser.add_argument('-d', '--domain', required=True, help=\"Target domain (e.g. x.com)\")\n    parser.add_argument('--years', type=str, default=\"last2\", \n                        help=\"Years to query: comma-separated, 'all', or 'last2' (default)\")\n    parser.add_argument('--max-per-year', type=int, default=3,\n                        help=\"Max number of indexes to use per year (default: 3)\")\n    args = parser.parse_args()\n\n    target_domain = args.domain.lower().strip().rstrip('.')\n\n    if args.years == \"all\":\n        years_filter = None\n    elif args.years == \"last2\":\n        current = time.localtime().tm_year\n        years_filter = {current, current - 1}\n    else:\n        try:\n            years_filter = {int(y.strip()) for y in args.years.split(',') if y.strip().isdigit()}\n        except:\n            print(\"[-] Invalid --years format. Using 'last2'.\")\n            current = time.localtime().tm_year\n            years_filter = {current, current - 1}\n\n    print(f\"[+] Starting CCrawldns against: {target_domain}\")\n    print(f\"[+] Years: {years_filter if years_filter else 'all'} | max per year: {args.max_per_year}\")\n\n    # Create per-domain DB\n    db_path = get_db_path(target_domain)\n    create_db(db_path)\n\n    indexes = fetch_index_list(years_filter, args.max_per_year)\n\n    for idx in indexes:\n        process_index(idx, target_domain, db_path)\n\n    print(\"\\n[+] Enumeration complete!\")\n    print(f\"[+] Results saved in: {db_path}\")\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "README.md",
    "content": "# CCrawlDNS #\n\nCommonCrawl data set subdomain extracter.\n\nAuthor: Laurent Gaffie <lgaffie@secorizon.com >  https://secorizon.com > https://x.com/@secorizon\n\n\n\n## Intro ##\n\nCCrawlDNS is a small pentest utility that make use of the CommonCrawl data set API (petabytes of data!). \n\nThis tool is highly customizable and is specifically designed for pentesters. Once configured for a scan, it will make multiple queries to CommonCrawl.org and will fetches all collected subdomains related to the DNS you provided as target. You can specify which years are of interest (from 2008 onward), from how many dataset per year, all results will be stored in a DB.\n\nNew options added: scans are now highly customizable, you can search by specific year and datasets. \n- ✅ Search by specific year and datasets. \n- ✅ Automatic path fingerprint.\n- ✅ Automatic web language fingerprint.\n- ✅ Automatic throttling.\n- ✅ All results are saved in a db.\n\n## Usage ##\n\nRunning the tool:\n\n    //Search all collected subdomains for yahoo.com in the past 2 years, include 1 dataset per year (most efficient)\n    python3 ccrawldns.py -d yahoo.com --years last2 --max-per-year 1\n\n    //Search all collected subdomains for yahoo.com only in 2025, include 3 dataset\n    python3 CCrawlDNS.py -d yahoo.com --years 2025 --max-per-year 3\n\n    //Search all collected subdomains for yahoo.com for the year 2025 and 2021, include 1 dataset\n    python3 CCrawlDNS.py -d yahoo.com --years 2025, 2021 --max-per-year 1\n    \n    //Search all collected subdomains for yahoo.com from 2008 to now, include 1 dataset (much slower, but complete)\n    python3 CCrawlDNS.py -d yahoo.com --years all --max-per-year 1\n\n## Demo ##\n\n\nhttps://github.com/user-attachments/assets/a6cf968f-2bac-4de3-80a3-43f697b923de\n\n"
  }
]