[
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contributing guidelines\n\n* [Reporting bugs](#reporting-bugs)\n* [Development](#development)\n  * [New features](#new-features)\n  * [Bug fixes](#bug-fixes)\n* [Getting Started](#getting-started)\n  * [Pre-Requisites](#pre-requisites)\n  * [Setup](#setup)\n  * [Running the crawler](#running-the-crawler)\n  * [Checking the results](#checking-the-results)\n* [Data Model](#data-model)\n  * [full_urls](#full_urls)\n  * [https_queue](#https_queue)\n  * [https_crawl](#https_crawl)\n  * [mixed_assets](#mixed_assets)\n  * [https_response_headers](#https_response_headers)\n  * [ssl_cert_info](#ssl_cert_info)\n  * [https_crawl_aggregate](#https_crawl_aggregate)\n  * [https_upgrade_metrics](#https_upgrade_metrics)\n  * [domain_exceptions](#domain_exceptions)\n  * [upgradeable_domains](#upgradeable_domains)\n\n# Reporting bugs\n\n1. First check to see if the bug has not already been [reported](https://github.com/duckduckgo/smarter-encryption/issues).\n2. Create a bug report [issue](https://github.com/duckduckgo/smarter-encryption/issues/new?template=bug_report.md).\n\n# Development\n\n## New features\n\nRight now all new feature development is handled internally.\n\n## Bug fixes\n\nMost bug fixes are handled internally, but we will accept pull requests for bug fixes if you first:\n1. Create an issue describing the bug. see [Reporting bugs](CONTRIBUTING.md#reporting-bugs)\n2. Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects.\n\n## Getting Started\n\n### Pre-Requisites\n- [PostgreSQL](https://www.postgresql.org/) database\n- [PhantomJS 2.1.1](https://phantomjs.org/download.html)\n- [Perl](https://www.perl.org/get.html)\n- [compare](https://imagemagick.org/script/compare.php)\n- [pkill](https://en.wikipedia.org/wiki/Pkill)\n- Should run on many varieties of Linux/*BSD\n\n### Setup\n\n1. Install required Perl modules via cpanfile:\n```sh\ncpanm --installdeps .\n```\n2. Connect to PostgreSQL with psql and create the tables needed by the crawler:\n```\n\\i sql/full_urls.sql\n\\i sql/https_crawl.sql\n\\i sql/mixed_assets.sql\netc.\n```\n3. Create a copy of the crawler configuration file:\n```sh\ncp config.yml.example config.yml\n```\nEdit the settings as necessary for your system.\n\n4. If you have a source of URLs you would like to be crawled for a host they can be added to the [full_urls](#full_urls) table:\n```sql\ninsert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ...\n```\nThe crawler will attempt to get URLs from the home page even if none are available in this table.\n\n### Running the crawler\n\n1. Add hosts to be crawled to the [https_queue](#https_queue) table:\n```sql\ninsert into https_queue (domain) values ('duckduckgo.com');\n```\n\n2. The crawler can be run as follows:\n```sh\nperl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml\n```\n\n### Checking the results\n\n1. The individual HTTP and HTTPs comparisons for each URL crawled are stored in [https_crawl](#https_crawl):\n```sql\nselect * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10;\n```\nThe maximum URLs for the crawl session, i.e. `limit`, is determined by [URLS_PER_SITE](config.yml.example#L49).\n\n2. Aggregate session data for each host is stored in [https_crawl_aggregate](#https_crawl_aggregate):\n```sql\nselect * from https_crawl_aggregate where domain = 'duckduckgo.com';\n```\nThere is also an associated view - [https_upgrade_metrics](#https_upgrade_metrics) - that calculates some additional metrics:\n```sql\nselect * from https_upgrade_metrics where domain = 'duckduckgo.com';\n```\n\n3. Additional information from the crawl can be found in:\n\n  * [sss_cert_info](#ssl_cert_info)\n  * [mixed_assets](#mixed_assets)\n  * [https_response_headers](#https_response_headers)\n\n4. Hosts can be selected based on various combinations of criteria directly from the above tables or by using the [upgradeable_domains](#upgradeable_domains) function.  \n\n### Data Model\n\n#### full_urls\n\nComplete URLs for hosts that will be used in addition to those the crawler extracts from the home page.\n\n| Column | Description | Type | Key |\n| --- | --- | --- | --- |\n| host | hostname | text |unique|\n| url | Complete URL with scheme | text |unique|\n| updated | When added to table | timestamp with time zone ||\n\n#### https_queue\n\nDomains to be crawled in rank order.  Multiple crawlers can access this concurrently.\n\n| Column | Description | Type | Key |\n| --- | --- | --- | --- |\n| rank | Processing order | integer | primary |\n|domain | Domain to be crawled | character varying(500) ||\n|processing_host|Hostname of server processing domain|character varying(50)||\n|worker_pid|Process ID of crawler handling domain|integer||\n|reserved|When domain was selected for processing|timestamp with time zone||\n|started|When processing of domain started|timestamp with time zone||\n|finished|When processing of domain completed|timestamp with time zone||\n\n#### https_crawl\n\nLog table of HTTP and HTTPs comparisons made by the crawler.\n\n| Column | Description | Type | Key |\n| --- | --- | --- | --- |\n| id | Comparison ID | bigint | unique |\n|domain|Domain evaluated|text||\n|http_request_uri|Resulting URI of HTTP request|text||\n|http_response|HTTP status code for HTTP request|integer||\n|http_requests|Total requests made, including child subrequests, for HTTP request|integer||\n|http_size|Size of HTTP response (bytes)|integer||\n|https_request_uri|Resulting URI of HTTPs request|text||\n|https_response|HTTP status code for HTTPs request|integer||\n|https_requests|Total requests made, including child subrequests, for HTTPs request|integer||\n|https_size|Size of HTTPs response (bytes)|integer||\n|timestamp|When inserted|timestamp with time zone||\n|screenshot_diff|Percentage difference between HTTP and HTTPs screenshots after page load|real||\n|autoupgrade|Whether HTTP request was redirected to HTTPs|boolean||\n|mixed|Whether HTTPs request had HTTP child requests|boolean||\n\n#### mixed_assets\n\nHTTP child requests made for HTTPs.\n\n| Column         | Description                                          | Type   | Key            |\n| ---            | ---                                                  | ---    | ---            |\n| https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |\n| asset          | URI of HTTP subrequest made during HTTPs request     | text   | unique         |\n\n\n#### https_response_headers\n\nThe response headers for HTTPs requests.\n\n| Column         | Description                                          | Type   | Key            |\n| ---            | ---                                                  | ---    | ---            |\n| https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |\n|response_headers|key/value of all HTTPs response headers|jsonb||\n\n\n#### ssl_cert_info\n\nSSL certificate information for domains crawled.\n\n| Column         | Description                                          | Type   | Key            |\n| ---            | ---                                                  | ---    | ---            |\n| domain | Domain evaluated | text | primary |\n|issuer|Issuer of SSL certificate|text||\n|notbefore|Valid from timestamp|timestamp with time zone||\n|notafter|Valid to timestamp|timestamp with time zone||\n|host_valid|Whether the domain is covered by the SSL certificate|boolean||\n|err|Connection err|text||\n|updated|When last updated|timestamp with time zone||\n\n#### https_crawl_aggregate\n\nAggregate of [https_crawl](#https_crawl) that creates latest crawl sessions based on domain.  Can also include domains that were redirected to and not directly crawled.\n\n| Column         | Description                                          | Type   | Key            |\n| ---            | ---                                                  | ---    | ---            |\n| domain | Domain evaluated | text | primary |\n|https|Comparisons where only HTTPs was supported|integer||\n|http_and_https|Comparisons where HTTP and HTTPs were supported|integer||\n|http|Comparisons where only HTTP was supported|integer||\n|https_errs|Number of non-2xx HTTPs responses|integer||\n|unknown|Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed|integer||\n|autoupgrade|Comparisons where HTTP was redirected to HTTPs|integer||\n|mixed_requests|HTTPs request that made HTTP calls|integer||\n|max_screenshot_diff|Maximum percentage difference between HTTP and HTTPs screenshots|real||\n|redirects|Number of HTTPs requests redirected to different host|integer||\n|requests|Number of comparison requests actually made during the crawl session|integer||\n|session_request_limit|The number of comparisons wanted for the session|integer||\n|is_redirect|Whether the domain was actually crawled or is a redirect from another host in the table that was crawled|boolean||\n|max_https_crawl_id|https_crawl.id of last comparison made during crawl session|bigint||\n|redirect_hosts|key/value pairs of hosts and the number of redirects to it|jsonb||\n\n#### https_upgrade_metrics\n\nView of [https_crawl_aggregate](#https_crawl_aggregate) that calculates crawl session percentages for easier selection based on cutoffs.\n\n| Column         | Description                                          | Type   | Key            |\n| ---            | ---                                                  | ---    | ---            |\n| domain | Domain evaluated | text | |\n| unknown_pct | Percentage of unknown|real||\n| combined_pct | Percentage that supported HTTPs|real||\n| https_err_rate | Percentage unknown|real||\n| max_screenshot_diff | https_crawl_aggregate.max_screenshot_diff|real||\n| mixed_ok | Whether HTTPs requests contained mixed content requests|boolean||\n| autoupgrade_pct|Percentage of autoupgrade|real||\n\n#### domain_exceptions\n\nFor manually excluding domains that may otherwise pass specific upgrade criteria given to [upgradeable_domains](#upgradeable_domains).\n\n| Column | Description       | Type | Key     |\n| ---    | ---               | ---  | ---     |\n| domain | Domain to exclude | text | primary |\n| comment | Reason for exclusion | text ||\n|updated|When added|timestamp with time zone||\n\n#### upgradeable_domains\n\nFunction to select domains based on a variety of criteria.\n\n| Parameter | Description       | Type | Source     |\n| ---    | ---               | ---  | ---     |\n|autoupgrade_min|Minimum autoupgrade percentage|real|[https_upgrade_metrics](#https_upgrade_metrics)|\n|combined_min|Minimum percentage of HTTPs responses|real|[https_upgrade_metrics](#https_upgrade_metrics)|\n|screenshot_diff_max|Maximum observed screenshot diff allowed|real|[https_upgrade_metrics](#https_upgrade_metrics)|\n|mixed_ok|Whether to allow domains that had mixed content|boolean|[https_upgrade_metrics](#https_upgrade_metrics)|\n|max_err_rate|Maximum https_err_rate|real|[https_upgrade_metrics](#https_upgrade_metrics)|\n|unknown_max|Maximum unknown comparisons|real|[https_upgrade_metrics](#https_upgrade_metrics)|\n|ssl_cert_buffer|SSL certificate must be valid until this timestamp|timestamp with time zone|[ssl_cert_info](#ssl_cert_info)|\n|exclude_issuers|Array of SSL cert issuers to exclude|text array|[ssl_cert_info](#ssl_cert_info)|\n\nIn addtion to the above parameters, the function enforces several other conditions:\n\n1. Domain must not be in [domain_exceptions](#domain_exceptions)\n2. From values in [ssl_cert_info](#ssl_cert_info):\n   1. No err\n   2. The domain, or host, must be valid for the certificate.\n   3. Valid from/to and the issuer must not be null\n"
  },
  {
    "path": "LICENSE",
    "content": "This license does not apply to any DuckDuckGo logos or marks that may be contained\nin this repo. DuckDuckGo logos and marks are licensed separately under the CCBY-NC-ND 4.0\nlicense (https://creativecommons.org/licenses/by-nc-nd/4.0/), and official up-to-date\nversions can be downloaded from https://duckduckgo.com/press.\n\nCopyright 2010 Duck Duck Go, Inc.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n   http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# DuckDuckGo Smarter Encryption \n\nDuckDuckGo Smarter Encryption is a large list of web sites that we know support HTTPS.  The list is automatically generated and updated by using the crawler in this repository.\n\nFor more information about where the list is being used and how it compares to other solutions, see our blog post [Your Connection is Secure with DuckDuckGo Smarter Encryption](https://spreadprivacy.com/duckduckgo-smarter-encryption).\n\nThis software is licensed under the terms of the Apache License, Version 2.0 (see [LICENSE](LICENSE)). Copyright (c) 2019 [Duck Duck Go, Inc.](https://duckduckgo.com)\n\n## Contributing\n\nSee [Contributing](CONTRIBUTING.md) for more information about [Reporting bugs](CONTRIBUTING.md#reporting-bugs) and [Getting Started](CONTRIBUTING.md#getting-started) with the crawler.\n\n## Just want the list?\n\nThe list we use (as a result of running this code) is [publicly available](https://staticcdn.duckduckgo.com/https/smarter_encryption_latest.tgz) under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n\nIf you'd like to license the list for commercial use, [please reach out](https://help.duckduckgo.com/duckduckgo-help-pages/company/contact-us/).\n\n## Questions or help with other DuckDuckGo things?\nSee [DuckDuckGo Help Pages](https://duck.co/help).\n"
  },
  {
    "path": "SmarterEncryption/Crawl.pm",
    "content": "package SmarterEncryption::Crawl;\n\nuse Exporter::Shiny qw'\n    aggregate_crawl_session\n    check_ssl_cert\n    dupe_link\n    urls_by_path\n';\n\nuse IO::Socket::SSL;\nuse IO::Socket::SSL::Utils 'CERT_asHash';\nuse Cpanel::JSON::XS 'encode_json';\nuse List::Util 'sum';\nuse URI;\nuse List::AllUtils qw'each_arrayref';\nuse Domain::PublicSuffix;\n\nuse strict;\nuse warnings;\nno warnings 'uninitialized';\nuse feature 'state';\n\nmy $SSL_TIMEOUT = 5;\nmy $DEBUG = 0;\n\n# Fields we want to convert to int if null\nmy @CONVERT_TO_INT = qw'\n    https\n    http_s\n    https_errs\n    http\n    unknown\n    autoupgrade\n    mixed_requests\n    max_ss_diff\n    redirects\n';\n\nsub screenshot_threshold { 0.05 }\n# Number of URLs checked for each domain per run.\nsub urls_per_domain { 10 }\n\nsub check_ssl_cert {\n    my $host = shift;\n\n    my ($issuer, $not_before, $not_after, $host_valid, $err);\n\n    if(my $iossl = IO::Socket::SSL->new(\n        PeerHost => $host,\n        PeerPort => 'https',\n        SSL_hostname => $host,\n        Timeout => $SSL_TIMEOUT,\n    )){\n        $host_valid = $iossl->verify_hostname($host, 'http') || 0;\n        my $c = $iossl->peer_certificate;\n        my $cert = CERT_asHash($c);\n        $issuer = $cert->{issuer}{organizationName};\n        $not_before = gmtime($cert->{not_before}) . ' UTC';\n        $not_after = gmtime($cert->{not_after}) . ' UTC';\n    }\n    else{\n        my $sys_err = $!;\n        $err = $SSL_ERROR;\n        if($sys_err){ $err .= \": $sys_err\"; }\n    }\n\n    return [$issuer, $not_before, $not_after, $host_valid, $err];\n}\n\nsub aggregate_crawl_session {\n    my ($domain, $session) = @_;\n\n    state $dps = Domain::PublicSuffix->new;\n    my $root_domain = $dps->get_root_domain($domain);\n\n    my %domain_stats = (is_redirect => 0);\n    my %redirects;\n    for my $comparison (@$session){\n        my ($http_request_uri,\n            $http_response,\n            $https_request_uri,\n            $https_response,\n            $autoupgrade,\n            $mixed,\n            $screenshot_diff,\n            $id\n        ) = @$comparison{qw'\n            http_request_uri\n            http_response\n            https_request_uri\n            https_response\n            autoupgrade\n            mixed\n            ss_diff\n            id\n        '};\n\n\n        my $http_valid = $http_request_uri =~ /^http:/i;\n        my $https_valid = $https_request_uri =~ /^https:/i;\n\n        my $redirect;\n        if($https_valid){\n            if(my $host = eval { URI->new($https_request_uri)->host }){\n                if($host ne $domain){\n                    my $host_root_domain = $dps->get_root_domain($host);\n                    if($root_domain eq $host_root_domain){\n                        ++$domain_stats{redirects}{$host};\n                        unless(exists $redirects{$host}){\n                            $redirects{$host} = {is_redirect => 1};\n                        }\n                        $redirect = $redirects{$host};\n                    }\n                }\n            }\n        }\n\n        ++$domain_stats{requests};\n        $redirect && ++$redirect->{requests};\n\n        $domain_stats{max_id} = $id if $domain_stats{max_id} < $id;\n        $redirect->{max_id} = $id if $redirect && ($redirect->{max_id} < $id);\n\n        if($autoupgrade){\n            ++$domain_stats{autoupgrade};\n            $redirect && ++$redirect->{autoupgrade};\n        }\n\n        if($mixed){\n            ++$domain_stats{mixed_requests};\n            $redirect && ++$redirect->{mixed_requests};\n        }\n\n        if(defined($screenshot_diff)){\n            $domain_stats{max_ss_diff} = $screenshot_diff if $domain_stats{max_ss_diff} < $screenshot_diff;\n            $redirect->{max_ss_diff} = $screenshot_diff if $redirect && ($redirect->{max_ss_diff} < $screenshot_diff)\n        }\n\n        my $http_s_same_response = $http_response == $https_response;\n        my $http_response_good = $http_valid && ( ($http_response == 200) || $http_s_same_response );\n        my $https_response_good = $https_valid && ( ($https_response == 200) || $http_s_same_response);\n\n        if($https_response_good){\n            if($http_response_good){\n                ++$domain_stats{http_s};\n                $redirect && ++$redirect->{http_s};\n            }\n            else{\n                ++$domain_stats{https};\n                $redirect && ++$redirect->{https};\n            }\n\n            if($https_response =~ /^[45]/){\n                ++$domain_stats{https_errs};\n                $redirect && ++$redirect->{https_errs};\n            }\n        }\n        elsif($http_response_good){\n            ++$domain_stats{http};\n            $redirect && ++$redirect->{http};\n        }\n        else{\n            ++$domain_stats{unknown};\n            $redirect && ++$redirect->{unknown};\n        }\n    }\n\n    my %aggs;\n    if(my $hosts = delete $domain_stats{redirects}){\n        $domain_stats{redirects} = sum values(%$hosts);\n        $domain_stats{redirect_hosts} = encode_json($hosts);\n\n        while(my ($host, $agg) = each %redirects){\n            null_to_int($agg);\n            $aggs{$host} = $agg;\n        }\n    }\n\n    null_to_int(\\%domain_stats);\n    $aggs{$domain} = \\%domain_stats;\n\n    return \\%aggs;\n}\n\nsub null_to_int {\n    my $h = shift;\n    $h->{$_} += 0 for @CONVERT_TO_INT;\n}\n\nsub urls_by_path {\n    my ($urls, $rr, $url_limit) = @_;\n\n    my %links;\n    for my $url (@$urls){\n        eval {\n            my @segs = URI->new($url)->path_segments;\n            push @{$links{$segs[1]}}, $url;\n        };\n    }\n\n    my @sorted_paths = sort {@{$links{$b}} <=> @{$links{$a}}} keys %links;\n\n    my @urls_by_path;\n\n    my $paths = each_arrayref @links{@sorted_paths};\n    CLICK_GROUP: while(my @urls = $paths->()){\n        for my $url (@urls){\n            next unless $url;\n            last CLICK_GROUP unless @urls_by_path < $url_limit;\n            next unless $rr->allowed($url);\n            push @urls_by_path, $url;\n        }\n    }\n\n    @$urls = @urls_by_path;\n}\n\n\nsub dupe_link {\n    my ($url, $urls) = @_;\n\n    $url =~ s{^https:}{http:}i;\n\n    for (@$urls){\n        my $u = $_ =~ s{^https:}{http:}ir;\n        return 1 if URI::eq($u, $url);\n    }\n\n    0;\n}\n\n1;\n"
  },
  {
    "path": "config.yml.example",
    "content": "---\n\n# Top-level temp directory will be created on start and removed\n# on exit.  Each crawler will have its own subdirectory with\n# PID appended\nTMP_DIR: /tmp/smarter_encryption\nCRAWLER_TMP_PREFIX: crawler_\n\n# User agent. Will use defaults if not specified\n#UA: \nVERBOSE: 1\n\n# Paths to system binaries.  If in path already, just the program\n# name should suffice.\nCOMPARE: /usr/local/bin/compare \nPKILL: /usr/bin/pkill\n\n# Database connection options.  If not specified will connect as\n# the current user.\n#DB:\n#HOST:\n#PORT:\n#USER:\n#PASS:\n\n# Number of concurrent crawlers per cpu.\nCRAWLERS_PER_CPU: 3\n# or exact number\n# MAX_CONCURRENT_CRAWLERS: 10\n\n# Path to phantomjs.  Should be v2.1.1\nPHANTOMJS: phantomjs\n\n# Path to modified netsniff.js\nNETSNIFF_SS: netsniff_screenshot.js\n\n# Timeout before killing phantomjs in seconds\nHEADLESS_ALARM: 30\n\n# Whether to continue running and polling the queue or exit when finished.\n# If specified and non-zero, it is the number of seconds to wait in\n# between polls.\nPOLL: 60\n\n# Number of sites a crawler should process before exiting\nSITES_PER_CRAWLER: 10\n\n# Desired number of URLs to check for each site \nURLS_PER_SITE: 10\n\n# Max percentage of URLS_PER_SITE included from the current home page\nHOMEPAGE_LINK_PCT: 0.5\n\n# Number of times to re-request HTTPs URL on failure\nHTTPS_RETRIES: 1\n\n# If SCREENSHOT_RETRIES is not 0, the comparison between HTTP and HTTPs\n# pages will be re-run if the diff is above SCREENSHOT_THRESHOLD.  It\n# will also introduce a delay before taking the screenshot to potentially\n# overcome slight network differences between the two. The delay will\n# remain in effect for links still to be processed for the site.\nSCREENSHOT_RETRIES: 1\nSCREENSHOT_THRESHOLD: 0.05\nPHANTOM_RENDER_DELAY: 1000\n"
  },
  {
    "path": "cpanfile",
    "content": "requires 'Cpanel::JSON::XS', 2.3310;\nrequires 'DBI', '1.631';\nrequires 'Domain::PublicSuffix', '0.10';\nrequires 'Exporter::Shiny', '0.038';\nrequires 'Exporter::Tiny', 0.038;\nrequires 'File::Copy::Recursive', 0.38;\nrequires 'IO::Socket::SSL', 2.060;\nrequires 'IO::Socket::SSL::Utils', 2.014;\nrequires 'IPC::Run', 0.92;\nrequires 'IPC::Run::Timer', 0.90;\nrequires 'LWP', 6.05;\nrequires 'List::AllUtils', 0.07;\nrequires 'List::Util', 1.52;\nrequires 'POE', 1.358;\nrequires 'POE::XS::Loop::Poll', 1.000;\nrequires 'URI', 1.71;\nrequires 'URI::Escape', 3.31;\nrequires 'WWW::Mechanize', 1.73;\nrequires 'WWW::RobotRules', 6.02;\nrequires 'YAML::XS', 0.41;\n"
  },
  {
    "path": "https_crawl.pl",
    "content": "#!/usr/bin/env perl\n\nuse LWP::UserAgent;\nuse WWW::Mechanize;\nuse POE::Kernel { loop => 'POE::XS::Loop::Poll' };\nuse POE qw(Wheel::Run Filter::Reference);\nuse DBI;\nuse Sys::Hostname 'hostname';\nuse Cpanel::JSON::XS qw'decode_json encode_json';\nuse URI;\nuse File::Copy::Recursive qw'pathmk pathrmdir';\nuse WWW::RobotRules;\nuse IPC::Run;\nuse YAML::XS 'LoadFile';\nuse List::AllUtils 'each_arrayref';\nuse SmarterEncryption::Crawl qw'\n    aggregate_crawl_session\n    check_ssl_cert\n    dupe_link\n    urls_by_path\n';\nuse Module::Load::Conditional 'can_load';\n\nuse feature 'state';\nuse strict;\nuse warnings;\nno warnings 'uninitialized';\n\nmy $DDG_INTERNAL;\nif(can_load(modules => {\n    'DDG::Util::HTTPS2' => undef,\n    'DDG::Util::Crawl' => undef\n})){\n    DDG::Util::HTTPS2->import(qw'add_stat backfill_urls');\n    DDG::Util::Crawl->import(qw'get_http_msg_sig_hdrs');\n    $DDG_INTERNAL = 1;\n}\n\nmy $HOST = hostname();\n\n# Crawler Config\nmy %CC;\n\n# Derived config values\nmy ($MAX_CONCURRENT_CRAWLERS, $PHANTOM_TIMEOUT, $HOMEPAGE_LINKS_MAX); \n\nPOE::Session->create(\n    inline_states => {\n        _start         => \\&_start,\n        _stop          => \\&normal_cleanup,\n        crawl          => \\&start_crawlers,\n        crawler_done   => \\&crawler_done,\n        crawler_debug  => \\&crawler_debug,\n        sig_child      => \\&sig_child,\n        shutdown       => \\&shutdown_now,\n        prune_tmp_dirs => \\&prune_tmp_dirs\n    }\n);\n\nPOE::Kernel->run;\nexit;\n\nsub _start {\n    my ($k, $h) = @_[KERNEL, HEAP];\n\n    parse_argv();\n\n    unless($MAX_CONCURRENT_CRAWLERS){\n        $MAX_CONCURRENT_CRAWLERS = `nproc` * $CC{CRAWLERS_PER_CPU};\n    }\n\n    $PHANTOM_TIMEOUT = $CC{HEADLESS_ALARM} * 1000; # in ms\n    $HOMEPAGE_LINKS_MAX = sprintf '%d', $CC{HOMEPAGE_LINK_PCT} * $CC{URLS_PER_SITE};\n\n    my $TMP_DIR = $CC{TMP_DIR};\n    unless(-d $TMP_DIR){\n        $CC{VERBOSE} && warn \"Creating temp dir $TMP_DIR\\n\";\n        pathmk($TMP_DIR) or die \"Failed to create tmp dir $TMP_DIR: $!\";\n    }\n\n    # clean up leftover junk for forced shutdown\n    while(<$TMP_DIR/$CC{CRAWLER_TMP_PREFIX}*>){\n        chomp;\n        pathrmdir($_) or warn \"Failed to remove old crawler tmp dir $_: $!\";\n    }\n\n    $k->sig($_ => 'shutdown') for qw{TERM INT};\n\n    $k->yield('crawl');\n}\n\nsub shutdown_now {\n    $_[KERNEL]->sig_handled;\n\n    # Kill crawlers\n    $_->kill() for values %{$_[HEAP]->{crawlers}};\n\n    # Make unfinished tasks available in the queue\n    my $db = prep_db('queue');\n    $db->{reset_unfinished_tasks}->execute;\n\n    normal_cleanup();\n\n    exit 1;\n}\n\nsub normal_cleanup {\n    # remove tmp dir\n    pathrmdir($CC{TMP_DIR}) if -d $CC{TMP_DIR};\n}\n\nsub start_crawlers{\n    my ($k, $h) = @_[KERNEL, HEAP];\n\n    my $db = prep_db('queue');\n\n    my $reserve_tasks = $db->{reserve_tasks};\n    while(keys %{$h->{crawlers}} < $MAX_CONCURRENT_CRAWLERS){\n\n        $reserve_tasks->execute();\n        if(my @ranks = sort map { $_->[0] } @{$reserve_tasks->fetchall_arrayref}){\n\n            my $c = POE::Wheel::Run->new(\n                Program      => \\&crawl_sites,\n                ProgramArgs  => [\\@ranks],\n                CloseOnCall  => 1,\n                NoSetSid     => 1,\n                StderrEvent  => 'crawler_debug',\n                CloseEvent   => 'crawler_done',\n                StdinFilter  => POE::Filter::Reference->new,\n                StderrFilter => POE::Filter::Line->new\n            );\n            $h->{crawlers}{$c->ID} = $c;\n            $k->sig_child($c->PID, 'sig_child');\n        }\n        else{\n            $CC{POLL} && $k->delay(crawl => $CC{POLL});\n            last;\n        }\n    }\n}\n\nsub crawl_sites{\n    my ($ranks) = @_;\n\n    my $VERBOSE = $CC{VERBOSE};\n    my $db = prep_db('crawl');\n\n    my $crawler_tmp_dir = \"$CC{TMP_DIR}/$CC{CRAWLER_TMP_PREFIX}$$\";\n    my $rm_tmp = pathmk($crawler_tmp_dir);\n\n    my @urls_by_domain;\n    for(my $i = 0;$i < @$ranks;++$i){\n        my $rank = $ranks->[$i];\n\n        my $domain;\n        eval {\n            $db->{start_task}->execute($$, $rank);\n            $domain = $db->{start_task}->fetchall_arrayref->[0][0];\n        }\n        or do {\n            warn \"Failed to start task for rank $rank: $@\";\n            next;\n        };\n\n        eval {\n            $domain = URI->new(\"https://$domain/\")->host;\n            1;\n        }\n        or do {\n            warn \"Failed to filter domain $domain: $@\";\n            next;\n        };\n\n        $VERBOSE && warn \"checking domain $domain\\n\";\n        my $urls = get_urls_for_domain($domain, $db);\n        my @pairs;\n        for my $url (@$urls){\n            push @pairs, [$domain, $url];\n        }\n        push @urls_by_domain, \\@pairs if @pairs;\n    }\n\n    my $ranks_str = '{' . join(',', @$ranks) . '}';\n\n    my $ea = each_arrayref @urls_by_domain;\n\n    my (%ssl_cert_checked, %domain_render_delay, %sessions);\n    while(my @urls = $ea->()){\n        for my $u (@urls){\n            next unless $u;\n            my ($domain, $url) = @$u;\n            next unless $url =~ /^http/i;\n\n            # for the command-line\n            $url =~ s/'/%27/g;\n\n            my ($http_url) = $url =~ s/^https:/http:/ri;\n            my ($https_url) = $url =~ s/^http:/https:/ri;\n\n            my $http_ss = $crawler_tmp_dir . '/http.' . $domain . '.png';\n\n            unless($ssl_cert_checked{$domain}){\n                my $ssl = check_ssl_cert($domain);\n                eval {\n                    $db->{insert_ssl}->execute($domain, @$ssl);\n                    ++$ssl_cert_checked{$domain};\n                }\n                or do {\n                    warn \"Failed to insert ssl info for $domain: $@\";\n                };\n            }\n\n            my %comparison;\n            # We will compare a URL twice max:\n            # 1. Compare HTTP vs. HTTPS\n            # 2. Redo if the screenshot is a above the threshold to check for rendering problems\n            SCREENSHOT_RETRY: for (0..$CC{SCREENSHOT_RETRIES}){\n                my $redo_comparison = 0;\n\n                my %stats = (domain => $domain);\n                check_site(\\%stats, $http_url, $http_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);\n                # the idea behind screenshots is:\n                # 1. Do for HTTP automatically so we don't have to make another request if it works\n                # 2. Do for HTTPS if HTTP worked and wasn't autoupgraded\n                # 3. If HTTPS worked and didn't downgrade, compare them\n                my $https_ss;\n                if( (-e $http_ss) && ($stats{http_request_uri} =~ /^http:/i) && ($stats{http_response} == 200)){\n                    $https_ss = $crawler_tmp_dir . '/https.' . $domain . '.png';\n                }\n\n                HTTPS_RETRY: for my $https_attempt (0..$CC{HTTPS_RETRIES}){\n                    my $redo_https;\n                    check_site(\\%stats, $https_url, $https_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);\n                    if( ($stats{https_request_uri} =~ /^https:/i) && ($stats{https_response} == 200)){\n                        if($https_ss && (-e $https_ss)){\n                            my $out = `$CC{COMPARE} -metric mae $http_ss $https_ss /dev/null 2>&1`;\n\n                            if(my ($diff) = $out =~ /\\(([\\d\\.e\\-]+)\\)/){\n                                if($CC{SCREENSHOT_THRESHOLD} < $diff){\n                                    # Only need to redo on the first failure. After that, the delay\n                                    # will have already been increased by a previous URL\n                                    unless($domain_render_delay{$domain} == $CC{PHANTOM_RENDER_DELAY}){\n                                        $domain_render_delay{$domain} = $CC{PHANTOM_RENDER_DELAY};\n                                        $redo_comparison = 1;\n                                        $VERBOSE && warn \"redoing $http_url (diff: $diff)\\n\";\n                                    }\n                                }\n                                $stats{ss_diff} = $diff;\n                            }\n                            else{\n                                warn \"Failed to extract compare diff betweeen $http_ss and $https_ss from $out\\n\";\n                            }\n                            unlink $_ for $http_ss, $https_ss;\n                        }\n\n                        if($DDG_INTERNAL && $https_attempt){\n                            add_stat(qw'increment smarter_encryption.crawl.https_retries.success');\n                        }\n                    }\n                    elsif($DDG_INTERNAL && $https_attempt){\n                        add_stat(qw'increment smarter_encryption.crawl.https_retries.failure');\n                    }\n                    elsif( ($stats{https_request_uri} !~ /^http:/) && ($stats{http_response} != $stats{https_response})){\n                        $redo_https = 1;\n                        $VERBOSE && warn \"Redoing HTTPS request for $domain: $https_url\\n\";\n                    }\n\n                    last HTTPS_RETRY unless $redo_https;\n                }\n\n                # Most should exit here\n                unless($redo_comparison){\n                    %comparison = %stats;\n                    last;\n                }\n            }\n\n            unless($db->{con}->ping){\n                $VERBOSE && warn \"Reconnecting to DB before inserting comparison\";\n                $db = prep_db('crawl');\n            }\n\n            if(my $host = eval { URI->new($comparison{https_request_uri})->host}){\n                unless($ssl_cert_checked{$host}){\n                    my $ssl = check_ssl_cert($host);\n                    eval {\n                        $db->{insert_ssl}->execute($host, @$ssl);\n                        ++$ssl_cert_checked{$host};\n                    }\n                    or do {\n                        warn \"Failed to insert ssl info for $host: $@\";\n                    };\n                }\n            }\n\n            if($comparison{http_request_uri} || $comparison{https_request_uri}){\n                my $log_id;\n                eval {\n                    $db->{insert_domain}->execute(@comparison{qw'\n                        domain\n                        http_request_uri\n                        http_response\n                        http_requests\n                        http_size\n                        https_request_uri\n                        https_response\n                        https_requests\n                        https_size\n                        autoupgrade\n                        mixed\n                        ss_diff'}\n                    );\n                    $log_id = $db->{insert_domain}->fetch()->[0];\n                }\n                or do {\n                   $VERBOSE && warn \"Failed to insert request for $domain: $@\";\n                };\n\n                if($log_id){\n                    if(my $hdrs = delete $comparison{https_response_headers}){\n                        eval {\n                            $db->{insert_headers}->execute($log_id, $hdrs);\n                        }\n                        or do {\n                            $VERBOSE && warn \"Failed to insert response headers for $domain ($log_id): $@\";\n                        };\n                    }\n\n                    if(my $mixed_reqs = delete $comparison{mixed_children}){\n                        for my $m (keys %$mixed_reqs){\n                            eval{\n                                $db->{insert_mixed}->execute($log_id, $m);\n                                1;\n                            }\n                            or do {\n                                $VERBOSE && warn \"Failed to insert mixed request for $domain: $@\";\n                            };\n                        }\n                    }\n                    $comparison{id} = $log_id;\n                    push @{$sessions{$domain}}, \\%comparison;\n                }\n            }\n        }\n    }\n\n    unless($db->{con}->ping){\n        $VERBOSE && warn \"Reconnecting to DB before updating aggregate data\";\n        $db = prep_db('crawl');\n    }\n\n    while(my ($domain, $session) = each %sessions){\n        my $aggregates = aggregate_crawl_session($domain, $session);\n        while(my ($host, $agg) = each %$aggregates){\n            eval {\n                $db->{upsert_aggregate}->execute(\n                    $host, @$agg{qw'\n                        https\n                        http_s\n                        https_errs\n                        http\n                        unknown\n                        autoupgrade\n                        mixed_requests\n                        max_ss_diff\n                        redirects\n                        max_id\n                        requests\n                        is_redirect\n                        redirect_hosts'\n                    }\n                );\n                1;\n            }\n            or do {\n                warn \"Failed to upsert aggregate for $host: $@\";\n            };\n        }\n    }\n\n    eval {\n        $db->{finish_tasks}->execute($ranks_str);\n        1;\n    }\n    or do {\n        warn \"Failed to finish tasks for ranks ($ranks_str): $@\";\n    };\n\n    system \"$CC{PKILL} -9 -f '$crawler_tmp_dir '\";\n    pathrmdir($crawler_tmp_dir) if $rm_tmp;\n}\n\nsub prep_db {\n    my $target = shift;\n\n    my %db;\n\n    my $con = get_con();\n\n    if($target eq 'queue'){\n        $db{reserve_tasks} = $con->prepare(\"\n            update https_queue\n                set processing_host = '$HOST',\n                    reserved = now()\n            where rank in (\n                select rank from https_queue\n                    where processing_host is null\n                    order by rank\n                    limit $CC{SITES_PER_CRAWLER}\n                    for update skip locked\n            )\n            returning rank\n        \");\n        $db{reset_unfinished_tasks} = $con->prepare(\"\n            update https_queue\n                set processing_host = null,\n                worker_pid = null,\n                reserved = null,\n                started = null\n            where\n                processing_host = '$HOST' and\n                finished is null\n        \");\n        $db{complete_unfinished_worker_tasks} = $con->prepare(\"\n            update https_queue\n                set finished = now(),\n                processing_host = '$HOST (incomplete)'\n            where\n                processing_host = '$HOST' and\n                finished is null and\n                worker_pid = ?\n        \");\n    }\n    elsif($target eq 'crawl'){\n        $db{start_task} = $con->prepare('update https_queue set worker_pid = ?, started = now() where rank = ? returning domain');\n        $db{select_urls} = $con->prepare('select url from full_urls where host = ?');\n        $db{insert_domain} = $con->prepare('\n            insert into https_crawl\n              (domain, http_request_uri, http_response, http_requests, http_size, https_request_uri, https_response, https_requests, https_size, autoupgrade, mixed, screenshot_diff)\n              values (?,?,?,?,?,?,?,?,?,?,?,?) returning id\n        ');\n        $db{insert_mixed} = $con->prepare('insert into mixed_assets (https_crawl_id, asset) values (?,?)');\n        $db{insert_headers} = $con->prepare('insert into https_response_headers (https_crawl_id, response_headers) values (?,?)');\n        $db{finish_tasks} = $con->prepare('update https_queue set finished = now() where rank = ANY(?::integer[])');\n        $db{insert_ssl} = $con->prepare('\n            insert into ssl_cert_info (domain, issuer, notBefore, notAfter, host_valid, err) values (?,?,?,?,?,?)\n            on conflict (domain) do update set\n            issuer = EXCLUDED.issuer,\n            notBefore = EXCLUDED.notBefore,\n            notAfter = EXCLUDED.notAfter,\n            host_valid = EXCLUDED.host_valid,\n            err = EXCLUDED.err,\n            updated = now()\n        ');\n        # Note where clause:\n        # 1. Non-redirects update any, including changing a redirect to a non-redirect\n        # 2. Redirects update other redirects\n        $db{upsert_aggregate} = $con->prepare(\"\n            insert into https_crawl_aggregate (\n                domain,\n                https,\n                http_and_https,\n                https_errs, http,\n                unknown,\n                autoupgrade,\n                mixed_requests,\n                max_screenshot_diff,\n                redirects,\n                max_https_crawl_id,\n                requests,\n                is_redirect,\n                redirect_hosts,\n                session_request_limit)\n                values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,$CC{URLS_PER_SITE})\n            on conflict (domain) do update set (\n                https,\n                http_and_https,\n                https_errs,\n                http,\n                unknown,\n                autoupgrade,\n                mixed_requests,\n                max_screenshot_diff,\n                redirects,\n                max_https_crawl_id,\n                requests,\n                is_redirect,\n                redirect_hosts,\n                session_request_limit\n            ) = (\n                EXCLUDED.https,\n                EXCLUDED.http_and_https,\n                EXCLUDED.https_errs,\n                EXCLUDED.http,\n                EXCLUDED.unknown,\n                EXCLUDED.autoupgrade,\n                EXCLUDED.mixed_requests,\n                EXCLUDED.max_screenshot_diff,\n                EXCLUDED.redirects,\n                EXCLUDED.max_https_crawl_id,\n                EXCLUDED.requests,\n                EXCLUDED.is_redirect,\n                EXCLUDED.redirect_hosts,\n                EXCLUDED.session_request_limit)\n            where\n                EXCLUDED.is_redirect = false or\n                https_crawl_aggregate.is_redirect = true\n        \");\n    }\n\n    $db{con} = $con;\n    return \\%db;\n}\n\n# Strategy behind url selection:\n# 1. Fill queue with homepage and click urls sort by top-level path\n#    prevalence\n# 2. If necessary, get backfill_urls\nsub get_urls_for_domain {\n    my ($domain, $db) = @_;\n\n    state $rr = WWW::RobotRules->new($CC{UA});\n    state $mech = get_ua('mech');\n    state $VERBOSE = $CC{VERBOSE};\n\n    # Get latest robot rules for domain\n    my $res = $mech->get(\"http://$domain/robots.txt\");\n    if($res->is_success){\n        # the uri may be different than what we requested\n        my @doms = ($domain);\n        my $uri = $res->request->uri;\n        if(my $host = eval { URI->new($uri)->host }){\n            push @doms, $host if $host ne $domain;\n        }\n        my $robots_txt = $res->decoded_content;\n\n        # Add the rules for the:\n        # 1. The domain and redirect host if different\n        # 2. HTTP/HTTPS for each\n        # yes, http and https could be different\n        for my $d (@doms){\n            for my $p (qw(http https)){\n                $rr->parse(\"$p://$d/\", $robots_txt);\n            }\n        }\n    }\n\n    my @urls;\n    my $homepage = 'http://' . $domain . '/';\n\n    $res = $mech->get($homepage);\n\n    if($res->is_success){\n        # the uri may be different than what we requested\n        my $uri = $res->request->uri;\n        if(my $host = eval { URI->new($uri)->host }){\n            # all links with the same host\n            my @homepage_links;\n            if(my $l = $mech->find_all_links(url_abs_regex => qr{//\\Q$host\\E/})){\n                @homepage_links = @$l;\n            }\n\n            for my $l (@homepage_links){\n                my $abs_url = $l->url_abs;\n                $abs_url = \"$abs_url\";\n                next if dupe_link($abs_url, \\@urls);\n                push @urls, $abs_url;\n            }\n        }\n    }\n    else {\n        $VERBOSE && warn \"Failed to get homepage links for $domain: \" . $res->status_line;\n    }\n\n    eval {\n        my $select_urls = $db->{select_urls};\n        $select_urls->execute($domain);\n        while(my $r = $select_urls->fetchrow_arrayref){\n            my $url = $r->[0];\n            next if dupe_link($url, \\@urls);\n            push @urls, $url;\n        }\n        1;\n    }\n    or do {\n        $VERBOSE && warn \"Failed to get click urls for $domain: $@\";\n    };\n\n    state $URLS_PER_SITE = $CC{URLS_PER_SITE};\n\n    urls_by_path(\\@urls, $rr, $URLS_PER_SITE);\n\n    if($DDG_INTERNAL && (@urls < $URLS_PER_SITE)){\n        backfill_urls($domain, \\@urls, $rr, $db, $mech, $URLS_PER_SITE, $VERBOSE);\n    }\n\n    # Add home by default since it often behaves differently\n    unless(dupe_link($homepage, \\@urls)){\n        if(@urls < $URLS_PER_SITE){\n            push @urls, $homepage;\n        }\n        else{\n            splice(@urls, -1, 1, $homepage);\n        }\n    }\n\n    return \\@urls;\n}\n\nsub prune_tmp_dirs {\n    my $h = $_[HEAP];\n\n    return unless exists $h->{crawler_tmp_dirs};\n\n    my ($TMP_DIR, $CRAWLER_TMP_PREFIX) = @CC{qw'TMP_DIR CRAWLER_TMP_PREFIX'};\n    for my $pid (keys %{$h->{crawler_tmp_dirs}}){\n        my $crawler_tmp_dir = \"$TMP_DIR/$CRAWLER_TMP_PREFIX$pid\";\n        if(-d $crawler_tmp_dir){\n            next unless pathrmdir($crawler_tmp_dir);\n        }\n        delete $h->{crawler_tmp_dirs}{$pid};\n    }\n}\n\nsub check_site {\n    my ($stats, $site, $screenshot, $delay, $crawler_tmp_dir) = @_;\n\n    if(my ($request_scheme) = $site =~ /^(https?):/i){\n        $request_scheme = lc $request_scheme;\n\n        eval{\n            @ENV{qw(PHANTOM_RENDER_DELAY PHANTOM_UA PHANTOM_TIMEOUT)} =\n                ($delay, \"'$CC{UA}'\", $PHANTOM_TIMEOUT);\n\n            # Build custom headers if HTTP message signatures are enabled\n            if($DDG_INTERNAL && $CC{ENABLE_HTTP_MESSAGE_SIGNATURES}){\n                # Clear any previous custom headers first\n                delete $ENV{CUSTOM_HEADERS};\n\n                my $sig_headers = get_http_msg_sig_hdrs('GET', $site);\n                if($sig_headers && %$sig_headers){\n                    $ENV{CUSTOM_HEADERS} = encode_json($sig_headers);\n                }\n            }\n\n            my $out;\n            my @cmd = (\n                $CC{PHANTOMJS},\n                \"--local-storage-path=$crawler_tmp_dir\", \"--offline-storage-path=$crawler_tmp_dir\",\n                $CC{NETSNIFF_SS}, $site);\n            push @cmd, $screenshot if $screenshot;\n\n            IPC::Run::run \\@cmd,  \\undef, \\$out,\n                IPC::Run::timeout($CC{HEADLESS_ALARM}, exception => \"$site timed out after $CC{HEADLESS_ALARM} seconds\");\n            die \"PHANTOMJS $out\" if $out =~ /^FAIL/;\n\n            # Can have error messages at the end so have to extract the json\n            my ($j) = $out =~ /^(\\{\\s+\"log\".+\\})/ms;\n            my $m = decode_json($j)->{log};\n\n            my ($main_request_scheme, $check_mixed);\n            for my $e (@{$m->{entries}}){\n                my $response_status = $e->{response}{status};\n                # netsniff records the redirects to https for some sites\n                next if $response_status =~ /^3/;\n                my $url = $e->{request}{url};\n                next unless my ($scheme) = $url =~ /^(https?):/i;\n                $scheme = lc $scheme;\n\n                if($check_mixed && ($scheme eq 'http')){\n                    # Absolute links.  Even if the same host as parent, browsers will mark\n                    # this as mixed and the extension can't upgrade them\n                    $stats->{mixed_children}{$url} = 1;\n                }\n\n                unless($main_request_scheme){\n                    $stats->{\"${request_scheme}_request_uri\"} = $url;\n                    $stats->{\"${request_scheme}_response\"} = $response_status;\n                    if($request_scheme eq 'http'){\n                        $stats->{autoupgrade} = $scheme eq 'https' ? 1 : 0;\n                    }\n                    elsif($scheme eq 'https'){\n                        $check_mixed = lc URI->new($url)->host;\n                        my $hdrs = delete $e->{response}{headers};\n                        my %response_headers;\n                        # We don't want to store an array of one-key hashes.\n                        for my $h (@$hdrs){\n                            my ($name, $value) = @$h{qw(name value)};\n                            if(exists $response_headers{$name}){\n                                # https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2\n                                $response_headers{$name} .= \",$value\";\n                            }\n                            else{\n                                $response_headers{$name} = $value;\n                            }\n                        }\n                        $stats->{https_response_headers} = encode_json(\\%response_headers);\n                    }\n                    $main_request_scheme = $scheme;\n                }\n\n                $stats->{\"${request_scheme}_size\"} += $e->{response}{bodySize};\n                ++$stats->{\"${request_scheme}_requests\"};\n\n            }\n\n            if($check_mixed){\n                $stats->{mixed} = exists $stats->{mixed_children} ? 1 : 0;\n            }\n            1;\n        }\n        or do {\n            warn \"check_site error: $@ ($site)\";\n            system \"$CC{PKILL} -9 -f '$crawler_tmp_dir '\" if $crawler_tmp_dir =~ /\\S/;\n        };\n    }\n}\n\nsub crawler_done{\n    my ($k, $h, $id) = @_[KERNEL, HEAP, ARG0];\n\n    state $VERBOSE = $CC{VERBOSE};\n    $VERBOSE && warn \"deleting crawler $id\\n\";\n    my $c = delete $h->{crawlers}{$id};\n\n    # see if any of its domains were left unfinished\n    my $pid = $c->PID;\n    eval {\n        my $db = prep_db('queue');\n        my $unfinished = $db->{complete_unfinished_worker_tasks}->execute($pid);\n        if($unfinished > 0){\n            $VERBOSE && warn \"Marked $unfinished tasks incomplete for crawler with pid $pid\\n\";\n        }\n        1;\n    }\n    or do {\n        warn \"Failed to verify worker tasks: $@\";\n    };\n\n    # Check and clean up tmp dirs for hung crawlers\n    $h->{crawler_tmp_dirs}{$pid} = 1;\n    $k->yield('prune_tmp_dirs');\n\n    $k->yield('crawl');\n}\n\nsub crawler_debug{\n    my $msg = $_[ARG0];\n\n    $CC{VERBOSE} && warn 'crawler debug: ' . $msg. \"\\n\";\n}\n\nsub sig_child {\n    warn 'Got signal from pid ' . $_[ARG1] . ', exit status: ' . $_[ARG2] if $_[ARG2];\n    $_[KERNEL]->sig_handled;\n}\n\nsub get_ua {\n    my $type = shift;\n\n    my $ua = $type eq 'mech' ?\n        WWW::Mechanize->new(\n            onerror => undef, # We'll check these ourselves so we don't have to catch die in eval\n            quiet => 1\n        ) \n        :\n        LWP::UserAgent->new();\n\n    $ua->agent($CC{UA});\n    $ua->timeout(10);\n    return $ua;\n}\n\nsub get_con {\n\n    $ENV{PGDATABASE} = $CC{DB}   if exists $CC{DB};\n    $ENV{PGHOST}     = $CC{HOST} if exists $CC{HOST};\n    $ENV{PGPORT}     = $CC{PORT} if exists $CC{PORT};\n    $ENV{PGUSER}     = $CC{USER} if exists $CC{USER};\n    $ENV{PGPASSWORD} = $CC{PASS} if exists $CC{PASS};\n\n    return DBI->connect('dbi:Pg:', '', '', {\n        RaiseError => 1,\n        PrintError => 0,\n        AutoCommit => 1,\n    });\n}\n\nsub parse_argv {\n    my $usage = <<ENDOFUSAGE;\n\n     *********************************************************************\n       USAGE: https_crawl.pl -c /path/to/config.yml [-h]\n\n       -c: Path to YAML config file\n       -h: Print this help\n\n    ***********************************************************************\n\nENDOFUSAGE\n\n    my $config_file_specified;\n    for(my $i = 0;$i < @ARGV;$i++) {\n        if($ARGV[$i] =~ /^-c$/i ){\n            %CC = %{LoadFile($ARGV[++$i])};\n            $config_file_specified = 1;\n        }\n        elsif($ARGV[$i] =~ /^-h$/i ){ die \"$usage\\n\" }\n    }\n\n    die \"Config file required\\n\\n$usage\\n\" unless $config_file_specified;\n}\n"
  },
  {
    "path": "netsniff_screenshot.js",
    "content": "// Copyright 2010 Ariya Hidayat\n// \n// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:\n//\n// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.\n//\n// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.\n//\n// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.\n//\n// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n\n\"use strict\";\nif (!Date.prototype.toISOString) {\n    Date.prototype.toISOString = function () {\n        function pad(n) { return n < 10 ? '0' + n : n; }\n        function ms(n) { return n < 10 ? '00'+ n : n < 100 ? '0' + n : n }\n        return this.getFullYear() + '-' +\n            pad(this.getMonth() + 1) + '-' +\n            pad(this.getDate()) + 'T' +\n            pad(this.getHours()) + ':' +\n            pad(this.getMinutes()) + ':' +\n            pad(this.getSeconds()) + '.' +\n            ms(this.getMilliseconds()) + 'Z';\n    }\n}\n\nfunction createHAR(address, title, startTime, resources)\n{\n    var entries = [];\n\n    resources.forEach(function (resource) {\n        var request = resource.request,\n            startReply = resource.startReply,\n            endReply = resource.endReply;\n\n        if (!request || !startReply || !endReply) {\n            return;\n        }\n\n        // Exclude Data URI from HAR file because\n        // they aren't included in specification\n        if (request.url.match(/(^data:image\\/.*)/i)) {\n            return;\n    }\n\n        entries.push({\n            startedDateTime: request.time.toISOString(),\n            time: endReply.time - request.time,\n            request: {\n                method: request.method,\n                url: request.url,\n                httpVersion: \"HTTP/1.1\",\n                cookies: [],\n                headers: request.headers,\n                queryString: [],\n                headersSize: -1,\n                bodySize: -1\n            },\n            response: {\n                status: endReply.status,\n                statusText: endReply.statusText,\n                httpVersion: \"HTTP/1.1\",\n                cookies: [],\n                headers: endReply.headers,\n                redirectURL: \"\",\n                headersSize: -1,\n                bodySize: startReply.bodySize,\n                content: {\n                    size: startReply.bodySize,\n                    mimeType: endReply.contentType\n                }\n            },\n            cache: {},\n            timings: {\n                blocked: 0,\n                dns: -1,\n                connect: -1,\n                send: 0,\n                wait: startReply.time - request.time,\n                receive: endReply.time - startReply.time,\n                ssl: -1\n            },\n            pageref: address\n        });\n    });\n\n    return {\n        log: {\n            version: '1.2',\n            creator: {\n                name: \"PhantomJS\",\n                version: phantom.version.major + '.' + phantom.version.minor +\n                    '.' + phantom.version.patch\n            },\n            pages: [{\n                startedDateTime: startTime.toISOString(),\n                id: address,\n                title: title,\n                pageTimings: {\n                    onLoad: page.endTime - page.startTime\n                }\n            }],\n            entries: entries\n        }\n    };\n}\n\nvar page = require('webpage').create(),\n    system = require('system');\nif(system.env['PHANTOM_UA'] !== 'undefined'){\n    page.settings.userAgent = system.env['PHANTOM_UA'];\n}\nif(system.env['PHANTOM_TIMEOUT'] !== 'undefined'){\n    page.settings.resourceTimeout = system.env['PHANTOM_TIMEOUT'];\n}\nvar renderDelay = 0;\nif(system.env['PHANTOM_RENDER_DELAY'] !== 'undefined'){\n    renderDelay = system.env['PHANTOM_RENDER_DELAY'];\n}\n// Parse and apply custom headers\nvar customHeaders = {};\nif (system.env['CUSTOM_HEADERS'] !== undefined) {\n    try {\n        customHeaders = JSON.parse(system.env['CUSTOM_HEADERS']);\n    } catch (e) {\n        console.error('Failed to parse CUSTOM_HEADERS: ' + e);\n    }\n}\npage.customHeaders = customHeaders;\n\npage.viewportSize = { width: 1024, height: 768 };\npage.clipRect = { top: 0, left: 0, width: 1024, height: 768 };\n\nif (system.args.length === 1) {\n    console.log('Usage: netsniff.js <some URL> <optional: screenshot file name');\n    phantom.exit(1);\n} else {\n\n    page.address = system.args[1];\n    page.resources = [];\n    var screenshot_file = system.args[2];\n\n    page.onLoadStarted = function () {\n        page.startTime = new Date();\n    };\n\n    page.onResourceRequested = function (req) {\n        page.resources[req.id] = {\n            request: req,\n            startReply: null,\n            endReply: null\n        };\n    };\n\n    page.onResourceReceived = function (res) {\n        if (res.stage === 'start') {\n            page.resources[res.id].startReply = res;\n        }\n        if (res.stage === 'end') {\n            page.resources[res.id].endReply = res;\n        }\n    };\n\n    page.onResourceError = function(resourceError) {\n        page.reason = resourceError.errorString;\n        page.reason_url = resourceError.url;\n    };\n\n    page.open(page.address, function (status) {\n        var har;\n        if (status !== 'success') {\n            console.log('FAIL to load the address ' + page.reason_url + ': ' + page.reason);\n            phantom.exit(1);\n        } else {\n            window.setTimeout(function () {\n                page.endTime = new Date();\n                page.title = page.evaluate(function () {\n                return document.title;\n                });\n                har = createHAR(page.address, page.title, page.startTime, page.resources);\n                console.log(JSON.stringify(har, undefined, 4));\n                if (typeof screenshot_file !== 'undefined') {\n                    page.render(screenshot_file);\n                }\n                phantom.exit();\n            }, renderDelay);\n        }\n    });\n}\n"
  },
  {
    "path": "sql/domain_exceptions.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: domain_exceptions; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE domain_exceptions (\n    domain text NOT NULL,\n    comment text,\n    updated timestamp with time zone NOT NULL default now()\n);\n\n\n--\n-- Name: domain_exceptions_pkey; Type: CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY domain_exceptions\n    ADD CONSTRAINT domain_exceptions_pkey PRIMARY KEY (domain);\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/full_urls.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: full_urls; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE full_urls (\n    host text NOT NULL,\n    url text NOT NULL,\n    updated timestamp with time zone DEFAULT now() NOT NULL\n);\n\n\n--\n-- Name: full_urls_host_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX full_urls_host_idx ON full_urls USING btree (host);\n\n\n--\n-- Name: full_urls_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE UNIQUE INDEX full_urls_unique_substrmd5_idx ON full_urls USING btree (host, \"left\"(md5(url), 8));\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/https_crawl.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: https_crawl; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE https_crawl (\n    domain text NOT NULL,\n    http_request_uri text,\n    http_response integer,\n    http_requests integer,\n    http_size integer,\n    https_request_uri text,\n    https_response integer,\n    https_requests integer,\n    https_size integer,\n    \"timestamp\" timestamp with time zone DEFAULT now(),\n    screenshot_diff real,\n    id bigint,\n    autoupgrade boolean,\n    mixed boolean\n);\n\n\n--\n-- Name: https_crawl_id_seq; Type: SEQUENCE; Schema: public; Owner: -\n--\n\nCREATE SEQUENCE https_crawl_id_seq\n    START WITH 1\n    INCREMENT BY 1\n    NO MINVALUE\n    NO MAXVALUE\n    CACHE 1;\n\n\n--\n-- Name: https_crawl_id_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -\n--\n\nALTER SEQUENCE https_crawl_id_seq OWNED BY https_crawl.id;\n\n\n--\n-- Name: id; Type: DEFAULT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_crawl ALTER COLUMN id SET DEFAULT nextval('https_crawl_id_seq'::regclass);\n\n\n--\n-- Name: https_crawl_id_key; Type: CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_crawl\n    ADD CONSTRAINT https_crawl_id_key UNIQUE (id);\n\n\n--\n-- Name: https_crawl_domain_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX https_crawl_domain_idx ON https_crawl USING btree (domain);\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/https_crawl_aggregate.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: https_crawl_aggregate; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE https_crawl_aggregate (\n    domain text NOT NULL,\n    https integer DEFAULT 0 NOT NULL,\n    http_and_https integer DEFAULT 0 NOT NULL,\n    https_errs integer DEFAULT 0 NOT NULL,\n    http integer DEFAULT 0 NOT NULL,\n    unknown integer DEFAULT 0 NOT NULL,\n    autoupgrade integer DEFAULT 0 NOT NULL,\n    mixed_requests integer DEFAULT 0 NOT NULL,\n    max_screenshot_diff real DEFAULT 0 NOT NULL,\n    redirects integer DEFAULT 0 NOT NULL,\n    requests integer NOT NULL,\n    session_request_limit integer NOT NULL,\n    is_redirect boolean DEFAULT false NOT NULL,\n    max_https_crawl_id bigint NOT NULL,\n    redirect_hosts jsonb\n);\n\n\n--\n-- Name: https_upgrade_metrics; Type: MATERIALIZED VIEW; Schema: public; Owner: -\n--\n\nCREATE VIEW https_upgrade_metrics AS\n SELECT https_crawl_aggregate.domain,\n    ((https_crawl_aggregate.unknown)::real / (https_crawl_aggregate.requests)::real) AS unknown_pct,\n    ((((https_crawl_aggregate.https + https_crawl_aggregate.http_and_https)))::double precision / (https_crawl_aggregate.requests)::real) AS combined_pct,\n    coalesce(https_crawl_aggregate.https_errs::real/nullif( (https_crawl_aggregate.https + https_crawl_aggregate.http_and_https), 0), 0)::real as https_err_rate,\n    https_crawl_aggregate.max_screenshot_diff,\n    ((https_crawl_aggregate.mixed_requests = 0) OR (https_crawl_aggregate.autoupgrade = https_crawl_aggregate.requests)) AS mixed_ok,\n    ((https_crawl_aggregate.autoupgrade)::double precision / (https_crawl_aggregate.requests)::real) AS autoupgrade_pct\n   FROM https_crawl_aggregate;\n\n\n--\n-- Name: https_crawl_aggregate_pkey; Type: CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_crawl_aggregate\n    ADD CONSTRAINT https_crawl_aggregate_pkey PRIMARY KEY (domain);\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/https_queue.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: https_queue; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE https_queue (\n    rank integer NOT NULL,\n    domain character varying(500) NOT NULL,\n    processing_host character varying(50),\n    worker_pid integer,\n    reserved timestamp with time zone,\n    started timestamp with time zone,\n    finished timestamp with time zone,\n    CONSTRAINT domain_is_lowercase CHECK (((domain)::text = lower((domain)::text)))\n);\n\n\n--\n-- Name: https_queue_rank_seq; Type: SEQUENCE; Schema: public; Owner: -\n--\n\nCREATE SEQUENCE https_queue_rank_seq\n    START WITH 1\n    INCREMENT BY 1\n    NO MINVALUE\n    NO MAXVALUE\n    CACHE 1\n    CYCLE;\n\n\n--\n-- Name: https_queue_rank_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -\n--\n\nALTER SEQUENCE https_queue_rank_seq OWNED BY https_queue.rank;\n\n\n--\n-- Name: rank; Type: DEFAULT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_queue ALTER COLUMN rank SET DEFAULT nextval('https_queue_rank_seq'::regclass);\n\n\n--\n-- Name: https_queue_pkey; Type: CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_queue\n    ADD CONSTRAINT https_queue_pkey PRIMARY KEY (rank);\n\n\n--\n-- Name: https_queue_domain_finished_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE UNIQUE INDEX https_queue_domain_finished_idx ON https_queue USING btree (domain, finished);\n\n\n--\n-- Name: https_queue_processing_host_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX https_queue_processing_host_idx ON https_queue USING btree (processing_host);\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/https_response_headers.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: https_response_headers; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE https_response_headers (\n    https_crawl_id bigint NOT NULL,\n    response_headers jsonb NOT NULL\n);\n\n\n--\n-- Name: https_response_headers_https_crawl_id_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE UNIQUE INDEX https_response_headers_https_crawl_id_idx ON https_response_headers USING btree (https_crawl_id);\n\n\n--\n-- Name: https_response_headers_response_headers_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX https_response_headers_response_headers_idx ON https_response_headers USING gin (response_headers);\n\n\n--\n-- Name: https_response_headers_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY https_response_headers\n    ADD CONSTRAINT https_response_headers_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/mixed_assets.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: mixed_assets; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE mixed_assets (\n    asset text NOT NULL,\n    https_crawl_id bigint NOT NULL\n);\n\n\n--\n-- Name: mixed_assets_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE UNIQUE INDEX mixed_assets_unique_substrmd5_idx ON mixed_assets USING btree (https_crawl_id, \"left\"(md5(asset), 8));\n\n\n--\n-- Name: mixed_assets_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY mixed_assets\n    ADD CONSTRAINT mixed_assets_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/ssl_cert_info.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\nSET default_tablespace = '';\n\nSET default_with_oids = false;\n\n--\n-- Name: ssl_cert_info; Type: TABLE; Schema: public; Owner: -\n--\n\nCREATE TABLE ssl_cert_info (\n    domain text NOT NULL,\n    issuer text,\n    notbefore timestamp with time zone,\n    notafter timestamp with time zone,\n    host_valid boolean,\n    err text,\n    updated timestamp with time zone DEFAULT now() NOT NULL\n);\n\n\n--\n-- Name: ssl_cert_info_pkey; Type: CONSTRAINT; Schema: public; Owner: -\n--\n\nALTER TABLE ONLY ssl_cert_info\n    ADD CONSTRAINT ssl_cert_info_pkey PRIMARY KEY (domain);\n\n\n--\n-- Name: ssl_cert_info_host_valid_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX ssl_cert_info_host_valid_idx ON ssl_cert_info USING btree (host_valid);\n\n\n--\n-- Name: ssl_cert_info_issuer_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX ssl_cert_info_issuer_idx ON ssl_cert_info USING btree (issuer);\n\n\n--\n-- Name: ssl_cert_info_notafter_idx; Type: INDEX; Schema: public; Owner: -\n--\n\nCREATE INDEX ssl_cert_info_notafter_idx ON ssl_cert_info USING btree (notafter);\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "sql/upgradeable_domains_func.sql",
    "content": "--\n-- PostgreSQL database dump\n--\n\n-- Dumped from database version 9.5.9\n-- Dumped by pg_dump version 9.5.9\n\nSET statement_timeout = 0;\nSET lock_timeout = 0;\nSET client_encoding = 'UTF8';\nSET standard_conforming_strings = on;\nSET check_function_bodies = false;\nSET client_min_messages = warning;\nSET row_security = off;\n\nSET search_path = public, pg_catalog;\n\n--\n-- Name: upgradeable_domains(real, real, real, real, real); Type: FUNCTION; Schema: public; Owner: - \n--\n\nCREATE OR REPLACE FUNCTION upgradeable_domains(\n    unknown_max real,\n    combined_min real,\n    screenshot_diff_max real,\n    mixed_ok boolean DEFAULT TRUE,\n    autoupgrade_min real DEFAULT 0,\n    ssl_cert_buffer timestamp with time zone DEFAULT now(),\n    exclude_issuers text[] default '{}',\n    max_err_rate real DEFAULT 1)\n    RETURNS TABLE(domain character varying) AS\n$$\n    select domain from https_upgrade_metrics m\n        where\n        (unknown_pct <= unknown_max) and\n        (combined_min <= combined_pct) and\n        (max_screenshot_diff <= screenshot_diff_max) and\n        (upgradeable_domains.mixed_ok = m.mixed_ok) and\n        (autoupgrade_min <= autoupgrade_pct) and\n        (https_err_rate <= max_err_rate)\n    except\n    (\n        select domain from domain_exceptions\n        union\n        select domain from ssl_cert_info\n            where\n            err is not null or\n            host_valid = false or\n            notafter < ssl_cert_buffer or\n            notbefore is null or\n            notafter is null or\n            issuer is null or\n            issuer ~* ANY(exclude_issuers)\n    )\n$$ LANGUAGE sql RETURNS NULL ON NULL INPUT;\n\n\n--\n-- PostgreSQL database dump complete\n--\n\n"
  },
  {
    "path": "third-party.txt",
    "content": "Smarter Encryption includes the following third party software:\n\nSoftware Name: PhantomJS\nVersion: 2.1.1\nLicense: BSD-3-Clause\nModified by DDG: Yes\nLocation: netsniff_screenshot.js\nObtained from: https://github.com/ariya/phantomjs\n"
  }
]