Repository: duckduckgo/smarter-encryption Branch: master Commit: 28bd11bf6423 Files: 18 Total size: 68.8 KB Directory structure: gitextract__bzzy0aw/ ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── SmarterEncryption/ │ └── Crawl.pm ├── config.yml.example ├── cpanfile ├── https_crawl.pl ├── netsniff_screenshot.js ├── sql/ │ ├── domain_exceptions.sql │ ├── full_urls.sql │ ├── https_crawl.sql │ ├── https_crawl_aggregate.sql │ ├── https_queue.sql │ ├── https_response_headers.sql │ ├── mixed_assets.sql │ ├── ssl_cert_info.sql │ └── upgradeable_domains_func.sql └── third-party.txt ================================================ FILE CONTENTS ================================================ ================================================ FILE: CONTRIBUTING.md ================================================ # Contributing guidelines * [Reporting bugs](#reporting-bugs) * [Development](#development) * [New features](#new-features) * [Bug fixes](#bug-fixes) * [Getting Started](#getting-started) * [Pre-Requisites](#pre-requisites) * [Setup](#setup) * [Running the crawler](#running-the-crawler) * [Checking the results](#checking-the-results) * [Data Model](#data-model) * [full_urls](#full_urls) * [https_queue](#https_queue) * [https_crawl](#https_crawl) * [mixed_assets](#mixed_assets) * [https_response_headers](#https_response_headers) * [ssl_cert_info](#ssl_cert_info) * [https_crawl_aggregate](#https_crawl_aggregate) * [https_upgrade_metrics](#https_upgrade_metrics) * [domain_exceptions](#domain_exceptions) * [upgradeable_domains](#upgradeable_domains) # Reporting bugs 1. First check to see if the bug has not already been [reported](https://github.com/duckduckgo/smarter-encryption/issues). 2. Create a bug report [issue](https://github.com/duckduckgo/smarter-encryption/issues/new?template=bug_report.md). # Development ## New features Right now all new feature development is handled internally. ## Bug fixes Most bug fixes are handled internally, but we will accept pull requests for bug fixes if you first: 1. Create an issue describing the bug. see [Reporting bugs](CONTRIBUTING.md#reporting-bugs) 2. Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects. ## Getting Started ### Pre-Requisites - [PostgreSQL](https://www.postgresql.org/) database - [PhantomJS 2.1.1](https://phantomjs.org/download.html) - [Perl](https://www.perl.org/get.html) - [compare](https://imagemagick.org/script/compare.php) - [pkill](https://en.wikipedia.org/wiki/Pkill) - Should run on many varieties of Linux/*BSD ### Setup 1. Install required Perl modules via cpanfile: ```sh cpanm --installdeps . ``` 2. Connect to PostgreSQL with psql and create the tables needed by the crawler: ``` \i sql/full_urls.sql \i sql/https_crawl.sql \i sql/mixed_assets.sql etc. ``` 3. Create a copy of the crawler configuration file: ```sh cp config.yml.example config.yml ``` Edit the settings as necessary for your system. 4. If you have a source of URLs you would like to be crawled for a host they can be added to the [full_urls](#full_urls) table: ```sql insert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ... ``` The crawler will attempt to get URLs from the home page even if none are available in this table. ### Running the crawler 1. Add hosts to be crawled to the [https_queue](#https_queue) table: ```sql insert into https_queue (domain) values ('duckduckgo.com'); ``` 2. The crawler can be run as follows: ```sh perl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml ``` ### Checking the results 1. The individual HTTP and HTTPs comparisons for each URL crawled are stored in [https_crawl](#https_crawl): ```sql select * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10; ``` The maximum URLs for the crawl session, i.e. `limit`, is determined by [URLS_PER_SITE](config.yml.example#L49). 2. Aggregate session data for each host is stored in [https_crawl_aggregate](#https_crawl_aggregate): ```sql select * from https_crawl_aggregate where domain = 'duckduckgo.com'; ``` There is also an associated view - [https_upgrade_metrics](#https_upgrade_metrics) - that calculates some additional metrics: ```sql select * from https_upgrade_metrics where domain = 'duckduckgo.com'; ``` 3. Additional information from the crawl can be found in: * [sss_cert_info](#ssl_cert_info) * [mixed_assets](#mixed_assets) * [https_response_headers](#https_response_headers) 4. Hosts can be selected based on various combinations of criteria directly from the above tables or by using the [upgradeable_domains](#upgradeable_domains) function. ### Data Model #### full_urls Complete URLs for hosts that will be used in addition to those the crawler extracts from the home page. | Column | Description | Type | Key | | --- | --- | --- | --- | | host | hostname | text |unique| | url | Complete URL with scheme | text |unique| | updated | When added to table | timestamp with time zone || #### https_queue Domains to be crawled in rank order. Multiple crawlers can access this concurrently. | Column | Description | Type | Key | | --- | --- | --- | --- | | rank | Processing order | integer | primary | |domain | Domain to be crawled | character varying(500) || |processing_host|Hostname of server processing domain|character varying(50)|| |worker_pid|Process ID of crawler handling domain|integer|| |reserved|When domain was selected for processing|timestamp with time zone|| |started|When processing of domain started|timestamp with time zone|| |finished|When processing of domain completed|timestamp with time zone|| #### https_crawl Log table of HTTP and HTTPs comparisons made by the crawler. | Column | Description | Type | Key | | --- | --- | --- | --- | | id | Comparison ID | bigint | unique | |domain|Domain evaluated|text|| |http_request_uri|Resulting URI of HTTP request|text|| |http_response|HTTP status code for HTTP request|integer|| |http_requests|Total requests made, including child subrequests, for HTTP request|integer|| |http_size|Size of HTTP response (bytes)|integer|| |https_request_uri|Resulting URI of HTTPs request|text|| |https_response|HTTP status code for HTTPs request|integer|| |https_requests|Total requests made, including child subrequests, for HTTPs request|integer|| |https_size|Size of HTTPs response (bytes)|integer|| |timestamp|When inserted|timestamp with time zone|| |screenshot_diff|Percentage difference between HTTP and HTTPs screenshots after page load|real|| |autoupgrade|Whether HTTP request was redirected to HTTPs|boolean|| |mixed|Whether HTTPs request had HTTP child requests|boolean|| #### mixed_assets HTTP child requests made for HTTPs. | Column | Description | Type | Key | | --- | --- | --- | --- | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign | | asset | URI of HTTP subrequest made during HTTPs request | text | unique | #### https_response_headers The response headers for HTTPs requests. | Column | Description | Type | Key | | --- | --- | --- | --- | | https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign | |response_headers|key/value of all HTTPs response headers|jsonb|| #### ssl_cert_info SSL certificate information for domains crawled. | Column | Description | Type | Key | | --- | --- | --- | --- | | domain | Domain evaluated | text | primary | |issuer|Issuer of SSL certificate|text|| |notbefore|Valid from timestamp|timestamp with time zone|| |notafter|Valid to timestamp|timestamp with time zone|| |host_valid|Whether the domain is covered by the SSL certificate|boolean|| |err|Connection err|text|| |updated|When last updated|timestamp with time zone|| #### https_crawl_aggregate Aggregate of [https_crawl](#https_crawl) that creates latest crawl sessions based on domain. Can also include domains that were redirected to and not directly crawled. | Column | Description | Type | Key | | --- | --- | --- | --- | | domain | Domain evaluated | text | primary | |https|Comparisons where only HTTPs was supported|integer|| |http_and_https|Comparisons where HTTP and HTTPs were supported|integer|| |http|Comparisons where only HTTP was supported|integer|| |https_errs|Number of non-2xx HTTPs responses|integer|| |unknown|Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed|integer|| |autoupgrade|Comparisons where HTTP was redirected to HTTPs|integer|| |mixed_requests|HTTPs request that made HTTP calls|integer|| |max_screenshot_diff|Maximum percentage difference between HTTP and HTTPs screenshots|real|| |redirects|Number of HTTPs requests redirected to different host|integer|| |requests|Number of comparison requests actually made during the crawl session|integer|| |session_request_limit|The number of comparisons wanted for the session|integer|| |is_redirect|Whether the domain was actually crawled or is a redirect from another host in the table that was crawled|boolean|| |max_https_crawl_id|https_crawl.id of last comparison made during crawl session|bigint|| |redirect_hosts|key/value pairs of hosts and the number of redirects to it|jsonb|| #### https_upgrade_metrics View of [https_crawl_aggregate](#https_crawl_aggregate) that calculates crawl session percentages for easier selection based on cutoffs. | Column | Description | Type | Key | | --- | --- | --- | --- | | domain | Domain evaluated | text | | | unknown_pct | Percentage of unknown|real|| | combined_pct | Percentage that supported HTTPs|real|| | https_err_rate | Percentage unknown|real|| | max_screenshot_diff | https_crawl_aggregate.max_screenshot_diff|real|| | mixed_ok | Whether HTTPs requests contained mixed content requests|boolean|| | autoupgrade_pct|Percentage of autoupgrade|real|| #### domain_exceptions For manually excluding domains that may otherwise pass specific upgrade criteria given to [upgradeable_domains](#upgradeable_domains). | Column | Description | Type | Key | | --- | --- | --- | --- | | domain | Domain to exclude | text | primary | | comment | Reason for exclusion | text || |updated|When added|timestamp with time zone|| #### upgradeable_domains Function to select domains based on a variety of criteria. | Parameter | Description | Type | Source | | --- | --- | --- | --- | |autoupgrade_min|Minimum autoupgrade percentage|real|[https_upgrade_metrics](#https_upgrade_metrics)| |combined_min|Minimum percentage of HTTPs responses|real|[https_upgrade_metrics](#https_upgrade_metrics)| |screenshot_diff_max|Maximum observed screenshot diff allowed|real|[https_upgrade_metrics](#https_upgrade_metrics)| |mixed_ok|Whether to allow domains that had mixed content|boolean|[https_upgrade_metrics](#https_upgrade_metrics)| |max_err_rate|Maximum https_err_rate|real|[https_upgrade_metrics](#https_upgrade_metrics)| |unknown_max|Maximum unknown comparisons|real|[https_upgrade_metrics](#https_upgrade_metrics)| |ssl_cert_buffer|SSL certificate must be valid until this timestamp|timestamp with time zone|[ssl_cert_info](#ssl_cert_info)| |exclude_issuers|Array of SSL cert issuers to exclude|text array|[ssl_cert_info](#ssl_cert_info)| In addtion to the above parameters, the function enforces several other conditions: 1. Domain must not be in [domain_exceptions](#domain_exceptions) 2. From values in [ssl_cert_info](#ssl_cert_info): 1. No err 2. The domain, or host, must be valid for the certificate. 3. Valid from/to and the issuer must not be null ================================================ FILE: LICENSE ================================================ This license does not apply to any DuckDuckGo logos or marks that may be contained in this repo. DuckDuckGo logos and marks are licensed separately under the CCBY-NC-ND 4.0 license (https://creativecommons.org/licenses/by-nc-nd/4.0/), and official up-to-date versions can be downloaded from https://duckduckgo.com/press. Copyright 2010 Duck Duck Go, Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: README.md ================================================ # DuckDuckGo Smarter Encryption DuckDuckGo Smarter Encryption is a large list of web sites that we know support HTTPS. The list is automatically generated and updated by using the crawler in this repository. For more information about where the list is being used and how it compares to other solutions, see our blog post [Your Connection is Secure with DuckDuckGo Smarter Encryption](https://spreadprivacy.com/duckduckgo-smarter-encryption). This software is licensed under the terms of the Apache License, Version 2.0 (see [LICENSE](LICENSE)). Copyright (c) 2019 [Duck Duck Go, Inc.](https://duckduckgo.com) ## Contributing See [Contributing](CONTRIBUTING.md) for more information about [Reporting bugs](CONTRIBUTING.md#reporting-bugs) and [Getting Started](CONTRIBUTING.md#getting-started) with the crawler. ## Just want the list? The list we use (as a result of running this code) is [publicly available](https://staticcdn.duckduckgo.com/https/smarter_encryption_latest.tgz) under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you'd like to license the list for commercial use, [please reach out](https://help.duckduckgo.com/duckduckgo-help-pages/company/contact-us/). ## Questions or help with other DuckDuckGo things? See [DuckDuckGo Help Pages](https://duck.co/help). ================================================ FILE: SmarterEncryption/Crawl.pm ================================================ package SmarterEncryption::Crawl; use Exporter::Shiny qw' aggregate_crawl_session check_ssl_cert dupe_link urls_by_path '; use IO::Socket::SSL; use IO::Socket::SSL::Utils 'CERT_asHash'; use Cpanel::JSON::XS 'encode_json'; use List::Util 'sum'; use URI; use List::AllUtils qw'each_arrayref'; use Domain::PublicSuffix; use strict; use warnings; no warnings 'uninitialized'; use feature 'state'; my $SSL_TIMEOUT = 5; my $DEBUG = 0; # Fields we want to convert to int if null my @CONVERT_TO_INT = qw' https http_s https_errs http unknown autoupgrade mixed_requests max_ss_diff redirects '; sub screenshot_threshold { 0.05 } # Number of URLs checked for each domain per run. sub urls_per_domain { 10 } sub check_ssl_cert { my $host = shift; my ($issuer, $not_before, $not_after, $host_valid, $err); if(my $iossl = IO::Socket::SSL->new( PeerHost => $host, PeerPort => 'https', SSL_hostname => $host, Timeout => $SSL_TIMEOUT, )){ $host_valid = $iossl->verify_hostname($host, 'http') || 0; my $c = $iossl->peer_certificate; my $cert = CERT_asHash($c); $issuer = $cert->{issuer}{organizationName}; $not_before = gmtime($cert->{not_before}) . ' UTC'; $not_after = gmtime($cert->{not_after}) . ' UTC'; } else{ my $sys_err = $!; $err = $SSL_ERROR; if($sys_err){ $err .= ": $sys_err"; } } return [$issuer, $not_before, $not_after, $host_valid, $err]; } sub aggregate_crawl_session { my ($domain, $session) = @_; state $dps = Domain::PublicSuffix->new; my $root_domain = $dps->get_root_domain($domain); my %domain_stats = (is_redirect => 0); my %redirects; for my $comparison (@$session){ my ($http_request_uri, $http_response, $https_request_uri, $https_response, $autoupgrade, $mixed, $screenshot_diff, $id ) = @$comparison{qw' http_request_uri http_response https_request_uri https_response autoupgrade mixed ss_diff id '}; my $http_valid = $http_request_uri =~ /^http:/i; my $https_valid = $https_request_uri =~ /^https:/i; my $redirect; if($https_valid){ if(my $host = eval { URI->new($https_request_uri)->host }){ if($host ne $domain){ my $host_root_domain = $dps->get_root_domain($host); if($root_domain eq $host_root_domain){ ++$domain_stats{redirects}{$host}; unless(exists $redirects{$host}){ $redirects{$host} = {is_redirect => 1}; } $redirect = $redirects{$host}; } } } } ++$domain_stats{requests}; $redirect && ++$redirect->{requests}; $domain_stats{max_id} = $id if $domain_stats{max_id} < $id; $redirect->{max_id} = $id if $redirect && ($redirect->{max_id} < $id); if($autoupgrade){ ++$domain_stats{autoupgrade}; $redirect && ++$redirect->{autoupgrade}; } if($mixed){ ++$domain_stats{mixed_requests}; $redirect && ++$redirect->{mixed_requests}; } if(defined($screenshot_diff)){ $domain_stats{max_ss_diff} = $screenshot_diff if $domain_stats{max_ss_diff} < $screenshot_diff; $redirect->{max_ss_diff} = $screenshot_diff if $redirect && ($redirect->{max_ss_diff} < $screenshot_diff) } my $http_s_same_response = $http_response == $https_response; my $http_response_good = $http_valid && ( ($http_response == 200) || $http_s_same_response ); my $https_response_good = $https_valid && ( ($https_response == 200) || $http_s_same_response); if($https_response_good){ if($http_response_good){ ++$domain_stats{http_s}; $redirect && ++$redirect->{http_s}; } else{ ++$domain_stats{https}; $redirect && ++$redirect->{https}; } if($https_response =~ /^[45]/){ ++$domain_stats{https_errs}; $redirect && ++$redirect->{https_errs}; } } elsif($http_response_good){ ++$domain_stats{http}; $redirect && ++$redirect->{http}; } else{ ++$domain_stats{unknown}; $redirect && ++$redirect->{unknown}; } } my %aggs; if(my $hosts = delete $domain_stats{redirects}){ $domain_stats{redirects} = sum values(%$hosts); $domain_stats{redirect_hosts} = encode_json($hosts); while(my ($host, $agg) = each %redirects){ null_to_int($agg); $aggs{$host} = $agg; } } null_to_int(\%domain_stats); $aggs{$domain} = \%domain_stats; return \%aggs; } sub null_to_int { my $h = shift; $h->{$_} += 0 for @CONVERT_TO_INT; } sub urls_by_path { my ($urls, $rr, $url_limit) = @_; my %links; for my $url (@$urls){ eval { my @segs = URI->new($url)->path_segments; push @{$links{$segs[1]}}, $url; }; } my @sorted_paths = sort {@{$links{$b}} <=> @{$links{$a}}} keys %links; my @urls_by_path; my $paths = each_arrayref @links{@sorted_paths}; CLICK_GROUP: while(my @urls = $paths->()){ for my $url (@urls){ next unless $url; last CLICK_GROUP unless @urls_by_path < $url_limit; next unless $rr->allowed($url); push @urls_by_path, $url; } } @$urls = @urls_by_path; } sub dupe_link { my ($url, $urls) = @_; $url =~ s{^https:}{http:}i; for (@$urls){ my $u = $_ =~ s{^https:}{http:}ir; return 1 if URI::eq($u, $url); } 0; } 1; ================================================ FILE: config.yml.example ================================================ --- # Top-level temp directory will be created on start and removed # on exit. Each crawler will have its own subdirectory with # PID appended TMP_DIR: /tmp/smarter_encryption CRAWLER_TMP_PREFIX: crawler_ # User agent. Will use defaults if not specified #UA: VERBOSE: 1 # Paths to system binaries. If in path already, just the program # name should suffice. COMPARE: /usr/local/bin/compare PKILL: /usr/bin/pkill # Database connection options. If not specified will connect as # the current user. #DB: #HOST: #PORT: #USER: #PASS: # Number of concurrent crawlers per cpu. CRAWLERS_PER_CPU: 3 # or exact number # MAX_CONCURRENT_CRAWLERS: 10 # Path to phantomjs. Should be v2.1.1 PHANTOMJS: phantomjs # Path to modified netsniff.js NETSNIFF_SS: netsniff_screenshot.js # Timeout before killing phantomjs in seconds HEADLESS_ALARM: 30 # Whether to continue running and polling the queue or exit when finished. # If specified and non-zero, it is the number of seconds to wait in # between polls. POLL: 60 # Number of sites a crawler should process before exiting SITES_PER_CRAWLER: 10 # Desired number of URLs to check for each site URLS_PER_SITE: 10 # Max percentage of URLS_PER_SITE included from the current home page HOMEPAGE_LINK_PCT: 0.5 # Number of times to re-request HTTPs URL on failure HTTPS_RETRIES: 1 # If SCREENSHOT_RETRIES is not 0, the comparison between HTTP and HTTPs # pages will be re-run if the diff is above SCREENSHOT_THRESHOLD. It # will also introduce a delay before taking the screenshot to potentially # overcome slight network differences between the two. The delay will # remain in effect for links still to be processed for the site. SCREENSHOT_RETRIES: 1 SCREENSHOT_THRESHOLD: 0.05 PHANTOM_RENDER_DELAY: 1000 ================================================ FILE: cpanfile ================================================ requires 'Cpanel::JSON::XS', 2.3310; requires 'DBI', '1.631'; requires 'Domain::PublicSuffix', '0.10'; requires 'Exporter::Shiny', '0.038'; requires 'Exporter::Tiny', 0.038; requires 'File::Copy::Recursive', 0.38; requires 'IO::Socket::SSL', 2.060; requires 'IO::Socket::SSL::Utils', 2.014; requires 'IPC::Run', 0.92; requires 'IPC::Run::Timer', 0.90; requires 'LWP', 6.05; requires 'List::AllUtils', 0.07; requires 'List::Util', 1.52; requires 'POE', 1.358; requires 'POE::XS::Loop::Poll', 1.000; requires 'URI', 1.71; requires 'URI::Escape', 3.31; requires 'WWW::Mechanize', 1.73; requires 'WWW::RobotRules', 6.02; requires 'YAML::XS', 0.41; ================================================ FILE: https_crawl.pl ================================================ #!/usr/bin/env perl use LWP::UserAgent; use WWW::Mechanize; use POE::Kernel { loop => 'POE::XS::Loop::Poll' }; use POE qw(Wheel::Run Filter::Reference); use DBI; use Sys::Hostname 'hostname'; use Cpanel::JSON::XS qw'decode_json encode_json'; use URI; use File::Copy::Recursive qw'pathmk pathrmdir'; use WWW::RobotRules; use IPC::Run; use YAML::XS 'LoadFile'; use List::AllUtils 'each_arrayref'; use SmarterEncryption::Crawl qw' aggregate_crawl_session check_ssl_cert dupe_link urls_by_path '; use Module::Load::Conditional 'can_load'; use feature 'state'; use strict; use warnings; no warnings 'uninitialized'; my $DDG_INTERNAL; if(can_load(modules => { 'DDG::Util::HTTPS2' => undef, 'DDG::Util::Crawl' => undef })){ DDG::Util::HTTPS2->import(qw'add_stat backfill_urls'); DDG::Util::Crawl->import(qw'get_http_msg_sig_hdrs'); $DDG_INTERNAL = 1; } my $HOST = hostname(); # Crawler Config my %CC; # Derived config values my ($MAX_CONCURRENT_CRAWLERS, $PHANTOM_TIMEOUT, $HOMEPAGE_LINKS_MAX); POE::Session->create( inline_states => { _start => \&_start, _stop => \&normal_cleanup, crawl => \&start_crawlers, crawler_done => \&crawler_done, crawler_debug => \&crawler_debug, sig_child => \&sig_child, shutdown => \&shutdown_now, prune_tmp_dirs => \&prune_tmp_dirs } ); POE::Kernel->run; exit; sub _start { my ($k, $h) = @_[KERNEL, HEAP]; parse_argv(); unless($MAX_CONCURRENT_CRAWLERS){ $MAX_CONCURRENT_CRAWLERS = `nproc` * $CC{CRAWLERS_PER_CPU}; } $PHANTOM_TIMEOUT = $CC{HEADLESS_ALARM} * 1000; # in ms $HOMEPAGE_LINKS_MAX = sprintf '%d', $CC{HOMEPAGE_LINK_PCT} * $CC{URLS_PER_SITE}; my $TMP_DIR = $CC{TMP_DIR}; unless(-d $TMP_DIR){ $CC{VERBOSE} && warn "Creating temp dir $TMP_DIR\n"; pathmk($TMP_DIR) or die "Failed to create tmp dir $TMP_DIR: $!"; } # clean up leftover junk for forced shutdown while(<$TMP_DIR/$CC{CRAWLER_TMP_PREFIX}*>){ chomp; pathrmdir($_) or warn "Failed to remove old crawler tmp dir $_: $!"; } $k->sig($_ => 'shutdown') for qw{TERM INT}; $k->yield('crawl'); } sub shutdown_now { $_[KERNEL]->sig_handled; # Kill crawlers $_->kill() for values %{$_[HEAP]->{crawlers}}; # Make unfinished tasks available in the queue my $db = prep_db('queue'); $db->{reset_unfinished_tasks}->execute; normal_cleanup(); exit 1; } sub normal_cleanup { # remove tmp dir pathrmdir($CC{TMP_DIR}) if -d $CC{TMP_DIR}; } sub start_crawlers{ my ($k, $h) = @_[KERNEL, HEAP]; my $db = prep_db('queue'); my $reserve_tasks = $db->{reserve_tasks}; while(keys %{$h->{crawlers}} < $MAX_CONCURRENT_CRAWLERS){ $reserve_tasks->execute(); if(my @ranks = sort map { $_->[0] } @{$reserve_tasks->fetchall_arrayref}){ my $c = POE::Wheel::Run->new( Program => \&crawl_sites, ProgramArgs => [\@ranks], CloseOnCall => 1, NoSetSid => 1, StderrEvent => 'crawler_debug', CloseEvent => 'crawler_done', StdinFilter => POE::Filter::Reference->new, StderrFilter => POE::Filter::Line->new ); $h->{crawlers}{$c->ID} = $c; $k->sig_child($c->PID, 'sig_child'); } else{ $CC{POLL} && $k->delay(crawl => $CC{POLL}); last; } } } sub crawl_sites{ my ($ranks) = @_; my $VERBOSE = $CC{VERBOSE}; my $db = prep_db('crawl'); my $crawler_tmp_dir = "$CC{TMP_DIR}/$CC{CRAWLER_TMP_PREFIX}$$"; my $rm_tmp = pathmk($crawler_tmp_dir); my @urls_by_domain; for(my $i = 0;$i < @$ranks;++$i){ my $rank = $ranks->[$i]; my $domain; eval { $db->{start_task}->execute($$, $rank); $domain = $db->{start_task}->fetchall_arrayref->[0][0]; } or do { warn "Failed to start task for rank $rank: $@"; next; }; eval { $domain = URI->new("https://$domain/")->host; 1; } or do { warn "Failed to filter domain $domain: $@"; next; }; $VERBOSE && warn "checking domain $domain\n"; my $urls = get_urls_for_domain($domain, $db); my @pairs; for my $url (@$urls){ push @pairs, [$domain, $url]; } push @urls_by_domain, \@pairs if @pairs; } my $ranks_str = '{' . join(',', @$ranks) . '}'; my $ea = each_arrayref @urls_by_domain; my (%ssl_cert_checked, %domain_render_delay, %sessions); while(my @urls = $ea->()){ for my $u (@urls){ next unless $u; my ($domain, $url) = @$u; next unless $url =~ /^http/i; # for the command-line $url =~ s/'/%27/g; my ($http_url) = $url =~ s/^https:/http:/ri; my ($https_url) = $url =~ s/^http:/https:/ri; my $http_ss = $crawler_tmp_dir . '/http.' . $domain . '.png'; unless($ssl_cert_checked{$domain}){ my $ssl = check_ssl_cert($domain); eval { $db->{insert_ssl}->execute($domain, @$ssl); ++$ssl_cert_checked{$domain}; } or do { warn "Failed to insert ssl info for $domain: $@"; }; } my %comparison; # We will compare a URL twice max: # 1. Compare HTTP vs. HTTPS # 2. Redo if the screenshot is a above the threshold to check for rendering problems SCREENSHOT_RETRY: for (0..$CC{SCREENSHOT_RETRIES}){ my $redo_comparison = 0; my %stats = (domain => $domain); check_site(\%stats, $http_url, $http_ss, $domain_render_delay{$domain}, $crawler_tmp_dir); # the idea behind screenshots is: # 1. Do for HTTP automatically so we don't have to make another request if it works # 2. Do for HTTPS if HTTP worked and wasn't autoupgraded # 3. If HTTPS worked and didn't downgrade, compare them my $https_ss; if( (-e $http_ss) && ($stats{http_request_uri} =~ /^http:/i) && ($stats{http_response} == 200)){ $https_ss = $crawler_tmp_dir . '/https.' . $domain . '.png'; } HTTPS_RETRY: for my $https_attempt (0..$CC{HTTPS_RETRIES}){ my $redo_https; check_site(\%stats, $https_url, $https_ss, $domain_render_delay{$domain}, $crawler_tmp_dir); if( ($stats{https_request_uri} =~ /^https:/i) && ($stats{https_response} == 200)){ if($https_ss && (-e $https_ss)){ my $out = `$CC{COMPARE} -metric mae $http_ss $https_ss /dev/null 2>&1`; if(my ($diff) = $out =~ /\(([\d\.e\-]+)\)/){ if($CC{SCREENSHOT_THRESHOLD} < $diff){ # Only need to redo on the first failure. After that, the delay # will have already been increased by a previous URL unless($domain_render_delay{$domain} == $CC{PHANTOM_RENDER_DELAY}){ $domain_render_delay{$domain} = $CC{PHANTOM_RENDER_DELAY}; $redo_comparison = 1; $VERBOSE && warn "redoing $http_url (diff: $diff)\n"; } } $stats{ss_diff} = $diff; } else{ warn "Failed to extract compare diff betweeen $http_ss and $https_ss from $out\n"; } unlink $_ for $http_ss, $https_ss; } if($DDG_INTERNAL && $https_attempt){ add_stat(qw'increment smarter_encryption.crawl.https_retries.success'); } } elsif($DDG_INTERNAL && $https_attempt){ add_stat(qw'increment smarter_encryption.crawl.https_retries.failure'); } elsif( ($stats{https_request_uri} !~ /^http:/) && ($stats{http_response} != $stats{https_response})){ $redo_https = 1; $VERBOSE && warn "Redoing HTTPS request for $domain: $https_url\n"; } last HTTPS_RETRY unless $redo_https; } # Most should exit here unless($redo_comparison){ %comparison = %stats; last; } } unless($db->{con}->ping){ $VERBOSE && warn "Reconnecting to DB before inserting comparison"; $db = prep_db('crawl'); } if(my $host = eval { URI->new($comparison{https_request_uri})->host}){ unless($ssl_cert_checked{$host}){ my $ssl = check_ssl_cert($host); eval { $db->{insert_ssl}->execute($host, @$ssl); ++$ssl_cert_checked{$host}; } or do { warn "Failed to insert ssl info for $host: $@"; }; } } if($comparison{http_request_uri} || $comparison{https_request_uri}){ my $log_id; eval { $db->{insert_domain}->execute(@comparison{qw' domain http_request_uri http_response http_requests http_size https_request_uri https_response https_requests https_size autoupgrade mixed ss_diff'} ); $log_id = $db->{insert_domain}->fetch()->[0]; } or do { $VERBOSE && warn "Failed to insert request for $domain: $@"; }; if($log_id){ if(my $hdrs = delete $comparison{https_response_headers}){ eval { $db->{insert_headers}->execute($log_id, $hdrs); } or do { $VERBOSE && warn "Failed to insert response headers for $domain ($log_id): $@"; }; } if(my $mixed_reqs = delete $comparison{mixed_children}){ for my $m (keys %$mixed_reqs){ eval{ $db->{insert_mixed}->execute($log_id, $m); 1; } or do { $VERBOSE && warn "Failed to insert mixed request for $domain: $@"; }; } } $comparison{id} = $log_id; push @{$sessions{$domain}}, \%comparison; } } } } unless($db->{con}->ping){ $VERBOSE && warn "Reconnecting to DB before updating aggregate data"; $db = prep_db('crawl'); } while(my ($domain, $session) = each %sessions){ my $aggregates = aggregate_crawl_session($domain, $session); while(my ($host, $agg) = each %$aggregates){ eval { $db->{upsert_aggregate}->execute( $host, @$agg{qw' https http_s https_errs http unknown autoupgrade mixed_requests max_ss_diff redirects max_id requests is_redirect redirect_hosts' } ); 1; } or do { warn "Failed to upsert aggregate for $host: $@"; }; } } eval { $db->{finish_tasks}->execute($ranks_str); 1; } or do { warn "Failed to finish tasks for ranks ($ranks_str): $@"; }; system "$CC{PKILL} -9 -f '$crawler_tmp_dir '"; pathrmdir($crawler_tmp_dir) if $rm_tmp; } sub prep_db { my $target = shift; my %db; my $con = get_con(); if($target eq 'queue'){ $db{reserve_tasks} = $con->prepare(" update https_queue set processing_host = '$HOST', reserved = now() where rank in ( select rank from https_queue where processing_host is null order by rank limit $CC{SITES_PER_CRAWLER} for update skip locked ) returning rank "); $db{reset_unfinished_tasks} = $con->prepare(" update https_queue set processing_host = null, worker_pid = null, reserved = null, started = null where processing_host = '$HOST' and finished is null "); $db{complete_unfinished_worker_tasks} = $con->prepare(" update https_queue set finished = now(), processing_host = '$HOST (incomplete)' where processing_host = '$HOST' and finished is null and worker_pid = ? "); } elsif($target eq 'crawl'){ $db{start_task} = $con->prepare('update https_queue set worker_pid = ?, started = now() where rank = ? returning domain'); $db{select_urls} = $con->prepare('select url from full_urls where host = ?'); $db{insert_domain} = $con->prepare(' insert into https_crawl (domain, http_request_uri, http_response, http_requests, http_size, https_request_uri, https_response, https_requests, https_size, autoupgrade, mixed, screenshot_diff) values (?,?,?,?,?,?,?,?,?,?,?,?) returning id '); $db{insert_mixed} = $con->prepare('insert into mixed_assets (https_crawl_id, asset) values (?,?)'); $db{insert_headers} = $con->prepare('insert into https_response_headers (https_crawl_id, response_headers) values (?,?)'); $db{finish_tasks} = $con->prepare('update https_queue set finished = now() where rank = ANY(?::integer[])'); $db{insert_ssl} = $con->prepare(' insert into ssl_cert_info (domain, issuer, notBefore, notAfter, host_valid, err) values (?,?,?,?,?,?) on conflict (domain) do update set issuer = EXCLUDED.issuer, notBefore = EXCLUDED.notBefore, notAfter = EXCLUDED.notAfter, host_valid = EXCLUDED.host_valid, err = EXCLUDED.err, updated = now() '); # Note where clause: # 1. Non-redirects update any, including changing a redirect to a non-redirect # 2. Redirects update other redirects $db{upsert_aggregate} = $con->prepare(" insert into https_crawl_aggregate ( domain, https, http_and_https, https_errs, http, unknown, autoupgrade, mixed_requests, max_screenshot_diff, redirects, max_https_crawl_id, requests, is_redirect, redirect_hosts, session_request_limit) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,$CC{URLS_PER_SITE}) on conflict (domain) do update set ( https, http_and_https, https_errs, http, unknown, autoupgrade, mixed_requests, max_screenshot_diff, redirects, max_https_crawl_id, requests, is_redirect, redirect_hosts, session_request_limit ) = ( EXCLUDED.https, EXCLUDED.http_and_https, EXCLUDED.https_errs, EXCLUDED.http, EXCLUDED.unknown, EXCLUDED.autoupgrade, EXCLUDED.mixed_requests, EXCLUDED.max_screenshot_diff, EXCLUDED.redirects, EXCLUDED.max_https_crawl_id, EXCLUDED.requests, EXCLUDED.is_redirect, EXCLUDED.redirect_hosts, EXCLUDED.session_request_limit) where EXCLUDED.is_redirect = false or https_crawl_aggregate.is_redirect = true "); } $db{con} = $con; return \%db; } # Strategy behind url selection: # 1. Fill queue with homepage and click urls sort by top-level path # prevalence # 2. If necessary, get backfill_urls sub get_urls_for_domain { my ($domain, $db) = @_; state $rr = WWW::RobotRules->new($CC{UA}); state $mech = get_ua('mech'); state $VERBOSE = $CC{VERBOSE}; # Get latest robot rules for domain my $res = $mech->get("http://$domain/robots.txt"); if($res->is_success){ # the uri may be different than what we requested my @doms = ($domain); my $uri = $res->request->uri; if(my $host = eval { URI->new($uri)->host }){ push @doms, $host if $host ne $domain; } my $robots_txt = $res->decoded_content; # Add the rules for the: # 1. The domain and redirect host if different # 2. HTTP/HTTPS for each # yes, http and https could be different for my $d (@doms){ for my $p (qw(http https)){ $rr->parse("$p://$d/", $robots_txt); } } } my @urls; my $homepage = 'http://' . $domain . '/'; $res = $mech->get($homepage); if($res->is_success){ # the uri may be different than what we requested my $uri = $res->request->uri; if(my $host = eval { URI->new($uri)->host }){ # all links with the same host my @homepage_links; if(my $l = $mech->find_all_links(url_abs_regex => qr{//\Q$host\E/})){ @homepage_links = @$l; } for my $l (@homepage_links){ my $abs_url = $l->url_abs; $abs_url = "$abs_url"; next if dupe_link($abs_url, \@urls); push @urls, $abs_url; } } } else { $VERBOSE && warn "Failed to get homepage links for $domain: " . $res->status_line; } eval { my $select_urls = $db->{select_urls}; $select_urls->execute($domain); while(my $r = $select_urls->fetchrow_arrayref){ my $url = $r->[0]; next if dupe_link($url, \@urls); push @urls, $url; } 1; } or do { $VERBOSE && warn "Failed to get click urls for $domain: $@"; }; state $URLS_PER_SITE = $CC{URLS_PER_SITE}; urls_by_path(\@urls, $rr, $URLS_PER_SITE); if($DDG_INTERNAL && (@urls < $URLS_PER_SITE)){ backfill_urls($domain, \@urls, $rr, $db, $mech, $URLS_PER_SITE, $VERBOSE); } # Add home by default since it often behaves differently unless(dupe_link($homepage, \@urls)){ if(@urls < $URLS_PER_SITE){ push @urls, $homepage; } else{ splice(@urls, -1, 1, $homepage); } } return \@urls; } sub prune_tmp_dirs { my $h = $_[HEAP]; return unless exists $h->{crawler_tmp_dirs}; my ($TMP_DIR, $CRAWLER_TMP_PREFIX) = @CC{qw'TMP_DIR CRAWLER_TMP_PREFIX'}; for my $pid (keys %{$h->{crawler_tmp_dirs}}){ my $crawler_tmp_dir = "$TMP_DIR/$CRAWLER_TMP_PREFIX$pid"; if(-d $crawler_tmp_dir){ next unless pathrmdir($crawler_tmp_dir); } delete $h->{crawler_tmp_dirs}{$pid}; } } sub check_site { my ($stats, $site, $screenshot, $delay, $crawler_tmp_dir) = @_; if(my ($request_scheme) = $site =~ /^(https?):/i){ $request_scheme = lc $request_scheme; eval{ @ENV{qw(PHANTOM_RENDER_DELAY PHANTOM_UA PHANTOM_TIMEOUT)} = ($delay, "'$CC{UA}'", $PHANTOM_TIMEOUT); # Build custom headers if HTTP message signatures are enabled if($DDG_INTERNAL && $CC{ENABLE_HTTP_MESSAGE_SIGNATURES}){ # Clear any previous custom headers first delete $ENV{CUSTOM_HEADERS}; my $sig_headers = get_http_msg_sig_hdrs('GET', $site); if($sig_headers && %$sig_headers){ $ENV{CUSTOM_HEADERS} = encode_json($sig_headers); } } my $out; my @cmd = ( $CC{PHANTOMJS}, "--local-storage-path=$crawler_tmp_dir", "--offline-storage-path=$crawler_tmp_dir", $CC{NETSNIFF_SS}, $site); push @cmd, $screenshot if $screenshot; IPC::Run::run \@cmd, \undef, \$out, IPC::Run::timeout($CC{HEADLESS_ALARM}, exception => "$site timed out after $CC{HEADLESS_ALARM} seconds"); die "PHANTOMJS $out" if $out =~ /^FAIL/; # Can have error messages at the end so have to extract the json my ($j) = $out =~ /^(\{\s+"log".+\})/ms; my $m = decode_json($j)->{log}; my ($main_request_scheme, $check_mixed); for my $e (@{$m->{entries}}){ my $response_status = $e->{response}{status}; # netsniff records the redirects to https for some sites next if $response_status =~ /^3/; my $url = $e->{request}{url}; next unless my ($scheme) = $url =~ /^(https?):/i; $scheme = lc $scheme; if($check_mixed && ($scheme eq 'http')){ # Absolute links. Even if the same host as parent, browsers will mark # this as mixed and the extension can't upgrade them $stats->{mixed_children}{$url} = 1; } unless($main_request_scheme){ $stats->{"${request_scheme}_request_uri"} = $url; $stats->{"${request_scheme}_response"} = $response_status; if($request_scheme eq 'http'){ $stats->{autoupgrade} = $scheme eq 'https' ? 1 : 0; } elsif($scheme eq 'https'){ $check_mixed = lc URI->new($url)->host; my $hdrs = delete $e->{response}{headers}; my %response_headers; # We don't want to store an array of one-key hashes. for my $h (@$hdrs){ my ($name, $value) = @$h{qw(name value)}; if(exists $response_headers{$name}){ # https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2 $response_headers{$name} .= ",$value"; } else{ $response_headers{$name} = $value; } } $stats->{https_response_headers} = encode_json(\%response_headers); } $main_request_scheme = $scheme; } $stats->{"${request_scheme}_size"} += $e->{response}{bodySize}; ++$stats->{"${request_scheme}_requests"}; } if($check_mixed){ $stats->{mixed} = exists $stats->{mixed_children} ? 1 : 0; } 1; } or do { warn "check_site error: $@ ($site)"; system "$CC{PKILL} -9 -f '$crawler_tmp_dir '" if $crawler_tmp_dir =~ /\S/; }; } } sub crawler_done{ my ($k, $h, $id) = @_[KERNEL, HEAP, ARG0]; state $VERBOSE = $CC{VERBOSE}; $VERBOSE && warn "deleting crawler $id\n"; my $c = delete $h->{crawlers}{$id}; # see if any of its domains were left unfinished my $pid = $c->PID; eval { my $db = prep_db('queue'); my $unfinished = $db->{complete_unfinished_worker_tasks}->execute($pid); if($unfinished > 0){ $VERBOSE && warn "Marked $unfinished tasks incomplete for crawler with pid $pid\n"; } 1; } or do { warn "Failed to verify worker tasks: $@"; }; # Check and clean up tmp dirs for hung crawlers $h->{crawler_tmp_dirs}{$pid} = 1; $k->yield('prune_tmp_dirs'); $k->yield('crawl'); } sub crawler_debug{ my $msg = $_[ARG0]; $CC{VERBOSE} && warn 'crawler debug: ' . $msg. "\n"; } sub sig_child { warn 'Got signal from pid ' . $_[ARG1] . ', exit status: ' . $_[ARG2] if $_[ARG2]; $_[KERNEL]->sig_handled; } sub get_ua { my $type = shift; my $ua = $type eq 'mech' ? WWW::Mechanize->new( onerror => undef, # We'll check these ourselves so we don't have to catch die in eval quiet => 1 ) : LWP::UserAgent->new(); $ua->agent($CC{UA}); $ua->timeout(10); return $ua; } sub get_con { $ENV{PGDATABASE} = $CC{DB} if exists $CC{DB}; $ENV{PGHOST} = $CC{HOST} if exists $CC{HOST}; $ENV{PGPORT} = $CC{PORT} if exists $CC{PORT}; $ENV{PGUSER} = $CC{USER} if exists $CC{USER}; $ENV{PGPASSWORD} = $CC{PASS} if exists $CC{PASS}; return DBI->connect('dbi:Pg:', '', '', { RaiseError => 1, PrintError => 0, AutoCommit => 1, }); } sub parse_argv { my $usage = <