Repository: duckduckgo/smarter-encryption
Branch: master
Commit: 28bd11bf6423
Files: 18
Total size: 68.8 KB

Directory structure:
gitextract__bzzy0aw/

├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SmarterEncryption/
│   └── Crawl.pm
├── config.yml.example
├── cpanfile
├── https_crawl.pl
├── netsniff_screenshot.js
├── sql/
│   ├── domain_exceptions.sql
│   ├── full_urls.sql
│   ├── https_crawl.sql
│   ├── https_crawl_aggregate.sql
│   ├── https_queue.sql
│   ├── https_response_headers.sql
│   ├── mixed_assets.sql
│   ├── ssl_cert_info.sql
│   └── upgradeable_domains_func.sql
└── third-party.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: CONTRIBUTING.md
================================================
# Contributing guidelines

* [Reporting bugs](#reporting-bugs)
* [Development](#development)
  * [New features](#new-features)
  * [Bug fixes](#bug-fixes)
* [Getting Started](#getting-started)
  * [Pre-Requisites](#pre-requisites)
  * [Setup](#setup)
  * [Running the crawler](#running-the-crawler)
  * [Checking the results](#checking-the-results)
* [Data Model](#data-model)
  * [full_urls](#full_urls)
  * [https_queue](#https_queue)
  * [https_crawl](#https_crawl)
  * [mixed_assets](#mixed_assets)
  * [https_response_headers](#https_response_headers)
  * [ssl_cert_info](#ssl_cert_info)
  * [https_crawl_aggregate](#https_crawl_aggregate)
  * [https_upgrade_metrics](#https_upgrade_metrics)
  * [domain_exceptions](#domain_exceptions)
  * [upgradeable_domains](#upgradeable_domains)

# Reporting bugs

1. First check to see if the bug has not already been [reported](https://github.com/duckduckgo/smarter-encryption/issues).
2. Create a bug report [issue](https://github.com/duckduckgo/smarter-encryption/issues/new?template=bug_report.md).

# Development

## New features

Right now all new feature development is handled internally.

## Bug fixes

Most bug fixes are handled internally, but we will accept pull requests for bug fixes if you first:
1. Create an issue describing the bug. see [Reporting bugs](CONTRIBUTING.md#reporting-bugs)
2. Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects.

## Getting Started

### Pre-Requisites
- [PostgreSQL](https://www.postgresql.org/) database
- [PhantomJS 2.1.1](https://phantomjs.org/download.html)
- [Perl](https://www.perl.org/get.html)
- [compare](https://imagemagick.org/script/compare.php)
- [pkill](https://en.wikipedia.org/wiki/Pkill)
- Should run on many varieties of Linux/*BSD

### Setup

1. Install required Perl modules via cpanfile:
```sh
cpanm --installdeps .
```
2. Connect to PostgreSQL with psql and create the tables needed by the crawler:
```
\i sql/full_urls.sql
\i sql/https_crawl.sql
\i sql/mixed_assets.sql
etc.
```
3. Create a copy of the crawler configuration file:
```sh
cp config.yml.example config.yml
```
Edit the settings as necessary for your system.

4. If you have a source of URLs you would like to be crawled for a host they can be added to the [full_urls](#full_urls) table:
```sql
insert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ...
```
The crawler will attempt to get URLs from the home page even if none are available in this table.

### Running the crawler

1. Add hosts to be crawled to the [https_queue](#https_queue) table:
```sql
insert into https_queue (domain) values ('duckduckgo.com');
```

2. The crawler can be run as follows:
```sh
perl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml
```

### Checking the results

1. The individual HTTP and HTTPs comparisons for each URL crawled are stored in [https_crawl](#https_crawl):
```sql
select * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10;
```
The maximum URLs for the crawl session, i.e. `limit`, is determined by [URLS_PER_SITE](config.yml.example#L49).

2. Aggregate session data for each host is stored in [https_crawl_aggregate](#https_crawl_aggregate):
```sql
select * from https_crawl_aggregate where domain = 'duckduckgo.com';
```
There is also an associated view - [https_upgrade_metrics](#https_upgrade_metrics) - that calculates some additional metrics:
```sql
select * from https_upgrade_metrics where domain = 'duckduckgo.com';
```

3. Additional information from the crawl can be found in:

  * [sss_cert_info](#ssl_cert_info)
  * [mixed_assets](#mixed_assets)
  * [https_response_headers](#https_response_headers)

4. Hosts can be selected based on various combinations of criteria directly from the above tables or by using the [upgradeable_domains](#upgradeable_domains) function.  

### Data Model

#### full_urls

Complete URLs for hosts that will be used in addition to those the crawler extracts from the home page.

| Column | Description | Type | Key |
| --- | --- | --- | --- |
| host | hostname | text |unique|
| url | Complete URL with scheme | text |unique|
| updated | When added to table | timestamp with time zone ||

#### https_queue

Domains to be crawled in rank order.  Multiple crawlers can access this concurrently.

| Column | Description | Type | Key |
| --- | --- | --- | --- |
| rank | Processing order | integer | primary |
|domain | Domain to be crawled | character varying(500) ||
|processing_host|Hostname of server processing domain|character varying(50)||
|worker_pid|Process ID of crawler handling domain|integer||
|reserved|When domain was selected for processing|timestamp with time zone||
|started|When processing of domain started|timestamp with time zone||
|finished|When processing of domain completed|timestamp with time zone||

#### https_crawl

Log table of HTTP and HTTPs comparisons made by the crawler.

| Column | Description | Type | Key |
| --- | --- | --- | --- |
| id | Comparison ID | bigint | unique |
|domain|Domain evaluated|text||
|http_request_uri|Resulting URI of HTTP request|text||
|http_response|HTTP status code for HTTP request|integer||
|http_requests|Total requests made, including child subrequests, for HTTP request|integer||
|http_size|Size of HTTP response (bytes)|integer||
|https_request_uri|Resulting URI of HTTPs request|text||
|https_response|HTTP status code for HTTPs request|integer||
|https_requests|Total requests made, including child subrequests, for HTTPs request|integer||
|https_size|Size of HTTPs response (bytes)|integer||
|timestamp|When inserted|timestamp with time zone||
|screenshot_diff|Percentage difference between HTTP and HTTPs screenshots after page load|real||
|autoupgrade|Whether HTTP request was redirected to HTTPs|boolean||
|mixed|Whether HTTPs request had HTTP child requests|boolean||

#### mixed_assets

HTTP child requests made for HTTPs.

| Column         | Description                                          | Type   | Key            |
| ---            | ---                                                  | ---    | ---            |
| https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |
| asset          | URI of HTTP subrequest made during HTTPs request     | text   | unique         |


#### https_response_headers

The response headers for HTTPs requests.

| Column         | Description                                          | Type   | Key            |
| ---            | ---                                                  | ---    | ---            |
| https_crawl_id | https_crawl.id, only associated with https_* columns | bigint | unique/foreign |
|response_headers|key/value of all HTTPs response headers|jsonb||


#### ssl_cert_info

SSL certificate information for domains crawled.

| Column         | Description                                          | Type   | Key            |
| ---            | ---                                                  | ---    | ---            |
| domain | Domain evaluated | text | primary |
|issuer|Issuer of SSL certificate|text||
|notbefore|Valid from timestamp|timestamp with time zone||
|notafter|Valid to timestamp|timestamp with time zone||
|host_valid|Whether the domain is covered by the SSL certificate|boolean||
|err|Connection err|text||
|updated|When last updated|timestamp with time zone||

#### https_crawl_aggregate

Aggregate of [https_crawl](#https_crawl) that creates latest crawl sessions based on domain.  Can also include domains that were redirected to and not directly crawled.

| Column         | Description                                          | Type   | Key            |
| ---            | ---                                                  | ---    | ---            |
| domain | Domain evaluated | text | primary |
|https|Comparisons where only HTTPs was supported|integer||
|http_and_https|Comparisons where HTTP and HTTPs were supported|integer||
|http|Comparisons where only HTTP was supported|integer||
|https_errs|Number of non-2xx HTTPs responses|integer||
|unknown|Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed|integer||
|autoupgrade|Comparisons where HTTP was redirected to HTTPs|integer||
|mixed_requests|HTTPs request that made HTTP calls|integer||
|max_screenshot_diff|Maximum percentage difference between HTTP and HTTPs screenshots|real||
|redirects|Number of HTTPs requests redirected to different host|integer||
|requests|Number of comparison requests actually made during the crawl session|integer||
|session_request_limit|The number of comparisons wanted for the session|integer||
|is_redirect|Whether the domain was actually crawled or is a redirect from another host in the table that was crawled|boolean||
|max_https_crawl_id|https_crawl.id of last comparison made during crawl session|bigint||
|redirect_hosts|key/value pairs of hosts and the number of redirects to it|jsonb||

#### https_upgrade_metrics

View of [https_crawl_aggregate](#https_crawl_aggregate) that calculates crawl session percentages for easier selection based on cutoffs.

| Column         | Description                                          | Type   | Key            |
| ---            | ---                                                  | ---    | ---            |
| domain | Domain evaluated | text | |
| unknown_pct | Percentage of unknown|real||
| combined_pct | Percentage that supported HTTPs|real||
| https_err_rate | Percentage unknown|real||
| max_screenshot_diff | https_crawl_aggregate.max_screenshot_diff|real||
| mixed_ok | Whether HTTPs requests contained mixed content requests|boolean||
| autoupgrade_pct|Percentage of autoupgrade|real||

#### domain_exceptions

For manually excluding domains that may otherwise pass specific upgrade criteria given to [upgradeable_domains](#upgradeable_domains).

| Column | Description       | Type | Key     |
| ---    | ---               | ---  | ---     |
| domain | Domain to exclude | text | primary |
| comment | Reason for exclusion | text ||
|updated|When added|timestamp with time zone||

#### upgradeable_domains

Function to select domains based on a variety of criteria.

| Parameter | Description       | Type | Source     |
| ---    | ---               | ---  | ---     |
|autoupgrade_min|Minimum autoupgrade percentage|real|[https_upgrade_metrics](#https_upgrade_metrics)|
|combined_min|Minimum percentage of HTTPs responses|real|[https_upgrade_metrics](#https_upgrade_metrics)|
|screenshot_diff_max|Maximum observed screenshot diff allowed|real|[https_upgrade_metrics](#https_upgrade_metrics)|
|mixed_ok|Whether to allow domains that had mixed content|boolean|[https_upgrade_metrics](#https_upgrade_metrics)|
|max_err_rate|Maximum https_err_rate|real|[https_upgrade_metrics](#https_upgrade_metrics)|
|unknown_max|Maximum unknown comparisons|real|[https_upgrade_metrics](#https_upgrade_metrics)|
|ssl_cert_buffer|SSL certificate must be valid until this timestamp|timestamp with time zone|[ssl_cert_info](#ssl_cert_info)|
|exclude_issuers|Array of SSL cert issuers to exclude|text array|[ssl_cert_info](#ssl_cert_info)|

In addtion to the above parameters, the function enforces several other conditions:

1. Domain must not be in [domain_exceptions](#domain_exceptions)
2. From values in [ssl_cert_info](#ssl_cert_info):
   1. No err
   2. The domain, or host, must be valid for the certificate.
   3. Valid from/to and the issuer must not be null


================================================
FILE: LICENSE
================================================
This license does not apply to any DuckDuckGo logos or marks that may be contained
in this repo. DuckDuckGo logos and marks are licensed separately under the CCBY-NC-ND 4.0
license (https://creativecommons.org/licenses/by-nc-nd/4.0/), and official up-to-date
versions can be downloaded from https://duckduckgo.com/press.

Copyright 2010 Duck Duck Go, Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


================================================
FILE: README.md
================================================
# DuckDuckGo Smarter Encryption 

DuckDuckGo Smarter Encryption is a large list of web sites that we know support HTTPS.  The list is automatically generated and updated by using the crawler in this repository.

For more information about where the list is being used and how it compares to other solutions, see our blog post [Your Connection is Secure with DuckDuckGo Smarter Encryption](https://spreadprivacy.com/duckduckgo-smarter-encryption).

This software is licensed under the terms of the Apache License, Version 2.0 (see [LICENSE](LICENSE)). Copyright (c) 2019 [Duck Duck Go, Inc.](https://duckduckgo.com)

## Contributing

See [Contributing](CONTRIBUTING.md) for more information about [Reporting bugs](CONTRIBUTING.md#reporting-bugs) and [Getting Started](CONTRIBUTING.md#getting-started) with the crawler.

## Just want the list?

The list we use (as a result of running this code) is [publicly available](https://staticcdn.duckduckgo.com/https/smarter_encryption_latest.tgz) under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).

If you'd like to license the list for commercial use, [please reach out](https://help.duckduckgo.com/duckduckgo-help-pages/company/contact-us/).

## Questions or help with other DuckDuckGo things?
See [DuckDuckGo Help Pages](https://duck.co/help).


================================================
FILE: SmarterEncryption/Crawl.pm
================================================
package SmarterEncryption::Crawl;

use Exporter::Shiny qw'
    aggregate_crawl_session
    check_ssl_cert
    dupe_link
    urls_by_path
';

use IO::Socket::SSL;
use IO::Socket::SSL::Utils 'CERT_asHash';
use Cpanel::JSON::XS 'encode_json';
use List::Util 'sum';
use URI;
use List::AllUtils qw'each_arrayref';
use Domain::PublicSuffix;

use strict;
use warnings;
no warnings 'uninitialized';
use feature 'state';

my $SSL_TIMEOUT = 5;
my $DEBUG = 0;

# Fields we want to convert to int if null
my @CONVERT_TO_INT = qw'
    https
    http_s
    https_errs
    http
    unknown
    autoupgrade
    mixed_requests
    max_ss_diff
    redirects
';

sub screenshot_threshold { 0.05 }
# Number of URLs checked for each domain per run.
sub urls_per_domain { 10 }

sub check_ssl_cert {
    my $host = shift;

    my ($issuer, $not_before, $not_after, $host_valid, $err);

    if(my $iossl = IO::Socket::SSL->new(
        PeerHost => $host,
        PeerPort => 'https',
        SSL_hostname => $host,
        Timeout => $SSL_TIMEOUT,
    )){
        $host_valid = $iossl->verify_hostname($host, 'http') || 0;
        my $c = $iossl->peer_certificate;
        my $cert = CERT_asHash($c);
        $issuer = $cert->{issuer}{organizationName};
        $not_before = gmtime($cert->{not_before}) . ' UTC';
        $not_after = gmtime($cert->{not_after}) . ' UTC';
    }
    else{
        my $sys_err = $!;
        $err = $SSL_ERROR;
        if($sys_err){ $err .= ": $sys_err"; }
    }

    return [$issuer, $not_before, $not_after, $host_valid, $err];
}

sub aggregate_crawl_session {
    my ($domain, $session) = @_;

    state $dps = Domain::PublicSuffix->new;
    my $root_domain = $dps->get_root_domain($domain);

    my %domain_stats = (is_redirect => 0);
    my %redirects;
    for my $comparison (@$session){
        my ($http_request_uri,
            $http_response,
            $https_request_uri,
            $https_response,
            $autoupgrade,
            $mixed,
            $screenshot_diff,
            $id
        ) = @$comparison{qw'
            http_request_uri
            http_response
            https_request_uri
            https_response
            autoupgrade
            mixed
            ss_diff
            id
        '};


        my $http_valid = $http_request_uri =~ /^http:/i;
        my $https_valid = $https_request_uri =~ /^https:/i;

        my $redirect;
        if($https_valid){
            if(my $host = eval { URI->new($https_request_uri)->host }){
                if($host ne $domain){
                    my $host_root_domain = $dps->get_root_domain($host);
                    if($root_domain eq $host_root_domain){
                        ++$domain_stats{redirects}{$host};
                        unless(exists $redirects{$host}){
                            $redirects{$host} = {is_redirect => 1};
                        }
                        $redirect = $redirects{$host};
                    }
                }
            }
        }

        ++$domain_stats{requests};
        $redirect && ++$redirect->{requests};

        $domain_stats{max_id} = $id if $domain_stats{max_id} < $id;
        $redirect->{max_id} = $id if $redirect && ($redirect->{max_id} < $id);

        if($autoupgrade){
            ++$domain_stats{autoupgrade};
            $redirect && ++$redirect->{autoupgrade};
        }

        if($mixed){
            ++$domain_stats{mixed_requests};
            $redirect && ++$redirect->{mixed_requests};
        }

        if(defined($screenshot_diff)){
            $domain_stats{max_ss_diff} = $screenshot_diff if $domain_stats{max_ss_diff} < $screenshot_diff;
            $redirect->{max_ss_diff} = $screenshot_diff if $redirect && ($redirect->{max_ss_diff} < $screenshot_diff)
        }

        my $http_s_same_response = $http_response == $https_response;
        my $http_response_good = $http_valid && ( ($http_response == 200) || $http_s_same_response );
        my $https_response_good = $https_valid && ( ($https_response == 200) || $http_s_same_response);

        if($https_response_good){
            if($http_response_good){
                ++$domain_stats{http_s};
                $redirect && ++$redirect->{http_s};
            }
            else{
                ++$domain_stats{https};
                $redirect && ++$redirect->{https};
            }

            if($https_response =~ /^[45]/){
                ++$domain_stats{https_errs};
                $redirect && ++$redirect->{https_errs};
            }
        }
        elsif($http_response_good){
            ++$domain_stats{http};
            $redirect && ++$redirect->{http};
        }
        else{
            ++$domain_stats{unknown};
            $redirect && ++$redirect->{unknown};
        }
    }

    my %aggs;
    if(my $hosts = delete $domain_stats{redirects}){
        $domain_stats{redirects} = sum values(%$hosts);
        $domain_stats{redirect_hosts} = encode_json($hosts);

        while(my ($host, $agg) = each %redirects){
            null_to_int($agg);
            $aggs{$host} = $agg;
        }
    }

    null_to_int(\%domain_stats);
    $aggs{$domain} = \%domain_stats;

    return \%aggs;
}

sub null_to_int {
    my $h = shift;
    $h->{$_} += 0 for @CONVERT_TO_INT;
}

sub urls_by_path {
    my ($urls, $rr, $url_limit) = @_;

    my %links;
    for my $url (@$urls){
        eval {
            my @segs = URI->new($url)->path_segments;
            push @{$links{$segs[1]}}, $url;
        };
    }

    my @sorted_paths = sort {@{$links{$b}} <=> @{$links{$a}}} keys %links;

    my @urls_by_path;

    my $paths = each_arrayref @links{@sorted_paths};
    CLICK_GROUP: while(my @urls = $paths->()){
        for my $url (@urls){
            next unless $url;
            last CLICK_GROUP unless @urls_by_path < $url_limit;
            next unless $rr->allowed($url);
            push @urls_by_path, $url;
        }
    }

    @$urls = @urls_by_path;
}


sub dupe_link {
    my ($url, $urls) = @_;

    $url =~ s{^https:}{http:}i;

    for (@$urls){
        my $u = $_ =~ s{^https:}{http:}ir;
        return 1 if URI::eq($u, $url);
    }

    0;
}

1;


================================================
FILE: config.yml.example
================================================
---

# Top-level temp directory will be created on start and removed
# on exit.  Each crawler will have its own subdirectory with
# PID appended
TMP_DIR: /tmp/smarter_encryption
CRAWLER_TMP_PREFIX: crawler_

# User agent. Will use defaults if not specified
#UA: 
VERBOSE: 1

# Paths to system binaries.  If in path already, just the program
# name should suffice.
COMPARE: /usr/local/bin/compare 
PKILL: /usr/bin/pkill

# Database connection options.  If not specified will connect as
# the current user.
#DB:
#HOST:
#PORT:
#USER:
#PASS:

# Number of concurrent crawlers per cpu.
CRAWLERS_PER_CPU: 3
# or exact number
# MAX_CONCURRENT_CRAWLERS: 10

# Path to phantomjs.  Should be v2.1.1
PHANTOMJS: phantomjs

# Path to modified netsniff.js
NETSNIFF_SS: netsniff_screenshot.js

# Timeout before killing phantomjs in seconds
HEADLESS_ALARM: 30

# Whether to continue running and polling the queue or exit when finished.
# If specified and non-zero, it is the number of seconds to wait in
# between polls.
POLL: 60

# Number of sites a crawler should process before exiting
SITES_PER_CRAWLER: 10

# Desired number of URLs to check for each site 
URLS_PER_SITE: 10

# Max percentage of URLS_PER_SITE included from the current home page
HOMEPAGE_LINK_PCT: 0.5

# Number of times to re-request HTTPs URL on failure
HTTPS_RETRIES: 1

# If SCREENSHOT_RETRIES is not 0, the comparison between HTTP and HTTPs
# pages will be re-run if the diff is above SCREENSHOT_THRESHOLD.  It
# will also introduce a delay before taking the screenshot to potentially
# overcome slight network differences between the two. The delay will
# remain in effect for links still to be processed for the site.
SCREENSHOT_RETRIES: 1
SCREENSHOT_THRESHOLD: 0.05
PHANTOM_RENDER_DELAY: 1000


================================================
FILE: cpanfile
================================================
requires 'Cpanel::JSON::XS', 2.3310;
requires 'DBI', '1.631';
requires 'Domain::PublicSuffix', '0.10';
requires 'Exporter::Shiny', '0.038';
requires 'Exporter::Tiny', 0.038;
requires 'File::Copy::Recursive', 0.38;
requires 'IO::Socket::SSL', 2.060;
requires 'IO::Socket::SSL::Utils', 2.014;
requires 'IPC::Run', 0.92;
requires 'IPC::Run::Timer', 0.90;
requires 'LWP', 6.05;
requires 'List::AllUtils', 0.07;
requires 'List::Util', 1.52;
requires 'POE', 1.358;
requires 'POE::XS::Loop::Poll', 1.000;
requires 'URI', 1.71;
requires 'URI::Escape', 3.31;
requires 'WWW::Mechanize', 1.73;
requires 'WWW::RobotRules', 6.02;
requires 'YAML::XS', 0.41;


================================================
FILE: https_crawl.pl
================================================
#!/usr/bin/env perl

use LWP::UserAgent;
use WWW::Mechanize;
use POE::Kernel { loop => 'POE::XS::Loop::Poll' };
use POE qw(Wheel::Run Filter::Reference);
use DBI;
use Sys::Hostname 'hostname';
use Cpanel::JSON::XS qw'decode_json encode_json';
use URI;
use File::Copy::Recursive qw'pathmk pathrmdir';
use WWW::RobotRules;
use IPC::Run;
use YAML::XS 'LoadFile';
use List::AllUtils 'each_arrayref';
use SmarterEncryption::Crawl qw'
    aggregate_crawl_session
    check_ssl_cert
    dupe_link
    urls_by_path
';
use Module::Load::Conditional 'can_load';

use feature 'state';
use strict;
use warnings;
no warnings 'uninitialized';

my $DDG_INTERNAL;
if(can_load(modules => {
    'DDG::Util::HTTPS2' => undef,
    'DDG::Util::Crawl' => undef
})){
    DDG::Util::HTTPS2->import(qw'add_stat backfill_urls');
    DDG::Util::Crawl->import(qw'get_http_msg_sig_hdrs');
    $DDG_INTERNAL = 1;
}

my $HOST = hostname();

# Crawler Config
my %CC;

# Derived config values
my ($MAX_CONCURRENT_CRAWLERS, $PHANTOM_TIMEOUT, $HOMEPAGE_LINKS_MAX); 

POE::Session->create(
    inline_states => {
        _start         => \&_start,
        _stop          => \&normal_cleanup,
        crawl          => \&start_crawlers,
        crawler_done   => \&crawler_done,
        crawler_debug  => \&crawler_debug,
        sig_child      => \&sig_child,
        shutdown       => \&shutdown_now,
        prune_tmp_dirs => \&prune_tmp_dirs
    }
);

POE::Kernel->run;
exit;

sub _start {
    my ($k, $h) = @_[KERNEL, HEAP];

    parse_argv();

    unless($MAX_CONCURRENT_CRAWLERS){
        $MAX_CONCURRENT_CRAWLERS = `nproc` * $CC{CRAWLERS_PER_CPU};
    }

    $PHANTOM_TIMEOUT = $CC{HEADLESS_ALARM} * 1000; # in ms
    $HOMEPAGE_LINKS_MAX = sprintf '%d', $CC{HOMEPAGE_LINK_PCT} * $CC{URLS_PER_SITE};

    my $TMP_DIR = $CC{TMP_DIR};
    unless(-d $TMP_DIR){
        $CC{VERBOSE} && warn "Creating temp dir $TMP_DIR\n";
        pathmk($TMP_DIR) or die "Failed to create tmp dir $TMP_DIR: $!";
    }

    # clean up leftover junk for forced shutdown
    while(<$TMP_DIR/$CC{CRAWLER_TMP_PREFIX}*>){
        chomp;
        pathrmdir($_) or warn "Failed to remove old crawler tmp dir $_: $!";
    }

    $k->sig($_ => 'shutdown') for qw{TERM INT};

    $k->yield('crawl');
}

sub shutdown_now {
    $_[KERNEL]->sig_handled;

    # Kill crawlers
    $_->kill() for values %{$_[HEAP]->{crawlers}};

    # Make unfinished tasks available in the queue
    my $db = prep_db('queue');
    $db->{reset_unfinished_tasks}->execute;

    normal_cleanup();

    exit 1;
}

sub normal_cleanup {
    # remove tmp dir
    pathrmdir($CC{TMP_DIR}) if -d $CC{TMP_DIR};
}

sub start_crawlers{
    my ($k, $h) = @_[KERNEL, HEAP];

    my $db = prep_db('queue');

    my $reserve_tasks = $db->{reserve_tasks};
    while(keys %{$h->{crawlers}} < $MAX_CONCURRENT_CRAWLERS){

        $reserve_tasks->execute();
        if(my @ranks = sort map { $_->[0] } @{$reserve_tasks->fetchall_arrayref}){

            my $c = POE::Wheel::Run->new(
                Program      => \&crawl_sites,
                ProgramArgs  => [\@ranks],
                CloseOnCall  => 1,
                NoSetSid     => 1,
                StderrEvent  => 'crawler_debug',
                CloseEvent   => 'crawler_done',
                StdinFilter  => POE::Filter::Reference->new,
                StderrFilter => POE::Filter::Line->new
            );
            $h->{crawlers}{$c->ID} = $c;
            $k->sig_child($c->PID, 'sig_child');
        }
        else{
            $CC{POLL} && $k->delay(crawl => $CC{POLL});
            last;
        }
    }
}

sub crawl_sites{
    my ($ranks) = @_;

    my $VERBOSE = $CC{VERBOSE};
    my $db = prep_db('crawl');

    my $crawler_tmp_dir = "$CC{TMP_DIR}/$CC{CRAWLER_TMP_PREFIX}$$";
    my $rm_tmp = pathmk($crawler_tmp_dir);

    my @urls_by_domain;
    for(my $i = 0;$i < @$ranks;++$i){
        my $rank = $ranks->[$i];

        my $domain;
        eval {
            $db->{start_task}->execute($$, $rank);
            $domain = $db->{start_task}->fetchall_arrayref->[0][0];
        }
        or do {
            warn "Failed to start task for rank $rank: $@";
            next;
        };

        eval {
            $domain = URI->new("https://$domain/")->host;
            1;
        }
        or do {
            warn "Failed to filter domain $domain: $@";
            next;
        };

        $VERBOSE && warn "checking domain $domain\n";
        my $urls = get_urls_for_domain($domain, $db);
        my @pairs;
        for my $url (@$urls){
            push @pairs, [$domain, $url];
        }
        push @urls_by_domain, \@pairs if @pairs;
    }

    my $ranks_str = '{' . join(',', @$ranks) . '}';

    my $ea = each_arrayref @urls_by_domain;

    my (%ssl_cert_checked, %domain_render_delay, %sessions);
    while(my @urls = $ea->()){
        for my $u (@urls){
            next unless $u;
            my ($domain, $url) = @$u;
            next unless $url =~ /^http/i;

            # for the command-line
            $url =~ s/'/%27/g;

            my ($http_url) = $url =~ s/^https:/http:/ri;
            my ($https_url) = $url =~ s/^http:/https:/ri;

            my $http_ss = $crawler_tmp_dir . '/http.' . $domain . '.png';

            unless($ssl_cert_checked{$domain}){
                my $ssl = check_ssl_cert($domain);
                eval {
                    $db->{insert_ssl}->execute($domain, @$ssl);
                    ++$ssl_cert_checked{$domain};
                }
                or do {
                    warn "Failed to insert ssl info for $domain: $@";
                };
            }

            my %comparison;
            # We will compare a URL twice max:
            # 1. Compare HTTP vs. HTTPS
            # 2. Redo if the screenshot is a above the threshold to check for rendering problems
            SCREENSHOT_RETRY: for (0..$CC{SCREENSHOT_RETRIES}){
                my $redo_comparison = 0;

                my %stats = (domain => $domain);
                check_site(\%stats, $http_url, $http_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);
                # the idea behind screenshots is:
                # 1. Do for HTTP automatically so we don't have to make another request if it works
                # 2. Do for HTTPS if HTTP worked and wasn't autoupgraded
                # 3. If HTTPS worked and didn't downgrade, compare them
                my $https_ss;
                if( (-e $http_ss) && ($stats{http_request_uri} =~ /^http:/i) && ($stats{http_response} == 200)){
                    $https_ss = $crawler_tmp_dir . '/https.' . $domain . '.png';
                }

                HTTPS_RETRY: for my $https_attempt (0..$CC{HTTPS_RETRIES}){
                    my $redo_https;
                    check_site(\%stats, $https_url, $https_ss, $domain_render_delay{$domain}, $crawler_tmp_dir);
                    if( ($stats{https_request_uri} =~ /^https:/i) && ($stats{https_response} == 200)){
                        if($https_ss && (-e $https_ss)){
                            my $out = `$CC{COMPARE} -metric mae $http_ss $https_ss /dev/null 2>&1`;

                            if(my ($diff) = $out =~ /\(([\d\.e\-]+)\)/){
                                if($CC{SCREENSHOT_THRESHOLD} < $diff){
                                    # Only need to redo on the first failure. After that, the delay
                                    # will have already been increased by a previous URL
                                    unless($domain_render_delay{$domain} == $CC{PHANTOM_RENDER_DELAY}){
                                        $domain_render_delay{$domain} = $CC{PHANTOM_RENDER_DELAY};
                                        $redo_comparison = 1;
                                        $VERBOSE && warn "redoing $http_url (diff: $diff)\n";
                                    }
                                }
                                $stats{ss_diff} = $diff;
                            }
                            else{
                                warn "Failed to extract compare diff betweeen $http_ss and $https_ss from $out\n";
                            }
                            unlink $_ for $http_ss, $https_ss;
                        }

                        if($DDG_INTERNAL && $https_attempt){
                            add_stat(qw'increment smarter_encryption.crawl.https_retries.success');
                        }
                    }
                    elsif($DDG_INTERNAL && $https_attempt){
                        add_stat(qw'increment smarter_encryption.crawl.https_retries.failure');
                    }
                    elsif( ($stats{https_request_uri} !~ /^http:/) && ($stats{http_response} != $stats{https_response})){
                        $redo_https = 1;
                        $VERBOSE && warn "Redoing HTTPS request for $domain: $https_url\n";
                    }

                    last HTTPS_RETRY unless $redo_https;
                }

                # Most should exit here
                unless($redo_comparison){
                    %comparison = %stats;
                    last;
                }
            }

            unless($db->{con}->ping){
                $VERBOSE && warn "Reconnecting to DB before inserting comparison";
                $db = prep_db('crawl');
            }

            if(my $host = eval { URI->new($comparison{https_request_uri})->host}){
                unless($ssl_cert_checked{$host}){
                    my $ssl = check_ssl_cert($host);
                    eval {
                        $db->{insert_ssl}->execute($host, @$ssl);
                        ++$ssl_cert_checked{$host};
                    }
                    or do {
                        warn "Failed to insert ssl info for $host: $@";
                    };
                }
            }

            if($comparison{http_request_uri} || $comparison{https_request_uri}){
                my $log_id;
                eval {
                    $db->{insert_domain}->execute(@comparison{qw'
                        domain
                        http_request_uri
                        http_response
                        http_requests
                        http_size
                        https_request_uri
                        https_response
                        https_requests
                        https_size
                        autoupgrade
                        mixed
                        ss_diff'}
                    );
                    $log_id = $db->{insert_domain}->fetch()->[0];
                }
                or do {
                   $VERBOSE && warn "Failed to insert request for $domain: $@";
                };

                if($log_id){
                    if(my $hdrs = delete $comparison{https_response_headers}){
                        eval {
                            $db->{insert_headers}->execute($log_id, $hdrs);
                        }
                        or do {
                            $VERBOSE && warn "Failed to insert response headers for $domain ($log_id): $@";
                        };
                    }

                    if(my $mixed_reqs = delete $comparison{mixed_children}){
                        for my $m (keys %$mixed_reqs){
                            eval{
                                $db->{insert_mixed}->execute($log_id, $m);
                                1;
                            }
                            or do {
                                $VERBOSE && warn "Failed to insert mixed request for $domain: $@";
                            };
                        }
                    }
                    $comparison{id} = $log_id;
                    push @{$sessions{$domain}}, \%comparison;
                }
            }
        }
    }

    unless($db->{con}->ping){
        $VERBOSE && warn "Reconnecting to DB before updating aggregate data";
        $db = prep_db('crawl');
    }

    while(my ($domain, $session) = each %sessions){
        my $aggregates = aggregate_crawl_session($domain, $session);
        while(my ($host, $agg) = each %$aggregates){
            eval {
                $db->{upsert_aggregate}->execute(
                    $host, @$agg{qw'
                        https
                        http_s
                        https_errs
                        http
                        unknown
                        autoupgrade
                        mixed_requests
                        max_ss_diff
                        redirects
                        max_id
                        requests
                        is_redirect
                        redirect_hosts'
                    }
                );
                1;
            }
            or do {
                warn "Failed to upsert aggregate for $host: $@";
            };
        }
    }

    eval {
        $db->{finish_tasks}->execute($ranks_str);
        1;
    }
    or do {
        warn "Failed to finish tasks for ranks ($ranks_str): $@";
    };

    system "$CC{PKILL} -9 -f '$crawler_tmp_dir '";
    pathrmdir($crawler_tmp_dir) if $rm_tmp;
}

sub prep_db {
    my $target = shift;

    my %db;

    my $con = get_con();

    if($target eq 'queue'){
        $db{reserve_tasks} = $con->prepare("
            update https_queue
                set processing_host = '$HOST',
                    reserved = now()
            where rank in (
                select rank from https_queue
                    where processing_host is null
                    order by rank
                    limit $CC{SITES_PER_CRAWLER}
                    for update skip locked
            )
            returning rank
        ");
        $db{reset_unfinished_tasks} = $con->prepare("
            update https_queue
                set processing_host = null,
                worker_pid = null,
                reserved = null,
                started = null
            where
                processing_host = '$HOST' and
                finished is null
        ");
        $db{complete_unfinished_worker_tasks} = $con->prepare("
            update https_queue
                set finished = now(),
                processing_host = '$HOST (incomplete)'
            where
                processing_host = '$HOST' and
                finished is null and
                worker_pid = ?
        ");
    }
    elsif($target eq 'crawl'){
        $db{start_task} = $con->prepare('update https_queue set worker_pid = ?, started = now() where rank = ? returning domain');
        $db{select_urls} = $con->prepare('select url from full_urls where host = ?');
        $db{insert_domain} = $con->prepare('
            insert into https_crawl
              (domain, http_request_uri, http_response, http_requests, http_size, https_request_uri, https_response, https_requests, https_size, autoupgrade, mixed, screenshot_diff)
              values (?,?,?,?,?,?,?,?,?,?,?,?) returning id
        ');
        $db{insert_mixed} = $con->prepare('insert into mixed_assets (https_crawl_id, asset) values (?,?)');
        $db{insert_headers} = $con->prepare('insert into https_response_headers (https_crawl_id, response_headers) values (?,?)');
        $db{finish_tasks} = $con->prepare('update https_queue set finished = now() where rank = ANY(?::integer[])');
        $db{insert_ssl} = $con->prepare('
            insert into ssl_cert_info (domain, issuer, notBefore, notAfter, host_valid, err) values (?,?,?,?,?,?)
            on conflict (domain) do update set
            issuer = EXCLUDED.issuer,
            notBefore = EXCLUDED.notBefore,
            notAfter = EXCLUDED.notAfter,
            host_valid = EXCLUDED.host_valid,
            err = EXCLUDED.err,
            updated = now()
        ');
        # Note where clause:
        # 1. Non-redirects update any, including changing a redirect to a non-redirect
        # 2. Redirects update other redirects
        $db{upsert_aggregate} = $con->prepare("
            insert into https_crawl_aggregate (
                domain,
                https,
                http_and_https,
                https_errs, http,
                unknown,
                autoupgrade,
                mixed_requests,
                max_screenshot_diff,
                redirects,
                max_https_crawl_id,
                requests,
                is_redirect,
                redirect_hosts,
                session_request_limit)
                values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,$CC{URLS_PER_SITE})
            on conflict (domain) do update set (
                https,
                http_and_https,
                https_errs,
                http,
                unknown,
                autoupgrade,
                mixed_requests,
                max_screenshot_diff,
                redirects,
                max_https_crawl_id,
                requests,
                is_redirect,
                redirect_hosts,
                session_request_limit
            ) = (
                EXCLUDED.https,
                EXCLUDED.http_and_https,
                EXCLUDED.https_errs,
                EXCLUDED.http,
                EXCLUDED.unknown,
                EXCLUDED.autoupgrade,
                EXCLUDED.mixed_requests,
                EXCLUDED.max_screenshot_diff,
                EXCLUDED.redirects,
                EXCLUDED.max_https_crawl_id,
                EXCLUDED.requests,
                EXCLUDED.is_redirect,
                EXCLUDED.redirect_hosts,
                EXCLUDED.session_request_limit)
            where
                EXCLUDED.is_redirect = false or
                https_crawl_aggregate.is_redirect = true
        ");
    }

    $db{con} = $con;
    return \%db;
}

# Strategy behind url selection:
# 1. Fill queue with homepage and click urls sort by top-level path
#    prevalence
# 2. If necessary, get backfill_urls
sub get_urls_for_domain {
    my ($domain, $db) = @_;

    state $rr = WWW::RobotRules->new($CC{UA});
    state $mech = get_ua('mech');
    state $VERBOSE = $CC{VERBOSE};

    # Get latest robot rules for domain
    my $res = $mech->get("http://$domain/robots.txt");
    if($res->is_success){
        # the uri may be different than what we requested
        my @doms = ($domain);
        my $uri = $res->request->uri;
        if(my $host = eval { URI->new($uri)->host }){
            push @doms, $host if $host ne $domain;
        }
        my $robots_txt = $res->decoded_content;

        # Add the rules for the:
        # 1. The domain and redirect host if different
        # 2. HTTP/HTTPS for each
        # yes, http and https could be different
        for my $d (@doms){
            for my $p (qw(http https)){
                $rr->parse("$p://$d/", $robots_txt);
            }
        }
    }

    my @urls;
    my $homepage = 'http://' . $domain . '/';

    $res = $mech->get($homepage);

    if($res->is_success){
        # the uri may be different than what we requested
        my $uri = $res->request->uri;
        if(my $host = eval { URI->new($uri)->host }){
            # all links with the same host
            my @homepage_links;
            if(my $l = $mech->find_all_links(url_abs_regex => qr{//\Q$host\E/})){
                @homepage_links = @$l;
            }

            for my $l (@homepage_links){
                my $abs_url = $l->url_abs;
                $abs_url = "$abs_url";
                next if dupe_link($abs_url, \@urls);
                push @urls, $abs_url;
            }
        }
    }
    else {
        $VERBOSE && warn "Failed to get homepage links for $domain: " . $res->status_line;
    }

    eval {
        my $select_urls = $db->{select_urls};
        $select_urls->execute($domain);
        while(my $r = $select_urls->fetchrow_arrayref){
            my $url = $r->[0];
            next if dupe_link($url, \@urls);
            push @urls, $url;
        }
        1;
    }
    or do {
        $VERBOSE && warn "Failed to get click urls for $domain: $@";
    };

    state $URLS_PER_SITE = $CC{URLS_PER_SITE};

    urls_by_path(\@urls, $rr, $URLS_PER_SITE);

    if($DDG_INTERNAL && (@urls < $URLS_PER_SITE)){
        backfill_urls($domain, \@urls, $rr, $db, $mech, $URLS_PER_SITE, $VERBOSE);
    }

    # Add home by default since it often behaves differently
    unless(dupe_link($homepage, \@urls)){
        if(@urls < $URLS_PER_SITE){
            push @urls, $homepage;
        }
        else{
            splice(@urls, -1, 1, $homepage);
        }
    }

    return \@urls;
}

sub prune_tmp_dirs {
    my $h = $_[HEAP];

    return unless exists $h->{crawler_tmp_dirs};

    my ($TMP_DIR, $CRAWLER_TMP_PREFIX) = @CC{qw'TMP_DIR CRAWLER_TMP_PREFIX'};
    for my $pid (keys %{$h->{crawler_tmp_dirs}}){
        my $crawler_tmp_dir = "$TMP_DIR/$CRAWLER_TMP_PREFIX$pid";
        if(-d $crawler_tmp_dir){
            next unless pathrmdir($crawler_tmp_dir);
        }
        delete $h->{crawler_tmp_dirs}{$pid};
    }
}

sub check_site {
    my ($stats, $site, $screenshot, $delay, $crawler_tmp_dir) = @_;

    if(my ($request_scheme) = $site =~ /^(https?):/i){
        $request_scheme = lc $request_scheme;

        eval{
            @ENV{qw(PHANTOM_RENDER_DELAY PHANTOM_UA PHANTOM_TIMEOUT)} =
                ($delay, "'$CC{UA}'", $PHANTOM_TIMEOUT);

            # Build custom headers if HTTP message signatures are enabled
            if($DDG_INTERNAL && $CC{ENABLE_HTTP_MESSAGE_SIGNATURES}){
                # Clear any previous custom headers first
                delete $ENV{CUSTOM_HEADERS};

                my $sig_headers = get_http_msg_sig_hdrs('GET', $site);
                if($sig_headers && %$sig_headers){
                    $ENV{CUSTOM_HEADERS} = encode_json($sig_headers);
                }
            }

            my $out;
            my @cmd = (
                $CC{PHANTOMJS},
                "--local-storage-path=$crawler_tmp_dir", "--offline-storage-path=$crawler_tmp_dir",
                $CC{NETSNIFF_SS}, $site);
            push @cmd, $screenshot if $screenshot;

            IPC::Run::run \@cmd,  \undef, \$out,
                IPC::Run::timeout($CC{HEADLESS_ALARM}, exception => "$site timed out after $CC{HEADLESS_ALARM} seconds");
            die "PHANTOMJS $out" if $out =~ /^FAIL/;

            # Can have error messages at the end so have to extract the json
            my ($j) = $out =~ /^(\{\s+"log".+\})/ms;
            my $m = decode_json($j)->{log};

            my ($main_request_scheme, $check_mixed);
            for my $e (@{$m->{entries}}){
                my $response_status = $e->{response}{status};
                # netsniff records the redirects to https for some sites
                next if $response_status =~ /^3/;
                my $url = $e->{request}{url};
                next unless my ($scheme) = $url =~ /^(https?):/i;
                $scheme = lc $scheme;

                if($check_mixed && ($scheme eq 'http')){
                    # Absolute links.  Even if the same host as parent, browsers will mark
                    # this as mixed and the extension can't upgrade them
                    $stats->{mixed_children}{$url} = 1;
                }

                unless($main_request_scheme){
                    $stats->{"${request_scheme}_request_uri"} = $url;
                    $stats->{"${request_scheme}_response"} = $response_status;
                    if($request_scheme eq 'http'){
                        $stats->{autoupgrade} = $scheme eq 'https' ? 1 : 0;
                    }
                    elsif($scheme eq 'https'){
                        $check_mixed = lc URI->new($url)->host;
                        my $hdrs = delete $e->{response}{headers};
                        my %response_headers;
                        # We don't want to store an array of one-key hashes.
                        for my $h (@$hdrs){
                            my ($name, $value) = @$h{qw(name value)};
                            if(exists $response_headers{$name}){
                                # https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
                                $response_headers{$name} .= ",$value";
                            }
                            else{
                                $response_headers{$name} = $value;
                            }
                        }
                        $stats->{https_response_headers} = encode_json(\%response_headers);
                    }
                    $main_request_scheme = $scheme;
                }

                $stats->{"${request_scheme}_size"} += $e->{response}{bodySize};
                ++$stats->{"${request_scheme}_requests"};

            }

            if($check_mixed){
                $stats->{mixed} = exists $stats->{mixed_children} ? 1 : 0;
            }
            1;
        }
        or do {
            warn "check_site error: $@ ($site)";
            system "$CC{PKILL} -9 -f '$crawler_tmp_dir '" if $crawler_tmp_dir =~ /\S/;
        };
    }
}

sub crawler_done{
    my ($k, $h, $id) = @_[KERNEL, HEAP, ARG0];

    state $VERBOSE = $CC{VERBOSE};
    $VERBOSE && warn "deleting crawler $id\n";
    my $c = delete $h->{crawlers}{$id};

    # see if any of its domains were left unfinished
    my $pid = $c->PID;
    eval {
        my $db = prep_db('queue');
        my $unfinished = $db->{complete_unfinished_worker_tasks}->execute($pid);
        if($unfinished > 0){
            $VERBOSE && warn "Marked $unfinished tasks incomplete for crawler with pid $pid\n";
        }
        1;
    }
    or do {
        warn "Failed to verify worker tasks: $@";
    };

    # Check and clean up tmp dirs for hung crawlers
    $h->{crawler_tmp_dirs}{$pid} = 1;
    $k->yield('prune_tmp_dirs');

    $k->yield('crawl');
}

sub crawler_debug{
    my $msg = $_[ARG0];

    $CC{VERBOSE} && warn 'crawler debug: ' . $msg. "\n";
}

sub sig_child {
    warn 'Got signal from pid ' . $_[ARG1] . ', exit status: ' . $_[ARG2] if $_[ARG2];
    $_[KERNEL]->sig_handled;
}

sub get_ua {
    my $type = shift;

    my $ua = $type eq 'mech' ?
        WWW::Mechanize->new(
            onerror => undef, # We'll check these ourselves so we don't have to catch die in eval
            quiet => 1
        ) 
        :
        LWP::UserAgent->new();

    $ua->agent($CC{UA});
    $ua->timeout(10);
    return $ua;
}

sub get_con {

    $ENV{PGDATABASE} = $CC{DB}   if exists $CC{DB};
    $ENV{PGHOST}     = $CC{HOST} if exists $CC{HOST};
    $ENV{PGPORT}     = $CC{PORT} if exists $CC{PORT};
    $ENV{PGUSER}     = $CC{USER} if exists $CC{USER};
    $ENV{PGPASSWORD} = $CC{PASS} if exists $CC{PASS};

    return DBI->connect('dbi:Pg:', '', '', {
        RaiseError => 1,
        PrintError => 0,
        AutoCommit => 1,
    });
}

sub parse_argv {
    my $usage = <<ENDOFUSAGE;

     *********************************************************************
       USAGE: https_crawl.pl -c /path/to/config.yml [-h]

       -c: Path to YAML config file
       -h: Print this help

    ***********************************************************************

ENDOFUSAGE

    my $config_file_specified;
    for(my $i = 0;$i < @ARGV;$i++) {
        if($ARGV[$i] =~ /^-c$/i ){
            %CC = %{LoadFile($ARGV[++$i])};
            $config_file_specified = 1;
        }
        elsif($ARGV[$i] =~ /^-h$/i ){ die "$usage\n" }
    }

    die "Config file required\n\n$usage\n" unless $config_file_specified;
}


================================================
FILE: netsniff_screenshot.js
================================================
// Copyright 2010 Ariya Hidayat
// 
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"use strict";
if (!Date.prototype.toISOString) {
    Date.prototype.toISOString = function () {
        function pad(n) { return n < 10 ? '0' + n : n; }
        function ms(n) { return n < 10 ? '00'+ n : n < 100 ? '0' + n : n }
        return this.getFullYear() + '-' +
            pad(this.getMonth() + 1) + '-' +
            pad(this.getDate()) + 'T' +
            pad(this.getHours()) + ':' +
            pad(this.getMinutes()) + ':' +
            pad(this.getSeconds()) + '.' +
            ms(this.getMilliseconds()) + 'Z';
    }
}

function createHAR(address, title, startTime, resources)
{
    var entries = [];

    resources.forEach(function (resource) {
        var request = resource.request,
            startReply = resource.startReply,
            endReply = resource.endReply;

        if (!request || !startReply || !endReply) {
            return;
        }

        // Exclude Data URI from HAR file because
        // they aren't included in specification
        if (request.url.match(/(^data:image\/.*)/i)) {
            return;
    }

        entries.push({
            startedDateTime: request.time.toISOString(),
            time: endReply.time - request.time,
            request: {
                method: request.method,
                url: request.url,
                httpVersion: "HTTP/1.1",
                cookies: [],
                headers: request.headers,
                queryString: [],
                headersSize: -1,
                bodySize: -1
            },
            response: {
                status: endReply.status,
                statusText: endReply.statusText,
                httpVersion: "HTTP/1.1",
                cookies: [],
                headers: endReply.headers,
                redirectURL: "",
                headersSize: -1,
                bodySize: startReply.bodySize,
                content: {
                    size: startReply.bodySize,
                    mimeType: endReply.contentType
                }
            },
            cache: {},
            timings: {
                blocked: 0,
                dns: -1,
                connect: -1,
                send: 0,
                wait: startReply.time - request.time,
                receive: endReply.time - startReply.time,
                ssl: -1
            },
            pageref: address
        });
    });

    return {
        log: {
            version: '1.2',
            creator: {
                name: "PhantomJS",
                version: phantom.version.major + '.' + phantom.version.minor +
                    '.' + phantom.version.patch
            },
            pages: [{
                startedDateTime: startTime.toISOString(),
                id: address,
                title: title,
                pageTimings: {
                    onLoad: page.endTime - page.startTime
                }
            }],
            entries: entries
        }
    };
}

var page = require('webpage').create(),
    system = require('system');
if(system.env['PHANTOM_UA'] !== 'undefined'){
    page.settings.userAgent = system.env['PHANTOM_UA'];
}
if(system.env['PHANTOM_TIMEOUT'] !== 'undefined'){
    page.settings.resourceTimeout = system.env['PHANTOM_TIMEOUT'];
}
var renderDelay = 0;
if(system.env['PHANTOM_RENDER_DELAY'] !== 'undefined'){
    renderDelay = system.env['PHANTOM_RENDER_DELAY'];
}
// Parse and apply custom headers
var customHeaders = {};
if (system.env['CUSTOM_HEADERS'] !== undefined) {
    try {
        customHeaders = JSON.parse(system.env['CUSTOM_HEADERS']);
    } catch (e) {
        console.error('Failed to parse CUSTOM_HEADERS: ' + e);
    }
}
page.customHeaders = customHeaders;

page.viewportSize = { width: 1024, height: 768 };
page.clipRect = { top: 0, left: 0, width: 1024, height: 768 };

if (system.args.length === 1) {
    console.log('Usage: netsniff.js <some URL> <optional: screenshot file name');
    phantom.exit(1);
} else {

    page.address = system.args[1];
    page.resources = [];
    var screenshot_file = system.args[2];

    page.onLoadStarted = function () {
        page.startTime = new Date();
    };

    page.onResourceRequested = function (req) {
        page.resources[req.id] = {
            request: req,
            startReply: null,
            endReply: null
        };
    };

    page.onResourceReceived = function (res) {
        if (res.stage === 'start') {
            page.resources[res.id].startReply = res;
        }
        if (res.stage === 'end') {
            page.resources[res.id].endReply = res;
        }
    };

    page.onResourceError = function(resourceError) {
        page.reason = resourceError.errorString;
        page.reason_url = resourceError.url;
    };

    page.open(page.address, function (status) {
        var har;
        if (status !== 'success') {
            console.log('FAIL to load the address ' + page.reason_url + ': ' + page.reason);
            phantom.exit(1);
        } else {
            window.setTimeout(function () {
                page.endTime = new Date();
                page.title = page.evaluate(function () {
                return document.title;
                });
                har = createHAR(page.address, page.title, page.startTime, page.resources);
                console.log(JSON.stringify(har, undefined, 4));
                if (typeof screenshot_file !== 'undefined') {
                    page.render(screenshot_file);
                }
                phantom.exit();
            }, renderDelay);
        }
    });
}


================================================
FILE: sql/domain_exceptions.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: domain_exceptions; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE domain_exceptions (
    domain text NOT NULL,
    comment text,
    updated timestamp with time zone NOT NULL default now()
);


--
-- Name: domain_exceptions_pkey; Type: CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY domain_exceptions
    ADD CONSTRAINT domain_exceptions_pkey PRIMARY KEY (domain);


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/full_urls.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: full_urls; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE full_urls (
    host text NOT NULL,
    url text NOT NULL,
    updated timestamp with time zone DEFAULT now() NOT NULL
);


--
-- Name: full_urls_host_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX full_urls_host_idx ON full_urls USING btree (host);


--
-- Name: full_urls_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE UNIQUE INDEX full_urls_unique_substrmd5_idx ON full_urls USING btree (host, "left"(md5(url), 8));


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/https_crawl.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: https_crawl; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE https_crawl (
    domain text NOT NULL,
    http_request_uri text,
    http_response integer,
    http_requests integer,
    http_size integer,
    https_request_uri text,
    https_response integer,
    https_requests integer,
    https_size integer,
    "timestamp" timestamp with time zone DEFAULT now(),
    screenshot_diff real,
    id bigint,
    autoupgrade boolean,
    mixed boolean
);


--
-- Name: https_crawl_id_seq; Type: SEQUENCE; Schema: public; Owner: -
--

CREATE SEQUENCE https_crawl_id_seq
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1;


--
-- Name: https_crawl_id_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -
--

ALTER SEQUENCE https_crawl_id_seq OWNED BY https_crawl.id;


--
-- Name: id; Type: DEFAULT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_crawl ALTER COLUMN id SET DEFAULT nextval('https_crawl_id_seq'::regclass);


--
-- Name: https_crawl_id_key; Type: CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_crawl
    ADD CONSTRAINT https_crawl_id_key UNIQUE (id);


--
-- Name: https_crawl_domain_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX https_crawl_domain_idx ON https_crawl USING btree (domain);


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/https_crawl_aggregate.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: https_crawl_aggregate; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE https_crawl_aggregate (
    domain text NOT NULL,
    https integer DEFAULT 0 NOT NULL,
    http_and_https integer DEFAULT 0 NOT NULL,
    https_errs integer DEFAULT 0 NOT NULL,
    http integer DEFAULT 0 NOT NULL,
    unknown integer DEFAULT 0 NOT NULL,
    autoupgrade integer DEFAULT 0 NOT NULL,
    mixed_requests integer DEFAULT 0 NOT NULL,
    max_screenshot_diff real DEFAULT 0 NOT NULL,
    redirects integer DEFAULT 0 NOT NULL,
    requests integer NOT NULL,
    session_request_limit integer NOT NULL,
    is_redirect boolean DEFAULT false NOT NULL,
    max_https_crawl_id bigint NOT NULL,
    redirect_hosts jsonb
);


--
-- Name: https_upgrade_metrics; Type: MATERIALIZED VIEW; Schema: public; Owner: -
--

CREATE VIEW https_upgrade_metrics AS
 SELECT https_crawl_aggregate.domain,
    ((https_crawl_aggregate.unknown)::real / (https_crawl_aggregate.requests)::real) AS unknown_pct,
    ((((https_crawl_aggregate.https + https_crawl_aggregate.http_and_https)))::double precision / (https_crawl_aggregate.requests)::real) AS combined_pct,
    coalesce(https_crawl_aggregate.https_errs::real/nullif( (https_crawl_aggregate.https + https_crawl_aggregate.http_and_https), 0), 0)::real as https_err_rate,
    https_crawl_aggregate.max_screenshot_diff,
    ((https_crawl_aggregate.mixed_requests = 0) OR (https_crawl_aggregate.autoupgrade = https_crawl_aggregate.requests)) AS mixed_ok,
    ((https_crawl_aggregate.autoupgrade)::double precision / (https_crawl_aggregate.requests)::real) AS autoupgrade_pct
   FROM https_crawl_aggregate;


--
-- Name: https_crawl_aggregate_pkey; Type: CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_crawl_aggregate
    ADD CONSTRAINT https_crawl_aggregate_pkey PRIMARY KEY (domain);


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/https_queue.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: https_queue; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE https_queue (
    rank integer NOT NULL,
    domain character varying(500) NOT NULL,
    processing_host character varying(50),
    worker_pid integer,
    reserved timestamp with time zone,
    started timestamp with time zone,
    finished timestamp with time zone,
    CONSTRAINT domain_is_lowercase CHECK (((domain)::text = lower((domain)::text)))
);


--
-- Name: https_queue_rank_seq; Type: SEQUENCE; Schema: public; Owner: -
--

CREATE SEQUENCE https_queue_rank_seq
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1
    CYCLE;


--
-- Name: https_queue_rank_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: -
--

ALTER SEQUENCE https_queue_rank_seq OWNED BY https_queue.rank;


--
-- Name: rank; Type: DEFAULT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_queue ALTER COLUMN rank SET DEFAULT nextval('https_queue_rank_seq'::regclass);


--
-- Name: https_queue_pkey; Type: CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_queue
    ADD CONSTRAINT https_queue_pkey PRIMARY KEY (rank);


--
-- Name: https_queue_domain_finished_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE UNIQUE INDEX https_queue_domain_finished_idx ON https_queue USING btree (domain, finished);


--
-- Name: https_queue_processing_host_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX https_queue_processing_host_idx ON https_queue USING btree (processing_host);


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/https_response_headers.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: https_response_headers; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE https_response_headers (
    https_crawl_id bigint NOT NULL,
    response_headers jsonb NOT NULL
);


--
-- Name: https_response_headers_https_crawl_id_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE UNIQUE INDEX https_response_headers_https_crawl_id_idx ON https_response_headers USING btree (https_crawl_id);


--
-- Name: https_response_headers_response_headers_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX https_response_headers_response_headers_idx ON https_response_headers USING gin (response_headers);


--
-- Name: https_response_headers_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY https_response_headers
    ADD CONSTRAINT https_response_headers_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/mixed_assets.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: mixed_assets; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE mixed_assets (
    asset text NOT NULL,
    https_crawl_id bigint NOT NULL
);


--
-- Name: mixed_assets_unique_substrmd5_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE UNIQUE INDEX mixed_assets_unique_substrmd5_idx ON mixed_assets USING btree (https_crawl_id, "left"(md5(asset), 8));


--
-- Name: mixed_assets_https_crawl_id_fkey; Type: FK CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY mixed_assets
    ADD CONSTRAINT mixed_assets_https_crawl_id_fkey FOREIGN KEY (https_crawl_id) REFERENCES https_crawl(id) ON DELETE CASCADE;


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/ssl_cert_info.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: ssl_cert_info; Type: TABLE; Schema: public; Owner: -
--

CREATE TABLE ssl_cert_info (
    domain text NOT NULL,
    issuer text,
    notbefore timestamp with time zone,
    notafter timestamp with time zone,
    host_valid boolean,
    err text,
    updated timestamp with time zone DEFAULT now() NOT NULL
);


--
-- Name: ssl_cert_info_pkey; Type: CONSTRAINT; Schema: public; Owner: -
--

ALTER TABLE ONLY ssl_cert_info
    ADD CONSTRAINT ssl_cert_info_pkey PRIMARY KEY (domain);


--
-- Name: ssl_cert_info_host_valid_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX ssl_cert_info_host_valid_idx ON ssl_cert_info USING btree (host_valid);


--
-- Name: ssl_cert_info_issuer_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX ssl_cert_info_issuer_idx ON ssl_cert_info USING btree (issuer);


--
-- Name: ssl_cert_info_notafter_idx; Type: INDEX; Schema: public; Owner: -
--

CREATE INDEX ssl_cert_info_notafter_idx ON ssl_cert_info USING btree (notafter);


--
-- PostgreSQL database dump complete
--


================================================
FILE: sql/upgradeable_domains_func.sql
================================================
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.5.9
-- Dumped by pg_dump version 9.5.9

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

--
-- Name: upgradeable_domains(real, real, real, real, real); Type: FUNCTION; Schema: public; Owner: - 
--

CREATE OR REPLACE FUNCTION upgradeable_domains(
    unknown_max real,
    combined_min real,
    screenshot_diff_max real,
    mixed_ok boolean DEFAULT TRUE,
    autoupgrade_min real DEFAULT 0,
    ssl_cert_buffer timestamp with time zone DEFAULT now(),
    exclude_issuers text[] default '{}',
    max_err_rate real DEFAULT 1)
    RETURNS TABLE(domain character varying) AS
$$
    select domain from https_upgrade_metrics m
        where
        (unknown_pct <= unknown_max) and
        (combined_min <= combined_pct) and
        (max_screenshot_diff <= screenshot_diff_max) and
        (upgradeable_domains.mixed_ok = m.mixed_ok) and
        (autoupgrade_min <= autoupgrade_pct) and
        (https_err_rate <= max_err_rate)
    except
    (
        select domain from domain_exceptions
        union
        select domain from ssl_cert_info
            where
            err is not null or
            host_valid = false or
            notafter < ssl_cert_buffer or
            notbefore is null or
            notafter is null or
            issuer is null or
            issuer ~* ANY(exclude_issuers)
    )
$$ LANGUAGE sql RETURNS NULL ON NULL INPUT;


--
-- PostgreSQL database dump complete
--


================================================
FILE: third-party.txt
================================================
Smarter Encryption includes the following third party software:

Software Name: PhantomJS
Version: 2.1.1
License: BSD-3-Clause
Modified by DDG: Yes
Location: netsniff_screenshot.js
Obtained from: https://github.com/ariya/phantomjs