Repository: cliqz-oss/whotracks.me Branch: master Commit: 44f3b9290cd0 Files: 179 Total size: 17.9 MB Directory structure: gitextract_rc25zhaz/ ├── .github/ │ └── workflows/ │ └── test.yml ├── .gitignore ├── .tool-versions ├── Dockerfile ├── Jenkinsfile ├── LICENSE.md ├── README.md ├── RIGHT_TO_AMEND.md ├── blog/ │ ├── adblockers_performance_study.md │ ├── block-third-party-cookies.md │ ├── cookie-consent.md │ ├── cookies.md │ ├── dexie_transaction_bug.md │ ├── fingerprinting.md │ ├── gdpr-what-happened.md │ ├── generating_adblocker_filters.md │ ├── google_domains.md │ ├── government_websites_september.md │ ├── how_cliqz_antitracking_protects_users.md │ ├── how_facebook_knows_exactly_what_turns_you_on.md │ ├── manifest_v3_privacy.md │ ├── private_analytics.md │ ├── static_site.md │ ├── static_site_blog.md │ ├── static_site_generation.md │ ├── static_site_visualization.md │ ├── tracker-tax.md │ ├── tracker_categories.md │ ├── trackers-who-steal.md │ ├── trackers_in_your_favorite_site.md │ ├── tracking_and_ux.md │ ├── tracking_pixel.md │ ├── update_apr_2018.md │ ├── update_dec_2017.md │ ├── update_feb_2018.md │ ├── update_jan_2018.md │ ├── update_jun_2018.md │ ├── update_may_2018.md │ ├── updating_our_tracking_prevalence_metrics.md │ ├── what_is_a_tracker.md │ └── where_is_the_data_from.md ├── contrib/ │ ├── generating_adblocker_filters.py │ ├── tracker_map_notebook.ipynb │ ├── wtm_april_update.ipynb │ └── wtm_may_update.ipynb ├── deploy_to_s3.py ├── docs/ │ └── local-build.md ├── pyproject.toml ├── static/ │ ├── font-awesome-4.7.0/ │ │ ├── HELP-US-OUT.txt │ │ ├── css/ │ │ │ └── font-awesome.css │ │ ├── fonts/ │ │ │ └── FontAwesome.otf │ │ ├── less/ │ │ │ ├── animated.less │ │ │ ├── bordered-pulled.less │ │ │ ├── core.less │ │ │ ├── fixed-width.less │ │ │ ├── font-awesome.less │ │ │ ├── icons.less │ │ │ ├── larger.less │ │ │ ├── list.less │ │ │ ├── mixins.less │ │ │ ├── path.less │ │ │ ├── rotated-flipped.less │ │ │ ├── screen-reader.less │ │ │ ├── stacked.less │ │ │ └── variables.less │ │ └── scss/ │ │ ├── _animated.scss │ │ ├── _bordered-pulled.scss │ │ ├── _core.scss │ │ ├── _fixed-width.scss │ │ ├── _icons.scss │ │ ├── _larger.scss │ │ ├── _list.scss │ │ ├── _mixins.scss │ │ ├── _path.scss │ │ ├── _rotated-flipped.scss │ │ ├── _screen-reader.scss │ │ ├── _stacked.scss │ │ ├── _variables.scss │ │ └── font-awesome.scss │ ├── fonts/ │ │ └── RationalTWSemiBold.otf │ ├── js/ │ │ ├── bootstrap.js │ │ ├── d3.layout.cloud.js │ │ ├── explorer.js │ │ ├── ghostery.js │ │ ├── highlight.pack.js │ │ └── search.js │ └── scss/ │ ├── _colors.scss │ ├── blog/ │ │ ├── card.scss │ │ ├── github.scss │ │ └── post.scss │ ├── bootstrap.min.scss │ ├── companies/ │ │ └── reach-chart.scss │ ├── custom.scss │ ├── datatables.colReorder.min.scss │ ├── datatables.min.scss │ ├── explorer/ │ │ └── table.scss │ ├── home/ │ │ └── index.scss │ ├── trackers/ │ │ ├── list.scss │ │ └── profile.scss │ └── websites/ │ ├── overview.scss │ └── profile.scss ├── templates/ │ ├── base.html │ ├── blog-page.html │ ├── blog.html │ ├── company-page.html │ ├── components/ │ │ ├── blog-card.html │ │ ├── breadcrumb.html │ │ ├── category-item.html │ │ ├── company-card.html │ │ ├── cookies.html │ │ ├── fingerprinting.html │ │ ├── footer.html │ │ ├── home/ │ │ │ └── header.html │ │ ├── navbar.html │ │ ├── tag_cloud.html │ │ ├── top-5-info-box.html │ │ ├── top-5-trackers.html │ │ ├── tracker-list.html │ │ ├── trackers/ │ │ │ ├── category.html │ │ │ └── header.html │ │ ├── tracking-methods.html │ │ ├── unified-ui-tracker-list.html │ │ ├── website-list.html │ │ └── websites/ │ │ ├── header.html │ │ └── tracker-list.html │ ├── explorer.html │ ├── imprint.html │ ├── index.html │ ├── not-found.html │ ├── privacy-policy.html │ ├── reach-chart-page.html │ ├── tracker-not-found.html │ ├── tracker-page.html │ ├── trackers.html │ ├── website-not-found.html │ ├── website-page.html │ └── websites.html ├── tests/ │ ├── __init__.py │ ├── test_data_integrity.py │ ├── test_db_integrity.py │ ├── test_db_validity.py │ ├── test_site_categories.py │ └── test_sites_data.py ├── update_trackerdb.sh ├── update_trackers_preview.py └── whotracksme/ ├── __init__.py ├── data/ │ ├── Readme.md │ ├── __init__.py │ ├── assets/ │ │ ├── trackerdb.sql │ │ └── trackers-preview.json │ ├── db.py │ ├── loader.py │ └── pack.py ├── main.py ├── qa/ │ ├── __init__.py │ ├── todo.py │ └── utils.py └── website/ ├── __init__.py ├── api/ │ └── meta.py ├── build/ │ ├── __init__.py │ ├── blog.py │ ├── companies.py │ ├── data.py │ ├── explorer.py │ ├── home.py │ ├── trackers.py │ └── websites.py ├── builder.py ├── plotting/ │ ├── .vscode/ │ │ └── settings.json │ ├── __init__.py │ ├── colors.py │ ├── companies.py │ ├── plots.py │ ├── sankey.py │ ├── trackers.py │ └── utils.py ├── serve.py ├── templates.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/test.yml ================================================ name: Tests on: push: branches: [master] pull_request: branches: [master] jobs: test: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v4 - name: Install sass run: | sudo apt-get update sudo apt-get install --yes ruby-sass build-essential - name: Install uv uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0 with: python-version: '3.13' - name: Install dependencies run: | uv sync --locked uv run whotracksme --help - name: Fetch test data assets run: | aws --no-sign-request s3 cp --recursive s3://data.whotracks.me/2017-06 2017-06 aws --no-sign-request s3 cp --recursive s3://data.whotracks.me/2021-06 2021-06 working-directory: whotracksme/data/assets env: AWS_DEFAULT_REGION: us-east-1 - name: Run tests run: | uv run pytest - name: Check build run: | uv run whotracksme website ================================================ FILE: .gitignore ================================================ *.pyc .cache/ .sass-cache/ __pycache__/ _site/ dist/ whotracksme.egg-info/ .DS_Store venv/ whotracksme/data/assets/**/*.csv whotracksme.db ================================================ FILE: .tool-versions ================================================ python 3.11.6 ================================================ FILE: Dockerfile ================================================ # Set base image to build upon FROM python:3.11-slim # Set arg and env ARG VERSION ARG UID=1000 ARG GID=1000 ARG USER=jenkins ARG GROUP=jenkins # Add jenkins user and group RUN groupadd -g ${GID} ${GROUP} && \ useradd -u ${UID} -g ${GID} -m -s /bin/bash ${USER} # Set labels to identify image LABEL vendor="Ghostery GmbH" \ maintainer="chrmod@ghostery.com" \ version=${VERSION} RUN apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ build-essential \ libffi-dev \ ruby-sass \ && \ rm -rf /var/lib/apt/lists/* && \ rm -f /var/cache/apt/*.bin # Copy application python requirements COPY requirements-dev.txt /home/jenkins/ # Install python dependencies RUN pip install -r /home/jenkins/requirements-dev.txt ================================================ FILE: Jenkinsfile ================================================ def testReport = 'test-report.xml' def stagingBucket = 'internal.clyqz.com' def stagingPrefix = '/docs/whotracksme' def productionBucket = 'whotracksme' def productionPrefix = '' node('magrathea') { stage ('Checkout') { checkout([ $class: 'GitSCM', branches: [[name: 'refs/heads/'+env.BRANCH_NAME]], extensions: [[$class: 'GitLFSPull']], userRemoteConfigs: [ [refspec: '+refs/heads/*:refs/remotes/origin/* +refs/pull/*/head:refs/remotes/origin/PR-* +refs/tags/*:refs/remotes/origin/*', url: 'https://github.com/ghostery/whotracks.me.git'] ] ]) } def img stage('Download Datasets') { dir('whotracksme/data/assets') { sh('aws s3 sync --no-sign-request --no-progress s3://data.whotracks.me/ .') } } stage('Build Docker Image') { img = docker.build('whotracksme', '. --build-arg user=`whoami` --build-arg UID=`id -u` --build-arg GID=`id -g`') } img.inside() { try { stage('Install') { sh("python -m pip install --user -e '.[dev]'") } stage('Test') { try { sh(script: "pytest --junit-xml=${testReport}") } catch(err) { junit(testReport) currentBuild.result = "FAILURE" } } stage('Build site') { sh('/home/jenkins/.local/bin/whotracksme website') } if (env.BRANCH_NAME == 'master') { withCredentials([[ $class: 'AmazonWebServicesCredentialsBinding', accessKeyVariable: 'AWS_ACCESS_KEY_ID', credentialsId: '04e892d6-1f78-400e-9908-1e9466e238a9', secretKeyVariable: 'AWS_SECRET_ACCESS_KEY' ]]) { stage('Publish Site') { sh("python deploy_to_s3.py ${productionBucket} ${productionPrefix} --production") } } } } finally { // cleanup sh('rm -rf _site; rm -rf .sass-cache') } } junit(testReport) } ================================================ FILE: LICENSE.md ================================================ MIT License Copyright (c) 2017 - to present Ghostery GmbH Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================
Transparency
· Privacy
· Tracking landscape
· Built by Ghostery
Trackers
· Websites
· Explorer
| 99% OF REQUESTS | MEDIAN | |
|---|---|---|
| Ghostery | 0.050ms | 0.007ms |
| uBlock Origin | 0.124ms (2.5x slower) | 0.017ms (2.7x slower) |
| Adblock Plus | 0.103ms (2.1x slower) | 0.019ms (2.9x slower) |
| Brave | 1.288ms (25.9x slower) | 0.041ms (6.3x slower) |
| DuckDuckGo | 12.085ms (242.5x slower) | 8.270ms (1258.4x slower) |
| 99% OF REQUESTS | MEDIAN | |
|---|---|---|
| Ghostery | 0.049ms | 0.006ms |
| uBlock Origin | 0.112ms (2.3x slower) | 0.018ms (2.8x slower) |
| Adblock Plus | 0.105ms (2.2x slower) | 0.020ms (3.1x slower) |
| Brave | 1.270ms (26.2x slower) | 0.038ms (5.9x slower) |
| DuckDuckGo | 11.190ms (230.5x slower) | 6.781ms (1060.5x slower) |
| 99% OF REQUESTS | MEDIAN | |
|---|---|---|
| Ghostery | 0.052ms | 0.007ms |
| uBlock Origin | 0.165ms (3.1x slower) | 0.016ms (2.2x slower) |
| Adblock Plus | 0.099ms (1.9x slower) | 0.014ms (1.9x slower) |
| Brave | 1.468ms (28.0x slower) | 0.062ms (8.5x slower) |
| DuckDuckGo | 13.025ms (248.5x slower) | 8.31ms (1130.6x slower) |
Google's consent cookie lasts for 20 years; A tracking cookie on the Economist which lasts for 68 years.
Third-party cookies also represent a security risk to you. [Cross-site request forgery](https://en.wikipedia.org/wiki/Cross-site_request_forgery) (CSRF) attacks are based on the idea that I can make a third-party request to a site that the browser has previously authenticated with, and the browser will send the credentials with the request. If browsers did not allow third-party cookies these attacks would be much harder to exploit than they currently are. These kinds of attacks have been around for over 15 years, and methods to mitigate them are [still being proposed](https://blog.mozilla.org/security/2018/04/24/same-site-cookies-in-firefox-60/), while browser-side protection, such as [first-party isolation](https://wiki.mozilla.org/Security/FirstPartyIsolation), have very limited distribution. Furthermore, the use-cases which legitimately use third-party cookies, like Single-Sign-On portals, or third-party authentication mechanisms, have alternatives which do not require cookies. Sites using a centralised authentication domain can obtain authenication tokens via first-party redirects, and OAuth[^2] can be used to log in to sites using third-party credentials. These mechanisms have the added bonus of transparency and implied consent: When a user logs in with Facebook on a site, the user is actively allowing this connection between the site and Facebook to proceed. So why do we have third-party cookies? Actually, the original 1997 [RFC Specification](https://tools.ietf.org/html/rfc2109) of the cookie standard proposed that third-party cookies should not be allowed on privacy grounds: > This restriction prevents a malicious service author from using unverifiable transactions to induce a user agent to start or continue a session with a server in a different domain. The starting or continuation of such sessions could be contrary to the privacy expectations of the user, and could also be a security problem. and browsers should have this setting by default: > User agents may offer configurable options that allow the user agent, or any autonomous programs that the user agent executes, to ignore the above rule, so long as these override options default to "off". However, these recommendations were not implemented by browser developers at that time, and the default of _'allow all cookies'_ has remained since then. Currently, almost all major browsers have a default to allow all cookies. The one exception is Safari, which only allows third-party cookies for domains which have been visited as a first party. This setting mitigates tracking from unknown domains, but still allows others to track, and does not prevent CSRF attacks. Mozilla also previously [attempted](https://blog.mozilla.org/netpolicy/2013/02/25/firefox-getting-smarter-about-third-party-cookies/) to change Firefox's default handling of third-party cookies in 2013, but pressure from the Ad industry led to a [U-turn](https://blog.mozilla.org/blog/2013/05/10/personalization-with-respect/) before these changes went live. The failure of browsers to handle third-party cookie tracking [has argueably led](https://medium.com/the-graph/how-to-reverse-publisher-revenue-drain-c33e41bf0665) to the increase in adblocker usage since then. The effect that this default has had over the last 20 years, is that developers now assume that cookies are allowed in all contexts. This causes many workflows to break once this assumption is broken. This leads to a vicious cycle, where attempts to limit third-party cookies are foiled because they break too many sites. Apple's push to reduce third-party cookie tracking with their [Intelligent Tracking Prevention](https://webkit.org/blog/7675/intelligent-tracking-prevention/) technology had to include a section to explain to developers how to solve several use-cases when their cookies are limited. This technology still allows third-party cookies from visited sites however, and this method is also recommended for implementing single sign-ons. ## Moving away from third-party cookies In 2015 Cliqz[^1] released an anti-tracking technology which [aggressively blocks third-party cookies](./how_cliqz_antitracking_protects_users.html). Third-party cookies are blocked unless certain heuristics are triggered. These heuristics aim to mitigate common cases where cookie blocking breaks workflows, but also require user action to trigger. A Facebook button can be loaded without cookies, but if the user then clicks on it, there is an implied consent to allow the cookies in this case. This method blocks 97% of third-party cookies, with minimal breakage of pages.| Browser | Default Cookie setting |
|---|---|
| Google Chrome | Allow all. |
| Mozilla Firefox | Allow all. |
| Apple Safari | Allow from visited; tracking cookies limited. |
| Cliqz Browser / Ghostery extension | Block all third-party, unless user interaction or compatibility exception. |
Looks like I'm logged out...
After the confirmation of a successful logout, simply navigate back to www.office.com, and one is returned to the view after login, including an up-to-date feed of recently changed documents.
Document change feed still shown after logout.
The API that makes this information available after logout is fetched via the [SharePoint REST API](https://docs.microsoft.com/en-us/sharepoint/dev/sp-add-ins/sharepoint-net-server-csom-jsom-and-rest-api-index), and the authentication token for this is not deleted nor expired after the failed logout - hence the page can continue to access this information. The token can be collected from the developer tools and then reused for API calls, for example to list folders in this organisation's SharePoint: ```javascript var accessToken = "eyJ0..."; var baseUrl = 'https://org-my.sharepoint.com/_api/'; var headers = new Headers(); headers.append('Authorization', `Bearer ${accessToken}`); headers.append('Accept', 'application/json;odata=verbose'); fetch(`${baseUrl}web/lists`, { headers }) .then(resp => resp.json()) .then(res => console.log(res)) ``` The broken logout state can only be resolved by manually deleting office.com cookies. We also found the session may eventually be expired, but this only happened after multiple hours. Hence, users affected by this will 1) likely not be aware that they're not logged out properly, as the logout appears to be successful, and 2) would not be able to logout anyway if they noticed the issue. The issues with Office continue when trying to purchase an Office365 trial from `https://products.office.com/try`. This time, the source of the problem is detected, but the user is given no choice to continue unless they compromise their security and privacy by enabling cookies. Ironically, they also imply that allowing third-party cookies is somehow safer.
It is not possible to buy Office without allowing third-party cookies.
### Pay with your Cookies It is common practice for E-Commerce sites to embed payment systems from third-party vendors, such as Paypal, on their checkout pages. Such widgets should not require third-party cookies - usually the user can be redirected to pay at the payment provider's site. This method is preferable, as it reduces the chances of phishing: loading the payment page as a first party will make the url and certificate status visible, and only prompting users to enter payment information on the first party site is also good practice. Despite this, we see examples of payment being blocked when third-party cookies are disabled. One such example is on the German E-Commerce site [Thomann.de](../websites/thomann.de.html). When attempting to checkout with Amazon pay, we get an error mentioning that third-party cookies are being blocked:
"There was an error processing the Amazon payment. A possible cause is third-party cookie blocking."
### Connect with Google? Third-party cookies required Many sites use Google's connect SDK, to allow users to login to sites with their Google account. When testing cases on [www.tripadvisor.com](https://www.tripadvisor.com) and [www.stumbleupon.com](https://www.stumbleupon.com) with third-party cookies disabled, the 'Connect with Google' button fails to do anything when clicked. Both these sites also offer Facebook login too which works with cookies disabled. It is not clear why the Google implementation requires third-party cookies to be allowed.
Tripadvisor signup buttons.
### Please let me track your tracking opt-out Following GDPR, websites using third-party services which collect data about users [acquire consent](./update_jun_2018.html) for this, as well as provide a reasonable way of opting-out of data collection and processing. While many publishers have converged on a solution which [gathers consent as a first-party cookie](https://iabtechlab.com/standards/gdpr-transparency-and-consent-framework/) which can then be passed to third-parties, other still rely on an older system of setting opt-out cookies for each vendor. Obviously, if third-party cookies are blocked, this mechanism will not work, as can be seen on the [Telegraph](../websites/telegraph.co.uk.html):
"You browser is currently blocking 3rd party cookies ... you will need to enabled 3rd party cookies if you want all of the opt-outs on this page to work."
In this case, users with third-party cookies disabled will be denied their right to opt-out (though blocking these cookies will effectively prevent a large proportion of tracking). Third-party vendors may say that this mechanism is required in order to remember a user's consent settings. However, previous attempts to allow browsers to convey tracking consent explicitly to servers, via the ['Do Not Track'](https://www.w3.org/TR/tracking-dnt/) standard were killed by the same vendors collectively saying they would [ignore this signal](https://blogs.harvard.edu/doc/2015/09/23/how-adtech-not-ad-blocking-breaks-the-social-contract/). ## Conclusion When the idea of cookies was first proposed, the standard writers were concerned about the privacy implications of allow third-party cookies, and specified that browser vendors should disable them by default. Fast-forward 20 years and the majority of browsers on the web will allow all third-party cookies. The result of this are significant challenges to protect against Cross-site request forgery, with countless sites and accounts compromised along the way, and pervasive privacy invasion in the form of cross-site tracking of users. We argue that we should aim to return to a web where third-party cookies are blocked by default, and are making that possible for users of our anti-tracking technology in [Cliqz](https://cliqz.com/) and [Ghostery](https://www.ghostery.com/), however this is made difficult by the prevailing assumption that cookies are a free-for-all, making many sites fail to function properly in this environment. In this regard we are constantly improving heuristics to mitigate the breakage issues we do find. We showed multiple cases where the assumption that third-party cookies will be allowed lead to both benign and potentially dangerous issues for users who block cookies. Some of these cases affect payments, so perhaps if cookie-blocking becomes more common and companies' bottom lines are effected these issues will be fixed. This is a chicken and egg problem though, if the web is broken for users blocking cookies, then we may never achieve the critical mass required to get it fixed. **For users**, getting control over which cookies your browser sends out, and to whom, is a key part of protecting privacy online, but also something that is not universally recognised by browser privacy tools. Most adblockers, for example, do nothing to the cookies of third-party requests which are not on their blocklists. More adoption of the kind of cookie blocking that Cliqz and Ghostery do help us to achive this critical mass, and push more websites to ensure that their services still work correctly for users who chose more private browser configurations. **Developers** have a part to play here too. By building services which do not require third-party cookies, or at least continue to function without them, it becomes easier for users to turn off third-party cookies, and the web becomes more privacy-friendly. As we have seen in this article, even the biggest tech companies are currently failing at this, but this seems to be more due to a lack of awareness, than any difficultly in implementation. [^1]: Disclosure: WhoTracks.Me is operated by Cliqz. [^2]: Note that both of these methods also have some privacy issues. First-party redirection has been exploited for [user tracking](https://brave.com/redirection-based-tracking/), and OAuth dialogs can trick users into granting [many more permissions](https://lifehacker.com/how-to-revoke-pokemon-go-s-extensive-permissions-to-you-1783466118) than they actually need. ================================================ FILE: blog/cookie-consent.md ================================================ title: Improving Cookie Consent subtitle: Cliqz' new feature to make consent fairer author: privacy team type: article publish: True date: 2019-11-28 tags: blog, gdpr, consent header_img: blog/autoconsent/cookie-blocker-prompt.png +++ Since the GDPR came into force in May last year, the Cookie-Consent Popup has become a fixture of browsing the web. These popups are ostensibly there to allow you to choose whether you agree or disagree to your data being used for certain purposes on the site, but confusing UI design and tricks mean that many users are not able to select their desired consent settings. A recent [study](https://arxiv.org/pdf/1909.02638.pdf) showed that user fatigue with consent popups, and simple UI tricks are able to artificially inflate the opt-in rate. The study also showed that, when opt-out is the default, only 0.1% of users would consent to all data processing. This is in stark contrast to the over 90% opt-in rate that the [industry claims](https://www.thedrum.com/news/2018/07/31/over-90-users-consent-gdpr-requests-says-quantcast-after-enabling-1bn-them), and uses to justify that users are OK with tracking. How can we restore balance to this situation, and allow users a fair choice about how their data is used? At Cliqz we have been developing a new feature to aim to address the difficulty of denying consent based around 3 core principles: 1. Opt-out and opt-in should both require maximum of one click, i.e. the time-cost should be the same, no matter which choice is made. 2. The user should not have to decide individually for every site. Their default choice can be used to give consent after their initial decision. 3. Consent banners only offering an 'OK' or 'Allow' option do not allow user choice. The are at best a distraction for the user, and at worst drive consent fatigue and encourage the bad practice of automatically clicking away message prompts. These should be hidden. Unfortunately, implementing an automated consent choice in the browser is made challenging by the lack of adoption or adherence to browser standards. The [Do Not Track](https://www.w3.org/blog/2018/06/do-not-track-and-the-gdpr/) standard enables users to broadcast preferences around tracking, and for sites to communicate tracking status to the browser. Before that, the [P3P Project](https://www.w3.org/P3P/) attempted to standardise privacy practices and allow automated decision making around them. Both of these standards have been rejected by the tracking industry, who prefer to present consent on their terms. The industry have instead proposed and implemented the [Transparency and Consent Framework](https://iabeurope.eu/transparency-consent-framework/), which primarily focuses on communicating consent between vendors. It is a read-only API, so the browser can only read the consent status as set by the site, and not modify it. This means that consent can currently only be expressed by clicking through HTML forms.
Navigating a Cookie-Consent Popup manually.
Luckily, the number of vendors offering consent solutions is limited, and browser extensions can simulate clicking through forms. Thus, [autoconsent](https://github.com/cliqz-oss/autoconsent) was born - a library of rules standardising the navigation of consent forms for the most popular sites and vendors. This library is able to: * Detect the presence of supported Consent Management Providers on a page. * Determine whether a popup or overlay is being shown on the page. * Execute an opt-in (allow all purposes) or opt-out (reject all purposes). * Where available, re-open the popup to allow modification of the settings. In practice, this allows consent popups to be rapidly dismissed when loading a new site. The speed depends on the provider and how quickly their UI can be manipulated. In all cases, however, this is faster than a user could navigate the interface.
Automatic navigation of the Cookie-Consent Popup.
For popups that are informational only, or force affirmative consent, we apply simple cosmetic rules. These are CSS rules that define elements in the page that should be hidden. As with the consent rules, we benefit from the defacto standardisation of tools for displaying of popups, such that a small number of rules can support the majority of popups shown by websites. These elements combined mean that we now just have to ask the user once whether they want to opt-in or opt-out, then they will not be bothered by consent popups on the majority of sites they visit. At the same time, they will signal to these sites their approval or dissapproval of their data collection practices. This signal of non-consent is important to encourage and incentivise a shift in data usage practices on the web. When sites realise they cannot just trick users into allowing invasive data collection, they will have a strong incentive to change the way they operate and respect users more. The new Cliqz Cookie-Popup blocker is available in the latest version of the Cliqz browser. Get it at [cliqz.com](https://cliqz.com/download). ================================================ FILE: blog/cookies.md ================================================ title: Cookies subtitle: A small piece of data sent from a website, meant to 'help', used to track. author: privacy team type: primer publish: True date: 2017-07-22 tags: primer, tracking header_img: blog/blog-cookies.jpg +++ An HTTP cookie (also called web cookie, Internet cookie, browser cookie, or simply cookie) is a small piece of data sent from a website and stored on the user's computer by the user's web browser while the user is browsing. Cookies were designed to be a reliable mechanism for websites to remember stateful information (such as items added in the shopping cart in an online store) or to record the user's browsing activity (including clicking particular buttons, logging in, or recording which pages were visited in the past). They can also be used to remember arbitrary pieces of information that the user previously entered into form fields such as names, addresses, passwords, and credit card numbers. Other kinds of cookies perform essential functions in the modern web. Perhaps most importantly, authentication cookies are the most common method used by web servers to know whether the user is logged in or not, and which account they are logged in with. Without such a mechanism, the site would not know whether to send a page containing sensitive information, or require the user to authenticate themselves by logging in. The security of an authentication cookie generally depends on the security of the issuing website and the user's web browser, and on whether the cookie data is encrypted. Security vulnerabilities may allow a cookie's data to be read by a hacker, used to gain access to user data, or used to gain access (with the user's credentials) to the website to which the cookie belongs (see cross-site scripting and cross-site request forgery for examples).[[1](http://news.cnet.com/8301-10789_3-9918582-57.html)] ## Tracking Cookies The tracking cookies, and especially third-party tracking cookies, are commonly used as ways to compile long-term records of individuals' browsing histories – a potential privacy concern that prompted European [[2](http://webcookies.org/faq/#Directive)] and U.S. lawmakers to take action in 2011. European law [[3](http://www.bbc.co.uk/news/technology-12668552)] requires that all websites targeting European Union member states gain "informed consent" from users before storing non-essential cookies on their device. The excerpt above has been retrieved from [wikipedia](https://en.wikipedia.org/wiki/HTTP_cookie). #### References: [1] [Gmail cookie stolen via Google Spreadsheets](http://news.cnet.com/8301-10789_3-9918582-57.html)Percentage change in traffic to google search result pages, April 2018
We can further see the magnitude of this change by focusing on data for Germany. If we look at the relative proportion of pages loaded on `www.google.de` and `www.google.com` in Germany over the last month, we see a marked increase, with the share of traffic to `www.google.com` going up from around 5% to over 40%. Search results pages used in Germany, April 2018
Why is Google doing this? We don't know - we're not aware of any official announcement. However, one reason for this could be a reaction to increased usage of restrictive cookie settings, such as allowing cookies only from visited sites, or Apple's [Intelligent Tracking Prevention](https://webkit.org/blog/7675/intelligent-tracking-prevention/). If a user is rarely visiting the google.com domain, these technologies can expire this cookie earlier, or prevent its use in third-party contexts. As `google.com` is the domain used to authenticate with Google services, if the browser sends `google.com` cookies in third-party context, these visits can be directly attributed to one's Google profile. Therefore, this change increases the likelihood that the user will have recently visited `www.google.com`, and therefore Google's tracking can continue uninterrupted. Tracking from `google.*` domains [reaches 30% of web traffic](https://whotracks.me/trackers/google.html), and the majority of this reach is contributed by the [`google.com` domain](https://github.com/ghostery/whotracks.me/blob/master/whotracksme/data/assets/2018-03/global/domains.csv#L5). As, with this change, it is very difficult to avoid visiting `google.com` domain as a first party, preventing this tracking in a vanilla browser would require disabling all third-party cookies. Alternatively, [Cliqz](https://cliqz.com/) and [Ghostery's](https://www.ghostery.com/) AI anti-tracking technologies block all third-party tracking cookies (Disclosure: the author works on this product). [Privacy Badger](https://www.eff.org/privacybadger) is also able to block third-party tracking cookies. ================================================ FILE: blog/government_websites_september.md ================================================ title: Government websites subtitle: If you are not the product, you're the taxpayer author: privacy team type: article publish: True date: 2018-10-10 tags: trackers, government header_img: blog/gov_trackers/gov.png redirect_url: https://www.ghostery.com/blog/government-websites-trackers +++ _This post is one of our regular monthly blogs accompanying an update to the data displayed on WhoTracks.Me. In these posts we introduce what data has been added as well as point out interesting trends and case-studies we found in the last month._Average number of trackers seen on selected government websites from the WhoTracks.Me September dataset.
Here's a list of the government websites ending up in this month's release:| Country | Site | Notable trackers |
|---|---|---|
| Australia | bom.gov.au | Google Analytics, Doubleclick |
| Europe | europa.eu | Google Analytics, Google, Twitter |
| France | ants.gouv.fr | Google Analytics, Doubleclick |
| France | legifrance.gouv.fr | AT Internet |
| France | impots.gouv.fr | AT Internet |
| Russia | zakupki.gov.ru | Yandex |
| UK | tax.service.gov.uk | Google Analytics, Optimizely |
| US | ca.gov | Google Analytics, Google, AddThis |
| US | dhs.gov | Google Analytics, Doubleclick |
| US | irs.gov | Google Analytics, New Relic, AddToAny, Youtube, Foresee |
| US | nih.gov | Google Analytics, Doubleclick, Google |
| US | noaa.gov | Google Analytics |
| US | state.gov | Google Analytics, Google, Youtube, Qualtrics |
| US | weather.gov | Google Analytics, AddThis |
Now, upon visiting a site which has a Facebook widget, in this case bild.de, we can see a request to facebook.com. As third-party cookies are enabled in the browser (the default setting in all major browsers), we will send the cookie we got on the previous page along with the request. The Referer header of this request will also contain the site I am visiting: www.bild.de.
Here we can see many parameters are sent in the request, and many values match across both requests. However, we cannot know for sure if these represent uids, or just other values used legitimately for the service. However, the qn value is suspicious, as a long cryptic value which remains the same when visiting different sites.
We now try opening the same sites in a different browser:
Again, pixels are generated with various parameters set in the request URL. Some are the same as we saw in the first test, for example the qq parameter. However, looking at the qn value we see that it is again the same on both web pages, but different to the value we saw on Mac. We can hypothesise that this is a fingerprint of this browser which functions as a uid, however we would need more examples from more unique browsers to properly test this.
Finally, we test the qn in a private tab in the first browser. As shown below, we see that the same fingerprint is generated. Therefore, Moat are able to also tag page views in private tabs with the same uid as in a normal window, suggesting that they can bypass this protection for their tracking purposes.
# Conclusion
In this post we’ve given a general description of how online tracking works, and looked at the extent of tracker companies’ reach across the web. In the next post we will look at how we can stop this tracking, and give an in depth description of how our Cliqz Anti-tracking technology works to prevent tracking without an adverse effect on user experience.
================================================
FILE: blog/manifest_v3_privacy.md
================================================
title: Chrome's Manifest V3 - Improving Privacy?
subtitle: How Chrome's changes will reduce user privacy
author: privacy team
type: article
publish: True
date: 2019-06-18
tags: blog, extensions, privacy, chrome
header_img: blog/adblocker-perf-study.jpg
redirect_url: https://www.ghostery.com/blog/manifest-v3-privacy
+++
The Chrome team's proposed changes to browser extension APIs, known as Manifest v3, have proven controversial due to their expected impact on adblockers and privacy extensions. Of particular concern are the changes to the `webRequest` API, whose blocking capabilities are being replaced by the `declarativeNetRequest` API. In repeated posts the Chrome team claim that these changes are required to improve the *performance*, *security* and *privacy* of extensions. In a [previous post](./adblockers_performance_study.html) we showed that, for the most popular adblocker engines, performance is already very good, and these changes are unlikely to improve much. In this post we assess the privacy argument for the changes to request handling, if the proposed changes do improve privacy, and how Ghostery specifically will be affected. We find that:
* The Chrome team have only belatedly stated specific privacy concerns with the `webRequest` API, and these are still not included in the design document.
* The proposed changes do not provide any protections against the stated privacy issues.
* Privacy extensions like Ghostery will be negatively impacted by the changes, reducing their ability to keep users safe online.
## Extension privacy
Browser extensions have the potential to cause many privacy problems - when granted permissions, they can see every page you visit in the browser, view their contents, read and write form data, and send requests to any server on the internet. These powers are required for some of the valuable features extensions provide. Therefore, as the Chrome team rightly [point out](https://blog.chromium.org/2019/05/taking-action-on-deceptive-installation.html), ensuring extensions are consentfully installed is the first step to address privacy.
The Manifest v3 changes, however, primarily address extensions' capabilities post install. As privacy at this point is also a stated goal, what are the privacy concerns and attacks that the changes seek to address? In the Manifest V3 [design document](https://docs.google.com/document/d/1nPu6Wy4LWR66EFLeYInl3NzzhHzc-qnk4w4PX-0XMw8/edit#heading=h.9lwe237fxtp2) this goal is stated as follows:
> Users should have increased control over their extensions. A user should be able to determine what information is available to an extension, and be able to control that privilege.
Later in the document the changes to the `webRequest` API are described, but only using a performance-based reasoning:
> … the extension then performs arbitrary (and potentially very slow) JavaScript, and returns the result back to the browser process. This can have a significant effect on every single network request, ...
They also acknowledge that the `webRequest` API should remain in place for observation.
> The non-blocking implementation of the webRequest API, which allows extensions to observe network requests, but not modify, redirect, or block them (and thus doesn't prevent Chrome from continuing to process the request) will not be discouraged.
This implies that the potential privacy impact of extensions being able to observe all requests going out of the browser are not a concern for these API changes. While the `webRequest` API remains, the switch to allow blocking only via the `declarativeNetRequest` API does nothing for the stated privacy goal of increasing user control over the information extensions can access.
Despite this, since [our study](./adblockers_performance_study.html) showed that the performance cost of `webRequest` blocking for leading adblockers was not an issue, the Chrome team have focused on privacy reasons for the changes. In their [recent blog](https://blog.chromium.org/2019/06/web-request-and-declarative-net-request.html) about web request and declarative net request changes, they state:
> In order to improve the security and privacy guarantees of the extensions platform, we are rethinking some of the extension platform's core APIs. That's why we're planning to replace the blocking Web Request API with the Declarative Net Request API.
This shift in angle has also come up in public statements by Chrome devs:
> "… The big problem with webRequest is unfixable privacy and security holes. …" @justinschuh ([Source](https://twitter.com/justinschuh/status/1134060703231254528))
In the blog post they also mention one potential malicious use of webRequest:
> Because all of the request data is exposed to the extension, it makes it very easy for a malicious developer to abuse that access to a user’s credentials, accounts, or personal information.
If this is the single privacy loophole the `webRequest` changes are targeting, then it seems strange that the solution is to remove the blocking capabilities of `webRequest` and leave the observational ones. Post Manifest V3, the exact same malicious extension will be possible. We can imagine that the Chrome team's strategy may be, that by providing a simple alternative API for blocking use-cases, the extension review process can be tougher for extensions asking for `webRequest` permissions. This, however, would also be possible by just introducing the new API, leaving `webRequest` as it is, and providing developer incentives to switch unless they really need `webRequest` for their use-case.
It is strange that this privacy issue was not stated in the original design document, and the proposed change to `webRequest` is seemingly just collateral damage that does not address the stated goals. More transparency is needed on what the strategy is here, and why keeping `webRequest` observation with blocking removed should be the solution.
To summarise:
- The stated privacy improvements of Manifest V3 are addressed elsewhere in the proposals.
- The privacy and security issues with `webRequest` blocking have not been fully articulated by the Chrome team, with only a brief mention of malicious behaviour in a blog post last month.
- The removal of `webRequest` blocking does not improve the privacy of extensions.
Therefore at this point, the primary impact on privacy from the proposed changes will be the neutering of the capabilities of several privacy extensions. Privacy Badger devs [expect their core functionality to be broken](https://github.com/EFForg/privacybadger/issues/2273) by the changes. Similarly, we expect it to be difficult to provide the same level of protection in Ghostery should these changes come into effect, and we will describe why in the rest of this post.
It is ironic that a change ostensibly aimed at improving user privacy will actually reduce it for many users who rely on privacy extensions to protect them online. Some have suggested that the changes simply align Chrome with Apple's Safari, which provides a similar declarative blocking API for extensions. This overlooks the fact that Safari comes with significant privacy protections by default, having been blocking most third-party cookies by default for years, and recently bringing in advanced anti-tracking measures in the form of [ITP](https://webkit.org/blog/8613/intelligent-tracking-prevention-2-1/). Chrome on the other hand, ships with zero tracking protection by default, and is now hindering extensions which try to provide comparible protections to other browsers.
## How removing webRequest blocking affects Ghostery
This analysis is based on the `declarativeNetRequest` [API documentation](https://developer.chrome.com/extensions/declarativeNetRequest) as of 17th June 2019. The primary features of the API are:
1. A matching grammar for specifying rules that will trigger blocking, header modification or redirects.
2. Up to 30,000 static rules per extension
3. The ability to add _dynamic_ rules at runtime, up to a maximum of 5,000 rules.
4. Rules can have a white- or black-list of first-party sites, to control triggering.
5. Individual sites can be dynamically whitelisted, up to a maximum of 100 per extension.
Ghostery contains the following components which will be affected by the webRequest API changes:
### 1. Tracker matching and blocking
Ghostery contains a blocklist of over 4,000 filters which are used to detect and block trackers. The extension allows users fine-grained control over these, allowing or blocking specific trackers on specific sites or globally. The list of detected trackers is shown in the Ghostery UI for each page visited.
To support the `declarativeNetRequest`, these 4,000 filters would have to be re-written to the new filter grammar that Chrome offers. We are likely to lose some filters in the process, as certain types of matching rule, for example Regex's likely cannot be implemented in the more restrictive grammar.
The more challenging issue, however, is maintaining Ghostery's rich configurability with the low threshold of dynamic rules allowed. As every rule should be toggleable, all 4,000 filters would have to be _dynamic_ rules. This means that we are already using 80% of our allowance from the start, before we have even started adding supplementary rules for adblocking and cookie blocking.
Likewise, the limit of 100 whitelisted sites is prohibitively low, as many users may us the Ghostery 'Trust Site' feature for more sites than this. It is unclear how to handle hitting this limit, as to the user it will seem like the feature is broken if they trust a site, but it does not get saved.
Furthermore, the new API, in it's current form, does not report the results of blocking back to the extension. This means that we will still have to run our filters on all urls via the `webRequest` API anyway, in order to display the list of trackers seen and blocked. This means that the user pays the cost of keeping the block list loaded in memory and matching against each url twice.
### 2. Cookie blocking
The Ghostery extension uses a heuristic third-party cookie blocker as part of the 'Enhanced Anti-Tracking' feature. This feature blocks third-party cookies in most cases, using a set of heuristics to decide when cookies should be allowed. It is currently not clear if these heuristics will be able to work correctly without the webRequest API, nor if the dynamic filter cap is sufficient to even hold the basic cookie blocklist.
Our cookie heuristics respond to user input, for example clicking on a Facebook like button or Google login form, in order to trigger a temporary cookie whitelist for a specific domain. To implement this with `declarativeNetRequest`, we would have to add or modify our cookie blocking rule temporarily. As the API for this is asynchronous, we introduce a race condition that we did not have before. If the rule is not added before the request we want to whitelist, the mechanism will fail. This can, for example, break Google logins on third-party sites.
The cookie blocking is done based on a dynamically generated list of tracker domains of between 2,000 and 3,000 entries. For these domains, third-party cookies should be blocked, unless a heuristic allows it. Again, the limited rule threshold of the `declarativeNetRequest` API means that this list would have to be reduced.
Another concern is that the [Rule condition specification](https://developer.chrome.com/extensions/declarativeNetRequest#type-Rule) can distuingiush between `firstParty` and `thirdParty` contexts for a request, but this is done on a frame level, rather than relative to the page document. This means that we would not, for example, be able to block Google cookies inside a Google Ads iFrame, as in this context the API would consider requests from the frame as first party.
### 3. Removing private data points
The other component of Ghostery's 'Enhanced Anti-Tracking' feature is the dynamic removal of url parameters seen to be used for cross-site tracking. This uses a [k-anonymity](./how_cliqz_antitracking_protects_users.html) based algorithm, using anonymously contributed data from our users.
As the `declarativeNetRequest` API does not support dynamic redirects, this component cannot be implemented with it.
### 4. Adblocker
Ghostery includes an additional adblocker component which is able to further block ads based on standard blocklist. As this feature should also be toggleable on-and-off at runtime, we would need to use _dynamic_ rules for these filters. With only 1,000 rules available after adding the Ghostery tracker matching, the coverage of this feature would be drastically reduced.
### 5. WhoTracks.Me Data
Ghostery is the primary source of data for this website, using our [anonymised telemetry system](https://arxiv.org/abs/1804.08959) to report on global tracker trends. This largely relies on the webRequest API in order to observe which trackers are on which page. Changes caused by the introduction of `declarativeNetRequest` will reduce the quality of this data. Namely, cookies blocked by the declarative API will not be visible to webRequest listeners. This means that we will not be able to distinguish between trackers setting cookies, which are then blocked, and those who do not set cookies.
### Summary
To summarise, the Manifest V3 changes to the webRequest API will require a significant re-write of the Ghostery extension to be able to fit the existing features into the constraints of the `declarativeNetRequest` API. The result will be:
- Slower: URL matching will have to be done twice in order to show tracker counts in the UI.
- Less configurable: Configuration may have to be limited to fit within the very low dynamic rule limit.
- Break sites more often: We will have to evaluate the trade-offs of relaxing the third-party cookie blocking vs. breaking sites.
- Less private: As the private data removal feature will have to be removed.
## Conclusion
In this post we have shown that the current proposed changes to the webRequest API by Chrome do not improve privacy, and in fact reduce it, by severely hindering the operation of privacy extensions like Ghostery. The limitations on dynamic rules in the new `declarativeNetRequest` API are particularly taxing for extensions which aim to the give user control over what is blocked and what is not.
This forces extensions into a 'dumb blocker' model, where block lists are fixed, and the only controls are an on/off toggle. At the same time,
the changes increase the difficulty and practicality of implementing dynamic heuristic mechanisms for detecting and blocking tracking.
The webRequest API powers much innovation in browser extensions, however it does implicitly provide access to private user data. While the Chrome team state that privacy is a reason for the proposed changes to this API they have not stated which specific concerns they aim to address. The Manifest V3 changes do not prevent extensions accessing private user data via webRequest, nor have other potentially dangerous APIs like content scripts been limited. Therefore the claims that this change improves extension privacy are misleading and disingenuous.
The fact that very few of the initial concerns regarding Manifest V3 have been addressed in the months since the original announcement, means that it currently looks like the changes will be forced through, despite community objections. This means that Chrome users will become second class web citizens with regards to their access to tracking protection. This is however just a continuation of a trend where Chrome stands still or actively reduces privacy while the rest of the competition have been pushing forward. At this point we recommend considering switching away from Chrome, if you haven't done so already, to browsers with privacy built-in by default. For example, the [Cliqz Browser](https://cliqz.com/en/download) has Anti-tracking built in and enabled by default, and Firefox now ships with [tracking protection on by default](https://blog.mozilla.org/blog/2019/06/04/firefox-now-available-with-enhanced-tracking-protection-by-default/).
_Disclosure: WhoTracks.Me is a joint effort by Cliqz and Ghostery._
================================================
FILE: blog/private_analytics.md
================================================
title: Tracking visits without tracking people
subtitle: A privacy-by-design approach.
author: privacy team
type: article
publish: True
date: 2018-05-03
tags: analytics, privacy-by-design
header_img: blog/analytics/analytics.png
redirect_url: https://www.ghostery.com/blog/private-analytics
+++
Analytics are one of the most common use-cases on the web. You want to know how many people are
visiting your website, whether anyone actually clicked the link you posted on social media, or who
is sending traffic to your website. For most sites, the solution is to just drop a
[Google Analytics](../trackers/google_analytics.html) script into the page - it's free, after all...
This has led us to the current situation, where we see Google Analytics having presence across 87%
of the top half a million websites, and, despite using reasonably short-lived identifiers, the way the data is collected can be used to
[track users across these sites](https://www.slideshare.net/jmpujol/data-collection-without-privacy-sideeffects-at-big2016-www-2016#13).
Is counting page visits such a difficult problem that only Google has solved it? No, there are
[paid](https://get.gaug.es/) and [open source](https://matomo.org/) alternatives available, but
why pay when you can use a free version which does more, and why host a server with the extra
costs that entails, when you don't have to?
But is Google Analytics actually better than the competition? We would argue that, at least among
privacy conscious users (i.e. those
[who contribute to the WhoTracks.Me dataset](../blog/where_is_the_data_from.html)), Google
Analytics will report vastly incorrect figures, for two main reasons:
1. Our data shows that on 29% of pages with Google Analytics some of the requests will be blocked
due to Ghostery blocking settings.
2. On 19% of pages with Google Analytics, Cliqz and Ghostery's AI anti-tracking will remove
potential identifiers from the request, often causing unique visitors and conversions to be
incorrectly measured.
## Analytics without tracking
So how can we _accurately_ measure the traffic coming to our site without exposing the user to
tracking and privacy side-effects? This was a problem we faced when we created the WhoTracks.Me
website. We wanted to have _some_ analytics so that we can measure if we are being successful in
engaging people with the information we are providing on the site. However, we had a few
constraints:
1. No tracking. We [define tracking](../blog/what_is_a_tracker.html) as when a service is able to
collect and correlate data across multiple sites. Unfortunately, as server-side aggregation is the
norm amongst third-party analytics providers, privacy cannot be guaranteed.
[Client side alternatives](http://josepmpujol.net/public/papers/big_green_tracker.pdf) have been
proposed, but unfortunately [the implementation](https://github.com/cliqz-oss/green-analytics) only
reached a proof-of-concept state. This means we have to roll our own service.
2. Minimal Ops. WhoTracks.Me is a statically generated site, which is simply hosted on a CDN. This
decision was made to minimise costs, make it fast, and eliminate the need to deploy and monitor
hosting infrastructure. Having done this, it does not make sense to have to deploy infrastructure
in order to host a [Matomo](https://matomo.org/) or similar service.
3. Respect Privacy. The system should not store any personal information from users (i.e. IP
address), nor be able to correlate visits for an individual user over a long time frame. Apart from
the obvious reasons for this, it makes regulatory compliance easy: If we do not hold IP addresses,
it is not possible for us to extract data on an individual user for data access or deletion
requests (as per GDPR).
Our analytics implementation satisfies these three constraints, using probably the oldest technique
on the Internet: server log parsing. Daily analytics for the WhoTracks.Me site are generated as
follows:
1. Visits to the site are logged via [CloudFront's logging mechanism](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html).
2. Each day, a script processes these logs, to obfuscate personal data such as IP addresses. This
script generates a random key for the day, and encrypts all IP addresses with this key. The
anonymised logs are copied to a new bucket, and the key is destroyed once the job completes. This
method allows us to count unique visits from an IP address during a single day, but no day-to-day
correlations can be made, nor can the IP address ever be recovered from the anonymised value.
3. The original CloudFront logs (with IP addresses) are removed.
4. We can then parse the clean logs and filter out requests to static resources and those by bots
in order to see requests to actual pages. We can count unique visitors within single days, using a
combination of user-agent and anonymised IP; we can see where incoming traffic is coming from via
HTTP referrers (which we also strip of potentially revealing parameters) and so on.

Processing of raw CloudFront logs to remove potential personal data.
This workflow allows us to keep track of how much traffic we are getting to the WhoTracks.Me website. There is also no reason that this method could not be scaled up to more complex use-cases which services like Google Analytics provides, like conversion counting - provided the time frame that this conversions can occur in are shorter than the time the IP encryption key is used for. The method is also safe with respect to privacy regulations and user preferences. As IPs are stored for maximum 1 day (and this is only because CloudFront's logging does not obfuscate IPs for us), no other personal information is collected, and message linkage limited to 1 day, there are no additional obligations regarding the usage of this data under GDPR. Furthermore, as tracking is time limited and context limited (this data can only be used for usage on whotracks.me), it respects [Do Not Track](https://en.wikipedia.org/wiki/Do_Not_Track) automatically (using the standard's own [tracking definition](https://www.w3.org/TR/tracking-dnt/#terminology.activity)). ## Conclusion We rolled our own analytics for this site because there was no off-the-shelf solution providing the (very basic) analytics we wanted without significant extra overhead, or potential privacy implications for users of the site. Our system leverages CloudFront logging with a data obfuscation step in order to collect privacy-safe server logs which can then be analysed for basic insights. This technique could be extended to provide most of the richer features of existing web analytics tools. The lack of privacy-preserving tools in the web analytics ecosystem is a worrying trend. Google Analytics dominates as they provide an extremely feature-rich product as zero cost to the webmaster. It is difficult to see how a service can compete with free without selling analytics data. Existing competitors mostly aim for businesses who will pay for a premium product, and leave bloggers and smaller sites to Google. While increasing use of adblockers is a more fundamental threat to Google's Ad business, a side effect may be a loss of trust in Google Analytics, as we measure [29%](https://github.com/ghostery/whotracks.me/blob/master/whotracksme/data/assets/2018-03/global/trackers.csv#L2) of pages with Google Analytics being affected by blocking. We already see companies which rely on analytics for core business activities (for example advertisers using affiliate schemes) deploying multiple analytics scripts and averaging the results. If the trust in analytics breaks down, then this whole ecosystem may unravel. ================================================ FILE: blog/static_site.md ================================================ title: Building whotracks.me subtitle: Adding search, data, plots and blog to 1000+ pages of tracker profiles and top domains. author: privacy team type: article publish: False date: 2017-11-03 tags: tracker-free, lightweight header_img: blog/blog-site.jpg +++ At Cliqz and Ghostery, we [collect anonymous data about trackers](/blog/where_is_the_data_from.html) to power our [anti-tracking](blog/how_cliqz_antitracking_protects_users.html) technology. We see our anti-tracking as a community effort and as such we want to share a structured representation of this data to cast some light on the tracker landscape. Out of the three main entities involved in a page load: **users**, **websites** and **trackers**, we have data only on the last two. We'll start with: * Profiles of the [top 500 trackers](/trackers.html) * Tracker data on the [top 500 domains](/websites.html). With these out of the way, a blog space would be needed. This for two of reasons. We realised there was a need for a learning space where we explain concepts referred to in the site. We call these **primers**. These define what we call a [tracker](/blog/what_is_a_tracker.html), what [cookies](/blog/cookies.html) and [fingerprinting](/blog/fingerprinting.html) are or [where this data comes from](/blog/where_is_the_data_from.html). Hopefully over time it will become a space for curious readers to be introduced to tracking technologies. The second reason is to have a space where we'll be writing about particular trackers, technologies, papers, engineering, and other interesting topics. ## Going static Through whotracks.me, we want to cast some light on the tracking landscape, but also make a point about trackers and **privacy by design**, hence the choice of this being a static site was pretty obvious. This meant that we could build the whole site offline, put it in a folder and serve it through CDN. Given this will be updated a few times a month, build performance was not really a big issue for us. But stumbling upon [a discussion](https://news.ycombinator.com/item?id=15507538) about site generators' performance, some comments read: - *"with Hugo + Pygments was taking ~20s for ~20 pages at the time"* - *"92 pages in 1s (full rebuild, No CSS magic tooling though)"* - *"Rust: ~10k pages in ~60s"* The assumption would be that most of this time is spent parsing the markdown files. To build this site however, with the exception of the blog, the rest of the pages are mainly about instantiating a template, plugging some content, and writing to disk. So most likely a comparison between site generators and this would be unfair. At the time of writing, whotracks.me has roughly 1020 pages. On 1000 of these pages there are offline generated plots, quite some data and a fair amount of tooling with respect to styling. On a `Thinkpad x230` with an Intel `i3 processor`: ```bash (venv) ➜ whotracks.me git:(master) ✗ time python build.py site Home page ............................... done Tracker list ............................ done Website list ............................ done Blog List ............................... done Blog Posts .............................. done Website pages ........................... done Tracker Pages ........................... done python build.py site 13.86s user 1.08s system 158% cpu 9.400 total ``` This will be a 5 part series dedicated to: 1. [Generating a static site (part 1)](/blog/static_site_generation.html) 2. [Visualization (part 2)](/blog/static_site_visualization.html) 3. [Building a blog (part 3)](/blog/static_site_blog.html) 4. Search, for some definition of search. 5. No third party trackers and Fast The code and data to generate this site is open-sourced at [`https://github.com/ghostery/whotracks.me`](https://github.com/ghostery/whotracks.me).
Figure 2: Mode bar on top right corner of plotly plots.
`include_plotlyjs` is set to `False` to avoid `plotly.js` being loaded inline with the `div` output for every plot. This is not necessary as it is already linked in [`base.html`](https://github.com/ghostery/whotracks.me/blob/master/templates/base.html). ## Bar Chart On main page of this site, you will see this:
Figure 3: Horizontal bar chart on tracking reach of top 10 companies
The code to generate this can be found in [`plotting/companies`](https://github.com/ghostery/whotracks.me/blob/master/plotting/companies.py). Let's write a simpler function for a horizontal bar plot to get the idea: ```python def horizontal_bar_plot(x, y): ''' x: values y: names ''' c_purple = "#A069AB" c_gray = "#BCC4CE" trace = go.Bar( x=x, y=y, orientation='h' marker=dict( color=[c_purple]*2 + [c_gray]*8 ), ) data = [trace] layout = go.Layout( dict( showlegend=False, xaxis=dict( color=CliqzColors["gray_blue"] ) ) ) fig = dict(data=data, layout=layout) return div_output(fig) ``` ## Tracker Reach - trend Line This chart, as many others, was inspired by Edward Tufte's sparkline [2], drawn without axes or coordinates.
Figure 4: Trend line of tracker reach.
```python def sparkline(ts, t): """ Sparkline for plotting line Args: ts: timeseries data t: x-axis (time) Returns: hmtl output of an interactive timeseries plot """ y = list(map(lambda x: x * 100, ts)) # scaling percentages trace0 = line( x=t, y=y, color="#A069AB" #purple ) trace1 = line( x=[t[-1]], y=[y[-1]], color="#A069AB", mode='markers' ) layout = go.Layout( dict( showlegend=False, height=100, width=153, hoverlabel=dict( bgcolor="#1A1A25", bordercolor="#00000000", # transparent font=dict( family=WTMFonts.mono, size=13, color="#BFCBD6" ) ), xaxis=dict( autorange=True, showgrid=False, zeroline=False, showline=False, autotick=True, hoverformat="%b %y", ticks='', showticklabels=False ), yaxis=dict( # providing some padding for the sparkline range=[min(y)*0.90, max(y)*1.05 if max(y) != y[-1] else max(y)*1.15], showgrid=False, zeroline=False, showline=False, autotick=True, ticks='', showticklabels=False ) ) ) data = [trace0, trace1] fig = dict(data=data, layout=layout) return div_output(fig) ``` The code used to plot the sparkline seen in tracker profiles is defined in [`plotting/trackers.py`](https://github.com/ghostery/whotracks.me/blob/master/plotting/trackers.py). ## Sankey Diagrams Sankey diagrams are at visualizing flow volume metrics. Sometimes they are found under the name alluvial diagrams, although they originally are different types of flow diagrams.Figure 1: Sankey diagram used to represent a [tracker map](../websites/upornia.com.html)
In this site we use sankey diagrams in website profile pages like [bahn.de](/websites/www.bahn.de.html) to map companies and the trackers they operate to the category of the tracker. The thickness of the link is a function of the frequency of of appearance of the tracker per page load in the given domain. So looking at the diagram above, we know that the dominant tracker category is advertising and Google operates the most trackers and has the highest frequency of appearance. Our Sankey Diagram function in Python looks like this: ```python from plotting.utils import div_output def sankey_plot(input_data): data_trace = dict( type='sankey', domain=dict( x=[0, 1], y=[0, 1] ), hoverinfo="none", orientation="h", node=dict( pad=10, thickness=30, label=list(map(lambda x: x.replace("_", " ").capitalize(), input_data['node']['label'])), color=input_data['node']['color'] ), link=dict( source=input_data['link']['source'], target=input_data['link']['target'], value=input_data['link']['value'], label=input_data['link']['label'], color=["#dedede" for _ in range(len(input_data['link']['source']))] ) ) layout = dict( autosize=True, font=dict( size=12 ) ) fig = dict(data=[data_trace], layout=layout) return div_output(fig) ``` Having looked at a lot of examples of sankey plots, we noticed a recurrent pattern: they do a great job at explaining the plot aesthetics, but take the structure of input data as given. This is a bit of a problem, because in most examples the input data is a huge json file, and figuring out the structure of such json file can become tedious. Here is how `input_data` is structured: ```json input_data = { "node":{ "label": [], "color": [] }, "link": { "source": [], "target": [], "value": [], "label": [], "color": [] } } ``` As you notice, input_data has two main parts: node and link: **NODE**: `input_data["node"]` is responsible for building nodes. In our example these nodes are either categories of trackers or companies that operate them. The atributes of each node are two: `label` and `color`. These are both lists of strings. These lists have to have equal length because the mapping of each label to a color is done based on the item's index in the list. **LINK**: `input_data["link"]` is responsible for linking two nodes together. Each link has the following attributes: `source`, `target`, `value`, `label` and `color`. So here is where the index of `input_data["node"]["label"]` becomes very important given the way sankey plots have been implemented in plotly. The `source` and `target` are lists of equal length, where the index is used to link.
Figure 5: Node label ilustration
The elements in `source` and `target` are in fact the indexes of the source node and target nodes in the `input_data["node"]["label"]`. So if we were to refer to the illustration in the figure above, to render our sankey diagram we would have: ```python source = [1, 1, 1, ... ] target = [0, 2, len-2, ... ] ``` With that out of the way, the remaining are intuitive: `value` represents how thick the link should be, `label` what name it has and `color` its color. All the `link` attributes are lists of equal length, and the matching is done based on index. For details, have a look at the actual implementation of the `input_data` generation in [`utils/companies.py`](https://github.com/ghostery/whotracks.me/blob/master/utils/companies.py). ## References [1] [Adding Words to the Brain's Visual Dictionary](http://www.jneurosci.org/content/35/12/4965.short)Figure 1: Distribution of the number of trackers
There are several differences between our two studies which may explain the increase in tracker dominance seen in this study. Firstly, this study’s sample contains the 500 most popular websites in the US, while our previous study analyzed 144 million page loads across more than 12 countries. By only considering the most popular websites and neglecting the long tail of more obscure ones, it is not surprising that this study saw a larger proportion of sites with a tracker. Additionally, the data for this study was synthetically generated using a custom crawler, whereas our previous study used data gathered from users of the Ghostery browser extension who had opted-in to the collection of information about trackers on pages they visit. While the methodologies differ, both studies verify tracker pervasiveness throughout the web. ### Trackers and Page Latency Without blocking trackers, only 17% of all the pages in the study loaded within 5 seconds. All other pages loaded much more slowly: it took more than 10 seconds to load nearly 60% of the pages, more than 30 seconds for 18% of the pages, and nearly 5% of the pages took over a minute to load. This long tail cannot be ignored and suggests Internet users waste a lot of time every day simply waiting or websites to load. Figure 2: Average time to load trackers
While we found that websites are generally slow to load, can any of this page latency be explained by the number of third-party trackers on that site? To answer this question, we calculated the average page load time for each tracker count. We excluded both tracker volumes with fewer than five observations and page latency outliers within each tracker count (identified using the interquartile range rule). To quantify the relationship between the number of trackers on a website and the average time it took that page to load, we ran a simple linear regression (`adj-R2 0.802`) which suggested that each additional tracker adds, on average, 0.5 seconds to the overall page load. The next model we fitted, which included a quadratic term (`adj-R2 0.836`), suggests that trackers have an increasing impact on page load times. However, these linear models both exhibit heteroscedasticity – uneven variance of the error terms – and thus violate linear regression assumptions. Figure 3: Log Latency as a function of the number of trackers
A Box-Cox test showed that log-transforming the response variable would realize the best fitting model, and also act as a variance-stabilizing transformation. The log-linear model (`adj-R2 0.885`) on the transformed data indicates a compounding effect: if the tracker count increases by 1, we expect the page load time to increase by 2.5%. ### Protection from Trackers We also assessed the difference in page latency when trackers are blocked rather than allowed. The data showed that the average page load time was twice as long when trackers are not blocked: the mean page latency with no trackers blocked and with all trackers blocked was 19.3 seconds and 8.6 seconds, respectively. These time savings from blocking trackers are even more drastic when only considering the 10 slowest domains in the sample. We saw that average load times were 10x faster, and blocking trackers saved an average of 84 seconds per page load. Figure 4: Latencies for certain domains
The term “piggybacking” describes the practice of one tracker that is placed directly on a website giving access to other “piggybacking” trackers that are not originally on the site. We observed this phenomenon in our data: page loads were not the only metric significantly reduced when trackers were blocked, there were also fewer trackers detected on the page. We saw significantly more trackers per page when trackers were unblocked compared to blocked, in fact, among the domains with the highest average volume of trackers, there were on average 93 fewer trackers present per page load when tracker blocking was enabled. Piggybacking can create a snowball effect, where trackers bring in more trackers that can then bring in even more trackers; and as suggested above, each additional tracker slows down a website more than previous ones. This not only has notable performance implications, but also profound privacy concerns since these trackers are not directly on the site, so site owners may not be aware such intrusion is occurring. ## Future Implications The data in our study clearly showed the pervasiveness of online tracking, as nearly 90% of the most popular sites in the US had at least one third-party tracker present. Our study also confirmed the strong, positive link between the number of trackers on a page and the time it takes that page to load. Generally, the more tracks on a site, the longer the user will have to wait for that site to load. Quantifying this relationship depends on the model used, however the optimal model we found shows a compounding effect: for every extra tracker on the page, the time it takes for the page to load increases by 2.5%. While our current study focuses on only the most popular domains in the United Sates, it would be valuable to apply this framework to other regions to see if similar trends persist elsewhere. Additionally, future work may include measuring additional performance implications of trackers including data transferred. This data transferred, which occurs when trackers make requests to other servers, bears real monetary costs to the user, particularly on a mobile device where data plans are typically based on data used. Expanding this study to assess data transfer on mobile could be translated to the out of pocket expense suffered by the user, in addition to the more subjective dollar value of the user’s wasted time waiting for pages to load. Other future work may also include looking at the relationship between bounce rates and page load speeds, to calculate a hypothetical tracker value measure. Given the additional time trackers add to page loads, and research suggest that slower pages lead to a loss in site traffic, one tracker should provide the same value as this lost site traffic. As bounce rates are likely influenced by other factors besides page load speed, like funnel page and domain category, this potential future research involves several additional considerations. Moreover, the tracker tax may even have more pronounced implications in the United States following the recent repeal of net neutrality. In a time without such net neutrality regulations, users and their browsing speeds may be squeezed from both sides – by the ISP and the online tracking ecosystem. We may then start to see more of a two prong tacker tax: the direct monetary impact imposed by the ISP and the more subjective dollar value to the user for longer load times, and therefore more unproductive time imposed by trackers. In the wake of the net neutrality repeal, now more than ever users must consider the performance implications of browsing online without protection from trackers. The added waiting times incurred by not blocking trackers are not trivial, especially as the population is spending increasingly more time online. Luckily, various tracker blocking tools are available so user can not only protect their privacy, but also speed up their browsing experience by avoiding the tracker tax. ## References [^1]: [Using Passive Measurements to Demystify Online Trackers](https://www.telematica.polito.it/users/mellia/papers/metwalleyComsi.pdf) [^2]: [WhoTracks.Me: Monitoring the online tracking landscape at scale](https://arxiv.org/abs/1804.08959) [^3]: [Tracking The Trackers](https://pdfs.semanticscholar.org/2bfb/b6b8da453f91f5860ea936588fddef6c80e0.pdf) [^4]: [Windows.performance](https://developer.mozilla.org/en-US/docs/Web/API/Window/performance) API [^5]: [alexa.com](https://alexa.com) [^6]: Ghostery Study: [Tracking the Trackers](https://www.ghostery.com/wp-content/themes/ghostery/images/campaigns/tracker-study/Ghostery_Study_-_Tracking_the_Trackers.pdf) ================================================ FILE: blog/tracker_categories.md ================================================ title: Tracker Categories subtitle: Definitions for different types of trackers author: privacy team type: primer publish: True date: 2017-07-22 tags: primer, categories header_img: blog/blog-tracker-categories.jpg +++ Trackers differ both in the technologies they use, and the purpose they serve. Based on the the service they provide to the site owner, we have categorized the trackers in the following: Advertising : Provides advertising or advertising-related services such as data collection, behavioral analysis or re-targeting. Comments : Enables comments sections for articles and product reviews Customer Interaction : Includes chat, email messaging, customer support, and other interaction tools Essential : Includes tag managers, privacy notices, and technologies that are critical to the functionality of a website Pornvertising : Delivers advertisements that generally appear on sites with adult content Site Analytics : Collects and analyzes data related to site usage and performance. Social Media : Integrates features related to social media sites Audio Video Player : Enables websites to publish, distribute, and optimize video and audio content CDN (Content Delivery Network) : Content delivery network that delivers resources for different site utilities and usually for many different customers. Misc (Miscellaneous) : This tracker does not fit in other categories. Hosting : This is a service used by the content provider or site owner Unknown : This tracker has either not been labelled yet, or we do not have enough information to label it. ================================================ FILE: blog/trackers-who-steal.md ================================================ title: The Trackers Who Steal subtitle: How WhoTracks.Me caught the trail of the MageCart hackers author: privacy team type: article publish: True date: 2018-11-23 tags: tracking, hacking header_img: blog/blog-cc-stealing.png +++ We're all aware of the trackers siphoning off information about you as you browse the web. These trackers are mostly doing this for some business intelligence related reason - websites use these services to try to 'better understand' their customers, or to target them in order to attract their attention in a way which will benefit that website owner - be-it increasing the value of products customers put into their shopping cart, or increasing the likelihood that they click an ad. However, there is another kind of tracker which is more nefarious than these. These are hidden scripts placed by hackers on E-commerce sites which try to steal your credit-card details as you enter them. In the last year a string of attacks — dubbed 'Magecart' — have affected major sites, including [British Airways](https://www.riskiq.com/blog/labs/magecart-british-airways-breach/), [Ticketmaster](https://www.riskiq.com/blog/labs/magecart-ticketmaster-breach/), [NewEgg](https://www.riskiq.com/blog/labs/magecart-newegg/) and [VisionDirect](https://twitter.com/troyhunt/status/1064069833967337472); stealing payment information from millions of consumers. At WhoTracks.Me we are monitoring the third-parties loaded on millions of pages per day, therefore once we know the domains that these hackers are using to send their stolen data, we can analyse the extent and impact of these operations. In this article we provide a post-analysis of the four big breaches this year, plus some insights our data gives in on-going attacks. ## Four high-profile breaches ### British Airways In September 2018, British Airways announced that a security breach had led to a large theft of customer data. [RiskIQ's](https://www.riskiq.com/blog/labs/magecart-british-airways-breach/) write up of the breach explains how the attackers compromised a script on the payment page, such that it would send credit card information to a domain owned by the hackers: `baways.com`. With this information we can query our data to look for page loads where `baways.com` was a third-party. This allows us to verify the extent of the breach, and how many users were affected. Our data shows that: - `www.britishairways.com` was affected between August 22nd and September 5th. 193 pages in our data were affected[^1]. - We also see two page-loads on `hotline.ba.com` on the 30th August where data was sent to the attackers. This data corroborates the [statement by BA](https://www.britishairways.com/en-gb/information/incident/data-theft/latest-information) on the breach, that users entering card details "between 22:58 BST August 21 2018 until 21:45 BST September 5 2018" would have been affected. ### Ticketmaster In June 2018, Ticketmaster declared a hack of customer information. Again, [RiskIQ's](https://www.riskiq.com/blog/labs/magecart-ticketmaster-breach/) analysis tells us how it happened - this time involving a breach of the third-party supplier Inbenta. Compromised Inbenta scripts were then loaded on ticketmaster payment pages, and these scripts then skimmed credit card data input by customers and sent it to `webfotce.me`. Unlike the British Airways case, this was not a targeted attack on Ticketmaster, rather a generic hacking program which affected many other sites. We can access the extent of these hacks by looking for the `webfotce.me` domain in our data: - Ticketmaster's UK, Irish and New Zealand sites were first affected on February 10th. The malicious script remained in place until the June 23rd, and we saw over 2,500 page loads making requests to the hackers during this time. - Their German and Australian sites appear to have been affected earlier, with our first observations on December 10th 2017 for ticketmaster.de and December 20th for ticketmaster.com.au. These sites were fixed at the same time as the others. In their [disclosure](https://security.ticketmaster.co.uk/), Ticketmaster say that UK customers were affected between February and June 23rd, and international customers could have been affected from September 2017. Again this matches up with our data, though we have no observations for international sites before December 2017. Our data also shows several other sites affected by this attack:| Site | Affected from | Affected to | Pages |
|---|---|---|---|
| otel.com | 10/12/2017 | 21/06/2018 | 125 |
| www.cheaperthandirt.com | 19/01/2018 | 03/06/2018 | 42 |
| www.printninja.com | 16/02/2018 | 20/11/2018 | 45 |
| www.vitacost.com | 26/02/2018 | 04/06/2018 | 35 |
| thehungryjpeg.com | 12/03/2018 | 12/06/2018 | 19 |
| www.klook.com | 14/03/2018 | 12/06/2018 | 28 |
| www.steinmart.com | 15/03/2018 | 09/07/2018 | 12 |
| www.marveloptics.com | 28/03/2018 | 22/09/2018 | 15 |
Table 1: Sites affected by webfotce.me attack.
Compared to Ticketmaster, the impact of the breach on these sites is much smaller. Correlations between the dates of infection indicate that these sites were probably infected via a shared third-party (i.e. Inbenta) which was compromised. This shows how hackers can quickly achieve much greater scale by going for third-party services whose scripts will be loaded on many different sites. ### NewEgg Like British Airways, NewEgg were hit by a [targetted attack](https://www.riskiq.com/blog/labs/magecart-newegg/). In this case the collection server was specific for the target. The hackers registered `neweggstats.com` in order to have a legitimate looking domain so that they could avoid suspicion for as long as possible. Looking at our data, we see that pages on `secure.newegg.com` were sending requests to `neweggstats.com` for just over a month, between 15th August and 18th September, with 90 pages affected in our dataset[^1]. ### VisionDirect On the November 19th, VisionDirect, a large UK-based glasses retailer, [announced](https://www.visiondirect.co.uk/customer-data-theft) that their sites had been compromised between the November 3rd and 8th. In this case, a script was injected into the page from `g-analytics.com` which pretending to be a Google Analytics script. The difference is, that it will also send credit-card numbers back when it sees them in the page. Our analysis shows that VisionDirect's European sites (`.fr`, `.it`, `.es`, `.co.uk`, `.eu` and `.nl`) were all affected from the November 3rd. On the `.nl` and `.ie` sites we still observed pages contacting the attacker's server on the 9th of November, suggesting that the malicious code may not have been completely removed as early as the press release suggests. Compared to the other collection servers, `g-analytics.com` is currently much more active with 36 sites infected during November. We have, however, observed a shift in traffic since November 20th, with almost all sites which were previously infected with `g-analytics.com` switching to loading a script from `google-analytics.is` instead. This indicates that the attacks have ongoing access to these sites, allowing them to update their attack code.| Collection Server | Sites Infected |
|---|---|
| g-analytics.com | 36 |
| googletagmanager.eu | 29 |
| magento.name | 19 |
| google-analytics.is | 15 |
| trafficanalyzer.biz | 5 |
| web-stats.cc | 5 |
| bandagesplus.com | 5 |
| nearart.com | 4 |
Table 2: Collection servers still active in November 2018
A full list of sites affected during November is available at the end of this post. ## Breach detection While WhoTracks.Me was originally conceived as a transparency tool to show trackers directly or indirectly placed by site owners, this investigation as opened up another angle on this data. We can now effectively track the spread of malicious code being used to defraud web consumers. This capability can be taken to multiple different directions: 1. Once the collection servers (or drop servers) are known, we can quickly find and notify websites that are compromised. This can reduce the exposure time of websites, and thus reduce the risk to the average web user. (Thanks to the work RiskIQ have done here to collate a list of active drop servers). 2. We can audit breaches that have occurred, and make sure websites properly notify their users. The GDPR requires that companies notify users and authorities when user data is compromised. This data can be used to hold companies accountable if they try to dodge these responsibilities. 3. Given the set of collection servers we already know, we can develop algorithms to automatically detect third-parties in pages which are similar. This would then allow us to detect and block these servers even earlier. We are very exited to start exploring this direction for our data[^2]. ## Third-party scripts: A security liability In all of these hacking cases, the entry point has been a malicious script which is loaded in the main document of the page. When this happens on a payment page, the attacker can read all of the information entered: credit card number, CVV, etc. Therefore, any script loaded on a payment page is potentially a critical security weakness. With this in mind, we should be critical of the current careless way that scripts of scattered onto what should be secure webpages. In the case of the four big breaches we have outlined here, now standard browser security features could have prevented or limited the amount of data stolen: - [Subresource Integrity](https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity) on script tags would prevent surreptitious changes to first- and third- party scripts (provided the site's own webserver is not also compromised). If a script were to be changed to add the attacker's code the browser will refuse to load it. In the British Airways case one JavaScript file was edited to add the attack payload; for Ticketmaster the attack payload came via a third-party script. This technique provides some protection when loading content for less-trusted origins in pages which require high security. - The [Content Security Policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy) (CSP) header can be used to prevent requests to unknown origins. In all of these cases the credit card information was sent to a third-party collection server. A CSP header would have prevented this request, thus preventing the malicious script from exfiltrating data. As well as these high-profile cases, many of the other sites affected by these attacks are smaller E-Commerce sites using off-the-shelf software to run their business. It is therefore difficult for these sites to deploy these more advanced protection methods - even more so because the loading of 20 or 30 different untrusted third-parties on a webpage has become normalised, so users or even developers would not be able to detect unexpected third-parties appearing on a page. Related to this is another tactic employed by these hacking groups: registering domains very similar to common third-party trackers so that developers do not notice that the site is compromised. Some examples: - `g-analytics.com`, pretending to be [Google Analytics](../trackers/google_analytics.html); - `googletagmanager.eu` -> [Google Tag Manager](../trackers/google_tag_manager.html); - `slripe.com` -> [Stripe](../trackers/stripe.com.html); - `typeklt.com` -> [Adobe Typekit](../trackers/typekit_by_adobe.html); - `crtteo.com` -> [Criteo](../trackers/criteo.html); - `jsdellvr.com` -> [JSDelivr](../trackers/jsdelivr.html). ## Protecting users As we can assume that sites will continue to get hacked, we require a way we can protect users from having their data stolen without relying on site owners. This is where the browser comes in - as the user-agent it should be able to protect the user from attacks like this, much like it already does with phishing and malware sites. Luckily, as all of these attacks rely on collection servers to receive the stolen data, once we know of a server address we can use blocklists to prevent the browser from contacting these servers. Therefore, even when sites are compromised with malicious Javascript, this code will not be able to contact the hacker's server. For Cliqz and Ghostery users we have already distributed a block-list to block these domains and protect them from credit-card theft. > For Cliqz and Ghostery users we have already distributed a block-list to block these domains and protect them from credit-card theft. Blocking is just a reactive measure though. Domains are cheap, and sites are getting hacked all the time, so these hackers could easily turn over their domains faster to mitigate our blocking. Therefore, a more robust solution has to incorporate fast detection of these drop servers in order to minimise the effective lifetime of each attack. We hope to incorporate the WhoTracks.Me data in the hunt for these domains, and to emulate the speed that we are already [able to detect phishing sites](https://cliqz.com/en/whycliqz/anti-phishing). ## Conclusion In this post we've shown a new angle on the data we publish on WhoTracks.Me. As well as providing transparency on which companies are tracking you online, we are also able to turn this transparency on web criminals who are stealing from web users. This transparency can be used to: * ensure that breaches are reported promptly when they occur; * assess the impact of breaches, in terms of the timespan when sites were affected and how many users may have been affected; and * develop new techniques to catch these operations faster and reduce the number of users who suffer from them. [^1]: By "Pages Affected" we mean the number of page loads where we saw some third-party call to a server associated with MageCart operations. [^2]: Reach out to privacy@cliqz.com if you have suggestions, or would simply like to get in touch. ### Appendix: List of Magecart affected sites during November 2018| Collection Server | Site | Infected from | infected to | Number of pages |
|---|---|---|---|---|
| google-analytics.is | www.groworganic.com | 2018-11-22 | 2018-11-28 | 64 |
| googletagmanager.eu | www.wdrshop.de | 2018-11-03 | 2018-11-28 | 74 |
| google-analytics.is | www.directmaterial.com | 2018-11-28 | 2018-11-28 | 1 |
| google-analytics.is | www.harriscomm.com | 2018-11-23 | 2018-11-28 | 9 |
| google-analytics.is | www.prospin.com.br | 2018-11-24 | 2018-11-28 | 5 |
| google-analytics.is | www.electroactiva.com | 2018-11-28 | 2018-11-28 | 1 |
| magento.name | www.gamesquest.co.uk | 2018-11-13 | 2018-11-28 | 21 |
| google-analytics.is | www.drakegeneralstore.ca | 2018-11-23 | 2018-11-28 | 23 |
| google-analytics.is | shop.tokidoki.it | 2018-11-26 | 2018-11-28 | 3 |
| magento.name | store.curiousinventor.com | 2018-11-01 | 2018-11-28 | 13 |
| webfotce.me | www.printninja.com | 2018-11-05 | 2018-11-28 | 7 |
| vuserjs.com | www.medelita.com | 2018-11-01 | 2018-11-27 | 63 |
| googletagmanager.eu | www.aneros.com | 2018-11-01 | 2018-11-27 | 49 |
| magento.name | www.arrazofashion.com.br | 2018-11-27 | 2018-11-27 | 1 |
| fastproxycdn.com | lessthan10pounds.com | 2018-11-09 | 2018-11-27 | 19 |
| magento.name | chebdveri.ru | 2018-11-02 | 2018-11-27 | 2 |
| googletagmanager.eu | www.onegreekstore.com | 2018-11-02 | 2018-11-27 | 8 |
| vmaxjs.com | www.artistsnetwork.com | 2018-11-01 | 2018-11-27 | 105 |
| googletagmanager.eu | slf24.pl | 2018-11-08 | 2018-11-26 | 26 |
| google-analytics.is | www.pvcfittingsonline.com | 2018-11-22 | 2018-11-26 | 42 |
| googletagmanager.eu | www.bestkiteboarding.com | 2018-11-09 | 2018-11-26 | 24 |
| qsxjs.com | vapenw.com | 2018-11-02 | 2018-11-26 | 86 |
| magento.name | www.compremake.com.br | 2018-11-20 | 2018-11-26 | 3 |
| valdatecode.com | www.carnivalbkk.com | 2018-11-03 | 2018-11-26 | 52 |
| googletagmanager.eu | www.mobileparadise.de | 2018-11-21 | 2018-11-26 | 4 |
| googletagmanager.eu | www.cht-cottbus.de | 2018-11-03 | 2018-11-26 | 79 |
| privatejs.com | www.bydubai.com | 2018-11-01 | 2018-11-26 | 55 |
| google-analytics.is | www.scojo.com | 2018-11-24 | 2018-11-26 | 10 |
| googletagmanager.eu | www.nordhandel.de | 2018-11-01 | 2018-11-25 | 106 |
| alfcdn.com | www.softstarshoes.com | 2018-11-25 | 2018-11-25 | 2 |
| magento.name | www.prestigioplaza.com | 2018-11-25 | 2018-11-25 | 3 |
| googletagmanager.eu | www.wslstore.com | 2018-11-18 | 2018-11-25 | 2 |
| magento.name | www.herve-leger-shop.com | 2018-11-25 | 2018-11-25 | 1 |
| googletagmanager.eu | amsducati.com | 2018-11-05 | 2018-11-25 | 6 |
| google-analytics.is | www.ozarksource.com | 2018-11-24 | 2018-11-24 | 1 |
| g-analytics.com | geissele.com | 2018-11-10 | 2018-11-24 | 85 |
| crtteo.com | www.accessorygeeks.com | 2018-11-01 | 2018-11-23 | 31 |
| google-analytics.is | drdennisgross.com | 2018-11-22 | 2018-11-23 | 4 |
| googletagmanager.eu | www.everbestshoes.com | 2018-11-23 | 2018-11-23 | 1 |
| googletagmanager.eu | unitedsalonsupplies.com | 2018-11-23 | 2018-11-23 | 2 |
| trafficanalyzer.biz | www.oaknyc.com | 2018-11-19 | 2018-11-23 | 2 |
| googletagmanager.eu | dampoteket.no | 2018-11-07 | 2018-11-23 | 10 |
| magento.name | www.ikonmotorsports.com | 2018-11-05 | 2018-11-23 | 8 |
| google-analytics.is | www.dreamduffel.com | 2018-11-23 | 2018-11-23 | 4 |
| nearart.com | www.westcottbrand.com | 2018-11-04 | 2018-11-23 | 13 |
| magento.name | oramaoptics.gr | 2018-11-21 | 2018-11-23 | 3 |
| google-analytics.is | www.cruyffclassics.com | 2018-11-23 | 2018-11-23 | 3 |
| magento.name | www.weldingsuppliesdirect.co.uk | 2018-11-05 | 2018-11-22 | 9 |
| googletagmanager.eu | hk.ap-nutrition.com | 2018-11-13 | 2018-11-22 | 3 |
| nearart.com | www.camillusknives.com | 2018-11-03 | 2018-11-22 | 27 |
| google-analytics.is | www.softballfans.com | 2018-11-22 | 2018-11-22 | 3 |
| magento.name | www.ammerer.com | 2018-11-02 | 2018-11-22 | 4 |
| g-analytics.com | www.candent.ca | 2018-11-22 | 2018-11-22 | 1 |
| googletagmanager.eu | www.autosiliconehoses.com | 2018-11-03 | 2018-11-21 | 29 |
| google-analytics.is | temptu.com | 2018-11-21 | 2018-11-21 | 4 |
| googletagmanager.eu | www.lampen-line.de | 2018-11-04 | 2018-11-21 | 16 |
| googletagmanager.eu | www.airagestore.com | 2018-11-05 | 2018-11-21 | 6 |
| g-analytics.com | drdennisgross.com | 2018-11-11 | 2018-11-20 | 8 |
| g-analytics.com | pvcpipesupplies.com | 2018-11-12 | 2018-11-20 | 5 |
| g-analytics.com | www.cruyffclassics.com | 2018-11-08 | 2018-11-20 | 13 |
| g-analytics.com | www.pvcfittingsonline.com | 2018-11-08 | 2018-11-20 | 77 |
| g-analytics.com | www.ahmadtea.com | 2018-11-09 | 2018-11-20 | 10 |
| g-analytics.com | www.groworganic.com | 2018-11-04 | 2018-11-20 | 87 |
| web-stats.cc | www.kingfishertapes.co.uk | 2018-11-20 | 2018-11-20 | 3 |
| g-analytics.com | www.fabglassandmirror.com | 2018-11-10 | 2018-11-20 | 9 |
| statsdot.eu | www.punkstuff.com | 2018-11-20 | 2018-11-20 | 14 |
| onefromeu.com | www.joyfolie.com | 2018-11-03 | 2018-11-20 | 16 |
| listrakb.com | www.skistart.com | 2018-11-02 | 2018-11-19 | 4 |
| g-analytics.com | www.energymuse.com | 2018-11-06 | 2018-11-19 | 72 |
| googletagmanager.eu | www.casinhabonita.com.br | 2018-11-06 | 2018-11-19 | 20 |
| g-analytics.com | www.frightprops.com | 2018-11-15 | 2018-11-19 | 3 |
| statsdot.eu | storeinfinity.com | 2018-11-07 | 2018-11-19 | 10 |
| g-analytics.com | www.especialneeds.com | 2018-11-12 | 2018-11-19 | 21 |
| g-analytics.com | www.stmgoods.com.au | 2018-11-09 | 2018-11-18 | 7 |
| onefromeu.com | www.poshshop.com | 2018-11-13 | 2018-11-18 | 39 |
| googletagmanager.eu | deanzelinsky.com | 2018-11-07 | 2018-11-18 | 11 |
| googletagmanager.eu | nativetreasuresnm.com | 2018-11-10 | 2018-11-18 | 8 |
| g-analytics.com | vapage.com | 2018-11-13 | 2018-11-18 | 23 |
| magento.name | www.hydraulicsonline.co.uk | 2018-11-02 | 2018-11-18 | 2 |
| nearart.com | mitchellssalon.com | 2018-11-18 | 2018-11-18 | 1 |
| g-analytics.com | altheatsupply.com | 2018-11-14 | 2018-11-18 | 5 |
| scriptsfyou.com | adamspolishes.com | 2018-11-01 | 2018-11-17 | 55 |
| googletagmanager.eu | www.recifeingressos.com | 2018-11-16 | 2018-11-17 | 3 |
| g-analytics.com | www.stmgoods.com | 2018-11-10 | 2018-11-16 | 12 |
| g-analytics.com | temptu.com | 2018-11-06 | 2018-11-16 | 7 |
| g-analytics.com | www.drakegeneralstore.ca | 2018-11-16 | 2018-11-16 | 1 |
| g-analytics.com | shop.tokidoki.it | 2018-11-15 | 2018-11-15 | 3 |
| g-analytics.com | medmartonline.com | 2018-11-13 | 2018-11-15 | 4 |
| g-analytics.com | intl.drdennisgross.com | 2018-11-15 | 2018-11-15 | 2 |
| googletagmanager.eu | ikiegeszitok.hu | 2018-11-08 | 2018-11-15 | 11 |
| g-analytics.com | www.weareverincontinence.com | 2018-11-12 | 2018-11-14 | 3 |
| cdnscriptx.com | www.cartouchesarabais.com | 2018-11-11 | 2018-11-14 | 14 |
| g-analytics.com | cig2o.com | 2018-11-14 | 2018-11-14 | 1 |
| fastproxycdn.com | tilebar.com | 2018-11-03 | 2018-11-14 | 120 |
| g-analytics.com | www.curediva.com | 2018-11-07 | 2018-11-13 | 6 |
| typeklt.com | www.mariatash.com | 2018-11-02 | 2018-11-13 | 49 |
| g-analytics.com | www.lucerooliveoil.com | 2018-11-13 | 2018-11-13 | 5 |
| g-analytics.com | www.plumbingsupplynow.com | 2018-11-13 | 2018-11-13 | 1 |
| magento.name | www.grafipronto.pt | 2018-11-12 | 2018-11-12 | 1 |
| checkercarts.com | www.shambhala.com | 2018-11-01 | 2018-11-12 | 19 |
| scriptsenvoir.com | www.heatpressnation.com | 2018-11-01 | 2018-11-12 | 48 |
| typeklt.com | www.cabletiesunlimited.com | 2018-11-09 | 2018-11-12 | 6 |
| web-stats.cc | www.costway.de | 2018-11-07 | 2018-11-10 | 2 |
| g-analytics.com | www.visiondirect.ie | 2018-11-05 | 2018-11-09 | 4 |
| web-stats.cc | www.rincondidactico.cl | 2018-11-09 | 2018-11-09 | 1 |
| g-analytics.com | www.visiondirect.nl | 2018-11-04 | 2018-11-09 | 41 |
| magento.name | patbo.com.br | 2018-11-05 | 2018-11-09 | 3 |
| googletagmanager.eu | professional.imageskincare.nl | 2018-11-09 | 2018-11-09 | 2 |
| googletagmanager.eu | consument.imageskincare.nl | 2018-11-09 | 2018-11-09 | 2 |
| magento.name | eaccesoriigsm.ro | 2018-11-08 | 2018-11-08 | 1 |
| jspoi.com | www.padini.com | 2018-11-04 | 2018-11-08 | 3 |
| g-analytics.com | www.visiondirect.co.uk | 2018-11-03 | 2018-11-08 | 112 |
| googletagmanager.eu | www.oddbins.com | 2018-11-01 | 2018-11-08 | 9 |
| g-analytics.com | www.visiondirect.fr | 2018-11-03 | 2018-11-07 | 53 |
| magento.name | upmarketpets.com | 2018-11-07 | 2018-11-07 | 1 |
| g-analytics.com | www.visiondirect.it | 2018-11-04 | 2018-11-07 | 2 |
| g-analytics.com | www.visiondirect.es | 2018-11-05 | 2018-11-07 | 26 |
| upgradenstore.com | www.armysurplusworld.com | 2018-11-06 | 2018-11-06 | 1 |
| g-analytics.com | www.ozarksource.com | 2018-11-06 | 2018-11-06 | 1 |
| upgradenstore.com | www.princesspolly.com | 2018-11-01 | 2018-11-06 | 3 |
| locatefyou.com | www.jjroofingsupplies.co.uk | 2018-11-01 | 2018-11-06 | 10 |
| g-analytics.com | www.prospin.com.br | 2018-11-06 | 2018-11-06 | 1 |
| web-stats.cc | www.baleyo.com | 2018-11-06 | 2018-11-06 | 1 |
| maxijs.com | copperlab.com | 2018-11-05 | 2018-11-05 | 9 |
| gamacdn.com | csvape.com | 2018-11-03 | 2018-11-05 | 2 |
| valdatecode.com | www.pfiwestern.com | 2018-11-01 | 2018-11-05 | 15 |
| googletagmanager.eu | erecycleronline.com | 2018-11-05 | 2018-11-05 | 1 |
| magento.name | nicoman.co.uk | 2018-11-01 | 2018-11-05 | 2 |
| minifyscripts.com | shop.bombingscience.com | 2018-11-03 | 2018-11-04 | 4 |
| web-stats.cc | shelfadditions.com | 2018-11-04 | 2018-11-04 | 2 |
| jspoi.com | store.asqgrp.com | 2018-11-01 | 2018-11-04 | 3 |
| trafficanalyzer.biz | www.irishnewsarchive.com | 2018-11-03 | 2018-11-03 | 1 |
| magento.name | www.cochesdemetal.es | 2018-11-01 | 2018-11-03 | 2 |
| magento.name | originalnye-zapchasti.com | 2018-11-02 | 2018-11-02 | 1 |
| googletagmanager.eu | www.exeltek.com.au | 2018-11-02 | 2018-11-02 | 2 |
| g-analytics.com | www.hyperparapharmacie.com | 2018-11-02 | 2018-11-02 | 1 |
| amasty.biz | www.decantshop.com | 2018-11-01 | 2018-11-01 | 1 |
| jspoi.com | massivejoes.com | 2018-11-01 | 2018-11-01 | 4 |
| cdnrfv.com | www.versare.com | 2018-11-01 | 2018-11-01 | 18 |
| magento.name | www.yourdezire.co.uk | 2018-11-01 | 2018-11-01 | 2 |
| allacarts.com | www.plumprettysugar.com | 2018-11-01 | 2018-11-01 | 6 |
Figure 1: Sankey diagram used to represent a [tracker map](../websites/tumblr.com.html)
Sankey diagrams are great at visualizing flow volume metrics. Sometimes they are found under the name alluvial diagrams, although they originally are different types of flow diagrams [1]. We wanted to use the sankey diagram supported in plotly [2], the visualisation library of choice used in whotracks.me. The function itself is pretty simple, as you will see in a bit when we define `sankey_diagram()`. The challenge to creating sankey diagrams with Plotly is understanding the required structure of the input data required by the plotting function. Hopefully the following example will make it easier for those reading this post, should they ever decide to try sankey diagrams. The goal here is to show a very small dataset, structured in a way that the plotly diagram (and other plotting solutions e.g.: d3.js) understand. We will be mapping cities to the countries they are part of. The value of each link, will be the city population (in millions). ```python city_data = dict( nodes = dict( label=["Germany", "Berlin", "Munich", "Cologne", "France", "Paris", "Lyon", "Bordeaux"], color=["beige", "black", "red", "yellow", "beige", "blue", "white", "red"] ), links = dict( source=[0, 0, 0, 4, 4, 4], target=[1, 2, 3, 5, 6, 7], value= [3.5, 1.5, 1, 2.2, 0.5, 0.2], label=["capital", "city", "city", "capital", "city", "city"], color=["black", "red", "yellow", "blue", "whitesmoke", "red"] ) ) ``` Note how there are two keys in the dictionary, `nodes` and `links`, and each has some attributes. Let's go over them. Each node has a label (e.g. Germany) and a corresponding `color` (in this case `beige`). Note that labels and colors are stored in lists of equal length, and the pairing is done based on equality of that index. Links contain information about how to link nodes. Each has a `source`, `target`, `value`, `label` and `color`. Source contains the index in the list of the source node, whereas target the index in the list of the target node. Value determines how thick the link should be (in our case it will be the population of each link, hence each city), Label and color, as the name suggests, specify the label and color of the link. Links too, are paired based on index. ## Plotting a sankey diagram Now let's write a simple function to plot this data nicely. Most of the work has already been done, given we're feeding the data in a format that's easy to parse. ```python from plotly.offline import iplot def sankey_diagram(sndata, title): # First part of a plotly plot is the `trace` data_trace = dict( type='sankey', node=dict( pad=10, thickness=30, # label could easily be equal to sndatap['node]['label']. The following is just cosmetics label=list(map(lambda x: x.replace("_", " ").capitalize(), sndata['nodes']['label'])), color=sndata['nodes']['color'] ), link=sndata["links"], # configuration options for the diagram domain=dict( x=[0, 1], y=[0, 1] ), hoverinfo="none", orientation="h" ) # Second part of a plotly plot is the `layout` layout = dict( title=title, font=dict( size=12 ) ) fig = dict(data=[data_trace], layout=layout) return iplot(fig) ``` ## Sankey diagram for a few German and French cities All that is left now, is feeding the `city_data` to the `sankey_diagram` function and we're done.Figure 1: Simple example of a sankey digram for cities
Trying to create the flags of these countries did not end up being such an aesthetically good idea. # From Cities to Trackers Doing Sankey diagrams for cities may have been fun. The result of doing the same for trackers on your favourite sites might not be as fun -it may in fact be terrifying. We'll be using public data from whotracks.me to map tracker categories to companies present on a particular site. Each link will be a tracker the company owns. This gives immediate visual insights on who's watching you and why. ## Terse intro to the API The data and API for whotracksme us available on Pypi and you can easily install it running `pip install whotracksme`. ```python from whotracksme.data.loader import DataSource from whotracksme.website.plotting.colors import tracker_categoryColors, cliqz_colors ``` DataSource is a class that provides access to trackers, companies that own them, and popular websites. The functionality of DataSource is something we'll be constantly trying to improve and expand. Online tracking is messy enough to analyze, so at least the tooling should be as simple as possible. We will be looking at Reddit. If you are not familiar with Reddit, check it out - there are some great communities there. Now we'll look at the tracking landscape in reddit. To do that, we only need to know the reddit `site_id`, which is `reddit.com`. Each site has a `site_id`, most often its url. ## Preparing reddit tracker data for sankey diagram Here we will be mapping companies and the trackers they operate to the category of the tracker. The thickness of the link is a function of the frequency of appearance of the tracker per page load in the given domain. ```python def sankey_data(site_id, data_source): nodes = [] link_source = [] link_target = [] link_value = [] link_label = [] for (tracker, category, company) in data_source.sites.trackers_on_site(site_id, data_source.trackers, data_source.companies): # index of this category in nodes if category in nodes: cat_idx = nodes.index(category) else: nodes.append(category) cat_idx = len(nodes) - 1 # index of this company in nodes if company in nodes: com_idx = nodes.index(company) else: nodes.append(company) com_idx = len(nodes) - 1 link_source.append(cat_idx) link_target.append(com_idx) link_label.append(tracker["name"]) link_value.append(100.0 * tracker["frequency"]) label_colors = [tracker_categoryColors[l] if l in tracker_category_colors else cliqz_colors["purple"] for l in nodes] return dict( nodes = dict( label=nodes, color=label_colors ), links = dict( source=link_source, target=link_target, value=link_value, label=link_label, color=["#dedede"] * len(link_label) ) ) ``` Now that we have a function to generate the data in the format we need it, let's run it for reddit and plot the sankey diagram to investigate the tracking landscape: ```python input_data = sankey_data('reddit.com', data_source=DataSource()) sankey_diagram(input_data, 'Tracker Map on reddit.com') ```Figure 1: Tracking landscape on reddit.com
We see that most tracking happens for advertising reasons. Although it does not seem like it, Reddit is keeping the set of advertisers they expose their users somwhat limited compared to other portals and news sites. In terms of number of trackers, Google has the most eyes on reddit users. For more details on the tracking landscape on reddit, head over to reddit's [profile page](https://whotracks.me/websites/reddit.com.html) on whotracks.me. ## References [[1] Sankey Diagrams](https://en.wikipedia.org/wiki/Sankey_diagram) - WikipediaFigure 1: banner ads are intrusive and distracting. In this example they are placed right next to the content to get the users’ attention.
### Deception Another reason for annoyance is the use of native advertising: ads that are designed to resemble content as much as possible (Figure 2). The main goal is to maximize click-through rates on ads by deliberately misleading users on the nature of the content. Although users are less likely to notice the presence of native advertising compared to traditional banner ads [9], it is more difficult for them to distinguish organic content from paid ads.Figure 2: native advertising is deceptive as it makes it difficult for users to distinguish organic content from paid ads.
### Page Breakage Last but not least, ads and tracking increase page loading times: users have to wait substantially longer for content to appear, which degrades the online user experience. The average data usage by trackers amounts to more than 6MB per page load [2]. In a Mozilla study, researchers further found that the average number of reported problems with web pages was higher for users with tracker blocking disabled, relative to those with it enabled.[3]. These users reported more often that web pages felt slow, laggy, or unresponsive. This is surprising because tracking protection is often the reason for such page breakage. ### Moments of embarrassment The facets discussed so far all relate to functional problems (i.e. web pages do not work, users cannot complete their tasks, etc.). However, there is also another dimension to the visual effects of tracking: social implications. In order to deliver the most relevant ads to the user, they are often targeted and based on previous online behaviors, such as page visits. For example, a user would see ads for sports shoes on a news page after having searched for them on a shopping site. While the majority of users are opposed to behavioral targeting and are concerned about their privacy [4, 10], behavioral targeting can affect the user experience in a much more direct way, in particular when sharing a computer: Imagine an online purchase for your loved one popping up on a web page visited by the future gift recipient — surprise ruined. Or imagine a friend looking over your shoulder and getting a glimpse on an ad about something that you find embarrassing. ## The Hidden Effects of Tracking Yet, a large part of tracking takes place behind the scenes of the shiny web surface. It's not obvious that users are being observed, yet trackers record all their page visits [6]. This is not only a privacy problem, it also affects the user experience. ### Lack of Transparency Most users are aware that their searches and interactions are recorded when using services like Facebook or Amazon. After all, they are explicitly registered and logged in. Users understand that such services need to know certain things to provide their services, for example, to show interesting posts or to suggest new friends. However, a large part of tracking takes place via third-party trackers, scripts that are embedded on pages around the web or are part of a browser add-on without the users' awareness. These scripts call home to report on each user's behavior—often without having asked for permission. It is not transparent to users that their oftentimes personal data is shared, with whom it is shared, and where it is stored. For example, users were surprised to learn that browsing history is used to target ads [10]. ### Lack of Control Even if users knew about the extent of tracking taking place, there is still a lack of control. Once the data is out on some servers, users do not have the option to audit or delete the data stored about them. Current approaches for giving control to users are not understood by users [10]. ### Transparency and Control are Critical Why are transparency and control so important? Data collected by trackers reveal more about a person than you might think. One page visit may not tell who you are, but the visit of multiple pages does. Trackers connect these visits through unique identifiers. Suddenly the virtual self turns into a real person: Profile pictures from social networks reveal the visual appearance, location sharing exposes home and workplace, and shopping behavior hint at personal preferences. All this happens without the awareness of the user—the user experience on the surface does not reveal the operating network of trackers underneath it. ## The UX Challenges Both visible and hidden effects of trackers on user experience are non-trivial to address. Numerous applications or add-ons exist to remove ads from web pages or to reduce the effects of tracking. Adblock Plus and NoScript are two popular examples.Figure 3: an example of an ad block wall encouraging users to whitelist trackers.
However, removing ads leads, similar to ads in the first place, to page breakage. News sites, for example, use adblock detection to put up ad block walls, asking users to whitelist ads in order to access the content (Figure 3). Ad block walls not only degrade the user experience but also reduce traffic to the underlying pages: a recent survey found that 74% of American adblock users choose to leave sites with adblock walls [1]. Another example are pages without visible ads, but that use scripts for tracking. Blocking all scripts offers protection, but makes modern web pages unusable as many features rely on scripting. Overall, tracking is a complex topic. Its technical foundation is hard to grasp for most users. Users build their mental models about how trackers work based on their own experiences. This leads to wrong beliefs, such as that Facebook cannot track users once they are logged out of the platform. On the other hand, advertising is often the only revenue stream for web sites. Users benefit from it as websites can run without charging their users. Nonetheless, revenue should never come at the cost of the users’ privacy. It is not easy, but targeted advertising does not have to rely on tracking [5]. Users should always be in control over their data. This is the paradigm that Cliqz follows in their products [11]. The challenges, from a user experience point of view, lie in educating users in a simple enough way so that they understand the effects of tracking and in allowing users to decide which data they want to share or not to share. We love to hear your thoughts on this topic. ##References [1] [2017 Adblock Report](https://pagefair.com/blog/2017/adblockreport/)Figure 1: Page loads per country, March 2018
This volume of data will also enable us to publish WhoTracks.Me content for individual countries, something we plan to add later this month. ## Data restructuring We have updated the structure of data which we publish in our [repository](https://github.com/ghostery/whotracks.me/) to make it both easier to use and more scalable as we add more data. We now publish CSV files each month for each of the following: * `domains.csv`: Top third-party domains seen tracking. * `trackers.csv`: Top trackers - this combines domains known be operated by the same tracker. * `companies.csv`: Top companies - aggregates the stats for trackers owned by the same company. * `sites.csv`: Stats for number of trackers seen on popular websites. * `site_trackers.csv`: Stats for each tracker on each site. These files can then be loaded with popular data-analysis tools such as [Pandas](https://pandas.pydata.org/). We have also rewritten the code to render the site to take advantage of Pandas. We expose the dataframes via the `DataSource` class which loads data from all CSV files: ```python from whotracksme.data.loader import DataSource data = DataSource() len(data.trackers.df) >> 7928 ``` We have also updated the criteria by which we include trackers and sites on the main site. We now 'rollover' entries, so once they have been included once, we will keep publishing data (until they completely disappear from the data). This has the effect of naturally growing the number of trackers and sites we publish. We currently have data on 868 trackers and 748 websites published: ```python pd.DataFrame({ 'trackers': data.trackers.df.groupby('month').count()['tracker'], 'sites': data.sites.df.groupby('month').count()['site'] }).plot() ``` Figure 2: Growth of trackers and sites
The per-site trend for average number of trackers continues a slightly downward trend, although the average is still high at 9 trackers per page. There are several possible reasons for this, it is not necessarily that sites are using fewer trackers! The proportion of data from Ghostery users continues to increase, and these users will disproportionately block many trackers. This has an effect on the average number of trackers, because it prevents the blocked trackers from loading others. The data shows also that the average incidence of blocking for trackers increased to 25% in March, up from 20% in February. ```python sns.boxplot( data=data.sites.df[data.sites.df.month >= '2018-01'], x='month', y='trackers' ) ``` Figure 3: Average trackers per page since January
```python (data.trackers.df[data.trackers.df.month >= '2018-01'] .groupby('month') ['has_blocking'].mean() * 100).plot() ``` Figure 4: Blocking Trend since January
As in previous months, we look at sites' changing their trackers. [fewo-direct.de](../websites/fewo-direkt.de.html), [brigitte.de](../websites/brigitte.de.html) and [gutefrage.net](../websites/gutefrage.net.html) all had 5 fewer trackers on average per page this month. However, each of these still has over 50 trackers with some kind of presence, showing that this is more likely a side-effect of increased blocking than an active effort to reduce tracking on their sites. [klingel.de](../websites/klingel.de.html) and [informationvine.com](../websites/informationvine.com.html) see the largest increase in tracking of the sites we currently monitor.| Site | Trackers | Change since February |
|---|---|---|
| informationvine.com | 18.3 | +6.4 |
| klingel.de | 26.7 | +5.3 |
| gutefrage.net | 13.0 | -5.6 |
| brigitte.de | 19.5 | -5.8 |
| fewo-direkt.de | 16.0 | -6.6 |
Table 1: Websites Tracking Trends
A side-effect of the filtering we added in this new data pipeline is that the site reach for top trackers has increased. In the previous analysis a long-tail of very rarely visited sites reduced effective site reach. With this factor reduced, we get a real sense of the coverage of the largest trackers, with Google Analytics reaching 85% of popular sites, and Facebook almost 60%. The data can easily be retrieved as shown below: ```python df = data.trackers.get_snapshot().sort_values(by='site_reach', ascending=False).head(10) df['name'] = df.id.apply(func=lambda x: data.app_info[x]['name']) ``` Figure 5: Reach of top 10 trackers across popular websites
If you want to delve deeper into our data, it is available on the [WhoTracks.Me Github Repository](https://github.com/ghostery/whotracks.me/tree/master/whotracksme/data), and as a [pip package](https://pypi.python.org/pypi/whotracksme/). _NB: The code snippets here will not generate the presented plots. Full code snippets for the plots in this post are available in this [Jupyter Notebook](https://nbviewer.jupyter.org/github/ghostery/whotracks.me/blob/master/contrib/wtm_april_update.ipynb)._ ================================================ FILE: blog/update_dec_2017.md ================================================ title: WhoTracks.me December Update subtitle: New data and trackers in our monthly update. author: privacy team type: article publish: True date: 2017-12-08 tags: blog, update header_img: blog/blog-data-dec17.png +++ We're happy to update the site today with data from November 2017 - based on data from 100 million page loads. We're also expanding the amount of data we show, up to 600 top websites and 600 top trackers. ## New Trackers in the database Increasing the number of trackers displayed meant that we needed to add tracker information for a new batch of tracker domains, as well as new entrants appearing in the top 500. Here are the 3 most interesting entrants: * [Tru Optik](../trackers/truoptik.html), a company offering targeted advertising for Smart TVs, and claiming 70 Million US households in their 'Household Graph'. Their presence across major German sites suggests they might be using online ad networks in order to harvest user information and link it to active Smart TVs, where they can then push targeted adverts. * [Digitrust](../trackers/digitrust.html), a non-profit aiming to reducing the number of third-party requests per page. Their solution, however, is to create a unified user identifier, intended to prevent the need for trackers to synchronise pixels and tracking tokens on each page. Notably, they state that they [do not support](http://www.digitru.st/faqs/) the [Do Not Track](https://en.wikipedia.org/wiki/Do_Not_Track) standard, so their claims to be working in consumers interests are, at best, suspect. * [ORC International](../trackers/orc_international.html), the registered owner of the domain `emxdgt.com`, and a subsiduary of [Engine](http://www.enginegroup.com/), an Advertising Agency. Despite only appearing in our data recently, they have quickly risen up to the top 300 trackers, and are listed in the [ads.txt](https://iabtechlab.com/ads-txt/) files as a reseller for several major US publications, such as [The Atlantic](https://www.theatlantic.com/ads.txt) and [CNET](https://www.cnet.com/ads.txt). Their ownership, and policy for the data collection is, however, not transparently disclosed. ## Month-to-month trends The average number of trackers on top websites increased to 10, an increase of 3%. [Heine.de](../websites/heine.de.html), [gutefrage.net](../websites/gutefrage.net.html), [sportscheck.com](../websites/sportscheck.com.html) and [bild.de](../websites/bild.de.html) increased their number of trackers the most, each of them added on average 5 more trackers page page load. At the other end of the spectrum, [paket.de](../websites/paket.de.html), [jackpot.de](../websites/jackpot.de.htmwl) and [hurriyet.com.tr](../websites/hurriyet.com.tr.html) had on average 5 fewer trackers per page. On the tracker side, the biggest gain was by [pmddby.com](../trackers/pmddby.com.html), which increased its reach by 9 times since October. Its profile is that of Spyware which is injecting ads into webpages for affected users, however at this time we were not able to determine the source - the WHOIS data for the domain is private. ## Additions to the dataset This month we added two new signals to the data which attempt to show the effect of ad-blockers on the trackers in our database. These signals are: * `has_blocking` - the proportion of pages on which this tracker was affected by some kind of blocking. * `requests_failed` - the average number of failed requests per page load (for comparison with `requests` to get an idea of how aggressive the blocking is). These signals should be able to tell us something about the impact of blocking on different trackers in the ecosystem. For example, we see evidence of blocking 40% of the time for Google Analytics and Facebook, and between 10% and 20% of requests failing. Thus, anyone using these services to measure activity and conversions on their sites must reckon with error rates in these orders. We also can see how new entrants can initially avoid the effects of blocking - for [Tru Optik](../trackers/truoptik.html) and [Digitrust](../trackers/digitrust.html) who we mentioned earlier, we measure only 5 and 1% of pages which may be affected by blocking. These stats are currently only available in the raw data, but we will be looking at incorporating them in the site in due course. ================================================ FILE: blog/update_feb_2018.md ================================================ title: February Update - The Tracking Shell Game subtitle: How mergers and acquisitions are hiding who actually is tracking us. author: privacy team type: article publish: True date: 2018-02-06 tags: blog, update header_img: blog/blog-data-feb18-2.png +++ _This post is one of our regular monthly blogs accompanying an update to the data displayed on WhoTracks.Me. In these posts we introduce what data has been added as well as point out interesting trends and case-studies we found in the last month. Previous month's posts can be found here: [January 2018](./update_jan_2018.html), [December 2017](./update_dec_2017.html)._ We've updated the site today with data collected during January 2018. Due to increased distribution, we have over 115 million page loads this month, an increase of 15% over previous months (see [Where does the data come from?](./where_is_the_data_from.html) for more background on our data collection). The regions from which we are getting data is also diversifying. While 70% of the data still comes from German users, we now have more significant US and international data. We plan to have sufficient data in the coming months in order to provide region-specific tracking breakdowns. ## The tracking shell game Picking out some of the biggest movers in the rankings this month, we first find [Nexage](../trackers/nexage.html) down 262 places this month, and to one tenth of its reach in May last year. This is probably simply a winding down of the operation which was acquired by Millennial Media in 2014, who were acquired by AOL in 2015, who were acquired by Verizon also in 2015. Their landing page now redirects to [One by AOL](https://www.onebyaol.com/). One of the challenges for us on whotracks.me is to make the link between tracker domain names, tracking products, and tracking companies. Nexage is an example of how many trackers lead you down a 'rabbit hole' of mergers and acquisitions until you find the company above it all. If we expand out the web of companies underneath Verizon who are also present on whotracks.me, we find 10 different trackers which can be linked: [Adap.tv](../trackers/adap.tv.html), [ADTECH](../trackers/adtech.html), [Advertising.com](../trackers/advertising.com.html), [alephD](../trackers/alephd.com.html), [Convertro](../trackers/convertro.html), [Nexage](../trackers/nexage.html) and [Vidible](../trackers/vidible.html) under AOL, and [Brightroll](../trackers/brightroll.html) and [Flickr](../trackers/flickr_badge.html) under [Yahoo](../trackers/yahoo.html). Furthermore, Yahoo and AOL both have popular web portals ([yahoo.com](../websites/yahoo.com.html), [aol.com](../websites/aol.com.html) and [aol.de](../websites/aol.de.html)) to drive more traffic which they can track. This leads Verizon to be able to track at least 6% of web traffic, the 11th highest reach of any company in our dataset. You can now check the full list of companies sorted by their trackers' combined reach [here](../companies/reach-chart.html).
Verizon's trackers - © WhoTracks.Me 2018
A new entry at 546, [Smarter Travel Media](../trackers/smarter_travel.html) is another example of a smaller company with giants hiding behind it. The tracker is primarly present on tripadvisor and other travel sites, and infact they are a [Tripadvisor](../trackers/tripadvisor.html) company. Tripadvisor in turn is owned by [Expedia](../trackers/expedia.html). Above all of this stands [InterActiveCorp (IAC)](http://iac.com/) who own several other web brands, including [Vimeo](../trackers/vimeo.html) and [Mindspark](../trackers/mindspark.html).
IAC - © WhoTracks.Me 2018
The final movers we would like to highlight this month are [davebestdeals.com](../trackers/davebestdeals.com.html) and [eshopcomp.com](../trackers/eshopcomp.com.html), up 201 and 194 places respectively. Unfortunately we cannot yet trace the owners of these trackers---they are both registered with PrivacyGuard in [Panama](https://who.is/whois/eshopcomp.com) and have no visible landing page. In fact they are likely operated by the same entity as their domains point to the same CloudFront endpoints, for example on the `istatic` subdomain for both domains. The reason for the lack of transparency in this case is that they are malware. Looking at their profile pages we can see that they have a small presence across many sites, including sites which we know for certain would not have trackers like this in the page (e.g. Google sites, which will only ever contain Google's own trackers). These are likely browser extensions which include code to inject their tracking code in all of the pages the user visits, and send this information back to their servers. The user browsing history which they collect can then by sold on. The Web Of Trust extension was [caught doing this](https://www.forbes.com/sites/leemathews/2016/11/07/web-of-trust-browser-add-on-blasted-for-breaking-user-trust/#5029a0a53ef5) in 2015, and our data shows that it is still a common practice (look for the 'Extensions' tracker category on this site). This style of user history harvesting also has the advantage that it is not blocked by the majority of Ad-blocking and privacy tools. These domains are not on these blocklists, because the list maintainers will not encounter them - unless they happen to install the malware themselves. Therefore, currently only Cliqz and Ghostery 8's AI anti-tracking are detecting these trackers and preventing them from gathering user sessions - because they are using the same data to find trackers that whotracks.me uses. ## New data points This month we add data about the content-types loaded by trackers. This is based on values reported by the [webRequest 'type' property](https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/ResourceType). By reporting these values we can further characterise tracker behaviours, and quantify risks, such as which trackers are being permitted to load scripts on certain pages. We add the following new columns for trackers, reported as the proportion of pages where the specific tracker or company loaded particular resource type(s) into the page: * `script`: Javascript code (via a `" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting imports\n", "from plotly.offline import init_notebook_mode, iplot\n", "init_notebook_mode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Sankey Data\n", "When building the tracker maps that you see on popular site profiles on whotracks.me, sankey diagrams seemed like a good fit to map categories of tracking to companies that own the trackers. Each link would be a tracker, going from a category to a company. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given we had decided to use plottly.offline to generate the interactive images, I wanted to use the sankey diagram supported in plotly. The fuction itself is pretty straightforward, as you can see in `sankey_diagram()`, but figuring out how the structure of the input data took a bit. Hopefully the following example will make it easier for those reading this post, should they ever decided to try sankey diagrams.\n", "\n", "The goal here is to show some very small dataset, structured in a way that the plotly diagram (and other plotting solutions e.g.: d3.js) understand. We will be mapping cities to the countries they are part of. The value of each link, will be the city population (in millions)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "city_data = dict(\n", " nodes = dict(\n", " label=[\"Germany\", \"Berlin\", \"Munich\", \"Cologne\", \"France\", \"Paris\", \"Lyon\", \"Bordeaux\"],\n", " color=[\"beige\", \"black\", \"red\", \"yellow\", \"beige\", \"blue\", \"white\", \"red\"]\n", " ),\n", " links = dict(\n", " source=[0, 0, 0, 4, 4, 4],\n", " target=[1, 2, 3, 5, 6, 7],\n", " value= [3.5, 1.5, 1, 2.2, 0.5, 0.2],\n", " label=[\"capital\", \"city\", \"city\", \"capital\", \"city\", \"city\"],\n", " color=[\"black\", \"red\", \"yellow\", \"blue\", \"whitesmoke\", \"red\"]\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how there are two keys in the `dictionary`, `nodes` and `links`, and each has some attributes. Let's go over them. Each node has a label (e.g. `Germany`) and a corresponding color (in this case `beige`). Note than `labels` and `colors` are stored in lists of equal length, and the pairing is done based on the index. \n", "\n", "Links contain information about how to link nodes. Eeach has a `source`, `target`, `value`, `label` and `color`. Source cointains the index in the list of the source node, whereas target the index in the list of the target node. Value determines how thick the link should be (in our case it will be the population of each link, hence each city), Label and color, as the name suggests, specify the label and color of the link. Links too, are paired based on index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting a sankey diagram\n", "\n", "Now let's write a simple function to plot these data nicely. Most of the work has already been done, given we're feeding the data in a format that's easy to parse." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def sankey_diagram(sndata, title):\n", " # First part of a plotly plot is the `trace`\n", " data_trace = dict(\n", " type='sankey',\n", " node=dict(\n", " pad=10,\n", " thickness=30,\n", " # label could easily be equal to sndatap['node]['label']. The following is just cosmetics\n", " label=list(map(lambda x: x.replace(\"_\", \" \").capitalize(), sndata['nodes']['label'])),\n", " color=sndata['nodes']['color']\n", " ),\n", " link=sndata[\"links\"],\n", " \n", " # configuration options for the diagram\n", " domain=dict(\n", " x=[0, 1],\n", " y=[0, 1]\n", " ),\n", " hoverinfo=\"none\",\n", " orientation=\"h\"\n", " )\n", " # Second part of a plotly plot is the `layout`\n", " layout = dict(\n", " title=title,\n", " font=dict(\n", " size=12\n", " )\n", " )\n", " fig = dict(data=[data_trace], layout=layout)\n", " return iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sankey diagram for a few German and French citites\n", "All that is left now, is feeding the city_data to the sankey_diagram function and we're done." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "hoverinfo": "none", "link": { "color": [ "black", "red", "yellow", "blue", "whitesmoke", "red" ], "label": [ "capital", "city", "city", "capital", "city", "city" ], "source": [ 0, 0, 0, 4, 4, 4 ], "target": [ 1, 2, 3, 5, 6, 7 ], "value": [ 3.5, 1.5, 1, 2.2, 0.5, 0.2 ] }, "node": { "color": [ "beige", "black", "red", "yellow", "beige", "blue", "white", "red" ], "label": [ "Germany", "Berlin", "Munich", "Cologne", "France", "Paris", "Lyon", "Bordeaux" ], "pad": 10, "thickness": 30 }, "orientation": "h", "type": "sankey" } ], "layout": { "font": { "size": 12 }, "title": "A few European Cities" } }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sankey_diagram(city_data, \"A few European Cities\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# From Cities to Trackers\n", "\n", "Doing Sankey diagrams for cities may have been fun. I am not sure the result of doing the same for trackers on your favorite sites will be equally fun. In fact it may be terrifying. We'll be using public data from whotracks.me to map tracker categories to Companies present on a particular site. Each link will be a tracker the company owns. This gives imediate visual insights on who's watching you an why. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Utils from `whotracksme`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from whotracksme.data.loader import DataSource\n", "from whotracksme.website.plotting.colors import tracker_category_colors, cliqz_colors\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data\n", "`DataSource` is a class that provides access to trackers, websites and companies. The functionality of `DataSource` is something we'll be constantly trying to improve and expand. Online tracking is messy enough to analyze, so the tooling should be not." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "DATA = DataSource()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available entities\n", "\n", "These entities are loaded into DataSource, but an API is provided for some common operations on each of them. For more details, have a look at `whotracksme.data.loader`. As far as we're concerned, we can load them like this:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "trackers = DATA.trackers\n", "sites = DATA.sites\n", "companies = DATA.companies\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at reddit.com\n", "Most people know what reddit is. For you that don't, check it out - there are some great communities there. Now we'll look at the tracking landscape in reddit. To do that, we only need to know the reddit `site_id`, which is `reddit.com`. Each site has a `site_id`, most often its `url`. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['apps', 'category', 'history', 'name', 'overview', 'subdomains'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reddit_id = \"reddit.com\"\n", "reddit_data = DATA.sites.get_site(reddit_id)\n", "\n", "# reddit_data is a dictionary. And a site object has the following keys: \n", "reddit_data.keys()\n", "\n", "# apps refers to trackers. Naming is hard, but it'll soon be changed to trackers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing tracker data for sankey diagram\n", "Here we will be mapping the trackers on reddit to the category they belong to (on the left) and to the companies that own them (on the right). This means each link is a tracker, nodes on the left are categories, and nodes on the right are companies. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def sankey_data(site_id, data=DATA):\n", "\n", " nodes = []\n", " link_source = []\n", " link_target = []\n", " link_value = []\n", " link_label = []\n", "\n", " for (tracker, category, company) in data.sites.trackers_on_site(site_id, data.trackers, data.companies):\n", "\n", " # index of this category in nodes\n", " if category in nodes:\n", " cat_idx = nodes.index(category)\n", " else:\n", " nodes.append(category)\n", " cat_idx = len(nodes) - 1 \n", " \n", " # index of this company in nodes\n", " if company in nodes:\n", " com_idx = nodes.index(company)\n", " else:\n", " nodes.append(company)\n", " com_idx = len(nodes) - 1 \n", " \n", " link_source.append(cat_idx)\n", " link_target.append(com_idx)\n", " link_label.append(tracker[\"name\"])\n", " link_value.append(100.0 * tracker[\"frequency\"])\n", "\n", " label_colors = [tracker_category_colors[l] if l in tracker_category_colors else cliqz_colors[\"purple\"] for l in nodes]\n", "\n", " return dict(\n", " nodes = dict(\n", " label=nodes,\n", " color=label_colors\n", " ),\n", " links = dict(\n", " source=link_source,\n", " target=link_target,\n", " value=link_value,\n", " label=link_label,\n", " color=[\"#dedede\"] * len(link_label)\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "hoverinfo": "none", "link": { "color": [ "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede" ], "label": [ "Reddit", "Google Tag Manager", "Google Analytics", "Amazon Associates", "Quantcast", "ScoreCard Research Beacon", "Google", "DoubleClick", "Quantcount", "Google AdServices", "Moat", "Google Syndication", "Google APIs", "OpenX", "Imgur", "Google CDN", "YouTube", "Alexa Metrics", "WikiMedia", "Amazon Web Services", "AppNexus", "InsightExpress", "Advertising.com" ], "source": [ 0, 2, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 10, 5, 12, 10, 14, 4, 10, 17, 5, 4, 5 ], "target": [ 1, 3, 3, 6, 7, 8, 3, 3, 7, 3, 9, 3, 3, 11, 13, 3, 3, 15, 16, 6, 18, 19, 20 ], "value": [ 99.58731289613593, 99.0402000255409, 98.2706125110061, 91.97209321082666, 46.30429960814889, 45.09245132106922, 44.410240554909564, 43.818095052459654, 37.652657261343855, 32.872476996390674, 20.00053770306692, 16.99612181662981, 16.054469320679388, 8.754478058354225, 5.226473810499996, 4.598705479866381, 2.9412357760735577, 2.19920554371862, 1.9437965869297826, 1.2777169127778412, 1.2004220969075352, 1.1789139742305805, 1.1110289620314422 ] }, "node": { "color": [ "#87BCEF", "#A069AB", "#FC9834", "#A069AB", "#84D7F0", "#BF90D2", "#A069AB", "#A069AB", "#A069AB", "#A069AB", "#C0BB61", "#A069AB", "#80C87D", "#A069AB", "#F86D4F", "#A069AB", "#A069AB", "#444", "#A069AB", "#A069AB", "#A069AB" ], "label": [ "Social media", "Reddit", "Essential", "Google", "Site analytics", "Advertising", "Amazon", "Quantcast", "Comscore", "Oracle", "Cdn", "Openx", "Misc", "Imgur", "Audio video player", "Alexa", "Wikimedia", "Hosting", "Appnexus", "Millward brown", "Aol" ], "pad": 10, "thickness": 30 }, "orientation": "h", "type": "sankey" } ], "layout": { "font": { "size": 12 }, "title": "reddit.com" } }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "input_data = sankey_data(reddit_id, data=DATA)\n", "sankey_diagram(input_data, reddit_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't forget to check out the article on https://whotracks.me/blog/trackers_in_your_favorite_site.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: contrib/wtm_april_update.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Whotracks.me April Update\n", "\n", "This month we have a big update to the site. We have restructured the data we publish to make it easier to use, increased the number of entries we publish, and we have laid the groundwork for internationalised versions of WhoTracks.Me - that means you can see how tracking differs between different countries.\n", "\n", "Thanks to integration with Ghostery 8 we collected significantly more tracker data this month, covering 360 million page loads. This is spread over countries across the world, with Germany and the USA the most represented." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from plotly.offline import init_notebook_mode, iplot, offline\n", "import plotly.graph_objs as go\n", "from whotracksme.website.plotting.colors import cliqz_colors, palette\n", "from whotracksme.website.plotting.utils import (\n", " WTMFonts,\n", " div_output,\n", " set_margins,\n", " annotation,\n", " set_line_style,\n", " set_category_colors\n", ")\n", "\n", "import pandas as pd\n", "init_notebook_mode()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "hole": 0.45, "hoverinfo": "label+percent", "labels": [ "Germany", "USA", "France", "Other", "Russia", "UK", "Poland", "Netherlands", "Canada", "Ukraine", "Austria", "Italy", "Spain", "Switzerland", "Belgium" ], "name": "Data origin", "pull": 0.07, "textfont": { "color": "#1A1A25", "family": "sans-serif", "size": 15 }, "textinfo": "label", "textposition": "outside", "type": "pie", "values": [ 87124064, 78216572, 40282874, 32326828, 24384449, 16317893, 10554555, 10291928, 10054367, 6268086, 6261035, 6094486, 5753209, 4732324, 4048089 ] } ], "layout": { "annotations": [ { "align": "center", "ax": 0, "ay": 0, "bgcolor": "#1A1A25", "bordercolor": "#1A1A25", "borderpad": 5, "borderwidth": 1, "font": { "color": "white", "family": "sans-serif", "size": 15 }, "showarrow": true, "text": "DATA ORIGIN", "width": 100, "x": 0.5, "xref": "x", "y": 0.5, "yref": "y" } ], "margin": { "b": 30, "l": 60, "pad": 5, "r": 60, "t": 30 }, "paper_bgcolor": "#00000000", "plot_bgcolor": "#FFFFFF", "showlegend": false, "xaxis": { "showgrid": false, "showline": false, "showticklabels": false, "zeroline": false }, "yaxis": { "showgrid": false, "showline": false, "showticklabels": false, "zeroline": false } } }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def doughnut_chart(values, labels, name):\n", " trace = go.Pie(\n", " values=values,\n", " labels=labels,\n", " name=str(name),\n", " hoverinfo=\"label+percent\",\n", " textposition=\"outside\",\n", " hole=0.45,\n", " pull=0.07,\n", " textinfo=\"label\",\n", " textfont=dict(\n", " family=WTMFonts.regular,\n", " color=cliqz_colors[\"black\"],\n", " size=15\n", " ) \n", " )\n", " data = [trace]\n", " layout = dict(\n", " showlegend=False,\n", " paper_bgcolor=cliqz_colors[\"transparent\"],\n", " plot_bgcolor=cliqz_colors[\"white\"],\n", " xaxis=dict(showgrid=False, showline=False, showticklabels=False, zeroline=False),\n", " yaxis=dict(showgrid=False, showline=False, showticklabels=False, zeroline=False),\n", " # autosize=True,\n", " margin=set_margins(t=30, b=30),\n", " annotations=[\n", " annotation(\n", " text=str(name).upper(),\n", " x=0.5,\n", " y=0.5,\n", " background_color=cliqz_colors[\"black\"],\n", " shift_x=0,\n", " text_size=15\n", " )\n", " ]\n", " )\n", " fig = dict(data=data, layout=layout)\n", " # NB: saving plot requires a manual step, plotly is does not support it yet\n", " # source: https://github.com/plotly/plotly.py/issues/880\n", " offline.plot(fig, image='svg')\n", "\n", " return iplot(fig)\n", "\n", "countries = ['Germany', 'USA', 'France', 'Other', 'Russia', 'UK', 'Poland', 'Netherlands', 'Canada', 'Ukraine', 'Austria', 'Italy', 'Spain', 'Switzerland', 'Belgium']\n", "page_loads = [87124064, 78216572, 40282874, 32326828, 24384449, 16317893, 10554555, 10291928, 10054367, 6268086, 6261035, 6094486, 5753209, 4732324, 4048089]\n", "\n", "doughnut_chart(values=page_loads, labels=countries, name='Data origin')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This volume of data will also enable us to publish separate rankings for individual countries, something we plan to add later this month." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data restructure\n", "\n", "We have updated the struture of data which we publish in our [respository](https://github.com/ghostery/whotracks.me/) to make it both easier to use and more scalable as we add more data. We now publish CSV files each month for each of the following:\n", "\n", " * `domains.csv`: Top third-party domains seen tracking.\n", " * `trackers.csv`: Top trackers - this combines domains known be operated by the same tracker.\n", " * `companies.csv`: Top companies - aggregates the stats for trackers owned by the same company.\n", " * `sites.csv`: Stats for number of trackers seen on popular websites.\n", " * `site_trackers.csv`: Stats for each tracker on each site.\n", "\n", "These files can then be loaded with popular data-analysis tools such as [Pandas](https://pandas.pydata.org/). We have also rewritten the code to render the site to take advantage of Pandas. We expose the dataframes via the `DataSource` class which loads data from all CSV files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from whotracksme.data.loader import DataSource\n", "data = DataSource()\n", "len(data.trackers.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have also updated the criteria by which we include trackers and sites on the main site. We now 'rollover' entries, so once they have been included once, we will keep publishing data (until they completely dissappear from the data). This has the effect of naturally growing the number of trackers and sites we publish. We currently have data on 868 trackers and 748 websites published:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_ts():\n", " df = pd.DataFrame({\n", " 'trackers': data.trackers.df.groupby('month').count()['tracker'], \n", " 'sites': data.sites.df.groupby('month').count()['site']\n", " })\n", " sites_trace = go.Scatter(\n", " x=df.index, \n", " y=df.sites, \n", " name='Sites',\n", " line=dict(width=4, color='#9ebcda'),\n", " )\n", " trackers_trace = go.Scatter(\n", " x=df.index, \n", " y=df.trackers, \n", " name='Trackers',\n", " line=dict(width=4, color='#A069AB'),\n", " )\n", " \n", " layout=dict(\n", " margin=set_margins(t=0,b=30),\n", " legend=dict(\n", " x=0.05, y=1,\n", " bgcolor='#E2E2E2',\n", " orientation='h'\n", " )\n", " )\n", " fig = dict(data=[sites_trace, trackers_trace], layout=layout)\n", " offline.plot(fig, image='svg')\n", "\n", " iplot(fig)\n", "\n", "plot_ts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The per site trend for average number of trackers continues a slightly downward trend, but the average is still above 9. There are several possible reasons for this, it is not necessarily that sites are using fewer trackers. The proportion of data from Ghostery users continues to increase, and these users will disproportionately block many trackers. This has an effect on the average number of trackers, because it prevents the blocked trackers from loading others. The data shows also that the average indcidence of blocking for trackers increased to 25% in March, up from 20% in February. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "traces = [\n", " go.Box(\n", " y=data.sites.df[data.sites.df.month == '2018-01'].trackers, \n", " name='Jan 2018',\n", " marker=dict(\n", " color='#c44e52',\n", " line=dict(\n", " color='#c44e52',\n", " width=3\n", " ),\n", " )\n", " ),\n", " go.Box(\n", " y=data.sites.df[data.sites.df.month == '2018-02'].trackers, \n", " name='Feb 2018',\n", " marker=dict(\n", " color='#55a868',\n", " line=dict(\n", " color='#55a868',\n", " width=3\n", " ),\n", " )\n", " ),\n", " go.Box(\n", " y=data.sites.df[data.sites.df.month == '2018-03'].trackers, \n", " name='Mar 2018',\n", " marker=dict(\n", " color='#4c72b0',\n", " line=dict(\n", " color='#4c72b0',\n", " width=3\n", " ),\n", " )\n", " )\n", "]\n", "fig = dict(data=traces, layout=dict(showlegend=False, margin=set_margins(t=0, b=30)))\n", "offline.plot(fig, image='svg')\n", "iplot(fig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mean occurrence of Blocking per page\n", "traces = [\n", " go.Bar(\n", " x=['Jan 2018', 'Feb 2018', 'Mar 2018'],\n", " y=[\n", " data.trackers.df[data.trackers.df.month == '2018-01'].has_blocking.mean()*100,\n", " data.trackers.df[data.trackers.df.month == '2018-02'].has_blocking.mean()*100,\n", " data.trackers.df[data.trackers.df.month == '2018-03'].has_blocking.mean()*100\n", " ],\n", " marker=dict(\n", " color=['#A069AB', '#9564c4', '#6564c4'],\n", " line=dict(\n", " color='#222',\n", " width=2\n", " ),\n", " )\n", " )\n", "]\n", "fig = dict(data=traces, layout=dict(margin=set_margins(t=0, b=30)))\n", "offline.plot(fig, image_height=200, image_width=800, image='svg', output_type='file')\n", "iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in previous months, we look at sites' changing the trackers. [fewo-direct.de](../websites/fewo-direkt.de.html), [brigitte.de](../websites/brigitte.de.html) and [gutefrage.net](../websites/gutefrage.net.html) all had 5 fewer trackers on average per page this month. However, each of these still has over 50 trackers with some kind of presence, showing that this is more likely a side-effect of increased blocking than an active effort to reduce tracking on their sites. [klingel.de](../websites/klingel.de.html) and [informationvine.com](../websites/informationvine.com.html) see the largest increase in tracking of the sites we currently monitor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mar_trackers = data.sites.get_snapshot('2018-03').set_index('site')['trackers']\n", "feb_trackers = data.sites.get_snapshot('2018-02').set_index('site')['trackers']\n", "site_diffs = pd.DataFrame({\n", " 'trackers': mar_trackers,\n", " 'change': (mar_trackers - feb_trackers)\n", "})\n", "site_diffs[(site_diffs.change > 5) | (site_diffs.change < -5.5)].sort_values('change')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A side-effect of the filtering we added in this new data pipeline is that the site reach for top trackers has increased. In the previous analysis a long-tail of very rarely visited sites reduced effective site reach. With this factor reduced, we get a real sense of the coverage of the largest trackers, with Google Analytics reaching 85% of popular sites, and Facebook almost 60%." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = data.trackers.get_snapshot().sort_values(by='site_reach', ascending=False).head(10)\n", "df['name'] = df.id.apply(func=lambda x: data.app_info[x]['name'])\n", "\n", "traces = [\n", " go.Bar(\n", " x=df.site_reach[::-1]*100,\n", " y=df.name[::-1],\n", " orientation='h',\n", " marker=dict(\n", " color=palette('#9ebcda', '#A069AB', 10),\n", " line=dict(\n", " color='#333',\n", " width=2\n", " ),\n", " )\n", " )\n", "]\n", "layout=dict(margin=set_margins(l=200))\n", "fig = dict(data=traces, layout=layout)\n", "offline.plot(fig, image='svg')\n", "iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to delve deeper into our data, it is available on the [Whotracks.me Github Repository](https://github.com/ghostery/whotracks.me/tree/master/whotracksme/data), and as a [pip package](https://pypi.python.org/pypi/whotracksme/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: contrib/wtm_may_update.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Whotracks.me May Update\n", "\n", "*This post is one of our regular monthly blogs accompanying an update to the data\n", "displayed on WhoTracks.Me. In these posts we introduce what data has been added as well\n", "as point out interesting trends and case-studies we found in the last month. Previous\n", "month's posts can be found here: [April 2018](./update_apr_2018.html),\n", "[February 2018](./update_feb_2018.html), [January 2018](./update_jan_2018.html),\n", "[December 2017](./update_dec_2017.html).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "This month we update the site with data from 340 million page loads during April 2018. We expand\n", "the number of trackers shown to 951, and the number of websites to 1330. As this will be the last\n", "full month before the [GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation)\n", "comes into force for European users, this will provide a benchmark to assess whether there is an\n", "observable difference on the tracking ecosystem.\n", "\n", "This month also saw our new paper **\"WhoTracks.Me: Monitoring the online tracking landscape at scale\"**\n", "published on [Arxiv](https://arxiv.org/abs/1804.08959). This paper covers the methodology behind\n", "the data we collect here, and how we ensure no private information can be leaked during this\n", "process.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from plotly.offline import init_notebook_mode, iplot, offline\n", "\n", "import pandas as pd\n", "import cufflinks as cf\n", "\n", "init_notebook_mode()\n", "cf.set_config_file(offline=False, world_readable=True, theme='pearl')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data available for months: ['2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04']\n" ] } ], "source": [ "from whotracksme.data.loader import DataSource\n", "data = DataSource()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notable Changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As customary, here below are the sites with the most notable changes this month. The\n", "largest increase in the average number of trackers per page load was measured in\n", "[markt.de](https://whotracks.me/websites/markt.de.html), and the largest decrease in\n", "[babbel.com](https://whotracks.me/websites/babbel.com.html)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "| \n", " | change | \n", "trackers | \n", "
|---|---|---|
| babbel.com | \n", "-8.111454 | \n", "12.722951 | \n", "
| bento.de | \n", "-3.611723 | \n", "19.215815 | \n", "
| klingel.de | \n", "-3.492893 | \n", "26.706119 | \n", "
| tvnow.de | \n", "-3.151073 | \n", "25.500678 | \n", "
| sheego.de | \n", "4.633526 | \n", "11.616530 | \n", "
| markt.de | \n", "10.795911 | \n", "17.783326 | \n", "
{{ blog_post.subtitle }}
{{ demographics.website_url }}
{% if demographics.description != "None" %} {{ demographics.description }} {% endif %}
COOKIES are files placed by the website, stored in the browser that is used to identify you to the website.
================================================ FILE: templates/components/fingerprinting.html ================================================FINGERPRINTING is a unique digital signature derived from the properties of your device.
================================================ FILE: templates/components/footer.html ================================================ ================================================ FILE: templates/components/home/header.html ================================================
Trackers using cookies.
{{header_stats.by_fingerprinting|to_percentage}}%Trackers using fingerprinting.
{{header_stats.data|b_to_mb}}MBaverage data usage by trackers
{{tracker.reach|to_percentage}}% of web traffic is tracked by {{ tracker.id|get_app_name }}
Owned by {{app.company_id|get_company_name}}
{% endif %} {% if profile.website_url is not none %} {% endif %}TRACKER RANK
{{ profile.overview.reach_rank }} /{{ trackers }}
{{profile.id|rank_label}}
OPERATES UNDER
of web traffic is tracked by {{ profile.name }}
of the top 10,000 sites seen loading the {{ profile.name }} tracker
{% if profile.date_range[0] == profile.date_range[1] %} Data from {{ profile.date_range[0].strftime('%B %Y') }}. {% else %} Data from {{ profile.date_range[0].strftime('%B %Y') }} to {{ profile.date_range[1].strftime('%B %Y') }}. {% endif %}
No tracking detected at present
{% endif %}No tracking detected on this site at present.
{% endif %} ================================================ FILE: templates/components/unified-ui-tracker-list.html ================================================{{ site.overview.content_length|b_to_mb }}Mb of user data from trackers per page
{{ site.overview.trackers|round2 }} TRACKERS
On average per page
Page loads from {{ profile.website_url | normalize_domain_name }} on which tracking occurred
Tracking requests per page load
WhoTracks.Me is a project of Ghostery GmbH, Arabellastraße 23, 81925 Munich, who owns and operates the website www.whotracks.me.
Ghostery GmbH is registered under the registration HRB 230794 at the Registration Court Amtsgericht Munich – Registergericht –, Infanteriestraße 5, 80325 Munich.
Registered Directors (vertretungsberechtigte Geschäftsführer) are Jean Paul Schmetz and Heinz Spengler.
The VAT ID Number is DE313473689.
For communication you can either visit our get help page or write an e-mail to support@ghostery.com
Proportion of the web traffic tracked by these companies.
See Full Chart
That is more than the next 4 biggest trackers combined.
Facebook knows more than what you just do on Facebook
Which of these have you heard of?
{{ websites.gt10 }}
out of {{ websites.count }} top websites have more than 10 trackers per page.
{{ (websites.data / 1024 / 1024)|round|int }}MB
of data per page load on average required by trackers
We could not find the page you are looking for. We have lots of other interesting things for you to see though. Did you check our trackers page?
Proportion of the web traffic tracked their trackers.
Generated from Ghostery Anti-Tracking data
This tracker is not very popular, hence we don't have a profile on it yet. We are constantly adding more data on trackers, so make sure you check again in the future. Meanwhile, have you seen our list of trackers?
{{CATEGORY_DESC[profile.category]}}
{% else %} misc{{ CATEGORY_DESC["misc"]}}
{% endif %}We don't have a profile of this site yet. We are constantly adding more data on popular sites, so make sure you check again in the future. Meanwhile, have you seen our list of websites?
{% include "components/websites/tracker-list.html" %}
Proportion of traffic to top {{ website_list|count }} sites containing trackers
Average number of trackers present on a site
Average number of requests per page that track you
WhoTracks.Me Privacy Policy
Please find below our statement on the processing of personal data by our company in accordance with the legal requirements, particularly the EU General Data Protection Regulation (GDPR - available here).
Content
I. General information
II. Details of data processing
III. Rights of data subjects
IV. California
I. General information
This section of the data privacy statement contains information on the scope of validity, the person responsible for data processing (controller), the data protection officer and data security. It also begins with a list of definitions of important terms used in the data privacy statement.
1. Definition of main terms
Browser: Computer program used to display websites (e.g. Chrome, Firefox, Safari)
Cookies: Text files placed on the user’s computer by the web server by means of the browser which is used. The stored cookie information may contain both an identifier (cookie ID) for recognition purposes and content data, such as login status or information about websites visited. The browser sends the cookie information back to the web server with each new request upon subsequent repeat visits to these sites. Most browsers accept cookies automatically. Cookies can be managed using the browser functions (usually under “Options” or “Settings”). The storage of cookies may be disabled in this way or it may be made dependent on the user’s approval in any given case or otherwise restricted. Cookies may also be deleted at any time.
Third countries: Countries outside of the European Union (EU)
GDPR: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), available here.
Personal data: Any information relating to an identified or identifiable natural person. An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier, such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
Profiling: Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.
Services: Our offers to which this data privacy statement applies (cf. Scope of validity).
Tracking: The collection of data and their evaluation regarding the behaviour of visitors in response to our services.
Tracking technologies: Actions can be tracked either via the activity records stored on our web servers (log files) or by collecting data from end devices via pixels, cookies or similar tracking technologies.
Processing: Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.
Pixel: Pixels are also called tracking pixels, web beacons or web bugs. These are small, invisible graphics in HTML emails or on websites. When a document is opened, this small image is downloaded from a server on the Internet and the download is registered there. This allows the operator of the server to see if and when an email has been opened or a website has been visited. This function is usually carried out by calling up a small program (JavaScript). Certain types of information can be detected on your computer system in this way and shared, such as the content of cookies, the time and date of the visit, and a description of the page on which the tracking pixel is located.
2. Scope of validity
This data privacy statement applies to the following services:
All of these offers are also collectively referred to as “services”.
3. Controller
The following party is responsible for the processing of data in relation to the services, i.e. the role of controller which involves determining the purposes and means of processing personal data:
4. Data protection officer
The contact details of our data protection officer are given in paragraph 3. Messages should be marked for the attention of the data privacy department or sent via privacy@ghostery.com.
II. Details of data processing
1. General information about the data processing operations
The following applies to all the processing operations listed below, unless stated otherwise:
a) No obligation to provide personal data & consequences of failure to provide such data
The provision of personal data is not required by law or contract, and you are under no obligation to provide any data. We will inform you during the data entry process when personal information must be provided for the relevant service (e.g. by indicating “mandatory field”). In cases where the provision of data is required, the consequence of not providing data will be that the service in question cannot be provided. Otherwise, failure to provide data may result in our inability to provide our services in the same form and quality.
b) Consent
In various cases, you may also grant us your consent to the further processing of data (or some of the data, where applicable) in connection with the operations listed below. In this case, we will inform you separately in connection with the submission of the respective declaration of consent about all the procedures and the scope of the consent and concerning the purposes which we pursue in these processing operations. The processing operations based on your consent are therefore not listed again here (Art. 13 (4) GDPR).
c) Transfer of personal data to third countries
When we send data to third countries, i.e. countries outside of the European Union, the data are then transmitted strictly in compliance with the statutory conditions of admissibility. If the transmission of the data to a third country does not serve the purpose of fulfilling our contract with you, if we do not have your consent, if the transmission is not required for asserting, exercising or defending legal claims, and if no other exemption applies under Art. 49 GDPR, we will only transmit your data to a third country if in possession of an adequacy decision pursuant to Art. 45 GDPR or appropriate guarantees under Art. 46 GDPR. In order to ensure an adequate level of data protection, we provide appropriate safeguards pursuant to Art. 46 (2) c) GDPR by the conclusion of EU standard data protection clauses adopted by the European Commission with the receiving body. Copies of the standard EU data protection clauses are available on the website of the European Commission here.
d) Hosting at external service providers
Our data processing work is carried out to a large extent with the involvement of hosting service providers who provide us with storage space and processing capacities at their data centres and who also process personal data on our behalf according to our instructions. It may be the case that personal data are transmitted to hosting service providers in respect of all of the functions listed below. These service providers process data either exclusively in the EU or subject to guaranteed levels of data protection which we have put in place based on the standard EU data protection clauses (cf. subsection c).
e) Transmission to government authorities
In principle, we do not transmit any data to government authorities. We only send personal information to government authorities (including law enforcement agencies) when required to fulfil a legal obligation to which we are subject (legal basis: Art. 6 (1) c) GDPR) or when it is necessary for the assertion, exercise or defence of legal claims (legal basis: Art. 6 (1) f) GDPR).
f) Period of storage
The time specified in the “period of storage” paragraph indicates how long we use the data for the relevant purposes in any given case. At the end of this period, the data will no longer be processed by us but will be erased at regular intervals, unless continued processing and storage are required by law (mainly because it is necessary to fulfil a legal obligation or for the establishment, exercise or defence of legal claims) or unless you grant us extended consent.
g) Data categories
The category names listed below are used for specific types of data in the following sections:
2. Accessing our services
The passages below set out how your personal data are processed when you access our services (e.g., loading and viewing the website, opening the mobile app and navigating within the app). We would point out that it is impossible not to send access data to external content providers (cf. subsection b) due to the technical processes involved in transmitting information over the Internet. The third-party providers are themselves responsible for the privacy-compliant operation of the IT systems which they use. The service providers are required to decide how long the data will be stored.
a) Purposes of data processing, legal basis, legitimate interests (where applicable), and period of storage
Data category:
Access data
Purpose:
Establishing connection; presenting contents of the service; detecting attacks on our site due to unusual activities; fault diagnosis
Legal basis:
Art. 6 (1) f) GDPR Our legitimate interest: Proper functioning of the services; security of data and business processes; prevention of misuse; prevention of damage through interference in information systems
Period of storage:
Four weeks
b) Recipients of the personal data
Recipient category:
External content providers who provide content which is needed to display the service (e.g. images, videos, embedded postings from social networks, banner ads, fonts, update information, shortened links) as well as IT Security Service Provider
Data concerned:
Access data
Legal Basis:
Art. 6 (1) f) GDPR Our legitimate interest: Proper functioning of the services; (accelerated) display of content; Prevention of attacks through exploitation of security gaps/vulnerabilities
Email address; Personal master data; Newsletter usage profile data
Purpose:
Verification of the registration process (“double opt-in”) including traceability of registrations and unsubscriptions (“logging”); sending and designing the newsletter according to interests; measurement of opening and click rates for the purpose of optimising our newsletter service.
Period of storage:
Personal data is deleted as soon as its further processing is no longer necessary for the respective purpose and legal retention periods do not prevent deletion. This is regularly the case upon receipt of your withdrawal. In the event of your withdrawal, however, we reserve the right to store your e-mail address for the purpose of proving that you have previously given your consent. This storage is solely for the purpose of defending possible legal claims.
3. Downloads (tracker data)
The download of our complete data set on the largest and longest measurement of online tracking can be performed at Github. For this purpose we link directly to our repository.
4. Contacting WhoTracks.Me
We invite our visitors to send us an email for any question or concern they might have. The tables below show how your personal data are processed when you contact our customer support.
a) Purposes of data processing, legal basis, legitimate interests (where applicable), and period of storage
Data category:
Personal master data; contact details; e-mail address; contents of enquiries/complaints
Purpose:
Processing of customer feedback, enquiries, and user complaints
Legal basis:
Art. 6 (1) b) and f) GDPR
Our legitimate interest: Improvement of our service; increase in customer loyalty
Period of storage:
We retain any personal data related to user-submitted email during the processing of the inquiry. We delete these tickets after 6 months of inactivity.
b) Recipients of the personal data
Recipient category:
IT service providers
Data concerned:
All data listed under (a) in this section
Legal Basis:
Art. 28 GDPR
5. Web Analytics
This website uses no analytics.
III. Rights of data subjects
1. Right to object
If we process your personal data for direct marketing purposes, you have the right to object at any time to the processing of your personal data for such marketing with future effect.
You also have the right, at any time with future effect and for reasons pertinent to your particular situation, to object to the processing of your personal data in accordance with Art. 6 (1) e) or f) GDPR; this also applies to any profiling based on these provisions. The right to object may be exercised free of charge. In order to be able to process your request faster, please reach us by emailing us at privacy@ghostery.com.
2. Right of access
You have the right to obtain confirmation from us as to whether or not personal data concerning you are being processed and, where that is the case, to access the personal data and the other information listed in Art. 15 GDPR.
3. Right to rectification
You have the right to obtain from us without undue delay the rectification of incorrect personal data concerning you (Art. 16 GDPR). Taking into account the purposes of the processing, you have the right to have incomplete personal data completed, including by means of providing a supplementary statement.
4. Right to erasure (“right to be forgotten”)
You have the right to obtain from us the erasure of personal data concerning you without undue delay if one of the grounds listed in Art. 17 (1) GDPR is applicable and the processing operations are not required for one of the purposes approved in Art. 17 (3) GDPR.
5. Right to restriction of processing
You are entitled to obtain from us the restriction of the processing of your personal data where one of the conditions laid down in Art. 18 (1) a) to d) GDPR is met.
6. Right to data portability
You have the right, in respect of the personal data which you have given us, to be provided with these data in a structured, commonly used and machine-readable format and the right to send these data to another controller without any hindrance on our part, insofar as the requirements set out in Art. 20 (1) GDPR are met. In exercising your right to data portability, you have the right to have the personal data transmitted directly by us to another controller where technically feasible.
7. Right to withdraw consent
If the processing is based on your consent, you have the right to revoke your consent at any time. This will not affect the legality of the processing operations on the basis of the consent until such time as the revocation takes effect.
8. Right to object
You have the right to lodge a complaint with the supervisory authority responsible for our company. The supervisory authority responsible for our company is as follows:
IV. California
For residents of California, please see our Privacy Policy Supplemental Notice – California.