Repository: ziplokk1/incapsula-cracker-py3 Branch: master Commit: be1738d0e649 Files: 26 Total size: 342.4 KB Directory structure: gitextract_b7gw_h1l/ ├── .gitignore ├── LICENSE.txt ├── README.md ├── circle.yml ├── docs/ │ ├── Makefile │ ├── conf.py │ ├── index.rst │ ├── make.bat │ └── source/ │ ├── incapsula.rst │ └── modules.rst ├── incapsula/ │ ├── __init__.py │ ├── errors.py │ ├── parsers.py │ └── session.py ├── requirements.txt ├── setup.cfg ├── setup.py ├── tests/ │ ├── __init__.py │ ├── helpers.py │ ├── test_IframeResourceParser.py │ └── whoscored/ │ ├── __init__.py │ ├── index.html │ ├── jsTest.html │ ├── test_whoscored.py │ └── whoscored-index_unblocked.html └── tools.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Created by .ignore support plugin (hsz.mobi) ### Python template # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *,cover .hypothesis/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # dotenv .env # virtualenv .venv venv/ ENV/ # Spyder project settings .spyderproject # Rope project settings .ropeproject ### JetBrains template # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 # User-specific stuff: .idea/**/workspace.xml .idea/**/tasks.xml .idea/dictionaries # Sensitive or high-churn files: .idea/**/dataSources/ .idea/**/dataSources.ids .idea/**/dataSources.xml .idea/**/dataSources.local.xml .idea/**/sqlDataSources.xml .idea/**/dynamic.xml .idea/**/uiDesigner.xml # Gradle: .idea/**/gradle.xml .idea/**/libraries # Mongo Explorer plugin: .idea/**/mongoSettings.xml ## File-based project format: *.iws ## Plugin-specific files: # IntelliJ /out/ # mpeltonen/sbt-idea plugin .idea_modules/ # JIRA plugin atlassian-ide-plugin.xml # Crashlytics plugin (for Android Studio and IntelliJ) com_crashlytics_export_strings.xml crashlytics.properties crashlytics-build.properties fabric.properties *.idea ================================================ FILE: LICENSE.txt ================================================ This is free and unencumbered software released into the public domain. Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means. In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. For more information, please refer to ================================================ FILE: README.md ================================================ [![CircleCI](https://circleci.com/gh/ziplokk1/incapsula-cracker-py3.svg?style=shield)](https://circleci.com/gh/ziplokk1/incapsula-cracker-py3) # Description This module is used to wrap any request to a webpage blocked by incapsula. Despite the name, this library should be ok to use with python2.7. Incapsula has begun using re-captcha after too many requests which may seem malicious. As of now, there is no way around it. Currently in order to detect that, I just simply raise an IncapBlocked error when the page is blocked by re-captcha. ## Documentation can be found [here](https://ziplokk1.github.io/incapsula-cracker-py3/). # Usage ```python from incapsula import IncapSession session = IncapSession() response = session.get('http://example.com') # url is not blocked by incapsula ``` ```python # Sometimes incapsula will block based on user agent. from incapsula import IncapSession session = IncapSession(user_agent='any-user-agent-string') respose = session.get('http://example.com') # This can also be done after instantiation. session.headers['User-Agent'] = 'some-other-user-agent-string' # It can also be done on a per request basis, just like requests. response = session.get('http://example.com', headers={'User-Agent': 'another-user-agent-string'}) ``` ```python # Since IncapSession inherits from requests.Session, you can pass all the same arguments to it. # See the requests documentation here (http://docs.python-requests.org/en/master/user/advanced/#session-objects) from __future__ import print_function from incapsula import IncapSession session = IncapSession() session.cookies.set('cookie-key', 'cookie-value') response = session.get('http://example.com', headers={'Referer': 'http://other-example.com'}) print(session.cookies) ``` ```python # Handling re-captcha blocks. from incapsula import IncapSession, RecaptchaBlocked session = IncapSession() try: response = session.get('http://example.com') except RecaptchaBlocked as e: raise ``` ```python # Sending a request to a page which is not blocked by incapsula from incapsula import IncapSession session = IncapSession() # When using the bypass_crack param, the IncapSession will not send out extra requests to bypass incapsula. # This will speed up the requests significantly so if you're making a scraper which # accesses multiple sites and some don't use incapsula, you can just bypass the crack. response = session.get('http://example.com', bypass_crack=True) ``` # Setup `pip install incapsula-cracker-py3` # Notes * As of now, this is only proven to work with the following sites: * whoscored.com * coursehero.com * offerup.com * dollargeneral.com * I understand that there's minimal commenting and that's because I'm not sure exactly why incapsula is sending requests to certain pages other than to obtain cookies. This is just a literal reverse engineer of incapsulas javascript code. * Feel free to contribute. Unfortunately webscraping is such a dynamic field that I can't always put out updates and make changes for specific sites. So I turn to the community to help with those issues. Thank you for your understanding. For anyone who is using this library and it works for your site, please send me a note so i can add it to the list. # How it works: Lets start with how incapsula works first: 1. When you navigate to a webpage, incapsula runs some javascript code which tests your browser to see if it's using selenium, phantomJS, mechanize, etc. 2. A cookie is created which holds the results of this test. 3. A request is then sent out which "applies" the cookie and now sends back a few other cookies necessary to obtain access to the site. 4. Any subsequent request is now authorized to access the site until the cookie expires. 5. If there are too many requests being made to the site despite the cookie authentication, incapsula will serve back a re-captcha instead. When detecting whether a resource is blocked, by default, we look for two elements. ```html ``` Finding both of these elements is necessary because unless both tags are present (from what I have seen) then the resource is not blocked. Once we have determined that the resource is blocked, we send a get request to the `src` of the `iframe` to determine its contents. If the contents contain a re-captcha, then there's nothing we can do and we raise a `IncapBlocked` exception. Otherwise we set the `___utvmc` cookie, send the GET request to apply the cookie, and send a new request to the original url. # Customizing ### The iframe src isn't contained in what I have coded already, here is how to expand the list to search. ```html ``` ```python from incapsula import IncapSession, WebsiteResourceParser class MyResourceParser(WebsiteResourceParser): # List of arguments to pass into BeautifulSoup().find() method. extra_find_iframe_args = [ ('iframe', {'src': 'http://some-site-i-havent-added.com'}) ] incap_session = IncapSession(resource_parser=MyResourceParser) # more code here ``` ### The resource is blocked by incapsula but there's no `` so this library isn't detecting that it's blocked. ```html ``` ```python from incapsula import IncapSession, WebsiteResourceParser class NoRobotsMetaResourceParser(WebsiteResourceParser): def is_blocked(self): return bool(self.incapsula_iframe) incap_session = IncapSession(resource_parser=NoRobotsMetaResourceParser) # More code here ``` ### The iframe contents have a captcha, but my library isn't detecting that. ```html
``` ```python from incapsula import IncapSession, IframeResourceParser class MyIframeResourceParser(IframeResourceParser): # List of arguments to pass into BeautifulSoup().find() method. extra_find_recaptcha_args = [ ('div', {'class': 'some-recaptcha-class', 'id': 'recaptcha-div'}) ] incap_session = IncapSession(iframe_parser=MyIframeResourceParser) ``` ### Since I've tried to keep this pretty site agnostic, its not always going to work with some sites. I've tried to keep it as extensible as possible so that it's easy to tailor it to a specific site. ## As Always: Scrape responsibly, obey timeouts, and obey the robots.txt. ;) feel free to contact me at sdscdeveloper@gmail.com ================================================ FILE: circle.yml ================================================ defaults: &defaults working_directory: ~/incapsula-cracker-py3 docker: - image: ubuntu:14.04 steps: - run: name: "Update package manager..." command: "apt-get -y update" - run: name: "Install git..." command: "apt-get install -y git" - checkout python_defaults: &python_defaults working_directory: ~/incapsula-cracker-py3 steps: - run: name: "Update package manager" command: "apt-get -y update" - run: name: "Install git" command: "apt-get install -y git" - checkout - run: name: "Install dependencies" command: "python -m pip install -r requirements.txt" - run: name: "Install nose" command: "python -m pip install nose" - run: name: "Python version" command: "printf Using- && python --version" - run: name: "Nosetests" command: "nosetests -v" version: 2 jobs: build: <<: *defaults build-p27: <<: *python_defaults docker: - image: python:2.7 build-p34: <<: *python_defaults docker: - image: python:3.4 workflows: version: 2 build_and_test: jobs: - build-p27: filters: branches: ignore: gh-pages - build-p34: filters: branches: ignore: gh-pages ================================================ FILE: docs/Makefile ================================================ # Minimal makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = python -msphinx SPHINXPROJ = IncapsulaCracker SOURCEDIR = . BUILDDIR = ../../incapsula-cracker-py3-docs # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) ================================================ FILE: docs/conf.py ================================================ #!/usr/bin/env python3 # -*- coding: utf-8 -*- # # Incapsula Cracker documentation build configuration file, created by # sphinx-quickstart on Sat Jun 17 21:53:21 2017. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # import os import sys sys.path.insert(0, os.path.abspath('.')) sys.path.insert(0, os.path.abspath('../')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. # # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = ['sphinx.ext.autodoc'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # # source_suffix = ['.rst', '.md'] source_suffix = '.rst' # The master toctree document. master_doc = 'index' # General information about the project. project = 'Incapsula Cracker' copyright = '2017, Mark Sanders' author = 'Mark Sanders' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '0.1.8.1' # The full version, including alpha/beta/rc tags. release = '0.1.8.1' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = 'alabaster' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # # html_theme_options = {} # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # -- Options for HTMLHelp output ------------------------------------------ # Output file base name for HTML help builder. htmlhelp_basename = 'IncapsulaCrackerdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). # # 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). # # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # # 'preamble': '', # Latex figure (float) alignment # # 'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'IncapsulaCracker.tex', 'Incapsula Cracker Documentation', 'Mark Sanders', 'manual'), ] # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ (master_doc, 'incapsulacracker', 'Incapsula Cracker Documentation', [author], 1) ] # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'IncapsulaCracker', 'Incapsula Cracker Documentation', author, 'IncapsulaCracker', 'One line description of project.', 'Miscellaneous'), ] ================================================ FILE: docs/index.rst ================================================ .. Incapsula Cracker documentation master file, created by sphinx-quickstart on Sat Jun 17 21:53:21 2017. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to Incapsula Cracker's documentation! ============================================= .. toctree:: :maxdepth: 2 :caption: Contents: Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ================================================ FILE: docs/make.bat ================================================ @ECHO OFF pushd %~dp0 REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=python -msphinx ) set SOURCEDIR=. set BUILDDIR=_build set SPHINXPROJ=IncapsulaCracker if "%1" == "" goto help %SPHINXBUILD% >NUL 2>NUL if errorlevel 9009 ( echo. echo.The Sphinx module was not found. Make sure you have Sphinx installed, echo.then set the SPHINXBUILD environment variable to point to the full echo.path of the 'sphinx-build' executable. Alternatively you may add the echo.Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% goto end :help %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% :end popd ================================================ FILE: docs/source/incapsula.rst ================================================ incapsula package ================= Submodules ---------- incapsula\.errors module ------------------------ .. automodule:: incapsula.errors :members: :undoc-members: :show-inheritance: incapsula\.parsers module ------------------------- .. automodule:: incapsula.parsers :members: :undoc-members: :show-inheritance: incapsula\.session module ------------------------- .. automodule:: incapsula.session :members: :undoc-members: :show-inheritance: Module contents --------------- .. automodule:: incapsula :members: :undoc-members: :show-inheritance: ================================================ FILE: docs/source/modules.rst ================================================ incapsula ========= .. toctree:: :maxdepth: 4 incapsula ================================================ FILE: incapsula/__init__.py ================================================ from .errors import IncapBlocked, MaxRetriesExceeded, RecaptchaBlocked from .parsers import ResourceParser, WebsiteResourceParser, IframeResourceParser from .session import IncapSession ================================================ FILE: incapsula/errors.py ================================================ class IncapBlocked(ValueError): """ Base exception for exceptions in this module. :param response: The response which was being processed when this error was raised. :type response: requests.Response :param *args: Additional arguments to pass to :class:`ValueError`. """ def __init__(self, response, *args): self.response = response super(IncapBlocked, self).__init__(*args) class MaxRetriesExceeded(IncapBlocked): """ Raised when the number attempts to bypass incapsula has exceeded the amount specified. :param response: The response which was being processed when this error was raised. :type response: requests.Response :param *args: Additional arguments to pass to :class:`ValueError`. """ pass class RecaptchaBlocked(IncapBlocked): """ Raised when re-captcha is encountered. :param response: The response which contains the re-captcha. :type response: requests.Response :param *args: Additional arguments to pass to :class:`ValueError`. """ pass ================================================ FILE: incapsula/parsers.py ================================================ import re from six.moves.urllib.parse import urlsplit from bs4 import BeautifulSoup class ResourceParser(object): """ Superclass for all other parser objects. :param response: Response from GET request. :type response: requests.Response """ def __init__(self, response): """ :param response: Response from GET request. :type response: requests.Response """ self.response = response split = urlsplit(response.url) self.scheme = split.scheme self.host = split.netloc self.soup = BeautifulSoup(self.response.content, 'html.parser') def is_blocked(self): """ Override this method to determine whether or not the resource is blocked. .. note:: If this class is passed into :class:`IncapSession` as the ``resource_parser`` parameter then this method will be used to determine whether to attempt to bypass incapsula or raise a :class:`MaxRetriesExceeded` error on too many retries. .. note:: If this class is passed into :class:`IncapSession` as the ``iframe_parser`` parameter then this method will be used to determine whether to raise a :class:`RecaptchaBlocked` error when a re-captcha is encountered. :return: True if resource is blocked otherwise False """ raise NotImplementedError('`is_blocked()` is not implemented') class IframeResourceParser(ResourceParser): """ Parser object to obtain the contents of the incapsula iframe. :param response: The response of the request sent to the incapsula iframe url. :type response: requests.Response """ # Standard args to use with soup.find() when searching for the element which contains a recaptcha. default_find_recaptcha_args = [ ('form', {'id': 'captcha-form'}), ('div', {'class': 'g-recaptcha'}) ] # Extra find recaptcha args to use when subclassing. # Note: when searching for the recaptcha, it will search for the recaptcha using the values in this list first then # it will search using the default_find_iframe_args list. extra_find_recaptcha_args = [] def __init__(self, response): """ :param response: The response of the request sent to the incapsula iframe url. :type response: requests.Response """ super(IframeResourceParser, self).__init__(response) @property def recaptcha_element(self): """ Recaptcha element in the document. :rtype: bs4.element.Tag """ # Iterate over user defined list first. for element in self.extra_find_recaptcha_args: elem = self.soup.find(*element) if elem: return elem # Then iterate over defaults. for element in self.default_find_recaptcha_args: elem = self.soup.find(*element) if elem: return elem def is_blocked(self): """ Determine whether the iframe contents is a google recaptcha. This is determined by simply iterating over the combined results of default_find_recaptcha_args and extra_find_recaptcha_args then seeing if the element is found in the document. :return: True if the iframe contains a google recaptcha. :rtype: bool """ return bool(self.recaptcha_element) class WebsiteResourceParser(ResourceParser): """ Parser object to extract the robots meta element, incapsula iframe element, and the incapsula iframe url. :param response: The response of the request sent to the targeted host. :type response: requests.Response """ # Standard args to use with soup.find() when searching for the iframe. default_find_iframe_args = [ ('iframe', {'src': re.compile('^/_Incapsula_Resource.*')}), ('iframe', {'src': re.compile('^//content\.incapsula\.com.*')}) ] # Extra find iframe args to use when subclassing. # Note: when searching for the iframe, it will search for the iframe using the values in this list first then # it will search using the default_find_iframe_args list. extra_find_iframe_args = [] def __init__(self, response): """ :param response: The response of the request sent to the targeted host. :type response: requests.Response """ super(WebsiteResourceParser, self).__init__(response) @property def incapsula_script_url(self): """ The script url to get the b var value :rtype: str """ scripts = self.soup.find_all('script') if len(scripts) > 1: return scripts[0].get('src') return None @property def robots_meta(self): """ The meta robots tag which is so commonly found in incapsula blocked resources. :rtype: bs4.element.Tag """ return self.soup.find('meta', {'name': re.compile('^robots$', re.IGNORECASE)}) @property def incapsula_iframe(self): """ The iframe which contains the javascript code that runs on browser load. :rtype: bs4.element.Tag """ # Iterate over user defined args first. for element in self.extra_find_iframe_args: iframe = self.soup.find(*element) if iframe: return iframe # Then iterate over defaults. for element in self.default_find_iframe_args: iframe = self.soup.find(*element) if iframe: return iframe @property def incapsula_iframe_url(self): """ The src attribute value of the incapsula iframe. :rtype: str """ if self.incapsula_iframe: uri = self.incapsula_iframe.get('src') # Case when uri isn't actually a uri, but an external resource. if uri.startswith('//'): return self.scheme + ':' + uri return self.scheme + '://' + self.host + uri def is_blocked(self): """ Determine whether the resource is blocked by incapsula or not. If the resource has the tag and the incapsula IFrame then we can assume the resource is blocked. :return: True if the robots meta tag and the incapsula iframe are both found in the document. :rtype: bool """ return bool(self.robots_meta) and bool(self.incapsula_iframe) ================================================ FILE: incapsula/session.py ================================================ from __future__ import absolute_import import re import time import logging import datetime import random from six.moves.urllib.parse import quote, urlsplit from requests import Session from .parsers import WebsiteResourceParser, IframeResourceParser from .errors import RecaptchaBlocked, MaxRetriesExceeded logger = logging.getLogger('incapsula') # A list of valid values which are tested in the incapsula test method. # These values are pulled straight from my browser and should be enough to spoof # the robot check when setting the cookie. o = [ ('navigator', 'true'), ('navigator.vendor', 'Google Inc.'), ('navigator.appName', 'Netscape'), ('navigator.plugins.length==0', 'false'), ('navigator.platform', 'Linux x86_64'), ('navigator.webdriver', 'undefined'), ('plugin_ext', 'no extention'), ('plugin_ext', 'so'), ('ActiveXObject', 'false'), ('webkitURL', 'true'), ('_phantom', 'false'), ('callPhantom', 'false'), ('chrome', 'true'), ('yandex', 'false'), ('opera', 'false'), ('opr', 'false'), ('safari', 'false'), ('awesomium', 'false'), ('puffinDevice', 'false'), ('__nightmare', 'false'), ('_Selenium_IDE_Recorder', 'false'), ('document.__webdriver_script_fn', 'false'), ('document.$cdc_asdjflasutopfhvcZLmcfl_', 'false'), ('process.version', 'false'), ('navigator.cpuClass', 'false'), ('navigator.oscpu', 'false'), ('navigator.connection', 'false'), ('window.outerWidth==0', 'false'), ('window.outerHeight==0', 'false'), ('window.WebGLRenderingContext', 'true'), ('document.documentMode', 'undefined'), ('eval.toString().length', '33') ] def test(): """ Quote each value in the tuple list and return a comma delimited string of the parameters. This method is a shortened version of incapsulas test method. What the original method does is check for specific plugins in your browser and set a cookie based on which extensions you have installed. The list of the values is taken from my own browser after running the test method so they are all valid. This is just more of a shortcut method instead of trying to reverse engineer the entire code that they had. :return: """ # safe param set to () for the single parameter with the key of "eval.toString().length". # This is needed to match the cookie value exactly with what is expected from incapsula. r = [quote('='.join(x), safe='()') for x in o] return ','.join(r) def simple_digest(s): """ Create a sum of the ordinal values of the characters passed in from s. .. code-block: javascript // The original javascript code. function simpleDigest(mystr) { var res = 0; for (var i = 0; i < mystr.length; i++) { res += mystr.charCodeAt(i); } return res; } :param s: The string to calculate the digest from. :return: Sum of ordinal values converted to a string. """ res = 0 for ch in s: res += ord(ch) return str(res) class IncapSession(Session): """ Session object to bypass sites which are guarded by incapsula. :param max_retries: The number of times to attempt to get the incapsula resource before raising a :class:`MaxRetriesExceeded` error. Set this to `None` to never give up. :param user_agent: Change the default user agent when sending requests. :param cookie_domain: Use this param to change the domain which is set in the cookie. Sometimes the domain set for the cookie isn't the same as the actual host. i.e. .domain.com instead of www.domain.com. :param resource_parser: :class:`ResourceParser` to use when checking whether the website served back a page which is blocked by incapsula. Default: :class:`WebsiteResourceParser`. :param iframe_parser: :class:`ResourceParser` class (not instance) to use when checking whether the iframe contains a captcha. Default: :class:`IframeResourceParser`. """ def __init__(self, max_retries=3, user_agent=None, cookie_domain='', resource_parser=WebsiteResourceParser, iframe_parser=IframeResourceParser): super(IncapSession, self).__init__() default_useragent = 'IncapUnblockSession (https://github.com/ziplokk1/incapsula-cracker-py3)' user_agent = user_agent or default_useragent self.max_retries = max_retries self.cookie_domain = cookie_domain self.headers['User-Agent'] = user_agent self.ResourceParser = resource_parser self.IframeParser = iframe_parser def _get_session_cookies(self): """ Get a list of cookies needed for making the simple digest when setting the ___utvmc cookie. .. note:: Translated from: function getSessionCookies() { var cookieArray = new Array(); var cName = /^\s?incap_ses_/; var c = document.cookie.split(";"); for (var i = 0; i < c.length; i++) { var key = c[i].substr(0, c[i].indexOf("=")); var value = c[i].substr(c[i].indexOf("=") + 1, c[i].length); if (cName.test(key)) { cookieArray[cookieArray.length] = value; } } return cookieArray; } :return: List of cookies where the cookie name starts with "incap_ses_". """ return [cookie.value for cookie in self.cookies if cookie.name.startswith('incap_ses_')] def _create_cookie(self, name, value, seconds, domain=''): """ Set the incapsula cookie needed to make verification request. :param name: Cookie name. :param value: Cookie value. :param seconds: Cookie expiry seconds from the current time. :param domain: Cookie domain. :return: """ expires = None if seconds: d = datetime.datetime.now() d += datetime.timedelta(seconds=seconds) expires = round((d - datetime.datetime.utcfromtimestamp(0)).total_seconds() * 1000) self.cookies.set(name, value, domain=domain, path='/', expires=expires) def _set_incap_cookie(self, v_array, domain='', sl=None): """ Calculate the final value for the cookie needed to bypass incapsula. .. note:: Translated from: function setIncapCookie(vArray) { var res; try { var cookies = getSessionCookies(); var digests = new Array(cookies.length); for (var i = 0; i < cookies.length; i++) { digests[i] = simpleDigest((vArray) + cookies[i]); } var sl = "jcMQV+ffvh2BmAcW8nq2a1HZRZcsB5poBUV2Ew=="; var dd = digests.join(); var asl = ''; for (var i=0;i= self.max_retries: raise MaxRetriesExceeded(resp, 'max retries exceeded when attempting to crack incapsula') resource = self.ResourceParser(resp) if resource.is_blocked(): logger.debug('Resource is blocked. attempt={} url={}'.format(tries, resp.url)) # Raise if the response content's iframe contains a recaptcha. self._raise_for_recaptcha(resource) # Apply cookies and send GET request to apply them. self._apply_cookies(org.url, resource.incapsula_script_url) # Recursively call crack() again since if the request isn't blocked after the above cookie-set and request, # then it will just return the unblocked resource. return self.crack(self.get(org.url, bypass_crack=True), org=org, tries=tries + 1) return resp def get(self, url, bypass_crack=False, **kwargs): """ Override :class:`Session`.:func:`get` :param url: URL for the new :class:`Request` object. :param bypass_crack: Use when sending a request that you dont want to go through the incapsula crack. :param kwargs: Optional arguments that ``request`` takes. Used in this class so when sending a get request from this instance, we dont end up creating an infinate loop by calling .get() then .crack() which calls .get() and repeat x infinity. Also any requests made to get incapsula resources don't need to be cracked. :rtype: requests.Response """ kwargs.setdefault('allow_redirects', True) # If the request is to get the incapsula resources, then we dont call crack(). if bypass_crack: return self.request('GET', url, **kwargs) return self.crack(self.request('GET', url, **kwargs)) ================================================ FILE: requirements.txt ================================================ beautifulsoup4==4.6.0 requests==2.14.2 six==1.10.0 ================================================ FILE: setup.cfg ================================================ [metadata] description-file = README.md ================================================ FILE: setup.py ================================================ from __future__ import unicode_literals from setuptools import setup version = '0.1.8.1' REQUIREMENTS = [ 'requests', 'beautifulsoup4', 'six' ] setup( name='incapsula-cracker-py3', version=version, packages=['incapsula'], url='https://github.com/ziplokk1/incapsula-cracker-py3', license='Unlicense', author='Mark Sanders', author_email='sdscdeveloper@gmail.com', install_requires=REQUIREMENTS, description='A way to bypass incapsula robot checks when using requests.', include_package_data=True ) ================================================ FILE: tests/__init__.py ================================================ ================================================ FILE: tests/helpers.py ================================================ from requests import Response def make_response(url, content, status_code=200): response = Response() response.url = url response._content = content response.status_code = status_code return response ================================================ FILE: tests/test_IframeResourceParser.py ================================================ from __future__ import absolute_import import unittest from tests.helpers import make_response from incapsula import IframeResourceParser class TestIframeResourceParserReCaptcha(unittest.TestCase): body = """

dollargeneral.com -

Additional security check is required

Powered by Incapsula
What is this page?

The web site you are visiting is protected and accelerated by Incapsula. Your computer might have been infected by some kind of malware and flagged by Incapsula network. This page is presented by Incapsula to verify that a human is behind the traffic to this site and not malicious software.

What should i do?

Simply enter the two words in the image above to pass the security check, once you do that we will remember your answer and will not show this page again. You should run a virus and malware scan on your computer to remove any infection.

""" def setUp(self): self.parser = IframeResourceParser(make_response('http://dollargeneral.com/iframe-fake-url', self.body)) def test_is_blocked(self): self.assertTrue(self.parser.is_blocked()) ================================================ FILE: tests/whoscored/__init__.py ================================================ ================================================ FILE: tests/whoscored/index.html ================================================ ================================================ FILE: tests/whoscored/jsTest.html ================================================ Hello, I am a java script test analytics page ================================================ FILE: tests/whoscored/test_whoscored.py ================================================ from __future__ import absolute_import import os import unittest from bs4 import BeautifulSoup from incapsula import WebsiteResourceParser, IframeResourceParser from tests.helpers import make_response test_root = os.path.dirname(os.path.abspath(__file__)) blocked_index = os.path.join(test_root, 'index.html') unblocked_index = os.path.join(test_root, 'whoscored-index_unblocked.html') iframe = os.path.join(test_root, 'jsTest.html') class TestWhoScoredIndexBlocked(unittest.TestCase): def setUp(self): with open(blocked_index, 'rb') as f: content = f.read() self.parser = WebsiteResourceParser(make_response('http://whoscored.com', content)) def test_robots_meta(self): robots_meta = BeautifulSoup('', 'html.parser').find('meta') self.assertEqual(self.parser.robots_meta, robots_meta) def test_incapsula_iframe(self): incapsula_iframe = BeautifulSoup('', 'html.parser').find('iframe') self.assertEqual(self.parser.incapsula_iframe, incapsula_iframe) def test_incapsula_iframe_url(self): url = 'http://content.incapsula.com/jsTest.html' self.assertEqual(self.parser.incapsula_iframe_url, url) def test_is_blocked(self): self.assertTrue(self.parser.is_blocked()) class TestWhoScoredIndexUnblocked(unittest.TestCase): def setUp(self): with open(unblocked_index, 'rb') as f: content = f.read() self.parser = WebsiteResourceParser(make_response('http://whoscored.com', content)) # We don't care about whether the iframe or robots tag exist in this case as long as is_blocked is false. def test_is_blocked(self): self.assertFalse(self.parser.is_blocked()) class TestWhoScoredIframeContentsNoRecaptcha(unittest.TestCase): def setUp(self): with open(iframe, 'rb') as f: content = f.read() self.parser = IframeResourceParser(make_response('http://content.incapsula.com/jsTest.html', content)) def test_is_blocked(self): self.assertFalse(self.parser.is_blocked()) ================================================ FILE: tests/whoscored/whoscored-index_unblocked.html ================================================ Football Statistics | Football Live Scores | WhoScored.com
Language:

Live Scores Summary

News

Top Player Statistics

View:
Overall
Home
Away

Top Team Statistics

View:
Overall
Home
Away
* Statistics from top 5 leagues only
Possession
Bayern Munich64.5%
Barcelona62.1%
Paris Saint Germain61.5%
Manchester City60.9%
Napoli59%
Aggression
Alaves1165
Malaga1087
Granada1096
Sporting Gijon1144
Valencia1086
Aerial Duels Won
Roma59.6%
Real Madrid57.8%
Paris Saint Germain57.6%
Angers57%
Montpellier56.9%
Dribbles per Game
Barcelona14.8
Chelsea13.3
Manchester City13.2
Lyon12.7
Paris Saint Germain12.6
Tackles per Game
Valencia22.8
Atletico Madrid22.3
Nantes22.1
Hamburger SV21.8
Espanyol21.6
Interceptions per Game
Eintracht Frankfurt24.4
Augsburg23.3
Schalke 0423
Mainz 0522.9
RasenBallsport Leipzig22.3
Shots per Game
Bayern Munich18.3
Roma17.8
Napoli17.7
Tottenham17.6
Real Madrid17.4
Pass Accuracy
Paris Saint Germain88.9%
Nice87.8%
Bayern Munich87.3%
Napoli87.2%
Barcelona86.7%
Ratings
Bayern Munich7.23
Barcelona7.19
Real Madrid7.14
Roma7.14
Monaco7.14

Statistical Best XI

WhoScored Elsewhere

================================================ FILE: tools.py ================================================ def chunks(l, n): """Yield successive n-sized chunks from l.""" for i in range(0, len(l), n): yield l[i:i + n] def decrypt_obfuscated_js(obfuscated_javascript): """ Parse out the obfuscated javascript code which is normally contained in the incapsula iframe. :param obfuscated_javascript: :return: """ return ''.join([chr(int(x, 16)) for x in chunks(obfuscated_javascript, 2)]) if __name__ == '__main__': print(decrypt_obfuscated_js('7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D353830353233353534373331373739393237392C31333939363439393632303537303139383231342C313536343932363633303635393535393036332C343133383138222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B'))