Full Code of LinkedInLearning/web-scraping-with-python-2848331 for AI

master 841cdd162f40 cached

202 files

32.9 MB

753.8k tokens

401 symbols

1 requests

Download .txt

Showing preview only (3,015K chars total). Download the full file or copy to clipboard to get everything.

Repository: LinkedInLearning/web-scraping-with-python-2848331
Branch: master
Commit: 841cdd162f40
Files: 202
Total size: 32.9 MB

Directory structure:
gitextract_jwpdz1w4/

├── .github/
│   ├── CODEOWNERS
│   ├── ISSUE_TEMPLATE.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── main.yml
├── .gitignore
├── 01_03/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 01_04_b/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 01_04_e/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 02_01/
│   └── article_scraper/
│       ├── article_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_02_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_02_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_03_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_03_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_04_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_04_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_05/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_01_b/
│   └── form/
│       ├── form/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── get_form.py
│       └── scrapy.cfg
├── 03_01_e/
│   └── form/
│       ├── form/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── get_form.py
│       │       └── post_form.py
│       └── scrapy.cfg
├── 03_03_b/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_03_e/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_04/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── cnn.py
│       │       └── counts.csv
│       └── scrapy.cfg
├── 03_05/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── cnn.py
│       │       └── counts.csv
│       └── scrapy.cfg
├── 04_01_b/
│   └── profiles/
│       ├── profiles/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── pythonscraping.py
│       └── scrapy.cfg
├── 04_01_e/
│   └── profiles/
│       ├── profiles/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── pythonscraping.py
│       └── scrapy.cfg
├── 04_02_b/
│   ├── chromedriver
│   └── locations/
│       ├── locations/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── dunkin.py
│       └── scrapy.cfg
├── 04_02_e/
│   ├── chromedriver
│   └── locations/
│       ├── locations/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── dunkin.py
│       └── scrapy.cfg
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/CODEOWNERS
================================================
# Codeowners for these exercise files:
# * (asterisk) deotes "all files and folders"
# Example: * @producer @instructor


================================================
FILE: .github/ISSUE_TEMPLATE.md
================================================
<!--
BEFORE POSTING YOUR ISSUE:
- These comments won't show up when you submit the issue.
- Please use the sections below to provide information about the issue.
- Be specific: Add as much detail as possible.
-->

## Issue Overview
<!-- A brief overview of the issue --->

## Describe your environment
<!-- Provide details about your environment: what editor, browser, and other software you are using and any other specifics to your setup -->

## Steps to Reproduce
<!-- Provide an unambiguous set of steps to reproduce this bug. Include code to reproduce, if relevant. Include a live link if available. -->
1.
2.
3.
4.

## Expected Behavior
<!-- What behavior did you expect? -->

## Current Behavior
<!-- What happened instead of the expected behavior? Describe the difference. -->

## Possible Solution
<!-- Optional: Do you have a fix or a suggestion on how to fix the issue? -->

## Screenshots / Video
<!-- Optional: Add any screenshots or video of the issue if available. -->

## Related Issues
<!-- List related issues -->


================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
<!-- This repository *does not* accept pull requests (PRs). All pull requests will be closed. See CONTRIBUTING.md for further details. -->


================================================
FILE: .github/workflows/main.yml
================================================
name: Copy To Branches
on:
  workflow_dispatch:
jobs:
  copy-to-branches:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: Copy To Branches Action
        uses: planetoftheweb/copy-to-branches@v1


================================================
FILE: .gitignore
================================================
.DS_Store
node_modules
.tmp
npm-debug.log


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/__init__.py
================================================


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IetfScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class IetfScraperSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IetfScraperDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class IetfScraperPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for ietf_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ietf_scraper'

SPIDER_MODULES = ['ietf_scraper.spiders']
NEWSPIDER_MODULE = 'ietf_scraper.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ietf_scraper (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ietf_scraper.pipelines.IetfScraperPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 01_03/ietf_scraper/ietf_scraper/spiders/ietf.py
================================================
# -*- coding: utf-8 -*-
import scrapy


class IetfSpider(scrapy.Spider):
    name = 'ietf'
    allowed_domains = ['pythonscraping.com']
    start_urls = ['http://pythonscraping.com/linkedin/ietf.html']

    def parse(self, response):
        return {'title': response.xpath('//span[@class="title"]/text()').get()}


================================================
FILE: 01_03/ietf_scraper/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = ietf_scraper.settings

[deploy]
#url = http://localhost:6800/
project = ietf_scraper


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/__init__.py
================================================


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IetfScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class IetfScraperSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IetfScraperDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class IetfScraperPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for ietf_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ietf_scraper'

SPIDER_MODULES = ['ietf_scraper.spiders']
NEWSPIDER_MODULE = 'ietf_scraper.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ietf_scraper (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ietf_scraper.pipelines.IetfScraperPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 01_04_b/ietf_scraper/ietf_scraper/spiders/ietf.py
================================================
# -*- coding: utf-8 -*-
import scrapy


class IetfSpider(scrapy.Spider):
    name = 'ietf'
    allowed_domains = ['pythonscraping.com']
    start_urls = ['http://pythonscraping.com/linkedin/ietf.html']

    def parse(self, response):
        #title = response.css('span.title::text').get()
        title = response.xpath('//span[@class="title"]/text()').get()
        return {"title": title}


================================================
FILE: 01_04_b/ietf_scraper/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = ietf_scraper.settings

[deploy]
#url = http://localhost:6800/
project = ietf_scraper


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/__init__.py
================================================


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IetfScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class IetfScraperSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IetfScraperDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class IetfScraperPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for ietf_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ietf_scraper'

SPIDER_MODULES = ['ietf_scraper.spiders']
NEWSPIDER_MODULE = 'ietf_scraper.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ietf_scraper (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ietf_scraper.middlewares.IetfScraperDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ietf_scraper.pipelines.IetfScraperPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 01_04_e/ietf_scraper/ietf_scraper/spiders/ietf.py
================================================
# -*- coding: utf-8 -*-
import scrapy
import w3lib.html

class IetfSpider(scrapy.Spider):
    name = 'ietf'
    allowed_domains = ['pythonscraping.com']
    start_urls = ['http://pythonscraping.com/linkedin/ietf.html']

    def parse(self, response):
        return {
            'number': response.xpath('//span[@class="rfc-no"]/text()').get(),
            'title': response.xpath('//meta[@name="DC.Title"]/@content').get(),
            # 'title': response.xpath('//span[@class="title"]/text()').get(),
            'date': response.xpath('//span[@class="date"]/text()').get(),
            # 'date': response.xpath('//meta[@name="DC.Date.Issued"]/@content').get(),
            'description': response.xpath('//meta[@name="DC.Description.Abstract"]/@content').get(),
            'author': response.xpath('//meta[@name="DC.Creator"]/@content').get(),
            # 'author': response.xpath('//span[@class="author-name"]/text()').get(),
            'company': response.xpath('//span[@class="author-company"]/text()').get(),
            'address': response.xpath('//span[@class="address"]/text()').get(),
            'text': w3lib.html.remove_tags(response.xpath('//div[@class="text"]').get()),
            'headings': response.xpath('//span[@class="subheading"]/text()').getall()
        }


================================================
FILE: 01_04_e/ietf_scraper/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = ietf_scraper.settings

[deploy]
#url = http://localhost:6800/
project = ietf_scraper


================================================
FILE: 02_01/article_scraper/article_scraper/__init__.py
================================================


================================================
FILE: 02_01/article_scraper/article_scraper/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticleScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: 02_01/article_scraper/article_scraper/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleScraperSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleScraperDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_01/article_scraper/article_scraper/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleScraperPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_01/article_scraper/article_scraper/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_scraper'

SPIDER_MODULES = ['article_scraper.spiders']
NEWSPIDER_MODULE = 'article_scraper.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_scraper (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_scraper.middlewares.ArticleScraperSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_scraper.middlewares.ArticleScraperDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_scraper.pipelines.ArticleScraperPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_01/article_scraper/article_scraper/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_01/article_scraper/article_scraper/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']
    rules = [Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)]

    def parse_info(self, response):
        return {
            'title': response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()').get(),
            'url': response.url,
            'last_edited': response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        }


================================================
FILE: 02_01/article_scraper/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_scraper.settings

[deploy]
#url = http://localhost:6800/
project = article_scraper


================================================
FILE: 02_02_b/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_02_b/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticleCrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: 02_02_b/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_02_b/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleCrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_02_b/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_crawler.pipelines.ArticleCrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_02_b/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_02_b/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    def parse_info(self, response):
        return {
            "title": response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()'),
            "url": response.url,
            "last_edited": response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        }


================================================
FILE: 02_02_b/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_02_e/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_02_e/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Article(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    lastUpdated = scrapy.Field()



================================================
FILE: 02_02_e/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_02_e/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleCrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_02_e/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_crawler.pipelines.ArticleCrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_02_e/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_02_e/article_crawler/article_crawler/spiders/articles.csv
================================================
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 26 May 2020, at 18:28",Obie Award,https://en.wikipedia.org/wiki/Obie_Award
" This page was last edited on 5 October 2020, at 13:16",Richard Dean Anderson,https://en.wikipedia.org/wiki/Richard_Dean_Anderson
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO


================================================
FILE: 02_02_e/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from article_crawler.items import Article

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    def parse_info(self, response):
        article = Article()
        article['title']= response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()')
        article['url'] = response.url

        article['lastUpdated'] = response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        return article


================================================
FILE: 02_02_e/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_03_b/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_03_b/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Article(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    lastUpdated = scrapy.Field()



================================================
FILE: 02_03_b/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_03_b/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleCrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_03_b/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_crawler.pipelines.ArticleCrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_03_b/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_03_b/article_crawler/article_crawler/spiders/articles.csv
================================================
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 26 May 2020, at 18:28",Obie Award,https://en.wikipedia.org/wiki/Obie_Award
" This page was last edited on 5 October 2020, at 13:16",Richard Dean Anderson,https://en.wikipedia.org/wiki/Richard_Dean_Anderson
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO


================================================
FILE: 02_03_b/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from article_crawler.items import Article

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    def parse_info(self, response):
        article = Article()
        article['title']= response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()')
        article['url'] = response.url

        article['lastUpdated'] = response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        return article


================================================
FILE: 02_03_b/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_03_e/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_03_e/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Article(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    lastUpdated = scrapy.Field()



================================================
FILE: 02_03_e/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_03_e/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleCrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_03_e/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

CLOSESPIDER_PAGECOUNT=10

FEED_URI='articles.json'
FEED_FORMAT='json'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_crawler.pipelines.ArticleCrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_03_e/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_03_e/article_crawler/article_crawler/spiders/articles.csv
================================================
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 26 May 2020, at 18:28",Obie Award,https://en.wikipedia.org/wiki/Obie_Award
" This page was last edited on 5 October 2020, at 13:16",Richard Dean Anderson,https://en.wikipedia.org/wiki/Richard_Dean_Anderson
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 25 September 2020, at 05:22","[<Selector xpath='//h1/i/text()' data=""She's Having a Baby"">]",https://en.wikipedia.org/wiki/She%27s_Having_a_Baby
" This page was last edited on 2 July 2020, at 13:53",SNAC,https://en.wikipedia.org/wiki/SNAC
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 20 September 2020, at 07:23",[<Selector xpath='//h1/i/text()' data='Los Angeles Daily News'>],https://en.wikipedia.org/wiki/Los_Angeles_Daily_News
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 22 July 2020, at 18:39", (2020 TV series),https://en.wikipedia.org/wiki/Ana_(2020_TV_series)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 6 June 2020, at 20:53",Bruce Gilbert,https://en.wikipedia.org/wiki/Bruce_Gilbert
" This page was last edited on 23 June 2020, at 19:06", (TV series),https://en.wikipedia.org/wiki/The_Remix_(TV_series)
" This page was last edited on 6 October 2020, at 13:04",[<Selector xpath='//h1/i/text()' data='The New York Times'>],https://en.wikipedia.org/wiki/The_New_York_Times
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 11 May 2020, at 14:47",National Library of Latvia,https://en.wikipedia.org/wiki/National_Library_of_Latvia
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 5 October 2020, at 15:12",Judy Garland,https://en.wikipedia.org/wiki/Judy_Garland
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 6 October 2020, at 03:55",IMDb,https://en.wikipedia.org/wiki/IMDb
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 30 July 2020, at 18:19",Virtual International Authority File,https://en.wikipedia.org/wiki/VIAF_(identifier)
" This page was last edited on 18 September 2020, at 03:04",Trove,https://en.wikipedia.org/wiki/Trove
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO


================================================
FILE: 02_03_e/article_crawler/article_crawler/spiders/articles.json
================================================
[
{"title": "Kevin Bacon", "url": "https://en.wikipedia.org/wiki/Kevin_Bacon", "lastUpdated": " This page was last edited on 19 September 2020, at 00:35"},
{"title": "Fox Broadcasting Company", "url": "https://en.wikipedia.org/wiki/Fox_Broadcasting_Company", "lastUpdated": " This page was last edited on 6 October 2020, at 15:27"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Patriots_Day_(film)", "lastUpdated": " This page was last edited on 23 September 2020, at 20:01"},
{"title": " (TV series)", "url": "https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)", "lastUpdated": " This page was last edited on 18 August 2020, at 20:30"},
{"title": "Screen Actors Guild Awards", "url": "https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award", "lastUpdated": " This page was last edited on 21 July 2020, at 00:07"},
,
,
{"title": "Primetime Emmy Award", "url": "https://en.wikipedia.org/wiki/Primetime_Emmy_Award", "lastUpdated": " This page was last edited on 22 September 2020, at 10:27"},
{"title": "Golden Globe Awards", "url": "https://en.wikipedia.org/wiki/Golden_Globe_Award", "lastUpdated": " This page was last edited on 8 September 2020, at 12:45"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Black_Mass_(film)", "lastUpdated": " This page was last edited on 1 October 2020, at 17:17"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Frost/Nixon_(film)", "lastUpdated": " This page was last edited on 16 August 2020, at 00:08"},
,
,
{"title": "HBO", "url": "https://en.wikipedia.org/wiki/HBO", "lastUpdated": " This page was last edited on 7 October 2020, at 00:10"},
,
{"title": "Circle in the Square Theatre", "url": "https://en.wikipedia.org/wiki/Circle_in_the_Square", "lastUpdated": " This page was last edited on 27 September 2020, at 21:06"},
{"title": "Main Page", "url": "https://en.wikipedia.org/wiki/Main_Page", "lastUpdated": " This page was last edited on 23 July 2020, at 12:44"},
{"title": "WorldCat", "url": "https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)", "lastUpdated": " This page was last edited on 3 October 2020, at 11:46"},
{"title": "Virtual International Authority File", "url": "https://en.wikipedia.org/wiki/VIAF_(identifier)", "lastUpdated": " This page was last edited on 30 July 2020, at 18:19"},
{"title": "Trove", "url": "https://en.wikipedia.org/wiki/Trove", "lastUpdated": " This page was last edited on 18 September 2020, at 03:04"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Wild_Things_(film)", "lastUpdated": " This page was last edited on 29 September 2020, at 08:35"},
{"title": "Syst\u00e8me universitaire de documentation", "url": "https://en.wikipedia.org/wiki/SUDOC_(identifier)", "lastUpdated": " This page was last edited on 19 October 2019, at 13:42"},
{"title": "SNAC", "url": "https://en.wikipedia.org/wiki/SNAC", "lastUpdated": " This page was last edited on 2 July 2020, at 13:53"}
]

================================================
FILE: 02_03_e/article_crawler/article_crawler/spiders/articles.xml
================================================
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_Bacon</url><lastUpdated> This page was last edited on 19 September 2020, at 00:35</lastUpdated></item>
<item><title> (TV series)</title><url>https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)</url><lastUpdated> This page was last edited on 18 August 2020, at 20:30</lastUpdated></item>
<item><title>Primetime Emmy Award</title><url>https://en.wikipedia.org/wiki/Primetime_Emmy_Award</url><lastUpdated> This page was last edited on 22 September 2020, at 10:27</lastUpdated></item>
<item><title>SixDegrees.org</title><url>https://en.wikipedia.org/wiki/SixDegrees.org</url><lastUpdated> This page was last edited on 20 March 2020, at 11:35</lastUpdated></item>
<item><title>Golden Globe Award for Best Actor – Television Series Musical or Comedy</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy</url><lastUpdated> This page was last edited on 14 August 2020, at 04:30</lastUpdated></item>
<item><title>List of social networking websites</title><url>https://en.wikipedia.org/wiki/Social_networks</url><lastUpdated> This page was last edited on 6 September 2020, at 23:58</lastUpdated></item>
<item><title>Six Degrees of Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon</url><lastUpdated> This page was last edited on 2 October 2020, at 20:10</lastUpdated></item>
<item><title>Screen Actors Guild Awards</title><url>https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award</url><lastUpdated> This page was last edited on 21 July 2020, at 00:07</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Guardian'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Guardian</url><lastUpdated> This page was last edited on 18 September 2020, at 16:08</lastUpdated></item>
<item><title>Hollywood Walk of Fame</title><url>https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame</url><lastUpdated> This page was last edited on 3 October 2020, at 12:56</lastUpdated></item>
<item><title>Academy Awards</title><url>https://en.wikipedia.org/wiki/Academy_Award</url><lastUpdated> This page was last edited on 1 October 2020, at 12:55</lastUpdated></item>
<item><title>Cannes Film Festival</title><url>https://en.wikipedia.org/wiki/Cannes_Film_Festival</url><lastUpdated> This page was last edited on 5 October 2020, at 12:51</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='Taking Chance'&gt;</value></title><url>https://en.wikipedia.org/wiki/Taking_Chance</url><lastUpdated> This page was last edited on 3 September 2020, at 14:05</lastUpdated></item>
<item><title>Alan Rickman</title><url>https://en.wikipedia.org/wiki/Alan_Rickman</url><lastUpdated> This page was last edited on 7 October 2020, at 00:12</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Following'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Following</url><lastUpdated> This page was last edited on 11 September 2020, at 16:17</lastUpdated></item>
<item><title>Main Page</title><url>https://en.wikipedia.org/wiki/Main_Page</url><lastUpdated> This page was last edited on 23 July 2020, at 12:44</lastUpdated></item>
<item><title>Fox Broadcasting Company</title><url>https://en.wikipedia.org/wiki/Fox_Broadcasting_Company</url><lastUpdated> This page was last edited on 6 October 2020, at 15:27</lastUpdated></item>
<item><title>Golden Globe Awards</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award</url><lastUpdated> This page was last edited on 8 September 2020, at 12:45</lastUpdated></item>
<item><title>WorldCat</title><url>https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)</url><lastUpdated> This page was last edited on 3 October 2020, at 11:46</lastUpdated></item>
<item><title>Virtual International Authority File</title><url>https://en.wikipedia.org/wiki/VIAF_(identifier)</url><lastUpdated> This page was last edited on 30 July 2020, at 18:19</lastUpdated></item>
<item><title>HBO</title><url>https://en.wikipedia.org/wiki/HBO</url><lastUpdated> This page was last edited on 7 October 2020, at 00:10</lastUpdated></item>
<item><title>Trove</title><url>https://en.wikipedia.org/wiki/Trove</url><lastUpdated> This page was last edited on 18 September 2020, at 03:04</lastUpdated></item>
<item><title>Système universitaire de documentation</title><url>https://en.wikipedia.org/wiki/SUDOC_(identifier)</url><lastUpdated> This page was last edited on 19 October 2019, at 13:42</lastUpdated></item>
</items>

================================================
FILE: 02_03_e/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from article_crawler.items import Article

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    custom_settings={
        'FEED_URI': 'articles.xml',
        'FEED_FORMAT': 'xml'
    }

    def parse_info(self, response):
        article = Article()
        article['title']= response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()')
        article['url'] = response.url

        article['lastUpdated'] = response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        return article


================================================
FILE: 02_03_e/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_04_b/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_04_b/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Article(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    lastUpdated = scrapy.Field()



================================================
FILE: 02_04_b/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_04_b/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class ArticleCrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: 02_04_b/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

CLOSESPIDER_PAGECOUNT=10

FEED_URI='articles.json'
FEED_FORMAT='json'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'article_crawler.pipelines.ArticleCrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_04_b/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_04_b/article_crawler/article_crawler/spiders/articles.csv
================================================
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 26 May 2020, at 18:28",Obie Award,https://en.wikipedia.org/wiki/Obie_Award
" This page was last edited on 5 October 2020, at 13:16",Richard Dean Anderson,https://en.wikipedia.org/wiki/Richard_Dean_Anderson
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 25 September 2020, at 05:22","[<Selector xpath='//h1/i/text()' data=""She's Having a Baby"">]",https://en.wikipedia.org/wiki/She%27s_Having_a_Baby
" This page was last edited on 2 July 2020, at 13:53",SNAC,https://en.wikipedia.org/wiki/SNAC
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 20 September 2020, at 07:23",[<Selector xpath='//h1/i/text()' data='Los Angeles Daily News'>],https://en.wikipedia.org/wiki/Los_Angeles_Daily_News
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 22 July 2020, at 18:39", (2020 TV series),https://en.wikipedia.org/wiki/Ana_(2020_TV_series)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 6 June 2020, at 20:53",Bruce Gilbert,https://en.wikipedia.org/wiki/Bruce_Gilbert
" This page was last edited on 23 June 2020, at 19:06", (TV series),https://en.wikipedia.org/wiki/The_Remix_(TV_series)
" This page was last edited on 6 October 2020, at 13:04",[<Selector xpath='//h1/i/text()' data='The New York Times'>],https://en.wikipedia.org/wiki/The_New_York_Times
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 11 May 2020, at 14:47",National Library of Latvia,https://en.wikipedia.org/wiki/National_Library_of_Latvia
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 5 October 2020, at 15:12",Judy Garland,https://en.wikipedia.org/wiki/Judy_Garland
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 6 October 2020, at 03:55",IMDb,https://en.wikipedia.org/wiki/IMDb
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 30 July 2020, at 18:19",Virtual International Authority File,https://en.wikipedia.org/wiki/VIAF_(identifier)
" This page was last edited on 18 September 2020, at 03:04",Trove,https://en.wikipedia.org/wiki/Trove
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO


================================================
FILE: 02_04_b/article_crawler/article_crawler/spiders/articles.json
================================================
[
{"title": "Kevin Bacon", "url": "https://en.wikipedia.org/wiki/Kevin_Bacon", "lastUpdated": " This page was last edited on 19 September 2020, at 00:35"},
{"title": "Fox Broadcasting Company", "url": "https://en.wikipedia.org/wiki/Fox_Broadcasting_Company", "lastUpdated": " This page was last edited on 6 October 2020, at 15:27"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Patriots_Day_(film)", "lastUpdated": " This page was last edited on 23 September 2020, at 20:01"},
{"title": " (TV series)", "url": "https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)", "lastUpdated": " This page was last edited on 18 August 2020, at 20:30"},
{"title": "Screen Actors Guild Awards", "url": "https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award", "lastUpdated": " This page was last edited on 21 July 2020, at 00:07"},
,
,
{"title": "Primetime Emmy Award", "url": "https://en.wikipedia.org/wiki/Primetime_Emmy_Award", "lastUpdated": " This page was last edited on 22 September 2020, at 10:27"},
{"title": "Golden Globe Awards", "url": "https://en.wikipedia.org/wiki/Golden_Globe_Award", "lastUpdated": " This page was last edited on 8 September 2020, at 12:45"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Black_Mass_(film)", "lastUpdated": " This page was last edited on 1 October 2020, at 17:17"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Frost/Nixon_(film)", "lastUpdated": " This page was last edited on 16 August 2020, at 00:08"},
,
,
{"title": "HBO", "url": "https://en.wikipedia.org/wiki/HBO", "lastUpdated": " This page was last edited on 7 October 2020, at 00:10"},
,
{"title": "Circle in the Square Theatre", "url": "https://en.wikipedia.org/wiki/Circle_in_the_Square", "lastUpdated": " This page was last edited on 27 September 2020, at 21:06"},
{"title": "Main Page", "url": "https://en.wikipedia.org/wiki/Main_Page", "lastUpdated": " This page was last edited on 23 July 2020, at 12:44"},
{"title": "WorldCat", "url": "https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)", "lastUpdated": " This page was last edited on 3 October 2020, at 11:46"},
{"title": "Virtual International Authority File", "url": "https://en.wikipedia.org/wiki/VIAF_(identifier)", "lastUpdated": " This page was last edited on 30 July 2020, at 18:19"},
{"title": "Trove", "url": "https://en.wikipedia.org/wiki/Trove", "lastUpdated": " This page was last edited on 18 September 2020, at 03:04"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Wild_Things_(film)", "lastUpdated": " This page was last edited on 29 September 2020, at 08:35"},
{"title": "Syst\u00e8me universitaire de documentation", "url": "https://en.wikipedia.org/wiki/SUDOC_(identifier)", "lastUpdated": " This page was last edited on 19 October 2019, at 13:42"},
{"title": "SNAC", "url": "https://en.wikipedia.org/wiki/SNAC", "lastUpdated": " This page was last edited on 2 July 2020, at 13:53"}
]

================================================
FILE: 02_04_b/article_crawler/article_crawler/spiders/articles.xml
================================================
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_Bacon</url><lastUpdated> This page was last edited on 19 September 2020, at 00:35</lastUpdated></item>
<item><title> (TV series)</title><url>https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)</url><lastUpdated> This page was last edited on 18 August 2020, at 20:30</lastUpdated></item>
<item><title>Primetime Emmy Award</title><url>https://en.wikipedia.org/wiki/Primetime_Emmy_Award</url><lastUpdated> This page was last edited on 22 September 2020, at 10:27</lastUpdated></item>
<item><title>SixDegrees.org</title><url>https://en.wikipedia.org/wiki/SixDegrees.org</url><lastUpdated> This page was last edited on 20 March 2020, at 11:35</lastUpdated></item>
<item><title>Golden Globe Award for Best Actor – Television Series Musical or Comedy</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy</url><lastUpdated> This page was last edited on 14 August 2020, at 04:30</lastUpdated></item>
<item><title>List of social networking websites</title><url>https://en.wikipedia.org/wiki/Social_networks</url><lastUpdated> This page was last edited on 6 September 2020, at 23:58</lastUpdated></item>
<item><title>Six Degrees of Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon</url><lastUpdated> This page was last edited on 2 October 2020, at 20:10</lastUpdated></item>
<item><title>Screen Actors Guild Awards</title><url>https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award</url><lastUpdated> This page was last edited on 21 July 2020, at 00:07</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Guardian'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Guardian</url><lastUpdated> This page was last edited on 18 September 2020, at 16:08</lastUpdated></item>
<item><title>Hollywood Walk of Fame</title><url>https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame</url><lastUpdated> This page was last edited on 3 October 2020, at 12:56</lastUpdated></item>
<item><title>Academy Awards</title><url>https://en.wikipedia.org/wiki/Academy_Award</url><lastUpdated> This page was last edited on 1 October 2020, at 12:55</lastUpdated></item>
<item><title>Cannes Film Festival</title><url>https://en.wikipedia.org/wiki/Cannes_Film_Festival</url><lastUpdated> This page was last edited on 5 October 2020, at 12:51</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='Taking Chance'&gt;</value></title><url>https://en.wikipedia.org/wiki/Taking_Chance</url><lastUpdated> This page was last edited on 3 September 2020, at 14:05</lastUpdated></item>
<item><title>Alan Rickman</title><url>https://en.wikipedia.org/wiki/Alan_Rickman</url><lastUpdated> This page was last edited on 7 October 2020, at 00:12</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Following'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Following</url><lastUpdated> This page was last edited on 11 September 2020, at 16:17</lastUpdated></item>
<item><title>Main Page</title><url>https://en.wikipedia.org/wiki/Main_Page</url><lastUpdated> This page was last edited on 23 July 2020, at 12:44</lastUpdated></item>
<item><title>Fox Broadcasting Company</title><url>https://en.wikipedia.org/wiki/Fox_Broadcasting_Company</url><lastUpdated> This page was last edited on 6 October 2020, at 15:27</lastUpdated></item>
<item><title>Golden Globe Awards</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award</url><lastUpdated> This page was last edited on 8 September 2020, at 12:45</lastUpdated></item>
<item><title>WorldCat</title><url>https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)</url><lastUpdated> This page was last edited on 3 October 2020, at 11:46</lastUpdated></item>
<item><title>Virtual International Authority File</title><url>https://en.wikipedia.org/wiki/VIAF_(identifier)</url><lastUpdated> This page was last edited on 30 July 2020, at 18:19</lastUpdated></item>
<item><title>HBO</title><url>https://en.wikipedia.org/wiki/HBO</url><lastUpdated> This page was last edited on 7 October 2020, at 00:10</lastUpdated></item>
<item><title>Trove</title><url>https://en.wikipedia.org/wiki/Trove</url><lastUpdated> This page was last edited on 18 September 2020, at 03:04</lastUpdated></item>
<item><title>Système universitaire de documentation</title><url>https://en.wikipedia.org/wiki/SUDOC_(identifier)</url><lastUpdated> This page was last edited on 19 October 2019, at 13:42</lastUpdated></item>
</items>

================================================
FILE: 02_04_b/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from article_crawler.items import Article

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    custom_settings={
        'FEED_URI': 'articles.xml',
        'FEED_FORMAT': 'xml'
    }

    def parse_info(self, response):
        article = Article()
        article['title']= response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()')
        article['url'] = response.url

        article['lastUpdated'] = response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        return article


================================================
FILE: 02_04_b/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_04_e/article_crawler/article_crawler/__init__.py
================================================


================================================
FILE: 02_04_e/article_crawler/article_crawler/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Article(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    lastUpdated = scrapy.Field()



================================================
FILE: 02_04_e/article_crawler/article_crawler/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ArticleCrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArticleCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_04_e/article_crawler/article_crawler/pipelines.py
================================================
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exceptions import DropItem
from datetime import datetime

class CheckItemPipeline:
    def process_item(self, article, spider):
        if not article['lastUPdated'] or not article['url'] or not article['title']:
            raise DropItem('Missing something!')
        return article


class CleanDatePipeline:
    def process_item(self, article, spider):
        article['lastUpdated'].replace('This page was last edited on', '').strip()
        article['lastUpdated'] = datetime.strptime(article['lastUpdated'], '%d %B %Y, at %H:%M')
        return article


================================================
FILE: 02_04_e/article_crawler/article_crawler/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for article_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article_crawler'

CLOSESPIDER_PAGECOUNT=10

FEED_URI='articles.json'
FEED_FORMAT='json'

SPIDER_MODULES = ['article_crawler.spiders']
NEWSPIDER_MODULE = 'article_crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article_crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'article_crawler.middlewares.ArticleCrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'article_crawler.pipelines.CheckItemPipeline': 100,
    'article_crawler.pipelines.CleanDatePipeline': 200,

}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_04_e/article_crawler/article_crawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_04_e/article_crawler/article_crawler/spiders/articles.csv
================================================
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 26 May 2020, at 18:28",Obie Award,https://en.wikipedia.org/wiki/Obie_Award
" This page was last edited on 5 October 2020, at 13:16",Richard Dean Anderson,https://en.wikipedia.org/wiki/Richard_Dean_Anderson
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 25 September 2020, at 05:22","[<Selector xpath='//h1/i/text()' data=""She's Having a Baby"">]",https://en.wikipedia.org/wiki/She%27s_Having_a_Baby
" This page was last edited on 2 July 2020, at 13:53",SNAC,https://en.wikipedia.org/wiki/SNAC
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 20 September 2020, at 07:23",[<Selector xpath='//h1/i/text()' data='Los Angeles Daily News'>],https://en.wikipedia.org/wiki/Los_Angeles_Daily_News
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 22 July 2020, at 18:39", (2020 TV series),https://en.wikipedia.org/wiki/Ana_(2020_TV_series)
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
" This page was last edited on 6 June 2020, at 20:53",Bruce Gilbert,https://en.wikipedia.org/wiki/Bruce_Gilbert
" This page was last edited on 23 June 2020, at 19:06", (TV series),https://en.wikipedia.org/wiki/The_Remix_(TV_series)
" This page was last edited on 6 October 2020, at 13:04",[<Selector xpath='//h1/i/text()' data='The New York Times'>],https://en.wikipedia.org/wiki/The_New_York_Times
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 1 October 2020, at 17:17", (film),https://en.wikipedia.org/wiki/Black_Mass_(film)
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 11 May 2020, at 14:47",National Library of Latvia,https://en.wikipedia.org/wiki/National_Library_of_Latvia
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 23 September 2020, at 20:01", (film),https://en.wikipedia.org/wiki/Patriots_Day_(film)
" This page was last edited on 5 October 2020, at 15:12",Judy Garland,https://en.wikipedia.org/wiki/Judy_Garland
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO
lastUpdated,title,url
" This page was last edited on 19 September 2020, at 00:35",Kevin Bacon,https://en.wikipedia.org/wiki/Kevin_Bacon
" This page was last edited on 18 August 2020, at 20:30", (TV series),https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)
" This page was last edited on 22 September 2020, at 10:27",Primetime Emmy Award,https://en.wikipedia.org/wiki/Primetime_Emmy_Award
" This page was last edited on 20 March 2020, at 11:35",SixDegrees.org,https://en.wikipedia.org/wiki/SixDegrees.org
" This page was last edited on 2 October 2020, at 20:10",Six Degrees of Kevin Bacon,https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
" This page was last edited on 1 October 2020, at 12:55",Academy Awards,https://en.wikipedia.org/wiki/Academy_Award
" This page was last edited on 18 September 2020, at 16:08",[<Selector xpath='//h1/i/text()' data='The Guardian'>],https://en.wikipedia.org/wiki/The_Guardian
" This page was last edited on 3 October 2020, at 12:56",Hollywood Walk of Fame,https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame
" This page was last edited on 14 August 2020, at 04:30",Golden Globe Award for Best Actor – Television Series Musical or Comedy,https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
" This page was last edited on 3 September 2020, at 14:05",[<Selector xpath='//h1/i/text()' data='Taking Chance'>],https://en.wikipedia.org/wiki/Taking_Chance
" This page was last edited on 6 September 2020, at 23:58",List of social networking websites,https://en.wikipedia.org/wiki/Social_networks
" This page was last edited on 21 July 2020, at 00:07",Screen Actors Guild Awards,https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award
" This page was last edited on 8 September 2020, at 12:45",Golden Globe Awards,https://en.wikipedia.org/wiki/Golden_Globe_Award
" This page was last edited on 11 September 2020, at 16:17",[<Selector xpath='//h1/i/text()' data='The Following'>],https://en.wikipedia.org/wiki/The_Following
" This page was last edited on 6 October 2020, at 03:55",IMDb,https://en.wikipedia.org/wiki/IMDb
" This page was last edited on 23 July 2020, at 12:44",Main Page,https://en.wikipedia.org/wiki/Main_Page
" This page was last edited on 3 October 2020, at 11:46",WorldCat,https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)
" This page was last edited on 30 July 2020, at 18:19",Virtual International Authority File,https://en.wikipedia.org/wiki/VIAF_(identifier)
" This page was last edited on 18 September 2020, at 03:04",Trove,https://en.wikipedia.org/wiki/Trove
" This page was last edited on 6 October 2020, at 15:27",Fox Broadcasting Company,https://en.wikipedia.org/wiki/Fox_Broadcasting_Company
" This page was last edited on 7 October 2020, at 00:10",HBO,https://en.wikipedia.org/wiki/HBO


================================================
FILE: 02_04_e/article_crawler/article_crawler/spiders/articles.json
================================================
[
{"title": "Kevin Bacon", "url": "https://en.wikipedia.org/wiki/Kevin_Bacon", "lastUpdated": " This page was last edited on 19 September 2020, at 00:35"},
{"title": "Fox Broadcasting Company", "url": "https://en.wikipedia.org/wiki/Fox_Broadcasting_Company", "lastUpdated": " This page was last edited on 6 October 2020, at 15:27"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Patriots_Day_(film)", "lastUpdated": " This page was last edited on 23 September 2020, at 20:01"},
{"title": " (TV series)", "url": "https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)", "lastUpdated": " This page was last edited on 18 August 2020, at 20:30"},
{"title": "Screen Actors Guild Awards", "url": "https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award", "lastUpdated": " This page was last edited on 21 July 2020, at 00:07"},
,
,
{"title": "Primetime Emmy Award", "url": "https://en.wikipedia.org/wiki/Primetime_Emmy_Award", "lastUpdated": " This page was last edited on 22 September 2020, at 10:27"},
{"title": "Golden Globe Awards", "url": "https://en.wikipedia.org/wiki/Golden_Globe_Award", "lastUpdated": " This page was last edited on 8 September 2020, at 12:45"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Black_Mass_(film)", "lastUpdated": " This page was last edited on 1 October 2020, at 17:17"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Frost/Nixon_(film)", "lastUpdated": " This page was last edited on 16 August 2020, at 00:08"},
,
,
{"title": "HBO", "url": "https://en.wikipedia.org/wiki/HBO", "lastUpdated": " This page was last edited on 7 October 2020, at 00:10"},
,
{"title": "Circle in the Square Theatre", "url": "https://en.wikipedia.org/wiki/Circle_in_the_Square", "lastUpdated": " This page was last edited on 27 September 2020, at 21:06"},
{"title": "Main Page", "url": "https://en.wikipedia.org/wiki/Main_Page", "lastUpdated": " This page was last edited on 23 July 2020, at 12:44"},
{"title": "WorldCat", "url": "https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)", "lastUpdated": " This page was last edited on 3 October 2020, at 11:46"},
{"title": "Virtual International Authority File", "url": "https://en.wikipedia.org/wiki/VIAF_(identifier)", "lastUpdated": " This page was last edited on 30 July 2020, at 18:19"},
{"title": "Trove", "url": "https://en.wikipedia.org/wiki/Trove", "lastUpdated": " This page was last edited on 18 September 2020, at 03:04"},
{"title": " (film)", "url": "https://en.wikipedia.org/wiki/Wild_Things_(film)", "lastUpdated": " This page was last edited on 29 September 2020, at 08:35"},
{"title": "Syst\u00e8me universitaire de documentation", "url": "https://en.wikipedia.org/wiki/SUDOC_(identifier)", "lastUpdated": " This page was last edited on 19 October 2019, at 13:42"},
{"title": "SNAC", "url": "https://en.wikipedia.org/wiki/SNAC", "lastUpdated": " This page was last edited on 2 July 2020, at 13:53"}
]

================================================
FILE: 02_04_e/article_crawler/article_crawler/spiders/articles.xml
================================================
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_Bacon</url><lastUpdated> This page was last edited on 19 September 2020, at 00:35</lastUpdated></item>
<item><title> (TV series)</title><url>https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)</url><lastUpdated> This page was last edited on 18 August 2020, at 20:30</lastUpdated></item>
<item><title>Primetime Emmy Award</title><url>https://en.wikipedia.org/wiki/Primetime_Emmy_Award</url><lastUpdated> This page was last edited on 22 September 2020, at 10:27</lastUpdated></item>
<item><title>SixDegrees.org</title><url>https://en.wikipedia.org/wiki/SixDegrees.org</url><lastUpdated> This page was last edited on 20 March 2020, at 11:35</lastUpdated></item>
<item><title>Golden Globe Award for Best Actor – Television Series Musical or Comedy</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy</url><lastUpdated> This page was last edited on 14 August 2020, at 04:30</lastUpdated></item>
<item><title>List of social networking websites</title><url>https://en.wikipedia.org/wiki/Social_networks</url><lastUpdated> This page was last edited on 6 September 2020, at 23:58</lastUpdated></item>
<item><title>Six Degrees of Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon</url><lastUpdated> This page was last edited on 2 October 2020, at 20:10</lastUpdated></item>
<item><title>Screen Actors Guild Awards</title><url>https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award</url><lastUpdated> This page was last edited on 21 July 2020, at 00:07</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Guardian'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Guardian</url><lastUpdated> This page was last edited on 18 September 2020, at 16:08</lastUpdated></item>
<item><title>Hollywood Walk of Fame</title><url>https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame</url><lastUpdated> This page was last edited on 3 October 2020, at 12:56</lastUpdated></item>
<item><title>Academy Awards</title><url>https://en.wikipedia.org/wiki/Academy_Award</url><lastUpdated> This page was last edited on 1 October 2020, at 12:55</lastUpdated></item>
<item><title>Cannes Film Festival</title><url>https://en.wikipedia.org/wiki/Cannes_Film_Festival</url><lastUpdated> This page was last edited on 5 October 2020, at 12:51</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='Taking Chance'&gt;</value></title><url>https://en.wikipedia.org/wiki/Taking_Chance</url><lastUpdated> This page was last edited on 3 September 2020, at 14:05</lastUpdated></item>
<item><title>Alan Rickman</title><url>https://en.wikipedia.org/wiki/Alan_Rickman</url><lastUpdated> This page was last edited on 7 October 2020, at 00:12</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Following'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Following</url><lastUpdated> This page was last edited on 11 September 2020, at 16:17</lastUpdated></item>
<item><title>Main Page</title><url>https://en.wikipedia.org/wiki/Main_Page</url><lastUpdated> This page was last edited on 23 July 2020, at 12:44</lastUpdated></item>
<item><title>Fox Broadcasting Company</title><url>https://en.wikipedia.org/wiki/Fox_Broadcasting_Company</url><lastUpdated> This page was last edited on 6 October 2020, at 15:27</lastUpdated></item>
<item><title>Golden Globe Awards</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award</url><lastUpdated> This page was last edited on 8 September 2020, at 12:45</lastUpdated></item>
<item><title>WorldCat</title><url>https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)</url><lastUpdated> This page was last edited on 3 October 2020, at 11:46</lastUpdated></item>
<item><title>Virtual International Authority File</title><url>https://en.wikipedia.org/wiki/VIAF_(identifier)</url><lastUpdated> This page was last edited on 30 July 2020, at 18:19</lastUpdated></item>
<item><title>HBO</title><url>https://en.wikipedia.org/wiki/HBO</url><lastUpdated> This page was last edited on 7 October 2020, at 00:10</lastUpdated></item>
<item><title>Trove</title><url>https://en.wikipedia.org/wiki/Trove</url><lastUpdated> This page was last edited on 18 September 2020, at 03:04</lastUpdated></item>
<item><title>Système universitaire de documentation</title><url>https://en.wikipedia.org/wiki/SUDOC_(identifier)</url><lastUpdated> This page was last edited on 19 October 2019, at 13:42</lastUpdated></item>
</items><?xml version="1.0" encoding="utf-8"?>
<items>
<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_Bacon</url><lastUpdated>2020-09-19 00:35:00</lastUpdated></item>
<item><title> (TV series)</title><url>https://en.wikipedia.org/wiki/I_Love_Dick_(TV_series)</url><lastUpdated>2020-08-18 20:30:00</lastUpdated></item>
<item><title>SixDegrees.org</title><url>https://en.wikipedia.org/wiki/SixDegrees.org</url><lastUpdated>2020-03-20 11:35:00</lastUpdated></item>
<item><title>Golden Globe Award for Best Actor – Television Series Musical or Comedy</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy</url><lastUpdated>2020-08-14 04:30:00</lastUpdated></item>
<item><title>List of social networking websites</title><url>https://en.wikipedia.org/wiki/Social_networks</url><lastUpdated>2020-09-06 23:58:00</lastUpdated></item>
<item><title>Six Degrees of Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon</url><lastUpdated>2020-10-02 20:10:00</lastUpdated></item>
<item><title>Primetime Emmy Award</title><url>https://en.wikipedia.org/wiki/Primetime_Emmy_Award</url><lastUpdated>2020-09-22 10:27:00</lastUpdated></item>
<item><title>Academy Awards</title><url>https://en.wikipedia.org/wiki/Academy_Award</url><lastUpdated>2020-10-01 12:55:00</lastUpdated></item>
<item><title>Hollywood Walk of Fame</title><url>https://en.wikipedia.org/wiki/Hollywood_Walk_of_Fame</url><lastUpdated>2020-10-03 12:56:00</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Guardian'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Guardian</url><lastUpdated>2020-09-18 16:08:00</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='The Following'&gt;</value></title><url>https://en.wikipedia.org/wiki/The_Following</url><lastUpdated>2020-09-11 16:17:00</lastUpdated></item>
<item><title><value>&lt;Selector xpath='//h1/i/text()' data='Taking Chance'&gt;</value></title><url>https://en.wikipedia.org/wiki/Taking_Chance</url><lastUpdated>2020-09-03 14:05:00</lastUpdated></item>
<item><title>Screen Actors Guild Awards</title><url>https://en.wikipedia.org/wiki/Screen_Actors_Guild_Award</url><lastUpdated>2020-07-21 00:07:00</lastUpdated></item>
<item><title>Seattle International Film Festival</title><url>https://en.wikipedia.org/wiki/Seattle_International_Film_Festival</url><lastUpdated>2020-04-26 00:39:00</lastUpdated></item>
<item><title>Golden Globe Awards</title><url>https://en.wikipedia.org/wiki/Golden_Globe_Award</url><lastUpdated>2020-09-08 12:45:00</lastUpdated></item>
<item><title>Main Page</title><url>https://en.wikipedia.org/wiki/Main_Page</url><lastUpdated>2020-07-23 12:44:00</lastUpdated></item>
<item><title>WorldCat</title><url>https://en.wikipedia.org/wiki/WorldCat_Identities_(identifier)</url><lastUpdated>2020-10-03 11:46:00</lastUpdated></item>
<item><title>Fox Broadcasting Company</title><url>https://en.wikipedia.org/wiki/Fox_Broadcasting_Company</url><lastUpdated>2020-10-06 15:27:00</lastUpdated></item>
<item><title>Virtual International Authority File</title><url>https://en.wikipedia.org/wiki/VIAF_(identifier)</url><lastUpdated>2020-07-30 18:19:00</lastUpdated></item>
<item><title>Trove</title><url>https://en.wikipedia.org/wiki/Trove</url><lastUpdated>2020-09-18 03:04:00</lastUpdated></item>
<item><title>Système universitaire de documentation</title><url>https://en.wikipedia.org/wiki/SUDOC_(identifier)</url><lastUpdated>2019-10-19 13:42:00</lastUpdated></item>
<item><title>SNAC</title><url>https://en.wikipedia.org/wiki/SNAC</url><lastUpdated>2020-07-02 13:53:00</lastUpdated></item>
<item><title>HBO</title><url>https://en.wikipedia.org/wiki/HBO</url><lastUpdated>2020-10-07 00:10:00</lastUpdated></item>
</items>

================================================
FILE: 02_04_e/article_crawler/article_crawler/spiders/wikipedia.py
================================================
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from article_crawler.items import Article

class WikipediaSpider(CrawlSpider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Kevin_Bacon']

    rules = [
        Rule(LinkExtractor(allow=r'wiki/((?!:).)*$'), callback='parse_info', follow=True)
    ]

    custom_settings={
        'FEED_URI': 'articles.xml',
        'FEED_FORMAT': 'xml'
    }

    def parse_info(self, response):
        article = Article()
        article['title']= response.xpath('//h1/text()').get() or response.xpath('//h1/i/text()')
        article['url'] = response.url

        article['lastUpdated'] = response.xpath('//li[@id="footer-info-lastmod"]/text()').get()
        return article


================================================
FILE: 02_04_e/article_crawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = article_crawler.settings

[deploy]
#url = http://localhost:6800/
project = article_crawler


================================================
FILE: 02_05/news_scraper/news_scraper/__init__.py
================================================


================================================
FILE: 02_05/news_scraper/news_scraper/items.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class NewsArticle(scrapy.Item):
    url = scrapy.Field()
    source = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    date = scrapy.Field()
    author = scrapy.Field()
    text = scrapy.Field()


================================================
FILE: 02_05/news_scraper/news_scraper/middlewares.py
================================================
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class NewsScraperSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class NewsScraperDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


================================================
FILE: 02_05/news_scraper/news_scraper/pipelines.py
================================================
# -*- coding: utf-8 -*-
from datetime import datetime

class NewsScraperPipeline:
    def process_item(self, item, spider):
        item.date = datetime.strptime(item.date.split('T')[0], '%Y-%B-%D')
        item.author = item.author.replace(', CNN', '')
        item.text = [text.strip() for text in item.text]
        return item


================================================
FILE: 02_05/news_scraper/news_scraper/settings.py
================================================
# -*- coding: utf-8 -*-

# Scrapy settings for news_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'news_scraper'

SPIDER_MODULES = ['news_scraper.spiders']
NEWSPIDER_MODULE = 'news_scraper.spiders'

CLOSESPIDER_PAGECOUNT=10

FEED_URI='news_articles.json'
FEED_FORMAT='json'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'news_scraper (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'news_scraper.middlewares.NewsScraperSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'news_scraper.middlewares.NewsScraperDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'news_scraper.pipelines.NewsScraperPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


================================================
FILE: 02_05/news_scraper/news_scraper/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: 02_05/news_scraper/news_scraper/spiders/associated_press.py
================================================
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news_scraper.items import NewsArticle
import json 

class AssociatedPressSpider(CrawlSpider):
    name = 'associated_press'
    allowed_domains = ['apnews.com']
    start_urls = ['http://apnews.com/']
    rules = [Rule(LinkExtractor(allow=r'\/article\/[a-zA-Z\-]+\-[a-zA-Z0-9]{32}'), callback='parse_item', follow=True)]

    def parse_item(self, response):
        article = NewsArticle()
        # <script data-rh="true">
        article['url'] = response.url
        article['source'] = 'Associated Press'

        jsonData = json.loads(response.xpath('//script[@data-rh="true"]/text()').get())
        article['title'] = jsonData['headline']
        article['description'] = jsonData['description']
        article['date'] = jsonData['datePublished']
        article['author'] = jsonData['author'][0]
        article['text'] = response.xpath('//div[@class="Article"]/p/text()').getall()
        return article


================================================
FILE: 02_05/news_scraper/news_scraper/spiders/cnn.py
================================================
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news_scraper.items import NewsArticle

class CnnSpider(CrawlSpider):
    name = 'cnn'
    allowed_domains = ['cnn.com']
    # Articles on the front page are dynamically loaded
    start_urls = ['https://www.cnn.com/africa']
    # /2020/08/28/weather/rapid-fire-disasters-in-coronavirus-pandemic-weir-wxc/index.html
    rules = [Rule(LinkExtractor(allow=r'\/2020\/[0-9][0-9]\/[0-9][0-9]\/[a-zA-Z\-]+\/[a-zA-Z\-]+\/index.html'), callback='parse_item', follow=True)]
    
    def parse_item(self, response):
        article = NewsArticle()
        # <script data-rh="true">
        article['url'] = response.url
        article['source'] = 'CNN'
        article['title'] = response.xpath('//h1/text()').get()
        article['description'] = response.xpath('//meta[@name="description"]/@content').get()
        article['date'] = response.xpath('//meta[@itemprop="datePublished"]/@content').get()
        article['author'] = response.xpath('//meta[@itemprop="author"]/@content').get().replace(', CNN', '')
        article['text'] = response.xpath('//section[@data-zone-label="bodyText"]/div[@class="l-container"]//*/text()').getall()
        return article


================================================
FILE: 02_05/news_scraper/news_scraper/spiders/news_articles.json
================================================
[
{"url": "https://www.cnn.com/2020/09/29/africa/blasphemy-trial-nigeria/index.html", "source": "CNN", "title": "The WhatsApp voice note that led to a death sentence", "description": "A heated conversation in a WhatsApp group has led to a death penalty sentence and a family torn apart in northern Nigeria over allegations of insulting Prophet Mohammed. ", "date": "2020-09-29T09:51:49Z", "author": "Eoin McSweeney and Stephanie Busari", "text": [" (CNN)", "An intense argument recorded and posted in a WhatsApp group has led to a death penalty sentence and a family torn apart over allegations of insulting Prophet Mohammed, according to lawyers for the defendant. ", "Music studio assistant Yahaya Sharif-Aminu was sentenced to death by hanging on August 10 after being convicted of blasphemy by an Islamic court in northern Nigeria. ", "The judgment document states that Sharif-Aminu, 22, was convicted for making \"a blasphemous statement against Prophet Mohammed in a WhatsApp Group,\" which is contrary to the Kano State Sharia Penal Code and is an offence which carries the death sentence. ", "The recording was shared widely, causing mass outrage in the highly conservative, majority Muslim, state, according to various reports. ", "\"Whoever insults, defames or utters words or acts which are capable of bringing into disrespect ... such a person has committed a serious crime which is punishable by death,\" according to a translation of court documents provided to CNN by his lawyers. ", "Read More", "Sharif-Aminu, described by his friend Kabiru Ibrahim, as \"kind, religious and dutiful,\" admitted charges of blasphemy during his trial, but said he had made a mistake. ", "No legal representation", "Under Sharia law, a voluntary confession is binding, according to court papers. ", "Sharif-Aminu's lawyers, who became involved in the case only after his conviction, say he was not allowed legal representation before or during his trial -- in contravention of Nigerian citizens' constitutional right to legal representation. ", "Outrage as Nigeria sentences 13-year-old boy to 10 years in prison for blasphemy", "According to the lawyers, the Sharia court adjourned his case four times because no lawyer came forth from the Legal Aid Council to represent him, likely because of the sensitivity of the case. The Sharia court is, however, statute-bound to provide legal representation.", "Advocates from the ", "Foundation for Religious Freedom", " (FRF), a not-for-profit aimed at protecting religious freedom in Nigeria, which is representing Sharif-Aminu, told CNN he has also not been permitted access to legal advice to prepare an appeal against his conviction. ", "The FRF says it has lodged an appeal on his behalf in Kano's high court, a common-law court with constitutional powers. ", "\"The state laws he is accused of breaking are in gross conflict with the Nigerian constitution,\" said his counsel, Kola Alapinni. ", "No Muslim will condone it. People hold Prophet Mohammed higher than their parents. ", "Islamic cleric, Bashir Aliyu Umar", "Kano's State Governor, Abdullahi Ganduje told clerics in Kano that he would sign Sharif-Aminu's death warrant as soon as the singer had exhausted the appeals process, local media reports say. ", "\"I assure you that immediately the Supreme Court affirms the judgment, I will sign it without any hesitation,\" Ganduje said, according to ", "Nigeria's Daily Post newspaper", ". CNN contacted a spokesman for Governor Ganduje several times for comment but did not receive a response. ", "Islamic scholar and cleric Bashir Aliyu Umar, who is not connected to the case, but said he had read the transcript of the court proceedings, told CNN, \"No Muslim will condone it. People hold Prophet Mohammed higher than their parents, and when things like this happen, it will lead to a breakdown of peace because of mob action and attacks against the accused.\" ", "When news of Sharif-Aminu's alleged crime broke earlier this year, protesters marched to his family home and destroyed it, prompting his father to flee to a neighboring town, his lawyers told CNN. Sharif-Aminu went into hiding, according to Amnesty and his lawyers, but in March he was arrested by the Hisbah Corps, the religious police force that enforces Sharia law in Kano state. ", "'A travesty of justice'", "Human rights organization Amnesty International has described Sharif-Aminu's trial as a \"travesty of justice,\" and called on Kano state authorities to quash his conviction and death sentence. ", "\"There are serious concerns about the fairness of his trial and the framing of the charges against him based on his Whatsapp messages,\" said Amnesty's Nigeria director Osai Ojigho. \"Furthermore, the imposition of the death penalty following an unfair trial violates the right to life,\" she added. ", "The United States Commission on International Religious Freedom (USCIRF) has also condemned Sharif-Aminu's death sentence. It said Nigeria's blasphemy laws were inconsistent with universal human rights standards. ", "\"It is unconscionable that Sharif-Aminu is facing a death sentence merely for expressing his beliefs artistically through music,\" said the organization's commissioner, Frederick A. Davie, in a statement. ", "The organization released a ", "follow-up statement", " saying it had adopted Aminu-Sharif as \"a religious prisoner of conscience.\"  ", "Atheism frowned upon ", "Nigeria is Africa's most populous nation and religion permeates every facet of life here, with prayers routinely said in schools and public offices. In addition to blasphemy, atheism is frowned upon by many in the majority Muslim north as well as in parts of the mostly Christian south. ", "Human rights groups have expressed concern over a crackdown on freedom of speech and expression, particularly when it comes to religion. ", "On April 28 this year, Mubarak Bala, president of the Nigerian humanist association, was ", "arrested in Kaduna", ", another northern state, after allegedly posting a message on his Facebook page claiming that a Nigerian evangelical preacher was better than the Prophet Mohammed.  ", "'use strict';CNN.Videx = CNN.Videx || {};CNN.Videx.mobile = {};CNN.INJECTOR.executeFeature('video').then(function () {CNN.VideoPlayer.handleUnmutePlayer = function handleUnmutePlayer(containerId, dataObj) {'use strict';var playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector = 'unmute_' + containerId,isPlayerMute;dataObj = dataObj || {};if (CNN.VideoPlayer.getLibraryName(containerId) === 'fave') {playerInstance = FAVE.player.getInstance(containerId) || null;} else {playerInstance = containerId && window.cnnVideoManager.getPlayerByContainer(containerId).videoInstance.cvp || null;}isPlayerMute = (typeof dataObj.muted === 'boolean') ? dataObj.muted : false;if (CNN.VideoPlayer.playerProperties && CNN.VideoPlayer.playerProperties[containerId]) {playerPropertyObj = CNN.VideoPlayer.playerProperties[containerId];}if (playerPropertyObj.mute && playerPropertyObj.contentPlayed) {if (isPlayerMute === false) {unmuteCTA = jQuery(document.getElementById(unmuteIdSelector));playerInstance.unmute();if (unmuteCTA.length > 0) {unmuteCTA.removeClass('video__unmute--active').addClass('video__unmute--inactive');unmuteCTA.off('click');rememberTime = 0;if (rememberTime < 0) {rememberTime = 360 / 60;}CNN.Utils.storeLocalValue('unmute_africa', 'X', rememberTime);}} else {playerInstance.mute();}}};CNN.VideoPlayer.showFlashSlate = function showFlashSlate(container) {'use strict';var $vidEndSlate;$vidEndSlate = container.parent().find('.js-video__end-slate').eq(0);if ($vidEndSlate.length > 0) {$vidEndSlate.find('.l-container').html('<a href=\"https://get.adobe.com/flashplayer/\" target=\"_blank\"><div class=\"flash-slate\"></div></a>');$vidEndSlate.removeClass('video__end-slate--inactive').addClass('video__end-slate--active');}};CNN.autoPlayVideoExist = (CNN.autoPlayVideoExist === true) ? true : false;var configObj = {thumb: 'none',video: 'world/2019/10/08/iran-instagram-star-arrested-blasphemy-sot-mxp-vpx.hln',width: '100%',height: '100%',section: 'domestic',profile: 'expansion',network: 'cnn',markupId: 'body-text_39',theoplayer: {allowNativeFullscreen: true},adsection: 'const-article-inpage',frameWidth: '100%',frameHeight: '100%',posterImageOverride: {\"mini\":{\"width\":220,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-small-169.jpg\",\"height\":124},\"xsmall\":{\"width\":307,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-medium-plus-169.jpg\",\"height\":173},\"small\":{\"width\":460,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-large-169.jpg\",\"height\":259},\"medium\":{\"width\":780,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-exlarge-169.jpg\",\"height\":438},\"large\":{\"width\":1100,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-super-169.jpg\",\"height\":619},\"full16x9\":{\"width\":1600,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-full-169.jpg\",\"height\":900},\"mini1x1\":{\"width\":120,\"type\":\"jpg\",\"uri\":\"//cdn.cnn.com/cnnnext/dam/assets/191007143046-iran-blasphemy-instagram-small-11.jpg\",\"height\":120}}},autoStartVideo = false,isVideoReplayClicked = false,callbackObj,containerEl,currentVideoCollection = [],currentVideoCollectionId = '',isLivePlayer = false,mediaMetadataCallbacks,mobilePinnedView = null,moveToNextTimeout,mutePlayerEnabled = false,nextVideoId = '',nextVideoUrl = '',turnOnFlashMessaging = false,videoPinner,videoEndSlateImpl;if (CNN.autoPlayVideoExist === false) {autoStartVideo = false;if (autoStartVideo === true) {if (turnOnFlashMessaging === true) {autoStartVideo = false;containerEl = jQuery(document.getElementById(configObj.markupId));CNN.VideoPlayer.showFlashSlate(containerEl);} else {CNN.autoPlayVideoExist = true;}}}configObj.autostart = CNN.Features.enableAutoplayBlock ? false : autoStartVideo;CNN.VideoPlayer.setPlayerProperties(configObj.markupId, autoStartVideo, isLivePlayer, isVideoReplayClicked, mutePlayerEnabled);CNN.VideoPlayer.setFirstVideoInCollection(currentVideoCollection, configObj.markupId);videoEndSlateImpl = new CNN.VideoEndSlate('body-text_39');function findNextVideo(currentVideoId) {var i,vidObj;if (currentVideoId && jQuery.isArray(currentVideoCollection) && currentVideoCollection.length > 0) {for (i = 0; i < currentVideoCollection.length; i++) {vidObj = currentVideoCollection[i];if (typeof vidObj !== 'undefined' && vidObj.videoId === currentVideoId) {if (i < currentVideoCollection.length - 1) {nextVideoId = currentVideoCollection[i + 1].videoId;nextVideoUrl = currentVideoCollection[i + 1].videoUrl;} else {nextVideoId = currentVideoCollection[0].videoId;nextVideoUrl = currentVideoCollection[0].videoUrl;}break;}}if (!nextVideoUrl) {nextVideoId = currentVideoCollection[0].videoId;nextVideoUrl = currentVideoCollection[0].videoUrl;}currentVideoCollectionId = (window.jsmd && window.jsmd.v && window.jsmd.v.eVar60) || nextVideoUrl.replace(/^.+\\/video\\/playlists\\/(.+)\\//, '$1');} else {nextVideoId = '';nextVideoUrl = '';}}findNextVideo('world/2019/10/08/iran-instagram-star-arrested-blasphemy-sot-mxp-vpx.hln');function navigateToNextVideo(currentVideoId, containerId) {var $endSlate,nextVideoPlayTimeout = 1500;findNextVideo(currentVideoId);if (nextVideoUrl) {moveToNextTimeout = setTimeout(function () {location.href = nextVideoUrl;}, nextVideoPlayTimeout);} else {$endSlate = jQuery(document.getElementById(containerId)).parent().find('.js-video__end-slate').eq(0);if ($endSlate.length > 0) {videoEndSlateImpl.showEndSlateForContainer();if (mobilePinnedView) {mobilePinnedView.disable();}}}}callbackObj = {onPlayerReady: function (containerId) {var playerInstance,containerClassId = '#' + containerId;CNN.VideoPlayer.handleInitialExpandableVideoState(containerId);CNN.VideoPlayer.handleAdOnCVPVisibilityChange(containerId, CNN.pageVis.isDocumentVisible());if (CNN.Features.enableMobileWebFloatingPlayer &&Modernizr &&(Modernizr.phone || Modernizr.mobile || Modernizr.tablet) &&CNN.VideoPlayer.getLibraryName(containerId) === 'fave' &&jQuery(containerClassId).parents('.js-pg-rail-tall__head').length > 0 &&CNN.contentModel.pageType === 'article') {playerInstance = FAVE.player.getInstance(containerId);mobilePinnedView = new CNN.MobilePinnedView({element: jQuery(containerClassId),enabled: false,transition: CNN.MobileWebFloatingPlayer.transition,onPin: function () {playerInstance.hideUI();},onUnpin: function () {playerInstance.showUI();},onPlayerClick: functio

Download .txt

gitextract_jwpdz1w4/

├── .github/
│   ├── CODEOWNERS
│   ├── ISSUE_TEMPLATE.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── main.yml
├── .gitignore
├── 01_03/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 01_04_b/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 01_04_e/
│   └── ietf_scraper/
│       ├── ietf_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── ietf.py
│       └── scrapy.cfg
├── 02_01/
│   └── article_scraper/
│       ├── article_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_02_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_02_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_03_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_03_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_04_b/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_04_e/
│   └── article_crawler/
│       ├── article_crawler/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── articles.csv
│       │       ├── articles.json
│       │       ├── articles.xml
│       │       └── wikipedia.py
│       └── scrapy.cfg
├── 02_05/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_01_b/
│   └── form/
│       ├── form/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── get_form.py
│       └── scrapy.cfg
├── 03_01_e/
│   └── form/
│       ├── form/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── get_form.py
│       │       └── post_form.py
│       └── scrapy.cfg
├── 03_03_b/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_03_e/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── associated_press.py
│       │       ├── cnn.py
│       │       ├── news_articles.json
│       │       └── yahoo.py
│       └── scrapy.cfg
├── 03_04/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── cnn.py
│       │       └── counts.csv
│       └── scrapy.cfg
├── 03_05/
│   └── news_scraper/
│       ├── news_scraper/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       ├── cnn.py
│       │       └── counts.csv
│       └── scrapy.cfg
├── 04_01_b/
│   └── profiles/
│       ├── profiles/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── pythonscraping.py
│       └── scrapy.cfg
├── 04_01_e/
│   └── profiles/
│       ├── profiles/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── pythonscraping.py
│       └── scrapy.cfg
├── 04_02_b/
│   ├── chromedriver
│   └── locations/
│       ├── locations/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── dunkin.py
│       └── scrapy.cfg
├── 04_02_e/
│   ├── chromedriver
│   └── locations/
│       ├── locations/
│       │   ├── __init__.py
│       │   ├── items.py
│       │   ├── middlewares.py
│       │   ├── pipelines.py
│       │   ├── settings.py
│       │   └── spiders/
│       │       ├── __init__.py
│       │       └── dunkin.py
│       └── scrapy.cfg
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
└── README.md

Download .txt

SYMBOL INDEX (401 symbols across 91 files)

FILE: 01_03/ietf_scraper/ietf_scraper/items.py
  class IetfScraperItem (line 11) | class IetfScraperItem(scrapy.Item):

FILE: 01_03/ietf_scraper/ietf_scraper/middlewares.py
  class IetfScraperSpiderMiddleware (line 11) | class IetfScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class IetfScraperDownloaderMiddleware (line 59) | class IetfScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 01_03/ietf_scraper/ietf_scraper/pipelines.py
  class IetfScraperPipeline (line 9) | class IetfScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 01_03/ietf_scraper/ietf_scraper/spiders/ietf.py
  class IetfSpider (line 5) | class IetfSpider(scrapy.Spider):
    method parse (line 10) | def parse(self, response):

FILE: 01_04_b/ietf_scraper/ietf_scraper/items.py
  class IetfScraperItem (line 11) | class IetfScraperItem(scrapy.Item):

FILE: 01_04_b/ietf_scraper/ietf_scraper/middlewares.py
  class IetfScraperSpiderMiddleware (line 11) | class IetfScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class IetfScraperDownloaderMiddleware (line 59) | class IetfScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 01_04_b/ietf_scraper/ietf_scraper/pipelines.py
  class IetfScraperPipeline (line 9) | class IetfScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 01_04_b/ietf_scraper/ietf_scraper/spiders/ietf.py
  class IetfSpider (line 5) | class IetfSpider(scrapy.Spider):
    method parse (line 10) | def parse(self, response):

FILE: 01_04_e/ietf_scraper/ietf_scraper/items.py
  class IetfScraperItem (line 11) | class IetfScraperItem(scrapy.Item):

FILE: 01_04_e/ietf_scraper/ietf_scraper/middlewares.py
  class IetfScraperSpiderMiddleware (line 11) | class IetfScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class IetfScraperDownloaderMiddleware (line 59) | class IetfScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 01_04_e/ietf_scraper/ietf_scraper/pipelines.py
  class IetfScraperPipeline (line 9) | class IetfScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 01_04_e/ietf_scraper/ietf_scraper/spiders/ietf.py
  class IetfSpider (line 5) | class IetfSpider(scrapy.Spider):
    method parse (line 10) | def parse(self, response):

FILE: 02_01/article_scraper/article_scraper/items.py
  class ArticleScraperItem (line 11) | class ArticleScraperItem(scrapy.Item):

FILE: 02_01/article_scraper/article_scraper/middlewares.py
  class ArticleScraperSpiderMiddleware (line 11) | class ArticleScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleScraperDownloaderMiddleware (line 59) | class ArticleScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_01/article_scraper/article_scraper/pipelines.py
  class ArticleScraperPipeline (line 9) | class ArticleScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_01/article_scraper/article_scraper/spiders/wikipedia.py
  class WikipediaSpider (line 6) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 12) | def parse_info(self, response):

FILE: 02_02_b/article_crawler/article_crawler/items.py
  class ArticleCrawlerItem (line 11) | class ArticleCrawlerItem(scrapy.Item):

FILE: 02_02_b/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_02_b/article_crawler/article_crawler/pipelines.py
  class ArticleCrawlerPipeline (line 9) | class ArticleCrawlerPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_02_b/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 6) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 15) | def parse_info(self, response):

FILE: 02_02_e/article_crawler/article_crawler/items.py
  class Article (line 11) | class Article(scrapy.Item):

FILE: 02_02_e/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_02_e/article_crawler/article_crawler/pipelines.py
  class ArticleCrawlerPipeline (line 9) | class ArticleCrawlerPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_02_e/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 7) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 16) | def parse_info(self, response):

FILE: 02_03_b/article_crawler/article_crawler/items.py
  class Article (line 11) | class Article(scrapy.Item):

FILE: 02_03_b/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_03_b/article_crawler/article_crawler/pipelines.py
  class ArticleCrawlerPipeline (line 9) | class ArticleCrawlerPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_03_b/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 7) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 16) | def parse_info(self, response):

FILE: 02_03_e/article_crawler/article_crawler/items.py
  class Article (line 11) | class Article(scrapy.Item):

FILE: 02_03_e/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_03_e/article_crawler/article_crawler/pipelines.py
  class ArticleCrawlerPipeline (line 9) | class ArticleCrawlerPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_03_e/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 7) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 21) | def parse_info(self, response):

FILE: 02_04_b/article_crawler/article_crawler/items.py
  class Article (line 11) | class Article(scrapy.Item):

FILE: 02_04_b/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_04_b/article_crawler/article_crawler/pipelines.py
  class ArticleCrawlerPipeline (line 9) | class ArticleCrawlerPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 02_04_b/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 7) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 21) | def parse_info(self, response):

FILE: 02_04_e/article_crawler/article_crawler/items.py
  class Article (line 11) | class Article(scrapy.Item):

FILE: 02_04_e/article_crawler/article_crawler/middlewares.py
  class ArticleCrawlerSpiderMiddleware (line 11) | class ArticleCrawlerSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ArticleCrawlerDownloaderMiddleware (line 59) | class ArticleCrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_04_e/article_crawler/article_crawler/pipelines.py
  class CheckItemPipeline (line 11) | class CheckItemPipeline:
    method process_item (line 12) | def process_item(self, article, spider):
  class CleanDatePipeline (line 18) | class CleanDatePipeline:
    method process_item (line 19) | def process_item(self, article, spider):

FILE: 02_04_e/article_crawler/article_crawler/spiders/wikipedia.py
  class WikipediaSpider (line 7) | class WikipediaSpider(CrawlSpider):
    method parse_info (line 21) | def parse_info(self, response):

FILE: 02_05/news_scraper/news_scraper/items.py
  class NewsArticle (line 11) | class NewsArticle(scrapy.Item):

FILE: 02_05/news_scraper/news_scraper/middlewares.py
  class NewsScraperSpiderMiddleware (line 11) | class NewsScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class NewsScraperDownloaderMiddleware (line 59) | class NewsScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 02_05/news_scraper/news_scraper/pipelines.py
  class NewsScraperPipeline (line 4) | class NewsScraperPipeline:
    method process_item (line 5) | def process_item(self, item, spider):

FILE: 02_05/news_scraper/news_scraper/spiders/associated_press.py
  class AssociatedPressSpider (line 7) | class AssociatedPressSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 02_05/news_scraper/news_scraper/spiders/cnn.py
  class CnnSpider (line 6) | class CnnSpider(CrawlSpider):
    method parse_item (line 14) | def parse_item(self, response):

FILE: 02_05/news_scraper/news_scraper/spiders/yahoo.py
  class YahooSpider (line 7) | class YahooSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 03_01_b/form/form/items.py
  class FormItem (line 11) | class FormItem(scrapy.Item):

FILE: 03_01_b/form/form/middlewares.py
  class FormSpiderMiddleware (line 11) | class FormSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class FormDownloaderMiddleware (line 59) | class FormDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_01_b/form/form/pipelines.py
  class FormPipeline (line 9) | class FormPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 03_01_b/form/form/spiders/get_form.py
  class GetFormSpider (line 4) | class GetFormSpider(scrapy.Spider):
    method parse (line 9) | def parse(self, response):

FILE: 03_01_e/form/form/items.py
  class FormItem (line 11) | class FormItem(scrapy.Item):

FILE: 03_01_e/form/form/middlewares.py
  class FormSpiderMiddleware (line 11) | class FormSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class FormDownloaderMiddleware (line 59) | class FormDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_01_e/form/form/pipelines.py
  class FormPipeline (line 9) | class FormPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 03_01_e/form/form/spiders/get_form.py
  function generate_start_urls (line 4) | def generate_start_urls():
  class GetFormSpider (line 11) | class GetFormSpider(scrapy.Spider):
    method parse (line 16) | def parse(self, response):

FILE: 03_01_e/form/form/spiders/post_form.py
  class GetFormSpider (line 5) | class GetFormSpider(scrapy.Spider):
    method start_requests (line 9) | def start_requests(self):
    method parse (line 17) | def parse(self, response):

FILE: 03_03_b/news_scraper/news_scraper/items.py
  class NewsArticle (line 11) | class NewsArticle(scrapy.Item):

FILE: 03_03_b/news_scraper/news_scraper/middlewares.py
  class NewsScraperSpiderMiddleware (line 11) | class NewsScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class NewsScraperDownloaderMiddleware (line 59) | class NewsScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_03_b/news_scraper/news_scraper/pipelines.py
  class NewsScraperPipeline (line 4) | class NewsScraperPipeline:
    method process_item (line 5) | def process_item(self, item, spider):

FILE: 03_03_b/news_scraper/news_scraper/spiders/associated_press.py
  class AssociatedPressSpider (line 7) | class AssociatedPressSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 03_03_b/news_scraper/news_scraper/spiders/cnn.py
  class CnnSpider (line 6) | class CnnSpider(CrawlSpider):
    method parse_item (line 14) | def parse_item(self, response):

FILE: 03_03_b/news_scraper/news_scraper/spiders/yahoo.py
  class YahooSpider (line 7) | class YahooSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 03_03_e/news_scraper/news_scraper/items.py
  class NewsArticle (line 11) | class NewsArticle(scrapy.Item):

FILE: 03_03_e/news_scraper/news_scraper/middlewares.py
  class NewsScraperSpiderMiddleware (line 11) | class NewsScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class NewsScraperDownloaderMiddleware (line 59) | class NewsScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_03_e/news_scraper/news_scraper/pipelines.py
  class NewsScraperPipeline (line 4) | class NewsScraperPipeline:
    method process_item (line 5) | def process_item(self, item, spider):

FILE: 03_03_e/news_scraper/news_scraper/spiders/associated_press.py
  class AssociatedPressSpider (line 7) | class AssociatedPressSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 03_03_e/news_scraper/news_scraper/spiders/cnn.py
  class CnnSpider (line 6) | class CnnSpider(SitemapSpider):
    method parse (line 11) | def parse(self, response):

FILE: 03_03_e/news_scraper/news_scraper/spiders/yahoo.py
  class YahooSpider (line 7) | class YahooSpider(CrawlSpider):
    method parse_item (line 13) | def parse_item(self, response):

FILE: 03_04/news_scraper/news_scraper/items.py
  class NewsArticle (line 11) | class NewsArticle(scrapy.Item):

FILE: 03_04/news_scraper/news_scraper/middlewares.py
  class NewsScraperSpiderMiddleware (line 11) | class NewsScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class NewsScraperDownloaderMiddleware (line 59) | class NewsScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_04/news_scraper/news_scraper/pipelines.py
  class NewsScraperPipeline (line 9) | class NewsScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 03_04/news_scraper/news_scraper/spiders/cnn.py
  function generate_start_urls (line 6) | def generate_start_urls():
  class CnnSpider (line 11) | class CnnSpider(CrawlSpider):
    method parse (line 16) | def parse(self, response):

FILE: 03_05/news_scraper/news_scraper/items.py
  class NewsArticle (line 11) | class NewsArticle(scrapy.Item):

FILE: 03_05/news_scraper/news_scraper/middlewares.py
  class NewsScraperSpiderMiddleware (line 11) | class NewsScraperSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class NewsScraperDownloaderMiddleware (line 59) | class NewsScraperDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 03_05/news_scraper/news_scraper/pipelines.py
  class NewsScraperPipeline (line 9) | class NewsScraperPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 03_05/news_scraper/news_scraper/spiders/cnn.py
  function generate_start_urls (line 6) | def generate_start_urls():
  class CnnSpider (line 11) | class CnnSpider(CrawlSpider):
    method parse (line 16) | def parse(self, response):

FILE: 04_01_b/profiles/profiles/items.py
  class ProfilesItem (line 11) | class ProfilesItem(scrapy.Item):

FILE: 04_01_b/profiles/profiles/middlewares.py
  class ProfilesSpiderMiddleware (line 11) | class ProfilesSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ProfilesDownloaderMiddleware (line 59) | class ProfilesDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 04_01_b/profiles/profiles/pipelines.py
  class ProfilesPipeline (line 9) | class ProfilesPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 04_01_b/profiles/profiles/spiders/pythonscraping.py
  class PythonscrapingSpider (line 5) | class PythonscrapingSpider(scrapy.Spider):
    method parse (line 11) | def parse(self, response):

FILE: 04_01_e/profiles/profiles/items.py
  class ProfilesItem (line 11) | class ProfilesItem(scrapy.Item):

FILE: 04_01_e/profiles/profiles/middlewares.py
  class ProfilesSpiderMiddleware (line 11) | class ProfilesSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class ProfilesDownloaderMiddleware (line 59) | class ProfilesDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 04_01_e/profiles/profiles/pipelines.py
  class ProfilesPipeline (line 9) | class ProfilesPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 04_01_e/profiles/profiles/spiders/pythonscraping.py
  class PythonscrapingSpider (line 5) | class PythonscrapingSpider(scrapy.Spider):
    method make_requests_from_url (line 11) | def make_requests_from_url(self, url):
    method parse (line 17) | def parse(self, response):

FILE: 04_02_b/locations/locations/items.py
  class LocationsItem (line 11) | class LocationsItem(scrapy.Item):

FILE: 04_02_b/locations/locations/middlewares.py
  class LocationsSpiderMiddleware (line 11) | class LocationsSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class LocationsDownloaderMiddleware (line 59) | class LocationsDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 04_02_b/locations/locations/pipelines.py
  class LocationsPipeline (line 9) | class LocationsPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 04_02_b/locations/locations/spiders/dunkin.py
  class DunkinSpider (line 4) | class DunkinSpider(scrapy.Spider):
    method parse (line 9) | def parse(self, response):

FILE: 04_02_e/locations/locations/items.py
  class LocationsItem (line 11) | class LocationsItem(scrapy.Item):

FILE: 04_02_e/locations/locations/middlewares.py
  class LocationsSpiderMiddleware (line 11) | class LocationsSpiderMiddleware:
    method from_crawler (line 17) | def from_crawler(cls, crawler):
    method process_spider_input (line 23) | def process_spider_input(self, response, spider):
    method process_spider_output (line 30) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 38) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class LocationsDownloaderMiddleware (line 59) | class LocationsDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: 04_02_e/locations/locations/pipelines.py
  class LocationsPipeline (line 9) | class LocationsPipeline:
    method process_item (line 10) | def process_item(self, item, spider):

FILE: 04_02_e/locations/locations/spiders/dunkin.py
  function wait (line 6) | def wait(driver):
  class DunkinSpider (line 10) | class DunkinSpider(scrapy.Spider):
    method make_requests_from_url (line 16) | def make_requests_from_url(self, url):
    method parse (line 19) | def parse(self, response):

Download .json

Condensed preview — 202 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,074K chars).

[
  {
    "path": ".github/CODEOWNERS",
    "chars": 120,
    "preview": "# Codeowners for these exercise files:\n# * (asterisk) deotes \"all files and folders\"\n# Example: * @producer @instructor\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE.md",
    "chars": 1032,
    "preview": "<!--\nBEFORE POSTING YOUR ISSUE:\n- These comments won't show up when you submit the issue.\n- Please use the sections belo"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "chars": 139,
    "preview": "<!-- This repository *does not* accept pull requests (PRs). All pull requests will be closed. See CONTRIBUTING.md for fu"
  },
  {
    "path": ".github/workflows/main.yml",
    "chars": 272,
    "preview": "name: Copy To Branches\non:\n  workflow_dispatch:\njobs:\n  copy-to-branches:\n    runs-on: ubuntu-latest\n    steps:\n      - "
  },
  {
    "path": ".gitignore",
    "chars": 42,
    "preview": ".DS_Store\nnode_modules\n.tmp\nnpm-debug.log\n"
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/items.py",
    "chars": 292,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/pipelines.py",
    "chars": 285,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/settings.py",
    "chars": 3145,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for ietf_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "01_03/ietf_scraper/ietf_scraper/spiders/ietf.py",
    "chars": 314,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\n\nclass IetfSpider(scrapy.Spider):\n    name = 'ietf'\n    allowed_domains = ['pytho"
  },
  {
    "path": "01_03/ietf_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/items.py",
    "chars": 292,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/pipelines.py",
    "chars": 285,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/settings.py",
    "chars": 3145,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for ietf_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "01_04_b/ietf_scraper/ietf_scraper/spiders/ietf.py",
    "chars": 392,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\n\nclass IetfSpider(scrapy.Spider):\n    name = 'ietf'\n    allowed_domains = ['pytho"
  },
  {
    "path": "01_04_b/ietf_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/items.py",
    "chars": 292,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/pipelines.py",
    "chars": 285,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/settings.py",
    "chars": 3145,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for ietf_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "01_04_e/ietf_scraper/ietf_scraper/spiders/ietf.py",
    "chars": 1287,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nimport w3lib.html\n\nclass IetfSpider(scrapy.Spider):\n    name = 'ietf'\n    allowed_"
  },
  {
    "path": "01_04_e/ietf_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_01/article_scraper/article_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_01/article_scraper/article_scraper/items.py",
    "chars": 295,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_01/article_scraper/article_scraper/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_01/article_scraper/article_scraper/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_01/article_scraper/article_scraper/settings.py",
    "chars": 3178,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_scraper project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_01/article_scraper/article_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_01/article_scraper/article_scraper/spiders/wikipedia.py",
    "chars": 679,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_01/article_scraper/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/items.py",
    "chars": 295,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/settings.py",
    "chars": 3178,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_02_b/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 688,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_02_b/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/items.py",
    "chars": 284,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/settings.py",
    "chars": 3178,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/spiders/articles.csv",
    "chars": 2588,
    "preview": "lastUpdated,title,url\r\n\" This page was last edited on 19 September 2020, at 00:35\",Kevin Bacon,https://en.wikipedia.org/"
  },
  {
    "path": "02_02_e/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 770,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_02_e/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/items.py",
    "chars": 284,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/settings.py",
    "chars": 3178,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/spiders/articles.csv",
    "chars": 2588,
    "preview": "lastUpdated,title,url\r\n\" This page was last edited on 19 September 2020, at 00:35\",Kevin Bacon,https://en.wikipedia.org/"
  },
  {
    "path": "02_03_b/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 770,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_03_b/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/items.py",
    "chars": 284,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/settings.py",
    "chars": 3249,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/spiders/articles.csv",
    "chars": 13897,
    "preview": "lastUpdated,title,url\r\n\" This page was last edited on 19 September 2020, at 00:35\",Kevin Bacon,https://en.wikipedia.org/"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/spiders/articles.json",
    "chars": 2923,
    "preview": "[\n{\"title\": \"Kevin Bacon\", \"url\": \"https://en.wikipedia.org/wiki/Kevin_Bacon\", \"lastUpdated\": \" This page was last edite"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/spiders/articles.xml",
    "chars": 4632,
    "preview": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<items>\n<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_"
  },
  {
    "path": "02_03_e/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 864,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_03_e/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/items.py",
    "chars": 284,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/pipelines.py",
    "chars": 288,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/settings.py",
    "chars": 3249,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/spiders/articles.csv",
    "chars": 13897,
    "preview": "lastUpdated,title,url\r\n\" This page was last edited on 19 September 2020, at 00:35\",Kevin Bacon,https://en.wikipedia.org/"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/spiders/articles.json",
    "chars": 2923,
    "preview": "[\n{\"title\": \"Kevin Bacon\", \"url\": \"https://en.wikipedia.org/wiki/Kevin_Bacon\", \"lastUpdated\": \" This page was last edite"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/spiders/articles.xml",
    "chars": 4632,
    "preview": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<items>\n<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_"
  },
  {
    "path": "02_04_b/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 864,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_04_b/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/items.py",
    "chars": 284,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/middlewares.py",
    "chars": 3595,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/pipelines.py",
    "chars": 767,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/settings.py",
    "chars": 3298,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for article_crawler project\n#\n# For simplicity, this file contains only setti"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/spiders/articles.csv",
    "chars": 13897,
    "preview": "lastUpdated,title,url\r\n\" This page was last edited on 19 September 2020, at 00:35\",Kevin Bacon,https://en.wikipedia.org/"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/spiders/articles.json",
    "chars": 2923,
    "preview": "[\n{\"title\": \"Kevin Bacon\", \"url\": \"https://en.wikipedia.org/wiki/Kevin_Bacon\", \"lastUpdated\": \" This page was last edite"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/spiders/articles.xml",
    "chars": 8458,
    "preview": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<items>\n<item><title>Kevin Bacon</title><url>https://en.wikipedia.org/wiki/Kevin_"
  },
  {
    "path": "02_04_e/article_crawler/article_crawler/spiders/wikipedia.py",
    "chars": 864,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import Lin"
  },
  {
    "path": "02_04_e/article_crawler/scrapy.cfg",
    "chars": 273,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "02_05/news_scraper/news_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "02_05/news_scraper/news_scraper/items.py",
    "chars": 395,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "02_05/news_scraper/news_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "02_05/news_scraper/news_scraper/pipelines.py",
    "chars": 331,
    "preview": "# -*- coding: utf-8 -*-\nfrom datetime import datetime\n\nclass NewsScraperPipeline:\n    def process_item(self, item, spide"
  },
  {
    "path": "02_05/news_scraper/news_scraper/settings.py",
    "chars": 3221,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "02_05/news_scraper/news_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "02_05/news_scraper/news_scraper/spiders/associated_press.py",
    "chars": 1036,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "02_05/news_scraper/news_scraper/spiders/cnn.py",
    "chars": 1275,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "02_05/news_scraper/news_scraper/spiders/news_articles.json",
    "chars": 848713,
    "preview": "[\n{\"url\": \"https://www.cnn.com/2020/09/29/africa/blasphemy-trial-nigeria/index.html\", \"source\": \"CNN\", \"title\": \"The Wha"
  },
  {
    "path": "02_05/news_scraper/news_scraper/spiders/yahoo.py",
    "chars": 1022,
    "preview": "# -*- coding: utf-8 -*-\nimport json\nfrom news_scraper.items import NewsArticle\nfrom scrapy.spiders import CrawlSpider, R"
  },
  {
    "path": "02_05/news_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_01_b/form/form/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_01_b/form/form/items.py",
    "chars": 285,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_01_b/form/form/middlewares.py",
    "chars": 3575,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_01_b/form/form/pipelines.py",
    "chars": 278,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "03_01_b/form/form/settings.py",
    "chars": 3060,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for form project\n#\n# For simplicity, this file contains only settings conside"
  },
  {
    "path": "03_01_b/form/form/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_01_b/form/form/spiders/get_form.py",
    "chars": 235,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\nclass GetFormSpider(scrapy.Spider):\n    name = 'get_form'\n    allowed_domains = ["
  },
  {
    "path": "03_01_b/form/scrapy.cfg",
    "chars": 251,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_01_e/form/form/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_01_e/form/form/items.py",
    "chars": 285,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_01_e/form/form/middlewares.py",
    "chars": 3575,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_01_e/form/form/pipelines.py",
    "chars": 278,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "03_01_e/form/form/settings.py",
    "chars": 3060,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for form project\n#\n# For simplicity, this file contains only settings conside"
  },
  {
    "path": "03_01_e/form/form/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_01_e/form/form/spiders/get_form.py",
    "chars": 627,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\ndef generate_start_urls():\n    names = ['Alice', 'Bob', 'Charles']\n    quests = ["
  },
  {
    "path": "03_01_e/form/form/spiders/post_form.py",
    "chars": 683,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nfrom scrapy.http import FormRequest\n\nclass GetFormSpider(scrapy.Spider):\n    name "
  },
  {
    "path": "03_01_e/form/scrapy.cfg",
    "chars": 251,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/items.py",
    "chars": 395,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/pipelines.py",
    "chars": 331,
    "preview": "# -*- coding: utf-8 -*-\nfrom datetime import datetime\n\nclass NewsScraperPipeline:\n    def process_item(self, item, spide"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/settings.py",
    "chars": 3221,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/spiders/associated_press.py",
    "chars": 1036,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/spiders/cnn.py",
    "chars": 1275,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/spiders/news_articles.json",
    "chars": 848713,
    "preview": "[\n{\"url\": \"https://www.cnn.com/2020/09/29/africa/blasphemy-trial-nigeria/index.html\", \"source\": \"CNN\", \"title\": \"The Wha"
  },
  {
    "path": "03_03_b/news_scraper/news_scraper/spiders/yahoo.py",
    "chars": 1022,
    "preview": "# -*- coding: utf-8 -*-\nimport json\nfrom news_scraper.items import NewsArticle\nfrom scrapy.spiders import CrawlSpider, R"
  },
  {
    "path": "03_03_b/news_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/items.py",
    "chars": 395,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/pipelines.py",
    "chars": 331,
    "preview": "# -*- coding: utf-8 -*-\nfrom datetime import datetime\n\nclass NewsScraperPipeline:\n    def process_item(self, item, spide"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/settings.py",
    "chars": 3221,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/spiders/associated_press.py",
    "chars": 1036,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/spiders/cnn.py",
    "chars": 1011,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule, SitemapSpider\nfrom scrapy.linkextractors import Li"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/spiders/news_articles.json",
    "chars": 988312,
    "preview": "[\n{\"url\": \"https://www.cnn.com/2020/09/29/africa/blasphemy-trial-nigeria/index.html\", \"source\": \"CNN\", \"title\": \"The Wha"
  },
  {
    "path": "03_03_e/news_scraper/news_scraper/spiders/yahoo.py",
    "chars": 1022,
    "preview": "# -*- coding: utf-8 -*-\nimport json\nfrom news_scraper.items import NewsArticle\nfrom scrapy.spiders import CrawlSpider, R"
  },
  {
    "path": "03_03_e/news_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_04/news_scraper/news_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_04/news_scraper/news_scraper/items.py",
    "chars": 395,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_04/news_scraper/news_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_04/news_scraper/news_scraper/pipelines.py",
    "chars": 363,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "03_04/news_scraper/news_scraper/settings.py",
    "chars": 3146,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "03_04/news_scraper/news_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_04/news_scraper/news_scraper/spiders/cnn.py",
    "chars": 733,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "03_04/news_scraper/news_scraper/spiders/counts.csv",
    "chars": 6603,
    "preview": "count\r\n69\r\n82\r\nurl,count\r\nhttps://www.cnn.com/sitemaps/article-2011-01.xml,82\r\nhttps://www.cnn.com/sitemaps/article-2011"
  },
  {
    "path": "03_04/news_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "03_05/news_scraper/news_scraper/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "03_05/news_scraper/news_scraper/items.py",
    "chars": 395,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "03_05/news_scraper/news_scraper/middlewares.py",
    "chars": 3589,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "03_05/news_scraper/news_scraper/pipelines.py",
    "chars": 363,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "03_05/news_scraper/news_scraper/settings.py",
    "chars": 3146,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news_scraper project\n#\n# For simplicity, this file contains only settings"
  },
  {
    "path": "03_05/news_scraper/news_scraper/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "03_05/news_scraper/news_scraper/spiders/cnn.py",
    "chars": 733,
    "preview": "# -*- coding: utf-8 -*-\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfro"
  },
  {
    "path": "03_05/news_scraper/news_scraper/spiders/counts.csv",
    "chars": 6603,
    "preview": "count\r\n69\r\n82\r\nurl,count\r\nhttps://www.cnn.com/sitemaps/article-2011-01.xml,82\r\nhttps://www.cnn.com/sitemaps/article-2011"
  },
  {
    "path": "03_05/news_scraper/scrapy.cfg",
    "chars": 267,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "04_01_b/profiles/profiles/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "04_01_b/profiles/profiles/items.py",
    "chars": 289,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "04_01_b/profiles/profiles/middlewares.py",
    "chars": 3583,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "04_01_b/profiles/profiles/pipelines.py",
    "chars": 282,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "04_01_b/profiles/profiles/settings.py",
    "chars": 3104,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for profiles project\n#\n# For simplicity, this file contains only settings con"
  },
  {
    "path": "04_01_b/profiles/profiles/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "04_01_b/profiles/profiles/spiders/pythonscraping.py",
    "chars": 330,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\n\nclass PythonscrapingSpider(scrapy.Spider):\n    name = 'pythonscraping'\n    allow"
  },
  {
    "path": "04_01_b/profiles/scrapy.cfg",
    "chars": 259,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "04_01_e/profiles/profiles/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "04_01_e/profiles/profiles/items.py",
    "chars": 289,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "04_01_e/profiles/profiles/middlewares.py",
    "chars": 3583,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "04_01_e/profiles/profiles/pipelines.py",
    "chars": 282,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "04_01_e/profiles/profiles/settings.py",
    "chars": 3104,
    "preview": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for profiles project\n#\n# For simplicity, this file contains only settings con"
  },
  {
    "path": "04_01_e/profiles/profiles/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "04_01_e/profiles/profiles/spiders/pythonscraping.py",
    "chars": 565,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\n\nclass PythonscrapingSpider(scrapy.Spider):\n    name = 'pythonscraping'\n    allow"
  },
  {
    "path": "04_01_e/profiles/scrapy.cfg",
    "chars": 259,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "04_02_b/locations/locations/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "04_02_b/locations/locations/items.py",
    "chars": 290,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "04_02_b/locations/locations/middlewares.py",
    "chars": 3585,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "04_02_b/locations/locations/pipelines.py",
    "chars": 283,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "04_02_b/locations/locations/settings.py",
    "chars": 3322,
    "preview": "# -*- coding: utf-8 -*-\nimport os\n\nSELENIUM_DRIVER_NAME = 'chrome'\nSELENIUM_DRIVER_EXECUTABLE_PATH = '/Users/rspecht/chr"
  },
  {
    "path": "04_02_b/locations/locations/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "04_02_b/locations/locations/spiders/dunkin.py",
    "chars": 356,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\n\nclass DunkinSpider(scrapy.Spider):\n    name = 'dunkin'\n    allowed_domains = ['du"
  },
  {
    "path": "04_02_b/locations/scrapy.cfg",
    "chars": 261,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "04_02_e/locations/locations/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "04_02_e/locations/locations/items.py",
    "chars": 290,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy"
  },
  {
    "path": "04_02_e/locations/locations/middlewares.py",
    "chars": 3585,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.sc"
  },
  {
    "path": "04_02_e/locations/locations/pipelines.py",
    "chars": 283,
    "preview": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES s"
  },
  {
    "path": "04_02_e/locations/locations/settings.py",
    "chars": 3313,
    "preview": "# -*- coding: utf-8 -*-\nSELENIUM_DRIVER_NAME = 'chrome'\nSELENIUM_DRIVER_EXECUTABLE_PATH = '/Users/rmitchell/chromedriver"
  },
  {
    "path": "04_02_e/locations/locations/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "04_02_e/locations/locations/spiders/dunkin.py",
    "chars": 581,
    "preview": "# -*- coding: utf-8 -*-\nimport scrapy\nimport time\nfrom scrapy_selenium import SeleniumRequest\n\ndef wait(driver):\n    tim"
  },
  {
    "path": "04_02_e/locations/scrapy.cfg",
    "chars": 261,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 635,
    "preview": "\nContribution Agreement\n======================\n\nThis repository does not accept pull requests (PRs). All pull requests w"
  },
  {
    "path": "LICENSE",
    "chars": 6626,
    "preview": "LinkedIn Learning Exercise Files License Agreement\n==================================================\n\nThis License Agre"
  },
  {
    "path": "NOTICE",
    "chars": 2421,
    "preview": "Copyright 2020 LinkedIn Corporation\nAll Rights Reserved.\n\nLicensed under the LinkedIn Learning Exercise File License (th"
  },
  {
    "path": "README.md",
    "chars": 3180,
    "preview": "# Web Scraping with Python\nThis is the repository for the LinkedIn Learning course Web Scraping with Python. The full co"
  }
]

// ... and 2 more files (download for full content)

About this extraction

This page contains the full source code of the LinkedInLearning/web-scraping-with-python-2848331 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 202 files (32.9 MB), approximately 753.8k tokens, and a symbol index with 401 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo