Full Code of alirezamika/autoscraper for AI

master eb72f5dc6f7f cached

20 files

47.7 KB

11.9k tokens

77 symbols

1 requests

Download .txt

Repository: alirezamika/autoscraper
Branch: master
Commit: eb72f5dc6f7f
Files: 20
Total size: 47.7 KB

Directory structure:
gitextract_i4lmlmqj/

├── .github/
│   ├── FUNDING.yml
│   └── workflows/
│       ├── python-publish.yml
│       ├── stale-issues.yml
│       └── tests.yml
├── .gitignore
├── LICENSE
├── README.md
├── autoscraper/
│   ├── __init__.py
│   ├── auto_scraper.py
│   └── utils.py
├── setup.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── integration/
    │   ├── __init__.py
    │   ├── test_complex_features.py
    │   └── test_real_world.py
    └── unit/
        ├── __init__.py
        ├── test_additional_features.py
        ├── test_build.py
        └── test_features.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms

github: [alirezamika] # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']


================================================
FILE: .github/workflows/python-publish.yml
================================================
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

name: Upload Python Package

on:
  release:
    types: [created]

jobs:
  deploy:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install setuptools wheel twine pytest
        pip install .
    - name: Run tests
      run: |
        pytest -q
    - name: Build and publish
      env:
        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
      run: |
        python setup.py sdist bdist_wheel
        twine upload dist/*


================================================
FILE: .github/workflows/stale-issues.yml
================================================
name: Close inactive issues
on:
  schedule:
    - cron: "30 1 * * *"

jobs:
  close-issues:
    runs-on: ubuntu-latest
    permissions:
      issues: write
      pull-requests: write
    steps:
      - uses: actions/stale@v5
        with:
          days-before-issue-stale: 30
          days-before-issue-close: 14
          stale-issue-label: "stale"
          stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
          close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
          days-before-pr-stale: 30
          days-before-pr-close: 14
          repo-token: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/tests.yml
================================================
name: Run Tests

on:
  push:
  release:
    types: [created]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pytest
        pip install .
    - name: Run tests
      run: pytest -q


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
.idea/
.vscode/

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Alireza Mika

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

![img](https://user-images.githubusercontent.com/17881612/91968083-5ee92080-ed29-11ea-82ec-d99ec85367a5.png)

This project is made for automatic web scraping to make scraping easy. 
It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. **This data can be text, url or any html tag value of that page.** It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.


## Installation

It's compatible with python 3.

- Install latest version from git repository using pip:
```bash
$ pip install git+https://github.com/alirezamika/autoscraper.git
```

- Install from PyPI:
```bash
$ pip install autoscraper
```

- Install from source:
```bash
$ python setup.py install
```

## How to use

### Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

```python
from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)
```

Here's the output:
```python
[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]
```
Now you can use the `scraper` object to get related topics of any stackoverflow page:
```python
scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')
```

### Getting exact result

Say we want to scrape live stock prices from yahoo finance:

```python
from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = ["124.81"]

scraper = AutoScraper()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper.build(url, wanted_list)
print(result)
```
Note that you should update the `wanted_list` if you want to copy this code, as the content of the page dynamically changes.

You can also pass any custom `requests` module parameter. for example you may want to use proxies or custom headers:

```python
proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}

result = scraper.build(url, wanted_list, request_args=dict(proxies=proxies))
```

Now we can get the price of any symbol:

```python
scraper.get_result_exact('https://finance.yahoo.com/quote/MSFT/')
```

**You may want to get other info as well.** For example if you want to get market cap too, you can just append it to the wanted list. By using the `get_result_exact` method, it will retrieve the data as the same exact order in the wanted list.

**Another example:** Say we want to scrape the about text, number of stars and the link to issues of Github repo pages:

```python
from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '6.2k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)
```

Simple, right?


### Saving the model

We can now save the built model to use it later. To save:

```python
# Give it a file path
scraper.save('yahoo-finance')
```

And to load:

```python
scraper.load('yahoo-finance')
```

## Tutorials

- See [this gist](https://gist.github.com/alirezamika/72083221891eecd991bbc0a2a2467673) for more advanced usages.
- [AutoScraper and Flask: Create an API From Any Website in Less Than 5 Minutes](https://medium.com/better-programming/autoscraper-and-flask-create-an-api-from-any-website-in-less-than-5-minutes-3f0f176fc4a3)

## Issues
Feel free to open an issue if you have any problem using the module.


## Support the project

<a href="https://www.buymeacoffee.com/alirezam" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-black.png" alt="Buy Me A Coffee" height="45" width="163" ></a>


#### Happy Coding  ♥️


================================================
FILE: autoscraper/__init__.py
================================================
from autoscraper.auto_scraper import AutoScraper


================================================
FILE: autoscraper/auto_scraper.py
================================================
import hashlib
import json
from collections import defaultdict
from html import unescape
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

from autoscraper.utils import (
    FuzzyText,
    ResultItem,
    get_non_rec_text,
    normalize,
    text_match,
    unique_hashable,
    unique_stack_list,
)


class AutoScraper(object):
    """
    AutoScraper : A Smart, Automatic, Fast and Lightweight Web Scraper for Python.
    AutoScraper automatically learns a set of rules required to extract the needed content
        from a web page. So the programmer doesn't need to explicitly construct the rules.

    Attributes
    ----------
    stack_list: list
        List of rules learned by AutoScraper

    Methods
    -------
    build() - Learns a set of rules represented as stack_list based on the wanted_list,
        which can be reused for scraping similar elements from other web pages in the future.
    get_result_similar() - Gets similar results based on the previously learned rules.
    get_result_exact() - Gets exact results based on the previously learned rules.
    get_results() - Gets exact and similar results based on the previously learned rules.
    save() - Serializes the stack_list as JSON and saves it to disk.
    load() - De-serializes the JSON representation of the stack_list and loads it back.
    remove_rules() - Removes one or more learned rule[s] from the stack_list.
    keep_rules() - Keeps only the specified learned rules in the stack_list and removes the others.
    """

    request_headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
    }

    def __init__(self, stack_list=None):
        self.stack_list = stack_list or []

    def save(self, file_path):
        """
        Serializes the stack_list as JSON and saves it to the disk.

        Parameters
        ----------
        file_path: str
            Path of the JSON output

        Returns
        -------
        None
        """

        data = dict(stack_list=self.stack_list)
        with open(file_path, "w") as f:
            json.dump(data, f)

    def load(self, file_path):
        """
        De-serializes the JSON representation of the stack_list and loads it back.

        Parameters
        ----------
        file_path: str
            Path of the JSON file to load stack_list from.

        Returns
        -------
        None
        """

        with open(file_path, "r") as f:
            data = json.load(f)

        # for backward compatibility
        if isinstance(data, list):
            self.stack_list = data
            return

        self.stack_list = data["stack_list"]

    @classmethod
    def _fetch_html(cls, url, request_args=None):
        request_args = request_args or {}
        headers = dict(cls.request_headers)
        if url:
            headers["Host"] = urlparse(url).netloc

        user_headers = request_args.pop("headers", {})
        headers.update(user_headers)
        res = requests.get(url, headers=headers, **request_args)
        if res.encoding == "ISO-8859-1" and not "ISO-8859-1" in res.headers.get(
            "Content-Type", ""
        ):
            res.encoding = res.apparent_encoding
        html = res.text
        return html

    @classmethod
    def _get_soup(cls, url=None, html=None, request_args=None):
        if html:
            html = normalize(unescape(html))
            return BeautifulSoup(html, "lxml")

        html = cls._fetch_html(url, request_args)
        html = normalize(unescape(html))

        return BeautifulSoup(html, "lxml")

    @staticmethod
    def _get_valid_attrs(item):
        key_attrs = {"class", "style"}
        attrs = {
            k: v if v != [] else "" for k, v in item.attrs.items() if k in key_attrs
        }

        for attr in key_attrs:
            if attr not in attrs:
                attrs[attr] = ""
        return attrs

    @staticmethod
    def _child_has_text(child, text, url, text_fuzz_ratio):
        child_text = child.getText().strip()

        if text_match(text, child_text, text_fuzz_ratio):
            parent_text = child.parent.getText().strip()
            if child_text == parent_text and child.parent.parent:
                return False

            child.wanted_attr = None
            return True

        if text_match(text, get_non_rec_text(child), text_fuzz_ratio):
            child.is_non_rec_text = True
            child.wanted_attr = None
            return True

        for key, value in child.attrs.items():
            if not isinstance(value, str):
                continue

            value = value.strip()
            if text_match(text, value, text_fuzz_ratio):
                child.wanted_attr = key
                return True

            if key in {"href", "src"}:
                full_url = urljoin(url, value)
                if text_match(text, full_url, text_fuzz_ratio):
                    child.wanted_attr = key
                    child.is_full_url = True
                    return True

        return False

    def _get_children(self, soup, text, url, text_fuzz_ratio):
        children = reversed(soup.findChildren())
        children = [
            x for x in children if self._child_has_text(x, text, url, text_fuzz_ratio)
        ]
        return children

    def build(
        self,
        url=None,
        wanted_list=None,
        wanted_dict=None,
        html=None,
        request_args=None,
        update=False,
        text_fuzz_ratio=1.0,
    ):
        """
        Automatically constructs a set of rules to scrape the specified target[s] from a web page.
            The rules are represented as stack_list.

        Parameters:
        ----------
        url: str, optional
            URL of the target web page. You should either pass url or html or both.

        wanted_list: list of strings or compiled regular expressions, optional
            A list of needed contents to be scraped.
                AutoScraper learns a set of rules to scrape these targets. If specified,
                wanted_dict will be ignored.

        wanted_dict: dict, optional
            A dict of needed contents to be scraped. Keys are aliases and values are list of target texts
                or compiled regular expressions.
                AutoScraper learns a set of rules to scrape these targets and sets its aliases.

        html: str, optional
            An HTML string can also be passed instead of URL.
                You should either pass url or html or both.

        request_args: dict, optional
            A dictionary used to specify a set of additional request parameters used by requests
                module. You can specify proxy URLs, custom headers etc.

        update: bool, optional, defaults to False
            If True, new learned rules will be added to the previous ones.
            If False, all previously learned rules will be removed.

        text_fuzz_ratio: float in range [0, 1], optional, defaults to 1.0
            The fuzziness ratio threshold for matching the wanted contents.

        Returns:
        --------
        List of similar results
        """

        if not wanted_list and not (wanted_dict and any(wanted_dict.values())):
            raise ValueError("No targets were supplied")

        soup = self._get_soup(url=url, html=html, request_args=request_args)

        result_list = []

        if update is False:
            self.stack_list = []

        if wanted_list:
            wanted_dict = {"": wanted_list}

        wanted_list = []

        for alias, wanted_items in wanted_dict.items():
            wanted_items = [normalize(w) for w in wanted_items]
            wanted_list += wanted_items

            for wanted in wanted_items:
                children = self._get_children(soup, wanted, url, text_fuzz_ratio)

                for child in children:
                    result, stack = self._get_result_for_child(child, soup, url)
                    stack["alias"] = alias
                    result_list += result
                    self.stack_list.append(stack)

        result_list = [item.text for item in result_list]
        result_list = unique_hashable(result_list)

        self.stack_list = unique_stack_list(self.stack_list)
        return result_list

    @classmethod
    def _build_stack(cls, child, url):
        content = [(child.name, cls._get_valid_attrs(child))]

        parent = child
        while True:
            grand_parent = parent.findParent()
            if not grand_parent:
                break

            children = grand_parent.findAll(
                parent.name, cls._get_valid_attrs(parent), recursive=False
            )
            for i, c in enumerate(children):
                if c == parent:
                    content.insert(
                        0, (grand_parent.name, cls._get_valid_attrs(grand_parent), i)
                    )
                    break

            if not grand_parent.parent:
                break

            parent = grand_parent

        wanted_attr = getattr(child, "wanted_attr", None)
        is_full_url = getattr(child, "is_full_url", False)
        is_non_rec_text = getattr(child, "is_non_rec_text", False)
        stack = dict(
            content=content,
            wanted_attr=wanted_attr,
            is_full_url=is_full_url,
            is_non_rec_text=is_non_rec_text,
        )
        stack["url"] = url if is_full_url else ""
        stack["hash"] = hashlib.sha256(str(stack).encode("utf-8")).hexdigest()
        stack["stack_id"] = "rule_" + stack["hash"][:8]
        return stack

    def _get_result_for_child(self, child, soup, url):
        stack = self._build_stack(child, url)
        result = self._get_result_with_stack(stack, soup, url, 1.0)
        return result, stack

    @staticmethod
    def _fetch_result_from_child(child, wanted_attr, is_full_url, url, is_non_rec_text):
        if wanted_attr is None:
            if is_non_rec_text:
                return get_non_rec_text(child)
            return child.getText().strip()

        if wanted_attr not in child.attrs:
            return None

        if is_full_url:
            return urljoin(url, child.attrs[wanted_attr])

        return child.attrs[wanted_attr]

    @staticmethod
    def _get_fuzzy_attrs(attrs, attr_fuzz_ratio):
        attrs = dict(attrs)
        for key, val in attrs.items():
            if isinstance(val, str) and val:
                val = FuzzyText(val, attr_fuzz_ratio)
            elif isinstance(val, (list, tuple)):
                val = [FuzzyText(x, attr_fuzz_ratio) if x else x for x in val]
            attrs[key] = val
        return attrs

    def _get_result_with_stack(self, stack, soup, url, attr_fuzz_ratio, **kwargs):
        parents = [soup]
        stack_content = stack["content"]
        contain_sibling_leaves = kwargs.get("contain_sibling_leaves", False)
        for index, item in enumerate(stack_content):
            children = []
            if item[0] == "[document]":
                continue
            for parent in parents:

                attrs = item[1]
                if attr_fuzz_ratio < 1.0:
                    attrs = self._get_fuzzy_attrs(attrs, attr_fuzz_ratio)

                found = parent.findAll(item[0], attrs, recursive=False)
                if not found:
                    continue

                if not contain_sibling_leaves and index == len(stack_content) - 1:
                    idx = min(len(found) - 1, stack_content[index - 1][2])
                    found = [found[idx]]

                children += found

            parents = children

        wanted_attr = stack["wanted_attr"]
        is_full_url = stack["is_full_url"]
        is_non_rec_text = stack.get("is_non_rec_text", False)
        result = [
            ResultItem(
                self._fetch_result_from_child(
                    i, wanted_attr, is_full_url, url, is_non_rec_text
                ),
                getattr(i, "child_index", 0),
            )
            for i in parents
        ]
        if not kwargs.get("keep_blank", False):
            result = [x for x in result if x.text]
        return result

    def _get_result_with_stack_index_based(
        self, stack, soup, url, attr_fuzz_ratio, **kwargs
    ):
        p = soup.findChildren(recursive=False)[0]
        stack_content = stack["content"]
        for index, item in enumerate(stack_content[:-1]):
            if item[0] == "[document]":
                continue
            content = stack_content[index + 1]
            attrs = content[1]
            if attr_fuzz_ratio < 1.0:
                attrs = self._get_fuzzy_attrs(attrs, attr_fuzz_ratio)
            p = p.findAll(content[0], attrs, recursive=False)
            if not p:
                return []
            idx = min(len(p) - 1, item[2])
            p = p[idx]

        result = [
            ResultItem(
                self._fetch_result_from_child(
                    p,
                    stack["wanted_attr"],
                    stack["is_full_url"],
                    url,
                    stack["is_non_rec_text"],
                ),
                getattr(p, "child_index", 0),
            )
        ]
        if not kwargs.get("keep_blank", False):
            result = [x for x in result if x.text]
        return result

    def _get_result_by_func(
        self,
        func,
        url,
        html,
        soup,
        request_args,
        grouped,
        group_by_alias,
        unique,
        attr_fuzz_ratio,
        **kwargs
    ):
        if not soup:
            soup = self._get_soup(url=url, html=html, request_args=request_args)

        keep_order = kwargs.get("keep_order", False)

        if group_by_alias or (keep_order and not grouped):
            for index, child in enumerate(soup.findChildren()):
                setattr(child, "child_index", index)

        result_list = []
        grouped_result = defaultdict(list)
        for stack in self.stack_list:
            if not url:
                url = stack.get("url", "")

            result = func(stack, soup, url, attr_fuzz_ratio, **kwargs)

            if not grouped and not group_by_alias:
                result_list += result
                continue

            group_id = stack.get("alias", "") if group_by_alias else stack["stack_id"]
            grouped_result[group_id] += result

        return self._clean_result(
            result_list, grouped_result, grouped, group_by_alias, unique, keep_order
        )

    @staticmethod
    def _clean_result(
        result_list, grouped_result, grouped, grouped_by_alias, unique, keep_order
    ):
        if not grouped and not grouped_by_alias:
            if unique is None:
                unique = True
            if keep_order:
                result_list = sorted(result_list, key=lambda x: x.index)
            result = [x.text for x in result_list]
            if unique:
                result = unique_hashable(result)
            return result

        for k, val in grouped_result.items():
            if grouped_by_alias:
                val = sorted(val, key=lambda x: x.index)
            val = [x.text for x in val]
            if unique:
                val = unique_hashable(val)
            grouped_result[k] = val

        return dict(grouped_result)

    def get_result_similar(
        self,
        url=None,
        html=None,
        soup=None,
        request_args=None,
        grouped=False,
        group_by_alias=False,
        unique=None,
        attr_fuzz_ratio=1.0,
        keep_blank=False,
        keep_order=False,
        contain_sibling_leaves=False,
    ):
        """
        Gets similar results based on the previously learned rules.

        Parameters:
        ----------
        url: str, optional
            URL of the target web page. You should either pass url or html or both.

        html: str, optional
            An HTML string can also be passed instead of URL.
                You should either pass url or html or both.

        request_args: dict, optional
            A dictionary used to specify a set of additional request parameters used by requests
                module. You can specify proxy URLs, custom headers etc.

        grouped: bool, optional, defaults to False
            If set to True, the result will be a dictionary with the rule_ids as keys
                and a list of scraped data per rule as values.

        group_by_alias: bool, optional, defaults to False
            If set to True, the result will be a dictionary with the rule alias as keys
                and a list of scraped data per alias as values.

        unique: bool, optional, defaults to True for non grouped results and
                False for grouped results.
            If set to True, will remove duplicates from returned result list.

        attr_fuzz_ratio: float in range [0, 1], optional, defaults to 1.0
            The fuzziness ratio threshold for matching html tag attributes.

        keep_blank: bool, optional, defaults to False
            If set to True, missing values will be returned as empty strings.

        keep_order: bool, optional, defaults to False
            If set to True, the results will be ordered as they are present on the web page.

        contain_sibling_leaves: bool, optional, defaults to False
            If set to True, the results will also contain the sibling leaves of the wanted elements.

        Returns:
        --------
        List of similar results scraped from the web page.
        Dictionary if grouped=True or group_by_alias=True.
        """

        func = self._get_result_with_stack
        return self._get_result_by_func(
            func,
            url,
            html,
            soup,
            request_args,
            grouped,
            group_by_alias,
            unique,
            attr_fuzz_ratio,
            keep_blank=keep_blank,
            keep_order=keep_order,
            contain_sibling_leaves=contain_sibling_leaves,
        )

    def get_result_exact(
        self,
        url=None,
        html=None,
        soup=None,
        request_args=None,
        grouped=False,
        group_by_alias=False,
        unique=None,
        attr_fuzz_ratio=1.0,
        keep_blank=False,
    ):
        """
        Gets exact results based on the previously learned rules.

        Parameters:
        ----------
        url: str, optional
            URL of the target web page. You should either pass url or html or both.

        html: str, optional
            An HTML string can also be passed instead of URL.
                You should either pass url or html or both.

        request_args: dict, optional
            A dictionary used to specify a set of additional request parameters used by requests
                module. You can specify proxy URLs, custom headers etc.

        grouped: bool, optional, defaults to False
            If set to True, the result will be a dictionary with the rule_ids as keys
                and a list of scraped data per rule as values.

        group_by_alias: bool, optional, defaults to False
            If set to True, the result will be a dictionary with the rule alias as keys
                and a list of scraped data per alias as values.

        unique: bool, optional, defaults to True for non grouped results and
                False for grouped results.
            If set to True, will remove duplicates from returned result list.

        attr_fuzz_ratio: float in range [0, 1], optional, defaults to 1.0
            The fuzziness ratio threshold for matching html tag attributes.

        keep_blank: bool, optional, defaults to False
            If set to True, missing values will be returned as empty strings.

        Returns:
        --------
        List of exact results scraped from the web page.
        Dictionary if grouped=True or group_by_alias=True.
        """

        func = self._get_result_with_stack_index_based
        return self._get_result_by_func(
            func,
            url,
            html,
            soup,
            request_args,
            grouped,
            group_by_alias,
            unique,
            attr_fuzz_ratio,
            keep_blank=keep_blank,
        )

    def get_result(
        self,
        url=None,
        html=None,
        request_args=None,
        grouped=False,
        group_by_alias=False,
        unique=None,
        attr_fuzz_ratio=1.0,
    ):
        """
        Gets similar and exact results based on the previously learned rules.

        Parameters:
        ----------
        url: str, optional
            URL of the target web page. You should either pass url or html or both.

        html: str, optional
            An HTML string can also be passed instead of URL.
                You should either pass url or html or both.

        request_args: dict, optional
            A dictionary used to specify a set of additional request parameters used by requests
                module. You can specify proxy URLs, custom headers etc.

        grouped: bool, optional, defaults to False
            If set to True, the result will be dictionaries with the rule_ids as keys
                and a list of scraped data per rule as values.

        group_by_alias: bool, optional, defaults to False
            If set to True, the result will be a dictionary with the rule alias as keys
                and a list of scraped data per alias as values.

        unique: bool, optional, defaults to True for non grouped results and
                False for grouped results.
            If set to True, will remove duplicates from returned result list.

        attr_fuzz_ratio: float in range [0, 1], optional, defaults to 1.0
            The fuzziness ratio threshold for matching html tag attributes.

        Returns:
        --------
        Pair of (similar, exact) results.
        See get_result_similar and get_result_exact methods.
        """

        soup = self._get_soup(url=url, html=html, request_args=request_args)
        args = dict(
            url=url,
            soup=soup,
            grouped=grouped,
            group_by_alias=group_by_alias,
            unique=unique,
            attr_fuzz_ratio=attr_fuzz_ratio,
        )
        similar = self.get_result_similar(**args)
        exact = self.get_result_exact(**args)
        return similar, exact

    def remove_rules(self, rules):
        """
        Removes a list of learned rules from stack_list.

        Parameters:
        ----------
        rules : list
            A list of rules to be removed

        Returns:
        --------
        None
        """

        self.stack_list = [x for x in self.stack_list if x["stack_id"] not in rules]

    def keep_rules(self, rules):
        """
        Removes all other rules except the specified ones.

        Parameters:
        ----------
        rules : list
            A list of rules to keep in stack_list and removing the rest.

        Returns:
        --------
        None
        """

        self.stack_list = [x for x in self.stack_list if x["stack_id"] in rules]

    def set_rule_aliases(self, rule_aliases):
        """
        Sets the specified alias for each rule

        Parameters:
        ----------
        rule_aliases : dict
            A dictionary with keys of rule_id and values of alias

        Returns:
        --------
        None
        """

        id_to_stack = {stack["stack_id"]: stack for stack in self.stack_list}
        for rule_id, alias in rule_aliases.items():
            id_to_stack[rule_id]["alias"] = alias

    def generate_python_code(self):
        # deprecated
        print("This function is deprecated. Please use save() and load() instead.")


================================================
FILE: autoscraper/utils.py
================================================
from collections import OrderedDict

import unicodedata

from difflib import SequenceMatcher


def unique_stack_list(stack_list):
    seen = set()
    unique_list = []
    for stack in stack_list:
        stack_hash = stack['hash']
        if stack_hash in seen:
            continue
        unique_list.append(stack)
        seen.add(stack_hash)
    return unique_list


def unique_hashable(hashable_items):
    """Removes duplicates from the list. Must preserve the orders."""
    return list(OrderedDict.fromkeys(hashable_items))


def get_non_rec_text(element):
    return ''.join(element.find_all(text=True, recursive=False)).strip()


def normalize(item):
    if not isinstance(item, str):
        return item
    return unicodedata.normalize("NFKD", item.strip())


def text_match(t1, t2, ratio_limit):
    if hasattr(t1, 'fullmatch'):
        return bool(t1.fullmatch(t2))
    if ratio_limit >= 1:
        return t1 == t2
    return SequenceMatcher(None, t1, t2).ratio() >= ratio_limit


class ResultItem():
    def __init__(self, text, index):
        self.text = text
        self.index = index

    def __str__(self):
        return self.text


class FuzzyText(object):
    def __init__(self, text, ratio_limit):
        self.text = text
        self.ratio_limit = ratio_limit
        self.match = None

    def search(self, text):
        return SequenceMatcher(None, self.text, text).ratio() >= self.ratio_limit


================================================
FILE: setup.py
================================================
from codecs import open
from os import path

from setuptools import find_packages, setup

here = path.abspath(path.dirname(__file__))

with open(path.join(here, "README.md"), encoding="utf-8") as f:
    long_description = f.read()

setup(
    name="autoscraper",
    version="1.1.14",
    description="A Smart, Automatic, Fast and Lightweight Web Scraper for Python",
    long_description_content_type="text/markdown",
    long_description=long_description,
    url="https://github.com/alirezamika/autoscraper",
    author="Alireza Mika",
    author_email="alirezamika@gmail.com",
    license="MIT",
    classifiers=[
        "Development Status :: 4 - Beta",
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3",
    ],
    keywords="scraping - scraper",
    packages=find_packages(exclude=["contrib", "docs", "tests"]),
    python_requires=">=3.6",
    install_requires=["requests", "bs4", "lxml"],
)


================================================
FILE: tests/__init__.py
================================================


================================================
FILE: tests/conftest.py
================================================
import sys
from types import ModuleType
from html.parser import HTMLParser

class _Node:
    def __init__(self, name, attrs, parent=None):
        self.name = name
        self.attrs = dict(attrs)
        self.parent = parent
        self.children = []
        self.text = ""

    def append_child(self, child):
        self.children.append(child)
        child.parent = self

    def getText(self):
        return self.text + "".join(c.getText() for c in self.children)

    def findChildren(self, recursive=True):
        result = []
        for child in self.children:
            result.append(child)
            if recursive:
                result.extend(child.findChildren(recursive))
        return result

    def findParent(self):
        return self.parent

    def _attr_match(self, child, attrs):
        from autoscraper.utils import FuzzyText

        for key, val in (attrs or {}).items():
            actual = child.attrs.get(key, "")
            if isinstance(actual, list):
                actual = " ".join(actual)

            if isinstance(val, FuzzyText):
                if not val.search(actual):
                    return False
            elif actual != val:
                return False
        return True

    def findAll(self, name=None, attrs=None, recursive=True):
        result = []
        for child in self.children:
            if (name is None or child.name == name) and self._attr_match(child, attrs):
                result.append(child)
            if recursive:
                result.extend(child.findAll(name, attrs, recursive))
        return result

    def find_all(self, name=None, attrs=None, text=None, recursive=True):
        if text:
            res = []
            if self.text.strip():
                res.append(self.text)
            for child in self.children:
                if recursive:
                    res.extend(child.find_all(text=True, recursive=True))
                elif child.text.strip():
                    res.append(child.text)
            return res
        return self.findAll(name, attrs, recursive)

class _Parser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.root = _Node("[document]", {})
        self.current = self.root

    def handle_starttag(self, tag, attrs):
        node = _Node(tag, attrs)
        self.current.append_child(node)
        self.current = node

    def handle_endtag(self, tag):
        if self.current.parent:
            self.current = self.current.parent

    def handle_data(self, data):
        self.current.text += data

class BeautifulSoup(_Node):
    def __init__(self, html, parser):
        p = _Parser()
        p.feed(html)
        super().__init__(p.root.name, p.root.attrs)
        self.children = p.root.children
        for c in self.children:
            c.parent = self

bs4_mod = ModuleType("bs4")
bs4_mod.BeautifulSoup = BeautifulSoup
sys.modules.setdefault("bs4", bs4_mod)

class _Response:
    def __init__(self, text=""):
        self.encoding = "utf-8"
        self.headers = {"Content-Type": "text/html"}
        self.text = text

requests_mod = ModuleType("requests")
requests_mod.get = lambda url, headers=None, **kw: _Response()
sys.modules.setdefault("requests", requests_mod)


================================================
FILE: tests/integration/__init__.py
================================================


================================================
FILE: tests/integration/test_complex_features.py
================================================
import pytest
import re
from autoscraper import AutoScraper

HTML_COMPLEX = """
<div id="main">
  <ul class="fruits">
    <li class="item"><span class="name">Banana</span><a href="/banana" class="link">More</a></li>
    <li class="item"><span class="name">Apple</span><a href="/apple" class="link">More</a></li>
    <li class="item"><span class="name">Orange</span><a href="/orange" class="link">More</a></li>
    <li class="item"><span class="name">Banana</span></li>
  </ul>
  <p class="info">Fresh fruits</p>
  <a class="external" href="/shop">Shop Now</a>
</div>
"""


def test_extract_relative_link():
    scraper = AutoScraper()
    url = "https://example.com/index.html"
    result = scraper.build(url=url, html=HTML_COMPLEX, wanted_list=["https://example.com/apple"])
    assert "https://example.com/apple" in result
    similar = scraper.get_result_similar(
        url=url, html=HTML_COMPLEX, contain_sibling_leaves=True, unique=True
    )
    assert set(similar) == {
        "https://example.com/banana",
        "https://example.com/apple",
        "https://example.com/orange",
    }
    exact = scraper.get_result_exact(url=url, html=HTML_COMPLEX)
    assert exact == ["https://example.com/apple"]


def test_build_with_regex():
    scraper = AutoScraper()
    scraper.build(html=HTML_COMPLEX, wanted_list=[re.compile("Ban.*")])
    result = scraper.get_result_exact(html=HTML_COMPLEX)
    assert "Banana" in result[0]


def test_update_appends_rules():
    scraper = AutoScraper()
    scraper.build(html=HTML_COMPLEX, wanted_list=["Banana"])
    count = len(scraper.stack_list)
    scraper.build(html=HTML_COMPLEX, wanted_list=["Apple"], update=True)
    assert len(scraper.stack_list) == count + 1


def test_remove_rules():
    scraper = AutoScraper()
    scraper.build(html=HTML_COMPLEX, wanted_list=["Banana"])
    scraper.build(html=HTML_COMPLEX, wanted_list=["Apple"], update=True)
    rule_ids = [s["stack_id"] for s in scraper.stack_list]
    to_remove = rule_ids[0]
    scraper.remove_rules([to_remove])
    remaining = [s["stack_id"] for s in scraper.stack_list]
    assert to_remove not in remaining
    assert len(remaining) == len(rule_ids) - 1


def test_keep_blank_returns_empty():
    scraper = AutoScraper()
    scraper.build(html=HTML_COMPLEX, wanted_list=["/shop"])
    html_blank = HTML_COMPLEX.replace('href="/shop"', 'href=""')
    result = scraper.get_result_exact(html=html_blank, keep_blank=True)
    assert result == [""]


def test_attr_fuzz_ratio():
    html_base = '<div><a class="btn-primary" href="/item">Buy</a></div>'
    html_variant = '<div><a class="btn-prime" href="/item">Buy</a></div>'
    scraper = AutoScraper()
    scraper.build(html=html_base, wanted_list=["Buy"])
    res = scraper.get_result_exact(html=html_variant, attr_fuzz_ratio=0.8)
    assert res == ["Buy"]


================================================
FILE: tests/integration/test_real_world.py
================================================
import re
from autoscraper import AutoScraper

HTML_PAGE_1 = """
<div id='product'>
  <h1 class='title'>Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black</h1>
  <span class='price'>US $349.99</span>
  <div class='rating'><span class='value'>4.8</span></div>
  <div class='note'>See details</div>
</div>
"""

HTML_PAGE_2 = """
<div id='product'>
  <h1 class='title'>Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti</h1>
  <span class='price'>US $1,229.49</span>
  <div class='rating'><span class='value'>5.0</span></div>
  <div class='note'>See details</div>
</div>
"""

HTML_WALMART_1 = "<div class='price'>$8.95</div>"
HTML_WALMART_2 = "<div class='price'>$7.00</div>"
HTML_ETSY_1 = "<span class='amount'>$12.50+</span>"
HTML_ETSY_2 = "<span class='amount'>$60.00</span>"


def test_grouping_and_rule_removal():
    scraper = AutoScraper()
    wanted = [
        "Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black",
        "US $349.99",
        "4.8",
        "See details",
    ]
    scraper.build(html=HTML_PAGE_1, wanted_list=wanted)
    grouped = scraper.get_result_exact(html=HTML_PAGE_2, grouped=True)
    unwanted = [r for r, v in grouped.items() if v == ["See details"]]
    scraper.remove_rules(unwanted)
    result = scraper.get_result_exact(html=HTML_PAGE_2)
    assert result == [
        "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
        "US $1,229.49",
        "5.0",
    ]


def test_incremental_learning_multiple_sites():
    scraper = AutoScraper()
    data = [
        (HTML_PAGE_1, ["US $349.99"]),
        (HTML_WALMART_1, ["$8.95"]),
        (HTML_ETSY_1, ["$12.50+"]),
    ]
    for html, wanted in data:
        scraper.build(html=html, wanted_list=wanted, update=True)
    assert "US $1,229.49" in scraper.get_result_exact(html=HTML_PAGE_2)
    assert "$7.00" in scraper.get_result_exact(html=HTML_WALMART_2)
    assert "$60.00" in scraper.get_result_exact(html=HTML_ETSY_2)


def test_attr_fuzz_ratio_realistic():
    base = "<div><a class='btn-primary-action' href='/buy'>Buy</a></div>"
    variant = "<div><a class='btn-prim-action' href='/buy'>Buy</a></div>"
    scraper = AutoScraper()
    scraper.build(html=base, wanted_list=["Buy"])
    assert scraper.get_result_exact(html=variant, attr_fuzz_ratio=0.8) == ["Buy"]


def test_regex_name_extraction():
    scraper = AutoScraper()
    scraper.build(html=HTML_PAGE_1, wanted_list=[re.compile(r".*PlayStation.*Console.*")])
    result = scraper.get_result_exact(html=HTML_PAGE_1)
    assert any("PlayStation" in r for r in result)


def test_keep_blank_for_missing_rating():
    scraper = AutoScraper()
    scraper.build(html=HTML_PAGE_1, wanted_list=["4.8"])
    html_no_rating = HTML_PAGE_2.replace("5.0", "")
    res = scraper.get_result_exact(html=html_no_rating, keep_blank=True)
    assert res == [""]



================================================
FILE: tests/unit/__init__.py
================================================


================================================
FILE: tests/unit/test_additional_features.py
================================================
from autoscraper import AutoScraper

HTML = "<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>"
HTML_DUP = "<ul><li>Banana</li><li>Banana</li></ul>"


def test_text_fuzz_ratio_partial():
    scraper = AutoScraper()
    scraper.build(html="<ul><li>Banana</li></ul>", wanted_list=["Banan"], text_fuzz_ratio=0.8)
    assert scraper.get_result_exact(html="<ul><li>Banana</li></ul>") == ["Banana"]


def test_set_rule_aliases():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    rule_id = scraper.stack_list[0]["stack_id"]
    scraper.set_rule_aliases({rule_id: "fruit"})
    result = scraper.get_result_similar(html=HTML, group_by_alias=True, contain_sibling_leaves=True)
    assert result == {"fruit": ["Banana", "Apple", "Orange"]}


def test_grouped_results_by_rule():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    rule_id = scraper.stack_list[0]["stack_id"]
    result = scraper.get_result_similar(html=HTML, grouped=True, contain_sibling_leaves=True)
    assert result == {rule_id: ["Banana", "Apple", "Orange"]}


def test_similar_unique_false():
    scraper = AutoScraper()
    scraper.build(html=HTML_DUP, wanted_list=["Banana"])
    result = scraper.get_result_similar(html=HTML_DUP, unique=False)
    assert result == ["Banana", "Banana"]


def test_similar_keep_order():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    result = scraper.get_result_similar(html=HTML, contain_sibling_leaves=True, keep_order=True)
    assert result == ["Banana", "Apple", "Orange"]


================================================
FILE: tests/unit/test_build.py
================================================
import pytest
from autoscraper import AutoScraper

HTML = "<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>"


def test_build_requires_targets():
    scraper = AutoScraper()
    with pytest.raises(ValueError):
        scraper.build(html=HTML)


def test_build_and_get_result_similar():
    scraper = AutoScraper()
    result = scraper.build(html=HTML, wanted_list=["Banana"])
    assert result == ["Banana"]
    similar = scraper.get_result_similar(html=HTML, contain_sibling_leaves=True)
    assert similar == ["Banana", "Apple", "Orange"]


================================================
FILE: tests/unit/test_features.py
================================================
import pytest

from autoscraper import AutoScraper

HTML = "<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>"
HTML_COMPLEX_ORDER = """
<div class='products'>
  <h2>Banana</h2>
  <p class='price'>$1</p>
  <h2>Apple</h2>
  <p class='price'>$2</p>
</div>
"""


def test_get_result_exact_order():
    scraper = AutoScraper()
    scraper.build(html=HTML_COMPLEX_ORDER, wanted_list=["Banana", "$2"])
    assert scraper.get_result_exact(html=HTML_COMPLEX_ORDER) == ["Banana", "$2"]


def test_group_by_alias():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_dict={"fruit": ["Banana"]})
    similar = scraper.get_result_similar(
        html=HTML, group_by_alias=True, contain_sibling_leaves=True, unique=True
    )
    assert similar == {"fruit": ["Banana", "Apple", "Orange"]}


def test_save_and_load(tmp_path):
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    file_path = tmp_path / "model.json"
    scraper.save(file_path)
    new_scraper = AutoScraper()
    new_scraper.load(file_path)
    assert new_scraper.get_result_exact(html=HTML) == scraper.get_result_exact(html=HTML)


def test_keep_rules():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    first_rule = scraper.stack_list[0]["stack_id"]
    scraper.build(html=HTML, wanted_list=["Apple"], update=True)
    second_rule = scraper.stack_list[1]["stack_id"]
    scraper.keep_rules([second_rule])
    assert len(scraper.stack_list) == 1
    assert scraper.stack_list[0]["stack_id"] == second_rule


def test_get_result_combined():
    scraper = AutoScraper()
    scraper.build(html=HTML, wanted_list=["Banana"])
    similar, exact = scraper.get_result(html=HTML)
    assert exact == ["Banana"]
    assert similar == ["Banana"]

Download .txt

gitextract_i4lmlmqj/

├── .github/
│   ├── FUNDING.yml
│   └── workflows/
│       ├── python-publish.yml
│       ├── stale-issues.yml
│       └── tests.yml
├── .gitignore
├── LICENSE
├── README.md
├── autoscraper/
│   ├── __init__.py
│   ├── auto_scraper.py
│   └── utils.py
├── setup.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── integration/
    │   ├── __init__.py
    │   ├── test_complex_features.py
    │   └── test_real_world.py
    └── unit/
        ├── __init__.py
        ├── test_additional_features.py
        ├── test_build.py
        └── test_features.py

Download .txt

SYMBOL INDEX (77 symbols across 8 files)

FILE: autoscraper/auto_scraper.py
  class AutoScraper (line 21) | class AutoScraper(object):
    method __init__ (line 50) | def __init__(self, stack_list=None):
    method save (line 53) | def save(self, file_path):
    method load (line 71) | def load(self, file_path):
    method _fetch_html (line 96) | def _fetch_html(cls, url, request_args=None):
    method _get_soup (line 113) | def _get_soup(cls, url=None, html=None, request_args=None):
    method _get_valid_attrs (line 124) | def _get_valid_attrs(item):
    method _child_has_text (line 136) | def _child_has_text(child, text, url, text_fuzz_ratio):
    method _get_children (line 170) | def _get_children(self, soup, text, url, text_fuzz_ratio):
    method build (line 177) | def build(
    method _build_stack (line 261) | def _build_stack(cls, child, url):
    method _get_result_for_child (line 299) | def _get_result_for_child(self, child, soup, url):
    method _fetch_result_from_child (line 305) | def _fetch_result_from_child(child, wanted_attr, is_full_url, url, is_...
    method _get_fuzzy_attrs (line 320) | def _get_fuzzy_attrs(attrs, attr_fuzz_ratio):
    method _get_result_with_stack (line 330) | def _get_result_with_stack(self, stack, soup, url, attr_fuzz_ratio, **...
    method _get_result_with_stack_index_based (line 372) | def _get_result_with_stack_index_based(
    method _get_result_by_func (line 406) | def _get_result_by_func(
    method _clean_result (line 448) | def _clean_result(
    method get_result_similar (line 471) | def get_result_similar(
    method get_result_exact (line 547) | def get_result_exact(
    method get_result (line 613) | def get_result(
    method remove_rules (line 673) | def remove_rules(self, rules):
    method keep_rules (line 689) | def keep_rules(self, rules):
    method set_rule_aliases (line 705) | def set_rule_aliases(self, rule_aliases):
    method generate_python_code (line 723) | def generate_python_code(self):

FILE: autoscraper/utils.py
  function unique_stack_list (line 8) | def unique_stack_list(stack_list):
  function unique_hashable (line 20) | def unique_hashable(hashable_items):
  function get_non_rec_text (line 25) | def get_non_rec_text(element):
  function normalize (line 29) | def normalize(item):
  function text_match (line 35) | def text_match(t1, t2, ratio_limit):
  class ResultItem (line 43) | class ResultItem():
    method __init__ (line 44) | def __init__(self, text, index):
    method __str__ (line 48) | def __str__(self):
  class FuzzyText (line 52) | class FuzzyText(object):
    method __init__ (line 53) | def __init__(self, text, ratio_limit):
    method search (line 58) | def search(self, text):

FILE: tests/conftest.py
  class _Node (line 5) | class _Node:
    method __init__ (line 6) | def __init__(self, name, attrs, parent=None):
    method append_child (line 13) | def append_child(self, child):
    method getText (line 17) | def getText(self):
    method findChildren (line 20) | def findChildren(self, recursive=True):
    method findParent (line 28) | def findParent(self):
    method _attr_match (line 31) | def _attr_match(self, child, attrs):
    method findAll (line 46) | def findAll(self, name=None, attrs=None, recursive=True):
    method find_all (line 55) | def find_all(self, name=None, attrs=None, text=None, recursive=True):
  class _Parser (line 68) | class _Parser(HTMLParser):
    method __init__ (line 69) | def __init__(self):
    method handle_starttag (line 74) | def handle_starttag(self, tag, attrs):
    method handle_endtag (line 79) | def handle_endtag(self, tag):
    method handle_data (line 83) | def handle_data(self, data):
  class BeautifulSoup (line 86) | class BeautifulSoup(_Node):
    method __init__ (line 87) | def __init__(self, html, parser):
  class _Response (line 99) | class _Response:
    method __init__ (line 100) | def __init__(self, text=""):

FILE: tests/integration/test_complex_features.py
  function test_extract_relative_link (line 19) | def test_extract_relative_link():
  function test_build_with_regex (line 36) | def test_build_with_regex():
  function test_update_appends_rules (line 43) | def test_update_appends_rules():
  function test_remove_rules (line 51) | def test_remove_rules():
  function test_keep_blank_returns_empty (line 63) | def test_keep_blank_returns_empty():
  function test_attr_fuzz_ratio (line 71) | def test_attr_fuzz_ratio():

FILE: tests/integration/test_real_world.py
  function test_grouping_and_rule_removal (line 28) | def test_grouping_and_rule_removal():
  function test_incremental_learning_multiple_sites (line 48) | def test_incremental_learning_multiple_sites():
  function test_attr_fuzz_ratio_realistic (line 62) | def test_attr_fuzz_ratio_realistic():
  function test_regex_name_extraction (line 70) | def test_regex_name_extraction():
  function test_keep_blank_for_missing_rating (line 77) | def test_keep_blank_for_missing_rating():

FILE: tests/unit/test_additional_features.py
  function test_text_fuzz_ratio_partial (line 7) | def test_text_fuzz_ratio_partial():
  function test_set_rule_aliases (line 13) | def test_set_rule_aliases():
  function test_grouped_results_by_rule (line 22) | def test_grouped_results_by_rule():
  function test_similar_unique_false (line 30) | def test_similar_unique_false():
  function test_similar_keep_order (line 37) | def test_similar_keep_order():

FILE: tests/unit/test_build.py
  function test_build_requires_targets (line 7) | def test_build_requires_targets():
  function test_build_and_get_result_similar (line 13) | def test_build_and_get_result_similar():

FILE: tests/unit/test_features.py
  function test_get_result_exact_order (line 16) | def test_get_result_exact_order():
  function test_group_by_alias (line 22) | def test_group_by_alias():
  function test_save_and_load (line 31) | def test_save_and_load(tmp_path):
  function test_keep_rules (line 41) | def test_keep_rules():
  function test_get_result_combined (line 52) | def test_get_result_combined():

Download .json

Condensed preview — 20 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (52K chars).

[
  {
    "path": ".github/FUNDING.yml",
    "chars": 725,
    "preview": "# These are supported funding model platforms\n\ngithub: [alirezamika] # Replace with up to 4 GitHub Sponsors-enabled user"
  },
  {
    "path": ".github/workflows/python-publish.yml",
    "chars": 947,
    "preview": "# This workflows will upload a Python Package using Twine when a release is created\n# For more information see: https://"
  },
  {
    "path": ".github/workflows/stale-issues.yml",
    "chars": 705,
    "preview": "name: Close inactive issues\non:\n  schedule:\n    - cron: \"30 1 * * *\"\n\njobs:\n  close-issues:\n    runs-on: ubuntu-latest\n "
  },
  {
    "path": ".github/workflows/tests.yml",
    "chars": 432,
    "preview": "name: Run Tests\n\non:\n  push:\n  release:\n    types: [created]\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    steps:\n    - "
  },
  {
    "path": ".gitignore",
    "chars": 1173,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n.idea/\n.vscode/\n\n# C extensions\n*.so\n\n# Distri"
  },
  {
    "path": "LICENSE",
    "chars": 1069,
    "preview": "MIT License\n\nCopyright (c) 2020 Alireza Mika\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
  },
  {
    "path": "README.md",
    "chars": 4682,
    "preview": "# AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python\n\n![img](https://user-images.githubusercon"
  },
  {
    "path": "autoscraper/__init__.py",
    "chars": 49,
    "preview": "from autoscraper.auto_scraper import AutoScraper\n"
  },
  {
    "path": "autoscraper/auto_scraper.py",
    "chars": 23898,
    "preview": "import hashlib\nimport json\nfrom collections import defaultdict\nfrom html import unescape\nfrom urllib.parse import urljoi"
  },
  {
    "path": "autoscraper/utils.py",
    "chars": 1425,
    "preview": "from collections import OrderedDict\n\nimport unicodedata\n\nfrom difflib import SequenceMatcher\n\n\ndef unique_stack_list(sta"
  },
  {
    "path": "setup.py",
    "chars": 946,
    "preview": "from codecs import open\nfrom os import path\n\nfrom setuptools import find_packages, setup\n\nhere = path.abspath(path.dirna"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/conftest.py",
    "chars": 3248,
    "preview": "import sys\nfrom types import ModuleType\nfrom html.parser import HTMLParser\n\nclass _Node:\n    def __init__(self, name, at"
  },
  {
    "path": "tests/integration/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/integration/test_complex_features.py",
    "chars": 2825,
    "preview": "import pytest\nimport re\nfrom autoscraper import AutoScraper\n\nHTML_COMPLEX = \"\"\"\n<div id=\"main\">\n  <ul class=\"fruits\">\n  "
  },
  {
    "path": "tests/integration/test_real_world.py",
    "chars": 2863,
    "preview": "import re\nfrom autoscraper import AutoScraper\n\nHTML_PAGE_1 = \"\"\"\n<div id='product'>\n  <h1 class='title'>Sony PlayStation"
  },
  {
    "path": "tests/unit/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/unit/test_additional_features.py",
    "chars": 1585,
    "preview": "from autoscraper import AutoScraper\n\nHTML = \"<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>\"\nHTML_DUP = \"<ul><li>"
  },
  {
    "path": "tests/unit/test_build.py",
    "chars": 546,
    "preview": "import pytest\nfrom autoscraper import AutoScraper\n\nHTML = \"<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>\"\n\n\ndef "
  },
  {
    "path": "tests/unit/test_features.py",
    "chars": 1774,
    "preview": "import pytest\n\nfrom autoscraper import AutoScraper\n\nHTML = \"<ul><li>Banana</li><li>Apple</li><li>Orange</li></ul>\"\nHTML_"
  }
]

About this extraction

This page contains the full source code of the alirezamika/autoscraper GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 20 files (47.7 KB), approximately 11.9k tokens, and a symbol index with 77 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo