Full Code of cullenwatson/StaffSpy for AI

main 0a8a8d73a5db cached

28 files

107.0 KB

25.1k tokens

114 symbols

1 requests

Download .txt

Repository: cullenwatson/StaffSpy
Branch: main
Commit: 0a8a8d73a5db
Files: 28
Total size: 107.0 KB

Directory structure:
gitextract_26g2vb8c/

├── .github/
│   └── workflows/
│       └── publish-to-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── examples/
│   ├── daily_auto_connect.py
│   ├── upload_staff_to_clay.py
│   └── x_corp_staff.py
├── pyproject.toml
└── staffspy/
    ├── __init__.py
    ├── linkedin/
    │   ├── certifications.py
    │   ├── comments.py
    │   ├── contact_info.py
    │   ├── employee.py
    │   ├── employee_bio.py
    │   ├── experiences.py
    │   ├── languages.py
    │   ├── linkedin.py
    │   ├── schools.py
    │   └── skills.py
    ├── solvers/
    │   ├── capsolver.py
    │   ├── solver.py
    │   ├── solver_type.py
    │   └── two_captcha.py
    └── utils/
        ├── driver_type.py
        ├── exceptions.py
        ├── models.py
        └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/publish-to-pypi.yml
================================================
name: Publish Python 🐍 distributions 📦 to PyPI
on: push

jobs:
  build-n-publish:
    name: Build and publish Python 🐍 distributions 📦 to PyPI
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: "3.10"

    - name: Install poetry
      run: >-
        python3 -m
        pip install
        poetry
        --user

    - name: Build distribution 📦
      run: >-
        python3 -m
        poetry
        build

    - name: Publish distribution 📦 to PyPI
      if: startsWith(github.ref, 'refs/tags')
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
        password: ${{ secrets.PYPI_API_TOKEN }}

================================================
FILE: .gitignore
================================================
/venv/
/.idea
**/__pycache__/
**/.pytest_cache/
/.ipynb_checkpoints/
**/output/
**/.DS_Store
*.pyc
.env
dist

================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/psf/black
  rev: 24.2.0
  hooks:
  - id: black
    language_version: python
    args: [--line-length=88, --quiet]

================================================
FILE: LICENSE
================================================
            DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                    Version 2, December 2004

 Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

 Everyone is permitted to copy and distribute verbatim or modified
 copies of this license document, and changing it is allowed as long
 as the name is changed.

            DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

  0. You just DO WHAT THE FUCK YOU WANT TO.


================================================
FILE: README.md
================================================
<img width="640" alt="3FAD4652-488F-4F6F-A744-4C2AA5855E92" src="https://github.com/user-attachments/assets/73b701ff-2db8-4d72-9ad3-42b7e1db537f">

**StaffSpy** is a staff fetcher library for LinkedIn.

## Features

- Fetches staff from a company on **LinkedIn**
- Obtains skills, experiences, certifications & more
- Fetch individuals users / comments on posts
- Export all your connections with their contact info
- Aggregates the employees in a Pandas DataFrame

### Installation

```
pip install -U "staffspy[browser]"
```

Or for latest code from this repo directly

```
pip install "git+https://github.com/cullenwatson/StaffSpy.git#egg=staffspy[browser]"
```

_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_

### Usage

```python
from staffspy import LinkedInAccount, SolverType, DriverType, BrowserType

account = LinkedInAccount(
    # driver_type=DriverType( # if issues with webdriver, specify its exact location, download link in the FAQ
    #     browser_type=BrowserType.CHROME,
    #     executable_path="/Users/pc/chromedriver-mac-arm64/chromedriver"
    # ),
    session_file="session.pkl", # save login cookies to only log in once (lasts a week or so)
    log_level=1, # 0 for no logs
)

# search by company
staff = account.scrape_staff(
    company_name="openai",
    search_term="software engineer",
    location="london",
    extra_profile_data=True, # fetch all past experiences, schools, & skills
    max_results=50, # can go up to 1000
    # block=True # if you want to block the user after scraping, to exclude from future search results
    # connect=True # if you want to connect with the users until you hit your limit
)
# or fetch by user ids
users = account.scrape_users(
    user_ids=['williamhgates', 'rbranson', 'jeffweiner08']
    # connect=True,
    # block=True
)

# fetch all comments on two of Bill Gates' posts 
comments = account.scrape_comments(
    ['7252421958540091394','7253083989547048961']
)

# fetch company details
companies = account.scrape_companies(
    company_names=['openai', 'microsoft']
)

# fetch connections (also gets their contact info if available)
connections = account.scrape_connections(
    extra_profile_data=True,
    max_results=50
)

# export any of the results to csv
staff.to_csv("staff.csv", index=False)
```

#### Browser login

If you rather use a browser to log in, install the browser add-on to StaffSpy .

`pip install staffspy[browser]`

If you do not pass the `username` & `password` params, then a browser will open to sign in to LinkedIn on the first sign-in. Press enter after signing in to begin scraping.

### Output

| profile_id       | name           | first_name | last_name | location                        | age | position                        | followers | connections | company | past_company1 | past_company2 | school1                             | school2                    | skill1   | skill2     | skill3     | is_connection | premium | creator | potential_email                                  | profile_link                                 | profile_photo                                                                                                                                                               |
| ---------------- | -------------- | ---------- | --------- | ------------------------------- | --- | ------------------------------- | --------- | ----------- | ------- | ------------- | ------------- | ---------------------------------- | ------------------------- | -------- | ---------- | ---------- | ------------- | ------- | ------- | ------------------------------------------------ | -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| javiersierra2102 | Javier Sierra  | Javier     | Sierra    | London, England, United Kingdom | 39  | Software Engineer               | 735       | 725         | OpenAI  | Meta          | Oculus VR     | Hult International Business School | Universidad Simón Bolívar | Java     | JavaScript | C++        | FALSE         | FALSE   | FALSE   | javier.sierra@openai.com, jsierra@openai.com     | https://www.linkedin.com/in/javiersierra2102 | https://media.licdn.com/dms/image/C4D03AQHEyUg1kGT08Q/profile-displayphoto-shrink_800_800/0/1516504680512?e=1727913600&v=beta&t=3enCmNDBtJ7LxfbW6j1hDD8qNtHjO2jb2XTONECxUXw |
| dougli           | Douglas Li     | Douglas    | Li        | London, England, United Kingdom | 37  | @ OpenAI UK, previously at Meta | 583       | 401         | OpenAI  | Shift Lab     | Facebook      | Washington University in St. Louis |                           | Java     | Python     | JavaScript | FALSE         | TRUE    | FALSE   | douglas.li@openai.com, dli@openai.com            | https://www.linkedin.com/in/dougli           | https://media.licdn.com/dms/image/D4E03AQETmRyb3_GB8A/profile-displayphoto-shrink_800_800/0/1687996628597?e=1727913600&v=beta&t=HRYGJ4RxsTMcPF1YcSikXlbz99hx353csho3PWT6fOQ |
| nkartashov       | Nick Kartashov | Nick       | Kartashov | London, England, United Kingdom | 33  | Software Engineer               | 2186      | 2182        | OpenAI  | Google        | DeepMind      | St. Petersburg Academic University | Bioinformatics Institute  | Teamwork | Java       | Haskell    | FALSE         | FALSE   | FALSE   | nick.kartashov@openai.com, nkartashov@openai.com | https://www.linkedin.com/in/nkartashov       | https://media.licdn.com/dms/image/D4E03AQEjOKxC5UgwWw/profile-displayphoto-shrink_800_800/0/1680706122689?e=1727913600&v=beta&t=m-JnG9nm0zxp1Z7njnInwbCoXyqa3AN-vJZntLfbzQ4 |


### Parameters for `LinkedInAccount()`

```plaintext
Optional
├── session_file (str):
|    file path to save session cookies, so only one manual login is needed.
|    can use mult profiles this way
|
| For automated login
├── username (str):
|    linkedin account email
│
├── password (str):
|    linkedin account password
|
├── driver_type (DriverType):
|    signs in with the given BrowserType (Chrome, Firefox) and executable_path
|
├── solver_service (SolverType):
|    solves the captcha using the desired service - either CapSolver, or 2Captcha (worse of the two)
|
├── solver_api_key (str):
|    api key for the solver provider
│
├── log_level (int):
|    Controls the verbosity of the runtime printouts
|    (0 prints only errors, 1 is info, 2 is all logs. Default is 0.)
```

### Parameters for `scrape_staff()`

```plaintext
Optional
├── company_name (str):
|    company identifier on linkedin, will search for that company if that company id does not exist
|    e.g. openai from https://www.linkedin.com/company/openai
|
├── search_term (str):
|    staff title to search for
|    e.g. software engineer
|
├── location (str):
|    location the staff resides
|    e.g. london
│
├── extra_profile_data (bool)
|    fetches educations, experiences, skills, certifications (Default false)
│
├── max_results (int):
|    number of staff to fetch, default/max is 1000 for a search imposed by LinkedIn
|
├── block (bool):
|    whether to block the user after scraping
|
├── connect (bool):
|    whether to conncet with the user after scraping
```

### Parameters for `scrape_users()`

```plaintext
├── user_ids (list):
|    user ids to scrape from
|     e.g. dougmcmillon from https://www.linkedin.com/in/dougmcmillon
|
├── block (bool):
|    whether to block the user after scraping
|
├── connect (bool):
|    whether to conncet with the user after scraping
```


### Parameters for `scrape_comments()`

```plaintext
├── post_ids (list):
|    post ids to scrape from
|     e.g. 7252381444906364929 from https://www.linkedin.com/posts/williamhgates_technology-transformtheeveryday-activity-7252381444906364929-Bkls
```


### Parameters for `scrape_companies()`

```plaintext
├── company_names (list):
|    list of company names to scrape details from
|     e.g. ['openai', 'microsoft', 'google']
```


### Parameters for `scrape_connections()`

```plaintext
├── max_results (int):
|    maximum number of connections to fetch
|
├── extra_profile_data (bool):
|    fetches educations, experiences, skills, certifications & contact info for each connection (Default false)
```

### LinkedIn notes

    - only 1000 max results per search
    - extra_profile_data increases runtime by O(n)
    - if rate limited, the program will stop scraping
    - if using non-browser sign in, turn off 2fa

---

## Frequently Asked Questions

---

**Q: Can I get my account banned?**  
**A:** It is a possibility, although there are no recorded incidents. Let me know if you are the first. However, to protect you, the code does not allow you to run it if LinkedIn is blocking you

---

**Q: Scraped 999 staff members, with 869 hidden LinkedIn Members?**  
**A:** It means your LinkedIn account is bad. Not sure how they classify it but unverified email, new account, low connections and a bunch of factors go into it.

---

**Q: How to get around the 1000 search limit result?**  
**A:** Check the examples folder. We can block the user after searching and try many different locations and search terms to maximize results.

---

**Q: Exception: driver not found for selenium?**  
**A:** You need chromedriver installed (not the chrome): https://googlechromelabs.github.io/chrome-for-testing/#stable

---

**Q: Encountering issues with your queries?**  
**A:** If problems
persist, [submit an issue](https://github.com/cullenwatson/StaffSpy/issues).


### Staff Schema

```plaintext
Staff
├── Personal Information
│   ├── search_term
│   ├── id
│   ├── name
│   ├── first_name
│   ├── last_name
│   ├── location
│   └── bio
│
├── Professional Details
│   ├── position
│   ├── profile_id
│   ├── profile_link
│   ├── potential_emails
│   └── estimated_age
│
├── Social Connectivity
│   ├── followers
│   ├── connections
│   └── mutuals_count
│
├── Status
│   ├── influencer
│   ├── creator
│   ├── premium
│   ├── open_to_work
│   ├── is_hiring
│   └── is_connection
│
├── Visuals
│   ├── profile_photo
│   └── banner_photo
│
├── Skills
│   ├── name
│   └── endorsements
│
├── Experiences
│   ├── from_date
│   ├── to_date
│   ├── duration
│   ├── title
│   ├── company
│   ├── location
│   └── emp_type
│
├── Certifications
│   ├── title
│   ├── issuer
│   ├── date_issued
│   ├── cert_id
│   └── cert_link
│
├── Educational Background
|   ├── years
|   ├── school
|   └── degree
│
└── Connection Info (only when a connection and enabled on their profile)
    ├── email_address
    ├── address
    ├── birthday
    ├── websites
    ├── phone_numbers
    └── created_at
```


================================================
FILE: examples/daily_auto_connect.py
================================================
""" Script to connect with 10 software engineers daily from random tech companies """

from staffspy import LinkedInAccount, DriverType, BrowserType
import random
import time
from datetime import datetime
import schedule

# List of tech companies to randomly choose from
TECH_COMPANIES = [
    "microsoft",
    "google",
    "apple",
    "meta",
    "amazon",
    "netflix",
    "salesforce",
    "adobe",
    "intel",
    "nvidia",
    "oracle",
    "ibm",
    "vmware",
    "twitter",
    "linkedin",
    "airbnb",
    "uber",
    "stripe",
    "snowflake",
    "databricks",
]


def connect_with_staff():
    print(f"Starting connection run at {datetime.now()}")

    # Initialize LinkedIn account
    account = LinkedInAccount(session_file="session.pkl", log_level=1)

    # Choose a random company
    company = random.choice(TECH_COMPANIES)
    print(f"Selected company: {company}")

    # Connect with 10 users
    account.scrape_staff(
        company_name=company,
        search_term="software engineer",
        max_results=10,
        extra_profile_data=True,
        connect=True,
    )


if __name__ == "__main__":
    # Schedule to run once a day at 10 AM
    schedule.every().day.at("10:00").do(connect_with_staff)

    # Run immediately on script start
    connect_with_staff()

    # Keep the script running
    while True:
        schedule.run_pending()
        time.sleep(60)


================================================
FILE: examples/upload_staff_to_clay.py
================================================
"""
Uploads staff to the Clay platform to then further enrich the staff (e.g. waterfall strategy to find their verified emails)
"""

from staffspy import LinkedInAccount
from staffspy.utils.utils import upload_to_clay

session_file = "session.pkl"
account = LinkedInAccount(session_file=session_file, log_level=2)

connections = account.scrape_connections(extra_profile_data=True, max_results=3)

clay_webhook_url = (
    "https://api.clay.com/v3/sources/webhook/pull-in-data-from-a-webhook-XXXXXXXXXXXXXX"
)
upload_to_clay(webhook_url=clay_webhook_url, data=connections)


================================================
FILE: examples/x_corp_staff.py
================================================
"""
CASE STUDY: X CORP EMPLOYEES
RESULT: We retrieved 1087 profiles. Not as good as expected but still a good result for company that has 2800 employees.
final csv - https://drive.google.com/file/d/1aC-GF4RXf9wzGrpxQyGPBxlnLo2X5vm4

Strategies to get around LinkedIn 1000 result limit:
1) It blocks the user after searching to prevent it from appearing in future searches.
2) It tries various searches with department and location to get more results.

Lastly, it saves the results in CSV files and then combines them into one DataFrame at the end to view the results.
"""

import os
from datetime import datetime
import pandas as pd
import glob


from staffspy import LinkedInAccount

session_file = "session.pkl"
account = LinkedInAccount(session_file=str(session_file), log_level=2)


departments = [
    # Leadership
    "CEO",
    "CFO",
    "CTO",
    "COO",
    "executive",
    "director",
    "vice president",
    "head",
    "lead",
    # Engineering/Tech
    "software",
    "developer",
    "engineer",
    "architect",
    "devops",
    "QA",
    "data",
    "IT",
    "security",
    # Business/Operations
    "sales",
    "account",
    "business development",
    "operations",
    "project manager",
    "product manager",
    # Support Functions
    "HR",
    "recruiter",
    "marketing",
    "finance",
    "legal",
    "accounting",
    "admin",
    "support",
    # Customer-Facing
    "customer success",
    "account manager",
    "sales representative",
    "customer support",
    # Specialists
    "analyst",
    "consultant",
    "coordinator",
    "specialist",
]
locations = [
    "San Francisco",
    "New York",
    "Los Angeles",
    "Seattle",
    "Miami",
    "Boston",
    "Austin",
    "Chicago",
    "Toronto",
    "London",
    "Singapore",
    "Tokyo",
    "Dublin",
]


def save_results(users: pd.DataFrame):
    output_dir = f"output/{company_name}"
    os.makedirs(output_dir, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_path = f"{output_dir}/users_{timestamp}.csv"
    users.to_csv(output_path, index=False)


def scrape_and_save(term=None, location=None):
    users = account.scrape_staff(
        company_name=company_name,
        search_term=term,
        location=location,
        extra_profile_data=True,
        max_results=1000,
        block=True,
    )
    if not users.empty:
        save_results(users)


company_name = "x-corp"

# generic search
for _ in range(5):
    scrape_and_save()

# Search by departments
for department in departments:
    scrape_and_save(term=department)

# Search by locations
for location in locations:
    scrape_and_save(location=location)

# load all csvs into one df
files = glob.glob("output/x-corp/*.csv")
dfs = [pd.read_csv(f) for f in files]
combined_df = pd.concat(dfs, ignore_index=True)

# Filter out hidden profiles
filtered_df = combined_df[combined_df["urn"] != "headless"]
filtered_df = filtered_df[filtered_df["current_company"] == "X"]
filtered_df = filtered_df.drop_duplicates(subset="id")

filtered_urns = len(set(filtered_df["urn"]))
print(f"Total unique profiles: {filtered_urns}")
company_name = "x-corp"
filtered_df.to_csv(
    f"output/{company_name}/final_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv",
    index=False,
)


================================================
FILE: pyproject.toml
================================================
[tool.poetry]
name = "staffspy"
version = "0.2.25"
description = "Staff scraper library for LinkedIn"
authors = ["Cullen Watson <cullen@cullenwatson.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
pydantic = "^2.7.2"
pandas = "^2.2.2"
requests = "^2.32.3"
tldextract = "^5.1.2"
selenium = { version = "^4.3.0", optional = true }
tenacity = "^8.5.0"
python-dateutil = "^2.9.0.post0"
beautifulsoup4 = "^4.12.3"
2captcha-python = "^1.2.8"

[tool.poetry.extras]
browser = ["selenium"]

[tool.poetry.group.dev.dependencies]
pre-commit = "^3.7.1"
black = "^24.4.2"
jupyter = "^1.0.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"


================================================
FILE: staffspy/__init__.py
================================================
import json
import pandas as pd

from staffspy.linkedin.comments import CommentFetcher
from staffspy.linkedin.linkedin import LinkedInScraper
from staffspy.utils.models import Staff
from staffspy.solvers.capsolver import CapSolver
from staffspy.solvers.solver_type import SolverType
from staffspy.solvers.two_captcha import TwoCaptchaSolver
from staffspy.utils.utils import (
    set_logger_level,
    logger,
    Login,
    parse_company_data,
    extract_emails_from_text,
    clean_df,
)
from staffspy.utils.driver_type import DriverType, BrowserType

__all__ = [
    "LinkedInAccount",
    "SolverType",
    "DriverType",
    "BrowserType",
]


class LinkedInAccount:
    """LinkedinAccount storing cookie data and providing outer facing methods for client"""

    solver_map = {
        SolverType.CAPSOLVER: CapSolver,
        SolverType.TWO_CAPTCHA: TwoCaptchaSolver,
    }

    def __init__(
        self,
        session_file: str = None,
        username: str = None,
        password: str = None,
        log_level: int = 0,
        solver_api_key: str = None,
        solver_service: SolverType = SolverType.CAPSOLVER,
        driver_type: DriverType = None,
    ):
        self.session_file = session_file
        self.username = username
        self.password = password
        self.log_level = log_level
        self.solver = self.solver_map[solver_service](solver_api_key)
        self.driver_type = driver_type
        self.session = None
        self.linkedin_scraper = None
        self.on_block = False
        self.login()

    def login(self):
        set_logger_level(self.log_level)
        login = Login(
            self.username,
            self.password,
            self.solver,
            self.session_file,
            self.driver_type,
        )
        self.session = login.load_session()

    def scrape_staff(
        self,
        company_name: str = None,
        search_term: str = None,
        location: str = None,
        extra_profile_data: bool = False,
        max_results: int = 1000,
        block: bool = False,
        connect: bool = False,
    ):
        if self.on_block:
            return logger.error(
                "Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
            )
        """Main function entry point to scrape LinkedIn staff"""
        li_scraper = LinkedInScraper(self.session)
        staff = li_scraper.scrape_staff(
            company_name=company_name,
            extra_profile_data=extra_profile_data,
            search_term=search_term,
            location=location,
            max_results=max_results,
            block=block,
            connect=connect,
        )
        if li_scraper.on_block:
            self.on_block = True
        staff_dicts = [staff.to_dict() for staff in staff]
        staff_df = pd.DataFrame(staff_dicts)
        if staff_df.empty:
            return staff_df

        staff_df = clean_df(staff_df)
        linkedin_member_df = staff_df[staff_df["name"] == "LinkedIn Member"]
        non_linkedin_member_df = staff_df[staff_df["name"] != "LinkedIn Member"]
        staff_df = pd.concat([non_linkedin_member_df, linkedin_member_df])
        logger.info(
            f"3) Staff from {company_name}: {len(staff_df)} total, {len(linkedin_member_df)} hidden, {len(staff_df) - len(linkedin_member_df)} visible"
        )
        return staff_df.reset_index(drop=True)

    def scrape_users(
        self, user_ids: list[str], block: bool = False, connect: bool = False
    ) -> pd.DataFrame | None:
        """Scrape users from Linkedin by user IDs"""
        if self.on_block:
            return logger.error(
                "Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
            )

        li_scraper = LinkedInScraper(self.session)
        li_scraper.num_staff = len(user_ids)
        users = [
            Staff(
                id="",
                search_term="manual",
                profile_id=user_id,
                profile_link=f"https://www.linkedin.com/in/{user_id}",
            )
            for user_id in user_ids
        ]

        for i, user in enumerate(users, start=1):
            user.id, user.urn = li_scraper.fetch_user_profile_data_from_public_id(
                user.profile_id, "user_id"
            )
            if user.id:
                li_scraper.fetch_all_info_for_employee(user, i)
                if block:
                    li_scraper.block_user(user)
                elif connect:
                    li_scraper.connect_user(user)

        users_dicts = [user.to_dict() for user in users if user.id]
        users_df = pd.DataFrame(users_dicts)

        if users_df.empty:
            return users_df
        linkedin_member_df = users_df[users_df["name"] == "LinkedIn Member"]
        non_linkedin_member_df = users_df[users_df["name"] != "LinkedIn Member"]
        users_df = pd.concat([non_linkedin_member_df, linkedin_member_df])
        logger.info(f"Scraped {len(users_df)} users")
        return users_df

    def scrape_comments(self, post_ids: list[str]) -> pd.DataFrame:
        """Scrape comments from Linkedin by post IDs"""
        if self.on_block:
            return logger.error(
                "Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
            )

        comment_fetcher = CommentFetcher(self.session)
        all_comments = []
        for i, post_id in enumerate(post_ids, start=1):
            comments = comment_fetcher.fetch_comments(post_id)
            all_comments.extend(comments)

        comment_dict = [comment.to_dict() for comment in all_comments]
        comment_df = pd.DataFrame(comment_dict)

        if not comment_df.empty:
            comment_df["emails"] = comment_df["text"].apply(extract_emails_from_text)
            comment_df = comment_df.sort_values(by="created_at", ascending=False)

        return comment_df

    def scrape_companies(
        self,
        company_names: list[str] = None,
    ) -> pd.DataFrame:
        """Scrape company details from Linkedin"""
        if self.on_block:
            return logger.error(
                "Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
            )

        if not company_names:
            raise ValueError("company_names list cannot be empty")

        li_scraper = LinkedInScraper(self.session)
        company_dfs = []

        for company_name in company_names:
            try:
                company_res = li_scraper.fetch_or_search_company(company_name)
                try:
                    company_data = company_res.json()
                except json.decoder.JSONDecodeError:
                    logger.error(f"Failed to fetch company data for {company_name}")
                    continue

                company_df = parse_company_data(company_data, search_term=company_name)
                company_dfs.append(company_df)

            except Exception as e:
                logger.error(f"Failed to process company {company_name}: {str(e)}")
                continue

        if not company_dfs:
            return pd.DataFrame()

        return pd.concat(company_dfs, ignore_index=True)

    def scrape_connections(
        self,
        max_results: int = 10**8,
        extra_profile_data: bool = False,
    ) -> pd.DataFrame:
        """Scrape connections from Linkedin"""
        if self.on_block:
            return logger.error(
                "Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
            )
        li_scraper = LinkedInScraper(self.session)

        connections = li_scraper.scrape_connections(
            max_results=max_results,
            extra_profile_data=extra_profile_data,
        )
        connections_df = pd.DataFrame()
        if connections:
            staff_dicts = [staff.to_dict() for staff in connections]
            connections_df = pd.DataFrame(staff_dicts)
            connections_df = clean_df(connections_df)

        return connections_df


================================================
FILE: staffspy/linkedin/certifications.py
================================================
import json
import logging

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Certification

logger = logging.getLogger(__name__)


class CertificationFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:certifications,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"

    def fetch_certifications(self, staff):
        ep = self.endpoint.format(employee_id=staff.id)
        res = self.session.get(ep)
        logger.debug(f"certs, status code - {res.status_code}")
        if res.status_code == 429:
            raise TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text[:200])
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            return False

        try:
            elems = res_json["data"]["identityDashProfileComponentsBySectionType"][
                "elements"
            ]
        except (KeyError, IndexError, TypeError) as e:
            logger.debug(res_json)
            return False

        if elems:
            cert_elems = elems[0]["components"]["pagedListComponent"]["components"][
                "elements"
            ]
            staff.certifications = self.parse_certifications(cert_elems)
        return True

    def parse_certifications(self, sections):
        certs = []
        for section in sections:
            elem = section["components"]["entityComponent"]
            if not elem:
                break
            title = elem["titleV2"]["text"]["text"]
            issuer = elem["subtitle"]["text"] if elem["subtitle"] else None
            date_issued = (
                elem["caption"]["text"].replace("Issued ", "")
                if elem["caption"]
                else None
            )
            cert_id = (
                elem["metadata"]["text"].replace("Credential ID ", "")
                if elem["metadata"]
                else None
            )
            try:
                subcomp = elem["subComponents"]["components"][0]
                cert_link = subcomp["components"]["actionComponent"]["action"][
                    "navigationAction"
                ]["actionTarget"]
            except:
                cert_link = None
            cert = Certification(
                title=title,
                issuer=issuer,
                date_issued=date_issued,
                cert_link=cert_link,
                cert_id=cert_id,
            )
            certs.append(cert)

        return certs


================================================
FILE: staffspy/linkedin/comments.py
================================================
import json
import re
from datetime import datetime as dt

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Comment

from staffspy.utils.utils import logger


class CommentFetcher:

    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSocialDashComments.8cb29aedde780600a7ad17fc7ebb8277&queryName=SocialDashCommentsBySocialDetail&variables=(origins:List(),count:100,socialDetailUrn:urn%3Ali%3Afsd_socialDetail%3A%28urn%3Ali%3Aactivity%3A{post_id}%2Curn%3Ali%3Aactivity%3A7254884361622208512%2Curn%3Ali%3AhighlightedReply%3A-%29,sortOrder:REVERSE_CHRONOLOGICAL,start:{start})"
        self.post_id = None
        self.num_commments = 100

    def fetch_comments(self, post_id: str):
        all_comments = []
        self.post_id = post_id

        for i in range(0, 200_000, self.num_commments):
            logger.info(f"Fetching comments for post {post_id}, start {i}")

            ep = self.endpoint.format(post_id=post_id, start=i)
            res = self.session.get(ep)
            logger.debug(f"comments info, status code - {res.status_code}")

            if res.status_code == 429:
                raise TooManyRequests("429 Too Many Requests")
            if not res.ok:
                logger.debug(res.text[:200])
                return False
            try:
                comments_json = res.json()
            except json.decoder.JSONDecodeError:
                logger.debug(res.text[:200])
                return False

            comments, num_results = self.parse_comments(comments_json)
            all_comments.extend(comments)
            if not num_results:
                break

        return all_comments

    def parse_comments(self, comments_json: dict):
        """Parse the comment data from the employee profile."""
        comments = []
        for element in (
            results := comments_json.get("data", {})
            .get("socialDashCommentsBySocialDetail", {})
            .get("elements", [])
        ):
            internal_profile_id = (commenter := element["commenter"])[
                "commenterProfileId"
            ]
            name = commenter["title"]["text"]
            linkedin_id_match = re.search("/in/(.+)", commenter["navigationUrl"])
            linkedin_id = linkedin_id_match.group(1) if linkedin_id_match else None

            commentary = element.get("commentary", {}).get("text", "")
            comment_id = element["urn"].split(",")[-1].rstrip(")")
            num_likes = element["socialDetail"]["totalSocialActivityCounts"]["numLikes"]
            comment = Comment(
                post_id=self.post_id,
                comment_id=comment_id,
                internal_profile_id=internal_profile_id,
                public_profile_id=linkedin_id,
                name=name,
                text=commentary,
                num_likes=num_likes,
                created_at=dt.utcfromtimestamp(element["createdAt"] / 1000),
            )
            comments.append(comment)

        return comments, len(results)


================================================
FILE: staffspy/linkedin/contact_info.py
================================================
from calendar import month_name
from datetime import datetime
import json
import requests
import logging

import pytz

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import ContactInfo, Staff

logger = logging.getLogger(__name__)


class ContactInfoFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfiles.13618f886ce95bf503079f49245fbd6f&queryName=ProfilesByMemberIdentity&variables=(memberIdentity:{employee_id},count:1)"

    def fetch_contact_info(self, base_staff):
        ep = self.endpoint.format(employee_id=base_staff.id)
        try:
            res = self.session.get(ep)
        except requests.exceptions.TooManyRedirects as e:
            logger.error("Too many redirects encountered: %s", e)
            return None
        logger.debug(f"bio info, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text)
            return False
        try:
            employee_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text)
            return False

        self.parse_emp_contact_info(base_staff, employee_json)
        return True

    def parse_emp_contact_info(self, emp: Staff, emp_dict: dict):
        """Parse the employee data from the employee profile."""
        contact_info = ContactInfo()
        emp_dict = emp_dict["data"]["identityDashProfilesByMemberIdentity"]["elements"][
            0
        ]
        try:
            contact_info.email_address = emp_dict["emailAddress"]["emailAddress"]
        except (KeyError, IndexError, TypeError):
            pass

        try:
            contact_info.address = emp_dict["address"]
        except (KeyError, IndexError, TypeError):
            pass

        try:
            month = month_name[emp_dict["birthDateOn"]["month"]]
            day = emp_dict["birthDateOn"]["day"]
            contact_info.birthday = f"{month} {day}"
        except (KeyError, IndexError, TypeError):
            pass

        try:
            contact_info.websites = [x["url"] for x in emp_dict["websites"]]
        except (KeyError, IndexError, TypeError):
            pass

        try:
            contact_info.phone_numbers = [
                x["phoneNumber"]["number"] for x in emp_dict["phoneNumbers"]
            ]
        except (KeyError, IndexError, TypeError):
            pass

        try:
            created_at = emp_dict["memberRelationship"][
                "memberRelationshipDataResolutionResult"
            ]["connection"]["createdAt"]
            timezone = pytz.timezone("UTC")
            dt = datetime.fromtimestamp(created_at / 1000, tz=timezone)
            contact_info.created_at = dt.strftime("%Y-%m-%d %H:%M:%S %Z")
        except (KeyError, IndexError, TypeError):
            pass
        emp.contact_info = contact_info


================================================
FILE: staffspy/linkedin/employee.py
================================================
import json
import logging
import re

import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Staff

logger = logging.getLogger(__name__)


class EmployeeFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/voyagerIdentityDashProfiles?count=1&decorationId=com.linkedin.voyager.dash.deco.identity.profile.TopCardComplete-138&memberIdentity={employee_id}&q=memberIdentity"

        self.domain = None

    def fetch_employee(self, base_staff, domain):
        self.domain = domain
        ep = self.endpoint.format(employee_id=base_staff.id)
        res = self.session.get(ep)
        logger.debug(f"basic info, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text[:200])
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            return False

        try:
            employee_json = res_json["elements"][0]
        except (KeyError, IndexError, TypeError):
            logger.debug(res_json)
            return False

        self.parse_emp(base_staff, employee_json)
        return True

    def parse_emp(self, emp: Staff, emp_dict: dict):
        """Parse the employee data from the employee profile."""

        def get_photo_url(emp_dict: dict, key: str):
            try:
                photo_data = emp_dict[key]["displayImageReference"]["vectorImage"]
                photo_base_url = photo_data["rootUrl"]
                photo_ext_url = photo_data["artifacts"][-1][
                    "fileIdentifyingUrlPathSegment"
                ]
                return f"{photo_base_url}{photo_ext_url}"
            except (KeyError, TypeError, IndexError, ValueError):
                return None

        emp.profile_photo = get_photo_url(emp_dict, "profilePicture")
        emp.banner_photo = get_photo_url(emp_dict, "backgroundPicture")
        emp.profile_id = emp_dict["publicIdentifier"]
        try:
            emp.headline = emp_dict.get("headline")
            if not emp.headline:
                emp.headline = emp_dict["memberRelationship"]["memberRelationshipData"][
                    "noInvitation"
                ]["targetInviteeResolutionResult"]["headline"]
        except:
            pass
        union_type = next(
            iter(emp_dict["memberRelationship"]["memberRelationshipUnion"])
        )
        emp.is_connection = "no"
        if union_type == "connection":
            emp.is_connection = "yes"
        elif union_type == "noConnection":
            invitation = (
                emp_dict["memberRelationship"]["memberRelationshipUnion"][
                    "noConnection"
                ]
                .get("invitationUnion", {})
                .get("invitation", {})
            )
            if invitation and invitation.get("invitationState") == "PENDING":
                emp.is_connection = "pending"

        emp.open_to_work = emp_dict["profilePicture"].get("frameType") == "OPEN_TO_WORK"
        emp.is_hiring = emp_dict["profilePicture"].get("frameType") == "HIRING"

        emp.first_name = emp_dict["firstName"]
        emp.last_name = emp_dict["lastName"].split(",")[0]
        if not emp.name:
            name = filter(None, [emp.first_name, emp.last_name])
            emp.name = " ".join(name)
        emp.potential_emails = (
            utils.create_emails(emp.first_name, emp.last_name, self.domain)
            if self.domain
            else None
        )

        emp.followers = emp_dict.get("followingState", {}).get("followerCount")
        emp.connections = emp_dict["connections"]["paging"]["total"]
        emp.location = (
            emp_dict.get("geoLocation", {}).get("geo", {}).get("defaultLocalizedName")
        )

        # Handle empty elements case for company
        top_positions = emp_dict.get("profileTopPosition", {}).get("elements", [])
        if top_positions:
            emp.company = top_positions[0].get("companyName", None)
        else:
            emp.company = None

        edu_cards = emp_dict.get("profileTopEducation", {}).get("elements", [])
        if edu_cards:
            emp.school = edu_cards[0].get(
                "schoolName", edu_cards[0].get("school", {}).get("name")
            )
        emp.influencer = emp_dict.get("influencer", False)
        emp.creator = emp_dict.get("creator", False)
        emp.premium = emp_dict.get("premium", False)
        emp.mutual_connections = 0

        try:
            profile_insight = emp_dict.get("profileInsight", {}).get("elements", [])
            if profile_insight:
                mutual_connections_str = profile_insight[0]["text"]["text"]
                match = re.search(r"\d+", mutual_connections_str)
                if match:
                    emp.mutual_connections = int(match.group()) + 2
                else:
                    emp.mutual_connections = (
                        2 if " and " in mutual_connections_str else 1
                    )
        except (KeyError, TypeError, IndexError, ValueError) as e:
            pass


================================================
FILE: staffspy/linkedin/employee_bio.py
================================================
import json
import logging

from staffspy.utils.exceptions import TooManyRequests

logger = logging.getLogger(__name__)


class EmployeeBioFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileCards.9ad2590cb61a073ad514922fa752f566&queryName=ProfileTabInitialCards&variables=(count:50,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id})"

    def fetch_employee_bio(self, base_staff):
        ep = self.endpoint.format(employee_id=base_staff.id)
        res = self.session.get(ep)
        logger.debug(f"bio info, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text)
            return False
        try:
            data = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text)
            return False

        try:
            base_staff.bio = data["data"]["identityDashProfileCardsByInitialCards"][
                "elements"
            ][3]["topComponents"][1]["components"]["textComponent"]["text"]["text"]
        except (KeyError, IndexError, TypeError):
            return False

        return True


================================================
FILE: staffspy/linkedin/experiences.py
================================================
import json
import logging

import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Experience

logger = logging.getLogger(__name__)


class ExperiencesFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:experience,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"

    def fetch_experiences(self, staff):
        ep = self.endpoint.format(employee_id=staff.id)
        res = self.session.get(ep)
        logger.debug(f"exps, status code - {res.status_code}")
        if res.reason == "INKApi Error":
            raise Exception(
                "Delete session file and log in again",
                res.status_code,
                res.text[:200],
                res.reason,
            )
        elif res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        elif not res.ok:
            logger.debug(res.text[:200])
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            return False

        try:
            skills_json = res_json["data"][
                "identityDashProfileComponentsBySectionType"
            ]["elements"][0]["components"]["pagedListComponent"]["components"][
                "elements"
            ]
        except (KeyError, IndexError, TypeError) as e:
            logger.debug(res_json)
            return False

        staff.experiences = self.parse_experiences(skills_json)
        return True

    def parse_experiences(self, elements):
        exps = []
        for elem in elements:
            try:
                components = elem.get("components")
                if components is None:
                    continue

                entity = components.get("entityComponent")
                if entity is None:
                    continue

                sub_components = entity.get("subComponents")
                if (
                    sub_components is None
                    or len(sub_components.get("components", [])) == 0
                    or sub_components["components"][0].get("components") is None
                    or sub_components["components"][0]["components"].get(
                        "pagedListComponent"
                    )
                    is None
                ):

                    emp_type = start_date = end_date = None

                    caption = entity.get("caption")
                    duration = caption.get("text") if caption else None
                    if duration:
                        start_date, end_date = utils.parse_dates(duration)
                        from_date, to_date = utils.parse_duration(duration)
                        if from_date:
                            duration_parts = duration.split(" · ")
                            if len(duration_parts) > 1:
                                duration = duration_parts[1]

                    subtitle = entity.get("subtitle")
                    company = subtitle.get("text") if subtitle else None

                    titleV2 = entity.get("titleV2")
                    title_text = titleV2.get("text") if titleV2 else None
                    title = title_text.get("text") if title_text else None

                    metadata = entity.get("metadata")
                    location = metadata.get("text") if metadata else None

                    if company:
                        parts = company.split(" · ")
                        if len(parts) > 1:
                            company = parts[0]
                            emp_type = parts[-1].lower()

                    exp = Experience(
                        duration=duration,
                        title=title,
                        company=company,
                        emp_type=emp_type,
                        start_date=start_date,
                        end_date=end_date,
                        location=location,
                    )
                    exps.append(exp)
                else:
                    multi_exps = self.parse_multi_exp(entity)
                    exps += multi_exps

            except Exception as e:
                logger.exception(e)

        return exps

    def parse_multi_exp(self, entity):
        exps = []
        company = entity["titleV2"]["text"]["text"]
        elements = entity["subComponents"]["components"][0]["components"][
            "pagedListComponent"
        ]["components"]["elements"]
        for elem in elements:
            entity = elem["components"]["entityComponent"]
            duration = entity["caption"]["text"]
            title = entity["titleV2"]["text"]["text"]
            emp_type = (
                entity["subtitle"]["text"].lower() if entity["subtitle"] else None
            )
            location = entity["metadata"]["text"] if entity["metadata"] else None
            start_date, end_date = utils.parse_dates(duration)
            from_date, to_date = utils.parse_duration(duration)
            if from_date:
                duration = duration.split(" · ")[1]
            exp = Experience(
                duration=duration,
                title=title,
                company=company,
                emp_type=emp_type,
                start_date=start_date,
                end_date=end_date,
                location=location,
            )
            exps.append(exp)
        return exps


================================================
FILE: staffspy/linkedin/languages.py
================================================
import json
import logging

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Skill, Staff

logger = logging.getLogger(__name__)


class LanguagesFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.9117695ef207012719e3e0681c667e14&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:languages,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"

    def fetch_languages(self, staff: Staff):
        ep = self.endpoint.format(employee_id=staff.id)
        res = self.session.get(ep)
        logger.debug(f"skills, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text)
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text)
            return False

        if res_json.get("errors"):
            return False
        staff.languages = self.parse_languages(res_json)
        return True

    def parse_languages(self, language_json: dict) -> list[str]:
        languages = []
        elements = language_json["data"]["identityDashProfileComponentsBySectionType"][
            "elements"
        ][0]["components"]["pagedListComponent"]["components"]["elements"]

        for element in elements:
            if comp := element["components"]["entityComponent"]:
                title = comp["titleV2"]["text"]["text"]
                languages.append(title)

        return languages


================================================
FILE: staffspy/linkedin/linkedin.py
================================================
"""
staffspy.linkedin.linkedin
~~~~~~~~~~~~~~~~~~~

This module contains routines to scrape LinkedIn.
"""

import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import quote, unquote

import requests

import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests, BadCookies, GeoUrnNotFound
from staffspy.linkedin.contact_info import ContactInfoFetcher
from staffspy.linkedin.certifications import CertificationFetcher
from staffspy.linkedin.employee import EmployeeFetcher
from staffspy.linkedin.employee_bio import EmployeeBioFetcher
from staffspy.linkedin.experiences import ExperiencesFetcher
from staffspy.linkedin.languages import LanguagesFetcher
from staffspy.linkedin.schools import SchoolsFetcher
from staffspy.linkedin.skills import SkillsFetcher
from staffspy.utils.models import Staff
from staffspy.utils.utils import logger


class LinkedInScraper:
    employees_ep = "https://www.linkedin.com/voyager/api/graphql?variables=(start:{offset},query:(flagshipSearchIntent:SEARCH_SRP,{search}queryParameters:List({company_id}{location}(key:resultType,value:List(PEOPLE))),includeFiltersInResponse:false),count:{count})&queryId=voyagerSearchDashClusters.66adc6056cf4138949ca5dcb31bb1749"
    company_id_ep = "https://www.linkedin.com/voyager/api/organization/companies?q=universalName&universalName="
    company_search_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashClusters.02af3bc8bc85a169bb76bb4805d05759&queryName=SearchClusterCollection&variables=(query:(flagshipSearchIntent:SEARCH_SRP,keywords:{company},includeFiltersInResponse:false,queryParameters:(keywords:List({company}),resultType:List(COMPANIES))),count:10,origin:GLOBAL_SEARCH_HEADER,start:0)"
    location_id_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashReusableTypeahead.57a4fa1dd92d3266ed968fdbab2d7bf5&queryName=SearchReusableTypeaheadByType&variables=(query:(showFullLastNameForConnections:false,typeaheadFilterQuery:(geoSearchTypes:List(MARKET_AREA,COUNTRY_REGION,ADMIN_DIVISION_1,CITY))),keywords:{location},type:GEO,start:0)"
    public_user_id_ep = (
        "https://www.linkedin.com/voyager/api/identity/profiles/{user_id}/profileView"
    )
    connections_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashClusters.dfcd3603c2779eddd541f572936f4324&queryName=SearchClusterCollection&variables=(query:(queryParameters:(resultType:List(FOLLOWERS)),flagshipSearchIntent:MYNETWORK_CURATION_HUB,includeFiltersInResponse:true),count:50,origin:CurationHub,start:{offset})"
    block_user_ep = "https://www.linkedin.com/voyager/api/voyagerTrustDashContentReportingForm?action=entityBlock"
    connect_to_user_ep = "https://www.linkedin.com/voyager/api/voyagerRelationshipsDashMemberRelationships?action=verifyQuotaAndCreateV2&decorationId=com.linkedin.voyager.dash.deco.relationships.InvitationCreationResultWithInvitee-1"

    def __init__(self, session: requests.Session):
        self.session = session
        (
            self.company_id,
            self.staff_count,
            self.num_staff,
            self.company_name,
            self.domain,
            self.max_results,
            self.search_term,
            self.location,
            self.raw_location,
        ) = (None, None, None, None, None, None, None, None, None)
        self.on_block = False
        self.connect_block = False
        self.certs = CertificationFetcher(self.session)
        self.skills = SkillsFetcher(self.session)
        self.employees = EmployeeFetcher(self.session)
        self.schools = SchoolsFetcher(self.session)
        self.experiences = ExperiencesFetcher(self.session)
        self.bio = EmployeeBioFetcher(self.session)
        self.languages = LanguagesFetcher(self.session)
        self.contact = ContactInfoFetcher(self.session)

    def search_companies(self, company_name: str):
        """Get the company id and staff count from the company name."""

        company_search_ep = self.company_search_ep.format(company=quote(company_name))
        self.session.headers["x-li-graphql-pegasus-client"] = "true"
        res = self.session.get(company_search_ep)
        self.session.headers.pop("x-li-graphql-pegasus-client", "")
        if not res.ok:
            raise Exception(
                f"Failed to search for company {company_name}",
                res.status_code,
                res.text[:200],
            )
        logger.debug(
            f"Searched companies for name '{company_name}' - res code {res.status_code}-"
        )
        companies = res.json()["data"]["searchDashClustersByAll"]["elements"]

        err_msg = f"No companies found for name {company_name}"
        if len(companies) < 2:
            raise Exception(err_msg)
        try:
            num_results = companies[0]["items"][0]["item"]["simpleTextV2"]["text"][
                "text"
            ]
            first_company = companies[1]["items"][0]["item"].get("entityResult")
            if not first_company and len(companies) > 2:
                first_company = companies[2]["items"][0]["item"].get("entityResult")
            if not first_company:
                raise Exception(err_msg)

            company_link = first_company["navigationUrl"]
            company_name_id = unquote(
                re.search(r"/company/([^/]+)", company_link).group(1)
            )
            company_name_new = first_company["title"]["text"]
        except Exception as e:
            raise Exception(
                f"Failed to load json in search_companies {str(e)}, Response: {res.text[:200]}"
            )

        logger.info(
            f"Searched company {company_name} on LinkedIn and were {num_results}, using first result with company name - '{company_name_new}' and company id - '{company_name_id}'"
        )
        return company_name_id

    def fetch_or_search_company(self, company_name):
        """Fetch the company details by name, or search if not found."""
        res = self.session.get(f"{self.company_id_ep}{company_name}")

        if res.status_code not in (200, 404):
            raise Exception(
                f"Failed to find company {company_name} (likely due to outdated login if you know it's valid company)",
                res.status_code,
                res.text[:200],
            )
        elif res.status_code == 404:
            logger.info(
                f"Failed to directly use company '{company_name}' as company id, now searching for the company"
            )
            company_name = self.search_companies(company_name)
            res = self.session.get(f"{self.company_id_ep}{company_name}")
            if res.status_code != 200:
                raise Exception(
                    f"Failed to find company after performing a direct and generic search for {company_name}",
                    res.status_code,
                    res.text[:200],
                )

        if not res.ok:
            logger.debug(f"res code {res.status_code} - fetched company ")
        return res

    def _get_company_id_and_staff_count(self, company_name: str):
        """Extract company id and staff count from the company details."""
        res = self.fetch_or_search_company(company_name)

        try:
            response_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            raise Exception(
                f"Failed to load json in get_company_id_and_staff_count {res.text[:200]}"
            )

        company = response_json["elements"][0]
        self.domain = (
            utils.extract_base_domain(company["companyPageUrl"])
            if company.get("companyPageUrl")
            else None
        )
        staff_count = company["staffCount"]
        company_id = company["trackingInfo"]["objectUrn"].split(":")[-1]
        company_name = company["universalName"]

        logger.info(f"Found company '{company_name}' with {staff_count} staff")
        return company_id, staff_count

    def parse_staff(self, elements: list[dict]):
        """Parse the staff from the search results"""
        staff = []

        for elem in elements:
            for card in elem.get("items", []):
                person = card.get("item", {}).get("entityResult", {})
                if not person:
                    continue
                pattern = (
                    r"urn:li:fsd_profile:([^,]+),(?:SEARCH_SRP|MYNETWORK_CURATION_HUB)"
                )
                match = re.search(pattern, person["entityUrn"])
                linkedin_id = match.group(1) if match else None
                person_urn = person["trackingUrn"].split(":")[-1]

                name = person["title"]["text"].strip()
                headline = (
                    person.get("primarySubtitle", {}).get("text", "")
                    if person.get("primarySubtitle")
                    else ""
                )
                profile_link = person["navigationUrl"].split("?")[0]
                staff.append(
                    Staff(
                        urn=person_urn,
                        id=linkedin_id,
                        name=name,
                        headline=headline,
                        search_term=" - ".join(
                            filter(
                                None,
                                [
                                    self.company_name,
                                    self.search_term,
                                    self.raw_location,
                                ],
                            )
                        ),
                        profile_link=profile_link,
                    )
                )
        return staff

    def fetch_staff(self, offset: int):
        """Fetch the staff using LinkedIn search"""
        ep = self.employees_ep.format(
            offset=offset,
            company_id=(
                f"(key:currentCompany,value:List({self.company_id})),"
                if self.company_id
                else ""
            ),
            count=50,
            search=f"keywords:{quote(self.search_term)}," if self.search_term else "",
            location=(
                f"(key:geoUrn,value:List({self.location}))," if self.location else ""
            ),
        )
        res = self.session.get(ep)
        if not res.ok:
            logger.debug(f"employees, status code - {res.status_code}")
        if res.status_code == 400:
            raise BadCookies("Outdated login, delete the session file to log in again")
        elif res.status_code == 429:
            raise TooManyRequests("429 Too Many Requests")
        if not res.ok:
            return None, 0
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text)
            return None, 0

        try:
            elements = res_json["data"]["searchDashClustersByAll"]["elements"]
            total_count = res_json["data"]["searchDashClustersByAll"]["metadata"][
                "totalResultCount"
            ]

        except (KeyError, IndexError, TypeError):
            logger.debug(res_json)
            return None, 0
        new_staff = self.parse_staff(elements) if elements else []
        return new_staff, total_count

    def fetch_connections_page(self, offset: int):
        self.session.headers["x-li-graphql-pegasus-client"] = "true"
        res = self.session.get(self.connections_ep.format(offset=offset))
        self.session.headers.pop("x-li-graphql-pegasus-client", "")
        if not res.ok:
            logger.debug(f"employees, status code - {res.status_code}")
        if res.status_code == 400:
            raise BadCookies("Outdated login, delete the session file to log in again")
        elif res.status_code == 429:
            raise TooManyRequests("429 Too Many Requests")
        if not res.ok:
            return
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text)
            return

        try:
            elements = res_json["data"]["searchDashClustersByAll"]["elements"]
            total_count = res_json["data"]["searchDashClustersByAll"]["metadata"][
                "totalResultCount"
            ]

        except (KeyError, IndexError, TypeError):
            logger.debug(res_json)
            return

        new_staff = self.parse_staff(elements) if elements else []
        return new_staff, total_count

    def scrape_connections(
        self,
        max_results: int = 10**8,
        extra_profile_data: bool = False,
    ):
        self.search_term = "connections"
        staff_list: list[Staff] = []

        try:
            initial_staff, total_search_result_count = self.fetch_connections_page(0)
            if initial_staff:
                staff_list.extend(initial_staff)

            self.num_staff = min(total_search_result_count, max_results)
            for offset in range(50, self.num_staff, 50):
                staff, _ = self.fetch_connections_page(offset)
                logger.debug(
                    f"Connections from search: {len(staff)} new, {len(staff_list) + len(staff)} total"
                )
                if not staff:
                    break
                staff_list.extend(staff)
        except (BadCookies, TooManyRequests) as e:
            self.on_block = True
            logger.error(f"Exiting early due to fatal error: {str(e)}")
            return staff_list[:max_results]

        reduced_staff_list = staff_list[:max_results]

        non_restricted = list(
            filter(lambda x: x.name != "LinkedIn Member", reduced_staff_list)
        )

        if extra_profile_data:
            try:
                for i, employee in enumerate(non_restricted, start=1):
                    self.fetch_all_info_for_employee(employee, i)
            except TooManyRequests as e:
                logger.error(str(e))
        return reduced_staff_list

    def fetch_location_id(self):
        """Fetch the location id for the location to be used in LinkedIn search"""
        ep = self.location_id_ep.format(location=quote(self.raw_location))
        res = self.session.get(ep)
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            if res.reason == "INKApi Error":
                raise Exception(
                    "Delete session file and log in again",
                    res.status_code,
                    res.text[:200],
                    res.reason,
                )
            raise GeoUrnNotFound(
                "Failed to send request to get geo id",
                res.status_code,
                res.text[:200],
                res.reason,
            )

        try:
            elems = res_json["data"]["searchDashReusableTypeaheadByType"]["elements"]
        except (KeyError, IndexError, TypeError):
            raise GeoUrnNotFound("Failed to locate geo id", res_json[:200])

        geo_id = None
        if elems:
            urn = elems[0]["trackingUrn"]
            m = re.search("urn:li:geo:(.+)", urn)
            if m:
                geo_id = m.group(1)
        if not geo_id:
            raise GeoUrnNotFound("Failed to parse geo id")
        self.location = geo_id

    def scrape_staff(
        self,
        company_name: str | None,
        search_term: str,
        location: str,
        extra_profile_data: bool,
        max_results: int,
        block: bool,
        connect: bool,
    ):
        """Main function entry point to scrape LinkedIn staff"""
        self.search_term = search_term
        self.company_name = company_name
        self.max_results = max_results
        self.raw_location = location
        self.company_id = None

        if self.company_name:
            self.company_id, staff_count = self._get_company_id_and_staff_count(
                company_name
            )

        staff_list: list[Staff] = []

        if self.raw_location:
            try:
                self.fetch_location_id()
            except GeoUrnNotFound as e:
                logger.error(str(e))
                return staff_list[:max_results]

        try:
            initial_staff, total_count = self.fetch_staff(0)
            if initial_staff:
                staff_list.extend(initial_staff)
            location = f", location: '{location}'" if location else ""
            logger.info(
                f"1) Search results for company: '{company_name}'{location} - {total_count:,} staff"
            )

            self.num_staff = min(total_count, max_results, 1000)
            for offset in range(50, self.num_staff, 50):
                staff, _ = self.fetch_staff(offset)
                logger.debug(
                    f"Staff members from search: {len(staff)} new, {len(staff_list) + len(staff)} total"
                )
                if not staff:
                    break
                staff_list.extend(staff)
            location = f", location: '{location}'" if location else ""
            logger.info(
                f"2) Total results collected for company: '{company_name}'{location} - {len(staff_list)} results"
            )
        except (BadCookies, TooManyRequests) as e:
            self.on_block = True
            logger.error(f"Exiting early due to fatal error: {str(e)}")
            return staff_list[:max_results]

        reduced_staff_list = staff_list[:max_results]
        non_restricted = list(
            filter(lambda x: x.name != "LinkedIn Member", reduced_staff_list)
        )

        if extra_profile_data:
            try:
                for i, employee in enumerate(non_restricted, start=1):
                    self.fetch_all_info_for_employee(employee, i)
                    if block:
                        self.block_user(employee)
                    elif connect:
                        self.connect_user(employee)

            except TooManyRequests as e:
                logger.error(str(e))

        return reduced_staff_list

    def fetch_all_info_for_employee(self, employee: Staff, index: int):
        """Simultaniously fetch all the data for an employee"""
        logger.info(
            f"Fetching data for account {employee.id} {index:>4} / {self.num_staff} - {employee.profile_link}"
        )

        task_functions = [
            (self.employees.fetch_employee, (employee, self.domain), "employee"),
            (self.skills.fetch_skills, (employee,), "skills"),
            (self.experiences.fetch_experiences, (employee,), "experiences"),
            (self.certs.fetch_certifications, (employee,), "certifications"),
            (self.schools.fetch_schools, (employee,), "schools"),
            (self.bio.fetch_employee_bio, (employee,), "bio"),
            (self.languages.fetch_languages, (employee,), "languages"),
        ]

        with ThreadPoolExecutor(max_workers=len(task_functions)) as executor:
            tasks = {
                executor.submit(func, *args): name
                for func, args, name in task_functions
            }

            for future in as_completed(tasks):
                result = future.result()

        if employee.is_connection:
            self.contact.fetch_contact_info(employee)

    def fetch_user_profile_data_from_public_id(self, user_id: str, key: str):
        """Fetches data given the public LinkedIn user id"""
        endpoint = self.public_user_id_ep.format(user_id=user_id)
        response = self.session.get(endpoint)

        try:
            response_json = response.json()
        except json.decoder.JSONDecodeError:
            logger.debug(response.text[:200])
            raise Exception(
                f"Failed to load JSON from endpoint",
                response.status_code,
                response.reason,
            )

        keys = {
            "user_id": ("positionView", "profileId"),
            "company_id": (
                "positionView",
                "elements",
                0,
                "company",
                "miniCompany",
                "universalName",
            ),
        }

        try:
            data = response_json
            for k in keys[key]:
                data = data[k]
            urn = response_json["profile"]["miniProfile"]["objectUrn"].split(":")[-1]
            return data, urn
        except (KeyError, TypeError, IndexError) as e:
            logger.warning(f"Failed to find user_id {user_id}")
            if key == "user_id":
                return ""
            raise Exception(f"Failed to fetch '{key}' for user_id {user_id}: {e}")

    def block_user(self, employee: Staff) -> None:
        """Block a user on LinkedIn given their urn"""
        if employee.urn == "headless":
            return
        self.session.headers["Content-Type"] = (
            "application/x-protobuf2; symbol-table=voyager-20757"
        )

        urn_string = f"urn:li:member:{employee.urn}"
        length_byte = bytes([len(urn_string)])
        body = b"\x00\x01\x14\nblockeeUrn\x14" + length_byte + urn_string.encode()

        res = self.session.post(
            self.block_user_ep,
            data=body,
        )
        self.session.headers.pop("Content-Type", "")

        if res.ok:
            logger.info(f"Successfully blocked user {employee.id}")
        elif res.status_code == 403:
            logger.warning(
                f"Failed to block user - status code 403, one possible reason is you have alread blocked/unblocked this person in past 48 hours and on cooldown: {employee.profile_link}"
            )
        else:
            logger.warning(
                f"Failed to block user - status code {res.status_code} {employee.id}: {employee.name}"
            )

    def connect_user(self, employee: Staff) -> None:
        """Connects with a user on LinkedIn given their profile id"""
        if self.connect_block:
            return logger.info(
                f"Skipping connection request for user due to previou block: {employee.id} - {employee.profile_link} "
            )
        if employee.urn == "headless":
            return
        if employee.is_connection != "no":
            return logger.info(
                f"Already connected or pending connection request to user {employee.id} - {employee.profile_link}"
            )
        self.session.headers["Content-Type"] = (
            "application/x-protobuf2; symbol-table=voyager-20757"
        )
        body = (
            b"\x00\x01\x03\xe2\x05\x00\x01\x03\xd3w\x00\x01\x03\xd5\x06\x14:urn:li:fsd_profile:"
            + employee.id.encode()
        )

        res = self.session.post(
            self.connect_to_user_ep,
            data=body,
        )
        self.session.headers.pop("Content-Type", "")

        if res.ok:
            logger.info(
                f"Successfully sent connection request to user {employee.id} - {employee.profile_link}"
            )
        elif res.status_code == 429:
            self.connect_block = True
            logger.warning(
                f"Failed to connect to user - status code 429 - pausing connection requests for this scrape: {employee.id} - {employee.profile_link}"
            )
        else:
            logger.warning(
                f"Failed to connect to user - status code {res.status_code} {employee.id} -{employee.profile_link}"
            )


================================================
FILE: staffspy/linkedin/schools.py
================================================
import json
import logging

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import School
from staffspy.utils.utils import parse_dates

logger = logging.getLogger(__name__)


class SchoolsFetcher:

    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:education,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"

    def fetch_schools(self, staff):
        ep = self.endpoint.format(employee_id=staff.id)
        res = self.session.get(ep)
        logger.debug(f"schools, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")

        if not res.ok:
            logger.debug(res.text[:200])
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            return False

        try:
            elements = res_json["data"]["identityDashProfileComponentsBySectionType"][
                "elements"
            ][0]["components"]["pagedListComponent"]["components"]["elements"]
        except (KeyError, IndexError, TypeError) as e:
            logger.debug(res_json)
            return False

        staff.schools = self.parse_schools(elements)
        return True

    def parse_schools(self, elements):
        schools = []
        start = end = None
        for elem in elements:
            entity = elem["components"]["entityComponent"]
            if not entity:
                break
            years = entity["caption"]["text"] if entity["caption"] else None
            school_name = entity["titleV2"]["text"]["text"]

            if years:
                start, end = parse_dates(years)
            degree = entity["subtitle"]["text"] if entity["subtitle"] else None
            school = School(
                start_date=start, end_date=end, school=school_name, degree=degree
            )
            schools.append(school)

        return schools


================================================
FILE: staffspy/linkedin/skills.py
================================================
import json
import logging

from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Skill, Staff

logger = logging.getLogger(__name__)


class SkillsFetcher:
    def __init__(self, session):
        self.session = session
        self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:skills,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"

    def fetch_skills(self, staff: Staff):
        ep = self.endpoint.format(employee_id=staff.id)
        res = self.session.get(ep)
        logger.debug(f"skills, status code - {res.status_code}")
        if res.status_code == 429:
            return TooManyRequests("429 Too Many Requests")
        if not res.ok:
            logger.debug(res.text[:200])
            return False
        try:
            res_json = res.json()
        except json.decoder.JSONDecodeError:
            logger.debug(res.text[:200])
            return False

        if res_json.get("errors"):
            return False
        tab_comp = res_json["data"]["identityDashProfileComponentsBySectionType"][
            "elements"
        ][0]["components"]["tabComponent"]
        if tab_comp:
            sections = tab_comp["sections"]
            staff.skills = self.parse_skills(sections)
        return True

    def parse_skills(self, sections):
        names = set()
        skills = []
        for section in sections:
            elems = section["subComponent"]["components"]["pagedListComponent"][
                "components"
            ]["elements"]
            for elem in elems:
                passed_assessment, endorsements = None, 0
                entity = elem["components"]["entityComponent"]
                name = entity["titleV2"]["text"]["text"]
                if name in names:
                    continue
                names.add(name)
                components = entity["subComponents"]["components"]
                for component in components:

                    try:
                        candidate = component["components"]["insightComponent"]["text"][
                            "text"
                        ]["text"]
                        if " endorsements" in candidate:
                            endorsements = int(candidate.replace(" endorsements", ""))
                        if "Passed LinkedIn Skill Assessment" in candidate:
                            passed_assessment = True
                    except:
                        pass

                skills.append(
                    Skill(
                        name=name,
                        endorsements=endorsements,
                        passed_assessment=passed_assessment,
                    )
                )
        return skills


================================================
FILE: staffspy/solvers/capsolver.py
================================================
import json
import time

import requests
from tenacity import retry, stop_after_attempt, retry_if_result

from staffspy.solvers.solver import Solver


def is_none(value):
    return value is None


class CapSolver(Solver):
    """https://www.capsolver.com/"""

    @retry(stop=stop_after_attempt(10), retry=retry_if_result(is_none))
    def solve(self, blob_data: str, page_url: str = None):
        from staffspy.utils.utils import logger

        logger.info(f"Waiting on CapSolver to solve captcha...")

        payload = {
            "clientKey": self.solver_api_key,
            "task": {
                "type": "FunCaptchaTaskProxyLess",
                "websitePublicKey": self.public_key,
                "websiteURL": self.page_url,
                "data": json.dumps({"blob": blob_data}) if blob_data else "",
            },
        }
        res = requests.post("https://api.capsolver.com/createTask", json=payload)
        resp = res.json()
        task_id = resp.get("taskId")
        if not task_id:
            raise Exception(
                "CapSolver failed to create task, try another captcha solver like 2Captcha if this persists or use browser sign in `pip install staffspy[browser]` and then remove the username/password params to the LinkedInAccount()",
                res.text,
            )
        logger.info(f"Received captcha solver taskId: {task_id} / Getting result...")

        while True:
            time.sleep(1)  # delay
            payload = {"clientKey": self.solver_api_key, "taskId": task_id}
            res = requests.post("https://api.capsolver.com/getTaskResult", json=payload)
            resp = res.json()
            status = resp.get("status")
            if status == "ready":
                logger.info(f"CapSolver finished solving captcha")
                return resp.get("solution", {}).get("token")
            if status == "failed" or resp.get("errorId"):
                logger.info(f"Captcha solve failed! response: {res.text}")
                return None


================================================
FILE: staffspy/solvers/solver.py
================================================
from abc import ABC,abstractmethod


class Solver(ABC):
    public_key = "3117BF26-4762-4F5A-8ED9-A85E69209A46"
    page_url = "https://iframe.arkoselabs.com"

    def __init__(self, solver_api_key:str):
        self.solver_api_key=solver_api_key

    @abstractmethod
    def solve(self, blob_data: str, page_ur: str=None):
        pass


================================================
FILE: staffspy/solvers/solver_type.py
================================================
from enum import Enum

class SolverType(Enum):
    CAPSOLVER = 'capsolver'
    TWO_CAPTCHA = 'twocaptcha'


================================================
FILE: staffspy/solvers/two_captcha.py
================================================
from tenacity import retry_if_exception_type, stop_after_attempt, retry
from twocaptcha import TwoCaptcha, TimeoutException, ApiException, NetworkException

from staffspy.solvers.solver import Solver


class TwoCaptchaSolver(Solver):
    """https://2captcha.com/"""

    attempt = 1

    @retry(
        stop=stop_after_attempt(5),
        retry=retry_if_exception_type(
            (TimeoutException, ApiException, NetworkException)
        ),
    )
    def solve(self, blob_data: str, page_url: str = None):
        super().solve(blob_data, page_url)
        from staffspy.utils.utils import logger

        logger.info(
            f"Waiting on 2Captcha to solve captcha attempt {self.attempt} / 5 ..."
        )
        self.attempt += 1

        solver = TwoCaptcha(self.solver_api_key)

        result = solver.funcaptcha(
            sitekey=self.public_key,
            url=page_url,
            **{"data[blob]": blob_data},
            surl="https://iframe.arkoselabs.com",
        )
        logger.info(f"2Captcha finished solving captcha")
        return result["code"]


================================================
FILE: staffspy/utils/driver_type.py
================================================
from enum import Enum
from typing import Optional


class BrowserType(Enum):
    CHROME = "chrome"
    FIREFOX = "firefox"


class DriverType:
    def __init__(
        self, browser_type: BrowserType, executable_path: Optional[str] = None
    ):
        self.browser_type = browser_type
        self.executable_path = executable_path


================================================
FILE: staffspy/utils/exceptions.py
================================================
class TooManyRequests(Exception):
    """Too many requests."""


class BadCookies(Exception):
    """Login expiration."""


class GeoUrnNotFound(Exception):
    """Could not find geo urn for given location."""


class BlobException(Exception):
    """Could not find the blob needed to solve the captcha."""


================================================
FILE: staffspy/utils/models.py
================================================
from datetime import datetime, date

from pydantic import BaseModel
from datetime import datetime as dt

from staffspy.utils.utils import extract_emails_from_text


class Comment(BaseModel):
    post_id: str
    comment_id: str | None = None
    internal_profile_id: str | None = None
    public_profile_id: str | None = None
    name: str | None = None
    text: str | None = None
    num_likes: int | None = None
    created_at: dt | None = None

    def to_dict(self):
        return {
            "post_id": self.post_id,
            "comment_id": self.comment_id,
            "internal_profile_id": self.internal_profile_id,
            "public_profile_id": self.public_profile_id,
            "name": self.name,
            "text": self.text,
            "num_likes": self.num_likes,
            "created_at": self.created_at,
        }


class School(BaseModel):
    start_date: date | None = None
    end_date: date | None = None
    school: str | None = None
    degree: str | None = None

    def to_dict(self):
        return {
            "start_date": self.start_date.isoformat() if self.start_date else None,
            "end_date": self.end_date.isoformat() if self.end_date else None,
            "school": self.school,
            "degree": self.degree,
        }


class Skill(BaseModel):
    name: str | None = None
    endorsements: int | None = None
    passed_assessment: bool | None = None

    def to_dict(self):
        return {
            "name": self.name,
            "endorsements": self.endorsements if self.endorsements else 0,
            "passed_assessment": self.passed_assessment,
        }


class ContactInfo(BaseModel):
    email_address: str | None = None
    websites: list | None = None
    phone_numbers: list | None = None
    address: str | None = None
    birthday: str | None = None
    created_at: str | None = None

    def to_dict(self):
        return {
            "email_address": self.email_address,
            "websites": self.websites,
            "phone_numbers": self.phone_numbers,
            "address": self.address,
            "birthday": self.birthday,
            "created_at": self.created_at,
        }


class Certification(BaseModel):
    title: str | None = None
    issuer: str | None = None
    date_issued: str | None = None
    cert_id: str | None = None
    cert_link: str | None = None

    def to_dict(self):
        return {
            "title": self.title,
            "issuer": self.issuer,
            "date_issued": self.date_issued,
            "cert_id": self.cert_id,
            "cert_link": self.cert_link,
        }


class Experience(BaseModel):
    duration: str | None = None
    title: str | None = None
    company: str | None = None
    location: str | None = None
    emp_type: str | None = None
    start_date: date | None = None
    end_date: date | None = None

    def to_dict(self):
        return {
            "start_date": self.start_date.isoformat() if self.start_date else None,
            "end_date": self.end_date.isoformat() if self.end_date else None,
            "duration": self.duration,
            "title": self.title,
            "company": self.company,
            "location": self.location,
            "emp_type": self.emp_type,
        }


class Staff(BaseModel):
    urn: str | None = None
    search_term: str
    id: str
    name: str | None = None
    headline: str | None = None
    current_position: str | None = None

    profile_id: str | None = None
    profile_link: str | None = None
    first_name: str | None = None
    last_name: str | None = None
    potential_emails: list | None = None
    bio: str | None = None
    emails_in_bio: str | None = None
    followers: int | None = None
    connections: int | None = None
    mutual_connections: int | None = None
    is_connection: str | None = None  # yes, no, pending
    location: str | None = None
    company: str | None = None
    school: str | None = None
    influencer: bool | None = None
    creator: bool | None = None
    premium: bool | None = None
    open_to_work: bool | None = None
    is_hiring: bool | None = None
    profile_photo: str | None = None
    banner_photo: str | None = None
    skills: list[Skill] | None = None
    experiences: list[Experience] | None = None
    certifications: list[Certification] | None = None
    contact_info: ContactInfo | None = None
    schools: list[School] | None = None
    languages: list[str] | None = None

    def get_top_skills(self):
        top_three_skills = []
        if self.skills:
            sorted_skills = sorted(
                self.skills, key=lambda x: x.endorsements, reverse=True
            )
            top_three_skills = [skill.name for skill in sorted_skills[:3]]
        top_three_skills += [None] * (3 - len(top_three_skills))
        return top_three_skills

    def to_dict(self):
        sorted_schools = (
            sorted(
                self.schools,
                key=lambda x: (x.end_date is None, x.end_date),
                reverse=True,
            )
            if self.schools
            else []
        )

        top_three_school_names = [school.school for school in sorted_schools[:3]]
        top_three_school_names += [None] * (3 - len(top_three_school_names))
        estimated_age = self.estimate_age_based_on_education()

        sorted_experiences = (
            sorted(
                self.experiences,
                key=lambda x: (x.end_date is None, x.end_date),
                reverse=True,
            )
            if self.experiences
            else []
        )

        top_three_companies = []
        seen_companies = set()
        for exp in sorted_experiences:
            if exp.company not in seen_companies:
                top_three_companies.append(exp.company)
                seen_companies.add(exp.company)
            if len(top_three_companies) == 3:
                break

        top_three_companies += [None] * (3 - len(top_three_companies))
        top_three_skills = self.get_top_skills()
        self.emails_in_bio = extract_emails_from_text(self.bio) if self.bio else None
        self.current_position = (
            sorted_experiences[0].title
            if len(sorted_experiences) > 0 and sorted_experiences[0].end_date is None
            else None
        )

        contact_info = self.contact_info.to_dict() if self.contact_info else {}
        return {
            "search_term": self.search_term,
            "id": self.id,
            "urn": self.urn,
            "profile_link": self.profile_link,
            "profile_id": self.profile_id,
            "name": self.name,
            "first_name": self.first_name,
            "last_name": self.last_name,
            "location": self.location,
            "headline": self.headline,
            "estimated_age": estimated_age,
            "followers": self.followers,
            "connections": self.connections,
            "mutuals": self.mutual_connections,
            "is_connection": self.is_connection,
            "premium": self.premium,
            "creator": self.creator,
            "influencer": self.influencer,
            "open_to_work": self.open_to_work,
            "is_hiring": self.is_hiring,
            "current_position": self.current_position,
            "current_company": top_three_companies[0],
            "past_company_1": top_three_companies[1],
            "past_company_2": top_three_companies[2],
            "school_1": top_three_school_names[0],
            "school_2": top_three_school_names[1],
            "top_skill_1": top_three_skills[0],
            "top_skill_2": top_three_skills[1],
            "top_skill_3": top_three_skills[2],
            "bio": self.bio,
            "experiences": (
                [exp.to_dict() for exp in self.experiences]
                if self.experiences
                else None
            ),
            "schools": (
                [school.to_dict() for school in self.schools] if self.schools else None
            ),
            "skills": (
                [skill.to_dict() for skill in self.skills] if self.skills else None
            ),
            "certifications": (
                [cert.to_dict() for cert in self.certifications]
                if self.certifications
                else None
            ),
            "languages": self.languages,
            "emails_in_bio": (
                ", ".join(self.emails_in_bio) if self.emails_in_bio else None
            ),
            "potential_emails": self.potential_emails,
            "profile_photo": self.profile_photo,
            "banner_photo": self.banner_photo,
            "connection_created_at": contact_info.get("created_at"),
            "connection_email": contact_info.get("email_address"),
            "connection_phone_numbers": contact_info.get("phone_numbers"),
            "connection_websites": contact_info.get("websites"),
            "connection_street_address": contact_info.get("address"),
            "connection_birthday": contact_info.get("birthday"),
        }

    def estimate_age_based_on_education(self):
        """Adds 18 to their first college start date"""
        college_words = ["uni", "college"]

        sorted_schools = (
            sorted(
                [school for school in self.schools if school.start_date],
                key=lambda x: x.start_date,
            )
            if self.schools
            else []
        )

        current_date = datetime.now().date()
        for school in sorted_schools:
            if (
                any(word in school.school.lower() for word in college_words)
                or school.degree
            ):
                if school.start_date:
                    years_in_education = (current_date - school.start_date).days // 365
                    return int(18 + years_in_education)
        return None


================================================
FILE: staffspy/utils/utils.py
================================================
import logging
import os
import pickle
import re
from datetime import datetime

import pandas as pd
from typing import Optional
from urllib.parse import quote

import requests
import tldextract
from bs4 import BeautifulSoup
from dateutil.parser import parse
from tenacity import stop_after_attempt, retry_if_exception_type, retry, RetryError

from staffspy.solvers.solver import Solver
from staffspy.utils.driver_type import DriverType, BrowserType
from staffspy.utils.exceptions import BlobException

logger = logging.getLogger("StaffSpy")
logger.propagate = False
if not logger.handlers:
    logger.setLevel(logging.INFO)
    console_handler = logging.StreamHandler()
    format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    formatter = logging.Formatter(format)
    console_handler.setFormatter(formatter)
    logger.addHandler(console_handler)


def set_csrf_token(session):
    csrf_token = session.cookies["JSESSIONID"].replace('"', "")
    session.headers.update({"Csrf-Token": csrf_token})
    return session


def extract_base_domain(url: str):
    extracted = tldextract.extract(url)
    base_domain = "{}.{}".format(extracted.domain, extracted.suffix)
    return base_domain


def create_emails(first, last, domain):
    first = "".join(filter(str.isalpha, first)).lower()
    last = "".join(filter(str.isalpha, last)).lower()
    emails = [
        f"{first}.{last}@{domain}",
        f"{first[:1]}{last}@{domain}",
        f"{first[:2]}{last}@{domain}",
        f"{first}{last[:1]}@{domain}",
        f"{first}{last[:2]}@{domain}",
    ]
    return emails


def get_webdriver(driver_type: Optional[DriverType] = None):
    try:
        from selenium import webdriver
        from selenium.webdriver.chrome.service import Service as ChromeService
        from selenium.webdriver.firefox.service import Service as FirefoxService
    except ImportError as e:
        raise Exception(
            'install package `pip install "staffspy[browser]"` to login with browser'
        )

    if driver_type:
        if str(driver_type.browser_type) == str(BrowserType.CHROME):
            if driver_type.executable_path:
                service = ChromeService(executable_path=driver_type.executable_path)
                return webdriver.Chrome(service=service)
            else:
                return webdriver.Chrome()
        elif str(driver_type.browser_type) == str(BrowserType.FIREFOX):
            if driver_type.executable_path:
                service = FirefoxService(executable_path=driver_type.executable_path)
                return webdriver.Firefox(service=service)
            else:
                return webdriver.Firefox()
    else:
        for browser in [webdriver.Chrome, webdriver.Firefox]:
            try:
                return browser()
            except Exception:
                continue
    return None


class Login:

    def __init__(
        self,
        username: str,
        password: str,
        solver: Solver,
        session_file: str,
        driver_type: DriverType = None,
    ):
        (
            self.username,
            self.password,
            self.solver,
            self.session_file,
            self.driver_type,
        ) = (username, password, solver, session_file, driver_type)

    def solve_captcha(self, session, data, payload):
        url = data["challenge_url"]
        r = session.post(url, data=payload)

        soup = BeautifulSoup(r.text, "html.parser")

        code_tag = soup.find("code", id="securedDataExchange")

        logger.info("Searching for captcha blob in linkedin to begin captcha solving")
        if code_tag:
            comment = code_tag.contents[0]
            extracted_code = str(comment).strip('<!--""-->').strip()
            logger.debug("Extracted captcha blob:", extracted_code)
        elif "Please choose a more secure password." in r.text:
            raise Exception(
                "linkedin is requiring a more secure password. reset pw and try again"
            )
        else:
            raise BlobException(
                "blob to solve captcha not found - rerunning the program usually solves this"
            )

        if not self.solver:
            raise Exception(
                "captcha hit - provide solver_api_key and solver_service name to solve or switch to the browser-based login with `pip install staffspy[browser]`"
            )
        token = self.solver.solve(extracted_code, url)
        if not token:
            raise Exception("failed to solve captcha after 10 attempts")

        captcha_site_key = soup.find("input", {"name": "captchaSiteKey"})["value"]
        challenge_id = soup.find("input", {"name": "challengeId"})["value"]
        challenge_data = soup.find("input", {"name": "challengeData"})["value"]
        challenge_details = soup.find("input", {"name": "challengeDetails"})["value"]
        challenge_type = soup.find("input", {"name": "challengeType"})["value"]
        challenge_source = soup.find("input", {"name": "challengeSource"})["value"]
        request_submission_id = soup.find("input", {"name": "requestSubmissionId"})[
            "value"
        ]
        display_time = soup.find("input", {"name": "displayTime"})["value"]
        page_instance = soup.find("input", {"name": "pageInstance"})["value"]
        failure_redirect_uri = soup.find("input", {"name": "failureRedirectUri"})[
            "value"
        ]
        sign_in_link = soup.find("input", {"name": "signInLink"})["value"]
        join_now_link = soup.find("input", {"name": "joinNowLink"})["value"]
        for cookie in session.cookies:
            if cookie.name == "JSESSIONID":
                jsession_value = cookie.value.split("ajax:")[1].strip('"')
                break
        else:
            raise Exception("jsessionid not found, raise issue on GitHub")
        csrf_token = f"ajax:{jsession_value}"
        payload = {
            "csrfToken": csrf_token,
            "captchaSiteKey": captcha_site_key,
            "challengeId": challenge_id,
            "language": "en-US",
            "displayTime": display_time,
            "challengeType": challenge_type,
            "challengeSource": challenge_source,
            "requestSubmissionId": request_submission_id,
            "captchaUserResponseToken": token,
            "challengeData": challenge_data,
            "pageInstance": page_instance,
            "challengeDetails": challenge_details,
            "failureRedirectUri": failure_redirect_uri,
            "signInLink": sign_in_link,
            "joinNowLink": join_now_link,
            "_s": "CONSUMER_LOGIN",
        }
        encoded_payload = {
            key: f'{quote(str(value), "")}' for key, value in payload.items()
        }
        query_string = "&".join(
            [f"{key}={value}" for key, value in encoded_payload.items()]
        )
        response = session.post(
            "https://www.linkedin.com/checkpoint/challenge/verify", data=query_string
        )

        if not response.ok:
            raise Exception(f"verify captcha failed {response.text[:200]}")

    @retry(stop=stop_after_attempt(5), retry=retry_if_exception_type(BlobException))
    def login_requests(self):

        url = "https://www.linkedin.com/uas/authenticate"

        encoded_username = quote(self.username)
        encoded_password = quote(self.password)
        session = requests.Session()
        session.headers = {
            "X-Li-User-Agent": "LIAuthLibrary:44.0.* com.linkedin.LinkedIn:9.29.8962 iPhone:17.5.1",
            "User-Agent": "LinkedIn/9.29.8962 CFNetwork/1496.0.7 Darwin/23.5.0",
            "X-User-Language": "en",
            "X-User-Locale": "en_US",
            "Accept-Language": "en-us",
        }

        response = session.get(url)
        if response.status_code != 200:
            raise Exception(
                f"failed to begin auth process: {response.status_code} {response.text}"
            )
        for cookie in session.cookies:
            if cookie.name == "JSESSIONID":
                jsession_value = cookie.value.split("ajax:")[1].strip('"')
                break
        else:
            raise Exception("jsessionid not found, raise issue on GitHub")
        session.headers["content-type"] = "application/x-www-form-urlencoded"
        csrf_token = f"ajax%3A{jsession_value}"
        payload = f"session_key={encoded_username}&session_password={encoded_password}&JSESSIONID=%22{csrf_token}%22"
        response = session.post(url, data=payload)
        data = response.json()

        if data["login_result"] == "BAD_USERNAME_OR_PASSWORD":
            raise Exception("incorrect username or password")
        elif data["login_result"] == "CHALLENGE":
            self.solve_captcha(session, data, payload)

        session = set_csrf_token(session)
        return session

    def login_browser(self):
        """Backup login method"""
        driver = get_webdriver(self.driver_type)

        if driver is None:
            logger.debug("No browser found for selenium")
            raise Exception("driver not found for selenium")

        driver.get("https://linkedin.com/login")
        input("Press enter after logged in")

        selenium_cookies = driver.get_cookies()
        driver.quit()

        session = requests.Session()
        for cookie in selenium_cookies:
            session.cookies.set(cookie["name"], cookie["value"])

        session = set_csrf_token(session)
        return session

    def save_session(self, session, session_file: str):
        data = {"cookies": session.cookies, "headers": session.headers}
        with open(session_file, "wb") as f:
            pickle.dump(data, f)

    def load_session(self):
        """Load session from session file, otherwise login"""
        session = None
        if not self.session_file or not os.path.exists(self.session_file):
            if self.username and self.password:
                try:
                    session = self.login_requests()
                except RetryError as retry_err:
                    retry_err.reraise()
            else:
                session = self.login_browser()
            if not session:
                raise Exception("Failed to log in.")
            if self.session_file:
                self.save_session(session, self.session_file)
        else:
            with open(self.session_file, "rb") as f:
                data = pickle.load(f)
                session = requests.Session()
                session.cookies.update(data["cookies"])
                session.headers.update(data["headers"])
        session.headers.update(
            {
                "User-Agent": "Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30",
                "X-RestLi-Protocol-Version": "2.0.0",
                "X-Li-Track": '{"clientVersion":"1.13.1665"}',
            }
        )
        if not self.check_logged_in(session):
            raise Exception(
                "Failed to log in. Likely outdated session file and cookies have expired. Best practice to delete the file and rerun the LinkedAccount() code"
            )
        return session

    def check_logged_in(self, session):
        logger.info("Testing if logged in by checking arbitrary LinkedIn company page")
        try:
            res = session.get(
                "https://www.linkedin.com/voyager/api/organization/companies?q=universalName&universalName=amazon"
            )
            if res.status_code != 200:
                logger.error(f"{res.status_code} status code returned from linkedin")
                return False
        except Exception as e:
            logger.error(f"Failed to get arbitrary company page: {e}")
            return False
        logger.info("Account successfully logged in - res code 200")
        return True


def parse_date(date_str):
    formats = ["%b %Y", "%Y"]
    for fmt in formats:
        try:
            return datetime.strptime(date_str, fmt)
        except ValueError:
            continue
    return None


def parse_duration(duration):
    from_date = to_date = None
    dates = duration.split(" · ")
    if len(dates) > 1:
        date_range, _ = duration.split(" · ")
        dates = date_range.split(" - ")
        from_date_str = dates[0]
        to_date_str = dates[1] if dates[1] != "Present" else None
        from_date = parse_date(from_date_str) if from_date_str else None
        to_date = parse_date(to_date_str) if to_date_str else None

    return from_date, to_date


def set_logger_level(verbose: int = 0):
    """
    Adjusts the logger's level. This function allows the logging level to be changed at runtime.

    Parameters:
    - verbose: int {0, 1, 2} (default=0, no logs)
    """
    if verbose is None:
        return
    level_name = {2: "DEBUG", 1: "INFO", 0: "WARNING"}.get(verbose, "INFO")
    level = getattr(logging, level_name.upper(), None)
    if level is not None:
        logger.setLevel(level)
    else:
        raise ValueError(f"Invalid log level: {level_name}")


def parse_dates(date_str):
    regex = r"(\b\w+ \d{4}|\b\d{4}|\bPresent)"
    matches = re.findall(regex, date_str)

    start_date, end_date = None, None
    if matches:
        if "Present" in matches:
            if len(matches) == 1:
                start_date = None
                end_date = None
            else:
                start_date = parse(matches[0]).date()
                end_date = None
        else:
            if len(matches) == 2:
                start_date = parse(matches[0]).date()
                end_date = parse(matches[1]).date()
            elif len(matches) == 1:
                start_date = parse(matches[0]).date()

    return start_date, end_date


def extract_emails_from_text(text: str) -> list[str] | None:
    if not text:
        return None
    email_regex = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
    return email_regex.findall(text)


def parse_company_data(json_data, search_term=None):
    company_info = json_data["elements"][0]

    company_name = company_info.get("name", "")
    staff_count = company_info.get("staffCount", None)
    company_type = company_info.get("type", "")
    description = company_info.get("description", "")

    industries_list = [
        ind.get("localizedName", "")
        for ind in company_info.get("companyIndustries", [])
    ]

    headquarter = company_info.get("headquarter", {})
    headquarter_full = f'{headquarter.get("line1", "")}, {headquarter.get("city", "")}, {headquarter.get("country", "")} {headquarter.get("postalCode", "")}'

    logo_data = company_info.get("logo", {})
    vector_image = logo_data.get("image", {}).get("com.linkedin.common.VectorImage", {})
    root_url = vector_image.get("rootUrl", "")
    artifacts = vector_image.get("artifacts", [])

    logo_url = None
    if artifacts:
        first_artifact = artifacts[0]
        file_path = first_artifact.get("fileIdentifyingUrlPathSegment", "")
        logo_url = root_url + file_path

    tracking_info = company_info.get("trackingInfo", {})
    object_urn = tracking_info.get("objectUrn", "")
    internal_id = None
    if object_urn.startswith("urn:li:company:"):
        internal_id = object_urn.split(":")[-1]

    bg_photo = company_info.get("backgroundCoverPhoto", {})
    vector_image = bg_photo.get("com.linkedin.common.VectorImage", {})
    root_url = vector_image.get("rootUrl", "")
    artifacts = vector_image.get("artifacts", [])
    banner_url = None
    if artifacts:
        chosen_artifact = artifacts[0]
        file_segment = chosen_artifact.get("fileIdentifyingUrlPathSegment", "")
        banner_url = root_url + file_segment

    company_df = pd.DataFrame(
        {
            "search_term": [search_term],
            "linkedin_company_id": [internal_id],
            "company_name": [company_name],
            "staff_count": [staff_count],
            "company_type": [company_type],
            "industries": [industries_list],
            "headquarters_address": [headquarter_full],
            "description": [description],
            "logo_url": [logo_url],
            "banner_url": [banner_url],
        }
    )
    return company_df


def clean_df(staff_df):
    if "estimated_age" in staff_df.columns:
        staff_df["estimated_age"] = staff_df["estimated_age"].astype("Int64")
    if "followers" in staff_df.columns:
        staff_df["followers"] = staff_df["followers"].astype("Int64")
    if "connections" in staff_df.columns:
        staff_df["connections"] = staff_df["connections"].astype("Int64")
    if "mutuals" in staff_df.columns:
        staff_df["mutuals"] = staff_df["mutuals"].astype("Int64")
    return staff_df


def upload_to_clay(webhook_url: str, data: pd.DataFrame):
    records = data.to_dict("records")

    responses = []
    for i, row in enumerate(records, start=1):
        try:
            response = requests.post(
                webhook_url, headers={"Accept": "application/json"}, json=row
            )
            response.raise_for_status()
            logger.info(f"Uploaded row to Clay: {i} / {len(records)}")
        except requests.exceptions.RequestException as e:
            logger.error(f"Failed to upload row to Clay: {str(e)}")
            responses.append({"error": str(e), "data": row})

    return responses


if __name__ == "__main__":
    p = parse_dates("May 2018 - Jun 2024")

Download .txt

gitextract_26g2vb8c/

├── .github/
│   └── workflows/
│       └── publish-to-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── examples/
│   ├── daily_auto_connect.py
│   ├── upload_staff_to_clay.py
│   └── x_corp_staff.py
├── pyproject.toml
└── staffspy/
    ├── __init__.py
    ├── linkedin/
    │   ├── certifications.py
    │   ├── comments.py
    │   ├── contact_info.py
    │   ├── employee.py
    │   ├── employee_bio.py
    │   ├── experiences.py
    │   ├── languages.py
    │   ├── linkedin.py
    │   ├── schools.py
    │   └── skills.py
    ├── solvers/
    │   ├── capsolver.py
    │   ├── solver.py
    │   ├── solver_type.py
    │   └── two_captcha.py
    └── utils/
        ├── driver_type.py
        ├── exceptions.py
        ├── models.py
        └── utils.py

Download .txt

SYMBOL INDEX (114 symbols across 21 files)

FILE: examples/daily_auto_connect.py
  function connect_with_staff (line 34) | def connect_with_staff():

FILE: examples/x_corp_staff.py
  function save_results (line 90) | def save_results(users: pd.DataFrame):
  function scrape_and_save (line 98) | def scrape_and_save(term=None, location=None):

FILE: staffspy/__init__.py
  class LinkedInAccount (line 28) | class LinkedInAccount:
    method __init__ (line 36) | def __init__(
    method login (line 57) | def login(self):
    method scrape_staff (line 68) | def scrape_staff(
    method scrape_users (line 109) | def scrape_users(
    method scrape_comments (line 152) | def scrape_comments(self, post_ids: list[str]) -> pd.DataFrame:
    method scrape_companies (line 174) | def scrape_companies(
    method scrape_connections (line 211) | def scrape_connections(

FILE: staffspy/linkedin/certifications.py
  class CertificationFetcher (line 10) | class CertificationFetcher:
    method __init__ (line 11) | def __init__(self, session):
    method fetch_certifications (line 15) | def fetch_certifications(self, staff):
    method parse_certifications (line 45) | def parse_certifications(self, sections):

FILE: staffspy/linkedin/comments.py
  class CommentFetcher (line 11) | class CommentFetcher:
    method __init__ (line 13) | def __init__(self, session):
    method fetch_comments (line 19) | def fetch_comments(self, post_id: str):
    method parse_comments (line 48) | def parse_comments(self, comments_json: dict):

FILE: staffspy/linkedin/contact_info.py
  class ContactInfoFetcher (line 15) | class ContactInfoFetcher:
    method __init__ (line 16) | def __init__(self, session):
    method fetch_contact_info (line 20) | def fetch_contact_info(self, base_staff):
    method parse_emp_contact_info (line 42) | def parse_emp_contact_info(self, emp: Staff, emp_dict: dict):

FILE: staffspy/linkedin/employee.py
  class EmployeeFetcher (line 12) | class EmployeeFetcher:
    method __init__ (line 13) | def __init__(self, session):
    method fetch_employee (line 19) | def fetch_employee(self, base_staff, domain):
    method parse_emp (line 44) | def parse_emp(self, emp: Staff, emp_dict: dict):

FILE: staffspy/linkedin/employee_bio.py
  class EmployeeBioFetcher (line 9) | class EmployeeBioFetcher:
    method __init__ (line 10) | def __init__(self, session):
    method fetch_employee_bio (line 14) | def fetch_employee_bio(self, base_staff):

FILE: staffspy/linkedin/experiences.py
  class ExperiencesFetcher (line 11) | class ExperiencesFetcher:
    method __init__ (line 12) | def __init__(self, session):
    method fetch_experiences (line 16) | def fetch_experiences(self, staff):
    method parse_experiences (line 51) | def parse_experiences(self, elements):
    method parse_multi_exp (line 121) | def parse_multi_exp(self, entity):

FILE: staffspy/linkedin/languages.py
  class LanguagesFetcher (line 10) | class LanguagesFetcher:
    method __init__ (line 11) | def __init__(self, session):
    method fetch_languages (line 15) | def fetch_languages(self, staff: Staff):
    method parse_languages (line 35) | def parse_languages(self, language_json: dict) -> list[str]:

FILE: staffspy/linkedin/linkedin.py
  class LinkedInScraper (line 29) | class LinkedInScraper:
    method __init__ (line 41) | def __init__(self, session: requests.Session):
    method search_companies (line 65) | def search_companies(self, company_name: str):
    method fetch_or_search_company (line 111) | def fetch_or_search_company(self, company_name):
    method _get_company_id_and_staff_count (line 138) | def _get_company_id_and_staff_count(self, company_name: str):
    method parse_staff (line 163) | def parse_staff(self, elements: list[dict]):
    method fetch_staff (line 207) | def fetch_staff(self, offset: int):
    method fetch_connections_page (line 249) | def fetch_connections_page(self, offset: int):
    method scrape_connections (line 280) | def scrape_connections(
    method fetch_location_id (line 321) | def fetch_location_id(self):
    method scrape_staff (line 357) | def scrape_staff(
    method fetch_all_info_for_employee (line 434) | def fetch_all_info_for_employee(self, employee: Staff, index: int):
    method fetch_user_profile_data_from_public_id (line 462) | def fetch_user_profile_data_from_public_id(self, user_id: str, key: str):
    method block_user (line 501) | def block_user(self, employee: Staff) -> None:
    method connect_user (line 530) | def connect_user(self, employee: Staff) -> None:

FILE: staffspy/linkedin/schools.py
  class SchoolsFetcher (line 11) | class SchoolsFetcher:
    method __init__ (line 13) | def __init__(self, session):
    method fetch_schools (line 17) | def fetch_schools(self, staff):
    method parse_schools (line 44) | def parse_schools(self, elements):

FILE: staffspy/linkedin/skills.py
  class SkillsFetcher (line 10) | class SkillsFetcher:
    method __init__ (line 11) | def __init__(self, session):
    method fetch_skills (line 15) | def fetch_skills(self, staff: Staff):
    method parse_skills (line 40) | def parse_skills(self, sections):

FILE: staffspy/solvers/capsolver.py
  function is_none (line 10) | def is_none(value):
  class CapSolver (line 14) | class CapSolver(Solver):
    method solve (line 18) | def solve(self, blob_data: str, page_url: str = None):

FILE: staffspy/solvers/solver.py
  class Solver (line 4) | class Solver(ABC):
    method __init__ (line 8) | def __init__(self, solver_api_key:str):
    method solve (line 12) | def solve(self, blob_data: str, page_ur: str=None):

FILE: staffspy/solvers/solver_type.py
  class SolverType (line 3) | class SolverType(Enum):

FILE: staffspy/solvers/two_captcha.py
  class TwoCaptchaSolver (line 7) | class TwoCaptchaSolver(Solver):
    method solve (line 18) | def solve(self, blob_data: str, page_url: str = None):

FILE: staffspy/utils/driver_type.py
  class BrowserType (line 5) | class BrowserType(Enum):
  class DriverType (line 10) | class DriverType:
    method __init__ (line 11) | def __init__(

FILE: staffspy/utils/exceptions.py
  class TooManyRequests (line 1) | class TooManyRequests(Exception):
  class BadCookies (line 5) | class BadCookies(Exception):
  class GeoUrnNotFound (line 9) | class GeoUrnNotFound(Exception):
  class BlobException (line 13) | class BlobException(Exception):

FILE: staffspy/utils/models.py
  class Comment (line 9) | class Comment(BaseModel):
    method to_dict (line 19) | def to_dict(self):
  class School (line 32) | class School(BaseModel):
    method to_dict (line 38) | def to_dict(self):
  class Skill (line 47) | class Skill(BaseModel):
    method to_dict (line 52) | def to_dict(self):
  class ContactInfo (line 60) | class ContactInfo(BaseModel):
    method to_dict (line 68) | def to_dict(self):
  class Certification (line 79) | class Certification(BaseModel):
    method to_dict (line 86) | def to_dict(self):
  class Experience (line 96) | class Experience(BaseModel):
    method to_dict (line 105) | def to_dict(self):
  class Staff (line 117) | class Staff(BaseModel):
    method get_top_skills (line 153) | def get_top_skills(self):
    method to_dict (line 163) | def to_dict(self):
    method estimate_age_based_on_education (line 269) | def estimate_age_based_on_education(self):

FILE: staffspy/utils/utils.py
  function set_csrf_token (line 32) | def set_csrf_token(session):
  function extract_base_domain (line 38) | def extract_base_domain(url: str):
  function create_emails (line 44) | def create_emails(first, last, domain):
  function get_webdriver (line 57) | def get_webdriver(driver_type: Optional[DriverType] = None):
  class Login (line 89) | class Login:
    method __init__ (line 91) | def __init__(
    method solve_captcha (line 107) | def solve_captcha(self, session, data, payload):
    method login_requests (line 192) | def login_requests(self):
    method login_browser (line 232) | def login_browser(self):
    method save_session (line 253) | def save_session(self, session, session_file: str):
    method load_session (line 258) | def load_session(self):
    method check_logged_in (line 292) | def check_logged_in(self, session):
  function parse_date (line 308) | def parse_date(date_str):
  function parse_duration (line 318) | def parse_duration(duration):
  function set_logger_level (line 332) | def set_logger_level(verbose: int = 0):
  function parse_dates (line 349) | def parse_dates(date_str):
  function extract_emails_from_text (line 372) | def extract_emails_from_text(text: str) -> list[str] | None:
  function parse_company_data (line 379) | def parse_company_data(json_data, search_term=None):
  function clean_df (line 439) | def clean_df(staff_df):
  function upload_to_clay (line 451) | def upload_to_clay(webhook_url: str, data: pd.DataFrame):

Download .json

Condensed preview — 28 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (117K chars).

[
  {
    "path": ".github/workflows/publish-to-pypi.yml",
    "chars": 726,
    "preview": "name: Publish Python 🐍 distributions 📦 to PyPI\non: push\n\njobs:\n  build-n-publish:\n    name: Build and publish Python 🐍 d"
  },
  {
    "path": ".gitignore",
    "chars": 108,
    "preview": "/venv/\n/.idea\n**/__pycache__/\n**/.pytest_cache/\n/.ipynb_checkpoints/\n**/output/\n**/.DS_Store\n*.pyc\n.env\ndist"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 147,
    "preview": "repos:\n- repo: https://github.com/psf/black\n  rev: 24.2.0\n  hooks:\n  - id: black\n    language_version: python\n    args: "
  },
  {
    "path": "LICENSE",
    "chars": 483,
    "preview": "            DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE\n                    Version 2, December 2004\n\n Copyright (C) 200"
  },
  {
    "path": "README.md",
    "chars": 10681,
    "preview": "<img width=\"640\" alt=\"3FAD4652-488F-4F6F-A744-4C2AA5855E92\" src=\"https://github.com/user-attachments/assets/73b701ff-2db"
  },
  {
    "path": "examples/daily_auto_connect.py",
    "chars": 1396,
    "preview": "\"\"\" Script to connect with 10 software engineers daily from random tech companies \"\"\"\n\nfrom staffspy import LinkedInAcco"
  },
  {
    "path": "examples/upload_staff_to_clay.py",
    "chars": 572,
    "preview": "\"\"\"\nUploads staff to the Clay platform to then further enrich the staff (e.g. waterfall strategy to find their verified "
  },
  {
    "path": "examples/x_corp_staff.py",
    "chars": 3281,
    "preview": "\"\"\"\nCASE STUDY: X CORP EMPLOYEES\nRESULT: We retrieved 1087 profiles. Not as good as expected but still a good result for"
  },
  {
    "path": "pyproject.toml",
    "chars": 688,
    "preview": "[tool.poetry]\nname = \"staffspy\"\nversion = \"0.2.25\"\ndescription = \"Staff scraper library for LinkedIn\"\nauthors = [\"Cullen"
  },
  {
    "path": "staffspy/__init__.py",
    "chars": 8410,
    "preview": "import json\nimport pandas as pd\n\nfrom staffspy.linkedin.comments import CommentFetcher\nfrom staffspy.linkedin.linkedin i"
  },
  {
    "path": "staffspy/linkedin/certifications.py",
    "chars": 2824,
    "preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Cert"
  },
  {
    "path": "staffspy/linkedin/comments.py",
    "chars": 3126,
    "preview": "import json\nimport re\nfrom datetime import datetime as dt\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom st"
  },
  {
    "path": "staffspy/linkedin/contact_info.py",
    "chars": 3041,
    "preview": "from calendar import month_name\nfrom datetime import datetime\nimport json\nimport requests\nimport logging\n\nimport pytz\n\nf"
  },
  {
    "path": "staffspy/linkedin/employee.py",
    "chars": 5314,
    "preview": "import json\nimport logging\nimport re\n\nimport staffspy.utils.utils as utils\nfrom staffspy.utils.exceptions import TooMany"
  },
  {
    "path": "staffspy/linkedin/employee_bio.py",
    "chars": 1302,
    "preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\n\nlogger = logging.getLogger(__name__)\n"
  },
  {
    "path": "staffspy/linkedin/experiences.py",
    "chars": 5705,
    "preview": "import json\nimport logging\n\nimport staffspy.utils.utils as utils\nfrom staffspy.utils.exceptions import TooManyRequests\nf"
  },
  {
    "path": "staffspy/linkedin/languages.py",
    "chars": 1717,
    "preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Skil"
  },
  {
    "path": "staffspy/linkedin/linkedin.py",
    "chars": 23432,
    "preview": "\"\"\"\nstaffspy.linkedin.linkedin\n~~~~~~~~~~~~~~~~~~~\n\nThis module contains routines to scrape LinkedIn.\n\"\"\"\n\nimport json\ni"
  },
  {
    "path": "staffspy/linkedin/schools.py",
    "chars": 2220,
    "preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Scho"
  },
  {
    "path": "staffspy/linkedin/skills.py",
    "chars": 2884,
    "preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Skil"
  },
  {
    "path": "staffspy/solvers/capsolver.py",
    "chars": 2020,
    "preview": "import json\nimport time\n\nimport requests\nfrom tenacity import retry, stop_after_attempt, retry_if_result\n\nfrom staffspy."
  },
  {
    "path": "staffspy/solvers/solver.py",
    "chars": 337,
    "preview": "from abc import ABC,abstractmethod\n\n\nclass Solver(ABC):\n    public_key = \"3117BF26-4762-4F5A-8ED9-A85E69209A46\"\n    page"
  },
  {
    "path": "staffspy/solvers/solver_type.py",
    "chars": 106,
    "preview": "from enum import Enum\n\nclass SolverType(Enum):\n    CAPSOLVER = 'capsolver'\n    TWO_CAPTCHA = 'twocaptcha'\n"
  },
  {
    "path": "staffspy/solvers/two_captcha.py",
    "chars": 1081,
    "preview": "from tenacity import retry_if_exception_type, stop_after_attempt, retry\nfrom twocaptcha import TwoCaptcha, TimeoutExcept"
  },
  {
    "path": "staffspy/utils/driver_type.py",
    "chars": 335,
    "preview": "from enum import Enum\nfrom typing import Optional\n\n\nclass BrowserType(Enum):\n    CHROME = \"chrome\"\n    FIREFOX = \"firefo"
  },
  {
    "path": "staffspy/utils/exceptions.py",
    "chars": 307,
    "preview": "class TooManyRequests(Exception):\n    \"\"\"Too many requests.\"\"\"\n\n\nclass BadCookies(Exception):\n    \"\"\"Login expiration.\"\""
  },
  {
    "path": "staffspy/utils/models.py",
    "chars": 9856,
    "preview": "from datetime import datetime, date\n\nfrom pydantic import BaseModel\nfrom datetime import datetime as dt\n\nfrom staffspy.u"
  },
  {
    "path": "staffspy/utils/utils.py",
    "chars": 17440,
    "preview": "import logging\nimport os\nimport pickle\nimport re\nfrom datetime import datetime\n\nimport pandas as pd\nfrom typing import O"
  }
]

About this extraction

This page contains the full source code of the cullenwatson/StaffSpy GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 28 files (107.0 KB), approximately 25.1k tokens, and a symbol index with 114 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo