Repository: cullenwatson/StaffSpy
Branch: main
Commit: 0a8a8d73a5db
Files: 28
Total size: 107.0 KB
Directory structure:
gitextract_26g2vb8c/
├── .github/
│ └── workflows/
│ └── publish-to-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── examples/
│ ├── daily_auto_connect.py
│ ├── upload_staff_to_clay.py
│ └── x_corp_staff.py
├── pyproject.toml
└── staffspy/
├── __init__.py
├── linkedin/
│ ├── certifications.py
│ ├── comments.py
│ ├── contact_info.py
│ ├── employee.py
│ ├── employee_bio.py
│ ├── experiences.py
│ ├── languages.py
│ ├── linkedin.py
│ ├── schools.py
│ └── skills.py
├── solvers/
│ ├── capsolver.py
│ ├── solver.py
│ ├── solver_type.py
│ └── two_captcha.py
└── utils/
├── driver_type.py
├── exceptions.py
├── models.py
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/publish-to-pypi.yml
================================================
name: Publish Python 🐍 distributions 📦 to PyPI
on: push
jobs:
build-n-publish:
name: Build and publish Python 🐍 distributions 📦 to PyPI
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install poetry
run: >-
python3 -m
pip install
poetry
--user
- name: Build distribution 📦
run: >-
python3 -m
poetry
build
- name: Publish distribution 📦 to PyPI
if: startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.PYPI_API_TOKEN }}
================================================
FILE: .gitignore
================================================
/venv/
/.idea
**/__pycache__/
**/.pytest_cache/
/.ipynb_checkpoints/
**/output/
**/.DS_Store
*.pyc
.env
dist
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
language_version: python
args: [--line-length=88, --quiet]
================================================
FILE: LICENSE
================================================
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
================================================
FILE: README.md
================================================
<img width="640" alt="3FAD4652-488F-4F6F-A744-4C2AA5855E92" src="https://github.com/user-attachments/assets/73b701ff-2db8-4d72-9ad3-42b7e1db537f">
**StaffSpy** is a staff fetcher library for LinkedIn.
## Features
- Fetches staff from a company on **LinkedIn**
- Obtains skills, experiences, certifications & more
- Fetch individuals users / comments on posts
- Export all your connections with their contact info
- Aggregates the employees in a Pandas DataFrame
### Installation
```
pip install -U "staffspy[browser]"
```
Or for latest code from this repo directly
```
pip install "git+https://github.com/cullenwatson/StaffSpy.git#egg=staffspy[browser]"
```
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_
### Usage
```python
from staffspy import LinkedInAccount, SolverType, DriverType, BrowserType
account = LinkedInAccount(
# driver_type=DriverType( # if issues with webdriver, specify its exact location, download link in the FAQ
# browser_type=BrowserType.CHROME,
# executable_path="/Users/pc/chromedriver-mac-arm64/chromedriver"
# ),
session_file="session.pkl", # save login cookies to only log in once (lasts a week or so)
log_level=1, # 0 for no logs
)
# search by company
staff = account.scrape_staff(
company_name="openai",
search_term="software engineer",
location="london",
extra_profile_data=True, # fetch all past experiences, schools, & skills
max_results=50, # can go up to 1000
# block=True # if you want to block the user after scraping, to exclude from future search results
# connect=True # if you want to connect with the users until you hit your limit
)
# or fetch by user ids
users = account.scrape_users(
user_ids=['williamhgates', 'rbranson', 'jeffweiner08']
# connect=True,
# block=True
)
# fetch all comments on two of Bill Gates' posts
comments = account.scrape_comments(
['7252421958540091394','7253083989547048961']
)
# fetch company details
companies = account.scrape_companies(
company_names=['openai', 'microsoft']
)
# fetch connections (also gets their contact info if available)
connections = account.scrape_connections(
extra_profile_data=True,
max_results=50
)
# export any of the results to csv
staff.to_csv("staff.csv", index=False)
```
#### Browser login
If you rather use a browser to log in, install the browser add-on to StaffSpy .
`pip install staffspy[browser]`
If you do not pass the `username` & `password` params, then a browser will open to sign in to LinkedIn on the first sign-in. Press enter after signing in to begin scraping.
### Output
| profile_id | name | first_name | last_name | location | age | position | followers | connections | company | past_company1 | past_company2 | school1 | school2 | skill1 | skill2 | skill3 | is_connection | premium | creator | potential_email | profile_link | profile_photo |
| ---------------- | -------------- | ---------- | --------- | ------------------------------- | --- | ------------------------------- | --------- | ----------- | ------- | ------------- | ------------- | ---------------------------------- | ------------------------- | -------- | ---------- | ---------- | ------------- | ------- | ------- | ------------------------------------------------ | -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| javiersierra2102 | Javier Sierra | Javier | Sierra | London, England, United Kingdom | 39 | Software Engineer | 735 | 725 | OpenAI | Meta | Oculus VR | Hult International Business School | Universidad Simón Bolívar | Java | JavaScript | C++ | FALSE | FALSE | FALSE | javier.sierra@openai.com, jsierra@openai.com | https://www.linkedin.com/in/javiersierra2102 | https://media.licdn.com/dms/image/C4D03AQHEyUg1kGT08Q/profile-displayphoto-shrink_800_800/0/1516504680512?e=1727913600&v=beta&t=3enCmNDBtJ7LxfbW6j1hDD8qNtHjO2jb2XTONECxUXw |
| dougli | Douglas Li | Douglas | Li | London, England, United Kingdom | 37 | @ OpenAI UK, previously at Meta | 583 | 401 | OpenAI | Shift Lab | Facebook | Washington University in St. Louis | | Java | Python | JavaScript | FALSE | TRUE | FALSE | douglas.li@openai.com, dli@openai.com | https://www.linkedin.com/in/dougli | https://media.licdn.com/dms/image/D4E03AQETmRyb3_GB8A/profile-displayphoto-shrink_800_800/0/1687996628597?e=1727913600&v=beta&t=HRYGJ4RxsTMcPF1YcSikXlbz99hx353csho3PWT6fOQ |
| nkartashov | Nick Kartashov | Nick | Kartashov | London, England, United Kingdom | 33 | Software Engineer | 2186 | 2182 | OpenAI | Google | DeepMind | St. Petersburg Academic University | Bioinformatics Institute | Teamwork | Java | Haskell | FALSE | FALSE | FALSE | nick.kartashov@openai.com, nkartashov@openai.com | https://www.linkedin.com/in/nkartashov | https://media.licdn.com/dms/image/D4E03AQEjOKxC5UgwWw/profile-displayphoto-shrink_800_800/0/1680706122689?e=1727913600&v=beta&t=m-JnG9nm0zxp1Z7njnInwbCoXyqa3AN-vJZntLfbzQ4 |
### Parameters for `LinkedInAccount()`
```plaintext
Optional
├── session_file (str):
| file path to save session cookies, so only one manual login is needed.
| can use mult profiles this way
|
| For automated login
├── username (str):
| linkedin account email
│
├── password (str):
| linkedin account password
|
├── driver_type (DriverType):
| signs in with the given BrowserType (Chrome, Firefox) and executable_path
|
├── solver_service (SolverType):
| solves the captcha using the desired service - either CapSolver, or 2Captcha (worse of the two)
|
├── solver_api_key (str):
| api key for the solver provider
│
├── log_level (int):
| Controls the verbosity of the runtime printouts
| (0 prints only errors, 1 is info, 2 is all logs. Default is 0.)
```
### Parameters for `scrape_staff()`
```plaintext
Optional
├── company_name (str):
| company identifier on linkedin, will search for that company if that company id does not exist
| e.g. openai from https://www.linkedin.com/company/openai
|
├── search_term (str):
| staff title to search for
| e.g. software engineer
|
├── location (str):
| location the staff resides
| e.g. london
│
├── extra_profile_data (bool)
| fetches educations, experiences, skills, certifications (Default false)
│
├── max_results (int):
| number of staff to fetch, default/max is 1000 for a search imposed by LinkedIn
|
├── block (bool):
| whether to block the user after scraping
|
├── connect (bool):
| whether to conncet with the user after scraping
```
### Parameters for `scrape_users()`
```plaintext
├── user_ids (list):
| user ids to scrape from
| e.g. dougmcmillon from https://www.linkedin.com/in/dougmcmillon
|
├── block (bool):
| whether to block the user after scraping
|
├── connect (bool):
| whether to conncet with the user after scraping
```
### Parameters for `scrape_comments()`
```plaintext
├── post_ids (list):
| post ids to scrape from
| e.g. 7252381444906364929 from https://www.linkedin.com/posts/williamhgates_technology-transformtheeveryday-activity-7252381444906364929-Bkls
```
### Parameters for `scrape_companies()`
```plaintext
├── company_names (list):
| list of company names to scrape details from
| e.g. ['openai', 'microsoft', 'google']
```
### Parameters for `scrape_connections()`
```plaintext
├── max_results (int):
| maximum number of connections to fetch
|
├── extra_profile_data (bool):
| fetches educations, experiences, skills, certifications & contact info for each connection (Default false)
```
### LinkedIn notes
- only 1000 max results per search
- extra_profile_data increases runtime by O(n)
- if rate limited, the program will stop scraping
- if using non-browser sign in, turn off 2fa
---
## Frequently Asked Questions
---
**Q: Can I get my account banned?**
**A:** It is a possibility, although there are no recorded incidents. Let me know if you are the first. However, to protect you, the code does not allow you to run it if LinkedIn is blocking you
---
**Q: Scraped 999 staff members, with 869 hidden LinkedIn Members?**
**A:** It means your LinkedIn account is bad. Not sure how they classify it but unverified email, new account, low connections and a bunch of factors go into it.
---
**Q: How to get around the 1000 search limit result?**
**A:** Check the examples folder. We can block the user after searching and try many different locations and search terms to maximize results.
---
**Q: Exception: driver not found for selenium?**
**A:** You need chromedriver installed (not the chrome): https://googlechromelabs.github.io/chrome-for-testing/#stable
---
**Q: Encountering issues with your queries?**
**A:** If problems
persist, [submit an issue](https://github.com/cullenwatson/StaffSpy/issues).
### Staff Schema
```plaintext
Staff
├── Personal Information
│ ├── search_term
│ ├── id
│ ├── name
│ ├── first_name
│ ├── last_name
│ ├── location
│ └── bio
│
├── Professional Details
│ ├── position
│ ├── profile_id
│ ├── profile_link
│ ├── potential_emails
│ └── estimated_age
│
├── Social Connectivity
│ ├── followers
│ ├── connections
│ └── mutuals_count
│
├── Status
│ ├── influencer
│ ├── creator
│ ├── premium
│ ├── open_to_work
│ ├── is_hiring
│ └── is_connection
│
├── Visuals
│ ├── profile_photo
│ └── banner_photo
│
├── Skills
│ ├── name
│ └── endorsements
│
├── Experiences
│ ├── from_date
│ ├── to_date
│ ├── duration
│ ├── title
│ ├── company
│ ├── location
│ └── emp_type
│
├── Certifications
│ ├── title
│ ├── issuer
│ ├── date_issued
│ ├── cert_id
│ └── cert_link
│
├── Educational Background
| ├── years
| ├── school
| └── degree
│
└── Connection Info (only when a connection and enabled on their profile)
├── email_address
├── address
├── birthday
├── websites
├── phone_numbers
└── created_at
```
================================================
FILE: examples/daily_auto_connect.py
================================================
""" Script to connect with 10 software engineers daily from random tech companies """
from staffspy import LinkedInAccount, DriverType, BrowserType
import random
import time
from datetime import datetime
import schedule
# List of tech companies to randomly choose from
TECH_COMPANIES = [
"microsoft",
"google",
"apple",
"meta",
"amazon",
"netflix",
"salesforce",
"adobe",
"intel",
"nvidia",
"oracle",
"ibm",
"vmware",
"twitter",
"linkedin",
"airbnb",
"uber",
"stripe",
"snowflake",
"databricks",
]
def connect_with_staff():
print(f"Starting connection run at {datetime.now()}")
# Initialize LinkedIn account
account = LinkedInAccount(session_file="session.pkl", log_level=1)
# Choose a random company
company = random.choice(TECH_COMPANIES)
print(f"Selected company: {company}")
# Connect with 10 users
account.scrape_staff(
company_name=company,
search_term="software engineer",
max_results=10,
extra_profile_data=True,
connect=True,
)
if __name__ == "__main__":
# Schedule to run once a day at 10 AM
schedule.every().day.at("10:00").do(connect_with_staff)
# Run immediately on script start
connect_with_staff()
# Keep the script running
while True:
schedule.run_pending()
time.sleep(60)
================================================
FILE: examples/upload_staff_to_clay.py
================================================
"""
Uploads staff to the Clay platform to then further enrich the staff (e.g. waterfall strategy to find their verified emails)
"""
from staffspy import LinkedInAccount
from staffspy.utils.utils import upload_to_clay
session_file = "session.pkl"
account = LinkedInAccount(session_file=session_file, log_level=2)
connections = account.scrape_connections(extra_profile_data=True, max_results=3)
clay_webhook_url = (
"https://api.clay.com/v3/sources/webhook/pull-in-data-from-a-webhook-XXXXXXXXXXXXXX"
)
upload_to_clay(webhook_url=clay_webhook_url, data=connections)
================================================
FILE: examples/x_corp_staff.py
================================================
"""
CASE STUDY: X CORP EMPLOYEES
RESULT: We retrieved 1087 profiles. Not as good as expected but still a good result for company that has 2800 employees.
final csv - https://drive.google.com/file/d/1aC-GF4RXf9wzGrpxQyGPBxlnLo2X5vm4
Strategies to get around LinkedIn 1000 result limit:
1) It blocks the user after searching to prevent it from appearing in future searches.
2) It tries various searches with department and location to get more results.
Lastly, it saves the results in CSV files and then combines them into one DataFrame at the end to view the results.
"""
import os
from datetime import datetime
import pandas as pd
import glob
from staffspy import LinkedInAccount
session_file = "session.pkl"
account = LinkedInAccount(session_file=str(session_file), log_level=2)
departments = [
# Leadership
"CEO",
"CFO",
"CTO",
"COO",
"executive",
"director",
"vice president",
"head",
"lead",
# Engineering/Tech
"software",
"developer",
"engineer",
"architect",
"devops",
"QA",
"data",
"IT",
"security",
# Business/Operations
"sales",
"account",
"business development",
"operations",
"project manager",
"product manager",
# Support Functions
"HR",
"recruiter",
"marketing",
"finance",
"legal",
"accounting",
"admin",
"support",
# Customer-Facing
"customer success",
"account manager",
"sales representative",
"customer support",
# Specialists
"analyst",
"consultant",
"coordinator",
"specialist",
]
locations = [
"San Francisco",
"New York",
"Los Angeles",
"Seattle",
"Miami",
"Boston",
"Austin",
"Chicago",
"Toronto",
"London",
"Singapore",
"Tokyo",
"Dublin",
]
def save_results(users: pd.DataFrame):
output_dir = f"output/{company_name}"
os.makedirs(output_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = f"{output_dir}/users_{timestamp}.csv"
users.to_csv(output_path, index=False)
def scrape_and_save(term=None, location=None):
users = account.scrape_staff(
company_name=company_name,
search_term=term,
location=location,
extra_profile_data=True,
max_results=1000,
block=True,
)
if not users.empty:
save_results(users)
company_name = "x-corp"
# generic search
for _ in range(5):
scrape_and_save()
# Search by departments
for department in departments:
scrape_and_save(term=department)
# Search by locations
for location in locations:
scrape_and_save(location=location)
# load all csvs into one df
files = glob.glob("output/x-corp/*.csv")
dfs = [pd.read_csv(f) for f in files]
combined_df = pd.concat(dfs, ignore_index=True)
# Filter out hidden profiles
filtered_df = combined_df[combined_df["urn"] != "headless"]
filtered_df = filtered_df[filtered_df["current_company"] == "X"]
filtered_df = filtered_df.drop_duplicates(subset="id")
filtered_urns = len(set(filtered_df["urn"]))
print(f"Total unique profiles: {filtered_urns}")
company_name = "x-corp"
filtered_df.to_csv(
f"output/{company_name}/final_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv",
index=False,
)
================================================
FILE: pyproject.toml
================================================
[tool.poetry]
name = "staffspy"
version = "0.2.25"
description = "Staff scraper library for LinkedIn"
authors = ["Cullen Watson <cullen@cullenwatson.com>"]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
pydantic = "^2.7.2"
pandas = "^2.2.2"
requests = "^2.32.3"
tldextract = "^5.1.2"
selenium = { version = "^4.3.0", optional = true }
tenacity = "^8.5.0"
python-dateutil = "^2.9.0.post0"
beautifulsoup4 = "^4.12.3"
2captcha-python = "^1.2.8"
[tool.poetry.extras]
browser = ["selenium"]
[tool.poetry.group.dev.dependencies]
pre-commit = "^3.7.1"
black = "^24.4.2"
jupyter = "^1.0.0"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
================================================
FILE: staffspy/__init__.py
================================================
import json
import pandas as pd
from staffspy.linkedin.comments import CommentFetcher
from staffspy.linkedin.linkedin import LinkedInScraper
from staffspy.utils.models import Staff
from staffspy.solvers.capsolver import CapSolver
from staffspy.solvers.solver_type import SolverType
from staffspy.solvers.two_captcha import TwoCaptchaSolver
from staffspy.utils.utils import (
set_logger_level,
logger,
Login,
parse_company_data,
extract_emails_from_text,
clean_df,
)
from staffspy.utils.driver_type import DriverType, BrowserType
__all__ = [
"LinkedInAccount",
"SolverType",
"DriverType",
"BrowserType",
]
class LinkedInAccount:
"""LinkedinAccount storing cookie data and providing outer facing methods for client"""
solver_map = {
SolverType.CAPSOLVER: CapSolver,
SolverType.TWO_CAPTCHA: TwoCaptchaSolver,
}
def __init__(
self,
session_file: str = None,
username: str = None,
password: str = None,
log_level: int = 0,
solver_api_key: str = None,
solver_service: SolverType = SolverType.CAPSOLVER,
driver_type: DriverType = None,
):
self.session_file = session_file
self.username = username
self.password = password
self.log_level = log_level
self.solver = self.solver_map[solver_service](solver_api_key)
self.driver_type = driver_type
self.session = None
self.linkedin_scraper = None
self.on_block = False
self.login()
def login(self):
set_logger_level(self.log_level)
login = Login(
self.username,
self.password,
self.solver,
self.session_file,
self.driver_type,
)
self.session = login.load_session()
def scrape_staff(
self,
company_name: str = None,
search_term: str = None,
location: str = None,
extra_profile_data: bool = False,
max_results: int = 1000,
block: bool = False,
connect: bool = False,
):
if self.on_block:
return logger.error(
"Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
)
"""Main function entry point to scrape LinkedIn staff"""
li_scraper = LinkedInScraper(self.session)
staff = li_scraper.scrape_staff(
company_name=company_name,
extra_profile_data=extra_profile_data,
search_term=search_term,
location=location,
max_results=max_results,
block=block,
connect=connect,
)
if li_scraper.on_block:
self.on_block = True
staff_dicts = [staff.to_dict() for staff in staff]
staff_df = pd.DataFrame(staff_dicts)
if staff_df.empty:
return staff_df
staff_df = clean_df(staff_df)
linkedin_member_df = staff_df[staff_df["name"] == "LinkedIn Member"]
non_linkedin_member_df = staff_df[staff_df["name"] != "LinkedIn Member"]
staff_df = pd.concat([non_linkedin_member_df, linkedin_member_df])
logger.info(
f"3) Staff from {company_name}: {len(staff_df)} total, {len(linkedin_member_df)} hidden, {len(staff_df) - len(linkedin_member_df)} visible"
)
return staff_df.reset_index(drop=True)
def scrape_users(
self, user_ids: list[str], block: bool = False, connect: bool = False
) -> pd.DataFrame | None:
"""Scrape users from Linkedin by user IDs"""
if self.on_block:
return logger.error(
"Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
)
li_scraper = LinkedInScraper(self.session)
li_scraper.num_staff = len(user_ids)
users = [
Staff(
id="",
search_term="manual",
profile_id=user_id,
profile_link=f"https://www.linkedin.com/in/{user_id}",
)
for user_id in user_ids
]
for i, user in enumerate(users, start=1):
user.id, user.urn = li_scraper.fetch_user_profile_data_from_public_id(
user.profile_id, "user_id"
)
if user.id:
li_scraper.fetch_all_info_for_employee(user, i)
if block:
li_scraper.block_user(user)
elif connect:
li_scraper.connect_user(user)
users_dicts = [user.to_dict() for user in users if user.id]
users_df = pd.DataFrame(users_dicts)
if users_df.empty:
return users_df
linkedin_member_df = users_df[users_df["name"] == "LinkedIn Member"]
non_linkedin_member_df = users_df[users_df["name"] != "LinkedIn Member"]
users_df = pd.concat([non_linkedin_member_df, linkedin_member_df])
logger.info(f"Scraped {len(users_df)} users")
return users_df
def scrape_comments(self, post_ids: list[str]) -> pd.DataFrame:
"""Scrape comments from Linkedin by post IDs"""
if self.on_block:
return logger.error(
"Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
)
comment_fetcher = CommentFetcher(self.session)
all_comments = []
for i, post_id in enumerate(post_ids, start=1):
comments = comment_fetcher.fetch_comments(post_id)
all_comments.extend(comments)
comment_dict = [comment.to_dict() for comment in all_comments]
comment_df = pd.DataFrame(comment_dict)
if not comment_df.empty:
comment_df["emails"] = comment_df["text"].apply(extract_emails_from_text)
comment_df = comment_df.sort_values(by="created_at", ascending=False)
return comment_df
def scrape_companies(
self,
company_names: list[str] = None,
) -> pd.DataFrame:
"""Scrape company details from Linkedin"""
if self.on_block:
return logger.error(
"Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
)
if not company_names:
raise ValueError("company_names list cannot be empty")
li_scraper = LinkedInScraper(self.session)
company_dfs = []
for company_name in company_names:
try:
company_res = li_scraper.fetch_or_search_company(company_name)
try:
company_data = company_res.json()
except json.decoder.JSONDecodeError:
logger.error(f"Failed to fetch company data for {company_name}")
continue
company_df = parse_company_data(company_data, search_term=company_name)
company_dfs.append(company_df)
except Exception as e:
logger.error(f"Failed to process company {company_name}: {str(e)}")
continue
if not company_dfs:
return pd.DataFrame()
return pd.concat(company_dfs, ignore_index=True)
def scrape_connections(
self,
max_results: int = 10**8,
extra_profile_data: bool = False,
) -> pd.DataFrame:
"""Scrape connections from Linkedin"""
if self.on_block:
return logger.error(
"Account is on cooldown as a safety precaution after receiving a 429 (TooManyRequests) from LinkedIn. Please recreate a new LinkedInAccount to proceed."
)
li_scraper = LinkedInScraper(self.session)
connections = li_scraper.scrape_connections(
max_results=max_results,
extra_profile_data=extra_profile_data,
)
connections_df = pd.DataFrame()
if connections:
staff_dicts = [staff.to_dict() for staff in connections]
connections_df = pd.DataFrame(staff_dicts)
connections_df = clean_df(connections_df)
return connections_df
================================================
FILE: staffspy/linkedin/certifications.py
================================================
import json
import logging
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Certification
logger = logging.getLogger(__name__)
class CertificationFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:certifications,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"
def fetch_certifications(self, staff):
ep = self.endpoint.format(employee_id=staff.id)
res = self.session.get(ep)
logger.debug(f"certs, status code - {res.status_code}")
if res.status_code == 429:
raise TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text[:200])
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
try:
elems = res_json["data"]["identityDashProfileComponentsBySectionType"][
"elements"
]
except (KeyError, IndexError, TypeError) as e:
logger.debug(res_json)
return False
if elems:
cert_elems = elems[0]["components"]["pagedListComponent"]["components"][
"elements"
]
staff.certifications = self.parse_certifications(cert_elems)
return True
def parse_certifications(self, sections):
certs = []
for section in sections:
elem = section["components"]["entityComponent"]
if not elem:
break
title = elem["titleV2"]["text"]["text"]
issuer = elem["subtitle"]["text"] if elem["subtitle"] else None
date_issued = (
elem["caption"]["text"].replace("Issued ", "")
if elem["caption"]
else None
)
cert_id = (
elem["metadata"]["text"].replace("Credential ID ", "")
if elem["metadata"]
else None
)
try:
subcomp = elem["subComponents"]["components"][0]
cert_link = subcomp["components"]["actionComponent"]["action"][
"navigationAction"
]["actionTarget"]
except:
cert_link = None
cert = Certification(
title=title,
issuer=issuer,
date_issued=date_issued,
cert_link=cert_link,
cert_id=cert_id,
)
certs.append(cert)
return certs
================================================
FILE: staffspy/linkedin/comments.py
================================================
import json
import re
from datetime import datetime as dt
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Comment
from staffspy.utils.utils import logger
class CommentFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSocialDashComments.8cb29aedde780600a7ad17fc7ebb8277&queryName=SocialDashCommentsBySocialDetail&variables=(origins:List(),count:100,socialDetailUrn:urn%3Ali%3Afsd_socialDetail%3A%28urn%3Ali%3Aactivity%3A{post_id}%2Curn%3Ali%3Aactivity%3A7254884361622208512%2Curn%3Ali%3AhighlightedReply%3A-%29,sortOrder:REVERSE_CHRONOLOGICAL,start:{start})"
self.post_id = None
self.num_commments = 100
def fetch_comments(self, post_id: str):
all_comments = []
self.post_id = post_id
for i in range(0, 200_000, self.num_commments):
logger.info(f"Fetching comments for post {post_id}, start {i}")
ep = self.endpoint.format(post_id=post_id, start=i)
res = self.session.get(ep)
logger.debug(f"comments info, status code - {res.status_code}")
if res.status_code == 429:
raise TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text[:200])
return False
try:
comments_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
comments, num_results = self.parse_comments(comments_json)
all_comments.extend(comments)
if not num_results:
break
return all_comments
def parse_comments(self, comments_json: dict):
"""Parse the comment data from the employee profile."""
comments = []
for element in (
results := comments_json.get("data", {})
.get("socialDashCommentsBySocialDetail", {})
.get("elements", [])
):
internal_profile_id = (commenter := element["commenter"])[
"commenterProfileId"
]
name = commenter["title"]["text"]
linkedin_id_match = re.search("/in/(.+)", commenter["navigationUrl"])
linkedin_id = linkedin_id_match.group(1) if linkedin_id_match else None
commentary = element.get("commentary", {}).get("text", "")
comment_id = element["urn"].split(",")[-1].rstrip(")")
num_likes = element["socialDetail"]["totalSocialActivityCounts"]["numLikes"]
comment = Comment(
post_id=self.post_id,
comment_id=comment_id,
internal_profile_id=internal_profile_id,
public_profile_id=linkedin_id,
name=name,
text=commentary,
num_likes=num_likes,
created_at=dt.utcfromtimestamp(element["createdAt"] / 1000),
)
comments.append(comment)
return comments, len(results)
================================================
FILE: staffspy/linkedin/contact_info.py
================================================
from calendar import month_name
from datetime import datetime
import json
import requests
import logging
import pytz
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import ContactInfo, Staff
logger = logging.getLogger(__name__)
class ContactInfoFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfiles.13618f886ce95bf503079f49245fbd6f&queryName=ProfilesByMemberIdentity&variables=(memberIdentity:{employee_id},count:1)"
def fetch_contact_info(self, base_staff):
ep = self.endpoint.format(employee_id=base_staff.id)
try:
res = self.session.get(ep)
except requests.exceptions.TooManyRedirects as e:
logger.error("Too many redirects encountered: %s", e)
return None
logger.debug(f"bio info, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text)
return False
try:
employee_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text)
return False
self.parse_emp_contact_info(base_staff, employee_json)
return True
def parse_emp_contact_info(self, emp: Staff, emp_dict: dict):
"""Parse the employee data from the employee profile."""
contact_info = ContactInfo()
emp_dict = emp_dict["data"]["identityDashProfilesByMemberIdentity"]["elements"][
0
]
try:
contact_info.email_address = emp_dict["emailAddress"]["emailAddress"]
except (KeyError, IndexError, TypeError):
pass
try:
contact_info.address = emp_dict["address"]
except (KeyError, IndexError, TypeError):
pass
try:
month = month_name[emp_dict["birthDateOn"]["month"]]
day = emp_dict["birthDateOn"]["day"]
contact_info.birthday = f"{month} {day}"
except (KeyError, IndexError, TypeError):
pass
try:
contact_info.websites = [x["url"] for x in emp_dict["websites"]]
except (KeyError, IndexError, TypeError):
pass
try:
contact_info.phone_numbers = [
x["phoneNumber"]["number"] for x in emp_dict["phoneNumbers"]
]
except (KeyError, IndexError, TypeError):
pass
try:
created_at = emp_dict["memberRelationship"][
"memberRelationshipDataResolutionResult"
]["connection"]["createdAt"]
timezone = pytz.timezone("UTC")
dt = datetime.fromtimestamp(created_at / 1000, tz=timezone)
contact_info.created_at = dt.strftime("%Y-%m-%d %H:%M:%S %Z")
except (KeyError, IndexError, TypeError):
pass
emp.contact_info = contact_info
================================================
FILE: staffspy/linkedin/employee.py
================================================
import json
import logging
import re
import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Staff
logger = logging.getLogger(__name__)
class EmployeeFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/voyagerIdentityDashProfiles?count=1&decorationId=com.linkedin.voyager.dash.deco.identity.profile.TopCardComplete-138&memberIdentity={employee_id}&q=memberIdentity"
self.domain = None
def fetch_employee(self, base_staff, domain):
self.domain = domain
ep = self.endpoint.format(employee_id=base_staff.id)
res = self.session.get(ep)
logger.debug(f"basic info, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text[:200])
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
try:
employee_json = res_json["elements"][0]
except (KeyError, IndexError, TypeError):
logger.debug(res_json)
return False
self.parse_emp(base_staff, employee_json)
return True
def parse_emp(self, emp: Staff, emp_dict: dict):
"""Parse the employee data from the employee profile."""
def get_photo_url(emp_dict: dict, key: str):
try:
photo_data = emp_dict[key]["displayImageReference"]["vectorImage"]
photo_base_url = photo_data["rootUrl"]
photo_ext_url = photo_data["artifacts"][-1][
"fileIdentifyingUrlPathSegment"
]
return f"{photo_base_url}{photo_ext_url}"
except (KeyError, TypeError, IndexError, ValueError):
return None
emp.profile_photo = get_photo_url(emp_dict, "profilePicture")
emp.banner_photo = get_photo_url(emp_dict, "backgroundPicture")
emp.profile_id = emp_dict["publicIdentifier"]
try:
emp.headline = emp_dict.get("headline")
if not emp.headline:
emp.headline = emp_dict["memberRelationship"]["memberRelationshipData"][
"noInvitation"
]["targetInviteeResolutionResult"]["headline"]
except:
pass
union_type = next(
iter(emp_dict["memberRelationship"]["memberRelationshipUnion"])
)
emp.is_connection = "no"
if union_type == "connection":
emp.is_connection = "yes"
elif union_type == "noConnection":
invitation = (
emp_dict["memberRelationship"]["memberRelationshipUnion"][
"noConnection"
]
.get("invitationUnion", {})
.get("invitation", {})
)
if invitation and invitation.get("invitationState") == "PENDING":
emp.is_connection = "pending"
emp.open_to_work = emp_dict["profilePicture"].get("frameType") == "OPEN_TO_WORK"
emp.is_hiring = emp_dict["profilePicture"].get("frameType") == "HIRING"
emp.first_name = emp_dict["firstName"]
emp.last_name = emp_dict["lastName"].split(",")[0]
if not emp.name:
name = filter(None, [emp.first_name, emp.last_name])
emp.name = " ".join(name)
emp.potential_emails = (
utils.create_emails(emp.first_name, emp.last_name, self.domain)
if self.domain
else None
)
emp.followers = emp_dict.get("followingState", {}).get("followerCount")
emp.connections = emp_dict["connections"]["paging"]["total"]
emp.location = (
emp_dict.get("geoLocation", {}).get("geo", {}).get("defaultLocalizedName")
)
# Handle empty elements case for company
top_positions = emp_dict.get("profileTopPosition", {}).get("elements", [])
if top_positions:
emp.company = top_positions[0].get("companyName", None)
else:
emp.company = None
edu_cards = emp_dict.get("profileTopEducation", {}).get("elements", [])
if edu_cards:
emp.school = edu_cards[0].get(
"schoolName", edu_cards[0].get("school", {}).get("name")
)
emp.influencer = emp_dict.get("influencer", False)
emp.creator = emp_dict.get("creator", False)
emp.premium = emp_dict.get("premium", False)
emp.mutual_connections = 0
try:
profile_insight = emp_dict.get("profileInsight", {}).get("elements", [])
if profile_insight:
mutual_connections_str = profile_insight[0]["text"]["text"]
match = re.search(r"\d+", mutual_connections_str)
if match:
emp.mutual_connections = int(match.group()) + 2
else:
emp.mutual_connections = (
2 if " and " in mutual_connections_str else 1
)
except (KeyError, TypeError, IndexError, ValueError) as e:
pass
================================================
FILE: staffspy/linkedin/employee_bio.py
================================================
import json
import logging
from staffspy.utils.exceptions import TooManyRequests
logger = logging.getLogger(__name__)
class EmployeeBioFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileCards.9ad2590cb61a073ad514922fa752f566&queryName=ProfileTabInitialCards&variables=(count:50,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id})"
def fetch_employee_bio(self, base_staff):
ep = self.endpoint.format(employee_id=base_staff.id)
res = self.session.get(ep)
logger.debug(f"bio info, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text)
return False
try:
data = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text)
return False
try:
base_staff.bio = data["data"]["identityDashProfileCardsByInitialCards"][
"elements"
][3]["topComponents"][1]["components"]["textComponent"]["text"]["text"]
except (KeyError, IndexError, TypeError):
return False
return True
================================================
FILE: staffspy/linkedin/experiences.py
================================================
import json
import logging
import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Experience
logger = logging.getLogger(__name__)
class ExperiencesFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:experience,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"
def fetch_experiences(self, staff):
ep = self.endpoint.format(employee_id=staff.id)
res = self.session.get(ep)
logger.debug(f"exps, status code - {res.status_code}")
if res.reason == "INKApi Error":
raise Exception(
"Delete session file and log in again",
res.status_code,
res.text[:200],
res.reason,
)
elif res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
elif not res.ok:
logger.debug(res.text[:200])
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
try:
skills_json = res_json["data"][
"identityDashProfileComponentsBySectionType"
]["elements"][0]["components"]["pagedListComponent"]["components"][
"elements"
]
except (KeyError, IndexError, TypeError) as e:
logger.debug(res_json)
return False
staff.experiences = self.parse_experiences(skills_json)
return True
def parse_experiences(self, elements):
exps = []
for elem in elements:
try:
components = elem.get("components")
if components is None:
continue
entity = components.get("entityComponent")
if entity is None:
continue
sub_components = entity.get("subComponents")
if (
sub_components is None
or len(sub_components.get("components", [])) == 0
or sub_components["components"][0].get("components") is None
or sub_components["components"][0]["components"].get(
"pagedListComponent"
)
is None
):
emp_type = start_date = end_date = None
caption = entity.get("caption")
duration = caption.get("text") if caption else None
if duration:
start_date, end_date = utils.parse_dates(duration)
from_date, to_date = utils.parse_duration(duration)
if from_date:
duration_parts = duration.split(" · ")
if len(duration_parts) > 1:
duration = duration_parts[1]
subtitle = entity.get("subtitle")
company = subtitle.get("text") if subtitle else None
titleV2 = entity.get("titleV2")
title_text = titleV2.get("text") if titleV2 else None
title = title_text.get("text") if title_text else None
metadata = entity.get("metadata")
location = metadata.get("text") if metadata else None
if company:
parts = company.split(" · ")
if len(parts) > 1:
company = parts[0]
emp_type = parts[-1].lower()
exp = Experience(
duration=duration,
title=title,
company=company,
emp_type=emp_type,
start_date=start_date,
end_date=end_date,
location=location,
)
exps.append(exp)
else:
multi_exps = self.parse_multi_exp(entity)
exps += multi_exps
except Exception as e:
logger.exception(e)
return exps
def parse_multi_exp(self, entity):
exps = []
company = entity["titleV2"]["text"]["text"]
elements = entity["subComponents"]["components"][0]["components"][
"pagedListComponent"
]["components"]["elements"]
for elem in elements:
entity = elem["components"]["entityComponent"]
duration = entity["caption"]["text"]
title = entity["titleV2"]["text"]["text"]
emp_type = (
entity["subtitle"]["text"].lower() if entity["subtitle"] else None
)
location = entity["metadata"]["text"] if entity["metadata"] else None
start_date, end_date = utils.parse_dates(duration)
from_date, to_date = utils.parse_duration(duration)
if from_date:
duration = duration.split(" · ")[1]
exp = Experience(
duration=duration,
title=title,
company=company,
emp_type=emp_type,
start_date=start_date,
end_date=end_date,
location=location,
)
exps.append(exp)
return exps
================================================
FILE: staffspy/linkedin/languages.py
================================================
import json
import logging
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Skill, Staff
logger = logging.getLogger(__name__)
class LanguagesFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.9117695ef207012719e3e0681c667e14&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:languages,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"
def fetch_languages(self, staff: Staff):
ep = self.endpoint.format(employee_id=staff.id)
res = self.session.get(ep)
logger.debug(f"skills, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text)
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text)
return False
if res_json.get("errors"):
return False
staff.languages = self.parse_languages(res_json)
return True
def parse_languages(self, language_json: dict) -> list[str]:
languages = []
elements = language_json["data"]["identityDashProfileComponentsBySectionType"][
"elements"
][0]["components"]["pagedListComponent"]["components"]["elements"]
for element in elements:
if comp := element["components"]["entityComponent"]:
title = comp["titleV2"]["text"]["text"]
languages.append(title)
return languages
================================================
FILE: staffspy/linkedin/linkedin.py
================================================
"""
staffspy.linkedin.linkedin
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape LinkedIn.
"""
import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import quote, unquote
import requests
import staffspy.utils.utils as utils
from staffspy.utils.exceptions import TooManyRequests, BadCookies, GeoUrnNotFound
from staffspy.linkedin.contact_info import ContactInfoFetcher
from staffspy.linkedin.certifications import CertificationFetcher
from staffspy.linkedin.employee import EmployeeFetcher
from staffspy.linkedin.employee_bio import EmployeeBioFetcher
from staffspy.linkedin.experiences import ExperiencesFetcher
from staffspy.linkedin.languages import LanguagesFetcher
from staffspy.linkedin.schools import SchoolsFetcher
from staffspy.linkedin.skills import SkillsFetcher
from staffspy.utils.models import Staff
from staffspy.utils.utils import logger
class LinkedInScraper:
employees_ep = "https://www.linkedin.com/voyager/api/graphql?variables=(start:{offset},query:(flagshipSearchIntent:SEARCH_SRP,{search}queryParameters:List({company_id}{location}(key:resultType,value:List(PEOPLE))),includeFiltersInResponse:false),count:{count})&queryId=voyagerSearchDashClusters.66adc6056cf4138949ca5dcb31bb1749"
company_id_ep = "https://www.linkedin.com/voyager/api/organization/companies?q=universalName&universalName="
company_search_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashClusters.02af3bc8bc85a169bb76bb4805d05759&queryName=SearchClusterCollection&variables=(query:(flagshipSearchIntent:SEARCH_SRP,keywords:{company},includeFiltersInResponse:false,queryParameters:(keywords:List({company}),resultType:List(COMPANIES))),count:10,origin:GLOBAL_SEARCH_HEADER,start:0)"
location_id_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashReusableTypeahead.57a4fa1dd92d3266ed968fdbab2d7bf5&queryName=SearchReusableTypeaheadByType&variables=(query:(showFullLastNameForConnections:false,typeaheadFilterQuery:(geoSearchTypes:List(MARKET_AREA,COUNTRY_REGION,ADMIN_DIVISION_1,CITY))),keywords:{location},type:GEO,start:0)"
public_user_id_ep = (
"https://www.linkedin.com/voyager/api/identity/profiles/{user_id}/profileView"
)
connections_ep = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerSearchDashClusters.dfcd3603c2779eddd541f572936f4324&queryName=SearchClusterCollection&variables=(query:(queryParameters:(resultType:List(FOLLOWERS)),flagshipSearchIntent:MYNETWORK_CURATION_HUB,includeFiltersInResponse:true),count:50,origin:CurationHub,start:{offset})"
block_user_ep = "https://www.linkedin.com/voyager/api/voyagerTrustDashContentReportingForm?action=entityBlock"
connect_to_user_ep = "https://www.linkedin.com/voyager/api/voyagerRelationshipsDashMemberRelationships?action=verifyQuotaAndCreateV2&decorationId=com.linkedin.voyager.dash.deco.relationships.InvitationCreationResultWithInvitee-1"
def __init__(self, session: requests.Session):
self.session = session
(
self.company_id,
self.staff_count,
self.num_staff,
self.company_name,
self.domain,
self.max_results,
self.search_term,
self.location,
self.raw_location,
) = (None, None, None, None, None, None, None, None, None)
self.on_block = False
self.connect_block = False
self.certs = CertificationFetcher(self.session)
self.skills = SkillsFetcher(self.session)
self.employees = EmployeeFetcher(self.session)
self.schools = SchoolsFetcher(self.session)
self.experiences = ExperiencesFetcher(self.session)
self.bio = EmployeeBioFetcher(self.session)
self.languages = LanguagesFetcher(self.session)
self.contact = ContactInfoFetcher(self.session)
def search_companies(self, company_name: str):
"""Get the company id and staff count from the company name."""
company_search_ep = self.company_search_ep.format(company=quote(company_name))
self.session.headers["x-li-graphql-pegasus-client"] = "true"
res = self.session.get(company_search_ep)
self.session.headers.pop("x-li-graphql-pegasus-client", "")
if not res.ok:
raise Exception(
f"Failed to search for company {company_name}",
res.status_code,
res.text[:200],
)
logger.debug(
f"Searched companies for name '{company_name}' - res code {res.status_code}-"
)
companies = res.json()["data"]["searchDashClustersByAll"]["elements"]
err_msg = f"No companies found for name {company_name}"
if len(companies) < 2:
raise Exception(err_msg)
try:
num_results = companies[0]["items"][0]["item"]["simpleTextV2"]["text"][
"text"
]
first_company = companies[1]["items"][0]["item"].get("entityResult")
if not first_company and len(companies) > 2:
first_company = companies[2]["items"][0]["item"].get("entityResult")
if not first_company:
raise Exception(err_msg)
company_link = first_company["navigationUrl"]
company_name_id = unquote(
re.search(r"/company/([^/]+)", company_link).group(1)
)
company_name_new = first_company["title"]["text"]
except Exception as e:
raise Exception(
f"Failed to load json in search_companies {str(e)}, Response: {res.text[:200]}"
)
logger.info(
f"Searched company {company_name} on LinkedIn and were {num_results}, using first result with company name - '{company_name_new}' and company id - '{company_name_id}'"
)
return company_name_id
def fetch_or_search_company(self, company_name):
"""Fetch the company details by name, or search if not found."""
res = self.session.get(f"{self.company_id_ep}{company_name}")
if res.status_code not in (200, 404):
raise Exception(
f"Failed to find company {company_name} (likely due to outdated login if you know it's valid company)",
res.status_code,
res.text[:200],
)
elif res.status_code == 404:
logger.info(
f"Failed to directly use company '{company_name}' as company id, now searching for the company"
)
company_name = self.search_companies(company_name)
res = self.session.get(f"{self.company_id_ep}{company_name}")
if res.status_code != 200:
raise Exception(
f"Failed to find company after performing a direct and generic search for {company_name}",
res.status_code,
res.text[:200],
)
if not res.ok:
logger.debug(f"res code {res.status_code} - fetched company ")
return res
def _get_company_id_and_staff_count(self, company_name: str):
"""Extract company id and staff count from the company details."""
res = self.fetch_or_search_company(company_name)
try:
response_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
raise Exception(
f"Failed to load json in get_company_id_and_staff_count {res.text[:200]}"
)
company = response_json["elements"][0]
self.domain = (
utils.extract_base_domain(company["companyPageUrl"])
if company.get("companyPageUrl")
else None
)
staff_count = company["staffCount"]
company_id = company["trackingInfo"]["objectUrn"].split(":")[-1]
company_name = company["universalName"]
logger.info(f"Found company '{company_name}' with {staff_count} staff")
return company_id, staff_count
def parse_staff(self, elements: list[dict]):
"""Parse the staff from the search results"""
staff = []
for elem in elements:
for card in elem.get("items", []):
person = card.get("item", {}).get("entityResult", {})
if not person:
continue
pattern = (
r"urn:li:fsd_profile:([^,]+),(?:SEARCH_SRP|MYNETWORK_CURATION_HUB)"
)
match = re.search(pattern, person["entityUrn"])
linkedin_id = match.group(1) if match else None
person_urn = person["trackingUrn"].split(":")[-1]
name = person["title"]["text"].strip()
headline = (
person.get("primarySubtitle", {}).get("text", "")
if person.get("primarySubtitle")
else ""
)
profile_link = person["navigationUrl"].split("?")[0]
staff.append(
Staff(
urn=person_urn,
id=linkedin_id,
name=name,
headline=headline,
search_term=" - ".join(
filter(
None,
[
self.company_name,
self.search_term,
self.raw_location,
],
)
),
profile_link=profile_link,
)
)
return staff
def fetch_staff(self, offset: int):
"""Fetch the staff using LinkedIn search"""
ep = self.employees_ep.format(
offset=offset,
company_id=(
f"(key:currentCompany,value:List({self.company_id})),"
if self.company_id
else ""
),
count=50,
search=f"keywords:{quote(self.search_term)}," if self.search_term else "",
location=(
f"(key:geoUrn,value:List({self.location}))," if self.location else ""
),
)
res = self.session.get(ep)
if not res.ok:
logger.debug(f"employees, status code - {res.status_code}")
if res.status_code == 400:
raise BadCookies("Outdated login, delete the session file to log in again")
elif res.status_code == 429:
raise TooManyRequests("429 Too Many Requests")
if not res.ok:
return None, 0
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text)
return None, 0
try:
elements = res_json["data"]["searchDashClustersByAll"]["elements"]
total_count = res_json["data"]["searchDashClustersByAll"]["metadata"][
"totalResultCount"
]
except (KeyError, IndexError, TypeError):
logger.debug(res_json)
return None, 0
new_staff = self.parse_staff(elements) if elements else []
return new_staff, total_count
def fetch_connections_page(self, offset: int):
self.session.headers["x-li-graphql-pegasus-client"] = "true"
res = self.session.get(self.connections_ep.format(offset=offset))
self.session.headers.pop("x-li-graphql-pegasus-client", "")
if not res.ok:
logger.debug(f"employees, status code - {res.status_code}")
if res.status_code == 400:
raise BadCookies("Outdated login, delete the session file to log in again")
elif res.status_code == 429:
raise TooManyRequests("429 Too Many Requests")
if not res.ok:
return
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text)
return
try:
elements = res_json["data"]["searchDashClustersByAll"]["elements"]
total_count = res_json["data"]["searchDashClustersByAll"]["metadata"][
"totalResultCount"
]
except (KeyError, IndexError, TypeError):
logger.debug(res_json)
return
new_staff = self.parse_staff(elements) if elements else []
return new_staff, total_count
def scrape_connections(
self,
max_results: int = 10**8,
extra_profile_data: bool = False,
):
self.search_term = "connections"
staff_list: list[Staff] = []
try:
initial_staff, total_search_result_count = self.fetch_connections_page(0)
if initial_staff:
staff_list.extend(initial_staff)
self.num_staff = min(total_search_result_count, max_results)
for offset in range(50, self.num_staff, 50):
staff, _ = self.fetch_connections_page(offset)
logger.debug(
f"Connections from search: {len(staff)} new, {len(staff_list) + len(staff)} total"
)
if not staff:
break
staff_list.extend(staff)
except (BadCookies, TooManyRequests) as e:
self.on_block = True
logger.error(f"Exiting early due to fatal error: {str(e)}")
return staff_list[:max_results]
reduced_staff_list = staff_list[:max_results]
non_restricted = list(
filter(lambda x: x.name != "LinkedIn Member", reduced_staff_list)
)
if extra_profile_data:
try:
for i, employee in enumerate(non_restricted, start=1):
self.fetch_all_info_for_employee(employee, i)
except TooManyRequests as e:
logger.error(str(e))
return reduced_staff_list
def fetch_location_id(self):
"""Fetch the location id for the location to be used in LinkedIn search"""
ep = self.location_id_ep.format(location=quote(self.raw_location))
res = self.session.get(ep)
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
if res.reason == "INKApi Error":
raise Exception(
"Delete session file and log in again",
res.status_code,
res.text[:200],
res.reason,
)
raise GeoUrnNotFound(
"Failed to send request to get geo id",
res.status_code,
res.text[:200],
res.reason,
)
try:
elems = res_json["data"]["searchDashReusableTypeaheadByType"]["elements"]
except (KeyError, IndexError, TypeError):
raise GeoUrnNotFound("Failed to locate geo id", res_json[:200])
geo_id = None
if elems:
urn = elems[0]["trackingUrn"]
m = re.search("urn:li:geo:(.+)", urn)
if m:
geo_id = m.group(1)
if not geo_id:
raise GeoUrnNotFound("Failed to parse geo id")
self.location = geo_id
def scrape_staff(
self,
company_name: str | None,
search_term: str,
location: str,
extra_profile_data: bool,
max_results: int,
block: bool,
connect: bool,
):
"""Main function entry point to scrape LinkedIn staff"""
self.search_term = search_term
self.company_name = company_name
self.max_results = max_results
self.raw_location = location
self.company_id = None
if self.company_name:
self.company_id, staff_count = self._get_company_id_and_staff_count(
company_name
)
staff_list: list[Staff] = []
if self.raw_location:
try:
self.fetch_location_id()
except GeoUrnNotFound as e:
logger.error(str(e))
return staff_list[:max_results]
try:
initial_staff, total_count = self.fetch_staff(0)
if initial_staff:
staff_list.extend(initial_staff)
location = f", location: '{location}'" if location else ""
logger.info(
f"1) Search results for company: '{company_name}'{location} - {total_count:,} staff"
)
self.num_staff = min(total_count, max_results, 1000)
for offset in range(50, self.num_staff, 50):
staff, _ = self.fetch_staff(offset)
logger.debug(
f"Staff members from search: {len(staff)} new, {len(staff_list) + len(staff)} total"
)
if not staff:
break
staff_list.extend(staff)
location = f", location: '{location}'" if location else ""
logger.info(
f"2) Total results collected for company: '{company_name}'{location} - {len(staff_list)} results"
)
except (BadCookies, TooManyRequests) as e:
self.on_block = True
logger.error(f"Exiting early due to fatal error: {str(e)}")
return staff_list[:max_results]
reduced_staff_list = staff_list[:max_results]
non_restricted = list(
filter(lambda x: x.name != "LinkedIn Member", reduced_staff_list)
)
if extra_profile_data:
try:
for i, employee in enumerate(non_restricted, start=1):
self.fetch_all_info_for_employee(employee, i)
if block:
self.block_user(employee)
elif connect:
self.connect_user(employee)
except TooManyRequests as e:
logger.error(str(e))
return reduced_staff_list
def fetch_all_info_for_employee(self, employee: Staff, index: int):
"""Simultaniously fetch all the data for an employee"""
logger.info(
f"Fetching data for account {employee.id} {index:>4} / {self.num_staff} - {employee.profile_link}"
)
task_functions = [
(self.employees.fetch_employee, (employee, self.domain), "employee"),
(self.skills.fetch_skills, (employee,), "skills"),
(self.experiences.fetch_experiences, (employee,), "experiences"),
(self.certs.fetch_certifications, (employee,), "certifications"),
(self.schools.fetch_schools, (employee,), "schools"),
(self.bio.fetch_employee_bio, (employee,), "bio"),
(self.languages.fetch_languages, (employee,), "languages"),
]
with ThreadPoolExecutor(max_workers=len(task_functions)) as executor:
tasks = {
executor.submit(func, *args): name
for func, args, name in task_functions
}
for future in as_completed(tasks):
result = future.result()
if employee.is_connection:
self.contact.fetch_contact_info(employee)
def fetch_user_profile_data_from_public_id(self, user_id: str, key: str):
"""Fetches data given the public LinkedIn user id"""
endpoint = self.public_user_id_ep.format(user_id=user_id)
response = self.session.get(endpoint)
try:
response_json = response.json()
except json.decoder.JSONDecodeError:
logger.debug(response.text[:200])
raise Exception(
f"Failed to load JSON from endpoint",
response.status_code,
response.reason,
)
keys = {
"user_id": ("positionView", "profileId"),
"company_id": (
"positionView",
"elements",
0,
"company",
"miniCompany",
"universalName",
),
}
try:
data = response_json
for k in keys[key]:
data = data[k]
urn = response_json["profile"]["miniProfile"]["objectUrn"].split(":")[-1]
return data, urn
except (KeyError, TypeError, IndexError) as e:
logger.warning(f"Failed to find user_id {user_id}")
if key == "user_id":
return ""
raise Exception(f"Failed to fetch '{key}' for user_id {user_id}: {e}")
def block_user(self, employee: Staff) -> None:
"""Block a user on LinkedIn given their urn"""
if employee.urn == "headless":
return
self.session.headers["Content-Type"] = (
"application/x-protobuf2; symbol-table=voyager-20757"
)
urn_string = f"urn:li:member:{employee.urn}"
length_byte = bytes([len(urn_string)])
body = b"\x00\x01\x14\nblockeeUrn\x14" + length_byte + urn_string.encode()
res = self.session.post(
self.block_user_ep,
data=body,
)
self.session.headers.pop("Content-Type", "")
if res.ok:
logger.info(f"Successfully blocked user {employee.id}")
elif res.status_code == 403:
logger.warning(
f"Failed to block user - status code 403, one possible reason is you have alread blocked/unblocked this person in past 48 hours and on cooldown: {employee.profile_link}"
)
else:
logger.warning(
f"Failed to block user - status code {res.status_code} {employee.id}: {employee.name}"
)
def connect_user(self, employee: Staff) -> None:
"""Connects with a user on LinkedIn given their profile id"""
if self.connect_block:
return logger.info(
f"Skipping connection request for user due to previou block: {employee.id} - {employee.profile_link} "
)
if employee.urn == "headless":
return
if employee.is_connection != "no":
return logger.info(
f"Already connected or pending connection request to user {employee.id} - {employee.profile_link}"
)
self.session.headers["Content-Type"] = (
"application/x-protobuf2; symbol-table=voyager-20757"
)
body = (
b"\x00\x01\x03\xe2\x05\x00\x01\x03\xd3w\x00\x01\x03\xd5\x06\x14:urn:li:fsd_profile:"
+ employee.id.encode()
)
res = self.session.post(
self.connect_to_user_ep,
data=body,
)
self.session.headers.pop("Content-Type", "")
if res.ok:
logger.info(
f"Successfully sent connection request to user {employee.id} - {employee.profile_link}"
)
elif res.status_code == 429:
self.connect_block = True
logger.warning(
f"Failed to connect to user - status code 429 - pausing connection requests for this scrape: {employee.id} - {employee.profile_link}"
)
else:
logger.warning(
f"Failed to connect to user - status code {res.status_code} {employee.id} -{employee.profile_link}"
)
================================================
FILE: staffspy/linkedin/schools.py
================================================
import json
import logging
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import School
from staffspy.utils.utils import parse_dates
logger = logging.getLogger(__name__)
class SchoolsFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:education,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"
def fetch_schools(self, staff):
ep = self.endpoint.format(employee_id=staff.id)
res = self.session.get(ep)
logger.debug(f"schools, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text[:200])
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
try:
elements = res_json["data"]["identityDashProfileComponentsBySectionType"][
"elements"
][0]["components"]["pagedListComponent"]["components"]["elements"]
except (KeyError, IndexError, TypeError) as e:
logger.debug(res_json)
return False
staff.schools = self.parse_schools(elements)
return True
def parse_schools(self, elements):
schools = []
start = end = None
for elem in elements:
entity = elem["components"]["entityComponent"]
if not entity:
break
years = entity["caption"]["text"] if entity["caption"] else None
school_name = entity["titleV2"]["text"]["text"]
if years:
start, end = parse_dates(years)
degree = entity["subtitle"]["text"] if entity["subtitle"] else None
school = School(
start_date=start, end_date=end, school=school_name, degree=degree
)
schools.append(school)
return schools
================================================
FILE: staffspy/linkedin/skills.py
================================================
import json
import logging
from staffspy.utils.exceptions import TooManyRequests
from staffspy.utils.models import Skill, Staff
logger = logging.getLogger(__name__)
class SkillsFetcher:
def __init__(self, session):
self.session = session
self.endpoint = "https://www.linkedin.com/voyager/api/graphql?queryId=voyagerIdentityDashProfileComponents.277ba7d7b9afffb04683953cede751fb&queryName=ProfileComponentsBySectionType&variables=(tabIndex:0,sectionType:skills,profileUrn:urn%3Ali%3Afsd_profile%3A{employee_id},count:50)"
def fetch_skills(self, staff: Staff):
ep = self.endpoint.format(employee_id=staff.id)
res = self.session.get(ep)
logger.debug(f"skills, status code - {res.status_code}")
if res.status_code == 429:
return TooManyRequests("429 Too Many Requests")
if not res.ok:
logger.debug(res.text[:200])
return False
try:
res_json = res.json()
except json.decoder.JSONDecodeError:
logger.debug(res.text[:200])
return False
if res_json.get("errors"):
return False
tab_comp = res_json["data"]["identityDashProfileComponentsBySectionType"][
"elements"
][0]["components"]["tabComponent"]
if tab_comp:
sections = tab_comp["sections"]
staff.skills = self.parse_skills(sections)
return True
def parse_skills(self, sections):
names = set()
skills = []
for section in sections:
elems = section["subComponent"]["components"]["pagedListComponent"][
"components"
]["elements"]
for elem in elems:
passed_assessment, endorsements = None, 0
entity = elem["components"]["entityComponent"]
name = entity["titleV2"]["text"]["text"]
if name in names:
continue
names.add(name)
components = entity["subComponents"]["components"]
for component in components:
try:
candidate = component["components"]["insightComponent"]["text"][
"text"
]["text"]
if " endorsements" in candidate:
endorsements = int(candidate.replace(" endorsements", ""))
if "Passed LinkedIn Skill Assessment" in candidate:
passed_assessment = True
except:
pass
skills.append(
Skill(
name=name,
endorsements=endorsements,
passed_assessment=passed_assessment,
)
)
return skills
================================================
FILE: staffspy/solvers/capsolver.py
================================================
import json
import time
import requests
from tenacity import retry, stop_after_attempt, retry_if_result
from staffspy.solvers.solver import Solver
def is_none(value):
return value is None
class CapSolver(Solver):
"""https://www.capsolver.com/"""
@retry(stop=stop_after_attempt(10), retry=retry_if_result(is_none))
def solve(self, blob_data: str, page_url: str = None):
from staffspy.utils.utils import logger
logger.info(f"Waiting on CapSolver to solve captcha...")
payload = {
"clientKey": self.solver_api_key,
"task": {
"type": "FunCaptchaTaskProxyLess",
"websitePublicKey": self.public_key,
"websiteURL": self.page_url,
"data": json.dumps({"blob": blob_data}) if blob_data else "",
},
}
res = requests.post("https://api.capsolver.com/createTask", json=payload)
resp = res.json()
task_id = resp.get("taskId")
if not task_id:
raise Exception(
"CapSolver failed to create task, try another captcha solver like 2Captcha if this persists or use browser sign in `pip install staffspy[browser]` and then remove the username/password params to the LinkedInAccount()",
res.text,
)
logger.info(f"Received captcha solver taskId: {task_id} / Getting result...")
while True:
time.sleep(1) # delay
payload = {"clientKey": self.solver_api_key, "taskId": task_id}
res = requests.post("https://api.capsolver.com/getTaskResult", json=payload)
resp = res.json()
status = resp.get("status")
if status == "ready":
logger.info(f"CapSolver finished solving captcha")
return resp.get("solution", {}).get("token")
if status == "failed" or resp.get("errorId"):
logger.info(f"Captcha solve failed! response: {res.text}")
return None
================================================
FILE: staffspy/solvers/solver.py
================================================
from abc import ABC,abstractmethod
class Solver(ABC):
public_key = "3117BF26-4762-4F5A-8ED9-A85E69209A46"
page_url = "https://iframe.arkoselabs.com"
def __init__(self, solver_api_key:str):
self.solver_api_key=solver_api_key
@abstractmethod
def solve(self, blob_data: str, page_ur: str=None):
pass
================================================
FILE: staffspy/solvers/solver_type.py
================================================
from enum import Enum
class SolverType(Enum):
CAPSOLVER = 'capsolver'
TWO_CAPTCHA = 'twocaptcha'
================================================
FILE: staffspy/solvers/two_captcha.py
================================================
from tenacity import retry_if_exception_type, stop_after_attempt, retry
from twocaptcha import TwoCaptcha, TimeoutException, ApiException, NetworkException
from staffspy.solvers.solver import Solver
class TwoCaptchaSolver(Solver):
"""https://2captcha.com/"""
attempt = 1
@retry(
stop=stop_after_attempt(5),
retry=retry_if_exception_type(
(TimeoutException, ApiException, NetworkException)
),
)
def solve(self, blob_data: str, page_url: str = None):
super().solve(blob_data, page_url)
from staffspy.utils.utils import logger
logger.info(
f"Waiting on 2Captcha to solve captcha attempt {self.attempt} / 5 ..."
)
self.attempt += 1
solver = TwoCaptcha(self.solver_api_key)
result = solver.funcaptcha(
sitekey=self.public_key,
url=page_url,
**{"data[blob]": blob_data},
surl="https://iframe.arkoselabs.com",
)
logger.info(f"2Captcha finished solving captcha")
return result["code"]
================================================
FILE: staffspy/utils/driver_type.py
================================================
from enum import Enum
from typing import Optional
class BrowserType(Enum):
CHROME = "chrome"
FIREFOX = "firefox"
class DriverType:
def __init__(
self, browser_type: BrowserType, executable_path: Optional[str] = None
):
self.browser_type = browser_type
self.executable_path = executable_path
================================================
FILE: staffspy/utils/exceptions.py
================================================
class TooManyRequests(Exception):
"""Too many requests."""
class BadCookies(Exception):
"""Login expiration."""
class GeoUrnNotFound(Exception):
"""Could not find geo urn for given location."""
class BlobException(Exception):
"""Could not find the blob needed to solve the captcha."""
================================================
FILE: staffspy/utils/models.py
================================================
from datetime import datetime, date
from pydantic import BaseModel
from datetime import datetime as dt
from staffspy.utils.utils import extract_emails_from_text
class Comment(BaseModel):
post_id: str
comment_id: str | None = None
internal_profile_id: str | None = None
public_profile_id: str | None = None
name: str | None = None
text: str | None = None
num_likes: int | None = None
created_at: dt | None = None
def to_dict(self):
return {
"post_id": self.post_id,
"comment_id": self.comment_id,
"internal_profile_id": self.internal_profile_id,
"public_profile_id": self.public_profile_id,
"name": self.name,
"text": self.text,
"num_likes": self.num_likes,
"created_at": self.created_at,
}
class School(BaseModel):
start_date: date | None = None
end_date: date | None = None
school: str | None = None
degree: str | None = None
def to_dict(self):
return {
"start_date": self.start_date.isoformat() if self.start_date else None,
"end_date": self.end_date.isoformat() if self.end_date else None,
"school": self.school,
"degree": self.degree,
}
class Skill(BaseModel):
name: str | None = None
endorsements: int | None = None
passed_assessment: bool | None = None
def to_dict(self):
return {
"name": self.name,
"endorsements": self.endorsements if self.endorsements else 0,
"passed_assessment": self.passed_assessment,
}
class ContactInfo(BaseModel):
email_address: str | None = None
websites: list | None = None
phone_numbers: list | None = None
address: str | None = None
birthday: str | None = None
created_at: str | None = None
def to_dict(self):
return {
"email_address": self.email_address,
"websites": self.websites,
"phone_numbers": self.phone_numbers,
"address": self.address,
"birthday": self.birthday,
"created_at": self.created_at,
}
class Certification(BaseModel):
title: str | None = None
issuer: str | None = None
date_issued: str | None = None
cert_id: str | None = None
cert_link: str | None = None
def to_dict(self):
return {
"title": self.title,
"issuer": self.issuer,
"date_issued": self.date_issued,
"cert_id": self.cert_id,
"cert_link": self.cert_link,
}
class Experience(BaseModel):
duration: str | None = None
title: str | None = None
company: str | None = None
location: str | None = None
emp_type: str | None = None
start_date: date | None = None
end_date: date | None = None
def to_dict(self):
return {
"start_date": self.start_date.isoformat() if self.start_date else None,
"end_date": self.end_date.isoformat() if self.end_date else None,
"duration": self.duration,
"title": self.title,
"company": self.company,
"location": self.location,
"emp_type": self.emp_type,
}
class Staff(BaseModel):
urn: str | None = None
search_term: str
id: str
name: str | None = None
headline: str | None = None
current_position: str | None = None
profile_id: str | None = None
profile_link: str | None = None
first_name: str | None = None
last_name: str | None = None
potential_emails: list | None = None
bio: str | None = None
emails_in_bio: str | None = None
followers: int | None = None
connections: int | None = None
mutual_connections: int | None = None
is_connection: str | None = None # yes, no, pending
location: str | None = None
company: str | None = None
school: str | None = None
influencer: bool | None = None
creator: bool | None = None
premium: bool | None = None
open_to_work: bool | None = None
is_hiring: bool | None = None
profile_photo: str | None = None
banner_photo: str | None = None
skills: list[Skill] | None = None
experiences: list[Experience] | None = None
certifications: list[Certification] | None = None
contact_info: ContactInfo | None = None
schools: list[School] | None = None
languages: list[str] | None = None
def get_top_skills(self):
top_three_skills = []
if self.skills:
sorted_skills = sorted(
self.skills, key=lambda x: x.endorsements, reverse=True
)
top_three_skills = [skill.name for skill in sorted_skills[:3]]
top_three_skills += [None] * (3 - len(top_three_skills))
return top_three_skills
def to_dict(self):
sorted_schools = (
sorted(
self.schools,
key=lambda x: (x.end_date is None, x.end_date),
reverse=True,
)
if self.schools
else []
)
top_three_school_names = [school.school for school in sorted_schools[:3]]
top_three_school_names += [None] * (3 - len(top_three_school_names))
estimated_age = self.estimate_age_based_on_education()
sorted_experiences = (
sorted(
self.experiences,
key=lambda x: (x.end_date is None, x.end_date),
reverse=True,
)
if self.experiences
else []
)
top_three_companies = []
seen_companies = set()
for exp in sorted_experiences:
if exp.company not in seen_companies:
top_three_companies.append(exp.company)
seen_companies.add(exp.company)
if len(top_three_companies) == 3:
break
top_three_companies += [None] * (3 - len(top_three_companies))
top_three_skills = self.get_top_skills()
self.emails_in_bio = extract_emails_from_text(self.bio) if self.bio else None
self.current_position = (
sorted_experiences[0].title
if len(sorted_experiences) > 0 and sorted_experiences[0].end_date is None
else None
)
contact_info = self.contact_info.to_dict() if self.contact_info else {}
return {
"search_term": self.search_term,
"id": self.id,
"urn": self.urn,
"profile_link": self.profile_link,
"profile_id": self.profile_id,
"name": self.name,
"first_name": self.first_name,
"last_name": self.last_name,
"location": self.location,
"headline": self.headline,
"estimated_age": estimated_age,
"followers": self.followers,
"connections": self.connections,
"mutuals": self.mutual_connections,
"is_connection": self.is_connection,
"premium": self.premium,
"creator": self.creator,
"influencer": self.influencer,
"open_to_work": self.open_to_work,
"is_hiring": self.is_hiring,
"current_position": self.current_position,
"current_company": top_three_companies[0],
"past_company_1": top_three_companies[1],
"past_company_2": top_three_companies[2],
"school_1": top_three_school_names[0],
"school_2": top_three_school_names[1],
"top_skill_1": top_three_skills[0],
"top_skill_2": top_three_skills[1],
"top_skill_3": top_three_skills[2],
"bio": self.bio,
"experiences": (
[exp.to_dict() for exp in self.experiences]
if self.experiences
else None
),
"schools": (
[school.to_dict() for school in self.schools] if self.schools else None
),
"skills": (
[skill.to_dict() for skill in self.skills] if self.skills else None
),
"certifications": (
[cert.to_dict() for cert in self.certifications]
if self.certifications
else None
),
"languages": self.languages,
"emails_in_bio": (
", ".join(self.emails_in_bio) if self.emails_in_bio else None
),
"potential_emails": self.potential_emails,
"profile_photo": self.profile_photo,
"banner_photo": self.banner_photo,
"connection_created_at": contact_info.get("created_at"),
"connection_email": contact_info.get("email_address"),
"connection_phone_numbers": contact_info.get("phone_numbers"),
"connection_websites": contact_info.get("websites"),
"connection_street_address": contact_info.get("address"),
"connection_birthday": contact_info.get("birthday"),
}
def estimate_age_based_on_education(self):
"""Adds 18 to their first college start date"""
college_words = ["uni", "college"]
sorted_schools = (
sorted(
[school for school in self.schools if school.start_date],
key=lambda x: x.start_date,
)
if self.schools
else []
)
current_date = datetime.now().date()
for school in sorted_schools:
if (
any(word in school.school.lower() for word in college_words)
or school.degree
):
if school.start_date:
years_in_education = (current_date - school.start_date).days // 365
return int(18 + years_in_education)
return None
================================================
FILE: staffspy/utils/utils.py
================================================
import logging
import os
import pickle
import re
from datetime import datetime
import pandas as pd
from typing import Optional
from urllib.parse import quote
import requests
import tldextract
from bs4 import BeautifulSoup
from dateutil.parser import parse
from tenacity import stop_after_attempt, retry_if_exception_type, retry, RetryError
from staffspy.solvers.solver import Solver
from staffspy.utils.driver_type import DriverType, BrowserType
from staffspy.utils.exceptions import BlobException
logger = logging.getLogger("StaffSpy")
logger.propagate = False
if not logger.handlers:
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
formatter = logging.Formatter(format)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
def set_csrf_token(session):
csrf_token = session.cookies["JSESSIONID"].replace('"', "")
session.headers.update({"Csrf-Token": csrf_token})
return session
def extract_base_domain(url: str):
extracted = tldextract.extract(url)
base_domain = "{}.{}".format(extracted.domain, extracted.suffix)
return base_domain
def create_emails(first, last, domain):
first = "".join(filter(str.isalpha, first)).lower()
last = "".join(filter(str.isalpha, last)).lower()
emails = [
f"{first}.{last}@{domain}",
f"{first[:1]}{last}@{domain}",
f"{first[:2]}{last}@{domain}",
f"{first}{last[:1]}@{domain}",
f"{first}{last[:2]}@{domain}",
]
return emails
def get_webdriver(driver_type: Optional[DriverType] = None):
try:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.firefox.service import Service as FirefoxService
except ImportError as e:
raise Exception(
'install package `pip install "staffspy[browser]"` to login with browser'
)
if driver_type:
if str(driver_type.browser_type) == str(BrowserType.CHROME):
if driver_type.executable_path:
service = ChromeService(executable_path=driver_type.executable_path)
return webdriver.Chrome(service=service)
else:
return webdriver.Chrome()
elif str(driver_type.browser_type) == str(BrowserType.FIREFOX):
if driver_type.executable_path:
service = FirefoxService(executable_path=driver_type.executable_path)
return webdriver.Firefox(service=service)
else:
return webdriver.Firefox()
else:
for browser in [webdriver.Chrome, webdriver.Firefox]:
try:
return browser()
except Exception:
continue
return None
class Login:
def __init__(
self,
username: str,
password: str,
solver: Solver,
session_file: str,
driver_type: DriverType = None,
):
(
self.username,
self.password,
self.solver,
self.session_file,
self.driver_type,
) = (username, password, solver, session_file, driver_type)
def solve_captcha(self, session, data, payload):
url = data["challenge_url"]
r = session.post(url, data=payload)
soup = BeautifulSoup(r.text, "html.parser")
code_tag = soup.find("code", id="securedDataExchange")
logger.info("Searching for captcha blob in linkedin to begin captcha solving")
if code_tag:
comment = code_tag.contents[0]
extracted_code = str(comment).strip('<!--""-->').strip()
logger.debug("Extracted captcha blob:", extracted_code)
elif "Please choose a more secure password." in r.text:
raise Exception(
"linkedin is requiring a more secure password. reset pw and try again"
)
else:
raise BlobException(
"blob to solve captcha not found - rerunning the program usually solves this"
)
if not self.solver:
raise Exception(
"captcha hit - provide solver_api_key and solver_service name to solve or switch to the browser-based login with `pip install staffspy[browser]`"
)
token = self.solver.solve(extracted_code, url)
if not token:
raise Exception("failed to solve captcha after 10 attempts")
captcha_site_key = soup.find("input", {"name": "captchaSiteKey"})["value"]
challenge_id = soup.find("input", {"name": "challengeId"})["value"]
challenge_data = soup.find("input", {"name": "challengeData"})["value"]
challenge_details = soup.find("input", {"name": "challengeDetails"})["value"]
challenge_type = soup.find("input", {"name": "challengeType"})["value"]
challenge_source = soup.find("input", {"name": "challengeSource"})["value"]
request_submission_id = soup.find("input", {"name": "requestSubmissionId"})[
"value"
]
display_time = soup.find("input", {"name": "displayTime"})["value"]
page_instance = soup.find("input", {"name": "pageInstance"})["value"]
failure_redirect_uri = soup.find("input", {"name": "failureRedirectUri"})[
"value"
]
sign_in_link = soup.find("input", {"name": "signInLink"})["value"]
join_now_link = soup.find("input", {"name": "joinNowLink"})["value"]
for cookie in session.cookies:
if cookie.name == "JSESSIONID":
jsession_value = cookie.value.split("ajax:")[1].strip('"')
break
else:
raise Exception("jsessionid not found, raise issue on GitHub")
csrf_token = f"ajax:{jsession_value}"
payload = {
"csrfToken": csrf_token,
"captchaSiteKey": captcha_site_key,
"challengeId": challenge_id,
"language": "en-US",
"displayTime": display_time,
"challengeType": challenge_type,
"challengeSource": challenge_source,
"requestSubmissionId": request_submission_id,
"captchaUserResponseToken": token,
"challengeData": challenge_data,
"pageInstance": page_instance,
"challengeDetails": challenge_details,
"failureRedirectUri": failure_redirect_uri,
"signInLink": sign_in_link,
"joinNowLink": join_now_link,
"_s": "CONSUMER_LOGIN",
}
encoded_payload = {
key: f'{quote(str(value), "")}' for key, value in payload.items()
}
query_string = "&".join(
[f"{key}={value}" for key, value in encoded_payload.items()]
)
response = session.post(
"https://www.linkedin.com/checkpoint/challenge/verify", data=query_string
)
if not response.ok:
raise Exception(f"verify captcha failed {response.text[:200]}")
@retry(stop=stop_after_attempt(5), retry=retry_if_exception_type(BlobException))
def login_requests(self):
url = "https://www.linkedin.com/uas/authenticate"
encoded_username = quote(self.username)
encoded_password = quote(self.password)
session = requests.Session()
session.headers = {
"X-Li-User-Agent": "LIAuthLibrary:44.0.* com.linkedin.LinkedIn:9.29.8962 iPhone:17.5.1",
"User-Agent": "LinkedIn/9.29.8962 CFNetwork/1496.0.7 Darwin/23.5.0",
"X-User-Language": "en",
"X-User-Locale": "en_US",
"Accept-Language": "en-us",
}
response = session.get(url)
if response.status_code != 200:
raise Exception(
f"failed to begin auth process: {response.status_code} {response.text}"
)
for cookie in session.cookies:
if cookie.name == "JSESSIONID":
jsession_value = cookie.value.split("ajax:")[1].strip('"')
break
else:
raise Exception("jsessionid not found, raise issue on GitHub")
session.headers["content-type"] = "application/x-www-form-urlencoded"
csrf_token = f"ajax%3A{jsession_value}"
payload = f"session_key={encoded_username}&session_password={encoded_password}&JSESSIONID=%22{csrf_token}%22"
response = session.post(url, data=payload)
data = response.json()
if data["login_result"] == "BAD_USERNAME_OR_PASSWORD":
raise Exception("incorrect username or password")
elif data["login_result"] == "CHALLENGE":
self.solve_captcha(session, data, payload)
session = set_csrf_token(session)
return session
def login_browser(self):
"""Backup login method"""
driver = get_webdriver(self.driver_type)
if driver is None:
logger.debug("No browser found for selenium")
raise Exception("driver not found for selenium")
driver.get("https://linkedin.com/login")
input("Press enter after logged in")
selenium_cookies = driver.get_cookies()
driver.quit()
session = requests.Session()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
session = set_csrf_token(session)
return session
def save_session(self, session, session_file: str):
data = {"cookies": session.cookies, "headers": session.headers}
with open(session_file, "wb") as f:
pickle.dump(data, f)
def load_session(self):
"""Load session from session file, otherwise login"""
session = None
if not self.session_file or not os.path.exists(self.session_file):
if self.username and self.password:
try:
session = self.login_requests()
except RetryError as retry_err:
retry_err.reraise()
else:
session = self.login_browser()
if not session:
raise Exception("Failed to log in.")
if self.session_file:
self.save_session(session, self.session_file)
else:
with open(self.session_file, "rb") as f:
data = pickle.load(f)
session = requests.Session()
session.cookies.update(data["cookies"])
session.headers.update(data["headers"])
session.headers.update(
{
"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30",
"X-RestLi-Protocol-Version": "2.0.0",
"X-Li-Track": '{"clientVersion":"1.13.1665"}',
}
)
if not self.check_logged_in(session):
raise Exception(
"Failed to log in. Likely outdated session file and cookies have expired. Best practice to delete the file and rerun the LinkedAccount() code"
)
return session
def check_logged_in(self, session):
logger.info("Testing if logged in by checking arbitrary LinkedIn company page")
try:
res = session.get(
"https://www.linkedin.com/voyager/api/organization/companies?q=universalName&universalName=amazon"
)
if res.status_code != 200:
logger.error(f"{res.status_code} status code returned from linkedin")
return False
except Exception as e:
logger.error(f"Failed to get arbitrary company page: {e}")
return False
logger.info("Account successfully logged in - res code 200")
return True
def parse_date(date_str):
formats = ["%b %Y", "%Y"]
for fmt in formats:
try:
return datetime.strptime(date_str, fmt)
except ValueError:
continue
return None
def parse_duration(duration):
from_date = to_date = None
dates = duration.split(" · ")
if len(dates) > 1:
date_range, _ = duration.split(" · ")
dates = date_range.split(" - ")
from_date_str = dates[0]
to_date_str = dates[1] if dates[1] != "Present" else None
from_date = parse_date(from_date_str) if from_date_str else None
to_date = parse_date(to_date_str) if to_date_str else None
return from_date, to_date
def set_logger_level(verbose: int = 0):
"""
Adjusts the logger's level. This function allows the logging level to be changed at runtime.
Parameters:
- verbose: int {0, 1, 2} (default=0, no logs)
"""
if verbose is None:
return
level_name = {2: "DEBUG", 1: "INFO", 0: "WARNING"}.get(verbose, "INFO")
level = getattr(logging, level_name.upper(), None)
if level is not None:
logger.setLevel(level)
else:
raise ValueError(f"Invalid log level: {level_name}")
def parse_dates(date_str):
regex = r"(\b\w+ \d{4}|\b\d{4}|\bPresent)"
matches = re.findall(regex, date_str)
start_date, end_date = None, None
if matches:
if "Present" in matches:
if len(matches) == 1:
start_date = None
end_date = None
else:
start_date = parse(matches[0]).date()
end_date = None
else:
if len(matches) == 2:
start_date = parse(matches[0]).date()
end_date = parse(matches[1]).date()
elif len(matches) == 1:
start_date = parse(matches[0]).date()
return start_date, end_date
def extract_emails_from_text(text: str) -> list[str] | None:
if not text:
return None
email_regex = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
return email_regex.findall(text)
def parse_company_data(json_data, search_term=None):
company_info = json_data["elements"][0]
company_name = company_info.get("name", "")
staff_count = company_info.get("staffCount", None)
company_type = company_info.get("type", "")
description = company_info.get("description", "")
industries_list = [
ind.get("localizedName", "")
for ind in company_info.get("companyIndustries", [])
]
headquarter = company_info.get("headquarter", {})
headquarter_full = f'{headquarter.get("line1", "")}, {headquarter.get("city", "")}, {headquarter.get("country", "")} {headquarter.get("postalCode", "")}'
logo_data = company_info.get("logo", {})
vector_image = logo_data.get("image", {}).get("com.linkedin.common.VectorImage", {})
root_url = vector_image.get("rootUrl", "")
artifacts = vector_image.get("artifacts", [])
logo_url = None
if artifacts:
first_artifact = artifacts[0]
file_path = first_artifact.get("fileIdentifyingUrlPathSegment", "")
logo_url = root_url + file_path
tracking_info = company_info.get("trackingInfo", {})
object_urn = tracking_info.get("objectUrn", "")
internal_id = None
if object_urn.startswith("urn:li:company:"):
internal_id = object_urn.split(":")[-1]
bg_photo = company_info.get("backgroundCoverPhoto", {})
vector_image = bg_photo.get("com.linkedin.common.VectorImage", {})
root_url = vector_image.get("rootUrl", "")
artifacts = vector_image.get("artifacts", [])
banner_url = None
if artifacts:
chosen_artifact = artifacts[0]
file_segment = chosen_artifact.get("fileIdentifyingUrlPathSegment", "")
banner_url = root_url + file_segment
company_df = pd.DataFrame(
{
"search_term": [search_term],
"linkedin_company_id": [internal_id],
"company_name": [company_name],
"staff_count": [staff_count],
"company_type": [company_type],
"industries": [industries_list],
"headquarters_address": [headquarter_full],
"description": [description],
"logo_url": [logo_url],
"banner_url": [banner_url],
}
)
return company_df
def clean_df(staff_df):
if "estimated_age" in staff_df.columns:
staff_df["estimated_age"] = staff_df["estimated_age"].astype("Int64")
if "followers" in staff_df.columns:
staff_df["followers"] = staff_df["followers"].astype("Int64")
if "connections" in staff_df.columns:
staff_df["connections"] = staff_df["connections"].astype("Int64")
if "mutuals" in staff_df.columns:
staff_df["mutuals"] = staff_df["mutuals"].astype("Int64")
return staff_df
def upload_to_clay(webhook_url: str, data: pd.DataFrame):
records = data.to_dict("records")
responses = []
for i, row in enumerate(records, start=1):
try:
response = requests.post(
webhook_url, headers={"Accept": "application/json"}, json=row
)
response.raise_for_status()
logger.info(f"Uploaded row to Clay: {i} / {len(records)}")
except requests.exceptions.RequestException as e:
logger.error(f"Failed to upload row to Clay: {str(e)}")
responses.append({"error": str(e), "data": row})
return responses
if __name__ == "__main__":
p = parse_dates("May 2018 - Jun 2024")
gitextract_26g2vb8c/
├── .github/
│ └── workflows/
│ └── publish-to-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── examples/
│ ├── daily_auto_connect.py
│ ├── upload_staff_to_clay.py
│ └── x_corp_staff.py
├── pyproject.toml
└── staffspy/
├── __init__.py
├── linkedin/
│ ├── certifications.py
│ ├── comments.py
│ ├── contact_info.py
│ ├── employee.py
│ ├── employee_bio.py
│ ├── experiences.py
│ ├── languages.py
│ ├── linkedin.py
│ ├── schools.py
│ └── skills.py
├── solvers/
│ ├── capsolver.py
│ ├── solver.py
│ ├── solver_type.py
│ └── two_captcha.py
└── utils/
├── driver_type.py
├── exceptions.py
├── models.py
└── utils.py
SYMBOL INDEX (114 symbols across 21 files)
FILE: examples/daily_auto_connect.py
function connect_with_staff (line 34) | def connect_with_staff():
FILE: examples/x_corp_staff.py
function save_results (line 90) | def save_results(users: pd.DataFrame):
function scrape_and_save (line 98) | def scrape_and_save(term=None, location=None):
FILE: staffspy/__init__.py
class LinkedInAccount (line 28) | class LinkedInAccount:
method __init__ (line 36) | def __init__(
method login (line 57) | def login(self):
method scrape_staff (line 68) | def scrape_staff(
method scrape_users (line 109) | def scrape_users(
method scrape_comments (line 152) | def scrape_comments(self, post_ids: list[str]) -> pd.DataFrame:
method scrape_companies (line 174) | def scrape_companies(
method scrape_connections (line 211) | def scrape_connections(
FILE: staffspy/linkedin/certifications.py
class CertificationFetcher (line 10) | class CertificationFetcher:
method __init__ (line 11) | def __init__(self, session):
method fetch_certifications (line 15) | def fetch_certifications(self, staff):
method parse_certifications (line 45) | def parse_certifications(self, sections):
FILE: staffspy/linkedin/comments.py
class CommentFetcher (line 11) | class CommentFetcher:
method __init__ (line 13) | def __init__(self, session):
method fetch_comments (line 19) | def fetch_comments(self, post_id: str):
method parse_comments (line 48) | def parse_comments(self, comments_json: dict):
FILE: staffspy/linkedin/contact_info.py
class ContactInfoFetcher (line 15) | class ContactInfoFetcher:
method __init__ (line 16) | def __init__(self, session):
method fetch_contact_info (line 20) | def fetch_contact_info(self, base_staff):
method parse_emp_contact_info (line 42) | def parse_emp_contact_info(self, emp: Staff, emp_dict: dict):
FILE: staffspy/linkedin/employee.py
class EmployeeFetcher (line 12) | class EmployeeFetcher:
method __init__ (line 13) | def __init__(self, session):
method fetch_employee (line 19) | def fetch_employee(self, base_staff, domain):
method parse_emp (line 44) | def parse_emp(self, emp: Staff, emp_dict: dict):
FILE: staffspy/linkedin/employee_bio.py
class EmployeeBioFetcher (line 9) | class EmployeeBioFetcher:
method __init__ (line 10) | def __init__(self, session):
method fetch_employee_bio (line 14) | def fetch_employee_bio(self, base_staff):
FILE: staffspy/linkedin/experiences.py
class ExperiencesFetcher (line 11) | class ExperiencesFetcher:
method __init__ (line 12) | def __init__(self, session):
method fetch_experiences (line 16) | def fetch_experiences(self, staff):
method parse_experiences (line 51) | def parse_experiences(self, elements):
method parse_multi_exp (line 121) | def parse_multi_exp(self, entity):
FILE: staffspy/linkedin/languages.py
class LanguagesFetcher (line 10) | class LanguagesFetcher:
method __init__ (line 11) | def __init__(self, session):
method fetch_languages (line 15) | def fetch_languages(self, staff: Staff):
method parse_languages (line 35) | def parse_languages(self, language_json: dict) -> list[str]:
FILE: staffspy/linkedin/linkedin.py
class LinkedInScraper (line 29) | class LinkedInScraper:
method __init__ (line 41) | def __init__(self, session: requests.Session):
method search_companies (line 65) | def search_companies(self, company_name: str):
method fetch_or_search_company (line 111) | def fetch_or_search_company(self, company_name):
method _get_company_id_and_staff_count (line 138) | def _get_company_id_and_staff_count(self, company_name: str):
method parse_staff (line 163) | def parse_staff(self, elements: list[dict]):
method fetch_staff (line 207) | def fetch_staff(self, offset: int):
method fetch_connections_page (line 249) | def fetch_connections_page(self, offset: int):
method scrape_connections (line 280) | def scrape_connections(
method fetch_location_id (line 321) | def fetch_location_id(self):
method scrape_staff (line 357) | def scrape_staff(
method fetch_all_info_for_employee (line 434) | def fetch_all_info_for_employee(self, employee: Staff, index: int):
method fetch_user_profile_data_from_public_id (line 462) | def fetch_user_profile_data_from_public_id(self, user_id: str, key: str):
method block_user (line 501) | def block_user(self, employee: Staff) -> None:
method connect_user (line 530) | def connect_user(self, employee: Staff) -> None:
FILE: staffspy/linkedin/schools.py
class SchoolsFetcher (line 11) | class SchoolsFetcher:
method __init__ (line 13) | def __init__(self, session):
method fetch_schools (line 17) | def fetch_schools(self, staff):
method parse_schools (line 44) | def parse_schools(self, elements):
FILE: staffspy/linkedin/skills.py
class SkillsFetcher (line 10) | class SkillsFetcher:
method __init__ (line 11) | def __init__(self, session):
method fetch_skills (line 15) | def fetch_skills(self, staff: Staff):
method parse_skills (line 40) | def parse_skills(self, sections):
FILE: staffspy/solvers/capsolver.py
function is_none (line 10) | def is_none(value):
class CapSolver (line 14) | class CapSolver(Solver):
method solve (line 18) | def solve(self, blob_data: str, page_url: str = None):
FILE: staffspy/solvers/solver.py
class Solver (line 4) | class Solver(ABC):
method __init__ (line 8) | def __init__(self, solver_api_key:str):
method solve (line 12) | def solve(self, blob_data: str, page_ur: str=None):
FILE: staffspy/solvers/solver_type.py
class SolverType (line 3) | class SolverType(Enum):
FILE: staffspy/solvers/two_captcha.py
class TwoCaptchaSolver (line 7) | class TwoCaptchaSolver(Solver):
method solve (line 18) | def solve(self, blob_data: str, page_url: str = None):
FILE: staffspy/utils/driver_type.py
class BrowserType (line 5) | class BrowserType(Enum):
class DriverType (line 10) | class DriverType:
method __init__ (line 11) | def __init__(
FILE: staffspy/utils/exceptions.py
class TooManyRequests (line 1) | class TooManyRequests(Exception):
class BadCookies (line 5) | class BadCookies(Exception):
class GeoUrnNotFound (line 9) | class GeoUrnNotFound(Exception):
class BlobException (line 13) | class BlobException(Exception):
FILE: staffspy/utils/models.py
class Comment (line 9) | class Comment(BaseModel):
method to_dict (line 19) | def to_dict(self):
class School (line 32) | class School(BaseModel):
method to_dict (line 38) | def to_dict(self):
class Skill (line 47) | class Skill(BaseModel):
method to_dict (line 52) | def to_dict(self):
class ContactInfo (line 60) | class ContactInfo(BaseModel):
method to_dict (line 68) | def to_dict(self):
class Certification (line 79) | class Certification(BaseModel):
method to_dict (line 86) | def to_dict(self):
class Experience (line 96) | class Experience(BaseModel):
method to_dict (line 105) | def to_dict(self):
class Staff (line 117) | class Staff(BaseModel):
method get_top_skills (line 153) | def get_top_skills(self):
method to_dict (line 163) | def to_dict(self):
method estimate_age_based_on_education (line 269) | def estimate_age_based_on_education(self):
FILE: staffspy/utils/utils.py
function set_csrf_token (line 32) | def set_csrf_token(session):
function extract_base_domain (line 38) | def extract_base_domain(url: str):
function create_emails (line 44) | def create_emails(first, last, domain):
function get_webdriver (line 57) | def get_webdriver(driver_type: Optional[DriverType] = None):
class Login (line 89) | class Login:
method __init__ (line 91) | def __init__(
method solve_captcha (line 107) | def solve_captcha(self, session, data, payload):
method login_requests (line 192) | def login_requests(self):
method login_browser (line 232) | def login_browser(self):
method save_session (line 253) | def save_session(self, session, session_file: str):
method load_session (line 258) | def load_session(self):
method check_logged_in (line 292) | def check_logged_in(self, session):
function parse_date (line 308) | def parse_date(date_str):
function parse_duration (line 318) | def parse_duration(duration):
function set_logger_level (line 332) | def set_logger_level(verbose: int = 0):
function parse_dates (line 349) | def parse_dates(date_str):
function extract_emails_from_text (line 372) | def extract_emails_from_text(text: str) -> list[str] | None:
function parse_company_data (line 379) | def parse_company_data(json_data, search_term=None):
function clean_df (line 439) | def clean_df(staff_df):
function upload_to_clay (line 451) | def upload_to_clay(webhook_url: str, data: pd.DataFrame):
Condensed preview — 28 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (117K chars).
[
{
"path": ".github/workflows/publish-to-pypi.yml",
"chars": 726,
"preview": "name: Publish Python 🐍 distributions 📦 to PyPI\non: push\n\njobs:\n build-n-publish:\n name: Build and publish Python 🐍 d"
},
{
"path": ".gitignore",
"chars": 108,
"preview": "/venv/\n/.idea\n**/__pycache__/\n**/.pytest_cache/\n/.ipynb_checkpoints/\n**/output/\n**/.DS_Store\n*.pyc\n.env\ndist"
},
{
"path": ".pre-commit-config.yaml",
"chars": 147,
"preview": "repos:\n- repo: https://github.com/psf/black\n rev: 24.2.0\n hooks:\n - id: black\n language_version: python\n args: "
},
{
"path": "LICENSE",
"chars": 483,
"preview": " DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE\n Version 2, December 2004\n\n Copyright (C) 200"
},
{
"path": "README.md",
"chars": 10681,
"preview": "<img width=\"640\" alt=\"3FAD4652-488F-4F6F-A744-4C2AA5855E92\" src=\"https://github.com/user-attachments/assets/73b701ff-2db"
},
{
"path": "examples/daily_auto_connect.py",
"chars": 1396,
"preview": "\"\"\" Script to connect with 10 software engineers daily from random tech companies \"\"\"\n\nfrom staffspy import LinkedInAcco"
},
{
"path": "examples/upload_staff_to_clay.py",
"chars": 572,
"preview": "\"\"\"\nUploads staff to the Clay platform to then further enrich the staff (e.g. waterfall strategy to find their verified "
},
{
"path": "examples/x_corp_staff.py",
"chars": 3281,
"preview": "\"\"\"\nCASE STUDY: X CORP EMPLOYEES\nRESULT: We retrieved 1087 profiles. Not as good as expected but still a good result for"
},
{
"path": "pyproject.toml",
"chars": 688,
"preview": "[tool.poetry]\nname = \"staffspy\"\nversion = \"0.2.25\"\ndescription = \"Staff scraper library for LinkedIn\"\nauthors = [\"Cullen"
},
{
"path": "staffspy/__init__.py",
"chars": 8410,
"preview": "import json\nimport pandas as pd\n\nfrom staffspy.linkedin.comments import CommentFetcher\nfrom staffspy.linkedin.linkedin i"
},
{
"path": "staffspy/linkedin/certifications.py",
"chars": 2824,
"preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Cert"
},
{
"path": "staffspy/linkedin/comments.py",
"chars": 3126,
"preview": "import json\nimport re\nfrom datetime import datetime as dt\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom st"
},
{
"path": "staffspy/linkedin/contact_info.py",
"chars": 3041,
"preview": "from calendar import month_name\nfrom datetime import datetime\nimport json\nimport requests\nimport logging\n\nimport pytz\n\nf"
},
{
"path": "staffspy/linkedin/employee.py",
"chars": 5314,
"preview": "import json\nimport logging\nimport re\n\nimport staffspy.utils.utils as utils\nfrom staffspy.utils.exceptions import TooMany"
},
{
"path": "staffspy/linkedin/employee_bio.py",
"chars": 1302,
"preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\n\nlogger = logging.getLogger(__name__)\n"
},
{
"path": "staffspy/linkedin/experiences.py",
"chars": 5705,
"preview": "import json\nimport logging\n\nimport staffspy.utils.utils as utils\nfrom staffspy.utils.exceptions import TooManyRequests\nf"
},
{
"path": "staffspy/linkedin/languages.py",
"chars": 1717,
"preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Skil"
},
{
"path": "staffspy/linkedin/linkedin.py",
"chars": 23432,
"preview": "\"\"\"\nstaffspy.linkedin.linkedin\n~~~~~~~~~~~~~~~~~~~\n\nThis module contains routines to scrape LinkedIn.\n\"\"\"\n\nimport json\ni"
},
{
"path": "staffspy/linkedin/schools.py",
"chars": 2220,
"preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Scho"
},
{
"path": "staffspy/linkedin/skills.py",
"chars": 2884,
"preview": "import json\nimport logging\n\nfrom staffspy.utils.exceptions import TooManyRequests\nfrom staffspy.utils.models import Skil"
},
{
"path": "staffspy/solvers/capsolver.py",
"chars": 2020,
"preview": "import json\nimport time\n\nimport requests\nfrom tenacity import retry, stop_after_attempt, retry_if_result\n\nfrom staffspy."
},
{
"path": "staffspy/solvers/solver.py",
"chars": 337,
"preview": "from abc import ABC,abstractmethod\n\n\nclass Solver(ABC):\n public_key = \"3117BF26-4762-4F5A-8ED9-A85E69209A46\"\n page"
},
{
"path": "staffspy/solvers/solver_type.py",
"chars": 106,
"preview": "from enum import Enum\n\nclass SolverType(Enum):\n CAPSOLVER = 'capsolver'\n TWO_CAPTCHA = 'twocaptcha'\n"
},
{
"path": "staffspy/solvers/two_captcha.py",
"chars": 1081,
"preview": "from tenacity import retry_if_exception_type, stop_after_attempt, retry\nfrom twocaptcha import TwoCaptcha, TimeoutExcept"
},
{
"path": "staffspy/utils/driver_type.py",
"chars": 335,
"preview": "from enum import Enum\nfrom typing import Optional\n\n\nclass BrowserType(Enum):\n CHROME = \"chrome\"\n FIREFOX = \"firefo"
},
{
"path": "staffspy/utils/exceptions.py",
"chars": 307,
"preview": "class TooManyRequests(Exception):\n \"\"\"Too many requests.\"\"\"\n\n\nclass BadCookies(Exception):\n \"\"\"Login expiration.\"\""
},
{
"path": "staffspy/utils/models.py",
"chars": 9856,
"preview": "from datetime import datetime, date\n\nfrom pydantic import BaseModel\nfrom datetime import datetime as dt\n\nfrom staffspy.u"
},
{
"path": "staffspy/utils/utils.py",
"chars": 17440,
"preview": "import logging\nimport os\nimport pickle\nimport re\nfrom datetime import datetime\n\nimport pandas as pd\nfrom typing import O"
}
]
About this extraction
This page contains the full source code of the cullenwatson/StaffSpy GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 28 files (107.0 KB), approximately 25.1k tokens, and a symbol index with 114 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.