Showing preview only (1,925K chars total). Download the full file or copy to clipboard to get everything.
Repository: Olshansk/rss-feeds
Branch: main
Commit: 3c0aa368aa54
Files: 96
Total size: 1.8 MB
Directory structure:
gitextract_o0vedygl/
├── .agents/
│ └── skills/
│ ├── cmd-rss-feed-generator/
│ │ └── SKILL.md
│ └── rss-feed-review/
│ └── SKILL.md
├── .editorconfig
├── .github/
│ ├── CODEOWNERS
│ ├── FUNDING.yml
│ ├── ISSUE_TEMPLATE/
│ │ └── request_rss_feed.md
│ ├── PULL_REQUEST_TEMPLATE/
│ │ └── add_new_feed.md
│ ├── dependabot.yml
│ ├── pull_request_template.md
│ └── workflows/
│ ├── cleanup_deprecated_feeds.yml
│ ├── label_new_feed.yml
│ ├── lint.yml
│ ├── run_feeds.yml
│ ├── run_selenium_feeds.yml
│ ├── test_feed.yml
│ └── validate_feeds.yml
├── .gitignore
├── .markdownlint.json
├── .pre-commit-config.yaml
├── AGENTS.md
├── CLAUDE.md
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── cache/
│ └── .gitkeep
├── feed_generators/
│ ├── ai_first_podcast.py
│ ├── anthropic_eng_blog.py
│ ├── anthropic_news_blog.py
│ ├── anthropic_red_blog.py
│ ├── anthropic_research_blog.py
│ ├── blogsurgeai_feed_generator.py
│ ├── chanderramesh_blog.py
│ ├── claude_blog.py
│ ├── cleanup_deprecated_feeds.py
│ ├── cohere_blog.py
│ ├── cursor_blog.py
│ ├── dagster_blog.py
│ ├── deeplearningai_the_batch.py
│ ├── deprecate_feed.py
│ ├── google_ai_blog.py
│ ├── groq_blog.py
│ ├── meta_ai_blog.py
│ ├── mistral_blog.py
│ ├── models.py
│ ├── ollama_blog.py
│ ├── paulgraham_blog.py
│ ├── perplexity_hub.py
│ ├── pinecone_blog.py
│ ├── run_all_feeds.py
│ ├── thinkingmachines_blog.py
│ ├── utils.py
│ ├── validate_feeds.py
│ ├── weaviate_blog.py
│ ├── windsurf_blog.py
│ ├── windsurf_changelog.py
│ ├── windsurf_next_changelog.py
│ └── xainews_blog.py
├── feeds/
│ ├── .gitkeep
│ ├── feed_ai_first_podcast.xml
│ ├── feed_anthropic_changelog_claude_code.xml
│ ├── feed_anthropic_engineering.xml
│ ├── feed_anthropic_news.xml
│ ├── feed_anthropic_red.xml
│ ├── feed_anthropic_research.xml
│ ├── feed_blogsurgeai.xml
│ ├── feed_chanderramesh.xml
│ ├── feed_claude.xml
│ ├── feed_cohere.xml
│ ├── feed_cursor.xml
│ ├── feed_dagster.xml
│ ├── feed_google_ai.xml
│ ├── feed_groq.xml
│ ├── feed_hamel.xml
│ ├── feed_meta_ai.xml
│ ├── feed_mistral.xml
│ ├── feed_ollama.xml
│ ├── feed_openai_research.xml
│ ├── feed_paulgraham.xml
│ ├── feed_perplexity_hub.xml
│ ├── feed_pinecone.xml
│ ├── feed_the_batch.xml
│ ├── feed_thinkingmachines.xml
│ ├── feed_weaviate.xml
│ ├── feed_windsurf_blog.xml
│ ├── feed_windsurf_changelog.xml
│ ├── feed_windsurf_next_changelog.xml
│ └── feed_xainews.xml
├── feeds.yaml
├── makefiles/
│ ├── ci.mk
│ ├── colors.mk
│ ├── common.mk
│ ├── dev.mk
│ ├── env.mk
│ └── feeds.mk
└── pyproject.toml
================================================
FILE CONTENTS
================================================
================================================
FILE: .agents/skills/cmd-rss-feed-generator/SKILL.md
================================================
---
name: cmd-rss-feed-generator
description: Generate Python RSS feed scrapers from blog websites, integrated with hourly GitHub Actions
disable-model-invocation: false
context: fork
agent: general-purpose
---
# RSS Feed Generator Command
You are the **RSS Feed Generator Agent**, specialized in creating Python scripts that convert blog websites without RSS feeds into properly formatted RSS/XML feeds.
The script will automatically be included in the hourly GitHub Actions workflow once merged. Always reference existing generators in `feed_generators/` as your primary guide.
## Table of Contents <!-- omit in toc -->
- [Project Context](#project-context)
- [Workflow](#workflow)
- [Step 0: Classify the URL](#step-0-classify-the-url)
- [Step 1: Review Existing Feed Generators](#step-1-review-existing-feed-generators)
- [Step 2: Analyze the Blog Source](#step-2-analyze-the-blog-source)
- [Step 3: Create the Feed Generator Script](#step-3-create-the-feed-generator-script)
- [Step 4: Update feeds.yaml](#step-4-update-feedsyaml)
- [Step 5: Add Makefile Target](#step-5-add-makefile-target)
- [Step 6: Update README](#step-6-update-readme)
- [Step 7: Test and Verify](#step-7-test-and-verify)
- [Reference Examples by Type](#reference-examples-by-type)
- [Common Patterns](#common-patterns)
- [Troubleshooting](#troubleshooting)
## Project Context
This project generates RSS feeds for blogs that don't provide them natively. The system uses:
- Python scripts in `feed_generators/` to scrape and convert blog content
- `feeds.yaml` as the single source of truth for the feed registry
- GitHub Actions for automated hourly updates
- Makefile targets for easy testing and execution
## Workflow
### Step 0: Classify the URL
**Before doing anything else**, determine which of the four cases applies. Each has a different exit path.
---
#### Case A: GitHub repo URL (`https://github.com/{owner}/{repo}`)
GitHub provides native Atom feeds — no scraper needed. Ask the user which to track:
> "This is a GitHub repo. GitHub provides native Atom feeds — no scraper needed. Which would you like to track?
>
> 1. **Releases** — `https://github.com/{owner}/{repo}/releases.atom`
> 2. **Tags** — `https://github.com/{owner}/{repo}/tags.atom`
> 3. **Commits (specific branch)** — `https://github.com/{owner}/{repo}/commits/{branch}.atom` _(ask which branch)_
> 4. **Commits (main)** — `https://github.com/{owner}/{repo}/commits/main.atom`"
Once the user picks:
- Construct the final Atom URL.
- **Go directly to [Step 6: Update README](#step-6-update-readme)** using `[Official RSS]` format.
- Do **not** create a script, add to `feeds.yaml`, or add a Makefile target.
---
#### Case B: Site has a native RSS/Atom feed
Fetch the page and check for a native feed **before writing any code**:
1. Look for `<link rel="alternate" type="application/rss+xml">` or `type="application/atom+xml"` in `<head>`.
2. Try common feed paths: `/feed`, `/rss.xml`, `/atom.xml`, `/feed.xml`, `/rss`, `/blog/feed`.
3. If a working feed URL is found:
- **Go directly to [Step 6: Update README](#step-6-update-readme)** using `[Official RSS]` format.
- Do **not** create a script, add to `feeds.yaml`, or add a Makefile target.
---
#### Case C: Static site (HTML served without JavaScript rendering)
Signals that `requests` + BeautifulSoup will work:
- Page HTML contains article content when fetched with `curl` or `requests`
- No heavy JS framework signals in the HTML (no `<div id="__next">`, no `<div id="app">` with empty body)
- Articles are visible in `view-source:`
**Reference generator:** `feed_generators/ollama_blog.py` (simplest), `feed_generators/blogsurgeai_feed_generator.py` (more complete), `feed_generators/paulgraham_blog.py`
Use `type: requests` in `feeds.yaml`. Proceed to Step 1.
---
#### Case D: Dynamic site (JavaScript-rendered content)
Signals that Selenium is required:
- `curl`/`requests` returns a near-empty body or a loading spinner
- HTML contains `<div id="__next">`, `<div id="root">`, or similar SPA shell
- Content only appears after JS execution
**Reference generators:** `feed_generators/xainews_blog.py` (Selenium + cache), `feed_generators/anthropic_news_blog.py` (Selenium + cache + incremental), `feed_generators/mistral_blog.py`
Use `type: selenium` in `feeds.yaml`. Proceed to Step 1.
---
### Step 1: Review Existing Feed Generators
**Always read the reference generator(s) for your case before writing any code:**
```bash
# For static sites
cat feed_generators/ollama_blog.py
cat feed_generators/blogsurgeai_feed_generator.py
# For dynamic/Selenium sites
cat feed_generators/xainews_blog.py
cat feed_generators/anthropic_news_blog.py
```
Study these to understand:
- Import structure and shared `utils` helpers
- `FEED_NAME` and `BLOG_URL` constants
- Date parsing patterns and fallback chains
- Article extraction logic and CSS selectors
- Cache + incremental update pattern (Selenium generators)
- Error handling approaches
### Step 2: Analyze the Blog Source
1. **Fetch the page** (use `fetch_page` from utils for static; Selenium for dynamic).
2. **Examine the HTML structure** to identify:
- Article container CSS selectors
- Title elements (h2, h3, h4, or custom)
- Date formats and locations
- Links to full articles
- Description/summary text
3. **Handle access issues**:
- If the site blocks automated requests (403/429), work with a local HTML file first
- The user can provide HTML via browser's "Save Page As"
- Support both local file and web fetching modes in the final script
### Step 3: Create the Feed Generator Script
Create `feed_generators/<name>_blog.py` following the reference for your case.
**Naming conventions:**
- Script: `feed_generators/{site_name}_blog.py` (e.g. `acme_blog.py`)
- Feed output: `feeds/feed_{site_name}.xml` (e.g. `feed_acme.xml`)
- `FEED_NAME` constant: `"{site_name}"` (e.g. `"acme"`)
**Required for all generators:**
- `FEED_NAME` and `BLOG_URL` constants at module level
- `setup_logging()` from utils
- Robust date parsing with multiple format fallback (see `xainews_blog.py`)
- Article deduplication (track seen links with a set)
- Per-article error handling: log warning and continue, never crash the full run
- Articles sorted newest-first before feed generation
**Additional requirements for Selenium generators:**
- Use `setup_selenium_driver()` from utils
- Use `load_cache()` / `save_cache()` / `merge_entries()` from utils for incremental updates
- Support `--full` flag via `argparse` for full-reset runs (see `anthropic_news_blog.py`)
- Use `sort_posts_for_feed()` from utils
See [Reference Examples by Type](#reference-examples-by-type) for full structural details.
### Step 4: Update feeds.yaml
Add an entry to `feeds.yaml` in alphabetical order by key:
**For static (requests) sites:**
```yaml
site_name:
script: site_name_blog.py
type: requests
blog_url: https://example.com/blog
```
**For dynamic (Selenium) sites:**
```yaml
site_name:
script: site_name_blog.py
type: selenium
blog_url: https://example.com/blog
```
### Step 5: Add Makefile Target
Add targets to `makefiles/feeds.mk` in alphabetical order.
**For static (requests) sites:**
```makefile
.PHONY: feeds_site_name
feeds_site_name: ## Generate RSS feed for Site Name
$(call check_venv)
$(call print_info,Generating Site Name feed)
$(Q)uv run feed_generators/site_name_blog.py
$(call print_success,Site Name feed generated)
```
**For dynamic (Selenium) sites — always include both incremental and full-reset targets:**
```makefile
.PHONY: feeds_site_name
feeds_site_name: ## Generate RSS feed for Site Name (incremental)
$(call check_venv)
$(call print_info,Generating Site Name feed)
$(Q)uv run feed_generators/site_name_blog.py
$(call print_success,Site Name feed generated)
.PHONY: feeds_site_name_full
feeds_site_name_full: ## Generate RSS feed for Site Name (full reset)
$(call check_venv)
$(call print_info,Generating Site Name feed - FULL RESET)
$(Q)uv run feed_generators/site_name_blog.py --full
$(call print_success,Site Name feed generated - full reset)
```
### Step 6: Update README
Add a row to the table in `README.md` in **alphabetical order** by blog name.
**For scraped feeds** (Cases C and D):
```markdown
| [Site Name](https://example.com/blog) | [feed_site_name.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_site_name.xml) |
```
**For native/official feeds** (Cases A and B):
```markdown
| [Site Name](https://example.com) | [Official RSS](https://example.com/feed.xml) |
```
The raw GitHub URL format must be exactly:
`https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_{name}.xml`
### Step 7: Test and Verify
**Run the generator:**
```bash
# Static sites
uv run feed_generators/site_name_blog.py
# Dynamic sites (incremental)
uv run feed_generators/site_name_blog.py
# Dynamic sites (full reset)
uv run feed_generators/site_name_blog.py --full
```
**Verify output:**
```bash
ls -la feeds/feed_site_name.xml
head -50 feeds/feed_site_name.xml
```
**Validate the feed:**
```bash
uv run feed_generators/validate_feeds.py
```
**Run via Makefile:**
```bash
make feeds_site_name
```
**Integration checklist before declaring done:**
- [ ] Script follows naming pattern: `feed_generators/{name}_blog.py`
- [ ] Output file follows pattern: `feeds/feed_{name}.xml`
- [ ] Entry added to `feeds.yaml` with correct `type`
- [ ] Makefile target(s) added to `makefiles/feeds.mk` (Selenium: both incremental + `_full`)
- [ ] README row added in alphabetical order with correct raw GitHub URL
- [ ] `validate_feeds.py` passes with no errors
- [ ] Articles are sorted newest-first
- [ ] Duplicate articles are filtered out
- [ ] Individual article failures are caught and logged (don't crash the run)
## Reference Examples by Type
### Type 1: Static (requests + BeautifulSoup)
**Simplest:** `feed_generators/ollama_blog.py`
- Minimal imports, straightforward `fetch_page` + BeautifulSoup
- Good starting point when the HTML structure is clean
**More complete:** `feed_generators/blogsurgeai_feed_generator.py`
- `fetch_page` + BeautifulSoup + `dateutil.parser`
- Better date handling, good error patterns
**Complex static with local-file fallback:** `feed_generators/paulgraham_blog.py`
### Type 2: Dynamic (Selenium + cache)
**Selenium + cache, no local-file fallback:** `feed_generators/mistral_blog.py`
- Minimal Selenium setup
- Good for simple JS-rendered pages
**Selenium + cache + incremental + argparse:** `feed_generators/xainews_blog.py`
- Full incremental update pattern with `--full` reset flag
- Use this as the base template for most Selenium generators
**Selenium + cache + incremental + multiple entry points:** `feed_generators/anthropic_news_blog.py`
- Same as xainews but handles multiple sections from one site
- Reference when a single domain has multiple feeds (e.g. `/news`, `/research`, `/engineering`)
### Type 3: Multiple feeds from one site
**Reference:** `feed_generators/anthropic_eng_blog.py`, `feed_generators/anthropic_research_blog.py`
- Each section gets its own `FEED_NAME` and script
- Share the Selenium driver setup pattern
- Add separate `feeds.yaml` entries and Makefile targets per feed
## Common Patterns
### Official RSS Detection (Case B — run before writing any code)
```python
import requests
from bs4 import BeautifulSoup
def check_native_feed(url):
resp = requests.get(url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
link = soup.find("link", rel="alternate", type=lambda t: t and "rss" in t or "atom" in t)
if link:
return link.get("href")
# Try common paths
for path in ["/feed", "/rss.xml", "/atom.xml", "/feed.xml", "/rss"]:
probe = requests.head(url.rstrip("/") + path, timeout=5)
if probe.status_code == 200:
return url.rstrip("/") + path
return None
```
### Incremental Updates (Selenium generators)
See `feed_generators/anthropic_news_blog.py` for the `get_existing_links_from_feed()` + `load_cache()` + `merge_entries()` pattern that avoids re-fetching already-seen articles.
### Robust Date Parsing
```python
DATE_FORMATS = [
"%B %d, %Y", # January 15, 2024
"%b %d, %Y", # Jan 15, 2024
"%Y-%m-%d", # 2024-01-15
"%d %B %Y", # 15 January 2024
"%B %Y", # January 2024
]
def parse_date(date_text):
for fmt in DATE_FORMATS:
with contextlib.suppress(ValueError):
return datetime.strptime(date_text.strip(), fmt).replace(tzinfo=pytz.UTC)
return stable_fallback_date() # from utils
```
### Local File Fallback (for blocked sites)
```python
import argparse, sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument("html_file", nargs="?", help="Local HTML file (optional)")
args = parser.parse_args()
if args.html_file:
with open(args.html_file) as f:
html = f.read()
else:
html = fetch_page(BLOG_URL)
...
```
## Troubleshooting
### No articles found
- Verify CSS selectors match actual HTML structure
- Check if content is dynamically loaded → switch to Selenium (Case D)
- Add debug logging to show what selectors find
### Date parsing failures
- Add the specific format to `DATE_FORMATS` list
- Use `stable_fallback_date()` from utils as the final fallback
### Blocked requests (403/429 errors)
- Save page locally with browser "Save Page As"
- Use local file mode for development
- Try different `User-Agent` headers in `fetch_page`
- If consistently blocked, switch to Selenium (Case D)
================================================
FILE: .agents/skills/rss-feed-review/SKILL.md
================================================
---
name: cmd-rss-feed-review
description: Review RSS feed generators and their XML output for broken selectors, missing error handling, stale cache logic, feed link conventions, empty/malformed feeds, and duplicate entries. Use when asked to "review feed", "check feed quality", "audit feeds", or after creating/modifying a feed generator.
disable-model-invocation: true
---
# RSS Feed Review
Review RSS feed generators and their output XML for correctness, robustness, and adherence to project conventions.
## Instructions
1. **Determine scope** — review all feed generators by default, or a specific one if the user specifies.
2. **Read the target generator(s)** and their corresponding `feeds/feed_*.xml` output files.
3. **Read `feed_generators/utils.py`** to understand shared helpers.
4. **Evaluate against the checklists below.** For every finding, cite `file_path:line_number`.
5. **If everything looks good**, say so briefly.
## Generator Code Review
### Selectors & Parsing
- Are CSS selectors specific enough to survive minor site redesigns?
- Are selectors targeting semantic elements (article, h2) over generated class names?
- Is there fallback logic if a selector returns no results?
### Error Handling
- Does `fetch_*` use `timeout=` on requests?
- Are HTTP errors handled (`response.raise_for_status()` or status check)?
- Are Selenium waits using explicit waits (`WebDriverWait`) rather than `time.sleep()`?
- Is the Selenium driver properly closed in a `finally` block?
### Feed Link Setup
**Critical convention** (from AGENTS.md):
```python
from utils import setup_feed_links
setup_feed_links(fg, blog_url="https://...", feed_name="...")
```
- The main `<link>` must point to the original blog URL, NOT the feed URL
- `rel="self"` must be set **first**, `rel="alternate"` must be set **last**
- Generators should use the `setup_feed_links()` helper from `utils.py`
- Flag any generator that sets links manually instead of using the helper
### Cache Logic (Pagination & Selenium patterns only)
- Is cache loaded before fetching new articles?
- Are articles deduped by URL before saving?
- Is the cache sorted by date descending?
- Does the `--full` flag correctly bypass incremental logic?
### Pattern Compliance
- **Simple Static**: No cache needed, fetches all posts each run
- **Pagination + Caching**: URL-based pagination with JSON cache in `cache/`
- **Selenium + Click**: Uses `undetected-chromedriver`, clicks load-more buttons, caches results
- Is the generator using the right pattern for how the target site loads content?
## Feed XML Output Review
### Structure
- Does the feed have a `<title>`, `<link>`, and `<description>` in `<channel>`?
- Does every `<item>` have at least `<title>`, `<link>`, and `<pubDate>`?
- Is there an `<atom:link rel="self">` pointing to the feed URL?
- Does the main `<link>` point to the blog (not the feed file)?
### Content Quality
- Are there 0 items? (EMPTY — likely broken scraper)
- Is the newest item older than 60 days? (STALE — selectors may have broken)
- Are there duplicate `<link>` values across items?
- Are dates parseable as RFC 2822 (`pubDate` format)?
- Are titles non-empty and non-duplicated?
### Encoding
- Is the XML declaration present with `encoding="utf-8"`?
- Are special characters properly escaped in titles and descriptions?
## Output Format
For each finding:
```
[SEVERITY] file_path:line_number — description
```
Severities: `ERROR` (broken/will fail), `WARN` (fragile/convention violation), `INFO` (suggestion)
End with a summary table:
| Feed | Generator | XML Output | Issues |
|------|-----------|------------|--------|
| name | OK/WARN/ERROR | OK/WARN/ERROR | brief note |
================================================
FILE: .editorconfig
================================================
root = true
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
[*.py]
indent_style = space
indent_size = 4
[*.{yml,yaml}]
indent_style = space
indent_size = 2
[Makefile]
indent_style = tab
[*.mk]
indent_style = tab
[*.md]
trim_trailing_whitespace = false
================================================
FILE: .github/CODEOWNERS
================================================
* @Olshansk @oborchers
================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms
github: olshansk
github: oborchers
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
lfx_crowdfunding: # Replace with a single LFX Crowdfunding project-name e.g., cloud-foundry
polar: # Replace with a single Polar username
buy_me_a_coffee: olshansky
thanks_dev: # Replace with a single thanks.dev username
custom: grove.city/olshansky
================================================
FILE: .github/ISSUE_TEMPLATE/request_rss_feed.md
================================================
---
name: Request a new RSS feed
about: Request an RSS feed for a new blog
title: "[RSS Feed Request] Blog Name"
labels: enhancement
assignees: ''
---
## Blog Information
**Blog Name:**
<!-- Please provide the name of the blog -->
**Blog URL:**
<!-- Please provide the URL of the blog -->
## Additional Information
**Description:**
<!-- Any additional information or description about the blog -->
**Note:**
Please ensure that you provide the link to the actual blog.
================================================
FILE: .github/PULL_REQUEST_TEMPLATE/add_new_feed.md
================================================
---
name: Add a new RSS feed
about: Contribute a new RSS feed to the repository
title: "[New RSS Feed] <Feed Name>"
labels: new-feed
assignees: ''
---
## Checklist
- [ ] Add the `new-feed` label
- [ ] Update the `Makefile` with a new target for generating the feed
- [ ] Ensure the title of the pull request is `[New RSS Feed] <Feed Name>`
- [ ] Do anything else you deem proper or idiomatic for a good developer experience
## Description
Please provide a brief description of the new RSS feed and any additional information that might be relevant.
================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
groups:
actions:
patterns: ["*"]
- package-ecosystem: "uv"
directory: "/"
schedule:
interval: "weekly"
groups:
minor-patch:
update-types: ["minor", "patch"]
================================================
FILE: .github/pull_request_template.md
================================================
## Summary
<!-- What does this PR do and why? -->
## Changes
<!-- Key changes, bullet points preferred -->
-
## Test plan
- [ ] Ran affected feed generators locally
- [ ] Validated feed XML output (`uv run feed_generators/validate_feeds.py`)
- [ ] No existing feeds broken
================================================
FILE: .github/workflows/cleanup_deprecated_feeds.yml
================================================
name: Cleanup Deprecated Feeds
# Stage 2 of the feed retirement lifecycle (Stage 1 is human-driven in
# deprecate_feed.py + scraper removal). This workflow deletes feed XMLs whose
# tombstone notice is older than 90 days, then pushes the deletion directly to
# main. Requires main to accept bot pushes from github-actions[bot]; if branch
# protection is changed to require PR review, convert this step to open a PR
# (peter-evans/create-pull-request) instead.
on:
schedule:
# First of every month, noon UTC.
- cron: "0 12 1 * *"
workflow_dispatch:
concurrency:
group: cleanup-deprecated-feeds
cancel-in-progress: false
permissions:
contents: write
jobs:
cleanup:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Install uv and Python
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
python-version: "3.11"
- name: Delete expired deprecated feeds
run: |
set -e
uv sync
uv run feed_generators/cleanup_deprecated_feeds.py --apply
- name: Commit and push deletions
if: github.ref == 'refs/heads/main'
run: |
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git add -A feeds/
if git diff --staged --quiet; then
echo "No deprecated feeds expired; nothing to commit"
else
git commit -m 'Auto-cleanup: remove expired deprecated feeds'
git push || { git pull --rebase && git push; }
fi
================================================
FILE: .github/workflows/label_new_feed.yml
================================================
name: Label New Feed PRs
on:
pull_request_target:
types: [opened, edited, reopened]
jobs:
add-label:
runs-on: ubuntu-latest
steps:
- name: Check PR title
id: check_title
run: |
if echo '${{ github.event.pull_request.title }}' | grep -q '\[New RSS Feed\]'; then
echo "has_tag=true" >> $GITHUB_OUTPUT
else
echo "has_tag=false" >> $GITHUB_OUTPUT
fi
- name: Add label
if: steps.check_title.outputs.has_tag == 'true'
uses: actions/github-script@v7
with:
script: |
await github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
labels: ['new-feed']
})
================================================
FILE: .github/workflows/lint.yml
================================================
name: Lint
on:
pull_request:
paths:
- "**.py"
- "pyproject.toml"
push:
branches: [main]
paths:
- "**.py"
- "pyproject.toml"
jobs:
lint:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
- name: Lint and format check
run: |
uv sync --group dev
uv run ruff check .
uv run ruff format --check .
================================================
FILE: .github/workflows/run_feeds.yml
================================================
name: Run Feeds
on:
schedule:
- cron: "0 * * * *"
workflow_dispatch:
concurrency:
group: request-feeds
cancel-in-progress: true
jobs:
run-feeds:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
- name: Install dependencies and run feeds
timeout-minutes: 20
run: |
set -e
uv sync
uv run feed_generators/run_all_feeds.py --skip-selenium
- name: Commit and push feeds
if: github.ref == 'refs/heads/main'
run: |
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git add feeds/*.xml
git stash
git pull --rebase || { echo "Rebase failed, aborting and retrying with merge"; git rebase --abort 2>/dev/null; git pull; }
git stash pop || true
git add feeds/*.xml
if git diff --staged --quiet; then
echo "No changes to commit"
else
git commit -m 'Update RSS feeds'
git push || { git pull --rebase && git push; }
fi
================================================
FILE: .github/workflows/run_selenium_feeds.yml
================================================
name: Run Selenium Feeds
on:
schedule:
- cron: "30 * * * *"
workflow_dispatch:
concurrency:
group: selenium-feeds
cancel-in-progress: true
jobs:
run-selenium-feeds:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
- name: Install Chrome for Selenium
uses: browser-actions/setup-chrome@v2
with:
chrome-version: stable
install-chromedriver: true
- name: Install dependencies and run Selenium feeds
timeout-minutes: 45
run: |
set -e
uv sync
uv run feed_generators/run_all_feeds.py --selenium-only
- name: Commit and push feeds
if: github.ref == 'refs/heads/main'
run: |
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git add feeds/*.xml
git stash
git pull --rebase || { echo "Rebase failed, aborting and retrying with merge"; git rebase --abort 2>/dev/null; git pull; }
git stash pop || true
git add feeds/*.xml
if git diff --staged --quiet; then
echo "No changes to commit"
else
git commit -m 'Update RSS feeds (Selenium)'
git push || { git pull --rebase && git push; }
fi
================================================
FILE: .github/workflows/test_feed.yml
================================================
name: Test Feed Generation
on:
workflow_dispatch:
jobs:
test-feed:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
- name: Install dependencies and run test feed
run: |
set -e
uv sync
uv run feed_generators/ollama_blog.py
- name: Upload test feed artifact
uses: actions/upload-artifact@v7
with:
name: feed_test
path: feeds/feed_ollama.xml
================================================
FILE: .github/workflows/validate_feeds.yml
================================================
name: Validate Feeds
on:
workflow_run:
workflows: ["Run Feeds", "Run Selenium Feeds"]
types: [completed]
workflow_dispatch:
jobs:
validate:
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout repository
uses: actions/checkout@v6
with:
ref: main
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install uv
uses: astral-sh/setup-uv@v8.0.0
with:
enable-cache: true
- name: Install dependencies and validate all feeds
run: |
uv sync
uv run feed_generators/validate_feeds.py
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# UV
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
#uv.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
# PyPI configuration file
.pypirc
# Claude Sync
.claudesync
# Visual Studio
.vscode
# HTML
*.html
# Cache (keep .gitkeep but ignore JSON cache files)
cache/*.json
# Agent tool configs (ignore contents, whitelist skill symlinks)
.claude/*
!.claude/skills/
.claude/skills/*
!.claude/skills/cmd-rss-feed-generator
!.claude/skills/rss-feed-review
.codex/*
!.codex/skills/
.codex/skills/*
!.codex/skills/cmd-rss-feed-generator
!.codex/skills/rss-feed-review
.codex-home/*
!.codex-home/skills/
.codex-home/skills/*
!.codex-home/skills/cmd-rss-feed-generator
!.codex-home/skills/rss-feed-review
# Agent skills evals (never commit)
.agents/skills/*/evals/
PR_DESCRIPTION.md
================================================
FILE: .markdownlint.json
================================================
{
"MD033": {
"allowed_elements": [
"Tabs",
"TabItem",
"ReactPlayer",
"details",
"summary",
"div",
"br",
"img",
"a",
"h1",
"h2",
"h3",
"h4",
"h5",
"h6"
]
},
"MD013": false,
"MD046": false,
"MD036": false
}
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-toml
- id: check-merge-conflict
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.10
hooks:
- id: ruff
args: [--fix]
- id: ruff-format
================================================
FILE: AGENTS.md
================================================
# AGENTS.md <!-- omit in toc -->
Instructions for Claude Code and contributors working on this repository.
## Table of Contents <!-- omit in toc -->
- [Project Overview](#project-overview)
- [Commands](#commands)
- [Architecture](#architecture)
- [Feed Generator Patterns](#feed-generator-patterns)
- [When to Use Each Pattern](#when-to-use-each-pattern)
- [Feed Link Setup (Important)](#feed-link-setup-important)
- [Adding a New Feed](#adding-a-new-feed)
- [Step 1: Analyze the Target Blog](#step-1-analyze-the-target-blog)
- [Step 2: Download HTML Sample](#step-2-download-html-sample)
- [Step 3: Generate the Feed Script](#step-3-generate-the-feed-script)
- [Step 4: Test Locally](#step-4-test-locally)
- [Step 5: Register the Feed](#step-5-register-the-feed)
- [Step 6: PR Checklist](#step-6-pr-checklist)
- [Deprecating a Feed](#deprecating-a-feed)
- [Troubleshooting](#troubleshooting)
- [GitHub Actions](#github-actions)
## Project Overview
RSS Feed Generator creates RSS feeds for blogs that don't provide them natively. Feed generators scrape blog pages and output `feed_*.xml` files to the `feeds/` directory. A GitHub Action runs hourly to regenerate and commit updated feeds.
## Commands
```bash
# Environment setup
make env_setup # Install dependencies (uses uv sync)
make dev_setup # Install dev dependencies + pre-commit hooks
# Generate feeds
make feeds_generate_all # Run all feed generators
make feeds_<name> # Run specific feed (e.g., feeds_ollama, feeds_anthropic_news)
# Development
make dev_lint # Check code with ruff
make dev_lint_fix # Auto-fix and format with ruff
make dev_format # Alias for dev_lint_fix
make dev_test_feed # Run test feed generator
# Run single generator directly
uv run feed_generators/ollama_blog.py
# CI/CD
make ci_trigger_feeds_workflow # Trigger GitHub Action manually
make ci_run_feeds_workflow_local # Test workflow locally with act
```
## Architecture
```
feed_generators/ # Python scripts that scrape blogs and generate RSS
run_all_feeds.py # Orchestrator that runs all generators
utils.py # Shared utilities (setup_feed_links, get_project_root, etc.)
<source>_blog.py # Individual feed generators
feeds/ # Output directory for feed_*.xml files
cache/ # JSON cache for paginated/dynamic feeds
makefiles/ # Modular Makefile includes (feeds.mk, env.mk, dev.mk, ci.mk)
```
### Feed Generator Patterns
Three patterns exist based on how the target site loads content:
#### 1. Simple Static (Default) <!-- omit in toc -->
For blogs where all content loads on first request.
**Examples**: `ollama_blog.py`, `paulgraham_blog.py`, `hamel_blog.py`
**Key functions**:
- `fetch_blog_content(url)` - HTTP request with User-Agent header
- `parse_blog_html(html)` - BeautifulSoup parsing for posts
- `generate_rss_feed(posts)` - Create feed using `feedgen`
- `save_rss_feed(fg, name)` - Write to `feeds/feed_{name}.xml`
**Cache**: Not needed (all posts fetched each run)
#### 2. Pagination + Caching <!-- omit in toc -->
For blogs with "Load More" or pagination that uses URL query params (`?page=2`).
**Examples**: `cursor_blog.py`, `dagster_blog.py`
**Key functions**:
- `load_cache()` / `save_cache(posts)` - JSON persistence in `cache/<source>_posts.json`
- `merge_posts(new, cached)` - Dedupe by URL, merge, sort by date
- `fetch_all_pages()` - Follow pagination until no next link
**Cache behavior**:
- **First run / `--full` flag**: Fetch all pages, populate cache
- **Incremental (default)**: Fetch page 1 only, merge with cache
- **Dedupe**: By URL, sorted by date descending
#### 3. Selenium + Click "Load More" <!-- omit in toc -->
For JS-heavy sites where content loads dynamically via JavaScript button clicks.
**Examples**: `anthropic_news_blog.py` (reference implementation), `anthropic_research_blog.py`, `openai_research_blog.py`, `xainews_blog.py`
**Key functions**:
- `setup_selenium_driver()` - Headless Chrome with `undetected-chromedriver`
- `fetch_news_content(max_clicks)` - Load page, click buttons, return final HTML
- `load_cache()` / `save_cache(articles)` - JSON persistence in `cache/<source>_posts.json`
- `merge_articles(new, cached)` - Dedupe by link, merge, sort by date
**Selenium specifics**:
- Uses `undetected-chromedriver` to avoid bot detection
- Clicks "See more"/"Load more" button repeatedly
- Waits for content to load between clicks
- `max_clicks` parameter controls depth (20 for full, 2-3 for incremental)
**Cache behavior** (see `anthropic_news_blog.py` for reference):
- **First run / `--full` flag**: Click up to 20 times, fetch all articles, populate cache
- **Incremental (default)**: Click 2-3 times (recent articles), merge with cache
- **Dedupe**: By URL, sorted by date descending
### When to Use Each Pattern
| Site Behavior | Pattern | Example | Cache? |
|--------------|---------|---------|--------|
| All posts on single page | Simple Static | `ollama_blog.py` | No |
| URL-based pagination (`?page=2`) | Pagination + Caching | `dagster_blog.py` | Yes |
| JS button loads more content | Selenium + Click | `anthropic_news_blog.py` | Yes |
| JS-rendered page (curl returns empty shell) | Selenium + Wait | `xainews_blog.py` | Yes |
**Key libraries**: `requests`, `beautifulsoup4`, `feedgen`, `selenium`, `undetected-chromedriver`
### Feed Link Setup (Important)
The main `<link>` element must point to the original blog, not the feed URL. Use the helper:
```python
from utils import setup_feed_links
fg = FeedGenerator()
# ... set title, description, etc.
setup_feed_links(fg, blog_url="https://example.com/blog", feed_name="example")
```
**Why this matters**: In `feedgen`, link order determines which URL becomes the main `<link>`:
- `rel="self"` must be set **first** → becomes `<atom:link rel="self">`
- `rel="alternate"` must be set **last** → becomes the main `<link>`
Wrong order produces `<link>https://.../feed_example.xml</link>` instead of the blog URL.
## Adding a New Feed
### Step 1: Analyze the Target Blog
Before writing code, determine which pattern to use:
1. **Open the blog** in your browser
2. **Check for pagination**:
- URL changes to `?page=2` or `/page/2` → **Pattern 2 (Pagination)**
- No URL change but "Load More" button exists → **Pattern 3 (Selenium)**
- All posts visible on single page → **Pattern 1 (Simple Static)**
3. **Check for JavaScript loading**:
- Open DevTools → Network tab → Reload
- If posts appear after JS execution (XHR requests) → **Pattern 3 (Selenium)**
- If posts are in initial HTML → **Pattern 1 or 2**
### Step 2: Download HTML Sample
```bash
# For static sites (Pattern 1 or 2)
curl -o sample.html "https://example.com/blog"
# For JS-heavy sites (Pattern 3)
# Use browser: View Page Source won't work
# Instead: DevTools → Elements → Copy outer HTML after page loads
```
### Step 3: Generate the Feed Script
Use Claude Code with the generator prompt:
```bash
Use /cmd-rss-feed-generator to convert @sample.html to a RSS feed for https://example.com/blog
```
Claude will:
- Analyze the HTML structure
- Choose the appropriate pattern
- Generate `feed_generators/<source>_blog.py`
### Step 4: Test Locally
```bash
# Install dependencies
make env_setup
# Run the generator
uv run feed_generators/<source>_blog.py
# Verify output
cat feeds/feed_<source>.xml | head -50
# For paginated feeds, test full fetch
uv run feed_generators/<source>_blog.py --full
```
**Verify**:
- [ ] Feed XML is valid (no parsing errors)
- [ ] `<link>` points to blog URL, not feed URL
- [ ] Posts have titles, dates, and links
- [ ] Dates are in correct order (newest first)
### Step 5: Register the Feed
1. **Add to `feeds.yaml`** (the feed registry):
```yaml
<source>:
script: <source>_blog.py
type: requests # or "selenium" for JS-heavy sites
blog_url: https://example.com/blog
```
2. **Add Make target** in `makefiles/feeds.mk`:
```makefile
.PHONY: feeds_<source>
feeds_<source>: ## Generate RSS feed for <Source Name>
$(call check_venv)
$(call print_info,Generating <Source Name> feed)
$(Q)uv run feed_generators/<source>_blog.py
$(call print_success,<Source Name> feed generated)
```
3. **Update README.md table** (alphabetical order):
```markdown
| [Source Name](https://example.com/blog) | [feed_<source>.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_<source>.xml) |
```
### Step 6: PR Checklist
Before submitting your PR, verify:
- [ ] `make dev_format` passes (code formatting)
- [ ] `uv run feed_generators/<source>_blog.py` runs without errors
- [ ] `feeds/feed_<source>.xml` is generated and valid
- [ ] Feed registered in `feeds.yaml`
- [ ] Make target added to `makefiles/feeds.mk`
- [ ] README.md table updated
- [ ] For paginated/dynamic feeds: cache file created in `cache/` on first run
- [ ] Feed `<link>` points to original blog (not the XML feed URL)
## Deprecating a Feed
When a blog launches an official RSS feed (or we otherwise decide to retire a scraper), follow the two-stage retirement process. Stage 1 is manual and lands in a single PR. Stage 2 is automated.
### Stage 1: Inject the notice and tear down the code (manual, one PR)
1. **Inject a sunset notice into the feed XML**:
```bash
uv run feed_generators/deprecate_feed.py \
--feed=<name> \
--message="Site X now publishes an official RSS feed." \
--alternative="https://example.com/feed.xml"
```
This adds a single `<item>` at the top of `feeds/feed_<name>.xml` with a stable GUID (so repeated runs are idempotent). Subscribers see the notice in their reader the next time they poll the feed.
2. **Remove everything except the XML**, in the same PR:
- Delete `feed_generators/<name>_blog.py`.
- Remove the `<name>:` entry from `feeds.yaml`.
- Remove the `feeds_<name>` target (and any `_full` variant) from `makefiles/feeds.mk`.
- Remove the `<name>` row from the README table (or update it to point at the official feed only).
- `cache/<name>_posts.json` is gitignored; nothing to do there.
3. **Leave `feeds/feed_<name>.xml`** in place. It now carries the notice as its newest `<item>` plus the historical posts. Subscribers can read both.
### Stage 2: Automatic deletion (workflow, ~90 days later)
`.github/workflows/cleanup_deprecated_feeds.yml` runs monthly. It invokes `feed_generators/cleanup_deprecated_feeds.py --apply`, which scans `feeds/feed_*.xml` for the `deprecation-notice-<name>` GUID, parses the notice's `<pubDate>`, and deletes any XML whose notice is older than 90 days. The deletion is committed to `main` directly; git history preserves the file for recovery.
To preview what would be removed without touching anything:
```bash
uv run feed_generators/cleanup_deprecated_feeds.py
```
To force-test deletion locally (reversible with `git checkout`):
```bash
uv run feed_generators/cleanup_deprecated_feeds.py --apply --threshold-days=0
```
## Troubleshooting
**"No posts found" or empty feed**
- HTML structure may have changed; re-download sample and update selectors
- For Selenium: increase wait times or check if site blocks headless browsers
**Feed `<link>` shows XML URL instead of blog URL**
- Use `setup_feed_links()` helper from `utils.py`
- Ensure `rel="self"` is set before `rel="alternate"`
**Selenium bot detection**
- `undetected-chromedriver` should handle most cases
- Try increasing wait times between clicks
- Some sites may require additional headers or cookies
**Cache not updating**
- Delete `cache/<source>_posts.json` and run with `--full`
- Check `merge_posts()` deduplication logic
**Date parsing errors**
- Add the date format to the `date_formats` list
- Use `stable_fallback_date()` for entries without parseable dates
**Empty feed after Selenium run (0 items)**
- The site is JS-rendered but `curl` returns a minimal HTML shell — confirm with `curl -sL <url> | wc -c` (< 10KB = JS-rendered)
- Capture Selenium page source to a file and inspect actual selectors: element classes on JS-rendered pages often differ from View Source
- Always call `deserialize_entries()` on cached data before passing to `merge_entries()` — ISO strings don't sort correctly as datetimes
## GitHub Actions
- `run_feeds.yml` - Runs hourly, executes `run_all_feeds.py`, commits updated XML files
- `test_feed.yml` - Tests feed generation on PRs (runs `ollama_blog.py`)
================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md
⚠️ This file is intentionally minimal.
**Authoritative project instructions live in `AGENTS.md`.**
You must:
1. Open and follow `AGENTS.md` before doing any work.
2. Treat `AGENTS.md` as the single source of truth for all operations.
3. Update `AGENTS.md` (not this file) when guidelines/architecture/standards change.
➡️ Read now: [AGENTS.md](./AGENTS.md)
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing
## Dev Setup
```bash
uv sync --group dev
pre-commit install
```
Run `make help` to see all available targets with descriptions.
## Running Feeds
**Run all request-based feeds:**
```bash
uv run feed_generators/run_all_feeds.py --skip-selenium
```
**Run a single feed by name** (from `feeds.yaml` registry):
```bash
uv run feed_generators/run_all_feeds.py --feed=ollama
uv run feed_generators/run_all_feeds.py --feed=dagster --full # full reset
```
**Or run the script directly:**
```bash
uv run feed_generators/ollama_blog.py
uv run feed_generators/dagster_blog.py --full
```
## Code Style
This project uses [Ruff](https://docs.astral.sh/ruff/) for linting and formatting, enforced via pre-commit hooks and a [CI workflow](.github/workflows/lint.yml).
**Check only:**
```bash
make dev_lint
```
**Auto-fix + format:**
```bash
make dev_lint_fix
```
## Adding a New Feed
See [AGENTS.md](./AGENTS.md) for the complete guide on creating feed generators.
**Recommended workflow**: Use [Claude Code](https://claude.com/claude-code) with the [Playwright MCP](https://github.com/microsoft/playwright-mcp) to inspect the target site, understand its structure, and generate the scraper.
**When to write a custom scraper**: Only if the site has no official RSS feed, or if a custom parser adds significant value over the official feed (e.g., full content extraction, structured metadata). Simple filtering (e.g., category-only views) does not justify a custom scraper. Check the README for sites that already have official feeds.
### Agent Skills
This repo includes two [Claude Code skills](.agents/skills/) to streamline feed development:
- **`/cmd-rss-feed-generator`** — Generate a new feed scraper from a blog URL or HTML sample. Analyzes the site, picks the right pattern, and scaffolds the generator + Makefile target.
- **`/rss-feed-review`** — Review feed generators and their XML output for broken selectors, missing error handling, stale feeds, and convention violations.
## Pull Requests
1. Branch from `main`
2. Follow the existing generator patterns in `feed_generators/`
3. Test your feed locally before submitting
4. Reference any related issues in the PR description
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2025 Daniel Olshansky
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: Makefile
================================================
#########################
### Makefile (root) ###
#########################
.DEFAULT_GOAL := help
# Patterns for classified help categories
HELP_PATTERNS := \
'^help:' \
'^env_.*:' \
'^feeds_.*:' \
'^dev_.*:' \
'^ci_.*:' \
'^clean_.*:' \
'^debug_vars:'
.PHONY: help
help: ## Show all available targets with descriptions
@printf "\n"
@printf "$(BOLD)$(CYAN)📋 RSS Feed Generator - Makefile Targets$(RESET)\n"
@printf "\n"
@printf "$(BOLD)=== 📋 Information & Discovery ===$(RESET)\n"
@grep -h -E '^(help|help-unclassified):.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}'
@printf "\n"
@printf "$(BOLD)=== 🐍 Environment Setup ===$(RESET)\n"
@grep -h -E '^env_.*:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}' | sort -u
@printf "\n"
@printf "$(BOLD)=== 🛠️ Development ===$(RESET)\n"
@grep -h -E '^dev_.*:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}' | sort -u
@printf "\n"
@printf "$(BOLD)=== 🚀 CI/CD ===$(RESET)\n"
@grep -h -E '^ci_.*:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}' | sort -u
@printf "\n"
@printf "$(BOLD)=== 🧹 Cleaning ===$(RESET)\n"
@grep -h -E '^clean_.*:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}' | sort -u
@printf "\n"
@printf "$(BOLD)=== 📡 RSS Feed Generation ===$(RESET)\n"
@grep -h -E '^feeds_.*:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}' | sort -u
@printf "\n"
@printf "$(YELLOW)Usage:$(RESET) make <target>\n"
@printf "\n"
.PHONY: help-unclassified
help-unclassified: ## Show all unclassified targets
@printf "\n"
@printf "$(BOLD)$(CYAN)📦 Unclassified Targets$(RESET)\n"
@printf "\n"
@grep -h -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | sed 's/:.*//g' | sort -u > /tmp/all_targets.txt
@( \
for pattern in $(HELP_PATTERNS); do \
grep -h -E "$pattern.*?## .*\$$" $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null || true; \
done \
) | sed 's/:.*//g' | sort -u > /tmp/classified_targets.txt
@comm -23 /tmp/all_targets.txt /tmp/classified_targets.txt | while read target; do \
grep -h -E "^$$target:.*?## .*\$$" $(MAKEFILE_LIST) ./makefiles/*.mk 2>/dev/null | awk 'BEGIN {FS = ":.*?## "}; {printf "$(CYAN)%-40s$(RESET) %s\n", $$1, $$2}'; \
done
@rm -f /tmp/all_targets.txt /tmp/classified_targets.txt
@printf "\n"
################
### Imports ###
################
include ./makefiles/colors.mk
include ./makefiles/common.mk
include ./makefiles/env.mk
include ./makefiles/feeds.mk
include ./makefiles/dev.mk
include ./makefiles/ci.mk
############################
### Legacy Target Aliases ##
############################
# Maintain backwards compatibility with existing targets
.PHONY: check-env
check-env: ## (Legacy) Check if virtual environment is activated
$(call check_venv)
.PHONY: env_create
env_create: env_setup ## (Legacy) Create virtual environment
.PHONY: uvx_install
uvx_install: env_setup ## (Legacy) Install dependencies
.PHONY: clean
clean: clean_env clean_feeds ## (Legacy) Clean all generated files
.PHONY: py_format
py_format: dev_format ## (Legacy) Format Python code
.PHONY: generate_all_feeds
generate_all_feeds: feeds_generate_all ## (Legacy) Generate all RSS feeds
.PHONY: generate_anthropic_news_feed
generate_anthropic_news_feed: feeds_anthropic_news ## (Legacy) Generate Anthropic News feed
.PHONY: generate_anthropic_engineering_feed
generate_anthropic_engineering_feed: feeds_anthropic_engineering ## (Legacy) Generate Anthropic Engineering feed
.PHONY: generate_anthropic_research_feed
generate_anthropic_research_feed: feeds_anthropic_research ## (Legacy) Generate Anthropic Research feed
.PHONY: generate_anthropic_changelog_claude_code_feed
generate_anthropic_changelog_claude_code_feed: feeds_anthropic_changelog_claude_code ## (Legacy) Generate Claude Code changelog feed
.PHONY: generate_google_ai_feed
generate_google_ai_feed: feeds_google_ai ## (Legacy) Generate Google AI feed
.PHONY: generate_openai_research_feed
generate_openai_research_feed: feeds_openai_research ## (Legacy) Generate OpenAI Research feed
.PHONY: generate_ollama_feed
generate_ollama_feed: feeds_ollama ## (Legacy) Generate Ollama feed
.PHONY: generate_paulgraham_feed
generate_paulgraham_feed: feeds_paulgraham ## (Legacy) Generate Paul Graham feed
.PHONY: generate_blogsurgeai_feed
generate_blogsurgeai_feed: feeds_blogsurgeai ## (Legacy) Generate Surge AI Blog feed
.PHONY: generate_xainews_feed
generate_xainews_feed: feeds_xainews ## (Legacy) Generate xAI News feed
.PHONY: generate_thinkingmachines_feed
generate_thinkingmachines_feed: feeds_thinkingmachines ## (Legacy) Generate Thinking Machines Lab feed
.PHONY: test_feed_workflow
test_feed_workflow: ci_test_workflow_local ## (Legacy) Test feed workflow locally
.PHONY: test_feed_generate
test_feed_generate: dev_test_feed ## (Legacy) Run test feed generator
.PHONY: act_run_feeds_workflow
act_run_feeds_workflow: ci_run_feeds_workflow_local ## (Legacy) Run feeds workflow locally
.PHONY: gh_run_feeds_workflow
gh_run_feeds_workflow: ci_trigger_feeds_workflow ## (Legacy) Trigger feeds workflow on GitHub
.PHONY: generate_the_batch_feed
generate_the_batch_feed: feeds_the_batch ## (Legacy) Generate The Batch feed
================================================
FILE: README.md
================================================
# RSS Feed Generator <!-- omit in toc -->
> [!TIP]
> This project is maintained by [@oborchers](https://github.com/oborchers) and [@Olshansk](https://github.com/Olshansk). If you gut any value out of it, consider sponsoring us on GitHub!
> [!NOTE]
> Read the blog post about this repo: [No RSS Feed? No Problem. Using Claude to automate RSS feeds.](https://olshansky.substack.com/p/no-rss-feed-no-problem-using-claude)
## tl;dr Available RSS Feeds <!-- omit in toc -->
Scraped feeds are generated hourly. "Official RSS" rows point to native feeds the blog now publishes directly.
| Blog | Feed |
| ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| [AI at Meta Blog](https://ai.meta.com/blog/) | [feed_meta_ai.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_meta_ai.xml) |
| [AI FIRST Podcast](https://ai-first.ai/podcast) (German) | [feed_ai_first_podcast.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_ai_first_podcast.xml) |
| [Anthropic Engineering](https://www.anthropic.com/engineering) | [feed_anthropic_engineering.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_anthropic_engineering.xml) |
| [Anthropic Frontier Red Team](https://red.anthropic.com/) | [feed_anthropic_red.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_anthropic_red.xml) |
| [Anthropic News](https://www.anthropic.com/news) | [feed_anthropic_news.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_anthropic_news.xml) |
| [Anthropic Research](https://www.anthropic.com/research) | [feed_anthropic_research.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_anthropic_research.xml) |
| [Chander Ramesh's Writing](https://chanderramesh.com/writing) | [feed_chanderramesh.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_chanderramesh.xml) |
| [Claude Blog](https://claude.com/blog) | [feed_claude.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_claude.xml) |
| [Claude Code Changelog](https://code.claude.com/docs/en/changelog) | [Official RSS](https://code.claude.com/docs/en/changelog/rss.xml) |
| [Cloudflare skills (commits/main)](https://github.com/cloudflare/skills) | [Official RSS](https://github.com/cloudflare/skills/commits/main.atom) |
| [Cohere Blog](https://cohere.com/blog) | [feed_cohere.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_cohere.xml) |
| [Cursor Blog](https://cursor.com/blog) | [feed_cursor.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_cursor.xml) |
| [Dagster Blog](https://dagster.io/blog) | [feed_dagster.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_dagster.xml) |
| [Google DeepMind Blog](https://deepmind.google/blog/) | [Official RSS](https://deepmind.google/blog/rss.xml) |
| [Google Developers Blog - AI](https://developers.googleblog.com/search/?technology_categories=AI) | [feed_google_ai.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_google_ai.xml) |
| [Groq Blog](https://groq.com/blog/) | [feed_groq.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_groq.xml) |
| [Hamel Husain's Blog](https://hamel.dev/) | [Official RSS](https://hamel.dev/index.xml) |
| [Interconnected (Matt Webb)](https://interconnected.org/home) | [Official RSS](https://interconnected.org/home/feed) |
| [Mistral AI News](https://mistral.ai/news) | [feed_mistral.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_mistral.xml) |
| [Ollama Blog](https://ollama.com/blog) | [feed_ollama.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_ollama.xml) |
| [OpenAI Engineering](https://openai.com/news/engineering/) | [Official RSS](https://openai.com/news/engineering/rss.xml) |
| [OpenAI Research](https://openai.com/news/research/) | [Official RSS](https://openai.com/blog/rss.xml) |
| [Paul Graham's Articles](https://www.paulgraham.com/articles.html) | [feed_paulgraham.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_paulgraham.xml) |
| [Perplexity Hub](https://www.perplexity.ai/hub) | [feed_perplexity_hub.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_perplexity_hub.xml) |
| [Pinecone Blog](https://www.pinecone.io/blog/) | [feed_pinecone.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_pinecone.xml) |
| [Simon Willison's Blog (Tools)](https://simonwillison.net/) | [Official RSS](https://simonwillison.net/atom/beats/tool/) |
| [Supabase Blog](https://supabase.com/blog) | [Official RSS](https://supabase.com/rss.xml) |
| [Surge AI Blog](https://www.surgehq.ai/blog) | [feed_blogsurgeai.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_blogsurgeai.xml) |
| [The Batch by DeepLearning.AI](https://www.deeplearning.ai/the-batch/) | [feed_the_batch.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_the_batch.xml) |
| [Thinking Machines Lab](https://thinkingmachines.ai/blog/) | [feed_thinkingmachines.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_thinkingmachines.xml) |
| [Weaviate Blog](https://weaviate.io/blog) | [feed_weaviate.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_weaviate.xml) |
| [Windsurf Blog](https://windsurf.com/blog) | [feed_windsurf_blog.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_windsurf_blog.xml) |
| [Windsurf Changelog](https://windsurf.com/changelog) | [feed_windsurf_changelog.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_windsurf_changelog.xml) |
| [Windsurf Next Changelog](https://windsurf.com/changelog/windsurf-next) | [feed_windsurf_next_changelog.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_windsurf_next_changelog.xml) |
| [xAI News](https://x.ai/news) | [feed_xainews.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_xainews.xml) |
### Planned <!-- omit in toc -->
| Blog | Status |
| -------------------------------------------------------------- | --------- |
| [David Crawshaw](https://crawshaw.io/) | _planned_ |
| [Engineering.fyi](https://engineering.fyi/) | _planned_ |
| [Patrick Collison's Blog](https://patrickcollison.com/culture) | _planned_ |
### What is this?
You know that blog you like that doesn't have an RSS feed and might never will?
🙌 **You can use this repo to create a RSS feed for it!** 🙌
## Table of Contents <!-- omit in toc -->
- [Quick Start](#quick-start)
- [Subscribe to a Feed](#subscribe-to-a-feed)
- [Request a new Feed](#request-a-new-feed)
- [Create a new a Feed](#create-a-new-a-feed)
- [Star History](#star-history)
- [Ideas](#ideas)
- [How It Works](#how-it-works)
- [For Developers 👀 only](#for-developers--only)
## Quick Start
### Subscribe to a Feed
- Go to the [feeds directory](./feeds).
- Find the feed you want to subscribe to.
- Use the **raw** link for your RSS reader. Example:
```text
https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_ollama.xml
```
- Use your RSS reader of choice to subscribe to the feed (e.g., [Blogtrottr](https://blogtrottr.com/)).
### Request a new Feed
Want me to create a feed for you?
[Open a GitHub issue](https://github.com/Olshansk/rss-feeds/issues/new?template=request_rss_feed.md) and include the blog URL.
If I do, consider supporting my 🌟🧋 addiction by [buying me a coffee](https://buymeacoffee.com/olshansky).
## Create a new a Feed
1. Download the HTML of the blog you want to create a feed for.
2. Open Claude Code CLI
3. Tell claude to:
```bash
Use /cmd-rss-feed-generator to convert @<html_file>.html to a RSS feed for <blog_url>.
```
## Star History
[](https://star-history.com/#Olshansk/rss-feeds&Date)
## Ideas
- **X RSS Feed**: Going to `x.com/{USER}/index.xml` should give an RSS feed of the user's tweets.
## How It Works
```mermaid
flowchart TB
subgraph GitHub["GitHub Repository"]
action[[GitHub Action<br/>Hourly Cron Job]]
runner{{"run_all_feeds.py"}}
feeds["Feed Generators<br/>(*.py files)"]
xml["Generated RSS Feeds<br/>(feed_*.xml)"]
end
subgraph External["External Services"]
blogtrottr["Blogtrottr"]
rssreaders["Other RSS Readers"]
end
action -->|"Triggers"| runner
runner -->|"Executes"| feeds
feeds -->|"Scrapes"| websites[("Blog Websites<br/>(HTML Content)")]
websites -->|"Content"| feeds
feeds -->|"Generates"| xml
xml -->|"Updates"| repo["GitHub Repository<br/>Main Branch"]
repo -->|"Pulls Feed"| blogtrottr
repo -->|"Pulls Feed"| rssreaders
style GitHub fill:#e6f3ff,stroke:#0066cc
style External fill:#f9f9f9,stroke:#666666
style action fill:#ddf4dd,stroke:#28a745,color:#000000
style runner fill:#fff3cd,stroke:#ffc107,color:#000000
style feeds fill:#f8d7da,stroke:#dc3545,color:#000000
style xml fill:#d1ecf1,stroke:#17a2b8,color:#000000
style websites fill:#e2e3e5,stroke:#383d41,color:#000000
```
### For Developers 👀 only
- Open source and community-driven 🙌
- Simple Python + GitHub Actions 🐍
- AI tooling for easy contributions 🤖
- Learn and contribute together 🧑🎓
- Streamlines the use of Claude, Claude Projects, and Claude Sync
================================================
FILE: cache/.gitkeep
================================================
================================================
FILE: feed_generators/ai_first_podcast.py
================================================
"""Generate RSS feed for the AI FIRST Podcast (https://ai-first.ai/podcast).
Two-stage scraper: the listing page gives link + title, each episode page
then provides the date and description via a JSON-LD PodcastEpisode schema.
German-language podcast.
"""
import argparse
import json
import time
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import (
DEFAULT_HEADERS,
deserialize_entries,
fetch_page,
load_cache,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "ai_first_podcast"
BLOG_URL = "https://ai-first.ai/podcast"
BASE_URL = "https://ai-first.ai"
DETAIL_FETCH_DELAY_SECONDS = 0.5
def parse_listing_page(html_content: str) -> list[dict]:
"""Extract (link, title) pairs from the podcast listing page."""
soup = BeautifulSoup(html_content, "html.parser")
episodes: list[dict] = []
seen_hrefs: set[str] = set()
for link in soup.select('a[href^="/podcast/"]'):
href = link.get("href", "")
if href.rstrip("/") == "/podcast" or href in seen_hrefs:
continue
seen_hrefs.add(href)
# Prefer a heading inside the anchor (just the episode title). Fall back
# to aria-label, then to separator-joined text -- the anchor contains
# multiple sibling text nodes (episode number, guest, role) that must
# not be concatenated without whitespace.
title = None
heading = link.select_one("h1, h2, h3, h4, h5, h6")
if heading:
title = heading.get_text(separator=" ", strip=True)
if not title:
aria = link.get("aria-label", "").strip()
if aria:
title = aria.removeprefix("Podcast: ").strip()
if not title:
text = link.get_text(separator=" ", strip=True)
if text and len(text) > 5:
title = text[:200]
if len(text) > 200:
logger.debug(f"Fallback title for {href} truncated from {len(text)} chars")
if not title:
continue
episodes.append({"link": f"{BASE_URL}{href}", "title": title})
logger.info(f"Found {len(episodes)} episode links on listing page")
return episodes
def fetch_episode_details(url: str) -> tuple[datetime | None, str]:
"""Return (date, description) for a single episode page."""
try:
html = fetch_page(url, timeout=15, headers=DEFAULT_HEADERS)
except Exception as e:
logger.warning(f"Failed to fetch episode page {url}: {e}")
return None, ""
soup = BeautifulSoup(html, "html.parser")
# Primary: JSON-LD PodcastEpisode schema
for script in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(script.string or "")
except (json.JSONDecodeError, TypeError):
continue
if data.get("@type") != "PodcastEpisode":
continue
date = None
date_str = data.get("datePublished")
if date_str:
try:
date = datetime.fromisoformat(date_str)
if date.tzinfo is None:
date = date.replace(tzinfo=pytz.UTC)
except ValueError:
pass
return date, data.get("description", "")
# Fallback: <time datetime="..."> element
time_elem = soup.select_one("time[datetime]")
if time_elem and time_elem.get("datetime"):
try:
date = datetime.fromisoformat(time_elem["datetime"].replace("Z", "+00:00"))
if date.tzinfo is None:
date = date.replace(tzinfo=pytz.UTC)
return date, ""
except ValueError:
pass
return None, ""
def enrich_episodes(stub_episodes: list[dict]) -> list[dict]:
"""Fetch detail page for each stub and return full episode dicts."""
enriched = []
for i, stub in enumerate(stub_episodes):
date, description = fetch_episode_details(stub["link"])
if not date:
date = stable_fallback_date(stub["link"])
enriched.append(
{
"title": stub["title"],
"link": stub["link"],
"date": date,
"description": description or stub["title"],
}
)
if i < len(stub_episodes) - 1:
time.sleep(DETAIL_FETCH_DELAY_SECONDS)
if (i + 1) % 10 == 0:
logger.info(f"Fetched {i + 1}/{len(stub_episodes)} episode details")
return enriched
def generate_rss_feed(episodes: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("AI FIRST Podcast")
fg.description(
"Der AI FIRST Podcast: Erfahre jeden Freitag aus erster Hand, wie Unternehmer und Führungskräfte AI einsetzen."
)
fg.language("de")
fg.author({"name": "AI FIRST"})
fg.logo("https://ai-first.ai/images/og/og-default.png")
fg.subtitle("KI-Transformation, Produktivität und die Zukunft der Arbeit")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for ep in sort_posts_for_feed(episodes, date_field="date"):
fe = fg.add_entry()
fe.title(ep["title"])
fe.description(ep["description"])
fe.link(href=ep["link"])
fe.id(ep["link"])
if ep.get("date"):
fe.published(ep["date"])
logger.info(f"Generated RSS feed with {len(episodes)} entries")
return fg
def main(full_reset: bool = False) -> bool:
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
cached_links = {ep["link"] for ep in cached_entries}
html = fetch_page(BLOG_URL, timeout=15, headers=DEFAULT_HEADERS)
listing = parse_listing_page(html)
if not listing:
logger.warning("No episodes found on listing page.")
return False
if full_reset:
stubs_to_fetch = listing
logger.info(f"Full reset: fetching details for all {len(stubs_to_fetch)} episodes")
all_episodes = enrich_episodes(stubs_to_fetch)
else:
stubs_to_fetch = [ep for ep in listing if ep["link"] not in cached_links]
logger.info(f"Incremental: {len(stubs_to_fetch)} new episode(s) to fetch")
new_episodes = enrich_episodes(stubs_to_fetch)
all_episodes = list(cached_entries) + new_episodes
all_episodes = sort_posts_for_feed(all_episodes, date_field="date")
save_cache(FEED_NAME, all_episodes)
feed = generate_rss_feed(all_episodes)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate AI FIRST Podcast RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (re-fetch every episode)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/anthropic_eng_blog.py
================================================
import re
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import fetch_page, save_rss_feed, setup_feed_links, setup_logging, sort_posts_for_feed
logger = setup_logging()
FEED_NAME = "anthropic_engineering"
BLOG_URL = "https://www.anthropic.com/engineering"
def fetch_engineering_content(url=BLOG_URL):
"""Fetch engineering page content from Anthropic's website."""
try:
return fetch_page(url)
except Exception as e:
logger.error(f"Error fetching engineering content: {e!s}")
raise
def validate_article(article):
"""Validate article has required fields."""
if not article.get("title") or len(article["title"]) < 5:
return False
if not article.get("link") or not article["link"].startswith("http"):
return False
return bool(article.get("date"))
def parse_engineering_html(html_content):
"""Parse the engineering HTML content and extract article information from embedded JSON."""
try:
soup = BeautifulSoup(html_content, "html.parser")
articles = []
# Find the Next.js script tag containing article data
script_tag = None
for script in soup.find_all("script"):
if script.string and "publishedOn" in script.string and "engineeringArticle" in script.string:
script_tag = script
break
if not script_tag:
logger.error("Could not find Next.js data script containing article information")
return []
script_content = script_tag.string
# Extract article data from the escaped JSON in the Next.js script
# Pattern matches: publishedOn, slug, title, and summary fields
pattern = r'\\"publishedOn\\":\\"([^"]+?)\\",\\"slug\\":\{[^}]*?\\"current\\":\\"([^"]+?)\\"'
matches = re.findall(pattern, script_content)
logger.info(f"Found {len(matches)} articles from JSON data")
for published_date, slug in matches:
try:
# Construct the full URL from the slug
link = f"https://www.anthropic.com/engineering/{slug}"
# Find the article object containing this slug to get title and summary
# Search for the section containing this slug
slug_pos = script_content.find(f'\\"current\\":\\"{slug}\\"')
if slug_pos == -1:
continue
# Search forward from slug position to find the title and summary
# The structure is: ...publishedOn, slug, ...other fields..., summary, title}
search_section = script_content[slug_pos : slug_pos + 2000]
# Extract title and summary (they appear AFTER the slug in the data)
# Use negative lookbehind to handle escaped quotes correctly
title_match = re.search(r'\\"title\\":\\"(.*?)(?<!\\)\\"', search_section)
title = title_match.group(1) if title_match else slug.replace("-", " ").title()
# Unescape the title using re.sub to handle all escaped characters
title = re.sub(r"\\(.)", r"\1", title) if title else title
# Extract summary/description
summary_match = re.search(r'\\"summary\\":\\"(.*?)(?<!\\)\\"', search_section)
description = summary_match.group(1) if summary_match else title
# Unescape the description
description = re.sub(r"\\(.)", r"\1", description) if description else description
# Parse the date
date = datetime.strptime(published_date, "%Y-%m-%d")
date = date.replace(hour=0, minute=0, second=0, tzinfo=pytz.UTC)
article = {
"title": title,
"link": link,
"description": description if description else title,
"date": date,
"category": "Engineering",
}
if validate_article(article):
articles.append(article)
logger.info(f"Found article: {title} ({published_date})")
except Exception as e:
logger.warning(f"Error parsing article {slug}: {e!s}")
continue
logger.info(f"Successfully parsed {len(articles)} articles from JSON data")
return articles
except Exception as e:
logger.error(f"Error parsing HTML content: {e!s}")
raise
def generate_rss_feed(articles, feed_name=FEED_NAME):
"""Generate RSS feed from engineering articles."""
try:
fg = FeedGenerator()
fg.title("Anthropic Engineering Blog")
fg.description("Latest engineering articles and insights from Anthropic's engineering team")
setup_feed_links(fg, BLOG_URL, feed_name)
fg.language("en")
# Set feed metadata
fg.author({"name": "Anthropic Engineering Team"})
fg.logo("https://www.anthropic.com/images/icons/apple-touch-icon.png")
fg.subtitle("Inside the team building reliable AI systems")
# Sort articles for correct feed order (newest first in output)
articles_sorted = sort_posts_for_feed(articles, date_field="date")
# Add entries
for article in articles_sorted:
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.published(article["date"])
fe.category(term=article["category"])
fe.id(article["link"])
logger.info("Successfully generated RSS feed")
return fg
except Exception as e:
logger.error(f"Error generating RSS feed: {e!s}")
raise
def main(feed_name=FEED_NAME):
"""Main function to generate RSS feed from Anthropic's engineering page."""
try:
# Fetch engineering content
html_content = fetch_engineering_content()
# Parse articles from HTML
articles = parse_engineering_html(html_content)
if not articles:
logger.warning("No articles found on the engineering page")
return False
# Generate RSS feed
feed = generate_rss_feed(articles, feed_name)
# Save feed to file
save_rss_feed(feed, feed_name)
logger.info(f"Successfully generated RSS feed with {len(articles)} articles")
return True
except Exception as e:
logger.error(f"Failed to generate RSS feed: {e!s}")
return False
if __name__ == "__main__":
main()
================================================
FILE: feed_generators/anthropic_news_blog.py
================================================
import argparse
import contextlib
import xml.etree.ElementTree as ET
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
setup_selenium_driver,
sort_posts_for_feed,
stable_fallback_date,
)
FEED_NAME = "anthropic_news"
BLOG_URL = "https://www.anthropic.com/news"
logger = setup_logging()
def fetch_news_content(url=BLOG_URL, max_clicks=20):
"""Fetch the fully loaded HTML content of the news page using Selenium.
Args:
url: The URL to fetch
max_clicks: Maximum number of "See more" button clicks.
Use 20 for full fetch, 2-3 for incremental updates.
"""
driver = None
try:
logger.info(f"Fetching content from URL: {url} (max_clicks={max_clicks})")
driver = setup_selenium_driver()
driver.get(url)
# Wait for news articles to be present
try:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/news/']")))
logger.info("News articles loaded successfully")
except Exception:
logger.warning("Could not confirm articles loaded, proceeding anyway...")
# Click "See more" button repeatedly until it's no longer available
clicks = 0
while clicks < max_clicks:
try:
# Look for the "See more" button using multiple selectors
see_more_button = None
selectors = [
"[class*='seeMore']",
"[class*='see-more']",
"button[class*='More']",
]
for selector in selectors:
try:
see_more_button = driver.find_element(By.CSS_SELECTOR, selector)
if see_more_button and see_more_button.is_displayed():
break
see_more_button = None
except Exception:
continue
# Also try finding by text content using XPath
if not see_more_button:
with contextlib.suppress(Exception):
see_more_button = driver.find_element(
By.XPATH,
"//*[contains(text(), 'See more') or contains(text(), 'Load more')]",
)
if see_more_button and see_more_button.is_displayed():
count_before = len(driver.find_elements(By.CSS_SELECTOR, "a[href*='/news/']"))
logger.info(f"Clicking 'See more' button (click {clicks + 1})...")
driver.execute_script("arguments[0].click();", see_more_button)
clicks += 1
# Wait for new articles to appear after click
with contextlib.suppress(Exception):
WebDriverWait(driver, 5).until(
lambda d, n=count_before: len(d.find_elements(By.CSS_SELECTOR, "a[href*='/news/']")) > n
)
else:
logger.info(f"No more 'See more' button found after {clicks} clicks")
break
except Exception as e:
# No more "See more" button found
logger.info(f"No more 'See more' button found after {clicks} clicks: {e}")
break
html_content = driver.page_source
logger.info("Successfully fetched HTML content")
return html_content
except Exception as e:
logger.error(f"Error fetching content: {e}")
raise
finally:
if driver:
driver.quit()
def extract_title(card):
"""Extract title using multiple fallback selectors."""
selectors = [
# New FeaturedGrid layout
"h2[class*='featuredTitle']",
"h4[class*='title']",
# New PublicationList layout
"span[class*='title']",
# Legacy selectors
"h3.PostCard_post-heading__Ob1pu",
"h3.Card_headline__reaoT",
"h3[class*='headline']",
"h3[class*='heading']",
"h2[class*='headline']",
"h2[class*='heading']",
"h3",
"h2",
]
for selector in selectors:
elem = card.select_one(selector)
if elem and elem.text.strip():
return elem.text.strip()
return None
def extract_date(card):
"""Extract date using multiple fallback selectors and formats."""
selectors = [
# New layout selectors - time element is most reliable
"time[class*='date']",
"time",
# Legacy selectors
"p.detail-m",
"div.PostList_post-date__djrOA",
"p[class*='date']",
"div[class*='date']",
]
date_formats = [
"%b %d, %Y",
"%B %d, %Y",
"%b %d %Y",
"%B %d %Y",
"%Y-%m-%d",
"%m/%d/%Y",
]
for selector in selectors:
# Use select() to get all matching elements, not just the first one
elems = card.select(selector)
for elem in elems:
date_text = elem.text.strip()
# Try to parse it as a date
for date_format in date_formats:
try:
date = datetime.strptime(date_text, date_format)
return date.replace(tzinfo=pytz.UTC)
except ValueError:
continue
return None
def extract_category(card, date_elem_text=None):
"""Extract category using multiple fallback selectors."""
selectors = [
# New layout selectors
"span[class*='subject']", # PublicationList layout
"span.caption.bold", # FeaturedGrid layout (category before date)
# Legacy selectors
"span.text-label",
"p.detail-m",
"span[class*='category']",
"div[class*='category']",
]
for selector in selectors:
elem = card.select_one(selector)
if elem:
text = elem.text.strip()
# Skip if this is the date element
if date_elem_text and text == date_elem_text:
continue
# Skip if it looks like a date
if any(
month in text
for month in [
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec",
]
):
continue
return text
return "News"
def validate_article(article):
"""Validate that article has all required fields with reasonable values."""
if not article.get("title") or len(article["title"]) < 5:
logger.warning(f"Invalid title for article: {article.get('link', 'unknown')}")
return False
if not article.get("link") or not article["link"].startswith("http"):
logger.warning(f"Invalid link for article: {article.get('title', 'unknown')}")
return False
if not article.get("date"):
logger.warning(f"Missing date for article: {article.get('title', 'unknown')}")
return False
return True
def parse_news_html(html_content):
"""Parse the news HTML content and extract article information."""
try:
soup = BeautifulSoup(html_content, "html.parser")
articles = []
seen_links = set()
unknown_structures = 0
# Find all links that point to news articles
# Use flexible selectors to catch current and future card types
# Handle both relative (/news/...) and absolute (https://www.anthropic.com/news/...) URLs
all_news_links = soup.select('a[href*="/news/"], a[href*="anthropic.com/news/"]')
logger.info(f"Found {len(all_news_links)} potential news article links")
for card in all_news_links:
href = card.get("href", "")
if not href:
continue
# Build full URL
link = "https://www.anthropic.com" + href if href.startswith("/") else href
# Skip duplicates
if link in seen_links:
continue
# Skip the main news page link and anchor links
if link.endswith("/news") or link.endswith("/news/") or "/news#" in link:
continue
seen_links.add(link)
# Extract title using fallback chain
title = extract_title(card)
if not title:
logger.debug(f"Could not extract title for link: {link}")
logger.debug(f"Card HTML preview: {str(card)[:200]}")
unknown_structures += 1
continue
# Extract date using fallback chain
date = extract_date(card)
if not date:
logger.warning(f"Could not extract date for article: {title}")
date = stable_fallback_date(link)
# Extract category
category = extract_category(card)
# Create article object
article = {
"title": title,
"link": link,
"date": date,
"category": category,
"description": title, # Using title as description fallback
}
# Validate article before adding
if validate_article(article):
articles.append(article)
else:
unknown_structures += 1
if unknown_structures > 0:
logger.warning(f"Encountered {unknown_structures} links with unknown or invalid structures")
logger.info(f"Successfully parsed {len(articles)} valid articles")
return articles
except Exception as e:
logger.error(f"Error parsing HTML content: {e!s}")
raise
def generate_rss_feed(articles):
"""Generate RSS feed from news articles."""
try:
fg = FeedGenerator()
fg.title("Anthropic News")
fg.description("Latest news and updates from Anthropic")
fg.language("en")
# Set feed metadata
fg.author({"name": "Anthropic News"})
fg.logo("https://www.anthropic.com/images/icons/apple-touch-icon.png")
fg.subtitle("Latest updates from Anthropic's newsroom")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
# Sort articles for correct feed order (newest first in output)
articles_sorted = sort_posts_for_feed(articles, date_field="date")
# Add entries
for article in articles_sorted:
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.published(article["date"])
fe.category(term=article["category"])
fe.id(article["link"])
logger.info("Successfully generated RSS feed")
return fg
except Exception as e:
logger.error(f"Error generating RSS feed: {e!s}")
raise
def get_existing_links_from_feed(feed_path):
"""Parse the existing RSS feed and return a set of all article links."""
existing_links = set()
try:
if not feed_path.exists():
return existing_links
tree = ET.parse(feed_path)
root = tree.getroot()
# RSS 2.0: items under channel/item
for item in root.findall("./channel/item"):
link_elem = item.find("link")
if link_elem is not None and link_elem.text:
existing_links.add(link_elem.text.strip())
except Exception as e:
logger.warning(f"Failed to parse existing feed for deduplication: {e!s}")
return existing_links
def main(full_reset=False):
"""Main function to generate RSS feed from Anthropic's news page.
Args:
full_reset: If True, fetch all articles (click "See more" up to 20 times).
If False, do incremental update (click 2-3 times, merge with cache).
"""
try:
cache = load_cache(FEED_NAME)
cached_articles = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_articles:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
html_content = fetch_news_content(max_clicks=20)
articles = parse_news_html(html_content)
else:
logger.info("Running incremental update (2 clicks only)")
html_content = fetch_news_content(max_clicks=2)
new_articles = parse_news_html(html_content)
logger.info(f"Found {len(new_articles)} articles from recent pages")
articles = merge_entries(new_articles, cached_articles)
if not articles:
logger.warning("No articles found. Please check the HTML structure.")
return False
# Save to cache
save_cache(FEED_NAME, articles)
# Generate RSS feed with all articles
feed = generate_rss_feed(articles)
# Save feed to file
save_rss_feed(feed, FEED_NAME)
logger.info(f"Successfully generated RSS feed with {len(articles)} articles")
return True
except Exception as e:
logger.error(f"Failed to generate RSS feed: {e!s}")
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Anthropic News RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all articles)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/anthropic_red_blog.py
================================================
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import fetch_page, save_rss_feed, setup_feed_links, setup_logging, sort_posts_for_feed, stable_fallback_date
logger = setup_logging()
FEED_NAME = "anthropic_red"
BLOG_URL = "https://red.anthropic.com/"
def fetch_red_content(url=BLOG_URL):
"""Fetch content from Anthropic's red team blog."""
try:
return fetch_page(url)
except Exception as e:
logger.error(f"Error fetching red team blog content: {e!s}")
raise
def parse_date(date_text):
"""Parse date text from article pages (e.g., 'November 12, 2025', 'September 29, 2025')."""
date_formats = [
"%B %d, %Y", # November 12, 2025
"%b %d, %Y", # Nov 12, 2025
"%B %Y", # November 2025 (fallback)
"%b %Y", # Nov 2025 (fallback)
]
for date_format in date_formats:
try:
date = datetime.strptime(date_text, date_format)
return date.replace(tzinfo=pytz.UTC)
except ValueError:
continue
logger.warning(f"Could not parse date: {date_text}")
return None
def fetch_article_date(article_url):
"""Fetch the publication date from an individual article page."""
try:
html = fetch_page(article_url)
soup = BeautifulSoup(html, "html.parser")
# Look for date in d-article section
article_section = soup.select_one("d-article")
if article_section:
# The date is typically in the first <p> tag
first_p = article_section.select_one("p")
if first_p:
date_text = first_p.text.strip()
date = parse_date(date_text)
if date:
logger.debug(f"Found date '{date_text}' for {article_url}")
return date
logger.warning(f"Could not find date in article: {article_url}")
return None
except Exception as e:
logger.warning(f"Error fetching article date from {article_url}: {e!s}")
return None
def parse_red_html(html_content):
"""Parse the red team blog HTML content and extract article information."""
try:
soup = BeautifulSoup(html_content, "html.parser")
articles = []
seen_links = set()
# Find all article links across the entire page (TOC + body sections)
all_notes = soup.select("a.note")
logger.info(f"Found {len(all_notes)} potential article links")
# Build a map of date dividers for context
date_sections = {}
for date_div in soup.select("div.date"):
date_text = date_div.text.strip()
parsed = parse_date(date_text)
if parsed:
date_sections[date_text] = parsed
for article_link in all_notes:
# Extract article information
href = article_link.get("href", "")
if not href:
continue
# Build full URL
if href.startswith("http"):
link = href
elif href.startswith("/"):
link = f"https://red.anthropic.com{href}"
else:
link = f"https://red.anthropic.com/{href}"
# Skip duplicates
if link in seen_links:
continue
seen_links.add(link)
# Extract title
title_elem = article_link.select_one("h3")
if not title_elem:
logger.warning(f"Could not extract title for link: {link}")
continue
title = title_elem.text.strip()
# Extract description
description_elem = article_link.select_one("div.description")
description = description_elem.text.strip() if description_elem else title
# Fetch actual publication date from the article page
article_date = fetch_article_date(link)
# Fallback to stable date if fetching fails
if not article_date:
article_date = stable_fallback_date(link)
logger.warning(f"Using fallback date for article: {title}")
# Create article object
article = {
"title": title,
"link": link,
"date": article_date,
"description": description,
}
articles.append(article)
logger.debug(f"Found article: {title} (date: {article_date})")
logger.info(f"Successfully parsed {len(articles)} articles")
return articles
except Exception as e:
logger.error(f"Error parsing HTML content: {e!s}")
raise
def generate_rss_feed(articles, feed_name=FEED_NAME):
"""Generate RSS feed from red team blog articles."""
try:
fg = FeedGenerator()
fg.title("Anthropic Frontier Red Team Blog")
fg.description(
"Research from Anthropic's Frontier Red Team on what frontier AI models mean for national security"
)
setup_feed_links(fg, BLOG_URL, feed_name)
fg.language("en")
# Set feed metadata
fg.author({"name": "Anthropic Frontier Red Team"})
fg.logo("https://www.anthropic.com/images/icons/apple-touch-icon.png")
fg.subtitle(
"Evidence-based analysis about AI's implications for cybersecurity, biosecurity, and autonomous systems"
)
# Sort articles for correct feed order (newest first in output)
sorted_articles = sort_posts_for_feed(articles, date_field="date")
# Add entries
for article in sorted_articles:
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.published(article["date"])
fe.id(article["link"])
logger.info("Successfully generated RSS feed")
return fg
except Exception as e:
logger.error(f"Error generating RSS feed: {e!s}")
raise
def main(feed_name=FEED_NAME):
"""Main function to generate RSS feed from Anthropic's red team blog."""
try:
# Fetch blog content
html_content = fetch_red_content()
# Parse articles from HTML
articles = parse_red_html(html_content)
if not articles:
logger.warning("No articles found")
return False
# Generate RSS feed
feed = generate_rss_feed(articles, feed_name)
# Save feed to file
save_rss_feed(feed, feed_name)
logger.info(f"Successfully generated RSS feed with {len(articles)} articles")
return True
except Exception as e:
logger.error(f"Failed to generate RSS feed: {e!s}")
return False
if __name__ == "__main__":
main()
================================================
FILE: feed_generators/anthropic_research_blog.py
================================================
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
setup_selenium_driver,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "anthropic_research"
BLOG_URL = "https://www.anthropic.com/research"
def fetch_research_content_selenium(url=BLOG_URL):
"""Fetch the fully loaded HTML content of the research page using Selenium."""
driver = None
try:
logger.info(f"Fetching content from URL: {url}")
driver = setup_selenium_driver()
driver.get(url)
# Wait for research articles to load
try:
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/research/']")))
logger.info("Research articles loaded successfully")
except Exception:
logger.warning("Could not confirm articles loaded, proceeding anyway...")
html_content = driver.page_source
logger.info("Successfully fetched HTML content")
return html_content
except Exception as e:
logger.error(f"Error fetching content: {e}")
raise
finally:
if driver:
driver.quit()
def extract_title(card):
"""Extract title using multiple fallback selectors."""
selectors = [
"h3",
"h2",
"h1",
".Card_headline__reaoT",
"h3[class*='headline']",
"h2[class*='headline']",
"h3[class*='title']",
"h2[class*='title']",
]
for selector in selectors:
elem = card.select_one(selector)
if elem and elem.text.strip():
title = elem.text.strip()
# Clean up whitespace
title = " ".join(title.split())
if len(title) >= 5:
return title
# Try using link text as last resort
if hasattr(card, "text"):
text = card.text.strip()
text = " ".join(text.split())
if len(text) >= 5:
return text
return None
def extract_date(card):
"""Extract date using multiple fallback selectors and formats."""
selectors = [
"p.detail-m", # Current format on listing page
".detail-m",
"time",
"[class*='timestamp']",
"[class*='date']",
".PostDetail_post-timestamp__TBJ0Z",
".text-label",
]
date_formats = [
"%b %d, %Y",
"%B %d, %Y",
"%Y-%m-%d",
"%m/%d/%Y",
"%d %b %Y",
"%d %B %Y",
"%b %d %Y",
"%B %d %Y",
]
# Look for date in the card and its parents
elements_to_check = [card]
if hasattr(card, "parent") and card.parent:
elements_to_check.append(card.parent)
if card.parent.parent:
elements_to_check.append(card.parent.parent)
for element in elements_to_check:
for selector in selectors:
date_elem = element.select_one(selector)
if date_elem:
date_text = date_elem.text.strip()
for date_format in date_formats:
try:
date = datetime.strptime(date_text, date_format)
return date.replace(tzinfo=pytz.UTC)
except ValueError:
continue
return None
def validate_article(article):
"""Validate that article has all required fields with reasonable values."""
if not article.get("title") or len(article["title"]) < 5:
return False
# Date can be None for research articles
return bool(article.get("link") and article["link"].startswith("http"))
def parse_research_html(html_content):
"""Parse the research HTML content and extract article information."""
try:
soup = BeautifulSoup(html_content, "html.parser")
articles = []
seen_links = set()
# Look for research article links using flexible selector
research_links = soup.select("a[href*='/research/']")
logger.info(f"Found {len(research_links)} potential research article links")
for link in research_links:
try:
href = link.get("href", "")
if not href:
continue
# Skip the main research page
if href == "/research" or href.endswith("/research/"):
continue
# Construct full URL
if href.startswith("https://"):
full_url = href
elif href.startswith("/"):
full_url = "https://www.anthropic.com" + href
else:
continue
# Skip duplicates
if full_url in seen_links:
continue
seen_links.add(full_url)
# Extract title
title = extract_title(link)
if not title:
logger.debug(f"Could not extract title for link: {full_url}")
continue
# Extract date, fall back to stable hash-based date
date = extract_date(link)
if date:
logger.info(f"Found article: {title} - {date}")
else:
logger.warning(f"No date found for article: {title}, using fallback")
date = stable_fallback_date(full_url)
# Determine category from URL
category = "Research"
if "/news/" in href:
category = "News"
article = {
"title": title,
"link": full_url,
"date": date, # Can be None
"category": category,
"description": title,
}
# Validate article
if validate_article(article):
articles.append(article)
else:
logger.debug(f"Article failed validation: {full_url}")
except Exception as e:
logger.warning(f"Error parsing research link: {e!s}")
continue
logger.info(f"Successfully parsed {len(articles)} unique research articles")
return articles
except Exception as e:
logger.error(f"Error parsing HTML content: {e!s}")
raise
def generate_rss_feed(articles):
"""Generate RSS feed from research articles."""
try:
fg = FeedGenerator()
fg.title("Anthropic Research")
fg.description("Latest research papers and updates from Anthropic")
fg.language("en")
# Set feed metadata
fg.author({"name": "Anthropic Research Team"})
fg.logo("https://www.anthropic.com/images/icons/apple-touch-icon.png")
fg.subtitle("Latest research from Anthropic")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
# Sort articles for correct feed order (newest first in output)
# Articles without dates will appear at the end
articles_sorted = sort_posts_for_feed(articles, date_field="date")
# Add entries
for article in articles_sorted:
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
# Only set published date if we have a valid date
if article["date"]:
fe.published(article["date"])
fe.category(term=article["category"])
fe.id(article["link"])
logger.info("Successfully generated RSS feed")
return fg
except Exception as e:
logger.error(f"Error generating RSS feed: {e!s}")
raise
def main(full_reset=False):
"""Main function to generate RSS feed from Anthropic's research page.
Args:
full_reset: If True, fetch all articles. If False, merge with cache.
"""
try:
cache = load_cache(FEED_NAME)
cached_articles = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_articles:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
else:
logger.info("Running incremental update")
# Fetch research content using Selenium
html_content = fetch_research_content_selenium()
# Parse articles from HTML
new_articles = parse_research_html(html_content)
if not new_articles and not cached_articles:
logger.warning("No articles found. Please check the HTML structure.")
return False
# Merge with cache or use fresh articles
if cached_articles and not full_reset:
articles = merge_entries(new_articles, cached_articles)
else:
articles = new_articles
# Save to cache
save_cache(FEED_NAME, articles)
# Generate RSS feed
feed = generate_rss_feed(articles)
# Save feed to file
save_rss_feed(feed, FEED_NAME)
logger.info(f"Successfully generated RSS feed with {len(articles)} articles")
return True
except Exception as e:
logger.error(f"Failed to generate RSS feed: {e!s}")
return False
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Generate Anthropic Research RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all articles)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/blogsurgeai_feed_generator.py
================================================
#!/usr/bin/env python3
"""
RSS Feed Generator for Surge AI Blog
Scrapes https://www.surgehq.ai/blog and generates an RSS feed
"""
import pytz
from bs4 import BeautifulSoup
from dateutil import parser
from feedgen.feed import FeedGenerator
from utils import fetch_page, save_rss_feed, setup_feed_links, setup_logging, stable_fallback_date
logger = setup_logging()
FEED_NAME = "blogsurgeai"
BLOG_URL = "https://www.surgehq.ai/blog"
def generate_blogsurgeai_feed():
"""Generate RSS feed for Surge AI blog"""
# Initialize feed generator
fg = FeedGenerator()
fg.id(BLOG_URL)
fg.title("Surge AI Blog")
fg.author({"name": "Surge AI", "email": "team@surgehq.ai"})
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
fg.language("en")
fg.description(
"New methods, current trends & software infrastructure for NLP. Articles written by our senior engineering leads from Google, Facebook, Twitter, Harvard, MIT, and Y Combinator"
)
# Fetch the blog page
try:
html = fetch_page(BLOG_URL)
except Exception as e:
logger.error(f"Error fetching blog page: {e}")
return
# Parse HTML
soup = BeautifulSoup(html, "html.parser")
# Find all blog post items
blog_items = soup.find_all("div", class_="blog-hero-cms-item")
logger.info(f"Found {len(blog_items)} blog posts")
# Process each blog post
for item in blog_items:
try:
# Find the title
title_element = item.find("div", class_="blog-hero-cms-item-title")
if not title_element:
continue
title = title_element.get_text(strip=True)
# Find the link
link_element = item.find("a", class_="blog-hero-cms-item-link")
if not link_element:
continue
link = link_element.get("href")
if not link.startswith("http"):
link = "https://www.surgehq.ai" + link
# Find the description
desc_element = item.find("div", class_="blog-hero-cms-item-desc")
description = desc_element.get_text(strip=True) if desc_element else title
# Find the date
date_element = item.find("div", class_="blog-hero-cms-item-date")
pub_date = None # Will be set by parsing or fallback
if date_element:
# Find the visible date element (the one without w-condition-invisible)
date_texts = date_element.find_all("div", class_="txt fs-12 inline")
for date_text in date_texts:
if "w-condition-invisible" not in date_text.get("class", []):
date_str = date_text.get_text(strip=True)
try:
# Parse the date string (e.g., "October 10, 2025")
pub_date = parser.parse(date_str)
# Make timezone-aware
if pub_date.tzinfo is None:
pub_date = pytz.UTC.localize(pub_date)
break
except Exception as e:
logger.warning(f"Could not parse date '{date_str}': {e}")
# Use stable fallback if no date was parsed
if pub_date is None:
pub_date = stable_fallback_date(link)
# Create feed entry
fe = fg.add_entry()
fe.id(link)
fe.title(title)
fe.link(href=link)
fe.published(pub_date)
# Set description
fe.description(description)
logger.info(f"Added: {title}")
except Exception as e:
logger.error(f"Error processing blog item: {e}")
continue
# Generate RSS feed
save_rss_feed(fg, FEED_NAME)
if __name__ == "__main__":
generate_blogsurgeai_feed()
================================================
FILE: feed_generators/chanderramesh_blog.py
================================================
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import fetch_page, save_rss_feed, setup_feed_links, setup_logging, sort_posts_for_feed, stable_fallback_date
logger = setup_logging()
FEED_NAME = "chanderramesh"
BLOG_URL = "https://chanderramesh.com/writing"
def parse_date(date_str):
"""Parse date string in format 'Month DD, YYYY'."""
try:
# Parse date like "June 12, 2025" or "February 8, 2025"
date = datetime.strptime(date_str.strip(), "%B %d, %Y")
return date.replace(tzinfo=pytz.UTC)
except ValueError as e:
logger.warning(f"Could not parse date: {date_str} - {e!s}")
return None
def parse_writing_page(html_content, base_url="https://chanderramesh.com"):
"""Parse the writing page and extract blog post information."""
try:
soup = BeautifulSoup(html_content, "html.parser")
blog_posts = []
# Find all essay cards - they are links with classes "group" and "masonry-item"
# Note: class_ parameter must be a list when searching for multiple classes
essay_links = soup.find_all("a", class_=["group", "masonry-item"])
logger.info(f"Found {len(essay_links)} essays")
for link in essay_links:
# Extract the URL
href = link.get("href")
if not href:
continue
full_url = f"{base_url}{href}" if href.startswith("/") else href
# Extract date
date_elem = link.find("p", class_="text-muted-foreground mb-2 text-sm")
date_str = date_elem.get_text(strip=True) if date_elem else None
# Extract title
title_elem = link.find("h3", class_="font-semibold tracking-tight mb-3 text-xl font-serif")
title = title_elem.get_text(strip=True) if title_elem else "Untitled"
# Extract description
desc_elem = link.find("p", class_="leading-relaxed text-muted-foreground")
description = desc_elem.get_text(strip=True) if desc_elem else ""
# Parse date
pub_date = (parse_date(date_str) if date_str else None) or stable_fallback_date(full_url)
blog_post = {
"title": title,
"link": full_url,
"description": description,
"date": pub_date,
}
blog_posts.append(blog_post)
logger.info(f"Parsed: {title} ({date_str})")
# Sort for correct feed order (newest first in output)
blog_posts = sort_posts_for_feed(blog_posts)
logger.info(f"Successfully parsed {len(blog_posts)} blog posts")
return blog_posts
except Exception as e:
logger.error(f"Error parsing HTML content: {e!s}")
raise
def generate_rss_feed(blog_posts):
"""Generate RSS feed from blog posts."""
try:
fg = FeedGenerator()
fg.title("Chander Ramesh - Writing")
fg.description("Essays by Chander Ramesh covering software, startups, investing, and philosophy")
fg.language("en")
# Set feed metadata
fg.author({"name": "Chander Ramesh"})
fg.subtitle("Essays covering software, startups, investing, and philosophy")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
# Add entries
for post in blog_posts:
fe = fg.add_entry()
fe.title(post["title"])
fe.description(post["description"])
fe.link(href=post["link"])
fe.published(post["date"])
fe.id(post["link"])
logger.info("Successfully generated RSS feed")
return fg
except Exception as e:
logger.error(f"Error generating RSS feed: {e!s}")
raise
def main():
"""Main function to generate RSS feed from blog URL."""
try:
# Fetch blog content
html_content = fetch_page(BLOG_URL)
# Parse blog posts
blog_posts = parse_writing_page(html_content)
# Generate RSS feed
feed = generate_rss_feed(blog_posts)
# Save feed to file
save_rss_feed(feed, FEED_NAME)
return True
except Exception as e:
logger.error(f"Failed to generate RSS feed: {e!s}")
return False
if __name__ == "__main__":
main()
================================================
FILE: feed_generators/claude_blog.py
================================================
#!/usr/bin/env python3
"""Generate RSS feed for Claude Blog (claude.com/blog)."""
import argparse
import html
import re
from datetime import datetime
import pytz
import requests
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
)
logger = setup_logging()
BLOG_URL = "https://claude.com/blog"
FEED_NAME = "claude"
BASE_URL = "https://claude.com"
DATE_PATTERN = re.compile(
r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},\s+\d{4}"
)
# Claude blog requires a custom header for Webflow/Finsweet
CLAUDE_HEADERS = {
"X-Webflow-App-ID": "finsweet",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
}
def fetch_page(url):
"""Fetch a single page HTML with Finsweet header."""
headers = CLAUDE_HEADERS
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return response.text
def extract_pagination_ids(html_content):
"""Extract pagination collection IDs from the HTML."""
pattern = r"\?([a-f0-9]+)_page=\d+"
matches = re.findall(pattern, html_content)
return list(set(matches))
def parse_date(date_str):
"""Parse date string like 'January 12, 2026' to datetime."""
try:
return datetime.strptime(date_str, "%B %d, %Y")
except ValueError:
return None
def parse_posts(html_content):
"""Parse the blog HTML content and extract post information.
Returns a list of unique posts, deduplicated by URL.
"""
soup = BeautifulSoup(html_content, "html.parser")
posts_by_url = {}
for item in soup.select(".w-dyn-item"):
link = item.select_one('a[href^="/blog/"]')
if not link:
continue
href = link.get("href", "")
if "/blog/category/" in href or not href:
continue
full_url = f"{BASE_URL}{href}"
# Skip if we already have this post (keep the one with most data)
if full_url in posts_by_url:
existing = posts_by_url[full_url]
# Only update if existing has no date and this one does
item_text = item.get_text()
date_match = DATE_PATTERN.search(item_text)
if not existing.get("date") and date_match:
pass # Continue to update
else:
continue # Keep existing
# Extract title
title = None
h2 = item.select_one("h2")
if h2:
title = h2.get_text(strip=True)
if not title:
title = link.get("data-cta-copy", "")
if not title:
for tag in ["h3", "h4", ".u-text-style-h6"]:
el = item.select_one(tag)
if el:
title = el.get_text(strip=True)
break
# Extract date
date_obj = None
item_text = item.get_text()
date_match = DATE_PATTERN.search(item_text)
if date_match:
date_obj = parse_date(date_match.group(0))
# Extract category
category = None
category_el = item.select_one('[fs-list-field="category"]')
if category_el:
category = category_el.get_text(strip=True)
if not category:
data_category = item.get("data-category")
if data_category:
category = data_category
# Extract description
description = None
desc_el = item.select_one(".card_blog_description, .u-text-style-body-2, p")
if desc_el:
description = desc_el.get_text(strip=True)
if title and href:
title = html.unescape(title)
if description:
description = html.unescape(description)
posts_by_url[full_url] = {
"link": full_url,
"title": title,
"date": date_obj.strftime("%Y-%m-%d") if date_obj else None,
"category": category,
"description": description or title,
}
return list(posts_by_url.values())
def fetch_all_pages():
"""Follow pagination until no new posts. Returns all posts."""
logger.info(f"Fetching main page: {BLOG_URL}")
html_content = fetch_page(BLOG_URL)
all_posts = parse_posts(html_content)
logger.info(f"Found {len(all_posts)} posts on main page")
# Get unique post URLs to track duplicates
seen_urls = {p["link"] for p in all_posts}
# Extract pagination collection IDs
collection_ids = extract_pagination_ids(html_content)
logger.info(f"Found pagination IDs: {collection_ids}")
for collection_id in collection_ids:
page = 2
consecutive_empty = 0
while consecutive_empty < 2:
page_url = f"{BLOG_URL}?{collection_id}_page={page}"
logger.info(f"Fetching: {page_url}")
try:
page_html = fetch_page(page_url)
except requests.RequestException as e:
logger.warning(f"Failed to fetch page {page}: {e}")
break
page_posts = parse_posts(page_html)
new_posts = [p for p in page_posts if p["link"] not in seen_urls]
if not new_posts:
consecutive_empty += 1
logger.info(f" No new posts (attempt {consecutive_empty})")
else:
consecutive_empty = 0
logger.info(f" Found {len(new_posts)} new posts")
all_posts.extend(new_posts)
seen_urls.update(p["link"] for p in new_posts)
page += 1
if page > 50:
logger.info(" Reached page limit, stopping")
break
# Sort for correct feed order (newest first in output)
sorted_posts = sort_posts_for_feed(all_posts, date_field="date")
logger.info(f"Total unique posts across all pages: {len(sorted_posts)}")
return sorted_posts
def generate_rss_feed(posts):
"""Generate RSS feed from blog posts."""
fg = FeedGenerator()
fg.title("Claude Blog")
fg.description(
"Get practical guidance and best practices for building with Claude. "
"Technical guides, real-world examples, and insights from Anthropic's "
"engineering and research teams."
)
fg.language("en")
fg.author({"name": "Anthropic", "email": "blog@anthropic.com"})
fg.subtitle("Latest updates from Claude Blog")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for post in posts:
fe = fg.add_entry()
fe.title(post["title"])
fe.description(post["description"])
fe.link(href=post["link"])
fe.id(post["link"])
if post.get("category"):
fe.category(term=post["category"])
if post.get("date"):
try:
dt = post["date"] if isinstance(post["date"], datetime) else datetime.strptime(post["date"], "%Y-%m-%d")
if dt.tzinfo is None:
dt = dt.replace(tzinfo=pytz.UTC)
fe.published(dt)
except (ValueError, TypeError):
pass
logger.info(f"Generated RSS feed with {len(posts)} entries")
return fg
def main(full_reset=False):
"""Main function to generate RSS feed from blog URL.
Args:
full_reset: If True, fetch all pages. If False, only fetch page 1
and merge with cached posts.
"""
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_entries:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
posts = fetch_all_pages()
else:
logger.info("Running incremental update (page 1 only)")
html_content = fetch_page(BLOG_URL)
new_posts = parse_posts(html_content)
logger.info(f"Found {len(new_posts)} posts on page 1")
posts = merge_entries(new_posts, cached_entries)
save_cache(FEED_NAME, posts)
feed = generate_rss_feed(posts)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Claude Blog RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all pages)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/cleanup_deprecated_feeds.py
================================================
"""Delete RSS feed XML files whose deprecation notice is older than the threshold.
A feed is considered "retired" once ``deprecate_feed.py`` has injected a
sunset ``<item>`` (GUID prefix ``deprecation-notice-``) and the human has
removed the generator, registry entry, Make target, and README row. This
script handles the final step: deleting the tombstone XML after enough time
has passed that existing subscribers have almost certainly seen the notice.
Default mode is dry-run: prints a punch list of eligible files. Use
``--apply`` to actually delete. The GitHub Actions workflow
``cleanup_deprecated_feeds.yml`` runs this with ``--apply`` on a monthly cron.
"""
import argparse
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta
from pathlib import Path
import pytz
from utils import get_feeds_dir, setup_logging
logger = setup_logging()
DEPRECATION_GUID_PREFIX = "deprecation-notice-"
RFC822_FORMAT = "%a, %d %b %Y %H:%M:%S %z"
DEFAULT_THRESHOLD_DAYS = 90
def find_deprecation_notice(feed_file: Path) -> datetime | None:
"""Return the pubDate of the deprecation <item> in ``feed_file``, or None."""
try:
tree = ET.parse(feed_file)
except ET.ParseError as e:
logger.warning(f"Could not parse {feed_file}: {e}")
return None
channel = tree.getroot().find("channel")
if channel is None:
return None
# A feed should only ever carry one tombstone, but keep looking if the
# first match is malformed rather than failing the whole file.
for item in channel.findall("item"):
guid = item.find("guid")
if guid is None or not guid.text or not guid.text.startswith(DEPRECATION_GUID_PREFIX):
continue
pub_date_elem = item.find("pubDate")
if pub_date_elem is None or not pub_date_elem.text:
logger.warning(f"Deprecation notice in {feed_file} has no pubDate; skipping item")
continue
try:
return datetime.strptime(pub_date_elem.text, RFC822_FORMAT)
except ValueError as e:
logger.warning(f"Could not parse pubDate in {feed_file} ({e}); skipping item")
continue
return None
def find_eligible_feeds(threshold_days: int) -> list[tuple[Path, int]]:
"""Return (path, age_days) for every feed XML whose notice is older than threshold_days."""
now = datetime.now(pytz.UTC)
cutoff = now - timedelta(days=threshold_days)
eligible: list[tuple[Path, int]] = []
for feed_file in sorted(get_feeds_dir().glob("feed_*.xml")):
pub_date = find_deprecation_notice(feed_file)
if pub_date is None:
continue
age_days = (now - pub_date).days
if pub_date < cutoff:
eligible.append((feed_file, age_days))
return eligible
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
parser.add_argument(
"--threshold-days",
type=int,
default=DEFAULT_THRESHOLD_DAYS,
help=f"Age in days after which a deprecated feed XML is deleted (default: {DEFAULT_THRESHOLD_DAYS})",
)
parser.add_argument(
"--apply",
action="store_true",
help="Actually delete eligible files (default is dry-run)",
)
args = parser.parse_args()
eligible = find_eligible_feeds(args.threshold_days)
if not eligible:
logger.info(f"No deprecated feeds older than {args.threshold_days} days")
return 0
logger.info(f"Found {len(eligible)} deprecated feed(s) older than {args.threshold_days} days:")
for feed_file, age_days in eligible:
logger.info(f" {feed_file.name} (notice is {age_days} days old)")
if args.apply:
for feed_file, _ in eligible:
feed_file.unlink()
logger.info(f"Deleted {feed_file}")
else:
logger.info("Dry run. Re-run with --apply to delete these files.")
return 0
if __name__ == "__main__":
raise SystemExit(main())
================================================
FILE: feed_generators/cohere_blog.py
================================================
"""Generate RSS feed for the Cohere Blog (https://cohere.com/blog).
The Cohere blog is built on Ghost CMS. We fetch posts directly from the Ghost
Content API instead of scraping HTML.
"""
import argparse
from datetime import datetime
import pytz
import requests
from feedgen.feed import FeedGenerator
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "cohere"
BLOG_URL = "https://cohere.com/blog"
GHOST_API_URL = "https://cohere-ai.ghost.io/ghost/api/content/posts/"
# Ghost Content API keys are intentionally public (like a Stripe publishable
# key). This is the key the cohere.com/blog front-end itself uses; it is
# read-only and rate-limited by Ghost.
GHOST_API_KEY = "572d288a9364f8e4186af1d60a"
MAX_POSTS_FULL = 50
MAX_POSTS_INCREMENTAL = 15
def fetch_posts_page(limit: int, page: int) -> dict:
"""Fetch a single page of posts from the Ghost Content API."""
params = {
"key": GHOST_API_KEY,
"limit": limit,
"page": page,
"include": "tags,authors",
"order": "published_at desc",
}
headers = {
"User-Agent": "Mozilla/5.0 (compatible; RSS Feed Generator)",
"Accept": "application/json",
}
response = requests.get(GHOST_API_URL, params=params, headers=headers, timeout=30)
response.raise_for_status()
return response.json()
def parse_api_posts(api_data: dict) -> list[dict]:
"""Extract post dicts from a Ghost API response."""
posts = []
for post in api_data.get("posts", []):
title = (post.get("title") or "").strip()
if not title:
continue
slug = post.get("slug", "")
link = f"https://cohere.com/blog/{slug}"
date = None
published_at = post.get("published_at")
if published_at:
try:
date = datetime.fromisoformat(published_at)
if date.tzinfo is None:
date = date.replace(tzinfo=pytz.UTC)
except ValueError:
logger.warning(f"Could not parse date for: {title}")
if not date:
date = stable_fallback_date(link)
description = post.get("custom_excerpt") or title
tags = post.get("tags") or []
category = tags[0]["name"] if tags else "Blog"
posts.append(
{
"title": title,
"link": link,
"date": date,
"description": description,
"category": category,
}
)
return posts
def fetch_all_posts(max_posts: int = MAX_POSTS_FULL) -> list[dict]:
"""Fetch posts across Ghost API pages until max_posts is reached."""
all_posts = []
page = 1
per_page = min(max_posts, 15)
while len(all_posts) < max_posts:
logger.info(f"Fetching page {page} (limit={per_page})")
api_data = fetch_posts_page(limit=per_page, page=page)
posts = parse_api_posts(api_data)
if not posts:
logger.info(f"No posts returned on page {page}, stopping")
break
all_posts.extend(posts)
logger.info(f"Page {page}: {len(posts)} posts (total: {len(all_posts)})")
pagination = api_data.get("meta", {}).get("pagination", {})
if not pagination.get("next"):
logger.info("No more pages available")
break
page += 1
return all_posts[:max_posts]
def generate_rss_feed(posts: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("The Cohere Blog")
fg.description("Latest news, research, and product updates from Cohere")
fg.language("en")
fg.author({"name": "Cohere"})
fg.logo("https://cohere.com/favicon.ico")
fg.subtitle("Enterprise AI research and product updates from Cohere")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for post in sort_posts_for_feed(posts, date_field="date"):
fe = fg.add_entry()
fe.title(post["title"])
fe.description(post["description"])
fe.link(href=post["link"])
fe.id(post["link"])
fe.category(term=post["category"])
if post.get("date"):
fe.published(post["date"])
logger.info(f"Generated RSS feed with {len(posts)} entries")
return fg
def main(full_reset: bool = False) -> bool:
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_entries:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
new_posts = fetch_all_posts(max_posts=MAX_POSTS_FULL)
posts = sort_posts_for_feed(new_posts, date_field="date")
else:
logger.info("Running incremental update")
api_data = fetch_posts_page(limit=MAX_POSTS_INCREMENTAL, page=1)
new_posts = parse_api_posts(api_data)
logger.info(f"Fetched {len(new_posts)} posts from API")
posts = merge_entries(new_posts, cached_entries)
if not posts:
logger.warning("No posts found. Check the Ghost API response.")
return False
save_cache(FEED_NAME, posts)
feed = generate_rss_feed(posts)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Cohere Blog RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch up to 50 posts)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/cursor_blog.py
================================================
import argparse
import re
from datetime import datetime
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import (
deserialize_entries,
fetch_page,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
)
logger = setup_logging()
BLOG_URL = "https://cursor.com/blog"
FEED_NAME = "cursor"
def parse_posts(html):
"""Extract posts from HTML. Returns (posts, next_page_url or None)."""
soup = BeautifulSoup(html, "html.parser")
posts = []
for card in soup.find_all("a", class_=re.compile(r"card")):
href = card.get("href", "")
if "/blog/" not in href or "/topic/" in href or "/page/" in href:
continue
# Make URL absolute
if href.startswith("/"):
href = f"https://cursor.com{href}"
ps = card.find_all("p")
title = ps[0].get_text(strip=True) if ps else ""
description = ps[1].get_text(strip=True) if len(ps) > 1 else ""
time_el = card.find("time")
date = time_el.get("datetime", "") if time_el else ""
category_el = card.find("span", class_="capitalize")
category = category_el.get_text(strip=True).rstrip(" ·") if category_el else ""
posts.append(
{
"link": href,
"title": title,
"description": description,
"date": date,
"category": category,
}
)
# Find next page link - look for links containing "Next" or "Older"
next_link = None
for link in soup.find_all("a", href=re.compile(r"/blog/page/\d+")):
link_text = link.get_text(strip=True)
if "Next" in link_text or "Older" in link_text:
next_link = link
break
next_url = None
if next_link:
href = next_link.get("href")
# Make relative URLs absolute
if href.startswith("/"):
next_url = f"https://cursor.com{href}"
else:
next_url = href
return posts, next_url
def fetch_all_pages():
"""Follow pagination until no Next link. Returns all posts."""
all_posts = []
url = BLOG_URL
page_num = 1
while url:
logger.info(f"Fetching page {page_num}: {url}")
html = fetch_page(url)
posts, next_url = parse_posts(html)
all_posts.extend(posts)
logger.info(f"Found {len(posts)} posts on page {page_num}")
url = next_url
page_num += 1
# Dedupe by URL (in case of overlaps)
seen = set()
unique_posts = []
for post in all_posts:
if post["link"] not in seen:
unique_posts.append(post)
seen.add(post["link"])
# Sort for correct feed order (newest first in output)
sorted_posts = sort_posts_for_feed(unique_posts, date_field="date")
logger.info(f"Total unique posts across all pages: {len(sorted_posts)}")
return sorted_posts
def generate_rss_feed(posts):
"""Generate RSS feed from posts."""
fg = FeedGenerator()
fg.title("Cursor Blog")
fg.description("The AI Code Editor")
fg.language("en")
fg.author({"name": "Cursor"})
fg.logo("https://cursor.com/favicon.ico")
fg.subtitle("Latest updates from Cursor")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for post in posts:
fe = fg.add_entry()
fe.title(post["title"])
fe.description(post["description"])
fe.link(href=post["link"])
fe.id(post["link"])
if post.get("date"):
try:
dt = datetime.fromisoformat(post["date"].replace("Z", "+00:00"))
fe.published(dt)
except ValueError:
pass
if post.get("category"):
fe.category(term=post["category"])
logger.info(f"Generated RSS feed with {len(posts)} entries")
return fg
def main(full_reset=False):
"""Main function to generate RSS feed."""
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_entries:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
posts = fetch_all_pages()
else:
logger.info("Running incremental update (page 1 only)")
html = fetch_page(BLOG_URL)
new_posts, _ = parse_posts(html)
logger.info(f"Found {len(new_posts)} posts on page 1")
posts = merge_entries(new_posts, cached_entries)
save_cache(FEED_NAME, posts)
feed = generate_rss_feed(posts)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Cursor Blog RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all pages)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/dagster_blog.py
================================================
import argparse
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import (
deserialize_entries,
fetch_page,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
)
logger = setup_logging()
BLOG_URL = "https://dagster.io/blog"
FEED_NAME = "dagster"
# Dagster uses Webflow CMS pagination with this query param
PAGINATION_PARAM = "a17fdf47_page"
def parse_posts(html_content):
"""Parse the blog HTML content and extract post information.
Returns (posts, has_next_page).
"""
soup = BeautifulSoup(html_content, "html.parser")
blog_posts = []
# Parse the featured blog post (if present)
featured_post = soup.select_one("div.featured_blog_link")
if featured_post:
title_elem = featured_post.select_one("h2.heading-style-h5")
date_elem = featured_post.select_one("p.text-color-neutral-500")
description_elem = featured_post.select_one("p.text-color-neutral-700")
link_elem = featured_post.select_one("a.clickable_link")
if title_elem and date_elem and link_elem:
title = title_elem.text.strip()
date_str = date_elem.text.strip()
try:
date_obj = datetime.strptime(date_str, "%B %d, %Y")
except ValueError:
logger.warning(f"Could not parse featured post date: {date_str}")
date_obj = None
if date_obj:
description = description_elem.text.strip() if description_elem else ""
link = link_elem.get("href", "")
if link.startswith("/"):
link = f"https://dagster.io{link}"
if link:
blog_posts.append(
{
"link": link,
"title": title,
"date": date_obj.strftime("%Y-%m-%d"),
"description": description,
}
)
# Find all regular blog post cards
posts = soup.select("div.blog_card")
for post in posts:
title_elem = post.select_one("h3.blog_card_title")
if not title_elem:
continue
title = title_elem.text.strip()
date_elem = post.select_one("p.text-color-neutral-500.text-size-small")
if not date_elem:
continue
date_str = date_elem.text.strip()
try:
date_obj = datetime.strptime(date_str, "%B %d, %Y")
except ValueError:
logger.warning(f"Could not parse date: {date_str}")
continue
description_elem = post.select_one('p[fs-cmsfilter-field="description"]')
description = description_elem.text.strip() if description_elem else ""
link_elem = post.select_one("a.clickable_link")
if not link_elem or not link_elem.get("href"):
continue
link = link_elem["href"]
if link.startswith("/"):
link = f"https://dagster.io{link}"
blog_posts.append(
{
"link": link,
"title": title,
"date": date_obj.strftime("%Y-%m-%d"),
"description": description,
}
)
# Check for "Load more" / next page link
next_link = soup.select_one("a.w-pagination-next")
has_next_page = next_link is not None and next_link.get("href")
return blog_posts, has_next_page
def fetch_all_pages():
"""Follow pagination until no next link. Returns all posts."""
all_posts = []
page_num = 1
while True:
if page_num == 1:
url = BLOG_URL
else:
url = f"{BLOG_URL}?{PAGINATION_PARAM}={page_num}"
logger.info(f"Fetching page {page_num}: {url}")
html = fetch_page(url)
posts, has_next_page = parse_posts(html)
all_posts.extend(posts)
logger.info(f"Found {len(posts)} posts on page {page_num}")
if not has_next_page:
break
page_num += 1
# Dedupe by URL
seen = set()
unique_posts = []
for post in all_posts:
if post["link"] not in seen:
unique_posts.append(post)
seen.add(post["link"])
# Sort for correct feed order (newest first in output)
sorted_posts = sort_posts_for_feed(unique_posts, date_field="date")
logger.info(f"Total unique posts across all pages: {len(sorted_posts)}")
return sorted_posts
def generate_rss_feed(posts):
"""Generate RSS feed from blog posts."""
fg = FeedGenerator()
fg.title("Dagster Blog")
fg.description(
"Read the latest from the Dagster team: insights, tutorials, and updates on data engineering, orchestration, and building better pipelines."
)
fg.language("en")
fg.author({"name": "Dagster"})
fg.subtitle("Latest updates from Dagster")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for post in posts:
fe = fg.add_entry()
fe.title(post["title"])
fe.description(post["description"])
fe.link(href=post["link"])
fe.id(post["link"])
if post.get("date"):
try:
dt = post["date"] if isinstance(post["date"], datetime) else datetime.strptime(post["date"], "%Y-%m-%d")
if dt.tzinfo is None:
dt = dt.replace(tzinfo=pytz.UTC)
fe.published(dt)
except (ValueError, TypeError):
pass
logger.info(f"Generated RSS feed with {len(posts)} entries")
return fg
def main(full_reset=False):
"""Main function to generate RSS feed from blog URL.
Args:
full_reset: If True, fetch all pages. If False, only fetch page 1
and merge with cached posts.
"""
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_entries:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
posts = fetch_all_pages()
else:
logger.info("Running incremental update (page 1 only)")
html = fetch_page(BLOG_URL)
new_posts, _ = parse_posts(html)
logger.info(f"Found {len(new_posts)} posts on page 1")
posts = merge_entries(new_posts, cached_entries)
save_cache(FEED_NAME, posts)
feed = generate_rss_feed(posts)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Dagster Blog RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all pages)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/deeplearningai_the_batch.py
================================================
import argparse
import re
import pytz
import requests
from bs4 import BeautifulSoup
from dateutil import parser as date_parser
from feedgen.feed import FeedGenerator
from utils import (
deserialize_entries,
fetch_page,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "the_batch"
BLOG_URL = "https://www.deeplearning.ai/the-batch/"
MAX_PAGES = 30 # Safety limit for pagination
def parse_date(value: str | None, fallback_id: str = ""):
"""Parse date text/datetime strings into timezone-aware datetime."""
if not value:
return stable_fallback_date(fallback_id)
try:
dt = date_parser.parse(value)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=pytz.UTC)
return dt
except (ValueError, TypeError) as exc:
logger.warning("Unable to parse date %r (%s); using fallback", value, exc)
return stable_fallback_date(fallback_id)
def clean_text(text: str | None) -> str | None:
if text is None:
return None
return " ".join(text.split())
def is_valid_article_link(href: str) -> bool:
"""Check if href is a valid article link (not a tag, category, or page link)."""
if not href:
return False
# Skip tag links, page links, and the main batch page
if "/tag/" in href or "/page/" in href:
return False
if href in ("/the-batch/", "/the-batch"):
return False
# Must be a the-batch article link
return href.startswith("/the-batch/") or "deeplearning.ai/the-batch/" in href
def normalize_link(href: str) -> str:
"""Convert relative URL to absolute URL."""
if href.startswith("/"):
return f"https://www.deeplearning.ai{href}"
return href
def extract_date_text(element) -> str | None:
"""Extract date text from element or its children.
Looks for:
- <time> elements with datetime attribute
- Tag links like <a href="/the-batch/tag/jan-16-2026/">Jan 16, 2026</a>
- Plain text matching date patterns
"""
if element is None:
return None
# Check for time element
time_el = element.find("time")
if time_el:
return time_el.get("datetime") or time_el.get_text(" ", strip=True)
# Check for date in tag links (new format)
for anchor in element.find_all("a", href=True):
href = anchor.get("href", "")
if "/tag/" in href:
text = anchor.get_text(" ", strip=True)
if text:
return text
# Date pattern for plain text (e.g., "Dec 26, 2025" or "January 16, 2026")
date_pattern = re.compile(
r"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{1,2},?\s+\d{4}",
re.I,
)
for tag in element.find_all(["a", "div", "span", "p"]):
text = tag.get_text(" ", strip=True)
match = date_pattern.search(text or "")
if match:
return match.group(0)
# Check element's own text
text = element.get_text(" ", strip=True) if hasattr(element, "get_text") else str(element)
match = date_pattern.search(text or "")
if match:
return match.group(0)
return None
def extract_description(element) -> str | None:
"""Extract description/excerpt from element or its parent context."""
if element is None:
return None
# Prefer visible snippet if present (line clamp text)
summary = element.find(
lambda tag: (
tag.name in {"div", "p"}
and tag.get("class")
and any("line-clamp" in cls for cls in (tag.get("class") or []))
)
)
if summary:
return clean_text(summary.get_text(" ", strip=True))
# Check parent for description
parent = element.parent
if parent:
summary = parent.find(
lambda tag: (
tag.name in {"div", "p"}
and tag.get("class")
and any("line-clamp" in cls for cls in (tag.get("class") or []))
)
)
if summary:
return clean_text(summary.get_text(" ", strip=True))
first_para = parent.find("p")
if first_para:
text = clean_text(first_para.get_text(" ", strip=True))
# Skip if it looks like just a date
if text and len(text) > 20:
return text
return None
def parse_articles_from_html(html_content: str) -> list[dict]:
"""Parse articles from HTML content string.
The site uses a card-based layout without <article> tags. Articles are
identified by finding links to /the-batch/issue-* URLs and extracting
title/date from the link context.
"""
soup = BeautifulSoup(html_content, "lxml")
articles = []
seen_links = set()
# Find all links that point to article pages
for anchor in soup.find_all("a", href=True):
href = anchor["href"]
if not is_valid_article_link(href):
continue
link = normalize_link(href)
if link in seen_links:
continue
seen_links.add(link)
# Extract title from heading within the link or nearby
heading = anchor.find(["h1", "h2", "h3", "h4"])
if not heading:
# Try parent element for title
parent = anchor.parent
if parent:
heading = parent.find(["h1", "h2", "h3", "h4"])
if not heading:
# Use link text as fallback
text = clean_text(anchor.get_text(" ", strip=True))
if text and len(text) > 10:
title = text
else:
continue
else:
title = clean_text(heading.get_text(" ", strip=True))
if not title:
continue
# Extract date - look for tag links or date patterns near the link
date_text = extract_date_text(anchor)
if not date_text:
# Check parent/sibling elements
parent = anchor.parent
if parent:
date_text = extract_date_text(parent)
date = parse_date(date_text, fallback_id=link)
# Extract description from nearby paragraph or use title
description = extract_description(anchor) or title
articles.append(
{
"title": title,
"link": link,
"date": date,
"description": description,
}
)
logger.info(f"Parsed {len(articles)} articles from HTML")
return articles
def fetch_all_articles(max_pages: int = MAX_PAGES) -> list[dict]:
"""Fetch all articles by iterating through paginated pages."""
all_articles = []
seen_links = set()
for page_num in range(1, max_pages + 1):
# Construct page URL
if page_num == 1:
url = BLOG_URL
else:
url = f"{BLOG_URL}page/{page_num}/"
try:
html_content = fetch_page(url)
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
logger.info(f"Page {page_num} not found (404), stopping pagination")
else:
logger.info(f"Error fetching page {page_num}: {e}")
break
except Exception as e:
logger.info(f"Error fetching page {page_num}, stopping pagination: {e}")
break
# Check for 404-like conditions (page not found)
if "Page not found" in html_content or "404" in html_content[:1000]:
logger.info(f"Page {page_num} not found, stopping pagination")
break
# Parse articles from current page
page_articles = parse_articles_from_html(html_content)
if not page_articles:
logger.info(f"No articles found on page {page_num}, stopping pagination")
break
# Deduplicate and add new articles
new_count = 0
for article in page_articles:
if article["link"] not in seen_links:
seen_links.add(article["link"])
all_articles.append(article)
new_count += 1
logger.info(f"Page {page_num}: Found {len(page_articles)} articles, {new_count} new")
if new_count == 0:
logger.info("No new articles found, stopping pagination")
break
logger.info(f"Total articles fetched: {len(all_articles)}")
return all_articles
def build_feed(articles: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("The Batch | DeepLearning.AI")
fg.description("Weekly AI news and insights from DeepLearning.AI's The Batch.")
fg.language("en")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
# Sort articles for correct feed order (newest first in output)
articles_sorted = sort_posts_for_feed(articles, date_field="date")
for article in articles_sorted:
entry = fg.add_entry()
entry.title(article["title"])
entry.link(href=article["link"])
entry.id(article["link"])
entry.published(article["date"])
entry.description(article["description"])
return fg
def main(full_reset=False):
"""Main function to generate RSS feed.
Args:
full_reset: If True, fetch all pages. If False, fetch only first 3 pages and merge with cache.
"""
cache = load_cache(FEED_NAME)
cached_articles = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_articles:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
articles = fetch_all_articles(max_pages=MAX_PAGES)
else:
logger.info("Running incremental update (3 pages only)")
new_articles = fetch_all_articles(max_pages=3)
logger.info(f"Found {len(new_articles)} articles from recent pages")
articles = merge_entries(new_articles, cached_articles)
if not articles:
logger.warning("No articles found")
return False
# Save to cache
save_cache(FEED_NAME, articles)
feed = build_feed(articles)
save_rss_feed(feed, FEED_NAME)
logger.info(f"Successfully generated RSS feed with {len(articles)} articles")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate DeepLearning.AI The Batch RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (fetch all pages)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/deprecate_feed.py
================================================
"""Inject a deprecation notice into a feed XML.
Used when a scraper is being retired (e.g., the site launched an official RSS feed).
The notice shows up as the newest entry in the feed, so subscribers see it in their
RSS reader rather than silently losing updates.
Usage:
uv run feed_generators/deprecate_feed.py \\
--feed=openai_research \\
--message="OpenAI now provides an official RSS feed." \\
--alternative="https://openai.com/blog/rss.xml"
After running, in the same PR, remove the generator script, the ``<name>:`` entry
from ``feeds.yaml``, the ``feeds_<name>`` Make target, and the README row. Only
``feeds/feed_<name>.xml`` (now carrying the tombstone notice) stays in place;
it is deleted automatically after ~90 days by the
``cleanup_deprecated_feeds.yml`` workflow.
"""
import argparse
from datetime import datetime
import pytz
from lxml import etree as ET
from utils import get_feeds_dir, setup_logging
logger = setup_logging()
DEPRECATION_GUID_PREFIX = "deprecation-notice-"
DEPRECATION_TITLE = "[NOTICE] This feed is no longer maintained"
# lxml.etree is used (not the stdlib xml.etree.ElementTree) because the stdlib
# parser drops unused namespace declarations and rewrites unregistered
# namespace prefixes to ns0/ns1/... on round-trip. That silently corrupts
# feedgen's <atom:link rel="self"> and xmlns:content declarations. lxml
# preserves the original xmlns bindings verbatim.
# RFC 822 day-of-week and month tokens. Python's strftime("%a"/"%b") honors the
# current system locale, which breaks feed readers on non-English CI runners.
# Build the pubDate explicitly to keep the round-trip locale-independent.
RFC822_WEEKDAYS = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
RFC822_MONTHS = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
def format_rfc822(dt: datetime) -> str:
"""Format a datetime as RFC 822 pubDate without relying on system locale."""
day = RFC822_WEEKDAYS[dt.weekday()]
month = RFC822_MONTHS[dt.month - 1]
return f"{day}, {dt.day:02d} {month} {dt.year} {dt.hour:02d}:{dt.minute:02d}:{dt.second:02d} +0000"
def deprecate_feed(feed_name: str, message: str, alternative_url: str | None = None) -> bool:
"""Inject a deprecation <item> into feeds/feed_<feed_name>.xml.
The entry uses a stable GUID (``deprecation-notice-<feed_name>``) so repeated
runs do not duplicate the notice. Returns True on success, False otherwise.
"""
feed_file = get_feeds_dir() / f"feed_{feed_name}.xml"
if not feed_file.exists():
logger.error(f"Feed file not found: {feed_file}")
return False
tree = ET.parse(feed_file)
root = tree.getroot()
channel = root.find("channel")
if channel is None:
logger.error("No <channel> element found in feed XML")
return False
guid_value = f"{DEPRECATION_GUID_PREFIX}{feed_name}"
for item in channel.findall("item"):
guid = item.find("guid")
if guid is not None and guid.text == guid_value:
logger.info(f"Deprecation notice already present in {feed_file}, skipping")
return True
body = message
if alternative_url:
body += f"\n\nRecommended alternative: {alternative_url}"
pub_date = format_rfc822(datetime.now(pytz.UTC))
notice = ET.Element("item")
ET.SubElement(notice, "title").text = DEPRECATION_TITLE
ET.SubElement(notice, "description").text = body
ET.SubElement(notice, "guid", isPermaLink="false").text = guid_value
ET.SubElement(notice, "pubDate").text = pub_date
if alternative_url:
ET.SubElement(notice, "link").text = alternative_url
first_item = channel.find("item")
if first_item is not None:
idx = list(channel).index(first_item)
channel.insert(idx, notice)
else:
channel.append(notice)
tree.write(str(feed_file), xml_declaration=True, encoding="UTF-8", pretty_print=False)
logger.info(f"Added deprecation notice to {feed_file}")
logger.info(
f"Next: remove the `{feed_name}:` entry from feeds.yaml, the feeds_{feed_name} Make "
"target, and any README row; leave the XML in place."
)
return True
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
parser.add_argument("--feed", required=True, help="Feed name (e.g., 'openai_research')")
parser.add_argument("--message", required=True, help="Notice body text")
parser.add_argument("--alternative", default=None, help="Optional alternative feed URL")
args = parser.parse_args()
success = deprecate_feed(args.feed, args.message, args.alternative)
raise SystemExit(0 if success else 1)
if __name__ == "__main__":
main()
================================================
FILE: feed_generators/google_ai_blog.py
================================================
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import fetch_page, save_rss_feed, setup_feed_links, setup_logging, sort_posts_for_feed
# TODO_IMPROVE: Add caching (Pattern 2) and "Load More" pagination support.
# Currently only fetches the first page of results. Should:
# 1. Add cache file (cache/google_ai_posts.json) with load_cache()/save_cache()
# 2. Implement pagination to fetch all pages (check for "Load more" or page params)
# 3. Support --full flag for full reset vs incremental updates
# See cursor_blog.py or dagster_blog.py for reference implementation.
logger = setup_logging()
FEED_NAME = "google_ai"
BLOG_URL = "https://developers.googleblog.com/search/?technology_categories=AI"
def fetch_blog_content(url=BLOG_URL):
"""Fetch the HTML content of the Google Developers Blog AI page."""
try:
logger.info(f"Fetching content from URL: {url}")
html = fetch_page(url)
logger.info("Content fetched successfully")
return html
except Exception as e:
logger.error(f"Error fetching content: {e}")
raise
def parse_date(date_str):
"""Parse date string like 'DEC. 19, 2025' to datetime object."""
try:
# Remove the period after the month abbreviation and normalize case
# e.g., "MARCH 23, 2026" -> "March 23, 2026", "DEC. 19, 2025" -> "Dec 19, 2025"
date_str = date_str.replace(".", "").strip().title()
# Try abbreviated month first, then full month name
for fmt in ("%b %d, %Y", "%B %d, %Y"):
try:
dt = datetime.strptime(date_str, fmt)
break
except ValueError:
continue
else:
raise ValueError(f"No matching date format for '{date_str}'")
# Make it timezone-aware (UTC)
return dt.replace(tzinfo=pytz.UTC)
except Exception as e:
logger.warning(f"Could not parse date '{date_str}': {e}")
return None
def parse_blog_posts(html_content):
"""Parse blog posts from the HTML content."""
soup = BeautifulSoup(html_content, "html.parser")
posts = []
# Find all search result items
search_results = soup.find_all("li", class_="search-result")
logger.info(f"Found {len(search_results)} blog posts")
for result in search_results:
try:
# Extract eyebrow (contains date and category)
eyebrow = result.find("p", class_="search-result__eyebrow")
if not eyebrow:
logger.warning("No eyebrow found, skipping post")
continue
eyebrow_text = eyebrow.get_text(strip=True)
# Split by ' / ' to get date and category
parts = eyebrow_text.split(" / ")
if len(parts) < 1:
logger.warning(f"Could not parse eyebrow: {eyebrow_text}")
continue
date_str = parts[0]
category = parts[1] if len(parts) > 1 else "Uncategorized"
# Extract title and link
title_elem = result.find("h3", class_="search-result__title")
if not title_elem:
logger.warning("No title found, skipping post")
continue
link_elem = title_elem.find("a")
if not link_elem:
logger.warning("No link found in title, skipping post")
continue
title = link_elem.get_text(strip=True)
relative_url = link_elem.get("href", "")
# Make absolute URL
if relative_url.startswith("/"):
link = f"https://developers.googleblog.com{relative_url}"
else:
link = relative_url
# Extract summary
summary_elem = result.find("p", class_="search-result__summary")
summary = summary_elem.get_text(strip=True) if summary_elem else ""
# Extract featured image
img_elem = result.find("img", class_="search-result__featured-img")
image_url = img_elem.get("src", "") if img_elem else ""
# Parse date
pub_date = parse_date(date_str)
post = {
"title": title,
"link": link,
"summary": summary,
"date": pub_date,
"category": category,
"image_url": image_url,
}
posts.append(post)
logger.debug(f"Parsed post: {title}")
except Exception as e:
logger.error(f"Error parsing post: {e}")
continue
logger.info(f"Successfully parsed {len(posts)} posts")
return posts
def create_rss_feed(posts):
"""Create an RSS feed from the blog posts."""
fg = FeedGenerator()
fg.title("Google Developers Blog - AI")
fg.description("Latest AI-related posts from Google Developers Blog")
setup_feed_links(fg, BLOG_URL, FEED_NAME)
fg.language("en")
# Sort posts for correct feed output (oldest first, feedgen reverses it)
sorted_posts = sort_posts_for_feed(posts, date_field="date")
# Add entries to feed
for post in sorted_posts:
fe = fg.add_entry()
fe.title(post["title"])
fe.link(href=post["link"])
# Build description with summary and image
description = ""
if post.get("image_url"):
description += f'<img src="{post["image_url"]}" alt="Featured image" /><br/><br/>'
description += post["summary"]
fe.description(description)
if post.get("date"):
fe.published(post["date"])
fe.updated(post["date"])
if post.get("category"):
fe.category(term=post["category"])
return fg
def main():
"""Main function to generate the RSS feed."""
try:
# Fetch blog content
html_content = fetch_blog_content()
# Parse blog posts
posts = parse_blog_posts(html_content)
if not posts:
logger.warning("No posts found to add to the feed")
return
# Create and save RSS feed
fg = create_rss_feed(posts)
save_rss_feed(fg, FEED_NAME)
logger.info("RSS feed generation completed successfully!")
except Exception as e:
logger.error(f"Error in main: {e}")
raise
if __name__ == "__main__":
main()
================================================
FILE: feed_generators/groq_blog.py
================================================
"""Generate RSS feed for the Groq Blog (https://groq.com/blog/).
Simple static HTML scraper. Cards are rendered server-side in <article class="card">
elements; no pagination or JavaScript. No cache needed.
"""
import argparse
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from utils import (
fetch_page,
save_rss_feed,
setup_feed_links,
setup_logging,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "groq"
BLOG_URL = "https://groq.com/blog/"
def parse_blog_html(html_content: str) -> list[dict]:
"""Extract articles from Groq's blog listing page."""
soup = BeautifulSoup(html_content, "html.parser")
articles = []
seen_links = set()
for card in soup.select("article.card"):
title_link = card.select_one("h2.card__title a")
if not title_link:
continue
href = title_link.get("href", "")
if not href or href.rstrip("/") == "/blog":
continue
link = f"https://groq.com{href}" if href.startswith("/") else href
if link in seen_links:
continue
seen_links.add(link)
title = title_link.get_text(strip=True)
if not title:
continue
date = None
time_elem = card.select_one("time.card__eyebrow")
if time_elem:
datetime_attr = time_elem.get("datetime")
if datetime_attr:
try:
date = datetime.fromisoformat(datetime_attr.replace("Z", "+00:00"))
if date.tzinfo is None:
date = date.replace(tzinfo=pytz.UTC)
except ValueError:
logger.warning(f"Could not parse datetime attribute: {datetime_attr}")
if not date:
date = stable_fallback_date(link)
articles.append(
{
"title": title,
"link": link,
"date": date,
"description": title,
}
)
logger.info(f"Parsed {len(articles)} articles")
return articles
def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("Groq Blog")
fg.description("Latest news and updates from Groq")
fg.language("en")
fg.author({"name": "Groq"})
fg.subtitle("LPU inference, AI infrastructure, and developer updates")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for article in sort_posts_for_feed(articles, date_field="date"):
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.id(article["link"])
if article.get("date"):
fe.published(article["date"])
logger.info(f"Generated RSS feed with {len(articles)} entries")
return fg
def main() -> bool:
logger.info(f"Fetching {BLOG_URL}")
html = fetch_page(BLOG_URL)
articles = parse_blog_html(html)
if not articles:
logger.warning("No articles found. Check the HTML structure.")
return False
feed = generate_rss_feed(articles)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate Groq Blog RSS feed")
# --full is accepted for orchestrator compatibility even though the generator has no cache.
parser.add_argument("--full", action="store_true", help="No-op (Groq has no cache)")
parser.parse_args()
main()
================================================
FILE: feed_generators/meta_ai_blog.py
================================================
"""Generate RSS feed for AI at Meta Blog (https://ai.meta.com/blog/).
React SPA with a "Load more" button. The page renders three distinct card
layouts (hero, Latest News grid, "More from AI at Meta" grid) that this
parser handles independently.
Closes upstream issue #61.
"""
import argparse
import contextlib
import re
import time
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
setup_selenium_driver,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "meta_ai"
BLOG_URL = "https://ai.meta.com/blog/"
DATE_PATTERN = re.compile(
r"(January|February|March|April|May|June|July|August"
r"|September|October|November|December)\s+\d{1,2},\s+\d{4}"
)
# Meta AI's layout uses hashed CSS-module class names (_amto, _amcy, _amda, _amde,
# _amsu, ...). These rotate when Meta rebuilds the site, so selector breakage is
# the failure mode to expect. Mitigations: the parser walks three layouts
# independently and falls back from class-based selectors to aria-label and
# finally to separator-joined text. When a layout change lands, capture the new
# page with ``curl`` or Selenium and update the class constants below.
CATEGORIES = {
"featured",
"ml applications",
"open source",
"research",
"computer vision",
"hardware",
"natural language processing",
"generative ai",
}
def fetch_blog_content(url: str = BLOG_URL, max_clicks: int = 20) -> str:
"""Fetch the blog HTML after clicking "Load more" up to max_clicks times."""
driver = None
try:
logger.info(f"Fetching content from {url} (max_clicks={max_clicks})")
driver = setup_selenium_driver()
driver.get(url)
time.sleep(5)
try:
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href*="/blog/"]')))
logger.info("Blog articles loaded")
except Exception:
logger.warning("Could not confirm articles loaded, proceeding anyway")
clicks = 0
while clicks < max_clicks:
load_more = None
with contextlib.suppress(Exception):
candidate = driver.find_element(By.CSS_SELECTOR, "button._amto")
if candidate.is_displayed():
load_more = candidate
if not load_more:
with contextlib.suppress(Exception):
load_more = driver.find_element(By.XPATH, "//button[contains(text(), 'Load more')]")
if load_more and load_more.is_displayed():
logger.info(f"Clicking 'Load more' button (click {clicks + 1})")
driver.execute_script("arguments[0].click();", load_more)
clicks += 1
time.sleep(2)
else:
logger.info(f"No more 'Load more' button after {clicks} clicks")
break
return driver.page_source
finally:
if driver:
driver.quit()
def parse_date(date_text: str) -> datetime | None:
"""Parse 'Month DD, YYYY' into a tz-aware datetime."""
date_text = date_text.strip()
for fmt in ("%B %d, %Y", "%b %d, %Y"):
try:
return datetime.strptime(date_text, fmt).replace(tzinfo=pytz.UTC)
except ValueError:
continue
return None
def _extract_date_from_elements(elements, article_href: str) -> tuple[datetime | None, str]:
"""Walk elements looking for a date match (long or short month). Returns (date, matched_text)."""
for elem in elements:
text = elem.get_text(strip=True)
date_match = DATE_PATTERN.search(text)
if date_match:
parsed = parse_date(date_match.group())
if parsed:
return parsed, text
for elem in elements:
text = elem.get_text(strip=True)
parsed = parse_date(text)
if parsed:
return parsed, text
return None, ""
def _append_article(articles, seen, href, title, date, category, description):
"""Append an article to the list if href is unseen. Mutates both collections."""
if href in seen or href in ("/blog/", "/blog"):
return
seen.add(href)
if not date:
date = stable_fallback_date(href)
articles.append(
{
"title": title,
"link": href,
"date": date,
"category": category,
"description": description,
}
)
def _absolute_meta_url(href: str) -> str:
return f"https://ai.meta.com{href}" if href.startswith("/") else href
def extract_articles(soup: BeautifulSoup) -> list[dict]:
"""Extract articles from the three card layouts on the Meta AI blog."""
articles: list[dict] = []
seen: set[str] = set()
# Hero card (featured, div._amcy)
hero = soup.select_one("div._amcy")
if hero:
link = hero.find("a", href=True)
if link:
href = _absolute_meta_url(link.get("href", ""))
title_elem = hero.find("div", class_="_amd1")
title = title_elem.get_text(strip=True) if title_elem else ""
if not title:
aria = link.get("aria-label", "")
title = aria.removeprefix("Read ").strip() if aria.startswith("Read ") else ""
if title:
# The hero's date container class has rotated (was _amdj, then
# _amun, ...), so scan every <div> inside the hero with the
# DATE_PATTERN regex instead of pinning to a single class.
# Without this we fall through to stable_fallback_date(), which
# (relying on Python's randomized hash()) buries the newest
# post under a bogus pubDate.
date, _ = _extract_date_from_elements(hero.find_all("div"), href)
# Category: try the legacy explicit class, then the current
# "FEATURED"-style badge, then default. Empty strings are
# treated as missing so we don't emit empty <category/>.
category = "AI"
for cls in ("_amug", "_amd5"):
cat_elem = hero.find("div", class_=cls)
cat_text = cat_elem.get_text(strip=True) if cat_elem else ""
if cat_text:
category = cat_text.title() if cat_text.isupper() else cat_text
break
_append_article(articles, seen, href, title, date, category, title)
# Latest News grid (div._amda)
for card in soup.select("div._amda"):
link = card.find("a", href=True)
if not link:
continue
href = _absolute_meta_url(link.get("href", ""))
title_elem = card.find("div", class_="_amde")
title = title_elem.get_text(strip=True) if title_elem else ""
if not title:
aria = link.get("aria-label", "")
title = aria.removeprefix("Read ").strip() if aria.startswith("Read ") else ""
if not title:
continue
amdj_elems = card.select("div._amdj")
date, matched_date_text = _extract_date_from_elements(amdj_elems, href)
category = "AI"
for elem in amdj_elems:
text = elem.get_text(strip=True)
if text == matched_date_text:
continue
if text.lower() in CATEGORIES:
category = text
break
description = title
desc_elem = card.find("p", class_="text-secondary") or card.find("p", class_="_amt3")
if desc_elem:
description = desc_elem.get_text(strip=True)[:300]
_append_article(articles, seen, href, title, date, category, description)
# "More from AI at Meta" grid (div._amsu)
for card in soup.select("div._amsu"):
link = card.find("a", href=True)
if not link:
continue
href = _absolute_meta_url(link.get("href", ""))
title_elem = card.find("p", class_="_amt2")
title = title_elem.get_text(strip=True) if title_elem else ""
if not title:
continue
cat_elem = card.find("p", class_="_amt0")
category = cat_elem.get_text(strip=True) if cat_elem else "AI"
date_elem = card.find("p", class_="_amt4")
date, _ = _extract_date_from_elements([date_elem] if date_elem else [], href)
desc_elem = card.find("p", class_="_amt3")
description = desc_elem.get_text(strip=True)[:300] if desc_elem else title
_append_article(articles, seen, href, title, date, category, description)
logger.info(f"Parsed {len(articles)} articles")
return articles
def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("AI at Meta Blog")
fg.description("Latest AI news and research from Meta")
fg.language("en")
fg.author({"name": "Meta AI"})
fg.subtitle("AI research, open source, and applications from Meta")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for article in sort_posts_for_feed(articles, date_field="date"):
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.id(article["link"])
fe.category(term=article["category"])
if article.get("date"):
fe.published(article["date"])
logger.info(f"Generated RSS feed with {len(articles)} entries")
return fg
def main(full_reset: bool = False) -> bool:
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
if full_reset or not cached_entries:
mode = "full reset" if full_reset else "no cache exists"
logger.info(f"Running full fetch ({mode})")
html = fetch_blog_content(max_clicks=20)
else:
logger.info("Running incremental update (3 clicks only)")
html = fetch_blog_content(max_clicks=3)
soup = BeautifulSoup(html, "html.parser")
new_articles = extract_articles(soup)
if cached_entries and not full_reset:
articles = merge_entries(new_articles, cached_entries)
else:
articles = sort_posts_for_feed(new_articles, date_field="date")
if not articles:
logger.warning("No articles found. Check the HTML structure.")
return False
save_cache(FEED_NAME, articles)
feed = generate_rss_feed(articles)
save_rss_feed(feed, FEED_NAME)
logger.info("Done!")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate AI at Meta Blog RSS feed")
parser.add_argument("--full", action="store_true", help="Force full reset (click Load more up to 20 times)")
args = parser.parse_args()
main(full_reset=args.full)
================================================
FILE: feed_generators/mistral_blog.py
================================================
"""Generate RSS feed for Mistral AI News (https://mistral.ai/news).
Selenium-driven numbered pagination. Unlike "Load more" SPAs that append content,
Mistral replaces the article grid on each page navigation, so we parse after
each click before advancing to the next page.
"""
import argparse
import time
from datetime import datetime
import pytz
from bs4 import BeautifulSoup
from feedgen.feed import FeedGenerator
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from utils import (
deserialize_entries,
load_cache,
merge_entries,
save_cache,
save_rss_feed,
setup_feed_links,
setup_logging,
setup_selenium_driver,
sort_posts_for_feed,
stable_fallback_date,
)
logger = setup_logging()
FEED_NAME = "mistral"
BLOG_URL = "https://mistral.ai/news"
MAX_PAGES_FULL = 6
MAX_PAGES_INCREMENTAL = 1
def parse_page_articles(html: str) -> list[dict]:
"""Extract articles from a single page. Returns a deduped list per page.
Page 1 has a hero card with <h1>; grid cards use <h2>. Cards live inside
<a href="/news/..."> wrappers containing an <article> element.
"""
soup = BeautifulSoup(html, "html.parser")
articles = []
seen_links = set()
for card in soup.select('a[href^="/news/"]'):
href = card.get("href", "")
if not href or href.rstrip("/") == "/news":
continue
link = f"https://mistral.ai{href}"
if link in seen_links:
continue
article_elem = card.find("article")
if not article_elem:
continue
seen_links.add(link)
title_elem = article_elem.find("h1") or article_elem.find("h2")
if not title_elem:
continue
title = title_elem.get_text(strip=True)
if len(title) < 3:
continue
category = "News"
for span in article_elem.find_all("span"):
classes = " ".join(span.get("class", []))
if "rounded-full" in classes and "border" in classes:
cat_text = span.get_text(strip=True)
if cat_text:
category = cat_text
break
description = title
for p in article_elem.find_all("p"):
classes = " ".join(p.get("class", []))
if "opacity" in classes or "text-black/50" in classes:
desc_text = p.get_text(strip=True)
if desc_text:
description = desc_text[:300]
break
date = None
for div in article_elem.find_all("div"):
if "text-sm" not in " ".join(div.get("class", [])):
continue
date_text = div.get_text(strip=True)
for fmt in ("%b %d, %Y", "%B %d, %Y"):
try:
date = datetime.strptime(date_text, fmt).replace(tzinfo=pytz.UTC)
break
except ValueError:
continue
if date:
break
if not date:
logger.warning(f"Could not parse date for article: {title}")
date = stable_fallback_date(link)
articles.append(
{
"title": title,
"link": link,
"date": date,
"category": category,
"description": description,
}
)
logger.info(f"Parsed {len(articles)} articles from page")
return articles
def fetch_all_articles(max_pages: int = MAX_PAGES_FULL) -> list[dict]:
"""Fetch articles across numbered pages using Selenium."""
driver = None
all_articles: list[dict] = []
seen_links: set[str] = set()
try:
logger.info(f"Fetching articles from {BLOG_URL} (max_pages={max_pages})")
driver = setup_selenium_driver()
driver.get(BLOG_URL)
time.sleep(5)
try:
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href^="/news/"]')))
except Exception:
logger.warning("Could not confirm articles loaded, proceeding anyway")
for page_num in range(1, max_pages + 1):
logger.info(f"Extracting articles from page {page_num}")
page_articles = parse_page_articles(driver.page_source)
new_count = 0
for article in page_articles:
if article["link"] not in seen_links:
all_articles.append(article)
seen_links.add(article["link"])
new_count += 1
logger.info(f"Page {page_num}: {new_count} new articles (total: {len(all_articles)})")
if page_num >= max_pages:
break
# The next-page arrow is the last button in the pagination row.
next_btn = None
pagination_buttons = driver.find_elements(By.CSS_SELECTOR, "button.size-8, button[class*='size-8']")
if pagination_buttons:
candidate = pagination_buttons[-1]
try:
candidate.find_element(By.TAG_NAME, "svg")
next_btn = candidate
except Exception:
next_btn = None
if not next_btn or not next_btn.is_displayed():
logger.info(f"No next button found after page {page_num}")
break
logger.info(f"Clicking next button to page {page_num + 1}")
driver.execute_script("arguments[0].click();", next_btn)
time.sleep(3)
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href^="/news/"]')))
except Exception:
logger.warning("Timeout waiting for next page content")
logger.info(f"Total articles fetched: {len(all_articles)}")
return all_articles
finally:
if driver:
driver.quit()
def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
fg = FeedGenerator()
fg.title("Mistral AI News")
fg.description("Latest news and updates from Mistral AI")
fg.language("en")
fg.author({"name": "Mistral AI"})
fg.subtitle("News, research, and product updates from Mistral AI")
setup_feed_links(fg, blog_url=BLOG_URL, feed_name=FEED_NAME)
for article in sort_posts_for_feed(articles, date_field="date"):
fe = fg.add_entry()
fe.title(article["title"])
fe.description(article["description"])
fe.link(href=article["link"])
fe.id(article["link"])
fe.category(term=article["category"])
if article.get("date"):
fe.published(article["date"])
logger.info(f"Generated RSS feed with {len(articles)} entries")
return fg
def main(full_reset: bool = False) -> bool:
cache = load_cache(FEED_NAME)
cached_entries = deserialize_entries(cache.get("entries", []))
pages = MAX_PAGES_FULL if (full_reset or not cached_entries) else MAX_PAGES_INCREMENTAL
gitextract_o0vedygl/ ├── .agents/ │ └── skills/ │ ├── cmd-rss-feed-generator/ │ │ └── SKILL.md │ └── rss-feed-review/ │ └── SKILL.md ├── .editorconfig ├── .github/ │ ├── CODEOWNERS │ ├── FUNDING.yml │ ├── ISSUE_TEMPLATE/ │ │ └── request_rss_feed.md │ ├── PULL_REQUEST_TEMPLATE/ │ │ └── add_new_feed.md │ ├── dependabot.yml │ ├── pull_request_template.md │ └── workflows/ │ ├── cleanup_deprecated_feeds.yml │ ├── label_new_feed.yml │ ├── lint.yml │ ├── run_feeds.yml │ ├── run_selenium_feeds.yml │ ├── test_feed.yml │ └── validate_feeds.yml ├── .gitignore ├── .markdownlint.json ├── .pre-commit-config.yaml ├── AGENTS.md ├── CLAUDE.md ├── CONTRIBUTING.md ├── LICENSE ├── Makefile ├── README.md ├── cache/ │ └── .gitkeep ├── feed_generators/ │ ├── ai_first_podcast.py │ ├── anthropic_eng_blog.py │ ├── anthropic_news_blog.py │ ├── anthropic_red_blog.py │ ├── anthropic_research_blog.py │ ├── blogsurgeai_feed_generator.py │ ├── chanderramesh_blog.py │ ├── claude_blog.py │ ├── cleanup_deprecated_feeds.py │ ├── cohere_blog.py │ ├── cursor_blog.py │ ├── dagster_blog.py │ ├── deeplearningai_the_batch.py │ ├── deprecate_feed.py │ ├── google_ai_blog.py │ ├── groq_blog.py │ ├── meta_ai_blog.py │ ├── mistral_blog.py │ ├── models.py │ ├── ollama_blog.py │ ├── paulgraham_blog.py │ ├── perplexity_hub.py │ ├── pinecone_blog.py │ ├── run_all_feeds.py │ ├── thinkingmachines_blog.py │ ├── utils.py │ ├── validate_feeds.py │ ├── weaviate_blog.py │ ├── windsurf_blog.py │ ├── windsurf_changelog.py │ ├── windsurf_next_changelog.py │ └── xainews_blog.py ├── feeds/ │ ├── .gitkeep │ ├── feed_ai_first_podcast.xml │ ├── feed_anthropic_changelog_claude_code.xml │ ├── feed_anthropic_engineering.xml │ ├── feed_anthropic_news.xml │ ├── feed_anthropic_red.xml │ ├── feed_anthropic_research.xml │ ├── feed_blogsurgeai.xml │ ├── feed_chanderramesh.xml │ ├── feed_claude.xml │ ├── feed_cohere.xml │ ├── feed_cursor.xml │ ├── feed_dagster.xml │ ├── feed_google_ai.xml │ ├── feed_groq.xml │ ├── feed_hamel.xml │ ├── feed_meta_ai.xml │ ├── feed_mistral.xml │ ├── feed_ollama.xml │ ├── feed_openai_research.xml │ ├── feed_paulgraham.xml │ ├── feed_perplexity_hub.xml │ ├── feed_pinecone.xml │ ├── feed_the_batch.xml │ ├── feed_thinkingmachines.xml │ ├── feed_weaviate.xml │ ├── feed_windsurf_blog.xml │ ├── feed_windsurf_changelog.xml │ ├── feed_windsurf_next_changelog.xml │ └── feed_xainews.xml ├── feeds.yaml ├── makefiles/ │ ├── ci.mk │ ├── colors.mk │ ├── common.mk │ ├── dev.mk │ ├── env.mk │ └── feeds.mk └── pyproject.toml
SYMBOL INDEX (172 symbols across 32 files)
FILE: feed_generators/ai_first_podcast.py
function parse_listing_page (line 38) | def parse_listing_page(html_content: str) -> list[dict]:
function fetch_episode_details (line 77) | def fetch_episode_details(url: str) -> tuple[datetime | None, str]:
function enrich_episodes (line 122) | def enrich_episodes(stub_episodes: list[dict]) -> list[dict]:
function generate_rss_feed (line 144) | def generate_rss_feed(episodes: list[dict]) -> FeedGenerator:
function main (line 169) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/anthropic_eng_blog.py
function fetch_engineering_content (line 16) | def fetch_engineering_content(url=BLOG_URL):
function validate_article (line 25) | def validate_article(article):
function parse_engineering_html (line 34) | def parse_engineering_html(html_content):
function generate_rss_feed (line 116) | def generate_rss_feed(articles, feed_name=FEED_NAME):
function main (line 151) | def main(feed_name=FEED_NAME):
FILE: feed_generators/anthropic_news_blog.py
function fetch_news_content (line 32) | def fetch_news_content(url=BLOG_URL, max_clicks=20):
function extract_title (line 111) | def extract_title(card):
function extract_date (line 136) | def extract_date(card):
function extract_category (line 174) | def extract_category(card, date_elem_text=None):
function validate_article (line 218) | def validate_article(article):
function parse_news_html (line 235) | def parse_news_html(html_content):
function generate_rss_feed (line 311) | def generate_rss_feed(articles):
function get_existing_links_from_feed (line 346) | def get_existing_links_from_feed(feed_path):
function main (line 364) | def main(full_reset=False):
FILE: feed_generators/anthropic_red_blog.py
function fetch_red_content (line 15) | def fetch_red_content(url=BLOG_URL):
function parse_date (line 24) | def parse_date(date_text):
function fetch_article_date (line 44) | def fetch_article_date(article_url):
function parse_red_html (line 71) | def parse_red_html(html_content):
function generate_rss_feed (line 147) | def generate_rss_feed(articles, feed_name=FEED_NAME):
function main (line 185) | def main(feed_name=FEED_NAME):
FILE: feed_generators/anthropic_research_blog.py
function fetch_research_content_selenium (line 29) | def fetch_research_content_selenium(url=BLOG_URL):
function extract_title (line 56) | def extract_title(card):
function extract_date (line 88) | def extract_date(card):
function validate_article (line 133) | def validate_article(article):
function parse_research_html (line 141) | def parse_research_html(html_content):
function generate_rss_feed (line 220) | def generate_rss_feed(articles):
function main (line 260) | def main(full_reset=False):
FILE: feed_generators/blogsurgeai_feed_generator.py
function generate_blogsurgeai_feed (line 20) | def generate_blogsurgeai_feed():
FILE: feed_generators/chanderramesh_blog.py
function parse_date (line 15) | def parse_date(date_str):
function parse_writing_page (line 26) | def parse_writing_page(html_content, base_url="https://chanderramesh.com"):
function generate_rss_feed (line 81) | def generate_rss_feed(blog_posts):
function main (line 111) | def main():
FILE: feed_generators/claude_blog.py
function fetch_page (line 42) | def fetch_page(url):
function extract_pagination_ids (line 50) | def extract_pagination_ids(html_content):
function parse_date (line 57) | def parse_date(date_str):
function parse_posts (line 65) | def parse_posts(html_content):
function fetch_all_pages (line 147) | def fetch_all_pages():
function generate_rss_feed (line 199) | def generate_rss_feed(posts):
function main (line 237) | def main(full_reset=False):
FILE: feed_generators/cleanup_deprecated_feeds.py
function find_deprecation_notice (line 30) | def find_deprecation_notice(feed_file: Path) -> datetime | None:
function find_eligible_feeds (line 61) | def find_eligible_feeds(threshold_days: int) -> list[tuple[Path, int]]:
function main (line 76) | def main() -> int:
FILE: feed_generators/cohere_blog.py
function fetch_posts_page (line 39) | def fetch_posts_page(limit: int, page: int) -> dict:
function parse_api_posts (line 57) | def parse_api_posts(api_data: dict) -> list[dict]:
function fetch_all_posts (line 96) | def fetch_all_posts(max_posts: int = MAX_POSTS_FULL) -> list[dict]:
function generate_rss_feed (line 122) | def generate_rss_feed(posts: list[dict]) -> FeedGenerator:
function main (line 146) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/cursor_blog.py
function parse_posts (line 26) | def parse_posts(html):
function fetch_all_pages (line 80) | def fetch_all_pages():
function generate_rss_feed (line 110) | def generate_rss_feed(posts):
function main (line 142) | def main(full_reset=False):
FILE: feed_generators/dagster_blog.py
function parse_posts (line 28) | def parse_posts(html_content):
function fetch_all_pages (line 116) | def fetch_all_pages():
function generate_rss_feed (line 151) | def generate_rss_feed(posts):
function main (line 184) | def main(full_reset=False):
FILE: feed_generators/deeplearningai_the_batch.py
function parse_date (line 30) | def parse_date(value: str | None, fallback_id: str = ""):
function clean_text (line 44) | def clean_text(text: str | None) -> str | None:
function is_valid_article_link (line 50) | def is_valid_article_link(href: str) -> bool:
function normalize_link (line 63) | def normalize_link(href: str) -> str:
function extract_date_text (line 70) | def extract_date_text(element) -> str | None:
function extract_description (line 114) | def extract_description(element) -> str | None:
function parse_articles_from_html (line 153) | def parse_articles_from_html(html_content: str) -> list[dict]:
function fetch_all_articles (line 220) | def fetch_all_articles(max_pages: int = MAX_PAGES) -> list[dict]:
function build_feed (line 274) | def build_feed(articles: list[dict]) -> FeedGenerator:
function main (line 295) | def main(full_reset=False):
FILE: feed_generators/deprecate_feed.py
function format_rfc822 (line 46) | def format_rfc822(dt: datetime) -> str:
function deprecate_feed (line 53) | def deprecate_feed(feed_name: str, message: str, alternative_url: str | ...
function main (line 107) | def main() -> None:
FILE: feed_generators/google_ai_blog.py
function fetch_blog_content (line 22) | def fetch_blog_content(url=BLOG_URL):
function parse_date (line 34) | def parse_date(date_str):
function parse_blog_posts (line 56) | def parse_blog_posts(html_content):
function create_rss_feed (line 134) | def create_rss_feed(posts):
function main (line 169) | def main():
FILE: feed_generators/groq_blog.py
function parse_blog_html (line 29) | def parse_blog_html(html_content: str) -> list[dict]:
function generate_rss_feed (line 81) | def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
function main (line 103) | def main() -> bool:
FILE: feed_generators/meta_ai_blog.py
function fetch_blog_content (line 64) | def fetch_blog_content(url: str = BLOG_URL, max_clicks: int = 20) -> str:
function parse_date (line 105) | def parse_date(date_text: str) -> datetime | None:
function _extract_date_from_elements (line 116) | def _extract_date_from_elements(elements, article_href: str) -> tuple[da...
function _append_article (line 133) | def _append_article(articles, seen, href, title, date, category, descrip...
function _absolute_meta_url (line 151) | def _absolute_meta_url(href: str) -> str:
function extract_articles (line 155) | def extract_articles(soup: BeautifulSoup) -> list[dict]:
function generate_rss_feed (line 253) | def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
function main (line 276) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/mistral_blog.py
function parse_page_articles (line 40) | def parse_page_articles(html: str) -> list[dict]:
function fetch_all_articles (line 121) | def fetch_all_articles(max_pages: int = MAX_PAGES_FULL) -> list[dict]:
function generate_rss_feed (line 182) | def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
function main (line 205) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/models.py
class FeedType (line 11) | class FeedType(StrEnum):
class FeedConfig (line 16) | class FeedConfig(BaseModel):
method script_must_exist (line 26) | def script_must_exist(cls, v: str) -> str:
class GlobalSettings (line 34) | class GlobalSettings(BaseSettings):
function load_feed_registry (line 45) | def load_feed_registry() -> dict[str, FeedConfig]:
FILE: feed_generators/ollama_blog.py
function fetch_blog_content (line 15) | def fetch_blog_content(url=BLOG_URL):
function parse_blog_html (line 24) | def parse_blog_html(html_content):
function generate_rss_feed (line 73) | def generate_rss_feed(blog_posts, feed_name=FEED_NAME):
function main (line 104) | def main(blog_url=BLOG_URL, feed_name=FEED_NAME):
FILE: feed_generators/paulgraham_blog.py
function extract_date_from_text (line 16) | def extract_date_from_text(text):
function get_article_content (line 47) | def get_article_content(article_html):
function parse_essays_page (line 73) | def parse_essays_page(html_content, base_url="https://paulgraham.com", m...
function generate_rss_feed (line 135) | def generate_rss_feed(blog_posts):
function main (line 165) | def main():
FILE: feed_generators/perplexity_hub.py
function _force_english_locale (line 52) | def _force_english_locale(driver) -> None:
function fetch_hub_content (line 64) | def fetch_hub_content(url: str = BLOG_URL) -> str:
function _canonicalize_link (line 88) | def _canonicalize_link(href: str) -> str:
function _extract_title (line 101) | def _extract_title(card) -> str | None:
function _extract_date (line 110) | def _extract_date(card) -> datetime | None:
function _extract_category (line 125) | def _extract_category(card) -> str:
function validate_article (line 137) | def validate_article(article: dict) -> bool:
function parse_hub_html (line 150) | def parse_hub_html(html_content: str) -> list[dict]:
function generate_rss_feed (line 195) | def generate_rss_feed(articles: list[dict]) -> FeedGenerator:
function main (line 218) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/pinecone_blog.py
function fetch_blog_content (line 39) | def fetch_blog_content(max_clicks: int = MAX_CLICKS_FULL) -> str:
function _parse_short_date (line 75) | def _parse_short_date(text: str) -> datetime | None:
function parse_blog_html (line 84) | def parse_blog_html(html: str) -> list[dict]:
function generate_rss_feed (line 156) | def generate_rss_feed(posts: list[dict]) -> FeedGenerator:
function main (line 180) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/run_all_feeds.py
function run_feed (line 14) | def run_feed(feed_name: str, config: FeedConfig, full: bool = False) -> ...
function run_all_feeds (line 40) | def run_all_feeds(
FILE: feed_generators/thinkingmachines_blog.py
function parse_date (line 25) | def parse_date(date_text):
function extract_articles (line 58) | def extract_articles(soup):
function parse_html (line 127) | def parse_html(html_content):
function generate_rss_feed (line 137) | def generate_rss_feed(articles):
function main (line 168) | def main(html_file=None):
FILE: feed_generators/utils.py
function setup_logging (line 31) | def setup_logging(name: str | None = None) -> logging.Logger:
function get_project_root (line 56) | def get_project_root() -> Path:
function get_cache_dir (line 61) | def get_cache_dir() -> Path:
function get_feeds_dir (line 68) | def get_feeds_dir() -> Path:
function get_cache_file (line 75) | def get_cache_file(feed_name: str) -> Path:
function fetch_page (line 92) | def fetch_page(url: str, timeout: int = 30, headers: dict | None = None)...
function stable_fallback_date (line 115) | def stable_fallback_date(identifier: str) -> datetime:
function load_cache (line 132) | def load_cache(feed_name: str, entries_key: str = "entries") -> dict:
function save_cache (line 155) | def save_cache(feed_name: str, entries: list[dict], entries_key: str = "...
function deserialize_entries (line 181) | def deserialize_entries(entries: list[dict], date_field: str = "date") -...
function merge_entries (line 203) | def merge_entries(
function setup_feed_links (line 239) | def setup_feed_links(fg: FeedGenerator, blog_url: str, feed_name: str) -...
function sort_posts_for_feed (line 263) | def sort_posts_for_feed(posts: list[dict[str, Any]], date_field: str = "...
function save_rss_feed (line 285) | def save_rss_feed(fg: FeedGenerator, feed_name: str) -> Path:
function get_chrome_major_version (line 307) | def get_chrome_major_version() -> int | None:
function setup_selenium_driver (line 333) | def setup_selenium_driver():
FILE: feed_generators/validate_feeds.py
function validate_feed (line 13) | def validate_feed(feed_path):
function main (line 85) | def main():
FILE: feed_generators/weaviate_blog.py
function parse_posts (line 33) | def parse_posts(html_content: str) -> tuple[list[dict], bool]:
function fetch_all_pages (line 79) | def fetch_all_pages(max_pages: int = MAX_PAGES_FULL) -> list[dict]:
function generate_rss_feed (line 103) | def generate_rss_feed(posts: list[dict]) -> FeedGenerator:
function main (line 128) | def main(full_reset: bool = False) -> bool:
FILE: feed_generators/windsurf_blog.py
function fetch_blog_posts (line 15) | def fetch_blog_posts():
function parse_blog_posts (line 31) | def parse_blog_posts(api_response):
function generate_rss_feed (line 84) | def generate_rss_feed(blog_posts, feed_name=FEED_NAME):
function main (line 119) | def main(feed_name=FEED_NAME):
FILE: feed_generators/windsurf_changelog.py
function fetch_changelog_content (line 16) | def fetch_changelog_content(url=BLOG_URL):
function parse_date (line 25) | def parse_date(date_text):
function parse_changelog_html (line 48) | def parse_changelog_html(html_content):
function generate_rss_feed (line 132) | def generate_rss_feed(changelog_entries, feed_name=FEED_NAME):
function main (line 164) | def main(feed_name=FEED_NAME):
FILE: feed_generators/windsurf_next_changelog.py
function fetch_changelog_content (line 16) | def fetch_changelog_content(url=BLOG_URL):
function parse_date (line 25) | def parse_date(date_text):
function parse_changelog_html (line 48) | def parse_changelog_html(html_content):
function generate_rss_feed (line 132) | def generate_rss_feed(changelog_entries, feed_name=FEED_NAME):
function main (line 164) | def main(feed_name=FEED_NAME):
FILE: feed_generators/xainews_blog.py
function fetch_news_content (line 30) | def fetch_news_content(url=BLOG_URL):
function parse_date (line 61) | def parse_date(date_text):
function looks_like_date (line 100) | def looks_like_date(text):
function extract_articles (line 105) | def extract_articles(soup):
function parse_news_html (line 201) | def parse_news_html(html_content):
function generate_rss_feed (line 211) | def generate_rss_feed(articles):
function main (line 238) | def main(full_reset=False):
Condensed preview — 96 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,955K chars).
[
{
"path": ".agents/skills/cmd-rss-feed-generator/SKILL.md",
"chars": 13650,
"preview": "---\nname: cmd-rss-feed-generator\ndescription: Generate Python RSS feed scrapers from blog websites, integrated with hour"
},
{
"path": ".agents/skills/rss-feed-review/SKILL.md",
"chars": 3702,
"preview": "---\nname: cmd-rss-feed-review\ndescription: Review RSS feed generators and their XML output for broken selectors, missing"
},
{
"path": ".editorconfig",
"chars": 307,
"preview": "root = true\n\n[*]\ncharset = utf-8\nend_of_line = lf\ninsert_final_newline = true\ntrim_trailing_whitespace = true\n\n[*.py]\nin"
},
{
"path": ".github/CODEOWNERS",
"chars": 23,
"preview": "* @Olshansk @oborchers\n"
},
{
"path": ".github/FUNDING.yml",
"chars": 780,
"preview": "# These are supported funding model platforms\n\ngithub: olshansk\ngithub: oborchers\npatreon: # Replace with a single Patre"
},
{
"path": ".github/ISSUE_TEMPLATE/request_rss_feed.md",
"chars": 475,
"preview": "---\nname: Request a new RSS feed\nabout: Request an RSS feed for a new blog\ntitle: \"[RSS Feed Request] Blog Name\"\nlabels:"
},
{
"path": ".github/PULL_REQUEST_TEMPLATE/add_new_feed.md",
"chars": 554,
"preview": "---\nname: Add a new RSS feed\nabout: Contribute a new RSS feed to the repository\ntitle: \"[New RSS Feed] <Feed Name>\"\nlabe"
},
{
"path": ".github/dependabot.yml",
"chars": 328,
"preview": "version: 2\nupdates:\n - package-ecosystem: \"github-actions\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n "
},
{
"path": ".github/pull_request_template.md",
"chars": 279,
"preview": "## Summary\n\n<!-- What does this PR do and why? -->\n\n## Changes\n\n<!-- Key changes, bullet points preferred -->\n\n-\n\n## Tes"
},
{
"path": ".github/workflows/cleanup_deprecated_feeds.yml",
"chars": 1688,
"preview": "name: Cleanup Deprecated Feeds\n\n# Stage 2 of the feed retirement lifecycle (Stage 1 is human-driven in\n# deprecate_feed."
},
{
"path": ".github/workflows/label_new_feed.yml",
"chars": 824,
"preview": "name: Label New Feed PRs\n\non:\n pull_request_target:\n types: [opened, edited, reopened]\n\njobs:\n add-label:\n runs-"
},
{
"path": ".github/workflows/lint.yml",
"chars": 684,
"preview": "name: Lint\n\non:\n pull_request:\n paths:\n - \"**.py\"\n - \"pyproject.toml\"\n push:\n branches: [main]\n pat"
},
{
"path": ".github/workflows/run_feeds.yml",
"chars": 1418,
"preview": "name: Run Feeds\n\non:\n schedule:\n - cron: \"0 * * * *\"\n workflow_dispatch:\n\nconcurrency:\n group: request-feeds\n can"
},
{
"path": ".github/workflows/run_selenium_feeds.yml",
"chars": 1631,
"preview": "name: Run Selenium Feeds\n\non:\n schedule:\n - cron: \"30 * * * *\"\n workflow_dispatch:\n\nconcurrency:\n group: selenium-"
},
{
"path": ".github/workflows/test_feed.yml",
"chars": 739,
"preview": "name: Test Feed Generation\n\non:\n workflow_dispatch:\n\njobs:\n test-feed:\n runs-on: ubuntu-latest\n timeout-minutes:"
},
{
"path": ".github/workflows/validate_feeds.yml",
"chars": 796,
"preview": "name: Validate Feeds\n\non:\n workflow_run:\n workflows: [\"Run Feeds\", \"Run Selenium Feeds\"]\n types: [completed]\n wo"
},
{
"path": ".gitignore",
"chars": 4054,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".markdownlint.json",
"chars": 313,
"preview": "{\n \"MD033\": {\n \"allowed_elements\": [\n \"Tabs\",\n \"TabItem\",\n \"ReactPlayer\",\n \"details\",\n \"sum"
},
{
"path": ".pre-commit-config.yaml",
"chars": 380,
"preview": "repos:\n - repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v6.0.0\n hooks:\n - id: trailing-whitespa"
},
{
"path": "AGENTS.md",
"chars": 12544,
"preview": "# AGENTS.md <!-- omit in toc -->\n\nInstructions for Claude Code and contributors working on this repository.\n\n## Table of"
},
{
"path": "CLAUDE.md",
"chars": 374,
"preview": "# CLAUDE.md\n\n⚠️ This file is intentionally minimal.\n\n**Authoritative project instructions live in `AGENTS.md`.**\n\nYou mu"
},
{
"path": "CONTRIBUTING.md",
"chars": 2211,
"preview": "# Contributing\n\n## Dev Setup\n\n```bash\nuv sync --group dev\npre-commit install\n```\n\nRun `make help` to see all available t"
},
{
"path": "LICENSE",
"chars": 1073,
"preview": "MIT License\n\nCopyright (c) 2025 Daniel Olshansky\n\nPermission is hereby granted, free of charge, to any person obtaining "
},
{
"path": "Makefile",
"chars": 5631,
"preview": "#########################\n### Makefile (root) ###\n#########################\n\n.DEFAULT_GOAL := help\n\n# Patterns for cla"
},
{
"path": "README.md",
"chars": 12855,
"preview": "# RSS Feed Generator <!-- omit in toc -->\n\n> [!TIP]\n> This project is maintained by [@oborchers](https://github.com/obor"
},
{
"path": "cache/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "feed_generators/ai_first_podcast.py",
"chars": 6959,
"preview": "\"\"\"Generate RSS feed for the AI FIRST Podcast (https://ai-first.ai/podcast).\n\nTwo-stage scraper: the listing page gives "
},
{
"path": "feed_generators/anthropic_eng_blog.py",
"chars": 6676,
"preview": "import re\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerato"
},
{
"path": "feed_generators/anthropic_news_blog.py",
"chars": 14126,
"preview": "import argparse\nimport contextlib\nimport xml.etree.ElementTree as ET\nfrom datetime import datetime\n\nimport pytz\nfrom bs4"
},
{
"path": "feed_generators/anthropic_red_blog.py",
"chars": 6862,
"preview": "from datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerator\n\nfrom ut"
},
{
"path": "feed_generators/anthropic_research_blog.py",
"chars": 9957,
"preview": "from datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerator\nfrom sel"
},
{
"path": "feed_generators/blogsurgeai_feed_generator.py",
"chars": 3934,
"preview": "#!/usr/bin/env python3\n\"\"\"\nRSS Feed Generator for Surge AI Blog\nScrapes https://www.surgehq.ai/blog and generates an RSS"
},
{
"path": "feed_generators/chanderramesh_blog.py",
"chars": 4351,
"preview": "from datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerator\n\nfrom ut"
},
{
"path": "feed_generators/claude_blog.py",
"chars": 8593,
"preview": "#!/usr/bin/env python3\n\"\"\"Generate RSS feed for Claude Blog (claude.com/blog).\"\"\"\n\nimport argparse\nimport html\nimport re"
},
{
"path": "feed_generators/cleanup_deprecated_feeds.py",
"chars": 3981,
"preview": "\"\"\"Delete RSS feed XML files whose deprecation notice is older than the threshold.\n\nA feed is considered \"retired\" once "
},
{
"path": "feed_generators/cohere_blog.py",
"chars": 5701,
"preview": "\"\"\"Generate RSS feed for the Cohere Blog (https://cohere.com/blog).\n\nThe Cohere blog is built on Ghost CMS. We fetch pos"
},
{
"path": "feed_generators/cursor_blog.py",
"chars": 5001,
"preview": "import argparse\nimport re\nfrom datetime import datetime\n\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGene"
},
{
"path": "feed_generators/dagster_blog.py",
"chars": 6909,
"preview": "import argparse\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGe"
},
{
"path": "feed_generators/deeplearningai_the_batch.py",
"chars": 10596,
"preview": "import argparse\nimport re\n\nimport pytz\nimport requests\nfrom bs4 import BeautifulSoup\nfrom dateutil import parser as date"
},
{
"path": "feed_generators/deprecate_feed.py",
"chars": 4758,
"preview": "\"\"\"Inject a deprecation notice into a feed XML.\n\nUsed when a scraper is being retired (e.g., the site launched an offici"
},
{
"path": "feed_generators/google_ai_blog.py",
"chars": 6399,
"preview": "from datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerator\n\nfrom ut"
},
{
"path": "feed_generators/groq_blog.py",
"chars": 3618,
"preview": "\"\"\"Generate RSS feed for the Groq Blog (https://groq.com/blog/).\n\nSimple static HTML scraper. Cards are rendered server-"
},
{
"path": "feed_generators/meta_ai_blog.py",
"chars": 11173,
"preview": "\"\"\"Generate RSS feed for AI at Meta Blog (https://ai.meta.com/blog/).\n\nReact SPA with a \"Load more\" button. The page ren"
},
{
"path": "feed_generators/mistral_blog.py",
"chars": 8057,
"preview": "\"\"\"Generate RSS feed for Mistral AI News (https://mistral.ai/news).\n\nSelenium-driven numbered pagination. Unlike \"Load m"
},
{
"path": "feed_generators/models.py",
"chars": 1679,
"preview": "\"\"\"Pydantic models for feed configuration and settings.\"\"\"\n\nfrom enum import StrEnum\nfrom pathlib import Path\n\nimport ya"
},
{
"path": "feed_generators/ollama_blog.py",
"chars": 3645,
"preview": "from datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerator\n\nfrom ut"
},
{
"path": "feed_generators/paulgraham_blog.py",
"chars": 5725,
"preview": "import re\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerato"
},
{
"path": "feed_generators/perplexity_hub.py",
"chars": 8177,
"preview": "\"\"\"Generate RSS feed for the Perplexity Hub (https://www.perplexity.ai/hub).\n\nThe hub is a Framer-built SPA that renders"
},
{
"path": "feed_generators/pinecone_blog.py",
"chars": 6549,
"preview": "\"\"\"Generate RSS feed for the Pinecone Blog (https://www.pinecone.io/blog/).\n\nSelenium \"Load More\" pagination. Two card l"
},
{
"path": "feed_generators/run_all_feeds.py",
"chars": 5020,
"preview": "import argparse\nimport logging\nimport os\nimport subprocess\nimport sys\n\nfrom models import FeedConfig, FeedType, load_fee"
},
{
"path": "feed_generators/thinkingmachines_blog.py",
"chars": 6726,
"preview": "import os\nimport sys\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import F"
},
{
"path": "feed_generators/utils.py",
"chars": 11802,
"preview": "\"\"\"Shared utilities for feed generators.\"\"\"\n\nimport json\nimport logging\nimport re\nimport subprocess\nfrom datetime import"
},
{
"path": "feed_generators/validate_feeds.py",
"chars": 3531,
"preview": "\"\"\"Validate all RSS feeds for empty content and stale items.\"\"\"\n\nimport sys\nimport xml.etree.ElementTree as ET\nfrom date"
},
{
"path": "feed_generators/weaviate_blog.py",
"chars": 4994,
"preview": "\"\"\"Generate RSS feed for the Weaviate Blog (https://weaviate.io/blog).\n\nDocusaurus-based blog with /page/N pagination. S"
},
{
"path": "feed_generators/windsurf_blog.py",
"chars": 4255,
"preview": "from datetime import datetime\n\nimport pytz\nimport requests\nfrom feedgen.feed import FeedGenerator\n\nfrom utils import sav"
},
{
"path": "feed_generators/windsurf_changelog.py",
"chars": 6407,
"preview": "import re\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerato"
},
{
"path": "feed_generators/windsurf_next_changelog.py",
"chars": 6470,
"preview": "import re\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGenerato"
},
{
"path": "feed_generators/xainews_blog.py",
"chars": 8909,
"preview": "import argparse\nfrom datetime import datetime\n\nimport pytz\nfrom bs4 import BeautifulSoup\nfrom feedgen.feed import FeedGe"
},
{
"path": "feeds/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "feeds/feed_ai_first_podcast.xml",
"chars": 44191,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_anthropic_changelog_claude_code.xml",
"chars": 174441,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_anthropic_engineering.xml",
"chars": 14704,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_anthropic_news.xml",
"chars": 100753,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_anthropic_red.xml",
"chars": 12227,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_anthropic_research.xml",
"chars": 8394,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_blogsurgeai.xml",
"chars": 24540,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_chanderramesh.xml",
"chars": 4565,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_claude.xml",
"chars": 61948,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_cohere.xml",
"chars": 26732,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_cursor.xml",
"chars": 7138,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_dagster.xml",
"chars": 85503,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_google_ai.xml",
"chars": 11810,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_groq.xml",
"chars": 10926,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_hamel.xml",
"chars": 18297,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_meta_ai.xml",
"chars": 53069,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_mistral.xml",
"chars": 23753,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_ollama.xml",
"chars": 22728,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_openai_research.xml",
"chars": 4139,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_paulgraham.xml",
"chars": 181315,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_perplexity_hub.xml",
"chars": 54335,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_pinecone.xml",
"chars": 57754,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_the_batch.xml",
"chars": 225078,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_thinkingmachines.xml",
"chars": 2772,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_weaviate.xml",
"chars": 29663,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_windsurf_blog.xml",
"chars": 88108,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_windsurf_changelog.xml",
"chars": 121960,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_windsurf_next_changelog.xml",
"chars": 119708,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds/feed_xainews.xml",
"chars": 15294,
"preview": "<?xml version='1.0' encoding='UTF-8'?>\n<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:content=\"http://purl.org/rss/"
},
{
"path": "feeds.yaml",
"chars": 3344,
"preview": "# Feed Registry -- single source of truth for all feed generators.\n# run_all_feeds.py reads this file instead of scannin"
},
{
"path": "makefiles/ci.mk",
"chars": 2315,
"preview": "##########################\n### CI/CD Workflows ###\n##########################\n\n.PHONY: ci_test_workflow_local\nci_test"
},
{
"path": "makefiles/colors.mk",
"chars": 651,
"preview": "# Basic ANSI colors & print helpers\n\nGREEN := \\033[0;32m\nYELLOW := \\033[1;33m\nBLUE := \\033[0;34m\nCYAN := \\033[0;36m\nRED "
},
{
"path": "makefiles/common.mk",
"chars": 1316,
"preview": "# Strict shell + sane make defaults\n\nSHELL := /bin/bash\n.SHELLFLAGS := -eu -o pipefail -c\nMAKEFLAGS += --warn-undefined-"
},
{
"path": "makefiles/dev.mk",
"chars": 1557,
"preview": "##########################\n### Development Tools ###\n##########################\n\n.PHONY: dev_setup\ndev_setup: ## Instal"
},
{
"path": "makefiles/env.mk",
"chars": 585,
"preview": "##########################\n### Environment Setup ###\n##########################\n\n.PHONY: env_setup\nenv_setup: ## Create"
},
{
"path": "makefiles/feeds.mk",
"chars": 10137,
"preview": "##########################\n### RSS Feed Generation ##\n##########################\n\n.PHONY: feeds_generate_all\nfeeds_gener"
},
{
"path": "pyproject.toml",
"chars": 1088,
"preview": "[project]\nname = \"rss-feeds\"\nversion = \"0.1.0\"\ndescription = \"RSS feed generator for blogs that don't have one\"\nreadme ="
}
]
About this extraction
This page contains the full source code of the Olshansk/rss-feeds GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 96 files (1.8 MB), approximately 520.3k tokens, and a symbol index with 172 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.