Deploy in 30 seconds — Say goodbye to endless scrolling, only see the news you truly care about
|
### **Smart Push Strategies**
**Three Push Modes**:
| Mode | Target Users | Push Feature |
|------|--------------|--------------|
| **Daily Summary** (daily) | Managers/Regular Users | Push all matched news of the day (includes previously pushed) |
| **Current Rankings** (current) | Content Creators | Push current ranking matches (continuously ranked news appear each time) |
| **Incremental Monitor** (incremental) | Traders/Investors | Push only new content, zero duplication |
> 💡 **Quick Selection Guide:**
> - Don't want duplicate news → Use `incremental`
> - Want complete ranking trends → Use `current`
> - Need daily summary reports → Use `daily`
>
> For detailed comparison and configuration, see [Configuration Guide - Push Mode Details](#3-push-mode-details)
**Additional Features** (Optional):
| Feature | Description | Default |
|---------|-------------|---------|
| **Scheduling System** | Per-day-of-week scheduling: assign different time periods, push modes, and AI analysis strategies to each day (Mon–Sun). **Each period can independently set its filter method (keyword/AI) and interest focus**, enabling different content at different times. 5 built-in presets (always_on / morning_evening / office_hours / night_owl / custom), or define your own. Supports weekday vs weekend differentiation, cross-midnight periods, per-period once-only dedup, and overlap conflict detection (v6.0.0 + v6.5.0) | morning_evening |
| **Content Order Configuration** | Use `display.region_order` to adjust display order of all regions (hotlist, new items, RSS, standalone, AI analysis); use `display.regions` to toggle each region on/off (v5.2.0) | See config |
| **Display Mode Switch** | `keyword`=group by keyword, `platform`=group by platform (v4.6.0 new) | keyword |
> 💡 For detailed configuration, see [Configuration Guide - Report Configuration](#7-report-configuration) and [Configuration Guide - Scheduling System](#8-when-will-i-receive-pushes)
### **Precise Content Filtering**
Set personal keywords (e.g., AI, BYD, Education Policy) to receive only relevant trending news, filtering out noise.
> 💡 **Basic Configuration**: [Keyword Configuration - Basic Syntax](#keyword-basic-syntax)
>
> 💡 **Advanced Configuration**: [Keyword Configuration - Advanced Settings](#keyword-advanced-settings)
>
> 💡 You can also skip filtering and receive all trending news (leave frequency_words.txt empty)
### **AI Smart News Filtering** (v6.5.0 New)
Describe your interests in natural language and let AI automatically classify news — replacing traditional keyword matching
- **Natural Language Interests**: Write your focus areas in everyday language in `ai_interests.txt`, no keyword syntax to learn
- **Two-Stage Smart Processing**: AI first extracts structured tags from interest descriptions, then batch-classifies and scores news against those tags
- **Score Threshold Control**: Fine-tune push quality with `ai_filter.min_score` — only highly relevant news gets delivered
- **Auto Fallback**: Automatically falls back to keyword matching if AI filtering fails, ensuring uninterrupted push delivery
- **Smart Tag Updates**: When interests change, AI evaluates the change magnitude to decide incremental or full reclassification
- **Flexible Switching**: `filter.method` supports `keyword` (default) and `ai` modes, Timeline can override per time period
- **Per-Period Personalization**: Different time periods can use different keyword files or AI interest descriptions. For example: mornings use a "tech keyword list" for quick filtering, evenings switch to "finance interests" for AI deep filtering
```yaml
# config.yaml quick enable example
filter:
method: ai # keyword (default) | ai
ai_filter:
min_score: 6 # Minimum push score threshold (1-10)
```
> 💡 AI filtering shares model config with AI analysis/translation — just configure `ai.api_key` once
### **Trending Analysis**
Real-time tracking of news popularity changes helps you understand not just "what's trending" but "how trends evolve."
- **Timeline Tracking**: Records complete time span from first to last appearance
- **Popularity Changes**: Tracks ranking changes and appearance frequency across time periods
- **New Detection**: Real-time identification of emerging topics, marked with 🆕
- **Continuity Analysis**: Distinguishes between one-time hot topics and continuously developing news
- **Cross-Platform Comparison**: Same news across different platforms, showing media attention differences
> 💡 Push format reference: [Configuration Guide - Push Format Reference](#5-push-format-reference)
### **Personalized Trending Algorithm**
No longer controlled by platform algorithms, TrendRadar reorganizes all trending searches
> 💡 Weight adjustment guide: [Configuration Guide - Advanced Configuration](#4-advanced-configuration---hotspot-weight-adjustment)
### **Multi-Channel Multi-Account Push**
Supports **WeWork** (+ WeChat push solution), **Feishu**, **DingTalk**, **Telegram**, **Email**, **ntfy**, **Bark**, **Slack**, **Generic Webhook** (connect to Discord, IFTTT, or any platform) — messages delivered directly to phone and email.
> 💡 For detailed configuration, see [Configuration Guide - Multi-Account Push Configuration](#10-multiple-account-push-configuration)
### **AI Multi-Language Translation** (v5.2.0 New)
Translate push content into any language, breaking language barriers — whether reading domestic trends or subscribing to international news via RSS, access everything in your native language
- **One-Click Translation**: Set `ai_translation.enabled: true` and target language in `config.yaml`
- **Multi-Language Support**: Supports English, Korean, Japanese, French, and any other language
- **Smart Batch Processing**: Automatically batches translations to reduce API calls and save costs
- **Custom Style**: Customize translation style and terminology via `ai_translation_prompt.txt`
- **Shared Model Config**: Shares the `ai` config section with AI analysis feature
```yaml
# config.yaml quick enable example
ai_translation:
enabled: true
language: "English" # Target translation language
```
> 💡 Translation shares model config with AI analysis — just configure `ai.api_key` once to use both features
**RSS Source References**: Here are some RSS feed collections for your reference
- [awesome-tech-rss](https://github.com/tuan3w/awesome-tech-rss) - Tech, startup, and programming blogs & media
- [awesome-rss-feeds](https://github.com/plenaryapp/awesome-rss-feeds) - Mainstream news media RSS from countries worldwide
> ⚠️ Some international media content may involve sensitive topics that AI models might refuse to translate. Please filter subscription sources based on your actual needs
### **Flexible Storage Architecture (v4.0.0 Major Update)**
**Multi-Backend Support**:
- **Remote Cloud Storage**: GitHub Actions environment default, supports S3-compatible protocols (R2/OSS/COS, etc.), data stored in cloud, keeping repository clean
- **Local SQLite**: Traditional SQLite database, stable and efficient (Docker/local deployment)
- **Auto Selection**: Auto-selects appropriate backend based on runtime environment
> 💡 For storage configuration details, see [Configuration Details - Storage Configuration](#11-storage-configuration-v400-new)
### **Multi-Platform Deployment**
- **GitHub Actions**: Cloud automated operations (7-day check-in cycle + remote cloud storage)
- **Docker Deployment**: Supports multi-architecture containerized operation
- **Local Running**: Python environment direct execution
### **AI Analysis Push (v5.0.0 New)**
Use AI models to deeply analyze push content, automatically generate trending insights report
- **Smart Analysis**: Automatically analyze trending topics, keyword popularity, cross-platform correlation, potential impact
- **Multi Provider**: Built on LiteLLM unified interface, supports 100+ AI providers (DeepSeek, OpenAI, Gemini, Anthropic, local Ollama, etc.), with automatic fallback model switching
- **Independent Analysis Mode**: AI analysis scope can differ from push content — push only new items (less noise), while AI analyzes the full day's news (complete trend picture)
- **Flexible Push**: Choose original content only, AI analysis only, or both
- **Custom Prompts**: Customize analysis perspective via `config/ai_analysis_prompt.txt`
> 💡 Detailed configuration tutorial: [Let AI help me analyze hot topics](#12-let-ai-help-me-analyze-hot-topics)
### **Independent Display Section (v5.0.0 New)**
Provide complete trending display for specified platforms, unaffected by keyword filtering
- **Full Trending**: Specified platforms show complete trending list, for users who want to see full rankings
- **RSS Independent Display**: RSS source content can be fully displayed, not limited by keywords
- **AI Deep Analysis**: Independently enable AI trend analysis on full hotlists, without displaying in push
- **Flexible Configuration**: Support configuring display platforms, RSS sources, max count
> 💡 Detailed configuration tutorial: [Report Configuration - Independent Display](#7-report-configuration)
### **AI Smart Analysis (v3.0.0 New)**
AI conversational analysis system based on MCP (Model Context Protocol), enabling deep data mining with natural language.
> **💡 Usage Tip**: AI features require local news data support
> - Project includes test data for immediate feature experience
> - Recommend deploying the project yourself to get more real-time data
>
> See [AI Analysis](#-ai-analysis) for details
### **Web Deployment**
After running, the `index.html` generated in the root directory is the complete news report page.
> **Deployment**: Click **Use this template** to create your repository, then deploy to Cloudflare Pages or GitHub Pages.
>
> **💡 Tip**: Enable GitHub Pages for an online URL. Go to Settings → Pages to enable. [Preview Effect](https://sansan0.github.io/TrendRadar/)
>
> ⚠️ The GitHub Actions auto-storage feature has been discontinued (this approach caused excessive load on GitHub servers, affecting platform stability).
### **Reduce APP Dependencies**
Transform from "algorithm recommendation captivity" to "actively getting the information you want"
**Target Users:** Investors, content creators, PR professionals, news-conscious general users
**Typical Scenarios:** Stock investment monitoring, brand sentiment tracking, industry trend watching, lifestyle news gathering
| Web Effect (Email Push) | Feishu Push Effect | AI Analysis Push Effect |
|:---:|:---:|:---:|
|  |  |  |
As shown above, each row is a configuration item:
- **Name**: Must use the fixed names listed in the expanded sections below (e.g., `WEWORK_WEBHOOK_URL`)
- **Secret (Value)**: Fill in the actual content obtained from the corresponding platform (e.g., Webhook URL, Token, etc.)
**Notes**:
- Uses the same Webhook address as WeWork bot
- Difference is message format: `text` for plain text, `markdown` for rich text (default)
- Plain text format will automatically remove all markdown syntax (bold, links, etc.)
最快30秒部署的热点助手 —— 告别无效刷屏,只看真正关心的新闻资讯
|
优化前
|
优化后
|
### **智能推送策略**
**三种推送模式**:
| 模式 | 适用场景 | 推送特点 |
|------|---------|---------|
| **当日汇总** (daily) | 企业管理者/普通用户 | 按时推送当日所有匹配新闻(会包含之前推送过的) |
| **当前榜单** (current) | 自媒体人/内容创作者 | 按时推送当前榜单匹配新闻(持续在榜的每次都出现) |
| **增量监控** (incremental) | 投资者/交易员 | 仅推送新增内容,零重复 |
> 💡 **快速选择指南:**
> - 不想看到重复新闻 → 用 `incremental`(增量监控)
> - 想看完整榜单趋势 → 用 `current`(当前榜单)
> - 需要每日汇总报告 → 用 `daily`(当日汇总)
>
> 详细对比和配置教程见 [配置详解 - 推送模式详解](#3-推送模式详解)
**附加功能**(可选):
| 功能 | 说明 | 默认 |
|------|------|------|
| **调度系统** | 按周一到周日逐日编排:为每天分配不同时间段、推送模式和 AI 分析策略。**每个时段可独立设置筛选方式(关键词/AI)和关注方向**,实现不同时间看不同类型新闻。内置 5 种预设(always_on / morning_evening / office_hours / night_owl / custom),也可自定义。支持工作日/周末差异化、跨午夜时段、per-period 去重、时段冲突检测(v6.0.0 + v6.5.0) | morning_evening |
| **内容顺序配置** | 通过 `display.region_order` 调整各区域(热榜、新增热点、RSS、独立展示区、AI 分析)的显示顺序;通过 `display.regions` 控制各区域是否显示(v5.2.0) | 见配置文件 |
| **显示模式切换** | `keyword`=按关键词分组,`platform`=按平台分组(v4.6.0 新增) | keyword |
> 💡 详细配置教程见 [推送内容怎么显示?](#7-推送内容怎么显示) 和 [什么时候给我推送?](#8-什么时候给我推送)
### **精准内容筛选**
设置个人关键词(如:AI、比亚迪、教育政策),只推送相关热点,过滤无关信息
> 💡 **基础配置教程**:[关键词配置 - 基础语法](#关键词基础语法)
>
> 💡 **高级配置教程**:[关键词配置 - 高级配置](#关键词高级配置)
>
> 💡 也可以不做筛选,完整推送所有热点(将 frequency_words.txt 留空)
### **AI 智能筛选新闻**(v6.5.0 新增)
用自然语言描述你的兴趣,AI 自动分类新闻,替代传统关键词匹配
- **自然语言兴趣描述**:在 `ai_interests.txt` 中用日常语言写下关注方向,无需学习关键词语法
- **两阶段智能处理**:AI 先从兴趣描述提取结构化标签,再对新闻按标签批量分类打分
- **分数阈值控制**:通过 `ai_filter.min_score` 精确控制推送质量,只推送高相关度新闻
- **自动回退保障**:AI 筛选失败时自动回退到关键词匹配,确保推送不中断
- **智能标签更新**:兴趣变更时 AI 自动评估变化幅度,决定增量或全量重分类
- **灵活切换**:`filter.method` 支持 `keyword`(默认)和 `ai` 两种模式,Timeline 可按时段覆盖
- **分时段个性化**:不同时间段可以使用不同的关键词文件或 AI 兴趣描述。例如早上用"科技词库"快速过滤,晚上换成"金融兴趣"做 AI 深度筛选
```yaml
# config.yaml 快速启用示例
filter:
method: ai # keyword(默认)| ai
ai_filter:
min_score: 6 # 推送最低分数阈值(1-10)
```
> 💡 AI 筛选与 AI 分析/翻译共享模型配置,只需配置一次 `ai.api_key`
### **热点趋势分析**
实时追踪新闻热度变化,让你不仅知道"什么在热搜",更了解"热点如何演变"
- **时间轴追踪**:记录每条新闻从首次出现到最后出现的完整时间跨度
- **热度变化**:统计新闻在不同时间段的排名变化和出现频次
- **新增检测**:实时识别新出现的热点话题,用🆕标记第一时间提醒
- **持续性分析**:区分一次性热点话题和持续发酵的深度新闻
- **跨平台对比**:同一新闻在不同平台的排名表现,看出媒体关注度差异
> 💡 推送格式说明见 [消息样式说明](#5-我收到的消息长什么样)
### **个性化热点算法**
不再被各个平台的算法牵着走,TrendRadar 会重新整理全网热搜
> 💡 三个比例可以调整,详见 [配置详解 - 热点权重调整](#4-热点权重调整)
### **多渠道多账号推送**
支持**企业微信**(+ 微信推送方案)、**飞书**、**钉钉**、**Telegram**、**邮件**、**ntfy**、**Bark**、**Slack**、**通用 Webhook**(可对接 Discord、IFTTT 等任意平台),消息直达手机和邮箱
> 💡 详细配置教程见 [推送到多个群/设备](#10-推送到多个群设备)
### **AI 多语言翻译**(v5.2.0 新增)
将推送内容翻译为任意语言,打破语言壁垒,无论是阅读国内热点还是通过 RSS 订阅海外资讯,都能以母语轻松获取
- **一键翻译**:在 `config.yaml` 中设置 `ai_translation.enabled: true` 和目标语言即可
- **多语言支持**:支持 English、Korean、Japanese、French 等任意语言
- **智能批量处理**:自动批量翻译,减少 API 调用次数,节省成本
- **自定义风格**:通过 `ai_translation_prompt.txt` 自定义翻译风格和术语
- **共享模型配置**:与 AI 分析功能共用 `ai` 配置段的模型设置
```yaml
# config.yaml 快速启用示例
ai_translation:
enabled: true
language: "English" # 翻译目标语言
```
> 💡 翻译功能与 AI 分析功能共享模型配置,只需配置一次 `ai.api_key` 即可同时使用两个功能
**RSS 源参考**:以下是一些 RSS 订阅源合集,可按需选用
- [awesome-tech-rss](https://github.com/tuan3w/awesome-tech-rss) - 科技、创业、编程领域博客和媒体
- [awesome-rss-feeds](https://github.com/plenaryapp/awesome-rss-feeds) - 世界各国主流新闻媒体 RSS 合集
> ⚠️ 部分海外媒体内容可能涉及敏感话题,AI 模型可能拒绝翻译,建议根据实际需求筛选订阅源
### **灵活存储架构**(v4.0.0 重大更新)
**多存储后端支持**:
- **远程云存储**:GitHub Actions 环境默认,支持 S3 兼容协议(R2/OSS/COS 等),数据存储在云端,不污染仓库
- **本地 SQLite 数据库**:Docker/本地环境默认,数据完全可控
- **自动后端选择**:根据运行环境智能切换存储方式
> 💡 详细说明见 [数据保存在哪里?](#11-数据保存在哪里)
### **多端部署**
- **GitHub Actions**:定时自动爬取 + 远程云存储(需签到续期)
- **Docker 部署**:支持多架构容器化运行,数据本地存储
- **本地运行**:Windows/Mac/Linux 直接运行
### **AI 分析推送(v5.0.0 新增)**
使用 AI 大模型对推送内容进行深度分析,自动生成热点洞察报告
- **智能分析**:自动分析热点趋势、关键词热度、跨平台关联、潜在影响
- **多提供商**:基于 LiteLLM 统一接口,支持 100+ AI 提供商(DeepSeek、OpenAI、Gemini、Anthropic、本地 Ollama 等),还支持备用模型自动切换
- **分析模式独立**:AI 的分析范围可以和推送不同——推送只发新增消息(避免打扰),但 AI 可以分析当天全部新闻(看完整趋势)
- **灵活推送**:可选仅原始内容、仅 AI 分析、或两者都推送
- **自定义提示词**:通过 `config/ai_analysis_prompt.txt` 自定义分析角度
> 💡 详细配置教程见 [让 AI 帮我分析热点](#12-让-ai-帮我分析热点)
### **独立展示区(v5.0.0 新增)**
为指定平台提供完整热榜展示,不受关键词过滤影响
- **完整热榜**:指定平台的热榜完整展示,适合想看完整排名的用户
- **RSS 独立展示**:RSS 源内容可完整展示,不受关键词限制
- **AI 深度分析**:可独立开启 AI 对完整热榜的趋势分析,无需在推送中展示
- **灵活配置**:支持配置展示平台、RSS 源、最大条数
> 💡 详细配置教程见 [推送内容怎么显示? - 独立展示区](#7-推送内容怎么显示)
### **AI 智能分析(v3.0.0 新增)**
基于 MCP (Model Context Protocol) 协议的 AI 对话分析系统,让你用自然语言深度挖掘新闻数据
> **💡 使用提示**:AI 功能需要本地新闻数据支持
> - 项目自带测试数据,可立即体验功能
> - 建议自行部署运行项目,获取更实时的数据
>
> 详见 [AI 智能分析](#-ai-智能分析)
### **网页部署**
运行后根目录生成 `index.html`,即为完整的新闻报告页面。
> **部署方式**:点击 **Use this template** 创建仓库,可部署到 Cloudflare Pages 或 GitHub Pages 等静态托管平台。
>
> **💡 提示**:启用 GitHub Pages 可获得在线访问地址,进入仓库 Settings → Pages 即可开启。[效果预览](https://sansan0.github.io/TrendRadar/)
>
> ⚠️ 原 GitHub Actions 自动存储功能已下线(该方案曾导致 GitHub 服务器负载过高,影响平台稳定性)。
### **减少 APP 依赖**
从"被算法推荐绑架"变成"主动获取自己想要的信息"
**适合人群:** 投资者、自媒体人、企业公关、关心时事的普通用户
**典型场景:** 股市投资监控、品牌舆情追踪、行业动态关注、生活资讯获取
| 网页效果(邮箱推送效果) | 飞书推送效果 | AI 分析推送效果 |
|:---:|:---:|:---:|
|  |  |  |
如上图所示,每一行是一个配置项:
- **Name(名称)**:必须使用下方展开内容中列出的固定名称(如 `WEWORK_WEBHOOK_URL`)
- **Secret(值)**:填写你从对应平台获取的实际内容(如 Webhook 地址、Token 等)
**说明**:
- 与企业微信机器人使用相同的 Webhook 地址
- 区别在于消息格式:`text` 为纯文本,`markdown` 为富文本(默认)
- 纯文本格式会自动去除所有 markdown 语法(粗体、链接等)
method=keyword 使用 frequency_words.txt;
method=ai 使用 ai_interests.txt + AI 筛选配置。priority_sort_enabled 仅在 method=ai 时生效。
config/ai_interests.txt;填写后仅从
config/custom/ai/ 查找该文件名。
请在左侧粘贴 timeline.yaml 内容
或点击右上角「加载官网最新配置」
error_on_overlap 会在时间段重叠时直接报错;last_wins 会按 day_plans 中靠后的时间段覆盖。
frequency_file 从 config/custom/keyword/ 查找,
interests_file 从 config/custom/ai/ 查找;留空会删除该字段并恢复继承。
) - Email:HTML 邮件(完整网页样式,支持 # 标题、---、粗体斜体) - ntfy:Markdown(自动剥离 标签) - Bark:Markdown(自动简化为粗体+链接,适配 iOS 推送) - Slack:mrkdwn(自动转换 **→*、~~→~、[text](url)→) - 通用 Webhook:Markdown(支持自定义模板) 提示:发送前可调用 get_channel_format_guide 获取目标渠道的详细格式化策略, 以生成最佳排版效果的消息内容。 Args: message: markdown 格式的消息内容(必需) title: 消息标题,默认 "TrendRadar 通知" channels: 指定发送的渠道列表,不指定则发送到所有已配置渠道 可选值: feishu, dingtalk, wework, telegram, email, ntfy, bark, slack, generic_webhook Returns: JSON格式的发送结果,包含每个渠道的发送状态 Examples: - send_notification(message="**测试消息**\\n这是一条测试通知") - send_notification(message="紧急通知", title="系统告警", channels=["feishu", "dingtalk"]) """ tools = _get_tools() result = await asyncio.to_thread( tools['notification'].send_notification, message=message, title=title, channels=channels ) return json.dumps(result, ensure_ascii=False, indent=2) # ==================== 启动入口 ==================== def run_server( project_root: Optional[str] = None, transport: str = 'stdio', host: str = '0.0.0.0', port: int = 3333 ): """ 启动 MCP 服务器 Args: project_root: 项目根目录路径 transport: 传输模式,'stdio' 或 'http' host: HTTP模式的监听地址,默认 0.0.0.0 port: HTTP模式的监听端口,默认 3333 """ # 初始化工具实例 _get_tools(project_root) # 打印启动信息 print() print("=" * 60) print(" TrendRadar MCP Server - FastMCP 2.0") print("=" * 60) print(f" 传输模式: {transport.upper()}") if transport == 'stdio': print(" 协议: MCP over stdio (标准输入输出)") print(" 说明: 通过标准输入输出与 MCP 客户端通信") elif transport == 'http': print(f" 协议: MCP over HTTP (生产环境)") print(f" 服务器监听: {host}:{port}") if project_root: print(f" 项目目录: {project_root}") else: print(" 项目目录: 当前目录") print() print(" 已注册的工具:") print(" === 日期解析工具(推荐优先调用)===") print(" 0. resolve_date_range - 解析自然语言日期为标准格式") print() print(" === 基础数据查询(P0核心)===") print(" 1. get_latest_news - 获取最新新闻") print(" 2. get_news_by_date - 按日期查询新闻(支持自然语言)") print(" 3. get_trending_topics - 获取趋势话题(支持自动提取)") print() print(" === RSS 数据查询 ===") print(" 4. get_latest_rss - 获取最新 RSS 订阅数据") print(" 5. search_rss - 搜索 RSS 数据") print(" 6. get_rss_feeds_status - 获取 RSS 源状态") print() print(" === 智能检索工具 ===") print(" 7. search_news - 统一新闻搜索(关键词/模糊/实体)") print(" 8. find_related_news - 相关新闻查找(支持历史数据)") print() print(" === 高级数据分析 ===") print(" 9. analyze_topic_trend - 统一话题趋势分析(热度/生命周期/爆火/预测)") print(" 10. analyze_data_insights - 统一数据洞察分析(平台对比/活跃度/关键词共现)") print(" 11. analyze_sentiment - 情感倾向分析") print(" 12. aggregate_news - 跨平台新闻聚合去重") print(" 13. compare_periods - 时期对比分析(周环比/月环比)") print(" 14. generate_summary_report - 每日/每周摘要生成") print() print(" === 配置与系统管理 ===") print(" 15. get_current_config - 获取当前系统配置") print(" 16. get_system_status - 获取系统运行状态") print(" 17. check_version - 检查版本更新(对比本地与远程版本)") print(" 18. trigger_crawl - 手动触发爬取任务") print() print(" === 存储同步工具 ===") print(" 19. sync_from_remote - 从远程存储拉取数据到本地") print(" 20. get_storage_status - 获取存储配置和状态") print(" 21. list_available_dates - 列出本地/远程可用日期") print() print(" === 文章内容读取 ===") print(" 22. read_article - 读取单篇文章内容(Markdown格式)") print(" 23. read_articles_batch - 批量读取多篇文章(自动限速)") print() print(" === 通知推送工具 ===") print(" 24. get_channel_format_guide - 获取渠道格式化策略指南(提示词)") print(" 25. get_notification_channels - 获取已配置的通知渠道状态") print(" 26. send_notification - 向通知渠道发送消息(自动适配格式)") print("=" * 60) print() # 根据传输模式运行服务器 if transport == 'stdio': mcp.run(transport='stdio') elif transport == 'http': # HTTP 模式(生产推荐) mcp.run( transport='http', host=host, port=port, path='/mcp' # HTTP 端点路径 ) else: raise ValueError(f"不支持的传输模式: {transport}") if __name__ == '__main__': import argparse parser = argparse.ArgumentParser( description='TrendRadar MCP Server - 新闻热点聚合 MCP 工具服务器', formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" 详细配置教程请查看: README-Cherry-Studio.md """ ) parser.add_argument( '--transport', choices=['stdio', 'http'], default='stdio', help='传输模式:stdio (默认) 或 http (生产环境)' ) parser.add_argument( '--host', default='0.0.0.0', help='HTTP模式的监听地址,默认 0.0.0.0' ) parser.add_argument( '--port', type=int, default=3333, help='HTTP模式的监听端口,默认 3333' ) parser.add_argument( '--project-root', help='项目根目录路径' ) args = parser.parse_args() run_server( project_root=args.project_root, transport=args.transport, host=args.host, port=args.port ) ================================================ FILE: mcp_server/services/__init__.py ================================================ """ 服务层模块 提供数据访问、缓存、解析等核心服务。 """ ================================================ FILE: mcp_server/services/cache_service.py ================================================ """ 缓存服务 实现TTL缓存机制,提升数据访问性能。 """ import hashlib import json import time from typing import Any, Optional from threading import Lock def make_cache_key(namespace: str, **params) -> str: """ 生成结构化缓存 key 通过对参数排序和哈希,确保相同参数组合总是生成相同的 key。 Args: namespace: 缓存命名空间,如 "latest_news", "trending_topics" **params: 缓存参数 Returns: 格式化的缓存 key,如 "latest_news:a1b2c3d4" Examples: >>> make_cache_key("latest_news", platforms=["zhihu"], limit=50) 'latest_news:8f14e45f' >>> make_cache_key("search", query="AI", mode="keyword") 'search:3c6e0b8a' """ if not params: return namespace # 对参数进行规范化处理 normalized_params = {} for k, v in params.items(): if v is None: continue # 跳过 None 值 elif isinstance(v, (list, tuple)): # 列表排序后转为字符串 normalized_params[k] = json.dumps(sorted(v) if all(isinstance(i, str) for i in v) else list(v), ensure_ascii=False) elif isinstance(v, dict): # 字典按键排序后转为字符串 normalized_params[k] = json.dumps(v, sort_keys=True, ensure_ascii=False) else: normalized_params[k] = str(v) # 排序参数并生成哈希 sorted_params = sorted(normalized_params.items()) param_str = "&".join(f"{k}={v}" for k, v in sorted_params) # 使用 MD5 生成短哈希(取前8位) hash_value = hashlib.md5(param_str.encode('utf-8')).hexdigest()[:8] return f"{namespace}:{hash_value}" class CacheService: """缓存服务类""" def __init__(self): """初始化缓存服务""" self._cache = {} self._timestamps = {} self._lock = Lock() def get(self, key: str, ttl: int = 900) -> Optional[Any]: """ 获取缓存数据 Args: key: 缓存键 ttl: 存活时间(秒),默认15分钟 Returns: 缓存的值,如果不存在或已过期则返回None """ with self._lock: if key in self._cache: # 检查是否过期 if time.time() - self._timestamps[key] < ttl: return self._cache[key] else: # 已过期,删除缓存 del self._cache[key] del self._timestamps[key] return None def set(self, key: str, value: Any) -> None: """ 设置缓存数据 Args: key: 缓存键 value: 缓存值 """ with self._lock: self._cache[key] = value self._timestamps[key] = time.time() def delete(self, key: str) -> bool: """ 删除缓存 Args: key: 缓存键 Returns: 是否成功删除 """ with self._lock: if key in self._cache: del self._cache[key] del self._timestamps[key] return True return False def clear(self) -> None: """清空所有缓存""" with self._lock: self._cache.clear() self._timestamps.clear() def cleanup_expired(self, ttl: int = 900) -> int: """ 清理过期缓存 Args: ttl: 存活时间(秒) Returns: 清理的条目数量 """ with self._lock: current_time = time.time() expired_keys = [ key for key, timestamp in self._timestamps.items() if current_time - timestamp >= ttl ] for key in expired_keys: del self._cache[key] del self._timestamps[key] return len(expired_keys) def get_stats(self) -> dict: """ 获取缓存统计信息 Returns: 统计信息字典 """ with self._lock: return { "total_entries": len(self._cache), "oldest_entry_age": ( time.time() - min(self._timestamps.values()) if self._timestamps else 0 ), "newest_entry_age": ( time.time() - max(self._timestamps.values()) if self._timestamps else 0 ) } # 全局缓存实例 _global_cache = None def get_cache() -> CacheService: """ 获取全局缓存实例 Returns: 全局缓存服务实例 """ global _global_cache if _global_cache is None: _global_cache = CacheService() return _global_cache ================================================ FILE: mcp_server/services/data_service.py ================================================ """ 数据访问服务 提供统一的数据查询接口,封装数据访问逻辑。 """ import re from collections import Counter from datetime import datetime, timedelta from typing import Dict, List, Optional, Tuple from .cache_service import get_cache from .parser_service import ParserService from ..utils.errors import DataNotFoundError class DataService: """数据访问服务类""" # 中文停用词列表(用于 auto_extract 模式) STOPWORDS = { '的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这', '那', '来', '被', '与', '为', '对', '将', '从', '以', '及', '等', '但', '或', '而', '于', '中', '由', '可', '可以', '已', '已经', '还', '更', '最', '再', '因为', '所以', '如果', '虽然', '然而', '什么', '怎么', '如何', '哪', '哪些', '多少', '几', '这个', '那个', '他', '她', '它', '他们', '她们', '我们', '你们', '大家', '自己', '这样', '那样', '怎样', '这么', '那么', '多么', '非常', '特别', '应该', '可能', '能够', '需要', '必须', '一定', '肯定', '确实', '正在', '已经', '曾经', '将要', '即将', '刚刚', '马上', '立刻', '回应', '发布', '表示', '称', '曝', '官方', '最新', '重磅', '突发', '热搜', '刷屏', '引发', '关注', '网友', '评论', '转发', '点赞' } def __init__(self, project_root: str = None): """ 初始化数据服务 Args: project_root: 项目根目录 """ self.parser = ParserService(project_root) self.cache = get_cache() def get_latest_news( self, platforms: Optional[List[str]] = None, limit: int = 50, include_url: bool = False ) -> List[Dict]: """ 获取最新一批爬取的新闻数据 Args: platforms: 平台ID列表,None表示所有平台 limit: 返回条数限制 include_url: 是否包含URL链接,默认False(节省token) Returns: 新闻列表 Raises: DataNotFoundError: 数据不存在 """ # 尝试从缓存获取 cache_key = f"latest_news:{','.join(platforms or [])}:{limit}:{include_url}" cached = self.cache.get(cache_key, ttl=900) # 15分钟缓存 if cached: return cached # 读取今天的数据 all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date( date=None, platform_ids=platforms ) # 获取最新的文件时间 if timestamps: latest_timestamp = max(timestamps.values()) fetch_time = datetime.fromtimestamp(latest_timestamp) else: fetch_time = datetime.now() # 转换为新闻列表 news_list = [] for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 取第一个排名 rank = info["ranks"][0] if info["ranks"] else 0 news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "rank": rank, "timestamp": fetch_time.strftime("%Y-%m-%d %H:%M:%S") } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") news_list.append(news_item) # 按排名排序 news_list.sort(key=lambda x: x["rank"]) # 限制返回数量 result = news_list[:limit] # 缓存结果 self.cache.set(cache_key, result) return result def get_news_by_date( self, target_date: datetime, platforms: Optional[List[str]] = None, limit: int = 50, include_url: bool = False ) -> List[Dict]: """ 按指定日期获取新闻 Args: target_date: 目标日期 platforms: 平台ID列表,None表示所有平台 limit: 返回条数限制 include_url: 是否包含URL链接,默认False(节省token) Returns: 新闻列表 Raises: DataNotFoundError: 数据不存在 Examples: >>> service = DataService() >>> news = service.get_news_by_date( ... target_date=datetime(2025, 10, 10), ... platforms=['zhihu'], ... limit=20 ... ) """ # 尝试从缓存获取 date_str = target_date.strftime("%Y-%m-%d") cache_key = f"news_by_date:{date_str}:{','.join(platforms or [])}:{limit}:{include_url}" cached = self.cache.get(cache_key, ttl=900) # 15分钟缓存 if cached: return cached # 读取指定日期的数据 all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date( date=target_date, platform_ids=platforms ) # 转换为新闻列表 news_list = [] for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 计算平均排名 avg_rank = sum(info["ranks"]) / len(info["ranks"]) if info["ranks"] else 0 news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "rank": info["ranks"][0] if info["ranks"] else 0, "avg_rank": round(avg_rank, 2), "count": len(info["ranks"]), "date": date_str } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") news_list.append(news_item) # 按排名排序 news_list.sort(key=lambda x: x["rank"]) # 限制返回数量 result = news_list[:limit] # 缓存结果(历史数据缓存更久) self.cache.set(cache_key, result) return result def search_news_by_keyword( self, keyword: str, date_range: Optional[Tuple[datetime, datetime]] = None, platforms: Optional[List[str]] = None, limit: Optional[int] = None ) -> Dict: """ 按关键词搜索新闻 Args: keyword: 搜索关键词 date_range: 日期范围 (start_date, end_date) platforms: 平台过滤列表 limit: 返回条数限制(可选) Returns: 搜索结果字典 Raises: DataNotFoundError: 数据不存在 """ # 确定搜索日期范围 if date_range: start_date, end_date = date_range else: # 默认搜索今天 start_date = end_date = datetime.now() # 收集所有匹配的新闻 results = [] platform_distribution = Counter() # 遍历日期范围 current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.parser.read_all_titles_for_date( date=current_date, platform_ids=platforms ) # 搜索包含关键词的标题 for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): if keyword.lower() in title.lower(): # 计算平均排名 avg_rank = sum(info["ranks"]) / len(info["ranks"]) if info["ranks"] else 0 results.append({ "title": title, "platform": platform_id, "platform_name": platform_name, "ranks": info["ranks"], "count": len(info["ranks"]), "avg_rank": round(avg_rank, 2), "url": info.get("url", ""), "mobileUrl": info.get("mobileUrl", ""), "date": current_date.strftime("%Y-%m-%d") }) platform_distribution[platform_id] += 1 except DataNotFoundError: # 该日期没有数据,继续下一天 pass # 下一天 current_date += timedelta(days=1) if not results: raise DataNotFoundError( f"未找到包含关键词 '{keyword}' 的新闻", suggestion="请尝试其他关键词或扩大日期范围" ) # 计算统计信息 total_ranks = [] for item in results: total_ranks.extend(item["ranks"]) avg_rank = sum(total_ranks) / len(total_ranks) if total_ranks else 0 # 限制返回数量(如果指定) total_found = len(results) if limit is not None and limit > 0: results = results[:limit] return { "results": results, "total": len(results), "total_found": total_found, "statistics": { "platform_distribution": dict(platform_distribution), "avg_rank": round(avg_rank, 2), "keyword": keyword } } def _extract_words_from_title(self, title: str, min_length: int = 2) -> List[str]: """ 从标题中提取有意义的词语(用于 auto_extract 模式) Args: title: 新闻标题 min_length: 最小词长 Returns: 关键词列表 """ # 移除URL和特殊字符 title = re.sub(r'http[s]?://\S+', '', title) title = re.sub(r'\[.*?\]', '', title) # 移除方括号内容 title = re.sub(r'[【】《》「」『』""''・·•]', '', title) # 移除中文标点 # 使用正则表达式分词(中文和英文) # 匹配连续的中文字符或英文单词 words = re.findall(r'[\u4e00-\u9fff]{2,}|[a-zA-Z]{2,}[a-zA-Z0-9]*', title) # 过滤停用词和短词 keywords = [ word for word in words if word and len(word) >= min_length and word.lower() not in self.STOPWORDS and word not in self.STOPWORDS ] return keywords def get_trending_topics( self, top_n: int = 10, mode: str = "current", extract_mode: str = "keywords" ) -> Dict: """ 获取热点话题统计 Args: top_n: 返回TOP N话题 mode: 时间模式 - "daily": 当日累计数据统计 - "current": 最新一批数据统计(默认) extract_mode: 提取模式 - "keywords": 统计预设关注词(基于 config/frequency_words.txt) - "auto_extract": 自动从新闻标题提取高频词 Returns: 话题频率统计字典 Raises: DataNotFoundError: 数据不存在 """ # 尝试从缓存获取 cache_key = f"trending_topics:{top_n}:{mode}:{extract_mode}" cached = self.cache.get(cache_key, ttl=900) # 15分钟缓存 if cached: return cached # 读取今天的数据 all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date() if not all_titles: raise DataNotFoundError( "未找到今天的新闻数据", suggestion="请确保爬虫已经运行并生成了数据" ) # 根据 mode 选择要处理的标题数据 if mode == "daily": titles_to_process = all_titles elif mode == "current": titles_to_process = all_titles # 简化实现 else: raise ValueError(f"不支持的模式: {mode}。支持的模式: daily, current") # 统计词频 word_frequency = Counter() keyword_to_news = {} # 预加载关键词数据(避免在循环内重复调用) if extract_mode == "keywords": from trendradar.core.frequency import _word_matches word_groups = self.parser.parse_frequency_words() # 遍历要处理的标题 for platform_id, titles in titles_to_process.items(): for title in titles.keys(): if extract_mode == "keywords": # 基于预设关键词统计(支持正则匹配) title_lower = title.lower() for group in word_groups: all_words = group.get("required", []) + group.get("normal", []) # 检查是否匹配词组中的任意一个词 matched = any(_word_matches(word_config, title_lower) for word_config in all_words) if matched: # 使用组的 display_name(组别名或行别名拼接) display_key = group.get("display_name") or group.get("group_key", "") word_frequency[display_key] += 1 if display_key not in keyword_to_news: keyword_to_news[display_key] = [] keyword_to_news[display_key].append(title) break # 每个标题只计入第一个匹配的词组 elif extract_mode == "auto_extract": # 自动提取关键词 extracted_words = self._extract_words_from_title(title) for word in extracted_words: word_frequency[word] += 1 if word not in keyword_to_news: keyword_to_news[word] = [] keyword_to_news[word].append(title) # 获取TOP N关键词 top_keywords = word_frequency.most_common(top_n) # 构建话题列表 topics = [] for keyword, frequency in top_keywords: matched_news = keyword_to_news.get(keyword, []) topics.append({ "keyword": keyword, "frequency": frequency, "matched_news": len(set(matched_news)), # 去重后的新闻数量 "trend": "stable", "weight_score": 0.0 }) # 构建结果 result = { "topics": topics, "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"), "mode": mode, "extract_mode": extract_mode, "total_keywords": len(word_frequency), "description": self._get_mode_description(mode, extract_mode) } # 缓存结果 self.cache.set(cache_key, result) return result def _get_mode_description(self, mode: str, extract_mode: str = "keywords") -> str: """获取模式描述""" mode_desc = { "daily": "当日累计统计", "current": "最新一批统计" }.get(mode, "未知时间模式") extract_desc = { "keywords": "基于预设关注词", "auto_extract": "自动提取高频词" }.get(extract_mode, "未知提取模式") return f"{mode_desc} - {extract_desc}" def get_current_config(self, section: str = "all") -> Dict: """ 获取当前系统配置 Args: section: 配置节 - all/crawler/push/keywords/weights Returns: 配置字典 Raises: FileParseError: 配置文件解析错误 """ # 解析配置文件 config_data = self.parser.parse_yaml_config() word_groups = self.parser.parse_frequency_words() # 根据section返回对应配置 advanced = config_data.get("advanced", {}) advanced_crawler = advanced.get("crawler", {}) platforms_config = config_data.get("platforms", {}) if section == "all" or section == "crawler": crawler_config = { "enable_crawler": platforms_config.get("enabled", True), "use_proxy": advanced_crawler.get("use_proxy", False), "request_interval": advanced_crawler.get("request_interval", 1), "retry_times": 3, "platforms": [p["id"] for p in platforms_config.get("sources", [])] } if section == "all" or section == "push": notification = config_data.get("notification", {}) batch_size = advanced.get("batch_size", {}) push_config = { "enable_notification": notification.get("enabled", True), "enabled_channels": [], "message_batch_size": batch_size.get("default", 4000), "push_window": {} # 已迁移至调度系统(schedule + timeline.yaml) } # 检测已配置的通知渠道(合并 config.yaml + .env) from trendradar.core.loader import _load_webhook_config webhook_config = _load_webhook_config(config_data) channel_checks = { "feishu": [webhook_config.get("FEISHU_WEBHOOK_URL")], "dingtalk": [webhook_config.get("DINGTALK_WEBHOOK_URL")], "wework": [webhook_config.get("WEWORK_WEBHOOK_URL")], "telegram": [webhook_config.get("TELEGRAM_BOT_TOKEN"), webhook_config.get("TELEGRAM_CHAT_ID")], "email": [webhook_config.get("EMAIL_FROM"), webhook_config.get("EMAIL_PASSWORD"), webhook_config.get("EMAIL_TO")], "ntfy": [webhook_config.get("NTFY_SERVER_URL"), webhook_config.get("NTFY_TOPIC")], "bark": [webhook_config.get("BARK_URL")], "slack": [webhook_config.get("SLACK_WEBHOOK_URL")], "generic_webhook": [webhook_config.get("GENERIC_WEBHOOK_URL")], } for ch_id, required_values in channel_checks.items(): if all(required_values): push_config["enabled_channels"].append(ch_id) if section == "all" or section == "keywords": keywords_config = { "word_groups": word_groups, "total_groups": len(word_groups) } if section == "all" or section == "weights": weight = advanced.get("weight", {}) weights_config = { "rank_weight": weight.get("rank", 0.6), "frequency_weight": weight.get("frequency", 0.3), "hotness_weight": weight.get("hotness", 0.1) } # 组装结果 if section == "all": result = { "crawler": crawler_config, "push": push_config, "keywords": keywords_config, "weights": weights_config } elif section == "crawler": result = crawler_config elif section == "push": result = push_config elif section == "keywords": result = keywords_config elif section == "weights": result = weights_config else: result = {} return result def get_available_date_range(self, db_type: str = "news") -> Tuple[Optional[datetime], Optional[datetime]]: """ 扫描 output 目录,返回实际可用的日期范围 Args: db_type: 数据库类型 ("news" 或 "rss") Returns: (最早日期, 最新日期) 元组,如果没有数据则返回 (None, None) Examples: >>> service = DataService() >>> earliest, latest = service.get_available_date_range() >>> print(f"可用日期范围:{earliest} 至 {latest}") """ return self.parser.get_available_date_range(db_type) def get_system_status(self) -> Dict: """ 获取系统运行状态 Returns: 系统状态字典 """ # 获取数据统计 output_dir = self.parser.project_root / "output" total_storage = 0 # 使用 parser 的方法获取日期范围 oldest_record, latest_record = self.get_available_date_range(db_type="news") # 计算 output 目录总存储大小 if output_dir.exists(): for item in output_dir.rglob("*"): if item.is_file(): total_storage += item.stat().st_size # 读取版本信息 version_file = self.parser.project_root / "version" version = "unknown" if version_file.exists(): try: with open(version_file, "r") as f: version = f.read().strip() except: pass return { "system": { "version": version, "project_root": str(self.parser.project_root) }, "data": { "total_storage": f"{total_storage / 1024 / 1024:.2f} MB", "oldest_record": oldest_record.strftime("%Y-%m-%d") if oldest_record else None, "latest_record": latest_record.strftime("%Y-%m-%d") if latest_record else None, }, "cache": self.cache.get_stats(), "health": "healthy" } # ======================================== # RSS 数据查询方法 # ======================================== def get_latest_rss( self, feeds: Optional[List[str]] = None, days: int = 1, limit: int = 50, include_summary: bool = False ) -> List[Dict]: """ 获取最新的 RSS 数据(支持多日查询) Args: feeds: RSS 源 ID 列表,None 表示所有源 days: 获取最近 N 天的数据,默认 1(仅今天),最大 30 天 limit: 返回条数限制 include_summary: 是否包含摘要,默认 False(节省 token) Returns: RSS 条目列表(按 URL 去重) Raises: DataNotFoundError: 数据不存在 """ days = min(max(days, 1), 30) # 限制 1-30 天 cache_key = f"latest_rss:{','.join(feeds or [])}:{days}:{limit}:{include_summary}" cached = self.cache.get(cache_key, ttl=900) if cached: return cached rss_list = [] seen_urls = set() # 跨日期 URL 去重 today = datetime.now() for i in range(days): target_date = today - timedelta(days=i) try: all_items, id_to_name, timestamps = self.parser.read_all_titles_for_date( date=target_date, platform_ids=feeds, db_type="rss" ) # 获取抓取时间 if timestamps: latest_timestamp = max(timestamps.values()) fetch_time = datetime.fromtimestamp(latest_timestamp) else: fetch_time = target_date # 转换为列表 for feed_id, items in all_items.items(): feed_name = id_to_name.get(feed_id, feed_id) for title, info in items.items(): # 跨日期 URL 去重 url = info.get("url", "") if url and url in seen_urls: continue if url: seen_urls.add(url) rss_item = { "title": title, "feed_id": feed_id, "feed_name": feed_name, "url": url, "published_at": info.get("published_at", ""), "author": info.get("author", ""), "date": target_date.strftime("%Y-%m-%d"), "fetch_time": fetch_time.strftime("%Y-%m-%d %H:%M:%S") if isinstance(fetch_time, datetime) else target_date.strftime("%Y-%m-%d") } if include_summary: rss_item["summary"] = info.get("summary", "") rss_list.append(rss_item) except DataNotFoundError: continue # 按发布时间排序(最新的在前) rss_list.sort(key=lambda x: x.get("published_at", ""), reverse=True) # 限制返回数量 result = rss_list[:limit] # 缓存结果 self.cache.set(cache_key, result) return result def search_rss( self, keyword: str, feeds: Optional[List[str]] = None, days: int = 7, limit: int = 50, include_summary: bool = False ) -> List[Dict]: """ 搜索 RSS 数据(跨日期自动去重) Args: keyword: 搜索关键词 feeds: RSS 源 ID 列表,None 表示所有源 days: 搜索最近 N 天的数据 limit: 返回条数限制 include_summary: 是否包含摘要 Returns: 匹配的 RSS 条目列表(按 URL 去重) """ cache_key = f"search_rss:{keyword}:{','.join(feeds or [])}:{days}:{limit}:{include_summary}" cached = self.cache.get(cache_key, ttl=900) if cached: return cached results = [] seen_urls = set() # 用于 URL 去重 today = datetime.now() for i in range(days): target_date = today - timedelta(days=i) try: all_items, id_to_name, _ = self.parser.read_all_titles_for_date( date=target_date, platform_ids=feeds, db_type="rss" ) for feed_id, items in all_items.items(): feed_name = id_to_name.get(feed_id, feed_id) for title, info in items.items(): # 跨日期去重:如果 URL 已出现过则跳过 url = info.get("url", "") if url and url in seen_urls: continue if url: seen_urls.add(url) # 关键词匹配(标题或摘要) summary = info.get("summary", "") if keyword.lower() in title.lower() or keyword.lower() in summary.lower(): rss_item = { "title": title, "feed_id": feed_id, "feed_name": feed_name, "url": url, "published_at": info.get("published_at", ""), "author": info.get("author", ""), "date": target_date.strftime("%Y-%m-%d") } if include_summary: rss_item["summary"] = summary results.append(rss_item) except DataNotFoundError: continue # 按发布时间排序 results.sort(key=lambda x: x.get("published_at", ""), reverse=True) # 限制返回数量 result = results[:limit] # 缓存结果 self.cache.set(cache_key, result) return result def get_rss_feeds_status(self) -> Dict: """ 获取 RSS 源状态 Returns: RSS 源状态信息 """ cache_key = "rss_feeds_status" cached = self.cache.get(cache_key, ttl=900) if cached: return cached # 获取可用的 RSS 日期 available_dates = self.parser.get_available_dates(db_type="rss") # 获取今天的 RSS 数据统计 today_stats = {} try: all_items, id_to_name, _ = self.parser.read_all_titles_for_date( date=None, platform_ids=None, db_type="rss" ) for feed_id, items in all_items.items(): today_stats[feed_id] = { "name": id_to_name.get(feed_id, feed_id), "item_count": len(items) } except DataNotFoundError: pass result = { "available_dates": available_dates[:10], # 最近 10 天 "total_dates": len(available_dates), "today_feeds": today_stats, "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S") } self.cache.set(cache_key, result) return result ================================================ FILE: mcp_server/services/parser_service.py ================================================ """ 数据解析服务 v2.0.0: 仅支持 SQLite 数据库,移除 TXT 文件支持 新存储结构:output/{type}/{date}.db """ import re import sqlite3 from pathlib import Path from typing import Dict, List, Tuple, Optional from datetime import datetime import yaml from ..utils.errors import FileParseError, DataNotFoundError from .cache_service import get_cache class ParserService: """数据解析服务类""" def __init__(self, project_root: str = None): """ 初始化解析服务 Args: project_root: 项目根目录,默认为当前目录的父目录 """ if project_root is None: current_file = Path(__file__) self.project_root = current_file.parent.parent.parent else: self.project_root = Path(project_root) self.cache = get_cache() # frequency_words.txt mtime 缓存 self._freq_words_cache: Optional[List[Dict]] = None self._freq_words_mtime: float = 0.0 @staticmethod def clean_title(title: str) -> str: """清理标题文本""" title = re.sub(r'\s+', ' ', title) title = title.strip() return title def get_date_folder_name(self, date: datetime = None) -> str: """ 获取日期字符串(ISO 格式) Args: date: 日期对象,默认为今天 Returns: 日期字符串(YYYY-MM-DD) """ if date is None: date = datetime.now() return date.strftime("%Y-%m-%d") def _get_db_path(self, date: datetime = None, db_type: str = "news") -> Optional[Path]: """ 获取数据库文件路径 新结构:output/{type}/{date}.db Args: date: 日期对象,默认为今天 db_type: 数据库类型 ("news" 或 "rss") Returns: 数据库文件路径,如果不存在则返回 None """ date_str = self.get_date_folder_name(date) db_path = self.project_root / "output" / db_type / f"{date_str}.db" if db_path.exists(): return db_path return None def _read_from_sqlite( self, date: datetime = None, platform_ids: Optional[List[str]] = None, db_type: str = "news" ) -> Optional[Tuple[Dict, Dict, Dict]]: """ 从 SQLite 数据库读取数据 Args: date: 日期对象,默认为今天 platform_ids: 平台ID列表,None表示所有平台 db_type: 数据库类型 ("news" 或 "rss") Returns: (all_titles, id_to_name, all_timestamps) 元组,如果数据库不存在返回 None """ db_path = self._get_db_path(date, db_type) if db_path is None: return None all_titles = {} id_to_name = {} all_timestamps = {} try: conn = sqlite3.connect(str(db_path)) conn.row_factory = sqlite3.Row cursor = conn.cursor() if db_type == "news": return self._read_news_from_sqlite(cursor, platform_ids, all_titles, id_to_name, all_timestamps) elif db_type == "rss": return self._read_rss_from_sqlite(cursor, platform_ids, all_titles, id_to_name, all_timestamps) except Exception as e: print(f"Warning: 从 SQLite 读取数据失败: {e}") return None finally: if 'conn' in locals(): conn.close() def _read_news_from_sqlite( self, cursor, platform_ids: Optional[List[str]], all_titles: Dict, id_to_name: Dict, all_timestamps: Dict ) -> Optional[Tuple[Dict, Dict, Dict]]: """从热榜数据库读取数据""" # 检查表是否存在 cursor.execute(""" SELECT name FROM sqlite_master WHERE type='table' AND name='news_items' """) if not cursor.fetchone(): return None # 构建查询 if platform_ids: placeholders = ','.join(['?' for _ in platform_ids]) query = f""" SELECT n.id, n.platform_id, p.name as platform_name, n.title, n.rank, n.url, n.mobile_url, n.first_crawl_time, n.last_crawl_time, n.crawl_count FROM news_items n LEFT JOIN platforms p ON n.platform_id = p.id WHERE n.platform_id IN ({placeholders}) """ cursor.execute(query, platform_ids) else: cursor.execute(""" SELECT n.id, n.platform_id, p.name as platform_name, n.title, n.rank, n.url, n.mobile_url, n.first_crawl_time, n.last_crawl_time, n.crawl_count FROM news_items n LEFT JOIN platforms p ON n.platform_id = p.id """) rows = cursor.fetchall() # 收集所有 news_item_id 用于查询历史排名 news_ids = [row['id'] for row in rows] rank_history_map = {} if news_ids: placeholders = ",".join("?" * len(news_ids)) cursor.execute(f""" SELECT news_item_id, rank FROM rank_history WHERE news_item_id IN ({placeholders}) ORDER BY news_item_id, crawl_time """, news_ids) for rh_row in cursor.fetchall(): news_id = rh_row['news_item_id'] rank = rh_row['rank'] if news_id not in rank_history_map: rank_history_map[news_id] = [] rank_history_map[news_id].append(rank) for row in rows: news_id = row['id'] platform_id = row['platform_id'] platform_name = row['platform_name'] or platform_id title = row['title'] if platform_id not in id_to_name: id_to_name[platform_id] = platform_name if platform_id not in all_titles: all_titles[platform_id] = {} ranks = rank_history_map.get(news_id, [row['rank']]) all_titles[platform_id][title] = { "ranks": ranks, "url": row['url'] or "", "mobileUrl": row['mobile_url'] or "", "first_time": row['first_crawl_time'] or "", "last_time": row['last_crawl_time'] or "", "count": row['crawl_count'] or 1, } # 获取抓取时间作为 timestamps cursor.execute(""" SELECT crawl_time, created_at FROM crawl_records ORDER BY crawl_time """) for row in cursor.fetchall(): crawl_time = row['crawl_time'] created_at = row['created_at'] try: ts = datetime.strptime(created_at, "%Y-%m-%d %H:%M:%S").timestamp() except (ValueError, TypeError): ts = datetime.now().timestamp() all_timestamps[f"{crawl_time}.db"] = ts if not all_titles: return None return (all_titles, id_to_name, all_timestamps) def _read_rss_from_sqlite( self, cursor, feed_ids: Optional[List[str]], all_items: Dict, id_to_name: Dict, all_timestamps: Dict ) -> Optional[Tuple[Dict, Dict, Dict]]: """从 RSS 数据库读取数据""" # 检查表是否存在 cursor.execute(""" SELECT name FROM sqlite_master WHERE type='table' AND name='rss_items' """) if not cursor.fetchone(): return None # 构建查询 if feed_ids: placeholders = ','.join(['?' for _ in feed_ids]) query = f""" SELECT i.id, i.feed_id, f.name as feed_name, i.title, i.url, i.published_at, i.summary, i.author, i.first_crawl_time, i.last_crawl_time, i.crawl_count FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id WHERE i.feed_id IN ({placeholders}) ORDER BY i.published_at DESC """ cursor.execute(query, feed_ids) else: cursor.execute(""" SELECT i.id, i.feed_id, f.name as feed_name, i.title, i.url, i.published_at, i.summary, i.author, i.first_crawl_time, i.last_crawl_time, i.crawl_count FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id ORDER BY i.published_at DESC """) rows = cursor.fetchall() for row in rows: feed_id = row['feed_id'] feed_name = row['feed_name'] or feed_id title = row['title'] if feed_id not in id_to_name: id_to_name[feed_id] = feed_name if feed_id not in all_items: all_items[feed_id] = {} all_items[feed_id][title] = { "url": row['url'] or "", "published_at": row['published_at'] or "", "summary": row['summary'] or "", "author": row['author'] or "", "first_time": row['first_crawl_time'] or "", "last_time": row['last_crawl_time'] or "", "count": row['crawl_count'] or 1, } # 获取抓取时间 cursor.execute(""" SELECT crawl_time, created_at FROM rss_crawl_records ORDER BY crawl_time """) for row in cursor.fetchall(): crawl_time = row['crawl_time'] created_at = row['created_at'] try: ts = datetime.strptime(created_at, "%Y-%m-%d %H:%M:%S").timestamp() except (ValueError, TypeError): ts = datetime.now().timestamp() all_timestamps[f"{crawl_time}.db"] = ts if not all_items: return None return (all_items, id_to_name, all_timestamps) def read_all_titles_for_date( self, date: datetime = None, platform_ids: Optional[List[str]] = None, db_type: str = "news" ) -> Tuple[Dict, Dict, Dict]: """ 读取指定日期的所有数据(带缓存) Args: date: 日期对象,默认为今天 platform_ids: 平台/Feed ID列表,None表示所有 db_type: 数据库类型 ("news" 或 "rss") Returns: (all_titles, id_to_name, all_timestamps) 元组 Raises: DataNotFoundError: 数据不存在 """ date_str = self.get_date_folder_name(date) platform_key = ','.join(sorted(platform_ids)) if platform_ids else 'all' cache_key = f"read_all:{db_type}:{date_str}:{platform_key}" is_today = (date is None) or (date.date() == datetime.now().date()) ttl = 900 if is_today else 900 cached = self.cache.get(cache_key, ttl=ttl) if cached: return cached result = self._read_from_sqlite(date, platform_ids, db_type) if result: self.cache.set(cache_key, result) return result raise DataNotFoundError( f"未找到 {date_str} 的 {db_type} 数据", suggestion="请先运行爬虫或检查日期是否正确" ) def parse_yaml_config(self, config_path: str = None) -> dict: """ 解析YAML配置文件 Args: config_path: 配置文件路径,默认为 config/config.yaml Returns: 配置字典 Raises: FileParseError: 配置文件解析错误 """ if config_path is None: config_path = self.project_root / "config" / "config.yaml" else: config_path = Path(config_path) if not config_path.exists(): raise FileParseError(str(config_path), "配置文件不存在") try: with open(config_path, "r", encoding="utf-8") as f: config_data = yaml.safe_load(f) return config_data except Exception as e: raise FileParseError(str(config_path), str(e)) def parse_frequency_words(self, words_file: str = None) -> List[Dict]: """ 解析关键词配置文件(带 mtime 缓存) 仅当 frequency_words.txt 被修改时才重新解析,避免循环内重复 IO。 复用 trendradar.core.frequency 的解析逻辑,支持: - # 开头的注释行 - 空行分隔词组 - [组别名] 作为词组第一行,给整组指定别名 - +前缀必须词、!前缀过滤词、@数量限制 - /pattern/ 正则表达式语法 - => 别名 显示名称语法 - [GLOBAL_FILTER] 全局过滤区域 显示名称优先级:组别名 > 行别名拼接 > 关键词拼接 Args: words_file: 关键词文件路径,默认为 config/frequency_words.txt Returns: 词组列表 Raises: FileParseError: 文件解析错误 """ import os from trendradar.core.frequency import load_frequency_words if words_file is None: words_file = str(self.project_root / "config" / "frequency_words.txt") else: words_file = str(words_file) try: current_mtime = os.path.getmtime(words_file) if self._freq_words_cache is not None and current_mtime == self._freq_words_mtime: return self._freq_words_cache word_groups, filter_words, global_filters = load_frequency_words(words_file) self._freq_words_cache = word_groups self._freq_words_mtime = current_mtime return word_groups except FileNotFoundError: return [] except Exception as e: raise FileParseError(words_file, str(e)) def get_available_dates(self, db_type: str = "news") -> List[str]: """ 获取可用的日期列表 Args: db_type: 数据库类型 ("news" 或 "rss") Returns: 日期字符串列表(YYYY-MM-DD 格式,降序排列) """ db_dir = self.project_root / "output" / db_type if not db_dir.exists(): return [] dates = [] for db_file in db_dir.glob("*.db"): date_match = re.match(r'(\d{4}-\d{2}-\d{2})\.db$', db_file.name) if date_match: dates.append(date_match.group(1)) return sorted(dates, reverse=True) def get_available_date_range(self, db_type: str = "news") -> Tuple[Optional[datetime], Optional[datetime]]: """ 获取可用的日期范围 Args: db_type: 数据库类型 ("news" 或 "rss") Returns: (最早日期, 最新日期) 元组,如果没有数据则返回 (None, None) """ dates = self.get_available_dates(db_type) if not dates: return (None, None) earliest = datetime.strptime(dates[-1], "%Y-%m-%d") latest = datetime.strptime(dates[0], "%Y-%m-%d") return (earliest, latest) ================================================ FILE: mcp_server/tools/__init__.py ================================================ """ MCP 工具模块 包含所有MCP工具的实现。 """ ================================================ FILE: mcp_server/tools/analytics.py ================================================ """ 高级数据分析工具 提供热度趋势分析、平台对比、关键词共现、情感分析等高级分析功能。 """ import os import re from collections import Counter, defaultdict from datetime import datetime, timedelta from typing import Dict, List, Optional, Union from difflib import SequenceMatcher import yaml from trendradar.core.analyzer import calculate_news_weight as _calculate_news_weight from ..services.data_service import DataService from ..utils.validators import ( validate_platforms, validate_limit, validate_keyword, validate_top_n, validate_date_range, validate_threshold ) from ..utils.errors import MCPError, InvalidParameterError, DataNotFoundError # 权重配置 mtime 缓存(避免重复读取同一配置文件) _weight_config_cache: Optional[Dict] = None _weight_config_mtime: float = 0.0 _weight_config_path: Optional[str] = None _WEIGHT_DEFAULT_CONFIG = { "RANK_WEIGHT": 0.6, "FREQUENCY_WEIGHT": 0.3, "HOTNESS_WEIGHT": 0.1, } def _get_weight_config() -> Dict: """ 从 config.yaml 读取权重配置(带 mtime 缓存) 仅当配置文件被修改时才重新读取,避免循环内重复 IO。 Returns: 权重配置字典,包含 RANK_WEIGHT, FREQUENCY_WEIGHT, HOTNESS_WEIGHT """ global _weight_config_cache, _weight_config_mtime, _weight_config_path try: # 首次调用时计算路径(之后复用) if _weight_config_path is None: current_dir = os.path.dirname(os.path.abspath(__file__)) _weight_config_path = os.path.normpath( os.path.join(current_dir, "..", "..", "config", "config.yaml") ) current_mtime = os.path.getmtime(_weight_config_path) # 文件未修改且缓存有效,直接返回 if _weight_config_cache is not None and current_mtime == _weight_config_mtime: return _weight_config_cache # 文件已修改或首次读取,重新解析 with open(_weight_config_path, 'r', encoding='utf-8') as f: config = yaml.safe_load(f) weight = config.get('advanced', {}).get('weight', {}) _weight_config_cache = { "RANK_WEIGHT": weight.get('rank', 0.6), "FREQUENCY_WEIGHT": weight.get('frequency', 0.3), "HOTNESS_WEIGHT": weight.get('hotness', 0.1), } _weight_config_mtime = current_mtime return _weight_config_cache except Exception: return _WEIGHT_DEFAULT_CONFIG def calculate_news_weight(news_data: Dict, rank_threshold: int = 5) -> float: """ 计算新闻权重(用于排序) 复用 trendradar.core.analyzer.calculate_news_weight 实现, 权重配置从 config.yaml 的 advanced.weight 读取。 Args: news_data: 新闻数据字典,包含 ranks 和 count 字段 rank_threshold: 高排名阈值,默认5 Returns: 权重分数(0-100之间的浮点数) """ return _calculate_news_weight(news_data, rank_threshold, _get_weight_config()) class AnalyticsTools: """高级数据分析工具类""" def __init__(self, project_root: str = None): """ 初始化分析工具 Args: project_root: 项目根目录 """ self.data_service = DataService(project_root) def analyze_data_insights_unified( self, insight_type: str = "platform_compare", topic: Optional[str] = None, date_range: Optional[Union[Dict[str, str], str]] = None, min_frequency: int = 3, top_n: int = 20 ) -> Dict: """ 统一数据洞察分析工具 - 整合多种数据分析模式 Args: insight_type: 洞察类型,可选值: - "platform_compare": 平台对比分析(对比不同平台对话题的关注度) - "platform_activity": 平台活跃度统计(统计各平台发布频率和活跃时间) - "keyword_cooccur": 关键词共现分析(分析关键词同时出现的模式) topic: 话题关键词(可选,platform_compare模式适用) date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} min_frequency: 最小共现频次(keyword_cooccur模式),默认3 top_n: 返回TOP N结果(keyword_cooccur模式),默认20 Returns: 数据洞察分析结果字典 Examples: - analyze_data_insights_unified(insight_type="platform_compare", topic="人工智能") - analyze_data_insights_unified(insight_type="platform_activity", date_range={...}) - analyze_data_insights_unified(insight_type="keyword_cooccur", min_frequency=5) """ try: # 参数验证 if insight_type not in ["platform_compare", "platform_activity", "keyword_cooccur"]: raise InvalidParameterError( f"无效的洞察类型: {insight_type}", suggestion="支持的类型: platform_compare, platform_activity, keyword_cooccur" ) # 根据洞察类型调用相应方法 if insight_type == "platform_compare": return self.compare_platforms( topic=topic, date_range=date_range ) elif insight_type == "platform_activity": return self.get_platform_activity_stats( date_range=date_range ) else: # keyword_cooccur return self.analyze_keyword_cooccurrence( min_frequency=min_frequency, top_n=top_n ) except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def analyze_topic_trend_unified( self, topic: str, analysis_type: str = "trend", date_range: Optional[Union[Dict[str, str], str]] = None, granularity: str = "day", threshold: float = 3.0, time_window: int = 24, lookahead_hours: int = 6, confidence_threshold: float = 0.7 ) -> Dict: """ 统一话题趋势分析工具 - 整合多种趋势分析模式 Args: topic: 话题关键词(必需) analysis_type: 分析类型,可选值: - "trend": 热度趋势分析(追踪话题的热度变化) - "lifecycle": 生命周期分析(从出现到消失的完整周期) - "viral": 异常热度检测(识别突然爆火的话题) - "predict": 话题预测(预测未来可能的热点) date_range: 日期范围(trend和lifecycle模式),可选 - **格式**: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} - **默认**: 不指定时默认分析最近7天 granularity: 时间粒度(trend模式),默认"day"(hour/day) threshold: 热度突增倍数阈值(viral模式),默认3.0 time_window: 检测时间窗口小时数(viral模式),默认24 lookahead_hours: 预测未来小时数(predict模式),默认6 confidence_threshold: 置信度阈值(predict模式),默认0.7 Returns: 趋势分析结果字典 Examples (假设今天是 2025-11-17): - 用户:"分析AI最近7天的趋势" → analyze_topic_trend_unified(topic="人工智能", analysis_type="trend", date_range={"start": "2025-11-11", "end": "2025-11-17"}) - 用户:"看看特斯拉本月的热度" → analyze_topic_trend_unified(topic="特斯拉", analysis_type="lifecycle", date_range={"start": "2025-11-01", "end": "2025-11-17"}) - analyze_topic_trend_unified(topic="比特币", analysis_type="viral", threshold=3.0) - analyze_topic_trend_unified(topic="ChatGPT", analysis_type="predict", lookahead_hours=6) """ try: # 参数验证 topic = validate_keyword(topic) if analysis_type not in ["trend", "lifecycle", "viral", "predict"]: raise InvalidParameterError( f"无效的分析类型: {analysis_type}", suggestion="支持的类型: trend, lifecycle, viral, predict" ) # 根据分析类型调用相应方法 if analysis_type == "trend": return self.get_topic_trend_analysis( topic=topic, date_range=date_range, granularity=granularity ) elif analysis_type == "lifecycle": return self.analyze_topic_lifecycle( topic=topic, date_range=date_range ) elif analysis_type == "viral": # viral模式不需要topic参数,使用通用检测 return self.detect_viral_topics( threshold=threshold, time_window=time_window ) else: # predict # predict模式不需要topic参数,使用通用预测 return self.predict_trending_topics( lookahead_hours=lookahead_hours, confidence_threshold=confidence_threshold ) except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_topic_trend_analysis( self, topic: str, date_range: Optional[Union[Dict[str, str], str]] = None, granularity: str = "day" ) -> Dict: """ 热度趋势分析 - 追踪特定话题的热度变化趋势 Args: topic: 话题关键词 date_range: 日期范围(可选) - **格式**: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} - **默认**: 不指定时默认分析最近7天 granularity: 时间粒度,仅支持 day(天) Returns: 趋势分析结果字典 Examples: 用户询问示例: - "帮我分析一下'人工智能'这个话题最近一周的热度趋势" - "查看'比特币'过去一周的热度变化" - "看看'iPhone'最近7天的趋势如何" - "分析'特斯拉'最近一个月的热度趋势" - "查看'ChatGPT'2024年12月的趋势变化" 代码调用示例: >>> tools = AnalyticsTools() >>> # 分析7天趋势(假设今天是 2025-11-17) >>> result = tools.get_topic_trend_analysis( ... topic="人工智能", ... date_range={"start": "2025-11-11", "end": "2025-11-17"}, ... granularity="day" ... ) >>> # 分析历史月份趋势 >>> result = tools.get_topic_trend_analysis( ... topic="特斯拉", ... date_range={"start": "2024-12-01", "end": "2024-12-31"}, ... granularity="day" ... ) >>> print(result['trend_data']) """ try: # 验证参数 topic = validate_keyword(topic) # 验证粒度参数(只支持day) if granularity != "day": from ..utils.errors import InvalidParameterError raise InvalidParameterError( f"不支持的粒度参数: {granularity}", suggestion="当前仅支持 'day' 粒度,因为底层数据按天聚合" ) # 处理日期范围(不指定时默认最近7天) if date_range: from ..utils.validators import validate_date_range date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: # 默认最近7天 end_date = datetime.now() start_date = end_date - timedelta(days=6) # 收集趋势数据 trend_data = [] current_date = start_date while current_date <= end_date: try: all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( date=current_date ) # 统计该时间点的话题出现次数 count = 0 matched_titles = [] for _, titles in all_titles.items(): for title in titles.keys(): if topic.lower() in title.lower(): count += 1 matched_titles.append(title) trend_data.append({ "date": current_date.strftime("%Y-%m-%d"), "count": count, "sample_titles": matched_titles[:3] # 只保留前3个样本 }) except DataNotFoundError: trend_data.append({ "date": current_date.strftime("%Y-%m-%d"), "count": 0, "sample_titles": [] }) # 按天增加时间 current_date += timedelta(days=1) # 计算趋势指标 counts = [item["count"] for item in trend_data] total_days = (end_date - start_date).days + 1 if len(counts) >= 2: # 计算涨跌幅度 first_non_zero = next((c for c in counts if c > 0), 0) last_count = counts[-1] if first_non_zero > 0: change_rate = ((last_count - first_non_zero) / first_non_zero) * 100 else: change_rate = 0 # 找到峰值时间 max_count = max(counts) peak_index = counts.index(max_count) peak_time = trend_data[peak_index]["date"] else: change_rate = 0 peak_time = None max_count = 0 return { "success": True, "summary": { "description": f"话题「{topic}」的热度趋势分析", "topic": topic, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d"), "total_days": total_days }, "granularity": granularity, "total_mentions": sum(counts), "average_mentions": round(sum(counts) / len(counts), 2) if counts else 0, "peak_count": max_count, "peak_time": peak_time, "change_rate": round(change_rate, 2), "trend_direction": "上升" if change_rate > 10 else "下降" if change_rate < -10 else "稳定" }, "data": trend_data } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def compare_platforms( self, topic: Optional[str] = None, date_range: Optional[Union[Dict[str, str], str]] = None ) -> Dict: """ 平台对比分析 - 对比不同平台对同一话题的关注度 Args: topic: 话题关键词(可选,不指定则对比整体活跃度) date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} Returns: 平台对比分析结果 Examples: 用户询问示例: - "对比一下各个平台对'人工智能'话题的关注度" - "看看知乎和微博哪个平台更关注科技新闻" - "分析各平台今天的热点分布" 代码调用示例: >>> # 对比各平台(假设今天是 2025-11-17) >>> result = tools.compare_platforms( ... topic="人工智能", ... date_range={"start": "2025-11-08", "end": "2025-11-17"} ... ) >>> print(result['platform_stats']) """ try: # 参数验证 if topic: topic = validate_keyword(topic) date_range_tuple = validate_date_range(date_range) # 确定日期范围 if date_range_tuple: start_date, end_date = date_range_tuple else: start_date = end_date = datetime.now() # 收集各平台数据 platform_stats = defaultdict(lambda: { "total_news": 0, "topic_mentions": 0, "unique_titles": set(), "top_keywords": Counter() }) # 遍历日期范围 current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date ) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title in titles.keys(): platform_stats[platform_name]["total_news"] += 1 platform_stats[platform_name]["unique_titles"].add(title) # 如果指定了话题,统计包含话题的新闻 if topic and topic.lower() in title.lower(): platform_stats[platform_name]["topic_mentions"] += 1 # 提取关键词(简单分词) keywords = self._extract_keywords(title) platform_stats[platform_name]["top_keywords"].update(keywords) except DataNotFoundError: pass current_date += timedelta(days=1) # 转换为可序列化的格式 result_stats = {} for platform, stats in platform_stats.items(): coverage_rate = 0 if stats["total_news"] > 0: coverage_rate = (stats["topic_mentions"] / stats["total_news"]) * 100 result_stats[platform] = { "total_news": stats["total_news"], "topic_mentions": stats["topic_mentions"], "unique_titles": len(stats["unique_titles"]), "coverage_rate": round(coverage_rate, 2), "top_keywords": [ {"keyword": k, "count": v} for k, v in stats["top_keywords"].most_common(5) ] } # 找出各平台独有的热点 unique_topics = self._find_unique_topics(platform_stats) return { "success": True, "topic": topic, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d") }, "platform_stats": result_stats, "unique_topics": unique_topics, "total_platforms": len(result_stats) } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def analyze_keyword_cooccurrence( self, min_frequency: int = 3, top_n: int = 20 ) -> Dict: """ 关键词共现分析 - 分析哪些关键词经常同时出现 Args: min_frequency: 最小共现频次 top_n: 返回TOP N关键词对 Returns: 关键词共现分析结果 Examples: 用户询问示例: - "分析一下哪些关键词经常一起出现" - "看看'人工智能'经常和哪些词一起出现" - "找出今天新闻中的关键词关联" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.analyze_keyword_cooccurrence( ... min_frequency=5, ... top_n=15 ... ) >>> print(result['cooccurrence_pairs']) """ try: # 参数验证 min_frequency = validate_limit(min_frequency, default=3, max_limit=100) top_n = validate_top_n(top_n, default=20) # 读取今天的数据 all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() # 关键词共现统计 cooccurrence = Counter() keyword_titles = defaultdict(list) for platform_id, titles in all_titles.items(): for title in titles.keys(): # 提取关键词 keywords = self._extract_keywords(title) # 记录每个关键词出现的标题 for kw in keywords: keyword_titles[kw].append(title) # 计算两两共现 if len(keywords) >= 2: for i, kw1 in enumerate(keywords): for kw2 in keywords[i+1:]: # 统一排序,避免重复 pair = tuple(sorted([kw1, kw2])) cooccurrence[pair] += 1 # 过滤低频共现 filtered_pairs = [ (pair, count) for pair, count in cooccurrence.items() if count >= min_frequency ] # 排序并取TOP N top_pairs = sorted(filtered_pairs, key=lambda x: x[1], reverse=True)[:top_n] # 构建结果 result_pairs = [] for (kw1, kw2), count in top_pairs: # 找出同时包含两个关键词的标题样本 titles_with_both = [ title for title in keyword_titles[kw1] if kw2 in self._extract_keywords(title) ] result_pairs.append({ "keyword1": kw1, "keyword2": kw2, "cooccurrence_count": count, "sample_titles": titles_with_both[:3] }) return { "success": True, "summary": { "description": "关键词共现分析结果", "total": len(result_pairs), "min_frequency": min_frequency, "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S") }, "data": result_pairs } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def analyze_sentiment( self, topic: Optional[str] = None, platforms: Optional[List[str]] = None, date_range: Optional[Union[Dict[str, str], str]] = None, limit: int = 50, sort_by_weight: bool = True, include_url: bool = False ) -> Dict: """ 情感倾向分析 - 生成用于 AI 情感分析的结构化提示词 本工具收集新闻数据并生成优化的 AI 提示词,你可以将其发送给 AI 进行深度情感分析。 Args: topic: 话题关键词(可选),只分析包含该关键词的新闻 platforms: 平台过滤列表(可选),如 ['zhihu', 'weibo'] date_range: 日期范围(可选),格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} 不指定则默认查询今天的数据 limit: 返回新闻数量限制,默认50,最大100 sort_by_weight: 是否按权重排序,默认True(推荐) include_url: 是否包含URL链接,默认False(节省token) Returns: 包含 AI 提示词和新闻数据的结构化结果 Examples: 用户询问示例: - "分析一下今天新闻的情感倾向" - "看看'特斯拉'相关新闻是正面还是负面的" - "分析各平台对'人工智能'的情感态度" - "看看'特斯拉'相关新闻是正面还是负面的,请选择一周内的前10条新闻来分析" 代码调用示例: >>> tools = AnalyticsTools() >>> # 分析今天的特斯拉新闻,返回前10条 >>> result = tools.analyze_sentiment( ... topic="特斯拉", ... limit=10 ... ) >>> # 分析一周内的特斯拉新闻(假设今天是 2025-11-17) >>> result = tools.analyze_sentiment( ... topic="特斯拉", ... date_range={"start": "2025-11-11", "end": "2025-11-17"}, ... limit=10 ... ) >>> print(result['ai_prompt']) # 获取生成的提示词 """ try: # 参数验证 if topic: topic = validate_keyword(topic) platforms = validate_platforms(platforms) limit = validate_limit(limit, default=50) # 处理日期范围 if date_range: date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: # 默认今天 start_date = end_date = datetime.now() # 收集新闻数据(支持多天) all_news_items = [] current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date, platform_ids=platforms ) # 收集该日期的新闻 for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 如果指定了话题,只收集包含话题的标题 if topic and topic.lower() not in title.lower(): continue news_item = { "platform": platform_name, "title": title, "ranks": info.get("ranks", []), "count": len(info.get("ranks", [])), "date": current_date.strftime("%Y-%m-%d") } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") all_news_items.append(news_item) except DataNotFoundError: # 该日期没有数据,继续下一天 pass # 下一天 current_date += timedelta(days=1) if not all_news_items: time_desc = "今天" if start_date == end_date else f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" raise DataNotFoundError( f"未找到相关新闻({time_desc})", suggestion="请尝试其他话题、日期范围或平台" ) # 去重(同一标题只保留一次) unique_news = {} for item in all_news_items: key = f"{item['platform']}::{item['title']}" if key not in unique_news: unique_news[key] = item else: # 合并 ranks(如果同一新闻在多天出现) existing = unique_news[key] existing["ranks"].extend(item["ranks"]) existing["count"] = len(existing["ranks"]) deduplicated_news = list(unique_news.values()) # 按权重排序(如果启用) if sort_by_weight: deduplicated_news.sort( key=lambda x: calculate_news_weight(x), reverse=True ) # 限制返回数量 selected_news = deduplicated_news[:limit] # 生成 AI 提示词 ai_prompt = self._create_sentiment_analysis_prompt( news_data=selected_news, topic=topic ) # 构建时间范围描述 if start_date == end_date: time_range_desc = start_date.strftime("%Y-%m-%d") else: time_range_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" result = { "success": True, "method": "ai_prompt_generation", "summary": { "description": "情感分析数据和AI提示词", "total_found": len(deduplicated_news), "returned": len(selected_news), "requested_limit": limit, "duplicates_removed": len(all_news_items) - len(deduplicated_news), "topic": topic, "time_range": time_range_desc, "platforms": list(set(item["platform"] for item in selected_news)), "sorted_by_weight": sort_by_weight }, "ai_prompt": ai_prompt, "data": selected_news, "usage_note": "请将 ai_prompt 字段的内容发送给 AI 进行情感分析" } # 如果返回数量少于请求数量,增加提示 if len(selected_news) < limit and len(deduplicated_news) >= limit: result["note"] = "返回数量少于请求数量是因为去重逻辑(同一标题在不同平台只保留一次)" elif len(deduplicated_news) < limit: result["note"] = f"在指定时间范围内仅找到 {len(deduplicated_news)} 条匹配的新闻" return result except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def _create_sentiment_analysis_prompt( self, news_data: List[Dict], topic: Optional[str] ) -> str: """ 创建情感分析的 AI 提示词 Args: news_data: 新闻数据列表(已排序和限制数量) topic: 话题关键词 Returns: 格式化的 AI 提示词 """ # 按平台分组 platform_news = defaultdict(list) for item in news_data: platform_news[item["platform"]].append({ "title": item["title"], "date": item.get("date", "") }) # 构建提示词 prompt_parts = [] # 1. 任务说明 if topic: prompt_parts.append(f"请分析以下关于「{topic}」的新闻标题的情感倾向。") else: prompt_parts.append("请分析以下新闻标题的情感倾向。") prompt_parts.append("") prompt_parts.append("分析要求:") prompt_parts.append("1. 识别每条新闻的情感倾向(正面/负面/中性)") prompt_parts.append("2. 统计各情感类别的数量和百分比") prompt_parts.append("3. 分析不同平台的情感差异") prompt_parts.append("4. 总结整体情感趋势") prompt_parts.append("5. 列举典型的正面和负面新闻样本") prompt_parts.append("") # 2. 数据概览 prompt_parts.append(f"数据概览:") prompt_parts.append(f"- 总新闻数:{len(news_data)}") prompt_parts.append(f"- 覆盖平台:{len(platform_news)}") # 时间范围 dates = set(item.get("date", "") for item in news_data if item.get("date")) if dates: date_list = sorted(dates) if len(date_list) == 1: prompt_parts.append(f"- 时间范围:{date_list[0]}") else: prompt_parts.append(f"- 时间范围:{date_list[0]} 至 {date_list[-1]}") prompt_parts.append("") # 3. 按平台展示新闻 prompt_parts.append("新闻列表(按平台分类,已按重要性排序):") prompt_parts.append("") for platform, items in sorted(platform_news.items()): prompt_parts.append(f"【{platform}】({len(items)} 条)") for i, item in enumerate(items, 1): title = item["title"] date_str = f" [{item['date']}]" if item.get("date") else "" prompt_parts.append(f"{i}. {title}{date_str}") prompt_parts.append("") # 4. 输出格式说明 prompt_parts.append("请按以下格式输出分析结果:") prompt_parts.append("") prompt_parts.append("## 情感分布统计") prompt_parts.append("- 正面:XX条 (XX%)") prompt_parts.append("- 负面:XX条 (XX%)") prompt_parts.append("- 中性:XX条 (XX%)") prompt_parts.append("") prompt_parts.append("## 平台情感对比") prompt_parts.append("[各平台的情感倾向差异]") prompt_parts.append("") prompt_parts.append("## 整体情感趋势") prompt_parts.append("[总体分析和关键发现]") prompt_parts.append("") prompt_parts.append("## 典型样本") prompt_parts.append("正面新闻样本:") prompt_parts.append("[列举3-5条]") prompt_parts.append("") prompt_parts.append("负面新闻样本:") prompt_parts.append("[列举3-5条]") return "\n".join(prompt_parts) def find_similar_news( self, reference_title: str, threshold: float = 0.6, limit: int = 50, include_url: bool = False ) -> Dict: """ 相似新闻查找 - 基于标题相似度查找相关新闻 Args: reference_title: 参考标题 threshold: 相似度阈值(0-1之间) limit: 返回条数限制,默认50 include_url: 是否包含URL链接,默认False(节省token) Returns: 相似新闻列表 Examples: 用户询问示例: - "找出和'特斯拉降价'相似的新闻" - "查找关于iPhone发布的类似报道" - "看看有没有和这条新闻相似的报道" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.find_similar_news( ... reference_title="特斯拉宣布降价", ... threshold=0.6, ... limit=10 ... ) >>> print(result['similar_news']) """ try: # 参数验证 reference_title = validate_keyword(reference_title) threshold = validate_threshold(threshold, default=0.6, min_value=0.0, max_value=1.0) limit = validate_limit(limit, default=50) # 读取数据 all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date() # 计算相似度 similar_items = [] for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): if title == reference_title: continue # 计算相似度 similarity = self._calculate_similarity(reference_title, title) if similarity >= threshold: news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "similarity": round(similarity, 3), "rank": info["ranks"][0] if info["ranks"] else 0 } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") similar_items.append(news_item) # 按相似度排序 similar_items.sort(key=lambda x: x["similarity"], reverse=True) # 限制数量 result_items = similar_items[:limit] if not result_items: raise DataNotFoundError( f"未找到相似度超过 {threshold} 的新闻", suggestion="请降低相似度阈值或尝试其他标题" ) result = { "success": True, "summary": { "description": "相似新闻搜索结果", "total_found": len(similar_items), "returned": len(result_items), "requested_limit": limit, "threshold": threshold, "reference_title": reference_title }, "data": result_items } if len(similar_items) < limit: result["note"] = f"相似度阈值 {threshold} 下仅找到 {len(similar_items)} 条相似新闻" return result except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def search_by_entity( self, entity: str, entity_type: Optional[str] = None, limit: int = 50, sort_by_weight: bool = True ) -> Dict: """ 实体识别搜索 - 搜索包含特定人物/地点/机构的新闻 Args: entity: 实体名称 entity_type: 实体类型(person/location/organization),可选 limit: 返回条数限制,默认50,最大200 sort_by_weight: 是否按权重排序,默认True Returns: 实体相关新闻列表 Examples: 用户询问示例: - "搜索马斯克相关的新闻" - "查找关于特斯拉公司的报道,返回前20条" - "看看北京有什么新闻" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.search_by_entity( ... entity="马斯克", ... entity_type="person", ... limit=20 ... ) >>> print(result['related_news']) """ try: # 参数验证 entity = validate_keyword(entity) limit = validate_limit(limit, default=50) if entity_type and entity_type not in ["person", "location", "organization"]: raise InvalidParameterError( f"无效的实体类型: {entity_type}", suggestion="支持的类型: person, location, organization" ) # 读取数据 all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date() # 搜索包含实体的新闻 related_news = [] entity_context = Counter() # 统计实体周边的词 for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): if entity in title: url = info.get("url", "") mobile_url = info.get("mobileUrl", "") ranks = info.get("ranks", []) count = len(ranks) related_news.append({ "title": title, "platform": platform_id, "platform_name": platform_name, "url": url, "mobileUrl": mobile_url, "ranks": ranks, "count": count, "rank": ranks[0] if ranks else 999 }) # 提取实体周边的关键词 keywords = self._extract_keywords(title) entity_context.update(keywords) if not related_news: raise DataNotFoundError( f"未找到包含实体 '{entity}' 的新闻", suggestion="请尝试其他实体名称" ) # 移除实体本身 if entity in entity_context: del entity_context[entity] # 按权重排序(如果启用) if sort_by_weight: related_news.sort( key=lambda x: calculate_news_weight(x), reverse=True ) else: # 按排名排序 related_news.sort(key=lambda x: x["rank"]) # 限制返回数量 result_news = related_news[:limit] return { "success": True, "summary": { "description": f"实体「{entity}」相关新闻", "entity": entity, "entity_type": entity_type or "auto", "total_found": len(related_news), "returned": len(result_news), "sorted_by_weight": sort_by_weight }, "data": result_news, "related_keywords": [ {"keyword": k, "count": v} for k, v in entity_context.most_common(10) ] } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def generate_summary_report( self, report_type: str = "daily", date_range: Optional[Union[Dict[str, str], str]] = None ) -> Dict: """ 每日/每周摘要生成器 - 自动生成热点摘要报告 Args: report_type: 报告类型(daily/weekly) date_range: 自定义日期范围(可选) Returns: Markdown格式的摘要报告 Examples: 用户询问示例: - "生成今天的新闻摘要报告" - "给我一份本周的热点总结" - "生成过去7天的新闻分析报告" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.generate_summary_report( ... report_type="daily" ... ) >>> print(result['markdown_report']) """ try: # 参数验证 if report_type not in ["daily", "weekly"]: raise InvalidParameterError( f"无效的报告类型: {report_type}", suggestion="支持的类型: daily, weekly" ) # 确定日期范围 if date_range: date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: if report_type == "daily": start_date = end_date = datetime.now() else: # weekly end_date = datetime.now() start_date = end_date - timedelta(days=6) # 收集数据 all_keywords = Counter() all_platforms_news = defaultdict(int) all_titles_list = [] current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date ) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) all_platforms_news[platform_name] += len(titles) for title in titles.keys(): all_titles_list.append({ "title": title, "platform": platform_name, "date": current_date.strftime("%Y-%m-%d") }) # 提取关键词 keywords = self._extract_keywords(title) all_keywords.update(keywords) except DataNotFoundError: pass current_date += timedelta(days=1) # 生成报告 report_title = f"{'每日' if report_type == 'daily' else '每周'}新闻热点摘要" date_str = f"{start_date.strftime('%Y-%m-%d')}" if report_type == "daily" else f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" # 构建Markdown报告 markdown = f"""# {report_title} **报告日期**: {date_str} **生成时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} --- ## 📊 数据概览 - **总新闻数**: {len(all_titles_list)} - **覆盖平台**: {len(all_platforms_news)} - **热门关键词数**: {len(all_keywords)} ## 🔥 TOP 10 热门话题 """ # 添加TOP 10关键词 for i, (keyword, count) in enumerate(all_keywords.most_common(10), 1): markdown += f"{i}. **{keyword}** - 出现 {count} 次\n" # 平台分析 markdown += "\n## 📱 平台活跃度\n\n" sorted_platforms = sorted(all_platforms_news.items(), key=lambda x: x[1], reverse=True) for platform, count in sorted_platforms: markdown += f"- **{platform}**: {count} 条新闻\n" # 趋势变化(如果是周报) if report_type == "weekly": markdown += "\n## 📈 趋势分析\n\n" markdown += "本周热度持续的话题(样本数据):\n\n" # 简单的趋势分析 top_keywords = [kw for kw, _ in all_keywords.most_common(5)] for keyword in top_keywords: markdown += f"- **{keyword}**: 持续热门\n" # 添加样本新闻(按权重选择,确保确定性) markdown += "\n## 📰 精选新闻样本\n\n" # 确定性选取:按标题的权重排序,取前5条 # 这样相同输入总是返回相同结果 if all_titles_list: # 计算每条新闻的权重分数(基于关键词出现次数) news_with_scores = [] for news in all_titles_list: # 简单权重:统计包含TOP关键词的次数 score = 0 title_lower = news['title'].lower() for keyword, count in all_keywords.most_common(10): if keyword.lower() in title_lower: score += count news_with_scores.append((news, score)) # 按权重降序排序,权重相同则按标题字母顺序(确保确定性) news_with_scores.sort(key=lambda x: (-x[1], x[0]['title'])) # 取前5条 sample_news = [item[0] for item in news_with_scores[:5]] for news in sample_news: markdown += f"- [{news['platform']}] {news['title']}\n" markdown += "\n---\n\n*本报告由 TrendRadar MCP 自动生成*\n" return { "success": True, "report_type": report_type, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d") }, "markdown_report": markdown, "statistics": { "total_news": len(all_titles_list), "platforms_count": len(all_platforms_news), "keywords_count": len(all_keywords), "top_keyword": all_keywords.most_common(1)[0] if all_keywords else None } } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_platform_activity_stats( self, date_range: Optional[Union[Dict[str, str], str]] = None ) -> Dict: """ 平台活跃度统计 - 统计各平台的发布频率和活跃时间段 Args: date_range: 日期范围(可选) Returns: 平台活跃度统计结果 Examples: 用户询问示例: - "统计各平台今天的活跃度" - "看看哪个平台更新最频繁" - "分析各平台的发布时间规律" 代码调用示例: >>> # 查看各平台活跃度(假设今天是 2025-11-17) >>> result = tools.get_platform_activity_stats( ... date_range={"start": "2025-11-08", "end": "2025-11-17"} ... ) >>> print(result['platform_activity']) """ try: # 参数验证 date_range_tuple = validate_date_range(date_range) # 确定日期范围 if date_range_tuple: start_date, end_date = date_range_tuple else: start_date = end_date = datetime.now() # 统计各平台活跃度 platform_activity = defaultdict(lambda: { "total_updates": 0, "days_active": set(), "news_count": 0, "hourly_distribution": Counter() }) # 遍历日期范围 current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, timestamps = self.data_service.parser.read_all_titles_for_date( date=current_date ) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) platform_activity[platform_name]["news_count"] += len(titles) platform_activity[platform_name]["days_active"].add(current_date.strftime("%Y-%m-%d")) # 统计更新次数(基于文件数量) platform_activity[platform_name]["total_updates"] += len(timestamps) # 统计时间分布(基于文件名中的时间) for filename in timestamps.keys(): # 解析文件名中的小时(格式:HHMM.txt) match = re.match(r'(\d{2})(\d{2})\.txt', filename) if match: hour = int(match.group(1)) platform_activity[platform_name]["hourly_distribution"][hour] += 1 except DataNotFoundError: pass current_date += timedelta(days=1) # 转换为可序列化的格式 result_activity = {} for platform, stats in platform_activity.items(): days_count = len(stats["days_active"]) avg_news_per_day = stats["news_count"] / days_count if days_count > 0 else 0 # 找出最活跃的时间段 most_active_hours = stats["hourly_distribution"].most_common(3) result_activity[platform] = { "total_updates": stats["total_updates"], "news_count": stats["news_count"], "days_active": days_count, "avg_news_per_day": round(avg_news_per_day, 2), "most_active_hours": [ {"hour": f"{hour:02d}:00", "count": count} for hour, count in most_active_hours ], "activity_score": round(stats["news_count"] / max(days_count, 1), 2) } # 按活跃度排序 sorted_platforms = sorted( result_activity.items(), key=lambda x: x[1]["activity_score"], reverse=True ) return { "success": True, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d") }, "platform_activity": dict(sorted_platforms), "most_active_platform": sorted_platforms[0][0] if sorted_platforms else None, "total_platforms": len(result_activity) } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def analyze_topic_lifecycle( self, topic: str, date_range: Optional[Union[Dict[str, str], str]] = None ) -> Dict: """ 话题生命周期分析 - 追踪话题从出现到消失的完整周期 Args: topic: 话题关键词 date_range: 日期范围(可选) - **格式**: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} - **默认**: 不指定时默认分析最近7天 Returns: 话题生命周期分析结果 Examples: 用户询问示例: - "分析'人工智能'这个话题的生命周期" - "看看'iPhone'话题是昙花一现还是持续热点" - "追踪'比特币'话题的热度变化" 代码调用示例: >>> # 分析话题生命周期(假设今天是 2025-11-17) >>> result = tools.analyze_topic_lifecycle( ... topic="人工智能", ... date_range={"start": "2025-10-19", "end": "2025-11-17"} ... ) >>> print(result['lifecycle_stage']) """ try: # 参数验证 topic = validate_keyword(topic) # 处理日期范围(不指定时默认最近7天) if date_range: from ..utils.validators import validate_date_range date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: # 默认最近7天 end_date = datetime.now() start_date = end_date - timedelta(days=6) # 收集话题历史数据 lifecycle_data = [] current_date = start_date while current_date <= end_date: try: all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( date=current_date ) # 统计该日的话题出现次数 count = 0 for _, titles in all_titles.items(): for title in titles.keys(): if topic.lower() in title.lower(): count += 1 lifecycle_data.append({ "date": current_date.strftime("%Y-%m-%d"), "count": count }) except DataNotFoundError: lifecycle_data.append({ "date": current_date.strftime("%Y-%m-%d"), "count": 0 }) current_date += timedelta(days=1) # 计算分析天数 total_days = (end_date - start_date).days + 1 # 分析生命周期阶段 counts = [item["count"] for item in lifecycle_data] if not any(counts): time_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" raise DataNotFoundError( f"在 {time_desc} 内未找到话题 '{topic}'", suggestion="请尝试其他话题或扩大时间范围" ) # 找到首次出现和最后出现 first_appearance = next((item["date"] for item in lifecycle_data if item["count"] > 0), None) last_appearance = next((item["date"] for item in reversed(lifecycle_data) if item["count"] > 0), None) # 计算峰值 max_count = max(counts) peak_index = counts.index(max_count) peak_date = lifecycle_data[peak_index]["date"] # 计算平均值和标准差(简单实现) non_zero_counts = [c for c in counts if c > 0] avg_count = sum(non_zero_counts) / len(non_zero_counts) if non_zero_counts else 0 # 判断生命周期阶段 recent_counts = counts[-3:] # 最近3天 early_counts = counts[:3] # 前3天 if sum(recent_counts) > sum(early_counts): lifecycle_stage = "上升期" elif sum(recent_counts) < sum(early_counts) * 0.5: lifecycle_stage = "衰退期" elif max_count in recent_counts: lifecycle_stage = "爆发期" else: lifecycle_stage = "稳定期" # 分类:昙花一现 vs 持续热点 active_days = sum(1 for c in counts if c > 0) if active_days <= 2 and max_count > avg_count * 2: topic_type = "昙花一现" elif active_days >= total_days * 0.6: topic_type = "持续热点" else: topic_type = "周期性热点" return { "success": True, "topic": topic, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d"), "total_days": total_days }, "lifecycle_data": lifecycle_data, "analysis": { "first_appearance": first_appearance, "last_appearance": last_appearance, "peak_date": peak_date, "peak_count": max_count, "active_days": active_days, "avg_daily_mentions": round(avg_count, 2), "lifecycle_stage": lifecycle_stage, "topic_type": topic_type } } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def detect_viral_topics( self, threshold: float = 3.0, time_window: int = 24 ) -> Dict: """ 异常热度检测 - 自动识别突然爆火的话题 Args: threshold: 热度突增倍数阈值 time_window: 检测时间窗口(小时) Returns: 爆火话题列表 Examples: 用户询问示例: - "检测今天有哪些突然爆火的话题" - "看看有没有热度异常的新闻" - "预警可能的重大事件" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.detect_viral_topics( ... threshold=3.0, ... time_window=24 ... ) >>> print(result['viral_topics']) """ try: # 参数验证 threshold = validate_threshold(threshold, default=3.0, min_value=1.0, max_value=100.0) time_window = validate_limit(time_window, default=24, max_limit=72) # 读取当前和之前的数据 current_all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() # 读取昨天的数据作为基准 yesterday = datetime.now() - timedelta(days=1) try: previous_all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( date=yesterday ) except DataNotFoundError: previous_all_titles = {} # 统计当前的关键词频率 current_keywords = Counter() current_keyword_titles = defaultdict(list) for _, titles in current_all_titles.items(): for title in titles.keys(): keywords = self._extract_keywords(title) current_keywords.update(keywords) for kw in keywords: current_keyword_titles[kw].append(title) # 统计之前的关键词频率 previous_keywords = Counter() for _, titles in previous_all_titles.items(): for title in titles.keys(): keywords = self._extract_keywords(title) previous_keywords.update(keywords) # 检测异常热度 viral_topics = [] for keyword, current_count in current_keywords.items(): previous_count = previous_keywords.get(keyword, 0) # 计算增长倍数 if previous_count == 0: # 新出现的话题 if current_count >= 5: # 至少出现5次才认为是爆火 growth_rate = float('inf') is_viral = True else: continue else: growth_rate = current_count / previous_count is_viral = growth_rate >= threshold if is_viral: viral_topics.append({ "keyword": keyword, "current_count": current_count, "previous_count": previous_count, "growth_rate": round(growth_rate, 2) if growth_rate != float('inf') else "新话题", "sample_titles": current_keyword_titles[keyword][:3], "alert_level": "高" if growth_rate > threshold * 2 else "中" }) # 按增长率排序 viral_topics.sort( key=lambda x: x["current_count"] if x["growth_rate"] == "新话题" else x["growth_rate"], reverse=True ) if not viral_topics: return { "success": True, "summary": { "description": "异常热度检测结果", "total": 0, "threshold": threshold, "time_window": time_window }, "data": [], "message": f"未检测到热度增长超过 {threshold} 倍的话题" } return { "success": True, "summary": { "description": "异常热度检测结果", "total": len(viral_topics), "threshold": threshold, "time_window": time_window, "detection_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S") }, "data": viral_topics } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def predict_trending_topics( self, lookahead_hours: int = 6, confidence_threshold: float = 0.7 ) -> Dict: """ 话题预测 - 基于历史数据预测未来可能的热点 Args: lookahead_hours: 预测未来多少小时 confidence_threshold: 置信度阈值 Returns: 预测的潜力话题列表 Examples: 用户询问示例: - "预测接下来6小时可能的热点话题" - "有哪些话题可能会火起来" - "早期发现潜力话题" 代码调用示例: >>> tools = AnalyticsTools() >>> result = tools.predict_trending_topics( ... lookahead_hours=6, ... confidence_threshold=0.7 ... ) >>> print(result['predicted_topics']) """ try: # 参数验证 lookahead_hours = validate_limit(lookahead_hours, default=6, max_limit=48) confidence_threshold = validate_threshold( confidence_threshold, default=0.7, min_value=0.0, max_value=1.0, param_name="confidence_threshold" ) # 收集最近3天的数据用于预测 keyword_trends = defaultdict(list) for days_ago in range(3, 0, -1): date = datetime.now() - timedelta(days=days_ago) try: all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( date=date ) # 统计关键词 keywords_count = Counter() for _, titles in all_titles.items(): for title in titles.keys(): keywords = self._extract_keywords(title) keywords_count.update(keywords) # 记录每个关键词的历史数据 for keyword, count in keywords_count.items(): keyword_trends[keyword].append(count) except DataNotFoundError: pass # 添加今天的数据 try: all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() keywords_count = Counter() keyword_titles = defaultdict(list) for _, titles in all_titles.items(): for title in titles.keys(): keywords = self._extract_keywords(title) keywords_count.update(keywords) for kw in keywords: keyword_titles[kw].append(title) for keyword, count in keywords_count.items(): keyword_trends[keyword].append(count) except DataNotFoundError: raise DataNotFoundError( "未找到今天的数据", suggestion="请等待爬虫任务完成" ) # 预测潜力话题 predicted_topics = [] for keyword, trend_data in keyword_trends.items(): if len(trend_data) < 2: continue # 简单的线性趋势预测 # 计算增长率 recent_value = trend_data[-1] previous_value = trend_data[-2] if len(trend_data) >= 2 else 0 if previous_value == 0: if recent_value >= 3: growth_rate = 1.0 else: continue else: growth_rate = (recent_value - previous_value) / previous_value # 判断是否是上升趋势 if growth_rate > 0.3: # 增长超过30% # 计算置信度(基于趋势的稳定性) if len(trend_data) >= 3: # 检查是否连续增长 is_consistent = all( trend_data[i] <= trend_data[i+1] for i in range(len(trend_data)-1) ) confidence = 0.9 if is_consistent else 0.7 else: confidence = 0.6 if confidence >= confidence_threshold: predicted_topics.append({ "keyword": keyword, "current_count": recent_value, "growth_rate": round(growth_rate * 100, 2), "confidence": round(confidence, 2), "trend_data": trend_data, "prediction": "上升趋势,可能成为热点", "sample_titles": keyword_titles.get(keyword, [])[:3] }) # 按置信度和增长率排序 predicted_topics.sort( key=lambda x: (x["confidence"], x["growth_rate"]), reverse=True ) return { "success": True, "summary": { "description": "热点话题预测结果", "total": len(predicted_topics), "returned": min(20, len(predicted_topics)), "lookahead_hours": lookahead_hours, "confidence_threshold": confidence_threshold, "prediction_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S") }, "data": predicted_topics[:20], # 返回TOP 20 "note": "预测基于历史趋势,实际结果可能有偏差" } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } # ==================== 辅助方法 ==================== def _extract_keywords(self, title: str, min_length: int = 2) -> List[str]: """ 从标题中提取关键词(简单实现) Args: title: 标题文本 min_length: 最小关键词长度 Returns: 关键词列表 """ # 移除URL和特殊字符 title = re.sub(r'http[s]?://\S+', '', title) title = re.sub(r'[^\w\s]', ' ', title) # 简单分词(按空格和常见分隔符) words = re.split(r'[\s,。!?、]+', title) # 过滤停用词和短词 stopwords = {'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'} keywords = [ word.strip() for word in words if word.strip() and len(word.strip()) >= min_length and word.strip() not in stopwords ] return keywords def _calculate_similarity(self, text1: str, text2: str) -> float: """ 计算两个文本的相似度 Args: text1: 文本1 text2: 文本2 Returns: 相似度分数(0-1之间) """ # 使用 SequenceMatcher 计算相似度 return SequenceMatcher(None, text1, text2).ratio() def _find_unique_topics(self, platform_stats: Dict) -> Dict[str, List[str]]: """ 找出各平台独有的热点话题 Args: platform_stats: 平台统计数据 Returns: 各平台独有话题字典 """ unique_topics = {} # 获取每个平台的TOP关键词 platform_keywords = {} for platform, stats in platform_stats.items(): top_keywords = set([kw for kw, _ in stats["top_keywords"].most_common(10)]) platform_keywords[platform] = top_keywords # 找出独有关键词 for platform, keywords in platform_keywords.items(): # 找出其他平台的所有关键词 other_keywords = set() for other_platform, other_kws in platform_keywords.items(): if other_platform != platform: other_keywords.update(other_kws) # 找出独有的 unique = keywords - other_keywords if unique: unique_topics[platform] = list(unique)[:5] # 最多5个 return unique_topics # ==================== 跨平台聚合工具 ==================== def aggregate_news( self, date_range: Optional[Union[Dict[str, str], str]] = None, platforms: Optional[List[str]] = None, similarity_threshold: float = 0.7, limit: int = 50, include_url: bool = False ) -> Dict: """ 跨平台新闻聚合 - 对相似新闻进行去重合并 将不同平台报道的同一事件合并为一条聚合新闻, 显示该新闻在各平台的覆盖情况和综合热度。 Args: date_range: 日期范围(可选) - 不指定: 查询今天 - {\"start\": \"YYYY-MM-DD\", \"end\": \"YYYY-MM-DD\"}: 日期范围 platforms: 平台过滤列表,如 ['zhihu', 'weibo'] similarity_threshold: 相似度阈值,0-1之间,默认0.7 limit: 返回聚合新闻数量,默认50 include_url: 是否包含URL链接,默认False Returns: 聚合结果字典,包含: - aggregated_news: 聚合后的新闻列表 - statistics: 聚合统计信息 """ try: # 参数验证 platforms = validate_platforms(platforms) similarity_threshold = validate_threshold( similarity_threshold, default=0.7, min_value=0.3, max_value=1.0 ) limit = validate_limit(limit, default=50) # 处理日期范围 if date_range: date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: start_date = end_date = datetime.now() # 收集所有新闻 all_news = [] current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date, platform_ids=platforms ) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "ranks": info.get("ranks", []), "count": len(info.get("ranks", [])), "rank": info["ranks"][0] if info["ranks"] else 999 } if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") # 计算权重 news_item["weight"] = calculate_news_weight(news_item) all_news.append(news_item) except DataNotFoundError: pass current_date += timedelta(days=1) if not all_news: return { "success": True, "summary": { "description": "跨平台新闻聚合结果", "total": 0, "returned": 0 }, "data": [], "message": "未找到新闻数据" } # 执行聚合 aggregated = self._aggregate_similar_news( all_news, similarity_threshold, include_url ) # 按综合权重排序 aggregated.sort(key=lambda x: x["aggregate_weight"], reverse=True) # 限制返回数量 results = aggregated[:limit] # 统计信息 total_original = len(all_news) total_aggregated = len(aggregated) dedup_rate = 1 - (total_aggregated / total_original) if total_original > 0 else 0 platform_coverage = Counter() for item in aggregated: for p in item["platforms"]: platform_coverage[p] += 1 return { "success": True, "summary": { "description": "跨平台新闻聚合结果", "original_count": total_original, "aggregated_count": total_aggregated, "returned": len(results), "deduplication_rate": f"{dedup_rate * 100:.1f}%", "similarity_threshold": similarity_threshold, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d") } }, "data": results, "statistics": { "platform_coverage": dict(platform_coverage), "multi_platform_news": len([a for a in aggregated if len(a["platforms"]) > 1]), "single_platform_news": len([a for a in aggregated if len(a["platforms"]) == 1]) } } except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return {"success": False, "error": {"code": "INTERNAL_ERROR", "message": str(e)}} def _aggregate_similar_news( self, news_list: List[Dict], threshold: float, include_url: bool ) -> List[Dict]: """ 对新闻列表进行相似度聚合 使用双层过滤策略:先用 Jaccard 快速粗筛,再用 SequenceMatcher 精确计算 Args: news_list: 新闻列表 threshold: 相似度阈值 include_url: 是否包含URL Returns: 聚合后的新闻列表 """ if not news_list: return [] # 预计算字符集合用于快速过滤 prepared_news = [] for news in news_list: char_set = set(news["title"]) prepared_news.append({ "data": news, "char_set": char_set, "set_len": len(char_set) }) # 按权重排序 sorted_items = sorted(prepared_news, key=lambda x: x["data"].get("weight", 0), reverse=True) aggregated = [] used_indices = set() PRE_FILTER_RATIO = 0.5 # 粗筛阈值系数 for i, item in enumerate(sorted_items): if i in used_indices: continue news = item["data"] base_set = item["char_set"] base_len = item["set_len"] group = { "representative_title": news["title"], "platforms": [news["platform_name"]], "platform_ids": [news["platform"]], "dates": [news["date"]], "best_rank": news["rank"], "total_count": news["count"], "aggregate_weight": news.get("weight", 0), "sources": [{ "platform": news["platform_name"], "rank": news["rank"], "date": news["date"] }] } if include_url and news.get("url"): group["urls"] = [{ "platform": news["platform_name"], "url": news.get("url", ""), "mobileUrl": news.get("mobileUrl", "") }] used_indices.add(i) # 查找相似新闻 for j in range(i + 1, len(sorted_items)): if j in used_indices: continue compare_item = sorted_items[j] compare_set = compare_item["char_set"] compare_len = compare_item["set_len"] # 快速粗筛:长度检查 if base_len == 0 or compare_len == 0: continue # 快速粗筛:长度比例检查 if min(base_len, compare_len) / max(base_len, compare_len) < (threshold * PRE_FILTER_RATIO): continue # 快速粗筛:Jaccard 相似度 intersection = len(base_set & compare_set) union = len(base_set | compare_set) jaccard_sim = intersection / union if union > 0 else 0 if jaccard_sim < (threshold * PRE_FILTER_RATIO): continue # 精确计算:SequenceMatcher other_news = compare_item["data"] real_similarity = self._calculate_similarity(news["title"], other_news["title"]) if real_similarity >= threshold: # 合并到当前组 if other_news["platform_name"] not in group["platforms"]: group["platforms"].append(other_news["platform_name"]) group["platform_ids"].append(other_news["platform"]) if other_news["date"] not in group["dates"]: group["dates"].append(other_news["date"]) group["best_rank"] = min(group["best_rank"], other_news["rank"]) group["total_count"] += other_news["count"] group["aggregate_weight"] += other_news.get("weight", 0) * 0.5 # 额外权重 group["sources"].append({ "platform": other_news["platform_name"], "rank": other_news["rank"], "date": other_news["date"] }) if include_url and other_news.get("url"): if "urls" not in group: group["urls"] = [] group["urls"].append({ "platform": other_news["platform_name"], "url": other_news.get("url", ""), "mobileUrl": other_news.get("mobileUrl", "") }) used_indices.add(j) # 添加聚合信息 group["platform_count"] = len(group["platforms"]) group["is_cross_platform"] = len(group["platforms"]) > 1 aggregated.append(group) return aggregated # ==================== 时期对比分析工具 ==================== def compare_periods( self, period1: Union[Dict[str, str], str], period2: Union[Dict[str, str], str], topic: Optional[str] = None, compare_type: str = "overview", platforms: Optional[List[str]] = None, top_n: int = 10 ) -> Dict: """ 时期对比分析 - 比较两个时间段的新闻数据 支持多种对比维度:热度对比、话题变化、平台活跃度等。 Args: period1: 第一个时间段 - {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"}: 日期范围 - "today", "yesterday", "last_week", "last_month": 预设值 period2: 第二个时间段(格式同 period1) topic: 可选的话题关键词(聚焦特定话题的对比) compare_type: 对比类型 - "overview": 总体概览(默认) - "topic_shift": 话题变化分析 - "platform_activity": 平台活跃度对比 platforms: 平台过滤列表 top_n: 返回 TOP N 结果,默认10 Returns: 对比分析结果字典 """ try: # 参数验证 platforms = validate_platforms(platforms) top_n = validate_top_n(top_n, default=10) if compare_type not in ["overview", "topic_shift", "platform_activity"]: raise InvalidParameterError( f"不支持的对比类型: {compare_type}", suggestion="支持的类型: overview, topic_shift, platform_activity" ) # 解析时间段 date_range1 = self._parse_period(period1) date_range2 = self._parse_period(period2) if not date_range1 or not date_range2: raise InvalidParameterError( "无效的时间段格式", suggestion="使用 {'start': 'YYYY-MM-DD', 'end': 'YYYY-MM-DD'} 或预设值如 'last_week'" ) # 收集两个时期的数据 data1 = self._collect_period_data(date_range1, platforms, topic) data2 = self._collect_period_data(date_range2, platforms, topic) # 根据对比类型执行不同的分析 if compare_type == "overview": analysis_result = self._compare_overview(data1, data2, date_range1, date_range2, top_n) elif compare_type == "topic_shift": analysis_result = self._compare_topic_shift(data1, data2, date_range1, date_range2, top_n) else: # platform_activity analysis_result = self._compare_platform_activity(data1, data2, date_range1, date_range2) result = { "success": True, "summary": { "description": f"时期对比分析({compare_type})", "compare_type": compare_type, "periods": { "period1": { "start": date_range1[0].strftime("%Y-%m-%d"), "end": date_range1[1].strftime("%Y-%m-%d") }, "period2": { "start": date_range2[0].strftime("%Y-%m-%d"), "end": date_range2[1].strftime("%Y-%m-%d") } } }, "data": analysis_result } if topic: result["summary"]["topic_filter"] = topic return result except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return {"success": False, "error": {"code": "INTERNAL_ERROR", "message": str(e)}} def _parse_period(self, period: Union[Dict[str, str], str]) -> Optional[tuple]: """解析时间段为日期范围元组""" today = datetime.now() if isinstance(period, str): if period == "today": return (today, today) elif period == "yesterday": yesterday = today - timedelta(days=1) return (yesterday, yesterday) elif period == "last_week": return (today - timedelta(days=7), today - timedelta(days=1)) elif period == "this_week": # 本周一到今天 days_since_monday = today.weekday() monday = today - timedelta(days=days_since_monday) return (monday, today) elif period == "last_month": return (today - timedelta(days=30), today - timedelta(days=1)) elif period == "this_month": first_of_month = today.replace(day=1) return (first_of_month, today) else: return None elif isinstance(period, dict): try: start = datetime.strptime(period["start"], "%Y-%m-%d") end = datetime.strptime(period["end"], "%Y-%m-%d") return (start, end) except (KeyError, ValueError): return None return None def _collect_period_data( self, date_range: tuple, platforms: Optional[List[str]], topic: Optional[str] ) -> Dict: """收集指定时期的新闻数据""" start_date, end_date = date_range all_news = [] all_keywords = Counter() platform_stats = Counter() current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date, platform_ids=platforms ) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 如果指定了话题,过滤不相关的新闻 if topic and topic.lower() not in title.lower(): continue news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "ranks": info.get("ranks", []), "rank": info["ranks"][0] if info["ranks"] else 999 } news_item["weight"] = calculate_news_weight(news_item) all_news.append(news_item) # 统计平台 platform_stats[platform_name] += 1 # 提取关键词 keywords = self._extract_keywords(title) all_keywords.update(keywords) except DataNotFoundError: pass current_date += timedelta(days=1) return { "news": all_news, "news_count": len(all_news), "keywords": all_keywords, "platform_stats": platform_stats, "date_range": date_range } def _compare_overview( self, data1: Dict, data2: Dict, range1: tuple, range2: tuple, top_n: int ) -> Dict: """总体概览对比""" # 计算变化 count_change = data2["news_count"] - data1["news_count"] count_change_pct = (count_change / data1["news_count"] * 100) if data1["news_count"] > 0 else 0 # TOP 关键词对比 top_kw1 = [kw for kw, _ in data1["keywords"].most_common(top_n)] top_kw2 = [kw for kw, _ in data2["keywords"].most_common(top_n)] new_keywords = [kw for kw in top_kw2 if kw not in top_kw1] disappeared_keywords = [kw for kw in top_kw1 if kw not in top_kw2] persistent_keywords = [kw for kw in top_kw1 if kw in top_kw2] # TOP 新闻对比 top_news1 = sorted(data1["news"], key=lambda x: x.get("weight", 0), reverse=True)[:top_n] top_news2 = sorted(data2["news"], key=lambda x: x.get("weight", 0), reverse=True)[:top_n] return { "overview": { "period1_count": data1["news_count"], "period2_count": data2["news_count"], "count_change": count_change, "count_change_percent": f"{count_change_pct:+.1f}%" }, "keyword_analysis": { "new_keywords": new_keywords[:5], "disappeared_keywords": disappeared_keywords[:5], "persistent_keywords": persistent_keywords[:5] }, "top_news": { "period1": [{"title": n["title"], "platform": n["platform_name"]} for n in top_news1], "period2": [{"title": n["title"], "platform": n["platform_name"]} for n in top_news2] } } def _compare_topic_shift( self, data1: Dict, data2: Dict, range1: tuple, range2: tuple, top_n: int ) -> Dict: """话题变化分析""" kw1 = data1["keywords"] kw2 = data2["keywords"] # 计算热度变化 all_keywords = set(kw1.keys()) | set(kw2.keys()) keyword_changes = [] for kw in all_keywords: count1 = kw1.get(kw, 0) count2 = kw2.get(kw, 0) change = count2 - count1 if count1 > 0: change_pct = (change / count1) * 100 elif count2 > 0: change_pct = 100 # 新出现 else: change_pct = 0 keyword_changes.append({ "keyword": kw, "period1_count": count1, "period2_count": count2, "change": change, "change_percent": round(change_pct, 1) }) # 按变化幅度排序 rising = sorted([k for k in keyword_changes if k["change"] > 0], key=lambda x: x["change"], reverse=True)[:top_n] falling = sorted([k for k in keyword_changes if k["change"] < 0], key=lambda x: x["change"])[:top_n] new_topics = [k for k in keyword_changes if k["period1_count"] == 0 and k["period2_count"] > 0][:top_n] return { "rising_topics": rising, "falling_topics": falling, "new_topics": new_topics, "total_keywords": { "period1": len(kw1), "period2": len(kw2) } } def _compare_platform_activity( self, data1: Dict, data2: Dict, range1: tuple, range2: tuple ) -> Dict: """平台活跃度对比""" ps1 = data1["platform_stats"] ps2 = data2["platform_stats"] all_platforms = set(ps1.keys()) | set(ps2.keys()) platform_changes = [] for platform in all_platforms: count1 = ps1.get(platform, 0) count2 = ps2.get(platform, 0) change = count2 - count1 if count1 > 0: change_pct = (change / count1) * 100 elif count2 > 0: change_pct = 100 else: change_pct = 0 platform_changes.append({ "platform": platform, "period1_count": count1, "period2_count": count2, "change": change, "change_percent": round(change_pct, 1) }) # 按变化排序 platform_changes.sort(key=lambda x: x["change"], reverse=True) return { "platform_comparison": platform_changes, "most_active_growth": platform_changes[0] if platform_changes else None, "least_active_growth": platform_changes[-1] if platform_changes else None, "total_activity": { "period1": sum(ps1.values()), "period2": sum(ps2.values()) } } ================================================ FILE: mcp_server/tools/article_reader.py ================================================ """ 文章内容读取工具 通过 Jina AI Reader API 将 URL 转换为 LLM 友好的 Markdown 格式。 支持单篇和批量读取,内置速率限制和并发控制。 """ import time from typing import Dict, List import requests from ..utils.errors import MCPError, InvalidParameterError # Jina Reader 配置 JINA_READER_BASE = "https://r.jina.ai" DEFAULT_TIMEOUT = 30 # 秒 MAX_BATCH_SIZE = 5 # 单次批量最大篇数 BATCH_INTERVAL = 5.0 # 批量请求间隔(秒) class ArticleReaderTools: """文章内容读取工具类""" def __init__(self, project_root: str = None, jina_api_key: str = None): """ 初始化文章读取工具 Args: project_root: 项目根目录 jina_api_key: Jina API Key(可选,有 Key 可提升速率限制) """ self.project_root = project_root self.jina_api_key = jina_api_key self._last_request_time = 0.0 def _build_headers(self) -> Dict[str, str]: """构建请求头""" headers = { "Accept": "text/markdown", "X-Return-Format": "markdown", "X-No-Cache": "true", } if self.jina_api_key: headers["Authorization"] = f"Bearer {self.jina_api_key}" return headers def _throttle(self): """速率控制:确保请求间隔 5 秒""" now = time.time() elapsed = now - self._last_request_time if elapsed < BATCH_INTERVAL: time.sleep(BATCH_INTERVAL - elapsed) self._last_request_time = time.time() def read_article( self, url: str, timeout: int = DEFAULT_TIMEOUT ) -> Dict: """ 读取单篇文章内容(Markdown 格式) Args: url: 文章链接 timeout: 请求超时时间(秒),默认 30 Returns: 文章内容字典 """ try: if not url or not url.startswith(("http://", "https://")): raise InvalidParameterError( f"无效的 URL: {url}", suggestion="URL 必须以 http:// 或 https:// 开头" ) self._throttle() response = requests.get( f"{JINA_READER_BASE}/{url}", headers=self._build_headers(), timeout=timeout ) if response.status_code == 200: return { "success": True, "data": { "url": url, "content": response.text, "format": "markdown", "content_length": len(response.text) } } elif response.status_code == 429: return { "success": False, "error": { "code": "RATE_LIMITED", "message": "Jina Reader 速率限制,请稍后重试", "suggestion": "免费限制: 100 RPM / 2 并发,可配置 API Key 提升限额" } } else: return { "success": False, "error": { "code": "FETCH_FAILED", "message": f"HTTP {response.status_code}: {response.reason}", "url": url } } except requests.Timeout: return { "success": False, "error": { "code": "TIMEOUT", "message": f"请求超时({timeout}秒)", "url": url, "suggestion": "可尝试增加 timeout 参数" } } except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return { "success": False, "error": { "code": "REQUEST_ERROR", "message": str(e), "url": url } } def read_articles_batch( self, urls: List[str], timeout: int = DEFAULT_TIMEOUT ) -> Dict: """ 批量读取多篇文章内容(最多 5 篇,间隔 5 秒) Args: urls: 文章链接列表 timeout: 每篇的请求超时时间(秒) Returns: 批量读取结果 """ try: if not urls: raise InvalidParameterError( "URL 列表不能为空", suggestion="请提供至少一个 URL" ) # 限制最多 5 篇 actual_urls = urls[:MAX_BATCH_SIZE] skipped = len(urls) - len(actual_urls) results = [] succeeded = 0 failed = 0 for i, url in enumerate(actual_urls): result = self.read_article(url=url, timeout=timeout) results.append({ "index": i + 1, "url": url, "success": result["success"], "data": result.get("data"), "error": result.get("error") }) if result["success"]: succeeded += 1 else: failed += 1 return { "success": True, "summary": { "description": "批量文章读取结果", "requested": len(urls), "processed": len(actual_urls), "succeeded": succeeded, "failed": failed, "skipped": skipped, "interval_seconds": BATCH_INTERVAL, }, "articles": results, "note": f"已跳过 {skipped} 篇(单次上限 {MAX_BATCH_SIZE} 篇)" if skipped > 0 else None } except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return { "success": False, "error": { "code": "BATCH_ERROR", "message": str(e) } } ================================================ FILE: mcp_server/tools/config_mgmt.py ================================================ """ 配置管理工具 实现配置查询和管理功能。 """ from typing import Dict, Optional, Any, TypedDict from ..services.data_service import DataService from ..utils.validators import validate_config_section from ..utils.errors import MCPError class ErrorInfo(TypedDict, total=False): """错误信息结构""" code: str message: str suggestion: str class ConfigResult(TypedDict): """配置查询结果 - success 字段必需,其他字段可选""" success: bool config: Optional[Dict[str, Any]] section: Optional[str] error: Optional[ErrorInfo] class ConfigManagementTools: """配置管理工具类""" def __init__(self, project_root: str = None): """ 初始化配置管理工具 Args: project_root: 项目根目录 """ self.data_service = DataService(project_root) def get_current_config(self, section: Optional[str] = None) -> ConfigResult: """ 获取当前系统配置 Args: section: 配置节 - all/crawler/push/keywords/weights,默认all Returns: 配置字典 Example: >>> tools = ConfigManagementTools() >>> result = tools.get_current_config(section="crawler") >>> print(result['crawler']['platforms']) """ try: # 参数验证 section = validate_config_section(section) # 获取配置 config = self.data_service.get_current_config(section=section) return ConfigResult( success=True, config=config, section=section, error=None ) except MCPError as e: return ConfigResult( success=False, config=None, section=None, error=e.to_dict() ) except Exception as e: return ConfigResult( success=False, config=None, section=None, error={"code": "INTERNAL_ERROR", "message": str(e), "suggestion": "请查看服务日志获取详细信息"} ) ================================================ FILE: mcp_server/tools/data_query.py ================================================ """ 数据查询工具 实现P0核心的数据查询工具。 """ from typing import Dict, List, Optional, Union from ..services.data_service import DataService from ..utils.validators import ( validate_platforms, validate_limit, validate_keyword, validate_date_range, validate_top_n, validate_mode, validate_date_query, normalize_date_range ) from ..utils.errors import MCPError class DataQueryTools: """数据查询工具类""" def __init__(self, project_root: str = None): """ 初始化数据查询工具 Args: project_root: 项目根目录 """ self.data_service = DataService(project_root) def get_latest_news( self, platforms: Optional[List[str]] = None, limit: Optional[int] = None, include_url: bool = False ) -> Dict: """ 获取最新一批爬取的新闻数据 Args: platforms: 平台ID列表,如 ['zhihu', 'weibo'] limit: 返回条数限制,默认20 include_url: 是否包含URL链接,默认False(节省token) Returns: 新闻列表字典 Example: >>> tools = DataQueryTools() >>> result = tools.get_latest_news(platforms=['zhihu'], limit=10) >>> print(result['total']) 10 """ try: # 参数验证 platforms = validate_platforms(platforms) limit = validate_limit(limit, default=50) # 获取数据 news_list = self.data_service.get_latest_news( platforms=platforms, limit=limit, include_url=include_url ) return { "success": True, "summary": { "description": "最新一批爬取的新闻数据", "total": len(news_list), "returned": len(news_list), "platforms": platforms or "全部平台" }, "data": news_list } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def search_news_by_keyword( self, keyword: str, date_range: Optional[Union[Dict, str]] = None, platforms: Optional[List[str]] = None, limit: Optional[int] = None ) -> Dict: """ 按关键词搜索历史新闻 Args: keyword: 搜索关键词(必需) date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} platforms: 平台过滤列表 limit: 返回条数限制(可选,默认返回所有) Returns: 搜索结果字典 Example (假设今天是 2025-11-17): >>> tools = DataQueryTools() >>> result = tools.search_news_by_keyword( ... keyword="人工智能", ... date_range={"start": "2025-11-08", "end": "2025-11-17"}, ... limit=50 ... ) >>> print(result['total']) """ try: # 参数验证 keyword = validate_keyword(keyword) date_range_tuple = validate_date_range(date_range) platforms = validate_platforms(platforms) if limit is not None: limit = validate_limit(limit, default=100) # 搜索数据 search_result = self.data_service.search_news_by_keyword( keyword=keyword, date_range=date_range_tuple, platforms=platforms, limit=limit ) return { **search_result, "success": True } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_trending_topics( self, top_n: Optional[int] = None, mode: Optional[str] = None, extract_mode: Optional[str] = None ) -> Dict: """ 获取热点话题统计 Args: top_n: 返回TOP N话题,默认10 mode: 时间模式 - "daily": 当日累计数据统计 - "current": 最新一批数据统计(默认) extract_mode: 提取模式 - "keywords": 统计预设关注词(基于 config/frequency_words.txt,默认) - "auto_extract": 自动从新闻标题提取高频词 Returns: 话题频率统计字典 Example: >>> tools = DataQueryTools() >>> # 使用预设关注词 >>> result = tools.get_trending_topics(top_n=5, mode="current") >>> # 自动提取高频词 >>> result = tools.get_trending_topics(top_n=10, extract_mode="auto_extract") """ try: # 参数验证 top_n = validate_top_n(top_n, default=10) valid_modes = ["daily", "current"] mode = validate_mode(mode, valid_modes, default="current") # 验证 extract_mode if extract_mode is None: extract_mode = "keywords" elif extract_mode not in ["keywords", "auto_extract"]: return { "success": False, "error": { "code": "INVALID_PARAMETER", "message": f"不支持的提取模式: {extract_mode}", "suggestion": "支持的模式: keywords, auto_extract" } } # 获取趋势话题 trending_result = self.data_service.get_trending_topics( top_n=top_n, mode=mode, extract_mode=extract_mode ) return { **trending_result, "success": True } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_news_by_date( self, date_range: Optional[Union[Dict[str, str], str]] = None, platforms: Optional[List[str]] = None, limit: Optional[int] = None, include_url: bool = False ) -> Dict: """ 按日期查询新闻,支持自然语言日期 Args: date_range: 日期范围(可选,默认"今天"),支持: - 范围对象:{"start": "2025-01-01", "end": "2025-01-07"} - 相对日期:今天、昨天、前天、3天前 - 单日字符串:2025-10-10 platforms: 平台ID列表,如 ['zhihu', 'weibo'] limit: 返回条数限制,默认50 include_url: 是否包含URL链接,默认False(节省token) Returns: 新闻列表字典 Example: >>> tools = DataQueryTools() >>> # 不指定日期,默认查询今天 >>> result = tools.get_news_by_date(platforms=['zhihu'], limit=20) >>> # 指定日期 >>> result = tools.get_news_by_date( ... date_range="昨天", ... platforms=['zhihu'], ... limit=20 ... ) >>> print(result['total']) 20 """ try: # 参数验证 - 默认今天 if date_range is None: date_range = "今天" # 规范化 date_range(处理 JSON 字符串序列化问题) date_range = normalize_date_range(date_range) # 处理 date_range:支持字符串或对象 if isinstance(date_range, dict): # 范围对象,取 start 日期 date_str = date_range.get('start', '今天') else: date_str = date_range target_date = validate_date_query(date_str) platforms = validate_platforms(platforms) limit = validate_limit(limit, default=50) # 获取数据 news_list = self.data_service.get_news_by_date( target_date=target_date, platforms=platforms, limit=limit, include_url=include_url ) return { "success": True, "summary": { "description": f"按日期查询的新闻({target_date.strftime('%Y-%m-%d')})", "total": len(news_list), "returned": len(news_list), "date": target_date.strftime("%Y-%m-%d"), "date_range": date_range, "platforms": platforms or "全部平台" }, "data": news_list } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } # ======================================== # RSS 数据查询方法 # ======================================== def get_latest_rss( self, feeds: Optional[List[str]] = None, days: int = 1, limit: Optional[int] = None, include_summary: bool = False ) -> Dict: """ 获取最新的 RSS 数据(支持多日查询) Args: feeds: RSS 源 ID 列表,如 ['hacker-news', '36kr'] days: 获取最近 N 天的数据,默认 1(仅今天),最大 30 天 limit: 返回条数限制,默认50 include_summary: 是否包含摘要,默认False(节省token) Returns: RSS 条目列表字典 """ try: limit = validate_limit(limit, default=50) rss_list = self.data_service.get_latest_rss( feeds=feeds, days=days, limit=limit, include_summary=include_summary ) return { "success": True, "summary": { "description": f"最近 {days} 天的 RSS 订阅数据" if days > 1 else "最新的 RSS 订阅数据", "total": len(rss_list), "returned": len(rss_list), "days": days, "feeds": feeds or "全部订阅源" }, "data": rss_list } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def search_rss( self, keyword: str, feeds: Optional[List[str]] = None, days: int = 7, limit: Optional[int] = None, include_summary: bool = False ) -> Dict: """ 搜索 RSS 数据 Args: keyword: 搜索关键词 feeds: RSS 源 ID 列表 days: 搜索最近 N 天的数据,默认 7 天 limit: 返回条数限制,默认50 include_summary: 是否包含摘要 Returns: 匹配的 RSS 条目列表 """ try: keyword = validate_keyword(keyword) limit = validate_limit(limit, default=50) if days < 1 or days > 30: days = 7 rss_list = self.data_service.search_rss( keyword=keyword, feeds=feeds, days=days, limit=limit, include_summary=include_summary ) return { "success": True, "summary": { "description": f"RSS 搜索结果(关键词: {keyword})", "total": len(rss_list), "returned": len(rss_list), "keyword": keyword, "feeds": feeds or "全部订阅源", "days": days }, "data": rss_list } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_rss_feeds_status(self) -> Dict: """ 获取 RSS 源状态 Returns: RSS 源状态信息 """ try: status = self.data_service.get_rss_feeds_status() return { **status, "success": True } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } ================================================ FILE: mcp_server/tools/notification.py ================================================ # coding=utf-8 """ 通知推送工具 支持向已配置的通知渠道发送消息,自动检测 config.yaml 和 .env 中的渠道配置。 接受 markdown 格式内容,内部按各渠道要求自动转换格式后发送。 """ import json import os import re import smtplib import time from datetime import datetime from email.header import Header from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText from email.utils import formataddr, formatdate, make_msgid from pathlib import Path from typing import Any, Dict, List, Optional from urllib.parse import urlparse import requests import yaml from trendradar.core.loader import _load_webhook_config, _load_notification_config from trendradar.notification.batch import ( truncate_to_bytes, get_batch_header, get_max_batch_header_size, add_batch_headers, ) from trendradar.notification.formatters import strip_markdown from trendradar.notification.senders import SMTP_CONFIGS from ..utils.errors import MCPError, InvalidParameterError # ==================== 渠道启用判断规则 ==================== # 每个渠道需要哪些配置项都非空才算"已配置" # 注意:NTFY_SERVER_URL 在 loader 中有默认值 "https://ntfy.sh",不作为判断依据 _CHANNEL_REQUIREMENTS = { "feishu": ["FEISHU_WEBHOOK_URL"], "dingtalk": ["DINGTALK_WEBHOOK_URL"], "wework": ["WEWORK_WEBHOOK_URL"], "telegram": ["TELEGRAM_BOT_TOKEN", "TELEGRAM_CHAT_ID"], "email": ["EMAIL_FROM", "EMAIL_PASSWORD", "EMAIL_TO"], "ntfy": ["NTFY_TOPIC"], "bark": ["BARK_URL"], "slack": ["SLACK_WEBHOOK_URL"], "generic_webhook": ["GENERIC_WEBHOOK_URL"], } # 渠道显示名称 _CHANNEL_NAMES = { "feishu": "飞书", "dingtalk": "钉钉", "wework": "企业微信", "telegram": "Telegram", "email": "邮件", "ntfy": "ntfy", "bark": "Bark", "slack": "Slack", "generic_webhook": "通用 Webhook", } # ==================== 批次处理配置 ==================== # 各渠道最大批次字节数的默认值 # 运行时从 config.yaml → advanced.batch_size 读取覆盖 _CHANNEL_BATCH_SIZES_DEFAULT = { "feishu": 30000, # config.yaml: advanced.batch_size.feishu "dingtalk": 20000, # config.yaml: advanced.batch_size.dingtalk "wework": 4000, # config.yaml: advanced.batch_size.default "telegram": 4000, # config.yaml: advanced.batch_size.default "email": 0, # 邮件无字节限制,不分批 "ntfy": 3800, # 严格 4KB 限制(ntfy 代码默认值) "bark": 4000, # config.yaml: advanced.batch_size.bark "slack": 4000, # config.yaml: advanced.batch_size.slack "generic_webhook": 4000, } # 显示最新消息在前的渠道,批次需反序发送 _REVERSE_BATCH_CHANNELS = {"ntfy", "bark"} # 批次发送间隔默认值(秒),运行时从 config.yaml → advanced.batch_send_interval 读取 _BATCH_INTERVAL_DEFAULT = 3.0 # ==================== 批次处理 ==================== # truncate_to_bytes, get_batch_header, get_max_batch_header_size, # add_batch_headers 复用自 trendradar.notification.batch def _split_text_into_batches(text: str, max_bytes: int) -> List[str]: """将文本按字节限制分批,优先在段落边界(双换行)切割 分割策略(参考 trendradar splitter.py 的原子性保证): 1. 优先按段落(双换行 \\n\\n)拆分 2. 段落仍超限时,按单行(\\n)拆分 3. 单行仍超限时,用 _truncate_to_bytes 安全截断 Args: text: 已转换为目标渠道格式的文本 max_bytes: 单批最大字节数(已扣除批次头部预留) Returns: 分批后的文本列表 """ if max_bytes <= 0 or len(text.encode("utf-8")) <= max_bytes: return [text] # 按段落分割 paragraphs = text.split("\n\n") batches = [] current = "" for para in paragraphs: candidate = f"{current}\n\n{para}" if current else para if len(candidate.encode("utf-8")) <= max_bytes: current = candidate else: # 当前段落放不下,先保存已有内容 if current: batches.append(current) current = "" # 检查单个段落是否超限 if len(para.encode("utf-8")) <= max_bytes: current = para else: # 段落本身超限,按行拆分 lines = para.split("\n") for line in lines: candidate = f"{current}\n{line}" if current else line if len(candidate.encode("utf-8")) <= max_bytes: current = candidate else: if current: batches.append(current) current = "" # 单行超限,循环截断直到处理完 if len(line.encode("utf-8")) > max_bytes: remaining = line while remaining: chunk = truncate_to_bytes(remaining, max_bytes) if not chunk: break batches.append(chunk) # 移除已截断的部分 remaining = remaining[len(chunk):] else: current = line if current: batches.append(current) return batches if batches else [text] def _format_for_channel(message: str, channel_id: str) -> str: """将通用 Markdown 适配并转换为目标渠道格式 统一入口:先适配(剥离不支持的语法),再转换(Markdown→HTML/mrkdwn 等)。 返回的文本可以直接用于字节分割和发送。 Args: message: 原始 Markdown 格式文本 channel_id: 目标渠道 ID Returns: 目标渠道格式的文本 """ if channel_id == "feishu": return _adapt_markdown_for_feishu(message) elif channel_id == "dingtalk": return _adapt_markdown_for_dingtalk(message) elif channel_id == "wework": return _adapt_markdown_for_wework(message) elif channel_id == "telegram": return _markdown_to_telegram_html(message) elif channel_id == "ntfy": return _adapt_markdown_for_ntfy(message) elif channel_id == "bark": return _adapt_markdown_for_bark(message) elif channel_id == "slack": return _convert_markdown_to_slack(message) else: # email, generic_webhook: 保持原始 Markdown return message def _prepare_batches(message: str, channel_id: str, batch_sizes: Dict = None) -> List[str]: """完整的分批管线:格式适配 → 字节分割 → 添加批次头部 Args: message: 原始 Markdown 格式文本 channel_id: 目标渠道 ID batch_sizes: 各渠道批次大小字典(来自 config.yaml),None 使用默认值 Returns: 准备好的批次列表(已添加头部,已处理反序) """ sizes = batch_sizes or _CHANNEL_BATCH_SIZES_DEFAULT max_bytes = sizes.get(channel_id, sizes.get("default", 4000)) if max_bytes <= 0: # 无字节限制(如 email),返回原始文本 return [message] formatted = _format_for_channel(message, channel_id) # 预留批次头部空间后分割 header_reserve = get_max_batch_header_size(channel_id) batches = _split_text_into_batches(formatted, max_bytes - header_reserve) # 添加批次头部(单批时不添加) batches = add_batch_headers(batches, channel_id, max_bytes) # ntfy/Bark 反序发送(客户端显示最新在前) if channel_id in _REVERSE_BATCH_CHANNELS and len(batches) > 1: batches = list(reversed(batches)) return batches CHANNEL_FORMAT_GUIDES = { "feishu": { "name": "飞书", "format": "Markdown(卡片消息)", "max_length": "约 29000 字节", "supported": [ "**粗体**", "[链接文本](URL)", "彩色文本", "---(分割线)", "换行分隔段落", ], "unsupported": [ "# 标题语法(不渲染为标题样式)", "> 引用块", "表格 / 图片嵌入", ], "prompt": ( "飞书卡片 Markdown 格式化策略:\n" "1. 用 **粗体** 作小标题和重点词\n" "2. 用 红色 标记紧急/重要内容\n" "3. 用 灰色 标记辅助信息(时间、来源)\n" "4. 用 橙色 标记警告\n" "5. 用 绿色 标记正面/成功信息\n" "6. 用 [文本](URL) 添加可点击链接\n" "7. 用 --- 分割不同主题区域\n" "8. 不要用 # 标题语法(卡片内不渲染)\n" "9. 不要用 > 引用语法\n" "10. 用换行 + 粗体模拟层级结构" ), }, "dingtalk": { "name": "钉钉", "format": "Markdown", "max_length": "约 20000 字节", "supported": [ "### 三级标题 / #### 四级标题", "**粗体**", "[链接文本](URL)", "> 引用块", "---(分割线)", "- 无序列表 / 1. 有序列表", ], "unsupported": [ "# 一级标题 / ## 二级标题(可能不渲染)", " 彩色文本", "~~删除线~~", "表格 / 图片嵌入", ], "prompt": ( "钉钉 Markdown 格式化策略:\n" "1. 用 ### 或 #### 作章节标题(不用 # 和 ##)\n" "2. 用 **粗体** 突出关键词和数据\n" "3. 用 > 引用块展示备注或补充说明\n" "4. 用 --- 分割不同主题区域\n" "5. 用 [文本](URL) 添加可点击链接\n" "6. 用有序列表(1. 2. 3.)组织要点\n" "7. 不要用 颜色标签(钉钉不支持)\n" "8. 不要用删除线语法\n" "9. 标题和正文之间加空行提升可读性" ), }, "wework": { "name": "企业微信", "format": "Markdown(群机器人)/ 纯文本(个人微信)", "max_length": "约 4000 字节", "supported": [ "**粗体**", "[链接文本](URL)", "> 引用块(仅首行生效)", ], "unsupported": [ "# 标题语法", "---(水平分割线)", " 彩色文本", "~~删除线~~", "表格 / 图片嵌入 / 有序列表", ], "prompt": ( "企业微信 Markdown 格式化策略:\n" "1. 用 **粗体** 作小标题和重点词\n" "2. 用 [文本](URL) 添加可点击链接\n" "3. 用 > 引用块展示备注(仅首行生效)\n" "4. 内容要简洁,受 4KB 限制\n" "5. 不要用 # 标题语法(不渲染)\n" "6. 不要用 ---(不渲染),用多个换行分隔区域\n" "7. 不要用 颜色标签\n" "8. 不要用删除线和有序列表\n" "9. 用换行 + 粗体模拟层级结构\n" "10. 个人微信模式下所有格式被剥离为纯文本" ), }, "telegram": { "name": "Telegram", "format": "HTML(自动从 Markdown 转换)", "max_length": "约 4096 字符", "supported": [ "粗体(从 **粗体** 转换)", "斜体(从 *斜体* 转换)", " 删除线(从 ~~删除线~~ 转换)", "行内代码(从 `代码` 转换)", "链接(从 [文本](URL) 转换)", "引用块(从 > 引用 转换)", ], "unsupported": [ "# 标题语法(自动剥离 # 前缀)", "---(分割线,自动剥离)", " 彩色文本(自动剥离)", "表格 / 图片嵌入", ], "prompt": ( "Telegram HTML 格式化策略(输入仍为 Markdown,自动转换为 HTML):\n" "1. 用 **粗体** 突出关键词(转为 )\n" "2. 用 *斜体* 标记辅助信息(转为 )\n" "3. 用 `代码` 标记数据值/时间(转为)\n" "4. 用 [文本](URL) 添加链接(转为 )\n" "5. 用 > 开头的行作引用块(转为)\n" "6. 不要用 # 标题(Telegram 无标题样式,仅剥离 #)\n" "7. 不要用 --- 分割线(被剥离),用空行分隔\n" "8. 不要用 颜色标签(被剥离)\n" "9. 内容受 4096 字符限制,保持简洁\n" "10. 链接默认禁用预览,适合信息密集型消息" ), }, "email": { "name": "邮件", "format": "HTML(完整网页,从 Markdown 转换)", "max_length": "无硬限制", "supported": [ "# / ## / ### 标题(转为') text = '\n'.join(result_lines) # 转义 HTML 实体(在标记替换之前,但在 blockquote 标签之后) # 分段处理:保留已生成的 HTML 标签 parts = re.split(r'(?blockquote>)', text) escaped_parts = [] for part in parts: if part in ('/
/
)", "**粗体** / *斜体* / ~~删除线~~", "[链接文本](URL)", "`行内代码`", "---(水平分割线)", ], "unsupported": [ " 彩色文本(转义显示)", "复杂表格", ], "prompt": ( "邮件 HTML 格式化策略(输入为 Markdown,自动转换为带样式 HTML):\n" "1. 用 # / ## / ### 创建清晰的标题层级\n" "2. 用 **粗体** 和 *斜体* 增强可读性\n" "3. 用 [文本](URL) 添加链接(蓝色可点击)\n" "4. 用 --- 分割不同章节\n" "5. 用 `代码` 标记技术术语或数据\n" "6. 可以写较长内容,邮件无严格长度限制\n" "7. 邮件主题自动追加日期时间\n" "8. 自动附带纯文本备用版本" ), }, "ntfy": { "name": "ntfy", "format": "Markdown(原生支持)", "max_length": "约 3800 字节(单条 4KB 限制)", "supported": [ "**粗体** / *斜体*", "[链接文本](URL)", "> 引用块", "`行内代码`", "- 列表", ], "unsupported": [ "# 标题语法(渲染取决于客户端)", " 彩色文本", "---(渲染取决于客户端)", "表格", ], "prompt": ( "ntfy Markdown 格式化策略:\n" "1. 用 **粗体** 突出关键词\n" "2. 用 [文本](URL) 添加可点击链接\n" "3. 用 > 引用块展示备注\n" "4. 用 `代码` 标记数据值\n" "5. 内容要精炼,受 4KB 限制\n" "6. 不要用 颜色标签(无效)\n" "7. 不要依赖 # 标题和 --- 分割线\n" "8. 用空行和粗体组织信息层级" ), }, "bark": { "name": "Bark", "format": "Markdown(iOS 推送)", "max_length": "约 3600 字节(APNs 4KB 限制)", "supported": [ "**粗体**", "[链接文本](URL)", "基础文本格式", ], "unsupported": [ "# 标题语法", " 彩色文本", "---(分割线)", "> 引用块", "复杂嵌套格式", ], "prompt": ( "Bark 格式化策略(iOS 推送通知):\n" "1. 内容要极度精简,移动端阅读场景\n" "2. 用 **粗体** 标记核心信息\n" "3. 用 [文本](URL) 添加链接\n" "4. 不要用标题/颜色/引用等复杂格式\n" "5. 受 APNs 4KB 限制,控制内容长度\n" "6. 层级结构靠缩进和换行实现\n" "7. 适合简短通知和摘要,不适合长文" ), }, "slack": { "name": "Slack", "format": "mrkdwn(Slack 专有格式,自动从 Markdown 转换)", "max_length": "约 4000 字节", "supported": [ "*粗体*(从 **粗体** 转换)", "_斜体_", "~删除线~(从 ~~删除线~~ 转换)", "
(从 [文本](URL) 转换)", "`行内代码`", "```代码块```", "> 引用块", ], "unsupported": [ "# 标题语法(剥离为粗体)", " 彩色文本", "--- 分割线(渲染不稳定)", "表格", ], "prompt": ( "Slack mrkdwn 格式化策略(输入为 Markdown,自动转换为 mrkdwn):\n" "1. 用 **粗体** 突出关键词(转为 *粗体*)\n" "2. 用 ~~删除线~~ 标记过时信息(转为 ~删除线~)\n" "3. 用 [文本](URL) 添加链接(转为 )\n" "4. 用 > 引用块展示备注\n" "5. 用 `代码` 标记数据值\n" "6. 不要用 # 标题(Slack 无标题样式)\n" "7. 不要用 颜色标签\n" "8. 用空行和粗体组织信息层级" ), }, "generic_webhook": { "name": "通用 Webhook", "format": "Markdown(或自定义模板)", "max_length": "约 4000 字节", "supported": ["标准 Markdown 语法"], "unsupported": ["取决于接收端"], "prompt": ( "通用 Webhook 格式化策略:\n" "1. 使用标准 Markdown 格式\n" "2. 避免使用特殊平台专有语法\n" "3. 如配置了自定义模板,内容会填充到 {content} 占位符" ), }, } # ==================== 渠道 Markdown 适配 ==================== def _adapt_markdown_for_feishu(text: str) -> str: """将通用 Markdown 适配为飞书卡片 Markdown 格式 飞书卡片支持:**粗体**, [链接](url), , --- 不支持:# 标题, > 引用块 """ # 将 # 标题转换为粗体(飞书卡片不渲染标题语法) text = re.sub(r'^#{1,6}\s+(.+)$', r'**\1**', text, flags=re.MULTILINE) # 去除引用语法前缀(飞书不支持) text = re.sub(r'^>\s*', '', text, flags=re.MULTILINE) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() def _adapt_markdown_for_dingtalk(text: str) -> str: """将通用 Markdown 适配为钉钉 Markdown 格式 钉钉支持:### #### 标题, **粗体**, [链接](url), > 引用, --- 不支持:# ## 标题, 彩色文本, ~~删除线~~ """ # 去除 标签(钉钉不支持,保留内容) text = re.sub(r']*>(.+?)', r'\1', text) # 将 # 和 ## 标题降级为 ### (钉钉仅支持 ### 和 ####) text = re.sub(r'^##\s+(.+)$', r'### \1', text, flags=re.MULTILINE) text = re.sub(r'^#\s+(.+)$', r'### \1', text, flags=re.MULTILINE) # 去除删除线语法(钉钉不支持) text = re.sub(r'~~(.+?)~~', r'\1', text) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() def _adapt_markdown_for_wework(text: str) -> str: """将通用 Markdown 适配为企业微信 Markdown 格式 企业微信支持:**粗体**, [链接](url), > 引用(有限) 不支持:# 标题, ---, , ~~删除线~~, 有序列表 """ # 去除 标签(保留内容) text = re.sub(r']*>(.+?)', r'\1', text) # 将 # 标题转换为粗体(企业微信不渲染标题语法) text = re.sub(r'^#{1,6}\s+(.+)$', r'**\1**', text, flags=re.MULTILINE) # 将 --- 分割线替换为多个换行(企业微信不渲染水平线) text = re.sub(r'^[\-\*]{3,}\s*$', '\n\n', text, flags=re.MULTILINE) # 去除删除线语法(企业微信不支持) text = re.sub(r'~~(.+?)~~', r'\1', text) # 清理多余空行(保留最多两个) text = re.sub(r'\n{4,}', '\n\n\n', text) return text.strip() def _adapt_markdown_for_ntfy(text: str) -> str: """将通用 Markdown 适配为 ntfy 格式 ntfy 支持:**粗体**, *斜体*, [链接](url), > 引用, `代码` 不可靠:# 标题, ---, """ # 去除 标签(ntfy 不支持) text = re.sub(r']*>(.+?)', r'\1', text) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() def _adapt_markdown_for_bark(text: str) -> str: """将通用 Markdown 适配为 Bark 格式(iOS 推送) Bark 支持:**粗体**, [链接](url), 基础文本 不支持:# 标题, , ---, > 引用, 复杂嵌套 """ # 去除 标签(保留内容) text = re.sub(r']*>(.+?)', r'\1', text) # 将 # 标题转换为粗体 text = re.sub(r'^#{1,6}\s+(.+)$', r'**\1**', text, flags=re.MULTILINE) # 将 --- 替换为换行 text = re.sub(r'^[\-\*]{3,}\s*$', '\n', text, flags=re.MULTILINE) # 去除引用语法 text = re.sub(r'^>\s*', '', text, flags=re.MULTILINE) # 去除删除线语法 text = re.sub(r'~~(.+?)~~', r'\1', text) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() # ==================== 格式转换 ==================== def _markdown_to_telegram_html(text: str) -> str: """ 将 markdown 转换为 Telegram 支持的 HTML 格式 Telegram 支持的标签:, , ,, text,""" # 预处理:去除 标签(Telegram 不支持,保留内容) text = re.sub(r']*>(.+?)', r'\1', text) lines = text.split('\n') result_lines = [] in_blockquote = False for line in lines: # 将标题符号 # ## ### 转换为粗体 header_match = re.match(r'^(#{1,6})\s+(.+)$', line) if header_match: line = f'**{header_match.group(2)}**' # 去除水平分割线 if re.match(r'^[\-\*]{3,}\s*$', line): if in_blockquote: result_lines.append('') in_blockquote = False line = '' # 处理引用块 > text →textquote_match = re.match(r'^>\s*(.*)$', line) if quote_match: if not in_blockquote: result_lines.append('') in_blockquote = True result_lines.append(quote_match.group(1)) continue elif in_blockquote: result_lines.append('') in_blockquote = False result_lines.append(line) if in_blockquote: result_lines.append('', ''): escaped_parts.append(part) else: part = part.replace('&', '&') part = part.replace('<', '<') part = part.replace('>', '>') escaped_parts.append(part) text = ''.join(escaped_parts) # 转换链接 [text](url) → text text = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'\1', text) # 转换粗体 **text** → text text = re.sub(r'\*\*(.+?)\*\*', r'\1', text) # 转换斜体 *text* → text text = re.sub(r'\*(.+?)\*', r'\1', text) # 转换删除线 ~~text~~ →texttext = re.sub(r'~~(.+?)~~', r'\1', text) # 转换行内代码 `code` →codetext = re.sub(r'`(.+?)`', r'\1', text) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() def _convert_markdown_to_slack(text: str) -> str: """将 Markdown 转换为 Slack mrkdwn 格式(增强版) Slack mrkdwn 与标准 Markdown 差异: - 粗体: *text* (非 **text**) - 删除线: ~text~ (非 ~~text~~) - 链接:(非 [text](url)) - 不支持标题语法 """ # 去除 标签(保留内容) text = re.sub(r']*>(.+?)', r'\1', text) # 将 # 标题转换为粗体(Slack 无标题样式) text = re.sub(r'^#{1,6}\s+(.+)$', r'**\1**', text, flags=re.MULTILINE) # 去除 --- 分割线(Slack 渲染不稳定) text = re.sub(r'^[\-\*]{3,}\s*$', '', text, flags=re.MULTILINE) # 转换链接格式: [文本](url) → text = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'<\2|\1>', text) # 转换删除线: ~~文本~~ → ~文本~ text = re.sub(r'~~(.+?)~~', r'~\1~', text) # 转换粗体: **文本** → *文本*(必须在删除线之后) text = re.sub(r'\*\*([^*]+)\*\*', r'*\1*', text) # 清理多余空行 text = re.sub(r'\n{3,}', '\n\n', text) return text.strip() def _markdown_to_simple_html(text: str) -> str: """ 将 markdown 转换为简单 HTML(用于 Email) """ html = text # 转义 html = html.replace('&', '&') html = html.replace('<', '<') html = html.replace('>', '>') # 链接 html = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'\1', html) # 标题 ### → html = re.sub(r'^### (.+)$', r'
\1
', html, flags=re.MULTILINE) html = re.sub(r'^## (.+)$', r'\1
', html, flags=re.MULTILINE) html = re.sub(r'^# (.+)$', r'\1
', html, flags=re.MULTILINE) # 粗体 html = re.sub(r'\*\*(.+?)\*\*', r'\1', html) # 斜体 html = re.sub(r'\*(.+?)\*', r'\1', html) # 删除线 html = re.sub(r'~~(.+?)~~', r'\1', html) # 行内代码 html = re.sub(r'`(.+?)`', r'\1', html) # 分割线 html = re.sub(r'^[\-\*]{3,}\s*$', '
', html, flags=re.MULTILINE) # 换行 html = html.replace('\n', '
\n') return f"""TrendRadar 通知 {html}""" # ==================== 各渠道发送器 ==================== def _send_feishu(webhook_url: str, content: str, title: str) -> Dict: """飞书发送(纯文本消息,与 trendradar send_to_feishu 一致) 飞书 webhook 使用 msg_type: "text",所有信息整合到 content.text 中。 """ payload = { "msg_type": "text", "content": { "text": content, }, } try: resp = requests.post(webhook_url, json=payload, timeout=30) data = resp.json() ok = resp.status_code == 200 and (data.get("code") == 0 or data.get("StatusCode") == 0) detail = "" if not ok: detail = data.get("msg") or data.get("StatusMessage", "") return {"success": ok, "detail": detail} except Exception as e: return {"success": False, "detail": str(e)} def _send_dingtalk(webhook_url: str, content: str, title: str) -> Dict: """钉钉发送(接收已适配的 Markdown)""" payload = { "msgtype": "markdown", "markdown": {"title": title, "text": content} } try: resp = requests.post(webhook_url, json=payload, timeout=30) data = resp.json() ok = resp.status_code == 200 and data.get("errcode") == 0 return {"success": ok, "detail": data.get("errmsg", "") if not ok else ""} except Exception as e: return {"success": False, "detail": str(e)} def _send_wework(webhook_url: str, content: str, title: str, msg_type: str = "markdown") -> Dict: """企业微信发送(接收已适配的 Markdown,text 模式自动剥离格式)""" if msg_type == "text": payload = {"msgtype": "text", "text": {"content": strip_markdown(content)}} else: payload = {"msgtype": "markdown", "markdown": {"content": content}} try: resp = requests.post(webhook_url, json=payload, timeout=30) data = resp.json() ok = resp.status_code == 200 and data.get("errcode") == 0 return {"success": ok, "detail": data.get("errmsg", "") if not ok else ""} except Exception as e: return {"success": False, "detail": str(e)} def _send_telegram(bot_token: str, chat_id: str, content: str, title: str) -> Dict: """Telegram 发送(接收已转换的 HTML)""" url = f"https://api.telegram.org/bot{bot_token}/sendMessage" payload = { "chat_id": chat_id, "text": content, "parse_mode": "HTML", "disable_web_page_preview": True, } try: resp = requests.post(url, json=payload, timeout=30) data = resp.json() ok = resp.status_code == 200 and data.get("ok") return {"success": ok, "detail": data.get("description", "") if not ok else ""} except Exception as e: return {"success": False, "detail": str(e)} def _send_email( from_email: str, password: str, to_email: str, message: str, title: str, smtp_server: str = "", smtp_port: str = "" ) -> Dict: """邮件发送(HTML 格式)""" try: domain = from_email.split("@")[-1].lower() html_content = _markdown_to_simple_html(message) # SMTP 配置 if smtp_server and smtp_port: server_host = smtp_server port = int(smtp_port) use_tls = port != 465 elif domain in SMTP_CONFIGS: cfg = SMTP_CONFIGS[domain] server_host = cfg["server"] port = cfg["port"] use_tls = cfg["encryption"] == "TLS" else: server_host = f"smtp.{domain}" port = 587 use_tls = True msg = MIMEMultipart("alternative") msg["From"] = formataddr(("TrendRadar", from_email)) recipients = [addr.strip() for addr in to_email.split(",")] msg["To"] = ", ".join(recipients) now = datetime.now() msg["Subject"] = Header(f"{title} - {now.strftime('%m月%d日 %H:%M')}", "utf-8") msg["MIME-Version"] = "1.0" msg["Date"] = formatdate(localtime=True) msg["Message-ID"] = make_msgid() # 纯文本备选 msg.attach(MIMEText(strip_markdown(message), "plain", "utf-8")) # HTML 主体 msg.attach(MIMEText(html_content, "html", "utf-8")) if use_tls: server = smtplib.SMTP(server_host, port, timeout=30) server.ehlo() server.starttls() server.ehlo() else: server = smtplib.SMTP_SSL(server_host, port, timeout=30) server.ehlo() server.login(from_email, password) server.send_message(msg) server.quit() return {"success": True, "detail": ""} except Exception as e: return {"success": False, "detail": str(e)} def _send_ntfy(server_url: str, topic: str, content: str, title: str, token: str = "") -> Dict: """ntfy 发送(接收已适配的 Markdown,与 trendradar send_to_ntfy 一致) 注意:Title 使用 ASCII 字符避免 HTTP header 编码问题。 支持 429 速率限制重试。 """ base_url = server_url.rstrip("/") if not base_url.startswith(("http://", "https://")): base_url = f"https://{base_url}" url = f"{base_url}/{topic}" headers = { "Content-Type": "text/plain; charset=utf-8", "Markdown": "yes", "Title": "TrendRadar Notification", # ASCII,避免 HTTP header 编码问题 "Priority": "default", "Tags": "news", } if token: headers["Authorization"] = f"Bearer {token}" try: resp = requests.post(url, data=content.encode("utf-8"), headers=headers, timeout=30) if resp.status_code == 200: return {"success": True, "detail": ""} elif resp.status_code == 429: # 速率限制,等待后重试一次(与 trendradar 一致) time.sleep(10) retry_resp = requests.post(url, data=content.encode("utf-8"), headers=headers, timeout=30) ok = retry_resp.status_code == 200 return {"success": ok, "detail": "" if ok else f"retry status={retry_resp.status_code}"} elif resp.status_code == 413: return {"success": False, "detail": f"消息过大被拒绝 ({len(content.encode('utf-8'))} bytes)"} else: return {"success": False, "detail": f"status={resp.status_code}"} except Exception as e: return {"success": False, "detail": str(e)} def _send_bark(bark_url: str, content: str, title: str) -> Dict: """Bark 发送(接收已适配的 Markdown,iOS 推送)""" parsed = urlparse(bark_url) device_key = parsed.path.strip('/').split('/')[0] if parsed.path else None if not device_key: return {"success": False, "detail": f"无法从 URL 提取 device_key: {bark_url}"} api_endpoint = f"{parsed.scheme}://{parsed.netloc}/push" payload = { "title": title, "markdown": content, "device_key": device_key, "sound": "default", "group": "TrendRadar", "action": "none", } try: resp = requests.post(api_endpoint, json=payload, timeout=30) data = resp.json() ok = resp.status_code == 200 and data.get("code") == 200 return {"success": ok, "detail": data.get("message", "") if not ok else ""} except Exception as e: return {"success": False, "detail": str(e)} def _send_slack(webhook_url: str, content: str, title: str) -> Dict: """Slack 发送(接收已转换的 mrkdwn)""" payload = {"text": content} try: resp = requests.post(webhook_url, json=payload, timeout=30) ok = resp.status_code == 200 and resp.text == "ok" return {"success": ok, "detail": "" if ok else resp.text} except Exception as e: return {"success": False, "detail": str(e)} def _send_generic_webhook( webhook_url: str, message: str, title: str, payload_template: str = "" ) -> Dict: """通用 Webhook 发送(Markdown 格式,支持自定义模板)""" try: if payload_template: json_content = json.dumps(message)[1:-1] json_title = json.dumps(title)[1:-1] payload_str = payload_template.replace("{content}", json_content).replace("{title}", json_title) try: payload = json.loads(payload_str) except json.JSONDecodeError: payload = {"title": title, "content": message} else: payload = {"title": title, "content": message} resp = requests.post( webhook_url, headers={"Content-Type": "application/json"}, json=payload, timeout=30, ) ok = 200 <= resp.status_code < 300 return {"success": ok, "detail": "" if ok else f"status={resp.status_code}"} except Exception as e: return {"success": False, "detail": str(e)} # ==================== 工具类 ==================== class NotificationTools: """通知推送工具类""" def __init__(self, project_root: str = None): if project_root: self.project_root = Path(project_root) else: current_file = Path(__file__) self.project_root = current_file.parent.parent.parent def _load_merged_config(self) -> Dict[str, Any]: """ 加载合并后的通知配置(config.yaml + .env) Returns: 包含 webhook 配置和通知参数的合并字典 """ config_path = self.project_root / "config" / "config.yaml" if config_path.exists(): with open(config_path, "r", encoding="utf-8") as f: config_data = yaml.safe_load(f) else: config_data = {} webhook_config = _load_webhook_config(config_data) notification_config = _load_notification_config(config_data) return {**webhook_config, **notification_config} def _detect_config_source(self, env_key: str, yaml_value: str) -> str: """检测配置项来源:env / yaml / 未配置""" env_val = os.environ.get(env_key, "").strip() if env_val: return "env" elif yaml_value: return "yaml" return "" def get_channel_format_guide(self, channel: Optional[str] = None) -> Dict: """ 获取渠道格式化策略指南 返回各渠道支持的 Markdown 特性、限制和最佳格式化提示词, 供 LLM 在生成推送内容时参考,确保内容样式贴合目标渠道。 Args: channel: 指定渠道 ID,None 返回所有渠道的策略 Returns: 格式化策略字典 """ if channel: if channel not in CHANNEL_FORMAT_GUIDES: valid = list(CHANNEL_FORMAT_GUIDES.keys()) return { "success": False, "error": { "code": "INVALID_CHANNEL", "message": f"无效的渠道: {channel}", "suggestion": f"支持的渠道: {valid}", }, } guide = CHANNEL_FORMAT_GUIDES[channel] return { "success": True, "channel": channel, "guide": guide, } else: return { "success": True, "summary": f"共 {len(CHANNEL_FORMAT_GUIDES)} 个渠道的格式化策略", "guides": CHANNEL_FORMAT_GUIDES, } def get_notification_channels(self) -> Dict: """ 获取所有通知渠道的配置状态 检测 config.yaml 和 .env 环境变量,返回每个渠道是否已配置。 Returns: 渠道状态字典 """ try: config = self._load_merged_config() enabled = config.get("ENABLE_NOTIFICATION", True) # 从 yaml 直接读取(用于判断来源) config_path = self.project_root / "config" / "config.yaml" yaml_channels = {} if config_path.exists(): with open(config_path, "r", encoding="utf-8") as f: raw = yaml.safe_load(f) or {} yaml_channels = raw.get("notification", {}).get("channels", {}) channels = [] env_key_map = { "FEISHU_WEBHOOK_URL": ("feishu", "webhook_url"), "DINGTALK_WEBHOOK_URL": ("dingtalk", "webhook_url"), "WEWORK_WEBHOOK_URL": ("wework", "webhook_url"), "TELEGRAM_BOT_TOKEN": ("telegram", "bot_token"), "TELEGRAM_CHAT_ID": ("telegram", "chat_id"), "EMAIL_FROM": ("email", "from"), "EMAIL_PASSWORD": ("email", "password"), "EMAIL_TO": ("email", "to"), "NTFY_SERVER_URL": ("ntfy", "server_url"), "NTFY_TOPIC": ("ntfy", "topic"), "BARK_URL": ("bark", "url"), "SLACK_WEBHOOK_URL": ("slack", "webhook_url"), "GENERIC_WEBHOOK_URL": ("generic_webhook", "webhook_url"), } for channel_id, required_keys in _CHANNEL_REQUIREMENTS.items(): is_configured = all(config.get(k) for k in required_keys) # 判断来源 sources = set() for key in required_keys: ch_name, field = env_key_map.get(key, ("", "")) yaml_val = yaml_channels.get(ch_name, {}).get(field, "") src = self._detect_config_source(key, yaml_val) if src: sources.add(src) channels.append({ "id": channel_id, "name": _CHANNEL_NAMES.get(channel_id, channel_id), "configured": is_configured, "source": list(sources) if sources else [], }) configured_count = sum(1 for ch in channels if ch["configured"]) return { "success": True, "notification_enabled": enabled, "summary": f"{configured_count}/{len(channels)} 个渠道已配置", "channels": channels, } except Exception as e: return { "success": False, "error": {"code": "INTERNAL_ERROR", "message": str(e)}, } def send_notification( self, message: str, title: str = "TrendRadar 通知", channels: Optional[List[str]] = None, ) -> Dict: """ 向已配置的通知渠道发送消息 接受 markdown 格式内容,内部自动转换为各渠道要求的格式。 Args: message: markdown 格式的消息内容 title: 消息标题 channels: 指定发送的渠道列表,None 表示发送到所有已配置渠道 可选值: feishu, dingtalk, wework, telegram, email, ntfy, bark, slack, generic_webhook Returns: 发送结果字典 """ if not message or not message.strip(): return { "success": False, "error": {"code": "EMPTY_MESSAGE", "message": "消息内容不能为空"}, } try: config = self._load_merged_config() if not config.get("ENABLE_NOTIFICATION", True): return { "success": False, "error": {"code": "NOTIFICATION_DISABLED", "message": "通知功能已禁用(notification.enabled = false)"}, } # 确定目标渠道 all_channel_ids = list(_CHANNEL_REQUIREMENTS.keys()) if channels: # 验证渠道名称 invalid = [ch for ch in channels if ch not in all_channel_ids] if invalid: raise InvalidParameterError( f"无效的渠道: {invalid}", suggestion=f"支持的渠道: {all_channel_ids}" ) target_channels = channels else: # 发送到所有已配置渠道 target_channels = [ ch_id for ch_id, keys in _CHANNEL_REQUIREMENTS.items() if all(config.get(k) for k in keys) ] if not target_channels: return { "success": False, "error": { "code": "NO_CHANNELS", "message": "没有已配置的目标渠道", "suggestion": "请在 config.yaml 或 .env 中配置至少一个通知渠道", }, } # 逐渠道发送 results = {} for ch_id in target_channels: required_keys = _CHANNEL_REQUIREMENTS[ch_id] if not all(config.get(k) for k in required_keys): results[ch_id] = {"success": False, "detail": "渠道未配置"} continue result = self._dispatch_to_channel(ch_id, config, message, title) results[ch_id] = result success_count = sum(1 for r in results.values() if r["success"]) total = len(results) return { "success": success_count > 0, "summary": f"{success_count}/{total} 个渠道发送成功", "results": { ch_id: { "name": _CHANNEL_NAMES.get(ch_id, ch_id), **r, } for ch_id, r in results.items() }, } except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return { "success": False, "error": {"code": "INTERNAL_ERROR", "message": str(e)}, } def _dispatch_to_channel( self, channel_id: str, config: Dict, message: str, title: str ) -> Dict: """分发消息到指定渠道(格式适配 → 字节分批 → 多账号 × 逐批发送) 从 config.yaml → advanced.batch_size / batch_send_interval 读取配置。 """ # 从 config 读取批次配置(与 trendradar 一致) batch_sizes = self._get_batch_sizes() batch_interval = self._get_batch_interval() # Email 无字节限制,不走分批管线 if channel_id == "email": return _send_email( config["EMAIL_FROM"], config["EMAIL_PASSWORD"], config["EMAIL_TO"], message, title, config.get("EMAIL_SMTP_SERVER", ""), config.get("EMAIL_SMTP_PORT", ""), ) # 统一分批管线:格式适配 → 字节分割 → 添加批次头部 → (可选)反序 batches = _prepare_batches(message, channel_id, batch_sizes) # 按渠道路由发送 if channel_id == "feishu": return self._send_batched_multi_account( config["FEISHU_WEBHOOK_URL"], batches, channel_id, lambda url, content: _send_feishu(url, content, title), batch_interval, ) elif channel_id == "dingtalk": return self._send_batched_multi_account( config["DINGTALK_WEBHOOK_URL"], batches, channel_id, lambda url, content: _send_dingtalk(url, content, title), batch_interval, ) elif channel_id == "wework": msg_type = config.get("WEWORK_MSG_TYPE", "markdown") return self._send_batched_multi_account( config["WEWORK_WEBHOOK_URL"], batches, channel_id, lambda url, content: _send_wework(url, content, title, msg_type), batch_interval, ) elif channel_id == "telegram": return self._send_batched_telegram( config, batches, title, batch_interval, ) elif channel_id == "ntfy": return self._send_batched_ntfy( config, batches, title, batch_interval, ) elif channel_id == "bark": return self._send_batched_multi_account( config["BARK_URL"], batches, channel_id, lambda url, content: _send_bark(url, content, title), batch_interval, ) elif channel_id == "slack": return self._send_batched_multi_account( config["SLACK_WEBHOOK_URL"], batches, channel_id, lambda url, content: _send_slack(url, content, title), batch_interval, ) elif channel_id == "generic_webhook": template = config.get("GENERIC_WEBHOOK_TEMPLATE", "") return self._send_batched_multi_account( config["GENERIC_WEBHOOK_URL"], batches, channel_id, lambda url, content: _send_generic_webhook(url, content, title, template), batch_interval, ) else: return {"success": False, "detail": f"未知渠道: {channel_id}"} def _get_batch_sizes(self) -> Dict: """从 config.yaml 读取 advanced.batch_size,合并到默认值""" try: config_path = self.project_root / "config" / "config.yaml" if config_path.exists(): with open(config_path, "r", encoding="utf-8") as f: raw = yaml.safe_load(f) or {} advanced = raw.get("advanced", {}) cfg_sizes = advanced.get("batch_size", {}) # 从 config 构建渠道映射 sizes = dict(_CHANNEL_BATCH_SIZES_DEFAULT) default_size = cfg_sizes.get("default", 4000) for ch_id in sizes: if ch_id in cfg_sizes: sizes[ch_id] = cfg_sizes[ch_id] elif ch_id not in ("email", "ntfy") and sizes[ch_id] == 4000: # 使用 config 中的 default sizes[ch_id] = default_size return sizes except Exception: pass return dict(_CHANNEL_BATCH_SIZES_DEFAULT) def _get_batch_interval(self) -> float: """从 config.yaml 读取 advanced.batch_send_interval""" try: config_path = self.project_root / "config" / "config.yaml" if config_path.exists(): with open(config_path, "r", encoding="utf-8") as f: raw = yaml.safe_load(f) or {} return float(raw.get("advanced", {}).get("batch_send_interval", _BATCH_INTERVAL_DEFAULT)) except Exception: pass return _BATCH_INTERVAL_DEFAULT def _send_batched_multi_account( self, urls_str: str, batches: List[str], channel_id: str, send_func, batch_interval: float = _BATCH_INTERVAL_DEFAULT, ) -> Dict: """多账号 × 逐批发送(; 分隔的 URL)""" urls = [u.strip() for u in urls_str.split(";") if u.strip()] if not urls: return {"success": False, "detail": "URL 为空"} any_ok = False details = [] for url in urls: for i, batch in enumerate(batches): r = send_func(url, batch) if r["success"]: any_ok = True elif r["detail"]: details.append(r["detail"]) # 批次间间隔 if i < len(batches) - 1: time.sleep(batch_interval) return { "success": any_ok, "detail": "; ".join(details) if details else "", "batches": len(batches), } def _send_batched_telegram( self, config: Dict, batches: List[str], title: str, batch_interval: float = _BATCH_INTERVAL_DEFAULT, ) -> Dict: """Telegram 多账号 × 逐批发送(token/chat_id 配对)""" tokens = config["TELEGRAM_BOT_TOKEN"].split(";") chat_ids = config["TELEGRAM_CHAT_ID"].split(";") if len(tokens) != len(chat_ids): return {"success": False, "detail": "bot_token 和 chat_id 数量不一致"} any_ok = False details = [] for token, cid in zip(tokens, chat_ids): token, cid = token.strip(), cid.strip() if not (token and cid): continue for i, batch in enumerate(batches): r = _send_telegram(token, cid, batch, title) if r["success"]: any_ok = True elif r["detail"]: details.append(r["detail"]) if i < len(batches) - 1: time.sleep(batch_interval) return { "success": any_ok, "detail": "; ".join(details) if details else "", "batches": len(batches), } def _send_batched_ntfy( self, config: Dict, batches: List[str], title: str, batch_interval: float = _BATCH_INTERVAL_DEFAULT, ) -> Dict: """ntfy 多账号 × 逐批发送(server/topic/token 配对,含速率限制处理)""" servers = config["NTFY_SERVER_URL"].split(";") topics = config["NTFY_TOPIC"].split(";") tokens_str = config.get("NTFY_TOKEN", "") tokens = tokens_str.split(";") if tokens_str else [""] if len(servers) != len(topics): return {"success": False, "detail": "server_url 和 topic 数量不一致"} any_ok = False details = [] for i, (srv, topic) in enumerate(zip(servers, topics)): srv, topic = srv.strip(), topic.strip() tk = tokens[i].strip() if i < len(tokens) else "" if not (srv and topic): continue # ntfy.sh 公共服务器用 2s 间隔(与 trendradar 一致) interval = 2.0 if "ntfy.sh" in srv else batch_interval for j, batch in enumerate(batches): r = _send_ntfy(srv, topic, batch, title, tk) if r["success"]: any_ok = True elif r["detail"]: details.append(r["detail"]) if j < len(batches) - 1: time.sleep(interval) return { "success": any_ok, "detail": "; ".join(details) if details else "", "batches": len(batches), } ================================================ FILE: mcp_server/tools/search_tools.py ================================================ """ 智能新闻检索工具 提供模糊搜索、链接查询、历史相关新闻检索等高级搜索功能。 """ import re from collections import Counter from datetime import datetime, timedelta from difflib import SequenceMatcher from typing import Dict, List, Optional, Tuple, Union from ..services.data_service import DataService from ..utils.validators import validate_keyword, validate_limit, validate_threshold, normalize_date_range from ..utils.errors import MCPError, InvalidParameterError, DataNotFoundError class SearchTools: """智能新闻检索工具类""" def __init__(self, project_root: str = None): """ 初始化智能检索工具 Args: project_root: 项目根目录 """ self.data_service = DataService(project_root) def search_news_unified( self, query: str, search_mode: str = "keyword", date_range: Optional[Union[Dict[str, str], str]] = None, platforms: Optional[List[str]] = None, limit: int = 50, sort_by: str = "relevance", threshold: float = 0.6, include_url: bool = False, include_rss: bool = False, rss_limit: int = 20 ) -> Dict: """ 统一新闻搜索工具 - 整合多种搜索模式,支持同时搜索热榜和RSS Args: query: 查询内容(必需)- 关键词、内容片段或实体名称 search_mode: 搜索模式,可选值: - "keyword": 精确关键词匹配(默认) - "fuzzy": 模糊内容匹配(使用相似度算法) - "entity": 实体名称搜索(自动按权重排序) date_range: 日期范围(可选) - **格式**: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} - **示例**: {"start": "2025-01-01", "end": "2025-01-07"} - **默认**: 不指定时默认查询今天 - **注意**: start和end可以相同(表示单日查询) platforms: 平台过滤列表,如 ['zhihu', 'weibo'] limit: 热榜返回条数限制,默认50 sort_by: 排序方式,可选值: - "relevance": 按相关度排序(默认) - "weight": 按新闻权重排序 - "date": 按日期排序 threshold: 相似度阈值(仅fuzzy模式有效),0-1之间,默认0.6 include_url: 是否包含URL链接,默认False(节省token) include_rss: 是否同时搜索RSS数据,默认False rss_limit: RSS返回条数限制,默认20 Returns: 搜索结果字典,包含匹配的新闻列表(热榜和RSS分开展示) Examples: - search_news_unified(query="人工智能", search_mode="keyword") - search_news_unified(query="特斯拉降价", search_mode="fuzzy", threshold=0.4) - search_news_unified(query="马斯克", search_mode="entity", limit=20) - search_news_unified(query="AI", include_rss=True) # 同时搜索热榜和RSS - search_news_unified(query="iPhone 16", date_range={"start": "2025-01-01", "end": "2025-01-07"}) """ try: # 参数验证 query = validate_keyword(query) if search_mode not in ["keyword", "fuzzy", "entity"]: raise InvalidParameterError( f"无效的搜索模式: {search_mode}", suggestion="支持的模式: keyword, fuzzy, entity" ) if sort_by not in ["relevance", "weight", "date"]: raise InvalidParameterError( f"无效的排序方式: {sort_by}", suggestion="支持的排序: relevance, weight, date" ) limit = validate_limit(limit, default=50) threshold = validate_threshold(threshold, default=0.6, min_value=0.0, max_value=1.0) # 处理日期范围 if date_range: from ..utils.validators import validate_date_range date_range_tuple = validate_date_range(date_range) start_date, end_date = date_range_tuple else: # 不指定日期时,使用最新可用数据日期(而非 datetime.now()) earliest, latest = self.data_service.get_available_date_range() if latest is None: # 没有任何可用数据 return { "success": False, "error": { "code": "NO_DATA_AVAILABLE", "message": "output 目录下没有可用的新闻数据", "suggestion": "请先运行爬虫生成数据,或检查 output 目录" } } # 使用最新可用日期 start_date = end_date = latest # 收集所有匹配的新闻 all_matches = [] current_date = start_date while current_date <= end_date: try: all_titles, id_to_name, timestamps = self.data_service.parser.read_all_titles_for_date( date=current_date, platform_ids=platforms ) # 根据搜索模式执行不同的搜索逻辑 if search_mode == "keyword": matches = self._search_by_keyword_mode( query, all_titles, id_to_name, current_date, include_url ) elif search_mode == "fuzzy": matches = self._search_by_fuzzy_mode( query, all_titles, id_to_name, current_date, threshold, include_url ) else: # entity matches = self._search_by_entity_mode( query, all_titles, id_to_name, current_date, include_url ) all_matches.extend(matches) except DataNotFoundError: # 该日期没有数据,继续下一天 pass current_date += timedelta(days=1) if not all_matches: # 获取可用日期范围用于错误提示 earliest, latest = self.data_service.get_available_date_range() # 判断时间范围描述 if start_date.date() == datetime.now().date() and start_date == end_date: time_desc = "今天" elif start_date == end_date: time_desc = start_date.strftime("%Y-%m-%d") else: time_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" # 构建错误消息 if earliest and latest: available_desc = f"{earliest.strftime('%Y-%m-%d')} 至 {latest.strftime('%Y-%m-%d')}" message = f"未找到匹配的新闻(查询范围: {time_desc},可用数据: {available_desc})" else: message = f"未找到匹配的新闻({time_desc})" result = { "success": True, "results": [], "total": 0, "query": query, "search_mode": search_mode, "time_range": time_desc, "message": message } return result # 统一排序逻辑 if sort_by == "relevance": all_matches.sort(key=lambda x: x.get("similarity_score", 1.0), reverse=True) elif sort_by == "weight": from .analytics import calculate_news_weight all_matches.sort(key=lambda x: calculate_news_weight(x), reverse=True) elif sort_by == "date": all_matches.sort(key=lambda x: x.get("date", ""), reverse=True) # 限制返回数量 results = all_matches[:limit] # 构建时间范围描述(正确判断是否为今天) if start_date.date() == datetime.now().date() and start_date == end_date: time_range_desc = "今天" elif start_date == end_date: time_range_desc = start_date.strftime("%Y-%m-%d") else: time_range_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" result = { "success": True, "summary": { "description": f"新闻搜索结果({search_mode}模式)", "total_found": len(all_matches), "returned": len(results), "requested_limit": limit, "search_mode": search_mode, "query": query, "platforms": platforms or "所有平台", "time_range": time_range_desc, "sort_by": sort_by }, "data": results } if search_mode == "fuzzy": result["summary"]["threshold"] = threshold if len(all_matches) < limit: result["note"] = f"模糊搜索模式下,相似度阈值 {threshold} 仅匹配到 {len(all_matches)} 条结果" # 如果启用 RSS 搜索,同时搜索 RSS 数据 if include_rss: rss_results = self._search_rss_by_keyword( query=query, start_date=start_date, end_date=end_date, limit=rss_limit, include_url=include_url ) result["rss"] = rss_results["items"] result["rss_total"] = rss_results["total"] result["summary"]["include_rss"] = True result["summary"]["rss_found"] = rss_results["total"] result["summary"]["rss_returned"] = len(rss_results["items"]) return result except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def _search_by_keyword_mode( self, query: str, all_titles: Dict, id_to_name: Dict, current_date: datetime, include_url: bool ) -> List[Dict]: """ 关键词搜索模式(精确匹配) Args: query: 搜索关键词 all_titles: 所有标题字典 id_to_name: 平台ID到名称映射 current_date: 当前日期 Returns: 匹配的新闻列表 """ matches = [] query_lower = query.lower() for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 精确包含判断 if query_lower in title.lower(): news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "similarity_score": 1.0, # 精确匹配,相似度为1 "ranks": info.get("ranks", []), "count": len(info.get("ranks", [])), "rank": info["ranks"][0] if info["ranks"] else 999 } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") matches.append(news_item) return matches def _search_by_fuzzy_mode( self, query: str, all_titles: Dict, id_to_name: Dict, current_date: datetime, threshold: float, include_url: bool ) -> List[Dict]: """ 模糊搜索模式(使用相似度算法) Args: query: 搜索内容 all_titles: 所有标题字典 id_to_name: 平台ID到名称映射 current_date: 当前日期 threshold: 相似度阈值 Returns: 匹配的新闻列表 """ matches = [] for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 模糊匹配 is_match, similarity = self._fuzzy_match(query, title, threshold) if is_match: news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "similarity_score": round(similarity, 4), "ranks": info.get("ranks", []), "count": len(info.get("ranks", [])), "rank": info["ranks"][0] if info["ranks"] else 999 } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") matches.append(news_item) return matches def _search_by_entity_mode( self, query: str, all_titles: Dict, id_to_name: Dict, current_date: datetime, include_url: bool ) -> List[Dict]: """ 实体搜索模式(自动按权重排序) Args: query: 实体名称 all_titles: 所有标题字典 id_to_name: 平台ID到名称映射 current_date: 当前日期 Returns: 匹配的新闻列表 """ matches = [] for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 实体搜索:精确包含实体名称 if query in title: news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "similarity_score": 1.0, "ranks": info.get("ranks", []), "count": len(info.get("ranks", [])), "rank": info["ranks"][0] if info["ranks"] else 999 } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") matches.append(news_item) return matches def _calculate_similarity(self, text1: str, text2: str) -> float: """ 计算两个文本的相似度 Args: text1: 文本1 text2: 文本2 Returns: 相似度分数 (0-1之间) """ # 使用 difflib.SequenceMatcher 计算序列相似度 return SequenceMatcher(None, text1.lower(), text2.lower()).ratio() def _fuzzy_match(self, query: str, text: str, threshold: float = 0.3) -> Tuple[bool, float]: """ 模糊匹配函数 Args: query: 查询文本 text: 待匹配文本 threshold: 匹配阈值 Returns: (是否匹配, 相似度分数) """ # 直接包含判断 if query.lower() in text.lower(): return True, 1.0 # 计算整体相似度 similarity = self._calculate_similarity(query, text) if similarity >= threshold: return True, similarity # 分词后的部分匹配 query_words = set(self._extract_keywords(query)) text_words = set(self._extract_keywords(text)) if not query_words or not text_words: return False, 0.0 # 计算关键词重合度 common_words = query_words & text_words keyword_overlap = len(common_words) / len(query_words) if keyword_overlap >= 0.5: # 50%的关键词重合 return True, keyword_overlap return False, similarity def _extract_keywords(self, text: str, min_length: int = 2) -> List[str]: """ 从文本中提取关键词 Args: text: 输入文本 min_length: 最小词长 Returns: 关键词列表 """ # 移除URL和特殊字符 text = re.sub(r'http[s]?://\S+', '', text) text = re.sub(r'\[.*?\]', '', text) # 移除方括号内容 # 使用正则表达式分词(中文和英文) words = re.findall(r'[\w]+', text) # 过滤短词 keywords = [word for word in words if word and len(word) >= min_length] return keywords def _calculate_keyword_overlap(self, keywords1: List[str], keywords2: List[str]) -> float: """ 计算两个关键词列表的重合度 Args: keywords1: 关键词列表1 keywords2: 关键词列表2 Returns: 重合度分数 (0-1之间) """ if not keywords1 or not keywords2: return 0.0 set1 = set(keywords1) set2 = set(keywords2) # Jaccard 相似度 intersection = len(set1 & set2) union = len(set1 | set2) if union == 0: return 0.0 return intersection / union def _jaccard_similarity(self, list1: List[str], list2: List[str]) -> float: """ 计算两个列表的 Jaccard 相似度 Args: list1: 列表1 list2: 列表2 Returns: Jaccard 相似度 (0-1之间) """ if not list1 or not list2: return 0.0 set1 = set(list1) set2 = set(list2) intersection = len(set1 & set2) union = len(set1 | set2) if union == 0: return 0.0 return intersection / union def search_related_news_history( self, reference_title: str, time_preset: str = "yesterday", start_date: Optional[datetime] = None, end_date: Optional[datetime] = None, threshold: float = 0.4, limit: int = 50, include_url: bool = False ) -> Dict: """ 在历史数据中搜索与给定新闻相关的新闻 Args: reference_title: 参考新闻标题或内容 time_preset: 时间范围预设值,可选: - "yesterday": 昨天 - "last_week": 上周 (7天) - "last_month": 上个月 (30天) - "custom": 自定义日期范围(需要提供 start_date 和 end_date) start_date: 自定义开始日期(仅当 time_preset="custom" 时有效) end_date: 自定义结束日期(仅当 time_preset="custom" 时有效) threshold: 相似度阈值 (0-1之间),默认0.4 limit: 返回条数限制,默认50 include_url: 是否包含URL链接,默认False(节省token) Returns: 搜索结果字典,包含相关新闻列表 Example: >>> tools = SearchTools() >>> result = tools.search_related_news_history( ... reference_title="人工智能技术突破", ... time_preset="last_week", ... threshold=0.4, ... limit=50 ... ) >>> for news in result['results']: ... print(f"{news['date']}: {news['title']} (相似度: {news['similarity_score']})") """ try: # 参数验证 reference_title = validate_keyword(reference_title) threshold = validate_threshold(threshold, default=0.4, min_value=0.0, max_value=1.0) limit = validate_limit(limit, default=50) # 确定查询日期范围 today = datetime.now() if time_preset == "yesterday": search_start = today - timedelta(days=1) search_end = today - timedelta(days=1) elif time_preset == "last_week": search_start = today - timedelta(days=7) search_end = today - timedelta(days=1) elif time_preset == "last_month": search_start = today - timedelta(days=30) search_end = today - timedelta(days=1) elif time_preset == "custom": if not start_date or not end_date: raise InvalidParameterError( "自定义时间范围需要提供 start_date 和 end_date", suggestion="请提供 start_date 和 end_date 参数" ) search_start = start_date search_end = end_date else: raise InvalidParameterError( f"不支持的时间范围: {time_preset}", suggestion="请使用 'yesterday', 'last_week', 'last_month' 或 'custom'" ) # 提取参考文本的关键词 reference_keywords = self._extract_keywords(reference_title) if not reference_keywords: raise InvalidParameterError( "无法从参考文本中提取关键词", suggestion="请提供更详细的文本内容" ) # 收集所有相关新闻 all_related_news = [] current_date = search_start while current_date <= search_end: try: # 读取该日期的数据 all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date(current_date) # 搜索相关新闻 for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): # 计算标题相似度 title_similarity = self._calculate_similarity(reference_title, title) # 提取标题关键词 title_keywords = self._extract_keywords(title) # 计算关键词重合度 keyword_overlap = self._calculate_keyword_overlap( reference_keywords, title_keywords ) # 综合相似度 (70% 关键词重合 + 30% 文本相似度) combined_score = keyword_overlap * 0.7 + title_similarity * 0.3 if combined_score >= threshold: news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": current_date.strftime("%Y-%m-%d"), "similarity_score": round(combined_score, 4), "keyword_overlap": round(keyword_overlap, 4), "text_similarity": round(title_similarity, 4), "common_keywords": list(set(reference_keywords) & set(title_keywords)), "rank": info["ranks"][0] if info["ranks"] else 0 } # 条件性添加 URL 字段 if include_url: news_item["url"] = info.get("url", "") news_item["mobileUrl"] = info.get("mobileUrl", "") all_related_news.append(news_item) except DataNotFoundError: # 该日期没有数据,继续下一天 pass except Exception as e: # 记录错误但继续处理其他日期 print(f"Warning: 处理日期 {current_date.strftime('%Y-%m-%d')} 时出错: {e}") # 移动到下一天 current_date += timedelta(days=1) if not all_related_news: return { "success": True, "results": [], "total": 0, "query": reference_title, "time_preset": time_preset, "date_range": { "start": search_start.strftime("%Y-%m-%d"), "end": search_end.strftime("%Y-%m-%d") }, "message": "未找到相关新闻" } # 按相似度排序 all_related_news.sort(key=lambda x: x["similarity_score"], reverse=True) # 限制返回数量 results = all_related_news[:limit] # 统计信息 platform_distribution = Counter([news["platform"] for news in all_related_news]) date_distribution = Counter([news["date"] for news in all_related_news]) result = { "success": True, "summary": { "description": "历史相关新闻搜索结果", "total_found": len(all_related_news), "returned": len(results), "requested_limit": limit, "threshold": threshold, "reference_title": reference_title, "reference_keywords": reference_keywords, "time_preset": time_preset, "date_range": { "start": search_start.strftime("%Y-%m-%d"), "end": search_end.strftime("%Y-%m-%d") } }, "data": results, "statistics": { "platform_distribution": dict(platform_distribution), "date_distribution": dict(date_distribution), "avg_similarity": round( sum([news["similarity_score"] for news in all_related_news]) / len(all_related_news), 4 ) if all_related_news else 0.0 } } if len(all_related_news) < limit: result["note"] = f"相关性阈值 {threshold} 下仅找到 {len(all_related_news)} 条相关新闻" return result except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def find_related_news_unified( self, reference_title: str, date_range: Optional[Union[Dict[str, str], str]] = None, threshold: float = 0.5, limit: int = 50, include_url: bool = False ) -> Dict: """ 统一的相关新闻查找工具 - 整合相似新闻和历史相关搜索 Args: reference_title: 参考新闻标题 date_range: 日期范围(可选) - 不指定: 只查询今天的数据 - {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"}: 查询指定日期范围 - "today": 今天 - "yesterday": 昨天 - "last_week": 最近7天 - "last_month": 最近30天 threshold: 相似度阈值,0-1之间,默认0.5 limit: 返回条数限制,默认50 include_url: 是否包含URL链接,默认False Returns: 相关新闻列表,按相似度排序 """ try: # 参数验证 reference_title = validate_keyword(reference_title) threshold = validate_threshold(threshold, default=0.5, min_value=0.0, max_value=1.0) limit = validate_limit(limit, default=50) # 确定日期范围 today = datetime.now() # 规范化 date_range(处理 JSON 字符串序列化问题) date_range = normalize_date_range(date_range) if date_range is None or date_range == "today": # 只查询今天 search_dates = [today] elif isinstance(date_range, str): # 预设时间范围 if date_range == "yesterday": search_dates = [today - timedelta(days=1)] elif date_range == "last_week": search_dates = [today - timedelta(days=i) for i in range(7)] elif date_range == "last_month": search_dates = [today - timedelta(days=i) for i in range(30)] else: # 单日字符串格式 try: single_date = datetime.strptime(date_range, "%Y-%m-%d") search_dates = [single_date] except ValueError: search_dates = [today] elif isinstance(date_range, dict): # 日期范围对象 start_str = date_range.get("start") end_str = date_range.get("end") if start_str and end_str: start_date = datetime.strptime(start_str, "%Y-%m-%d") end_date = datetime.strptime(end_str, "%Y-%m-%d") search_dates = [] current = start_date while current <= end_date: search_dates.append(current) current += timedelta(days=1) else: search_dates = [today] else: search_dates = [today] # 提取参考标题的关键词 reference_keywords = self._extract_keywords(reference_title) # 收集所有相关新闻 all_related_news = [] for search_date in search_dates: try: all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date(search_date) for platform_id, titles in all_titles.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles.items(): if title == reference_title: continue # 计算相似度(使用混合算法) text_similarity = self._calculate_similarity(reference_title, title) # 如果有关键词,也计算关键词重合度 if reference_keywords: title_keywords = self._extract_keywords(title) keyword_similarity = self._jaccard_similarity(reference_keywords, title_keywords) # 混合相似度:70% 文本 + 30% 关键词 similarity = 0.7 * text_similarity + 0.3 * keyword_similarity else: similarity = text_similarity if similarity >= threshold: news_item = { "title": title, "platform": platform_id, "platform_name": platform_name, "date": search_date.strftime("%Y-%m-%d"), "similarity": round(similarity, 3), "rank": info["ranks"][0] if info["ranks"] else 0 } if include_url: news_item["url"] = info.get("url", "") all_related_news.append(news_item) except Exception: # 某天数据读取失败,跳过 continue # 按相似度排序 all_related_news.sort(key=lambda x: x["similarity"], reverse=True) # 限制数量 results = all_related_news[:limit] # 统计信息 from collections import Counter platform_dist = Counter([n["platform_name"] for n in all_related_news]) date_dist = Counter([n["date"] for n in all_related_news]) return { "success": True, "summary": { "description": "相关新闻搜索结果", "total_found": len(all_related_news), "returned": len(results), "reference_title": reference_title, "threshold": threshold, "date_range": { "start": min(search_dates).strftime("%Y-%m-%d"), "end": max(search_dates).strftime("%Y-%m-%d") } if search_dates else None }, "data": results, "statistics": { "platform_distribution": dict(platform_dist), "date_distribution": dict(date_dist) } } except MCPError as e: return {"success": False, "error": e.to_dict()} except Exception as e: return {"success": False, "error": {"code": "INTERNAL_ERROR", "message": str(e)}} def _search_rss_by_keyword( self, query: str, start_date: datetime, end_date: datetime, limit: int = 20, include_url: bool = False ) -> Dict: """ 在 RSS 数据中搜索关键词 Args: query: 搜索关键词 start_date: 开始日期 end_date: 结束日期 limit: 返回条数限制 include_url: 是否包含 URL Returns: RSS 搜索结果字典 """ all_rss_matches = [] query_lower = query.lower() current_date = start_date while current_date <= end_date: try: # 读取该日期的 RSS 数据 all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( date=current_date, platform_ids=None, db_type="rss" ) for feed_id, items in all_titles.items(): feed_name = id_to_name.get(feed_id, feed_id) for title, info in items.items(): # 关键词匹配(标题或摘要) title_match = query_lower in title.lower() summary = info.get("summary", "") summary_match = query_lower in summary.lower() if summary else False if title_match or summary_match: rss_item = { "title": title, "feed_id": feed_id, "feed_name": feed_name, "date": current_date.strftime("%Y-%m-%d"), "published_at": info.get("published_at", ""), "author": info.get("author", ""), "match_in": "title" if title_match else "summary" } if include_url: rss_item["url"] = info.get("url", "") all_rss_matches.append(rss_item) except DataNotFoundError: # 该日期没有 RSS 数据,继续下一天 pass except Exception: # 其他错误,跳过 pass current_date += timedelta(days=1) # 按发布时间排序(最新的在前) all_rss_matches.sort(key=lambda x: x.get("published_at", ""), reverse=True) return { "items": all_rss_matches[:limit], "total": len(all_rss_matches) } ================================================ FILE: mcp_server/tools/storage_sync.py ================================================ # coding=utf-8 """ 存储同步工具 实现从远程存储拉取数据到本地、获取存储状态、列出可用日期等功能。 """ import os import re from pathlib import Path from datetime import datetime, timedelta from typing import Dict, List, Optional import yaml from ..utils.errors import MCPError class StorageSyncTools: """存储同步工具类""" def __init__(self, project_root: str = None): """ 初始化存储同步工具 Args: project_root: 项目根目录 """ if project_root: self.project_root = Path(project_root) else: current_file = Path(__file__) self.project_root = current_file.parent.parent.parent self._config = None self._remote_backend = None def _load_config(self) -> dict: """加载配置文件""" if self._config is None: config_path = self.project_root / "config" / "config.yaml" if config_path.exists(): with open(config_path, "r", encoding="utf-8") as f: self._config = yaml.safe_load(f) else: self._config = {} return self._config def _get_storage_config(self) -> dict: """获取存储配置""" config = self._load_config() return config.get("storage", {}) def _get_remote_config(self) -> dict: """ 获取远程存储配置(合并配置文件和环境变量) """ storage_config = self._get_storage_config() remote_config = storage_config.get("remote", {}) return { "endpoint_url": remote_config.get("endpoint_url") or os.environ.get("S3_ENDPOINT_URL", ""), "bucket_name": remote_config.get("bucket_name") or os.environ.get("S3_BUCKET_NAME", ""), "access_key_id": remote_config.get("access_key_id") or os.environ.get("S3_ACCESS_KEY_ID", ""), "secret_access_key": remote_config.get("secret_access_key") or os.environ.get("S3_SECRET_ACCESS_KEY", ""), "region": remote_config.get("region") or os.environ.get("S3_REGION", ""), } def _has_remote_config(self) -> bool: """检查是否有有效的远程存储配置""" config = self._get_remote_config() return bool( config.get("bucket_name") and config.get("access_key_id") and config.get("secret_access_key") and config.get("endpoint_url") ) def _get_remote_backend(self): """获取远程存储后端实例""" if self._remote_backend is not None: return self._remote_backend if not self._has_remote_config(): return None try: from trendradar.storage.remote import RemoteStorageBackend remote_config = self._get_remote_config() config = self._load_config() timezone = config.get("app", {}).get("timezone", "Asia/Shanghai") self._remote_backend = RemoteStorageBackend( bucket_name=remote_config["bucket_name"], access_key_id=remote_config["access_key_id"], secret_access_key=remote_config["secret_access_key"], endpoint_url=remote_config["endpoint_url"], region=remote_config.get("region", ""), timezone=timezone, ) return self._remote_backend except ImportError: print("[存储同步] 远程存储后端需要安装 boto3: pip install boto3") return None except Exception as e: print(f"[存储同步] 创建远程后端失败: {e}") return None def _get_local_data_dir(self) -> Path: """获取本地数据目录""" storage_config = self._get_storage_config() local_config = storage_config.get("local", {}) data_dir = local_config.get("data_dir", "output") return self.project_root / data_dir def _parse_date_folder_name(self, folder_name: str) -> Optional[datetime]: """ 解析日期文件夹名称(兼容中文和 ISO 格式) 支持两种格式: - 中文格式:YYYY年MM月DD日 - ISO 格式:YYYY-MM-DD """ # 尝试 ISO 格式 iso_match = re.match(r'(\d{4})-(\d{2})-(\d{2})', folder_name) if iso_match: try: return datetime( int(iso_match.group(1)), int(iso_match.group(2)), int(iso_match.group(3)) ) except ValueError: pass # 尝试中文格式 chinese_match = re.match(r'(\d{4})年(\d{2})月(\d{2})日', folder_name) if chinese_match: try: return datetime( int(chinese_match.group(1)), int(chinese_match.group(2)), int(chinese_match.group(3)) ) except ValueError: pass return None def _get_local_dates(self, db_type: str = "news") -> List[str]: """ 获取本地可用的日期列表 存储结构: output/{db_type}/{date}.db 例如: output/news/2025-12-30.db, output/rss/2025-12-30.db Args: db_type: 数据库类型 ("news" 或 "rss"),默认 "news" Returns: 日期列表(按时间倒序) """ local_dir = self._get_local_data_dir() dates = set() if not local_dir.exists(): return [] # 扫描 output/{db_type}/{date}.db 文件 type_dir = local_dir / db_type if type_dir.exists(): for item in type_dir.iterdir(): if item.is_file() and item.suffix == ".db": # 从文件名解析日期 (2025-12-30.db -> 2025-12-30) date_str = item.stem # 去除 .db 后缀 folder_date = self._parse_date_folder_name(date_str) if folder_date: dates.add(folder_date.strftime("%Y-%m-%d")) return sorted(list(dates), reverse=True) def _get_all_local_dates(self) -> Dict[str, List[str]]: """ 获取所有本地可用的日期列表(包括 news 和 rss) Returns: { "news": ["2025-12-30", ...], "rss": ["2025-12-30", ...], "all": ["2025-12-30", ...] # 合并去重 } """ news_dates = set(self._get_local_dates("news")) rss_dates = set(self._get_local_dates("rss")) all_dates = news_dates | rss_dates return { "news": sorted(list(news_dates), reverse=True), "rss": sorted(list(rss_dates), reverse=True), "all": sorted(list(all_dates), reverse=True) } def _calculate_dir_size(self, path: Path) -> int: """计算目录大小(字节)""" total_size = 0 if path.exists(): for item in path.rglob("*"): if item.is_file(): total_size += item.stat().st_size return total_size def sync_from_remote(self, days: int = 7) -> Dict: """ 从远程存储拉取数据到本地 Args: days: 拉取最近 N 天的数据,默认 7 天 Returns: 同步结果字典 """ try: # 检查远程配置 if not self._has_remote_config(): return { "success": False, "error": { "code": "REMOTE_NOT_CONFIGURED", "message": "未配置远程存储", "suggestion": "请在 config/config.yaml 中配置 storage.remote 或设置环境变量" } } # 获取远程后端 remote_backend = self._get_remote_backend() if remote_backend is None: return { "success": False, "error": { "code": "REMOTE_BACKEND_FAILED", "message": "无法创建远程存储后端", "suggestion": "请检查远程存储配置和 boto3 是否已安装" } } # 获取本地数据目录 local_dir = self._get_local_data_dir() local_dir.mkdir(parents=True, exist_ok=True) # 获取远程可用日期 remote_dates = remote_backend.list_remote_dates() # 获取本地已有日期 local_dates = set(self._get_local_dates()) # 计算需要拉取的日期(最近 N 天) from trendradar.utils.time import get_configured_time config = self._load_config() timezone = config.get("app", {}).get("timezone", "Asia/Shanghai") now = get_configured_time(timezone) target_dates = [] for i in range(days): date = now - timedelta(days=i) date_str = date.strftime("%Y-%m-%d") if date_str in remote_dates: target_dates.append(date_str) # 执行拉取 synced_dates = [] skipped_dates = [] failed_dates = [] for date_str in target_dates: # 检查本地是否已存在 if date_str in local_dates: skipped_dates.append(date_str) continue # 拉取单个日期 try: local_date_dir = local_dir / date_str local_db_path = local_date_dir / "news.db" remote_key = f"news/{date_str}.db" local_date_dir.mkdir(parents=True, exist_ok=True) remote_backend.s3_client.download_file( remote_backend.bucket_name, remote_key, str(local_db_path) ) synced_dates.append(date_str) print(f"[存储同步] 已拉取: {date_str}") except Exception as e: failed_dates.append({"date": date_str, "error": str(e)}) print(f"[存储同步] 拉取失败 ({date_str}): {e}") return { "success": True, "summary": { "description": "远程存储同步结果", "synced_files": len(synced_dates), "skipped_count": len(skipped_dates), "failed_count": len(failed_dates) }, "data": { "synced_dates": synced_dates, "skipped_dates": skipped_dates, "failed_dates": failed_dates }, "message": f"成功同步 {len(synced_dates)} 天数据" + ( f",跳过 {len(skipped_dates)} 天(本地已存在)" if skipped_dates else "" ) + ( f",失败 {len(failed_dates)} 天" if failed_dates else "" ) } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def get_storage_status(self) -> Dict: """ 获取存储配置和状态 Returns: 存储状态字典 """ try: storage_config = self._get_storage_config() config = self._load_config() # 本地存储状态 local_config = storage_config.get("local", {}) local_dir = self._get_local_data_dir() local_size = self._calculate_dir_size(local_dir) # 获取分类的日期列表 all_dates = self._get_all_local_dates() news_dates = all_dates["news"] rss_dates = all_dates["rss"] combined_dates = all_dates["all"] local_status = { "data_dir": local_config.get("data_dir", "output"), "retention_days": local_config.get("retention_days", 0), "total_size": f"{local_size / 1024 / 1024:.2f} MB", "total_size_bytes": local_size, "date_count": len(combined_dates), "earliest_date": combined_dates[-1] if combined_dates else None, "latest_date": combined_dates[0] if combined_dates else None, "news": { "date_count": len(news_dates), "dates": news_dates[:10], # 最近 10 天 }, "rss": { "date_count": len(rss_dates), "dates": rss_dates[:10], # 最近 10 天 }, } # 远程存储状态 remote_config = storage_config.get("remote", {}) has_remote = self._has_remote_config() remote_status = { "configured": has_remote, "retention_days": remote_config.get("retention_days", 0), } if has_remote: merged_config = self._get_remote_config() # 脱敏显示 endpoint = merged_config.get("endpoint_url", "") bucket = merged_config.get("bucket_name", "") remote_status["endpoint_url"] = endpoint remote_status["bucket_name"] = bucket # 尝试获取远程日期列表 remote_backend = self._get_remote_backend() if remote_backend: try: remote_dates = remote_backend.list_remote_dates() remote_status["date_count"] = len(remote_dates) remote_status["earliest_date"] = remote_dates[-1] if remote_dates else None remote_status["latest_date"] = remote_dates[0] if remote_dates else None except Exception as e: remote_status["error"] = str(e) # 拉取配置状态 pull_config = storage_config.get("pull", {}) pull_status = { "enabled": pull_config.get("enabled", False), "days": pull_config.get("days", 7), } return { "success": True, "summary": { "description": "存储配置和状态信息", "backend": storage_config.get("backend", "auto") }, "data": { "local": local_status, "remote": remote_status, "pull": pull_status } } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def list_available_dates(self, source: str = "both") -> Dict: """ 列出可用的日期范围 Args: source: 数据来源 - "local": 仅本地 - "remote": 仅远程 - "both": 两者都列出(默认) Returns: 日期列表字典 """ try: data_result = {} summary_info = { "description": "可用日期列表", "source": source } # 本地日期 if source in ("local", "both"): all_dates = self._get_all_local_dates() news_dates = all_dates["news"] rss_dates = all_dates["rss"] combined_dates = all_dates["all"] data_result["local"] = { "dates": combined_dates, "count": len(combined_dates), "earliest": combined_dates[-1] if combined_dates else None, "latest": combined_dates[0] if combined_dates else None, "news": { "dates": news_dates, "count": len(news_dates), }, "rss": { "dates": rss_dates, "count": len(rss_dates), }, } # 远程日期 if source in ("remote", "both"): if not self._has_remote_config(): data_result["remote"] = { "configured": False, "dates": [], "count": 0, "earliest": None, "latest": None, "error": "未配置远程存储" } else: remote_backend = self._get_remote_backend() if remote_backend: try: remote_dates = remote_backend.list_remote_dates() data_result["remote"] = { "configured": True, "dates": remote_dates, "count": len(remote_dates), "earliest": remote_dates[-1] if remote_dates else None, "latest": remote_dates[0] if remote_dates else None, } except Exception as e: data_result["remote"] = { "configured": True, "dates": [], "count": 0, "earliest": None, "latest": None, "error": str(e) } else: data_result["remote"] = { "configured": True, "dates": [], "count": 0, "earliest": None, "latest": None, "error": "无法创建远程存储后端" } # 如果同时查询两者,计算差异 if source == "both" and "local" in data_result and "remote" in data_result: local_set = set(data_result["local"]["dates"]) remote_set = set(data_result["remote"].get("dates", [])) data_result["comparison"] = { "only_local": sorted(list(local_set - remote_set), reverse=True), "only_remote": sorted(list(remote_set - local_set), reverse=True), "both": sorted(list(local_set & remote_set), reverse=True), } return { "success": True, "summary": summary_info, "data": data_result } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } ================================================ FILE: mcp_server/tools/system.py ================================================ """ 系统管理工具 实现系统状态查询和爬虫触发功能。 """ from pathlib import Path from typing import Dict, List, Optional from ..services.data_service import DataService from ..utils.validators import validate_platforms from ..utils.errors import MCPError, CrawlTaskError class SystemManagementTools: """系统管理工具类""" def __init__(self, project_root: str = None): """ 初始化系统管理工具 Args: project_root: 项目根目录 """ self.data_service = DataService(project_root) if project_root: self.project_root = Path(project_root) else: # 获取项目根目录 current_file = Path(__file__) self.project_root = current_file.parent.parent.parent def get_system_status(self) -> Dict: """ 获取系统运行状态和健康检查信息 Returns: 系统状态字典 Example: >>> tools = SystemManagementTools() >>> result = tools.get_system_status() >>> print(result['system']['version']) """ try: # 获取系统状态 status = self.data_service.get_system_status() return { "success": True, "summary": { "description": "系统运行状态和健康检查信息" }, "data": status } except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } def trigger_crawl(self, platforms: Optional[List[str]] = None, save_to_local: bool = False, include_url: bool = False) -> Dict: """ 手动触发一次临时爬取任务(可选持久化) Args: platforms: 指定平台列表,为空则爬取所有平台 save_to_local: 是否保存到本地 output 目录,默认 False include_url: 是否包含URL链接,默认False(节省token) Returns: 爬取结果字典,包含新闻数据和保存路径(如果保存) Example: >>> tools = SystemManagementTools() >>> # 临时爬取,不保存 >>> result = tools.trigger_crawl(platforms=['zhihu', 'weibo']) >>> print(result['data']) >>> # 爬取并保存到本地 >>> result = tools.trigger_crawl(platforms=['zhihu'], save_to_local=True) >>> print(result['saved_files']) """ try: import time import yaml from trendradar.crawler.fetcher import DataFetcher from trendradar.storage.local import LocalStorageBackend from trendradar.storage.base import convert_crawl_results_to_news_data from trendradar.utils.time import get_configured_time, format_date_folder, format_time_filename from ..services.cache_service import get_cache # 参数验证 platforms = validate_platforms(platforms) # 加载配置文件 config_path = self.project_root / "config" / "config.yaml" if not config_path.exists(): raise CrawlTaskError( "配置文件不存在", suggestion=f"请确保配置文件存在: {config_path}" ) # 读取配置 with open(config_path, "r", encoding="utf-8") as f: config_data = yaml.safe_load(f) # 获取平台配置(嵌套结构:{enabled: bool, sources: [...]}) platforms_config = config_data.get("platforms", {}) if not platforms_config.get("enabled", True): raise CrawlTaskError( "热榜平台已禁用", suggestion="请检查 config/config.yaml 中的 platforms.enabled 配置" ) all_platforms = platforms_config.get("sources", []) if not all_platforms: raise CrawlTaskError( "配置文件中没有平台配置", suggestion="请检查 config/config.yaml 中的 platforms.sources 配置" ) # 过滤平台 if platforms: target_platforms = [p for p in all_platforms if p["id"] in platforms] if not target_platforms: raise CrawlTaskError( f"指定的平台不存在: {platforms}", suggestion=f"可用平台: {[p['id'] for p in all_platforms]}" ) else: target_platforms = all_platforms # 构建平台ID列表 ids = [] for platform in target_platforms: if "name" in platform: ids.append((platform["id"], platform["name"])) else: ids.append(platform["id"]) print(f"开始临时爬取,平台: {[p.get('name', p['id']) for p in target_platforms]}") # 初始化数据获取器 advanced = config_data.get("advanced", {}) crawler_config = advanced.get("crawler", {}) proxy_url = None if crawler_config.get("use_proxy"): proxy_url = crawler_config.get("default_proxy") fetcher = DataFetcher(proxy_url=proxy_url) request_interval = crawler_config.get("request_interval", 100) # 执行爬取 results, id_to_name, failed_ids = fetcher.crawl_websites( ids_list=ids, request_interval=request_interval ) # 获取当前时间(统一使用 trendradar 的时间工具) # 从配置中读取时区,默认为 Asia/Shanghai timezone = config_data.get("app", {}).get("timezone", "Asia/Shanghai") current_time = get_configured_time(timezone) crawl_date = format_date_folder(None, timezone) crawl_time_str = format_time_filename(timezone) # 转换为标准数据模型 news_data = convert_crawl_results_to_news_data( results=results, id_to_name=id_to_name, failed_ids=failed_ids, crawl_time=crawl_time_str, crawl_date=crawl_date ) # 初始化存储后端 storage = LocalStorageBackend( data_dir=str(self.project_root / "output"), enable_txt=True, enable_html=True, timezone=timezone ) # 尝试持久化数据 save_success = False save_error_msg = "" saved_files = {} try: # 1. 保存到 SQLite (核心持久化) if storage.save_news_data(news_data): save_success = True # 2. 如果请求保存到本地,生成 TXT/HTML 快照 if save_to_local: # 保存 TXT txt_path = storage.save_txt_snapshot(news_data) if txt_path: saved_files["txt"] = txt_path # 保存 HTML (使用简化版生成器) html_content = self._generate_simple_html(results, id_to_name, failed_ids, current_time) html_filename = f"{crawl_time_str}.html" html_path = storage.save_html_report(html_content, html_filename) if html_path: saved_files["html"] = html_path except Exception as e: # 捕获所有保存错误(特别是 Docker 只读卷导致的 PermissionError) print(f"[System] 数据保存失败: {e}") save_success = False save_error_msg = str(e) # 3. 清除缓存,确保下次查询获取最新数据 # 即使保存失败,内存中的数据可能已经通过其他方式更新,或者是临时的 get_cache().clear() print("[System] 缓存已清除") # 构建返回结果 news_response_data = [] for platform_id, titles_data in results.items(): platform_name = id_to_name.get(platform_id, platform_id) for title, info in titles_data.items(): news_item = { "platform_id": platform_id, "platform_name": platform_name, "title": title, "ranks": info.get("ranks", []) } if include_url: news_item["url"] = info.get("url", "") news_item["mobile_url"] = info.get("mobileUrl", "") news_response_data.append(news_item) result = { "success": True, "summary": { "description": "爬取任务执行结果", "task_id": f"crawl_{int(time.time())}", "status": "completed", "crawl_time": current_time.strftime("%Y-%m-%d %H:%M:%S"), "total_news": len(news_response_data), "platforms": list(results.keys()), "failed_platforms": failed_ids, "saved_to_local": save_success and save_to_local }, "data": news_response_data } if save_success: if save_to_local: result["saved_files"] = saved_files result["note"] = "数据已保存到 SQLite 数据库及 output 文件夹" else: result["note"] = "数据已保存到 SQLite 数据库 (仅内存中返回结果,未生成TXT快照)" else: # 明确告知用户保存失败 result["saved_to_local"] = False result["save_error"] = save_error_msg if "Read-only file system" in save_error_msg or "Permission denied" in save_error_msg: result["note"] = "爬取成功,但无法写入数据库(Docker只读模式)。数据仅在本次返回中有效。" else: result["note"] = f"爬取成功但保存失败: {save_error_msg}" # 清理资源 storage.cleanup() return result except MCPError as e: return { "success": False, "error": e.to_dict() } except Exception as e: import traceback return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e), "traceback": traceback.format_exc() } } def _generate_simple_html(self, results: Dict, id_to_name: Dict, failed_ids: List, now) -> str: """生成简化的 HTML 报告""" html = """MCP 爬取结果 """ return html def _html_escape(self, text: str) -> str: """HTML 转义""" if not isinstance(text, str): text = str(text) return ( text.replace("&", "&") .replace("<", "<") .replace(">", ">") .replace('"', """) .replace("'", "'") ) def check_version(self, proxy_url: Optional[str] = None) -> Dict: """ 检查版本更新 同时检查 TrendRadar 和 MCP Server 两个组件的版本更新。 远程版本 URL 从 config.yaml 获取: - version_check_url: TrendRadar 版本 - mcp_version_check_url: MCP Server 版本 Args: proxy_url: 可选的代理URL,用于访问远程版本 Returns: 版本检查结果字典,包含: - success: 是否成功 - trendradar: TrendRadar 版本检查结果 - mcp: MCP Server 版本检查结果 - any_update: 是否有任何组件需要更新 Example: >>> tools = SystemManagementTools() >>> result = tools.check_version() >>> print(result['data']['any_update']) """ import yaml import requests def parse_version(version_str: str): """将版本号字符串解析为元组""" try: parts = version_str.strip().split(".") if len(parts) != 3: raise ValueError("版本号格式不正确") return int(parts[0]), int(parts[1]), int(parts[2]) except: return 0, 0, 0 def check_single_version( name: str, local_version: str, remote_url: str, proxies: Optional[Dict], headers: Dict ) -> Dict: """检查单个组件的版本""" try: response = requests.get( remote_url, proxies=proxies, headers=headers, timeout=10 ) response.raise_for_status() remote_version = response.text.strip() local_tuple = parse_version(local_version) remote_tuple = parse_version(remote_version) need_update = local_tuple < remote_tuple if need_update: message = f"发现新版本 {remote_version},当前版本 {local_version},建议更新" elif local_tuple > remote_tuple: message = f"当前版本 {local_version} 高于远程版本 {remote_version}(可能是开发版本)" else: message = f"当前版本 {local_version} 已是最新版本" return { "success": True, "name": name, "current_version": local_version, "remote_version": remote_version, "need_update": need_update, "current_parsed": list(local_tuple), "remote_parsed": list(remote_tuple), "message": message } except requests.exceptions.Timeout: return { "success": False, "name": name, "current_version": local_version, "error": "获取远程版本超时" } except requests.exceptions.RequestException as e: return { "success": False, "name": name, "current_version": local_version, "error": f"网络请求失败: {str(e)}" } except Exception as e: return { "success": False, "name": name, "current_version": local_version, "error": str(e) } try: # 导入本地版本 from trendradar import __version__ as trendradar_version from mcp_server import __version__ as mcp_version # 从配置文件获取远程版本 URL config_path = self.project_root / "config" / "config.yaml" if not config_path.exists(): return { "success": False, "error": { "code": "CONFIG_NOT_FOUND", "message": f"配置文件不存在: {config_path}" } } with open(config_path, "r", encoding="utf-8") as f: config_data = yaml.safe_load(f) advanced_config = config_data.get("advanced", {}) trendradar_url = advanced_config.get( "version_check_url", "https://raw.githubusercontent.com/sansan0/TrendRadar/refs/heads/master/version" ) mcp_url = advanced_config.get( "mcp_version_check_url", "https://raw.githubusercontent.com/sansan0/TrendRadar/refs/heads/master/version_mcp" ) # 配置代理 proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 请求头 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", "Accept": "text/plain, */*", "Cache-Control": "no-cache", } # 检查两个版本 trendradar_result = check_single_version( "TrendRadar", trendradar_version, trendradar_url, proxies, headers ) mcp_result = check_single_version( "MCP Server", mcp_version, mcp_url, proxies, headers ) # 判断是否有任何更新 any_update = ( (trendradar_result.get("success") and trendradar_result.get("need_update", False)) or (mcp_result.get("success") and mcp_result.get("need_update", False)) ) return { "success": True, "summary": { "description": "版本检查结果(TrendRadar + MCP Server)", "any_update": any_update }, "data": { "trendradar": trendradar_result, "mcp": mcp_result, "any_update": any_update } } except ImportError as e: return { "success": False, "error": { "code": "IMPORT_ERROR", "message": f"无法导入版本信息: {str(e)}" } } except Exception as e: return { "success": False, "error": { "code": "INTERNAL_ERROR", "message": str(e) } } ================================================ FILE: mcp_server/utils/__init__.py ================================================ """ 工具类模块 提供参数验证、错误处理等辅助功能。 """ ================================================ FILE: mcp_server/utils/date_parser.py ================================================ """ 日期解析工具 支持多种自然语言日期格式解析,包括相对日期和绝对日期。 """ import re from datetime import datetime, timedelta from typing import Tuple, Dict, Optional from .errors import InvalidParameterError class DateParser: """日期解析器类""" # 中文日期映射 CN_DATE_MAPPING = { "今天": 0, "昨天": 1, "前天": 2, "大前天": 3, } # 英文日期映射 EN_DATE_MAPPING = { "today": 0, "yesterday": 1, } # 日期范围表达式(用于 resolve_date_range_expression) RANGE_EXPRESSIONS = { # 中文表达式 "今天": "today", "昨天": "yesterday", "本周": "this_week", "这周": "this_week", "当前周": "this_week", "上周": "last_week", "本月": "this_month", "这个月": "this_month", "当前月": "this_month", "上月": "last_month", "上个月": "last_month", "最近3天": "last_3_days", "近3天": "last_3_days", "最近7天": "last_7_days", "近7天": "last_7_days", "最近一周": "last_7_days", "过去一周": "last_7_days", "最近14天": "last_14_days", "近14天": "last_14_days", "最近两周": "last_14_days", "过去两周": "last_14_days", "最近30天": "last_30_days", "近30天": "last_30_days", "最近一个月": "last_30_days", "过去一个月": "last_30_days", # 英文表达式 "today": "today", "yesterday": "yesterday", "this week": "this_week", "current week": "this_week", "last week": "last_week", "this month": "this_month", "current month": "this_month", "last month": "last_month", "last 3 days": "last_3_days", "past 3 days": "last_3_days", "last 7 days": "last_7_days", "past 7 days": "last_7_days", "past week": "last_7_days", "last 14 days": "last_14_days", "past 14 days": "last_14_days", "last 30 days": "last_30_days", "past 30 days": "last_30_days", "past month": "last_30_days", } # 星期映射 WEEKDAY_CN = { "一": 0, "二": 1, "三": 2, "四": 3, "五": 4, "六": 5, "日": 6, "天": 6 } WEEKDAY_EN = { "monday": 0, "tuesday": 1, "wednesday": 2, "thursday": 3, "friday": 4, "saturday": 5, "sunday": 6 } @staticmethod def parse_date_query(date_query: str) -> datetime: """ 解析日期查询字符串 支持的格式: - 相对日期(中文):今天、昨天、前天、大前天、N天前 - 相对日期(英文):today、yesterday、N days ago - 星期(中文):上周一、上周二、本周三 - 星期(英文):last monday、this friday - 绝对日期:2025-10-10、10月10日、2025年10月10日 Args: date_query: 日期查询字符串 Returns: datetime对象 Raises: InvalidParameterError: 日期格式无法识别 Examples: >>> DateParser.parse_date_query("今天") datetime(2025, 10, 11) >>> DateParser.parse_date_query("昨天") datetime(2025, 10, 10) >>> DateParser.parse_date_query("3天前") datetime(2025, 10, 8) >>> DateParser.parse_date_query("2025-10-10") datetime(2025, 10, 10) """ if not date_query or not isinstance(date_query, str): raise InvalidParameterError( "日期查询字符串不能为空", suggestion="请提供有效的日期查询,如:今天、昨天、2025-10-10" ) date_query = date_query.strip().lower() # 1. 尝试解析中文常用相对日期 if date_query in DateParser.CN_DATE_MAPPING: days_ago = DateParser.CN_DATE_MAPPING[date_query] return datetime.now() - timedelta(days=days_ago) # 2. 尝试解析英文常用相对日期 if date_query in DateParser.EN_DATE_MAPPING: days_ago = DateParser.EN_DATE_MAPPING[date_query] return datetime.now() - timedelta(days=days_ago) # 3. 尝试解析 "N天前" 或 "N days ago" cn_days_ago_match = re.match(r'(\d+)\s*天前', date_query) if cn_days_ago_match: days = int(cn_days_ago_match.group(1)) if days > 365: raise InvalidParameterError( f"天数过大: {days}天", suggestion="请使用小于365天的相对日期或使用绝对日期" ) return datetime.now() - timedelta(days=days) en_days_ago_match = re.match(r'(\d+)\s*days?\s+ago', date_query) if en_days_ago_match: days = int(en_days_ago_match.group(1)) if days > 365: raise InvalidParameterError( f"天数过大: {days}天", suggestion="请使用小于365天的相对日期或使用绝对日期" ) return datetime.now() - timedelta(days=days) # 4. 尝试解析星期(中文):上周一、本周三 cn_weekday_match = re.match(r'(上|本)周([一二三四五六日天])', date_query) if cn_weekday_match: week_type = cn_weekday_match.group(1) # 上 或 本 weekday_str = cn_weekday_match.group(2) target_weekday = DateParser.WEEKDAY_CN[weekday_str] return DateParser._get_date_by_weekday(target_weekday, week_type == "上") # 5. 尝试解析星期(英文):last monday、this friday en_weekday_match = re.match(r'(last|this)\s+(monday|tuesday|wednesday|thursday|friday|saturday|sunday)', date_query) if en_weekday_match: week_type = en_weekday_match.group(1) # last 或 this weekday_str = en_weekday_match.group(2) target_weekday = DateParser.WEEKDAY_EN[weekday_str] return DateParser._get_date_by_weekday(target_weekday, week_type == "last") # 6. 尝试解析绝对日期:YYYY-MM-DD iso_date_match = re.match(r'(\d{4})-(\d{1,2})-(\d{1,2})', date_query) if iso_date_match: year = int(iso_date_match.group(1)) month = int(iso_date_match.group(2)) day = int(iso_date_match.group(3)) try: return datetime(year, month, day) except ValueError as e: raise InvalidParameterError( f"无效的日期: {date_query}", suggestion=f"日期值错误: {str(e)}" ) # 7. 尝试解析中文日期:MM月DD日 或 YYYY年MM月DD日 cn_date_match = re.match(r'(?:(\d{4})年)?(\d{1,2})月(\d{1,2})日', date_query) if cn_date_match: year_str = cn_date_match.group(1) month = int(cn_date_match.group(2)) day = int(cn_date_match.group(3)) # 如果没有年份,使用当前年份 if year_str: year = int(year_str) else: year = datetime.now().year # 如果月份大于当前月份,说明是去年 current_month = datetime.now().month if month > current_month: year -= 1 try: return datetime(year, month, day) except ValueError as e: raise InvalidParameterError( f"无效的日期: {date_query}", suggestion=f"日期值错误: {str(e)}" ) # 8. 尝试解析斜杠格式:YYYY/MM/DD 或 MM/DD slash_date_match = re.match(r'(?:(\d{4})/)?(\d{1,2})/(\d{1,2})', date_query) if slash_date_match: year_str = slash_date_match.group(1) month = int(slash_date_match.group(2)) day = int(slash_date_match.group(3)) if year_str: year = int(year_str) else: year = datetime.now().year current_month = datetime.now().month if month > current_month: year -= 1 try: return datetime(year, month, day) except ValueError as e: raise InvalidParameterError( f"无效的日期: {date_query}", suggestion=f"日期值错误: {str(e)}" ) # 如果所有格式都不匹配 raise InvalidParameterError( f"无法识别的日期格式: {date_query}", suggestion=( "支持的格式:\n" "- 相对日期: 今天、昨天、前天、3天前、today、yesterday、3 days ago\n" "- 星期: 上周一、本周三、last monday、this friday\n" "- 绝对日期: 2025-10-10、10月10日、2025年10月10日" ) ) @staticmethod def _get_date_by_weekday(target_weekday: int, is_last_week: bool) -> datetime: """ 根据星期几获取日期 Args: target_weekday: 目标星期 (0=周一, 6=周日) is_last_week: 是否是上周 Returns: datetime对象 """ today = datetime.now() current_weekday = today.weekday() # 计算天数差 if is_last_week: # 上周的某一天 days_diff = current_weekday - target_weekday + 7 else: # 本周的某一天 days_diff = current_weekday - target_weekday if days_diff < 0: days_diff += 7 return today - timedelta(days=days_diff) @staticmethod def format_date_folder(date: datetime) -> str: """ 将日期格式化为文件夹名称 Args: date: datetime对象 Returns: 文件夹名称,格式: YYYY-MM-DD Examples: >>> DateParser.format_date_folder(datetime(2025, 10, 11)) '2025-10-11' """ return date.strftime("%Y-%m-%d") @staticmethod def validate_date_not_future(date: datetime) -> None: """ 验证日期不在未来 Args: date: 待验证的日期 Raises: InvalidParameterError: 日期在未来 """ if date.date() > datetime.now().date(): raise InvalidParameterError( f"不能查询未来的日期: {date.strftime('%Y-%m-%d')}", suggestion="请使用今天或过去的日期" ) @staticmethod def validate_date_not_too_old(date: datetime, max_days: int = 365) -> None: """ 验证日期不太久远 Args: date: 待验证的日期 max_days: 最大天数 Raises: InvalidParameterError: 日期太久远 """ days_ago = (datetime.now().date() - date.date()).days if days_ago > max_days: raise InvalidParameterError( f"日期太久远: {date.strftime('%Y-%m-%d')} ({days_ago}天前)", suggestion=f"请查询{max_days}天内的数据" ) @staticmethod def resolve_date_range_expression(expression: str) -> Dict: """ 将自然语言日期表达式解析为标准日期范围 这是专门为 MCP 工具设计的方法,用于在服务器端解析日期表达式, 避免 AI 模型自己计算日期导致的不一致问题。 Args: expression: 自然语言日期表达式,支持: - 单日: "今天", "昨天", "today", "yesterday" - 本周/上周: "本周", "上周", "this week", "last week" - 本月/上月: "本月", "上月", "this month", "last month" - 最近N天: "最近7天", "最近30天", "last 7 days", "last 30 days" - 动态N天: "最近5天", "last 10 days" Returns: 解析结果字典: { "success": True, "expression": "本周", "normalized": "this_week", "date_range": { "start": "2025-11-18", "end": "2025-11-24" }, "current_date": "2025-11-26", "description": "本周(周一到周日)" } Raises: InvalidParameterError: 无法识别的日期表达式 Examples: >>> DateParser.resolve_date_range_expression("本周") {"success": True, "date_range": {"start": "2025-11-18", "end": "2025-11-24"}, ...} >>> DateParser.resolve_date_range_expression("最近7天") {"success": True, "date_range": {"start": "2025-11-20", "end": "2025-11-26"}, ...} """ if not expression or not isinstance(expression, str): raise InvalidParameterError( "日期表达式不能为空", suggestion="请提供有效的日期表达式,如:本周、最近7天、last week" ) expression_lower = expression.strip().lower() today = datetime.now() today_str = today.strftime("%Y-%m-%d") # 1. 尝试匹配预定义表达式 normalized = DateParser.RANGE_EXPRESSIONS.get(expression_lower) # 2. 尝试匹配动态 "最近N天" / "last N days" 模式 if not normalized: # 中文: 最近N天 cn_match = re.match(r'最近(\d+)天', expression_lower) if cn_match: days = int(cn_match.group(1)) normalized = f"last_{days}_days" # 英文: last N days en_match = re.match(r'(?:last|past)\s+(\d+)\s+days?', expression_lower) if en_match: days = int(en_match.group(1)) normalized = f"last_{days}_days" if not normalized: # 提供支持的表达式列表 supported_cn = ["今天", "昨天", "本周", "上周", "本月", "上月", "最近7天", "最近30天", "最近N天"] supported_en = ["today", "yesterday", "this week", "last week", "this month", "last month", "last 7 days", "last N days"] raise InvalidParameterError( f"无法识别的日期表达式: {expression}", suggestion=f"支持的表达式:\n中文: {', '.join(supported_cn)}\n英文: {', '.join(supported_en)}" ) # 3. 根据 normalized 类型计算日期范围 start_date, end_date, description = DateParser._calculate_date_range( normalized, today ) return { "success": True, "expression": expression, "normalized": normalized, "date_range": { "start": start_date.strftime("%Y-%m-%d"), "end": end_date.strftime("%Y-%m-%d") }, "current_date": today_str, "description": description } @staticmethod def _calculate_date_range( normalized: str, today: datetime ) -> Tuple[datetime, datetime, str]: """ 根据标准化的日期类型计算实际日期范围 Args: normalized: 标准化的日期类型 today: 当前日期 Returns: (start_date, end_date, description) 元组 """ # 单日类型 if normalized == "today": return today, today, "今天" if normalized == "yesterday": yesterday = today - timedelta(days=1) return yesterday, yesterday, "昨天" # 本周(周一到周日) if normalized == "this_week": # 计算本周一 weekday = today.weekday() # 0=周一, 6=周日 start = today - timedelta(days=weekday) end = start + timedelta(days=6) # 如果本周还没结束,end 不能超过今天 if end > today: end = today return start, end, f"本周(周一到周日,{start.strftime('%m-%d')} 至 {end.strftime('%m-%d')})" # 上周(上周一到上周日) if normalized == "last_week": weekday = today.weekday() # 本周一 this_monday = today - timedelta(days=weekday) # 上周一 start = this_monday - timedelta(days=7) end = start + timedelta(days=6) return start, end, f"上周({start.strftime('%m-%d')} 至 {end.strftime('%m-%d')})" # 本月(本月1日到今天) if normalized == "this_month": start = today.replace(day=1) return start, today, f"本月({start.strftime('%m-%d')} 至 {today.strftime('%m-%d')})" # 上月(上月1日到上月最后一天) if normalized == "last_month": # 上月最后一天 = 本月1日 - 1天 first_of_this_month = today.replace(day=1) end = first_of_this_month - timedelta(days=1) start = end.replace(day=1) return start, end, f"上月({start.strftime('%Y-%m-%d')} 至 {end.strftime('%Y-%m-%d')})" # 最近N天 (last_N_days 格式) match = re.match(r'last_(\d+)_days', normalized) if match: days = int(match.group(1)) start = today - timedelta(days=days - 1) # 包含今天,所以是 days-1 return start, today, f"最近{days}天({start.strftime('%m-%d')} 至 {today.strftime('%m-%d')})" # 兜底:返回今天 return today, today, "今天(默认)" @staticmethod def get_supported_expressions() -> Dict[str, list]: """ 获取支持的日期表达式列表 Returns: 分类的表达式列表 """ return { "单日": ["今天", "昨天", "today", "yesterday"], "周": ["本周", "上周", "this week", "last week"], "月": ["本月", "上月", "this month", "last month"], "最近N天": ["最近3天", "最近7天", "最近14天", "最近30天", "last 3 days", "last 7 days", "last 14 days", "last 30 days"], "动态天数": ["最近N天", "last N days"] } ================================================ FILE: mcp_server/utils/errors.py ================================================ """ 自定义错误类 定义MCP Server使用的所有自定义异常类型。 """ from typing import Optional, List, Callable # ==================== 延迟加载支持的平台列表 ==================== _get_supported_platforms: Optional[Callable[[], List[str]]] = None def _load_supported_platforms() -> List[str]: """延迟加载支持的平台列表""" global _get_supported_platforms if _get_supported_platforms is None: try: from .validators import get_supported_platforms _get_supported_platforms = get_supported_platforms except ImportError: # 降级:返回空列表 return [] return _get_supported_platforms() class MCPError(Exception): """MCP工具错误基类""" def __init__(self, message: str, code: str = "MCP_ERROR", suggestion: Optional[str] = None): super().__init__(message) self.code = code self.message = message self.suggestion = suggestion def to_dict(self) -> dict: """转换为字典格式""" error_dict = { "code": self.code, "message": self.message } if self.suggestion: error_dict["suggestion"] = self.suggestion return error_dict class DataNotFoundError(MCPError): """数据不存在错误""" def __init__(self, message: str, suggestion: Optional[str] = None): super().__init__( message=message, code="DATA_NOT_FOUND", suggestion=suggestion or "请检查日期范围或等待爬取任务完成" ) class InvalidParameterError(MCPError): """参数无效错误""" def __init__(self, message: str, suggestion: Optional[str] = None): super().__init__( message=message, code="INVALID_PARAMETER", suggestion=suggestion or "请检查参数格式是否正确" ) class ConfigurationError(MCPError): """配置错误""" def __init__(self, message: str, suggestion: Optional[str] = None): super().__init__( message=message, code="CONFIGURATION_ERROR", suggestion=suggestion or "请检查配置文件是否正确" ) class PlatformNotSupportedError(MCPError): """平台不支持错误""" def __init__(self, platform: str): supported = _load_supported_platforms() suggestion = f"支持的平台: {', '.join(supported)}" if supported else "请检查 config/config.yaml 中的平台配置" super().__init__( message=f"平台 '{platform}' 不受支持", code="PLATFORM_NOT_SUPPORTED", suggestion=suggestion ) class CrawlTaskError(MCPError): """爬取任务错误""" def __init__(self, message: str, suggestion: Optional[str] = None): super().__init__( message=message, code="CRAWL_TASK_ERROR", suggestion=suggestion or "请稍后重试或查看日志" ) class FileParseError(MCPError): """文件解析错误""" def __init__(self, file_path: str, reason: str): super().__init__( message=f"解析文件 {file_path} 失败: {reason}", code="FILE_PARSE_ERROR", suggestion="请检查文件格式是否正确" ) ================================================ FILE: mcp_server/utils/validators.py ================================================ """ 参数验证工具 提供统一的参数验证功能。 支持 MCP 客户端将参数序列化为字符串的情况。 """ from datetime import datetime from typing import List, Optional, Union import os import json import yaml import ast from .errors import InvalidParameterError from .date_parser import DateParser # ==================== 辅助函数:处理字符串序列化 ==================== def _parse_string_to_list(value: str) -> List[str]: """ 将字符串解析为列表 支持格式: - JSON 数组: '["zhihu", "weibo"]' - Python 列表字符串: "['zhihu', 'weibo']" - 逗号分隔: "zhihu, weibo" 或 "zhihu,weibo" Args: value: 字符串值 Returns: 解析后的列表 Raises: InvalidParameterError: 解析失败 """ value = value.strip() if not value: return [] # 尝试 JSON 解析: '["zhihu", "weibo"]' try: parsed = json.loads(value) if isinstance(parsed, list): return [str(item) for item in parsed] # 如果解析结果不是列表,继续尝试其他方式 except json.JSONDecodeError: pass # 尝试 Python 字面量解析: "['zhihu', 'weibo']" try: parsed = ast.literal_eval(value) if isinstance(parsed, list): return [str(item) for item in parsed] if isinstance(parsed, str): # 单个字符串,包装成列表 return [parsed] except (ValueError, SyntaxError): pass # 尝试逗号分隔: "zhihu, weibo" 或 "zhihu,weibo" if ',' in value: items = [item.strip() for item in value.split(',')] return [item for item in items if item] # 单个值 return [value] def _parse_string_to_int(value: str, param_name: str = "参数") -> int: """ 将字符串解析为整数 Args: value: 字符串值 param_name: 参数名(用于错误消息) Returns: 解析后的整数 Raises: InvalidParameterError: 解析失败 """ value = value.strip() try: # 尝试直接转换 return int(value) except ValueError: pass # 尝试解析浮点数后取整 try: return int(float(value)) except ValueError: raise InvalidParameterError( f"{param_name} 必须是整数,无法解析: {value}", suggestion=f"请提供有效的整数值,如: 10, 50, 100" ) def _parse_string_to_float(value: str, param_name: str = "参数") -> float: """ 将字符串解析为浮点数 Args: value: 字符串值 param_name: 参数名(用于错误消息) Returns: 解析后的浮点数 Raises: InvalidParameterError: 解析失败 """ value = value.strip() try: return float(value) except ValueError: raise InvalidParameterError( f"{param_name} 必须是数字,无法解析: {value}", suggestion=f"请提供有效的数字值,如: 0.6, 3.0" ) def _parse_string_to_bool(value: str) -> bool: """ 将字符串解析为布尔值 Args: value: 字符串值 Returns: 解析后的布尔值 """ value = value.strip().lower() if value in ('true', '1', 'yes', 'on'): return True elif value in ('false', '0', 'no', 'off', ''): return False else: # 默认非空字符串为 True return bool(value) # 平台列表 mtime 缓存(避免每次 MCP 调用都重新读取 config.yaml) _platforms_cache: Optional[List[str]] = None _platforms_config_mtime: float = 0.0 _platforms_config_path: Optional[str] = None def get_supported_platforms() -> List[str]: """ 从 config.yaml 动态获取支持的平台列表(带 mtime 缓存) 仅当 config.yaml 被修改时才重新读取,避免每次 MCP 调用的重复 IO。 Returns: 平台ID列表 Note: - 读取失败时返回空列表,允许所有平台通过(降级策略) - 平台列表来自 config/config.yaml 中的 platforms 配置 """ global _platforms_cache, _platforms_config_mtime, _platforms_config_path try: if _platforms_config_path is None: current_dir = os.path.dirname(os.path.abspath(__file__)) _platforms_config_path = os.path.normpath( os.path.join(current_dir, "..", "..", "config", "config.yaml") ) current_mtime = os.path.getmtime(_platforms_config_path) if _platforms_cache is not None and current_mtime == _platforms_config_mtime: return _platforms_cache with open(_platforms_config_path, 'r', encoding='utf-8') as f: config = yaml.safe_load(f) platforms_config = config.get('platforms', {}) sources = platforms_config.get('sources', []) _platforms_cache = [p['id'] for p in sources if 'id' in p] _platforms_config_mtime = current_mtime return _platforms_cache except Exception as e: print(f"警告:无法加载平台配置: {e}") return [] def validate_platforms(platforms: Optional[Union[List[str], str]]) -> List[str]: """ 验证平台列表 Args: platforms: 平台ID列表或字符串,None表示使用 config.yaml 中配置的所有平台 支持多种格式: - None: 使用默认平台 - ["zhihu", "weibo"]: JSON 数组 - '["zhihu", "weibo"]': JSON 数组字符串 - "['zhihu', 'weibo']": Python 列表字符串 - "zhihu, weibo": 逗号分隔字符串 - "zhihu": 单个平台字符串 Returns: 验证后的平台列表 Raises: InvalidParameterError: 平台不支持 Note: - platforms=None 时,返回 config.yaml 中配置的平台列表 - 会验证平台ID是否在 config.yaml 的 platforms 配置中 - 配置加载失败时,允许所有平台通过(降级策略) """ supported_platforms = get_supported_platforms() if platforms is None: # 返回配置文件中的平台列表(用户的默认配置) return supported_platforms if supported_platforms else [] # 支持字符串形式的列表输入(某些 MCP 客户端会将 JSON 数组序列化为字符串) if isinstance(platforms, str): platforms = _parse_string_to_list(platforms) if not platforms: # 空字符串或解析后为空,使用默认平台 return supported_platforms if supported_platforms else [] if not isinstance(platforms, list): raise InvalidParameterError("platforms 参数必须是列表类型") if not platforms: # 空列表时,返回配置文件中的平台列表 return supported_platforms if supported_platforms else [] # 如果配置加载失败(supported_platforms为空),允许所有平台通过 if not supported_platforms: print("警告:平台配置未加载,跳过平台验证") return platforms # 验证每个平台是否在配置中 invalid_platforms = [p for p in platforms if p not in supported_platforms] if invalid_platforms: raise InvalidParameterError( f"不支持的平台: {', '.join(invalid_platforms)}", suggestion=f"支持的平台(来自config.yaml): {', '.join(supported_platforms)}" ) return platforms def validate_limit(limit: Optional[Union[int, str]], default: int = 20, max_limit: int = 1000) -> int: """ 验证数量限制参数 Args: limit: 限制数量(整数或字符串) default: 默认值 max_limit: 最大限制 Returns: 验证后的限制值 Raises: InvalidParameterError: 参数无效 """ if limit is None: return default # 支持字符串形式的整数(某些 MCP 客户端会将数字序列化为字符串) if isinstance(limit, str): limit = _parse_string_to_int(limit, "limit") if not isinstance(limit, int): raise InvalidParameterError("limit 参数必须是整数类型") if limit <= 0: raise InvalidParameterError("limit 必须大于0") if limit > max_limit: raise InvalidParameterError( f"limit 不能超过 {max_limit}", suggestion=f"请使用分页或降低limit值" ) return limit def validate_date(date_str: str) -> datetime: """ 验证日期格式 Args: date_str: 日期字符串 (YYYY-MM-DD) Returns: datetime对象 Raises: InvalidParameterError: 日期格式错误 """ try: return datetime.strptime(date_str, "%Y-%m-%d") except ValueError: raise InvalidParameterError( f"日期格式错误: {date_str}", suggestion="请使用 YYYY-MM-DD 格式,例如: 2025-10-11" ) def normalize_date_range(date_range: Optional[Union[dict, str]]) -> Optional[Union[dict, str]]: """ 规范化 date_range 参数 某些 MCP 客户端(特别是 HTTP 方式)会将 JSON 对象序列化为字符串传入。 此函数尝试将 JSON 字符串解析为 dict,如果不是 JSON 格式则保持原样。 Args: date_range: 日期范围,可能是: - dict: {"start": "2025-01-01", "end": "2025-01-07"} - JSON 字符串: '{"start": "2025-01-01", "end": "2025-01-07"}' - 普通字符串: "今天", "昨天", "2025-01-01" - None Returns: 规范化后的 date_range(dict 或普通字符串) Examples: >>> normalize_date_range('{"start":"2025-01-01","end":"2025-01-07"}') {"start": "2025-01-01", "end": "2025-01-07"} >>> normalize_date_range("今天") "今天" >>> normalize_date_range({"start": "2025-01-01", "end": "2025-01-07"}) {"start": "2025-01-01", "end": "2025-01-07"} """ if date_range is None: return None # 如果已经是 dict,直接返回 if isinstance(date_range, dict): return date_range # 如果是字符串,尝试解析为 JSON if isinstance(date_range, str): # 检查是否看起来像 JSON 对象 stripped = date_range.strip() if stripped.startswith('{') and stripped.endswith('}'): try: parsed = json.loads(stripped) if isinstance(parsed, dict): return parsed except json.JSONDecodeError: pass # 解析失败,当作普通字符串处理 return date_range def validate_date_range(date_range: Optional[Union[dict, str]]) -> Optional[tuple]: """ 验证日期范围 Args: date_range: 日期范围,支持多种格式: - dict: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} - JSON 字符串: '{"start": "2025-01-01", "end": "2025-01-07"}' - 单日字符串: "2025-01-01"(自动转为同一天的范围) - 自然语言: "今天", "昨天", "本周", "最近7天" 等 Returns: (start_date, end_date) 元组,或 None Raises: InvalidParameterError: 日期范围无效 """ if date_range is None: return None # 支持字符串形式的输入 if isinstance(date_range, str): stripped = date_range.strip() # 1. 检查是否是 JSON 对象格式 if stripped.startswith('{') and stripped.endswith('}'): try: date_range = json.loads(stripped) except json.JSONDecodeError as e: raise InvalidParameterError( f"date_range JSON 解析失败: {e}", suggestion='请使用正确的JSON格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"}' ) # 2. 检查是否是单日字符串格式 YYYY-MM-DD elif len(stripped) == 10 and stripped[4] == '-' and stripped[7] == '-': try: single_date = datetime.strptime(stripped, "%Y-%m-%d") return (single_date, single_date) except ValueError: raise InvalidParameterError( f"日期格式错误: {stripped}", suggestion="请使用 YYYY-MM-DD 格式,例如: 2025-10-11" ) # 3. 尝试自然语言解析 else: try: result = DateParser.resolve_date_range_expression(stripped) if result.get("success"): dr = result["date_range"] start_date = datetime.strptime(dr["start"], "%Y-%m-%d") end_date = datetime.strptime(dr["end"], "%Y-%m-%d") return (start_date, end_date) else: raise InvalidParameterError( f"无法识别的日期表达式: {stripped}", suggestion="支持格式: YYYY-MM-DD, {\"start\": \"...\", \"end\": \"...\"}, 或自然语言(今天、本周、最近7天等)" ) except InvalidParameterError: raise except Exception: raise InvalidParameterError( f"日期解析失败: {stripped}", suggestion="支持格式: YYYY-MM-DD, {\"start\": \"...\", \"end\": \"...\"}, 或自然语言(今天、本周、最近7天等)" ) if not isinstance(date_range, dict): raise InvalidParameterError( "date_range 必须是字典类型、日期字符串或有效的JSON字符串", suggestion='例如: {"start": "2025-10-01", "end": "2025-10-11"} 或 "2025-10-01"' ) start_str = date_range.get("start") end_str = date_range.get("end") if not start_str or not end_str: raise InvalidParameterError( "date_range 必须包含 start 和 end 字段", suggestion='例如: {"start": "2025-10-01", "end": "2025-10-11"}' ) start_date = validate_date(start_str) end_date = validate_date(end_str) if start_date > end_date: raise InvalidParameterError( "开始日期不能晚于结束日期", suggestion=f"start: {start_str}, end: {end_str}" ) # 检查日期是否在未来 today = datetime.now().date() if start_date.date() > today or end_date.date() > today: # 获取可用日期范围提示 try: from ..services.data_service import DataService data_service = DataService() earliest, latest = data_service.get_available_date_range() if earliest and latest: available_range = f"{earliest.strftime('%Y-%m-%d')} 至 {latest.strftime('%Y-%m-%d')}" else: available_range = "无可用数据" except Exception: available_range = "未知(请检查 output 目录)" future_dates = [] if start_date.date() > today: future_dates.append(start_str) if end_date.date() > today and end_str != start_str: future_dates.append(end_str) raise InvalidParameterError( f"不允许查询未来日期: {', '.join(future_dates)}(当前日期: {today.strftime('%Y-%m-%d')})", suggestion=f"当前可用数据范围: {available_range}" ) return (start_date, end_date) def validate_keyword(keyword: str) -> str: """ 验证关键词 Args: keyword: 搜索关键词 Returns: 处理后的关键词 Raises: InvalidParameterError: 关键词无效 """ if not keyword: raise InvalidParameterError("keyword 不能为空") if not isinstance(keyword, str): raise InvalidParameterError("keyword 必须是字符串类型") keyword = keyword.strip() if not keyword: raise InvalidParameterError("keyword 不能为空白字符") if len(keyword) > 100: raise InvalidParameterError( "keyword 长度不能超过100个字符", suggestion="请使用更简洁的关键词" ) return keyword def validate_top_n(top_n: Optional[Union[int, str]], default: int = 10) -> int: """ 验证TOP N参数 Args: top_n: TOP N数量(整数或字符串) default: 默认值 Returns: 验证后的值 Raises: InvalidParameterError: 参数无效 """ return validate_limit(top_n, default=default, max_limit=100) def validate_mode(mode: Optional[str], valid_modes: List[str], default: str) -> str: """ 验证模式参数 Args: mode: 模式字符串 valid_modes: 有效模式列表 default: 默认模式 Returns: 验证后的模式 Raises: InvalidParameterError: 模式无效 """ if mode is None: return default if not isinstance(mode, str): raise InvalidParameterError("mode 必须是字符串类型") if mode not in valid_modes: raise InvalidParameterError( f"无效的模式: {mode}", suggestion=f"支持的模式: {', '.join(valid_modes)}" ) return mode def validate_config_section(section: Optional[str]) -> str: """ 验证配置节参数 Args: section: 配置节名称 Returns: 验证后的配置节 Raises: InvalidParameterError: 配置节无效 """ valid_sections = ["all", "crawler", "push", "keywords", "weights"] return validate_mode(section, valid_sections, "all") def validate_threshold( threshold: Optional[Union[float, int, str]], default: float = 0.6, min_value: float = 0.0, max_value: float = 1.0, param_name: str = "threshold" ) -> float: """ 验证阈值参数(浮点数) Args: threshold: 阈值(浮点数、整数或字符串) default: 默认值 min_value: 最小值 max_value: 最大值 param_name: 参数名(用于错误消息) Returns: 验证后的阈值 Raises: InvalidParameterError: 参数无效 """ if threshold is None: return default # 支持字符串形式的数字(某些 MCP 客户端会将数字序列化为字符串) if isinstance(threshold, str): threshold = _parse_string_to_float(threshold, param_name) # 整数转浮点数 if isinstance(threshold, int): threshold = float(threshold) if not isinstance(threshold, float): raise InvalidParameterError( f"{param_name} 必须是数字类型", suggestion=f"请提供 {min_value} 到 {max_value} 之间的数字" ) if threshold < min_value or threshold > max_value: raise InvalidParameterError( f"{param_name} 必须在 {min_value} 到 {max_value} 之间,当前值: {threshold}", suggestion=f"推荐值: {default}" ) return threshold def validate_date_query( date_query: str, allow_future: bool = False, max_days_ago: int = 365 ) -> datetime: """ 验证并解析日期查询字符串 Args: date_query: 日期查询字符串 allow_future: 是否允许未来日期 max_days_ago: 允许查询的最大天数 Returns: 解析后的datetime对象 Raises: InvalidParameterError: 日期查询无效 Examples: >>> validate_date_query("昨天") datetime(2025, 10, 10) >>> validate_date_query("2025-10-10") datetime(2025, 10, 10) """ if not date_query: raise InvalidParameterError( "日期查询字符串不能为空", suggestion="请提供日期查询,如:今天、昨天、2025-10-10" ) # 使用DateParser解析日期 parsed_date = DateParser.parse_date_query(date_query) # 验证日期不在未来 if not allow_future: DateParser.validate_date_not_future(parsed_date) # 验证日期不太久远 DateParser.validate_date_not_too_old(parsed_date, max_days=max_days_ago) return parsed_date ================================================ FILE: pyproject.toml ================================================ [project] name = "trendradar" version = "6.5.0" description = "TrendRadar - 热点新闻聚合与分析工具" requires-python = ">=3.10" dependencies = [ "requests>=2.32.5,<3.0.0", "pytz>=2025.2,<2026.0", "PyYAML>=6.0.3,<7.0.0", "fastmcp>=2.12.0,<2.14.0", "websockets>=13.0,<14.0", "feedparser>=6.0.0,<7.0.0", "boto3>=1.35.0,<2.0.0", "litellm>=1.57.0,<2.0.0", "json-repair>=0.58.3,<1.0.0", "tenacity==8.5.0" ] [project.scripts] trendradar = "trendradar.__main__:main" trendradar-mcp = "mcp_server.server:run_server" [dependency-groups] dev = [] [build-system] requires = ["hatchling"] build-backend = "hatchling.build" [tool.hatch.build.targets.wheel] packages = ["trendradar", "mcp_server"] ================================================ FILE: requirements.txt ================================================ requests>=2.32.5,<3.0.0 pytz>=2025.2,<2026.0 PyYAML>=6.0.3,<7.0.0 fastmcp>=2.12.0,<2.14.0 websockets>=13.0,<14.0 boto3>=1.35.0,<2.0.0 feedparser>=6.0.0,<7.0.0 litellm>=1.57.0,<2.0.0 tenacity==8.5.0 ================================================ FILE: setup-mac.sh ================================================ #!/bin/bash # 颜色定义 RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' BOLD='\033[1m' NC='\033[0m' # No Color echo -e "${BOLD}╔════════════════════════════════════════╗${NC}" echo -e "${BOLD}║ TrendRadar MCP 一键部署 (Mac) ║${NC}" echo -e "${BOLD}╚════════════════════════════════════════╝${NC}" echo "" # 获取项目根目录 PROJECT_ROOT="$(cd "$(dirname "$0")" && pwd)" echo -e "📍 项目目录: ${BLUE}${PROJECT_ROOT}${NC}" echo "" # 检查 UV 是否已安装 if ! command -v uv &> /dev/null; then echo -e "${YELLOW}[1/3] 🔧 UV 未安装,正在自动安装...${NC}" echo "提示: UV 是一个快速的 Python 包管理器,只需安装一次" echo "" curl -LsSf https://astral.sh/uv/install.sh | sh echo "" echo "正在刷新 PATH 环境变量..." echo "" # 添加 UV 到 PATH export PATH="$HOME/.cargo/bin:$PATH" # 验证 UV 是否真正可用 if ! command -v uv &> /dev/null; then echo -e "${RED}❌ [错误] UV 安装失败${NC}" echo "" echo "可能的原因:" echo " 1. 网络连接问题,无法下载安装脚本" echo " 2. 安装路径权限不足" echo " 3. 安装脚本执行异常" echo "" echo "解决方案:" echo " 1. 检查网络连接是否正常" echo " 2. 手动安装: https://docs.astral.sh/uv/getting-started/installation/" echo " 3. 或运行: curl -LsSf https://astral.sh/uv/install.sh | sh" exit 1 fi echo -e "${GREEN}✅ [成功] UV 已安装${NC}" echo -e "${YELLOW}⚠️ 请重新运行此脚本以继续${NC}" exit 0 else echo -e "${GREEN}[1/3] ✅ UV 已安装${NC}" uv --version fi echo "" echo "[2/3] 📦 安装项目依赖..." echo "提示: 这可能需要 1-2 分钟,请耐心等待" echo "" # 创建虚拟环境并安装依赖 uv sync if [ $? -ne 0 ]; then echo "" echo -e "${RED}❌ [错误] 依赖安装失败${NC}" echo "请检查网络连接后重试" exit 1 fi echo "" echo -e "${GREEN}[3/3] ✅ 检查配置文件...${NC}" echo "" # 检查配置文件 if [ ! -f "config/config.yaml" ]; then echo -e "${YELLOW}⚠️ [警告] 未找到配置文件: config/config.yaml${NC}" echo "请确保配置文件存在" echo "" fi # 添加执行权限 chmod +x start-http.sh 2>/dev/null || true # 获取 UV 路径 UV_PATH=$(which uv) echo "" echo -e "${BOLD}╔════════════════════════════════════════╗${NC}" echo -e "${BOLD}║ 部署完成! ║${NC}" echo -e "${BOLD}╚════════════════════════════════════════╝${NC}" echo "" echo "📋 下一步操作:" echo "" echo " 1️⃣ 打开 Cherry Studio" echo " 2️⃣ 进入 设置 > MCP Servers > 添加服务器" echo " 3️⃣ 填入以下配置:" echo "" echo " 名称: TrendRadar" echo " 描述: 新闻热点聚合工具" echo " 类型: STDIO" echo -e " 命令: ${BLUE}${UV_PATH}${NC}" echo " 参数(每个占一行):" echo -e " ${BLUE}--directory${NC}" echo -e " ${BLUE}${PROJECT_ROOT}${NC}" echo -e " ${BLUE}run${NC}" echo -e " ${BLUE}python${NC}" echo -e " ${BLUE}-m${NC}" echo -e " ${BLUE}mcp_server.server${NC}" echo "" echo " 4️⃣ 保存并启用 MCP 开关" echo "" echo "📖 详细教程请查看: README-Cherry-Studio.md,本窗口别关,待会儿用于填入参数" echo "" ================================================ FILE: setup-windows-en.bat ================================================ @echo off setlocal enabledelayedexpansion echo ========================================== echo TrendRadar MCP Setup (Windows) echo ========================================== echo: REM Fix: Use script location instead of current working directory set "PROJECT_ROOT=%~dp0" REM Remove trailing backslash if "%PROJECT_ROOT:~-1%"=="\" set "PROJECT_ROOT=%PROJECT_ROOT:~0,-1%" echo Project Directory: %PROJECT_ROOT% echo: REM Change to project directory cd /d "%PROJECT_ROOT%" if %errorlevel% neq 0 ( echo [ERROR] Cannot access project directory pause exit /b 1 ) REM Validate project structure echo [0/4] Validating project structure... if not exist "pyproject.toml" ( echo [ERROR] pyproject.toml not found in: %PROJECT_ROOT% echo: echo This should not happen! Please check: echo 1. Is setup-windows.bat in the project root? echo 2. Was the project properly cloned/downloaded? echo: echo Files in current directory: dir /b echo: pause exit /b 1 ) echo [OK] pyproject.toml found echo: REM Check Python echo [1/4] Checking Python... python --version >nul 2>&1 if %errorlevel% neq 0 ( echo [ERROR] Python not detected. Please install Python 3.10+ echo Download: https://www.python.org/downloads/ pause exit /b 1 ) for /f "tokens=*" %%i in ('python --version') do echo [OK] %%i echo: REM Check UV echo [2/4] Checking UV... where uv >nul 2>&1 if %errorlevel% neq 0 ( echo UV not installed, installing automatically... echo: echo Trying installation method 1: PowerShell... powershell -ExecutionPolicy Bypass -Command "try { irm https://astral.sh/uv/install.ps1 | iex; exit 0 } catch { Write-Host 'PowerShell method failed'; exit 1 }" if %errorlevel% neq 0 ( echo: echo Method 1 failed. Trying method 2: pip... python -m pip install --upgrade uv if %errorlevel% neq 0 ( echo: echo [ERROR] Automatic installation failed echo: echo Please install UV manually using one of these methods: echo: echo Method 1 - pip: echo python -m pip install uv echo: echo Method 2 - pipx: echo pip install pipx echo pipx install uv echo: echo Method 3 - Manual download: echo Visit: https://docs.astral.sh/uv/getting-started/installation/ echo: pause exit /b 1 ) ) echo: echo [SUCCESS] UV installed successfully! echo: echo [IMPORTANT] Please restart your terminal: echo 1. Close this window echo 2. Open a new Command Prompt echo 3. Navigate to: %PROJECT_ROOT% echo 4. Run: setup-windows.bat echo: pause exit /b 0 ) else ( for /f "tokens=*" %%i in ('uv --version') do echo [OK] %%i ) echo: echo [3/4] Installing dependencies... echo Working directory: %PROJECT_ROOT% echo: REM Ensure we're in the project directory cd /d "%PROJECT_ROOT%" uv sync if %errorlevel% neq 0 ( echo: echo [ERROR] Dependency installation failed echo: echo Troubleshooting steps: echo 1. Check your internet connection echo 2. Verify Python version ^>= 3.10: python --version echo 3. Try with verbose output: uv sync --verbose echo 4. Check if pyproject.toml is valid echo: echo Project directory: %PROJECT_ROOT% echo: pause exit /b 1 ) echo: echo [OK] Dependencies installed successfully echo: echo [4/4] Checking configuration file... if not exist "config\config.yaml" ( echo [WARNING] config\config.yaml not found if exist "config\config.example.yaml" ( echo: echo To create your configuration: echo 1. Copy: copy config\config.example.yaml config\config.yaml echo 2. Edit: notepad config\config.yaml echo 3. Add your API keys ) echo: ) else ( echo [OK] config\config.yaml exists ) echo: REM Get UV path for /f "tokens=*" %%i in ('where uv 2^>nul') do set "UV_PATH=%%i" if not defined UV_PATH ( set "UV_PATH=uv" ) echo: echo ========================================== echo Setup Complete! echo ========================================== echo: echo MCP Server Configuration for Claude Desktop: echo: echo Command: %UV_PATH% echo Working Directory: %PROJECT_ROOT% echo: echo Arguments (one per line): echo --directory echo %PROJECT_ROOT% echo run echo python echo -m echo mcp_server.server echo: echo Configuration guide: README-Cherry-Studio.md echo: echo: pause ================================================ FILE: setup-windows.bat ================================================ @echo off chcp 65001 >nul setlocal enabledelayedexpansion echo ========================================== echo TrendRadar MCP 一键部署 (Windows) echo ========================================== echo. REM 修复:使用脚本所在目录,而不是当前工作目录 set "PROJECT_ROOT=%~dp0" REM 移除末尾的反斜杠 if "%PROJECT_ROOT:~-1%"=="\" set "PROJECT_ROOT=%PROJECT_ROOT:~0,-1%" echo 📍 项目目录: %PROJECT_ROOT% echo. REM 切换到项目目录 cd /d "%PROJECT_ROOT%" if %errorlevel% neq 0 ( echo ❌ 无法访问项目目录 pause exit /b 1 ) REM 验证项目结构 echo [0/4] 🔍 验证项目结构... if not exist "pyproject.toml" ( echo ❌ 未找到 pyproject.toml 文件: %PROJECT_ROOT% echo. echo 请检查: echo 1. setup-windows.bat 是否在项目根目录? echo 2. 项目文件是否完整? echo. echo 当前目录内容: dir /b echo. pause exit /b 1 ) echo ✅ pyproject.toml 已找到 echo. REM 检查 Python echo [1/4] 🐍 检查 Python... python --version >nul 2>&1 if %errorlevel% neq 0 ( echo ❌ 未检测到 Python,请先安装 Python 3.10+ echo 下载地址: https://www.python.org/downloads/ pause exit /b 1 ) for /f "tokens=*" %%i in ('python --version') do echo ✅ %%i echo. REM 检查 UV echo [2/4] 🔧 检查 UV... where uv >nul 2>&1 if %errorlevel% neq 0 ( echo UV 未安装,正在自动安装... echo. echo 尝试方法1: PowerShell 安装... powershell -ExecutionPolicy Bypass -Command "try { irm https://astral.sh/uv/install.ps1 | iex; exit 0 } catch { Write-Host 'PowerShell 安装失败'; exit 1 }" if %errorlevel% neq 0 ( echo. echo 方法1失败,尝试方法2: pip 安装... python -m pip install --upgrade uv if %errorlevel% neq 0 ( echo. echo ❌ 自动安装失败 echo. echo 请手动安装 UV,可选方法: echo. echo 方法1 - pip: echo python -m pip install uv echo. echo 方法2 - pipx: echo pip install pipx echo pipx install uv echo. echo 方法3 - 手动下载: echo 访问: https://docs.astral.sh/uv/getting-started/installation/ echo. pause exit /b 1 ) ) echo. echo ✅ UV 安装完成! echo. echo ⚠️ 重要: 请按照以下步骤操作: echo 1. 关闭此窗口 echo 2. 重新打开命令提示符(或 PowerShell) echo 3. 回到项目目录: %PROJECT_ROOT% echo 4. 重新运行此脚本: setup-windows.bat echo. pause exit /b 0 ) else ( for /f "tokens=*" %%i in ('uv --version') do echo ✅ %%i ) echo. echo [3/4] 📦 安装项目依赖... echo 工作目录: %PROJECT_ROOT% echo. REM 确保在项目目录下执行 cd /d "%PROJECT_ROOT%" uv sync if %errorlevel% neq 0 ( echo. echo ❌ 依赖安装失败 echo. echo 可能的原因: echo 1. 网络连接问题 echo 2. Python 版本不兼容(需要 ^>= 3.10) echo 3. pyproject.toml 文件格式错误 echo. echo 故障排查: echo - 检查网络连接 echo - 验证 Python 版本: python --version echo - 尝试详细输出: uv sync --verbose echo. echo 项目目录: %PROJECT_ROOT% echo. pause exit /b 1 ) echo. echo ✅ 依赖安装成功 echo. echo [4/4] ⚙️ 检查配置文件... if not exist "config\config.yaml" ( echo ⚠️ 配置文件不存在: config\config.yaml if exist "config\config.example.yaml" ( echo. echo 创建配置文件: echo 1. 复制: copy config\config.example.yaml config\config.yaml echo 2. 编辑: notepad config\config.yaml echo 3. 填入 API 密钥 ) echo. ) else ( echo ✅ config\config.yaml 已存在 ) echo. REM 获取 UV 路径 for /f "tokens=*" %%i in ('where uv 2^>nul') do set "UV_PATH=%%i" if not defined UV_PATH ( set "UV_PATH=uv" ) echo. echo ========================================== echo 部署完成! echo ========================================== echo. echo 📋 MCP 服务器配置信息(用于 Claude Desktop): echo. echo 命令: %UV_PATH% echo 工作目录: %PROJECT_ROOT% echo. echo 参数(逐行填入): echo --directory echo %PROJECT_ROOT% echo run echo python echo -m echo mcp_server.server echo. echo 📖 详细教程: README-Cherry-Studio.md echo. echo. pause ================================================ FILE: start-http.bat ================================================ @echo off chcp 65001 >nul echo ============================================================ echo TrendRadar MCP Server (HTTP 模式) echo ============================================================ echo. REM 检查虚拟环境 if not exist ".venv\Scripts\python.exe" ( echo ❌ [错误] 虚拟环境未找到 echo 请先运行 setup-windows.bat 或 setup-windows-en.bat 进行部署 echo. pause exit /b 1 ) echo [模式] HTTP (适合远程访问) echo [地址] http://localhost:3333/mcp echo [提示] 按 Ctrl+C 停止服务 echo. uv run python -m mcp_server.server --transport http --host 0.0.0.0 --port 3333 pause ================================================ FILE: start-http.sh ================================================ #!/bin/bash echo "╔════════════════════════════════════════╗" echo "║ TrendRadar MCP Server (HTTP 模式) ║" echo "╚════════════════════════════════════════╝" echo "" # 检查虚拟环境 if [ ! -d ".venv" ]; then echo "❌ [错误] 虚拟环境未找到" echo "请先运行 ./setup-mac.sh 进行部署" echo "" exit 1 fi echo "[模式] HTTP (适合远程访问)" echo "[地址] http://localhost:3333/mcp" echo "[提示] 按 Ctrl+C 停止服务" echo "" uv run python -m mcp_server.server --transport http --host 0.0.0.0 --port 3333 ================================================ FILE: trendradar/__init__.py ================================================ # coding=utf-8 """ TrendRadar - 热点新闻聚合与分析工具 使用方式: python -m trendradar # 模块执行 trendradar # 安装后执行 """ from trendradar.context import AppContext __version__ = "6.5.0" __all__ = ["AppContext", "__version__"] ================================================ FILE: trendradar/__main__.py ================================================ # coding=utf-8 """ TrendRadar 主程序 热点新闻聚合与分析工具 支持: python -m trendradar """ import argparse import copy import json import os import re import sys import webbrowser from datetime import datetime, timezone from pathlib import Path from typing import Dict, List, Tuple, Optional import requests from trendradar.context import AppContext from trendradar import __version__ from trendradar.core import load_config, parse_multi_account_config, validate_paired_configs from trendradar.core.analyzer import convert_keyword_stats_to_platform_stats from trendradar.crawler import DataFetcher from trendradar.storage import convert_crawl_results_to_news_data from trendradar.utils.time import DEFAULT_TIMEZONE, is_within_days, calculate_days_old from trendradar.ai import AIAnalyzer, AIAnalysisResult from trendradar.core.scheduler import ResolvedSchedule def _parse_version(version_str: str) -> Tuple[int, int, int]: """解析版本号字符串为元组""" try: parts = version_str.strip().split(".") if len(parts) >= 3: return int(parts[0]), int(parts[1]), int(parts[2]) return 0, 0, 0 except: return 0, 0, 0 def _compare_version(local: str, remote: str) -> str: """比较版本号,返回状态文字""" local_tuple = _parse_version(local) remote_tuple = _parse_version(remote) if local_tuple < remote_tuple: return "⚠️ 需要更新" elif local_tuple > remote_tuple: return "🔮 超前版本" else: return "✅ 已是最新" def _fetch_remote_version(version_url: str, proxy_url: Optional[str] = None) -> Optional[str]: """获取远程版本号""" try: proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", "Accept": "text/plain, */*", "Cache-Control": "no-cache", } response = requests.get(version_url, proxies=proxies, headers=headers, timeout=10) response.raise_for_status() return response.text.strip() except Exception as e: print(f"[版本检查] 获取远程版本失败: {e}") return None def _parse_config_versions(content: str) -> Dict[str, str]: """解析配置文件版本内容为字典""" versions = {} try: if not content: return versions for line in content.splitlines(): line = line.strip() if not line or "=" not in line: continue name, version = line.split("=", 1) versions[name.strip()] = version.strip() except Exception as e: print(f"[版本检查] 解析配置版本失败: {e}") return versions def check_all_versions( version_url: str, configs_version_url: Optional[str] = None, proxy_url: Optional[str] = None ) -> Tuple[bool, Optional[str]]: """ 统一版本检查:程序版本 + 配置文件版本 Args: version_url: 远程程序版本检查 URL configs_version_url: 远程配置文件版本检查 URL (返回格式: filename=version) proxy_url: 代理 URL Returns: (need_update, remote_version): 程序是否需要更新及远程版本号 """ # 获取远程版本 remote_version = _fetch_remote_version(version_url, proxy_url) # 获取远程配置版本(如果有提供 URL) remote_config_versions = {} if configs_version_url: content = _fetch_remote_version(configs_version_url, proxy_url) if content: remote_config_versions = _parse_config_versions(content) print("=" * 60) print("版本检查") print("=" * 60) if remote_version: print(f"远程程序版本: {remote_version}") else: print("远程程序版本: 获取失败") if configs_version_url: if remote_config_versions: print(f"远程配置清单: 获取成功 ({len(remote_config_versions)} 个文件)") else: print("远程配置清单: 获取失败或为空") print("-" * 60) program_status = _compare_version(__version__, remote_version) if remote_version else "(无法比较)" print(f" 主程序版本: {__version__} {program_status}") config_files = [ Path("config/config.yaml"), Path("config/timeline.yaml"), Path("config/frequency_words.txt"), Path("config/ai_interests.txt"), Path("config/ai_analysis_prompt.txt"), Path("config/ai_translation_prompt.txt"), ] version_pattern = re.compile(r"Version:\s*(\d+\.\d+\.\d+)", re.IGNORECASE) for config_file in config_files: if not config_file.exists(): print(f" {config_file.name}: 文件不存在") continue try: with open(config_file, "r", encoding="utf-8") as f: local_version = None for i, line in enumerate(f): if i >= 20: break match = version_pattern.search(line) if match: local_version = match.group(1) break # 获取该文件的远程版本 target_remote_version = remote_config_versions.get(config_file.name) if local_version: if target_remote_version: status = _compare_version(local_version, target_remote_version) print(f" {config_file.name}: {local_version} {status}") else: print(f" {config_file.name}: {local_version} (未找到远程版本)") else: print(f" {config_file.name}: 未找到本地版本号") except Exception as e: print(f" {config_file.name}: 读取失败 - {e}") print("=" * 60) # 返回程序版本的更新状态 if remote_version: need_update = _parse_version(__version__) < _parse_version(remote_version) return need_update, remote_version if need_update else None return False, None # === 主分析器 === class NewsAnalyzer: """新闻分析器""" # 模式策略定义 MODE_STRATEGIES = { "incremental": { "mode_name": "增量模式", "description": "增量模式(只关注新增新闻,无新增时不推送)", "report_type": "增量分析", "should_send_notification": True, }, "current": { "mode_name": "当前榜单模式", "description": "当前榜单模式(当前榜单匹配新闻 + 新增新闻区域 + 按时推送)", "report_type": "当前榜单", "should_send_notification": True, }, "daily": { "mode_name": "全天汇总模式", "description": "全天汇总模式(所有匹配新闻 + 新增新闻区域 + 按时推送)", "report_type": "全天汇总", "should_send_notification": True, }, } def __init__(self, config: Optional[Dict] = None): # 使用传入的配置或加载新配置 if config is None: print("正在加载配置...") config = load_config() print(f"TrendRadar v{__version__} 配置加载完成") print(f"监控平台数量: {len(config['PLATFORMS'])}") print(f"时区: {config.get('TIMEZONE', DEFAULT_TIMEZONE)}") # 创建应用上下文 self.ctx = AppContext(config) self.request_interval = self.ctx.config["REQUEST_INTERVAL"] self.report_mode = self.ctx.config["REPORT_MODE"] self.frequency_file = None self.filter_method = None # None=使用全局配置 ctx.filter_method self.interests_file = None # None=使用全局配置 ai_filter.interests_file self.rank_threshold = self.ctx.rank_threshold self.is_github_actions = os.environ.get("GITHUB_ACTIONS") == "true" self.is_docker_container = self._detect_docker_environment() self.update_info = None self.proxy_url = None self._setup_proxy() self.data_fetcher = DataFetcher(self.proxy_url) # 初始化存储管理器(使用 AppContext) self._init_storage_manager() # 注意:update_info 由 main() 函数设置,避免重复请求远程版本 def _init_storage_manager(self) -> None: """初始化存储管理器(使用 AppContext)""" # 获取数据保留天数(支持环境变量覆盖) env_retention = os.environ.get("STORAGE_RETENTION_DAYS", "").strip() if env_retention: # 环境变量覆盖配置 self.ctx.config["STORAGE"]["RETENTION_DAYS"] = int(env_retention) self.storage_manager = self.ctx.get_storage_manager() print(f"存储后端: {self.storage_manager.backend_name}") retention_days = self.ctx.config.get("STORAGE", {}).get("RETENTION_DAYS", 0) if retention_days > 0: print(f"数据保留天数: {retention_days} 天") def _detect_docker_environment(self) -> bool: """检测是否运行在 Docker 容器中""" try: if os.environ.get("DOCKER_CONTAINER") == "true": return True if os.path.exists("/.dockerenv"): return True return False except Exception: return False def _should_open_browser(self) -> bool: """判断是否应该打开浏览器""" return not self.is_github_actions and not self.is_docker_container def _setup_proxy(self) -> None: """设置代理配置""" if not self.is_github_actions and self.ctx.config["USE_PROXY"]: self.proxy_url = self.ctx.config["DEFAULT_PROXY"] print("本地环境,使用代理") elif not self.is_github_actions and not self.ctx.config["USE_PROXY"]: print("本地环境,未启用代理") else: print("GitHub Actions环境,不使用代理") def _set_update_info_from_config(self) -> None: """从已缓存的远程版本设置更新信息(不再重复请求)""" try: version_url = self.ctx.config.get("VERSION_CHECK_URL", "") if not version_url: return remote_version = _fetch_remote_version(version_url, self.proxy_url) if remote_version: need_update = _parse_version(__version__) < _parse_version(remote_version) if need_update: self.update_info = { "current_version": __version__, "remote_version": remote_version, } except Exception as e: print(f"版本检查出错: {e}") def _get_mode_strategy(self) -> Dict: """获取当前模式的策略配置""" return self.MODE_STRATEGIES.get(self.report_mode, self.MODE_STRATEGIES["daily"]) def _has_notification_configured(self) -> bool: """检查是否配置了任何通知渠道""" cfg = self.ctx.config return any( [ cfg["FEISHU_WEBHOOK_URL"], cfg["DINGTALK_WEBHOOK_URL"], cfg["WEWORK_WEBHOOK_URL"], (cfg["TELEGRAM_BOT_TOKEN"] and cfg["TELEGRAM_CHAT_ID"]), ( cfg["EMAIL_FROM"] and cfg["EMAIL_PASSWORD"] and cfg["EMAIL_TO"] ), (cfg["NTFY_SERVER_URL"] and cfg["NTFY_TOPIC"]), cfg["BARK_URL"], cfg["SLACK_WEBHOOK_URL"], cfg["GENERIC_WEBHOOK_URL"], ] ) def _has_valid_content( self, stats: List[Dict], new_titles: Optional[Dict] = None ) -> bool: """检查是否有有效的新闻内容""" if self.report_mode == "incremental": # 增量模式:只要有匹配的新闻就推送 # count_word_frequency 已经确保只处理新增的新闻(包括当天第一次爬取的情况) has_matched_news = any(stat["count"] > 0 for stat in stats) return has_matched_news elif self.report_mode == "current": # current模式:只要stats有内容就说明有匹配的新闻 return any(stat["count"] > 0 for stat in stats) else: # 当日汇总模式下,检查是否有匹配的频率词新闻或新增新闻 has_matched_news = any(stat["count"] > 0 for stat in stats) has_new_news = bool( new_titles and any(len(titles) > 0 for titles in new_titles.values()) ) return has_matched_news or has_new_news def _prepare_ai_analysis_data( self, ai_mode: str, current_results: Optional[Dict] = None, current_id_to_name: Optional[Dict] = None, ) -> Tuple[List[Dict], Optional[Dict]]: """ 为 AI 分析准备指定模式的数据 Args: ai_mode: AI 分析模式 (daily/current/incremental) current_results: 当前抓取的结果(用于 incremental 模式) current_id_to_name: 当前的平台映射(用于 incremental 模式) Returns: Tuple[stats, id_to_name]: 统计数据和平台映射 """ try: word_groups, filter_words, global_filters = self.ctx.load_frequency_words(self.frequency_file) if ai_mode == "incremental": # incremental 模式:使用当前抓取的数据 if not current_results or not current_id_to_name: print("[AI] incremental 模式需要当前抓取数据,但未提供") return [], None # 准备当前时间信息 time_info = self.ctx.format_time() title_info = self._prepare_current_title_info(current_results, time_info) # 检测新增标题 new_titles = self.ctx.detect_new_titles(list(current_results.keys())) # 统计计算 stats, _ = self.ctx.count_frequency( current_results, word_groups, filter_words, current_id_to_name, title_info, new_titles, mode="incremental", global_filters=global_filters, quiet=True, ) # 如果是 platform 模式,转换数据结构 if self.ctx.display_mode == "platform" and stats: stats = convert_keyword_stats_to_platform_stats( stats, self.ctx.weight_config, self.ctx.rank_threshold, ) return stats, current_id_to_name elif ai_mode in ["daily", "current"]: # 加载历史数据 analysis_data = self._load_analysis_data(quiet=True) if not analysis_data: print(f"[AI] 无法加载历史数据用于 {ai_mode} 模式分析") return [], None ( all_results, id_to_name, title_info, new_titles, _, _, _, ) = analysis_data # 统计计算 stats, _ = self.ctx.count_frequency( all_results, word_groups, filter_words, id_to_name, title_info, new_titles, mode=ai_mode, global_filters=global_filters, quiet=True, ) # 如果是 platform 模式,转换数据结构 if self.ctx.display_mode == "platform" and stats: stats = convert_keyword_stats_to_platform_stats( stats, self.ctx.weight_config, self.ctx.rank_threshold, ) return stats, id_to_name else: print(f"[AI] 未知的 AI 模式: {ai_mode}") return [], None except Exception as e: print(f"[AI] 准备 {ai_mode} 模式数据时出错: {e}") if self.ctx.config.get("DEBUG", False): import traceback traceback.print_exc() return [], None def _run_ai_analysis( self, stats: List[Dict], rss_items: Optional[List[Dict]], mode: str, report_type: str, id_to_name: Optional[Dict], current_results: Optional[Dict] = None, schedule: ResolvedSchedule = None, standalone_data: Optional[Dict] = None, ) -> Optional[AIAnalysisResult]: """执行 AI 分析""" analysis_config = self.ctx.config.get("AI_ANALYSIS", {}) if not analysis_config.get("ENABLED", False): return None # 调度系统决策 if not schedule.analyze: print("[AI] 调度器: 当前时间段不执行 AI 分析") return None if schedule.once_analyze and schedule.period_key: scheduler = self.ctx.create_scheduler() date_str = self.ctx.format_date() if scheduler.already_executed(schedule.period_key, "analyze", date_str): print(f"[AI] 调度器: 时间段 {schedule.period_name or schedule.period_key} 今天已分析过,跳过") return None else: print(f"[AI] 调度器: 时间段 {schedule.period_name or schedule.period_key} 今天首次分析") print("[AI] 正在进行 AI 分析...") try: ai_config = self.ctx.config.get("AI", {}) debug_mode = self.ctx.config.get("DEBUG", False) analyzer = AIAnalyzer(ai_config, analysis_config, self.ctx.get_time, debug=debug_mode) # 确定 AI 分析使用的模式 ai_mode_config = analysis_config.get("MODE", "follow_report") if ai_mode_config == "follow_report": # 跟随推送报告模式 ai_mode = mode ai_stats = stats ai_id_to_name = id_to_name elif ai_mode_config in ["daily", "current", "incremental"]: # 使用独立配置的模式,需要重新准备数据 ai_mode = ai_mode_config if ai_mode != mode: print(f"[AI] 使用独立分析模式: {ai_mode} (推送模式: {mode})") print(f"[AI] 正在准备 {ai_mode} 模式的数据...") # 根据 AI 模式重新准备数据 ai_stats, ai_id_to_name = self._prepare_ai_analysis_data( ai_mode, current_results, id_to_name ) if not ai_stats: print(f"[AI] 警告: 无法准备 {ai_mode} 模式的数据,回退到推送模式数据") ai_stats = stats ai_id_to_name = id_to_name ai_mode = mode else: ai_stats = stats ai_id_to_name = id_to_name else: # 配置错误,回退到跟随模式 print(f"[AI] 警告: 无效的 ai_analysis.mode 配置 '{ai_mode_config}',使用推送模式 '{mode}'") ai_mode = mode ai_stats = stats ai_id_to_name = id_to_name # 提取平台列表 platforms = list(ai_id_to_name.values()) if ai_id_to_name else [] # 提取关键词列表 keywords = [s.get("word", "") for s in ai_stats if s.get("word")] if ai_stats else [] # 确定报告类型 if ai_mode != mode: # 根据 AI 模式确定报告类型 ai_report_type = { "daily": "当日汇总", "current": "当前榜单", "incremental": "增量更新" }.get(ai_mode, report_type) else: ai_report_type = report_type result = analyzer.analyze( stats=ai_stats, rss_stats=rss_items, report_mode=ai_mode, report_type=ai_report_type, platforms=platforms, keywords=keywords, standalone_data=standalone_data, ) # 设置 AI 分析使用的模式 if result.success: result.ai_mode = ai_mode if result.error: # 成功但有警告(如 JSON 解析问题但使用了原始文本) print(f"[AI] 分析完成(有警告: {result.error})") else: print("[AI] 分析完成") # 记录 AI 分析 if schedule.once_analyze and schedule.period_key: scheduler = self.ctx.create_scheduler() date_str = self.ctx.format_date() scheduler.record_execution(schedule.period_key, "analyze", date_str) else: print(f"[AI] 分析失败: {result.error}") return result except Exception as e: import traceback error_type = type(e).__name__ error_msg = str(e) # 截断过长的错误消息 if len(error_msg) > 200: error_msg = error_msg[:200] + "..." print(f"[AI] 分析出错 ({error_type}): {error_msg}") # 详细错误日志到 stderr import sys print(f"[AI] 详细错误堆栈:", file=sys.stderr) traceback.print_exc(file=sys.stderr) return AIAnalysisResult(success=False, error=f"{error_type}: {error_msg}") def _load_analysis_data( self, quiet: bool = False, ) -> Optional[Tuple[Dict, Dict, Dict, Dict, List, List]]: """统一的数据加载和预处理,使用当前监控平台列表过滤历史数据""" try: # 获取当前配置的监控平台ID列表 current_platform_ids = self.ctx.platform_ids if not quiet: print(f"当前监控平台: {current_platform_ids}") all_results, id_to_name, title_info = self.ctx.read_today_titles( current_platform_ids, quiet=quiet ) if not all_results: print("没有找到当天的数据") return None total_titles = sum(len(titles) for titles in all_results.values()) if not quiet: print(f"读取到 {total_titles} 个标题(已按当前监控平台过滤)") new_titles = self.ctx.detect_new_titles(current_platform_ids, quiet=quiet) word_groups, filter_words, global_filters = self.ctx.load_frequency_words(self.frequency_file) return ( all_results, id_to_name, title_info, new_titles, word_groups, filter_words, global_filters, ) except Exception as e: print(f"数据加载失败: {e}") return None def _prepare_current_title_info(self, results: Dict, time_info: str) -> Dict: """从当前抓取结果构建标题信息""" title_info = {} for source_id, titles_data in results.items(): title_info[source_id] = {} for title, title_data in titles_data.items(): ranks = title_data.get("ranks", []) url = title_data.get("url", "") mobile_url = title_data.get("mobileUrl", "") title_info[source_id][title] = { "first_time": time_info, "last_time": time_info, "count": 1, "ranks": ranks, "url": url, "mobileUrl": mobile_url, } return title_info def _prepare_standalone_data( self, results: Dict, id_to_name: Dict, title_info: Optional[Dict] = None, rss_items: Optional[List[Dict]] = None, ) -> Optional[Dict]: """ 从原始数据中提取独立展示区数据 纯数据准备方法,不检查 display.regions.standalone 开关。 各消费者自行决定是否使用: - AI 分析:由 ai.include_standalone 控制 - 通知推送:由 display.regions.standalone 控制(在 dispatcher 层门控) - HTML 报告:始终包含(如果有数据) Args: results: 原始爬取结果 {platform_id: {title: title_data}} id_to_name: 平台 ID 到名称的映射 title_info: 标题元信息(含排名历史、时间等) rss_items: RSS 条目列表 Returns: 独立展示数据字典,如果未配置数据源返回 None """ display_config = self.ctx.config.get("DISPLAY", {}) standalone_config = display_config.get("STANDALONE", {}) platform_ids = standalone_config.get("PLATFORMS", []) rss_feed_ids = standalone_config.get("RSS_FEEDS", []) max_items = standalone_config.get("MAX_ITEMS", 20) if not platform_ids and not rss_feed_ids: return None standalone_data = { "platforms": [], "rss_feeds": [], } # 找出最新批次时间(类似 current 模式的过滤逻辑) latest_time = None if title_info: for source_titles in title_info.values(): for title_data in source_titles.values(): last_time = title_data.get("last_time", "") if last_time: if latest_time is None or last_time > latest_time: latest_time = last_time # 提取热榜平台数据 for platform_id in platform_ids: if platform_id not in results: continue platform_name = id_to_name.get(platform_id, platform_id) platform_titles = results[platform_id] items = [] for title, title_data in platform_titles.items(): # 获取元信息(如果有 title_info) meta = {} if title_info and platform_id in title_info and title in title_info[platform_id]: meta = title_info[platform_id][title] # 只保留当前在榜的话题(last_time 等于最新时间) if latest_time and meta: if meta.get("last_time") != latest_time: continue # 使用当前热榜的排名数据(title_data)进行排序 # title_data 包含的是爬虫返回的当前排名,用于保证独立展示区的顺序与热榜一致 current_ranks = title_data.get("ranks", []) current_rank = current_ranks[-1] if current_ranks else 0 # 用于显示的排名范围:合并历史排名和当前排名 historical_ranks = meta.get("ranks", []) if meta else [] # 合并去重,保持顺序 all_ranks = historical_ranks.copy() for rank in current_ranks: if rank not in all_ranks: all_ranks.append(rank) display_ranks = all_ranks if all_ranks else current_ranks item = { "title": title, "url": title_data.get("url", ""), "mobileUrl": title_data.get("mobileUrl", ""), "rank": current_rank, # 用于排序的当前排名 "ranks": display_ranks, # 用于显示的排名范围(历史+当前) "first_time": meta.get("first_time", ""), "last_time": meta.get("last_time", ""), "count": meta.get("count", 1), "rank_timeline": meta.get("rank_timeline", []), } items.append(item) # 按当前排名排序 items.sort(key=lambda x: x["rank"] if x["rank"] > 0 else 9999) # 限制条数 if max_items > 0: items = items[:max_items] if items: standalone_data["platforms"].append({ "id": platform_id, "name": platform_name, "items": items, }) # 提取 RSS 数据 if rss_items and rss_feed_ids: # 按 feed_id 分组 feed_items_map = {} for item in rss_items: feed_id = item.get("feed_id", "") if feed_id in rss_feed_ids: if feed_id not in feed_items_map: feed_items_map[feed_id] = { "name": item.get("feed_name", feed_id), "items": [], } feed_items_map[feed_id]["items"].append({ "title": item.get("title", ""), "url": item.get("url", ""), "published_at": item.get("published_at", ""), "author": item.get("author", ""), }) # 限制条数并添加到结果 for feed_id in rss_feed_ids: if feed_id in feed_items_map: feed_data = feed_items_map[feed_id] items = feed_data["items"] if max_items > 0: items = items[:max_items] if items: standalone_data["rss_feeds"].append({ "id": feed_id, "name": feed_data["name"], "items": items, }) # 如果没有任何数据,返回 None if not standalone_data["platforms"] and not standalone_data["rss_feeds"]: return None return standalone_data def _run_analysis_pipeline( self, data_source: Dict, mode: str, title_info: Dict, new_titles: Dict, word_groups: List[Dict], filter_words: List[str], id_to_name: Dict, failed_ids: Optional[List] = None, global_filters: Optional[List[str]] = None, quiet: bool = False, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, standalone_data: Optional[Dict] = None, schedule: ResolvedSchedule = None, rss_new_urls: Optional[set] = None, ) -> Tuple[List[Dict], Optional[str], Optional[AIAnalysisResult], Optional[List[Dict]]]: """统一的分析流水线:数据处理 → 统计计算(关键词/AI筛选)→ AI分析 → HTML生成""" # 根据筛选策略选择数据处理方式 if self.filter_method == "ai": # === AI 筛选策略 === print("[筛选] 使用 AI 智能筛选策略") ai_filter_result = self.ctx.run_ai_filter(interests_file=self.interests_file) if ai_filter_result and ai_filter_result.success: print(f"[筛选] AI 筛选完成: {ai_filter_result.total_matched} 条匹配, {len(ai_filter_result.tags)} 个标签") # 转换为与关键词匹配相同的数据结构 stats, ai_rss_stats = self.ctx.convert_ai_filter_to_report_data( ai_filter_result, mode=mode, new_titles=new_titles, rss_new_urls=rss_new_urls, ) total_titles = sum(len(titles) for titles in data_source.values()) # AI 筛选的 RSS 结果替换关键词匹配的 RSS 结果 if ai_rss_stats: rss_items = ai_rss_stats else: # AI 筛选失败,回退到关键词匹配 error_msg = ai_filter_result.error if ai_filter_result else "未知错误" print(f"[筛选] AI 筛选失败: {error_msg},回退到关键词匹配") stats, total_titles = self.ctx.count_frequency( data_source, word_groups, filter_words, id_to_name, title_info, new_titles, mode=mode, global_filters=global_filters, quiet=quiet, ) else: # === 关键词匹配策略(默认)=== stats, total_titles = self.ctx.count_frequency( data_source, word_groups, filter_words, id_to_name, title_info, new_titles, mode=mode, global_filters=global_filters, quiet=quiet, ) # 如果是 platform 模式,转换数据结构 if self.ctx.display_mode == "platform" and stats: stats = convert_keyword_stats_to_platform_stats( stats, self.ctx.weight_config, self.ctx.rank_threshold, ) # AI 分析(如果启用,用于 HTML 报告) ai_result = None ai_config = self.ctx.config.get("AI_ANALYSIS", {}) if ai_config.get("ENABLED", False) and stats: # 获取模式策略来确定报告类型 mode_strategy = self._get_mode_strategy() report_type = mode_strategy["report_type"] ai_result = self._run_ai_analysis( stats, rss_items, mode, report_type, id_to_name, current_results=data_source, schedule=schedule, standalone_data=standalone_data ) # HTML生成(如果启用) html_file = None if self.ctx.config["STORAGE"]["FORMATS"]["HTML"]: html_file = self.ctx.generate_html( stats, total_titles, failed_ids=failed_ids, new_titles=new_titles, id_to_name=id_to_name, mode=mode, update_info=self.update_info if self.ctx.config["SHOW_VERSION_UPDATE"] else None, rss_items=rss_items, rss_new_items=rss_new_items, ai_analysis=ai_result, standalone_data=standalone_data, frequency_file=self.frequency_file, ) return stats, html_file, ai_result, rss_items def _send_notification_if_needed( self, stats: List[Dict], report_type: str, mode: str, failed_ids: Optional[List] = None, new_titles: Optional[Dict] = None, id_to_name: Optional[Dict] = None, html_file_path: Optional[str] = None, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, standalone_data: Optional[Dict] = None, ai_result: Optional[AIAnalysisResult] = None, current_results: Optional[Dict] = None, schedule: ResolvedSchedule = None, ) -> bool: """统一的通知发送逻辑,包含所有判断条件,支持热榜+RSS合并推送+AI分析+独立展示区""" has_notification = self._has_notification_configured() cfg = self.ctx.config # 检查是否有有效内容(热榜或RSS) has_news_content = self._has_valid_content(stats, new_titles) has_rss_content = bool(rss_items and len(rss_items) > 0) has_any_content = has_news_content or has_rss_content # 计算热榜匹配条数 news_count = sum(len(stat.get("titles", [])) for stat in stats) if stats else 0 rss_count = sum(stat.get("count", 0) for stat in rss_items) if rss_items else 0 if ( cfg["ENABLE_NOTIFICATION"] and has_notification and has_any_content ): # 输出推送内容统计 content_parts = [] if news_count > 0: content_parts.append(f"热榜 {news_count} 条") if rss_count > 0: content_parts.append(f"RSS {rss_count} 条") total_count = news_count + rss_count print(f"[推送] 准备发送:{' + '.join(content_parts)},合计 {total_count} 条") # 调度系统决策 if not schedule.push: print("[推送] 调度器: 当前时间段不执行推送") return False if schedule.once_push and schedule.period_key: scheduler = self.ctx.create_scheduler() date_str = self.ctx.format_date() if scheduler.already_executed(schedule.period_key, "push", date_str): print(f"[推送] 调度器: 时间段 {schedule.period_name or schedule.period_key} 今天已推送过,跳过") return False else: print(f"[推送] 调度器: 时间段 {schedule.period_name or schedule.period_key} 今天首次推送") # AI 分析:优先使用传入的结果,避免重复分析 if ai_result is None: ai_config = cfg.get("AI_ANALYSIS", {}) if ai_config.get("ENABLED", False): ai_result = self._run_ai_analysis( stats, rss_items, mode, report_type, id_to_name, current_results=current_results, schedule=schedule ) # 准备报告数据 report_data = self.ctx.prepare_report(stats, failed_ids, new_titles, id_to_name, mode, frequency_file=self.frequency_file) # 是否发送版本更新信息 update_info_to_send = self.update_info if cfg["SHOW_VERSION_UPDATE"] else None # 使用 NotificationDispatcher 发送到所有渠道 dispatcher = self.ctx.create_notification_dispatcher() results = dispatcher.dispatch_all( report_data=report_data, report_type=report_type, update_info=update_info_to_send, proxy_url=self.proxy_url, mode=mode, html_file_path=html_file_path, rss_items=rss_items, rss_new_items=rss_new_items, ai_analysis=ai_result, standalone_data=standalone_data, ) if not results: print("未配置任何通知渠道,跳过通知发送") return False # 记录推送成功 if any(results.values()): if schedule.once_push and schedule.period_key: scheduler = self.ctx.create_scheduler() date_str = self.ctx.format_date() scheduler.record_execution(schedule.period_key, "push", date_str) return True elif cfg["ENABLE_NOTIFICATION"] and not has_notification: print("⚠️ 警告:通知功能已启用但未配置任何通知渠道,将跳过通知发送") elif not cfg["ENABLE_NOTIFICATION"]: print(f"跳过{report_type}通知:通知功能已禁用") elif ( cfg["ENABLE_NOTIFICATION"] and has_notification and not has_any_content ): mode_strategy = self._get_mode_strategy() if self.report_mode == "incremental": if not has_rss_content: print("跳过通知:增量模式下未检测到匹配的新闻和RSS") else: print("跳过通知:增量模式下新闻未匹配到关键词") else: print( f"跳过通知:{mode_strategy['mode_name']}下未检测到匹配的新闻" ) return False def _initialize_and_check_config(self) -> None: """通用初始化和配置检查""" now = self.ctx.get_time() print(f"当前北京时间: {now.strftime('%Y-%m-%d %H:%M:%S')}") if not self.ctx.config["ENABLE_CRAWLER"]: print("爬虫功能已禁用(ENABLE_CRAWLER=False),程序退出") return has_notification = self._has_notification_configured() if not self.ctx.config["ENABLE_NOTIFICATION"]: print("通知功能已禁用(ENABLE_NOTIFICATION=False),将只进行数据抓取") elif not has_notification: print("未配置任何通知渠道,将只进行数据抓取,不发送通知") else: print("通知功能已启用,将发送通知") mode_strategy = self._get_mode_strategy() print(f"报告模式: {self.report_mode}") print(f"运行模式: {mode_strategy['description']}") def _crawl_data(self) -> Tuple[Dict, Dict, List]: """执行数据爬取""" ids = [] for platform in self.ctx.platforms: if "name" in platform: ids.append((platform["id"], platform["name"])) else: ids.append(platform["id"]) print( f"配置的监控平台: {[p.get('name', p['id']) for p in self.ctx.platforms]}" ) print(f"开始爬取数据,请求间隔 {self.request_interval} 毫秒") Path("output").mkdir(parents=True, exist_ok=True) results, id_to_name, failed_ids = self.data_fetcher.crawl_websites( ids, self.request_interval ) # 转换为 NewsData 格式并保存到存储后端 crawl_time = self.ctx.format_time() crawl_date = self.ctx.format_date() news_data = convert_crawl_results_to_news_data( results, id_to_name, failed_ids, crawl_time, crawl_date ) # 保存到存储后端(SQLite) if self.storage_manager.save_news_data(news_data): print(f"数据已保存到存储后端: {self.storage_manager.backend_name}") # 保存 TXT 快照(如果启用) txt_file = self.storage_manager.save_txt_snapshot(news_data) if txt_file: print(f"TXT 快照已保存: {txt_file}") return results, id_to_name, failed_ids def _crawl_rss_data(self) -> Tuple[Optional[List[Dict]], Optional[List[Dict]], Optional[List[Dict]], set]: """ 执行 RSS 数据抓取 Returns: (rss_items, rss_new_items, raw_rss_items, rss_new_urls) 元组: - rss_items: 统计条目列表(按模式处理,用于统计区块) - rss_new_items: 新增条目列表(用于新增区块) - raw_rss_items: 原始 RSS 条目列表(用于独立展示区) - rss_new_urls: 原始新增 RSS 条目的 URL 集合(用于 AI 模式 is_new 检测) 如果未启用或失败返回 (None, None, None, set()) """ if not self.ctx.rss_enabled: return None, None, None, set() rss_feeds = self.ctx.rss_feeds if not rss_feeds: print("[RSS] 未配置任何 RSS 源") return None, None, None, set() try: from trendradar.crawler.rss import RSSFetcher, RSSFeedConfig # 构建 RSS 源配置 feeds = [] for feed_config in rss_feeds: # 读取并验证单个 feed 的 max_age_days(可选) max_age_days_raw = feed_config.get("max_age_days") max_age_days = None if max_age_days_raw is not None: try: max_age_days = int(max_age_days_raw) if max_age_days < 0: feed_id = feed_config.get("id", "unknown") print(f"[警告] RSS feed '{feed_id}' 的 max_age_days 为负数,将使用全局默认值") max_age_days = None except (ValueError, TypeError): feed_id = feed_config.get("id", "unknown") print(f"[警告] RSS feed '{feed_id}' 的 max_age_days 格式错误:{max_age_days_raw}") max_age_days = None feed = RSSFeedConfig( id=feed_config.get("id", ""), name=feed_config.get("name", ""), url=feed_config.get("url", ""), max_items=feed_config.get("max_items", 50), enabled=feed_config.get("enabled", True), max_age_days=max_age_days, # None=使用全局,0=禁用,>0=覆盖 ) if feed.id and feed.url and feed.enabled: feeds.append(feed) if not feeds: print("[RSS] 没有启用的 RSS 源") return None, None, None, set() # 创建抓取器 rss_config = self.ctx.rss_config # RSS 代理:优先使用 RSS 专属代理,否则使用爬虫默认代理 rss_proxy_url = rss_config.get("PROXY_URL", "") or self.proxy_url or "" # 获取配置的时区 timezone = self.ctx.config.get("TIMEZONE", DEFAULT_TIMEZONE) # 获取新鲜度过滤配置 freshness_config = rss_config.get("FRESHNESS_FILTER", {}) freshness_enabled = freshness_config.get("ENABLED", True) default_max_age_days = freshness_config.get("MAX_AGE_DAYS", 3) fetcher = RSSFetcher( feeds=feeds, request_interval=rss_config.get("REQUEST_INTERVAL", 2000), timeout=rss_config.get("TIMEOUT", 15), use_proxy=rss_config.get("USE_PROXY", False), proxy_url=rss_proxy_url, timezone=timezone, freshness_enabled=freshness_enabled, default_max_age_days=default_max_age_days, ) # 抓取数据 rss_data = fetcher.fetch_all() # 保存到存储后端 if self.storage_manager.save_rss_data(rss_data): print(f"[RSS] 数据已保存到存储后端") # 处理 RSS 数据(按模式过滤)并返回用于合并推送 return self._process_rss_data_by_mode(rss_data) else: print(f"[RSS] 数据保存失败") return None, None, None, set() except ImportError as e: print(f"[RSS] 缺少依赖: {e}") print("[RSS] 请安装 feedparser: pip install feedparser") return None, None, None, set() except Exception as e: print(f"[RSS] 抓取失败: {e}") return None, None, None, set() def _process_rss_data_by_mode(self, rss_data) -> Tuple[Optional[List[Dict]], Optional[List[Dict]], Optional[List[Dict]], set]: """ 按报告模式处理 RSS 数据,返回与热榜相同格式的统计结构 三种模式: - daily: 当日汇总,统计=当天所有条目,新增=本次新增条目 - current: 当前榜单,统计=当前榜单条目,新增=本次新增条目 - incremental: 增量模式,统计=新增条目,新增=无 Args: rss_data: 当前抓取的 RSSData 对象 Returns: (rss_stats, rss_new_stats, raw_rss_items, rss_new_urls) 元组: - rss_stats: RSS 关键词统计列表(与热榜 stats 格式一致) - rss_new_stats: RSS 新增关键词统计列表(与热榜 stats 格式一致) - raw_rss_items: 原始 RSS 条目列表(用于独立展示区) - rss_new_urls: 原始新增 RSS 条目的 URL 集合(未经关键词过滤,用于 AI 模式 is_new 检测) """ from trendradar.core.analyzer import count_rss_frequency # 从 display.regions.rss 统一控制 RSS 分析和展示 rss_display_enabled = self.ctx.config.get("DISPLAY", {}).get("REGIONS", {}).get("RSS", True) # 加载关键词配置 try: word_groups, filter_words, global_filters = self.ctx.load_frequency_words(self.frequency_file) except FileNotFoundError: word_groups, filter_words, global_filters = [], [], [] timezone = self.ctx.timezone max_news_per_keyword = self.ctx.config.get("MAX_NEWS_PER_KEYWORD", 0) sort_by_position_first = self.ctx.config.get("SORT_BY_POSITION_FIRST", False) rss_stats = None rss_new_stats = None raw_rss_items = None # 原始 RSS 条目列表(用于独立展示区) rss_new_urls = set() # 原始新增 RSS URLs(未经关键词过滤) # 1. 首先获取原始条目(用于独立展示区,不受 display.regions.rss 影响) # 根据模式获取原始条目 if self.report_mode == "incremental": new_items_dict = self.storage_manager.detect_new_rss_items(rss_data) if new_items_dict: raw_rss_items = self._convert_rss_items_to_list(new_items_dict, rss_data.id_to_name) elif self.report_mode == "current": latest_data = self.storage_manager.get_latest_rss_data(rss_data.date) if latest_data: raw_rss_items = self._convert_rss_items_to_list(latest_data.items, latest_data.id_to_name) else: # daily all_data = self.storage_manager.get_rss_data(rss_data.date) if all_data: raw_rss_items = self._convert_rss_items_to_list(all_data.items, all_data.id_to_name) # 如果 RSS 展示未启用,跳过关键词分析,只返回原始条目用于独立展示区 if not rss_display_enabled: return None, None, raw_rss_items, rss_new_urls # 2. 获取新增条目(用于统计) new_items_dict = self.storage_manager.detect_new_rss_items(rss_data) new_items_list = None if new_items_dict: new_items_list = self._convert_rss_items_to_list(new_items_dict, rss_data.id_to_name) if new_items_list: print(f"[RSS] 检测到 {len(new_items_list)} 条新增") # 收集原始新增 URLs(未经关键词过滤,用于 AI 模式 is_new 检测) rss_new_urls = {item["url"] for item in new_items_list if item.get("url")} # 3. 根据模式获取统计条目 if self.report_mode == "incremental": # 增量模式:统计条目就是新增条目 if not new_items_list: print("[RSS] 增量模式:没有新增 RSS 条目") return None, None, raw_rss_items, rss_new_urls rss_stats, total = count_rss_frequency( rss_items=new_items_list, word_groups=word_groups, filter_words=filter_words, global_filters=global_filters, new_items=new_items_list, # 增量模式所有都是新增 max_news_per_keyword=max_news_per_keyword, sort_by_position_first=sort_by_position_first, timezone=timezone, rank_threshold=self.rank_threshold, quiet=False, ) if not rss_stats: print("[RSS] 增量模式:关键词匹配后没有内容") # 即使关键词匹配为空,也返回原始条目用于独立展示区 return None, None, raw_rss_items, rss_new_urls elif self.report_mode == "current": # 当前榜单模式:统计=当前榜单所有条目 # raw_rss_items 已在前面获取 if not raw_rss_items: print("[RSS] 当前榜单模式:没有 RSS 数据") return None, None, None, rss_new_urls rss_stats, total = count_rss_frequency( rss_items=raw_rss_items, word_groups=word_groups, filter_words=filter_words, global_filters=global_filters, new_items=new_items_list, # 标记新增 max_news_per_keyword=max_news_per_keyword, sort_by_position_first=sort_by_position_first, timezone=timezone, rank_threshold=self.rank_threshold, quiet=False, ) if not rss_stats: print("[RSS] 当前榜单模式:关键词匹配后没有内容") # 即使关键词匹配为空,也返回原始条目用于独立展示区 return None, None, raw_rss_items, rss_new_urls # 生成新增统计 if new_items_list: rss_new_stats, _ = count_rss_frequency( rss_items=new_items_list, word_groups=word_groups, filter_words=filter_words, global_filters=global_filters, new_items=new_items_list, max_news_per_keyword=max_news_per_keyword, sort_by_position_first=sort_by_position_first, timezone=timezone, rank_threshold=self.rank_threshold, quiet=True, ) else: # daily 模式:统计=当天所有条目 # raw_rss_items 已在前面获取 if not raw_rss_items: print("[RSS] 当日汇总模式:没有 RSS 数据") return None, None, None, rss_new_urls rss_stats, total = count_rss_frequency( rss_items=raw_rss_items, word_groups=word_groups, filter_words=filter_words, global_filters=global_filters, new_items=new_items_list, # 标记新增 max_news_per_keyword=max_news_per_keyword, sort_by_position_first=sort_by_position_first, timezone=timezone, rank_threshold=self.rank_threshold, quiet=False, ) if not rss_stats: print("[RSS] 当日汇总模式:关键词匹配后没有内容") # 即使关键词匹配为空,也返回原始条目用于独立展示区 return None, None, raw_rss_items, rss_new_urls # 生成新增统计 if new_items_list: rss_new_stats, _ = count_rss_frequency( rss_items=new_items_list, word_groups=word_groups, filter_words=filter_words, global_filters=global_filters, new_items=new_items_list, max_news_per_keyword=max_news_per_keyword, sort_by_position_first=sort_by_position_first, timezone=timezone, rank_threshold=self.rank_threshold, quiet=True, ) return rss_stats, rss_new_stats, raw_rss_items, rss_new_urls def _convert_rss_items_to_list(self, items_dict: Dict, id_to_name: Dict) -> List[Dict]: """将 RSS 条目字典转换为列表格式,并应用新鲜度过滤(用于推送)""" rss_items = [] filtered_count = 0 filtered_details = [] # 用于 DEBUG 模式下的详细日志 # 获取新鲜度过滤配置 rss_config = self.ctx.rss_config freshness_config = rss_config.get("FRESHNESS_FILTER", {}) freshness_enabled = freshness_config.get("ENABLED", True) default_max_age_days = freshness_config.get("MAX_AGE_DAYS", 3) timezone = self.ctx.config.get("TIMEZONE", DEFAULT_TIMEZONE) debug_mode = self.ctx.config.get("DEBUG", False) # 构建 feed_id -> max_age_days 的映射 feed_max_age_map = {} for feed_cfg in self.ctx.rss_feeds: feed_id = feed_cfg.get("id", "") max_age = feed_cfg.get("max_age_days") if max_age is not None: try: feed_max_age_map[feed_id] = int(max_age) except (ValueError, TypeError): pass for feed_id, items in items_dict.items(): # 确定此 feed 的 max_age_days max_days = feed_max_age_map.get(feed_id) if max_days is None: max_days = default_max_age_days for item in items: # 应用新鲜度过滤(仅在启用时) if freshness_enabled and max_days > 0: if item.published_at and not is_within_days(item.published_at, max_days, timezone): filtered_count += 1 # 记录详细信息用于 DEBUG 模式 if debug_mode: days_old = calculate_days_old(item.published_at, timezone) feed_name = id_to_name.get(feed_id, feed_id) filtered_details.append({ "title": item.title[:50] + "..." if len(item.title) > 50 else item.title, "feed": feed_name, "days_old": days_old, "max_days": max_days, }) continue # 跳过超过指定天数的文章 rss_items.append({ "title": item.title, "feed_id": feed_id, "feed_name": id_to_name.get(feed_id, feed_id), "url": item.url, "published_at": item.published_at, "summary": item.summary, "author": item.author, }) # 输出过滤统计 if filtered_count > 0: print(f"[RSS] 新鲜度过滤:跳过 {filtered_count} 篇超过指定天数的旧文章(仍保留在数据库中)") # DEBUG 模式下显示详细信息 if debug_mode and filtered_details: print(f"[RSS] 被过滤的文章详情(共 {len(filtered_details)} 篇):") for detail in filtered_details[:10]: # 最多显示 10 条 days_str = f"{detail['days_old']:.1f}" if detail['days_old'] else "未知" print(f" - [{days_str}天前] [{detail['feed']}] {detail['title']} (限制: {detail['max_days']}天)") if len(filtered_details) > 10: print(f" ... 还有 {len(filtered_details) - 10} 篇被过滤") return rss_items def _filter_rss_by_keywords(self, rss_items: List[Dict]) -> List[Dict]: """使用关键词文件过滤 RSS 条目""" try: word_groups, filter_words, global_filters = self.ctx.load_frequency_words(self.frequency_file) if word_groups or filter_words or global_filters: from trendradar.core.frequency import matches_word_groups filtered_items = [] for item in rss_items: title = item.get("title", "") if matches_word_groups(title, word_groups, filter_words, global_filters): filtered_items.append(item) original_count = len(rss_items) rss_items = filtered_items print(f"[RSS] 关键词过滤后剩余 {len(rss_items)}/{original_count} 条") if not rss_items: print("[RSS] 关键词过滤后没有匹配内容") return [] except FileNotFoundError: # 关键词文件不存在时跳过过滤 pass return rss_items def _generate_rss_html_report(self, rss_items: list, feeds_info: dict) -> str: """生成 RSS HTML 报告""" try: from trendradar.report.rss_html import render_rss_html_content from pathlib import Path html_content = render_rss_html_content( rss_items=rss_items, total_count=len(rss_items), feeds_info=feeds_info, get_time_func=self.ctx.get_time, ) # 保存 HTML 文件(扁平化结构:output/html/日期/) date_folder = self.ctx.format_date() time_filename = self.ctx.format_time() output_dir = Path("output") / "html" / date_folder output_dir.mkdir(parents=True, exist_ok=True) file_path = output_dir / f"rss_{time_filename}.html" with open(file_path, "w", encoding="utf-8") as f: f.write(html_content) print(f"[RSS] HTML 报告已生成: {file_path}") return str(file_path) except Exception as e: print(f"[RSS] 生成 HTML 报告失败: {e}") return None def _execute_mode_strategy( self, mode_strategy: Dict, results: Dict, id_to_name: Dict, failed_ids: List, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, raw_rss_items: Optional[List[Dict]] = None, rss_new_urls: Optional[set] = None, ) -> Optional[str]: """执行模式特定逻辑,支持热榜+RSS合并推送 简化后的逻辑: - 每次运行都生成 HTML 报告(时间戳快照 + latest/{mode}.html + index.html) - 根据模式发送通知 """ # 调度系统 scheduler = self.ctx.create_scheduler() schedule = scheduler.resolve() # 使用 schedule 决定的 report_mode 覆盖全局配置 effective_mode = schedule.report_mode if effective_mode != self.report_mode: print(f"[调度] 报告模式覆盖: {self.report_mode} -> {effective_mode}") self.report_mode = effective_mode # 重新获取 mode_strategy,确保 report_type 与覆盖后的 report_mode 一致 mode_strategy = self._get_mode_strategy() # 使用 schedule 决定的 frequency_file 覆盖默认值 self.frequency_file = schedule.frequency_file # 使用 schedule 决定的筛选策略覆盖默认值 self.filter_method = schedule.filter_method or self.ctx.filter_method # 使用 schedule 决定的 AI 筛选兴趣文件覆盖默认值 self.interests_file = schedule.interests_file # 如果调度器说不采集,则直接跳过 if not schedule.collect: print("[调度] 当前时间段不执行数据采集,跳过分析流水线") return None # 获取当前监控平台ID列表 current_platform_ids = self.ctx.platform_ids new_titles = self.ctx.detect_new_titles(current_platform_ids) time_info = self.ctx.format_time() word_groups, filter_words, global_filters = self.ctx.load_frequency_words(self.frequency_file) html_file = None stats = [] ai_result = None title_info = None # current 模式需要使用完整的历史数据 if self.report_mode == "current": analysis_data = self._load_analysis_data() if analysis_data: ( all_results, historical_id_to_name, historical_title_info, historical_new_titles, _, _, _, ) = analysis_data print( f"current模式:使用过滤后的历史数据,包含平台:{list(all_results.keys())}" ) # 使用历史数据准备独立展示区数据(包含完整的 title_info) standalone_data = self._prepare_standalone_data( all_results, historical_id_to_name, historical_title_info, raw_rss_items ) stats, html_file, ai_result, rss_items = self._run_analysis_pipeline( all_results, self.report_mode, historical_title_info, historical_new_titles, word_groups, filter_words, historical_id_to_name, failed_ids=failed_ids, global_filters=global_filters, rss_items=rss_items, rss_new_items=rss_new_items, standalone_data=standalone_data, schedule=schedule, rss_new_urls=rss_new_urls, ) combined_id_to_name = {**historical_id_to_name, **id_to_name} new_titles = historical_new_titles id_to_name = combined_id_to_name title_info = historical_title_info results = all_results else: print("❌ 严重错误:无法读取刚保存的数据文件") raise RuntimeError("数据一致性检查失败:保存后立即读取失败") elif self.report_mode == "daily": # daily 模式:使用全天累计数据 analysis_data = self._load_analysis_data() if analysis_data: ( all_results, historical_id_to_name, historical_title_info, historical_new_titles, _, _, _, ) = analysis_data # 使用历史数据准备独立展示区数据(包含完整的 title_info) standalone_data = self._prepare_standalone_data( all_results, historical_id_to_name, historical_title_info, raw_rss_items ) stats, html_file, ai_result, rss_items = self._run_analysis_pipeline( all_results, self.report_mode, historical_title_info, historical_new_titles, word_groups, filter_words, historical_id_to_name, failed_ids=failed_ids, global_filters=global_filters, rss_items=rss_items, rss_new_items=rss_new_items, standalone_data=standalone_data, schedule=schedule, rss_new_urls=rss_new_urls, ) combined_id_to_name = {**historical_id_to_name, **id_to_name} new_titles = historical_new_titles id_to_name = combined_id_to_name title_info = historical_title_info results = all_results else: # 没有历史数据时使用当前数据 title_info = self._prepare_current_title_info(results, time_info) standalone_data = self._prepare_standalone_data( results, id_to_name, title_info, raw_rss_items ) stats, html_file, ai_result, rss_items = self._run_analysis_pipeline( results, self.report_mode, title_info, new_titles, word_groups, filter_words, id_to_name, failed_ids=failed_ids, global_filters=global_filters, rss_items=rss_items, rss_new_items=rss_new_items, standalone_data=standalone_data, schedule=schedule, rss_new_urls=rss_new_urls, ) else: # incremental 模式:只使用当前抓取的数据 title_info = self._prepare_current_title_info(results, time_info) standalone_data = self._prepare_standalone_data( results, id_to_name, title_info, raw_rss_items ) stats, html_file, ai_result, rss_items = self._run_analysis_pipeline( results, self.report_mode, title_info, new_titles, word_groups, filter_words, id_to_name, failed_ids=failed_ids, global_filters=global_filters, rss_items=rss_items, rss_new_items=rss_new_items, standalone_data=standalone_data, schedule=schedule, rss_new_urls=rss_new_urls, ) if html_file: print(f"HTML报告已生成: {html_file}") print(f"最新报告已更新: output/html/latest/{self.report_mode}.html") # 发送通知 if mode_strategy["should_send_notification"]: standalone_data = self._prepare_standalone_data( results, id_to_name, title_info, raw_rss_items ) self._send_notification_if_needed( stats, mode_strategy["report_type"], self.report_mode, failed_ids=failed_ids, new_titles=new_titles, id_to_name=id_to_name, html_file_path=html_file, rss_items=rss_items, rss_new_items=rss_new_items, standalone_data=standalone_data, ai_result=ai_result, current_results=results, schedule=schedule, ) # 打开浏览器(仅在非容器环境) if self._should_open_browser() and html_file: file_url = "file://" + str(Path(html_file).resolve()) print(f"正在打开HTML报告: {file_url}") webbrowser.open(file_url) elif self.is_docker_container and html_file: print(f"HTML报告已生成(Docker环境): {html_file}") return html_file def run(self) -> None: """执行分析流程""" try: self._initialize_and_check_config() mode_strategy = self._get_mode_strategy() # 抓取热榜数据 results, id_to_name, failed_ids = self._crawl_data() # 抓取 RSS 数据(如果启用),返回统计条目、新增条目和原始条目 rss_items, rss_new_items, raw_rss_items, rss_new_urls = self._crawl_rss_data() # 执行模式策略,传递 RSS 数据用于合并推送 self._execute_mode_strategy( mode_strategy, results, id_to_name, failed_ids, rss_items=rss_items, rss_new_items=rss_new_items, raw_rss_items=raw_rss_items, rss_new_urls=rss_new_urls ) except Exception as e: print(f"分析流程执行出错: {e}") if self.ctx.config.get("DEBUG", False): raise finally: # 清理资源(包括过期数据清理和数据库连接关闭) self.ctx.cleanup() def _record_doctor_result(results: List[Tuple[str, str, str]], status: str, item: str, detail: str) -> None: """记录并打印 doctor 检查结果""" icon_map = { "pass": "✅", "warn": "⚠️", "fail": "❌", } icon = icon_map.get(status, "•") results.append((status, item, detail)) print(f"{icon} {item}: {detail}") def _save_doctor_report( results: List[Tuple[str, str, str]], pass_count: int, warn_count: int, fail_count: int, config_path: Optional[str], ) -> None: """保存 doctor 体检报告到 JSON 文件""" report = { "version": __version__, "generated_at": datetime.now(timezone.utc).isoformat(), "config_path": config_path or os.environ.get("CONFIG_PATH", "config/config.yaml"), "summary": { "pass": pass_count, "warn": warn_count, "fail": fail_count, "ok": fail_count == 0, }, "checks": [ {"status": status, "item": item, "detail": detail} for status, item, detail in results ], } try: output_dir = Path("output") / "meta" output_dir.mkdir(parents=True, exist_ok=True) output_path = output_dir / "doctor_report.json" output_path.write_text( json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8", ) print(f"体检报告已保存: {output_path}") except Exception as e: print(f"⚠️ 体检报告保存失败: {e}") def _run_doctor(config_path: Optional[str] = None) -> bool: """运行环境体检""" print("=" * 60) print(f"TrendRadar v{__version__} 环境体检") print("=" * 60) results: List[Tuple[str, str, str]] = [] config = None # 1) Python 版本检查 py_ok = sys.version_info >= (3, 10) py_version = f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}" if py_ok: _record_doctor_result(results, "pass", "Python版本", f"{py_version} (满足 >= 3.10)") else: _record_doctor_result(results, "fail", "Python版本", f"{py_version} (不满足 >= 3.10)") # 2) 关键文件检查 if config_path is None: config_path = os.environ.get("CONFIG_PATH", "config/config.yaml") required_files = [ (config_path, "主配置文件"), ("config/frequency_words.txt", "关键词文件"), ] optional_files = [ ("config/timeline.yaml", "调度文件"), ] for path_str, desc in required_files: if Path(path_str).exists(): _record_doctor_result(results, "pass", desc, f"已找到: {path_str}") else: _record_doctor_result(results, "fail", desc, f"缺失: {path_str}") for path_str, desc in optional_files: if Path(path_str).exists(): _record_doctor_result(results, "pass", desc, f"已找到: {path_str}") else: _record_doctor_result(results, "warn", desc, f"未找到: {path_str}(将使用默认调度模板)") # 3) 配置加载检查 try: config = load_config(config_path) _record_doctor_result(results, "pass", "配置加载", f"加载成功: {config_path}") except Exception as e: _record_doctor_result(results, "fail", "配置加载", f"加载失败: {e}") # 后续检查依赖配置对象 if config: # 4) 调度配置检查 try: ctx = AppContext(config) schedule = ctx.create_scheduler().resolve() detail = f"调度解析成功(report_mode={schedule.report_mode}, ai_mode={schedule.ai_mode})" _record_doctor_result(results, "pass", "调度配置", detail) except Exception as e: _record_doctor_result(results, "fail", "调度配置", f"解析失败: {e}") # 5) AI 配置检查(按功能场景区分严重级别) ai_analysis_enabled = config.get("AI_ANALYSIS", {}).get("ENABLED", False) ai_translation_enabled = config.get("AI_TRANSLATION", {}).get("ENABLED", False) ai_filter_enabled = config.get("FILTER", {}).get("METHOD", "keyword") == "ai" ai_enabled = ai_analysis_enabled or ai_translation_enabled or ai_filter_enabled if ai_enabled: try: from trendradar.ai.client import AIClient valid, message = AIClient(config.get("AI", {})).validate_config() if valid: _record_doctor_result(results, "pass", "AI配置", f"模型: {config.get('AI', {}).get('MODEL', '')}") else: # AI 分析/翻译是硬依赖;AI 筛选缺失时会自动回退关键词匹配 if ai_analysis_enabled or ai_translation_enabled: _record_doctor_result(results, "fail", "AI配置", message) else: _record_doctor_result(results, "warn", "AI配置", f"{message}(AI 筛选将回退关键词模式)") except Exception as e: _record_doctor_result(results, "fail", "AI配置", f"校验异常: {e}") else: _record_doctor_result(results, "warn", "AI配置", "未启用 AI 功能,跳过校验") # 6) 存储配置检查 try: storage_cfg = config.get("STORAGE", {}) backend = storage_cfg.get("BACKEND", "auto") remote = storage_cfg.get("REMOTE", {}) missing_remote_keys = [ k for k in ("BUCKET_NAME", "ACCESS_KEY_ID", "SECRET_ACCESS_KEY", "ENDPOINT_URL") if not remote.get(k) ] if backend == "remote" and missing_remote_keys: _record_doctor_result( results, "fail", "存储配置", f"remote 模式缺少配置: {', '.join(missing_remote_keys)}" ) elif backend == "auto" and os.environ.get("GITHUB_ACTIONS") == "true" and missing_remote_keys: _record_doctor_result( results, "warn", "存储配置", "GitHub Actions + auto 模式未完整配置远程存储,可能导致数据丢失" ) else: sm = AppContext(config).get_storage_manager() _record_doctor_result(results, "pass", "存储配置", f"当前后端: {sm.backend_name}") except Exception as e: _record_doctor_result(results, "fail", "存储配置", f"检查失败: {e}") # 7) 通知渠道配置检查 channel_details = [] channel_issues = [] max_accounts = config.get("MAX_ACCOUNTS_PER_CHANNEL", 3) # 普通单值/多值渠道 for key, name in [ ("FEISHU_WEBHOOK_URL", "飞书"), ("DINGTALK_WEBHOOK_URL", "钉钉"), ("WEWORK_WEBHOOK_URL", "企业微信"), ("BARK_URL", "Bark"), ("SLACK_WEBHOOK_URL", "Slack"), ("GENERIC_WEBHOOK_URL", "通用Webhook"), ]: values = parse_multi_account_config(config.get(key, "")) if values: channel_details.append(f"{name}({min(len(values), max_accounts)}个)") # Telegram 配对校验 tg_tokens = parse_multi_account_config(config.get("TELEGRAM_BOT_TOKEN", "")) tg_chats = parse_multi_account_config(config.get("TELEGRAM_CHAT_ID", "")) if tg_tokens or tg_chats: valid, count = validate_paired_configs( {"bot_token": tg_tokens, "chat_id": tg_chats}, "Telegram", required_keys=["bot_token", "chat_id"], ) if valid and count > 0: channel_details.append(f"Telegram({min(count, max_accounts)}个)") else: channel_issues.append("Telegram bot_token/chat_id 配置不完整或数量不一致") # ntfy 配对校验(token 可选) ntfy_server = config.get("NTFY_SERVER_URL", "") ntfy_topics = parse_multi_account_config(config.get("NTFY_TOPIC", "")) ntfy_tokens = parse_multi_account_config(config.get("NTFY_TOKEN", "")) if ntfy_server and ntfy_topics: if ntfy_tokens: valid, count = validate_paired_configs( {"topic": ntfy_topics, "token": ntfy_tokens}, "ntfy", ) if valid and count > 0: channel_details.append(f"ntfy({min(count, max_accounts)}个)") else: channel_issues.append("ntfy topic/token 数量不一致") else: channel_details.append(f"ntfy({min(len(ntfy_topics), max_accounts)}个)") # 邮件配置完整性 email_ready = all( [ config.get("EMAIL_FROM"), config.get("EMAIL_PASSWORD"), config.get("EMAIL_TO"), ] ) if email_ready: channel_details.append("邮件") elif any([config.get("EMAIL_FROM"), config.get("EMAIL_PASSWORD"), config.get("EMAIL_TO")]): channel_issues.append("邮件配置不完整(需要 from/password/to 同时配置)") if channel_issues and not channel_details: _record_doctor_result(results, "fail", "通知配置", ";".join(channel_issues)) elif channel_issues and channel_details: detail = f"可用渠道: {', '.join(channel_details)};问题: {';'.join(channel_issues)}" _record_doctor_result(results, "warn", "通知配置", detail) elif channel_details: _record_doctor_result(results, "pass", "通知配置", f"可用渠道: {', '.join(channel_details)}") else: _record_doctor_result(results, "warn", "通知配置", "未配置任何通知渠道") # 8) 输出目录可写检查 try: output_dir = Path("output") output_dir.mkdir(parents=True, exist_ok=True) probe_file = output_dir / ".doctor_write_probe" probe_file.write_text("ok", encoding="utf-8") probe_file.unlink(missing_ok=True) _record_doctor_result(results, "pass", "输出目录", f"可写: {output_dir}") except Exception as e: _record_doctor_result(results, "fail", "输出目录", f"不可写: {e}") pass_count = sum(1 for status, _, _ in results if status == "pass") warn_count = sum(1 for status, _, _ in results if status == "warn") fail_count = sum(1 for status, _, _ in results if status == "fail") _save_doctor_report(results, pass_count, warn_count, fail_count, config_path) print("-" * 60) print(f"体检结果: ✅ {pass_count} 项通过 ⚠️ {warn_count} 项警告 ❌ {fail_count} 项失败") print("=" * 60) if fail_count == 0: print("体检通过。") return True print("体检未通过,请先修复失败项。") return False def _build_test_report_data(ctx: AppContext) -> Dict: """构造通知测试用报告数据""" now = ctx.get_time() time_display = now.strftime("%H:%M") title = f"TrendRadar 通知测试消息({now.strftime('%Y-%m-%d %H:%M:%S')})" return { "stats": [ { "word": "连通性测试", "count": 1, "titles": [ { "title": title, "source_name": "TrendRadar", "url": "https://github.com/sansan0/TrendRadar", "mobile_url": "", "ranks": [1], "rank_threshold": ctx.rank_threshold, "count": 1, "is_new": True, "time_display": time_display, "matched_keyword": "连通性测试", } ], } ], "failed_ids": [], "new_titles": [], "id_to_name": {}, } def _create_test_html_file(ctx: AppContext) -> Optional[str]: """创建邮件测试用 HTML 文件""" try: now = ctx.get_time() output_dir = Path("output") / "html" / ctx.format_date() output_dir.mkdir(parents=True, exist_ok=True) html_path = output_dir / f"notification_test_{ctx.format_time()}.html" html_content = f"""MCP 爬取结果
""" # 添加时间戳 html += f' \n\n' # 遍历每个平台 for platform_id, titles_data in results.items(): platform_name = id_to_name.get(platform_id, platform_id) html += f'\n' html += f'\n\n' # 失败的平台 if failed_ids: html += '{platform_name}\n' # 排序标题 sorted_items = [] for title, info in titles_data.items(): ranks = info.get("ranks", []) url = info.get("url", "") mobile_url = info.get("mobileUrl", "") rank = ranks[0] if ranks else 999 sorted_items.append((rank, title, url, mobile_url)) sorted_items.sort(key=lambda x: x[0]) # 显示新闻 for rank, title, url, mobile_url in sorted_items: html += f' \n' html += '\n' html += '\n' html += """请求失败的平台
\n' html += '\n' for platform_id in failed_ids: html += f'
\n' html += '- {self._html_escape(platform_id)}
\n' html += 'TrendRadar 通知测试 TrendRadar 通知连通性测试
测试时间:{now.strftime('%Y-%m-%d %H:%M:%S')} ({ctx.timezone})
这是一条测试消息,用于验证邮件渠道是否可达。
""" html_path.write_text(html_content, encoding="utf-8") return str(html_path) except Exception as e: print(f"[测试通知] 创建测试 HTML 失败: {e}") return None def _run_test_notification(config: Dict) -> bool: """发送测试通知到已配置渠道""" from trendradar.notification import NotificationDispatcher ctx = AppContext(config) try: # 检查是否配置了通知渠道 has_notification = any( [ config.get("FEISHU_WEBHOOK_URL"), config.get("DINGTALK_WEBHOOK_URL"), config.get("WEWORK_WEBHOOK_URL"), (config.get("TELEGRAM_BOT_TOKEN") and config.get("TELEGRAM_CHAT_ID")), (config.get("EMAIL_FROM") and config.get("EMAIL_PASSWORD") and config.get("EMAIL_TO")), (config.get("NTFY_SERVER_URL") and config.get("NTFY_TOPIC")), config.get("BARK_URL"), config.get("SLACK_WEBHOOK_URL"), config.get("GENERIC_WEBHOOK_URL"), ] ) if not has_notification: print("未检测到可用通知渠道,请先在 config.yaml 或环境变量中配置。") return False # 测试时固定展示区域,避免用户关闭 HOTLIST 导致测试内容为空 test_config = copy.deepcopy(config) test_display = test_config.setdefault("DISPLAY", {}) test_regions = test_display.setdefault("REGIONS", {}) test_regions.update( { "HOTLIST": True, "NEW_ITEMS": False, "RSS": False, "STANDALONE": False, "AI_ANALYSIS": False, } ) # 测试时禁用翻译,避免触发额外 AI 调用 if "AI_TRANSLATION" in test_config: test_config["AI_TRANSLATION"]["ENABLED"] = False proxy_url = test_config.get("DEFAULT_PROXY", "") if test_config.get("USE_PROXY") else None if proxy_url: print("[测试通知] 检测到代理配置,将使用代理发送") dispatcher = NotificationDispatcher( config=test_config, get_time_func=ctx.get_time, split_content_func=ctx.split_content, translator=None, ) report_data = _build_test_report_data(ctx) html_file_path = _create_test_html_file(ctx) print("=" * 60) print("通知连通性测试") print("=" * 60) results = dispatcher.dispatch_all( report_data=report_data, report_type="通知连通性测试", proxy_url=proxy_url, mode="daily", html_file_path=html_file_path, ) if not results: print("没有可测试的有效通知渠道(可能配置不完整)。") return False print("-" * 60) success_count = 0 for channel, ok in results.items(): if ok: success_count += 1 print(f"✅ {channel}: 测试成功") else: print(f"❌ {channel}: 测试失败") print("-" * 60) print(f"测试结果: {success_count}/{len(results)} 个渠道成功") return success_count > 0 finally: ctx.cleanup() def main(): """主程序入口""" # 解析命令行参数 parser = argparse.ArgumentParser( description="TrendRadar - 热点新闻聚合与分析工具", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" 调度状态命令: --show-schedule 显示当前调度状态(时间段、行为开关) 诊断命令: --doctor 运行环境与配置体检 --test-notification 发送测试通知到已配置渠道 示例: python -m trendradar # 正常运行 python -m trendradar --show-schedule # 查看当前调度状态 python -m trendradar --doctor # 运行一键体检 python -m trendradar --test-notification # 测试通知渠道连通性 """ ) parser.add_argument( "--show-schedule", action="store_true", help="显示当前调度状态" ) parser.add_argument( "--doctor", action="store_true", help="运行环境与配置体检" ) parser.add_argument( "--test-notification", action="store_true", help="发送测试通知到已配置渠道" ) args = parser.parse_args() debug_mode = False try: # 处理 doctor 命令(不依赖完整运行流程) if args.doctor: ok = _run_doctor() if not ok: raise SystemExit(1) return # 先加载配置 config = load_config() # 处理状态查看命令 if args.show_schedule: _handle_status_commands(config) return # 处理通知测试命令 if args.test_notification: ok = _run_test_notification(config) if not ok: raise SystemExit(1) return version_url = config.get("VERSION_CHECK_URL", "") configs_version_url = config.get("CONFIGS_VERSION_CHECK_URL", "") # 统一版本检查(程序版本 + 配置文件版本,只请求一次远程) need_update = False remote_version = None if version_url: need_update, remote_version = check_all_versions(version_url, configs_version_url) # 复用已加载的配置,避免重复加载 analyzer = NewsAnalyzer(config=config) # 设置更新信息(复用已获取的远程版本,不再重复请求) if analyzer.is_github_actions and need_update and remote_version: analyzer.update_info = { "current_version": __version__, "remote_version": remote_version, } # 获取 debug 配置 debug_mode = analyzer.ctx.config.get("DEBUG", False) analyzer.run() except FileNotFoundError as e: print(f"❌ 配置文件错误: {e}") print("\n请确保以下文件存在:") print(" • config/config.yaml") print(" • config/frequency_words.txt") print("\n参考项目文档进行正确配置") except Exception as e: print(f"❌ 程序运行错误: {e}") if debug_mode: raise def _handle_status_commands(config: Dict) -> None: """处理状态查看命令 - 显示当前调度状态""" from trendradar.context import AppContext ctx = AppContext(config) print("=" * 60) print(f"TrendRadar v{__version__} 调度状态") print("=" * 60) try: scheduler = ctx.create_scheduler() schedule = scheduler.resolve() now = ctx.get_time() date_str = ctx.format_date() print(f"\n⏰ 当前时间: {now.strftime('%Y-%m-%d %H:%M:%S')} ({ctx.timezone})") print(f"📅 当前日期: {date_str}") print(f"\n📋 调度信息:") print(f" 日计划: {schedule.day_plan}") if schedule.period_key: print(f" 当前时间段: {schedule.period_name or schedule.period_key} ({schedule.period_key})") else: print(f" 当前时间段: 无(使用默认配置)") print(f"\n🔧 行为开关:") print(f" 采集数据: {'✅ 是' if schedule.collect else '❌ 否'}") print(f" AI 分析: {'✅ 是' if schedule.analyze else '❌ 否'}") print(f" 推送通知: {'✅ 是' if schedule.push else '❌ 否'}") print(f" 报告模式: {schedule.report_mode}") print(f" AI 模式: {schedule.ai_mode}") if schedule.period_key: print(f"\n🔁 一次性控制:") if schedule.once_analyze: already_analyzed = scheduler.already_executed(schedule.period_key, "analyze", date_str) print(f" AI 分析: 仅一次 {'(今日已执行 ⚠️)' if already_analyzed else '(今日未执行 ✅)'}") else: print(f" AI 分析: 不限次数") if schedule.once_push: already_pushed = scheduler.already_executed(schedule.period_key, "push", date_str) print(f" 推送通知: 仅一次 {'(今日已执行 ⚠️)' if already_pushed else '(今日未执行 ✅)'}") else: print(f" 推送通知: 不限次数") except Exception as e: print(f"\n❌ 获取调度状态失败: {e}") print("\n" + "=" * 60) # 清理资源 ctx.cleanup() if __name__ == "__main__": main() ================================================ FILE: trendradar/ai/__init__.py ================================================ # coding=utf-8 """ TrendRadar AI 模块 提供 AI 大模型对热点新闻的深度分析和翻译功能 """ from .analyzer import AIAnalyzer, AIAnalysisResult from .filter import AIFilter, AIFilterResult from .translator import AITranslator, TranslationResult, BatchTranslationResult from .formatter import ( get_ai_analysis_renderer, render_ai_analysis_markdown, render_ai_analysis_feishu, render_ai_analysis_dingtalk, render_ai_analysis_html, render_ai_analysis_html_rich, render_ai_analysis_plain, ) __all__ = [ # 分析器 "AIAnalyzer", "AIAnalysisResult", # 智能筛选 "AIFilter", "AIFilterResult", # 翻译器 "AITranslator", "TranslationResult", "BatchTranslationResult", # 格式化 "get_ai_analysis_renderer", "render_ai_analysis_markdown", "render_ai_analysis_feishu", "render_ai_analysis_dingtalk", "render_ai_analysis_html", "render_ai_analysis_html_rich", "render_ai_analysis_plain", ] ================================================ FILE: trendradar/ai/analyzer.py ================================================ # coding=utf-8 """ AI 分析器模块 调用 AI 大模型对热点新闻进行深度分析 基于 LiteLLM 统一接口,支持 100+ AI 提供商 """ import json from dataclasses import dataclass, field from pathlib import Path from typing import Any, Callable, Dict, List, Optional from trendradar.ai.client import AIClient @dataclass class AIAnalysisResult: """AI 分析结果""" # 新版 5 核心板块 core_trends: str = "" # 核心热点与舆情态势 sentiment_controversy: str = "" # 舆论风向与争议 signals: str = "" # 异动与弱信号 rss_insights: str = "" # RSS 深度洞察 outlook_strategy: str = "" # 研判与策略建议 standalone_summaries: Dict[str, str] = field(default_factory=dict) # 独立展示区概括 {源ID: 概括} # 基础元数据 raw_response: str = "" # 原始响应 success: bool = False # 是否成功 error: str = "" # 错误信息 # 新闻数量统计 total_news: int = 0 # 总新闻数(热榜+RSS) analyzed_news: int = 0 # 实际分析的新闻数 max_news_limit: int = 0 # 分析上限配置值 hotlist_count: int = 0 # 热榜新闻数 rss_count: int = 0 # RSS 新闻数 ai_mode: str = "" # AI 分析使用的模式 (daily/current/incremental) class AIAnalyzer: """AI 分析器""" def __init__( self, ai_config: Dict[str, Any], analysis_config: Dict[str, Any], get_time_func: Callable, debug: bool = False, ): """ 初始化 AI 分析器 Args: ai_config: AI 模型配置(LiteLLM 格式) analysis_config: AI 分析功能配置(language, prompt_file 等) get_time_func: 获取当前时间的函数 debug: 是否开启调试模式 """ self.ai_config = ai_config self.analysis_config = analysis_config self.get_time_func = get_time_func self.debug = debug # 创建 AI 客户端(基于 LiteLLM) self.client = AIClient(ai_config) # 验证配置 valid, error = self.client.validate_config() if not valid: print(f"[AI] 配置警告: {error}") # 从分析配置获取功能参数 self.max_news = analysis_config.get("MAX_NEWS_FOR_ANALYSIS", 50) self.include_rss = analysis_config.get("INCLUDE_RSS", True) self.include_rank_timeline = analysis_config.get("INCLUDE_RANK_TIMELINE", False) self.include_standalone = analysis_config.get("INCLUDE_STANDALONE", False) self.language = analysis_config.get("LANGUAGE", "Chinese") # 加载提示词模板 self.system_prompt, self.user_prompt_template = self._load_prompt_template( analysis_config.get("PROMPT_FILE", "ai_analysis_prompt.txt") ) def _load_prompt_template(self, prompt_file: str) -> tuple: """加载提示词模板""" config_dir = Path(__file__).parent.parent.parent / "config" prompt_path = config_dir / prompt_file if not prompt_path.exists(): print(f"[AI] 提示词文件不存在: {prompt_path}") return "", "" content = prompt_path.read_text(encoding="utf-8") # 解析 [system] 和 [user] 部分 system_prompt = "" user_prompt = "" if "[system]" in content and "[user]" in content: parts = content.split("[user]") system_part = parts[0] user_part = parts[1] if len(parts) > 1 else "" # 提取 system 内容 if "[system]" in system_part: system_prompt = system_part.split("[system]")[1].strip() user_prompt = user_part.strip() else: # 整个文件作为 user prompt user_prompt = content return system_prompt, user_prompt def analyze( self, stats: List[Dict], rss_stats: Optional[List[Dict]] = None, report_mode: str = "daily", report_type: str = "当日汇总", platforms: Optional[List[str]] = None, keywords: Optional[List[str]] = None, standalone_data: Optional[Dict] = None, ) -> AIAnalysisResult: """ 执行 AI 分析 Args: stats: 热榜统计数据 rss_stats: RSS 统计数据 report_mode: 报告模式 report_type: 报告类型 platforms: 平台列表 keywords: 关键词列表 Returns: AIAnalysisResult: 分析结果 """ # 打印配置信息方便调试 model = self.ai_config.get("MODEL", "unknown") api_key = self.client.api_key or "" api_base = self.ai_config.get("API_BASE", "") masked_key = f"{api_key[:5]}******" if len(api_key) >= 5 else "******" model_display = model.replace("/", "/\u200b") if model else "unknown" print(f"[AI] 模型: {model_display}") print(f"[AI] Key : {masked_key}") if api_base: print(f"[AI] 接口: 存在自定义 API 端点") timeout = self.ai_config.get("TIMEOUT", 120) max_tokens = self.ai_config.get("MAX_TOKENS", 5000) print(f"[AI] 参数: timeout={timeout}, max_tokens={max_tokens}") if not self.client.api_key: return AIAnalysisResult( success=False, error="未配置 AI API Key,请在 config.yaml 或环境变量 AI_API_KEY 中设置" ) # 准备新闻内容并获取统计数据 news_content, rss_content, hotlist_total, rss_total, analyzed_count = self._prepare_news_content(stats, rss_stats) total_news = hotlist_total + rss_total if not news_content and not rss_content: return AIAnalysisResult( success=False, error="没有可分析的新闻内容", total_news=total_news, hotlist_count=hotlist_total, rss_count=rss_total, analyzed_news=0, max_news_limit=self.max_news ) # 构建提示词 current_time = self.get_time_func().strftime("%Y-%m-%d %H:%M:%S") # 提取关键词 if not keywords: keywords = [s.get("word", "") for s in stats if s.get("word")] if stats else [] # 使用安全的字符串替换,避免模板中其他花括号(如 JSON 示例)被误解析 user_prompt = self.user_prompt_template user_prompt = user_prompt.replace("{report_mode}", report_mode) user_prompt = user_prompt.replace("{report_type}", report_type) user_prompt = user_prompt.replace("{current_time}", current_time) user_prompt = user_prompt.replace("{news_count}", str(hotlist_total)) user_prompt = user_prompt.replace("{rss_count}", str(rss_total)) user_prompt = user_prompt.replace("{platforms}", ", ".join(platforms) if platforms else "多平台") user_prompt = user_prompt.replace("{keywords}", ", ".join(keywords[:20]) if keywords else "无") user_prompt = user_prompt.replace("{news_content}", news_content) user_prompt = user_prompt.replace("{rss_content}", rss_content) user_prompt = user_prompt.replace("{language}", self.language) # 构建独立展示区内容 standalone_content = "" if self.include_standalone and standalone_data: standalone_content = self._prepare_standalone_content(standalone_data) user_prompt = user_prompt.replace("{standalone_content}", standalone_content) if self.debug: print("\n" + "=" * 80) print("[AI 调试] 发送给 AI 的完整提示词") print("=" * 80) if self.system_prompt: print("\n--- System Prompt ---") print(self.system_prompt) print("\n--- User Prompt ---") print(user_prompt) print("=" * 80 + "\n") # 调用 AI API(使用 LiteLLM) try: response = self._call_ai(user_prompt) result = self._parse_response(response) # JSON 解析失败时的重试兜底(仅重试一次) if result.error and "JSON 解析错误" in result.error: print(f"[AI] JSON 解析失败,尝试让 AI 修复...") retry_result = self._retry_fix_json(response, result.error) if retry_result and retry_result.success and not retry_result.error: print("[AI] JSON 修复成功") retry_result.raw_response = response result = retry_result else: print("[AI] JSON 修复失败,使用原始文本兜底") # 如果配置未启用 RSS 分析,强制清空 AI 返回的 RSS 洞察 if not self.include_rss: result.rss_insights = "" # 如果配置未启用 standalone 分析,强制清空 if not self.include_standalone: result.standalone_summaries = {} # 填充统计数据 result.total_news = total_news result.hotlist_count = hotlist_total result.rss_count = rss_total result.analyzed_news = analyzed_count result.max_news_limit = self.max_news return result except Exception as e: error_type = type(e).__name__ error_msg = str(e) # 截断过长的错误消息 if len(error_msg) > 200: error_msg = error_msg[:200] + "..." friendly_msg = f"AI 分析失败 ({error_type}): {error_msg}" return AIAnalysisResult( success=False, error=friendly_msg ) def _prepare_news_content( self, stats: List[Dict], rss_stats: Optional[List[Dict]] = None, ) -> tuple: """ 准备新闻内容文本(增强版) 热榜新闻包含:来源、标题、排名范围、时间范围、出现次数 RSS 包含:来源、标题、发布时间 Returns: tuple: (news_content, rss_content, hotlist_total, rss_total, analyzed_count) """ news_lines = [] rss_lines = [] news_count = 0 rss_count = 0 # 计算总新闻数 hotlist_total = sum(len(s.get("titles", [])) for s in stats) if stats else 0 rss_total = sum(len(s.get("titles", [])) for s in rss_stats) if rss_stats else 0 # 热榜内容 if stats: for stat in stats: word = stat.get("word", "") titles = stat.get("titles", []) if word and titles: news_lines.append(f"\n**{word}** ({len(titles)}条)") for t in titles: if not isinstance(t, dict): continue title = t.get("title", "") if not title: continue # 来源 source = t.get("source_name", t.get("source", "")) # 构建行 if source: line = f"- [{source}] {title}" else: line = f"- {title}" # 始终显示简化格式:排名范围 + 时间范围 + 出现次数 ranks = t.get("ranks", []) if ranks: min_rank = min(ranks) max_rank = max(ranks) rank_str = f"{min_rank}" if min_rank == max_rank else f"{min_rank}-{max_rank}" else: rank_str = "-" first_time = t.get("first_time", "") last_time = t.get("last_time", "") time_str = self._format_time_range(first_time, last_time) appear_count = t.get("count", 1) line += f" | 排名:{rank_str} | 时间:{time_str} | 出现:{appear_count}次" # 开启完整时间线时,额外添加轨迹 if self.include_rank_timeline: rank_timeline = t.get("rank_timeline", []) timeline_str = self._format_rank_timeline(rank_timeline) line += f" | 轨迹:{timeline_str}" news_lines.append(line) news_count += 1 if news_count >= self.max_news: break if news_count >= self.max_news: break # RSS 内容(仅在启用时构建) if self.include_rss and rss_stats: remaining = self.max_news - news_count for stat in rss_stats: if rss_count >= remaining: break word = stat.get("word", "") titles = stat.get("titles", []) if word and titles: rss_lines.append(f"\n**{word}** ({len(titles)}条)") for t in titles: if not isinstance(t, dict): continue title = t.get("title", "") if not title: continue # 来源 source = t.get("source_name", t.get("feed_name", "")) # 发布时间 time_display = t.get("time_display", "") # 构建行:[来源] 标题 | 发布时间 if source: line = f"- [{source}] {title}" else: line = f"- {title}" if time_display: line += f" | {time_display}" rss_lines.append(line) rss_count += 1 if rss_count >= remaining: break news_content = "\n".join(news_lines) if news_lines else "" rss_content = "\n".join(rss_lines) if rss_lines else "" total_count = news_count + rss_count return news_content, rss_content, hotlist_total, rss_total, total_count def _call_ai(self, user_prompt: str) -> str: """调用 AI API(使用 LiteLLM)""" messages = [] if self.system_prompt: messages.append({"role": "system", "content": self.system_prompt}) messages.append({"role": "user", "content": user_prompt}) return self.client.chat(messages) def _retry_fix_json(self, original_response: str, error_msg: str) -> Optional[AIAnalysisResult]: """ JSON 解析失败时,请求 AI 修复 JSON(仅重试一次) 使用轻量 prompt,不重复原始分析的 system prompt,节省 token。 Args: original_response: AI 原始响应(JSON 格式有误) error_msg: JSON 解析的错误信息 Returns: 修复后的分析结果,失败时返回 None """ messages = [ { "role": "system", "content": ( "你是一个 JSON 修复助手。用户会提供一段格式有误的 JSON 和错误信息," "你需要修复 JSON 格式错误并返回正确的 JSON。\n" "常见问题:字符串值内的双引号未转义、缺少逗号、字符串未正确闭合等。\n" "只返回纯 JSON,不要包含 markdown 代码块标记(如 ```json)或任何说明文字。" ), }, { "role": "user", "content": ( f"以下 JSON 解析失败:\n\n" f"错误:{error_msg}\n\n" f"原始内容:\n{original_response}\n\n" f"请修复以上 JSON 中的格式问题(如值中的双引号改用中文引号「」或转义 \\\"、" f"缺少逗号、不完整的字符串等),保持原始内容语义不变,只修复格式。" f"直接返回修复后的纯 JSON。" ), }, ] try: response = self.client.chat(messages) return self._parse_response(response) except Exception as e: print(f"[AI] 重试修复 JSON 异常: {type(e).__name__}: {e}") return None def _format_time_range(self, first_time: str, last_time: str) -> str: """格式化时间范围(简化显示,只保留时分)""" def extract_time(time_str: str) -> str: if not time_str: return "-" # 尝试提取 HH:MM 部分 if " " in time_str: parts = time_str.split(" ") if len(parts) >= 2: time_part = parts[1] if ":" in time_part: return time_part[:5] # HH:MM elif ":" in time_str: return time_str[:5] # 处理 HH-MM 格式 result = time_str[:5] if len(time_str) >= 5 else time_str if len(result) == 5 and result[2] == '-': result = result.replace('-', ':') return result first = extract_time(first_time) last = extract_time(last_time) if first == last or last == "-": return first return f"{first}~{last}" def _format_rank_timeline(self, rank_timeline: List[Dict]) -> str: """格式化排名时间线""" if not rank_timeline: return "-" parts = [] for item in rank_timeline: time_str = item.get("time", "") if len(time_str) == 5 and time_str[2] == '-': time_str = time_str.replace('-', ':') rank = item.get("rank") if rank is None: parts.append(f"0({time_str})") else: parts.append(f"{rank}({time_str})") return "→".join(parts) def _prepare_standalone_content(self, standalone_data: Dict) -> str: """ 将独立展示区数据转为文本,注入 AI 分析 prompt Args: standalone_data: 独立展示区数据 {"platforms": [...], "rss_feeds": [...]} Returns: 格式化的文本内容 """ lines = [] # 热榜平台 for platform in standalone_data.get("platforms", []): platform_id = platform.get("id", "") platform_name = platform.get("name", platform_id) items = platform.get("items", []) if not items: continue lines.append(f"### [{platform_name}]") for item in items: title = item.get("title", "") if not title: continue line = f"- {title}" # 排名信息 ranks = item.get("ranks", []) if ranks: min_rank = min(ranks) max_rank = max(ranks) rank_str = f"{min_rank}" if min_rank == max_rank else f"{min_rank}-{max_rank}" line += f" | 排名:{rank_str}" # 时间范围 first_time = item.get("first_time", "") last_time = item.get("last_time", "") if first_time: time_str = self._format_time_range(first_time, last_time) line += f" | 时间:{time_str}" # 出现次数 count = item.get("count", 1) if count > 1: line += f" | 出现:{count}次" # 排名轨迹(如果启用) if self.include_rank_timeline: rank_timeline = item.get("rank_timeline", []) if rank_timeline: timeline_str = self._format_rank_timeline(rank_timeline) line += f" | 轨迹:{timeline_str}" lines.append(line) lines.append("") # RSS 源 for feed in standalone_data.get("rss_feeds", []): feed_id = feed.get("id", "") feed_name = feed.get("name", feed_id) items = feed.get("items", []) if not items: continue lines.append(f"### [{feed_name}]") for item in items: title = item.get("title", "") if not title: continue line = f"- {title}" published_at = item.get("published_at", "") if published_at: line += f" | {published_at}" lines.append(line) lines.append("") return "\n".join(lines) def _parse_response(self, response: str) -> AIAnalysisResult: """解析 AI 响应""" result = AIAnalysisResult(raw_response=response) if not response or not response.strip(): result.error = "AI 返回空响应" return result # 提取 JSON 文本(去掉 markdown 代码块标记) json_str = response if "```json" in response: parts = response.split("```json", 1) if len(parts) > 1: code_block = parts[1] end_idx = code_block.find("```") if end_idx != -1: json_str = code_block[:end_idx] else: json_str = code_block elif "```" in response: parts = response.split("```", 2) if len(parts) >= 2: json_str = parts[1] json_str = json_str.strip() if not json_str: result.error = "提取的 JSON 内容为空" result.core_trends = response[:500] + "..." if len(response) > 500 else response result.success = True return result # 第一步:标准 JSON 解析 data = None parse_error = None try: data = json.loads(json_str) except json.JSONDecodeError as e: parse_error = e # 第二步:json_repair 本地修复 if data is None: try: from json_repair import repair_json repaired = repair_json(json_str, return_objects=True) if isinstance(repaired, dict): data = repaired print("[AI] JSON 本地修复成功(json_repair)") except Exception: pass # 两步都失败,记录错误(后续由 analyze 方法的重试机制处理) if data is None: if parse_error: error_context = json_str[max(0, parse_error.pos - 30):parse_error.pos + 30] if json_str and parse_error.pos else "" result.error = f"JSON 解析错误 (位置 {parse_error.pos}): {parse_error.msg}" if error_context: result.error += f",上下文: ...{error_context}..." else: result.error = "JSON 解析失败" # 兜底:使用已提取的 json_str(不含 markdown 标记),避免推送中出现 ```json result.core_trends = json_str[:500] + "..." if len(json_str) > 500 else json_str result.success = True return result # 解析成功,提取字段 try: result.core_trends = data.get("core_trends", "") result.sentiment_controversy = data.get("sentiment_controversy", "") result.signals = data.get("signals", "") result.rss_insights = data.get("rss_insights", "") result.outlook_strategy = data.get("outlook_strategy", "") # 解析独立展示区概括 summaries = data.get("standalone_summaries", {}) if isinstance(summaries, dict): result.standalone_summaries = { str(k): str(v) for k, v in summaries.items() } result.success = True except (KeyError, TypeError, AttributeError) as e: result.error = f"字段提取错误: {type(e).__name__}: {e}" result.core_trends = json_str[:500] + "..." if len(json_str) > 500 else json_str result.success = True return result ================================================ FILE: trendradar/ai/client.py ================================================ # coding=utf-8 """ AI 客户端模块 基于 LiteLLM 的统一 AI 模型接口 支持 100+ AI 提供商(OpenAI、DeepSeek、Gemini、Claude、国内模型等) """ import os from typing import Any, Dict, List from litellm import completion class AIClient: """统一的 AI 客户端(基于 LiteLLM)""" def __init__(self, config: Dict[str, Any]): """ 初始化 AI 客户端 Args: config: AI 配置字典 - MODEL: 模型标识(格式: provider/model_name) - API_KEY: API 密钥 - API_BASE: API 基础 URL(可选) - TEMPERATURE: 采样温度 - MAX_TOKENS: 最大生成 token 数 - TIMEOUT: 请求超时时间(秒) - NUM_RETRIES: 重试次数(可选) - FALLBACK_MODELS: 备用模型列表(可选) """ self.model = config.get("MODEL", "deepseek/deepseek-chat") self.api_key = config.get("API_KEY") or os.environ.get("AI_API_KEY", "") self.api_base = config.get("API_BASE", "") self.temperature = config.get("TEMPERATURE", 1.0) self.max_tokens = config.get("MAX_TOKENS", 5000) self.timeout = config.get("TIMEOUT", 120) self.num_retries = config.get("NUM_RETRIES", 2) self.fallback_models = config.get("FALLBACK_MODELS", []) def chat( self, messages: List[Dict[str, str]], **kwargs ) -> str: """ 调用 AI 模型进行对话 Args: messages: 消息列表,格式: [{"role": "system/user/assistant", "content": "..."}] **kwargs: 额外参数,会覆盖默认配置 Returns: str: AI 响应内容 Raises: Exception: API 调用失败时抛出异常 """ # 构建请求参数 params = { "model": self.model, "messages": messages, "temperature": kwargs.get("temperature", self.temperature), "timeout": kwargs.get("timeout", self.timeout), "num_retries": kwargs.get("num_retries", self.num_retries), } # 添加 API Key if self.api_key: params["api_key"] = self.api_key # 添加 API Base(如果配置了) if self.api_base: params["api_base"] = self.api_base # 添加 max_tokens(如果配置了且不为 0) max_tokens = kwargs.get("max_tokens", self.max_tokens) if max_tokens and max_tokens > 0: params["max_tokens"] = max_tokens # 添加 fallback 模型(如果配置了) if self.fallback_models: params["fallbacks"] = self.fallback_models # 合并其他额外参数 for key, value in kwargs.items(): if key not in params: params[key] = value # 调用 LiteLLM response = completion(**params) # 提取响应内容 # 某些模型/提供商返回 list(内容块)而非 str,统一转为 str content = response.choices[0].message.content if isinstance(content, list): content = "\n".join( item.get("text", str(item)) if isinstance(item, dict) else str(item) for item in content ) return content or "" def validate_config(self) -> tuple[bool, str]: """ 验证配置是否有效 Returns: tuple: (是否有效, 错误信息) """ if not self.model: return False, "未配置 AI 模型(model)" if not self.api_key: return False, "未配置 AI API Key,请在 config.yaml 或环境变量 AI_API_KEY 中设置" # 验证模型格式(应该包含 provider/model) if "/" not in self.model: return False, f"模型格式错误: {self.model},应为 'provider/model' 格式(如 'deepseek/deepseek-chat')" return True, "" ================================================ FILE: trendradar/ai/filter.py ================================================ # coding=utf-8 """ AI 智能筛选模块 通过 AI 对新闻进行标签分类: 1. 阶段 A:从用户兴趣描述中提取结构化标签 2. 阶段 B:对新闻标题按标签进行批量分类 """ import hashlib import json from dataclasses import dataclass, field from pathlib import Path from typing import Any, Callable, Dict, List, Optional from trendradar.ai.client import AIClient @dataclass class AIFilterResult: """AI 筛选结果,传给报告和通知模块""" tags: List[Dict] = field(default_factory=list) # [{"tag": str, "description": str, "count": int, "items": [ # {"title": str, "source_id": str, "source_name": str, # "url": str, "mobile_url": str, "rank": int, "ranks": [...], # "first_time": str, "last_time": str, "count": int, # "relevance_score": float, "source_type": str} # ]}] total_matched: int = 0 # 匹配新闻总数 total_processed: int = 0 # 处理新闻总数 success: bool = False error: str = "" class AIFilter: """AI 智能筛选器""" def __init__( self, ai_config: Dict[str, Any], filter_config: Dict[str, Any], get_time_func: Callable, debug: bool = False, ): self.client = AIClient(ai_config) self.filter_config = filter_config self.batch_size = filter_config.get("BATCH_SIZE", 200) self.get_time_func = get_time_func self.debug = debug # 加载提示词模板 self.classify_system, self.classify_user = self._load_prompt( filter_config.get("PROMPT_FILE", "ai_filter_prompt.txt") ) self.extract_system, self.extract_user = self._load_prompt( filter_config.get("EXTRACT_PROMPT_FILE", "ai_filter_extract_prompt.txt") ) self.update_tags_system, self.update_tags_user = self._load_prompt( filter_config.get("UPDATE_TAGS_PROMPT_FILE", "update_tags_prompt.txt") ) def _load_prompt(self, filename: str) -> tuple: """加载提示词文件,返回 (system_prompt, user_prompt_template)""" config_dir = Path(__file__).parent.parent.parent / "config" / "ai_filter" prompt_path = config_dir / filename if not prompt_path.exists(): print(f"[AI筛选] 提示词文件不存在: {prompt_path}") return "", "" content = prompt_path.read_text(encoding="utf-8") system_prompt = "" user_prompt = "" if "[system]" in content and "[user]" in content: parts = content.split("[user]") system_part = parts[0] user_part = parts[1] if len(parts) > 1 else "" if "[system]" in system_part: system_prompt = system_part.split("[system]")[1].strip() user_prompt = user_part.strip() else: user_prompt = content return system_prompt, user_prompt def compute_interests_hash(self, interests_content: str, filename: str = "ai_interests.txt") -> str: """计算兴趣描述的 hash,格式为 filename:md5""" # 去除前后空白和注释行,确保内容变化才改变 hash lines = [] for line in interests_content.strip().splitlines(): line = line.strip() if line and not line.startswith("#"): lines.append(line) normalized = "\n".join(lines) content_hash = hashlib.md5(normalized.encode("utf-8")).hexdigest() return f"{filename}:{content_hash}" def load_interests_content(self, interests_file: Optional[str] = None) -> Optional[str]: """加载兴趣描述文件内容 解析逻辑: - interests_file 为 None:使用默认 config/ai_interests.txt - interests_file 有值:仅查 config/custom/ai/{filename} 注意:调用方(context.py)已完成 config/timeline 的合并决策, 此处不再二次读取 filter_config,避免语义冲突。 """ config_dir = Path(__file__).parent.parent.parent / "config" configured_file = interests_file if configured_file: # 自定义兴趣文件:仅查 custom/ai 目录 filename = configured_file interests_path = config_dir / "custom" / "ai" / filename if not interests_path.exists(): print(f"[AI筛选] 自定义兴趣描述文件不存在: {filename}") print(f"[AI筛选] 已查找: {interests_path}") return None else: # 默认兴趣文件:固定使用 config/ai_interests.txt filename = "ai_interests.txt" interests_path = config_dir / filename if not interests_path.exists(): print(f"[AI筛选] 默认兴趣描述文件不存在: {filename}") print(f"[AI筛选] 已查找: {interests_path}") return None if not interests_path.exists(): print(f"[AI筛选] 兴趣描述文件不存在: {interests_path}") return None content = interests_path.read_text(encoding="utf-8").strip() if not content: print("[AI筛选] 兴趣描述文件为空") return None return content def extract_tags(self, interests_content: str) -> List[Dict]: """ 阶段 A:从兴趣描述中提取结构化标签 Args: interests_content: 用户的兴趣描述文本 Returns: [{"tag": str, "description": str}, ...] """ if not self.extract_user: print("[AI筛选] 标签提取提示词模板为空") return [] user_prompt = self.extract_user.replace("{interests_content}", interests_content) messages = [] if self.extract_system: messages.append({"role": "system", "content": self.extract_system}) messages.append({"role": "user", "content": user_prompt}) if self.debug: print(f"\n[AI筛选][DEBUG] === 标签提取 Prompt ===") for m in messages: print(f"[{m['role']}]\n{m['content']}") print(f"[AI筛选][DEBUG] === Prompt 结束 ===") try: response = self.client.chat(messages) if self.debug: print(f"\n[AI筛选][DEBUG] === 标签提取 AI 原始响应 ===") # 尝试格式化 JSON 便于阅读 self._print_formatted_json(response) print(f"[AI筛选][DEBUG] === 响应结束 ===") tags = self._parse_tags_response(response) print(f"[AI筛选] 提取到 {len(tags)} 个标签") for t in tags: print(f" {t['tag']}: {t.get('description', '')}") if self.debug: json_str = self._extract_json(response) if not json_str: print(f"[AI筛选][DEBUG] 无法从响应中提取 JSON") else: raw_data = json.loads(json_str) raw_tags = raw_data.get("tags", []) skipped = len(raw_tags) - len(tags) if skipped > 0: print(f"[AI筛选][DEBUG] 原始标签 {len(raw_tags)} 个,有效 {len(tags)} 个,跳过 {skipped} 个(缺少 tag 字段或格式无效)") return tags except json.JSONDecodeError as e: print(f"[AI筛选] 标签提取失败: JSON 解析错误: {e}") if self.debug: print(f"[AI筛选][DEBUG] 尝试解析的 JSON 内容: {self._extract_json(response) if response else '(空响应)'}") return [] except Exception as e: print(f"[AI筛选] 标签提取失败: {type(e).__name__}: {e}") return [] def update_tags(self, old_tags: List[Dict], interests_content: str) -> Optional[Dict]: """ 阶段 A':AI 对比旧标签和新兴趣描述,给出更新方案 Args: old_tags: [{"tag": str, "description": str, "id": int}, ...] interests_content: 新的兴趣描述文本 Returns: {"keep": [{"tag": str, "description": str}], "add": [{"tag": str, "description": str}], "remove": [str], "change_ratio": float} 失败返回 None """ if not self.update_tags_user: print("[AI筛选] 标签更新提示词模板为空,回退到重新提取") return None # 构造旧标签 JSON old_tags_json = json.dumps( [{"tag": t["tag"], "description": t.get("description", "")} for t in old_tags], ensure_ascii=False, indent=2 ) user_prompt = self.update_tags_user.replace( "{old_tags_json}", old_tags_json ).replace( "{interests_content}", interests_content ) messages = [] if self.update_tags_system: messages.append({"role": "system", "content": self.update_tags_system}) messages.append({"role": "user", "content": user_prompt}) if self.debug: print(f"\n[AI筛选][DEBUG] === 标签更新 Prompt ===") for m in messages: print(f"[{m['role']}]\n{m['content']}") print(f"[AI筛选][DEBUG] === Prompt 结束 ===") try: response = self.client.chat(messages) if self.debug: print(f"\n[AI筛选][DEBUG] === 标签更新 AI 原始响应 ===") self._print_formatted_json(response) print(f"[AI筛选][DEBUG] === 响应结束 ===") result = self._parse_update_tags_response(response) if result is None: return None keep_count = len(result.get("keep", [])) add_count = len(result.get("add", [])) remove_count = len(result.get("remove", [])) ratio = result.get("change_ratio", 0) print(f"[AI筛选] AI 标签更新方案: 保留 {keep_count}, 新增 {add_count}, 移除 {remove_count}, change_ratio={ratio:.2f}") return result except Exception as e: print(f"[AI筛选] 标签更新失败: {type(e).__name__}: {e}") return None def _parse_update_tags_response(self, response: str) -> Optional[Dict]: """解析标签更新的 AI 响应""" json_str = self._extract_json(response) if not json_str: print("[AI筛选] 无法从标签更新响应中提取 JSON") return None data = json.loads(json_str) # 校验必需字段 keep = data.get("keep", []) add = data.get("add", []) remove = data.get("remove", []) change_ratio = float(data.get("change_ratio", 0)) # 校验 keep/add 格式 validated_keep = [] for t in keep: if isinstance(t, dict) and "tag" in t: validated_keep.append({ "tag": str(t["tag"]).strip(), "description": str(t.get("description", "")).strip(), }) validated_add = [] for t in add: if isinstance(t, dict) and "tag" in t: validated_add.append({ "tag": str(t["tag"]).strip(), "description": str(t.get("description", "")).strip(), }) validated_remove = [str(r).strip() for r in remove if r] # change_ratio 限制在 0~1 change_ratio = max(0.0, min(1.0, change_ratio)) return { "keep": validated_keep, "add": validated_add, "remove": validated_remove, "change_ratio": change_ratio, } def _parse_tags_response(self, response: str) -> List[Dict]: """解析标签提取的 AI 响应""" json_str = self._extract_json(response) if not json_str: return [] data = json.loads(json_str) tags_raw = data.get("tags", []) tags = [] for t in tags_raw: if not isinstance(t, dict) or "tag" not in t: continue tags.append({ "tag": str(t["tag"]).strip(), "description": str(t.get("description", "")).strip(), }) return tags def classify_batch( self, titles: List[Dict], tags: List[Dict], interests_content: str = "", ) -> List[Dict]: """ 阶段 B:对一批新闻标题做分类 Args: titles: [{"id": news_item_id, "title": str, "source": str}] tags: [{"id": tag_id, "tag": str, "description": str}] interests_content: 用户的兴趣描述(含质量过滤要求) Returns: [{"news_item_id": int, "tag_id": int, "relevance_score": float}, ...] """ if not titles or not tags: return [] if not self.classify_user: print("[AI筛选] 分类提示词模板为空") return [] # 构建标签列表文本 tags_list = "\n".join( f"{t['id']}. {t['tag']}: {t.get('description', '')}" for t in tags ) # 构建新闻列表文本 news_list = "\n".join( f"{t['id']}. [{t.get('source', '')}] {t['title']}" for t in titles ) # 填充模板 user_prompt = self.classify_user user_prompt = user_prompt.replace("{interests_content}", interests_content) user_prompt = user_prompt.replace("{tags_list}", tags_list) user_prompt = user_prompt.replace("{news_count}", str(len(titles))) user_prompt = user_prompt.replace("{news_list}", news_list) messages = [] if self.classify_system: messages.append({"role": "system", "content": self.classify_system}) messages.append({"role": "user", "content": user_prompt}) if self.debug: print(f"\n[AI筛选][DEBUG] === 分类 Prompt (标题数={len(titles)}, 标签={len(tags)}) ===") for m in messages: role = m['role'] content = m['content'] # 截断过长的新闻列表:只显示前5条和后5条 lines = content.split('\n') # 找到新闻列表区域并截断 if len(lines) > 30: # 显示前15行 + 省略提示 + 后10行 head = lines[:15] tail = lines[-10:] omitted = len(lines) - 25 truncated = '\n'.join(head) + f'\n... (省略 {omitted} 行) ...\n' + '\n'.join(tail) print(f"[{role}]\n{truncated}") else: print(f"[{role}]\n{content}") print(f"[AI筛选][DEBUG] === Prompt 结束 (长度: {sum(len(m['content']) for m in messages)} 字符) ===") try: response = self.client.chat(messages) return self._parse_classify_response(response, titles, tags) except Exception as e: print(f"[AI筛选] 分类请求失败: {type(e).__name__}: {e}") return [] def _parse_classify_response( self, response: str, titles: List[Dict], tags: List[Dict], ) -> List[Dict]: """解析分类的 AI 响应 支持两种 JSON 格式: - 新格式(扁平): [{"id": 1, "tag_id": 1, "score": 0.9}, ...] - 旧格式(嵌套): [{"id": 1, "tags": [{"tag_id": 1, "score": 0.9}]}, ...] 每条新闻只保留一个最高分的 tag,杜绝同一条出现在多个标签下。 """ json_str = self._extract_json(response) if not json_str: if self.debug: print(f"[AI筛选][DEBUG] 无法从分类响应中提取 JSON,原始响应前 500 字符: {(response or '')[:500]}") return [] try: data = json.loads(json_str) except json.JSONDecodeError as e: if self.debug: print(f"[AI筛选][DEBUG] 分类响应 JSON 解析失败: {e}") print(f"[AI筛选][DEBUG] 提取的 JSON 文本前 500 字符: {json_str[:500]}") return [] if not isinstance(data, list): if self.debug: print(f"[AI筛选][DEBUG] 分类响应顶层不是数组,实际类型: {type(data).__name__}") return [] # 构建 id 映射 title_ids = {t["id"] for t in titles} title_map = {t["id"]: t["title"] for t in titles} tag_id_set = {t["id"] for t in tags} tag_name_map = {t["id"]: t["tag"] for t in tags} # 每条新闻只保留一个最高分的 tag best_per_news: Dict[int, Dict] = {} # news_id -> {"tag_id": ..., "score": ...} skipped_news_ids = 0 skipped_tag_ids = 0 skipped_empty = 0 for item in data: if not isinstance(item, dict): continue news_id = item.get("id") if news_id not in title_ids: skipped_news_ids += 1 continue # 收集此条新闻的所有候选 tag candidates = [] if "tag_id" in item: # 新格式(扁平): {"id": 1, "tag_id": 1, "score": 0.9} candidates.append({"tag_id": item["tag_id"], "score": item.get("score", 0.5)}) elif "tags" in item: # 旧格式(嵌套): {"id": 1, "tags": [{"tag_id": 1, "score": 0.9}]} matched_tags = item.get("tags", []) if isinstance(matched_tags, list): if not matched_tags: skipped_empty += 1 continue candidates.extend(matched_tags) if not candidates: skipped_empty += 1 continue # 取最高分的有效 tag best_tag_id = None best_score = -1.0 for tag_match in candidates: if not isinstance(tag_match, dict): continue tag_id = tag_match.get("tag_id") if tag_id not in tag_id_set: skipped_tag_ids += 1 continue score = tag_match.get("score", 0.5) try: score = float(score) score = max(0.0, min(1.0, score)) except (ValueError, TypeError): score = 0.5 if score > best_score: best_score = score best_tag_id = tag_id if best_tag_id is not None: # 如果同一条新闻被多次返回,只保留分数更高的 existing = best_per_news.get(news_id) if existing is None or best_score > existing["relevance_score"]: best_per_news[news_id] = { "news_item_id": news_id, "tag_id": best_tag_id, "relevance_score": best_score, } results = list(best_per_news.values()) if self.debug: ai_returned = len(data) print(f"[AI筛选][DEBUG] --- 分类解析结果 ---") print(f"[AI筛选][DEBUG] AI 返回 {ai_returned} 条, 有效 {len(results)} 条 (每条新闻仅保留最高分 tag)") if skipped_empty > 0: print(f"[AI筛选][DEBUG] 跳过空 tags: {skipped_empty} 条") if skipped_news_ids > 0: print(f"[AI筛选][DEBUG] !! 跳过无效 news_id: {skipped_news_ids} 条") if skipped_tag_ids > 0: print(f"[AI筛选][DEBUG] !! 跳过无效 tag_id: {skipped_tag_ids} 条") # 按标签汇总 tag_summary: Dict[int, List[str]] = {} for r in results: tid = r["tag_id"] if tid not in tag_summary: tag_summary[tid] = [] tag_summary[tid].append( f" [{r['news_item_id']}] {title_map.get(r['news_item_id'], '?')[:40]} (score={r['relevance_score']:.2f})" ) for tid, items in tag_summary.items(): tname = tag_name_map.get(tid, f"tag_{tid}") print(f"[AI筛选][DEBUG] 标签「{tname}」匹配 {len(items)} 条:") for line in items: print(line) return results def _extract_json(self, response: str) -> Optional[str]: """从 AI 响应中提取 JSON 字符串""" if not response or not response.strip(): return None json_str = response.strip() if "```json" in json_str: parts = json_str.split("```json", 1) if len(parts) > 1: code_block = parts[1] end_idx = code_block.find("```") json_str = code_block[:end_idx] if end_idx != -1 else code_block elif "```" in json_str: parts = json_str.split("```", 2) if len(parts) >= 2: json_str = parts[1] json_str = json_str.strip() return json_str if json_str else None def _print_formatted_json(self, response: str) -> None: """格式化打印 AI 响应中的 JSON,便于 debug 阅读""" if not response: print("(空响应)") return json_str = self._extract_json(response) if json_str: try: data = json.loads(json_str) if isinstance(data, list): # 数组:每个元素压成一行 lines = [json.dumps(item, ensure_ascii=False) for item in data] print("[\n " + ",\n ".join(lines) + "\n]") else: print(json.dumps(data, ensure_ascii=False, indent=2)) return except json.JSONDecodeError: pass # JSON 解析失败,直接打印原始响应 print(response) ================================================ FILE: trendradar/ai/formatter.py ================================================ # coding=utf-8 """ AI 分析结果格式化模块 将 AI 分析结果格式化为各推送渠道的样式 """ import html as html_lib import re from .analyzer import AIAnalysisResult def _escape_html(text: str) -> str: """转义 HTML 特殊字符,防止 XSS 攻击""" return html_lib.escape(text) if text else "" def _format_list_content(text: str) -> str: """ 格式化列表内容,确保序号前有换行 例如将 "1. xxx 2. yyy" 转换为: 1. xxx 2. yyy """ if not text: return "" # 去除首尾空白,防止 AI 返回的内容开头就有换行导致显示空行 text = text.strip() # 0. 合并序号与紧随的【标签】(防御性处理) # 将 "1.\n【投资者】:" 或 "1. 【投资者】:" 合并为 "1. 投资者:" text = re.sub(r'(\d+\.)\s*【([^】]+)】([::]?)', r'\1 \2:', text) # 1. 规范化:确保 "1." 后面有空格 result = re.sub(r'(\d+)\.([^ \d])', r'\1. \2', text) # 2. 强制换行:匹配 "数字.",且前面不是换行符 # (?!\d) 排除版本号/小数(如 2.0、3.5),避免将其误判为列表序号 result = re.sub(r'(?<=[^\n])\s+(\d+\.)(?!\d)', r'\n\1', result) # 3. 处理 "1.**粗体**" 这种情况(虽然 Prompt 要求不输出 Markdown,但防御性处理) result = re.sub(r'(?<=[^\n])(\d+\.\*\*)', r'\n\1', result) # 4. 处理中文标点后的换行(排除版本号/小数) result = re.sub(r'([::;,。;,])\s*(\d+\.)(?!\d)', r'\1\n\2', result) # 5. 处理 "XX方面:"、"XX领域:" 等子标题换行 # 只有在中文标点(句号、逗号、分号等)后才触发换行,避免破坏 "1. XX领域:" 格式 result = re.sub(r'([。!?;,、])\s*([a-zA-Z0-9\u4e00-\u9fa5]+(方面|领域)[::])', r'\1\n\2', result) # 6. 处理 【标签】 格式 # 6a. 标签前确保空行分隔(文本开头除外) result = re.sub(r'(?<=\S)\n*(【[^】]+】)', r'\n\n\1', result) # 6b. 合并标签与被换行拆开的冒号:【tag】\n: → 【tag】: result = re.sub(r'(【[^】]+】)\n+([::])', r'\1\2', result) # 6c. 标签后(含可选冒号),如果紧跟非空白非冒号内容则另起一行 # 用 (?=[^\s::]) 避免正则回溯将冒号误判为"内容"而拆开 【tag】: result = re.sub(r'(【[^】]+】[::]?)[ \t]*(?=[^\s::])', r'\1\n', result) # 7. 在列表项之间增加视觉空行(排除版本号/小数) # 排除 【标签】 行(以】结尾)和子标题行(以冒号结尾)之后的情况,避免标题与首项之间出现空行 result = re.sub(r'(? str: """格式化独立展示区概括为纯文本行,每个源名称单独一行""" if not summaries: return "" lines = [] for source_name, summary in summaries.items(): if summary: lines.append(f"[{source_name}]:\n{summary}") return "\n\n".join(lines) def render_ai_analysis_markdown(result: AIAnalysisResult) -> str: """渲染为通用 Markdown 格式(Telegram、企业微信、ntfy、Bark、Slack)""" if not result.success: return f"⚠️ AI 分析失败: {result.error}" lines = ["**✨ AI 热点分析**", ""] if result.core_trends: lines.extend(["**核心热点态势**", _format_list_content(result.core_trends), ""]) if result.sentiment_controversy: lines.extend( ["**舆论风向争议**", _format_list_content(result.sentiment_controversy), ""] ) if result.signals: lines.extend(["**异动与弱信号**", _format_list_content(result.signals), ""]) if result.rss_insights: lines.extend( ["**RSS 深度洞察**", _format_list_content(result.rss_insights), ""] ) if result.outlook_strategy: lines.extend( ["**研判策略建议**", _format_list_content(result.outlook_strategy), ""] ) if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: lines.extend(["**独立源点速览**", summaries_text]) return "\n".join(lines) def render_ai_analysis_feishu(result: AIAnalysisResult) -> str: """渲染为飞书卡片 Markdown 格式""" if not result.success: return f"⚠️ AI 分析失败: {result.error}" lines = ["**✨ AI 热点分析**", ""] if result.core_trends: lines.extend(["**核心热点态势**", _format_list_content(result.core_trends), ""]) if result.sentiment_controversy: lines.extend( ["**舆论风向争议**", _format_list_content(result.sentiment_controversy), ""] ) if result.signals: lines.extend(["**异动与弱信号**", _format_list_content(result.signals), ""]) if result.rss_insights: lines.extend( ["**RSS 深度洞察**", _format_list_content(result.rss_insights), ""] ) if result.outlook_strategy: lines.extend( ["**研判策略建议**", _format_list_content(result.outlook_strategy), ""] ) if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: lines.extend(["**独立源点速览**", summaries_text]) return "\n".join(lines) def render_ai_analysis_dingtalk(result: AIAnalysisResult) -> str: """渲染为钉钉 Markdown 格式""" if not result.success: return f"⚠️ AI 分析失败: {result.error}" lines = ["### ✨ AI 热点分析", ""] if result.core_trends: lines.extend( ["#### 核心热点态势", _format_list_content(result.core_trends), ""] ) if result.sentiment_controversy: lines.extend( [ "#### 舆论风向争议", _format_list_content(result.sentiment_controversy), "", ] ) if result.signals: lines.extend(["#### 异动与弱信号", _format_list_content(result.signals), ""]) if result.rss_insights: lines.extend( ["#### RSS 深度洞察", _format_list_content(result.rss_insights), ""] ) if result.outlook_strategy: lines.extend( ["#### 研判策略建议", _format_list_content(result.outlook_strategy), ""] ) if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: lines.extend(["#### 独立源点速览", summaries_text]) return "\n".join(lines) def render_ai_analysis_html(result: AIAnalysisResult) -> str: """渲染为 HTML 格式(邮件)""" if not result.success: return ( f'⚠️ AI 分析失败: {_escape_html(result.error)}' ) html_parts = ['', "") return "\n".join(html_parts) def render_ai_analysis_plain(result: AIAnalysisResult) -> str: """渲染为纯文本格式""" if not result.success: return f"AI 分析失败: {result.error}" lines = ["【✨ AI 热点分析】", ""] if result.core_trends: lines.extend(["[核心热点态势]", _format_list_content(result.core_trends), ""]) if result.sentiment_controversy: lines.extend( ["[舆论风向争议]", _format_list_content(result.sentiment_controversy), ""] ) if result.signals: lines.extend(["[异动与弱信号]", _format_list_content(result.signals), ""]) if result.rss_insights: lines.extend(["[RSS 深度洞察]", _format_list_content(result.rss_insights), ""]) if result.outlook_strategy: lines.extend(["[研判策略建议]", _format_list_content(result.outlook_strategy), ""]) if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: lines.extend(["[独立源点速览]", summaries_text]) return "\n".join(lines) def get_ai_analysis_renderer(channel: str): """根据渠道获取对应的渲染函数""" renderers = { "feishu": render_ai_analysis_feishu, "dingtalk": render_ai_analysis_dingtalk, "wework": render_ai_analysis_markdown, "telegram": render_ai_analysis_markdown, "email": render_ai_analysis_html_rich, # 邮件使用丰富样式,配合 HTML 报告的 CSS "ntfy": render_ai_analysis_markdown, "bark": render_ai_analysis_plain, "slack": render_ai_analysis_markdown, } return renderers.get(channel, render_ai_analysis_markdown) def render_ai_analysis_html_rich(result: AIAnalysisResult) -> str: """渲染为丰富样式的 HTML 格式(HTML 报告用)""" if not result: return "" # 检查是否成功 if not result.success: error_msg = result.error or "未知错误" return f"""✨ AI 热点分析
"] if result.core_trends: content = _format_list_content(result.core_trends) content_html = _escape_html(content).replace("\n", "
") html_parts.extend( [ '', "", ] ) if result.sentiment_controversy: content = _format_list_content(result.sentiment_controversy) content_html = _escape_html(content).replace("\n", "核心热点态势
", f'{content_html}', "
") html_parts.extend( [ '', "", ] ) if result.signals: content = _format_list_content(result.signals) content_html = _escape_html(content).replace("\n", "舆论风向争议
", f'{content_html}', "
") html_parts.extend( [ '', "", ] ) if result.rss_insights: content = _format_list_content(result.rss_insights) content_html = _escape_html(content).replace("\n", "异动与弱信号
", f'{content_html}', "
") html_parts.extend( [ '', "", ] ) if result.outlook_strategy: content = _format_list_content(result.outlook_strategy) content_html = _escape_html(content).replace("\n", "RSS 深度洞察
", f'{content_html}', "
") html_parts.extend( [ '', "", ] ) if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: summaries_html = _escape_html(summaries_text).replace("\n", "研判策略建议
", f'{content_html}', "
") html_parts.extend( [ '', "", ] ) html_parts.append("独立源点速览
", f'{summaries_html}', """" ai_html = """⚠️ AI 分析失败: {_escape_html(str(error_msg))}""" return ai_html ================================================ FILE: trendradar/ai/translator.py ================================================ # coding=utf-8 """ AI 翻译器模块 对推送内容进行多语言翻译 基于 LiteLLM 统一接口,支持 100+ AI 提供商 """ from dataclasses import dataclass, field from pathlib import Path from typing import Any, Dict, List from trendradar.ai.client import AIClient @dataclass class TranslationResult: """翻译结果""" translated_text: str = "" # 翻译后的文本 original_text: str = "" # 原始文本 success: bool = False # 是否成功 error: str = "" # 错误信息 @dataclass class BatchTranslationResult: """批量翻译结果""" results: List[TranslationResult] = field(default_factory=list) success_count: int = 0 fail_count: int = 0 total_count: int = 0 prompt: str = "" # debug: 发送给 AI 的完整 prompt raw_response: str = "" # debug: AI 原始响应 parsed_count: int = 0 # debug: AI 响应解析出的条目数 class AITranslator: """AI 翻译器""" def __init__(self, translation_config: Dict[str, Any], ai_config: Dict[str, Any]): """ 初始化 AI 翻译器 Args: translation_config: AI 翻译配置 (AI_TRANSLATION) ai_config: AI 模型配置(LiteLLM 格式) """ self.translation_config = translation_config self.ai_config = ai_config # 翻译配置 self.enabled = translation_config.get("ENABLED", False) self.target_language = translation_config.get("LANGUAGE", "English") self.scope = translation_config.get("SCOPE", {"HOTLIST": True, "RSS": True, "STANDALONE": True}) # 创建 AI 客户端(基于 LiteLLM) self.client = AIClient(ai_config) # 加载提示词模板 self.system_prompt, self.user_prompt_template = self._load_prompt_template( translation_config.get("PROMPT_FILE", "ai_translation_prompt.txt") ) def _load_prompt_template(self, prompt_file: str) -> tuple: """加载提示词模板""" config_dir = Path(__file__).parent.parent.parent / "config" prompt_path = config_dir / prompt_file if not prompt_path.exists(): print(f"[翻译] 提示词文件不存在: {prompt_path}") return "", "" content = prompt_path.read_text(encoding="utf-8") # 解析 [system] 和 [user] 部分 system_prompt = "" user_prompt = "" if "[system]" in content and "[user]" in content: parts = content.split("[user]") system_part = parts[0] user_part = parts[1] if len(parts) > 1 else "" if "[system]" in system_part: system_prompt = system_part.split("[system]")[1].strip() user_prompt = user_part.strip() else: user_prompt = content return system_prompt, user_prompt def translate(self, text: str) -> TranslationResult: """ 翻译单条文本 Args: text: 要翻译的文本 Returns: TranslationResult: 翻译结果 """ result = TranslationResult(original_text=text) if not self.enabled: result.error = "翻译功能未启用" return result if not self.client.api_key: result.error = "未配置 AI API Key" return result if not text or not text.strip(): result.translated_text = text result.success = True return result try: # 构建提示词 user_prompt = self.user_prompt_template user_prompt = user_prompt.replace("{target_language}", self.target_language) user_prompt = user_prompt.replace("{content}", text) # 调用 AI API response = self._call_ai(user_prompt) result.translated_text = response.strip() result.success = True except Exception as e: error_type = type(e).__name__ error_msg = str(e) if len(error_msg) > 100: error_msg = error_msg[:100] + "..." result.error = f"翻译失败 ({error_type}): {error_msg}" return result def translate_batch(self, texts: List[str]) -> BatchTranslationResult: """ 批量翻译文本(单次 API 调用) Args: texts: 要翻译的文本列表 Returns: BatchTranslationResult: 批量翻译结果 """ batch_result = BatchTranslationResult(total_count=len(texts)) if not self.enabled: for text in texts: batch_result.results.append(TranslationResult( original_text=text, error="翻译功能未启用" )) batch_result.fail_count = len(texts) return batch_result if not self.client.api_key: for text in texts: batch_result.results.append(TranslationResult( original_text=text, error="未配置 AI API Key" )) batch_result.fail_count = len(texts) return batch_result if not texts: return batch_result # 过滤空文本 non_empty_indices = [] non_empty_texts = [] for i, text in enumerate(texts): if text and text.strip(): non_empty_indices.append(i) non_empty_texts.append(text) # 初始化结果列表 for text in texts: batch_result.results.append(TranslationResult(original_text=text)) # 空文本直接标记成功 for i, text in enumerate(texts): if not text or not text.strip(): batch_result.results[i].translated_text = text batch_result.results[i].success = True batch_result.success_count += 1 if not non_empty_texts: return batch_result try: # 构建批量翻译内容(使用编号格式) batch_content = self._format_batch_content(non_empty_texts) # 构建提示词 user_prompt = self.user_prompt_template user_prompt = user_prompt.replace("{target_language}", self.target_language) user_prompt = user_prompt.replace("{content}", batch_content) # 记录 debug 信息(包含完整的 system + user prompt) if self.system_prompt: batch_result.prompt = f"[system]\n{self.system_prompt}\n\n[user]\n{user_prompt}" else: batch_result.prompt = user_prompt # 调用 AI API response = self._call_ai(user_prompt) # 记录 AI 原始响应 batch_result.raw_response = response # 解析批量翻译结果 translated_texts, raw_parsed_count = self._parse_batch_response(response, len(non_empty_texts)) batch_result.parsed_count = raw_parsed_count # 填充结果 for idx, translated in zip(non_empty_indices, translated_texts): batch_result.results[idx].translated_text = translated batch_result.results[idx].success = True batch_result.success_count += 1 except Exception as e: error_msg = f"批量翻译失败: {type(e).__name__}: {str(e)[:100]}" for idx in non_empty_indices: batch_result.results[idx].error = error_msg batch_result.fail_count = len(non_empty_indices) return batch_result def _format_batch_content(self, texts: List[str]) -> str: """格式化批量翻译内容""" lines = [] for i, text in enumerate(texts, 1): lines.append(f"[{i}] {text}") return "\n".join(lines) def _parse_batch_response(self, response: str, expected_count: int) -> tuple: """ 解析批量翻译响应 Args: response: AI 响应文本 expected_count: 期望的翻译数量 Returns: tuple: (翻译结果列表, AI 原始解析出的条目数) """ results = [] lines = response.strip().split("\n") current_idx = None current_text = [] for line in lines: # 尝试匹配 [数字] 格式 stripped = line.strip() if stripped.startswith("[") and "]" in stripped: bracket_end = stripped.index("]") try: idx = int(stripped[1:bracket_end]) # 保存之前的内容 if current_idx is not None: results.append((current_idx, "\n".join(current_text).strip())) current_idx = idx current_text = [stripped[bracket_end + 1:].strip()] except ValueError: if current_idx is not None: current_text.append(line) else: if current_idx is not None: current_text.append(line) # 保存最后一条 if current_idx is not None: results.append((current_idx, "\n".join(current_text).strip())) # 按索引排序并提取文本 results.sort(key=lambda x: x[0]) translated = [text for _, text in results] raw_parsed_count = len(translated) # 如果解析结果数量不匹配,尝试简单按行分割 if len(translated) != expected_count: # 回退:按行分割(去除编号) translated = [] for line in lines: stripped = line.strip() if stripped.startswith("[") and "]" in stripped: bracket_end = stripped.index("]") translated.append(stripped[bracket_end + 1:].strip()) elif stripped: translated.append(stripped) raw_parsed_count = len(translated) # 确保返回正确数量 while len(translated) < expected_count: translated.append("") return translated[:expected_count], raw_parsed_count def _call_ai(self, user_prompt: str) -> str: """调用 AI API(使用 LiteLLM)""" messages = [] if self.system_prompt: messages.append({"role": "system", "content": self.system_prompt}) messages.append({"role": "user", "content": user_prompt}) return self.client.chat(messages) ================================================ FILE: trendradar/context.py ================================================ # coding=utf-8 """ 应用上下文模块 提供配置上下文类,封装所有依赖配置的操作,消除全局状态和包装函数。 """ from datetime import datetime from pathlib import Path from typing import Any, Dict, List, Optional, Tuple from trendradar.utils.time import ( DEFAULT_TIMEZONE, get_configured_time, format_date_folder, format_time_filename, get_current_time_display, convert_time_for_display, format_iso_time_friendly, is_within_days, ) from trendradar.core import ( load_frequency_words, matches_word_groups, read_all_today_titles, detect_latest_new_titles, count_word_frequency, Scheduler, ) from trendradar.report import ( prepare_report_data, generate_html_report, render_html_content, ) from trendradar.notification import ( render_feishu_content, render_dingtalk_content, split_content_into_batches, NotificationDispatcher, ) from trendradar.ai import AITranslator from trendradar.ai.filter import AIFilter, AIFilterResult from trendradar.storage import get_storage_manager class AppContext: """ 应用上下文类 封装所有依赖配置的操作,提供统一的接口。 消除对全局 CONFIG 的依赖,提高可测试性。 使用示例: config = load_config() ctx = AppContext(config) # 时间操作 now = ctx.get_time() date_folder = ctx.format_date() # 存储操作 storage = ctx.get_storage_manager() # 报告生成 html = ctx.generate_html_report(stats, total_titles, ...) """ def __init__(self, config: Dict[str, Any]): """ 初始化应用上下文 Args: config: 完整的配置字典 """ self.config = config self._storage_manager = None self._scheduler = None # === 配置访问 === @property def timezone(self) -> str: """获取配置的时区""" return self.config.get("TIMEZONE", DEFAULT_TIMEZONE) @property def rank_threshold(self) -> int: """获取排名阈值""" return self.config.get("RANK_THRESHOLD", 50) @property def weight_config(self) -> Dict: """获取权重配置""" return self.config.get("WEIGHT_CONFIG", {}) @property def platforms(self) -> List[Dict]: """获取平台配置列表""" return self.config.get("PLATFORMS", []) @property def platform_ids(self) -> List[str]: """获取平台ID列表""" return [p["id"] for p in self.platforms] @property def rss_config(self) -> Dict: """获取 RSS 配置""" return self.config.get("RSS", {}) @property def rss_enabled(self) -> bool: """RSS 是否启用""" return self.rss_config.get("ENABLED", False) @property def rss_feeds(self) -> List[Dict]: """获取 RSS 源列表""" return self.rss_config.get("FEEDS", []) @property def display_mode(self) -> str: """获取显示模式 (keyword | platform)""" return self.config.get("DISPLAY_MODE", "keyword") @property def show_new_section(self) -> bool: """是否显示新增热点区域""" return self.config.get("DISPLAY", {}).get("REGIONS", {}).get("NEW_ITEMS", True) @property def region_order(self) -> List[str]: """获取区域显示顺序""" default_order = ["hotlist", "rss", "new_items", "standalone", "ai_analysis"] return self.config.get("DISPLAY", {}).get("REGION_ORDER", default_order) @property def filter_method(self) -> str: """获取筛选策略: keyword | ai""" return self.config.get("FILTER", {}).get("METHOD", "keyword") @property def ai_priority_sort_enabled(self) -> bool: """AI 模式标签排序开关(与 keyword 的 sort_by_position_first 解耦)""" return self.config.get("FILTER", {}).get("PRIORITY_SORT_ENABLED", False) @property def ai_filter_config(self) -> Dict: """获取 AI 筛选配置""" return self.config.get("AI_FILTER", {}) @property def ai_filter_enabled(self) -> bool: """AI 筛选是否启用(基于 filter.method 判断)""" return self.filter_method == "ai" # === 时间操作 === def get_time(self) -> datetime: """获取当前配置时区的时间""" return get_configured_time(self.timezone) def format_date(self) -> str: """格式化日期文件夹 (YYYY-MM-DD)""" return format_date_folder(timezone=self.timezone) def format_time(self) -> str: """格式化时间文件名 (HH-MM)""" return format_time_filename(self.timezone) def get_time_display(self) -> str: """获取时间显示 (HH:MM)""" return get_current_time_display(self.timezone) @staticmethod def convert_time_display(time_str: str) -> str: """将 HH-MM 转换为 HH:MM""" return convert_time_for_display(time_str) # === 存储操作 === def get_storage_manager(self): """获取存储管理器(延迟初始化,单例)""" if self._storage_manager is None: storage_config = self.config.get("STORAGE", {}) remote_config = storage_config.get("REMOTE", {}) local_config = storage_config.get("LOCAL", {}) pull_config = storage_config.get("PULL", {}) self._storage_manager = get_storage_manager( backend_type=storage_config.get("BACKEND", "auto"), data_dir=local_config.get("DATA_DIR", "output"), enable_txt=storage_config.get("FORMATS", {}).get("TXT", True), enable_html=storage_config.get("FORMATS", {}).get("HTML", True), remote_config={ "bucket_name": remote_config.get("BUCKET_NAME", ""), "access_key_id": remote_config.get("ACCESS_KEY_ID", ""), "secret_access_key": remote_config.get("SECRET_ACCESS_KEY", ""), "endpoint_url": remote_config.get("ENDPOINT_URL", ""), "region": remote_config.get("REGION", ""), }, local_retention_days=local_config.get("RETENTION_DAYS", 0), remote_retention_days=remote_config.get("RETENTION_DAYS", 0), pull_enabled=pull_config.get("ENABLED", False), pull_days=pull_config.get("DAYS", 7), timezone=self.timezone, ) return self._storage_manager def get_output_path(self, subfolder: str, filename: str) -> str: """获取输出路径(扁平化结构:output/类型/日期/文件名)""" output_dir = Path("output") / subfolder / self.format_date() output_dir.mkdir(parents=True, exist_ok=True) return str(output_dir / filename) # === 数据处理 === def read_today_titles( self, platform_ids: Optional[List[str]] = None, quiet: bool = False ) -> Tuple[Dict, Dict, Dict]: """读取当天所有标题""" return read_all_today_titles(self.get_storage_manager(), platform_ids, quiet=quiet) def detect_new_titles( self, platform_ids: Optional[List[str]] = None, quiet: bool = False ) -> Dict: """检测最新批次的新增标题""" return detect_latest_new_titles(self.get_storage_manager(), platform_ids, quiet=quiet) def is_first_crawl(self) -> bool: """检测是否是当天第一次爬取""" return self.get_storage_manager().is_first_crawl_today() # === 频率词处理 === def load_frequency_words( self, frequency_file: Optional[str] = None ) -> Tuple[List[Dict], List[str], List[str]]: """加载频率词配置""" return load_frequency_words(frequency_file) def matches_word_groups( self, title: str, word_groups: List[Dict], filter_words: List[str], global_filters: Optional[List[str]] = None, ) -> bool: """检查标题是否匹配词组规则""" return matches_word_groups(title, word_groups, filter_words, global_filters) # === 统计分析 === def count_frequency( self, results: Dict, word_groups: List[Dict], filter_words: List[str], id_to_name: Dict, title_info: Optional[Dict] = None, new_titles: Optional[Dict] = None, mode: str = "daily", global_filters: Optional[List[str]] = None, quiet: bool = False, ) -> Tuple[List[Dict], int]: """统计词频""" return count_word_frequency( results=results, word_groups=word_groups, filter_words=filter_words, id_to_name=id_to_name, title_info=title_info, rank_threshold=self.rank_threshold, new_titles=new_titles, mode=mode, global_filters=global_filters, weight_config=self.weight_config, max_news_per_keyword=self.config.get("MAX_NEWS_PER_KEYWORD", 0), sort_by_position_first=self.config.get("SORT_BY_POSITION_FIRST", False), is_first_crawl_func=self.is_first_crawl, convert_time_func=self.convert_time_display, quiet=quiet, ) # === 报告生成 === def prepare_report( self, stats: List[Dict], failed_ids: Optional[List] = None, new_titles: Optional[Dict] = None, id_to_name: Optional[Dict] = None, mode: str = "daily", frequency_file: Optional[str] = None, ) -> Dict: """准备报告数据""" return prepare_report_data( stats=stats, failed_ids=failed_ids, new_titles=new_titles, id_to_name=id_to_name, mode=mode, rank_threshold=self.rank_threshold, matches_word_groups_func=self.matches_word_groups, load_frequency_words_func=lambda: self.load_frequency_words(frequency_file), show_new_section=self.show_new_section, ) def generate_html( self, stats: List[Dict], total_titles: int, failed_ids: Optional[List] = None, new_titles: Optional[Dict] = None, id_to_name: Optional[Dict] = None, mode: str = "daily", update_info: Optional[Dict] = None, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[Any] = None, standalone_data: Optional[Dict] = None, frequency_file: Optional[str] = None, ) -> str: """生成HTML报告""" return generate_html_report( stats=stats, total_titles=total_titles, failed_ids=failed_ids, new_titles=new_titles, id_to_name=id_to_name, mode=mode, update_info=update_info, rank_threshold=self.rank_threshold, output_dir="output", date_folder=self.format_date(), time_filename=self.format_time(), render_html_func=lambda *args, **kwargs: self.render_html(*args, rss_items=rss_items, rss_new_items=rss_new_items, ai_analysis=ai_analysis, standalone_data=standalone_data, **kwargs), matches_word_groups_func=self.matches_word_groups, load_frequency_words_func=lambda: self.load_frequency_words(frequency_file), ) def render_html( self, report_data: Dict, total_titles: int, mode: str = "daily", update_info: Optional[Dict] = None, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[Any] = None, standalone_data: Optional[Dict] = None, ) -> str: """渲染HTML内容""" return render_html_content( report_data=report_data, total_titles=total_titles, mode=mode, update_info=update_info, region_order=self.region_order, get_time_func=self.get_time, rss_items=rss_items, rss_new_items=rss_new_items, display_mode=self.display_mode, ai_analysis=ai_analysis, show_new_section=self.show_new_section, standalone_data=standalone_data, ) # === 通知内容渲染 === def render_feishu( self, report_data: Dict, update_info: Optional[Dict] = None, mode: str = "daily", ) -> str: """渲染飞书内容""" return render_feishu_content( report_data=report_data, update_info=update_info, mode=mode, separator=self.config.get("FEISHU_MESSAGE_SEPARATOR", "---"), region_order=self.region_order, get_time_func=self.get_time, show_new_section=self.show_new_section, ) def render_dingtalk( self, report_data: Dict, update_info: Optional[Dict] = None, mode: str = "daily", ) -> str: """渲染钉钉内容""" return render_dingtalk_content( report_data=report_data, update_info=update_info, mode=mode, region_order=self.region_order, get_time_func=self.get_time, show_new_section=self.show_new_section, ) def split_content( self, report_data: Dict, format_type: str, update_info: Optional[Dict] = None, max_bytes: Optional[int] = None, mode: str = "daily", rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_content: Optional[str] = None, standalone_data: Optional[Dict] = None, ai_stats: Optional[Dict] = None, report_type: str = "热点分析报告", ) -> List[str]: """分批处理消息内容(支持热榜+RSS合并+AI分析+独立展示区) Args: report_data: 报告数据 format_type: 格式类型 update_info: 更新信息 max_bytes: 最大字节数 mode: 报告模式 rss_items: RSS 统计条目列表 rss_new_items: RSS 新增条目列表 ai_content: AI 分析内容(已渲染的字符串) standalone_data: 独立展示区数据 ai_stats: AI 分析统计数据 report_type: 报告类型 Returns: 分批后的消息内容列表 """ return split_content_into_batches( report_data=report_data, format_type=format_type, update_info=update_info, max_bytes=max_bytes, mode=mode, batch_sizes={ "dingtalk": self.config.get("DINGTALK_BATCH_SIZE", 20000), "feishu": self.config.get("FEISHU_BATCH_SIZE", 29000), "default": self.config.get("MESSAGE_BATCH_SIZE", 4000), }, feishu_separator=self.config.get("FEISHU_MESSAGE_SEPARATOR", "---"), region_order=self.region_order, get_time_func=self.get_time, rss_items=rss_items, rss_new_items=rss_new_items, timezone=self.config.get("TIMEZONE", DEFAULT_TIMEZONE), display_mode=self.display_mode, ai_content=ai_content, standalone_data=standalone_data, rank_threshold=self.rank_threshold, ai_stats=ai_stats, report_type=report_type, show_new_section=self.show_new_section, ) # === 通知发送 === def create_notification_dispatcher(self) -> NotificationDispatcher: """创建通知调度器""" # 创建翻译器(如果启用) translator = None trans_config = self.config.get("AI_TRANSLATION", {}) if trans_config.get("ENABLED", False): ai_config = self.config.get("AI", {}) translator = AITranslator(trans_config, ai_config) return NotificationDispatcher( config=self.config, get_time_func=self.get_time, split_content_func=self.split_content, translator=translator, ) def create_scheduler(self) -> Scheduler: """ 创建调度器(延迟初始化,单例) 基于 config.yaml 的 schedule 段 + timeline.yaml 构建。 """ if self._scheduler is None: schedule_config = self.config.get("SCHEDULE", {}) timeline_data = self.config.get("_TIMELINE_DATA", {}) self._scheduler = Scheduler( schedule_config=schedule_config, timeline_data=timeline_data, storage_backend=self.get_storage_manager(), get_time_func=self.get_time, fallback_report_mode=self.config.get("REPORT_MODE", "current"), ) return self._scheduler # === AI 智能筛选 === @staticmethod def _with_ordered_priorities(tags: List[Dict], start_priority: int = 1) -> List[Dict]: """按当前列表顺序补齐优先级(值越小优先级越高)""" normalized: List[Dict] = [] priority = start_priority for tag_data in tags: if not isinstance(tag_data, dict): continue tag_name = str(tag_data.get("tag", "")).strip() if not tag_name: continue item = dict(tag_data) item["tag"] = tag_name item["priority"] = priority normalized.append(item) priority += 1 return normalized def run_ai_filter(self, interests_file: Optional[str] = None) -> Optional[AIFilterResult]: """ 执行 AI 智能筛选完整流程 Args: interests_file: 兴趣描述文件名(位于 config/custom/ai/),None=使用默认 config/ai_interests.txt 1. 读取兴趣描述文件,计算 hash 2. 对比数据库 prompt_hash,决定是否重新提取标签 3. 收集待分类新闻(去重) 4. 按 batch_size 分组调用 AI 分类 5. 保存结果 6. 查询 active 结果,按标签分组返回 Returns: AIFilterResult 或 None(未启用或出错) """ if not self.ai_filter_enabled: return None filter_config = self.ai_filter_config ai_config = self.config.get("AI", {}) debug = self.config.get("DEBUG", False) # 创建 AIFilter 实例 ai_filter = AIFilter(ai_config, filter_config, self.get_time, debug) # 确定实际使用的兴趣文件名 # None = 使用默认 config/ai_interests.txt,指定文件名 = config/custom/ai/{name} configured_interests = interests_file or filter_config.get("INTERESTS_FILE") effective_interests_file = configured_interests or "ai_interests.txt" if debug: print(f"[AI筛选][DEBUG] === 配置信息 ===") print(f"[AI筛选][DEBUG] 存储后端: {self.get_storage_manager().backend_name}") print(f"[AI筛选][DEBUG] batch_size={filter_config.get('BATCH_SIZE', 200)}, " f"batch_interval={filter_config.get('BATCH_INTERVAL', 5)}") print(f"[AI筛选][DEBUG] interests_file={effective_interests_file}") print(f"[AI筛选][DEBUG] prompt_file={filter_config.get('PROMPT_FILE', 'prompt.txt')}") print(f"[AI筛选][DEBUG] extract_prompt_file={filter_config.get('EXTRACT_PROMPT_FILE', 'extract_prompt.txt')}") # 1. 读取兴趣描述 # 传 configured_interests(可能为 None)给 load_interests_content, # 让它区分"默认文件(config/ai_interests.txt)"和"自定义文件(config/custom/ai/)" interests_content = ai_filter.load_interests_content(configured_interests) if not interests_content: return AIFilterResult(success=False, error="兴趣描述文件为空或不存在") current_hash = ai_filter.compute_interests_hash(interests_content, effective_interests_file) storage = self.get_storage_manager() if debug: print(f"[AI筛选][DEBUG] 兴趣描述 hash: {current_hash}") print(f"[AI筛选][DEBUG] 兴趣描述内容 ({len(interests_content)} 字符):\n{interests_content}") # 2. 开启批量模式(远程后端延迟上传,所有写操作完成后统一上传) storage.begin_batch() # 3. 检查提示词是否变更 stored_hash = storage.get_latest_prompt_hash(interests_file=effective_interests_file) if debug: print(f"[AI筛选][DEBUG] 数据库存储 hash: {stored_hash}") print(f"[AI筛选][DEBUG] hash 对比: stored={stored_hash} vs current={current_hash} → {'匹配' if stored_hash == current_hash else '不匹配'}") if stored_hash != current_hash: new_version = storage.get_latest_ai_filter_tag_version() + 1 threshold = filter_config.get("RECLASSIFY_THRESHOLD", 0.6) if stored_hash is None: # 首次运行,直接提取并保存全部标签 print(f"[AI筛选] 首次运行 ({effective_interests_file}),提取标签...") tags_data = ai_filter.extract_tags(interests_content) if not tags_data: storage.end_batch() return AIFilterResult(success=False, error="标签提取失败") tags_data = self._with_ordered_priorities(tags_data, start_priority=1) saved_count = storage.save_ai_filter_tags(tags_data, new_version, current_hash, interests_file=effective_interests_file) print(f"[AI筛选] 已保存 {saved_count} 个标签 (版本 {new_version})") else: # 兴趣描述已变更,让 AI 对比旧标签和新兴趣,给出更新方案 old_tags = storage.get_active_ai_filter_tags(interests_file=effective_interests_file) update_result = ai_filter.update_tags(old_tags, interests_content) if update_result is None: # AI 标签更新失败,回退到重新提取全部标签 print(f"[AI筛选] AI 标签更新失败,回退到重新提取") tags_data = ai_filter.extract_tags(interests_content) if not tags_data: storage.end_batch() return AIFilterResult(success=False, error="标签提取失败") tags_data = self._with_ordered_priorities(tags_data, start_priority=1) deprecated_count = storage.deprecate_all_ai_filter_tags(interests_file=effective_interests_file) storage.clear_analyzed_news(interests_file=effective_interests_file) saved_count = storage.save_ai_filter_tags(tags_data, new_version, current_hash, interests_file=effective_interests_file) print(f"[AI筛选] 废弃 {deprecated_count} 个旧标签, 保存 {saved_count} 个新标签 (版本 {new_version})") else: change_ratio = update_result["change_ratio"] keep_tags = update_result["keep"] add_tags = update_result["add"] remove_tags = update_result["remove"] if debug: print(f"[AI筛选][DEBUG] AI 标签更新: keep={len(keep_tags)}, add={len(add_tags)}, remove={len(remove_tags)}, change_ratio={change_ratio:.2f}, threshold={threshold:.2f}") if change_ratio >= threshold: # 全量重分类:废弃所有旧标签,用 extract_tags 重新提取 print(f"[AI筛选] 兴趣文件变更: {effective_interests_file} (AI change_ratio={change_ratio:.2f} >= threshold={threshold:.2f} → 全量重分类)") tags_data = ai_filter.extract_tags(interests_content) if not tags_data: storage.end_batch() return AIFilterResult(success=False, error="标签提取失败") tags_data = self._with_ordered_priorities(tags_data, start_priority=1) deprecated_count = storage.deprecate_all_ai_filter_tags(interests_file=effective_interests_file) storage.clear_analyzed_news(interests_file=effective_interests_file) saved_count = storage.save_ai_filter_tags(tags_data, new_version, current_hash, interests_file=effective_interests_file) print(f"[AI筛选] 废弃 {deprecated_count} 个旧标签, 保存 {saved_count} 个新标签 (版本 {new_version})") else: # 增量更新:按 AI 指示操作 print(f"[AI筛选] 兴趣文件变更: {effective_interests_file} (AI change_ratio={change_ratio:.2f} < threshold={threshold:.2f} → 增量更新)") print(f"[AI筛选] 保留 {len(keep_tags)} 个标签, 新增 {len(add_tags)} 个, 废弃 {len(remove_tags)} 个") # 废弃 AI 标记移除的标签 if remove_tags: remove_set = set(remove_tags) removed_ids = [t["id"] for t in old_tags if t["tag"] in remove_set] if removed_ids: storage.deprecate_specific_ai_filter_tags(removed_ids) if debug: print(f"[AI筛选][DEBUG] 废弃标签 IDs: {removed_ids}") # 更新保留标签的描述 keep_with_priority = [] if keep_tags: storage.update_ai_filter_tag_descriptions(keep_tags, interests_file=effective_interests_file) keep_with_priority = self._with_ordered_priorities(keep_tags, start_priority=1) storage.update_ai_filter_tag_priorities(keep_with_priority, interests_file=effective_interests_file) # 保存新增标签 if add_tags: add_start = keep_with_priority[-1]["priority"] + 1 if keep_with_priority else 1 add_with_priority = self._with_ordered_priorities(add_tags, start_priority=add_start) saved_count = storage.save_ai_filter_tags(add_with_priority, new_version, current_hash, interests_file=effective_interests_file) if debug: print(f"[AI筛选][DEBUG] 新增保存 {saved_count} 个标签") # 更新保留标签的 hash(标记为已处理) storage.update_ai_filter_tags_hash(effective_interests_file, current_hash) # 增量更新:清除不匹配新闻的分析记录,让它们有机会被新标签集重新分析 if add_tags: cleared = storage.clear_unmatched_analyzed_news(interests_file=effective_interests_file) if cleared > 0: print(f"[AI筛选] 清除 {cleared} 条不匹配记录,将在新标签下重新分析") # 3. 获取当前 active 标签 active_tags = storage.get_active_ai_filter_tags(interests_file=effective_interests_file) if debug: print(f"[AI筛选][DEBUG] 从数据库获取 active 标签: {len(active_tags)} 个") for t in active_tags: print(f"[AI筛选][DEBUG] id={t['id']} tag={t['tag']} priority={t.get('priority', 9999)} version={t.get('version')} hash={t.get('prompt_hash', '')[:8]}...") if not active_tags: storage.end_batch() return AIFilterResult(success=False, error="没有可用的标签") print(f"[AI筛选] 使用 {len(active_tags)} 个标签") # 4. 收集待分类新闻 # 热榜 all_news = storage.get_all_news_ids() analyzed_hotlist = storage.get_analyzed_news_ids("hotlist", interests_file=effective_interests_file) pending_news = [n for n in all_news if n["id"] not in analyzed_hotlist] # RSS(先做新鲜度过滤,再去除已分类的) pending_rss = [] freshness_filtered_rss = 0 if self.rss_enabled: all_rss = storage.get_all_rss_ids() # 应用新鲜度过滤(与推送阶段一致) rss_config = self.rss_config freshness_config = rss_config.get("FRESHNESS_FILTER", {}) freshness_enabled = freshness_config.get("ENABLED", True) default_max_age_days = freshness_config.get("MAX_AGE_DAYS", 3) timezone = self.config.get("TIMEZONE", DEFAULT_TIMEZONE) # 构建 feed_id -> max_age_days 的映射 feed_max_age_map = {} for feed_cfg in self.rss_feeds: feed_id = feed_cfg.get("id", "") max_age = feed_cfg.get("max_age_days") if max_age is not None: try: feed_max_age_map[feed_id] = int(max_age) except (ValueError, TypeError): pass fresh_rss = [] for n in all_rss: published_at = n.get("published_at", "") feed_id = n.get("source_id", "") max_days = feed_max_age_map.get(feed_id, default_max_age_days) if freshness_enabled and max_days > 0 and published_at: if not is_within_days(published_at, max_days, timezone): freshness_filtered_rss += 1 continue fresh_rss.append(n) analyzed_rss = storage.get_analyzed_news_ids("rss", interests_file=effective_interests_file) pending_rss = [n for n in fresh_rss if n["id"] not in analyzed_rss] # 始终打印总量/已分析/待分析 的详细数据 hotlist_total = len(all_news) hotlist_skipped = len(analyzed_hotlist) hotlist_pending = len(pending_news) print(f"[AI筛选] 热榜: 总计 {hotlist_total} 条, 已分析跳过 {hotlist_skipped} 条, 本次发送AI分析 {hotlist_pending} 条") if self.rss_enabled: rss_total = len(all_rss) rss_skipped = len(analyzed_rss) rss_pending = len(pending_rss) freshness_info = f", 新鲜度过滤 {freshness_filtered_rss} 条" if freshness_filtered_rss > 0 else "" print(f"[AI筛选] RSS: 总计 {rss_total} 条{freshness_info}, 已分析跳过 {rss_skipped} 条, 本次发送AI分析 {rss_pending} 条") total_pending = len(pending_news) + len(pending_rss) if total_pending == 0: print("[AI筛选] 没有新增新闻需要分类") # 5. 批量分类 batch_size = filter_config.get("BATCH_SIZE", 200) batch_interval = filter_config.get("BATCH_INTERVAL", 5) total_results = [] batch_count = 0 # 跨热榜和 RSS 的全局批次计数 # 处理热榜 for i in range(0, len(pending_news), batch_size): if batch_count > 0 and batch_interval > 0: import time print(f"[AI筛选] 批次间隔等待 {batch_interval} 秒...") time.sleep(batch_interval) batch = pending_news[i:i + batch_size] titles_for_ai = [ {"id": n["id"], "title": n["title"], "source": n.get("source_name", "")} for n in batch ] batch_results = ai_filter.classify_batch(titles_for_ai, active_tags, interests_content) for r in batch_results: r["source_type"] = "hotlist" total_results.extend(batch_results) batch_count += 1 print(f"[AI筛选] 热榜批次 {i // batch_size + 1}: {len(batch)} 条 → {len(batch_results)} 条匹配") # 处理 RSS for i in range(0, len(pending_rss), batch_size): if batch_count > 0 and batch_interval > 0: import time print(f"[AI筛选] 批次间隔等待 {batch_interval} 秒...") time.sleep(batch_interval) batch = pending_rss[i:i + batch_size] titles_for_ai = [ {"id": n["id"], "title": n["title"], "source": n.get("source_name", "")} for n in batch ] batch_results = ai_filter.classify_batch(titles_for_ai, active_tags, interests_content) for r in batch_results: r["source_type"] = "rss" total_results.extend(batch_results) batch_count += 1 print(f"[AI筛选] RSS 批次 {i // batch_size + 1}: {len(batch)} 条 → {len(batch_results)} 条匹配") # 6. 保存结果 if total_results: saved = storage.save_ai_filter_results(total_results) print(f"[AI筛选] 保存 {saved} 条分类结果") if debug and saved != len(total_results): print(f"[AI筛选][DEBUG] !! 保存数量不一致: 期望 {len(total_results)}, 实际 {saved}(可能有重复记录被跳过)") # 6.5 记录所有已分析的新闻(匹配+不匹配,用于去重) matched_hotlist_ids = {r["news_item_id"] for r in total_results if r.get("source_type") == "hotlist"} matched_rss_ids = {r["news_item_id"] for r in total_results if r.get("source_type") == "rss"} if pending_news: hotlist_ids = [n["id"] for n in pending_news] storage.save_analyzed_news( hotlist_ids, "hotlist", effective_interests_file, current_hash, matched_hotlist_ids ) if pending_rss: rss_ids = [n["id"] for n in pending_rss] storage.save_analyzed_news( rss_ids, "rss", effective_interests_file, current_hash, matched_rss_ids ) if pending_news or pending_rss: total_analyzed = len(pending_news) + len(pending_rss) total_matched = len(matched_hotlist_ids) + len(matched_rss_ids) print(f"[AI筛选] 已记录 {total_analyzed} 条新闻分析状态 (匹配 {total_matched}, 不匹配 {total_analyzed - total_matched})") # 7. 结束批量模式(统一上传数据库到远程存储) storage.end_batch() # 8. 查询并组装返回结果 all_results = storage.get_active_ai_filter_results(interests_file=effective_interests_file) if debug: print(f"[AI筛选][DEBUG] === 最终汇总 ===") print(f"[AI筛选][DEBUG] 数据库 active 分类结果: {len(all_results)} 条") # 按标签统计 tag_counts: dict = {} for r in all_results: tag_name = r.get("tag", "?") src_type = r.get("source_type", "?") key = f"{tag_name}({src_type})" tag_counts[key] = tag_counts.get(key, 0) + 1 for key, count in sorted(tag_counts.items()): print(f"[AI筛选][DEBUG] {key}: {count} 条") return self._build_filter_result(all_results, active_tags, total_pending) def _build_filter_result( self, raw_results: List[Dict], tags: List[Dict], total_processed: int, ) -> AIFilterResult: """将数据库查询结果组装为 AIFilterResult""" priority_sort_enabled = self.ai_priority_sort_enabled tag_priority_map = {} for idx, t in enumerate(tags, start=1): tag_name = str(t.get("tag", "")).strip() if isinstance(t, dict) else "" if not tag_name: continue try: tag_priority_map[tag_name] = int(t.get("priority", idx)) except (TypeError, ValueError): tag_priority_map[tag_name] = idx # 按标签分组 tag_groups: Dict[str, Dict] = {} seen_titles: Dict[str, set] = {} # 每个标签下去重 for r in raw_results: tag_name = r["tag"] if tag_name not in tag_groups: raw_priority = r.get("tag_priority", tag_priority_map.get(tag_name, 9999)) try: tag_position = int(raw_priority) except (TypeError, ValueError): tag_position = 9999 tag_groups[tag_name] = { "tag": tag_name, "description": r.get("tag_description", ""), "position": tag_position, "count": 0, "items": [], } seen_titles[tag_name] = set() title = r["title"] if title in seen_titles[tag_name]: continue seen_titles[tag_name].add(title) tag_groups[tag_name]["items"].append({ "title": title, "source_id": r.get("source_id", ""), "source_name": r.get("source_name", ""), "url": r.get("url", ""), "mobile_url": r.get("mobile_url", ""), "rank": r.get("rank", 0), "ranks": r.get("ranks", []), "first_time": r.get("first_time", ""), "last_time": r.get("last_time", ""), "count": r.get("count", 1), "relevance_score": r.get("relevance_score", 0), "source_type": r.get("source_type", "hotlist"), }) tag_groups[tag_name]["count"] += 1 # 根据配置排序:位置优先 / 数量优先 if priority_sort_enabled: sorted_tags = sorted( tag_groups.values(), key=lambda x: (x.get("position", 9999), -x["count"], x["tag"]), ) else: sorted_tags = sorted( tag_groups.values(), key=lambda x: (-x["count"], x.get("position", 9999), x["tag"]), ) total_matched = sum(t["count"] for t in sorted_tags) return AIFilterResult( tags=sorted_tags, total_matched=total_matched, total_processed=total_processed, success=True, ) def convert_ai_filter_to_report_data( self, ai_filter_result: AIFilterResult, mode: str = "daily", new_titles: Optional[Dict] = None, rss_new_urls: Optional[set] = None, ) -> tuple: """ 将 AI 筛选结果转换为与关键词匹配相同的数据结构 AIFilterResult.tags 中每个 tag 对应一个 "word"(关键词组)。 tag.items 中 source_type="hotlist" 的条目进入热榜 stats, source_type="rss" 的条目进入 rss_items stats。 Args: ai_filter_result: AI 筛选结果 mode: 报告模式 ("daily" | "current" | "incremental") new_titles: 热榜新增标题 {source_id: {title: data}},用于 is_new 检测 rss_new_urls: 新增 RSS 条目的 URL 集合,用于 is_new 检测 Returns: (hotlist_stats, rss_stats): - hotlist_stats: 与 count_word_frequency() 产出格式一致 - rss_stats: 与 rss_items 格式一致 """ hotlist_stats = [] rss_stats = [] max_news = self.config.get("MAX_NEWS_PER_KEYWORD", 0) min_score = self.ai_filter_config.get("MIN_SCORE", 0) # current 模式:计算最新时间,只保留当前在榜的热榜新闻 # 与 count_word_frequency(mode="current") 的过滤逻辑对齐 latest_time = None if mode == "current": for tag_data in ai_filter_result.tags: for item in tag_data.get("items", []): if item.get("source_type", "hotlist") == "hotlist": last_time = item.get("last_time", "") if last_time and (latest_time is None or last_time > latest_time): latest_time = last_time if latest_time: print(f"[AI筛选] current 模式:最新时间 {latest_time},过滤已下榜新闻") # RSS 新鲜度过滤配置(与推送阶段一致) rss_config = self.rss_config freshness_config = rss_config.get("FRESHNESS_FILTER", {}) freshness_enabled = freshness_config.get("ENABLED", True) default_max_age_days = freshness_config.get("MAX_AGE_DAYS", 3) timezone = self.config.get("TIMEZONE", DEFAULT_TIMEZONE) feed_max_age_map = {} for feed_cfg in self.rss_feeds: feed_id = feed_cfg.get("id", "") max_age = feed_cfg.get("max_age_days") if max_age is not None: try: feed_max_age_map[feed_id] = int(max_age) except (ValueError, TypeError): pass filtered_count = 0 for tag_data in ai_filter_result.tags: tag_name = tag_data.get("tag", "") items = tag_data.get("items", []) if not items: continue hotlist_titles = [] rss_titles = [] for item in items: source_type = item.get("source_type", "hotlist") # current 模式:跳过已下榜的热榜新闻 if mode == "current" and latest_time and source_type == "hotlist": if item.get("last_time", "") != latest_time: filtered_count += 1 continue # 分数阈值过滤:跳过相关度低于 min_score 的新闻 if min_score > 0: score = item.get("relevance_score", 0) if score < min_score: continue # 构建时间显示 first_time = item.get("first_time", "") last_time = item.get("last_time", "") if source_type == "rss": # RSS 新鲜度过滤:跳过超过 max_age_days 的旧文章 if freshness_enabled and first_time: feed_id = item.get("source_id", "") max_days = feed_max_age_map.get(feed_id, default_max_age_days) if max_days > 0 and not is_within_days(first_time, max_days, timezone): continue # RSS 条目:first_time 是 ISO 格式,用友好格式显示 if first_time: time_display = format_iso_time_friendly(first_time, timezone, include_date=True) else: time_display = "" else: # 热榜条目:使用 [HH:MM ~ HH:MM] 格式(与 keyword 模式一致) if first_time and last_time and first_time != last_time: first_display = convert_time_for_display(first_time) last_display = convert_time_for_display(last_time) time_display = f"[{first_display} ~ {last_display}]" elif first_time: time_display = convert_time_for_display(first_time) else: time_display = "" # 计算 is_new(与 keyword 模式 core/analyzer.py:335-342 对齐) if source_type == "rss": is_new = False if rss_new_urls: item_url = item.get("url", "") is_new = item_url in rss_new_urls if item_url else False else: is_new = False if new_titles: item_source_id = item.get("source_id", "") item_title = item.get("title", "") if item_source_id in new_titles: is_new = item_title in new_titles[item_source_id] # incremental 模式下仅保留本轮新增命中的条目。 # run_ai_filter() 返回的是 active 结果集合,因此这里需要 # 显式过滤掉历史已命中的旧条目,才能与 keyword 模式行为对齐。 if mode == "incremental" and not is_new: continue title_entry = { "title": item.get("title", ""), "source_name": item.get("source_name", ""), "url": item.get("url", ""), "mobile_url": item.get("mobile_url", ""), "ranks": item.get("ranks", []), "rank_threshold": self.rank_threshold, "count": item.get("count", 1), "is_new": is_new, "time_display": time_display, "matched_keyword": tag_name, } if source_type == "rss": rss_titles.append(title_entry) else: hotlist_titles.append(title_entry) if hotlist_titles: if max_news > 0: hotlist_titles = hotlist_titles[:max_news] hotlist_stats.append({ "word": tag_name, "count": len(hotlist_titles), "position": tag_data.get("position", 9999), "titles": hotlist_titles, }) if rss_titles: if max_news > 0: rss_titles = rss_titles[:max_news] rss_stats.append({ "word": tag_name, "count": len(rss_titles), "position": tag_data.get("position", 9999), "titles": rss_titles, }) if mode == "current" and filtered_count > 0: total_kept = sum(s["count"] for s in hotlist_stats) print(f"[AI筛选] current 模式:过滤 {filtered_count} 条已下榜新闻,保留 {total_kept} 条当前在榜") if min_score > 0: hotlist_kept = sum(s["count"] for s in hotlist_stats) rss_kept = sum(s["count"] for s in rss_stats) total_kept = hotlist_kept + rss_kept parts = [f"热榜 {hotlist_kept} 条"] if rss_kept > 0: parts.append(f"RSS {rss_kept} 条") print(f"[AI筛选] 分数过滤:min_score={min_score},保留 {total_kept} 条 score≥{min_score} ({', '.join(parts)})") priority_sort_enabled = self.ai_priority_sort_enabled if priority_sort_enabled: hotlist_stats.sort(key=lambda x: (x.get("position", 9999), -x["count"], x["word"])) rss_stats.sort(key=lambda x: (x.get("position", 9999), -x["count"], x["word"])) else: hotlist_stats.sort(key=lambda x: (-x["count"], x.get("position", 9999), x["word"])) rss_stats.sort(key=lambda x: (-x["count"], x.get("position", 9999), x["word"])) return hotlist_stats, rss_stats # === 资源清理 === def cleanup(self): """清理资源""" if self._storage_manager: self._storage_manager.cleanup_old_data() self._storage_manager.cleanup() self._storage_manager = None ================================================ FILE: trendradar/core/__init__.py ================================================ # coding=utf-8 """ 核心模块 - 配置管理和核心工具 """ from trendradar.core.config import ( parse_multi_account_config, validate_paired_configs, limit_accounts, get_account_at_index, ) from trendradar.core.loader import load_config from trendradar.core.frequency import load_frequency_words, matches_word_groups from trendradar.core.scheduler import Scheduler, ResolvedSchedule from trendradar.core.data import ( read_all_today_titles_from_storage, read_all_today_titles, detect_latest_new_titles_from_storage, detect_latest_new_titles, ) from trendradar.core.analyzer import ( calculate_news_weight, format_time_display, count_word_frequency, count_rss_frequency, ) __all__ = [ "parse_multi_account_config", "validate_paired_configs", "limit_accounts", "get_account_at_index", "load_config", "load_frequency_words", "matches_word_groups", # 数据处理 "read_all_today_titles_from_storage", "read_all_today_titles", "detect_latest_new_titles_from_storage", "detect_latest_new_titles", # 统计分析 "calculate_news_weight", "format_time_display", "count_word_frequency", "count_rss_frequency", # 调度器 "Scheduler", "ResolvedSchedule", ] ================================================ FILE: trendradar/core/analyzer.py ================================================ # coding=utf-8 """ 统计分析模块 提供新闻统计和分析功能: - calculate_news_weight: 计算新闻权重 - format_time_display: 格式化时间显示 - count_word_frequency: 统计词频 """ from typing import Dict, List, Tuple, Optional, Callable from trendradar.core.frequency import matches_word_groups, _word_matches from trendradar.utils.time import DEFAULT_TIMEZONE def calculate_news_weight( title_data: Dict, rank_threshold: int, weight_config: Dict, ) -> float: """ 计算新闻权重,用于排序 Args: title_data: 标题数据,包含 ranks 和 count rank_threshold: 排名阈值 weight_config: 权重配置 {RANK_WEIGHT, FREQUENCY_WEIGHT, HOTNESS_WEIGHT} Returns: float: 计算出的权重值 """ ranks = title_data.get("ranks", []) if not ranks: return 0.0 count = title_data.get("count", len(ranks)) # 排名权重:Σ(11 - min(rank, 10)) / 出现次数 rank_scores = [] for rank in ranks: score = 11 - min(rank, 10) rank_scores.append(score) rank_weight = sum(rank_scores) / len(ranks) if ranks else 0 # 频次权重:min(出现次数, 10) × 10 frequency_weight = min(count, 10) * 10 # 热度加成:高排名次数 / 总出现次数 × 100 high_rank_count = sum(1 for rank in ranks if rank <= rank_threshold) hotness_ratio = high_rank_count / len(ranks) if ranks else 0 hotness_weight = hotness_ratio * 100 total_weight = ( rank_weight * weight_config["RANK_WEIGHT"] + frequency_weight * weight_config["FREQUENCY_WEIGHT"] + hotness_weight * weight_config["HOTNESS_WEIGHT"] ) return total_weight def format_time_display( first_time: str, last_time: str, convert_time_func: Callable[[str], str], ) -> str: """ 格式化时间显示(将 HH-MM 转换为 HH:MM) Args: first_time: 首次出现时间 last_time: 最后出现时间 convert_time_func: 时间格式转换函数 Returns: str: 格式化后的时间显示字符串 """ if not first_time: return "" # 转换为显示格式 first_display = convert_time_func(first_time) last_display = convert_time_func(last_time) if first_display == last_display or not last_display: return first_display else: return f"[{first_display} ~ {last_display}]" def count_word_frequency( results: Dict, word_groups: List[Dict], filter_words: List[str], id_to_name: Dict, title_info: Optional[Dict] = None, rank_threshold: int = 3, new_titles: Optional[Dict] = None, mode: str = "daily", global_filters: Optional[List[str]] = None, weight_config: Optional[Dict] = None, max_news_per_keyword: int = 0, sort_by_position_first: bool = False, is_first_crawl_func: Optional[Callable[[], bool]] = None, convert_time_func: Optional[Callable[[str], str]] = None, quiet: bool = False, ) -> Tuple[List[Dict], int]: """ 统计词频,支持必须词、频率词、过滤词、全局过滤词,并标记新增标题 Args: results: 抓取结果 {source_id: {title: title_data}} word_groups: 词组配置列表 filter_words: 过滤词列表 id_to_name: ID 到名称的映射 title_info: 标题统计信息(可选) rank_threshold: 排名阈值 new_titles: 新增标题(可选) mode: 报告模式 (daily/incremental/current) global_filters: 全局过滤词(可选) weight_config: 权重配置 max_news_per_keyword: 每个关键词最大显示数量 sort_by_position_first: 是否优先按配置位置排序 is_first_crawl_func: 检测是否是当天第一次爬取的函数 convert_time_func: 时间格式转换函数 quiet: 是否静默模式(不打印日志) Returns: Tuple[List[Dict], int]: (统计结果列表, 总标题数) """ # 默认权重配置 if weight_config is None: weight_config = { "RANK_WEIGHT": 0.4, "FREQUENCY_WEIGHT": 0.3, "HOTNESS_WEIGHT": 0.3, } # 默认时间转换函数 if convert_time_func is None: convert_time_func = lambda x: x # 默认首次爬取检测函数 if is_first_crawl_func is None: is_first_crawl_func = lambda: True # 如果没有配置词组,创建一个包含所有新闻的虚拟词组 if not word_groups: print("频率词配置为空,将显示所有新闻") word_groups = [{"required": [], "normal": [], "group_key": "全部新闻"}] filter_words = [] # 清空过滤词,显示所有新闻 is_first_today = is_first_crawl_func() # 确定处理的数据源和新增标记逻辑 if mode == "incremental": if is_first_today: # 增量模式 + 当天第一次:处理所有新闻,都标记为新增 results_to_process = results all_news_are_new = True else: # 增量模式 + 当天非第一次:只处理新增的新闻 results_to_process = new_titles if new_titles else {} all_news_are_new = True elif mode == "current": # current 模式:只处理当前时间批次的新闻,但统计信息来自全部历史 if title_info: latest_time = None for source_titles in title_info.values(): for title_data in source_titles.values(): last_time = title_data.get("last_time", "") if last_time: if latest_time is None or last_time > latest_time: latest_time = last_time # 只处理 last_time 等于最新时间的新闻 if latest_time: results_to_process = {} for source_id, source_titles in results.items(): if source_id in title_info: filtered_titles = {} for title, title_data in source_titles.items(): if title in title_info[source_id]: info = title_info[source_id][title] if info.get("last_time") == latest_time: filtered_titles[title] = title_data if filtered_titles: results_to_process[source_id] = filtered_titles if not quiet: print( f"当前榜单模式:最新时间 {latest_time},筛选出 {sum(len(titles) for titles in results_to_process.values())} 条当前榜单新闻" ) else: results_to_process = results else: results_to_process = results all_news_are_new = False else: # 当日汇总模式:处理所有新闻 results_to_process = results all_news_are_new = False total_input_news = sum(len(titles) for titles in results.values()) filter_status = ( "全部显示" if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻" else "频率词过滤" ) print(f"当日汇总模式:处理 {total_input_news} 条新闻,模式:{filter_status}") word_stats = {} total_titles = 0 processed_titles = {} matched_new_count = 0 if title_info is None: title_info = {} if new_titles is None: new_titles = {} for group in word_groups: group_key = group["group_key"] word_stats[group_key] = {"count": 0, "titles": {}} for source_id, titles_data in results_to_process.items(): total_titles += len(titles_data) if source_id not in processed_titles: processed_titles[source_id] = {} for title, title_data in titles_data.items(): if title in processed_titles.get(source_id, {}): continue # 使用统一的匹配逻辑 matches_frequency_words = matches_word_groups( title, word_groups, filter_words, global_filters ) if not matches_frequency_words: continue # 如果是增量模式或 current 模式第一次,统计匹配的新增新闻数量 if (mode == "incremental" and all_news_are_new) or ( mode == "current" and is_first_today ): matched_new_count += 1 source_ranks = title_data.get("ranks", []) source_url = title_data.get("url", "") source_mobile_url = title_data.get("mobileUrl", "") # 找到匹配的词组(防御性转换确保类型安全) title_lower = str(title).lower() if not isinstance(title, str) else title.lower() for group in word_groups: required_words = group["required"] normal_words = group["normal"] # 如果是"全部新闻"模式,所有标题都匹配第一个(唯一的)词组 if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻": group_key = group["group_key"] word_stats[group_key]["count"] += 1 if source_id not in word_stats[group_key]["titles"]: word_stats[group_key]["titles"][source_id] = [] else: # 原有的匹配逻辑(支持正则语法) if required_words: all_required_present = all( _word_matches(req_item, title_lower) for req_item in required_words ) if not all_required_present: continue if normal_words: any_normal_present = any( _word_matches(normal_item, title_lower) for normal_item in normal_words ) if not any_normal_present: continue group_key = group["group_key"] word_stats[group_key]["count"] += 1 if source_id not in word_stats[group_key]["titles"]: word_stats[group_key]["titles"][source_id] = [] first_time = "" last_time = "" count_info = 1 ranks = source_ranks if source_ranks else [] url = source_url mobile_url = source_mobile_url rank_timeline = [] # 对于 current 模式,从历史统计信息中获取完整数据 if ( mode == "current" and title_info and source_id in title_info and title in title_info[source_id] ): info = title_info[source_id][title] first_time = info.get("first_time", "") last_time = info.get("last_time", "") count_info = info.get("count", 1) if "ranks" in info and info["ranks"]: ranks = info["ranks"] url = info.get("url", source_url) mobile_url = info.get("mobileUrl", source_mobile_url) rank_timeline = info.get("rank_timeline", []) elif ( title_info and source_id in title_info and title in title_info[source_id] ): info = title_info[source_id][title] first_time = info.get("first_time", "") last_time = info.get("last_time", "") count_info = info.get("count", 1) if "ranks" in info and info["ranks"]: ranks = info["ranks"] url = info.get("url", source_url) mobile_url = info.get("mobileUrl", source_mobile_url) rank_timeline = info.get("rank_timeline", []) if not ranks: ranks = [99] time_display = format_time_display(first_time, last_time, convert_time_func) source_name = id_to_name.get(source_id, source_id) # 判断是否为新增 is_new = False if all_news_are_new: # 增量模式下所有处理的新闻都是新增,或者当天第一次的所有新闻都是新增 is_new = True elif new_titles and source_id in new_titles: # 检查是否在新增列表中 new_titles_for_source = new_titles[source_id] is_new = title in new_titles_for_source word_stats[group_key]["titles"][source_id].append( { "title": title, "source_name": source_name, "first_time": first_time, "last_time": last_time, "time_display": time_display, "count": count_info, "ranks": ranks, "rank_threshold": rank_threshold, "url": url, "mobileUrl": mobile_url, "is_new": is_new, "rank_timeline": rank_timeline, } ) if source_id not in processed_titles: processed_titles[source_id] = {} processed_titles[source_id][title] = True break # 最后统一打印汇总信息 if mode == "incremental": if is_first_today: total_input_news = sum(len(titles) for titles in results.values()) filter_status = ( "全部显示" if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻" else "频率词匹配" ) if not quiet: print( f"增量模式:当天第一次爬取,{total_input_news} 条新闻中有 {matched_new_count} 条{filter_status}" ) else: if new_titles: total_new_count = sum(len(titles) for titles in new_titles.values()) filter_status = ( "全部显示" if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻" else "匹配频率词" ) if not quiet: print( f"增量模式:{total_new_count} 条新增新闻中,有 {matched_new_count} 条{filter_status}" ) if matched_new_count == 0 and len(word_groups) > 1: print("增量模式:没有新增新闻匹配频率词,将不会发送通知") else: if not quiet: print("增量模式:未检测到新增新闻") elif mode == "current": total_input_news = sum(len(titles) for titles in results_to_process.values()) if is_first_today: filter_status = ( "全部显示" if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻" else "频率词匹配" ) if not quiet: print( f"当前榜单模式:当天第一次爬取,{total_input_news} 条当前榜单新闻中有 {matched_new_count} 条{filter_status}" ) else: matched_count = sum(stat["count"] for stat in word_stats.values()) filter_status = ( "全部显示" if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部新闻" else "频率词匹配" ) if not quiet: print( f"当前榜单模式:{total_input_news} 条当前榜单新闻中有 {matched_count} 条{filter_status}" ) stats = [] # 创建 group_key 到位置、最大数量、显示名称的映射 group_key_to_position = { group["group_key"]: idx for idx, group in enumerate(word_groups) } group_key_to_max_count = { group["group_key"]: group.get("max_count", 0) for group in word_groups } group_key_to_display_name = { group["group_key"]: group.get("display_name") for group in word_groups } for group_key, data in word_stats.items(): all_titles = [] for source_id, title_list in data["titles"].items(): all_titles.extend(title_list) # 按权重排序 sorted_titles = sorted( all_titles, key=lambda x: ( -calculate_news_weight(x, rank_threshold, weight_config), min(x["ranks"]) if x["ranks"] else 999, -x["count"], ), ) # 应用最大显示数量限制(优先级:单独配置 > 全局配置) group_max_count = group_key_to_max_count.get(group_key, 0) if group_max_count == 0: # 使用全局配置 group_max_count = max_news_per_keyword if group_max_count > 0: sorted_titles = sorted_titles[:group_max_count] # 优先使用 display_name,否则使用 group_key display_word = group_key_to_display_name.get(group_key) or group_key stats.append( { "word": display_word, "count": data["count"], "position": group_key_to_position.get(group_key, 999), "titles": sorted_titles, "percentage": ( round(data["count"] / total_titles * 100, 2) if total_titles > 0 else 0 ), } ) # 根据配置选择排序优先级 if sort_by_position_first: # 先按配置位置,再按热点条数 stats.sort(key=lambda x: (x["position"], -x["count"])) else: # 先按热点条数,再按配置位置(原逻辑) stats.sort(key=lambda x: (-x["count"], x["position"])) # 打印过滤后的匹配新闻数 matched_news_count = sum(len(stat["titles"]) for stat in stats if stat["count"] > 0) if not quiet and mode == "daily": print(f"当日汇总模式:处理 {total_titles} 条新闻,模式:频率词过滤") print(f"频率词过滤后:{matched_news_count} 条新闻匹配") return stats, total_titles def count_rss_frequency( rss_items: List[Dict], word_groups: List[Dict], filter_words: List[str], global_filters: Optional[List[str]] = None, new_items: Optional[List[Dict]] = None, max_news_per_keyword: int = 0, sort_by_position_first: bool = False, timezone: str = DEFAULT_TIMEZONE, rank_threshold: int = 5, quiet: bool = False, ) -> Tuple[List[Dict], int]: """ 按关键词分组统计 RSS 条目(与热榜统计格式一致) Args: rss_items: RSS 条目列表,每个条目包含: - title: 标题 - feed_id: RSS 源 ID - feed_name: RSS 源名称 - url: 文章链接 - published_at: 发布时间(ISO 格式) word_groups: 词组配置列表 filter_words: 过滤词列表 global_filters: 全局过滤词(可选) new_items: 新增条目列表(可选,用于标记 is_new) max_news_per_keyword: 每个关键词最大显示数量 sort_by_position_first: 是否优先按配置位置排序 timezone: 时区名称(用于时间格式化) quiet: 是否静默模式 Returns: Tuple[List[Dict], int]: (统计结果列表, 总条目数) 统计结果格式与热榜一致: [ { "word": "关键词", "count": 5, "position": 0, "titles": [ { "title": "标题", "source_name": "Hacker News", "time_display": "12-29 08:20", "count": 1, "ranks": [1], # RSS 用发布时间顺序作为排名 "rank_threshold": 50, "url": "...", "mobile_url": "", "is_new": True/False } ], "percentage": 10.0 } ] """ from trendradar.utils.time import format_iso_time_friendly if not rss_items: return [], 0 # 如果没有配置词组,创建一个包含所有条目的虚拟词组 if not word_groups: if not quiet: print("[RSS] 频率词配置为空,将显示所有 RSS 条目") word_groups = [{"required": [], "normal": [], "group_key": "全部 RSS"}] filter_words = [] # 创建新增条目的 URL 集合,用于快速查找 new_urls = set() if new_items: for item in new_items: if item.get("url"): new_urls.add(item["url"]) # 初始化词组统计 word_stats = {} for group in word_groups: group_key = group["group_key"] word_stats[group_key] = {"count": 0, "titles": []} total_items = len(rss_items) processed_urls = set() # 用于去重 # 为每个条目分配一个基于发布时间的"排名" # 按发布时间排序,最新的排在前面 sorted_items = sorted( rss_items, key=lambda x: x.get("published_at", ""), reverse=True ) url_to_rank = {item.get("url", ""): idx + 1 for idx, item in enumerate(sorted_items)} for item in rss_items: title = item.get("title", "") url = item.get("url", "") # 去重 if url and url in processed_urls: continue if url: processed_urls.add(url) # 使用统一的匹配逻辑 if not matches_word_groups(title, word_groups, filter_words, global_filters): continue # 找到匹配的词组 title_lower = title.lower() for group in word_groups: required_words = group["required"] normal_words = group["normal"] group_key = group["group_key"] # "全部 RSS" 模式:所有条目都匹配 if len(word_groups) == 1 and word_groups[0]["group_key"] == "全部 RSS": matched = True else: # 检查必须词(支持正则语法) if required_words: all_required_present = all( _word_matches(req_item, title_lower) for req_item in required_words ) if not all_required_present: continue # 检查普通词(支持正则语法) if normal_words: any_normal_present = any( _word_matches(normal_item, title_lower) for normal_item in normal_words ) if not any_normal_present: continue matched = True if matched: word_stats[group_key]["count"] += 1 # 格式化时间显示 published_at = item.get("published_at", "") time_display = format_iso_time_friendly(published_at, timezone, include_date=True) if published_at else "" # 判断是否为新增 is_new = url in new_urls if url else False # 获取排名(基于发布时间顺序) rank = url_to_rank.get(url, 99) if url else 99 title_data = { "title": title, "source_name": item.get("feed_name", item.get("feed_id", "RSS")), "time_display": time_display, "count": 1, # RSS 条目通常只出现一次 "ranks": [rank], "rank_threshold": rank_threshold, "url": url, "mobile_url": "", "is_new": is_new, } word_stats[group_key]["titles"].append(title_data) break # 一个条目只匹配第一个词组 # 构建统计结果 stats = [] group_key_to_position = { group["group_key"]: idx for idx, group in enumerate(word_groups) } group_key_to_max_count = { group["group_key"]: group.get("max_count", 0) for group in word_groups } group_key_to_display_name = { group["group_key"]: group.get("display_name") for group in word_groups } for group_key, data in word_stats.items(): if data["count"] == 0: continue # 按发布时间排序(最新在前) sorted_titles = sorted( data["titles"], key=lambda x: x["ranks"][0] if x["ranks"] else 999 ) # 应用最大显示数量限制 group_max_count = group_key_to_max_count.get(group_key, 0) if group_max_count == 0: group_max_count = max_news_per_keyword if group_max_count > 0: sorted_titles = sorted_titles[:group_max_count] # 优先使用 display_name,否则使用 group_key display_word = group_key_to_display_name.get(group_key) or group_key stats.append({ "word": display_word, "count": data["count"], "position": group_key_to_position.get(group_key, 999), "titles": sorted_titles, "percentage": round(data["count"] / total_items * 100, 2) if total_items > 0 else 0, }) # 排序 if sort_by_position_first: stats.sort(key=lambda x: (x["position"], -x["count"])) else: stats.sort(key=lambda x: (-x["count"], x["position"])) matched_count = sum(stat["count"] for stat in stats) if not quiet: print(f"[RSS] 关键词分组统计:{matched_count}/{total_items} 条匹配") return stats, total_items def convert_keyword_stats_to_platform_stats( keyword_stats: List[Dict], weight_config: Dict, rank_threshold: int = 5, ) -> List[Dict]: """ 将按关键词分组的统计数据转换为按平台分组的统计数据 Args: keyword_stats: 原始按关键词分组的统计数据 weight_config: 权重配置 rank_threshold: 排名阈值 Returns: 按平台分组的统计数据,格式与原 stats 一致 """ # 1. 收集所有新闻,按平台分组 platform_map: Dict[str, List[Dict]] = {} for stat in keyword_stats: keyword = stat["word"] for title_data in stat["titles"]: source_name = title_data["source_name"] if source_name not in platform_map: platform_map[source_name] = [] # 复制 title_data 并添加匹配的关键词 title_with_keyword = title_data.copy() title_with_keyword["matched_keyword"] = keyword platform_map[source_name].append(title_with_keyword) # 2. 去重(同一平台下相同标题只保留一条,保留第一个匹配的关键词) for source_name, titles in platform_map.items(): seen_titles: Dict[str, bool] = {} unique_titles = [] for title_data in titles: title_text = title_data["title"] if title_text not in seen_titles: seen_titles[title_text] = True unique_titles.append(title_data) platform_map[source_name] = unique_titles # 3. 按权重排序每个平台内的新闻 for source_name, titles in platform_map.items(): platform_map[source_name] = sorted( titles, key=lambda x: ( -calculate_news_weight(x, rank_threshold, weight_config), min(x["ranks"]) if x["ranks"] else 999, -x["count"], ), ) # 4. 构建平台统计结果 platform_stats = [] for source_name, titles in platform_map.items(): platform_stats.append({ "word": source_name, # 平台名作为分组标识 "count": len(titles), "titles": titles, "percentage": 0, # 可后续计算 }) # 5. 按新闻条数排序平台 platform_stats.sort(key=lambda x: -x["count"]) return platform_stats ================================================ FILE: trendradar/core/config.py ================================================ # coding=utf-8 """ 配置工具模块 - 多账号配置解析和验证 提供多账号推送配置的解析、验证和限制功能 """ from typing import Dict, List, Optional, Tuple def parse_multi_account_config(config_value: str, separator: str = ";") -> List[str]: """ 解析多账号配置,返回账号列表 Args: config_value: 配置值字符串,多个账号用分隔符分隔 separator: 分隔符,默认为 ; Returns: 账号列表,空字符串会被保留(用于占位) Examples: >>> parse_multi_account_config("url1;url2;url3") ['url1', 'url2', 'url3'] >>> parse_multi_account_config(";token2") # 第一个账号无token ['', 'token2'] >>> parse_multi_account_config("") [] """ if not config_value: return [] # 保留空字符串用于占位(如 ";token2" 表示第一个账号无token) accounts = [acc.strip() for acc in config_value.split(separator)] # 过滤掉全部为空的情况 if all(not acc for acc in accounts): return [] return accounts def validate_paired_configs( configs: Dict[str, List[str]], channel_name: str, required_keys: Optional[List[str]] = None ) -> Tuple[bool, int]: """ 验证配对配置的数量是否一致 对于需要多个配置项配对的渠道(如 Telegram 的 token 和 chat_id), 验证所有配置项的账号数量是否一致。 Args: configs: 配置字典,key 为配置名,value 为账号列表 channel_name: 渠道名称,用于日志输出 required_keys: 必须有值的配置项列表 Returns: (是否验证通过, 账号数量) Examples: >>> validate_paired_configs({ ... "token": ["t1", "t2"], ... "chat_id": ["c1", "c2"] ... }, "Telegram", ["token", "chat_id"]) (True, 2) >>> validate_paired_configs({ ... "token": ["t1", "t2"], ... "chat_id": ["c1"] # 数量不匹配 ... }, "Telegram", ["token", "chat_id"]) (False, 0) """ # 过滤掉空列表 non_empty_configs = {k: v for k, v in configs.items() if v} if not non_empty_configs: return True, 0 # 检查必须项 if required_keys: for key in required_keys: if key not in non_empty_configs or not non_empty_configs[key]: return True, 0 # 必须项为空,视为未配置 # 获取所有非空配置的长度 lengths = {k: len(v) for k, v in non_empty_configs.items()} unique_lengths = set(lengths.values()) if len(unique_lengths) > 1: print(f"❌ {channel_name} 配置错误:配对配置数量不一致,将跳过该渠道推送") for key, length in lengths.items(): print(f" - {key}: {length} 个") return False, 0 return True, list(unique_lengths)[0] if unique_lengths else 0 def limit_accounts( accounts: List[str], max_count: int, channel_name: str ) -> List[str]: """ 限制账号数量 当配置的账号数量超过最大限制时,只使用前 N 个账号, 并输出警告信息。 Args: accounts: 账号列表 max_count: 最大账号数量 channel_name: 渠道名称,用于日志输出 Returns: 限制后的账号列表 Examples: >>> limit_accounts(["a1", "a2", "a3"], 2, "飞书") ⚠️ 飞书 配置了 3 个账号,超过最大限制 2,只使用前 2 个 ['a1', 'a2'] """ if len(accounts) > max_count: print(f"⚠️ {channel_name} 配置了 {len(accounts)} 个账号,超过最大限制 {max_count},只使用前 {max_count} 个") print(f" ⚠️ 警告:如果你是 fork 用户,过多账号可能导致 GitHub Actions 运行时间过长,存在账号风险") return accounts[:max_count] return accounts def get_account_at_index(accounts: List[str], index: int, default: str = "") -> str: """ 安全获取指定索引的账号值 当索引超出范围或账号值为空时,返回默认值。 Args: accounts: 账号列表 index: 索引 default: 默认值 Returns: 账号值或默认值 Examples: >>> get_account_at_index(["a", "b", "c"], 1) 'b' >>> get_account_at_index(["a", "", "c"], 1, "default") 'default' >>> get_account_at_index(["a"], 5, "default") 'default' """ if index < len(accounts): return accounts[index] if accounts[index] else default return default ================================================ FILE: trendradar/core/data.py ================================================ # coding=utf-8 """ 数据处理模块 提供数据读取和检测功能: - read_all_today_titles: 从存储后端读取当天所有标题 - detect_latest_new_titles: 检测最新批次的新增标题 Author: TrendRadar Team """ from typing import Dict, List, Tuple, Optional def read_all_today_titles_from_storage( storage_manager, current_platform_ids: Optional[List[str]] = None, ) -> Tuple[Dict, Dict, Dict]: """ 从存储后端读取当天所有标题(SQLite 数据) Args: storage_manager: 存储管理器实例 current_platform_ids: 当前监控的平台 ID 列表(用于过滤) Returns: Tuple[Dict, Dict, Dict]: (all_results, id_to_name, title_info) """ try: news_data = storage_manager.get_today_all_data() if not news_data or not news_data.items: return {}, {}, {} all_results = {} final_id_to_name = {} title_info = {} for source_id, news_list in news_data.items.items(): # 按平台过滤 if current_platform_ids is not None and source_id not in current_platform_ids: continue # 获取来源名称 source_name = news_data.id_to_name.get(source_id, source_id) final_id_to_name[source_id] = source_name if source_id not in all_results: all_results[source_id] = {} title_info[source_id] = {} for item in news_list: title = item.title ranks = item.ranks or [item.rank] first_time = item.first_time or item.crawl_time last_time = item.last_time or item.crawl_time count = item.count rank_timeline = item.rank_timeline all_results[source_id][title] = { "ranks": ranks, "url": item.url or "", "mobileUrl": item.mobile_url or "", } title_info[source_id][title] = { "first_time": first_time, "last_time": last_time, "count": count, "ranks": ranks, "url": item.url or "", "mobileUrl": item.mobile_url or "", "rank_timeline": rank_timeline, } return all_results, final_id_to_name, title_info except Exception as e: print(f"[存储] 从存储后端读取数据失败: {e}") return {}, {}, {} def read_all_today_titles( storage_manager, current_platform_ids: Optional[List[str]] = None, quiet: bool = False, ) -> Tuple[Dict, Dict, Dict]: """ 读取当天所有标题(从存储后端) Args: storage_manager: 存储管理器实例 current_platform_ids: 当前监控的平台 ID 列表(用于过滤) quiet: 是否静默模式(不打印日志) Returns: Tuple[Dict, Dict, Dict]: (all_results, id_to_name, title_info) """ all_results, final_id_to_name, title_info = read_all_today_titles_from_storage( storage_manager, current_platform_ids ) if not quiet: if all_results: total_count = sum(len(titles) for titles in all_results.values()) print(f"[存储] 已从存储后端读取 {total_count} 条标题") else: print("[存储] 当天暂无数据") return all_results, final_id_to_name, title_info def detect_latest_new_titles_from_storage( storage_manager, current_platform_ids: Optional[List[str]] = None, ) -> Dict: """ 从存储后端检测最新批次的新增标题 Args: storage_manager: 存储管理器实例 current_platform_ids: 当前监控的平台 ID 列表(用于过滤) Returns: Dict: 新增标题 {source_id: {title: title_data}} """ try: # 获取最新抓取数据 latest_data = storage_manager.get_latest_crawl_data() if not latest_data or not latest_data.items: return {} # 获取所有历史数据 all_data = storage_manager.get_today_all_data() if not all_data or not all_data.items: # 没有历史数据(第一次抓取),不应该有"新增"标题 return {} # 获取最新批次时间 latest_time = latest_data.crawl_time # 步骤1:收集最新批次的标题(last_crawl_time = latest_time 的标题) latest_titles = {} for source_id, news_list in latest_data.items.items(): if current_platform_ids is not None and source_id not in current_platform_ids: continue latest_titles[source_id] = {} for item in news_list: latest_titles[source_id][item.title] = { "ranks": [item.rank], "url": item.url or "", "mobileUrl": item.mobile_url or "", } # 步骤2:收集历史标题 # 关键逻辑:一个标题只要其 first_crawl_time < latest_time,就是历史标题 # 这样即使同一标题有多条记录(URL 不同),只要任何一条是历史的,该标题就算历史 historical_titles = {} for source_id, news_list in all_data.items.items(): if current_platform_ids is not None and source_id not in current_platform_ids: continue historical_titles[source_id] = set() for item in news_list: first_time = item.first_time or item.crawl_time # 如果该记录的首次出现时间早于最新批次,则该标题是历史标题 if first_time < latest_time: historical_titles[source_id].add(item.title) # 检查是否是当天第一次抓取(没有任何历史标题) # 如果所有平台的历史标题集合都为空,说明只有一个抓取批次 # 在这种情况下,将所有最新批次的标题视为"新增"(用于增量模式的第一次推送) has_historical_data = any(len(titles) > 0 for titles in historical_titles.values()) if not has_historical_data: # 第一次爬取:返回所有最新标题作为"新增" return latest_titles # 步骤3:找出新增标题 = 最新批次标题 - 历史标题 new_titles = {} for source_id, source_latest_titles in latest_titles.items(): historical_set = historical_titles.get(source_id, set()) source_new_titles = {} for title, title_data in source_latest_titles.items(): if title not in historical_set: source_new_titles[title] = title_data if source_new_titles: new_titles[source_id] = source_new_titles return new_titles except Exception as e: print(f"[存储] 从存储后端检测新标题失败: {e}") return {} def detect_latest_new_titles( storage_manager, current_platform_ids: Optional[List[str]] = None, quiet: bool = False, ) -> Dict: """ 检测当日最新批次的新增标题(从存储后端) Args: storage_manager: 存储管理器实例 current_platform_ids: 当前监控的平台 ID 列表(用于过滤) quiet: 是否静默模式(不打印日志) Returns: Dict: 新增标题 {source_id: {title: title_data}} """ new_titles = detect_latest_new_titles_from_storage(storage_manager, current_platform_ids) if new_titles and not quiet: total_new = sum(len(titles) for titles in new_titles.values()) print(f"[存储] 从存储后端检测到 {total_new} 条新增标题") return new_titles ================================================ FILE: trendradar/core/frequency.py ================================================ # coding=utf-8 """ 频率词配置加载模块 负责从配置文件加载频率词规则,支持: - 普通词组 - 必须词(+前缀) - 过滤词(!前缀) - 全局过滤词([GLOBAL_FILTER] 区域) - 最大显示数量(@前缀) - 正则表达式(/pattern/ 语法) - 显示名称(=> 别名 语法) - 组别名([组别名] 语法,作为词组第一行) """ import os import re from pathlib import Path from typing import Dict, List, Tuple, Optional, Union def _parse_word(word: str) -> Dict: """ 解析单个词,识别是否为正则表达式,支持显示名称 Args: word: 原始配置行 (e.g. "/京东|刘强东/ => 京东") Returns: Dict: 包含 word, is_regex, pattern, display_name """ display_name = None # 1. 优先处理显示名称 (=>) # 先切分出 "配置内容" 和 "显示名称" if '=>' in word: parts = re.split(r'\s*=>\s*', word, 1) word_config = parts[0].strip() # 只有当 => 右边有内容时才作为 display_name if len(parts) > 1 and parts[1].strip(): display_name = parts[1].strip() else: word_config = word.strip() # 2. 解析正则表达式 # 规则:以 / 开头,以 / 结尾(可能跟 flags),中间内容贪婪提取 # [a-z]*$ 表示允许末尾有 flags (如 i, g),但在下面代码中会被忽略 regex_match = re.match(r'^/(.+)/[a-z]*$', word_config) if regex_match: pattern_str = regex_match.group(1) try: pattern = re.compile(pattern_str, re.IGNORECASE) return { "word": pattern_str, "is_regex": True, "pattern": pattern, "display_name": display_name, } except re.error as e: print(f"Warning: Invalid regex pattern '/{pattern_str}/': {e}") pass return { "word": word_config, "is_regex": False, "pattern": None, "display_name": display_name } def _word_matches(word_config: Union[str, Dict], title_lower: str) -> bool: """ 检查词是否在标题中匹配 Args: word_config: 词配置(字符串或字典) title_lower: 小写的标题 Returns: 是否匹配 """ if isinstance(word_config, str): # 向后兼容:纯字符串 return word_config.lower() in title_lower if word_config.get("is_regex") and word_config.get("pattern"): # 正则匹配 return bool(word_config["pattern"].search(title_lower)) else: # 子字符串匹配 return word_config["word"].lower() in title_lower def load_frequency_words( frequency_file: Optional[str] = None, ) -> Tuple[List[Dict], List[str], List[str]]: """ 加载频率词配置 配置文件格式说明: - 每个词组由空行分隔 - [GLOBAL_FILTER] 区域定义全局过滤词 - [WORD_GROUPS] 区域定义词组(默认) 词组语法: - 普通词:直接写入,任意匹配即可 - +词:必须词,所有必须词都要匹配 - !词:过滤词,匹配则排除 - @数字:该词组最多显示的条数 Args: frequency_file: 频率词配置文件路径,默认从环境变量 FREQUENCY_WORDS_PATH 获取或使用 config/frequency_words.txt,短文件名从 config/custom/keyword/ 查找 Returns: (词组列表, 词组内过滤词, 全局过滤词) Raises: FileNotFoundError: 频率词文件不存在 """ if frequency_file is None: frequency_file = os.environ.get( "FREQUENCY_WORDS_PATH", "config/frequency_words.txt" ) frequency_path = Path(frequency_file) if not frequency_path.exists(): # 尝试作为短文件名,拼接 config/custom/keyword/ 前缀 custom_path = Path("config/custom/keyword") / frequency_file if custom_path.exists(): frequency_path = custom_path else: raise FileNotFoundError(f"频率词文件 {frequency_file} 不存在") with open(frequency_path, "r", encoding="utf-8") as f: content = f.read() word_groups = [group.strip() for group in content.split("\n\n") if group.strip()] processed_groups = [] filter_words = [] global_filters = [] # 默认区域(向后兼容) current_section = "WORD_GROUPS" for group in word_groups: # 过滤空行和注释行(# 开头) lines = [line.strip() for line in group.split("\n") if line.strip() and not line.strip().startswith("#")] if not lines: continue # 检查是否为区域标记 if lines[0].startswith("[") and lines[0].endswith("]"): section_name = lines[0][1:-1].upper() if section_name in ("GLOBAL_FILTER", "WORD_GROUPS"): current_section = section_name lines = lines[1:] # 移除标记行 # 处理全局过滤区域 if current_section == "GLOBAL_FILTER": # 直接添加所有非空行到全局过滤列表 for line in lines: # 忽略特殊语法前缀,只提取纯文本 if line.startswith(("!", "+", "@")): continue # 全局过滤区不支持特殊语法 if line: global_filters.append(line) continue # 处理词组区域 words = lines group_alias = None # 组别名([别名] 语法) # 检查第一行是否为组别名(非区域标记) if words and words[0].startswith("[") and words[0].endswith("]"): potential_alias = words[0][1:-1].strip() # 排除区域标记(GLOBAL_FILTER, WORD_GROUPS) if potential_alias.upper() not in ("GLOBAL_FILTER", "WORD_GROUPS"): group_alias = potential_alias words = words[1:] # 移除组别名行 group_required_words = [] group_normal_words = [] group_max_count = 0 # 默认不限制 for word in words: if word.startswith("@"): # 解析最大显示数量(只接受正整数) try: count = int(word[1:]) if count > 0: group_max_count = count except (ValueError, IndexError): pass # 忽略无效的@数字格式 elif word.startswith("!"): # 过滤词(支持正则语法) filter_word = word[1:] parsed = _parse_word(filter_word) filter_words.append(parsed) elif word.startswith("+"): # 必须词(支持正则语法) req_word = word[1:] group_required_words.append(_parse_word(req_word)) else: # 普通词(支持正则语法) group_normal_words.append(_parse_word(word)) if group_required_words or group_normal_words: if group_normal_words: group_key = " ".join(w["word"] for w in group_normal_words) else: group_key = " ".join(w["word"] for w in group_required_words) # 生成显示名称 # 优先级:组别名 > 行别名拼接 > 关键词拼接 if group_alias: # 有组别名,直接使用 display_name = group_alias else: # 没有组别名,拼接每行的显示名(行别名或关键词本身) all_words = group_normal_words + group_required_words display_parts = [] for w in all_words: # 优先使用行别名,否则使用关键词本身 part = w.get("display_name") or w["word"] display_parts.append(part) # 用 " / " 拼接多个词 display_name = " / ".join(display_parts) if display_parts else None processed_groups.append( { "required": group_required_words, "normal": group_normal_words, "group_key": group_key, "display_name": display_name, # 可能为 None "max_count": group_max_count, } ) return processed_groups, filter_words, global_filters def matches_word_groups( title: str, word_groups: List[Dict], filter_words: List, global_filters: Optional[List[str]] = None ) -> bool: """ 检查标题是否匹配词组规则 Args: title: 标题文本 word_groups: 词组列表 filter_words: 过滤词列表(可以是字符串列表或字典列表) global_filters: 全局过滤词列表 Returns: 是否匹配 """ # 防御性类型检查:确保 title 是有效字符串 if not isinstance(title, str): title = str(title) if title is not None else "" if not title.strip(): return False title_lower = title.lower() # 全局过滤检查(优先级最高) if global_filters: if any(global_word.lower() in title_lower for global_word in global_filters): return False # 如果没有配置词组,则匹配所有标题(支持显示全部新闻) if not word_groups: return True # 过滤词检查(兼容新旧格式) for filter_item in filter_words: if _word_matches(filter_item, title_lower): return False # 词组匹配检查 for group in word_groups: required_words = group["required"] normal_words = group["normal"] # 必须词检查 if required_words: all_required_present = all( _word_matches(req_item, title_lower) for req_item in required_words ) if not all_required_present: continue # 普通词检查 if normal_words: any_normal_present = any( _word_matches(normal_item, title_lower) for normal_item in normal_words ) if not any_normal_present: continue return True return False ================================================ FILE: trendradar/core/loader.py ================================================ # coding=utf-8 """ 配置加载模块 负责从 YAML 配置文件和环境变量加载配置。 """ import os from pathlib import Path from typing import Dict, Any, Optional import yaml from .config import parse_multi_account_config, validate_paired_configs from trendradar.utils.time import DEFAULT_TIMEZONE def _get_env_bool(key: str) -> Optional[bool]: """从环境变量获取布尔值,如果未设置返回 None""" value = os.environ.get(key, "").strip().lower() if not value: return None return value in ("true", "1") def _get_env_int(key: str, default: int = 0) -> int: """从环境变量获取整数值""" value = os.environ.get(key, "").strip() if not value: return default try: return int(value) except ValueError: return default def _get_env_int_or_none(key: str) -> Optional[int]: """从环境变量获取整数值,未设置时返回 None""" value = os.environ.get(key, "").strip() if not value: return None try: return int(value) except ValueError: return None def _get_env_str(key: str, default: str = "") -> str: """从环境变量获取字符串值""" return os.environ.get(key, "").strip() or default def _load_app_config(config_data: Dict) -> Dict: """加载应用配置""" app_config = config_data.get("app", {}) advanced = config_data.get("advanced", {}) return { "VERSION_CHECK_URL": advanced.get("version_check_url", ""), "CONFIGS_VERSION_CHECK_URL": advanced.get("configs_version_check_url", ""), "SHOW_VERSION_UPDATE": app_config.get("show_version_update", True), "TIMEZONE": _get_env_str("TIMEZONE") or app_config.get("timezone", DEFAULT_TIMEZONE), "DEBUG": _get_env_bool("DEBUG") if _get_env_bool("DEBUG") is not None else advanced.get("debug", False), } def _load_crawler_config(config_data: Dict) -> Dict: """加载爬虫配置""" advanced = config_data.get("advanced", {}) crawler_config = advanced.get("crawler", {}) platforms_config = config_data.get("platforms", {}) return { "REQUEST_INTERVAL": crawler_config.get("request_interval", 100), "USE_PROXY": crawler_config.get("use_proxy", False), "DEFAULT_PROXY": crawler_config.get("default_proxy", ""), "ENABLE_CRAWLER": platforms_config.get("enabled", True), } def _load_report_config(config_data: Dict) -> Dict: """加载报告配置""" report_config = config_data.get("report", {}) # 环境变量覆盖 sort_by_position_env = _get_env_bool("SORT_BY_POSITION_FIRST") max_news_env = _get_env_int("MAX_NEWS_PER_KEYWORD") return { "REPORT_MODE": report_config.get("mode", "daily"), "DISPLAY_MODE": report_config.get("display_mode", "keyword"), "RANK_THRESHOLD": report_config.get("rank_threshold", 10), "SORT_BY_POSITION_FIRST": sort_by_position_env if sort_by_position_env is not None else report_config.get("sort_by_position_first", False), "MAX_NEWS_PER_KEYWORD": max_news_env or report_config.get("max_news_per_keyword", 0), } def _load_notification_config(config_data: Dict) -> Dict: """加载通知配置""" notification = config_data.get("notification", {}) advanced = config_data.get("advanced", {}) batch_size = advanced.get("batch_size", {}) return { "ENABLE_NOTIFICATION": notification.get("enabled", True), "MESSAGE_BATCH_SIZE": batch_size.get("default", 4000), "DINGTALK_BATCH_SIZE": batch_size.get("dingtalk", 20000), "FEISHU_BATCH_SIZE": batch_size.get("feishu", 29000), "BARK_BATCH_SIZE": batch_size.get("bark", 3600), "SLACK_BATCH_SIZE": batch_size.get("slack", 4000), "BATCH_SEND_INTERVAL": advanced.get("batch_send_interval", 1.0), "FEISHU_MESSAGE_SEPARATOR": advanced.get("feishu_message_separator", "---"), "MAX_ACCOUNTS_PER_CHANNEL": _get_env_int("MAX_ACCOUNTS_PER_CHANNEL") or advanced.get("max_accounts_per_channel", 3), } def _load_schedule_config(config_data: Dict) -> Dict: """ 加载统一调度配置 从 config.yaml 的 schedule 段读取,支持环境变量覆盖。 """ schedule = config_data.get("schedule", {}) # 环境变量覆盖 enabled_env = _get_env_bool("SCHEDULE_ENABLED") preset_env = _get_env_str("SCHEDULE_PRESET") enabled = enabled_env if enabled_env is not None else schedule.get("enabled", False) preset = preset_env or schedule.get("preset", "always_on") return { "enabled": enabled, "preset": preset, } def _load_timeline_data(config_dir: str = "config") -> Dict: """ 加载 timeline.yaml Args: config_dir: 配置目录路径 Returns: timeline.yaml 的完整数据,找不到时返回空模板 """ timeline_path = Path(config_dir) / "timeline.yaml" if not timeline_path.exists(): print(f"[调度] timeline.yaml 未找到: {timeline_path},使用空模板") return { "presets": {}, "custom": { "default": { "collect": True, "analyze": False, "push": False, "report_mode": "current", "ai_mode": "follow_report", "once": {"analyze": False, "push": False}, }, "periods": {}, "day_plans": {"all_day": {"periods": []}}, "week_map": {i: "all_day" for i in range(1, 8)}, }, } with open(timeline_path, "r", encoding="utf-8") as f: data = yaml.safe_load(f) print(f"[调度] timeline.yaml 加载成功: {timeline_path}") return data or {} def _load_weight_config(config_data: Dict) -> Dict: """加载权重配置""" advanced = config_data.get("advanced", {}) weight = advanced.get("weight", {}) return { "RANK_WEIGHT": weight.get("rank", 0.6), "FREQUENCY_WEIGHT": weight.get("frequency", 0.3), "HOTNESS_WEIGHT": weight.get("hotness", 0.1), } def _load_rss_config(config_data: Dict) -> Dict: """加载 RSS 配置""" rss = config_data.get("rss", {}) advanced = config_data.get("advanced", {}) advanced_rss = advanced.get("rss", {}) advanced_crawler = advanced.get("crawler", {}) # RSS 代理配置:优先使用 RSS 专属代理,否则复用 crawler 的 default_proxy rss_proxy_url = advanced_rss.get("proxy_url", "") or advanced_crawler.get("default_proxy", "") # 新鲜度过滤配置 freshness_filter = rss.get("freshness_filter", {}) # 验证并设置 max_age_days 默认值 raw_max_age = freshness_filter.get("max_age_days", 3) try: max_age_days = int(raw_max_age) if max_age_days < 0: print(f"[警告] RSS freshness_filter.max_age_days 为负数 ({max_age_days}),使用默认值 3") max_age_days = 3 except (ValueError, TypeError): print(f"[警告] RSS freshness_filter.max_age_days 格式错误 ({raw_max_age}),使用默认值 3") max_age_days = 3 # RSS 配置直接从 config.yaml 读取,不再支持环境变量 return { "ENABLED": rss.get("enabled", False), "REQUEST_INTERVAL": advanced_rss.get("request_interval", 2000), "TIMEOUT": advanced_rss.get("timeout", 15), "USE_PROXY": advanced_rss.get("use_proxy", False), "PROXY_URL": rss_proxy_url, "FEEDS": rss.get("feeds", []), "FRESHNESS_FILTER": { "ENABLED": freshness_filter.get("enabled", True), # 默认启用 "MAX_AGE_DAYS": max_age_days, }, } def _load_display_config(config_data: Dict) -> Dict: """加载推送内容显示配置""" display = config_data.get("display", {}) regions = display.get("regions", {}) standalone = display.get("standalone", {}) # 默认区域顺序 default_region_order = ["hotlist", "rss", "new_items", "standalone", "ai_analysis"] region_order = display.get("region_order", default_region_order) # 验证 region_order 中的值是否合法 valid_regions = {"hotlist", "rss", "new_items", "standalone", "ai_analysis"} region_order = [r for r in region_order if r in valid_regions] # 如果过滤后为空,使用默认顺序 if not region_order: region_order = default_region_order return { # 区域显示顺序 "REGION_ORDER": region_order, # 区域开关 "REGIONS": { "HOTLIST": regions.get("hotlist", True), "NEW_ITEMS": regions.get("new_items", True), "RSS": regions.get("rss", True), "STANDALONE": regions.get("standalone", False), "AI_ANALYSIS": regions.get("ai_analysis", True), }, # 独立展示区配置 "STANDALONE": { "PLATFORMS": standalone.get("platforms", []), "RSS_FEEDS": standalone.get("rss_feeds", []), "MAX_ITEMS": standalone.get("max_items", 20), }, } def _load_ai_config(config_data: Dict) -> Dict: """加载 AI 模型配置(LiteLLM 格式)""" ai_config = config_data.get("ai", {}) timeout_env = _get_env_int_or_none("AI_TIMEOUT") return { # LiteLLM 核心配置 "MODEL": _get_env_str("AI_MODEL") or ai_config.get("model", ""), "API_KEY": _get_env_str("AI_API_KEY") or ai_config.get("api_key", ""), "API_BASE": _get_env_str("AI_API_BASE") or ai_config.get("api_base", ""), # 生成参数 "TIMEOUT": timeout_env if timeout_env is not None else ai_config.get("timeout", 120), "TEMPERATURE": ai_config.get("temperature", 1.0), "MAX_TOKENS": ai_config.get("max_tokens", 5000), # LiteLLM 高级选项 "NUM_RETRIES": ai_config.get("num_retries", 2), "FALLBACK_MODELS": ai_config.get("fallback_models", []), "EXTRA_PARAMS": ai_config.get("extra_params", {}), } def _load_ai_analysis_config(config_data: Dict) -> Dict: """加载 AI 分析配置(功能配置,模型配置见 _load_ai_config)""" ai_config = config_data.get("ai_analysis", {}) enabled_env = _get_env_bool("AI_ANALYSIS_ENABLED") return { "ENABLED": enabled_env if enabled_env is not None else ai_config.get("enabled", False), "LANGUAGE": ai_config.get("language", "Chinese"), "PROMPT_FILE": ai_config.get("prompt_file", "ai_analysis_prompt.txt"), "MODE": ai_config.get("mode", "follow_report"), "MAX_NEWS_FOR_ANALYSIS": ai_config.get("max_news_for_analysis", 50), "INCLUDE_RSS": ai_config.get("include_rss", True), "INCLUDE_RANK_TIMELINE": ai_config.get("include_rank_timeline", False), "INCLUDE_STANDALONE": ai_config.get("include_standalone", False), } def _load_ai_translation_config(config_data: Dict) -> Dict: """加载 AI 翻译配置(功能配置,模型配置见 _load_ai_config)""" trans_config = config_data.get("ai_translation", {}) enabled_env = _get_env_bool("AI_TRANSLATION_ENABLED") scope = trans_config.get("scope", {}) return { "ENABLED": enabled_env if enabled_env is not None else trans_config.get("enabled", False), "LANGUAGE": _get_env_str("AI_TRANSLATION_LANGUAGE") or trans_config.get("language", "English"), "PROMPT_FILE": trans_config.get("prompt_file", "ai_translation_prompt.txt"), "SCOPE": { "HOTLIST": scope.get("hotlist", True), "RSS": scope.get("rss", True), "STANDALONE": scope.get("standalone", True), }, } def _load_ai_filter_config(config_data: Dict) -> Dict: """加载 AI 智能筛选配置(由 filter.method 控制是否启用)""" ai_filter = config_data.get("ai_filter", {}) return { "BATCH_SIZE": ai_filter.get("batch_size", 200), "BATCH_INTERVAL": ai_filter.get("batch_interval", 5), "INTERESTS_FILE": ai_filter.get("interests_file"), # None = 使用默认 config/ai_interests.txt "PROMPT_FILE": ai_filter.get("prompt_file", "prompt.txt"), "EXTRACT_PROMPT_FILE": ai_filter.get("extract_prompt_file", "extract_prompt.txt"), "UPDATE_TAGS_PROMPT_FILE": ai_filter.get("update_tags_prompt_file", "update_tags_prompt.txt"), "RECLASSIFY_THRESHOLD": ai_filter.get("reclassify_threshold", 0.6), "MIN_SCORE": float(ai_filter.get("min_score", 0)), } def _load_filter_config(config_data: Dict) -> Dict: """加载筛选策略配置""" filter_cfg = config_data.get("filter", {}) # 环境变量兼容:AI_FILTER_ENABLED=true → method=ai env_ai_filter = _get_env_bool("AI_FILTER_ENABLED") method = filter_cfg.get("method", "keyword") if env_ai_filter is True: method = "ai" # 兼容旧配置:如果 ai_filter.enabled=true 且未显式设置 filter.method if method == "keyword" and not filter_cfg.get("method"): ai_filter = config_data.get("ai_filter", {}) if ai_filter.get("enabled", False): method = "ai" return { "METHOD": method, # "keyword" | "ai" "PRIORITY_SORT_ENABLED": filter_cfg.get("priority_sort_enabled", False), # AI 模式标签优先级排序开关 } def _load_storage_config(config_data: Dict) -> Dict: """加载存储配置""" storage = config_data.get("storage", {}) formats = storage.get("formats", {}) local = storage.get("local", {}) remote = storage.get("remote", {}) pull = storage.get("pull", {}) txt_enabled_env = _get_env_bool("STORAGE_TXT_ENABLED") html_enabled_env = _get_env_bool("STORAGE_HTML_ENABLED") pull_enabled_env = _get_env_bool("PULL_ENABLED") return { "BACKEND": _get_env_str("STORAGE_BACKEND") or storage.get("backend", "auto"), "FORMATS": { "SQLITE": formats.get("sqlite", True), "TXT": txt_enabled_env if txt_enabled_env is not None else formats.get("txt", True), "HTML": html_enabled_env if html_enabled_env is not None else formats.get("html", True), }, "LOCAL": { "DATA_DIR": local.get("data_dir", "output"), "RETENTION_DAYS": _get_env_int("LOCAL_RETENTION_DAYS") or local.get("retention_days", 0), }, "REMOTE": { "ENDPOINT_URL": _get_env_str("S3_ENDPOINT_URL") or remote.get("endpoint_url", ""), "BUCKET_NAME": _get_env_str("S3_BUCKET_NAME") or remote.get("bucket_name", ""), "ACCESS_KEY_ID": _get_env_str("S3_ACCESS_KEY_ID") or remote.get("access_key_id", ""), "SECRET_ACCESS_KEY": _get_env_str("S3_SECRET_ACCESS_KEY") or remote.get("secret_access_key", ""), "REGION": _get_env_str("S3_REGION") or remote.get("region", ""), "RETENTION_DAYS": _get_env_int("REMOTE_RETENTION_DAYS") or remote.get("retention_days", 0), }, "PULL": { "ENABLED": pull_enabled_env if pull_enabled_env is not None else pull.get("enabled", False), "DAYS": _get_env_int("PULL_DAYS") or pull.get("days", 7), }, } def _load_webhook_config(config_data: Dict) -> Dict: """加载 Webhook 配置""" notification = config_data.get("notification", {}) channels = notification.get("channels", {}) # 各渠道配置 feishu = channels.get("feishu", {}) dingtalk = channels.get("dingtalk", {}) wework = channels.get("wework", {}) telegram = channels.get("telegram", {}) email = channels.get("email", {}) ntfy = channels.get("ntfy", {}) bark = channels.get("bark", {}) slack = channels.get("slack", {}) generic = channels.get("generic_webhook", {}) return { # 飞书 "FEISHU_WEBHOOK_URL": _get_env_str("FEISHU_WEBHOOK_URL") or feishu.get("webhook_url", ""), # 钉钉 "DINGTALK_WEBHOOK_URL": _get_env_str("DINGTALK_WEBHOOK_URL") or dingtalk.get("webhook_url", ""), # 企业微信 "WEWORK_WEBHOOK_URL": _get_env_str("WEWORK_WEBHOOK_URL") or wework.get("webhook_url", ""), "WEWORK_MSG_TYPE": _get_env_str("WEWORK_MSG_TYPE") or wework.get("msg_type", "markdown"), # Telegram "TELEGRAM_BOT_TOKEN": _get_env_str("TELEGRAM_BOT_TOKEN") or telegram.get("bot_token", ""), "TELEGRAM_CHAT_ID": _get_env_str("TELEGRAM_CHAT_ID") or telegram.get("chat_id", ""), # 邮件 "EMAIL_FROM": _get_env_str("EMAIL_FROM") or email.get("from", ""), "EMAIL_PASSWORD": _get_env_str("EMAIL_PASSWORD") or email.get("password", ""), "EMAIL_TO": _get_env_str("EMAIL_TO") or email.get("to", ""), "EMAIL_SMTP_SERVER": _get_env_str("EMAIL_SMTP_SERVER") or email.get("smtp_server", ""), "EMAIL_SMTP_PORT": _get_env_str("EMAIL_SMTP_PORT") or email.get("smtp_port", ""), # ntfy "NTFY_SERVER_URL": _get_env_str("NTFY_SERVER_URL") or ntfy.get("server_url") or "https://ntfy.sh", "NTFY_TOPIC": _get_env_str("NTFY_TOPIC") or ntfy.get("topic", ""), "NTFY_TOKEN": _get_env_str("NTFY_TOKEN") or ntfy.get("token", ""), # Bark "BARK_URL": _get_env_str("BARK_URL") or bark.get("url", ""), # Slack "SLACK_WEBHOOK_URL": _get_env_str("SLACK_WEBHOOK_URL") or slack.get("webhook_url", ""), # 通用 Webhook "GENERIC_WEBHOOK_URL": _get_env_str("GENERIC_WEBHOOK_URL") or generic.get("webhook_url", ""), "GENERIC_WEBHOOK_TEMPLATE": _get_env_str("GENERIC_WEBHOOK_TEMPLATE") or generic.get("payload_template", ""), } def _print_notification_sources(config: Dict) -> None: """打印通知渠道配置来源信息""" notification_sources = [] max_accounts = config["MAX_ACCOUNTS_PER_CHANNEL"] if config["FEISHU_WEBHOOK_URL"]: accounts = parse_multi_account_config(config["FEISHU_WEBHOOK_URL"]) count = min(len(accounts), max_accounts) source = "环境变量" if os.environ.get("FEISHU_WEBHOOK_URL") else "配置文件" notification_sources.append(f"飞书({source}, {count}个账号)") if config["DINGTALK_WEBHOOK_URL"]: accounts = parse_multi_account_config(config["DINGTALK_WEBHOOK_URL"]) count = min(len(accounts), max_accounts) source = "环境变量" if os.environ.get("DINGTALK_WEBHOOK_URL") else "配置文件" notification_sources.append(f"钉钉({source}, {count}个账号)") if config["WEWORK_WEBHOOK_URL"]: accounts = parse_multi_account_config(config["WEWORK_WEBHOOK_URL"]) count = min(len(accounts), max_accounts) source = "环境变量" if os.environ.get("WEWORK_WEBHOOK_URL") else "配置文件" notification_sources.append(f"企业微信({source}, {count}个账号)") if config["TELEGRAM_BOT_TOKEN"] and config["TELEGRAM_CHAT_ID"]: tokens = parse_multi_account_config(config["TELEGRAM_BOT_TOKEN"]) chat_ids = parse_multi_account_config(config["TELEGRAM_CHAT_ID"]) valid, count = validate_paired_configs( {"bot_token": tokens, "chat_id": chat_ids}, "Telegram", required_keys=["bot_token", "chat_id"] ) if valid and count > 0: count = min(count, max_accounts) token_source = "环境变量" if os.environ.get("TELEGRAM_BOT_TOKEN") else "配置文件" notification_sources.append(f"Telegram({token_source}, {count}个账号)") if config["EMAIL_FROM"] and config["EMAIL_PASSWORD"] and config["EMAIL_TO"]: from_source = "环境变量" if os.environ.get("EMAIL_FROM") else "配置文件" notification_sources.append(f"邮件({from_source})") if config["NTFY_SERVER_URL"] and config["NTFY_TOPIC"]: topics = parse_multi_account_config(config["NTFY_TOPIC"]) tokens = parse_multi_account_config(config["NTFY_TOKEN"]) if tokens: valid, count = validate_paired_configs( {"topic": topics, "token": tokens}, "ntfy" ) if valid and count > 0: count = min(count, max_accounts) server_source = "环境变量" if os.environ.get("NTFY_SERVER_URL") else "配置文件" notification_sources.append(f"ntfy({server_source}, {count}个账号)") else: count = min(len(topics), max_accounts) server_source = "环境变量" if os.environ.get("NTFY_SERVER_URL") else "配置文件" notification_sources.append(f"ntfy({server_source}, {count}个账号)") if config["BARK_URL"]: accounts = parse_multi_account_config(config["BARK_URL"]) count = min(len(accounts), max_accounts) bark_source = "环境变量" if os.environ.get("BARK_URL") else "配置文件" notification_sources.append(f"Bark({bark_source}, {count}个账号)") if config["SLACK_WEBHOOK_URL"]: accounts = parse_multi_account_config(config["SLACK_WEBHOOK_URL"]) count = min(len(accounts), max_accounts) slack_source = "环境变量" if os.environ.get("SLACK_WEBHOOK_URL") else "配置文件" notification_sources.append(f"Slack({slack_source}, {count}个账号)") if config.get("GENERIC_WEBHOOK_URL"): accounts = parse_multi_account_config(config["GENERIC_WEBHOOK_URL"]) count = min(len(accounts), max_accounts) source = "环境变量" if os.environ.get("GENERIC_WEBHOOK_URL") else "配置文件" notification_sources.append(f"通用Webhook({source}, {count}个账号)") if notification_sources: print(f"通知渠道配置来源: {', '.join(notification_sources)}") print(f"每个渠道最大账号数: {max_accounts}") else: print("未配置任何通知渠道") def load_config(config_path: Optional[str] = None) -> Dict[str, Any]: """ 加载配置文件 Args: config_path: 配置文件路径,默认从环境变量 CONFIG_PATH 获取或使用 config/config.yaml Returns: 包含所有配置的字典 Raises: FileNotFoundError: 配置文件不存在 """ if config_path is None: config_path = os.environ.get("CONFIG_PATH", "config/config.yaml") if not Path(config_path).exists(): raise FileNotFoundError(f"配置文件 {config_path} 不存在") with open(config_path, "r", encoding="utf-8") as f: config_data = yaml.safe_load(f) print(f"配置文件加载成功: {config_path}") # 合并所有配置 config = {} # 应用配置 config.update(_load_app_config(config_data)) # 爬虫配置 config.update(_load_crawler_config(config_data)) # 报告配置 config.update(_load_report_config(config_data)) # 通知配置 config.update(_load_notification_config(config_data)) # 统一调度配置 config["SCHEDULE"] = _load_schedule_config(config_data) config["_TIMELINE_DATA"] = _load_timeline_data( str(Path(config_path).parent) if config_path else "config" ) # 权重配置 config["WEIGHT_CONFIG"] = _load_weight_config(config_data) # 平台配置 platforms_config = config_data.get("platforms", {}) config["PLATFORMS"] = platforms_config.get("sources", []) # RSS 配置 config["RSS"] = _load_rss_config(config_data) # AI 模型共享配置 config["AI"] = _load_ai_config(config_data) # AI 分析配置 config["AI_ANALYSIS"] = _load_ai_analysis_config(config_data) # AI 翻译配置 config["AI_TRANSLATION"] = _load_ai_translation_config(config_data) # AI 智能筛选配置 config["AI_FILTER"] = _load_ai_filter_config(config_data) # 筛选策略配置 config["FILTER"] = _load_filter_config(config_data) # 推送内容显示配置 config["DISPLAY"] = _load_display_config(config_data) # 存储配置 config["STORAGE"] = _load_storage_config(config_data) # Webhook 配置 config.update(_load_webhook_config(config_data)) # 打印通知渠道配置来源 _print_notification_sources(config) return config ================================================ FILE: trendradar/core/scheduler.py ================================================ # coding=utf-8 """ 时间线调度器 统一的时间线调度系统,替代分散的 push_window / analysis_window 逻辑。 基于 periods + day_plans + week_map 模型实现灵活的时间段调度。 """ import copy import re from dataclasses import dataclass from typing import Any, Callable, Dict, List, Optional from datetime import datetime @dataclass class ResolvedSchedule: """当前时间解析后的调度结果""" period_key: Optional[str] # 命中的 period key,None=默认配置 period_name: Optional[str] # 命中的展示名称 day_plan: str # 当前日计划 collect: bool analyze: bool push: bool report_mode: str ai_mode: str once_analyze: bool once_push: bool frequency_file: Optional[str] = None # 频率词文件路径,None=使用默认 filter_method: Optional[str] = None # 筛选策略: "keyword"|"ai",None=使用全局配置 interests_file: Optional[str] = None # AI 筛选兴趣文件,None=使用默认 class Scheduler: """ 时间线调度器 根据 timeline 配置(periods + day_plans + week_map)解析当前时间应执行的行为。 支持: - 预设模板 + 自定义模式 - 跨日时间段(如 22:00-07:00) - 每天 / 每周差异化配置 - once 执行去重(analyze / push 独立维度) - 冲突策略(error_on_overlap / last_wins) """ def __init__( self, schedule_config: Dict[str, Any], timeline_data: Dict[str, Any], storage_backend: Any, get_time_func: Callable[[], datetime], fallback_report_mode: str = "current", ): """ 初始化调度器 Args: schedule_config: config.yaml 中的 schedule 段(含 preset 等) timeline_data: timeline.yaml 的完整数据 storage_backend: 存储后端(用于 once 去重记录) get_time_func: 获取当前时间的函数(应使用配置的时区) fallback_report_mode: 调度未启用时回退使用的 report_mode(来自 config.yaml 的 report.mode) """ self.schedule_config = schedule_config self.storage = storage_backend self.get_time = get_time_func self.enabled = schedule_config.get("enabled", True) self.fallback_report_mode = fallback_report_mode # 加载并构建最终 timeline self.timeline = self._build_timeline(schedule_config, timeline_data) if self.enabled: self._validate_timeline(self.timeline) def _build_timeline( self, schedule_config: Dict[str, Any], timeline_data: Dict[str, Any], ) -> Dict[str, Any]: """从 preset 或 custom 构建 timeline""" preset = schedule_config.get("preset", "always_on") if preset == "custom": timeline = copy.deepcopy(timeline_data.get("custom", {})) else: presets = timeline_data.get("presets", {}) if preset not in presets: raise ValueError( f"未知的预设模板: '{preset}',可选值: " f"{', '.join(presets.keys())}, custom" ) timeline = copy.deepcopy(presets[preset]) # 确保 periods 是 dict(可能为空 {}) if timeline.get("periods") is None: timeline["periods"] = {} return timeline def resolve(self) -> ResolvedSchedule: """ 解析当前时间对应的调度配置 Returns: ResolvedSchedule 包含当前应执行的行为 """ if not self.enabled: # 调度未启用时返回默认的全功能配置,report_mode 回退使用 config.yaml 的 report.mode return ResolvedSchedule( period_key=None, period_name=None, day_plan="disabled", collect=True, analyze=True, push=True, report_mode=self.fallback_report_mode, ai_mode="follow_report", once_analyze=False, once_push=False, ) now = self.get_time() weekday = now.isoweekday() # 1=周一 ... 7=周日 now_hhmm = now.strftime("%H:%M") # 查找当天的日计划 day_plan_key = self.timeline["week_map"].get(weekday) if day_plan_key is None: raise ValueError(f"week_map 缺少星期映射: {weekday}") day_plan = self.timeline["day_plans"].get(day_plan_key) if day_plan is None: raise ValueError(f"week_map[{weekday}] 引用了不存在的 day_plan: {day_plan_key}") # 查找当前活跃的时间段 period_key = self._find_active_period(now_hhmm, day_plan) # 合并默认配置和时间段配置 merged = self._merge_with_default(period_key) # 打印调度日志 weekday_names = {1: "一", 2: "二", 3: "三", 4: "四", 5: "五", 6: "六", 7: "日"} period_display = "默认配置(未命中任何时间段)" if period_key: period_cfg = self.timeline["periods"][period_key] period_name = period_cfg.get("name", period_key) start = period_cfg.get("start", "?") end = period_cfg.get("end", "?") period_display = f"{period_name} ({start}-{end})" print(f"[调度] 星期{weekday_names.get(weekday, '?')},日计划: {day_plan_key}") print(f"[调度] 当前时间段: {period_display}") resolved = ResolvedSchedule( period_key=period_key, period_name=( self.timeline["periods"][period_key].get("name") if period_key else None ), day_plan=day_plan_key, collect=merged.get("collect", True), analyze=merged.get("analyze", False), push=merged.get("push", False), report_mode=merged.get("report_mode", "current"), ai_mode=self._resolve_ai_mode(merged), once_analyze=merged.get("once", {}).get("analyze", False), once_push=merged.get("once", {}).get("push", False), frequency_file=merged.get("frequency_file"), filter_method=merged.get("filter_method"), interests_file=merged.get("interests_file"), ) # 打印行为摘要 actions = [] if resolved.collect: actions.append("采集") if resolved.analyze: actions.append(f"分析(AI:{resolved.ai_mode})") if resolved.push: actions.append(f"推送(模式:{resolved.report_mode})") print(f"[调度] 行为: {', '.join(actions) if actions else '无'}") if resolved.frequency_file: print(f"[调度] 频率词文件: {resolved.frequency_file}") return resolved def _find_active_period( self, now_hhmm: str, day_plan: Dict[str, Any] ) -> Optional[str]: """ 查找当前时间命中的活跃时间段 Args: now_hhmm: 当前时间 HH:MM day_plan: 日计划配置 Returns: 命中的 period key,或 None """ candidates = [] for idx, key in enumerate(day_plan.get("periods", [])): period = self.timeline["periods"].get(key) if period is None: continue if self._in_range(now_hhmm, period["start"], period["end"]): candidates.append((idx, key)) if not candidates: return None # 检查冲突 if len(candidates) > 1: policy = self.timeline.get("overlap", {}).get("policy", "error_on_overlap") conflicting = [c[1] for c in candidates] if policy == "error_on_overlap": raise ValueError( f"检测到时间段重叠冲突: {', '.join(conflicting)} 在 {now_hhmm} 重叠。" f"请调整时间段配置,或将 overlap.policy 设为 'last_wins'" ) # last_wins:输出重叠警告,列表中后面的优先 print( f"[调度] 检测到时间段重叠: {', '.join(conflicting)} 在 {now_hhmm} 重叠" ) winner = candidates[-1] print(f"[调度] 冲突策略: last_wins,生效时间段: {winner[1]}") return winner[1] return candidates[0][1] @staticmethod def _in_range(now_hhmm: str, start: str, end: str) -> bool: """ 检查时间是否在范围内(支持跨日) Args: now_hhmm: 当前时间 HH:MM start: 开始时间 HH:MM end: 结束时间 HH:MM Returns: 是否在范围内 """ if start <= end: # 正常范围,如 08:00-09:00 return start <= now_hhmm <= end else: # 跨日范围,如 22:00-07:00 return now_hhmm >= start or now_hhmm <= end def _merge_with_default(self, period_key: Optional[str]) -> Dict[str, Any]: """合并默认配置和时间段配置""" base = copy.deepcopy(self.timeline.get("default", {})) if not period_key: return base period = copy.deepcopy(self.timeline["periods"][period_key]) # 先合并 once 子对象 merged_once = dict(base.get("once", {})) merged_once.update(period.get("once", {})) # 标量字段覆盖 base.update(period) # 恢复合并后的 once if merged_once: base["once"] = merged_once return base @staticmethod def _resolve_ai_mode(cfg: Dict[str, Any]) -> str: """解析最终的 AI 模式""" ai_mode = cfg.get("ai_mode", "follow_report") if ai_mode == "follow_report": return cfg.get("report_mode", "current") return ai_mode def already_executed(self, period_key: str, action: str, date_str: str) -> bool: """ 检查指定时间段的某个 action 今天是否已执行 Args: period_key: 时间段 key action: 动作类型 (analyze / push) date_str: 日期 YYYY-MM-DD Returns: 是否已执行 """ return self.storage.has_period_executed(date_str, period_key, action) def record_execution(self, period_key: str, action: str, date_str: str) -> None: """ 记录时间段的 action 执行 Args: period_key: 时间段 key action: 动作类型 (analyze / push) date_str: 日期 YYYY-MM-DD """ self.storage.record_period_execution(date_str, period_key, action) # ======================================== # 校验 # ======================================== def _validate_timeline(self, timeline: Dict[str, Any]) -> None: """ 启动时校验 timeline 配置 Raises: ValueError: 配置不合法时抛出 """ required_top_keys = ["default", "periods", "day_plans", "week_map"] for key in required_top_keys: if key not in timeline: raise ValueError(f"timeline 缺少必须字段: {key}") # week_map 必须覆盖 1..7 for day in range(1, 8): if day not in timeline["week_map"]: raise ValueError(f"week_map 缺少星期映射: {day}") # day_plan 引用完整性 for day, plan_key in timeline["week_map"].items(): if plan_key not in timeline["day_plans"]: raise ValueError( f"week_map[{day}] 引用了不存在的 day_plan: {plan_key}" ) # period 引用完整性 for plan_key, plan in timeline["day_plans"].items(): for period_key in plan.get("periods", []): if period_key not in timeline["periods"]: raise ValueError( f"day_plan[{plan_key}] 引用了不存在的 period: {period_key}" ) # 时间格式校验 for period_key, period in timeline["periods"].items(): if "start" not in period or "end" not in period: raise ValueError( f"period '{period_key}' 缺少 start 或 end 字段" ) self._validate_hhmm(period["start"], f"{period_key}.start") self._validate_hhmm(period["end"], f"{period_key}.end") if period["start"] == period["end"]: raise ValueError( f"period '{period_key}' 的 start 与 end 不能相同: {period['start']}" ) # 检查冲突策略下的重叠 policy = timeline.get("overlap", {}).get("policy", "error_on_overlap") if policy == "error_on_overlap": self._check_period_overlaps(timeline) def _check_period_overlaps(self, timeline: Dict[str, Any]) -> None: """ 检查每个日计划中的时间段是否存在重叠 仅在 overlap.policy == "error_on_overlap" 时调用 """ periods = timeline.get("periods", {}) for plan_key, plan in timeline["day_plans"].items(): period_keys = plan.get("periods", []) if len(period_keys) <= 1: continue # 收集每个时间段的范围 ranges = [] for pk in period_keys: p = periods.get(pk, {}) if "start" in p and "end" in p: ranges.append((pk, p["start"], p["end"])) # 两两检查重叠 for i in range(len(ranges)): for j in range(i + 1, len(ranges)): if self._ranges_overlap( ranges[i][1], ranges[i][2], ranges[j][1], ranges[j][2], ): raise ValueError( f"day_plan '{plan_key}' 中时间段 '{ranges[i][0]}' " f"({ranges[i][1]}-{ranges[i][2]}) 与 '{ranges[j][0]}' " f"({ranges[j][1]}-{ranges[j][2]}) 存在重叠。" f"请调整时间段,或将 overlap.policy 设为 'last_wins'" ) @staticmethod def _ranges_overlap(s1: str, e1: str, s2: str, e2: str) -> bool: """检查两个时间范围是否重叠(支持跨日)""" def to_minutes(t: str) -> int: h, m = t.split(":") return int(h) * 60 + int(m) def expand_range(start: str, end: str) -> List[tuple]: """将时间范围展开为分钟段列表,跨日时拆分为两段""" s = to_minutes(start) e = to_minutes(end) if s <= e: return [(s, e)] else: # 跨日:拆分为 [start, 23:59] 和 [00:00, end] return [(s, 24 * 60 - 1), (0, e)] segs1 = expand_range(s1, e1) segs2 = expand_range(s2, e2) for a_start, a_end in segs1: for b_start, b_end in segs2: # 两个区间有重叠的条件 if a_start <= b_end and b_start <= a_end: return True return False @staticmethod def _validate_hhmm(value: str, field_name: str) -> None: """校验 HH:MM 格式""" if not re.match(r"^\d{2}:\d{2}$", value): raise ValueError(f"{field_name} 格式错误: '{value}',期望 HH:MM") h, m = value.split(":") if not (0 <= int(h) <= 23 and 0 <= int(m) <= 59): raise ValueError(f"{field_name} 时间值超出范围: '{value}'") ================================================ FILE: trendradar/crawler/__init__.py ================================================ # coding=utf-8 """ 爬虫模块 - 数据抓取功能 """ from trendradar.crawler.fetcher import DataFetcher __all__ = ["DataFetcher"] ================================================ FILE: trendradar/crawler/fetcher.py ================================================ # coding=utf-8 """ 数据获取器模块 负责从 NewsNow API 抓取新闻数据,支持: - 单个平台数据获取 - 批量平台数据爬取 - 自动重试机制 - 代理支持 """ import json import random import time from typing import Dict, List, Tuple, Optional, Union import requests class DataFetcher: """数据获取器""" # 默认 API 地址 DEFAULT_API_URL = "https://newsnow.busiyi.world/api/s" # 默认请求头 DEFAULT_HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "Accept": "application/json, text/plain, */*", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Connection": "keep-alive", "Cache-Control": "no-cache", } def __init__( self, proxy_url: Optional[str] = None, api_url: Optional[str] = None, ): """ 初始化数据获取器 Args: proxy_url: 代理服务器 URL(可选) api_url: API 基础 URL(可选,默认使用 DEFAULT_API_URL) """ self.proxy_url = proxy_url self.api_url = api_url or self.DEFAULT_API_URL def fetch_data( self, id_info: Union[str, Tuple[str, str]], max_retries: int = 2, min_retry_wait: int = 3, max_retry_wait: int = 5, ) -> Tuple[Optional[str], str, str]: """ 获取指定ID数据,支持重试 Args: id_info: 平台ID 或 (平台ID, 别名) 元组 max_retries: 最大重试次数 min_retry_wait: 最小重试等待时间(秒) max_retry_wait: 最大重试等待时间(秒) Returns: (响应文本, 平台ID, 别名) 元组,失败时响应文本为 None """ if isinstance(id_info, tuple): id_value, alias = id_info else: id_value = id_info alias = id_value url = f"{self.api_url}?id={id_value}&latest" proxies = None if self.proxy_url: proxies = {"http": self.proxy_url, "https": self.proxy_url} retries = 0 while retries <= max_retries: try: response = requests.get( url, proxies=proxies, headers=self.DEFAULT_HEADERS, timeout=10, ) response.raise_for_status() data_text = response.text data_json = json.loads(data_text) status = data_json.get("status", "未知") if status not in ["success", "cache"]: raise ValueError(f"响应状态异常: {status}") status_info = "最新数据" if status == "success" else "缓存数据" print(f"获取 {id_value} 成功({status_info})") return data_text, id_value, alias except Exception as e: retries += 1 if retries <= max_retries: base_wait = random.uniform(min_retry_wait, max_retry_wait) additional_wait = (retries - 1) * random.uniform(1, 2) wait_time = base_wait + additional_wait print(f"请求 {id_value} 失败: {e}. {wait_time:.2f}秒后重试...") time.sleep(wait_time) else: print(f"请求 {id_value} 失败: {e}") return None, id_value, alias return None, id_value, alias def crawl_websites( self, ids_list: List[Union[str, Tuple[str, str]]], request_interval: int = 100, ) -> Tuple[Dict, Dict, List]: """ 爬取多个网站数据 Args: ids_list: 平台ID列表,每个元素可以是字符串或 (平台ID, 别名) 元组 request_interval: 请求间隔(毫秒) Returns: (结果字典, ID到名称的映射, 失败ID列表) 元组 """ results = {} id_to_name = {} failed_ids = [] for i, id_info in enumerate(ids_list): if isinstance(id_info, tuple): id_value, name = id_info else: id_value = id_info name = id_value id_to_name[id_value] = name response, _, _ = self.fetch_data(id_info) if response: try: data = json.loads(response) results[id_value] = {} for index, item in enumerate(data.get("items", []), 1): title = item.get("title") # 跳过无效标题(None、float、空字符串) if title is None or isinstance(title, float) or not str(title).strip(): continue title = str(title).strip() url = item.get("url", "") mobile_url = item.get("mobileUrl", "") if title in results[id_value]: results[id_value][title]["ranks"].append(index) else: results[id_value][title] = { "ranks": [index], "url": url, "mobileUrl": mobile_url, } except json.JSONDecodeError: print(f"解析 {id_value} 响应失败") failed_ids.append(id_value) except Exception as e: print(f"处理 {id_value} 数据出错: {e}") failed_ids.append(id_value) else: failed_ids.append(id_value) # 请求间隔(除了最后一个) if i < len(ids_list) - 1: actual_interval = request_interval + random.randint(-10, 20) actual_interval = max(50, actual_interval) time.sleep(actual_interval / 1000) print(f"成功: {list(results.keys())}, 失败: {failed_ids}") return results, id_to_name, failed_ids ================================================ FILE: trendradar/crawler/rss/__init__.py ================================================ # coding=utf-8 """ RSS 抓取模块 提供 RSS 2.0、Atom 和 JSON Feed 1.1 订阅源的解析和抓取功能 """ from .parser import RSSParser from .fetcher import RSSFetcher, RSSFeedConfig __all__ = ["RSSParser", "RSSFetcher", "RSSFeedConfig"] ================================================ FILE: trendradar/crawler/rss/fetcher.py ================================================ # coding=utf-8 """ RSS 抓取器 负责从配置的 RSS 源抓取数据并转换为标准格式 """ import time import random from dataclasses import dataclass from typing import List, Dict, Optional, Tuple import requests from .parser import RSSParser from trendradar.storage.base import RSSItem, RSSData from trendradar.utils.time import get_configured_time, is_within_days, DEFAULT_TIMEZONE @dataclass class RSSFeedConfig: """RSS 源配置""" id: str # 源 ID name: str # 显示名称 url: str # RSS URL max_items: int = 0 # 最大条目数(0=不限制) enabled: bool = True # 是否启用 max_age_days: Optional[int] = None # 文章最大年龄(天),覆盖全局设置;None=使用全局,0=禁用过滤 class RSSFetcher: """RSS 抓取器""" def __init__( self, feeds: List[RSSFeedConfig], request_interval: int = 2000, timeout: int = 15, use_proxy: bool = False, proxy_url: str = "", timezone: str = DEFAULT_TIMEZONE, freshness_enabled: bool = True, default_max_age_days: int = 3, ): """ 初始化抓取器 Args: feeds: RSS 源配置列表 request_interval: 请求间隔(毫秒) timeout: 请求超时(秒) use_proxy: 是否使用代理 proxy_url: 代理 URL timezone: 时区配置(如 'Asia/Shanghai') freshness_enabled: 是否启用新鲜度过滤 default_max_age_days: 默认最大文章年龄(天) """ self.feeds = [f for f in feeds if f.enabled] self.request_interval = request_interval self.timeout = timeout self.use_proxy = use_proxy self.proxy_url = proxy_url self.timezone = timezone self.freshness_enabled = freshness_enabled self.default_max_age_days = default_max_age_days self.parser = RSSParser() self.session = self._create_session() def _create_session(self) -> requests.Session: """创建请求会话""" session = requests.Session() session.headers.update({ "User-Agent": "TrendRadar/2.0 RSS Reader (https://github.com/trendradar)", "Accept": "application/feed+json, application/json, application/rss+xml, application/atom+xml, application/xml, text/xml, */*", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", }) if self.use_proxy and self.proxy_url: session.proxies = { "http": self.proxy_url, "https": self.proxy_url, } return session def _filter_by_freshness( self, items: List[RSSItem], feed: RSSFeedConfig, ) -> Tuple[List[RSSItem], int]: """ 根据新鲜度过滤文章 Args: items: 待过滤的文章列表 feed: RSS 源配置 Returns: (过滤后的文章列表, 被过滤的文章数) """ # 如果全局禁用,直接返回 if not self.freshness_enabled: return items, 0 # 确定此 feed 的 max_age_days max_days = feed.max_age_days if max_days is None: max_days = self.default_max_age_days # 如果设为 0,禁用此 feed 的过滤 if max_days == 0: return items, 0 # 过滤逻辑:无发布时间的文章保留 filtered = [] for item in items: if not item.published_at: # 无发布时间,保留 filtered.append(item) elif is_within_days(item.published_at, max_days, self.timezone): # 在指定天数内,保留 filtered.append(item) # 否则过滤掉 filtered_count = len(items) - len(filtered) return filtered, filtered_count def fetch_feed(self, feed: RSSFeedConfig) -> Tuple[List[RSSItem], Optional[str]]: """ 抓取单个 RSS 源 Args: feed: RSS 源配置 Returns: (条目列表, 错误信息) 元组 """ try: response = self.session.get(feed.url, timeout=self.timeout) response.raise_for_status() parsed_items = self.parser.parse(response.text, feed.url) # 限制条目数量(0=不限制) if feed.max_items > 0: parsed_items = parsed_items[:feed.max_items] # 转换为 RSSItem(使用配置的时区) now = get_configured_time(self.timezone) crawl_time = now.strftime("%H:%M") items = [] for parsed in parsed_items: item = RSSItem( title=parsed.title, feed_id=feed.id, feed_name=feed.name, url=parsed.url, published_at=parsed.published_at or "", summary=parsed.summary or "", author=parsed.author or "", crawl_time=crawl_time, first_time=crawl_time, last_time=crawl_time, count=1, ) items.append(item) # 注意:新鲜度过滤已移至推送阶段(_convert_rss_items_to_list) # 这样所有文章都会存入数据库,但旧文章不会推送 print(f"[RSS] {feed.name}: 获取 {len(items)} 条") return items, None except requests.Timeout: error = f"请求超时 ({self.timeout}s)" print(f"[RSS] {feed.name}: {error}") return [], error except requests.RequestException as e: error = f"请求失败: {e}" print(f"[RSS] {feed.name}: {error}") return [], error except ValueError as e: error = f"解析失败: {e}" print(f"[RSS] {feed.name}: {error}") return [], error except Exception as e: error = f"未知错误: {e}" print(f"[RSS] {feed.name}: {error}") return [], error def fetch_all(self) -> RSSData: """ 抓取所有 RSS 源 Returns: RSSData 对象 """ all_items: Dict[str, List[RSSItem]] = {} id_to_name: Dict[str, str] = {} failed_ids: List[str] = [] # 使用配置的时区 now = get_configured_time(self.timezone) crawl_time = now.strftime("%H:%M") crawl_date = now.strftime("%Y-%m-%d") print(f"[RSS] 开始抓取 {len(self.feeds)} 个 RSS 源...") for i, feed in enumerate(self.feeds): # 请求间隔(带随机波动) if i > 0: interval = self.request_interval / 1000 jitter = random.uniform(-0.2, 0.2) * interval time.sleep(interval + jitter) items, error = self.fetch_feed(feed) id_to_name[feed.id] = feed.name if error: failed_ids.append(feed.id) else: all_items[feed.id] = items total_items = sum(len(items) for items in all_items.values()) print(f"[RSS] 抓取完成: {len(all_items)} 个源成功, {len(failed_ids)} 个失败, 共 {total_items} 条") return RSSData( date=crawl_date, crawl_time=crawl_time, items=all_items, id_to_name=id_to_name, failed_ids=failed_ids, ) @classmethod def from_config(cls, config: Dict) -> "RSSFetcher": """ 从配置字典创建抓取器 Args: config: 配置字典,格式如下: { "enabled": true, "request_interval": 2000, "freshness_filter": { "enabled": true, "max_age_days": 3 }, "feeds": [ {"id": "hacker-news", "name": "Hacker News", "url": "...", "max_age_days": 1} ] } Returns: RSSFetcher 实例 """ # 读取新鲜度过滤配置 freshness_config = config.get("freshness_filter", {}) freshness_enabled = freshness_config.get("enabled", True) # 默认启用 default_max_age_days = freshness_config.get("max_age_days", 3) # 默认3天 feeds = [] for feed_config in config.get("feeds", []): # 读取并验证单个 feed 的 max_age_days(可选) max_age_days_raw = feed_config.get("max_age_days") max_age_days = None if max_age_days_raw is not None: try: max_age_days = int(max_age_days_raw) if max_age_days < 0: feed_id = feed_config.get("id", "unknown") print(f"[警告] RSS feed '{feed_id}' 的 max_age_days 为负数,将使用全局默认值") max_age_days = None except (ValueError, TypeError): feed_id = feed_config.get("id", "unknown") print(f"[警告] RSS feed '{feed_id}' 的 max_age_days 格式错误:{max_age_days_raw}") max_age_days = None feed = RSSFeedConfig( id=feed_config.get("id", ""), name=feed_config.get("name", ""), url=feed_config.get("url", ""), max_items=feed_config.get("max_items", 0), # 0=不限制 enabled=feed_config.get("enabled", True), max_age_days=max_age_days, # None=使用全局,0=禁用,>0=覆盖 ) if feed.id and feed.url: feeds.append(feed) return cls( feeds=feeds, request_interval=config.get("request_interval", 2000), timeout=config.get("timeout", 15), use_proxy=config.get("use_proxy", False), proxy_url=config.get("proxy_url", ""), timezone=config.get("timezone", DEFAULT_TIMEZONE), freshness_enabled=freshness_enabled, default_max_age_days=default_max_age_days, ) ================================================ FILE: trendradar/crawler/rss/parser.py ================================================ # coding=utf-8 """ RSS 解析器 支持 RSS 2.0、Atom 和 JSON Feed 1.1 格式的解析 """ import re import html import json from dataclasses import dataclass from datetime import datetime from typing import List, Optional, Dict, Any from email.utils import parsedate_to_datetime try: import feedparser HAS_FEEDPARSER = True except ImportError: HAS_FEEDPARSER = False feedparser = None @dataclass class ParsedRSSItem: """解析后的 RSS 条目""" title: str url: str published_at: Optional[str] = None summary: Optional[str] = None author: Optional[str] = None guid: Optional[str] = None class RSSParser: """RSS 解析器""" def __init__(self, max_summary_length: int = 500): """ 初始化解析器 Args: max_summary_length: 摘要最大长度 """ if not HAS_FEEDPARSER: raise ImportError("RSS 解析需要安装 feedparser: pip install feedparser") self.max_summary_length = max_summary_length def parse(self, content: str, feed_url: str = "") -> List[ParsedRSSItem]: """ 解析 RSS/Atom/JSON Feed 内容 Args: content: Feed 内容(XML 或 JSON) feed_url: Feed URL(用于错误提示) Returns: 解析后的条目列表 """ # 先尝试检测 JSON Feed if self._is_json_feed(content): return self._parse_json_feed(content, feed_url) # 使用 feedparser 解析 RSS/Atom feed = feedparser.parse(content) if feed.bozo and not feed.entries: raise ValueError(f"RSS 解析失败 ({feed_url}): {feed.bozo_exception}") items = [] for entry in feed.entries: item = self._parse_entry(entry) if item: items.append(item) return items def _is_json_feed(self, content: str) -> bool: """ 检测内容是否为 JSON Feed 格式 JSON Feed 必须包含 version 字段,值为 https://jsonfeed.org/version/1 或 1.1 """ content = content.strip() if not content.startswith("{"): return False try: data = json.loads(content) version = data.get("version", "") return "jsonfeed.org" in version except (json.JSONDecodeError, TypeError): return False def _parse_json_feed(self, content: str, feed_url: str = "") -> List[ParsedRSSItem]: """ 解析 JSON Feed 1.1 格式 JSON Feed 规范: https://www.jsonfeed.org/version/1.1/ Args: content: JSON Feed 内容 feed_url: Feed URL(用于错误提示) Returns: 解析后的条目列表 """ try: data = json.loads(content) except json.JSONDecodeError as e: raise ValueError(f"JSON Feed 解析失败 ({feed_url}): {e}") items_data = data.get("items", []) if not items_data: return [] items = [] for item_data in items_data: item = self._parse_json_feed_item(item_data) if item: items.append(item) return items def _parse_json_feed_item(self, item_data: Dict[str, Any]) -> Optional[ParsedRSSItem]: """解析单个 JSON Feed 条目""" # 标题:优先 title,否则使用 content_text 的前 100 字符 title = item_data.get("title", "") if not title: content_text = item_data.get("content_text", "") if content_text: title = content_text[:100] + ("..." if len(content_text) > 100 else "") title = self._clean_text(title) if not title: return None # URL url = item_data.get("url", "") or item_data.get("external_url", "") # 发布时间(ISO 8601 格式) published_at = None date_str = item_data.get("date_published") or item_data.get("date_modified") if date_str: published_at = self._parse_iso_date(date_str) # 摘要:优先 summary,否则使用 content_text summary = item_data.get("summary", "") if not summary: content_text = item_data.get("content_text", "") content_html = item_data.get("content_html", "") summary = content_text or self._clean_text(content_html) if summary: summary = self._clean_text(summary) if len(summary) > self.max_summary_length: summary = summary[:self.max_summary_length] + "..." # 作者 author = None authors = item_data.get("authors", []) if authors: names = [a.get("name", "") for a in authors if isinstance(a, dict) and a.get("name")] if names: author = ", ".join(names) # GUID guid = item_data.get("id", "") or url return ParsedRSSItem( title=title, url=url, published_at=published_at, summary=summary or None, author=author, guid=guid, ) def _parse_iso_date(self, date_str: str) -> Optional[str]: """解析 ISO 8601 日期格式""" if not date_str: return None try: # 处理常见的 ISO 8601 格式 # 替换 Z 为 +00:00 date_str = date_str.replace("Z", "+00:00") dt = datetime.fromisoformat(date_str) return dt.isoformat() except (ValueError, TypeError): pass return None def parse_url(self, url: str, timeout: int = 10) -> List[ParsedRSSItem]: """ 从 URL 解析 RSS Args: url: RSS URL timeout: 超时时间(秒) Returns: 解析后的条目列表 """ import requests response = requests.get(url, timeout=timeout, headers={ "User-Agent": "TrendRadar/2.0 RSS Reader" }) response.raise_for_status() return self.parse(response.text, url) def _parse_entry(self, entry: Any) -> Optional[ParsedRSSItem]: """解析单个条目""" title = self._clean_text(entry.get("title", "")) if not title: return None url = entry.get("link", "") if not url: # 尝试从 links 中获取 links = entry.get("links", []) for link in links: if link.get("rel") == "alternate" or link.get("type", "").startswith("text/html"): url = link.get("href", "") break if not url and links: url = links[0].get("href", "") published_at = self._parse_date(entry) summary = self._parse_summary(entry) author = self._parse_author(entry) guid = entry.get("id") or entry.get("guid", {}).get("value") or url return ParsedRSSItem( title=title, url=url, published_at=published_at, summary=summary, author=author, guid=guid, ) def _clean_text(self, text: str) -> str: """清理文本""" if not text: return "" # 解码 HTML 实体 text = html.unescape(text) # 移除 HTML 标签 text = re.sub(r'<[^>]+>', '', text) # 移除多余空白 text = re.sub(r'\s+', ' ', text) return text.strip() def _parse_date(self, entry: Any) -> Optional[str]: """解析发布日期""" # feedparser 会自动解析日期到 published_parsed date_struct = entry.get("published_parsed") or entry.get("updated_parsed") if date_struct: try: dt = datetime(*date_struct[:6]) return dt.isoformat() except (ValueError, TypeError): pass # 尝试手动解析 date_str = entry.get("published") or entry.get("updated") if date_str: try: dt = parsedate_to_datetime(date_str) return dt.isoformat() except (ValueError, TypeError): pass # 尝试 ISO 格式 try: dt = datetime.fromisoformat(date_str.replace("Z", "+00:00")) return dt.isoformat() except (ValueError, TypeError): pass return None def _parse_summary(self, entry: Any) -> Optional[str]: """解析摘要""" summary = entry.get("summary") or entry.get("description", "") if not summary: # 尝试从 content 获取 content = entry.get("content", []) if content and isinstance(content, list): summary = content[0].get("value", "") if not summary: return None summary = self._clean_text(summary) # 截断过长的摘要 if len(summary) > self.max_summary_length: summary = summary[:self.max_summary_length] + "..." return summary def _parse_author(self, entry: Any) -> Optional[str]: """解析作者""" author = entry.get("author") if author: return self._clean_text(author) # 尝试从 dc:creator 获取 author = entry.get("dc_creator") if author: return self._clean_text(author) # 尝试从 authors 列表获取 authors = entry.get("authors", []) if authors: names = [a.get("name", "") for a in authors if a.get("name")] if names: return ", ".join(names) return None ================================================ FILE: trendradar/notification/__init__.py ================================================ # coding=utf-8 """ 通知推送模块 提供多渠道通知推送功能,包括: - 飞书、钉钉、企业微信 - Telegram、Slack - Email、ntfy、Bark 模块结构: - formatters: 内容格式转换 - batch: 批次处理工具 - renderer: 通知内容渲染 - splitter: 消息分批拆分 - senders: 消息发送器(各渠道发送函数) - dispatcher: 多账号通知调度器 """ from trendradar.notification.formatters import ( strip_markdown, convert_markdown_to_mrkdwn, ) from trendradar.notification.batch import ( get_batch_header, get_max_batch_header_size, truncate_to_bytes, add_batch_headers, ) from trendradar.notification.renderer import ( render_feishu_content, render_dingtalk_content, ) from trendradar.notification.splitter import ( split_content_into_batches, DEFAULT_BATCH_SIZES, ) from trendradar.notification.senders import ( send_to_feishu, send_to_dingtalk, send_to_wework, send_to_telegram, send_to_email, send_to_ntfy, send_to_bark, send_to_slack, SMTP_CONFIGS, ) from trendradar.notification.dispatcher import NotificationDispatcher __all__ = [ # 格式转换 "strip_markdown", "convert_markdown_to_mrkdwn", # 批次处理 "get_batch_header", "get_max_batch_header_size", "truncate_to_bytes", "add_batch_headers", # 内容渲染 "render_feishu_content", "render_dingtalk_content", # 消息分批 "split_content_into_batches", "DEFAULT_BATCH_SIZES", # 消息发送器 "send_to_feishu", "send_to_dingtalk", "send_to_wework", "send_to_telegram", "send_to_email", "send_to_ntfy", "send_to_bark", "send_to_slack", "SMTP_CONFIGS", # 通知调度器 "NotificationDispatcher", ] ================================================ FILE: trendradar/notification/batch.py ================================================ # coding=utf-8 """ 批次处理模块 提供消息分批发送的辅助函数 """ from typing import List def get_batch_header(format_type: str, batch_num: int, total_batches: int) -> str: """根据 format_type 生成对应格式的批次头部 Args: format_type: 推送类型(telegram, slack, wework_text, bark, feishu, dingtalk, ntfy, wework) batch_num: 当前批次编号 total_batches: 总批次数 Returns: 格式化的批次头部字符串 """ if format_type == "telegram": return f"[第 {batch_num}/{total_batches} 批次]\n\n" elif format_type == "slack": return f"*[第 {batch_num}/{total_batches} 批次]*\n\n" elif format_type in ("wework_text", "bark"): # 企业微信文本模式和 Bark 使用纯文本格式 return f"[第 {batch_num}/{total_batches} 批次]\n\n" else: # 飞书、钉钉、ntfy、企业微信 markdown 模式 return f"**[第 {batch_num}/{total_batches} 批次]**\n\n" def get_max_batch_header_size(format_type: str) -> int: """估算批次头部的最大字节数(假设最多 99 批次) 用于在分批时预留空间,避免事后截断破坏内容完整性。 Args: format_type: 推送类型 Returns: 最大头部字节数 """ # 生成最坏情况的头部(99/99 批次) max_header = get_batch_header(format_type, 99, 99) return len(max_header.encode("utf-8")) def truncate_to_bytes(text: str, max_bytes: int) -> str: """安全截断字符串到指定字节数,避免截断多字节字符 Args: text: 要截断的文本 max_bytes: 最大字节数 Returns: 截断后的文本 """ text_bytes = text.encode("utf-8") if len(text_bytes) <= max_bytes: return text # 截断到指定字节数 truncated = text_bytes[:max_bytes] # 处理可能的不完整 UTF-8 字符 for i in range(min(4, len(truncated))): try: return truncated[: len(truncated) - i].decode("utf-8") except UnicodeDecodeError: continue # 极端情况:返回空字符串 return "" def add_batch_headers( batches: List[str], format_type: str, max_bytes: int ) -> List[str]: """为批次添加头部,动态计算确保总大小不超过限制 Args: batches: 原始批次列表 format_type: 推送类型(bark, telegram, feishu 等) max_bytes: 该推送类型的最大字节限制 Returns: 添加头部后的批次列表 """ if len(batches) <= 1: return batches total = len(batches) result = [] for i, content in enumerate(batches, 1): # 生成批次头部 header = get_batch_header(format_type, i, total) header_size = len(header.encode("utf-8")) # 动态计算允许的最大内容大小 max_content_size = max_bytes - header_size content_size = len(content.encode("utf-8")) # 如果超出,截断到安全大小 if content_size > max_content_size: print( f"警告:{format_type} 第 {i}/{total} 批次内容({content_size}字节) + 头部({header_size}字节) 超出限制({max_bytes}字节),截断到 {max_content_size} 字节" ) content = truncate_to_bytes(content, max_content_size) result.append(header + content) return result ================================================ FILE: trendradar/notification/dispatcher.py ================================================ # coding=utf-8 """ 通知调度器模块 提供统一的通知分发接口。 支持所有通知渠道的多账号配置,使用 `;` 分隔多个账号。 使用示例: dispatcher = NotificationDispatcher(config, get_time_func, split_content_func) results = dispatcher.dispatch_all(report_data, report_type, ...) """ from __future__ import annotations from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional from trendradar.core.config import ( get_account_at_index, limit_accounts, parse_multi_account_config, validate_paired_configs, ) from .senders import ( send_to_bark, send_to_dingtalk, send_to_email, send_to_feishu, send_to_ntfy, send_to_slack, send_to_telegram, send_to_wework, send_to_generic_webhook, ) from .renderer import ( render_rss_feishu_content, render_rss_dingtalk_content, render_rss_markdown_content, ) # 类型检查时导入,运行时不导入(避免循环导入) if TYPE_CHECKING: from trendradar.ai import AIAnalysisResult, AITranslator class NotificationDispatcher: """ 统一的多账号通知调度器 将多账号发送逻辑封装,提供简洁的 dispatch_all 接口。 内部处理账号解析、数量限制、配对验证等逻辑。 """ def __init__( self, config: Dict[str, Any], get_time_func: Callable, split_content_func: Callable, translator: Optional["AITranslator"] = None, ): """ 初始化通知调度器 Args: config: 完整的配置字典,包含所有通知渠道的配置 get_time_func: 获取当前时间的函数 split_content_func: 内容分批函数 translator: AI 翻译器实例(可选) """ self.config = config self.get_time_func = get_time_func self.split_content_func = split_content_func self.max_accounts = config.get("MAX_ACCOUNTS_PER_CHANNEL", 3) self.translator = translator def _translate_content( self, report_data: Dict, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, standalone_data: Optional[Dict] = None, display_regions: Optional[Dict] = None, ) -> tuple: """ 翻译推送内容 Args: report_data: 报告数据 rss_items: RSS 统计条目 rss_new_items: RSS 新增条目 standalone_data: 独立展示区数据 display_regions: 区域显示配置(不展示的区域跳过翻译) Returns: tuple: (翻译后的 report_data, rss_items, rss_new_items, standalone_data) """ if not self.translator or not self.translator.enabled: return report_data, rss_items, rss_new_items, standalone_data import copy print(f"[翻译] 开始翻译内容到 {self.translator.target_language}...") scope = self.translator.scope display_regions = display_regions or {} # 深拷贝避免修改原始数据 report_data = copy.deepcopy(report_data) rss_items = copy.deepcopy(rss_items) if rss_items else None rss_new_items = copy.deepcopy(rss_new_items) if rss_new_items else None standalone_data = copy.deepcopy(standalone_data) if standalone_data else None # 收集所有需要翻译的标题 titles_to_translate = [] title_locations = [] # 记录标题位置,用于回填 # 1. 热榜标题(scope 开启 且 区域展示) if scope.get("HOTLIST", True) and display_regions.get("HOTLIST", True): for stat_idx, stat in enumerate(report_data.get("stats", [])): for title_idx, title_data in enumerate(stat.get("titles", [])): titles_to_translate.append(title_data.get("title", "")) title_locations.append(("stats", stat_idx, title_idx)) # 2. 新增热点标题 for source_idx, source in enumerate(report_data.get("new_titles", [])): for title_idx, title_data in enumerate(source.get("titles", [])): titles_to_translate.append(title_data.get("title", "")) title_locations.append(("new_titles", source_idx, title_idx)) # 3. RSS 统计标题(结构与 stats 一致:[{word, count, titles: [{title, ...}]}]) if rss_items and scope.get("RSS", True) and display_regions.get("RSS", True): for stat_idx, stat in enumerate(rss_items): for title_idx, title_data in enumerate(stat.get("titles", [])): titles_to_translate.append(title_data.get("title", "")) title_locations.append(("rss_items", stat_idx, title_idx)) # 4. RSS 新增标题(结构与 stats 一致) if rss_new_items and scope.get("RSS", True) and display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True): for stat_idx, stat in enumerate(rss_new_items): for title_idx, title_data in enumerate(stat.get("titles", [])): titles_to_translate.append(title_data.get("title", "")) title_locations.append(("rss_new_items", stat_idx, title_idx)) # 5. 独立展示区 - 热榜平台 if standalone_data and scope.get("STANDALONE", True) and display_regions.get("STANDALONE", False): for plat_idx, platform in enumerate(standalone_data.get("platforms", [])): for item_idx, item in enumerate(platform.get("items", [])): titles_to_translate.append(item.get("title", "")) title_locations.append(("standalone_platforms", plat_idx, item_idx)) # 6. 独立展示区 - RSS 源 for feed_idx, feed in enumerate(standalone_data.get("rss_feeds", [])): for item_idx, item in enumerate(feed.get("items", [])): titles_to_translate.append(item.get("title", "")) title_locations.append(("standalone_rss", feed_idx, item_idx)) if not titles_to_translate: print("[翻译] 没有需要翻译的内容") return report_data, rss_items, rss_new_items, standalone_data print(f"[翻译] 共 {len(titles_to_translate)} 条标题待翻译") # 批量翻译 result = self.translator.translate_batch(titles_to_translate) if result.success_count == 0: print(f"[翻译] 翻译失败: {result.results[0].error if result.results else '未知错误'}") return report_data, rss_items, rss_new_items, standalone_data print(f"[翻译] 翻译完成: {result.success_count}/{result.total_count} 成功") # debug 模式:输出完整 prompt、AI 原始响应、逐条对照 if self.config.get("DEBUG", False): if result.prompt: print(f"[翻译][DEBUG] === 发送给 AI 的 Prompt ===") print(result.prompt) print(f"[翻译][DEBUG] === Prompt 结束 ===") if result.raw_response: print(f"[翻译][DEBUG] === AI 原始响应 ===") print(result.raw_response) print(f"[翻译][DEBUG] === 响应结束 ===") # 行数不匹配警告 expected = len(titles_to_translate) if result.parsed_count != expected: print(f"[翻译][DEBUG] ⚠️ 行数不匹配:期望 {expected} 条,AI 返回 {result.parsed_count} 条") # 逐条对照 unchanged_count = 0 for i, res in enumerate(result.results): if not res.success and res.error: print(f"[翻译][DEBUG] [{i+1}] !! 失败: {res.error}") elif res.original_text == res.translated_text: unchanged_count += 1 else: print(f"[翻译][DEBUG] [{i+1}] {res.original_text} => {res.translated_text}") if unchanged_count > 0: print(f"[翻译][DEBUG] (另有 {unchanged_count} 条未变化,已省略)") # 回填翻译结果 for i, (loc_type, idx1, idx2) in enumerate(title_locations): if i < len(result.results) and result.results[i].success: translated = result.results[i].translated_text if loc_type == "stats": report_data["stats"][idx1]["titles"][idx2]["title"] = translated elif loc_type == "new_titles": report_data["new_titles"][idx1]["titles"][idx2]["title"] = translated elif loc_type == "rss_items" and rss_items: rss_items[idx1]["titles"][idx2]["title"] = translated elif loc_type == "rss_new_items" and rss_new_items: rss_new_items[idx1]["titles"][idx2]["title"] = translated elif loc_type == "standalone_platforms" and standalone_data: standalone_data["platforms"][idx1]["items"][idx2]["title"] = translated elif loc_type == "standalone_rss" and standalone_data: standalone_data["rss_feeds"][idx1]["items"][idx2]["title"] = translated return report_data, rss_items, rss_new_items, standalone_data def dispatch_all( self, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", html_file_path: Optional[str] = None, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, standalone_data: Optional[Dict] = None, ) -> Dict[str, bool]: """ 分发通知到所有已配置的渠道(支持热榜+RSS合并推送+AI分析+独立展示区) Args: report_data: 报告数据(由 prepare_report_data 生成) report_type: 报告类型(如 "全天汇总"、"当前榜单"、"增量分析") update_info: 版本更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current/incremental) html_file_path: HTML 报告文件路径(邮件使用) rss_items: RSS 统计条目列表(用于 RSS 统计区块) rss_new_items: RSS 新增条目列表(用于 RSS 新增区块) ai_analysis: AI 分析结果(可选) standalone_data: 独立展示区数据(可选) Returns: Dict[str, bool]: 每个渠道的发送结果,key 为渠道名,value 为是否成功 """ results = {} # 获取区域显示配置 display_regions = self.config.get("DISPLAY", {}).get("REGIONS", {}) # 执行翻译(如果启用,根据 display_regions 跳过不展示的区域) report_data, rss_items, rss_new_items, standalone_data = self._translate_content( report_data, rss_items, rss_new_items, standalone_data, display_regions ) # 飞书 if self.config.get("FEISHU_WEBHOOK_URL"): results["feishu"] = self._send_feishu( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # 钉钉 if self.config.get("DINGTALK_WEBHOOK_URL"): results["dingtalk"] = self._send_dingtalk( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # 企业微信 if self.config.get("WEWORK_WEBHOOK_URL"): results["wework"] = self._send_wework( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # Telegram(需要配对验证) if self.config.get("TELEGRAM_BOT_TOKEN") and self.config.get("TELEGRAM_CHAT_ID"): results["telegram"] = self._send_telegram( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # ntfy(需要配对验证) if self.config.get("NTFY_SERVER_URL") and self.config.get("NTFY_TOPIC"): results["ntfy"] = self._send_ntfy( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # Bark if self.config.get("BARK_URL"): results["bark"] = self._send_bark( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # Slack if self.config.get("SLACK_WEBHOOK_URL"): results["slack"] = self._send_slack( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # 通用 Webhook if self.config.get("GENERIC_WEBHOOK_URL"): results["generic_webhook"] = self._send_generic_webhook( report_data, report_type, update_info, proxy_url, mode, rss_items, rss_new_items, ai_analysis, display_regions, standalone_data ) # 邮件(保持原有逻辑,已支持多收件人,AI 分析已嵌入 HTML) if ( self.config.get("EMAIL_FROM") and self.config.get("EMAIL_PASSWORD") and self.config.get("EMAIL_TO") ): results["email"] = self._send_email(report_type, html_file_path) return results def _send_to_multi_accounts( self, channel_name: str, config_value: str, send_func: Callable[..., bool], **kwargs, ) -> bool: """ 通用多账号发送逻辑 Args: channel_name: 渠道名称(用于日志和账号数量限制提示) config_value: 配置值(可能包含多个账号,用 ; 分隔) send_func: 发送函数,签名为 (account, account_label=..., **kwargs) -> bool **kwargs: 传递给发送函数的其他参数 Returns: bool: 任一账号发送成功则返回 True """ accounts = parse_multi_account_config(config_value) if not accounts: return False accounts = limit_accounts(accounts, self.max_accounts, channel_name) results = [] for i, account in enumerate(accounts): if account: account_label = f"账号{i+1}" if len(accounts) > 1 else "" result = send_func(account, account_label=account_label, **kwargs) results.append(result) return any(results) if results else False def _send_feishu( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到飞书(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} return self._send_to_multi_accounts( channel_name="飞书", config_value=self.config["FEISHU_WEBHOOK_URL"], send_func=lambda url, account_label: send_to_feishu( webhook_url=url, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("FEISHU_BATCH_SIZE", 29000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, get_time_func=self.get_time_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ), ) def _send_dingtalk( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到钉钉(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} return self._send_to_multi_accounts( channel_name="钉钉", config_value=self.config["DINGTALK_WEBHOOK_URL"], send_func=lambda url, account_label: send_to_dingtalk( webhook_url=url, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("DINGTALK_BATCH_SIZE", 20000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ), ) def _send_wework( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到企业微信(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} return self._send_to_multi_accounts( channel_name="企业微信", config_value=self.config["WEWORK_WEBHOOK_URL"], send_func=lambda url, account_label: send_to_wework( webhook_url=url, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("MESSAGE_BATCH_SIZE", 4000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), msg_type=self.config.get("WEWORK_MSG_TYPE", "markdown"), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ), ) def _send_telegram( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到 Telegram(多账号,需验证 token 和 chat_id 配对,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} telegram_tokens = parse_multi_account_config(self.config["TELEGRAM_BOT_TOKEN"]) telegram_chat_ids = parse_multi_account_config(self.config["TELEGRAM_CHAT_ID"]) if not telegram_tokens or not telegram_chat_ids: return False valid, count = validate_paired_configs( {"bot_token": telegram_tokens, "chat_id": telegram_chat_ids}, "Telegram", required_keys=["bot_token", "chat_id"], ) if not valid or count == 0: return False telegram_tokens = limit_accounts(telegram_tokens, self.max_accounts, "Telegram") telegram_chat_ids = telegram_chat_ids[: len(telegram_tokens)] results = [] for i in range(len(telegram_tokens)): token = telegram_tokens[i] chat_id = telegram_chat_ids[i] if token and chat_id: account_label = f"账号{i+1}" if len(telegram_tokens) > 1 else "" result = send_to_telegram( bot_token=token, chat_id=chat_id, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("MESSAGE_BATCH_SIZE", 4000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ) results.append(result) return any(results) if results else False def _send_ntfy( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到 ntfy(多账号,需验证 topic 和 token 配对,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} ntfy_server_url = self.config["NTFY_SERVER_URL"] ntfy_topics = parse_multi_account_config(self.config["NTFY_TOPIC"]) ntfy_tokens = parse_multi_account_config(self.config.get("NTFY_TOKEN", "")) if not ntfy_server_url or not ntfy_topics: return False if ntfy_tokens and len(ntfy_tokens) != len(ntfy_topics): print( f"❌ ntfy 配置错误:topic 数量({len(ntfy_topics)})与 token 数量({len(ntfy_tokens)})不一致,跳过 ntfy 推送" ) return False ntfy_topics = limit_accounts(ntfy_topics, self.max_accounts, "ntfy") if ntfy_tokens: ntfy_tokens = ntfy_tokens[: len(ntfy_topics)] results = [] for i, topic in enumerate(ntfy_topics): if topic: token = get_account_at_index(ntfy_tokens, i, "") if ntfy_tokens else "" account_label = f"账号{i+1}" if len(ntfy_topics) > 1 else "" result = send_to_ntfy( server_url=ntfy_server_url, topic=topic, token=token, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=3800, split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ) results.append(result) return any(results) if results else False def _send_bark( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到 Bark(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} return self._send_to_multi_accounts( channel_name="Bark", config_value=self.config["BARK_URL"], send_func=lambda url, account_label: send_to_bark( bark_url=url, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("BARK_BATCH_SIZE", 3600), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ), ) def _send_slack( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到 Slack(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} return self._send_to_multi_accounts( channel_name="Slack", config_value=self.config["SLACK_WEBHOOK_URL"], send_func=lambda url, account_label: send_to_slack( webhook_url=url, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("SLACK_BATCH_SIZE", 4000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ), ) def _send_generic_webhook( self, report_data: Dict, report_type: str, update_info: Optional[Dict], proxy_url: Optional[str], mode: str, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, ai_analysis: Optional[AIAnalysisResult] = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """发送到通用 Webhook(多账号,支持热榜+RSS合并+AI分析+独立展示区)""" display_regions = display_regions or {} if not display_regions.get("HOTLIST", True): report_data = {"stats": [], "failed_ids": [], "new_titles": [], "id_to_name": {}} urls = parse_multi_account_config(self.config.get("GENERIC_WEBHOOK_URL", "")) templates = parse_multi_account_config(self.config.get("GENERIC_WEBHOOK_TEMPLATE", "")) if not urls: return False urls = limit_accounts(urls, self.max_accounts, "通用Webhook") results = [] for i, url in enumerate(urls): if not url: continue template = "" if templates: if i < len(templates): template = templates[i] elif len(templates) == 1: template = templates[0] account_label = f"账号{i+1}" if len(urls) > 1 else "" result = send_to_generic_webhook( webhook_url=url, payload_template=template, report_data=report_data, report_type=report_type, update_info=update_info, proxy_url=proxy_url, mode=mode, account_label=account_label, batch_size=self.config.get("MESSAGE_BATCH_SIZE", 4000), batch_interval=self.config.get("BATCH_SEND_INTERVAL", 1.0), split_content_func=self.split_content_func, rss_items=rss_items if display_regions.get("RSS", True) else None, rss_new_items=rss_new_items if (display_regions.get("RSS", True) and display_regions.get("NEW_ITEMS", True)) else None, ai_analysis=ai_analysis if display_regions.get("AI_ANALYSIS", True) else None, display_regions=display_regions, standalone_data=standalone_data if display_regions.get("STANDALONE", False) else None, ) results.append(result) return any(results) if results else False def _send_email( self, report_type: str, html_file_path: Optional[str], ) -> bool: """发送邮件(保持原有逻辑,已支持多收件人) Note: AI 分析内容已在 HTML 生成时嵌入,无需在此传递 """ return send_to_email( from_email=self.config["EMAIL_FROM"], password=self.config["EMAIL_PASSWORD"], to_email=self.config["EMAIL_TO"], report_type=report_type, html_file_path=html_file_path, custom_smtp_server=self.config.get("EMAIL_SMTP_SERVER", ""), custom_smtp_port=self.config.get("EMAIL_SMTP_PORT", ""), get_time_func=self.get_time_func, ) # === RSS 通知方法 === def dispatch_rss( self, rss_items: List[Dict], feeds_info: Optional[Dict[str, str]] = None, proxy_url: Optional[str] = None, html_file_path: Optional[str] = None, ) -> Dict[str, bool]: """ 分发 RSS 通知到所有已配置的渠道 Args: rss_items: RSS 条目列表,每个条目包含: - title: 标题 - feed_id: RSS 源 ID - feed_name: RSS 源名称 - url: 链接 - published_at: 发布时间 - summary: 摘要(可选) - author: 作者(可选) feeds_info: RSS 源 ID 到名称的映射 proxy_url: 代理 URL(可选) html_file_path: HTML 报告文件路径(邮件使用) Returns: Dict[str, bool]: 每个渠道的发送结果 """ if not rss_items: print("[RSS通知] 没有 RSS 内容,跳过通知") return {} results = {} report_type = "RSS 订阅更新" # 飞书 if self.config.get("FEISHU_WEBHOOK_URL"): results["feishu"] = self._send_rss_feishu( rss_items, feeds_info, proxy_url ) # 钉钉 if self.config.get("DINGTALK_WEBHOOK_URL"): results["dingtalk"] = self._send_rss_dingtalk( rss_items, feeds_info, proxy_url ) # 企业微信 if self.config.get("WEWORK_WEBHOOK_URL"): results["wework"] = self._send_rss_markdown( rss_items, feeds_info, proxy_url, "wework" ) # Telegram if self.config.get("TELEGRAM_BOT_TOKEN") and self.config.get("TELEGRAM_CHAT_ID"): results["telegram"] = self._send_rss_markdown( rss_items, feeds_info, proxy_url, "telegram" ) # ntfy if self.config.get("NTFY_SERVER_URL") and self.config.get("NTFY_TOPIC"): results["ntfy"] = self._send_rss_markdown( rss_items, feeds_info, proxy_url, "ntfy" ) # Bark if self.config.get("BARK_URL"): results["bark"] = self._send_rss_markdown( rss_items, feeds_info, proxy_url, "bark" ) # Slack if self.config.get("SLACK_WEBHOOK_URL"): results["slack"] = self._send_rss_markdown( rss_items, feeds_info, proxy_url, "slack" ) # 邮件 if ( self.config.get("EMAIL_FROM") and self.config.get("EMAIL_PASSWORD") and self.config.get("EMAIL_TO") ): results["email"] = self._send_email(report_type, html_file_path) return results def _send_rss_feishu( self, rss_items: List[Dict], feeds_info: Optional[Dict[str, str]], proxy_url: Optional[str], ) -> bool: """发送 RSS 到飞书""" import requests content = render_rss_feishu_content( rss_items=rss_items, feeds_info=feeds_info, get_time_func=self.get_time_func, ) webhooks = parse_multi_account_config(self.config["FEISHU_WEBHOOK_URL"]) webhooks = limit_accounts(webhooks, self.max_accounts, "飞书") results = [] for i, webhook_url in enumerate(webhooks): if not webhook_url: continue account_label = f"账号{i+1}" if len(webhooks) > 1 else "" try: # 分批发送 batches = self.split_content_func( content, self.config.get("FEISHU_BATCH_SIZE", 29000) ) for batch_idx, batch_content in enumerate(batches): payload = { "msg_type": "interactive", "card": { "header": { "title": { "tag": "plain_text", "content": f"📰 RSS 订阅更新 {f'({batch_idx + 1}/{len(batches)})' if len(batches) > 1 else ''}", }, "template": "green", }, "elements": [ {"tag": "markdown", "content": batch_content} ], }, } proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post(webhook_url, json=payload, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ 飞书{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ 飞书{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_dingtalk( self, rss_items: List[Dict], feeds_info: Optional[Dict[str, str]], proxy_url: Optional[str], ) -> bool: """发送 RSS 到钉钉""" import requests content = render_rss_dingtalk_content( rss_items=rss_items, feeds_info=feeds_info, get_time_func=self.get_time_func, ) webhooks = parse_multi_account_config(self.config["DINGTALK_WEBHOOK_URL"]) webhooks = limit_accounts(webhooks, self.max_accounts, "钉钉") results = [] for i, webhook_url in enumerate(webhooks): if not webhook_url: continue account_label = f"账号{i+1}" if len(webhooks) > 1 else "" try: batches = self.split_content_func( content, self.config.get("DINGTALK_BATCH_SIZE", 20000) ) for batch_idx, batch_content in enumerate(batches): title = f"📰 RSS 订阅更新 {f'({batch_idx + 1}/{len(batches)})' if len(batches) > 1 else ''}" payload = { "msgtype": "markdown", "markdown": { "title": title, "text": batch_content, }, } proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post(webhook_url, json=payload, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ 钉钉{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ 钉钉{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_markdown( self, rss_items: List[Dict], feeds_info: Optional[Dict[str, str]], proxy_url: Optional[str], channel: str, ) -> bool: """发送 RSS 到 Markdown 兼容渠道(企业微信、Telegram、ntfy、Bark、Slack)""" content = render_rss_markdown_content( rss_items=rss_items, feeds_info=feeds_info, get_time_func=self.get_time_func, ) try: if channel == "wework": return self._send_rss_wework(content, proxy_url) elif channel == "telegram": return self._send_rss_telegram(content, proxy_url) elif channel == "ntfy": return self._send_rss_ntfy(content, proxy_url) elif channel == "bark": return self._send_rss_bark(content, proxy_url) elif channel == "slack": return self._send_rss_slack(content, proxy_url) except Exception as e: print(f"❌ {channel} RSS 通知发送失败: {e}") return False return False def _send_rss_wework(self, content: str, proxy_url: Optional[str]) -> bool: """发送 RSS 到企业微信""" import requests webhooks = parse_multi_account_config(self.config["WEWORK_WEBHOOK_URL"]) webhooks = limit_accounts(webhooks, self.max_accounts, "企业微信") results = [] for i, webhook_url in enumerate(webhooks): if not webhook_url: continue account_label = f"账号{i+1}" if len(webhooks) > 1 else "" try: batches = self.split_content_func( content, self.config.get("MESSAGE_BATCH_SIZE", 4000) ) for batch_content in batches: payload = { "msgtype": "markdown", "markdown": {"content": batch_content}, } proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post(webhook_url, json=payload, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ 企业微信{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ 企业微信{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_telegram(self, content: str, proxy_url: Optional[str]) -> bool: """发送 RSS 到 Telegram""" import requests tokens = parse_multi_account_config(self.config["TELEGRAM_BOT_TOKEN"]) chat_ids = parse_multi_account_config(self.config["TELEGRAM_CHAT_ID"]) if not tokens or not chat_ids: return False results = [] for i in range(min(len(tokens), len(chat_ids), self.max_accounts)): token = tokens[i] chat_id = chat_ids[i] if not token or not chat_id: continue account_label = f"账号{i+1}" if len(tokens) > 1 else "" try: batches = self.split_content_func( content, self.config.get("MESSAGE_BATCH_SIZE", 4000) ) for batch_content in batches: url = f"https://api.telegram.org/bot{token}/sendMessage" payload = { "chat_id": chat_id, "text": batch_content, "parse_mode": "Markdown", } proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post(url, json=payload, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ Telegram{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ Telegram{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_ntfy(self, content: str, proxy_url: Optional[str]) -> bool: """发送 RSS 到 ntfy""" import requests server_url = self.config["NTFY_SERVER_URL"] topics = parse_multi_account_config(self.config["NTFY_TOPIC"]) tokens = parse_multi_account_config(self.config.get("NTFY_TOKEN", "")) if not server_url or not topics: return False topics = limit_accounts(topics, self.max_accounts, "ntfy") results = [] for i, topic in enumerate(topics): if not topic: continue token = tokens[i] if tokens and i < len(tokens) else "" account_label = f"账号{i+1}" if len(topics) > 1 else "" try: batches = self.split_content_func(content, 3800) for batch_content in batches: url = f"{server_url.rstrip('/')}/{topic}" headers = {"Title": "RSS 订阅更新", "Markdown": "yes"} if token: headers["Authorization"] = f"Bearer {token}" proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post( url, data=batch_content.encode("utf-8"), headers=headers, proxies=proxies, timeout=30 ) resp.raise_for_status() print(f"✅ ntfy{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ ntfy{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_bark(self, content: str, proxy_url: Optional[str]) -> bool: """发送 RSS 到 Bark""" import requests import urllib.parse urls = parse_multi_account_config(self.config["BARK_URL"]) urls = limit_accounts(urls, self.max_accounts, "Bark") results = [] for i, bark_url in enumerate(urls): if not bark_url: continue account_label = f"账号{i+1}" if len(urls) > 1 else "" try: batches = self.split_content_func( content, self.config.get("BARK_BATCH_SIZE", 3600) ) for batch_content in batches: title = urllib.parse.quote("📰 RSS 订阅更新") body = urllib.parse.quote(batch_content) url = f"{bark_url.rstrip('/')}/{title}/{body}" proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.get(url, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ Bark{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ Bark{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False def _send_rss_slack(self, content: str, proxy_url: Optional[str]) -> bool: """发送 RSS 到 Slack""" import requests webhooks = parse_multi_account_config(self.config["SLACK_WEBHOOK_URL"]) webhooks = limit_accounts(webhooks, self.max_accounts, "Slack") results = [] for i, webhook_url in enumerate(webhooks): if not webhook_url: continue account_label = f"账号{i+1}" if len(webhooks) > 1 else "" try: batches = self.split_content_func( content, self.config.get("SLACK_BATCH_SIZE", 4000) ) for batch_content in batches: payload = { "blocks": [ { "type": "section", "text": { "type": "mrkdwn", "text": batch_content, }, } ] } proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None resp = requests.post(webhook_url, json=payload, proxies=proxies, timeout=30) resp.raise_for_status() print(f"✅ Slack{account_label} RSS 通知发送成功") results.append(True) except Exception as e: print(f"❌ Slack{account_label} RSS 通知发送失败: {e}") results.append(False) return any(results) if results else False ================================================ FILE: trendradar/notification/formatters.py ================================================ # coding=utf-8 """ 通知内容格式转换模块 提供不同推送平台间的格式转换功能 """ import re def strip_markdown(text: str) -> str: """去除文本中的 markdown 语法格式,用于个人微信推送 Args: text: 包含 markdown 格式的文本 Returns: 纯文本内容 """ # 转换链接 [text](url) -> text url(保留 URL) text = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'\1 \2', text) # 先保护 URL,避免后续 markdown 清洗误伤链接中的下划线等字符 protected_urls: list[str] = [] def _protect_url(match: re.Match) -> str: protected_urls.append(match.group(0)) return f"@@URLTOKEN{len(protected_urls) - 1}@@" text = re.sub(r'https?://[^\s<>\]]+', _protect_url, text) # 去除粗体 **text** 或 __text__ text = re.sub(r'\*\*(.+?)\*\*', r'\1', text) text = re.sub(r'(? alt text = re.sub(r'!\[(.+?)\]\(.+?\)', r'\1', text) # 去除行内代码 `code` text = re.sub(r'`(.+?)`', r'\1', text) # 去除引用符号 > text = re.sub(r'^>\s*', '', text, flags=re.MULTILINE) # 去除标题符号 # ## ### 等 text = re.sub(r'^#+\s*', '', text, flags=re.MULTILINE) # 去除水平分割线 --- 或 *** text = re.sub(r'^[\-\*]{3,}\s*$', '', text, flags=re.MULTILINE) # 去除 HTML 标签 text -> text text = re.sub(r']*>(.+?)', r'\1', text) text = re.sub(r'<[^>]+>', '', text) # 清理多余的空行(保留最多两个连续空行) text = re.sub(r'\n{3,}', '\n\n', text) # 还原之前保护的 URL for idx, url in enumerate(protected_urls): text = text.replace(f"@@URLTOKEN{idx}@@", url) return text.strip() def convert_markdown_to_mrkdwn(content: str) -> str: """ 将标准 Markdown 转换为 Slack 的 mrkdwn 格式 转换规则: - **粗体** → *粗体* - [文本](url) →""" if result.core_trends: content = _format_list_content(result.core_trends) content_html = _escape_html(content).replace("\n", "✨ AI 热点分析AI
") ai_html += f"""""" if result.sentiment_controversy: content = _format_list_content(result.sentiment_controversy) content_html = _escape_html(content).replace("\n", "核心热点态势{content_html}
") ai_html += f"""""" if result.signals: content = _format_list_content(result.signals) content_html = _escape_html(content).replace("\n", "舆论风向争议{content_html}
") ai_html += f"""""" if result.rss_insights: content = _format_list_content(result.rss_insights) content_html = _escape_html(content).replace("\n", "异动与弱信号{content_html}
") ai_html += f"""""" if result.outlook_strategy: content = _format_list_content(result.outlook_strategy) content_html = _escape_html(content).replace("\n", "RSS 深度洞察{content_html}
") ai_html += f"""""" if result.standalone_summaries: summaries_text = _format_standalone_summaries(result.standalone_summaries) if summaries_text: summaries_html = _escape_html(summaries_text).replace("\n", "研判策略建议{content_html}
") ai_html += f"""""" ai_html += """独立源点速览{summaries_html}- 保留其他格式(代码块、列表等) Args: content: Markdown 格式的内容 Returns: Slack mrkdwn 格式的内容 """ # 1. 转换链接格式: [文本](url) → content = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'<\2|\1>', content) # 2. 转换粗体: **文本** → *文本* content = re.sub(r'\*\*([^*]+)\*\*', r'*\1*', content) return content ================================================ FILE: trendradar/notification/renderer.py ================================================ # coding=utf-8 """ 通知内容渲染模块 提供多平台通知内容渲染功能,生成格式化的推送消息 """ from datetime import datetime from typing import Dict, List, Optional, Callable from trendradar.report.formatter import format_title_for_platform # 默认区域顺序 DEFAULT_REGION_ORDER = ["hotlist", "rss", "new_items", "standalone", "ai_analysis"] def render_feishu_content( report_data: Dict, update_info: Optional[Dict] = None, mode: str = "daily", separator: str = "---", region_order: Optional[List[str]] = None, get_time_func: Optional[Callable[[], datetime]] = None, rss_items: Optional[list] = None, show_new_section: bool = True, ) -> str: """渲染飞书通知内容(支持热榜+RSS合并) Args: report_data: 报告数据字典,包含 stats, new_titles, failed_ids, total_new_count update_info: 版本更新信息(可选) mode: 报告模式 ("daily", "incremental", "current") separator: 内容分隔符 region_order: 区域显示顺序列表 get_time_func: 获取当前时间的函数(可选,默认使用 datetime.now()) rss_items: RSS 条目列表(可选,用于合并推送) show_new_section: 是否显示新增热点区域 Returns: 格式化的飞书消息内容 """ if region_order is None: region_order = DEFAULT_REGION_ORDER # 生成热点词汇统计部分 stats_content = "" if report_data["stats"]: stats_content += "📊 **热点词汇统计**\n\n" total_count = len(report_data["stats"]) for i, stat in enumerate(report_data["stats"]): word = stat["word"] count = stat["count"] sequence_display = f"[{i + 1}/{total_count}]" if count >= 10: stats_content += f"🔥 {sequence_display} **{word}** : {count} 条\n\n" elif count >= 5: stats_content += f"📈 {sequence_display} **{word}** : {count} 条\n\n" else: stats_content += f"📌 {sequence_display} **{word}** : {count} 条\n\n" for j, title_data in enumerate(stat["titles"], 1): formatted_title = format_title_for_platform( "feishu", title_data, show_source=True ) stats_content += f" {j}. {formatted_title}\n" if j < len(stat["titles"]): stats_content += "\n" if i < len(report_data["stats"]) - 1: stats_content += f"\n{separator}\n\n" # 生成新增新闻部分 new_titles_content = "" if show_new_section and report_data["new_titles"]: new_titles_content += ( f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" ) for source_data in report_data["new_titles"]: new_titles_content += ( f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n" ) for j, title_data in enumerate(source_data["titles"], 1): title_data_copy = title_data.copy() title_data_copy["is_new"] = False formatted_title = format_title_for_platform( "feishu", title_data_copy, show_source=False ) new_titles_content += f" {j}. {formatted_title}\n" new_titles_content += "\n" # RSS 内容 rss_content = "" if rss_items: rss_content = _render_rss_section_feishu(rss_items, separator) # 准备各区域内容映射 region_contents = { "hotlist": stats_content, "new_items": new_titles_content, "rss": rss_content, } # 按 region_order 顺序组装内容 text_content = "" for region in region_order: content = region_contents.get(region, "") if content: if text_content: text_content += f"\n{separator}\n\n" text_content += content if not text_content: if mode == "incremental": mode_text = "增量模式下暂无新增匹配的热点词汇" elif mode == "current": mode_text = "当前榜单模式下暂无匹配的热点词汇" else: mode_text = "暂无匹配的热点词汇" text_content = f"📭 {mode_text}\n\n" if report_data["failed_ids"]: if text_content and "暂无匹配" not in text_content: text_content += f"\n{separator}\n\n" text_content += "⚠️ **数据获取失败的平台:**\n\n" for i, id_value in enumerate(report_data["failed_ids"], 1): text_content += f" • {id_value}\n" # 获取当前时间 now = get_time_func() if get_time_func else datetime.now() text_content += ( f"\n\n更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" ) if update_info: text_content += f"\nTrendRadar 发现新版本 {update_info['remote_version']},当前 {update_info['current_version']}" return text_content def render_dingtalk_content( report_data: Dict, update_info: Optional[Dict] = None, mode: str = "daily", region_order: Optional[List[str]] = None, get_time_func: Optional[Callable[[], datetime]] = None, rss_items: Optional[list] = None, show_new_section: bool = True, ) -> str: """渲染钉钉通知内容(支持热榜+RSS合并) Args: report_data: 报告数据字典,包含 stats, new_titles, failed_ids, total_new_count update_info: 版本更新信息(可选) mode: 报告模式 ("daily", "incremental", "current") region_order: 区域显示顺序列表 get_time_func: 获取当前时间的函数(可选,默认使用 datetime.now()) rss_items: RSS 条目列表(可选,用于合并推送) show_new_section: 是否显示新增热点区域 Returns: 格式化的钉钉消息内容 """ if region_order is None: region_order = DEFAULT_REGION_ORDER total_titles = sum( len(stat["titles"]) for stat in report_data["stats"] if stat["count"] > 0 ) now = get_time_func() if get_time_func else datetime.now() # 头部信息 header_content = f"**总新闻数:** {total_titles}\n\n" header_content += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n\n" header_content += "**类型:** 热点分析报告\n\n" header_content += "---\n\n" # 生成热点词汇统计部分 stats_content = "" if report_data["stats"]: stats_content += "📊 **热点词汇统计**\n\n" total_count = len(report_data["stats"]) for i, stat in enumerate(report_data["stats"]): word = stat["word"] count = stat["count"] sequence_display = f"[{i + 1}/{total_count}]" if count >= 10: stats_content += f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" elif count >= 5: stats_content += f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" else: stats_content += f"📌 {sequence_display} **{word}** : {count} 条\n\n" for j, title_data in enumerate(stat["titles"], 1): formatted_title = format_title_for_platform( "dingtalk", title_data, show_source=True ) stats_content += f" {j}. {formatted_title}\n" if j < len(stat["titles"]): stats_content += "\n" if i < len(report_data["stats"]) - 1: stats_content += "\n---\n\n" # 生成新增新闻部分 new_titles_content = "" if show_new_section and report_data["new_titles"]: new_titles_content += ( f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" ) for source_data in report_data["new_titles"]: new_titles_content += f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n\n" for j, title_data in enumerate(source_data["titles"], 1): title_data_copy = title_data.copy() title_data_copy["is_new"] = False formatted_title = format_title_for_platform( "dingtalk", title_data_copy, show_source=False ) new_titles_content += f" {j}. {formatted_title}\n" new_titles_content += "\n" # RSS 内容 rss_content = "" if rss_items: rss_content = _render_rss_section_markdown(rss_items) # 准备各区域内容映射 region_contents = { "hotlist": stats_content, "new_items": new_titles_content, "rss": rss_content, } # 按 region_order 顺序组装内容 text_content = header_content has_content = False for region in region_order: content = region_contents.get(region, "") if content: if has_content: text_content += "\n---\n\n" text_content += content has_content = True if not has_content: if mode == "incremental": mode_text = "增量模式下暂无新增匹配的热点词汇" elif mode == "current": mode_text = "当前榜单模式下暂无匹配的热点词汇" else: mode_text = "暂无匹配的热点词汇" text_content += f"📭 {mode_text}\n\n" if report_data["failed_ids"]: if "暂无匹配" not in text_content: text_content += "\n---\n\n" text_content += "⚠️ **数据获取失败的平台:**\n\n" for i, id_value in enumerate(report_data["failed_ids"], 1): text_content += f" • **{id_value}**\n" text_content += f"\n\n> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: text_content += f"\n> TrendRadar 发现新版本 **{update_info['remote_version']}**,当前 **{update_info['current_version']}**" return text_content def render_rss_feishu_content( rss_items: list, feeds_info: Optional[Dict] = None, separator: str = "---", get_time_func: Optional[Callable[[], datetime]] = None, ) -> str: """渲染 RSS 飞书通知内容 Args: rss_items: RSS 条目列表,每个条目包含: - title: 标题 - feed_id: RSS 源 ID - feed_name: RSS 源名称 - url: 链接 - published_at: 发布时间 - summary: 摘要(可选) - author: 作者(可选) feeds_info: RSS 源 ID 到名称的映射 separator: 内容分隔符 get_time_func: 获取当前时间的函数(可选) Returns: 格式化的飞书消息内容 """ if not rss_items: now = get_time_func() if get_time_func else datetime.now() return f"📭 暂无新的 RSS 订阅内容\n\n更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" # 按 feed_id 分组 feeds_map: Dict[str, list] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) text_content = f"📰 **RSS 订阅更新** (共 {len(rss_items)} 条)\n\n" text_content += f"{separator}\n\n" for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id if feeds_info and feed_id in feeds_info: feed_name = feeds_info[feed_id] text_content += f"**{feed_name}** ({len(items)} 条)\n\n" for i, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") if url: text_content += f" {i}. [{title}]({url})" else: text_content += f" {i}. {title}" if published_at: text_content += f" - {published_at}" text_content += "\n" if i < len(items): text_content += "\n" text_content += f"\n{separator}\n\n" now = get_time_func() if get_time_func else datetime.now() text_content += f"更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" return text_content def render_rss_dingtalk_content( rss_items: list, feeds_info: Optional[Dict] = None, get_time_func: Optional[Callable[[], datetime]] = None, ) -> str: """渲染 RSS 钉钉通知内容 Args: rss_items: RSS 条目列表 feeds_info: RSS 源 ID 到名称的映射 get_time_func: 获取当前时间的函数(可选) Returns: 格式化的钉钉消息内容 """ now = get_time_func() if get_time_func else datetime.now() if not rss_items: return f"📭 暂无新的 RSS 订阅内容\n\n> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" # 按 feed_id 分组 feeds_map: Dict[str, list] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) # 头部信息 text_content = f"**总条目数:** {len(rss_items)}\n\n" text_content += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n\n" text_content += "**类型:** RSS 订阅更新\n\n" text_content += "---\n\n" for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id if feeds_info and feed_id in feeds_info: feed_name = feeds_info[feed_id] text_content += f"📰 **{feed_name}** ({len(items)} 条)\n\n" for i, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") if url: text_content += f" {i}. [{title}]({url})" else: text_content += f" {i}. {title}" if published_at: text_content += f" - {published_at}" text_content += "\n" if i < len(items): text_content += "\n" text_content += "\n---\n\n" text_content += f"> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" return text_content def render_rss_markdown_content( rss_items: list, feeds_info: Optional[Dict] = None, get_time_func: Optional[Callable[[], datetime]] = None, ) -> str: """渲染 RSS 通用 Markdown 格式内容(企业微信、Bark、ntfy、Slack) Args: rss_items: RSS 条目列表 feeds_info: RSS 源 ID 到名称的映射 get_time_func: 获取当前时间的函数(可选) Returns: 格式化的 Markdown 消息内容 """ now = get_time_func() if get_time_func else datetime.now() if not rss_items: return f"📭 暂无新的 RSS 订阅内容\n\n更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" # 按 feed_id 分组 feeds_map: Dict[str, list] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) text_content = f"📰 **RSS 订阅更新** (共 {len(rss_items)} 条)\n\n" for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id if feeds_info and feed_id in feeds_info: feed_name = feeds_info[feed_id] text_content += f"**{feed_name}** ({len(items)} 条)\n" for i, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") if url: text_content += f" {i}. [{title}]({url})" else: text_content += f" {i}. {title}" if published_at: text_content += f" `{published_at}`" text_content += "\n" text_content += "\n" text_content += f"更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" return text_content # === RSS 内容渲染辅助函数(用于合并推送) === def _render_rss_section_feishu(rss_items: list, separator: str = "---") -> str: """渲染 RSS 内容区块(飞书格式,用于合并推送)""" if not rss_items: return "" # 按 feed_id 分组 feeds_map: Dict[str, list] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) text_content = f"📰 **RSS 订阅更新** (共 {len(rss_items)} 条)\n\n" for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id text_content += f"**{feed_name}** ({len(items)} 条)\n\n" for i, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") if url: text_content += f" {i}. [{title}]({url})" else: text_content += f" {i}. {title}" if published_at: text_content += f" - {published_at}" text_content += "\n" if i < len(items): text_content += "\n" text_content += "\n" return text_content.rstrip("\n") def _render_rss_section_markdown(rss_items: list) -> str: """渲染 RSS 内容区块(通用 Markdown 格式,用于合并推送)""" if not rss_items: return "" # 按 feed_id 分组 feeds_map: Dict[str, list] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) text_content = f"📰 **RSS 订阅更新** (共 {len(rss_items)} 条)\n\n" for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id text_content += f"**{feed_name}** ({len(items)} 条)\n" for i, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") if url: text_content += f" {i}. [{title}]({url})" else: text_content += f" {i}. {title}" if published_at: text_content += f" `{published_at}`" text_content += "\n" text_content += "\n" return text_content.rstrip("\n") ================================================ FILE: trendradar/notification/senders.py ================================================ # coding=utf-8 """ 消息发送器模块 将报告数据发送到各种通知渠道: - 飞书 (Feishu/Lark) - 钉钉 (DingTalk) - 企业微信 (WeCom/WeWork) - Telegram - 邮件 (Email) - ntfy - Bark - Slack 每个发送函数都支持分批发送,并通过参数化配置实现与 CONFIG 的解耦。 """ import smtplib import time import json from datetime import datetime from email.header import Header from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText from email.utils import formataddr, formatdate, make_msgid from pathlib import Path from typing import Any, Callable, Dict, Optional from urllib.parse import urlparse import requests from .batch import add_batch_headers, get_max_batch_header_size from .formatters import convert_markdown_to_mrkdwn, strip_markdown def _render_ai_analysis(ai_analysis: Any, channel: str) -> str: """渲染 AI 分析内容为指定渠道格式""" if not ai_analysis: return "" try: from trendradar.ai.formatter import get_ai_analysis_renderer renderer = get_ai_analysis_renderer(channel) return renderer(ai_analysis) except ImportError: return "" # === SMTP 邮件配置 === SMTP_CONFIGS = { # Gmail(使用 STARTTLS) "gmail.com": {"server": "smtp.gmail.com", "port": 587, "encryption": "TLS"}, # QQ邮箱(使用 SSL,更稳定) "qq.com": {"server": "smtp.qq.com", "port": 465, "encryption": "SSL"}, # Outlook(使用 STARTTLS) "outlook.com": {"server": "smtp-mail.outlook.com", "port": 587, "encryption": "TLS"}, "hotmail.com": {"server": "smtp-mail.outlook.com", "port": 587, "encryption": "TLS"}, "live.com": {"server": "smtp-mail.outlook.com", "port": 587, "encryption": "TLS"}, # 网易邮箱(使用 SSL,更稳定) "163.com": {"server": "smtp.163.com", "port": 465, "encryption": "SSL"}, "126.com": {"server": "smtp.126.com", "port": 465, "encryption": "SSL"}, # 新浪邮箱(使用 SSL) "sina.com": {"server": "smtp.sina.com", "port": 465, "encryption": "SSL"}, # 搜狐邮箱(使用 SSL) "sohu.com": {"server": "smtp.sohu.com", "port": 465, "encryption": "SSL"}, # 天翼邮箱(使用 SSL) "189.cn": {"server": "smtp.189.cn", "port": 465, "encryption": "SSL"}, # 阿里云邮箱(使用 TLS) "aliyun.com": {"server": "smtp.aliyun.com", "port": 465, "encryption": "TLS"}, # Yandex邮箱(使用 TLS) "yandex.com": {"server": "smtp.yandex.com", "port": 465, "encryption": "TLS"}, # iCloud邮箱(使用 SSL) "icloud.com": {"server": "smtp.mail.me.com", "port": 587, "encryption": "SSL"}, } def send_to_feishu( webhook_url: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 29000, batch_interval: float = 1.0, split_content_func: Callable = None, get_time_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到飞书(支持分批发送,支持热榜+RSS合并+独立展示区) Args: webhook_url: 飞书 Webhook URL report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 get_time_func: 获取当前时间的函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ headers = {"Content-Type": "application/json"} proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"飞书{account_label}" if account_label else "飞书" # 渲染 AI 分析内容(如果有) ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "feishu") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 预留批次头部空间,避免添加头部后超限 header_reserve = get_max_batch_header_size("feishu") batches = split_content_func( report_data, "feishu", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "feishu", batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) # 飞书 webhook 只显示 content.text,所有信息都整合到 text 中 payload = { "msg_type": "interactive", "content": { "text": batch_content, }, } try: response = requests.post( webhook_url, headers=headers, json=payload, proxies=proxies, timeout=30 ) if response.status_code == 200: result = response.json() # 检查飞书的响应状态 if result.get("StatusCode") == 0 or result.get("code") == 0: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") # 批次间间隔 if i < len(batches): time.sleep(batch_interval) else: error_msg = result.get("msg") or result.get("StatusMessage", "未知错误") print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],错误:{error_msg}" ) return False else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True def send_to_dingtalk( webhook_url: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 20000, batch_interval: float = 1.0, split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到钉钉(支持分批发送,支持热榜+RSS合并+独立展示区) Args: webhook_url: 钉钉 Webhook URL report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ headers = {"Content-Type": "application/json"} proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"钉钉{account_label}" if account_label else "钉钉" # 渲染 AI 分析内容(如果有) ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "dingtalk") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 预留批次头部空间,避免添加头部后超限 header_reserve = get_max_batch_header_size("dingtalk") batches = split_content_func( report_data, "dingtalk", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "dingtalk", batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) payload = { "msgtype": "markdown", "markdown": { "title": f"TrendRadar 热点分析报告 - {report_type}", "text": batch_content, }, } try: response = requests.post( webhook_url, headers=headers, json=payload, proxies=proxies, timeout=30 ) if response.status_code == 200: result = response.json() if result.get("errcode") == 0: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") # 批次间间隔 if i < len(batches): time.sleep(batch_interval) else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],错误:{result.get('errmsg')}" ) return False else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True def send_to_wework( webhook_url: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 4000, batch_interval: float = 1.0, msg_type: str = "markdown", split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到企业微信(支持分批发送,支持 markdown 和 text 两种格式,支持热榜+RSS合并+独立展示区) Args: webhook_url: 企业微信 Webhook URL report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) msg_type: 消息类型 (markdown/text) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ headers = {"Content-Type": "application/json"} proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"企业微信{account_label}" if account_label else "企业微信" # 获取消息类型配置(markdown 或 text) is_text_mode = msg_type.lower() == "text" if is_text_mode: print(f"{log_prefix}使用 text 格式(个人微信模式)[{report_type}]") else: print(f"{log_prefix}使用 markdown 格式(群机器人模式)[{report_type}]") # text 模式使用 wework_text,markdown 模式使用 wework header_format_type = "wework_text" if is_text_mode else "wework" # 渲染 AI 分析内容(如果有) ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "wework") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 获取分批内容,预留批次头部空间 header_reserve = get_max_batch_header_size(header_format_type) batches = split_content_func( report_data, "wework", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, header_format_type, batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): # 根据消息类型构建 payload if is_text_mode: # text 格式:去除 markdown 语法 plain_content = strip_markdown(batch_content) payload = {"msgtype": "text", "text": {"content": plain_content}} content_size = len(plain_content.encode("utf-8")) else: # markdown 格式:保持原样 payload = {"msgtype": "markdown", "markdown": {"content": batch_content}} content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) try: response = requests.post( webhook_url, headers=headers, json=payload, proxies=proxies, timeout=30 ) if response.status_code == 200: result = response.json() if result.get("errcode") == 0: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") # 批次间间隔 if i < len(batches): time.sleep(batch_interval) else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],错误:{result.get('errmsg')}" ) return False else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True def send_to_telegram( bot_token: str, chat_id: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 4000, batch_interval: float = 1.0, split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到 Telegram(支持分批发送,支持热榜+RSS合并+独立展示区) Args: bot_token: Telegram Bot Token chat_id: Telegram Chat ID report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ headers = {"Content-Type": "application/json"} url = f"https://api.telegram.org/bot{bot_token}/sendMessage" proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"Telegram{account_label}" if account_label else "Telegram" # 渲染 AI 分析内容(如果有) ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "telegram") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 获取分批内容,预留批次头部空间 header_reserve = get_max_batch_header_size("telegram") batches = split_content_func( report_data, "telegram", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "telegram", batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) payload = { "chat_id": chat_id, "text": batch_content, "parse_mode": "HTML", "disable_web_page_preview": True, } try: response = requests.post( url, headers=headers, json=payload, proxies=proxies, timeout=30 ) if response.status_code == 200: result = response.json() if result.get("ok"): print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") # 批次间间隔 if i < len(batches): time.sleep(batch_interval) else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],错误:{result.get('description')}" ) return False else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True def send_to_email( from_email: str, password: str, to_email: str, report_type: str, html_file_path: str, custom_smtp_server: Optional[str] = None, custom_smtp_port: Optional[int] = None, *, get_time_func: Callable = None, ) -> bool: """ 发送邮件通知 Args: from_email: 发件人邮箱 password: 邮箱密码/授权码 to_email: 收件人邮箱(多个用逗号分隔) report_type: 报告类型 html_file_path: HTML 报告文件路径 custom_smtp_server: 自定义 SMTP 服务器(可选) custom_smtp_port: 自定义 SMTP 端口(可选) get_time_func: 获取当前时间的函数 Returns: bool: 发送是否成功 Note: AI 分析内容已在 HTML 生成时嵌入,无需再追加 """ try: if not html_file_path or not Path(html_file_path).exists(): print(f"错误:HTML文件不存在或未提供: {html_file_path}") return False print(f"使用HTML文件: {html_file_path}") with open(html_file_path, "r", encoding="utf-8") as f: html_content = f.read() domain = from_email.split("@")[-1].lower() if custom_smtp_server and custom_smtp_port: # 使用自定义 SMTP 配置 smtp_server = custom_smtp_server smtp_port = int(custom_smtp_port) # 根据端口判断加密方式:465=SSL, 587=TLS if smtp_port == 465: use_tls = False # SSL 模式(SMTP_SSL) elif smtp_port == 587: use_tls = True # TLS 模式(STARTTLS) else: # 其他端口优先尝试 TLS(更安全,更广泛支持) use_tls = True elif domain in SMTP_CONFIGS: # 使用预设配置 config = SMTP_CONFIGS[domain] smtp_server = config["server"] smtp_port = config["port"] use_tls = config["encryption"] == "TLS" else: print(f"未识别的邮箱服务商: {domain},使用通用 SMTP 配置") smtp_server = f"smtp.{domain}" smtp_port = 587 use_tls = True msg = MIMEMultipart("alternative") # 严格按照 RFC 标准设置 From header sender_name = "TrendRadar" msg["From"] = formataddr((sender_name, from_email)) # 设置收件人 recipients = [addr.strip() for addr in to_email.split(",")] if len(recipients) == 1: msg["To"] = recipients[0] else: msg["To"] = ", ".join(recipients) # 设置邮件主题 now = get_time_func() if get_time_func else datetime.now() subject = f"TrendRadar 热点分析报告 - {report_type} - {now.strftime('%m月%d日 %H:%M')}" msg["Subject"] = Header(subject, "utf-8") # 设置其他标准 header msg["MIME-Version"] = "1.0" msg["Date"] = formatdate(localtime=True) msg["Message-ID"] = make_msgid() # 添加纯文本部分(作为备选) text_content = f""" TrendRadar 热点分析报告 ======================== 报告类型:{report_type} 生成时间:{now.strftime('%Y-%m-%d %H:%M:%S')} 请使用支持HTML的邮件客户端查看完整报告内容。 """ text_part = MIMEText(text_content, "plain", "utf-8") msg.attach(text_part) html_part = MIMEText(html_content, "html", "utf-8") msg.attach(html_part) print(f"正在发送邮件到 {to_email}...") print(f"SMTP 服务器: {smtp_server}:{smtp_port}") print(f"发件人: {from_email}") try: if use_tls: # TLS 模式 server = smtplib.SMTP(smtp_server, smtp_port, timeout=30) server.set_debuglevel(0) # 设为1可以查看详细调试信息 server.ehlo() server.starttls() server.ehlo() else: # SSL 模式 server = smtplib.SMTP_SSL(smtp_server, smtp_port, timeout=30) server.set_debuglevel(0) server.ehlo() # 登录 server.login(from_email, password) # 发送邮件 server.send_message(msg) server.quit() print(f"邮件发送成功 [{report_type}] -> {to_email}") return True except smtplib.SMTPServerDisconnected: print("邮件发送失败:服务器意外断开连接,请检查网络或稍后重试") return False except smtplib.SMTPAuthenticationError as e: print("邮件发送失败:认证错误,请检查邮箱和密码/授权码") print(f"详细错误: {str(e)}") return False except smtplib.SMTPRecipientsRefused as e: print(f"邮件发送失败:收件人地址被拒绝 {e}") return False except smtplib.SMTPSenderRefused as e: print(f"邮件发送失败:发件人地址被拒绝 {e}") return False except smtplib.SMTPDataError as e: print(f"邮件发送失败:邮件数据错误 {e}") return False except smtplib.SMTPConnectError as e: print(f"邮件发送失败:无法连接到 SMTP 服务器 {smtp_server}:{smtp_port}") print(f"详细错误: {str(e)}") return False except Exception as e: print(f"邮件发送失败 [{report_type}]:{e}") import traceback traceback.print_exc() return False def send_to_ntfy( server_url: str, topic: str, token: Optional[str], report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 3800, split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到 ntfy(支持分批发送,严格遵守4KB限制,支持热榜+RSS合并+独立展示区) Args: server_url: ntfy 服务器 URL topic: ntfy 主题 token: ntfy 访问令牌(可选) report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ # 日志前缀 log_prefix = f"ntfy{account_label}" if account_label else "ntfy" # 避免 HTTP header 编码问题 report_type_en_map = { "全天汇总": "Daily Summary", "当前榜单": "Current Ranking", "增量分析": "Incremental Update", "通知连通性测试": "Notification Test", } report_type_en = report_type_en_map.get(report_type, "News Report") headers = { "Content-Type": "text/plain; charset=utf-8", "Markdown": "yes", "Title": report_type_en, "Priority": "default", "Tags": "news", } if token: headers["Authorization"] = f"Bearer {token}" # 构建完整URL,确保格式正确 base_url = server_url.rstrip("/") if not base_url.startswith(("http://", "https://")): base_url = f"https://{base_url}" url = f"{base_url}/{topic}" proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 渲染 AI 分析内容(如果有),合并到主内容中 ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "ntfy") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 获取分批内容,预留批次头部空间 header_reserve = get_max_batch_header_size("ntfy") batches = split_content_func( report_data, "ntfy", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "ntfy", batch_size) total_batches = len(batches) print(f"{log_prefix}消息分为 {total_batches} 批次发送 [{report_type}]") # 反转批次顺序,使得在ntfy客户端显示时顺序正确 # ntfy显示最新消息在上面,所以我们从最后一批开始推送 reversed_batches = list(reversed(batches)) print(f"{log_prefix}将按反向顺序推送(最后批次先推送),确保客户端显示顺序正确") # 逐批发送(反向顺序) success_count = 0 for idx, batch_content in enumerate(reversed_batches, 1): # 计算正确的批次编号(用户视角的编号) actual_batch_num = total_batches - idx + 1 content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {actual_batch_num}/{total_batches} 批次(推送顺序: {idx}/{total_batches}),大小:{content_size} 字节 [{report_type}]" ) # 检查消息大小,确保不超过4KB if content_size > 4096: print(f"警告:{log_prefix}第 {actual_batch_num} 批次消息过大({content_size} 字节),可能被拒绝") # 更新 headers 的批次标识 current_headers = headers.copy() if total_batches > 1: current_headers["Title"] = f"{report_type_en} ({actual_batch_num}/{total_batches})" try: response = requests.post( url, headers=current_headers, data=batch_content.encode("utf-8"), proxies=proxies, timeout=30, ) if response.status_code == 200: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送成功 [{report_type}]") success_count += 1 if idx < total_batches: # 公共服务器建议 2-3 秒,自托管可以更短 interval = 2 if "ntfy.sh" in server_url else 1 time.sleep(interval) elif response.status_code == 429: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次速率限制 [{report_type}],等待后重试" ) time.sleep(10) # 等待10秒后重试 # 重试一次 retry_response = requests.post( url, headers=current_headers, data=batch_content.encode("utf-8"), proxies=proxies, timeout=30, ) if retry_response.status_code == 200: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次重试成功 [{report_type}]") success_count += 1 else: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次重试失败,状态码:{retry_response.status_code}" ) elif response.status_code == 413: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次消息过大被拒绝 [{report_type}],消息大小:{content_size} 字节" ) else: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) try: print(f"错误详情:{response.text}") except: pass except requests.exceptions.ConnectTimeout: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次连接超时 [{report_type}]") except requests.exceptions.ReadTimeout: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次读取超时 [{report_type}]") except requests.exceptions.ConnectionError as e: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次连接错误 [{report_type}]:{e}") except Exception as e: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送异常 [{report_type}]:{e}") # 判断整体发送是否成功 if success_count == total_batches: print(f"{log_prefix}所有 {total_batches} 批次发送完成 [{report_type}]") elif success_count > 0: print(f"{log_prefix}部分发送成功:{success_count}/{total_batches} 批次 [{report_type}]") else: print(f"{log_prefix}发送完全失败 [{report_type}]") return False return True def send_to_bark( bark_url: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 3600, batch_interval: float = 1.0, split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到 Bark(支持分批发送,使用 markdown 格式,支持热榜+RSS合并+独立展示区) Args: bark_url: Bark URL(包含 device_key) report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ # 日志前缀 log_prefix = f"Bark{account_label}" if account_label else "Bark" proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 解析 Bark URL,提取 device_key 和 API 端点 # Bark URL 格式: https://api.day.app/device_key 或 https://bark.day.app/device_key parsed_url = urlparse(bark_url) device_key = parsed_url.path.strip('/').split('/')[0] if parsed_url.path else None if not device_key: print(f"{log_prefix} URL 格式错误,无法提取 device_key: {bark_url}") return False # 构建正确的 API 端点 api_endpoint = f"{parsed_url.scheme}://{parsed_url.netloc}/push" # 渲染 AI 分析内容(如果有),合并到主内容中 ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "bark") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 获取分批内容,预留批次头部空间 header_reserve = get_max_batch_header_size("bark") batches = split_content_func( report_data, "bark", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "bark", batch_size) total_batches = len(batches) print(f"{log_prefix}消息分为 {total_batches} 批次发送 [{report_type}]") # 反转批次顺序,使得在Bark客户端显示时顺序正确 # Bark显示最新消息在上面,所以我们从最后一批开始推送 reversed_batches = list(reversed(batches)) print(f"{log_prefix}将按反向顺序推送(最后批次先推送),确保客户端显示顺序正确") # 逐批发送(反向顺序) success_count = 0 for idx, batch_content in enumerate(reversed_batches, 1): # 计算正确的批次编号(用户视角的编号) actual_batch_num = total_batches - idx + 1 content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {actual_batch_num}/{total_batches} 批次(推送顺序: {idx}/{total_batches}),大小:{content_size} 字节 [{report_type}]" ) # 检查消息大小(Bark使用APNs,限制4KB) if content_size > 4096: print( f"警告:{log_prefix}第 {actual_batch_num}/{total_batches} 批次消息过大({content_size} 字节),可能被拒绝" ) # 构建JSON payload payload = { "title": report_type, "markdown": batch_content, "device_key": device_key, "sound": "default", "group": "TrendRadar", "action": "none", # 点击推送跳到 APP 不弹出弹框,方便阅读 } try: response = requests.post( api_endpoint, json=payload, proxies=proxies, timeout=30, ) if response.status_code == 200: result = response.json() if result.get("code") == 200: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送成功 [{report_type}]") success_count += 1 # 批次间间隔 if idx < total_batches: time.sleep(batch_interval) else: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送失败 [{report_type}],错误:{result.get('message', '未知错误')}" ) else: print( f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送失败 [{report_type}],状态码:{response.status_code}" ) try: print(f"错误详情:{response.text}") except: pass except requests.exceptions.ConnectTimeout: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次连接超时 [{report_type}]") except requests.exceptions.ReadTimeout: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次读取超时 [{report_type}]") except requests.exceptions.ConnectionError as e: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次连接错误 [{report_type}]:{e}") except Exception as e: print(f"{log_prefix}第 {actual_batch_num}/{total_batches} 批次发送异常 [{report_type}]:{e}") # 判断整体发送是否成功 if success_count == total_batches: print(f"{log_prefix}所有 {total_batches} 批次发送完成 [{report_type}]") elif success_count > 0: print(f"{log_prefix}部分发送成功:{success_count}/{total_batches} 批次 [{report_type}]") else: print(f"{log_prefix}发送完全失败 [{report_type}]") return False return True def send_to_slack( webhook_url: str, report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 4000, batch_interval: float = 1.0, split_content_func: Callable = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到 Slack(支持分批发送,使用 mrkdwn 格式,支持热榜+RSS合并+独立展示区) Args: webhook_url: Slack Webhook URL report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ headers = {"Content-Type": "application/json"} proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"Slack{account_label}" if account_label else "Slack" # 渲染 AI 分析内容(如果有),合并到主内容中 ai_content = None ai_stats = None if ai_analysis: ai_content = _render_ai_analysis(ai_analysis, "slack") # 提取 AI 分析统计数据(只要 AI 分析成功就显示) if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), "ai_mode": getattr(ai_analysis, "ai_mode", ""), } # 获取分批内容,预留批次头部空间 header_reserve = get_max_batch_header_size("slack") batches = split_content_func( report_data, "slack", update_info, max_bytes=batch_size - header_reserve, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部(已预留空间,不会超限) batches = add_batch_headers(batches, "slack", batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): # 转换 Markdown 到 mrkdwn 格式 mrkdwn_content = convert_markdown_to_mrkdwn(batch_content) content_size = len(mrkdwn_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) # 构建 Slack payload(使用简单的 text 字段,支持 mrkdwn) payload = {"text": mrkdwn_content} try: response = requests.post( webhook_url, headers=headers, json=payload, proxies=proxies, timeout=30 ) # Slack Incoming Webhooks 成功时返回 "ok" 文本 if response.status_code == 200 and response.text == "ok": print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") # 批次间间隔 if i < len(batches): time.sleep(batch_interval) else: error_msg = response.text if response.text else f"状态码:{response.status_code}" print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],错误:{error_msg}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True def send_to_generic_webhook( webhook_url: str, payload_template: Optional[str], report_data: Dict, report_type: str, update_info: Optional[Dict] = None, proxy_url: Optional[str] = None, mode: str = "daily", account_label: str = "", *, batch_size: int = 4000, batch_interval: float = 1.0, split_content_func: Optional[Callable] = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, ai_analysis: Any = None, display_regions: Optional[Dict] = None, standalone_data: Optional[Dict] = None, ) -> bool: """ 发送到通用 Webhook(支持分批发送,支持自定义 JSON 模板,支持热榜+RSS合并+独立展示区) Args: webhook_url: Webhook URL payload_template: JSON 模板字符串,支持 {title} 和 {content} 占位符 report_data: 报告数据 report_type: 报告类型 update_info: 更新信息(可选) proxy_url: 代理 URL(可选) mode: 报告模式 (daily/current) account_label: 账号标签(多账号时显示) batch_size: 批次大小(字节) batch_interval: 批次发送间隔(秒) split_content_func: 内容分批函数 rss_items: RSS 统计条目列表(可选,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) Returns: bool: 发送是否成功 """ if split_content_func is None: raise ValueError("split_content_func is required") headers = {"Content-Type": "application/json"} proxies = None if proxy_url: proxies = {"http": proxy_url, "https": proxy_url} # 日志前缀 log_prefix = f"通用Webhook{account_label}" if account_label else "通用Webhook" # 渲染 AI 分析内容(如果有) ai_content = None ai_stats = None if ai_analysis: # 通用 Webhook 使用 markdown 格式渲染 AI 分析 ai_content = _render_ai_analysis(ai_analysis, "wework") # 提取 AI 分析统计数据 if getattr(ai_analysis, "success", False): ai_stats = { "total_news": getattr(ai_analysis, "total_news", 0), "analyzed_news": getattr(ai_analysis, "analyzed_news", 0), "max_news_limit": getattr(ai_analysis, "max_news_limit", 0), "hotlist_count": getattr(ai_analysis, "hotlist_count", 0), "rss_count": getattr(ai_analysis, "rss_count", 0), } # 获取分批内容 # 使用 'wework' 作为 format_type 以获取 markdown 格式的通用输出 # 预留一定空间给模板外壳 template_overhead = 200 batches = split_content_func( report_data, "wework", update_info, max_bytes=batch_size - template_overhead, mode=mode, rss_items=rss_items, rss_new_items=rss_new_items, ai_content=ai_content, standalone_data=standalone_data, ai_stats=ai_stats, report_type=report_type, ) # 统一添加批次头部 batches = add_batch_headers(batches, "wework", batch_size) print(f"{log_prefix}消息分为 {len(batches)} 批次发送 [{report_type}]") # 逐批发送 for i, batch_content in enumerate(batches, 1): content_size = len(batch_content.encode("utf-8")) print( f"发送{log_prefix}第 {i}/{len(batches)} 批次,大小:{content_size} 字节 [{report_type}]" ) try: # 构建 payload if payload_template: # 简单的字符串替换 # 注意:content 可能包含 JSON 特殊字符,需要先转义 json_content = json.dumps(batch_content)[1:-1] # 去掉首尾引号 json_title = json.dumps(report_type)[1:-1] payload_str = payload_template.replace("{content}", json_content).replace("{title}", json_title) # 尝试解析为 JSON 对象以验证有效性 try: payload = json.loads(payload_str) except json.JSONDecodeError as e: print(f"{log_prefix} JSON 模板解析失败: {e}") # 回退到默认格式 payload = {"title": report_type, "content": batch_content} else: # 默认格式 payload = {"title": report_type, "content": batch_content} response = requests.post( webhook_url, headers=headers, json=payload, proxies=proxies, timeout=30 ) if response.status_code >= 200 and response.status_code < 300: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送成功 [{report_type}]") if i < len(batches): time.sleep(batch_interval) else: print( f"{log_prefix}第 {i}/{len(batches)} 批次发送失败 [{report_type}],状态码:{response.status_code}, 响应: {response.text}" ) return False except Exception as e: print(f"{log_prefix}第 {i}/{len(batches)} 批次发送出错 [{report_type}]:{e}") return False print(f"{log_prefix}所有 {len(batches)} 批次发送完成 [{report_type}]") return True ================================================ FILE: trendradar/notification/splitter.py ================================================ # coding=utf-8 """ 消息分批处理模块 提供消息内容分批拆分功能,确保消息大小不超过各平台限制 """ from datetime import datetime from typing import Dict, List, Optional, Callable from trendradar.report.formatter import format_title_for_platform from trendradar.report.helpers import format_rank_display from trendradar.utils.time import DEFAULT_TIMEZONE, format_iso_time_friendly, convert_time_for_display # 默认批次大小配置 DEFAULT_BATCH_SIZES = { "dingtalk": 20000, "feishu": 29000, "ntfy": 3800, "default": 4000, } # 默认区域顺序 DEFAULT_REGION_ORDER = ["hotlist", "rss", "new_items", "standalone", "ai_analysis"] def split_content_into_batches( report_data: Dict, format_type: str, update_info: Optional[Dict] = None, max_bytes: Optional[int] = None, mode: str = "daily", batch_sizes: Optional[Dict[str, int]] = None, feishu_separator: str = "---", region_order: Optional[List[str]] = None, get_time_func: Optional[Callable[[], datetime]] = None, rss_items: Optional[list] = None, rss_new_items: Optional[list] = None, timezone: str = DEFAULT_TIMEZONE, display_mode: str = "keyword", ai_content: Optional[str] = None, standalone_data: Optional[Dict] = None, rank_threshold: int = 10, ai_stats: Optional[Dict] = None, report_type: str = "热点分析报告", show_new_section: bool = True, ) -> List[str]: """分批处理消息内容,确保词组标题+至少第一条新闻的完整性(支持热榜+RSS合并+AI分析+独立展示区) 热榜统计与RSS统计并列显示,热榜新增与RSS新增并列显示。 region_order 控制各区域的显示顺序。 AI分析内容根据 region_order 中的位置显示。 独立展示区根据 region_order 中的位置显示。 Args: report_data: 报告数据字典,包含 stats, new_titles, failed_ids, total_new_count format_type: 格式类型 (feishu, dingtalk, wework, telegram, ntfy, bark, slack) update_info: 版本更新信息(可选) max_bytes: 最大字节数(可选,如果不指定则使用默认配置) mode: 报告模式 (daily, incremental, current) batch_sizes: 批次大小配置字典(可选) feishu_separator: 飞书消息分隔符 region_order: 区域显示顺序列表 get_time_func: 获取当前时间的函数(可选) rss_items: RSS 统计条目列表(按源分组,用于合并推送) rss_new_items: RSS 新增条目列表(可选,用于新增区块) timezone: 时区名称(用于 RSS 时间格式化) display_mode: 显示模式 (keyword=按关键词分组, platform=按平台分组) ai_content: AI 分析内容(已渲染的字符串,可选) standalone_data: 独立展示区数据(可选),包含 platforms 和 rss_feeds 列表 ai_stats: AI 分析统计数据(可选),包含 total_news, analyzed_news, max_news_limit 等 Returns: 分批后的消息内容列表 """ if region_order is None: region_order = DEFAULT_REGION_ORDER # 合并批次大小配置 sizes = {**DEFAULT_BATCH_SIZES, **(batch_sizes or {})} if max_bytes is None: if format_type == "dingtalk": max_bytes = sizes.get("dingtalk", 20000) elif format_type == "feishu": max_bytes = sizes.get("feishu", 29000) elif format_type == "ntfy": max_bytes = sizes.get("ntfy", 3800) else: max_bytes = sizes.get("default", 4000) batches = [] total_hotlist_count = sum( len(stat["titles"]) for stat in report_data["stats"] if stat["count"] > 0 ) total_titles = total_hotlist_count # 累加 RSS 条目数 if rss_items: total_titles += sum(stat.get("count", 0) for stat in rss_items) now = get_time_func() if get_time_func else datetime.now() # 构建头部信息 base_header = "" # 准备 AI 分析统计行(如果存在) ai_stats_line = "" if ai_stats and ai_stats.get("analyzed_news", 0) > 0: analyzed_news = ai_stats.get("analyzed_news", 0) total_news = ai_stats.get("total_news", 0) ai_mode = ai_stats.get("ai_mode", "") # 构建分析数显示:如果被截断则显示 "实际分析数/总可分析数" if total_news > analyzed_news: news_display = f"{analyzed_news}/{total_news}" else: news_display = str(analyzed_news) # 如果 AI 模式与推送模式不同,显示模式标识 mode_suffix = "" if ai_mode and ai_mode != mode: mode_map = { "daily": "全天汇总", "current": "当前榜单", "incremental": "增量分析" } mode_label = mode_map.get(ai_mode, ai_mode) mode_suffix = f" ({mode_label})" if format_type in ("wework", "bark", "ntfy", "feishu", "dingtalk"): ai_stats_line = f"**AI 分析数:** {news_display}{mode_suffix}\n" elif format_type == "slack": ai_stats_line = f"*AI 分析数:* {news_display}{mode_suffix}\n" elif format_type == "telegram": ai_stats_line = f"AI 分析数: {news_display}{mode_suffix}\n" # 构建统一的头部(总是显示总新闻数、时间和类型) if format_type in ("wework", "bark"): base_header = f"**总新闻数:** {total_titles}\n" base_header += ai_stats_line base_header += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"**类型:** {report_type}\n\n" elif format_type == "telegram": base_header = f"总新闻数: {total_titles}\n" base_header += ai_stats_line base_header += f"时间: {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"类型: {report_type}\n\n" elif format_type == "ntfy": base_header = f"**总新闻数:** {total_titles}\n" base_header += ai_stats_line base_header += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"**类型:** {report_type}\n\n" elif format_type == "feishu": base_header = f"**总新闻数:** {total_titles}\n" base_header += ai_stats_line base_header += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"**类型:** {report_type}\n\n" base_header += "---\n\n" elif format_type == "dingtalk": base_header = f"**总新闻数:** {total_titles}\n" base_header += ai_stats_line base_header += f"**时间:** {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"**类型:** {report_type}\n\n" base_header += "---\n\n" elif format_type == "slack": base_header = f"*总新闻数:* {total_titles}\n" base_header += ai_stats_line base_header += f"*时间:* {now.strftime('%Y-%m-%d %H:%M:%S')}\n" base_header += f"*类型:* {report_type}\n\n" base_footer = "" if format_type in ("wework", "bark"): base_footer = f"\n\n\n> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: base_footer += f"\n> TrendRadar 发现新版本 **{update_info['remote_version']}**,当前 **{update_info['current_version']}**" elif format_type == "telegram": base_footer = f"\n\n更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: base_footer += f"\nTrendRadar 发现新版本 {update_info['remote_version']},当前 {update_info['current_version']}" elif format_type == "ntfy": base_footer = f"\n\n> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: base_footer += f"\n> TrendRadar 发现新版本 **{update_info['remote_version']}**,当前 **{update_info['current_version']}**" elif format_type == "feishu": base_footer = f"\n\n更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: base_footer += f"\nTrendRadar 发现新版本 {update_info['remote_version']},当前 {update_info['current_version']}" elif format_type == "dingtalk": base_footer = f"\n\n> 更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}" if update_info: base_footer += f"\n> TrendRadar 发现新版本 **{update_info['remote_version']}**,当前 **{update_info['current_version']}**" elif format_type == "slack": base_footer = f"\n\n_更新时间:{now.strftime('%Y-%m-%d %H:%M:%S')}_" if update_info: base_footer += f"\n_TrendRadar 发现新版本 *{update_info['remote_version']}*,当前 *{update_info['current_version']}_" # 根据 display_mode 选择统计标题 stats_title = "热点词汇统计" if display_mode == "keyword" else "热点新闻统计" stats_header = "" if report_data["stats"]: if format_type in ("wework", "bark"): stats_header = f"📊 **{stats_title}** (共 {total_hotlist_count} 条)\n\n" elif format_type == "telegram": stats_header = f"📊 {stats_title} (共 {total_hotlist_count} 条)\n\n" elif format_type == "ntfy": stats_header = f"📊 **{stats_title}** (共 {total_hotlist_count} 条)\n\n" elif format_type == "feishu": stats_header = f"📊 **{stats_title}** (共 {total_hotlist_count} 条)\n\n" elif format_type == "dingtalk": stats_header = f"📊 **{stats_title}** (共 {total_hotlist_count} 条)\n\n" elif format_type == "slack": stats_header = f"📊 *{stats_title}* (共 {total_hotlist_count} 条)\n\n" current_batch = base_header current_batch_has_content = False # 当没有热榜数据时的处理 # 注意:如果有 ai_content,不应该返回"暂无匹配"消息,而应该继续处理 AI 内容 if ( not report_data["stats"] and not report_data["new_titles"] and not report_data["failed_ids"] and not ai_content # 有 AI 内容时不返回"暂无匹配" and not rss_items # 有 RSS 内容时也不返回 and not standalone_data # 有独立展示区数据时也不返回 ): if mode == "incremental": mode_text = "增量模式下暂无新增匹配的热点词汇" elif mode == "current": mode_text = "当前榜单模式下暂无匹配的热点词汇" else: mode_text = "暂无匹配的热点词汇" simple_content = f"📭 {mode_text}\n\n" final_content = base_header + simple_content + base_footer batches.append(final_content) return batches # 定义处理热点词汇统计的函数 def process_stats_section(current_batch, current_batch_has_content, batches, add_separator=True): """处理热点词汇统计""" if not report_data["stats"]: return current_batch, current_batch_has_content, batches total_count = len(report_data["stats"]) # 根据 add_separator 决定是否添加前置分割线 actual_stats_header = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type == "feishu": actual_stats_header = f"\n{feishu_separator}\n\n{stats_header}" elif format_type == "dingtalk": actual_stats_header = f"\n---\n\n{stats_header}" elif format_type in ("wework", "bark"): actual_stats_header = f"\n\n\n\n{stats_header}" else: actual_stats_header = f"\n\n{stats_header}" else: # 不需要分割线(第一个区域) actual_stats_header = stats_header # 添加统计标题 test_content = current_batch + actual_stats_header if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes ): current_batch = test_content current_batch_has_content = True else: if current_batch_has_content: batches.append(current_batch + base_footer) # 新批次开头不需要分割线,使用原始 stats_header current_batch = base_header + stats_header current_batch_has_content = True # 逐个处理词组(确保词组标题+第一条新闻的原子性) for i, stat in enumerate(report_data["stats"]): word = stat["word"] count = stat["count"] sequence_display = f"[{i + 1}/{total_count}]" # 构建词组标题 word_header = "" if format_type in ("wework", "bark"): if count >= 10: word_header = ( f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" ) elif count >= 5: word_header = ( f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" ) else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "telegram": if count >= 10: word_header = f"🔥 {sequence_display} {word} : {count} 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} {word} : {count} 条\n\n" else: word_header = f"📌 {sequence_display} {word} : {count} 条\n\n" elif format_type == "ntfy": if count >= 10: word_header = ( f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" ) elif count >= 5: word_header = ( f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" ) else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "feishu": if count >= 10: word_header = f"🔥 {sequence_display} **{word}** : {count} 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} **{word}** : {count} 条\n\n" else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "dingtalk": if count >= 10: word_header = ( f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" ) elif count >= 5: word_header = ( f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" ) else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "slack": if count >= 10: word_header = ( f"🔥 {sequence_display} *{word}* : *{count}* 条\n\n" ) elif count >= 5: word_header = ( f"📈 {sequence_display} *{word}* : *{count}* 条\n\n" ) else: word_header = f"📌 {sequence_display} *{word}* : {count} 条\n\n" # 构建第一条新闻 # display_mode: keyword=显示来源, platform=显示关键词 show_source = display_mode == "keyword" show_keyword = display_mode == "platform" first_news_line = "" if stat["titles"]: first_title_data = stat["titles"][0] if format_type in ("wework", "bark"): formatted_title = format_title_for_platform( "wework", first_title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "telegram": formatted_title = format_title_for_platform( "telegram", first_title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "ntfy": formatted_title = format_title_for_platform( "ntfy", first_title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "feishu": formatted_title = format_title_for_platform( "feishu", first_title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "dingtalk": formatted_title = format_title_for_platform( "dingtalk", first_title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "slack": formatted_title = format_title_for_platform( "slack", first_title_data, show_source=show_source, show_keyword=show_keyword ) else: formatted_title = f"{first_title_data['title']}" first_news_line = f" 1. {formatted_title}\n" if len(stat["titles"]) > 1: first_news_line += "\n" # 原子性检查:词组标题+第一条新闻必须一起处理 word_with_first_news = word_header + first_news_line test_content = current_batch + word_with_first_news if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): # 当前批次容纳不下,开启新批次 if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + stats_header + word_with_first_news current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余新闻条目 for j in range(start_index, len(stat["titles"])): title_data = stat["titles"][j] if format_type in ("wework", "bark"): formatted_title = format_title_for_platform( "wework", title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "telegram": formatted_title = format_title_for_platform( "telegram", title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "ntfy": formatted_title = format_title_for_platform( "ntfy", title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "feishu": formatted_title = format_title_for_platform( "feishu", title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "dingtalk": formatted_title = format_title_for_platform( "dingtalk", title_data, show_source=show_source, show_keyword=show_keyword ) elif format_type == "slack": formatted_title = format_title_for_platform( "slack", title_data, show_source=show_source, show_keyword=show_keyword ) else: formatted_title = f"{title_data['title']}" news_line = f" {j + 1}. {formatted_title}\n" if j < len(stat["titles"]) - 1: news_line += "\n" test_content = current_batch + news_line if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + stats_header + word_header + news_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 词组间分隔符 if i < len(report_data["stats"]) - 1: separator = "" if format_type in ("wework", "bark"): separator = f"\n\n\n\n" elif format_type == "telegram": separator = f"\n\n" elif format_type == "ntfy": separator = f"\n\n" elif format_type == "feishu": separator = f"\n{feishu_separator}\n\n" elif format_type == "dingtalk": separator = f"\n---\n\n" elif format_type == "slack": separator = f"\n\n" test_content = current_batch + separator if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes ): current_batch = test_content return current_batch, current_batch_has_content, batches # 定义处理新增新闻的函数 def process_new_titles_section(current_batch, current_batch_has_content, batches, add_separator=True): """处理新增新闻""" if not show_new_section or not report_data["new_titles"]: return current_batch, current_batch_has_content, batches # 根据 add_separator 决定是否添加前置分割线 new_header = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type in ("wework", "bark"): new_header = f"\n\n\n\n🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "telegram": new_header = ( f"\n\n🆕 本次新增热点新闻 (共 {report_data['total_new_count']} 条)\n\n" ) elif format_type == "ntfy": new_header = f"\n\n🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "feishu": new_header = f"\n{feishu_separator}\n\n🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "dingtalk": new_header = f"\n---\n\n🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "slack": new_header = f"\n\n🆕 *本次新增热点新闻* (共 {report_data['total_new_count']} 条)\n\n" else: # 不需要分割线(第一个区域) if format_type in ("wework", "bark"): new_header = f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "telegram": new_header = f"🆕 本次新增热点新闻 (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "ntfy": new_header = f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "feishu": new_header = f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "dingtalk": new_header = f"🆕 **本次新增热点新闻** (共 {report_data['total_new_count']} 条)\n\n" elif format_type == "slack": new_header = f"🆕 *本次新增热点新闻* (共 {report_data['total_new_count']} 条)\n\n" test_content = current_batch + new_header if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 逐个处理新增新闻来源 for source_data in report_data["new_titles"]: source_header = "" if format_type in ("wework", "bark"): source_header = f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n\n" elif format_type == "telegram": source_header = f"{source_data['source_name']} ({len(source_data['titles'])} 条):\n\n" elif format_type == "ntfy": source_header = f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n\n" elif format_type == "feishu": source_header = f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n\n" elif format_type == "dingtalk": source_header = f"**{source_data['source_name']}** ({len(source_data['titles'])} 条):\n\n" elif format_type == "slack": source_header = f"*{source_data['source_name']}* ({len(source_data['titles'])} 条):\n\n" # 构建第一条新增新闻 first_news_line = "" if source_data["titles"]: first_title_data = source_data["titles"][0] title_data_copy = first_title_data.copy() title_data_copy["is_new"] = False if format_type in ("wework", "bark"): formatted_title = format_title_for_platform( "wework", title_data_copy, show_source=False ) elif format_type == "telegram": formatted_title = format_title_for_platform( "telegram", title_data_copy, show_source=False ) elif format_type == "feishu": formatted_title = format_title_for_platform( "feishu", title_data_copy, show_source=False ) elif format_type == "dingtalk": formatted_title = format_title_for_platform( "dingtalk", title_data_copy, show_source=False ) elif format_type == "slack": formatted_title = format_title_for_platform( "slack", title_data_copy, show_source=False ) else: formatted_title = f"{title_data_copy['title']}" first_news_line = f" 1. {formatted_title}\n" # 原子性检查:来源标题+第一条新闻 source_with_first_news = source_header + first_news_line test_content = current_batch + source_with_first_news if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header + source_with_first_news current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余新增新闻 for j in range(start_index, len(source_data["titles"])): title_data = source_data["titles"][j] title_data_copy = title_data.copy() title_data_copy["is_new"] = False if format_type == "wework": formatted_title = format_title_for_platform( "wework", title_data_copy, show_source=False ) elif format_type == "telegram": formatted_title = format_title_for_platform( "telegram", title_data_copy, show_source=False ) elif format_type == "feishu": formatted_title = format_title_for_platform( "feishu", title_data_copy, show_source=False ) elif format_type == "dingtalk": formatted_title = format_title_for_platform( "dingtalk", title_data_copy, show_source=False ) elif format_type == "slack": formatted_title = format_title_for_platform( "slack", title_data_copy, show_source=False ) else: formatted_title = f"{title_data_copy['title']}" news_line = f" {j + 1}. {formatted_title}\n" test_content = current_batch + news_line if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header + source_header + news_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True current_batch += "\n" return current_batch, current_batch_has_content, batches # 定义处理 AI 分析的函数 def process_ai_section(current_batch, current_batch_has_content, batches, add_separator=True): """处理 AI 分析内容""" nonlocal ai_content if not ai_content: return current_batch, current_batch_has_content, batches # 根据 add_separator 决定是否添加前置分割线 ai_separator = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type == "feishu": ai_separator = f"\n{feishu_separator}\n\n" elif format_type == "dingtalk": ai_separator = "\n---\n\n" elif format_type in ("wework", "bark"): ai_separator = "\n\n\n\n" elif format_type in ("telegram", "ntfy", "slack"): ai_separator = "\n\n" # 如果不需要分割线,ai_separator 保持为空字符串 # 尝试将 AI 内容添加到当前批次 test_content = current_batch + ai_separator + ai_content if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes ): current_batch = test_content current_batch_has_content = True else: # 当前批次容纳不下,开启新批次 if current_batch_has_content: batches.append(current_batch + base_footer) # AI 内容可能很长,需要考虑是否需要进一步分割 ai_with_header = base_header + ai_content current_batch = ai_with_header current_batch_has_content = True return current_batch, current_batch_has_content, batches # 定义处理独立展示区的函数 def process_standalone_section_wrapper(current_batch, current_batch_has_content, batches, add_separator=True): """处理独立展示区""" if not standalone_data: return current_batch, current_batch_has_content, batches return _process_standalone_section( standalone_data, format_type, feishu_separator, base_header, base_footer, max_bytes, current_batch, current_batch_has_content, batches, timezone, rank_threshold, add_separator ) # 定义处理 RSS 统计的函数 def process_rss_stats_wrapper(current_batch, current_batch_has_content, batches, add_separator=True): """处理 RSS 统计""" if not rss_items: return current_batch, current_batch_has_content, batches return _process_rss_stats_section( rss_items, format_type, feishu_separator, base_header, base_footer, max_bytes, current_batch, current_batch_has_content, batches, timezone, add_separator ) # 定义处理 RSS 新增的函数 def process_rss_new_wrapper(current_batch, current_batch_has_content, batches, add_separator=True): """处理 RSS 新增""" if not rss_new_items: return current_batch, current_batch_has_content, batches return _process_rss_new_titles_section( rss_new_items, format_type, feishu_separator, base_header, base_footer, max_bytes, current_batch, current_batch_has_content, batches, timezone, add_separator ) # 按 region_order 顺序处理各区域 # 记录是否已有区域内容(用于决定是否添加分割线) has_region_content = False for region in region_order: # 记录处理前的状态,用于判断该区域是否产生了内容 batch_before = current_batch has_content_before = current_batch_has_content batches_len_before = len(batches) # 决定是否需要添加分割线(第一个有内容的区域不需要) add_separator = has_region_content if region == "hotlist": # 处理热榜统计 current_batch, current_batch_has_content, batches = process_stats_section( current_batch, current_batch_has_content, batches, add_separator ) elif region == "rss": # 处理 RSS 统计 current_batch, current_batch_has_content, batches = process_rss_stats_wrapper( current_batch, current_batch_has_content, batches, add_separator ) elif region == "new_items": # 处理热榜新增 current_batch, current_batch_has_content, batches = process_new_titles_section( current_batch, current_batch_has_content, batches, add_separator ) # 处理 RSS 新增(跟随 new_items,继承 add_separator 逻辑) # 如果热榜新增产生了内容,RSS 新增需要分割线 new_batch_changed = ( current_batch != batch_before or current_batch_has_content != has_content_before or len(batches) != batches_len_before ) rss_new_separator = new_batch_changed or has_region_content current_batch, current_batch_has_content, batches = process_rss_new_wrapper( current_batch, current_batch_has_content, batches, rss_new_separator ) elif region == "standalone": # 处理独立展示区 current_batch, current_batch_has_content, batches = process_standalone_section_wrapper( current_batch, current_batch_has_content, batches, add_separator ) elif region == "ai_analysis": # 处理 AI 分析 current_batch, current_batch_has_content, batches = process_ai_section( current_batch, current_batch_has_content, batches, add_separator ) # 检查该区域是否产生了内容 region_produced_content = ( current_batch != batch_before or current_batch_has_content != has_content_before or len(batches) != batches_len_before ) if region_produced_content: has_region_content = True if report_data["failed_ids"]: failed_header = "" if format_type == "wework": failed_header = f"\n\n\n\n⚠️ **数据获取失败的平台:**\n\n" elif format_type == "telegram": failed_header = f"\n\n⚠️ 数据获取失败的平台:\n\n" elif format_type == "ntfy": failed_header = f"\n\n⚠️ **数据获取失败的平台:**\n\n" elif format_type == "feishu": failed_header = f"\n{feishu_separator}\n\n⚠️ **数据获取失败的平台:**\n\n" elif format_type == "dingtalk": failed_header = f"\n---\n\n⚠️ **数据获取失败的平台:**\n\n" test_content = current_batch + failed_header if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + failed_header current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True for i, id_value in enumerate(report_data["failed_ids"], 1): if format_type == "feishu": failed_line = f" • {id_value}\n" elif format_type == "dingtalk": failed_line = f" • **{id_value}**\n" else: failed_line = f" • {id_value}\n" test_content = current_batch + failed_line if ( len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes ): if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + failed_header + failed_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 完成最后批次 if current_batch_has_content: batches.append(current_batch + base_footer) return batches def _process_rss_stats_section( rss_stats: list, format_type: str, feishu_separator: str, base_header: str, base_footer: str, max_bytes: int, current_batch: str, current_batch_has_content: bool, batches: List[str], timezone: str = DEFAULT_TIMEZONE, add_separator: bool = True, ) -> tuple: """处理 RSS 统计区块(按关键词分组,与热榜统计格式一致) Args: rss_stats: RSS 关键词统计列表,格式与热榜 stats 一致: [{"word": "AI", "count": 5, "titles": [...]}] format_type: 格式类型 feishu_separator: 飞书分隔符 base_header: 基础头部 base_footer: 基础尾部 max_bytes: 最大字节数 current_batch: 当前批次内容 current_batch_has_content: 当前批次是否有内容 batches: 已完成的批次列表 timezone: 时区名称 add_separator: 是否在区块前添加分割线(第一个区域时为 False) Returns: (current_batch, current_batch_has_content, batches) 元组 """ if not rss_stats: return current_batch, current_batch_has_content, batches # 计算总条目数 total_items = sum(stat["count"] for stat in rss_stats) total_keywords = len(rss_stats) # RSS 统计区块标题(根据 add_separator 决定是否添加前置分割线) rss_header = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type == "feishu": rss_header = f"\n{feishu_separator}\n\n📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": rss_header = f"\n---\n\n📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" elif format_type in ("wework", "bark"): rss_header = f"\n\n\n\n📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" elif format_type == "telegram": rss_header = f"\n\n📰 RSS 订阅统计 (共 {total_items} 条)\n\n" elif format_type == "slack": rss_header = f"\n\n📰 *RSS 订阅统计* (共 {total_items} 条)\n\n" else: rss_header = f"\n\n📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" else: # 不需要分割线(第一个区域) if format_type == "feishu": rss_header = f"📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": rss_header = f"📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" elif format_type == "telegram": rss_header = f"📰 RSS 订阅统计 (共 {total_items} 条)\n\n" elif format_type == "slack": rss_header = f"📰 *RSS 订阅统计* (共 {total_items} 条)\n\n" else: rss_header = f"📰 **RSS 订阅统计** (共 {total_items} 条)\n\n" # 添加 RSS 标题 test_content = current_batch + rss_header if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes: current_batch = test_content current_batch_has_content = True else: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + rss_header current_batch_has_content = True # 逐个处理关键词组(与热榜一致) for i, stat in enumerate(rss_stats): word = stat["word"] count = stat["count"] sequence_display = f"[{i + 1}/{total_keywords}]" # 构建关键词标题(与热榜格式一致) word_header = "" if format_type in ("wework", "bark"): if count >= 10: word_header = f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "telegram": if count >= 10: word_header = f"🔥 {sequence_display} {word} : {count} 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} {word} : {count} 条\n\n" else: word_header = f"📌 {sequence_display} {word} : {count} 条\n\n" elif format_type == "ntfy": if count >= 10: word_header = f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "feishu": if count >= 10: word_header = f"🔥 {sequence_display} **{word}** : {count} 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} **{word}** : {count} 条\n\n" else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "dingtalk": if count >= 10: word_header = f"🔥 {sequence_display} **{word}** : **{count}** 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} **{word}** : **{count}** 条\n\n" else: word_header = f"📌 {sequence_display} **{word}** : {count} 条\n\n" elif format_type == "slack": if count >= 10: word_header = f"🔥 {sequence_display} *{word}* : *{count}* 条\n\n" elif count >= 5: word_header = f"📈 {sequence_display} *{word}* : *{count}* 条\n\n" else: word_header = f"📌 {sequence_display} *{word}* : {count} 条\n\n" # 构建第一条新闻(使用 format_title_for_platform) first_news_line = "" if stat["titles"]: first_title_data = stat["titles"][0] if format_type in ("wework", "bark"): formatted_title = format_title_for_platform("wework", first_title_data, show_source=True) elif format_type == "telegram": formatted_title = format_title_for_platform("telegram", first_title_data, show_source=True) elif format_type == "ntfy": formatted_title = format_title_for_platform("ntfy", first_title_data, show_source=True) elif format_type == "feishu": formatted_title = format_title_for_platform("feishu", first_title_data, show_source=True) elif format_type == "dingtalk": formatted_title = format_title_for_platform("dingtalk", first_title_data, show_source=True) elif format_type == "slack": formatted_title = format_title_for_platform("slack", first_title_data, show_source=True) else: formatted_title = f"{first_title_data['title']}" first_news_line = f" 1. {formatted_title}\n" if len(stat["titles"]) > 1: first_news_line += "\n" # 原子性检查:关键词标题 + 第一条新闻必须一起处理 word_with_first_news = word_header + first_news_line test_content = current_batch + word_with_first_news if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + rss_header + word_with_first_news current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余新闻条目 for j in range(start_index, len(stat["titles"])): title_data = stat["titles"][j] if format_type in ("wework", "bark"): formatted_title = format_title_for_platform("wework", title_data, show_source=True) elif format_type == "telegram": formatted_title = format_title_for_platform("telegram", title_data, show_source=True) elif format_type == "ntfy": formatted_title = format_title_for_platform("ntfy", title_data, show_source=True) elif format_type == "feishu": formatted_title = format_title_for_platform("feishu", title_data, show_source=True) elif format_type == "dingtalk": formatted_title = format_title_for_platform("dingtalk", title_data, show_source=True) elif format_type == "slack": formatted_title = format_title_for_platform("slack", title_data, show_source=True) else: formatted_title = f"{title_data['title']}" news_line = f" {j + 1}. {formatted_title}\n" if j < len(stat["titles"]) - 1: news_line += "\n" test_content = current_batch + news_line if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + rss_header + word_header + news_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 关键词间分隔符 if i < len(rss_stats) - 1: separator = "" if format_type in ("wework", "bark"): separator = "\n\n\n\n" elif format_type == "telegram": separator = "\n\n" elif format_type == "ntfy": separator = "\n\n" elif format_type == "feishu": separator = f"\n{feishu_separator}\n\n" elif format_type == "dingtalk": separator = "\n---\n\n" elif format_type == "slack": separator = "\n\n" test_content = current_batch + separator if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes: current_batch = test_content return current_batch, current_batch_has_content, batches def _process_rss_new_titles_section( rss_new_stats: list, format_type: str, feishu_separator: str, base_header: str, base_footer: str, max_bytes: int, current_batch: str, current_batch_has_content: bool, batches: List[str], timezone: str = DEFAULT_TIMEZONE, add_separator: bool = True, ) -> tuple: """处理 RSS 新增区块(按来源分组,与热榜新增格式一致) Args: rss_new_stats: RSS 新增关键词统计列表,格式与热榜 stats 一致: [{"word": "AI", "count": 5, "titles": [...]}] format_type: 格式类型 feishu_separator: 飞书分隔符 base_header: 基础头部 base_footer: 基础尾部 max_bytes: 最大字节数 current_batch: 当前批次内容 current_batch_has_content: 当前批次是否有内容 batches: 已完成的批次列表 timezone: 时区名称 add_separator: 是否在区块前添加分割线(第一个区域时为 False) Returns: (current_batch, current_batch_has_content, batches) 元组 """ if not rss_new_stats: return current_batch, current_batch_has_content, batches # 从关键词分组中提取所有条目,重新按来源分组 source_map = {} for stat in rss_new_stats: for title_data in stat.get("titles", []): source_name = title_data.get("source_name", "未知来源") if source_name not in source_map: source_map[source_name] = [] source_map[source_name].append(title_data) if not source_map: return current_batch, current_batch_has_content, batches # 计算总条目数 total_items = sum(len(titles) for titles in source_map.values()) # RSS 新增区块标题(根据 add_separator 决定是否添加前置分割线) new_header = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type in ("wework", "bark"): new_header = f"\n\n\n\n🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "telegram": new_header = f"\n\n🆕 RSS 本次新增 (共 {total_items} 条)\n\n" elif format_type == "ntfy": new_header = f"\n\n🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "feishu": new_header = f"\n{feishu_separator}\n\n🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": new_header = f"\n---\n\n🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "slack": new_header = f"\n\n🆕 *RSS 本次新增* (共 {total_items} 条)\n\n" else: # 不需要分割线(第一个区域) if format_type in ("wework", "bark"): new_header = f"🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "telegram": new_header = f"🆕 RSS 本次新增 (共 {total_items} 条)\n\n" elif format_type == "ntfy": new_header = f"🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "feishu": new_header = f"🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": new_header = f"🆕 **RSS 本次新增** (共 {total_items} 条)\n\n" elif format_type == "slack": new_header = f"🆕 *RSS 本次新增* (共 {total_items} 条)\n\n" # 添加 RSS 新增标题 test_content = current_batch + new_header if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 按来源分组显示(与热榜新增格式一致) source_list = list(source_map.items()) for i, (source_name, titles) in enumerate(source_list): count = len(titles) # 构建来源标题(与热榜新增格式一致) source_header = "" if format_type in ("wework", "bark"): source_header = f"**{source_name}** ({count} 条):\n\n" elif format_type == "telegram": source_header = f"{source_name} ({count} 条):\n\n" elif format_type == "ntfy": source_header = f"**{source_name}** ({count} 条):\n\n" elif format_type == "feishu": source_header = f"**{source_name}** ({count} 条):\n\n" elif format_type == "dingtalk": source_header = f"**{source_name}** ({count} 条):\n\n" elif format_type == "slack": source_header = f"*{source_name}* ({count} 条):\n\n" # 构建第一条新闻(不显示来源,禁用 new emoji) first_news_line = "" if titles: first_title_data = titles[0].copy() first_title_data["is_new"] = False if format_type in ("wework", "bark"): formatted_title = format_title_for_platform("wework", first_title_data, show_source=False) elif format_type == "telegram": formatted_title = format_title_for_platform("telegram", first_title_data, show_source=False) elif format_type == "ntfy": formatted_title = format_title_for_platform("ntfy", first_title_data, show_source=False) elif format_type == "feishu": formatted_title = format_title_for_platform("feishu", first_title_data, show_source=False) elif format_type == "dingtalk": formatted_title = format_title_for_platform("dingtalk", first_title_data, show_source=False) elif format_type == "slack": formatted_title = format_title_for_platform("slack", first_title_data, show_source=False) else: formatted_title = f"{first_title_data['title']}" first_news_line = f" 1. {formatted_title}\n" # 原子性检查:来源标题 + 第一条新闻必须一起处理 source_with_first_news = source_header + first_news_line test_content = current_batch + source_with_first_news if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header + source_with_first_news current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余新闻条目(禁用 new emoji) for j in range(start_index, len(titles)): title_data = titles[j].copy() title_data["is_new"] = False if format_type in ("wework", "bark"): formatted_title = format_title_for_platform("wework", title_data, show_source=False) elif format_type == "telegram": formatted_title = format_title_for_platform("telegram", title_data, show_source=False) elif format_type == "ntfy": formatted_title = format_title_for_platform("ntfy", title_data, show_source=False) elif format_type == "feishu": formatted_title = format_title_for_platform("feishu", title_data, show_source=False) elif format_type == "dingtalk": formatted_title = format_title_for_platform("dingtalk", title_data, show_source=False) elif format_type == "slack": formatted_title = format_title_for_platform("slack", title_data, show_source=False) else: formatted_title = f"{title_data['title']}" news_line = f" {j + 1}. {formatted_title}\n" test_content = current_batch + news_line if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + new_header + source_header + news_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True # 来源间添加空行(与热榜新增格式一致) current_batch += "\n" return current_batch, current_batch_has_content, batches def _format_rss_item_line( item: Dict, index: int, format_type: str, timezone: str = DEFAULT_TIMEZONE, ) -> str: """格式化单条 RSS 条目 Args: item: RSS 条目字典 index: 序号 format_type: 格式类型 timezone: 时区名称 Returns: 格式化后的条目行字符串 """ title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") # 使用友好时间格式 if published_at: friendly_time = format_iso_time_friendly(published_at, timezone, include_date=True) else: friendly_time = "" # 构建条目行 if format_type == "feishu": if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if friendly_time: item_line += f" - {friendly_time}" elif format_type == "telegram": if url: item_line = f" {index}. {title} ({url})" else: item_line = f" {index}. {title}" if friendly_time: item_line += f" - {friendly_time}" else: if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if friendly_time: item_line += f" `{friendly_time}`" item_line += "\n" return item_line def _process_standalone_section( standalone_data: Dict, format_type: str, feishu_separator: str, base_header: str, base_footer: str, max_bytes: int, current_batch: str, current_batch_has_content: bool, batches: List[str], timezone: str = DEFAULT_TIMEZONE, rank_threshold: int = 10, add_separator: bool = True, ) -> tuple: """处理独立展示区区块 独立展示区显示指定平台的完整热榜或 RSS 源内容,不受关键词过滤影响。 热榜按原始排名排序,RSS 按发布时间排序。 Args: standalone_data: 独立展示数据,格式: { "platforms": [{"id": "zhihu", "name": "知乎热榜", "items": [...]}], "rss_feeds": [{"id": "hacker-news", "name": "Hacker News", "items": [...]}] } format_type: 格式类型 feishu_separator: 飞书分隔符 base_header: 基础头部 base_footer: 基础尾部 max_bytes: 最大字节数 current_batch: 当前批次内容 current_batch_has_content: 当前批次是否有内容 batches: 已完成的批次列表 timezone: 时区名称 rank_threshold: 排名高亮阈值 add_separator: 是否在区块前添加分割线(第一个区域时为 False) Returns: (current_batch, current_batch_has_content, batches) 元组 """ if not standalone_data: return current_batch, current_batch_has_content, batches platforms = standalone_data.get("platforms", []) rss_feeds = standalone_data.get("rss_feeds", []) if not platforms and not rss_feeds: return current_batch, current_batch_has_content, batches # 计算总条目数 total_platform_items = sum(len(p.get("items", [])) for p in platforms) total_rss_items = sum(len(f.get("items", [])) for f in rss_feeds) total_items = total_platform_items + total_rss_items # 独立展示区标题(根据 add_separator 决定是否添加前置分割线) section_header = "" if add_separator and current_batch_has_content: # 需要添加分割线 if format_type == "feishu": section_header = f"\n{feishu_separator}\n\n📋 **独立展示区** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": section_header = f"\n---\n\n📋 **独立展示区** (共 {total_items} 条)\n\n" elif format_type in ("wework", "bark"): section_header = f"\n\n\n\n📋 **独立展示区** (共 {total_items} 条)\n\n" elif format_type == "telegram": section_header = f"\n\n📋 独立展示区 (共 {total_items} 条)\n\n" elif format_type == "slack": section_header = f"\n\n📋 *独立展示区* (共 {total_items} 条)\n\n" else: section_header = f"\n\n📋 **独立展示区** (共 {total_items} 条)\n\n" else: # 不需要分割线(第一个区域) if format_type == "feishu": section_header = f"📋 **独立展示区** (共 {total_items} 条)\n\n" elif format_type == "dingtalk": section_header = f"📋 **独立展示区** (共 {total_items} 条)\n\n" elif format_type == "telegram": section_header = f"📋 独立展示区 (共 {total_items} 条)\n\n" elif format_type == "slack": section_header = f"📋 *独立展示区* (共 {total_items} 条)\n\n" else: section_header = f"📋 **独立展示区** (共 {total_items} 条)\n\n" # 添加区块标题 test_content = current_batch + section_header if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) < max_bytes: current_batch = test_content current_batch_has_content = True else: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + section_header current_batch_has_content = True # 处理热榜平台 for platform in platforms: platform_name = platform.get("name", platform.get("id", "")) items = platform.get("items", []) if not items: continue # 平台标题 platform_header = "" if format_type in ("wework", "bark"): platform_header = f"**{platform_name}** ({len(items)} 条):\n\n" elif format_type == "telegram": platform_header = f"{platform_name} ({len(items)} 条):\n\n" elif format_type == "ntfy": platform_header = f"**{platform_name}** ({len(items)} 条):\n\n" elif format_type == "feishu": platform_header = f"**{platform_name}** ({len(items)} 条):\n\n" elif format_type == "dingtalk": platform_header = f"**{platform_name}** ({len(items)} 条):\n\n" elif format_type == "slack": platform_header = f"*{platform_name}* ({len(items)} 条):\n\n" # 构建第一条新闻 first_item_line = "" if items: first_item_line = _format_standalone_platform_item(items[0], 1, format_type, rank_threshold) # 原子性检查 platform_with_first = platform_header + first_item_line test_content = current_batch + platform_with_first if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + section_header + platform_with_first current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余条目 for j in range(start_index, len(items)): item_line = _format_standalone_platform_item(items[j], j + 1, format_type, rank_threshold) test_content = current_batch + item_line if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + section_header + platform_header + item_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True current_batch += "\n" # 处理 RSS 源 for feed in rss_feeds: feed_name = feed.get("name", feed.get("id", "")) items = feed.get("items", []) if not items: continue # RSS 源标题 feed_header = "" if format_type in ("wework", "bark"): feed_header = f"**{feed_name}** ({len(items)} 条):\n\n" elif format_type == "telegram": feed_header = f"{feed_name} ({len(items)} 条):\n\n" elif format_type == "ntfy": feed_header = f"**{feed_name}** ({len(items)} 条):\n\n" elif format_type == "feishu": feed_header = f"**{feed_name}** ({len(items)} 条):\n\n" elif format_type == "dingtalk": feed_header = f"**{feed_name}** ({len(items)} 条):\n\n" elif format_type == "slack": feed_header = f"*{feed_name}* ({len(items)} 条):\n\n" # 构建第一条 RSS first_item_line = "" if items: first_item_line = _format_standalone_rss_item(items[0], 1, format_type, timezone) # 原子性检查 feed_with_first = feed_header + first_item_line test_content = current_batch + feed_with_first if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + section_header + feed_with_first current_batch_has_content = True start_index = 1 else: current_batch = test_content current_batch_has_content = True start_index = 1 # 处理剩余条目 for j in range(start_index, len(items)): item_line = _format_standalone_rss_item(items[j], j + 1, format_type, timezone) test_content = current_batch + item_line if len(test_content.encode("utf-8")) + len(base_footer.encode("utf-8")) >= max_bytes: if current_batch_has_content: batches.append(current_batch + base_footer) current_batch = base_header + section_header + feed_header + item_line current_batch_has_content = True else: current_batch = test_content current_batch_has_content = True current_batch += "\n" return current_batch, current_batch_has_content, batches def _format_standalone_platform_item(item: Dict, index: int, format_type: str, rank_threshold: int = 10) -> str: """格式化独立展示区的热榜条目(复用热点词汇统计区样式) Args: item: 热榜条目,包含 title, url, rank, ranks, first_time, last_time, count index: 序号 format_type: 格式类型 rank_threshold: 排名高亮阈值 Returns: 格式化后的条目行字符串 """ title = item.get("title", "") url = item.get("url", "") or item.get("mobileUrl", "") ranks = item.get("ranks", []) rank = item.get("rank", 0) first_time = item.get("first_time", "") last_time = item.get("last_time", "") count = item.get("count", 1) # 使用 format_rank_display 格式化排名(复用热点词汇统计区逻辑) # 如果没有 ranks 列表,用单个 rank 构造 if not ranks and rank > 0: ranks = [rank] rank_display = format_rank_display(ranks, rank_threshold, format_type) if ranks else "" # 构建时间显示(用 ~ 连接范围,与热点词汇统计区一致) # 将 HH-MM 格式转换为 HH:MM 格式 time_display = "" if first_time and last_time and first_time != last_time: first_time_display = convert_time_for_display(first_time) last_time_display = convert_time_for_display(last_time) time_display = f"{first_time_display}~{last_time_display}" elif first_time: time_display = convert_time_for_display(first_time) # 构建次数显示(格式为 (N次),与热点词汇统计区一致) count_display = f"({count}次)" if count > 1 else "" # 根据格式类型构建条目行(复用热点词汇统计区样式) if format_type == "feishu": if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if rank_display: item_line += f" {rank_display}" if time_display: item_line += f" - {time_display}" if count_display: item_line += f" {count_display}" elif format_type == "dingtalk": if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if rank_display: item_line += f" {rank_display}" if time_display: item_line += f" - {time_display}" if count_display: item_line += f" {count_display}" elif format_type == "telegram": if url: item_line = f" {index}. {title} ({url})" else: item_line = f" {index}. {title}" if rank_display: item_line += f" {rank_display}" if time_display: item_line += f" - {time_display}" if count_display: item_line += f" {count_display}" elif format_type == "slack": if url: item_line = f" {index}. <{url}|{title}>" else: item_line = f" {index}. {title}" if rank_display: item_line += f" {rank_display}" if time_display: item_line += f" _{time_display}_" if count_display: item_line += f" {count_display}" else: # wework, bark, ntfy if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if rank_display: item_line += f" {rank_display}" if time_display: item_line += f" - {time_display}" if count_display: item_line += f" {count_display}" item_line += "\n" return item_line def _format_standalone_rss_item( item: Dict, index: int, format_type: str, timezone: str = "Asia/Shanghai" ) -> str: """格式化独立展示区的 RSS 条目 Args: item: RSS 条目,包含 title, url, published_at, author index: 序号 format_type: 格式类型 timezone: 时区名称 Returns: 格式化后的条目行字符串 """ title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") author = item.get("author", "") # 使用友好时间格式 friendly_time = "" if published_at: friendly_time = format_iso_time_friendly(published_at, timezone, include_date=True) # 构建元信息 meta_parts = [] if friendly_time: meta_parts.append(friendly_time) if author: meta_parts.append(author) meta_str = ", ".join(meta_parts) # 根据格式类型构建条目行 if format_type == "feishu": if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if meta_str: item_line += f" - {meta_str}" elif format_type == "telegram": if url: item_line = f" {index}. {title} ({url})" else: item_line = f" {index}. {title}" if meta_str: item_line += f" - {meta_str}" elif format_type == "slack": if url: item_line = f" {index}. <{url}|{title}>" else: item_line = f" {index}. {title}" if meta_str: item_line += f" _{meta_str}_" else: # wework, bark, ntfy, dingtalk if url: item_line = f" {index}. [{title}]({url})" else: item_line = f" {index}. {title}" if meta_str: item_line += f" `{meta_str}`" item_line += "\n" return item_line ================================================ FILE: trendradar/report/__init__.py ================================================ # coding=utf-8 """ 报告生成模块 提供报告生成和格式化功能,包括: - HTML 报告生成 - 标题格式化工具 模块结构: - helpers: 报告辅助函数(清理、转义、格式化) - formatter: 平台标题格式化 - html: HTML 报告渲染 - generator: 报告生成器 """ from trendradar.report.helpers import ( clean_title, html_escape, format_rank_display, ) from trendradar.report.formatter import format_title_for_platform from trendradar.report.html import render_html_content from trendradar.report.generator import ( prepare_report_data, generate_html_report, ) __all__ = [ # 辅助函数 "clean_title", "html_escape", "format_rank_display", # 格式化函数 "format_title_for_platform", # HTML 渲染 "render_html_content", # 报告生成器 "prepare_report_data", "generate_html_report", ] ================================================ FILE: trendradar/report/formatter.py ================================================ # coding=utf-8 """ 平台标题格式化模块 提供多平台标题格式化功能 """ from typing import Dict from trendradar.report.helpers import clean_title, html_escape, format_rank_display def format_title_for_platform( platform: str, title_data: Dict, show_source: bool = True, show_keyword: bool = False ) -> str: """统一的标题格式化方法 为不同平台生成对应格式的标题字符串。 Args: platform: 目标平台,支持: - "feishu": 飞书 - "dingtalk": 钉钉 - "wework": 企业微信 - "bark": Bark - "telegram": Telegram - "ntfy": ntfy - "slack": Slack - "html": HTML 报告 title_data: 标题数据字典,包含以下字段: - title: 标题文本 - source_name: 来源名称 - time_display: 时间显示 - count: 出现次数 - ranks: 排名列表 - rank_threshold: 高亮阈值 - url: PC端链接 - mobile_url: 移动端链接(优先使用) - is_new: 是否为新增标题(可选) - matched_keyword: 匹配的关键词(可选,platform 模式使用) show_source: 是否显示来源名称(keyword 模式使用) show_keyword: 是否显示关键词标签(platform 模式使用) Returns: 格式化后的标题字符串 """ rank_display = format_rank_display( title_data["ranks"], title_data["rank_threshold"], platform ) link_url = title_data["mobile_url"] or title_data["url"] cleaned_title = clean_title(title_data["title"]) # 获取关键词标签(platform 模式使用) keyword = title_data.get("matched_keyword", "") if show_keyword else "" if platform == "feishu": if link_url: formatted_title = f"[{cleaned_title}]({link_url})" else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"[{keyword}] {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" - {title_data['time_display']}" if title_data["count"] > 1: result += f" ({title_data['count']}次)" return result elif platform == "dingtalk": if link_url: formatted_title = f"[{cleaned_title}]({link_url})" else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"[{keyword}] {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" - {title_data['time_display']}" if title_data["count"] > 1: result += f" ({title_data['count']}次)" return result elif platform in ("wework", "bark"): # WeWork 和 Bark 使用 markdown 格式 if link_url: formatted_title = f"[{cleaned_title}]({link_url})" else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"[{keyword}] {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" - {title_data['time_display']}" if title_data["count"] > 1: result += f" ({title_data['count']}次)" return result elif platform == "telegram": if link_url: formatted_title = f'{html_escape(cleaned_title)}' else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"[{html_escape(keyword)}] {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" - {title_data['time_display']}" if title_data["count"] > 1: result += f"({title_data['count']}次)" return result elif platform == "ntfy": if link_url: formatted_title = f"[{cleaned_title}]({link_url})" else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"[{keyword}] {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" `- {title_data['time_display']}`" if title_data["count"] > 1: result += f" `({title_data['count']}次)`" return result elif platform == "slack": # Slack 使用 mrkdwn 格式 if link_url: # Slack 链接格式:formatted_title = f"<{link_url}|{cleaned_title}>" else: formatted_title = cleaned_title title_prefix = "🆕 " if title_data.get("is_new") else "" if show_source: result = f"[{title_data['source_name']}] {title_prefix}{formatted_title}" elif show_keyword and keyword: result = f"*[{keyword}]* {title_prefix}{formatted_title}" else: result = f"{title_prefix}{formatted_title}" # 排名(使用 * 加粗) rank_display = format_rank_display( title_data["ranks"], title_data["rank_threshold"], "slack" ) if rank_display: result += f" {rank_display}" if title_data["time_display"]: result += f" `- {title_data['time_display']}`" if title_data["count"] > 1: result += f" `({title_data['count']}次)`" return result elif platform == "html": rank_display = format_rank_display( title_data["ranks"], title_data["rank_threshold"], "html" ) link_url = title_data["mobile_url"] or title_data["url"] escaped_title = html_escape(cleaned_title) escaped_source_name = html_escape(title_data["source_name"]) # 构建前缀(来源或关键词) if show_source: prefix = f'[{escaped_source_name}] ' elif show_keyword and keyword: escaped_keyword = html_escape(keyword) prefix = f'[{escaped_keyword}] ' else: prefix = "" if link_url: escaped_url = html_escape(link_url) formatted_title = f'{prefix}{escaped_title}' else: formatted_title = f'{prefix}{escaped_title}' if rank_display: formatted_title += f" {rank_display}" if title_data["time_display"]: escaped_time = html_escape(title_data["time_display"]) formatted_title += f" - {escaped_time}" if title_data["count"] > 1: formatted_title += f" ({title_data['count']}次)" if title_data.get("is_new"): formatted_title = f" 🆕 {formatted_title}" return formatted_title else: return cleaned_title ================================================ FILE: trendradar/report/generator.py ================================================ # coding=utf-8 """ 报告生成模块 提供报告数据准备和 HTML 生成功能: - prepare_report_data: 准备报告数据 - generate_html_report: 生成 HTML 报告 """ from pathlib import Path from typing import Dict, List, Optional, Callable def prepare_report_data( stats: List[Dict], failed_ids: Optional[List] = None, new_titles: Optional[Dict] = None, id_to_name: Optional[Dict] = None, mode: str = "daily", rank_threshold: int = 3, matches_word_groups_func: Optional[Callable] = None, load_frequency_words_func: Optional[Callable] = None, show_new_section: bool = True, ) -> Dict: """ 准备报告数据 Args: stats: 统计结果列表 failed_ids: 失败的 ID 列表 new_titles: 新增标题 id_to_name: ID 到名称的映射 mode: 报告模式 (daily/incremental/current) rank_threshold: 排名阈值 matches_word_groups_func: 词组匹配函数 load_frequency_words_func: 加载频率词函数 show_new_section: 是否显示新增热点区域 Returns: Dict: 准备好的报告数据 """ processed_new_titles = [] # 在增量模式下或配置关闭时隐藏新增新闻区域 hide_new_section = mode == "incremental" or not show_new_section # 只有在非隐藏模式下才处理新增新闻部分 if not hide_new_section: filtered_new_titles = {} if new_titles and id_to_name: # 如果提供了匹配函数,使用它过滤 if matches_word_groups_func and load_frequency_words_func: word_groups, filter_words, global_filters = load_frequency_words_func() for source_id, titles_data in new_titles.items(): filtered_titles = {} for title, title_data in titles_data.items(): if matches_word_groups_func(title, word_groups, filter_words, global_filters): filtered_titles[title] = title_data if filtered_titles: filtered_new_titles[source_id] = filtered_titles else: # 没有匹配函数时,使用全部 filtered_new_titles = new_titles # 打印过滤后的新增热点数(与推送显示一致) original_new_count = sum(len(titles) for titles in new_titles.values()) if new_titles else 0 filtered_new_count = sum(len(titles) for titles in filtered_new_titles.values()) if filtered_new_titles else 0 if original_new_count > 0: print(f"频率词过滤后:{filtered_new_count} 条新增热点匹配(原始 {original_new_count} 条)") if filtered_new_titles and id_to_name: for source_id, titles_data in filtered_new_titles.items(): source_name = id_to_name.get(source_id, source_id) source_titles = [] for title, title_data in titles_data.items(): url = title_data.get("url", "") mobile_url = title_data.get("mobileUrl", "") ranks = title_data.get("ranks", []) processed_title = { "title": title, "source_name": source_name, "time_display": "", "count": 1, "ranks": ranks, "rank_threshold": rank_threshold, "url": url, "mobile_url": mobile_url, "is_new": True, } source_titles.append(processed_title) if source_titles: processed_new_titles.append( { "source_id": source_id, "source_name": source_name, "titles": source_titles, } ) processed_stats = [] for stat in stats: if stat["count"] <= 0: continue processed_titles = [] for title_data in stat["titles"]: processed_title = { "title": title_data["title"], "source_name": title_data["source_name"], "time_display": title_data["time_display"], "count": title_data["count"], "ranks": title_data["ranks"], "rank_threshold": title_data["rank_threshold"], "url": title_data.get("url", ""), "mobile_url": title_data.get("mobileUrl", ""), "is_new": title_data.get("is_new", False), } processed_titles.append(processed_title) processed_stats.append( { "word": stat["word"], "count": stat["count"], "percentage": stat.get("percentage", 0), "titles": processed_titles, } ) return { "stats": processed_stats, "new_titles": processed_new_titles, "failed_ids": failed_ids or [], "total_new_count": sum( len(source["titles"]) for source in processed_new_titles ), } def generate_html_report( stats: List[Dict], total_titles: int, failed_ids: Optional[List] = None, new_titles: Optional[Dict] = None, id_to_name: Optional[Dict] = None, mode: str = "daily", update_info: Optional[Dict] = None, rank_threshold: int = 3, output_dir: str = "output", date_folder: str = "", time_filename: str = "", render_html_func: Optional[Callable] = None, matches_word_groups_func: Optional[Callable] = None, load_frequency_words_func: Optional[Callable] = None, ) -> str: """ 生成 HTML 报告 每次生成 HTML 后会: 1. 保存时间戳快照到 output/html/日期/时间.html(历史记录) 2. 复制到 output/html/latest/{mode}.html(最新报告) 3. 复制到 output/index.html 和根目录 index.html(入口) Args: stats: 统计结果列表 total_titles: 总标题数 failed_ids: 失败的 ID 列表 new_titles: 新增标题 id_to_name: ID 到名称的映射 mode: 报告模式 (daily/incremental/current) update_info: 更新信息 rank_threshold: 排名阈值 output_dir: 输出目录 date_folder: 日期文件夹名称 time_filename: 时间文件名 render_html_func: HTML 渲染函数 matches_word_groups_func: 词组匹配函数 load_frequency_words_func: 加载频率词函数 Returns: str: 生成的 HTML 文件路径(时间戳快照路径) """ # 时间戳快照文件名 snapshot_filename = f"{time_filename}.html" # 构建输出路径(扁平化结构:output/html/日期/) snapshot_path = Path(output_dir) / "html" / date_folder snapshot_path.mkdir(parents=True, exist_ok=True) snapshot_file = str(snapshot_path / snapshot_filename) # 准备报告数据 report_data = prepare_report_data( stats, failed_ids, new_titles, id_to_name, mode, rank_threshold, matches_word_groups_func, load_frequency_words_func, ) # 渲染 HTML 内容 if render_html_func: html_content = render_html_func( report_data, total_titles, mode, update_info ) else: # 默认简单 HTML html_content = f"Report
{report_data}" # 1. 保存时间戳快照(历史记录) with open(snapshot_file, "w", encoding="utf-8") as f: f.write(html_content) # 2. 复制到 html/latest/{mode}.html(最新报告) latest_dir = Path(output_dir) / "html" / "latest" latest_dir.mkdir(parents=True, exist_ok=True) latest_file = latest_dir / f"{mode}.html" with open(latest_file, "w", encoding="utf-8") as f: f.write(html_content) # 3. 复制到 index.html(入口) # output/index.html(供 Docker Volume 挂载访问) output_index = Path(output_dir) / "index.html" with open(output_index, "w", encoding="utf-8") as f: f.write(html_content) # 根目录 index.html(供 GitHub Pages 访问) root_index = Path("index.html") with open(root_index, "w", encoding="utf-8") as f: f.write(html_content) return snapshot_file ================================================ FILE: trendradar/report/helpers.py ================================================ # coding=utf-8 """ 报告辅助函数模块 提供报告生成相关的通用辅助函数 """ import re from typing import List def clean_title(title: str) -> str: """清理标题中的特殊字符 清理规则: - 将换行符(\n, \r)替换为空格 - 将多个连续空白字符合并为单个空格 - 去除首尾空白 Args: title: 原始标题字符串 Returns: 清理后的标题字符串 """ if not isinstance(title, str): title = str(title) cleaned_title = title.replace("\n", " ").replace("\r", " ") cleaned_title = re.sub(r"\s+", " ", cleaned_title) cleaned_title = cleaned_title.strip() return cleaned_title def html_escape(text: str) -> str: """HTML特殊字符转义 转义规则(按顺序): - & → & - < → < - > → > - " → " - ' → ' Args: text: 原始文本 Returns: 转义后的文本 """ if not isinstance(text, str): text = str(text) return ( text.replace("&", "&") .replace("<", "<") .replace(">", ">") .replace('"', """) .replace("'", "'") ) def format_rank_display(ranks: List[int], rank_threshold: int, format_type: str) -> str: """格式化排名显示 根据不同平台类型生成对应格式的排名字符串。 当最小排名小于等于阈值时,使用高亮格式。 Args: ranks: 排名列表(可能包含重复值) rank_threshold: 高亮阈值,小于等于此值的排名会高亮显示 format_type: 平台类型,支持: - "html": HTML格式 - "feishu": 飞书格式 - "dingtalk": 钉钉格式 - "wework": 企业微信格式 - "telegram": Telegram格式 - "slack": Slack格式 - 其他: 默认markdown格式 Returns: 格式化后的排名字符串,如 "[1]" 或 "[1 - 5]" 如果排名列表为空,返回空字符串 """ if not ranks: return "" unique_ranks = sorted(set(ranks)) min_rank = unique_ranks[0] max_rank = unique_ranks[-1] # 根据平台类型选择高亮格式 if format_type == "html": highlight_start = "" highlight_end = "" elif format_type == "feishu": highlight_start = "**" highlight_end = "**" elif format_type == "dingtalk": highlight_start = "**" highlight_end = "**" elif format_type == "wework": highlight_start = "**" highlight_end = "**" elif format_type == "telegram": highlight_start = "" highlight_end = "" elif format_type == "slack": highlight_start = "*" highlight_end = "*" else: # 默认 markdown 格式 highlight_start = "**" highlight_end = "**" # 生成排名显示 rank_str = "" if min_rank <= rank_threshold: if min_rank == max_rank: rank_str = f"{highlight_start}[{min_rank}]{highlight_end}" else: rank_str = f"{highlight_start}[{min_rank} - {max_rank}]{highlight_end}" else: if min_rank == max_rank: rank_str = f"[{min_rank}]" else: rank_str = f"[{min_rank} - {max_rank}]" # 计算热度趋势 trend_arrow = "" if len(ranks) >= 2: prev_rank = ranks[-2] curr_rank = ranks[-1] if curr_rank < prev_rank: trend_arrow = "🔺" # 排名上升(数值变小) elif curr_rank > prev_rank: trend_arrow = "🔻" # 排名下降(数值变大) else: trend_arrow = "➖" # 排名持平 # len(ranks) == 1 时不显示趋势箭头(新上榜由 is_new 字段在 formatter.py 中处理) return f"{rank_str} {trend_arrow}" if trend_arrow else rank_str ================================================ FILE: trendradar/report/html.py ================================================ # coding=utf-8 """ HTML 报告渲染模块 提供 HTML 格式的热点新闻报告生成功能 """ from datetime import datetime from typing import Any, Dict, List, Optional, Callable from trendradar.report.helpers import html_escape from trendradar.utils.time import convert_time_for_display from trendradar.ai.formatter import render_ai_analysis_html_rich def render_html_content( report_data: Dict, total_titles: int, mode: str = "daily", update_info: Optional[Dict] = None, *, region_order: Optional[List[str]] = None, get_time_func: Optional[Callable[[], datetime]] = None, rss_items: Optional[List[Dict]] = None, rss_new_items: Optional[List[Dict]] = None, display_mode: str = "keyword", standalone_data: Optional[Dict] = None, ai_analysis: Optional[Any] = None, show_new_section: bool = True, ) -> str: """渲染HTML内容 Args: report_data: 报告数据字典,包含 stats, new_titles, failed_ids, total_new_count total_titles: 新闻总数 mode: 报告模式 ("daily", "current", "incremental") update_info: 更新信息(可选) region_order: 区域显示顺序列表 get_time_func: 获取当前时间的函数(可选,默认使用 datetime.now) rss_items: RSS 统计条目列表(可选) rss_new_items: RSS 新增条目列表(可选) display_mode: 显示模式 ("keyword"=按关键词分组, "platform"=按平台分组) standalone_data: 独立展示区数据(可选),包含 platforms 和 rss_feeds ai_analysis: AI 分析结果对象(可选),AIAnalysisResult 实例 show_new_section: 是否显示新增热点区域 Returns: 渲染后的 HTML 字符串 """ # 默认区域顺序 default_region_order = ["hotlist", "rss", "new_items", "standalone", "ai_analysis"] if region_order is None: region_order = default_region_order html = """热点新闻分析 """ return html ================================================ FILE: trendradar/report/rss_html.py ================================================ # coding=utf-8 """ RSS HTML 报告渲染模块 提供 RSS 订阅内容的 HTML 格式报告生成功能 """ from datetime import datetime from typing import Dict, List, Optional, Callable from trendradar.report.helpers import html_escape def render_rss_html_content( rss_items: List[Dict], total_count: int, feeds_info: Optional[Dict[str, str]] = None, *, get_time_func: Optional[Callable[[], datetime]] = None, ) -> str: """渲染 RSS HTML 内容 Args: rss_items: RSS 条目列表,每个条目包含: - title: 标题 - feed_id: RSS 源 ID - feed_name: RSS 源名称 - url: 链接 - published_at: 发布时间 - summary: 摘要(可选) - author: 作者(可选) total_count: 条目总数 feeds_info: RSS 源 ID 到名称的映射 get_time_func: 获取当前时间的函数(可选,默认使用 datetime.now) Returns: 渲染后的 HTML 字符串 """ html = """热点新闻分析报告类型 """ # 处理报告类型显示(根据 mode 直接显示) if mode == "current": html += "当前榜单" elif mode == "incremental": html += "增量分析" else: html += "全天汇总" html += """新闻总数 """ html += f"{total_titles} 条" # 计算筛选后的热点新闻数量 hot_news_count = sum(len(stat["titles"]) for stat in report_data["stats"]) html += """热点新闻 """ html += f"{hot_news_count} 条" html += """生成时间 """ # 使用提供的时间函数或默认 datetime.now if get_time_func: now = get_time_func() else: now = datetime.now() html += now.strftime("%m-%d %H:%M") html += """""" # 处理失败ID错误信息 if report_data["failed_ids"]: html += """""" # 生成热点词汇统计部分的HTML stats_html = "" if report_data["stats"]: total_count = len(report_data["stats"]) for i, stat in enumerate(report_data["stats"], 1): count = stat["count"] # 确定热度等级 if count >= 10: count_class = "hot" elif count >= 5: count_class = "warm" else: count_class = "" escaped_word = html_escape(stat["word"]) stats_html += f"""⚠️ 请求失败的平台""" for id_value in report_data["failed_ids"]: html += f'
- {html_escape(id_value)}
' html += """""" # 给热榜统计添加外层包装 if stats_html: stats_html = f"""""" # 处理每个词组下的新闻标题,给每条新闻标上序号 for j, title_data in enumerate(stat["titles"], 1): is_new = title_data.get("is_new", False) new_class = "new" if is_new else "" stats_html += f"""{escaped_word}{count} 条{i}/{total_count}""" stats_html += """{j}""" # 根据 display_mode 决定显示来源还是关键词 if display_mode == "keyword": # keyword 模式:显示来源 stats_html += f'{html_escape(title_data["source_name"])}' else: # platform 模式:显示关键词 matched_keyword = title_data.get("matched_keyword", "") if matched_keyword: stats_html += f'[{html_escape(matched_keyword)}]' # 处理排名显示 ranks = title_data.get("ranks", []) if ranks: min_rank = min(ranks) max_rank = max(ranks) rank_threshold = title_data.get("rank_threshold", 10) # 确定排名等级 if min_rank <= 3: rank_class = "top" elif min_rank <= rank_threshold: rank_class = "high" else: rank_class = "" if min_rank == max_rank: rank_text = str(min_rank) else: rank_text = f"{min_rank}-{max_rank}" stats_html += f'{rank_text}' # 处理时间显示 time_display = title_data.get("time_display", "") if time_display: # 简化时间显示格式,将波浪线替换为~ simplified_time = ( time_display.replace(" ~ ", "~") .replace("[", "") .replace("]", "") ) stats_html += ( f'{html_escape(simplified_time)}' ) # 处理出现次数 count_info = title_data.get("count", 1) if count_info > 1: stats_html += f'{count_info}次' stats_html += """""" # 处理标题和链接 escaped_title = html_escape(title_data["title"]) link_url = title_data.get("mobile_url") or title_data.get("url", "") if link_url: escaped_url = html_escape(link_url) stats_html += f'{escaped_title}' else: stats_html += escaped_title stats_html += """{stats_html}""" # 生成新增新闻区域的HTML new_titles_html = "" if show_new_section and report_data["new_titles"]: new_titles_html += f"""""" # 生成 RSS 统计内容 def render_rss_stats_html(stats: List[Dict], title: str = "RSS 订阅更新") -> str: """渲染 RSS 统计区块 HTML Args: stats: RSS 分组统计列表,格式与热榜一致: [ { "word": "关键词", "count": 5, "titles": [ { "title": "标题", "source_name": "Feed 名称", "time_display": "12-29 08:20", "url": "...", "is_new": True/False } ] } ] title: 区块标题 Returns: 渲染后的 HTML 字符串 """ if not stats: return "" # 计算总条目数 total_count = sum(stat.get("count", 0) for stat in stats) if total_count == 0: return "" rss_html = f"""本次新增热点 (共 {report_data['total_new_count']} 条)""" for source_data in report_data["new_titles"]: escaped_source = html_escape(source_data["source_name"]) titles_count = len(source_data["titles"]) new_titles_html += f"""""" new_titles_html += """{escaped_source} · {titles_count}条""" # 为新增新闻也添加序号 for idx, title_data in enumerate(source_data["titles"], 1): ranks = title_data.get("ranks", []) # 处理新增新闻的排名显示 rank_class = "" if ranks: min_rank = min(ranks) if min_rank <= 3: rank_class = "top" elif min_rank <= title_data.get("rank_threshold", 10): rank_class = "high" if len(ranks) == 1: rank_text = str(ranks[0]) else: rank_text = f"{min(ranks)}-{max(ranks)}" else: rank_text = "?" new_titles_html += f"""""" new_titles_html += """{idx}{rank_text}""" # 处理新增新闻的链接 escaped_title = html_escape(title_data["title"]) link_url = title_data.get("mobile_url") or title_data.get("url", "") if link_url: escaped_url = html_escape(link_url) new_titles_html += f'{escaped_title}' else: new_titles_html += escaped_title new_titles_html += """""" return rss_html # 生成独立展示区内容 def render_standalone_html(data: Optional[Dict]) -> str: """渲染独立展示区 HTML(复用热点词汇统计区样式) Args: data: 独立展示数据,格式: { "platforms": [ { "id": "zhihu", "name": "知乎热榜", "items": [ { "title": "标题", "url": "链接", "rank": 1, "ranks": [1, 2, 1], "first_time": "08:00", "last_time": "12:30", "count": 3, } ] } ], "rss_feeds": [ { "id": "hacker-news", "name": "Hacker News", "items": [ { "title": "标题", "url": "链接", "published_at": "2025-01-07T08:00:00", "author": "作者", } ] } ] } Returns: 渲染后的 HTML 字符串 """ if not data: return "" platforms = data.get("platforms", []) rss_feeds = data.get("rss_feeds", []) if not platforms and not rss_feeds: return "" # 计算总条目数 total_platform_items = sum(len(p.get("items", [])) for p in platforms) total_rss_items = sum(len(f.get("items", [])) for f in rss_feeds) total_count = total_platform_items + total_rss_items if total_count == 0: return "" standalone_html = f"""""" # 按关键词分组渲染(与热榜格式一致) for stat in stats: keyword = stat.get("word", "") titles = stat.get("titles", []) if not titles: continue keyword_count = len(titles) rss_html += f"""{title}{total_count} 条""" rss_html += """""" for title_data in titles: item_title = title_data.get("title", "") url = title_data.get("url", "") time_display = title_data.get("time_display", "") source_name = title_data.get("source_name", "") is_new = title_data.get("is_new", False) rss_html += """{html_escape(keyword)}{keyword_count} 条""" rss_html += """""" escaped_title = html_escape(item_title) if url: escaped_url = html_escape(url) rss_html += f'{escaped_title}' else: rss_html += escaped_title rss_html += """""" return standalone_html # 生成 RSS 统计和新增 HTML rss_stats_html = render_rss_stats_html(rss_items, "RSS 订阅更新") if rss_items else "" rss_new_html = render_rss_stats_html(rss_new_items, "RSS 新增更新") if rss_new_items else "" # 生成独立展示区 HTML standalone_html = render_standalone_html(standalone_data) # 生成 AI 分析 HTML ai_html = render_ai_analysis_html_rich(ai_analysis) if ai_analysis else "" # 准备各区域内容映射 region_contents = { "hotlist": stats_html, "rss": rss_stats_html, "new_items": (new_titles_html, rss_new_html), # 元组,分别处理 "standalone": standalone_html, "ai_analysis": ai_html, } def add_section_divider(content: str) -> str: """为内容的外层 div 添加 section-divider 类""" if not content or 'class="' not in content: return content first_class_pos = content.find('class="') if first_class_pos != -1: insert_pos = first_class_pos + len('class="') return content[:insert_pos] + "section-divider " + content[insert_pos:] return content # 按 region_order 顺序组装内容,动态添加分割线 has_previous_content = False for region in region_order: content = region_contents.get(region, "") if region == "new_items": # 特殊处理 new_items 区域(包含热榜新增和 RSS 新增两部分) new_html, rss_new = content if new_html: if has_previous_content: new_html = add_section_divider(new_html) html += new_html has_previous_content = True if rss_new: if has_previous_content: rss_new = add_section_divider(rss_new) html += rss_new has_previous_content = True elif content: if has_previous_content: content = add_section_divider(content) html += content has_previous_content = True html += """""" # 渲染热榜平台(复用 word-group 结构) for platform in platforms: platform_name = platform.get("name", platform.get("id", "")) items = platform.get("items", []) if not items: continue standalone_html += f"""独立展示区{total_count} 条""" # 渲染 RSS 源(复用相同结构) for feed in rss_feeds: feed_name = feed.get("name", feed.get("id", "")) items = feed.get("items", []) if not items: continue standalone_html += f"""""" # 渲染每个条目(复用 news-item 结构) for j, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") or item.get("mobileUrl", "") rank = item.get("rank", 0) ranks = item.get("ranks", []) first_time = item.get("first_time", "") last_time = item.get("last_time", "") count = item.get("count", 1) standalone_html += f"""{html_escape(platform_name)}{len(items)} 条""" standalone_html += """{j}""" # 排名显示(复用 rank-num 样式,无 # 前缀) if ranks: min_rank = min(ranks) max_rank = max(ranks) # 确定排名等级 if min_rank <= 3: rank_class = "top" elif min_rank <= 10: rank_class = "high" else: rank_class = "" if min_rank == max_rank: rank_text = str(min_rank) else: rank_text = f"{min_rank}-{max_rank}" standalone_html += f'{rank_text}' elif rank > 0: if rank <= 3: rank_class = "top" elif rank <= 10: rank_class = "high" else: rank_class = "" standalone_html += f'{rank}' # 时间显示(复用 time-info 样式,将 HH-MM 转换为 HH:MM) if first_time and last_time and first_time != last_time: first_time_display = convert_time_for_display(first_time) last_time_display = convert_time_for_display(last_time) standalone_html += f'{html_escape(first_time_display)}~{html_escape(last_time_display)}' elif first_time: first_time_display = convert_time_for_display(first_time) standalone_html += f'{html_escape(first_time_display)}' # 出现次数(复用 count-info 样式) if count > 1: standalone_html += f'{count}次' standalone_html += """""" # 标题和链接(复用 news-link 样式) escaped_title = html_escape(title) if url: escaped_url = html_escape(url) standalone_html += f'{escaped_title}' else: standalone_html += escaped_title standalone_html += """""" standalone_html += """""" for j, item in enumerate(items, 1): title = item.get("title", "") url = item.get("url", "") published_at = item.get("published_at", "") author = item.get("author", "") standalone_html += f"""{html_escape(feed_name)}{len(items)} 条""" standalone_html += """{j}""" # 时间显示(格式化 ISO 时间) if published_at: try: from datetime import datetime as dt if "T" in published_at: dt_obj = dt.fromisoformat(published_at.replace("Z", "+00:00")) time_display = dt_obj.strftime("%m-%d %H:%M") else: time_display = published_at except: time_display = published_at standalone_html += f'{html_escape(time_display)}' # 作者显示 if author: standalone_html += f'{html_escape(author)}' standalone_html += """""" escaped_title = html_escape(title) if url: escaped_url = html_escape(url) standalone_html += f'{escaped_title}' else: standalone_html += escaped_title standalone_html += """RSS 订阅内容 """ return html ================================================ FILE: trendradar/storage/__init__.py ================================================ # coding=utf-8 """ 存储模块 - 支持多种存储后端 支持的存储后端: - local: 本地 SQLite + TXT/HTML 文件 - remote: 远程云存储(S3 兼容协议:R2/OSS/COS/S3 等) - auto: 根据环境自动选择(GitHub Actions 用 remote,其他用 local) """ from trendradar.storage.base import ( StorageBackend, NewsItem, NewsData, RSSItem, RSSData, convert_crawl_results_to_news_data, ) from trendradar.storage.sqlite_mixin import SQLiteStorageMixin from trendradar.storage.local import LocalStorageBackend from trendradar.storage.manager import StorageManager, get_storage_manager # 远程后端可选导入(需要 boto3) try: from trendradar.storage.remote import RemoteStorageBackend HAS_REMOTE = True except ImportError: RemoteStorageBackend = None HAS_REMOTE = False __all__ = [ # 基础类 "StorageBackend", "NewsItem", "NewsData", "RSSItem", "RSSData", # Mixin "SQLiteStorageMixin", # 转换函数 "convert_crawl_results_to_news_data", # 后端实现 "LocalStorageBackend", "RemoteStorageBackend", "HAS_REMOTE", # 管理器 "StorageManager", "get_storage_manager", ] ================================================ FILE: trendradar/storage/ai_filter_schema.sql ================================================ -- AI 智能筛选相关表结构 -- 在 news 库中创建,与 news_items 同库 -- ============================================ -- AI 筛选兴趣标签表 -- 存储从用户兴趣描述中 AI 提取的结构化标签 -- 按版本管理,提示词变更时旧版本标记 deprecated -- 支持多兴趣文件隔离(interests_file 区分不同文件的标签集) -- ============================================ CREATE TABLE IF NOT EXISTS ai_filter_tags ( id INTEGER PRIMARY KEY AUTOINCREMENT, tag TEXT NOT NULL, -- 标签名,如 "AI/大模型" description TEXT DEFAULT '', -- 标签描述,AI 分类时参考 priority INTEGER NOT NULL DEFAULT 9999, -- 标签优先级(值越小优先级越高) status TEXT DEFAULT 'active', -- active / deprecated deprecated_at TEXT, -- 废弃时间 version INTEGER NOT NULL, -- 版本号,提示词变更时 +1 prompt_hash TEXT NOT NULL, -- 兴趣描述文件的 hash(格式: filename:md5) interests_file TEXT NOT NULL DEFAULT 'ai_interests.txt', -- 关联的兴趣文件名 created_at TEXT NOT NULL ); -- ============================================ -- AI 筛选分类结果表 -- 每条新闻 × 每个标签 = 一行 -- 引用 news_items.id 或 rss_items.id(通过 source_type 区分) -- ============================================ CREATE TABLE IF NOT EXISTS ai_filter_results ( id INTEGER PRIMARY KEY AUTOINCREMENT, news_item_id INTEGER NOT NULL, -- 引用 news_items.id 或 rss_items.id source_type TEXT NOT NULL DEFAULT 'hotlist', -- hotlist / rss tag_id INTEGER NOT NULL, -- 引用 ai_filter_tags.id relevance_score REAL DEFAULT 0, -- 相关度 0.0 ~ 1.0 status TEXT DEFAULT 'active', -- active / deprecated deprecated_at TEXT, created_at TEXT NOT NULL, UNIQUE(news_item_id, source_type, tag_id) ); -- ============================================ -- AI 筛选已分析新闻记录表 -- 记录所有已被 AI 分析过的新闻(无论匹配与否) -- 用于去重,避免重复发送给 AI 浪费 token -- ============================================ CREATE TABLE IF NOT EXISTS ai_filter_analyzed_news ( news_item_id INTEGER NOT NULL, -- 引用 news_items.id 或 rss_items.id source_type TEXT NOT NULL DEFAULT 'hotlist', -- hotlist / rss interests_file TEXT NOT NULL DEFAULT 'ai_interests.txt', -- 关联的兴趣文件 prompt_hash TEXT NOT NULL, -- 分析时使用的标签集 hash matched INTEGER NOT NULL DEFAULT 0, -- 是否匹配: 0=不匹配, 1=匹配 created_at TEXT NOT NULL, PRIMARY KEY (news_item_id, source_type, interests_file) ); -- ============================================ -- 索引 -- ============================================ CREATE INDEX IF NOT EXISTS idx_ai_filter_tags_status ON ai_filter_tags(status); CREATE INDEX IF NOT EXISTS idx_ai_filter_tags_version ON ai_filter_tags(version); CREATE INDEX IF NOT EXISTS idx_ai_filter_tags_file ON ai_filter_tags(interests_file, status); CREATE INDEX IF NOT EXISTS idx_ai_filter_tags_priority ON ai_filter_tags(interests_file, status, priority); CREATE INDEX IF NOT EXISTS idx_ai_filter_results_status ON ai_filter_results(status); CREATE INDEX IF NOT EXISTS idx_ai_filter_results_news ON ai_filter_results(news_item_id, source_type); CREATE INDEX IF NOT EXISTS idx_ai_filter_results_tag ON ai_filter_results(tag_id); CREATE INDEX IF NOT EXISTS idx_analyzed_news_lookup ON ai_filter_analyzed_news(source_type, interests_file); CREATE INDEX IF NOT EXISTS idx_analyzed_news_hash ON ai_filter_analyzed_news(interests_file, prompt_hash); ================================================ FILE: trendradar/storage/base.py ================================================ # coding=utf-8 """ 存储后端抽象基类和数据模型 定义统一的存储接口,所有存储后端都需要实现这些方法 """ from abc import ABC, abstractmethod from dataclasses import dataclass, field from typing import Dict, List, Optional, Any, Set @dataclass class NewsItem: """新闻条目数据模型(热榜数据)""" title: str # 新闻标题 source_id: str # 来源平台ID(如 toutiao, baidu) source_name: str = "" # 来源平台名称(运行时使用,数据库不存储) rank: int = 0 # 排名 url: str = "" # 链接 URL mobile_url: str = "" # 移动端 URL crawl_time: str = "" # 抓取时间(HH:MM 格式) # 统计信息(用于分析) ranks: List[int] = field(default_factory=list) # 历史排名列表 first_time: str = "" # 首次出现时间 last_time: str = "" # 最后出现时间 count: int = 1 # 出现次数 rank_timeline: List[Dict[str, Any]] = field(default_factory=list) # 完整排名时间线 # 格式: [{"time": "09:30", "rank": 1}, {"time": "10:00", "rank": 2}, ...] # None 表示脱榜: [{"time": "11:00", "rank": None}] def to_dict(self) -> Dict[str, Any]: """转换为字典""" return { "title": self.title, "source_id": self.source_id, "source_name": self.source_name, "rank": self.rank, "url": self.url, "mobile_url": self.mobile_url, "crawl_time": self.crawl_time, "ranks": self.ranks, "first_time": self.first_time, "last_time": self.last_time, "count": self.count, "rank_timeline": self.rank_timeline, } @classmethod def from_dict(cls, data: Dict[str, Any]) -> "NewsItem": """从字典创建""" return cls( title=data.get("title", ""), source_id=data.get("source_id", ""), source_name=data.get("source_name", ""), rank=data.get("rank", 0), url=data.get("url", ""), mobile_url=data.get("mobile_url", ""), crawl_time=data.get("crawl_time", ""), ranks=data.get("ranks", []), first_time=data.get("first_time", ""), last_time=data.get("last_time", ""), count=data.get("count", 1), rank_timeline=data.get("rank_timeline", []), ) @dataclass class RSSItem: """RSS 条目数据模型""" title: str # 标题 feed_id: str # RSS 源 ID(如 "hacker-news") feed_name: str = "" # RSS 源名称(运行时使用) url: str = "" # 文章链接 published_at: str = "" # RSS 发布时间(ISO 格式) summary: str = "" # 摘要/描述 author: str = "" # 作者 crawl_time: str = "" # 抓取时间(HH:MM 格式) # 统计信息 first_time: str = "" # 首次抓取时间 last_time: str = "" # 最后抓取时间 count: int = 1 # 抓取次数 def to_dict(self) -> Dict[str, Any]: """转换为字典""" return { "title": self.title, "feed_id": self.feed_id, "feed_name": self.feed_name, "url": self.url, "published_at": self.published_at, "summary": self.summary, "author": self.author, "crawl_time": self.crawl_time, "first_time": self.first_time, "last_time": self.last_time, "count": self.count, } @classmethod def from_dict(cls, data: Dict[str, Any]) -> "RSSItem": """从字典创建""" return cls( title=data.get("title", ""), feed_id=data.get("feed_id", ""), feed_name=data.get("feed_name", ""), url=data.get("url", ""), published_at=data.get("published_at", ""), summary=data.get("summary", ""), author=data.get("author", ""), crawl_time=data.get("crawl_time", ""), first_time=data.get("first_time", ""), last_time=data.get("last_time", ""), count=data.get("count", 1), ) @dataclass class RSSData: """ RSS 数据集合 结构: - date: 日期(YYYY-MM-DD) - crawl_time: 抓取时间(HH:MM) - items: 按 feed_id 分组的 RSS 条目 - id_to_name: feed_id 到名称的映射 - failed_ids: 失败的 feed_id 列表 """ date: str # 日期 crawl_time: str # 抓取时间 items: Dict[str, List[RSSItem]] # 按 feed_id 分组的条目 id_to_name: Dict[str, str] = field(default_factory=dict) # ID到名称映射 failed_ids: List[str] = field(default_factory=list) # 失败的ID def to_dict(self) -> Dict[str, Any]: """转换为字典""" items_dict = {} for feed_id, rss_list in self.items.items(): items_dict[feed_id] = [item.to_dict() for item in rss_list] return { "date": self.date, "crawl_time": self.crawl_time, "items": items_dict, "id_to_name": self.id_to_name, "failed_ids": self.failed_ids, } @classmethod def from_dict(cls, data: Dict[str, Any]) -> "RSSData": """从字典创建""" items = {} items_data = data.get("items", {}) for feed_id, rss_list in items_data.items(): items[feed_id] = [RSSItem.from_dict(item) for item in rss_list] return cls( date=data.get("date", ""), crawl_time=data.get("crawl_time", ""), items=items, id_to_name=data.get("id_to_name", {}), failed_ids=data.get("failed_ids", []), ) def get_total_count(self) -> int: """获取条目总数""" return sum(len(rss_list) for rss_list in self.items.values()) @dataclass class NewsData: """ 新闻数据集合 结构: - date: 日期(YYYY-MM-DD) - crawl_time: 抓取时间(HH时MM分) - items: 按来源ID分组的新闻条目 - id_to_name: 来源ID到名称的映射 - failed_ids: 失败的来源ID列表 """ date: str # 日期 crawl_time: str # 抓取时间 items: Dict[str, List[NewsItem]] # 按来源分组的新闻 id_to_name: Dict[str, str] = field(default_factory=dict) # ID到名称映射 failed_ids: List[str] = field(default_factory=list) # 失败的ID def to_dict(self) -> Dict[str, Any]: """转换为字典""" items_dict = {} for source_id, news_list in self.items.items(): items_dict[source_id] = [item.to_dict() for item in news_list] return { "date": self.date, "crawl_time": self.crawl_time, "items": items_dict, "id_to_name": self.id_to_name, "failed_ids": self.failed_ids, } @classmethod def from_dict(cls, data: Dict[str, Any]) -> "NewsData": """从字典创建""" items = {} items_data = data.get("items", {}) for source_id, news_list in items_data.items(): items[source_id] = [NewsItem.from_dict(item) for item in news_list] return cls( date=data.get("date", ""), crawl_time=data.get("crawl_time", ""), items=items, id_to_name=data.get("id_to_name", {}), failed_ids=data.get("failed_ids", []), ) def get_total_count(self) -> int: """获取新闻总数""" return sum(len(news_list) for news_list in self.items.values()) def merge_with(self, other: "NewsData") -> "NewsData": """ 合并另一个 NewsData 到当前数据 合并规则: - 相同 source_id + title 的新闻合并排名历史 - 更新 last_time 和 count - 保留较早的 first_time """ merged_items = {} # 复制当前数据 for source_id, news_list in self.items.items(): merged_items[source_id] = {item.title: item for item in news_list} # 合并其他数据 for source_id, news_list in other.items.items(): if source_id not in merged_items: merged_items[source_id] = {} for item in news_list: if item.title in merged_items[source_id]: # 合并已存在的新闻 existing = merged_items[source_id][item.title] # 合并排名 existing_ranks = set(existing.ranks) if existing.ranks else set() new_ranks = set(item.ranks) if item.ranks else set() merged_ranks = sorted(existing_ranks | new_ranks) existing.ranks = merged_ranks # 更新时间 if item.first_time and (not existing.first_time or item.first_time < existing.first_time): existing.first_time = item.first_time if item.last_time and (not existing.last_time or item.last_time > existing.last_time): existing.last_time = item.last_time # 更新计数 existing.count += 1 # 保留URL(如果原来没有) if not existing.url and item.url: existing.url = item.url if not existing.mobile_url and item.mobile_url: existing.mobile_url = item.mobile_url else: # 添加新新闻 merged_items[source_id][item.title] = item # 转换回列表格式 final_items = {} for source_id, items_dict in merged_items.items(): final_items[source_id] = list(items_dict.values()) # 合并 id_to_name merged_id_to_name = {**self.id_to_name, **other.id_to_name} # 合并 failed_ids(去重) merged_failed_ids = list(set(self.failed_ids + other.failed_ids)) return NewsData( date=self.date or other.date, crawl_time=other.crawl_time, # 使用较新的抓取时间 items=final_items, id_to_name=merged_id_to_name, failed_ids=merged_failed_ids, ) class StorageBackend(ABC): """ 存储后端抽象基类 所有存储后端都需要实现这些方法,以支持: - 保存新闻数据 - 读取当天所有数据 - 检测新增新闻 - 生成报告文件(TXT/HTML) """ @abstractmethod def save_news_data(self, data: NewsData) -> bool: """ 保存新闻数据 Args: data: 新闻数据 Returns: 是否保存成功 """ pass @abstractmethod def get_today_all_data(self, date: Optional[str] = None) -> Optional[NewsData]: """ 获取指定日期的所有新闻数据 Args: date: 日期字符串(YYYY-MM-DD),默认为今天 Returns: 合并后的新闻数据,如果没有数据返回 None """ pass @abstractmethod def get_latest_crawl_data(self, date: Optional[str] = None) -> Optional[NewsData]: """ 获取最新一次抓取的数据 Args: date: 日期字符串,默认为今天 Returns: 最新抓取的新闻数据 """ pass @abstractmethod def detect_new_titles(self, current_data: NewsData) -> Dict[str, Dict]: """ 检测新增的标题 Args: current_data: 当前抓取的数据 Returns: 新增的标题数据,格式: {source_id: {title: title_data}} """ pass @abstractmethod def save_txt_snapshot(self, data: NewsData) -> Optional[str]: """ 保存 TXT 快照(可选功能,本地环境可用) Args: data: 新闻数据 Returns: 保存的文件路径,如果不支持返回 None """ pass @abstractmethod def save_html_report(self, html_content: str, filename: str) -> Optional[str]: """ 保存 HTML 报告 Args: html_content: HTML 内容 filename: 文件名 Returns: 保存的文件路径 """ pass @abstractmethod def is_first_crawl_today(self, date: Optional[str] = None) -> bool: """ 检查是否是当天第一次抓取 Args: date: 日期字符串,默认为今天 Returns: 是否是第一次抓取 """ pass @abstractmethod def cleanup(self) -> None: """ 清理资源(如临时文件、数据库连接等) """ pass @abstractmethod def cleanup_old_data(self, retention_days: int) -> int: """ 清理过期数据 Args: retention_days: 保留天数(0 表示不清理) Returns: 删除的日期目录数量 """ pass @property @abstractmethod def backend_name(self) -> str: """ 存储后端名称 """ pass @property @abstractmethod def supports_txt(self) -> bool: """ 是否支持生成 TXT 快照 """ pass # === 时间段执行记录(调度系统)=== def has_period_executed(self, date_str: str, period_key: str, action: str) -> bool: """ 检查指定时间段的某个 action 是否已执行 Args: date_str: 日期字符串 YYYY-MM-DD period_key: 时间段 key action: 动作类型 (analyze / push) Returns: 是否已执行 """ return False def record_period_execution(self, date_str: str, period_key: str, action: str) -> bool: """ 记录时间段的 action 执行 Args: date_str: 日期字符串 YYYY-MM-DD period_key: 时间段 key action: 动作类型 (analyze / push) Returns: 是否记录成功 """ return False # === AI 智能筛选(默认实现,子类通过 mixin 覆盖) === def begin_batch(self) -> None: """开启批量模式(远程后端延迟上传,本地后端无操作)""" pass def end_batch(self) -> None: """结束批量模式""" pass def get_active_ai_filter_tags(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> List[Dict]: return [] def get_latest_prompt_hash(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> Optional[str]: return None def get_latest_ai_filter_tag_version(self, date: Optional[str] = None) -> int: return 0 def deprecate_all_ai_filter_tags(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def save_ai_filter_tags(self, tags: List[Dict], version: int, prompt_hash: str, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def save_ai_filter_results(self, results: List[Dict], date: Optional[str] = None) -> int: return 0 def get_active_ai_filter_results(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> List[Dict]: return [] def deprecate_specific_ai_filter_tags(self, tag_ids: List[int], date: Optional[str] = None) -> int: return 0 def update_ai_filter_tags_hash(self, interests_file: str, new_hash: str, date: Optional[str] = None) -> int: return 0 def update_ai_filter_tag_descriptions(self, tag_updates: List[Dict], date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def update_ai_filter_tag_priorities(self, tag_priorities: List[Dict], date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def save_analyzed_news(self, news_ids: List[str], source_type: str, interests_file: str, prompt_hash: str, matched_ids: Set[str], date: Optional[str] = None) -> int: return 0 def get_analyzed_news_ids(self, source_type: str = "hotlist", date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> Set[str]: return set() def clear_analyzed_news(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def clear_unmatched_analyzed_news(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: return 0 def get_all_news_ids(self, date: Optional[str] = None) -> List[Dict]: return [] def get_all_rss_ids(self, date: Optional[str] = None) -> List[Dict]: return [] def convert_crawl_results_to_news_data( results: Dict[str, Dict], id_to_name: Dict[str, str], failed_ids: List[str], crawl_time: str, crawl_date: str, ) -> NewsData: """ 将爬虫结果转换为 NewsData 格式 Args: results: 爬虫返回的结果 {source_id: {title: {ranks: [], url: "", mobileUrl: ""}}} id_to_name: 来源ID到名称的映射 failed_ids: 失败的来源ID crawl_time: 抓取时间(HH:MM) crawl_date: 抓取日期(YYYY-MM-DD) Returns: NewsData 对象 """ items = {} for source_id, titles_data in results.items(): source_name = id_to_name.get(source_id, source_id) news_list = [] for title, data in titles_data.items(): ranks = data.get("ranks", []) url = data.get("url", "") mobile_url = data.get("mobileUrl", "") rank = ranks[0] if ranks else 99 news_item = NewsItem( title=title, source_id=source_id, source_name=source_name, rank=rank, url=url, mobile_url=mobile_url, crawl_time=crawl_time, ranks=ranks, first_time=crawl_time, last_time=crawl_time, count=1, ) news_list.append(news_item) items[source_id] = news_list return NewsData( date=crawl_date, crawl_time=crawl_time, items=items, id_to_name=id_to_name, failed_ids=failed_ids, ) ================================================ FILE: trendradar/storage/local.py ================================================ # coding=utf-8 """ 本地存储后端 - SQLite + TXT/HTML 使用 SQLite 作为主存储,支持可选的 TXT 快照和 HTML 报告 """ import sqlite3 import shutil import pytz import re from datetime import datetime, timedelta from pathlib import Path from typing import Dict, List, Optional from trendradar.storage.base import StorageBackend, NewsData, RSSItem, RSSData from trendradar.storage.sqlite_mixin import SQLiteStorageMixin from trendradar.utils.time import ( DEFAULT_TIMEZONE, get_configured_time, format_date_folder, format_time_filename, ) class LocalStorageBackend(SQLiteStorageMixin, StorageBackend): """ 本地存储后端 使用 SQLite 数据库存储新闻数据,支持: - 按日期组织的 SQLite 数据库文件 - 可选的 TXT 快照(用于调试) - HTML 报告生成 """ def __init__( self, data_dir: str = "output", enable_txt: bool = True, enable_html: bool = True, timezone: str = DEFAULT_TIMEZONE, ): """ 初始化本地存储后端 Args: data_dir: 数据目录路径 enable_txt: 是否启用 TXT 快照 enable_html: 是否启用 HTML 报告 timezone: 时区配置 """ self.data_dir = Path(data_dir) self.enable_txt = enable_txt self.enable_html = enable_html self.timezone = timezone self._db_connections: Dict[str, sqlite3.Connection] = {} @property def backend_name(self) -> str: return "local" @property def supports_txt(self) -> bool: return self.enable_txt # ======================================== # SQLiteStorageMixin 抽象方法实现 # ======================================== def _get_configured_time(self) -> datetime: """获取配置时区的当前时间""" return get_configured_time(self.timezone) def _format_date_folder(self, date: Optional[str] = None) -> str: """格式化日期文件夹名 (ISO 格式: YYYY-MM-DD)""" return format_date_folder(date, self.timezone) def _format_time_filename(self) -> str: """格式化时间文件名 (格式: HH-MM)""" return format_time_filename(self.timezone) def _get_db_path(self, date: Optional[str] = None, db_type: str = "news") -> Path: """ 获取 SQLite 数据库路径 新结构(扁平):output/{type}/{date}.db - output/news/2025-12-28.db - output/rss/2025-12-28.db Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 数据库文件路径 """ date_str = self._format_date_folder(date) db_dir = self.data_dir / db_type db_dir.mkdir(parents=True, exist_ok=True) return db_dir / f"{date_str}.db" def _get_connection(self, date: Optional[str] = None, db_type: str = "news") -> sqlite3.Connection: """ 获取数据库连接(带缓存) Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 数据库连接 """ db_path = str(self._get_db_path(date, db_type)) if db_path not in self._db_connections: conn = sqlite3.connect(db_path) conn.row_factory = sqlite3.Row self._init_tables(conn, db_type) self._db_connections[db_path] = conn return self._db_connections[db_path] # ======================================== # StorageBackend 接口实现(委托给 mixin) # ======================================== def save_news_data(self, data: NewsData) -> bool: """保存新闻数据到 SQLite""" db_path = self._get_db_path(data.date) if not db_path.exists(): # 确保目录存在 db_path.parent.mkdir(parents=True, exist_ok=True) success, new_count, updated_count, title_changed_count, off_list_count = \ self._save_news_data_impl(data, "[本地存储]") if success: # 输出详细的存储统计日志 log_parts = [f"[本地存储] 处理完成:新增 {new_count} 条"] if updated_count > 0: log_parts.append(f"更新 {updated_count} 条") if title_changed_count > 0: log_parts.append(f"标题变更 {title_changed_count} 条") if off_list_count > 0: log_parts.append(f"脱榜 {off_list_count} 条") print(",".join(log_parts)) return success def get_today_all_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取指定日期的所有新闻数据(合并后)""" db_path = self._get_db_path(date) if not db_path.exists(): return None return self._get_today_all_data_impl(date) def get_latest_crawl_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取最新一次抓取的数据""" db_path = self._get_db_path(date) if not db_path.exists(): return None return self._get_latest_crawl_data_impl(date) def detect_new_titles(self, current_data: NewsData) -> Dict[str, Dict]: """检测新增的标题""" return self._detect_new_titles_impl(current_data) def is_first_crawl_today(self, date: Optional[str] = None) -> bool: """检查是否是当天第一次抓取""" db_path = self._get_db_path(date) if not db_path.exists(): return True return self._is_first_crawl_today_impl(date) def get_crawl_times(self, date: Optional[str] = None) -> List[str]: """获取指定日期的所有抓取时间列表""" db_path = self._get_db_path(date) if not db_path.exists(): return [] return self._get_crawl_times_impl(date) # ======================================== # 时间段执行记录(调度系统) # ======================================== def has_period_executed(self, date_str: str, period_key: str, action: str) -> bool: """检查指定时间段的某个 action 是否已执行""" return self._has_period_executed_impl(date_str, period_key, action) def record_period_execution(self, date_str: str, period_key: str, action: str) -> bool: """记录时间段的 action 执行""" success = self._record_period_execution_impl(date_str, period_key, action) if success: now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") print(f"[本地存储] 时间段执行记录已保存: {period_key}/{action} at {now_str}") return success # ======================================== # RSS 数据存储方法 # ======================================== def save_rss_data(self, data: RSSData) -> bool: """保存 RSS 数据到 SQLite""" success, new_count, updated_count = self._save_rss_data_impl(data, "[本地存储]") if success: # 输出统计日志 log_parts = [f"[本地存储] RSS 处理完成:新增 {new_count} 条"] if updated_count > 0: log_parts.append(f"更新 {updated_count} 条") print(",".join(log_parts)) return success def get_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取指定日期的所有 RSS 数据""" return self._get_rss_data_impl(date) def detect_new_rss_items(self, current_data: RSSData) -> Dict[str, List[RSSItem]]: """检测新增的 RSS 条目""" return self._detect_new_rss_items_impl(current_data) def get_latest_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取最新一次抓取的 RSS 数据""" db_path = self._get_db_path(date, db_type="rss") if not db_path.exists(): return None return self._get_latest_rss_data_impl(date) # ======================================== # AI 智能筛选 # ======================================== def get_active_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): return self._get_active_tags_impl(date, interests_file) def get_latest_prompt_hash(self, date=None, interests_file="ai_interests.txt"): return self._get_latest_prompt_hash_impl(date, interests_file) def get_latest_ai_filter_tag_version(self, date=None): return self._get_latest_tag_version_impl(date) def deprecate_all_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): return self._deprecate_all_tags_impl(date, interests_file) def save_ai_filter_tags(self, tags, version, prompt_hash, date=None, interests_file="ai_interests.txt"): return self._save_tags_impl(date, tags, version, prompt_hash, interests_file) def save_ai_filter_results(self, results, date=None): return self._save_filter_results_impl(date, results) def get_active_ai_filter_results(self, date=None, interests_file="ai_interests.txt"): return self._get_active_filter_results_impl(date, interests_file) def deprecate_specific_ai_filter_tags(self, tag_ids, date=None): return self._deprecate_specific_tags_impl(date, tag_ids) def update_ai_filter_tags_hash(self, interests_file, new_hash, date=None): return self._update_tags_hash_impl(date, interests_file, new_hash) def update_ai_filter_tag_descriptions(self, tag_updates, date=None, interests_file="ai_interests.txt"): return self._update_tag_descriptions_impl(date, tag_updates, interests_file) def update_ai_filter_tag_priorities(self, tag_priorities, date=None, interests_file="ai_interests.txt"): return self._update_tag_priorities_impl(date, tag_priorities, interests_file) def save_analyzed_news(self, news_ids, source_type, interests_file, prompt_hash, matched_ids, date=None): return self._save_analyzed_news_impl(date, news_ids, source_type, interests_file, prompt_hash, matched_ids) def get_analyzed_news_ids(self, source_type="hotlist", date=None, interests_file="ai_interests.txt"): return self._get_analyzed_news_ids_impl(date, source_type, interests_file) def clear_analyzed_news(self, date=None, interests_file="ai_interests.txt"): return self._clear_analyzed_news_impl(date, interests_file) def clear_unmatched_analyzed_news(self, date=None, interests_file="ai_interests.txt"): return self._clear_unmatched_analyzed_news_impl(date, interests_file) def get_all_news_ids(self, date=None): return self._get_all_news_ids_impl(date) def get_all_rss_ids(self, date=None): return self._get_all_rss_ids_impl(date) # ======================================== # 本地特有功能:TXT/HTML 快照 # ======================================== def save_txt_snapshot(self, data: NewsData) -> Optional[str]: """ 保存 TXT 快照 新结构:output/txt/{date}/{time}.txt Args: data: 新闻数据 Returns: 保存的文件路径 """ if not self.enable_txt: return None try: date_folder = self._format_date_folder(data.date) txt_dir = self.data_dir / "txt" / date_folder txt_dir.mkdir(parents=True, exist_ok=True) file_path = txt_dir / f"{data.crawl_time}.txt" with open(file_path, "w", encoding="utf-8") as f: for source_id, news_list in data.items.items(): source_name = data.id_to_name.get(source_id, source_id) # 写入来源标题 if source_name and source_name != source_id: f.write(f"{source_id} | {source_name}\n") else: f.write(f"{source_id}\n") # 按排名排序 sorted_news = sorted(news_list, key=lambda x: x.rank) for item in sorted_news: line = f"{item.rank}. {item.title}" if item.url: line += f" [URL:{item.url}]" if item.mobile_url: line += f" [MOBILE:{item.mobile_url}]" f.write(line + "\n") f.write("\n") # 写入失败的来源 if data.failed_ids: f.write("==== 以下ID请求失败 ====\n") for failed_id in data.failed_ids: f.write(f"{failed_id}\n") print(f"[本地存储] TXT 快照已保存: {file_path}") return str(file_path) except Exception as e: print(f"[本地存储] 保存 TXT 快照失败: {e}") return None def save_html_report(self, html_content: str, filename: str) -> Optional[str]: """ 保存 HTML 报告 新结构:output/html/{date}/{filename} Args: html_content: HTML 内容 filename: 文件名 Returns: 保存的文件路径 """ if not self.enable_html: return None try: date_folder = self._format_date_folder() html_dir = self.data_dir / "html" / date_folder html_dir.mkdir(parents=True, exist_ok=True) file_path = html_dir / filename with open(file_path, "w", encoding="utf-8") as f: f.write(html_content) print(f"[本地存储] HTML 报告已保存: {file_path}") return str(file_path) except Exception as e: print(f"[本地存储] 保存 HTML 报告失败: {e}") return None # ======================================== # 本地特有功能:资源清理 # ======================================== def cleanup(self) -> None: """清理资源(关闭数据库连接)""" for db_path, conn in self._db_connections.items(): try: conn.close() print(f"[本地存储] 关闭数据库连接: {db_path}") except Exception as e: print(f"[本地存储] 关闭连接失败 {db_path}: {e}") self._db_connections.clear() def cleanup_old_data(self, retention_days: int) -> int: """ 清理过期数据 新结构清理逻辑: - output/news/{date}.db -> 删除过期的 .db 文件 - output/rss/{date}.db -> 删除过期的 .db 文件 - output/txt/{date}/ -> 删除过期的日期目录 - output/html/{date}/ -> 删除过期的日期目录 Args: retention_days: 保留天数(0 表示不清理) Returns: 删除的文件/目录数量 """ if retention_days <= 0: return 0 deleted_count = 0 cutoff_date = self._get_configured_time() - timedelta(days=retention_days) def parse_date_from_name(name: str) -> Optional[datetime]: """从文件名或目录名解析日期 (ISO 格式: YYYY-MM-DD)""" # 移除 .db 后缀 name = name.replace('.db', '') try: date_match = re.match(r'(\d{4})-(\d{2})-(\d{2})', name) if date_match: return datetime( int(date_match.group(1)), int(date_match.group(2)), int(date_match.group(3)), tzinfo=pytz.timezone(self.timezone) ) except Exception: pass return None try: if not self.data_dir.exists(): return 0 # 清理数据库文件 (news/, rss/) for db_type in ["news", "rss"]: db_dir = self.data_dir / db_type if not db_dir.exists(): continue for db_file in db_dir.glob("*.db"): file_date = parse_date_from_name(db_file.name) if file_date and file_date < cutoff_date: # 先关闭数据库连接 db_path = str(db_file) if db_path in self._db_connections: try: self._db_connections[db_path].close() del self._db_connections[db_path] except Exception: pass # 删除文件 try: db_file.unlink() deleted_count += 1 print(f"[本地存储] 清理过期数据: {db_type}/{db_file.name}") except Exception as e: print(f"[本地存储] 删除文件失败 {db_file}: {e}") # 清理快照目录 (txt/, html/) for snapshot_type in ["txt", "html"]: snapshot_dir = self.data_dir / snapshot_type if not snapshot_dir.exists(): continue for date_folder in snapshot_dir.iterdir(): if not date_folder.is_dir() or date_folder.name.startswith('.'): continue folder_date = parse_date_from_name(date_folder.name) if folder_date and folder_date < cutoff_date: try: shutil.rmtree(date_folder) deleted_count += 1 print(f"[本地存储] 清理过期数据: {snapshot_type}/{date_folder.name}") except Exception as e: print(f"[本地存储] 删除目录失败 {date_folder}: {e}") if deleted_count > 0: print(f"[本地存储] 共清理 {deleted_count} 个过期文件/目录") return deleted_count except Exception as e: print(f"[本地存储] 清理过期数据失败: {e}") return deleted_count def __del__(self): """析构函数,确保关闭连接""" self.cleanup() ================================================ FILE: trendradar/storage/manager.py ================================================ # coding=utf-8 """ 存储管理器 - 统一管理存储后端 根据环境和配置自动选择合适的存储后端 """ import os from typing import Optional from trendradar.storage.base import StorageBackend, NewsData, RSSData from trendradar.utils.time import DEFAULT_TIMEZONE # 存储管理器单例 _storage_manager: Optional["StorageManager"] = None class StorageManager: """ 存储管理器 功能: - 自动检测运行环境(GitHub Actions / Docker / 本地) - 根据配置选择存储后端(local / remote / auto) - 提供统一的存储接口 - 支持从远程拉取数据到本地 """ def __init__( self, backend_type: str = "auto", data_dir: str = "output", enable_txt: bool = True, enable_html: bool = True, remote_config: Optional[dict] = None, local_retention_days: int = 0, remote_retention_days: int = 0, pull_enabled: bool = False, pull_days: int = 0, timezone: str = DEFAULT_TIMEZONE, ): """ 初始化存储管理器 Args: backend_type: 存储后端类型 (local / remote / auto) data_dir: 本地数据目录 enable_txt: 是否启用 TXT 快照 enable_html: 是否启用 HTML 报告 remote_config: 远程存储配置(endpoint_url, bucket_name, access_key_id 等) local_retention_days: 本地数据保留天数(0 = 无限制) remote_retention_days: 远程数据保留天数(0 = 无限制) pull_enabled: 是否启用启动时自动拉取 pull_days: 拉取最近 N 天的数据 timezone: 时区配置 """ self.backend_type = backend_type self.data_dir = data_dir self.enable_txt = enable_txt self.enable_html = enable_html self.remote_config = remote_config or {} self.local_retention_days = local_retention_days self.remote_retention_days = remote_retention_days self.pull_enabled = pull_enabled self.pull_days = pull_days self.timezone = timezone self._backend: Optional[StorageBackend] = None self._remote_backend: Optional[StorageBackend] = None @staticmethod def is_github_actions() -> bool: """检测是否在 GitHub Actions 环境中运行""" return os.environ.get("GITHUB_ACTIONS") == "true" @staticmethod def is_docker() -> bool: """检测是否在 Docker 容器中运行""" # 方法1: 检查 /.dockerenv 文件 if os.path.exists("/.dockerenv"): return True # 方法2: 检查 cgroup(Linux) try: with open("/proc/1/cgroup", "r") as f: return "docker" in f.read() except (FileNotFoundError, PermissionError): pass # 方法3: 检查环境变量 return os.environ.get("DOCKER_CONTAINER") == "true" def _resolve_backend_type(self) -> str: """解析实际使用的后端类型""" if self.backend_type == "auto": if self.is_github_actions(): # GitHub Actions 环境,检查是否配置了远程存储 if self._has_remote_config(): return "remote" else: print("[存储管理器] GitHub Actions 环境但未配置远程存储,使用本地存储") return "local" else: return "local" return self.backend_type def _has_remote_config(self) -> bool: """检查是否有有效的远程存储配置""" # 检查配置或环境变量 bucket_name = self.remote_config.get("bucket_name") or os.environ.get("S3_BUCKET_NAME") access_key = self.remote_config.get("access_key_id") or os.environ.get("S3_ACCESS_KEY_ID") secret_key = self.remote_config.get("secret_access_key") or os.environ.get("S3_SECRET_ACCESS_KEY") endpoint = self.remote_config.get("endpoint_url") or os.environ.get("S3_ENDPOINT_URL") # 调试日志 has_config = bool(bucket_name and access_key and secret_key and endpoint) if not has_config: print(f"[存储管理器] 远程存储配置检查失败:") print(f" - bucket_name: {'已配置' if bucket_name else '未配置'}") print(f" - access_key_id: {'已配置' if access_key else '未配置'}") print(f" - secret_access_key: {'已配置' if secret_key else '未配置'}") print(f" - endpoint_url: {'已配置' if endpoint else '未配置'}") return has_config def _create_remote_backend(self) -> Optional[StorageBackend]: """创建远程存储后端""" try: from trendradar.storage.remote import RemoteStorageBackend return RemoteStorageBackend( bucket_name=self.remote_config.get("bucket_name") or os.environ.get("S3_BUCKET_NAME", ""), access_key_id=self.remote_config.get("access_key_id") or os.environ.get("S3_ACCESS_KEY_ID", ""), secret_access_key=self.remote_config.get("secret_access_key") or os.environ.get("S3_SECRET_ACCESS_KEY", ""), endpoint_url=self.remote_config.get("endpoint_url") or os.environ.get("S3_ENDPOINT_URL", ""), region=self.remote_config.get("region") or os.environ.get("S3_REGION", ""), enable_txt=self.enable_txt, enable_html=self.enable_html, timezone=self.timezone, ) except ImportError as e: print(f"[存储管理器] 远程后端导入失败: {e}") print("[存储管理器] 请确保已安装 boto3: pip install boto3") return None except Exception as e: print(f"[存储管理器] 远程后端初始化失败: {e}") return None def get_backend(self) -> StorageBackend: """获取存储后端实例""" if self._backend is None: resolved_type = self._resolve_backend_type() if resolved_type == "remote": self._backend = self._create_remote_backend() if self._backend: print(f"[存储管理器] 使用远程存储后端") else: print("[存储管理器] 回退到本地存储") resolved_type = "local" if resolved_type == "local" or self._backend is None: from trendradar.storage.local import LocalStorageBackend self._backend = LocalStorageBackend( data_dir=self.data_dir, enable_txt=self.enable_txt, enable_html=self.enable_html, timezone=self.timezone, ) print(f"[存储管理器] 使用本地存储后端 (数据目录: {self.data_dir})") return self._backend def pull_from_remote(self) -> int: """ 从远程拉取数据到本地 Returns: 成功拉取的文件数量 """ if not self.pull_enabled or self.pull_days <= 0: return 0 if not self._has_remote_config(): print("[存储管理器] 未配置远程存储,无法拉取") return 0 # 创建远程后端(如果还没有) if self._remote_backend is None: self._remote_backend = self._create_remote_backend() if self._remote_backend is None: print("[存储管理器] 无法创建远程后端,拉取失败") return 0 # 调用拉取方法 return self._remote_backend.pull_recent_days(self.pull_days, self.data_dir) def save_news_data(self, data: NewsData) -> bool: """保存新闻数据""" return self.get_backend().save_news_data(data) def save_rss_data(self, data: RSSData) -> bool: """保存 RSS 数据""" return self.get_backend().save_rss_data(data) def get_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取指定日期的所有 RSS 数据(当日汇总模式)""" return self.get_backend().get_rss_data(date) def get_latest_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取最新一次抓取的 RSS 数据(当前榜单模式)""" return self.get_backend().get_latest_rss_data(date) def detect_new_rss_items(self, current_data: RSSData) -> dict: """检测新增的 RSS 条目(增量模式)""" return self.get_backend().detect_new_rss_items(current_data) def get_today_all_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取当天所有数据""" return self.get_backend().get_today_all_data(date) def get_latest_crawl_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取最新抓取数据""" return self.get_backend().get_latest_crawl_data(date) def detect_new_titles(self, current_data: NewsData) -> dict: """检测新增标题""" return self.get_backend().detect_new_titles(current_data) def save_txt_snapshot(self, data: NewsData) -> Optional[str]: """保存 TXT 快照""" return self.get_backend().save_txt_snapshot(data) def save_html_report(self, html_content: str, filename: str) -> Optional[str]: """保存 HTML 报告""" return self.get_backend().save_html_report(html_content, filename) def is_first_crawl_today(self, date: Optional[str] = None) -> bool: """检查是否是当天第一次抓取""" return self.get_backend().is_first_crawl_today(date) def cleanup(self) -> None: """清理资源""" if self._backend: self._backend.cleanup() if self._remote_backend: self._remote_backend.cleanup() def cleanup_old_data(self) -> int: """ 清理过期数据 Returns: 删除的日期目录数量 """ total_deleted = 0 # 清理本地数据 if self.local_retention_days > 0: total_deleted += self.get_backend().cleanup_old_data(self.local_retention_days) # 清理远程数据(如果配置了) if self.remote_retention_days > 0 and self._has_remote_config(): if self._remote_backend is None: self._remote_backend = self._create_remote_backend() if self._remote_backend: total_deleted += self._remote_backend.cleanup_old_data(self.remote_retention_days) return total_deleted @property def backend_name(self) -> str: """获取当前后端名称""" return self.get_backend().backend_name @property def supports_txt(self) -> bool: """是否支持 TXT 快照""" return self.get_backend().supports_txt def has_period_executed(self, date_str: str, period_key: str, action: str) -> bool: """检查指定时间段的某个 action 是否已执行""" return self.get_backend().has_period_executed(date_str, period_key, action) def record_period_execution(self, date_str: str, period_key: str, action: str) -> bool: """记录时间段的 action 执行""" return self.get_backend().record_period_execution(date_str, period_key, action) # === AI 智能筛选存储操作 === def begin_batch(self): """开启批量模式(远程后端延迟上传)""" self.get_backend().begin_batch() def end_batch(self): """结束批量模式(统一上传脏数据库)""" self.get_backend().end_batch() def get_active_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): """获取指定兴趣文件的 active 标签""" return self.get_backend().get_active_ai_filter_tags(date, interests_file) def get_latest_prompt_hash(self, date=None, interests_file="ai_interests.txt"): """获取指定兴趣文件的最新 prompt_hash""" return self.get_backend().get_latest_prompt_hash(date, interests_file) def get_latest_ai_filter_tag_version(self, date=None): """获取最新标签版本号""" return self.get_backend().get_latest_ai_filter_tag_version(date) def deprecate_all_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): """废弃指定兴趣文件的 active 标签和分类结果""" return self.get_backend().deprecate_all_ai_filter_tags(date, interests_file) def save_ai_filter_tags(self, tags, version, prompt_hash, date=None, interests_file="ai_interests.txt"): """保存新提取的标签""" return self.get_backend().save_ai_filter_tags(tags, version, prompt_hash, date, interests_file) def save_ai_filter_results(self, results, date=None): """保存分类结果""" return self.get_backend().save_ai_filter_results(results, date) def get_active_ai_filter_results(self, date=None, interests_file="ai_interests.txt"): """获取指定兴趣文件的 active 分类结果""" return self.get_backend().get_active_ai_filter_results(date, interests_file) def deprecate_specific_ai_filter_tags(self, tag_ids, date=None): """废弃指定 ID 的标签及其关联分类结果""" return self.get_backend().deprecate_specific_ai_filter_tags(tag_ids, date) def update_ai_filter_tags_hash(self, interests_file, new_hash, date=None): """更新指定兴趣文件所有 active 标签的 prompt_hash""" return self.get_backend().update_ai_filter_tags_hash(interests_file, new_hash, date) def update_ai_filter_tag_descriptions(self, tag_updates, date=None, interests_file="ai_interests.txt"): """按 tag 名匹配,更新 active 标签的 description""" return self.get_backend().update_ai_filter_tag_descriptions(tag_updates, date, interests_file) def update_ai_filter_tag_priorities(self, tag_priorities, date=None, interests_file="ai_interests.txt"): """按 tag 名匹配,更新 active 标签的 priority""" return self.get_backend().update_ai_filter_tag_priorities(tag_priorities, date, interests_file) def save_analyzed_news(self, news_ids, source_type, interests_file, prompt_hash, matched_ids, date=None): """批量记录已分析的新闻(匹配与不匹配都记录)""" return self.get_backend().save_analyzed_news(news_ids, source_type, interests_file, prompt_hash, matched_ids, date) def get_analyzed_news_ids(self, source_type="hotlist", date=None, interests_file="ai_interests.txt"): """获取已分析过的新闻 ID 集合""" return self.get_backend().get_analyzed_news_ids(source_type, date, interests_file) def clear_analyzed_news(self, date=None, interests_file="ai_interests.txt"): """清除指定兴趣文件的所有已分析记录""" return self.get_backend().clear_analyzed_news(date, interests_file) def clear_unmatched_analyzed_news(self, date=None, interests_file="ai_interests.txt"): """清除不匹配的已分析记录""" return self.get_backend().clear_unmatched_analyzed_news(date, interests_file) def get_all_news_ids(self, date=None): """获取所有新闻 ID 和标题""" return self.get_backend().get_all_news_ids(date) def get_all_rss_ids(self, date=None): """获取所有 RSS ID 和标题""" return self.get_backend().get_all_rss_ids(date) def get_storage_manager( backend_type: str = "auto", data_dir: str = "output", enable_txt: bool = True, enable_html: bool = True, remote_config: Optional[dict] = None, local_retention_days: int = 0, remote_retention_days: int = 0, pull_enabled: bool = False, pull_days: int = 0, timezone: str = DEFAULT_TIMEZONE, force_new: bool = False, ) -> StorageManager: """ 获取存储管理器单例 Args: backend_type: 存储后端类型 data_dir: 本地数据目录 enable_txt: 是否启用 TXT 快照 enable_html: 是否启用 HTML 报告 remote_config: 远程存储配置 local_retention_days: 本地数据保留天数(0 = 无限制) remote_retention_days: 远程数据保留天数(0 = 无限制) pull_enabled: 是否启用启动时自动拉取 pull_days: 拉取最近 N 天的数据 timezone: 时区配置 force_new: 是否强制创建新实例 Returns: StorageManager 实例 """ global _storage_manager if _storage_manager is None or force_new: _storage_manager = StorageManager( backend_type=backend_type, data_dir=data_dir, enable_txt=enable_txt, enable_html=enable_html, remote_config=remote_config, local_retention_days=local_retention_days, remote_retention_days=remote_retention_days, pull_enabled=pull_enabled, pull_days=pull_days, timezone=timezone, ) return _storage_manager ================================================ FILE: trendradar/storage/remote.py ================================================ # coding=utf-8 """ 远程存储后端(S3 兼容协议) 支持 Cloudflare R2、阿里云 OSS、腾讯云 COS、AWS S3、MinIO 等 使用 S3 兼容 API (boto3) 访问对象存储 数据流程:下载当天 SQLite → 合并新数据 → 上传回远程 """ import pytz import re import shutil import sys import tempfile import sqlite3 from datetime import datetime, timedelta from pathlib import Path from typing import Dict, List, Optional try: import boto3 from botocore.config import Config as BotoConfig from botocore.exceptions import ClientError HAS_BOTO3 = True except ImportError: HAS_BOTO3 = False boto3 = None BotoConfig = None ClientError = Exception from trendradar.storage.base import StorageBackend, NewsData, RSSItem, RSSData from trendradar.storage.sqlite_mixin import SQLiteStorageMixin from trendradar.utils.time import ( DEFAULT_TIMEZONE, get_configured_time, format_date_folder, format_time_filename, ) class RemoteStorageBackend(SQLiteStorageMixin, StorageBackend): """ 远程云存储后端(S3 兼容协议) 特点: - 使用 S3 兼容 API 访问远程存储 - 支持 Cloudflare R2、阿里云 OSS、腾讯云 COS、AWS S3、MinIO 等 - 下载 SQLite 到临时目录进行操作 - 支持数据合并和上传 - 支持从远程拉取历史数据到本地 - 运行结束后自动清理临时文件 """ def __init__( self, bucket_name: str, access_key_id: str, secret_access_key: str, endpoint_url: str, region: str = "", enable_txt: bool = False, # 远程模式默认不生成 TXT enable_html: bool = True, temp_dir: Optional[str] = None, timezone: str = DEFAULT_TIMEZONE, ): """ 初始化远程存储后端 Args: bucket_name: 存储桶名称 access_key_id: 访问密钥 ID secret_access_key: 访问密钥 endpoint_url: 服务端点 URL region: 区域(可选,部分服务商需要) enable_txt: 是否启用 TXT 快照(默认关闭) enable_html: 是否启用 HTML 报告 temp_dir: 临时目录路径(默认使用系统临时目录) timezone: 时区配置 """ if not HAS_BOTO3: raise ImportError("远程存储后端需要安装 boto3: pip install boto3") self.bucket_name = bucket_name self.endpoint_url = endpoint_url self.region = region self.enable_txt = enable_txt self.enable_html = enable_html self.timezone = timezone # 创建临时目录 self.temp_dir = Path(temp_dir) if temp_dir else Path(tempfile.mkdtemp(prefix="trendradar_")) self.temp_dir.mkdir(parents=True, exist_ok=True) # 初始化 S3 客户端 # 使用 virtual-hosted style addressing(主流) # 根据服务商选择签名版本: # - 腾讯云 COS 和 阿里云 OSS 使用 SigV2 以避免 chunked encoding 问题 # - 其他服务商(AWS S3、Cloudflare R2、MinIO 等)默认使用 SigV4 use_sigv2 = "myqcloud.com" in endpoint_url.lower() or "aliyuncs.com" in endpoint_url.lower() signature_version = 's3' if use_sigv2 else 's3v4' s3_config = BotoConfig( s3={"addressing_style": "virtual"}, signature_version=signature_version, ) client_kwargs = { "endpoint_url": endpoint_url, "aws_access_key_id": access_key_id, "aws_secret_access_key": secret_access_key, "config": s3_config, } if region: client_kwargs["region_name"] = region self.s3_client = boto3.client("s3", **client_kwargs) # 跟踪下载的文件(用于清理) self._downloaded_files: List[Path] = [] self._db_connections: Dict[str, sqlite3.Connection] = {} # 批量模式:延迟上传,避免频繁上传同一文件 self._batch_mode = False self._batch_dirty: set = set() # 待上传的 (date, db_type) 集合 print(f"[远程存储] 初始化完成,存储桶: {bucket_name},签名版本: {signature_version}") @property def backend_name(self) -> str: return "remote" @property def supports_txt(self) -> bool: return self.enable_txt # ======================================== # SQLiteStorageMixin 抽象方法实现 # ======================================== def _get_configured_time(self) -> datetime: """获取配置时区的当前时间""" return get_configured_time(self.timezone) def _format_date_folder(self, date: Optional[str] = None) -> str: """格式化日期文件夹名 (ISO 格式: YYYY-MM-DD)""" return format_date_folder(date, self.timezone) def _format_time_filename(self) -> str: """格式化时间文件名 (格式: HH-MM)""" return format_time_filename(self.timezone) def _get_remote_db_key(self, date: Optional[str] = None, db_type: str = "news") -> str: """ 获取远程存储中 SQLite 文件的对象键 Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 远程对象键,如 "news/2025-12-28.db" 或 "rss/2025-12-28.db" """ date_folder = self._format_date_folder(date) return f"{db_type}/{date_folder}.db" def _get_local_db_path(self, date: Optional[str] = None, db_type: str = "news") -> Path: """ 获取本地临时 SQLite 文件路径 Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 本地临时文件路径 """ date_folder = self._format_date_folder(date) db_dir = self.temp_dir / db_type db_dir.mkdir(parents=True, exist_ok=True) return db_dir / f"{date_folder}.db" def _check_object_exists(self, r2_key: str) -> bool: """ 检查远程存储中对象是否存在 Args: r2_key: 远程对象键 Returns: 是否存在 """ try: self.s3_client.head_object(Bucket=self.bucket_name, Key=r2_key) return True except ClientError as e: error_code = e.response.get("Error", {}).get("Code", "") # S3 兼容存储可能返回 404, NoSuchKey, 或其他变体 if error_code in ("404", "NoSuchKey", "Not Found"): return False # 其他错误(如权限问题)也视为不存在,但打印警告 print(f"[远程存储] 检查对象存在性失败 ({r2_key}): {e}") return False except Exception as e: print(f"[远程存储] 检查对象存在性异常 ({r2_key}): {e}") return False def _download_sqlite(self, date: Optional[str] = None, db_type: str = "news") -> Optional[Path]: """ 从远程存储下载当天的 SQLite 文件到本地临时目录 使用 get_object + iter_chunks 替代 download_file, 以正确处理腾讯云 COS 的 chunked transfer encoding。 Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 本地文件路径,如果不存在返回 None """ r2_key = self._get_remote_db_key(date, db_type) local_path = self._get_local_db_path(date, db_type) # 确保目录存在 local_path.parent.mkdir(parents=True, exist_ok=True) # 先检查文件是否存在 if not self._check_object_exists(r2_key): print(f"[远程存储] 文件不存在,将创建新数据库: {r2_key}") return None try: # 使用 get_object + iter_chunks 替代 download_file # iter_chunks 会自动处理 chunked transfer encoding response = self.s3_client.get_object(Bucket=self.bucket_name, Key=r2_key) with open(local_path, 'wb') as f: for chunk in response['Body'].iter_chunks(chunk_size=1024*1024): f.write(chunk) self._downloaded_files.append(local_path) print(f"[远程存储] 已下载: {r2_key} -> {local_path}") return local_path except ClientError as e: error_code = e.response.get("Error", {}).get("Code", "") # S3 兼容存储可能返回不同的错误码 if error_code in ("404", "NoSuchKey", "Not Found"): print(f"[远程存储] 文件不存在,将创建新数据库: {r2_key}") return None else: print(f"[远程存储] 下载失败 (错误码: {error_code}): {e}") raise except Exception as e: print(f"[远程存储] 下载异常: {e}") raise def begin_batch(self): """开启批量模式:延迟上传,避免频繁上传同一文件""" self._batch_mode = True self._batch_dirty.clear() def end_batch(self): """结束批量模式:统一上传所有脏数据库""" self._batch_mode = False for date, db_type in self._batch_dirty: self._upload_sqlite(date, db_type) self._batch_dirty.clear() def _upload_sqlite(self, date: Optional[str] = None, db_type: str = "news") -> bool: """ 上传本地 SQLite 文件到远程存储 批量模式下延迟上传,由 end_batch() 统一触发。 Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 是否上传成功 """ if self._batch_mode: self._batch_dirty.add((date, db_type)) return True local_path = self._get_local_db_path(date, db_type) r2_key = self._get_remote_db_key(date, db_type) if not local_path.exists(): print(f"[远程存储] 本地文件不存在,无法上传: {local_path}") return False try: # 获取本地文件大小 local_size = local_path.stat().st_size print(f"[远程存储] 准备上传: {local_path} ({local_size} bytes) -> {r2_key}") # 读取文件内容为 bytes 后上传 # 避免传入文件对象时 requests 库使用 chunked transfer encoding # 腾讯云 COS 等 S3 兼容服务可能无法正确处理 chunked encoding with open(local_path, 'rb') as f: file_content = f.read() # 使用 put_object 并明确设置 ContentLength,确保不使用 chunked encoding self.s3_client.put_object( Bucket=self.bucket_name, Key=r2_key, Body=file_content, ContentLength=local_size, ContentType='application/x-sqlite3', ) print(f"[远程存储] 已上传: {local_path} -> {r2_key}") # 验证上传成功 if self._check_object_exists(r2_key): print(f"[远程存储] 上传验证成功: {r2_key}") return True else: print(f"[远程存储] 上传验证失败: 文件未在远程存储中找到") return False except Exception as e: print(f"[远程存储] 上传失败: {e}") return False def _get_connection(self, date: Optional[str] = None, db_type: str = "news") -> sqlite3.Connection: """ 获取数据库连接 Args: date: 日期字符串 db_type: 数据库类型 ("news" 或 "rss") Returns: 数据库连接 """ local_path = self._get_local_db_path(date, db_type) db_path = str(local_path) if db_path not in self._db_connections: # 确保目录存在 local_path.parent.mkdir(parents=True, exist_ok=True) # 如果本地不存在,尝试从远程存储下载 if not local_path.exists(): self._download_sqlite(date, db_type) conn = sqlite3.connect(db_path) conn.row_factory = sqlite3.Row self._init_tables(conn, db_type) self._db_connections[db_path] = conn return self._db_connections[db_path] # ======================================== # StorageBackend 接口实现(委托给 mixin + 上传) # ======================================== def save_news_data(self, data: NewsData) -> bool: """ 保存新闻数据到远程存储 流程:下载现有数据库 → 插入/更新数据 → 上传回远程存储 """ # 查询已有记录数 conn = self._get_connection(data.date) cursor = conn.cursor() cursor.execute("SELECT COUNT(*) as count FROM news_items") row = cursor.fetchone() existing_count = row[0] if row else 0 if existing_count > 0: print(f"[远程存储] 已有 {existing_count} 条历史记录,将合并新数据") # 使用 mixin 的实现保存数据 success, new_count, updated_count, title_changed_count, off_list_count = \ self._save_news_data_impl(data, "[远程存储]") if not success: return False # 查询合并后的总记录数 cursor.execute("SELECT COUNT(*) as count FROM news_items") row = cursor.fetchone() final_count = row[0] if row else 0 # 输出详细的存储统计日志 log_parts = [f"[远程存储] 处理完成:新增 {new_count} 条"] if updated_count > 0: log_parts.append(f"更新 {updated_count} 条") if title_changed_count > 0: log_parts.append(f"标题变更 {title_changed_count} 条") if off_list_count > 0: log_parts.append(f"脱榜 {off_list_count} 条") log_parts.append(f"(去重后总计: {final_count} 条)") print(",".join(log_parts)) # 上传到远程存储 if self._upload_sqlite(data.date): print(f"[远程存储] 数据已同步到远程存储") return True else: print(f"[远程存储] 上传远程存储失败") return False def get_today_all_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取指定日期的所有新闻数据(合并后)""" return self._get_today_all_data_impl(date) def get_latest_crawl_data(self, date: Optional[str] = None) -> Optional[NewsData]: """获取最新一次抓取的数据""" return self._get_latest_crawl_data_impl(date) def detect_new_titles(self, current_data: NewsData) -> Dict[str, Dict]: """检测新增的标题""" return self._detect_new_titles_impl(current_data) def is_first_crawl_today(self, date: Optional[str] = None) -> bool: """检查是否是当天第一次抓取""" return self._is_first_crawl_today_impl(date) # ======================================== # 时间段执行记录(调度系统) # ======================================== def has_period_executed(self, date_str: str, period_key: str, action: str) -> bool: """检查指定时间段的某个 action 是否已执行""" return self._has_period_executed_impl(date_str, period_key, action) def record_period_execution(self, date_str: str, period_key: str, action: str) -> bool: """记录时间段的 action 执行""" success = self._record_period_execution_impl(date_str, period_key, action) if success: now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") print(f"[远程存储] 时间段执行记录已保存: {period_key}/{action} at {now_str}") # 上传到远程存储确保记录持久化 if self._upload_sqlite(date_str): print(f"[远程存储] 时间段执行记录已同步到远程存储") return True else: print(f"[远程存储] 时间段执行记录同步到远程存储失败") return False return False # ======================================== # RSS 数据存储方法 # ======================================== def save_rss_data(self, data: RSSData) -> bool: """ 保存 RSS 数据到远程存储 流程:下载现有数据库 → 插入/更新数据 → 上传回远程存储 """ success, new_count, updated_count = self._save_rss_data_impl(data, "[远程存储]") if not success: return False # 输出统计日志 log_parts = [f"[远程存储] RSS 处理完成:新增 {new_count} 条"] if updated_count > 0: log_parts.append(f"更新 {updated_count} 条") print(",".join(log_parts)) # 上传到远程存储 if self._upload_sqlite(data.date, db_type="rss"): print(f"[远程存储] RSS 数据已同步到远程存储") return True else: print(f"[远程存储] RSS 上传远程存储失败") return False def get_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取指定日期的所有 RSS 数据""" return self._get_rss_data_impl(date) def detect_new_rss_items(self, current_data: RSSData) -> Dict[str, List[RSSItem]]: """检测新增的 RSS 条目""" return self._detect_new_rss_items_impl(current_data) def get_latest_rss_data(self, date: Optional[str] = None) -> Optional[RSSData]: """获取最新一次抓取的 RSS 数据""" return self._get_latest_rss_data_impl(date) # ======================================== # AI 智能筛选存储方法 # ======================================== def get_active_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): return self._get_active_tags_impl(date, interests_file) def get_latest_prompt_hash(self, date=None, interests_file="ai_interests.txt"): return self._get_latest_prompt_hash_impl(date, interests_file) def get_latest_ai_filter_tag_version(self, date=None): return self._get_latest_tag_version_impl(date) def deprecate_all_ai_filter_tags(self, date=None, interests_file="ai_interests.txt"): count = self._deprecate_all_tags_impl(date, interests_file) if count > 0: self._upload_sqlite(date) return count def save_ai_filter_tags(self, tags, version, prompt_hash, date=None, interests_file="ai_interests.txt"): count = self._save_tags_impl(date, tags, version, prompt_hash, interests_file) if count > 0: self._upload_sqlite(date) return count def save_ai_filter_results(self, results, date=None): count = self._save_filter_results_impl(date, results) if count > 0: self._upload_sqlite(date) return count def get_active_ai_filter_results(self, date=None, interests_file="ai_interests.txt"): return self._get_active_filter_results_impl(date, interests_file) def deprecate_specific_ai_filter_tags(self, tag_ids, date=None): count = self._deprecate_specific_tags_impl(date, tag_ids) if count > 0: self._upload_sqlite(date) return count def update_ai_filter_tags_hash(self, interests_file, new_hash, date=None): count = self._update_tags_hash_impl(date, interests_file, new_hash) if count > 0: self._upload_sqlite(date) return count def update_ai_filter_tag_descriptions(self, tag_updates, date=None, interests_file="ai_interests.txt"): count = self._update_tag_descriptions_impl(date, tag_updates, interests_file) if count > 0: self._upload_sqlite(date) return count def update_ai_filter_tag_priorities(self, tag_priorities, date=None, interests_file="ai_interests.txt"): count = self._update_tag_priorities_impl(date, tag_priorities, interests_file) if count > 0: self._upload_sqlite(date) return count def save_analyzed_news(self, news_ids, source_type, interests_file, prompt_hash, matched_ids, date=None): count = self._save_analyzed_news_impl(date, news_ids, source_type, interests_file, prompt_hash, matched_ids) if count > 0: self._upload_sqlite(date) return count def get_analyzed_news_ids(self, source_type="hotlist", date=None, interests_file="ai_interests.txt"): return self._get_analyzed_news_ids_impl(date, source_type, interests_file) def clear_analyzed_news(self, date=None, interests_file="ai_interests.txt"): count = self._clear_analyzed_news_impl(date, interests_file) if count > 0: self._upload_sqlite(date) return count def clear_unmatched_analyzed_news(self, date=None, interests_file="ai_interests.txt"): count = self._clear_unmatched_analyzed_news_impl(date, interests_file) if count > 0: self._upload_sqlite(date) return count def get_all_news_ids(self, date=None): return self._get_all_news_ids_impl(date) def get_all_rss_ids(self, date=None): return self._get_all_rss_ids_impl(date) # ======================================== # 远程特有功能:TXT/HTML 快照(临时目录) # ======================================== def save_txt_snapshot(self, data: NewsData) -> Optional[str]: """保存 TXT 快照(远程存储模式下默认不支持)""" if not self.enable_txt: return None # 如果启用,保存到本地临时目录 try: date_folder = self._format_date_folder(data.date) txt_dir = self.temp_dir / date_folder / "txt" txt_dir.mkdir(parents=True, exist_ok=True) file_path = txt_dir / f"{data.crawl_time}.txt" with open(file_path, "w", encoding="utf-8") as f: for source_id, news_list in data.items.items(): source_name = data.id_to_name.get(source_id, source_id) if source_name and source_name != source_id: f.write(f"{source_id} | {source_name}\n") else: f.write(f"{source_id}\n") sorted_news = sorted(news_list, key=lambda x: x.rank) for item in sorted_news: line = f"{item.rank}. {item.title}" if item.url: line += f" [URL:{item.url}]" if item.mobile_url: line += f" [MOBILE:{item.mobile_url}]" f.write(line + "\n") f.write("\n") if data.failed_ids: f.write("==== 以下ID请求失败 ====\n") for failed_id in data.failed_ids: f.write(f"{failed_id}\n") print(f"[远程存储] TXT 快照已保存: {file_path}") return str(file_path) except Exception as e: print(f"[远程存储] 保存 TXT 快照失败: {e}") return None def save_html_report(self, html_content: str, filename: str) -> Optional[str]: """保存 HTML 报告到临时目录""" if not self.enable_html: return None try: date_folder = self._format_date_folder() html_dir = self.temp_dir / date_folder / "html" html_dir.mkdir(parents=True, exist_ok=True) file_path = html_dir / filename with open(file_path, "w", encoding="utf-8") as f: f.write(html_content) print(f"[远程存储] HTML 报告已保存: {file_path}") return str(file_path) except Exception as e: print(f"[远程存储] 保存 HTML 报告失败: {e}") return None # ======================================== # 远程特有功能:资源清理 # ======================================== def cleanup(self) -> None: """清理资源(关闭连接和删除临时文件)""" # 检查 Python 是否正在关闭 if sys.meta_path is None: return # 关闭数据库连接 db_connections = getattr(self, "_db_connections", {}) for db_path, conn in list(db_connections.items()): try: conn.close() print(f"[远程存储] 关闭数据库连接: {db_path}") except Exception as e: print(f"[远程存储] 关闭连接失败 {db_path}: {e}") if db_connections: db_connections.clear() # 删除临时目录 temp_dir = getattr(self, "temp_dir", None) if temp_dir: try: if temp_dir.exists(): shutil.rmtree(temp_dir) print(f"[远程存储] 临时目录已清理: {temp_dir}") except Exception as e: # 忽略 Python 关闭时的错误 if sys.meta_path is not None: print(f"[远程存储] 清理临时目录失败: {e}") downloaded_files = getattr(self, "_downloaded_files", None) if downloaded_files: downloaded_files.clear() def cleanup_old_data(self, retention_days: int) -> int: """ 清理远程存储上的过期数据 Args: retention_days: 保留天数(0 表示不清理) Returns: 删除的数据库文件数量 """ if retention_days <= 0: return 0 deleted_count = 0 cutoff_date = self._get_configured_time() - timedelta(days=retention_days) try: # 列出远程存储中 news/ 前缀下的所有对象 paginator = self.s3_client.get_paginator('list_objects_v2') pages = paginator.paginate(Bucket=self.bucket_name, Prefix="news/") # 收集需要删除的对象键 objects_to_delete = [] deleted_dates = set() for page in pages: if 'Contents' not in page: continue for obj in page['Contents']: key = obj['Key'] # 解析日期(格式: news/YYYY-MM-DD.db) folder_date = None date_str = None try: date_match = re.match(r'news/(\d{4})-(\d{2})-(\d{2})\.db$', key) if date_match: folder_date = datetime( int(date_match.group(1)), int(date_match.group(2)), int(date_match.group(3)), tzinfo=pytz.timezone(self.timezone) ) date_str = f"{date_match.group(1)}-{date_match.group(2)}-{date_match.group(3)}" except Exception: continue if folder_date and folder_date < cutoff_date: objects_to_delete.append({'Key': key}) deleted_dates.add(date_str) # 批量删除对象(每次最多 1000 个) if objects_to_delete: batch_size = 1000 for i in range(0, len(objects_to_delete), batch_size): batch = objects_to_delete[i:i + batch_size] try: self.s3_client.delete_objects( Bucket=self.bucket_name, Delete={'Objects': batch} ) print(f"[远程存储] 删除 {len(batch)} 个对象") except Exception as e: print(f"[远程存储] 批量删除失败: {e}") deleted_count = len(deleted_dates) for date_str in sorted(deleted_dates): print(f"[远程存储] 清理过期数据: news/{date_str}.db") print(f"[远程存储] 共清理 {deleted_count} 个过期日期数据库文件") return deleted_count except Exception as e: print(f"[远程存储] 清理过期数据失败: {e}") return deleted_count def __del__(self): """析构函数""" # 检查 Python 是否正在关闭 if sys.meta_path is None: return try: self.cleanup() except Exception: # Python 关闭时可能会出错,忽略即可 pass # ======================================== # 远程特有功能:数据拉取和列表 # ======================================== def pull_recent_days(self, days: int, local_data_dir: str = "output") -> int: """ 从远程拉取最近 N 天的数据到本地 Args: days: 拉取天数 local_data_dir: 本地数据目录 Returns: 成功拉取的数据库文件数量 """ if days <= 0: return 0 local_dir = Path(local_data_dir) local_dir.mkdir(parents=True, exist_ok=True) pulled_count = 0 now = self._get_configured_time() print(f"[远程存储] 开始拉取最近 {days} 天的数据...") for i in range(days): date = now - timedelta(days=i) date_str = date.strftime("%Y-%m-%d") # 本地目标路径 local_date_dir = local_dir / date_str local_db_path = local_date_dir / "news.db" # 如果本地已存在,跳过 if local_db_path.exists(): print(f"[远程存储] 跳过(本地已存在): {date_str}") continue # 远程对象键 remote_key = f"news/{date_str}.db" # 检查远程是否存在 if not self._check_object_exists(remote_key): print(f"[远程存储] 跳过(远程不存在): {date_str}") continue # 下载(使用 get_object + iter_chunks 处理 chunked encoding) try: local_date_dir.mkdir(parents=True, exist_ok=True) response = self.s3_client.get_object(Bucket=self.bucket_name, Key=remote_key) with open(local_db_path, 'wb') as f: for chunk in response['Body'].iter_chunks(chunk_size=1024*1024): f.write(chunk) print(f"[远程存储] 已拉取: {remote_key} -> {local_db_path}") pulled_count += 1 except Exception as e: print(f"[远程存储] 拉取失败 ({date_str}): {e}") print(f"[远程存储] 拉取完成,共下载 {pulled_count} 个数据库文件") return pulled_count def list_remote_dates(self) -> List[str]: """ 列出远程存储中所有可用的日期 Returns: 日期字符串列表(YYYY-MM-DD 格式) """ dates = [] try: paginator = self.s3_client.get_paginator('list_objects_v2') pages = paginator.paginate(Bucket=self.bucket_name, Prefix="news/") for page in pages: if 'Contents' not in page: continue for obj in page['Contents']: key = obj['Key'] # 解析日期 date_match = re.match(r'news/(\d{4}-\d{2}-\d{2})\.db$', key) if date_match: dates.append(date_match.group(1)) return sorted(dates, reverse=True) except Exception as e: print(f"[远程存储] 列出远程日期失败: {e}") return [] ================================================ FILE: trendradar/storage/rss_schema.sql ================================================ -- TrendRadar RSS 数据库表结构 -- 用于存储 RSS/Atom 订阅源数据 -- ============================================ -- RSS 源配置表 -- 存储订阅源的基本信息 -- ============================================ CREATE TABLE IF NOT EXISTS rss_feeds ( id TEXT PRIMARY KEY, -- 源 ID(如 "hacker-news") name TEXT NOT NULL, -- 显示名称(如 "Hacker News") feed_url TEXT DEFAULT '', -- RSS/Atom URL(可选,配置文件中已有) is_active INTEGER DEFAULT 1, -- 是否启用 last_fetch_time TEXT, -- 最后抓取时间 last_fetch_status TEXT, -- 最后抓取状态(success/failed) item_count INTEGER DEFAULT 0, -- 当日条目数 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- ============================================ -- RSS 条目表 -- 以 URL + feed_id 为唯一标识,支持去重存储 -- ============================================ CREATE TABLE IF NOT EXISTS rss_items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, -- 标题 feed_id TEXT NOT NULL, -- 所属 RSS 源 url TEXT NOT NULL, -- 文章链接 published_at TEXT, -- RSS 发布时间(ISO 格式) summary TEXT, -- 摘要/描述 author TEXT, -- 作者 first_crawl_time TEXT NOT NULL, -- 首次抓取时间 last_crawl_time TEXT NOT NULL, -- 最后抓取时间 crawl_count INTEGER DEFAULT 1, -- 抓取次数 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (feed_id) REFERENCES rss_feeds(id) ); -- ============================================ -- 抓取记录表 -- 记录每次抓取的时间和数量 -- ============================================ CREATE TABLE IF NOT EXISTS rss_crawl_records ( id INTEGER PRIMARY KEY AUTOINCREMENT, crawl_time TEXT NOT NULL UNIQUE, -- 抓取时间(HH:MM) total_items INTEGER DEFAULT 0, -- 总条目数 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- ============================================ -- 抓取来源状态表 -- 记录每次抓取各 RSS 源的成功/失败状态 -- ============================================ CREATE TABLE IF NOT EXISTS rss_crawl_status ( crawl_record_id INTEGER NOT NULL, feed_id TEXT NOT NULL, status TEXT NOT NULL CHECK(status IN ('success', 'failed')), error_message TEXT, -- 失败时的错误信息 PRIMARY KEY (crawl_record_id, feed_id), FOREIGN KEY (crawl_record_id) REFERENCES rss_crawl_records(id), FOREIGN KEY (feed_id) REFERENCES rss_feeds(id) ); -- ============================================ -- 推送记录表 -- 用于 push_window once_per_day 功能 -- 以及 ai_analysis analysis_window once_per_day 功能 -- ============================================ CREATE TABLE IF NOT EXISTS rss_push_records ( id INTEGER PRIMARY KEY AUTOINCREMENT, date TEXT NOT NULL UNIQUE, -- 日期(YYYY-MM-DD) pushed INTEGER DEFAULT 0, -- 是否已推送 push_time TEXT, -- 推送时间 ai_analyzed INTEGER DEFAULT 0, -- 是否已进行 AI 分析 ai_analysis_time TEXT, -- AI 分析时间 ai_analysis_mode TEXT, -- AI 分析模式 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- ============================================ -- 索引定义 -- ============================================ -- RSS 源索引 CREATE INDEX IF NOT EXISTS idx_rss_feed ON rss_items(feed_id); -- 发布时间索引(用于按时间排序) CREATE INDEX IF NOT EXISTS idx_rss_published ON rss_items(published_at DESC); -- 抓取时间索引(用于查询最新数据) CREATE INDEX IF NOT EXISTS idx_rss_crawl_time ON rss_items(last_crawl_time); -- 标题索引(用于标题搜索) CREATE INDEX IF NOT EXISTS idx_rss_title ON rss_items(title); -- URL + feed_id 唯一索引(实现去重) CREATE UNIQUE INDEX IF NOT EXISTS idx_rss_url_feed ON rss_items(url, feed_id); -- 抓取状态索引 CREATE INDEX IF NOT EXISTS idx_rss_crawl_status_record ON rss_crawl_status(crawl_record_id); ================================================ FILE: trendradar/storage/schema.sql ================================================ -- TrendRadar 数据库表结构 -- ============================================ -- 平台信息表 -- 核心:id 不变,name 可变 -- ============================================ CREATE TABLE IF NOT EXISTS platforms ( id TEXT PRIMARY KEY, name TEXT NOT NULL, is_active INTEGER DEFAULT 1, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- ============================================ -- 新闻条目表 -- 以 URL + platform_id 为唯一标识,支持去重存储 -- ============================================ CREATE TABLE IF NOT EXISTS news_items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, platform_id TEXT NOT NULL, rank INTEGER NOT NULL, url TEXT DEFAULT '', mobile_url TEXT DEFAULT '', first_crawl_time TEXT NOT NULL, -- 首次抓取时间 last_crawl_time TEXT NOT NULL, -- 最后抓取时间 crawl_count INTEGER DEFAULT 1, -- 抓取次数 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (platform_id) REFERENCES platforms(id) ); -- ============================================ -- 标题变更历史表 -- 记录同一 URL 下标题的变化 -- ============================================ CREATE TABLE IF NOT EXISTS title_changes ( id INTEGER PRIMARY KEY AUTOINCREMENT, news_item_id INTEGER NOT NULL, old_title TEXT NOT NULL, new_title TEXT NOT NULL, changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (news_item_id) REFERENCES news_items(id) ); -- ============================================ -- 排名历史表 -- 记录每次抓取时的排名变化 -- ============================================ CREATE TABLE IF NOT EXISTS rank_history ( id INTEGER PRIMARY KEY AUTOINCREMENT, news_item_id INTEGER NOT NULL, rank INTEGER NOT NULL, crawl_time TEXT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (news_item_id) REFERENCES news_items(id) ); -- ============================================ -- 抓取记录表 -- 记录每次抓取的时间和数量 -- ============================================ CREATE TABLE IF NOT EXISTS crawl_records ( id INTEGER PRIMARY KEY AUTOINCREMENT, crawl_time TEXT NOT NULL UNIQUE, total_items INTEGER DEFAULT 0, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- ============================================ -- 抓取来源状态表 -- 记录每次抓取各平台的成功/失败状态 -- ============================================ CREATE TABLE IF NOT EXISTS crawl_source_status ( crawl_record_id INTEGER NOT NULL, platform_id TEXT NOT NULL, status TEXT NOT NULL CHECK(status IN ('success', 'failed')), PRIMARY KEY (crawl_record_id, platform_id), FOREIGN KEY (crawl_record_id) REFERENCES crawl_records(id), FOREIGN KEY (platform_id) REFERENCES platforms(id) ); -- ============================================ -- 时间段执行记录表 -- 记录每天每个时间段在各 action 维度的执行状态(用于 once 功能) -- 替代旧的 push_records 表 -- ============================================ CREATE TABLE IF NOT EXISTS period_executions ( id INTEGER PRIMARY KEY AUTOINCREMENT, execution_date TEXT NOT NULL, -- YYYY-MM-DD period_key TEXT NOT NULL, -- period 的稳定 key action TEXT NOT NULL, -- analyze | push executed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(execution_date, period_key, action) ); -- ============================================ -- 索引定义 -- ============================================ -- 平台索引 CREATE INDEX IF NOT EXISTS idx_news_platform ON news_items(platform_id); -- 时间索引(用于查询最新数据) CREATE INDEX IF NOT EXISTS idx_news_crawl_time ON news_items(last_crawl_time); -- 标题索引(用于标题搜索) CREATE INDEX IF NOT EXISTS idx_news_title ON news_items(title); -- URL + platform_id 唯一索引(仅对非空 URL,实现去重) CREATE UNIQUE INDEX IF NOT EXISTS idx_news_url_platform ON news_items(url, platform_id) WHERE url != ''; -- 抓取状态索引 CREATE INDEX IF NOT EXISTS idx_crawl_status_record ON crawl_source_status(crawl_record_id); -- 排名历史索引 CREATE INDEX IF NOT EXISTS idx_rank_history_news ON rank_history(news_item_id); -- 时间段执行记录索引 CREATE INDEX IF NOT EXISTS idx_period_exec_lookup ON period_executions(execution_date, period_key, action); ================================================ FILE: trendradar/storage/sqlite_mixin.py ================================================ # coding=utf-8 """ SQLite 存储 Mixin 提供共用的 SQLite 数据库操作逻辑,供 LocalStorageBackend 和 RemoteStorageBackend 复用。 """ import sqlite3 from abc import abstractmethod from datetime import datetime from pathlib import Path from typing import Any, Dict, List, Optional from trendradar.storage.base import NewsItem, NewsData, RSSItem, RSSData from trendradar.utils.url import normalize_url class SQLiteStorageMixin: """ SQLite 存储操作 Mixin 子类需要实现以下抽象方法: - _get_connection(date, db_type) -> sqlite3.Connection - _get_configured_time() -> datetime - _format_date_folder(date) -> str - _format_time_filename() -> str """ # ======================================== # 抽象方法 - 子类必须实现 # ======================================== @abstractmethod def _get_connection(self, date: Optional[str] = None, db_type: str = "news") -> sqlite3.Connection: """获取数据库连接""" pass @abstractmethod def _get_configured_time(self) -> datetime: """获取配置时区的当前时间""" pass @abstractmethod def _format_date_folder(self, date: Optional[str] = None) -> str: """格式化日期文件夹名 (ISO 格式: YYYY-MM-DD)""" pass @abstractmethod def _format_time_filename(self) -> str: """格式化时间文件名 (格式: HH-MM)""" pass # ======================================== # Schema 管理 # ======================================== def _get_schema_path(self, db_type: str = "news") -> Path: """ 获取 schema.sql 文件路径 Args: db_type: 数据库类型 ("news" 或 "rss") Returns: schema 文件路径 """ if db_type == "rss": return Path(__file__).parent / "rss_schema.sql" return Path(__file__).parent / "schema.sql" def _get_ai_filter_schema_path(self) -> Path: """获取 AI 筛选 schema 文件路径""" return Path(__file__).parent / "ai_filter_schema.sql" def _init_tables(self, conn: sqlite3.Connection, db_type: str = "news") -> None: """ 从 schema.sql 初始化数据库表结构 Args: conn: 数据库连接 db_type: 数据库类型 ("news" 或 "rss") """ schema_path = self._get_schema_path(db_type) if schema_path.exists(): with open(schema_path, "r", encoding="utf-8") as f: schema_sql = f.read() conn.executescript(schema_sql) else: raise FileNotFoundError(f"Schema file not found: {schema_path}") # news 库额外加载 AI 筛选表结构 if db_type == "news": ai_filter_schema = self._get_ai_filter_schema_path() if ai_filter_schema.exists(): with open(ai_filter_schema, "r", encoding="utf-8") as f: conn.executescript(f.read()) conn.commit() # ======================================== # 新闻数据存储 # ======================================== def _save_news_data_impl(self, data: NewsData, log_prefix: str = "[存储]") -> tuple[bool, int, int, int, int]: """ 保存新闻数据到 SQLite(核心实现) Args: data: 新闻数据 log_prefix: 日志前缀 Returns: (success, new_count, updated_count, title_changed_count, off_list_count) """ try: conn = self._get_connection(data.date) cursor = conn.cursor() # 获取配置时区的当前时间 now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") # 首先同步平台信息到 platforms 表 for source_id, source_name in data.id_to_name.items(): cursor.execute(""" INSERT INTO platforms (id, name, updated_at) VALUES (?, ?, ?) ON CONFLICT(id) DO UPDATE SET name = excluded.name, updated_at = excluded.updated_at """, (source_id, source_name, now_str)) # 统计计数器 new_count = 0 updated_count = 0 title_changed_count = 0 success_sources = [] for source_id, news_list in data.items.items(): success_sources.append(source_id) for item in news_list: try: # 标准化 URL(去除动态参数,如微博的 band_rank) normalized_url = normalize_url(item.url, source_id) if item.url else "" # 检查是否已存在(通过标准化 URL + platform_id) if normalized_url: cursor.execute(""" SELECT id, title FROM news_items WHERE url = ? AND platform_id = ? """, (normalized_url, source_id)) existing = cursor.fetchone() if existing: # 已存在,更新记录 existing_id, existing_title = existing # 检查标题是否变化 if existing_title != item.title: # 记录标题变更 cursor.execute(""" INSERT INTO title_changes (news_item_id, old_title, new_title, changed_at) VALUES (?, ?, ?, ?) """, (existing_id, existing_title, item.title, now_str)) title_changed_count += 1 # 记录排名历史 cursor.execute(""" INSERT INTO rank_history (news_item_id, rank, crawl_time, created_at) VALUES (?, ?, ?, ?) """, (existing_id, item.rank, data.crawl_time, now_str)) # 更新现有记录 cursor.execute(""" UPDATE news_items SET title = ?, rank = ?, mobile_url = ?, last_crawl_time = ?, crawl_count = crawl_count + 1, updated_at = ? WHERE id = ? """, (item.title, item.rank, item.mobile_url, data.crawl_time, now_str, existing_id)) updated_count += 1 else: # 不存在,插入新记录(存储标准化后的 URL) cursor.execute(""" INSERT INTO news_items (title, platform_id, rank, url, mobile_url, first_crawl_time, last_crawl_time, crawl_count, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, 1, ?, ?) """, (item.title, source_id, item.rank, normalized_url, item.mobile_url, data.crawl_time, data.crawl_time, now_str, now_str)) new_id = cursor.lastrowid # 记录初始排名 cursor.execute(""" INSERT INTO rank_history (news_item_id, rank, crawl_time, created_at) VALUES (?, ?, ?, ?) """, (new_id, item.rank, data.crawl_time, now_str)) new_count += 1 else: # URL 为空的情况,直接插入(不做去重) cursor.execute(""" INSERT INTO news_items (title, platform_id, rank, url, mobile_url, first_crawl_time, last_crawl_time, crawl_count, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, 1, ?, ?) """, (item.title, source_id, item.rank, "", item.mobile_url, data.crawl_time, data.crawl_time, now_str, now_str)) new_id = cursor.lastrowid # 记录初始排名 cursor.execute(""" INSERT INTO rank_history (news_item_id, rank, crawl_time, created_at) VALUES (?, ?, ?, ?) """, (new_id, item.rank, data.crawl_time, now_str)) new_count += 1 except sqlite3.Error as e: print(f"{log_prefix} 保存新闻条目失败 [{item.title[:30]}...]: {e}") total_items = new_count + updated_count # ======================================== # 脱榜检测:检测上次在榜但这次不在榜的新闻 # ======================================== off_list_count = 0 # 获取上一次抓取时间 cursor.execute(""" SELECT crawl_time FROM crawl_records WHERE crawl_time < ? ORDER BY crawl_time DESC LIMIT 1 """, (data.crawl_time,)) prev_record = cursor.fetchone() if prev_record: prev_crawl_time = prev_record[0] # 对于每个成功抓取的平台,检测脱榜 for source_id in success_sources: # 获取当前抓取中该平台的所有标准化 URL current_urls = set() for item in data.items.get(source_id, []): normalized_url = normalize_url(item.url, source_id) if item.url else "" if normalized_url: current_urls.add(normalized_url) # 查询上次在榜(last_crawl_time = prev_crawl_time)但这次不在榜的新闻 # 这些新闻是"第一次脱榜",需要记录 cursor.execute(""" SELECT id, url FROM news_items WHERE platform_id = ? AND last_crawl_time = ? AND url != '' """, (source_id, prev_crawl_time)) for row in cursor.fetchall(): news_id, url = row[0], row[1] if url not in current_urls: # 插入脱榜记录(rank=0 表示脱榜) cursor.execute(""" INSERT INTO rank_history (news_item_id, rank, crawl_time, created_at) VALUES (?, 0, ?, ?) """, (news_id, data.crawl_time, now_str)) off_list_count += 1 # 记录抓取信息 cursor.execute(""" INSERT OR REPLACE INTO crawl_records (crawl_time, total_items, created_at) VALUES (?, ?, ?) """, (data.crawl_time, total_items, now_str)) # 获取刚插入的 crawl_record 的 ID cursor.execute(""" SELECT id FROM crawl_records WHERE crawl_time = ? """, (data.crawl_time,)) record_row = cursor.fetchone() if record_row: crawl_record_id = record_row[0] # 记录成功的来源 for source_id in success_sources: cursor.execute(""" INSERT OR REPLACE INTO crawl_source_status (crawl_record_id, platform_id, status) VALUES (?, ?, 'success') """, (crawl_record_id, source_id)) # 记录失败的来源 for failed_id in data.failed_ids: # 确保失败的平台也在 platforms 表中 cursor.execute(""" INSERT OR IGNORE INTO platforms (id, name, updated_at) VALUES (?, ?, ?) """, (failed_id, failed_id, now_str)) cursor.execute(""" INSERT OR REPLACE INTO crawl_source_status (crawl_record_id, platform_id, status) VALUES (?, ?, 'failed') """, (crawl_record_id, failed_id)) conn.commit() return True, new_count, updated_count, title_changed_count, off_list_count except Exception as e: print(f"{log_prefix} 保存失败: {e}") return False, 0, 0, 0, 0 def _get_today_all_data_impl(self, date: Optional[str] = None) -> Optional[NewsData]: """ 获取指定日期的所有新闻数据(合并后) Args: date: 日期字符串,默认为今天 Returns: 合并后的新闻数据 """ try: conn = self._get_connection(date) cursor = conn.cursor() # 获取所有新闻数据(包含 id 用于查询排名历史) cursor.execute(""" SELECT n.id, n.title, n.platform_id, p.name as platform_name, n.rank, n.url, n.mobile_url, n.first_crawl_time, n.last_crawl_time, n.crawl_count FROM news_items n LEFT JOIN platforms p ON n.platform_id = p.id ORDER BY n.platform_id, n.last_crawl_time """) rows = cursor.fetchall() if not rows: return None # 收集所有 news_item_id news_ids = [row[0] for row in rows] # 批量查询排名历史(同时获取时间和排名) # 过滤逻辑:只保留 last_crawl_time 之前的脱榜记录(rank=0) # 这样可以避免显示新闻永久脱榜后的无意义记录 rank_history_map: Dict[int, List[int]] = {} rank_timeline_map: Dict[int, List[Dict[str, Any]]] = {} if news_ids: placeholders = ",".join("?" * len(news_ids)) cursor.execute(f""" SELECT rh.news_item_id, rh.rank, rh.crawl_time FROM rank_history rh JOIN news_items ni ON rh.news_item_id = ni.id WHERE rh.news_item_id IN ({placeholders}) AND NOT (rh.rank = 0 AND rh.crawl_time > ni.last_crawl_time) ORDER BY rh.news_item_id, rh.crawl_time """, news_ids) for rh_row in cursor.fetchall(): news_id, rank, crawl_time = rh_row[0], rh_row[1], rh_row[2] # 构建 ranks 列表(去重,排除脱榜记录 rank=0) if news_id not in rank_history_map: rank_history_map[news_id] = [] if rank != 0 and rank not in rank_history_map[news_id]: rank_history_map[news_id].append(rank) # 构建 rank_timeline 列表(完整时间线,包含脱榜) if news_id not in rank_timeline_map: rank_timeline_map[news_id] = [] # 提取时间部分(HH:MM) time_part = crawl_time.split()[1][:5] if ' ' in crawl_time else crawl_time[:5] rank_timeline_map[news_id].append({ "time": time_part, "rank": rank if rank != 0 else None # 0 转为 None 表示脱榜 }) # 按 platform_id 分组 items: Dict[str, List[NewsItem]] = {} id_to_name: Dict[str, str] = {} crawl_date = self._format_date_folder(date) for row in rows: news_id = row[0] platform_id = row[2] title = row[1] platform_name = row[3] or platform_id id_to_name[platform_id] = platform_name if platform_id not in items: items[platform_id] = [] # 获取排名历史,如果没有则使用当前排名 ranks = rank_history_map.get(news_id, [row[4]]) rank_timeline = rank_timeline_map.get(news_id, []) items[platform_id].append(NewsItem( title=title, source_id=platform_id, source_name=platform_name, rank=row[4], url=row[5] or "", mobile_url=row[6] or "", crawl_time=row[8], # last_crawl_time ranks=ranks, first_time=row[7], # first_crawl_time last_time=row[8], # last_crawl_time count=row[9], # crawl_count rank_timeline=rank_timeline, )) final_items = items # 获取失败的来源 cursor.execute(""" SELECT DISTINCT css.platform_id FROM crawl_source_status css JOIN crawl_records cr ON css.crawl_record_id = cr.id WHERE css.status = 'failed' """) failed_ids = [row[0] for row in cursor.fetchall()] # 获取最新的抓取时间 cursor.execute(""" SELECT crawl_time FROM crawl_records ORDER BY crawl_time DESC LIMIT 1 """) time_row = cursor.fetchone() crawl_time = time_row[0] if time_row else self._format_time_filename() return NewsData( date=crawl_date, crawl_time=crawl_time, items=final_items, id_to_name=id_to_name, failed_ids=failed_ids, ) except Exception as e: print(f"[存储] 读取数据失败: {e}") return None def _get_latest_crawl_data_impl(self, date: Optional[str] = None) -> Optional[NewsData]: """ 获取最新一次抓取的数据 Args: date: 日期字符串,默认为今天 Returns: 最新抓取的新闻数据 """ try: conn = self._get_connection(date) cursor = conn.cursor() # 获取最新的抓取时间 cursor.execute(""" SELECT crawl_time FROM crawl_records ORDER BY crawl_time DESC LIMIT 1 """) time_row = cursor.fetchone() if not time_row: return None latest_time = time_row[0] # 获取该时间的新闻数据(包含 id 用于查询排名历史) cursor.execute(""" SELECT n.id, n.title, n.platform_id, p.name as platform_name, n.rank, n.url, n.mobile_url, n.first_crawl_time, n.last_crawl_time, n.crawl_count FROM news_items n LEFT JOIN platforms p ON n.platform_id = p.id WHERE n.last_crawl_time = ? """, (latest_time,)) rows = cursor.fetchall() if not rows: return None # 收集所有 news_item_id news_ids = [row[0] for row in rows] # 批量查询排名历史(同时获取时间和排名) # 过滤逻辑:只保留 last_crawl_time 之前的脱榜记录(rank=0) # 这样可以避免显示新闻永久脱榜后的无意义记录 rank_history_map: Dict[int, List[int]] = {} rank_timeline_map: Dict[int, List[Dict[str, Any]]] = {} if news_ids: placeholders = ",".join("?" * len(news_ids)) cursor.execute(f""" SELECT rh.news_item_id, rh.rank, rh.crawl_time FROM rank_history rh JOIN news_items ni ON rh.news_item_id = ni.id WHERE rh.news_item_id IN ({placeholders}) AND NOT (rh.rank = 0 AND rh.crawl_time > ni.last_crawl_time) ORDER BY rh.news_item_id, rh.crawl_time """, news_ids) for rh_row in cursor.fetchall(): news_id, rank, crawl_time = rh_row[0], rh_row[1], rh_row[2] # 构建 ranks 列表(去重,排除脱榜记录 rank=0) if news_id not in rank_history_map: rank_history_map[news_id] = [] if rank != 0 and rank not in rank_history_map[news_id]: rank_history_map[news_id].append(rank) # 构建 rank_timeline 列表(完整时间线,包含脱榜) if news_id not in rank_timeline_map: rank_timeline_map[news_id] = [] # 提取时间部分(HH:MM) time_part = crawl_time.split()[1][:5] if ' ' in crawl_time else crawl_time[:5] rank_timeline_map[news_id].append({ "time": time_part, "rank": rank if rank != 0 else None # 0 转为 None 表示脱榜 }) items: Dict[str, List[NewsItem]] = {} id_to_name: Dict[str, str] = {} crawl_date = self._format_date_folder(date) for row in rows: news_id = row[0] platform_id = row[2] platform_name = row[3] or platform_id id_to_name[platform_id] = platform_name if platform_id not in items: items[platform_id] = [] # 获取排名历史,如果没有则使用当前排名 ranks = rank_history_map.get(news_id, [row[4]]) rank_timeline = rank_timeline_map.get(news_id, []) items[platform_id].append(NewsItem( title=row[1], source_id=platform_id, source_name=platform_name, rank=row[4], url=row[5] or "", mobile_url=row[6] or "", crawl_time=row[8], # last_crawl_time ranks=ranks, first_time=row[7], # first_crawl_time last_time=row[8], # last_crawl_time count=row[9], # crawl_count rank_timeline=rank_timeline, )) # 获取失败的来源(针对最新一次抓取) cursor.execute(""" SELECT css.platform_id FROM crawl_source_status css JOIN crawl_records cr ON css.crawl_record_id = cr.id WHERE cr.crawl_time = ? AND css.status = 'failed' """, (latest_time,)) failed_ids = [row[0] for row in cursor.fetchall()] return NewsData( date=crawl_date, crawl_time=latest_time, items=items, id_to_name=id_to_name, failed_ids=failed_ids, ) except Exception as e: print(f"[存储] 获取最新数据失败: {e}") return None def _detect_new_titles_impl(self, current_data: NewsData) -> Dict[str, Dict]: """ 检测新增的标题 该方法比较当前抓取数据与历史数据,找出新增的标题。 关键逻辑:只有在历史批次中从未出现过的标题才算新增。 Args: current_data: 当前抓取的数据 Returns: 新增的标题数据 {source_id: {title: NewsItem}} """ try: # 获取历史数据 historical_data = self._get_today_all_data_impl(current_data.date) if not historical_data: # 没有历史数据,所有都是新的 new_titles = {} for source_id, news_list in current_data.items.items(): new_titles[source_id] = {item.title: item for item in news_list} return new_titles # 获取当前批次时间 current_time = current_data.crawl_time # 收集历史标题(first_time < current_time 的标题) # 这样可以正确处理同一标题因 URL 变化而产生多条记录的情况 historical_titles: Dict[str, set] = {} for source_id, news_list in historical_data.items.items(): historical_titles[source_id] = set() for item in news_list: first_time = item.first_time or item.crawl_time if first_time < current_time: historical_titles[source_id].add(item.title) # 检查是否有历史数据 has_historical_data = any(len(titles) > 0 for titles in historical_titles.values()) if not has_historical_data: # 第一次抓取,没有"新增"概念 return {} # 检测新增 new_titles = {} for source_id, news_list in current_data.items.items(): hist_set = historical_titles.get(source_id, set()) for item in news_list: if item.title not in hist_set: if source_id not in new_titles: new_titles[source_id] = {} new_titles[source_id][item.title] = item return new_titles except Exception as e: print(f"[存储] 检测新标题失败: {e}") return {} def _is_first_crawl_today_impl(self, date: Optional[str] = None) -> bool: """ 检查是否是当天第一次抓取 Args: date: 日期字符串,默认为今天 Returns: 是否是第一次抓取 """ try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT COUNT(*) as count FROM crawl_records """) row = cursor.fetchone() count = row[0] if row else 0 # 如果只有一条或没有记录,视为第一次抓取 return count <= 1 except Exception as e: print(f"[存储] 检查首次抓取失败: {e}") return True def _get_crawl_times_impl(self, date: Optional[str] = None) -> List[str]: """ 获取指定日期的所有抓取时间列表 Args: date: 日期字符串,默认为今天 Returns: 抓取时间列表(按时间排序) """ try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT crawl_time FROM crawl_records ORDER BY crawl_time """) rows = cursor.fetchall() return [row[0] for row in rows] except Exception as e: print(f"[存储] 获取抓取时间列表失败: {e}") return [] # ======================================== # 时间段执行记录(调度系统) # ======================================== def _has_period_executed_impl(self, date_str: str, period_key: str, action: str) -> bool: """ 检查指定时间段的某个 action 今天是否已执行 Args: date_str: 日期字符串 YYYY-MM-DD period_key: 时间段 key action: 动作类型 (analyze / push) Returns: 是否已执行 """ try: conn = self._get_connection(date_str) cursor = conn.cursor() # 先检查表是否存在 cursor.execute(""" SELECT name FROM sqlite_master WHERE type='table' AND name='period_executions' """) if not cursor.fetchone(): return False cursor.execute(""" SELECT 1 FROM period_executions WHERE execution_date = ? AND period_key = ? AND action = ? """, (date_str, period_key, action)) return cursor.fetchone() is not None except Exception as e: print(f"[存储] 检查时间段执行记录失败: {e}") return False def _record_period_execution_impl(self, date_str: str, period_key: str, action: str) -> bool: """ 记录时间段的 action 执行 Args: date_str: 日期字符串 YYYY-MM-DD period_key: 时间段 key action: 动作类型 (analyze / push) Returns: 是否记录成功 """ try: conn = self._get_connection(date_str) cursor = conn.cursor() # 确保表存在 cursor.execute(""" CREATE TABLE IF NOT EXISTS period_executions ( id INTEGER PRIMARY KEY AUTOINCREMENT, execution_date TEXT NOT NULL, period_key TEXT NOT NULL, action TEXT NOT NULL, executed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(execution_date, period_key, action) ) """) now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") cursor.execute(""" INSERT OR IGNORE INTO period_executions (execution_date, period_key, action, executed_at) VALUES (?, ?, ?, ?) """, (date_str, period_key, action, now_str)) conn.commit() return True except Exception as e: print(f"[存储] 记录时间段执行失败: {e}") return False # ======================================== # RSS 数据存储 # ======================================== def _save_rss_data_impl(self, data: RSSData, log_prefix: str = "[存储]") -> tuple[bool, int, int]: """ 保存 RSS 数据到 SQLite(以 URL 为唯一标识) Args: data: RSS 数据 log_prefix: 日志前缀 Returns: (success, new_count, updated_count) """ try: conn = self._get_connection(data.date, db_type="rss") cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") # 同步 RSS 源信息到 rss_feeds 表 for feed_id, feed_name in data.id_to_name.items(): cursor.execute(""" INSERT INTO rss_feeds (id, name, updated_at) VALUES (?, ?, ?) ON CONFLICT(id) DO UPDATE SET name = excluded.name, updated_at = excluded.updated_at """, (feed_id, feed_name, now_str)) # 统计计数器 new_count = 0 updated_count = 0 for feed_id, rss_list in data.items.items(): for item in rss_list: try: # 检查是否已存在(通过 URL + feed_id) if item.url: cursor.execute(""" SELECT id, title FROM rss_items WHERE url = ? AND feed_id = ? """, (item.url, feed_id)) existing = cursor.fetchone() if existing: # 已存在,更新记录 existing_id = existing[0] cursor.execute(""" UPDATE rss_items SET title = ?, published_at = ?, summary = ?, author = ?, last_crawl_time = ?, crawl_count = crawl_count + 1, updated_at = ? WHERE id = ? """, (item.title, item.published_at, item.summary, item.author, data.crawl_time, now_str, existing_id)) updated_count += 1 else: # 不存在,插入新记录(使用 ON CONFLICT 兜底处理并发/竞争场景) cursor.execute(""" INSERT INTO rss_items (title, feed_id, url, published_at, summary, author, first_crawl_time, last_crawl_time, crawl_count, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?, 1, ?, ?) ON CONFLICT(url, feed_id) DO UPDATE SET title = excluded.title, published_at = excluded.published_at, summary = excluded.summary, author = excluded.author, last_crawl_time = excluded.last_crawl_time, crawl_count = crawl_count + 1, updated_at = excluded.updated_at """, (item.title, feed_id, item.url, item.published_at, item.summary, item.author, data.crawl_time, data.crawl_time, now_str, now_str)) new_count += 1 else: # URL 为空,用 try-except 处理重复 try: cursor.execute(""" INSERT INTO rss_items (title, feed_id, url, published_at, summary, author, first_crawl_time, last_crawl_time, crawl_count, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?, 1, ?, ?) """, (item.title, feed_id, "", item.published_at, item.summary, item.author, data.crawl_time, data.crawl_time, now_str, now_str)) new_count += 1 except sqlite3.IntegrityError: # 重复的空 URL 条目,忽略 pass except sqlite3.Error as e: print(f"{log_prefix} 保存 RSS 条目失败 [{item.title[:30]}...]: {e}") total_items = new_count + updated_count # 记录抓取信息 cursor.execute(""" INSERT OR REPLACE INTO rss_crawl_records (crawl_time, total_items, created_at) VALUES (?, ?, ?) """, (data.crawl_time, total_items, now_str)) # 记录抓取状态 cursor.execute(""" SELECT id FROM rss_crawl_records WHERE crawl_time = ? """, (data.crawl_time,)) record_row = cursor.fetchone() if record_row: crawl_record_id = record_row[0] # 记录成功的源 for feed_id in data.items.keys(): cursor.execute(""" INSERT OR REPLACE INTO rss_crawl_status (crawl_record_id, feed_id, status) VALUES (?, ?, 'success') """, (crawl_record_id, feed_id)) # 记录失败的源 for failed_id in data.failed_ids: cursor.execute(""" INSERT OR IGNORE INTO rss_feeds (id, name, updated_at) VALUES (?, ?, ?) """, (failed_id, failed_id, now_str)) cursor.execute(""" INSERT OR REPLACE INTO rss_crawl_status (crawl_record_id, feed_id, status) VALUES (?, ?, 'failed') """, (crawl_record_id, failed_id)) conn.commit() return True, new_count, updated_count except Exception as e: print(f"{log_prefix} 保存 RSS 数据失败: {e}") return False, 0, 0 def _get_rss_data_impl(self, date: Optional[str] = None) -> Optional[RSSData]: """ 获取指定日期的所有 RSS 数据 Args: date: 日期字符串(YYYY-MM-DD),默认为今天 Returns: RSSData 对象,如果没有数据返回 None """ try: conn = self._get_connection(date, db_type="rss") cursor = conn.cursor() # 获取所有 RSS 数据 cursor.execute(""" SELECT i.id, i.title, i.feed_id, f.name as feed_name, i.url, i.published_at, i.summary, i.author, i.first_crawl_time, i.last_crawl_time, i.crawl_count FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id ORDER BY i.published_at DESC """) rows = cursor.fetchall() if not rows: return None items: Dict[str, List[RSSItem]] = {} id_to_name: Dict[str, str] = {} crawl_date = self._format_date_folder(date) for row in rows: feed_id = row[2] feed_name = row[3] or feed_id id_to_name[feed_id] = feed_name if feed_id not in items: items[feed_id] = [] items[feed_id].append(RSSItem( title=row[1], feed_id=feed_id, feed_name=feed_name, url=row[4] or "", published_at=row[5] or "", summary=row[6] or "", author=row[7] or "", crawl_time=row[9], first_time=row[8], last_time=row[9], count=row[10], )) # 获取最新的抓取时间 cursor.execute(""" SELECT crawl_time FROM rss_crawl_records ORDER BY crawl_time DESC LIMIT 1 """) time_row = cursor.fetchone() crawl_time = time_row[0] if time_row else self._format_time_filename() # 获取失败的源 cursor.execute(""" SELECT DISTINCT cs.feed_id FROM rss_crawl_status cs JOIN rss_crawl_records cr ON cs.crawl_record_id = cr.id WHERE cs.status = 'failed' """) failed_ids = [row[0] for row in cursor.fetchall()] return RSSData( date=crawl_date, crawl_time=crawl_time, items=items, id_to_name=id_to_name, failed_ids=failed_ids, ) except Exception as e: print(f"[存储] 读取 RSS 数据失败: {e}") return None def _detect_new_rss_items_impl(self, current_data: RSSData) -> Dict[str, List[RSSItem]]: """ 检测新增的 RSS 条目(增量模式) 该方法比较当前抓取数据与历史数据,找出新增的 RSS 条目。 关键逻辑:只有在历史批次中从未出现过的 URL 才算新增。 Args: current_data: 当前抓取的 RSS 数据 Returns: 新增的 RSS 条目 {feed_id: [RSSItem, ...]} """ try: # 获取历史数据 historical_data = self._get_rss_data_impl(current_data.date) if not historical_data: # 没有历史数据,所有都是新的 return current_data.items.copy() # 获取当前批次时间 current_time = current_data.crawl_time # 收集历史 URL(first_time < current_time 的条目) historical_urls: Dict[str, set] = {} for feed_id, rss_list in historical_data.items.items(): historical_urls[feed_id] = set() for item in rss_list: first_time = item.first_time or item.crawl_time if first_time < current_time: if item.url: historical_urls[feed_id].add(item.url) # 检查是否有历史数据 has_historical_data = any(len(urls) > 0 for urls in historical_urls.values()) if not has_historical_data: # 第一次抓取,没有"新增"概念 return {} # 检测新增 new_items: Dict[str, List[RSSItem]] = {} for feed_id, rss_list in current_data.items.items(): hist_set = historical_urls.get(feed_id, set()) for item in rss_list: # 通过 URL 判断是否新增 if item.url and item.url not in hist_set: if feed_id not in new_items: new_items[feed_id] = [] new_items[feed_id].append(item) return new_items except Exception as e: print(f"[存储] 检测新 RSS 条目失败: {e}") return {} def _get_latest_rss_data_impl(self, date: Optional[str] = None) -> Optional[RSSData]: """ 获取最新一次抓取的 RSS 数据(当前榜单模式) Args: date: 日期字符串(YYYY-MM-DD),默认为今天 Returns: 最新抓取的 RSS 数据,如果没有数据返回 None """ try: conn = self._get_connection(date, db_type="rss") cursor = conn.cursor() # 获取最新的抓取时间 cursor.execute(""" SELECT crawl_time FROM rss_crawl_records ORDER BY crawl_time DESC LIMIT 1 """) time_row = cursor.fetchone() if not time_row: return None latest_time = time_row[0] # 获取该时间的 RSS 数据 cursor.execute(""" SELECT i.id, i.title, i.feed_id, f.name as feed_name, i.url, i.published_at, i.summary, i.author, i.first_crawl_time, i.last_crawl_time, i.crawl_count FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id WHERE i.last_crawl_time = ? ORDER BY i.published_at DESC """, (latest_time,)) rows = cursor.fetchall() if not rows: return None items: Dict[str, List[RSSItem]] = {} id_to_name: Dict[str, str] = {} crawl_date = self._format_date_folder(date) for row in rows: feed_id = row[2] feed_name = row[3] or feed_id id_to_name[feed_id] = feed_name if feed_id not in items: items[feed_id] = [] items[feed_id].append(RSSItem( title=row[1], feed_id=feed_id, feed_name=feed_name, url=row[4] or "", published_at=row[5] or "", summary=row[6] or "", author=row[7] or "", crawl_time=row[9], first_time=row[8], last_time=row[9], count=row[10], )) # 获取失败的源(针对最新一次抓取) cursor.execute(""" SELECT cs.feed_id FROM rss_crawl_status cs JOIN rss_crawl_records cr ON cs.crawl_record_id = cr.id WHERE cr.crawl_time = ? AND cs.status = 'failed' """, (latest_time,)) failed_ids = [row[0] for row in cursor.fetchall()] return RSSData( date=crawl_date, crawl_time=latest_time, items=items, id_to_name=id_to_name, failed_ids=failed_ids, ) except Exception as e: print(f"[存储] 获取最新 RSS 数据失败: {e}") return None # ======================================== # AI 智能筛选 - 标签管理 # ======================================== def _get_active_tags_impl(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> List[Dict[str, Any]]: """获取指定兴趣文件的 active 标签列表""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT id, tag, description, version, prompt_hash, priority FROM ai_filter_tags WHERE status = 'active' AND interests_file = ? ORDER BY priority ASC, id ASC """, (interests_file,)) return [ { "id": row[0], "tag": row[1], "description": row[2], "version": row[3], "prompt_hash": row[4], "priority": row[5], } for row in cursor.fetchall() ] except Exception as e: print(f"[AI筛选] 获取标签失败: {e}") return [] def _get_latest_prompt_hash_impl(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> Optional[str]: """获取指定兴趣文件最新版本标签的 prompt_hash""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT prompt_hash FROM ai_filter_tags WHERE status = 'active' AND interests_file = ? ORDER BY version DESC LIMIT 1 """, (interests_file,)) row = cursor.fetchone() return row[0] if row else None except Exception as e: print(f"[AI筛选] 获取 prompt_hash 失败: {e}") return None def _get_latest_tag_version_impl(self, date: Optional[str] = None) -> int: """获取最新版本号""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT MAX(version) FROM ai_filter_tags """) row = cursor.fetchone() return row[0] if row and row[0] is not None else 0 except Exception as e: print(f"[AI筛选] 获取版本号失败: {e}") return 0 def _deprecate_all_tags_impl(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> int: """将指定兴趣文件的 active 标签和关联的分类结果标记为 deprecated""" try: conn = self._get_connection(date) cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") # 获取该兴趣文件的 active 标签 id cursor.execute( "SELECT id FROM ai_filter_tags WHERE status = 'active' AND interests_file = ?", (interests_file,) ) tag_ids = [row[0] for row in cursor.fetchall()] if not tag_ids: return 0 # 废弃标签 placeholders = ",".join("?" * len(tag_ids)) cursor.execute(f""" UPDATE ai_filter_tags SET status = 'deprecated', deprecated_at = ? WHERE id IN ({placeholders}) """, [now_str] + tag_ids) tag_count = cursor.rowcount # 废弃关联的分类结果 placeholders = ",".join("?" * len(tag_ids)) cursor.execute(f""" UPDATE ai_filter_results SET status = 'deprecated', deprecated_at = ? WHERE tag_id IN ({placeholders}) AND status = 'active' """, [now_str] + tag_ids) conn.commit() print(f"[AI筛选] 已废弃 {tag_count} 个标签及关联分类结果") return tag_count except Exception as e: print(f"[AI筛选] 废弃标签失败: {e}") return 0 def _save_tags_impl( self, date: Optional[str], tags: List[Dict], version: int, prompt_hash: str, interests_file: str = "ai_interests.txt" ) -> int: """保存新提取的标签""" try: conn = self._get_connection(date) cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") count = 0 for idx, tag_data in enumerate(tags, start=1): priority = tag_data.get("priority", idx) try: priority = int(priority) except (TypeError, ValueError): priority = idx cursor.execute(""" INSERT INTO ai_filter_tags (tag, description, priority, version, prompt_hash, interests_file, created_at) VALUES (?, ?, ?, ?, ?, ?, ?) """, ( tag_data["tag"], tag_data.get("description", ""), priority, version, prompt_hash, interests_file, now_str, )) count += 1 conn.commit() return count except Exception as e: print(f"[AI筛选] 保存标签失败: {e}") return 0 def _deprecate_specific_tags_impl( self, date: Optional[str], tag_ids: List[int] ) -> int: """废弃指定 ID 的标签及其关联分类结果(增量更新时使用)""" if not tag_ids: return 0 try: conn = self._get_connection(date) cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") placeholders = ",".join("?" * len(tag_ids)) cursor.execute(f""" UPDATE ai_filter_tags SET status = 'deprecated', deprecated_at = ? WHERE id IN ({placeholders}) """, [now_str] + tag_ids) tag_count = cursor.rowcount cursor.execute(f""" UPDATE ai_filter_results SET status = 'deprecated', deprecated_at = ? WHERE tag_id IN ({placeholders}) AND status = 'active' """, [now_str] + tag_ids) conn.commit() return tag_count except Exception as e: print(f"[AI筛选] 废弃指定标签失败: {e}") return 0 def _update_tags_hash_impl( self, date: Optional[str], interests_file: str, new_hash: str ) -> int: """更新指定兴趣文件所有 active 标签的 prompt_hash(增量更新时使用)""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" UPDATE ai_filter_tags SET prompt_hash = ? WHERE interests_file = ? AND status = 'active' """, (new_hash, interests_file)) count = cursor.rowcount conn.commit() return count except Exception as e: print(f"[AI筛选] 更新标签 hash 失败: {e}") return 0 # ======================================== # AI 智能筛选 - 分类结果管理 # ======================================== def _update_tag_descriptions_impl( self, date: Optional[str], tag_updates: List[Dict], interests_file: str = "ai_interests.txt" ) -> int: """按 tag 名匹配,更新 active 标签的 description 字段""" try: conn = self._get_connection(date) cursor = conn.cursor() count = 0 for t in tag_updates: tag_name = t.get("tag", "") description = t.get("description", "") if not tag_name: continue cursor.execute(""" UPDATE ai_filter_tags SET description = ? WHERE tag = ? AND interests_file = ? AND status = 'active' """, (description, tag_name, interests_file)) count += cursor.rowcount conn.commit() return count except Exception as e: print(f"[AI筛选] 更新标签描述失败: {e}") return 0 def _update_tag_priorities_impl( self, date: Optional[str], tag_priorities: List[Dict], interests_file: str = "ai_interests.txt" ) -> int: """按 tag 名匹配,更新 active 标签的 priority 字段""" try: conn = self._get_connection(date) cursor = conn.cursor() count = 0 for t in tag_priorities: tag_name = t.get("tag", "") priority = t.get("priority") if not tag_name: continue try: priority = int(priority) except (TypeError, ValueError): continue cursor.execute(""" UPDATE ai_filter_tags SET priority = ? WHERE tag = ? AND interests_file = ? AND status = 'active' """, (priority, tag_name, interests_file)) count += cursor.rowcount conn.commit() return count except Exception as e: print(f"[AI筛选] 更新标签优先级失败: {e}") return 0 # ======================================== # AI 智能筛选 - 已分析新闻追踪 # ======================================== def _save_analyzed_news_impl( self, date: Optional[str], news_ids: List[int], source_type: str, interests_file: str, prompt_hash: str, matched_ids: set ) -> int: """批量记录已分析的新闻(匹配与不匹配都记录)""" try: conn = self._get_connection(date) cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") count = 0 for nid in news_ids: try: cursor.execute(""" INSERT OR REPLACE INTO ai_filter_analyzed_news (news_item_id, source_type, interests_file, prompt_hash, matched, created_at) VALUES (?, ?, ?, ?, ?, ?) """, ( nid, source_type, interests_file, prompt_hash, 1 if nid in matched_ids else 0, now_str, )) count += 1 except Exception: pass conn.commit() return count except Exception as e: print(f"[AI筛选] 保存已分析记录失败: {e}") return 0 def _get_analyzed_news_ids_impl( self, date: Optional[str] = None, source_type: str = "hotlist", interests_file: str = "ai_interests.txt" ) -> set: """获取已分析过的新闻 ID 集合(用于去重)""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT news_item_id FROM ai_filter_analyzed_news WHERE source_type = ? AND interests_file = ? """, (source_type, interests_file)) return {row[0] for row in cursor.fetchall()} except Exception as e: print(f"[AI筛选] 获取已分析ID失败: {e}") return set() def _clear_analyzed_news_impl( self, date: Optional[str] = None, interests_file: str = "ai_interests.txt" ) -> int: """清除指定兴趣文件的所有已分析记录(全量重分类时使用)""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" DELETE FROM ai_filter_analyzed_news WHERE interests_file = ? """, (interests_file,)) count = cursor.rowcount conn.commit() return count except Exception as e: print(f"[AI筛选] 清除已分析记录失败: {e}") return 0 def _clear_unmatched_analyzed_news_impl( self, date: Optional[str] = None, interests_file: str = "ai_interests.txt" ) -> int: """清除不匹配的已分析记录,让这些新闻有机会被新标签重新分析""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" DELETE FROM ai_filter_analyzed_news WHERE interests_file = ? AND matched = 0 """, (interests_file,)) count = cursor.rowcount conn.commit() return count except Exception as e: print(f"[AI筛选] 清除不匹配记录失败: {e}") return 0 # ======================================== # AI 智能筛选 - 分类结果管理(原有) # ======================================== def _save_filter_results_impl( self, date: Optional[str], results: List[Dict] ) -> int: """批量保存分类结果""" try: conn = self._get_connection(date) cursor = conn.cursor() now_str = self._get_configured_time().strftime("%Y-%m-%d %H:%M:%S") count = 0 for r in results: try: cursor.execute(""" INSERT INTO ai_filter_results (news_item_id, source_type, tag_id, relevance_score, created_at) VALUES (?, ?, ?, ?, ?) """, ( r["news_item_id"], r.get("source_type", "hotlist"), r["tag_id"], r.get("relevance_score", 0.0), now_str, )) count += 1 except sqlite3.IntegrityError: pass # 重复记录,跳过 conn.commit() return count except Exception as e: print(f"[AI筛选] 保存分类结果失败: {e}") return 0 def _get_active_filter_results_impl(self, date: Optional[str] = None, interests_file: str = "ai_interests.txt") -> List[Dict[str, Any]]: """获取指定兴趣文件的 active 分类结果,JOIN news_items 获取新闻详情""" try: conn = self._get_connection(date) cursor = conn.cursor() # 热榜结果 cursor.execute(""" SELECT r.news_item_id, r.source_type, r.tag_id, r.relevance_score, t.tag, t.description as tag_description, t.priority, n.title, n.platform_id as source_id, p.name as source_name, n.url, n.mobile_url, n.rank, n.first_crawl_time, n.last_crawl_time, n.crawl_count FROM ai_filter_results r JOIN ai_filter_tags t ON r.tag_id = t.id JOIN news_items n ON r.news_item_id = n.id LEFT JOIN platforms p ON n.platform_id = p.id WHERE r.status = 'active' AND r.source_type = 'hotlist' AND t.status = 'active' AND t.interests_file = ? ORDER BY t.priority ASC, t.id ASC, r.relevance_score DESC """, (interests_file,)) results = [] hotlist_news_ids = [] for row in cursor.fetchall(): results.append({ "news_item_id": row[0], "source_type": row[1], "tag_id": row[2], "relevance_score": row[3], "tag": row[4], "tag_description": row[5], "tag_priority": row[6], "title": row[7], "source_id": row[8], "source_name": row[9] or row[8], "url": row[10] or "", "mobile_url": row[11] or "", "rank": row[12], "first_time": row[13], "last_time": row[14], "count": row[15], }) hotlist_news_ids.append(row[0]) # 批量查排名历史(热榜) ranks_map: Dict[int, List[int]] = {} if hotlist_news_ids: unique_ids = list(set(hotlist_news_ids)) placeholders = ",".join("?" * len(unique_ids)) cursor.execute(f""" SELECT news_item_id, rank FROM rank_history WHERE news_item_id IN ({placeholders}) AND rank != 0 """, unique_ids) for rh_row in cursor.fetchall(): nid, rank = rh_row[0], rh_row[1] if nid not in ranks_map: ranks_map[nid] = [] if rank not in ranks_map[nid]: ranks_map[nid].append(rank) for item in results: item["ranks"] = ranks_map.get(item["news_item_id"], [item["rank"]]) # RSS 结果(如果有 rss 库) try: rss_conn = self._get_connection(date, db_type="rss") rss_cursor = rss_conn.cursor() # 从 news 库获取 rss 类型的分类结果 ID cursor.execute(""" SELECT r.news_item_id, r.tag_id, r.relevance_score, t.tag, t.description, t.priority FROM ai_filter_results r JOIN ai_filter_tags t ON r.tag_id = t.id WHERE r.status = 'active' AND r.source_type = 'rss' AND t.status = 'active' AND t.interests_file = ? ORDER BY t.priority ASC, t.id ASC, r.relevance_score DESC """, (interests_file,)) rss_filter_rows = cursor.fetchall() if rss_filter_rows: rss_ids = [row[0] for row in rss_filter_rows] placeholders = ",".join("?" * len(rss_ids)) rss_cursor.execute(f""" SELECT i.id, i.title, i.feed_id, f.name as feed_name, i.url, i.published_at FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id WHERE i.id IN ({placeholders}) """, rss_ids) rss_info = {row[0]: row for row in rss_cursor.fetchall()} for fr_row in rss_filter_rows: rss_id = fr_row[0] info = rss_info.get(rss_id) if info: results.append({ "news_item_id": rss_id, "source_type": "rss", "tag_id": fr_row[1], "relevance_score": fr_row[2], "tag": fr_row[3], "tag_description": fr_row[4], "tag_priority": fr_row[5], "title": info[1], "source_id": info[2], "source_name": info[3] or info[2], "url": info[4] or "", "mobile_url": "", "rank": 0, "ranks": [], "first_time": info[5] or "", "last_time": info[5] or "", "count": 1, }) except Exception: pass # RSS 库不存在时静默跳过 return results except Exception as e: print(f"[AI筛选] 获取分类结果失败: {e}") return [] def _get_all_news_ids_impl(self, date: Optional[str] = None) -> List[Dict]: """获取当日所有新闻的 id 和标题(用于 AI 筛选分类)""" try: conn = self._get_connection(date) cursor = conn.cursor() cursor.execute(""" SELECT n.id, n.title, n.platform_id, p.name as platform_name FROM news_items n LEFT JOIN platforms p ON n.platform_id = p.id ORDER BY n.id """) return [ { "id": row[0], "title": row[1], "source_id": row[2], "source_name": row[3] or row[2], } for row in cursor.fetchall() ] except Exception as e: print(f"[AI筛选] 获取新闻列表失败: {e}") return [] def _get_all_rss_ids_impl(self, date: Optional[str] = None) -> List[Dict]: """获取当日所有 RSS 条目的 id 和标题(用于 AI 筛选分类)""" try: conn = self._get_connection(date, db_type="rss") cursor = conn.cursor() cursor.execute(""" SELECT i.id, i.title, i.feed_id, f.name as feed_name, i.published_at FROM rss_items i LEFT JOIN rss_feeds f ON i.feed_id = f.id ORDER BY i.id """) return [ { "id": row[0], "title": row[1], "source_id": row[2], "source_name": row[3] or row[2], "published_at": row[4] or "", } for row in cursor.fetchall() ] except Exception as e: print(f"[AI筛选] 获取 RSS 列表失败: {e}") return [] ================================================ FILE: trendradar/utils/__init__.py ================================================ # coding=utf-8 """ 工具模块 - 公共工具函数 """ from trendradar.utils.time import ( get_configured_time, format_date_folder, format_time_filename, get_current_time_display, convert_time_for_display, ) from trendradar.utils.url import normalize_url, get_url_signature __all__ = [ "get_configured_time", "format_date_folder", "format_time_filename", "get_current_time_display", "convert_time_for_display", "normalize_url", "get_url_signature", ] ================================================ FILE: trendradar/utils/time.py ================================================ # coding=utf-8 """ 时间工具模块 本模块提供统一的时间处理函数,所有时区相关操作都应使用 DEFAULT_TIMEZONE 常量。 """ from datetime import datetime from typing import Optional, Tuple import pytz # 默认时区常量 - 仅作为 fallback,正常运行时使用 config.yaml 中的 app.timezone DEFAULT_TIMEZONE = "Asia/Shanghai" def get_configured_time(timezone: str = DEFAULT_TIMEZONE) -> datetime: """ 获取配置时区的当前时间 Args: timezone: 时区名称,如 'Asia/Shanghai', 'America/Los_Angeles' Returns: 带时区信息的当前时间 """ try: tz = pytz.timezone(timezone) except pytz.UnknownTimeZoneError: print(f"[警告] 未知时区 '{timezone}',使用默认时区 {DEFAULT_TIMEZONE}") tz = pytz.timezone(DEFAULT_TIMEZONE) return datetime.now(tz) def format_date_folder( date: Optional[str] = None, timezone: str = DEFAULT_TIMEZONE ) -> str: """ 格式化日期文件夹名 (ISO 格式: YYYY-MM-DD) Args: date: 指定日期字符串,为 None 则使用当前日期 timezone: 时区名称 Returns: 格式化后的日期字符串,如 '2025-12-09' """ if date: return date return get_configured_time(timezone).strftime("%Y-%m-%d") def format_time_filename(timezone: str = DEFAULT_TIMEZONE) -> str: """ 格式化时间文件名 (格式: HH-MM,用于文件名) Windows 系统不支持冒号作为文件名,因此使用连字符 Args: timezone: 时区名称 Returns: 格式化后的时间字符串,如 '15-30' """ return get_configured_time(timezone).strftime("%H-%M") def get_current_time_display(timezone: str = DEFAULT_TIMEZONE) -> str: """ 获取当前时间显示 (格式: HH:MM,用于显示) Args: timezone: 时区名称 Returns: 格式化后的时间字符串,如 '15:30' """ return get_configured_time(timezone).strftime("%H:%M") def convert_time_for_display(time_str: str) -> str: """ 将 HH-MM 格式转换为 HH:MM 格式用于显示 Args: time_str: 输入时间字符串,如 '15-30' Returns: 转换后的时间字符串,如 '15:30' """ if time_str and "-" in time_str and len(time_str) == 5: return time_str.replace("-", ":") return time_str def format_iso_time_friendly( iso_time: str, timezone: str = DEFAULT_TIMEZONE, include_date: bool = True, ) -> str: """ 将 ISO 格式时间转换为用户时区的友好显示格式 Args: iso_time: ISO 格式时间字符串,如 '2025-12-29T00:20:00' 或 '2025-12-29T00:20:00+00:00' timezone: 目标时区名称 include_date: 是否包含日期部分 Returns: 友好格式的时间字符串,如 '12-29 08:20' 或 '08:20' """ if not iso_time: return "" try: # 尝试解析各种 ISO 格式 dt = None # 尝试解析带时区的格式 if "+" in iso_time or iso_time.endswith("Z"): iso_time = iso_time.replace("Z", "+00:00") try: dt = datetime.fromisoformat(iso_time) except ValueError: pass # 尝试解析不带时区的格式(假设为 UTC) if dt is None: try: # 处理 T 分隔符 if "T" in iso_time: dt = datetime.fromisoformat(iso_time.replace("T", " ").split(".")[0]) else: dt = datetime.fromisoformat(iso_time.split(".")[0]) # 假设为 UTC 时间 dt = pytz.UTC.localize(dt) except ValueError: pass if dt is None: # 无法解析,返回原始字符串的简化版本 if "T" in iso_time: parts = iso_time.split("T") if len(parts) == 2: date_part = parts[0][5:] # MM-DD time_part = parts[1][:5] # HH:MM return f"{date_part} {time_part}" if include_date else time_part return iso_time # 转换到目标时区 try: target_tz = pytz.timezone(timezone) except pytz.UnknownTimeZoneError: target_tz = pytz.timezone(DEFAULT_TIMEZONE) dt_local = dt.astimezone(target_tz) # 格式化输出 if include_date: return dt_local.strftime("%m-%d %H:%M") else: return dt_local.strftime("%H:%M") except Exception: # 出错时返回原始字符串的简化版本 if "T" in iso_time: parts = iso_time.split("T") if len(parts) == 2: date_part = parts[0][5:] # MM-DD time_part = parts[1][:5] # HH:MM return f"{date_part} {time_part}" if include_date else time_part return iso_time def is_within_days( iso_time: str, max_days: int, timezone: str = DEFAULT_TIMEZONE, ) -> bool: """ 检查 ISO 格式时间是否在指定天数内 用于 RSS 文章新鲜度过滤,判断文章发布时间是否超过指定天数。 Args: iso_time: ISO 格式时间字符串(如 '2025-12-29T00:20:00' 或带时区) max_days: 最大天数(文章发布时间距今不超过此天数则返回 True) - max_days > 0: 正常过滤,保留 N 天内的文章 - max_days <= 0: 禁用过滤,保留所有文章 timezone: 时区名称(用于获取当前时间) Returns: True 如果时间在指定天数内(应保留),False 如果超过指定天数(应过滤) 如果无法解析时间,返回 True(保留文章) """ # 无时间戳或禁用过滤时,保留文章 if not iso_time: return True if max_days <= 0: return True # max_days=0 表示禁用过滤 try: dt = None # 尝试解析带时区的格式 if "+" in iso_time or iso_time.endswith("Z"): iso_time_normalized = iso_time.replace("Z", "+00:00") try: dt = datetime.fromisoformat(iso_time_normalized) except ValueError: pass # 尝试解析不带时区的格式(假设为 UTC) if dt is None: try: if "T" in iso_time: dt = datetime.fromisoformat(iso_time.replace("T", " ").split(".")[0]) else: dt = datetime.fromisoformat(iso_time.split(".")[0]) dt = pytz.UTC.localize(dt) except ValueError: pass if dt is None: # 无法解析时间,保留文章 return True # 获取当前时间(配置的时区,带时区信息) now = get_configured_time(timezone) # 计算时间差(两个带时区的 datetime 相减会自动处理时区差异) diff = now - dt days_diff = diff.total_seconds() / (24 * 60 * 60) return days_diff <= max_days except Exception: # 出错时保留文章 return True def calculate_days_old(iso_time: str, timezone: str = DEFAULT_TIMEZONE) -> Optional[float]: """ 计算 ISO 格式时间距今多少天 Args: iso_time: ISO 格式时间字符串 timezone: 时区名称 Returns: 距今天数(浮点数),如果无法解析返回 None """ if not iso_time: return None try: dt = None # 尝试解析带时区的格式 if "+" in iso_time or iso_time.endswith("Z"): iso_time_normalized = iso_time.replace("Z", "+00:00") try: dt = datetime.fromisoformat(iso_time_normalized) except ValueError: pass # 尝试解析不带时区的格式(假设为 UTC) if dt is None: try: if "T" in iso_time: dt = datetime.fromisoformat(iso_time.replace("T", " ").split(".")[0]) else: dt = datetime.fromisoformat(iso_time.split(".")[0]) dt = pytz.UTC.localize(dt) except ValueError: pass if dt is None: return None now = get_configured_time(timezone) diff = now - dt return diff.total_seconds() / (24 * 60 * 60) except Exception: return None class TimeWindowChecker: """ 时间窗口检查器 统一管理时间窗口控制逻辑,支持: - 推送窗口控制 (push_window) - AI 分析窗口控制 (analysis_window) - once_per_day 功能 """ def __init__( self, storage_backend, get_time_func=None, window_name: str = "时间窗口", ): """ 初始化时间窗口检查器 Args: storage_backend: 存储后端实例 get_time_func: 获取当前时间的函数 window_name: 窗口名称(用于日志输出) """ self.storage_backend = storage_backend self.get_time_func = get_time_func or (lambda: get_configured_time(DEFAULT_TIMEZONE)) self.window_name = window_name def is_in_time_range(self, start_time: str, end_time: str) -> bool: """ 检查当前时间是否在指定时间范围内 支持跨日时间窗口,例如: - 正常窗口:09:00-21:00(当天 9 点到 21 点) - 跨日窗口:22:00-02:00(当天 22 点到次日 2 点) Args: start_time: 开始时间(格式:HH:MM) end_time: 结束时间(格式:HH:MM) Returns: 是否在时间范围内 """ now = self.get_time_func() current_time = now.strftime("%H:%M") normalized_start = self._normalize_time(start_time) normalized_end = self._normalize_time(end_time) normalized_current = self._normalize_time(current_time) # 判断是否跨日窗口(start > end 表示跨日,如 22:00-02:00) if normalized_start <= normalized_end: # 正常窗口:09:00-21:00 result = normalized_start <= normalized_current <= normalized_end else: # 跨日窗口:22:00-02:00 # 当前时间 >= 开始时间(如 23:00 >= 22:00)或 当前时间 <= 结束时间(如 01:00 <= 02:00) result = normalized_current >= normalized_start or normalized_current <= normalized_end if not result: print(f"[{self.window_name}] 当前 {normalized_current},窗口 {normalized_start}-{normalized_end}") return result def _normalize_time(self, time_str: str) -> str: """将时间字符串标准化为 HH:MM 格式""" try: parts = time_str.strip().split(":") if len(parts) != 2: raise ValueError(f"时间格式错误: {time_str}") hour = int(parts[0]) minute = int(parts[1]) if not (0 <= hour <= 23 and 0 <= minute <= 59): raise ValueError(f"时间范围错误: {time_str}") return f"{hour:02d}:{minute:02d}" except Exception as e: print(f"[{self.window_name}] 时间格式化错误 '{time_str}': {e}") return time_str def check_window( self, window_config: dict, check_once_per_day_func=None, record_func=None, ) -> Tuple[bool, str]: """ 统一的时间窗口检查逻辑 Args: window_config: 窗口配置字典,包含: - ENABLED: 是否启用窗口控制 - TIME_RANGE: {"START": "HH:MM", "END": "HH:MM"} - ONCE_PER_DAY: 是否每天只执行一次 check_once_per_day_func: 检查今天是否已执行的函数 record_func: 记录执行的函数(成功后调用) Returns: (should_proceed, reason) 元组: - should_proceed: 是否应该继续执行 - reason: 原因说明 """ if not window_config.get("ENABLED", False): return True, "窗口控制未启用" time_range = window_config.get("TIME_RANGE", {}) start_time = time_range.get("START", "00:00") end_time = time_range.get("END", "23:59") # 检查时间范围 if not self.is_in_time_range(start_time, end_time): now = self.get_time_func() return False, f"当前时间 {now.strftime('%H:%M')} 不在窗口 {start_time}-{end_time} 内" # 检查 once_per_day if window_config.get("ONCE_PER_DAY", False) and check_once_per_day_func: if check_once_per_day_func(): return False, "今天已执行过" else: print(f"[{self.window_name}] 今天首次执行") return True, "在窗口内" def get_status(self, window_config: dict, check_once_per_day_func=None) -> dict: """ 获取窗口状态信息 Args: window_config: 窗口配置 check_once_per_day_func: 检查今天是否已执行的函数 Returns: 状态信息字典 """ now = self.get_time_func() status = { "enabled": window_config.get("ENABLED", False), "current_time": now.strftime("%H:%M:%S"), "current_date": now.strftime("%Y-%m-%d"), "timezone": str(now.tzinfo), } if status["enabled"]: time_range = window_config.get("TIME_RANGE", {}) status["window_start"] = time_range.get("START", "00:00") status["window_end"] = time_range.get("END", "23:59") status["in_window"] = self.is_in_time_range( status["window_start"], status["window_end"] ) status["once_per_day"] = window_config.get("ONCE_PER_DAY", False) if status["once_per_day"] and check_once_per_day_func: status["executed_today"] = check_once_per_day_func() return status ================================================ FILE: trendradar/utils/url.py ================================================ # coding=utf-8 """ URL 处理工具模块 提供 URL 标准化功能,用于去重时消除动态参数的影响: - normalize_url: 标准化 URL,去除动态参数 """ from urllib.parse import urlparse, urlunparse, parse_qs, urlencode from typing import Dict, Set # 各平台需要移除的特定参数 # - weibo: 有 band_rank(排名)和 Refer(来源)动态参数 # - 其他平台: URL 为路径格式或简单关键词查询,无需处理 PLATFORM_PARAMS_TO_REMOVE: Dict[str, Set[str]] = { # 微博:band_rank 是动态排名参数,Refer 是来源参数,t 是时间范围参数 # 示例:https://s.weibo.com/weibo?q=xxx&t=31&band_rank=1&Refer=top # 保留:q(关键词) # 移除:band_rank, Refer, t "weibo": {"band_rank", "Refer", "t"}, } # 通用追踪参数(适用于所有平台) # 这些参数通常由分享链接或广告追踪添加,不影响内容识别 COMMON_TRACKING_PARAMS: Set[str] = { # UTM 追踪参数 "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content", # 常见追踪参数 "ref", "referrer", "source", "channel", # 时间戳和随机参数 "_t", "timestamp", "_", "random", # 分享相关 "share_token", "share_id", "share_from", } def normalize_url(url: str, platform_id: str = "") -> str: """ 标准化 URL,去除动态参数 用于数据库去重,确保同一条新闻的不同 URL 变体能被正确识别为同一条。 处理规则: 1. 去除平台特定的动态参数(如微博的 band_rank) 2. 去除通用追踪参数(如 utm_*) 3. 保留核心查询参数(如搜索关键词 q=, wd=, keyword=) 4. 对查询参数按字母序排序(确保一致性) Args: url: 原始 URL platform_id: 平台 ID,用于应用平台特定规则 Returns: 标准化后的 URL Examples: >>> normalize_url("https://s.weibo.com/weibo?q=test&band_rank=6&Refer=top", "weibo") 'https://s.weibo.com/weibo?q=test' >>> normalize_url("https://example.com/page?id=1&utm_source=twitter", "") 'https://example.com/page?id=1' """ if not url: return url try: # 解析 URL parsed = urlparse(url) # 如果没有查询参数,直接返回 if not parsed.query: return url # 解析查询参数 params = parse_qs(parsed.query, keep_blank_values=True) # 收集需要移除的参数(使用小写进行比较) params_to_remove: Set[str] = set() # 添加通用追踪参数 params_to_remove.update(COMMON_TRACKING_PARAMS) # 添加平台特定参数 if platform_id and platform_id in PLATFORM_PARAMS_TO_REMOVE: params_to_remove.update(PLATFORM_PARAMS_TO_REMOVE[platform_id]) # 过滤参数(参数名转小写进行比较) filtered_params = { key: values for key, values in params.items() if key.lower() not in {p.lower() for p in params_to_remove} } # 如果过滤后没有参数了,返回不带查询字符串的 URL if not filtered_params: return urlunparse(( parsed.scheme, parsed.netloc, parsed.path, parsed.params, "", # 空查询字符串 "" # 移除 fragment )) # 重建查询字符串(按字母序排序以确保一致性) sorted_params = [] for key in sorted(filtered_params.keys()): for value in filtered_params[key]: sorted_params.append((key, value)) new_query = urlencode(sorted_params) # 重建 URL(移除 fragment) normalized = urlunparse(( parsed.scheme, parsed.netloc, parsed.path, parsed.params, new_query, "" # 移除 fragment )) return normalized except Exception: # 解析失败时返回原始 URL return url def get_url_signature(url: str, platform_id: str = "") -> str: """ 获取 URL 的签名(用于快速比较) 基于标准化 URL 生成签名,可用于: - 快速判断两个 URL 是否指向同一内容 - 作为缓存键 Args: url: 原始 URL platform_id: 平台 ID Returns: URL 签名字符串 """ return normalize_url(url, platform_id) ================================================ FILE: version ================================================ 6.5.0 ================================================ FILE: version_configs ================================================ config.yaml=2.2.0 timeline.yaml=1.2.0 frequency_words.txt=1.1.0 ai_interests.txt=1.0.0 ai_analysis_prompt.txt=2.0.0 ai_translation_prompt.txt=1.2.0 ================================================ FILE: version_mcp ================================================ 4.0.0RSS 订阅内容订阅条目 """ html += f"{total_count} 条" html += """生成时间 """ # 使用提供的时间函数或默认 datetime.now if get_time_func: now = get_time_func() else: now = datetime.now() html += now.strftime("%m-%d %H:%M") html += """""" # 按 feed_id 分组 feeds_map: Dict[str, List[Dict]] = {} for item in rss_items: feed_id = item.get("feed_id", "unknown") if feed_id not in feeds_map: feeds_map[feed_id] = [] feeds_map[feed_id].append(item) # 渲染每个 RSS 源的内容 for feed_id, items in feeds_map.items(): feed_name = items[0].get("feed_name", feed_id) if items else feed_id if feeds_info and feed_id in feeds_info: feed_name = feeds_info[feed_id] escaped_feed_name = html_escape(feed_name) html += f"""""" html += """""" for item in items: escaped_title = html_escape(item.get("title", "")) url = item.get("url", "") published_at = item.get("published_at", "") author = item.get("author", "") summary = item.get("summary", "") html += """{escaped_feed_name}{len(items)} 条""" html += """""" if url: escaped_url = html_escape(url) html += f'{escaped_title}' else: html += escaped_title html += """""" if summary: escaped_summary = html_escape(summary) html += f"""{escaped_summary}
""" html += """