Repository: DemonDamon/Listed-company-news-crawl-and-text-analysis
Branch: main
Commit: d7a20a1f7ee8
Files: 293
Total size: 2.5 MB

Directory structure:
gitextract_w0u594fz/

├── .deepsource.toml
├── .gitignore
├── LICENSE
├── README.md
├── README_zn.md
├── backend/
│   ├── .gitignore
│   ├── README.md
│   ├── README_zn.md
│   ├── add_raw_html_column.py
│   ├── app/
│   │   ├── __init__.py
│   │   ├── agents/
│   │   │   ├── __init__.py
│   │   │   ├── data_collector.py
│   │   │   ├── data_collector_v2.py
│   │   │   ├── debate_agents.py
│   │   │   ├── news_analyst.py
│   │   │   ├── orchestrator.py
│   │   │   ├── quantitative_agent.py
│   │   │   └── search_analyst.py
│   │   ├── alpha_mining/
│   │   │   ├── README.md
│   │   │   ├── __init__.py
│   │   │   ├── backtest/
│   │   │   │   ├── __init__.py
│   │   │   │   └── evaluator.py
│   │   │   ├── config.py
│   │   │   ├── dsl/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── ops.py
│   │   │   │   └── vocab.py
│   │   │   ├── features/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── market.py
│   │   │   │   └── sentiment.py
│   │   │   ├── model/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── alpha_generator.py
│   │   │   │   └── trainer.py
│   │   │   ├── tools/
│   │   │   │   ├── __init__.py
│   │   │   │   └── alpha_mining_tool.py
│   │   │   ├── utils.py
│   │   │   └── vm/
│   │   │       ├── __init__.py
│   │   │       └── factor_vm.py
│   │   ├── api/
│   │   │   ├── __init__.py
│   │   │   └── v1/
│   │   │       ├── __init__.py
│   │   │       ├── agents.py
│   │   │       ├── alpha_mining.py
│   │   │       ├── analysis.py
│   │   │       ├── debug.py
│   │   │       ├── knowledge_graph.py
│   │   │       ├── llm_config.py
│   │   │       ├── news.py
│   │   │       ├── news_v2.py
│   │   │       ├── stocks.py
│   │   │       └── tasks.py
│   │   ├── config/
│   │   │   ├── __init__.py
│   │   │   └── debate_modes.yaml
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── celery_app.py
│   │   │   ├── config.py
│   │   │   ├── database.py
│   │   │   ├── neo4j_client.py
│   │   │   └── redis_client.py
│   │   ├── financial/
│   │   │   ├── __init__.py
│   │   │   ├── models/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── news.py
│   │   │   │   └── stock.py
│   │   │   ├── providers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── eastmoney/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── fetchers/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── news.py
│   │   │   │   │   └── provider.py
│   │   │   │   ├── nbd/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── fetchers/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── news.py
│   │   │   │   │   └── provider.py
│   │   │   │   ├── netease/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── fetchers/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── news.py
│   │   │   │   │   └── provider.py
│   │   │   │   ├── sina/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── fetchers/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── news.py
│   │   │   │   │   └── provider.py
│   │   │   │   ├── tencent/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── fetchers/
│   │   │   │   │   │   ├── __init__.py
│   │   │   │   │   │   └── news.py
│   │   │   │   │   └── provider.py
│   │   │   │   └── yicai/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── fetchers/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   └── news.py
│   │   │   │       └── provider.py
│   │   │   ├── registry.py
│   │   │   └── tools.py
│   │   ├── knowledge/
│   │   │   ├── README.md
│   │   │   ├── __init__.py
│   │   │   ├── graph_models.py
│   │   │   ├── graph_service.py
│   │   │   ├── knowledge_extractor.py
│   │   │   └── parallel_search.py
│   │   ├── main.py
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── analysis.py
│   │   │   ├── crawl_task.py
│   │   │   ├── database.py
│   │   │   ├── debate_history.py
│   │   │   ├── news.py
│   │   │   └── stock.py
│   │   ├── scripts/
│   │   │   └── init_stocks.py
│   │   ├── services/
│   │   │   ├── __init__.py
│   │   │   ├── analysis_service.py
│   │   │   ├── embedding_service.py
│   │   │   ├── llm_service.py
│   │   │   └── stock_data_service.py
│   │   ├── storage/
│   │   │   ├── __init__.py
│   │   │   └── vector_storage.py
│   │   ├── tasks/
│   │   │   ├── __init__.py
│   │   │   └── crawl_tasks.py
│   │   └── tools/
│   │       ├── __init__.py
│   │       ├── bochaai_search.py
│   │       ├── caijing_crawler.py
│   │       ├── crawler_base.py
│   │       ├── crawler_enhanced.py
│   │       ├── dynamic_crawler_example.py
│   │       ├── eastmoney_crawler.py
│   │       ├── eeo_crawler.py
│   │       ├── interactive_crawler.py
│   │       ├── jingji21_crawler.py
│   │       ├── jwview_crawler.py
│   │       ├── nbd_crawler.py
│   │       ├── netease163_crawler.py
│   │       ├── search_engine_crawler.py
│   │       ├── sina_crawler.py
│   │       ├── tencent_crawler.py
│   │       ├── text_cleaner.py
│   │       └── yicai_crawler.py
│   ├── clear_news_data.py
│   ├── env.example
│   ├── init_db.py
│   ├── init_knowledge_graph.py
│   ├── requirements.txt
│   ├── reset_database.py
│   ├── setup_env.sh
│   ├── start.sh
│   ├── start_celery.sh
│   └── tests/
│       ├── __init__.py
│       ├── check_milvus_data.py
│       ├── check_news_embedding_status.py
│       ├── financial/
│       │   ├── __init__.py
│       │   ├── test_smoke_openbb_models.py
│       │   ├── test_smoke_openbb_provider.py
│       │   └── test_smoke_openbb_tools.py
│       ├── manual_vectorize.py
│       ├── test_alpha_mining/
│       │   ├── __init__.py
│       │   ├── test_integration_p2.py
│       │   ├── test_smoke_p0.py
│       │   └── test_smoke_p1.py
│       └── test_smoke_alpha_mining.py
├── deploy/
│   ├── Dockerfile.celery
│   ├── celery-entrypoint.sh
│   └── docker-compose.dev.yml
├── docs/
│   ├── BochaAI_Web_Search_API_20251222_121535.md
│   └── 天眼查MCP服务_20260104_171528.md
├── frontend/
│   ├── .gitignore
│   ├── QUICKSTART.md
│   ├── README.md
│   ├── index.html
│   ├── package.json
│   ├── postcss.config.js
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── DebateChatRoom.tsx
│   │   │   ├── DebateConfig.tsx
│   │   │   ├── DebateHistorySidebar.tsx
│   │   │   ├── HighlightText.tsx
│   │   │   ├── KLineChart.tsx
│   │   │   ├── MentionInput.tsx
│   │   │   ├── ModelSelector.tsx
│   │   │   ├── NewsDetailDrawer.tsx
│   │   │   ├── StockSearch.tsx
│   │   │   ├── alpha-mining/
│   │   │   │   ├── AgentDemo.tsx
│   │   │   │   ├── MetricsDashboard.tsx
│   │   │   │   ├── OperatorGrid.tsx
│   │   │   │   ├── SentimentCompare.tsx
│   │   │   │   ├── TrainingMonitor.tsx
│   │   │   │   └── index.ts
│   │   │   └── ui/
│   │   │       ├── badge.tsx
│   │   │       ├── button.tsx
│   │   │       ├── card.tsx
│   │   │       ├── dropdown-menu.tsx
│   │   │       ├── sheet.tsx
│   │   │       └── tabs.tsx
│   │   ├── context/
│   │   │   └── NewsToolbarContext.tsx
│   │   ├── hooks/
│   │   │   └── useDebounce.ts
│   │   ├── index.css
│   │   ├── layout/
│   │   │   └── MainLayout.tsx
│   │   ├── lib/
│   │   │   ├── api-client.ts
│   │   │   └── utils.ts
│   │   ├── main.tsx
│   │   ├── pages/
│   │   │   ├── AgentMonitorPage.tsx
│   │   │   ├── AlphaMiningPage.tsx
│   │   │   ├── Dashboard.tsx
│   │   │   ├── NewsListPage.tsx
│   │   │   ├── StockAnalysisPage.tsx
│   │   │   ├── StockSearchPage.tsx
│   │   │   └── TaskManagerPage.tsx
│   │   ├── store/
│   │   │   ├── useDebateStore.ts
│   │   │   ├── useLanguageStore.ts
│   │   │   ├── useNewsStore.ts
│   │   │   └── useTaskStore.ts
│   │   └── types/
│   │       └── api.ts
│   ├── tailwind.config.js
│   ├── tsconfig.json
│   ├── tsconfig.node.json
│   └── vite.config.ts
├── legacy_v1/
│   ├── .deepsource.toml
│   ├── Chinese_Stop_Words.txt
│   ├── Crawler/
│   │   ├── __init__.py
│   │   ├── crawler_cnstock.py
│   │   ├── crawler_jrj.py
│   │   ├── crawler_nbd.py
│   │   ├── crawler_sina.py
│   │   ├── crawler_stcn.py
│   │   └── crawler_tushare.py
│   ├── README_OLD.md
│   ├── Text_Analysis/
│   │   ├── __init__.py
│   │   ├── text_mining.py
│   │   └── text_processing.py
│   ├── finance_dict.txt
│   ├── run_crawler_cnstock.py
│   ├── run_crawler_jrj.py
│   ├── run_crawler_nbd.py
│   ├── run_crawler_sina.py
│   ├── run_crawler_stcn.py
│   ├── run_crawler_tushare.py
│   ├── run_main.py
│   └── src/
│       ├── Gon/
│       │   ├── __init__.py
│       │   ├── cnstockspyder.py
│       │   ├── history_starter_cnstock.py
│       │   ├── history_starter_jrj.py
│       │   ├── history_starter_nbd.py
│       │   ├── history_starter_stock_price.py
│       │   ├── ifengspyder.py
│       │   ├── jrjspyder.py
│       │   ├── kill_realtime_spyder_tasks.py
│       │   ├── money163spyder.py
│       │   ├── nbdspyder.py
│       │   ├── realtime_starter_cnstock.py
│       │   ├── realtime_starter_jrj.py
│       │   ├── realtime_starter_nbd.py
│       │   ├── realtime_starter_redis_queue.py
│       │   ├── realtime_starter_stock_price.py
│       │   ├── sinaspyder.py
│       │   ├── spyder.py
│       │   └── stockinfospyder.py
│       ├── Hisoka/
│       │   └── classifier.py
│       ├── Killua/
│       │   ├── __init__.py
│       │   ├── buildstocknewsdb.py
│       │   ├── deduplication.py
│       │   └── denull.py
│       ├── Kite/
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── database.py
│       │   ├── log.py
│       │   ├── utils.py
│       │   └── webserver.py
│       ├── Leorio/
│       │   ├── __init__.py
│       │   ├── chnstopwords.txt
│       │   ├── financedict.txt
│       │   ├── tokenization.py
│       │   └── topicmodelling.py
│       ├── __init__.py
│       ├── history_spyder_startup.bat
│       ├── main.py
│       ├── realtime_spyder_startup.bat
│       └── realtime_spyder_stopall.bat
├── reset_all_data.sh
└── thirdparty/
    ├── DISC-FinLLM.md
    ├── ElegantRL.md
    ├── FinCast-fts.md
    ├── FinGPT.md
    ├── FinGenius.md
    ├── FinRL-Meta.md
    ├── FinRL.md
    ├── FinRobot.md
    ├── FinceptTerminal.md
    ├── Kronos.md
    ├── Lean.md
    ├── README.md
    ├── TradingAgents-CN.md
    ├── TradingAgents.md
    ├── TrendRadar.md
    ├── agentic-trading.md
    ├── awesome-quant.md
    ├── backtrader.md
    ├── investor-agent.md
    ├── panda_quantflow.md
    ├── qlib.md
    └── vnpy.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .deepsource.toml
================================================
version = 1

[[analyzers]]
name = "python"

  [analyzers.meta]
  runtime_version = "3.x.x"

================================================
FILE: .gitignore
================================================
# Development documentation (local only, not for Git)
devlogs/
conclusions/
researches/

# Python
__pycache__/
*.py[cod]
*$py.class

# Virtual environments
venv/
env/
ENV/

# IDE
.vscode/
.idea/
*.swp

# OS
.DS_Store
node_modules/
**/node_modules/backend/celerybeat-schedule*
backend/.crawl_cache/
backend/celerybeat-schedule
backend/reproduce_sina.py
backend/checkpoints/

================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2025 Ziran Li

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# FinnewsHunter: Multi-Agent Investment Decision Platform Driven by Financial News

<div align="right">
  <a href="README_zn.md">中文版</a> | <a href="README.md">English</a>
</div>

<div align="center">
  <img src="assets/images/FINNEWS_HUNTER_LOGO.png" alt="FinnewsHunter Logo" width="450">
</div>

An enterprise-grade financial news analysis system built on the [AgenticX](https://github.com/DemonDamon/AgenticX) framework, integrating real-time news streams, deep quantitative analysis, and multi-agent debate mechanisms.

FinnewsHunter goes beyond traditional text classification by deploying multi-agent teams (NewsAnalyst, Researcher, etc.) to monitor multiple financial news sources in real-time, including Sina Finance, National Business Daily, Financial World, Securities Times, and more. It leverages large language models for deep interpretation, sentiment analysis, and market impact assessment, combined with knowledge graphs to mine potential investment opportunities and risks, providing decision-level alpha signals for quantitative trading.

---

## 🎯 Project Features

- ✅ **AgenticX Native**: Deeply integrated with AgenticX framework, using core abstractions like Agent, Tool, and Workflow
- ✅ **AgenticX Component Integration**: Direct use of AgenticX's `BailianEmbeddingProvider` and `MilvusStorage`, avoiding reinventing the wheel
- ✅ **Agent-Driven**: NewsAnalyst agent automatically analyzes news sentiment and market impact
- ✅ **Multi-Provider LLM Support**: Supports 5 major LLM providers (Bailian, OpenAI, DeepSeek, Kimi, Zhipu), switchable with one click in the frontend
- ✅ **Batch Operations**: Supports batch selection, batch deletion, and batch analysis of news, improving operational efficiency
- ✅ **Stock K-Line Analysis**: Integrated with akshare real market data, supporting daily/minute K-line multi-period display
- ✅ **Intelligent Stock Search**: Supports code and name fuzzy queries, pre-loaded with 5000+ A-share data
- ✅ **Complete Tech Stack**: FastAPI + PostgreSQL + Milvus + Redis + React
- ✅ **Real-time Search**: Supports multi-dimensional search by title, content, stock code, with keyword highlighting
- ✅ **Async Vectorization**: Background async vectorization execution, non-blocking analysis flow
- ✅ **Production Ready**: One-click deployment with Docker Compose, complete logging and monitoring

---

## 🏗️ System Architecture

![FinnewsHunter Architecture](assets/images/arch-20251201.png)

The system adopts a layered architecture design:
- **M6 Frontend Interaction Layer**: React + TypeScript + Shadcn UI
- **M1 Platform Service Layer**: FastAPI Gateway + Task Manager
- **M4/M5 Agent Collaboration Layer**: AgenticX Agent + Debate Workflow
- **M2/M3 Infrastructure Layer**: Crawler Service + LLM Service + Embedding
- **M7-M11 Storage & Learning Layer**: PostgreSQL + Milvus + Redis + ACE Framework

---

## 🚀 Quick Start

### Prerequisites

- Python 3.11+
- Docker & Docker Compose
- (Optional) OpenAI API Key or local LLM
- Node.js 18+ (for frontend development)

### 1. Install AgenticX

```bash
cd /Users/damon/myWork/AgenticX
pip install -e .
```

### 2. Install Backend Dependencies

```bash
cd FinnewsHunter/backend
pip install -r requirements.txt
```

### 3. Configure Environment Variables

```bash
cd FinnewsHunter/backend
cp env.example .env
# Edit .env file and fill in LLM API Key and other configurations
```

**Multi-Provider LLM Configuration:**

The system supports 5 LLM providers, at least one needs to be configured:

| Provider | Environment Variable | Registration URL |
|----------|---------------------|------------------|
| Bailian (Alibaba Cloud) | `DASHSCOPE_API_KEY` | https://dashscope.console.aliyun.com/ |
| OpenAI | `OPENAI_API_KEY` | https://platform.openai.com/api-keys |
| DeepSeek | `DEEPSEEK_API_KEY` | https://platform.deepseek.com/ |
| Kimi (Moonshot) | `MOONSHOT_API_KEY` | https://platform.moonshot.cn/ |
| Zhipu | `ZHIPU_API_KEY` | https://open.bigmodel.cn/ |

**Example Configuration (Recommended: Bailian):**

```bash
# Bailian (Alibaba Cloud) - Recommended, fast access in China
DASHSCOPE_API_KEY=sk-your-dashscope-key
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
BAILIAN_MODELS=qwen-plus,qwen-max,qwen-turbo

# Optional: Other providers
OPENAI_API_KEY=sk-your-openai-key
DEEPSEEK_API_KEY=sk-your-deepseek-key
```

### 4. Start Base Services (PostgreSQL, Redis, Milvus)

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml up -d postgres redis milvus-etcd milvus-minio milvus-standalone
```

### 5. Initialize Database

```bash
cd FinnewsHunter/backend
python init_db.py
```

### 5.1 Initialize Stock Data (Optional, for stock search functionality)

```bash
cd FinnewsHunter/backend
python -m app.scripts.init_stocks
# Will fetch all A-share data (approximately 5000+ stocks) from akshare and save to database
```

### 6. Start Backend API Service

```bash
cd FinnewsHunter/backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

### 7. Start Celery Worker and Beat (Auto Crawling)

```bash
# Open a new terminal
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml up -d celery-worker celery-beat
```

### 8. Start Frontend Service

```bash
# Open a new terminal
cd FinnewsHunter/frontend
npm install  # First time requires dependency installation
npm run dev
```

### 9. Access Application

- **Frontend Interface**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs

---

## 🔄 Service Management

### View All Service Status

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml ps
```

### Restart All Services

```bash
cd FinnewsHunter

# Restart Docker services (infrastructure + Celery)
docker compose -f deploy/docker-compose.dev.yml restart

# If backend API is started independently, manually restart it
# Press Ctrl+C to stop backend process, then rerun:
cd backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

### Restart Specific Service

```bash
cd FinnewsHunter

# Restart only Celery (after code changes)
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# Restart only database
docker compose -f deploy/docker-compose.dev.yml restart postgres

# Restart only Redis
docker compose -f deploy/docker-compose.dev.yml restart redis
```

### Stop All Services

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml down
```

### View Logs

```bash
cd FinnewsHunter

# View Celery Worker logs
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker

# View Celery Beat logs (scheduled task dispatch)
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# View PostgreSQL logs
docker compose -f deploy/docker-compose.dev.yml logs -f postgres

# View all service logs
docker compose -f deploy/docker-compose.dev.yml logs -f
```

---

## 🗑️ Reset Database

### Method 1: Use One-Click Reset Script (Recommended) ⭐

```bash
cd FinnewsHunter

# Execute reset script
./reset_all_data.sh

# Enter yes to confirm
```

**The script will automatically complete:**
1. ✅ Clear all news and task data in PostgreSQL
2. ✅ Clear Redis cache
3. ✅ Reset database auto-increment IDs (restart from 1)
4. ✅ Clear Celery schedule files
5. ✅ Automatically restart Celery services

**After execution, wait:**
- 5-10 minutes for the system to automatically re-crawl data
- Access frontend to view new data

---

### Method 2: Manual Reset (Advanced)

#### Step 1: Clear PostgreSQL Data

```bash
# Enter PostgreSQL container
docker exec -it finnews_postgres psql -U finnews -d finnews_db
```

Execute in PostgreSQL command line:

```sql
-- Clear news table
DELETE FROM news;

-- Clear task table
DELETE FROM crawl_tasks;

-- Clear analysis table
DELETE FROM analyses;

-- Reset auto-increment IDs
ALTER SEQUENCE news_id_seq RESTART WITH 1;
ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1;
ALTER SEQUENCE analyses_id_seq RESTART WITH 1;

-- Verify results (should all be 0)
SELECT 'news table', COUNT(*) FROM news;
SELECT 'crawl_tasks table', COUNT(*) FROM crawl_tasks;
SELECT 'analyses table', COUNT(*) FROM analyses;

-- Exit
\q
```

#### Step 2: Clear Redis Cache

```bash
cd FinnewsHunter
docker exec finnews_redis redis-cli FLUSHDB
```

#### Step 3: Clear Celery Schedule Files

```bash
cd FinnewsHunter/backend
rm -f celerybeat-schedule*
```

#### Step 4: Restart Celery Services

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

#### Step 5: Verify Data Cleared

```bash
# Check news count (should be 0)
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT COUNT(*) FROM news;"

# Check Redis (should be 0 or very small)
docker exec finnews_redis redis-cli DBSIZE

# Check if Celery has started crawling
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat
# Should see 10 crawl tasks triggered per minute
```

---

### Method 3: Use Python Script Reset

```bash
cd FinnewsHunter/backend
python reset_database.py
# Enter yes to confirm
```

---

### Method 4: Quick Manual Cleanup (One-Line Commands) 🔥

**Use Case:** When reset script doesn't work, this is the fastest method

```bash
cd FinnewsHunter

# Step 1: Clear database tables
docker exec finnews_postgres psql -U finnews -d finnews_db -c "DELETE FROM news; DELETE FROM crawl_tasks; DELETE FROM analyses;"

# Step 2: Reset auto-increment IDs
docker exec finnews_postgres psql -U finnews -d finnews_db -c "ALTER SEQUENCE news_id_seq RESTART WITH 1; ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1; ALTER SEQUENCE analyses_id_seq RESTART WITH 1;"

# Step 3: Clear Redis cache
docker exec finnews_redis redis-cli FLUSHDB

# Step 4: Clear Celery schedule files
rm -f backend/celerybeat-schedule*

# Step 5: Restart Celery services
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# Step 6: Verify cleared (should display 0)
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT COUNT(*) FROM news;"
```

**Immediately refresh browser after execution:**
- Mac: `Command + Shift + R`
- Windows: `Ctrl + Shift + R`

---

### 🖥️ Clear Frontend Cache (Important!)

**After data is cleared, frontend may still display old data due to browser cache.**

#### Method 1: Hard Refresh Browser (Recommended) ⭐

**Mac System:**
```
Press Command + Shift + R
or Command + Option + R
```

**Windows/Linux System:**
```
Press Ctrl + Shift + R
or Ctrl + F5
```

#### Method 2: Developer Tools Clear Cache

1. Press `F12` to open developer tools
2. Right-click the refresh button (next to address bar)
3. Select **"Empty Cache and Hard Reload"**

#### Method 3: Clear Browser Cache

1. **Chrome/Edge:**
   - `Command + Shift + Delete` (Mac) or `Ctrl + Shift + Delete` (Windows)
   - Check "Cached images and files"
   - Time range select "All time"
   - Click "Clear data"

2. **After refreshing page, hard refresh again**
   - Ensure React Query cache is also cleared

#### Method 4: Restart Frontend Dev Server (Most Thorough)

```bash
# Press Ctrl+C in frontend terminal to stop service
# Then restart
cd FinnewsHunter/frontend
npm run dev
```

---

## 📊 Data Recovery Timeline After Reset

| Time | Event | Expected Result |
|------|-------|----------------|
| 0 min | Execute reset script | Database cleared, Redis cleared |
| 1 min | Celery Beat starts scheduling | 10 crawl tasks triggered |
| 2-5 min | First batch of news saved | Database starts having data |
| 5-10 min | All sources have data | Frontend can see 100+ news |
| 30 min | Data continues growing | 500+ news |
| 1 hour | Stable operation | 1000-2000 news |

**Notes:**
- Need to wait 5-10 minutes after reset to see new data
- **Frontend must hard refresh** (Command+Shift+R / Ctrl+Shift+R) to clear cache
- Don't reset frequently, affects system stability

**Steps to immediately hard refresh frontend after reset:**
1. Execute reset command
2. **Immediately** press `Command + Shift + R` (Mac) or `Ctrl + Shift + R` (Windows) in browser
3. Wait 5-10 minutes then refresh again to view new data

---

## ⚠️ Crawler Status Check

### Check Which Sources Are Working

```bash
cd FinnewsHunter

# View news count by source
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, COUNT(*) as count 
FROM news 
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY source 
ORDER BY count DESC;
"

# View recent crawl task status
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, 
       crawled_count, 
       saved_count, 
       status,
       error_message 
FROM crawl_tasks 
WHERE created_at > NOW() - INTERVAL '10 minutes'
ORDER BY created_at DESC 
LIMIT 20;
"
```

### View Crawl Errors

```bash
cd FinnewsHunter

# View ERROR logs
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep ERROR

# View specific source issues
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep "jwview"
```

---

## 📚 User Guide

### Auto Crawl Mode (Recommended) ⭐

**System is configured with automatic crawling for 10 news sources:**

1. 🌐 Sina Finance
2. 🐧 Tencent Finance
3. 💰 Financial World
4. 📊 Economic Observer
5. 📈 Caijing.com
6. 📉 21st Century Business Herald
7. 📰 National Business Daily
8. 🎯 Yicai
9. 📧 NetEase Finance
10. 💎 East Money

**How it works:**
- ✅ Celery Beat automatically triggers crawling for all sources every 1 minute
- ✅ Automatic deduplication (URL level)
- ✅ Smart time filtering (keep news within 24 hours)
- ✅ Stock keyword filtering
- ✅ No manual operation needed

**View crawl progress:**

```bash
# View Celery Beat scheduling logs
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# View Celery Worker execution logs
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker
```

---

### Manual Refresh (Get Latest Immediately)

**Method 1: Via Frontend**
1. Visit http://localhost:3000/news
2. Click the "🔄 Refresh Now" button in the top right
3. System will immediately trigger crawling, data updates in about 2 minutes

**Method 2: Via API**
```bash
# Force refresh Sina Finance
curl -X POST "http://localhost:8000/api/v1/news/refresh?source=sina"

# Force refresh all sources (need to call individually)
for source in sina tencent jwview eeo caijing jingji21 nbd yicai 163 eastmoney; do
  curl -X POST "http://localhost:8000/api/v1/news/refresh?source=$source"
  sleep 1
done
```

---

### View News List

**Method 1: Via Frontend (Recommended)**
- Visit http://localhost:3000
- Homepage: View source statistics and latest news
- News Feed: Filter news by source and sentiment
- Batch selection support: Use checkboxes to select multiple news, supports Shift key range selection
- Batch operations: Select all/deselect all, batch delete, batch analyze

**Method 2: Via API**

```bash
# Get latest news from all sources (200 items)
curl "http://localhost:8000/api/v1/news/latest?limit=200"

# Get news from specific source
curl "http://localhost:8000/api/v1/news/latest?source=sina&limit=50"

# Filter by sentiment (using old API)
curl "http://localhost:8000/api/v1/news/?sentiment=positive&limit=20"

# Get all available news source list
curl "http://localhost:8000/api/v1/news/sources"
```

---

### Batch Operations on News

**Frontend Operations:**
1. **Batch Selection**:
   - Click checkbox on the left of news card to select single news
   - Hold Shift key and click for range selection
   - Use "Select All" button in top toolbar to select all news in current filter results
   - Selection state automatically clears when switching news source or filter conditions

2. **Batch Delete**:
   - After selecting multiple news, click "Batch Delete" button in top toolbar
   - After confirming delete dialog, selected news will be deleted
   - List automatically refreshes after deletion

3. **Batch Analysis**:
   - After selecting multiple news, click "Batch Analyze" button in top toolbar
   - System will analyze selected news sequentially, showing progress and result statistics
   - After analysis completes, shows success/failure count

**API Operations:**
```bash
# Batch delete news
curl -X POST "http://localhost:8000/api/v1/news/batch/delete" \
  -H "Content-Type: application/json" \
  -d '{"news_ids": [1, 2, 3]}'

# Batch analyze news
curl -X POST "http://localhost:8000/api/v1/analysis/batch" \
  -H "Content-Type: application/json" \
  -d '{"news_ids": [1, 2, 3], "provider": "bailian", "model": "qwen-plus"}'
```

---

### Analyze News

**Method 1: Via Frontend**
- Click "✨ Analyze" button on news card
- Wait 3-5 seconds to view analysis results
- Click news card to open detail drawer, view complete analysis content

**Method 2: Via API**
```bash
# Analyze news with specified ID (using default model)
curl -X POST http://localhost:8000/api/v1/analysis/news/1

# Analyze news (specify model)
curl -X POST http://localhost:8000/api/v1/analysis/news/1 \
  -H "Content-Type: application/json" \
  -d '{"provider": "bailian", "model": "qwen-max"}'

# View analysis results
curl http://localhost:8000/api/v1/analysis/1
```

---

### Switch LLM Model

**Frontend Operations:**
1. Click model selector in top right (shows current model name)
2. Select different provider and model from dropdown menu
3. Selection automatically saves, subsequent analyses will use new model

**Supported Models:**
- 🔥 **Bailian**: qwen-plus, qwen-max, qwen-turbo, qwen-long
- 🤖 **OpenAI**: gpt-4, gpt-4-turbo, gpt-3.5-turbo
- 🧠 **DeepSeek**: deepseek-chat, deepseek-coder
- 🌙 **Kimi**: moonshot-v1-8k, moonshot-v1-32k, moonshot-v1-128k
- 🔮 **Zhipu**: glm-4, glm-4-plus, glm-4-air

**API to Get Available Model List:**
```bash
curl http://localhost:8000/api/v1/llm/config
```

---

### Search News

**Frontend Operations:**
1. Enter keywords in top search box
2. Supports search: title, content, stock code, source
3. Matching keywords will be highlighted
4. Search has 300ms debounce, automatically searches after input stops

**Search Examples:**
- Search stock code: `600519` (Kweichow Moutai)
- Search keywords: `新能源` (new energy), `半导体` (semiconductor)
- Search source: `sina`, `eastmoney`

---

### View News Details

**Frontend Operations:**
1. Click any news card
2. Detail drawer slides out from right, displaying:
   - 📰 News title and source
   - 📊 Sentiment score (positive/negative/neutral)
   - 📈 Associated stock codes
   - 📝 Complete news content
   - 🤖 AI analysis results (Markdown format)
   - 🔗 Original article link
3. Click "Copy Analysis Content" to copy analysis report in Markdown format

---

### Stock K-Line Analysis

**Frontend Operations:**
1. Visit http://localhost:3000/stocks/SH600519 (Kweichow Moutai example)
2. Use top right search box to enter stock code or name (e.g., `茅台` (Moutai), `600519`)
3. Select time period: Daily K, 60min, 30min, 15min, 5min, 1min
4. Chart supports:
   - 📈 K-line candlestick chart (OHLC)
   - 📊 Volume bar chart
   - 📉 MA moving averages (5/10/30/60 day)

**API Operations:**

```bash
# Get K-line data (daily, default 180 items)
curl "http://localhost:8000/api/v1/stocks/SH600519/kline?period=daily&limit=180"

# Get minute K-line (60-minute line)
curl "http://localhost:8000/api/v1/stocks/SH600519/kline?period=60m&limit=200"

# Search stocks
curl "http://localhost:8000/api/v1/stocks/search/realtime?q=茅台&limit=10"

# View stock count in database
curl "http://localhost:8000/api/v1/stocks/count"
```

---

### Filter by Source

**Frontend Operations:**

1. **Homepage (Dashboard)**
   - View "News Source Statistics" card
   - Click any source button to filter
   - Display news count and list for that source

2. **News Feed Page**
   - Top has 10 source filter buttons
   - Click to switch and view different sources
   - Supports source + sentiment dual filtering

**API Operations:**

```bash
# View Sina Finance news
curl "http://localhost:8000/api/v1/news/latest?source=sina&limit=50"

# View National Business Daily news
curl "http://localhost:8000/api/v1/news/latest?source=nbd&limit=50"

# View all sources
curl "http://localhost:8000/api/v1/news/latest?limit=200"
```

---

## 🏗️ Project Structure

```
FinnewsHunter/
├── backend/                    # Backend service
│   ├── app/
│   │   ├── agents/            # Agent definitions (NewsAnalyst, debate agents, etc.)
│   │   ├── api/v1/            # FastAPI routes
│   │   │   ├── analysis.py    # Analysis API (supports batch analysis)
│   │   │   ├── llm_config.py  # LLM config API
│   │   │   ├── news_v2.py     # News API (supports batch delete)
│   │   │   └── ...
│   │   ├── core/              # Core configuration (config, database, redis, neo4j)
│   │   ├── models/            # SQLAlchemy data models
│   │   ├── services/          # Business services
│   │   │   ├── llm_service.py      # LLM service (multi-provider support)
│   │   │   ├── analysis_service.py # Analysis service (async vectorization)
│   │   │   ├── embedding_service.py # Vectorization service (based on AgenticX BailianEmbeddingProvider)
│   │   │   └── stock_data_service.py # Stock data service
│   │   ├── storage/           # Storage wrapper
│   │   │   └── vector_storage.py # Milvus vector storage (based on AgenticX MilvusStorage)
│   │   ├── tasks/             # Celery tasks
│   │   └── tools/              # AgenticX tools (Crawler, Cleaner)
│   ├── tests/                 # Test and utility scripts
│   │   ├── check_milvus_data.py           # Check Milvus vector storage data
│   │   ├── check_news_embedding_status.py # Check news vectorization status
│   │   └── manual_vectorize.py           # Manually vectorize specified news
│   ├── env.example            # Environment variable template
│   └── requirements.txt       # Python dependencies
├── frontend/                  # React frontend
│   └── src/
│       ├── components/        # Components
│       │   ├── ModelSelector.tsx    # LLM model selector
│       │   ├── NewsDetailDrawer.tsx # News detail drawer
│       │   └── HighlightText.tsx    # Keyword highlighting
│       ├── context/           # React Context
│       ├── hooks/             # Custom Hooks
│       │   └── useDebounce.ts # Debounce Hook
│       ├── layout/            # Layout components
│       └── pages/             # Page components
│           └── NewsListPage.tsx # News list page (supports batch operations)
├── deploy/                    # Deployment configuration
│   ├── docker-compose.dev.yml # Docker Compose configuration
│   ├── Dockerfile.celery     # Celery image build file
│   └── celery-entrypoint.sh  # Celery container startup script
├── conclusions/               # Module summary documentation
│   ├── backend/              # Backend module summaries
│   └── frontend/             # Frontend module summaries
└── .dev-docs/                 # Development documentation
```

---

## 🧪 Testing & Acceptance

### MVP Acceptance Criteria

- [x] News crawling successful and saved to PostgreSQL
- [x] NewsAnalyst calls LLM to complete analysis
- [x] Analysis results include sentiment scores
- [x] Frontend can display news and analysis results
- [x] Support multi-provider LLM dynamic switching
- [x] News details display complete analysis content
- [x] Real-time search and filtering functionality
- [x] Batch selection, batch delete, batch analysis functionality
- [x] Vectorization and storage services based on AgenticX
- [x] Async vectorization, non-blocking analysis flow

### Testing Process

1. **Start All Services**
   ```bash
   ./start.sh
   ```

2. **Check Docker Container Status**
   ```bash
   docker ps
   # Should see: postgres, redis, milvus-standalone, milvus-etcd, milvus-minio
   ```

3. **Test News Crawling**
   ```bash
   curl -X POST http://localhost:8000/api/v1/news/crawl \
     -H "Content-Type: application/json" \
     -d '{"source": "sina", "start_page": 1, "end_page": 1}'
   
   # Wait 5-10 seconds then check results
   curl http://localhost:8000/api/v1/news/?limit=5
   ```

4. **Test Agent Analysis**
   ```bash
   # Get first news ID
   NEWS_ID=$(curl -s http://localhost:8000/api/v1/news/?limit=1 | jq '.[0].id')
   
   # Trigger analysis
   curl -X POST http://localhost:8000/api/v1/analysis/news/$NEWS_ID
   
   # View analysis results
   curl http://localhost:8000/api/v1/analysis/1
   ```

5. **Test Frontend Interface**
   - Open `frontend/index.html`
   - Click "Crawl News" and wait for completion
   - Select a news item and click "Analyze"
   - Check if sentiment score is displayed

---

## 🔧 Troubleshooting

### Issue 1: Database Connection Failed

**Symptom:** Backend startup error `could not connect to database`

**Solution:**

```bash
cd FinnewsHunter

# Check if PostgreSQL is running
docker ps | grep postgres

# View logs
docker compose -f deploy/docker-compose.dev.yml logs postgres

# Restart container
docker compose -f deploy/docker-compose.dev.yml restart postgres

# Wait 30 seconds then retry backend startup
```

---

### Issue 2: Celery Tasks Not Executing

**Symptom:** Frontend shows 0 news count, no automatic crawling

**Troubleshooting Steps:**

```bash
cd FinnewsHunter

# 1. Check if Celery Worker is running
docker ps | grep celery

# 2. View Celery Beat logs (should see tasks triggered every minute)
docker compose -f deploy/docker-compose.dev.yml logs celery-beat --tail=100

# 3. View Celery Worker logs (check task execution)
docker compose -f deploy/docker-compose.dev.yml logs celery-worker --tail=100

# 4. Check Redis connection
docker exec finnews_redis redis-cli PING
# Should return PONG

# 5. Restart Celery services
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

---

### Issue 3: Crawling Failed (404 Error)

**Symptom:** Celery logs show `404 Client Error: Not Found`

**Cause:** News website URL has changed

**Solution:**

```bash
# 1. Manually visit URL to verify if available
curl -I https://finance.caijing.com.cn/

# 2. If URL changed, update corresponding crawler configuration
# Edit backend/app/tools/{source}_crawler.py
# Update BASE_URL and STOCK_URL

# 3. Clear Python cache
cd FinnewsHunter/backend
find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true

# 4. Restart Celery
cd ..
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

---

### Issue 4: Only Sina Finance Has Data

**Symptom:** Other 9 sources have no news

**Possible Causes:**
1. Celery Beat configuration incomplete
2. Crawler code has errors
3. Website URL incorrect

**Solution:**

```bash
cd FinnewsHunter

# 1. Check Celery Beat configuration
docker compose -f deploy/docker-compose.dev.yml logs celery-beat | grep "crawl-"
# Should see 10 scheduled tasks (crawl-sina, crawl-tencent, ..., crawl-eastmoney)

# 2. Manually test single source crawling
docker exec -it finnews_celery_worker python -c "
from app.tools import get_crawler_tool
crawler = get_crawler_tool('nbd')  # Test National Business Daily
news = crawler.crawl()
print(f'Crawled {len(news)} news items')
"

# 3. View data volume by source in database
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, COUNT(*) as count 
FROM news 
GROUP BY source 
ORDER BY count DESC;
"

# 4. If a source keeps failing, view detailed errors
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep "ERROR"
```

---

### Issue 5: LLM Call Failed

**Symptom:** Analysis functionality not working, error `LLM Provider NOT provided`

**Solution:**

```bash
cd FinnewsHunter/backend

# 1. Check if API Key is configured
grep -E "DASHSCOPE_API_KEY|OPENAI_API_KEY|DEEPSEEK_API_KEY" .env

# 2. Check if Base URL is correct (Bailian must configure)
grep DASHSCOPE_BASE_URL .env
# Should be: https://dashscope.aliyuncs.com/compatible-mode/v1

# 3. Verify LLM config API is normal
curl http://localhost:8000/api/v1/llm/config | jq '.providers[].has_api_key'
# At least one should return true

# 4. If using Bailian, ensure complete configuration
cat >> .env << EOF
DASHSCOPE_API_KEY=sk-your-key
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
BAILIAN_MODELS=qwen-plus,qwen-max
EOF

# 5. Restart backend service
```

---

### Issue 6: Frontend Shows Blank or CORS Error

**Symptom:** Frontend cannot load data, browser Console shows CORS error

**Solution:**

```bash
# 1. Check backend CORS configuration
cd FinnewsHunter/backend
grep BACKEND_CORS_ORIGINS .env
# Should include http://localhost:3000

# 2. Check frontend API address configuration
cd ../frontend
cat .env
# VITE_API_URL should be http://localhost:8000

# 3. Hard refresh browser
# Chrome/Edge: Ctrl+Shift+R (Windows) or Cmd+Shift+R (Mac)

# 4. Restart frontend dev server
npm run dev
```

---

### Issue 7: Milvus Connection Failed

**Symptom:** Vector search functionality not working

**Solution:**

```bash
cd FinnewsHunter

# Milvus requires longer startup time (approximately 60 seconds)
docker compose -f deploy/docker-compose.dev.yml logs milvus-standalone

# Check health status
docker inspect finnews_milvus | grep -A 10 Health

# Restart Milvus related services
docker compose -f deploy/docker-compose.dev.yml restart milvus-etcd milvus-minio milvus-standalone
```

---

### Issue 8: Data Statistics Inaccurate

**Symptom:** Homepage shows news count doesn't match actual

**Solution:**

```bash
# Use reset script to clear data and start fresh
cd FinnewsHunter
./reset_all_data.sh
```

---

### Common Debugging Commands

```bash
cd FinnewsHunter

# View all container status
docker compose -f deploy/docker-compose.dev.yml ps

# View complete logs for a service
docker compose -f deploy/docker-compose.dev.yml logs celery-worker --tail=500

# Enter container for debugging
docker exec -it finnews_celery_worker bash

# View database connection
docker exec finnews_postgres psql -U finnews -d finnews_db -c "\conninfo"

# View Redis connection
docker exec finnews_redis redis-cli INFO

# Test network connectivity
docker exec finnews_celery_worker ping -c 3 postgres
```

---

## ⚡ Quick Reference (Common Commands)

### Project Directory

```bash
cd FinnewsHunter
```

### One-Click Operations

```bash
# Start all services
docker compose -f deploy/docker-compose.dev.yml up -d

# Stop all services
docker compose -f deploy/docker-compose.dev.yml down

# Restart Celery (after code updates)
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# Clear all data and start fresh
./reset_all_data.sh
```

### View Status

```bash
# Service status
docker compose -f deploy/docker-compose.dev.yml ps

# News count
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT source, COUNT(*) FROM news GROUP BY source;"

# Task count
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT status, COUNT(*) FROM crawl_tasks GROUP BY status;"

# Redis cache
docker exec finnews_redis redis-cli DBSIZE
```

### View Logs

```bash
# Celery Beat (scheduled dispatch)
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# Celery Worker (task execution)
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker

# PostgreSQL
docker compose -f deploy/docker-compose.dev.yml logs -f postgres

# All services
docker compose -f deploy/docker-compose.dev.yml logs -f
```

### Direct Access

- **Frontend**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs

---

## 📊 Database Structure

### News Table
- id, title, content, url, source
- publish_time, stock_codes
- sentiment_score, is_embedded

### Analysis Table
- id, news_id, agent_name
- sentiment, sentiment_score, confidence
- analysis_result, structured_data

### Stock Table
- id, code, name, industry, market

---

## 🛠️ Development Guide

### Add New Crawler

1. Inherit `BaseCrawler` class
2. Implement `crawl()` method
3. Register in `tools/__init__.py`

Example:
```python
# backend/app/tools/custom_crawler.py
from .crawler_base import BaseCrawler

class CustomCrawlerTool(BaseCrawler):
    name = "custom_crawler"
    
    def crawl(self, start_page, end_page):
        # Implement crawling logic
        pass
```

### Use Enhanced Crawler (Optional)

For scenarios requiring JS rendering or intelligent content extraction, use enhanced crawler:

```python
from app.tools.crawler_enhanced import crawl_url, EnhancedCrawler

# Quick crawl single URL
article = crawl_url("https://finance.sina.com.cn/xxx", engine='auto')
print(article.to_markdown())

# Get LLM message format (multimodal)
llm_messages = article.to_llm_message()

# Batch crawl (with cache)
crawler = EnhancedCrawler(use_cache=True)
articles = crawler.crawl_batch(urls, delay=1.0)
```

**Supported Engines:**
- `requests`: Basic HTTP requests (default)
- `playwright`: JS rendering (requires `playwright install chromium`)
- `jina`: Jina Reader API (requires `JINA_API_KEY` configuration)
- `auto`: Automatically select best engine

**Install Optional Dependencies:**

```bash
pip install markdownify readabilipy playwright
playwright install chromium  # Optional, for JS rendering
```

---

### Add New Agent

1. Inherit `Agent` class
2. Define role, goal, backstory
3. Implement business methods

Example:
```python
# backend/app/agents/risk_analyst.py
from agenticx import Agent

class RiskAnalystAgent(Agent):
    def __init__(self, llm_provider):
        super().__init__(
            name="RiskAnalyst",
            role="Risk Analyst",
            goal="Assess investment risks",
            llm_provider=llm_provider
        )
```

---

### Using AgenticX Components

FinnewsHunter deeply integrates AgenticX framework core components to avoid reinventing the wheel:

#### 1. Embedding Service

The system uses `agenticx.embeddings.BailianEmbeddingProvider` as the core embedding engine:

```python
from app.services.embedding_service import EmbeddingService

# Synchronous interface (for sync contexts)
embedding_service = EmbeddingService()
vector = embedding_service.embed_text("text content")

# Asynchronous interface (recommended for async contexts)
vector = await embedding_service.aembed_text("text content")

# Batch processing (Provider handles internal batching)
vectors = embedding_service.embed_batch(["text1", "text2", "text3"])
```

**Features**:
- Redis caching support to avoid duplicate calculations
- Automatic text length limit handling (6000 characters)
- Both sync and async interfaces to avoid event loop conflicts

#### 2. Vector Storage (Milvus)

The system uses `agenticx.storage.vectordb_storages.milvus.MilvusStorage` as the vector database:

```python
from app.storage.vector_storage import VectorStorage

vector_storage = VectorStorage()

# Store single vector
vector_storage.store_embedding(
    news_id=1,
    text="news content",
    embedding=[0.1, 0.2, ...]
)

# Batch storage
vector_storage.store_embeddings_batch([
    {"news_id": 1, "text": "content1", "embedding": [...]},
    {"news_id": 2, "text": "content2", "embedding": [...]}
])

# Similarity search
results = vector_storage.search_similar(query_vector=[...], top_k=10)

# Get statistics (with query count fallback mechanism)
stats = vector_storage.get_stats()
```

**Features**:
- Direct use of AgenticX MilvusStorage, no duplicate implementation
- Compatibility interface for simplified calls
- Query count fallback when `num_entities` is inaccurate
- Async operation support to avoid blocking

#### 3. Async Embedding Best Practices

In async contexts (e.g., FastAPI routes), use async interfaces:

```python
from app.services.embedding_service import EmbeddingService
from app.storage.vector_storage import VectorStorage

async def analyze_news(news_id: int, text: str):
    embedding_service = EmbeddingService()
    vector_storage = VectorStorage()
    
    # Use async interface to avoid event loop conflicts
    embedding = await embedding_service.aembed_text(text)
    
    # Store vector asynchronously in background (non-blocking)
    asyncio.create_task(
        vector_storage.store_embedding(news_id, text, embedding)
    )
    
    # Continue with analysis logic...
```

**Notes**:
- In async contexts, use `aembed_text()` instead of `embed_text()`
- Embedding operations run asynchronously in background, non-blocking
- Milvus `flush()` operation is optimized, not executed by default (relies on auto-flush)

---

## Multi-Agent Debate Architecture

FinnewsHunter's core feature is the **bull-bear debate mechanism**, through collaboration and confrontation of multiple professional agents, deeply mining investment value and risks of individual stocks.

### Core Participants

| Agent | Role | Core Responsibilities |
|-------|------|---------------------|
| **BullResearcher** | Bull Researcher | Mine growth potential, core positives, valuation advantages |
| **BearResearcher** | Bear Researcher | Identify downside risks, negative catalysts, refute optimistic expectations |
| **SearchAnalyst** | Search Analyst | Dynamically acquire data (AkShare/BochaAI/browser search) |
| **InvestmentManager** | Investment Manager | Host debate, evaluate argument quality, make final decisions |

### Debate Data Flow Architecture

```mermaid
graph TD
    subgraph Debate Initiation
        Manager[Investment Manager] -->|Opening Statement| Orchestrator[Debate Orchestrator]
    end
    
    subgraph Multi-Round Debate
        Orchestrator -->|Round N| Bull[Bull Researcher]
        Bull -->|Statement + Data Request| Orchestrator
        Orchestrator -->|Trigger Search| Searcher[Search Analyst]
        
        Searcher -->|Financial Data| AkShare[AkShare]
        Searcher -->|Real-time News| BochaAI[BochaAI]
        Searcher -->|Web Search| Browser[Browser Engine]
        
        AkShare --> Context[Update Context]
        BochaAI --> Context
        Browser --> Context
        
        Context --> Orchestrator
        Orchestrator -->|Round N| Bear[Bear Researcher]
        Bear -->|Statement + Data Request| Orchestrator
    end
    
    subgraph Final Decision
        Orchestrator -->|Intelligent Data Supplement| Searcher
        Orchestrator -->|Comprehensive Judgment| Manager
        Manager -->|Investment Rating| Result[Final Report]
    end
```

### Dynamic Search Mechanism

During debate, agents can request additional data through specific format:

```
[SEARCH: "Recent gross margin data" source:akshare]   -- Get financial data from AkShare
[SEARCH: "Industry competition analysis" source:bochaai]   -- Search news from BochaAI
[SEARCH: "Recent fund flows" source:akshare]       -- Get fund flows
[SEARCH: "Competitor comparison analysis"]                       -- Automatically select best data source
```

**Supported Data Sources:**
- **AkShare**: Financial indicators, K-line market data, fund flows, institutional holdings
- **BochaAI**: Real-time news search, analyst reports
- **Browser Search**: Baidu News, Sogou, 360 and other multi-engine search
- **Knowledge Base**: Historical news and analysis data

---

## 📈 Roadmap

### Phase 1: MVP (Completed) ✅
- [x] Project infrastructure
- [x] Database models
- [x] Crawler tool refactoring (10 news sources)
- [x] LLM service integration
- [x] NewsAnalyst agent
- [x] FastAPI routes
- [x] React + TypeScript frontend

### Phase 1.5: Multi-Provider LLM Support (Completed) ✅
- [x] Support 5 major LLM providers (Bailian, OpenAI, DeepSeek, Kimi, Zhipu)
- [x] Frontend dynamic model switching
- [x] LLM config API (`/api/v1/llm/config`)
- [x] News detail drawer (complete content + AI analysis)
- [x] Real-time search functionality (multi-dimensional + keyword highlighting)
- [x] Markdown rendering (supports tables, code blocks)
- [x] One-click copy analysis report

### Phase 1.6: Stock Analysis & Enhanced Crawler (Completed) ✅
- [x] Stock K-line charts (integrated akshare + klinecharts)
- [x] Multi-period support (Daily K/60min/30min/15min/5min/1min)
- [x] Stock search (code/name fuzzy query, pre-loaded 5000+ A-shares)
- [x] Enhanced crawler module
  - [x] Multi-engine support (Requests/Playwright/Jina)
  - [x] Intelligent content extraction (readabilipy + heuristic algorithms)
  - [x] Content quality assessment and auto-retry
  - [x] Cache mechanism and unified Article model

### Phase 1.7: AgenticX Deep Integration & Batch Operations (Completed) ✅
- [x] Migrated to AgenticX BailianEmbeddingProvider (removed redundant batch processing logic)
- [x] Migrated to AgenticX MilvusStorage (simplified storage wrapper, removed duplicate code)
- [x] Async vectorization interfaces (aembed_text/aembed_batch), avoid event loop conflicts
- [x] Background async vectorization, non-blocking analysis flow
- [x] Milvus statistics optimization (query count fallback mechanism)
- [x] Frontend batch selection functionality (checkboxes + Shift range selection)
- [x] Batch delete news functionality
- [x] Batch analyze news functionality (with progress display and result statistics)
- [x] Docker Compose optimization (Celery image build, improved startup performance)

### Phase 2: Multi-Agent Debate (Completed) ✅
- [x] BullResearcher & BearResearcher agents
- [x] SearchAnalyst search analyst (dynamic data acquisition)
- [x] InvestmentManager investment manager decision
- [x] Debate orchestrator (DebateOrchestrator)
- [x] Dynamic search mechanism (on-demand data acquisition during debate)
- [x] Three debate modes: parallel analysis, real-time debate, quick analysis
- [ ] Real-time WebSocket push (in progress)
- [ ] Agent execution trace visualization (in progress)

### Phase 3: Knowledge Enhancement (Planned)
- [ ] Financial knowledge graph (Neo4j)
- [ ] Agent memory system
- [ ] GraphRetriever graph retrieval

### Phase 4: Self-Evolution (Planned)
- [ ] ACE framework integration
- [ ] Investment strategy Playbook
- [ ] Decision effectiveness evaluation and learning

---

## 📄 License

This project follows the AgenticX license.

---

## 🙏 Acknowledgments

- [AgenticX](https://github.com/yourusername/AgenticX) - Multi-agent framework
- [FastAPI](https://fastapi.tiangolo.com/) - Web framework
- [Milvus](https://milvus.io/) - Vector database
- [Alibaba Cloud Bailian](https://dashscope.console.aliyun.com/) - LLM service
- [Shadcn UI](https://ui.shadcn.com/) - Frontend component library

---

## ⭐ Star History

If you find this project helpful, please give it a Star ⭐️!

[![Star History Chart](https://api.star-history.com/svg?repos=DemonDamon/FinnewsHunter&type=Date)](https://star-history.com/#DemonDamon/FinnewsHunter&Date)

---

**Built with ❤️ using AgenticX**


================================================
FILE: README_zn.md
================================================
# FinnewsHunter：金融新闻驱动的多智能体投资决策平台

<div align="right">
  <a href="README_zn.md">中文版</a> | <a href="README.md">English</a>
</div>

<div align="center">
  <img src="assets/images/FINNEWS_HUNTER_LOGO.png" alt="FinnewsHunter Logo" width="450">
</div>

基于 [AgenticX](https://github.com/DemonDamon/AgenticX) 框架构建的企业级金融新闻分析系统，融合实时新闻流、深度量化分析和多智能体辩论机制。

FinnewsHunter 不再局限于传统的文本分类，而是部署多智能体战队（NewsAnalyst, Researcher 等），实时监控新浪财经、每经网、金融界、证券时报等多源财经资讯。利用大模型进行深度解读、情感分析与市场影响评估，并结合知识图谱挖掘潜在的投资机会与风险，为量化交易提供决策级别的阿尔法信号。

---

## 🎯 项目特色

- ✅ **AgenticX 原生**: 深度集成 AgenticX 框架，使用 Agent、Tool、Workflow 等核心抽象
- ✅ **AgenticX 组件集成**: 直接使用 AgenticX 的 `BailianEmbeddingProvider` 和 `MilvusStorage`，避免重复造轮子
- ✅ **智能体驱动**: NewsAnalyst 智能体自动分析新闻情感和市场影响
- ✅ **多厂商 LLM 支持**: 支持百炼、OpenAI、DeepSeek、Kimi、智谱 5 大厂商，前端一键切换
- ✅ **批量操作**: 支持批量选择、批量删除、批量分析新闻，提高操作效率
- ✅ **股票 K 线分析**: 集成 akshare 真实行情数据，支持日K/分K多周期展示
- ✅ **股票智能搜索**: 支持代码和名称模糊查询，预加载 5000+ A股数据
- ✅ **完整技术栈**: FastAPI + PostgreSQL + Milvus + Redis + React
- ✅ **实时搜索**: 支持标题、内容、股票代码多维度搜索，关键词高亮
- ✅ **异步向量化**: 后台异步执行向量化，不阻塞分析流程
- ✅ **生产就绪**: Docker Compose 一键部署，日志、监控完备

---

## 🏗️ 系统架构

![FinnewsHunter Architecture](assets/images/arch-20251201.png)

系统采用分层架构设计：
- **M6 前端交互层**: React + TypeScript + Shadcn UI
- **M1 平台服务层**: FastAPI Gateway + Task Manager
- **M4/M5 智能体协同层**: AgenticX Agent + Debate Workflow
- **M2/M3 基础设施层**: Crawler Service + LLM Service + Embedding
- **M7-M11 存储与学习层**: PostgreSQL + Milvus + Redis + ACE Framework

---

## 🚀 快速开始

### 前置条件

- Python 3.11+
- Docker & Docker Compose
- (可选) OpenAI API Key 或本地 LLM
- Node.js 18+ (前端开发)

### 1. 安装 AgenticX

```bash
cd /Users/damon/myWork/AgenticX
pip install -e .
```

### 2. 安装后端依赖

```bash
cd FinnewsHunter/backend
pip install -r requirements.txt
```

### 3. 配置环境变量

```bash
cd FinnewsHunter/backend
cp env.example .env
# 编辑 .env 文件，填入 LLM API Key 等配置
```

**多厂商 LLM 配置说明：**

系统支持 5 个 LLM 厂商，至少配置一个即可使用：

| 厂商 | 环境变量 | 获取地址 |
|------|----------|----------|
| 百炼（阿里云） | `DASHSCOPE_API_KEY` | https://dashscope.console.aliyun.com/ |
| OpenAI | `OPENAI_API_KEY` | https://platform.openai.com/api-keys |
| DeepSeek | `DEEPSEEK_API_KEY` | https://platform.deepseek.com/ |
| Kimi（Moonshot） | `MOONSHOT_API_KEY` | https://platform.moonshot.cn/ |
| 智谱 | `ZHIPU_API_KEY` | https://open.bigmodel.cn/ |

**示例配置（推荐百炼）：**

```bash
# 百炼（阿里云）- 推荐，国内访问快
DASHSCOPE_API_KEY=sk-your-dashscope-key
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
BAILIAN_MODELS=qwen-plus,qwen-max,qwen-turbo

# 可选：其他厂商
OPENAI_API_KEY=sk-your-openai-key
DEEPSEEK_API_KEY=sk-your-deepseek-key
```

### 4. 启动基础服务（PostgreSQL、Redis、Milvus）

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml up -d postgres redis milvus-etcd milvus-minio milvus-standalone
```

### 5. 初始化数据库

```bash
cd FinnewsHunter/backend
python init_db.py
```

### 5.1 初始化股票数据（可选，用于股票搜索功能）

```bash
cd FinnewsHunter/backend
python -m app.scripts.init_stocks
# 将从 akshare 获取全部 A 股数据（约 5000+ 只）并存入数据库
```

### 6. 启动后端API服务

```bash
cd FinnewsHunter/backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

### 7. 启动Celery Worker和Beat（自动爬取）

```bash
# 新开一个终端
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml up -d celery-worker celery-beat
```

### 8. 启动前端服务

```bash
# 新开一个终端
cd FinnewsHunter/frontend
npm install  # 首次需要安装依赖
npm run dev
```

### 9. 访问应用

- **前端界面**: http://localhost:3000
- **后端 API**: http://localhost:8000
- **API 文档**: http://localhost:8000/docs

---

## 🔄 服务管理

### 查看所有服务状态

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml ps
```

### 重启所有服务

```bash
cd FinnewsHunter

# 重启Docker服务（基础设施 + Celery）
docker compose -f deploy/docker-compose.dev.yml restart

# 如果后端API是独立启动的，需要手动重启
# Ctrl+C 停止后端进程，然后重新运行：
cd backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

### 重启特定服务

```bash
cd FinnewsHunter

# 只重启Celery（应用代码更改后）
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# 只重启数据库
docker compose -f deploy/docker-compose.dev.yml restart postgres

# 只重启Redis
docker compose -f deploy/docker-compose.dev.yml restart redis
```

### 停止所有服务

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml down
```

### 查看日志

```bash
cd FinnewsHunter

# 查看Celery Worker日志
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker

# 查看Celery Beat日志（定时任务调度）
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# 查看PostgreSQL日志
docker compose -f deploy/docker-compose.dev.yml logs -f postgres

# 查看所有服务日志
docker compose -f deploy/docker-compose.dev.yml logs -f
```

---

## 🗑️ 重置数据库

### 方式1：使用一键重置脚本（推荐）⭐

```bash
cd FinnewsHunter

# 执行重置脚本
./reset_all_data.sh

# 输入 yes 确认
```

**脚本会自动完成：**
1. ✅ 清空PostgreSQL中的所有新闻和任务数据
2. ✅ 清空Redis缓存
3. ✅ 重置数据库自增ID（从1重新开始）
4. ✅ 清空Celery调度文件
5. ✅ 自动重启Celery服务

**执行后等待：**
- 5-10分钟系统会自动重新爬取数据
- 访问前端查看新数据

---

### 方式2：手动重置（高级）

#### 步骤1：清空PostgreSQL数据

```bash
# 进入PostgreSQL容器
docker exec -it finnews_postgres psql -U finnews -d finnews_db
```

在PostgreSQL命令行中执行：

```sql
-- 清空新闻表
DELETE FROM news;

-- 清空任务表
DELETE FROM crawl_tasks;

-- 清空分析表
DELETE FROM analyses;

-- 重置自增ID
ALTER SEQUENCE news_id_seq RESTART WITH 1;
ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1;
ALTER SEQUENCE analyses_id_seq RESTART WITH 1;

-- 验证结果（应该都是0）
SELECT 'news表', COUNT(*) FROM news;
SELECT 'crawl_tasks表', COUNT(*) FROM crawl_tasks;
SELECT 'analyses表', COUNT(*) FROM analyses;

-- 退出
\q
```

#### 步骤2：清空Redis缓存

```bash
cd FinnewsHunter
docker exec finnews_redis redis-cli FLUSHDB
```

#### 步骤3：清空Celery调度文件

```bash
cd FinnewsHunter/backend
rm -f celerybeat-schedule*
```

#### 步骤4：重启Celery服务

```bash
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

#### 步骤5：验证数据已清空

```bash
# 检查新闻数量（应该是0）
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT COUNT(*) FROM news;"

# 检查Redis（应该是0或很小）
docker exec finnews_redis redis-cli DBSIZE

# 查看Celery是否开始爬取
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat
# 应该看到每分钟触发10个爬取任务
```

---

### 方式3：使用Python脚本重置

```bash
cd FinnewsHunter/backend
python reset_database.py
# 输入 yes 确认
```

---

### 方式4：快速手动清理（一行命令）🔥

**适用场景：** 当重置脚本不工作时，使用此方法最快速

```bash
cd FinnewsHunter

# 步骤1：清空数据库表
docker exec finnews_postgres psql -U finnews -d finnews_db -c "DELETE FROM news; DELETE FROM crawl_tasks; DELETE FROM analyses;"

# 步骤2：重置自增ID
docker exec finnews_postgres psql -U finnews -d finnews_db -c "ALTER SEQUENCE news_id_seq RESTART WITH 1; ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1; ALTER SEQUENCE analyses_id_seq RESTART WITH 1;"

# 步骤3：清空Redis缓存
docker exec finnews_redis redis-cli FLUSHDB

# 步骤4：清空Celery调度文件
rm -f backend/celerybeat-schedule*

# 步骤5：重启Celery服务
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# 步骤6：验证是否清空（应该显示0）
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT COUNT(*) FROM news;"
```

**执行后立即刷新浏览器：**
- Mac: `Command + Shift + R`
- Windows: `Ctrl + Shift + R`

---

### 🖥️ 清除前端缓存（重要！）

**数据清空后，前端可能仍显示旧数据，这是因为浏览器缓存。**

#### 方法1：硬刷新浏览器（推荐）⭐

**Mac系统：**
```
按 Command + Shift + R
或 Command + Option + R
```

**Windows/Linux系统：**
```
按 Ctrl + Shift + R
或 Ctrl + F5
```

#### 方法2：开发者工具清空缓存

1. 按 `F12` 打开开发者工具
2. 右键点击刷新按钮（地址栏旁边）
3. 选择 **"清空缓存并硬性重新加载"**

#### 方法3：清除浏览器缓存

1. **Chrome/Edge:**
   - `Command + Shift + Delete` (Mac) 或 `Ctrl + Shift + Delete` (Windows)
   - 勾选"缓存的图片和文件"
   - 时间范围选择"全部"
   - 点击"清除数据"

2. **刷新页面后，再次硬刷新**
   - 确保React Query缓存也被清除

#### 方法4：重启前端开发服务器（最彻底）

```bash
# 在前端终端按 Ctrl+C 停止服务
# 然后重新启动
cd FinnewsHunter/frontend
npm run dev
```

---

## 📊 重置后的数据恢复时间线

| 时间 | 事件 | 预期结果 |
|------|------|----------|
| 0分钟 | 执行重置脚本 | 数据库清空，Redis清空 |
| 1分钟 | Celery Beat开始调度 | 10个爬取任务被触发 |
| 2-5分钟 | 第一批新闻保存 | 数据库开始有数据 |
| 5-10分钟 | 所有源都有数据 | 前端可看到100+条新闻 |
| 30分钟 | 数据持续增长 | 500+条新闻 |
| 1小时 | 稳定运行 | 1000-2000条新闻 |

**注意：**
- 重置后需要等待5-10分钟才能看到新数据
- **前端必须硬刷新**（Command+Shift+R / Ctrl+Shift+R）清除缓存
- 不要频繁重置，会影响系统稳定性

**重置后立即硬刷新前端的步骤：**
1. 执行重置命令
2. **立即**在浏览器按 `Command + Shift + R` (Mac) 或 `Ctrl + Shift + R` (Windows)
3. 等待5-10分钟后再次刷新查看新数据

---

## ⚠️ 爬虫状态检查

### 查看哪些源正常工作

```bash
cd FinnewsHunter

# 查看各源的新闻数量
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, COUNT(*) as count 
FROM news 
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY source 
ORDER BY count DESC;
"

# 查看最近的爬取任务状态
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, 
       crawled_count, 
       saved_count, 
       status,
       error_message 
FROM crawl_tasks 
WHERE created_at > NOW() - INTERVAL '10 minutes'
ORDER BY created_at DESC 
LIMIT 20;
"
```

### 查看爬取错误

```bash
cd FinnewsHunter

# 查看ERROR日志
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep ERROR

# 查看特定源的问题
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep "jwview"
```

---

## 📚 使用指南

### 自动爬取模式（推荐）⭐

**系统已配置10个新闻源的自动爬取：**

1. 🌐 新浪财经
2. 🐧 腾讯财经
3. 💰 金融界
4. 📊 经济观察网
5. 📈 财经网
6. 📉 21经济网
7. 📰 每日经济新闻
8. 🎯 第一财经
9. 📧 网易财经
10. 💎 东方财富

**工作方式：**
- ✅ Celery Beat 每1分钟自动触发所有源的爬取
- ✅ 自动去重（URL级别）
- ✅ 智能时间筛选（保留24小时内新闻）
- ✅ 股票关键词筛选
- ✅ 无需手动操作

**查看爬取进度：**

```bash
# 查看Celery Beat调度日志
cd FinnewsHunter
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# 查看Celery Worker执行日志
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker
```

---

### 手动刷新（立即获取最新）

**方式 1: 通过前端**
1. 访问 http://localhost:3000/news
2. 点击右上角"🔄 立即刷新"按钮
3. 系统会立即触发爬取，约2分钟后数据更新

**方式 2: 通过 API**
```bash
# 强制刷新新浪财经
curl -X POST "http://localhost:8000/api/v1/news/refresh?source=sina"

# 强制刷新所有源（需要逐个调用）
for source in sina tencent jwview eeo caijing jingji21 nbd yicai 163 eastmoney; do
  curl -X POST "http://localhost:8000/api/v1/news/refresh?source=$source"
  sleep 1
done
```

---

### 查看新闻列表

**方式 1: 通过前端（推荐）**
- 访问 http://localhost:3000
- 首页：查看来源统计和最新新闻
- 新闻流：按来源和情感筛选新闻
- 支持批量选择：使用复选框选择多条新闻，支持 Shift 键范围选择
- 批量操作：全选/取消全选、批量删除、批量分析

**方式 2: 通过 API**

```bash
# 获取所有来源的最新新闻（200条）
curl "http://localhost:8000/api/v1/news/latest?limit=200"

# 获取特定来源的新闻
curl "http://localhost:8000/api/v1/news/latest?source=sina&limit=50"

# 按情感筛选（使用旧接口）
curl "http://localhost:8000/api/v1/news/?sentiment=positive&limit=20"

# 获取所有可用的新闻源列表
curl "http://localhost:8000/api/v1/news/sources"
```

---

### 批量操作新闻

**前端操作：**
1. **批量选择**：
   - 点击新闻卡片左侧的复选框选择单条新闻
   - 按住 Shift 键点击可进行范围选择
   - 使用顶部工具栏的"全选"按钮选择当前筛选结果的所有新闻
   - 切换新闻源或筛选条件时，选择状态会自动清空

2. **批量删除**：
   - 选择多条新闻后，点击顶部工具栏的"批量删除"按钮
   - 确认删除对话框后，选中的新闻将被删除
   - 删除后会自动刷新列表

3. **批量分析**：
   - 选择多条新闻后，点击顶部工具栏的"批量分析"按钮
   - 系统会依次分析选中的新闻，显示进度和结果统计
   - 分析完成后会显示成功/失败数量

**API 操作：**
```bash
# 批量删除新闻
curl -X POST "http://localhost:8000/api/v1/news/batch/delete" \
  -H "Content-Type: application/json" \
  -d '{"news_ids": [1, 2, 3]}'

# 批量分析新闻
curl -X POST "http://localhost:8000/api/v1/analysis/batch" \
  -H "Content-Type: application/json" \
  -d '{"news_ids": [1, 2, 3], "provider": "bailian", "model": "qwen-plus"}'
```

---

### 分析新闻

**方式 1: 通过前端**
- 在新闻卡片上点击"✨ 分析"按钮
- 等待3-5秒查看分析结果
- 点击新闻卡片打开详情抽屉，查看完整分析内容

**方式 2: 通过 API**
```bash
# 分析指定ID的新闻（使用默认模型）
curl -X POST http://localhost:8000/api/v1/analysis/news/1

# 分析新闻（指定模型）
curl -X POST http://localhost:8000/api/v1/analysis/news/1 \
  -H "Content-Type: application/json" \
  -d '{"provider": "bailian", "model": "qwen-max"}'

# 查看分析结果
curl http://localhost:8000/api/v1/analysis/1
```

---

### 切换 LLM 模型

**前端操作：**
1. 点击右上角的模型选择器（显示当前模型名称）
2. 在下拉菜单中选择不同的厂商和模型
3. 选择后自动保存，后续分析将使用新模型

**支持的模型：**
- 🔥 **百炼**: qwen-plus, qwen-max, qwen-turbo, qwen-long
- 🤖 **OpenAI**: gpt-4, gpt-4-turbo, gpt-3.5-turbo
- 🧠 **DeepSeek**: deepseek-chat, deepseek-coder
- 🌙 **Kimi**: moonshot-v1-8k, moonshot-v1-32k, moonshot-v1-128k
- 🔮 **智谱**: glm-4, glm-4-plus, glm-4-air

**API 获取可用模型列表：**
```bash
curl http://localhost:8000/api/v1/llm/config
```

---

### 搜索新闻

**前端操作：**
1. 在顶部搜索框输入关键词
2. 支持搜索：标题、内容、股票代码、来源
3. 匹配的关键词会高亮显示
4. 搜索带有 300ms 防抖，输入停止后自动搜索

**搜索示例：**
- 搜索股票代码：`600519`（贵州茅台）
- 搜索关键词：`新能源`、`半导体`
- 搜索来源：`sina`、`eastmoney`

---

### 查看新闻详情

**前端操作：**
1. 点击任意新闻卡片
2. 右侧滑出详情抽屉，展示：
   - 📰 新闻标题和来源
   - 📊 情感评分（利好/利空/中性）
   - 📈 关联股票代码
   - 📝 完整新闻内容
   - 🤖 AI 分析结果（Markdown 格式）
   - 🔗 原文链接
3. 点击"复制分析内容"可复制 Markdown 格式的分析报告

---

### 股票 K 线分析

**前端操作：**
1. 访问 http://localhost:3000/stocks/SH600519（贵州茅台示例）
2. 使用右上角搜索框输入股票代码或名称（如 `茅台`、`600519`）
3. 选择时间周期：日K、60分、30分、15分、5分、1分
4. 图表支持：
   - 📈 K 线蜡烛图（OHLC）
   - 📊 成交量柱状图
   - 📉 MA 均线（5/10/30/60日）

**API 操作：**

```bash
# 获取 K 线数据（日线，默认180条）
curl "http://localhost:8000/api/v1/stocks/SH600519/kline?period=daily&limit=180"

# 获取分钟 K 线（60分钟线）
curl "http://localhost:8000/api/v1/stocks/SH600519/kline?period=60m&limit=200"

# 搜索股票
curl "http://localhost:8000/api/v1/stocks/search/realtime?q=茅台&limit=10"

# 查看数据库中的股票数量
curl "http://localhost:8000/api/v1/stocks/count"
```

---

### 按来源筛选查看

**前端操作：**

1. **首页（Dashboard）**
   - 查看"新闻来源统计"卡片
   - 点击任意来源按钮筛选
   - 显示该来源的新闻数量和列表

2. **新闻流页面**
   - 顶部有10个来源筛选按钮
   - 点击切换查看不同来源
   - 支持来源+情感双重筛选

**API操作：**

```bash
# 查看新浪财经的新闻
curl "http://localhost:8000/api/v1/news/latest?source=sina&limit=50"

# 查看每日经济新闻
curl "http://localhost:8000/api/v1/news/latest?source=nbd&limit=50"

# 查看所有来源
curl "http://localhost:8000/api/v1/news/latest?limit=200"
```

---

## 🏗️ 项目结构

```
FinnewsHunter/
├── backend/                    # 后端服务
│   ├── app/
│   │   ├── agents/            # 智能体定义（NewsAnalyst、辩论智能体等）
│   │   ├── api/v1/            # FastAPI 路由
│   │   │   ├── analysis.py    # 分析 API（支持批量分析）
│   │   │   ├── llm_config.py  # LLM 配置 API
│   │   │   ├── news_v2.py     # 新闻 API（支持批量删除）
│   │   │   └── ...
│   │   ├── core/              # 核心配置（config, database, redis, neo4j）
│   │   ├── models/            # SQLAlchemy 数据模型
│   │   ├── services/          # 业务服务
│   │   │   ├── llm_service.py      # LLM 服务（支持多厂商）
│   │   │   ├── analysis_service.py # 分析服务（异步向量化）
│   │   │   ├── embedding_service.py # 向量化服务（基于 AgenticX BailianEmbeddingProvider）
│   │   │   └── stock_data_service.py # 股票数据服务
│   │   ├── storage/           # 存储封装
│   │   │   └── vector_storage.py # Milvus 向量存储（基于 AgenticX MilvusStorage）
│   │   ├── tasks/             # Celery 任务
│   │   └── tools/              # AgenticX 工具（Crawler, Cleaner）
│   ├── tests/                 # 测试和工具脚本
│   │   ├── check_milvus_data.py           # 检查 Milvus 向量存储数据
│   │   ├── check_news_embedding_status.py # 检查新闻向量化状态
│   │   └── manual_vectorize.py           # 手动向量化指定新闻
│   ├── env.example            # 环境变量模板
│   └── requirements.txt       # Python 依赖
├── frontend/                  # React 前端
│   └── src/
│       ├── components/        # 组件
│       │   ├── ModelSelector.tsx    # LLM 模型选择器
│       │   ├── NewsDetailDrawer.tsx # 新闻详情抽屉
│       │   └── HighlightText.tsx    # 关键词高亮
│       ├── context/           # React Context
│       ├── hooks/             # 自定义 Hooks
│       │   └── useDebounce.ts # 防抖 Hook
│       ├── layout/            # 布局组件
│       └── pages/             # 页面组件
│           └── NewsListPage.tsx # 新闻列表页面（支持批量操作）
├── deploy/                    # 部署配置
│   ├── docker-compose.dev.yml # Docker Compose 配置
│   ├── Dockerfile.celery     # Celery 镜像构建文件
│   └── celery-entrypoint.sh  # Celery 容器启动脚本
├── conclusions/               # 模块摘要文档
│   ├── backend/              # 后端模块总结
│   └── frontend/             # 前端模块总结
└── .dev-docs/                 # 开发文档
```

---

## 🧪 测试与验收

### MVP 验收标准

- [x] 新闻爬取成功并存入 PostgreSQL
- [x] NewsAnalyst 调用 LLM 完成分析
- [x] 分析结果包含情感评分
- [x] 前端能够展示新闻和分析结果
- [x] 支持多厂商 LLM 动态切换
- [x] 新闻详情展示完整分析内容
- [x] 实时搜索和筛选功能
- [x] 批量选择、批量删除、批量分析功能
- [x] 基于 AgenticX 的向量化和存储服务
- [x] 异步向量化，不阻塞分析流程

### 测试流程

1. **启动所有服务**
   ```bash
   ./start.sh
   ```

2. **检查 Docker 容器状态**
   ```bash
   docker ps
   # 应看到: postgres, redis, milvus-standalone, milvus-etcd, milvus-minio
   ```

3. **测试新闻爬取**
   ```bash
   curl -X POST http://localhost:8000/api/v1/news/crawl \
     -H "Content-Type: application/json" \
     -d '{"source": "sina", "start_page": 1, "end_page": 1}'
   
   # 等待 5-10 秒后查看结果
   curl http://localhost:8000/api/v1/news/?limit=5
   ```

4. **测试智能体分析**
   ```bash
   # 获取第一条新闻的ID
   NEWS_ID=$(curl -s http://localhost:8000/api/v1/news/?limit=1 | jq '.[0].id')
   
   # 触发分析
   curl -X POST http://localhost:8000/api/v1/analysis/news/$NEWS_ID
   
   # 查看分析结果
   curl http://localhost:8000/api/v1/analysis/1
   ```

5. **测试前端界面**
   - 打开 `frontend/index.html`
   - 点击"爬取新闻"并等待完成
   - 选择一条新闻点击"分析"
   - 查看情感评分是否显示

---

## 🔧 故障排查

### 问题 1: 数据库连接失败

**症状：** 后端启动报错 `could not connect to database`

**解决方法：**

```bash
cd FinnewsHunter

# 检查 PostgreSQL 是否启动
docker ps | grep postgres

# 查看日志
docker compose -f deploy/docker-compose.dev.yml logs postgres

# 重启容器
docker compose -f deploy/docker-compose.dev.yml restart postgres

# 等待30秒后重试后端启动
```

---

### 问题 2: Celery任务不执行

**症状：** 前端显示新闻数量为0，没有自动爬取

**排查步骤：**

```bash
cd FinnewsHunter

# 1. 检查Celery Worker是否运行
docker ps | grep celery

# 2. 查看Celery Beat日志（应该看到每分钟触发任务）
docker compose -f deploy/docker-compose.dev.yml logs celery-beat --tail=100

# 3. 查看Celery Worker日志（查看任务执行情况）
docker compose -f deploy/docker-compose.dev.yml logs celery-worker --tail=100

# 4. 检查Redis连接
docker exec finnews_redis redis-cli PING
# 应该返回 PONG

# 5. 重启Celery服务
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

---

### 问题 3: 爬取失败（404错误）

**症状：** Celery日志显示 `404 Client Error: Not Found`

**原因：** 新闻网站URL已变更

**解决方法：**

```bash
# 1. 手动访问URL验证是否可用
curl -I https://finance.caijing.com.cn/

# 2. 如果URL变更，更新对应爬虫的配置
# 编辑 backend/app/tools/{source}_crawler.py
# 更新 BASE_URL 和 STOCK_URL

# 3. 清理Python缓存
cd FinnewsHunter/backend
find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true

# 4. 重启Celery
cd ..
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat
```

---

### 问题 4: 只有新浪财经有数据

**症状：** 其他9个来源没有新闻

**可能原因：**
1. Celery Beat配置不完整
2. 爬虫代码有错误
3. 网站URL不正确

**解决方法：**

```bash
cd FinnewsHunter

# 1. 检查Celery Beat配置
docker compose -f deploy/docker-compose.dev.yml logs celery-beat | grep "crawl-"
# 应该看到10个定时任务（crawl-sina, crawl-tencent, ..., crawl-eastmoney）

# 2. 手动测试单个源的爬取
docker exec -it finnews_celery_worker python -c "
from app.tools import get_crawler_tool
crawler = get_crawler_tool('nbd')  # 测试每日经济新闻
news = crawler.crawl()
print(f'爬取到 {len(news)} 条新闻')
"

# 3. 查看数据库中各源的数据量
docker exec finnews_postgres psql -U finnews -d finnews_db -c "
SELECT source, COUNT(*) as count 
FROM news 
GROUP BY source 
ORDER BY count DESC;
"

# 4. 如果某个源一直失败，查看详细错误
docker compose -f deploy/docker-compose.dev.yml logs celery-worker | grep "ERROR"
```

---

### 问题 5: LLM 调用失败

**症状：** 分析功能不工作，报错 `LLM Provider NOT provided`

**解决方法：**

```bash
cd FinnewsHunter/backend

# 1. 检查 API Key 是否配置
grep -E "DASHSCOPE_API_KEY|OPENAI_API_KEY|DEEPSEEK_API_KEY" .env

# 2. 检查 Base URL 是否正确（百炼必须配置）
grep DASHSCOPE_BASE_URL .env
# 应该是: https://dashscope.aliyuncs.com/compatible-mode/v1

# 3. 验证 LLM 配置 API 是否正常
curl http://localhost:8000/api/v1/llm/config | jq '.providers[].has_api_key'
# 至少有一个返回 true

# 4. 如果使用百炼，确保配置完整
cat >> .env << EOF
DASHSCOPE_API_KEY=sk-your-key
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
BAILIAN_MODELS=qwen-plus,qwen-max
EOF

# 5. 重启后端服务
```

---

### 问题 6: 前端显示空白或CORS错误

**症状：** 前端无法加载数据，浏览器Console显示CORS错误

**解决方法：**

```bash
# 1. 检查后端CORS配置
cd FinnewsHunter/backend
grep BACKEND_CORS_ORIGINS .env
# 应该包含 http://localhost:3000

# 2. 检查前端API地址配置
cd ../frontend
cat .env
# VITE_API_URL 应该是 http://localhost:8000

# 3. 硬刷新浏览器
# Chrome/Edge: Ctrl+Shift+R (Windows) 或 Cmd+Shift+R (Mac)

# 4. 重启前端开发服务器
npm run dev
```

---

### 问题 7: Milvus 连接失败

**症状：** 向量搜索功能不工作

**解决方法：**

```bash
cd FinnewsHunter

# Milvus 需要较长启动时间（约 60 秒）
docker compose -f deploy/docker-compose.dev.yml logs milvus-standalone

# 检查健康状态
docker inspect finnews_milvus | grep -A 10 Health

# 重启Milvus相关服务
docker compose -f deploy/docker-compose.dev.yml restart milvus-etcd milvus-minio milvus-standalone
```

---

### 问题 8: 数据统计不准确

**症状：** 首页显示的新闻数和实际不符

**解决方法：**

```bash
# 使用重置脚本清空数据重新开始
cd FinnewsHunter
./reset_all_data.sh
```

---

### 常用调试命令

```bash
cd FinnewsHunter

# 查看所有容器状态
docker compose -f deploy/docker-compose.dev.yml ps

# 查看某个服务的完整日志
docker compose -f deploy/docker-compose.dev.yml logs celery-worker --tail=500

# 进入容器调试
docker exec -it finnews_celery_worker bash

# 查看数据库连接
docker exec finnews_postgres psql -U finnews -d finnews_db -c "\conninfo"

# 查看Redis连接
docker exec finnews_redis redis-cli INFO

# 测试网络连通性
docker exec finnews_celery_worker ping -c 3 postgres
```

---

## ⚡ 快速参考（常用命令）

### 项目目录

```bash
cd FinnewsHunter
```

### 一键操作

```bash
# 启动所有服务
docker compose -f deploy/docker-compose.dev.yml up -d

# 停止所有服务
docker compose -f deploy/docker-compose.dev.yml down

# 重启Celery（代码更新后）
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

# 清空所有数据重新开始
./reset_all_data.sh
```

### 查看状态

```bash
# 服务状态
docker compose -f deploy/docker-compose.dev.yml ps

# 新闻数量
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT source, COUNT(*) FROM news GROUP BY source;"

# 任务数量
docker exec finnews_postgres psql -U finnews -d finnews_db -c "SELECT status, COUNT(*) FROM crawl_tasks GROUP BY status;"

# Redis缓存
docker exec finnews_redis redis-cli DBSIZE
```

### 查看日志

```bash
# Celery Beat（定时调度）
docker compose -f deploy/docker-compose.dev.yml logs -f celery-beat

# Celery Worker（任务执行）
docker compose -f deploy/docker-compose.dev.yml logs -f celery-worker

# PostgreSQL
docker compose -f deploy/docker-compose.dev.yml logs -f postgres

# 所有服务
docker compose -f deploy/docker-compose.dev.yml logs -f
```

### 直接访问

- **前端**: http://localhost:3000
- **后端API**: http://localhost:8000
- **API文档**: http://localhost:8000/docs

---

## 📊 数据库结构

### News（新闻表）
- id, title, content, url, source
- publish_time, stock_codes
- sentiment_score, is_embedded

### Analysis（分析表）
- id, news_id, agent_name
- sentiment, sentiment_score, confidence
- analysis_result, structured_data

### Stock（股票表）
- id, code, name, industry, market

---

## 🛠️ 开发指南

### 添加新的爬虫

1. 继承 `BaseCrawler` 类
2. 实现 `crawl()` 方法
3. 注册到 `tools/__init__.py`

示例：
```python
# backend/app/tools/custom_crawler.py
from .crawler_base import BaseCrawler

class CustomCrawlerTool(BaseCrawler):
    name = "custom_crawler"
    
    def crawl(self, start_page, end_page):
        # 实现爬取逻辑
        pass
```

### 使用增强版爬虫（可选）

对于需要 JS 渲染或智能内容提取的场景，可使用增强版爬虫：

```python
from app.tools.crawler_enhanced import crawl_url, EnhancedCrawler

# 快速爬取单个 URL
article = crawl_url("https://finance.sina.com.cn/xxx", engine='auto')
print(article.to_markdown())

# 获取 LLM 消息格式（多模态）
llm_messages = article.to_llm_message()

# 批量爬取（带缓存）
crawler = EnhancedCrawler(use_cache=True)
articles = crawler.crawl_batch(urls, delay=1.0)
```

**支持的引擎：**
- `requests`: 基础 HTTP 请求（默认）
- `playwright`: JS 渲染（需安装 `playwright install chromium`）
- `jina`: Jina Reader API（需配置 `JINA_API_KEY`）
- `auto`: 自动选择最佳引擎

**安装可选依赖：**

```bash
pip install markdownify readabilipy playwright
playwright install chromium  # 可选，用于 JS 渲染
```

---

### 添加新的智能体

1. 继承 `Agent` 类
2. 定义 role、goal、backstory
3. 实现业务方法

示例：
```python
# backend/app/agents/risk_analyst.py
from agenticx import Agent

class RiskAnalystAgent(Agent):
    def __init__(self, llm_provider):
        super().__init__(
            name="RiskAnalyst",
            role="风险分析师",
            goal="评估投资风险",
            llm_provider=llm_provider
        )
```

---

### 使用 AgenticX 组件

FinnewsHunter 深度集成了 AgenticX 框架的核心组件，避免重复造轮子：

#### 1. 向量化服务（Embedding）

系统使用 `agenticx.embeddings.BailianEmbeddingProvider` 作为核心向量化引擎：

```python
from app.services.embedding_service import EmbeddingService

# 同步接口（适用于同步上下文）
embedding_service = EmbeddingService()
vector = embedding_service.embed_text("文本内容")

# 异步接口（推荐在异步上下文中使用）
vector = await embedding_service.aembed_text("文本内容")

# 批量处理（Provider 内部已实现批量优化）
vectors = embedding_service.embed_batch(["文本1", "文本2", "文本3"])
```

**特点**：
- 支持 Redis 缓存，避免重复计算
- 自动处理文本长度限制（6000字符）
- 支持同步和异步两种接口，避免事件循环冲突

#### 2. 向量存储（Milvus）

系统使用 `agenticx.storage.vectordb_storages.milvus.MilvusStorage` 作为向量数据库：

```python
from app.storage.vector_storage import VectorStorage

vector_storage = VectorStorage()

# 存储单个向量
vector_storage.store_embedding(
    news_id=1,
    text="新闻内容",
    embedding=[0.1, 0.2, ...]
)

# 批量存储
vector_storage.store_embeddings_batch([
    {"news_id": 1, "text": "内容1", "embedding": [...]},
    {"news_id": 2, "text": "内容2", "embedding": [...]}
])

# 相似度搜索
results = vector_storage.search_similar(query_vector=[...], top_k=10)

# 获取统计信息（带查询计数回退机制）
stats = vector_storage.get_stats()
```

**特点**：
- 直接使用 AgenticX MilvusStorage，无需重复实现
- 提供兼容性接口，简化调用
- 当 `num_entities` 不准确时，通过实际查询获取真实数量
- 支持异步操作，避免阻塞

#### 3. 异步向量化最佳实践

在异步上下文中（如 FastAPI 路由），推荐使用异步接口：

```python
from app.services.embedding_service import EmbeddingService
from app.storage.vector_storage import VectorStorage

async def analyze_news(news_id: int, text: str):
    embedding_service = EmbeddingService()
    vector_storage = VectorStorage()
    
    # 使用异步接口，避免事件循环冲突
    embedding = await embedding_service.aembed_text(text)
    
    # 后台异步存储向量（不阻塞分析流程）
    asyncio.create_task(
        vector_storage.store_embedding(news_id, text, embedding)
    )
    
    # 继续执行分析逻辑...
```

**注意事项**：
- 在异步上下文中，使用 `aembed_text()` 而不是 `embed_text()`
- 向量化操作在后台异步执行，不阻塞主流程
- Milvus 的 `flush()` 操作已优化，默认不执行（依赖自动刷新）

---

## 多智能体辩论架构

FinnewsHunter 的核心特色是 **多空辩论机制**，通过多个专业智能体的协作与对抗，深度挖掘个股的投资价值和风险。

### 核心参与角色

| 智能体 | 角色定位 | 核心职责 |
|--------|----------|----------|
| **BullResearcher** | 看多研究员 | 挖掘增长潜力、核心利好、估值优势 |
| **BearResearcher** | 看空研究员 | 识别下行风险、负面催化剂、反驳乐观预期 |
| **SearchAnalyst** | 搜索分析师 | 动态获取数据（AkShare/BochaAI/浏览器搜索） |
| **InvestmentManager** | 投资经理 | 主持辩论、评估论点质量、做出最终决策 |

### 辩论数据流架构

```mermaid
graph TD
    subgraph 辩论启动
        Manager[投资经理] -->|开场陈述| Orchestrator[辩论编排器]
    end
    
    subgraph 多轮辩论
        Orchestrator -->|第N轮| Bull[看多研究员]
        Bull -->|发言 + 数据请求| Orchestrator
        Orchestrator -->|触发搜索| Searcher[搜索分析师]
        
        Searcher -->|财务数据| AkShare[AkShare]
        Searcher -->|实时新闻| BochaAI[BochaAI]
        Searcher -->|网页搜索| Browser[浏览器引擎]
        
        AkShare --> Context[更新上下文]
        BochaAI --> Context
        Browser --> Context
        
        Context --> Orchestrator
        Orchestrator -->|第N轮| Bear[看空研究员]
        Bear -->|发言 + 数据请求| Orchestrator
    end
    
    subgraph 最终决策
        Orchestrator -->|智能数据补充| Searcher
        Orchestrator -->|综合判断| Manager
        Manager -->|投资评级| Result[最终报告]
    end
```

### 动态搜索机制

辩论过程中，智能体可以通过特定格式请求额外数据：

```
[SEARCH: "最近的毛利率数据" source:akshare]   -- 从 AkShare 获取财务数据
[SEARCH: "行业竞争格局分析" source:bochaai]   -- 从 BochaAI 搜索新闻
[SEARCH: "近期资金流向" source:akshare]       -- 获取资金流向
[SEARCH: "竞品对比分析"]                       -- 自动选择最佳数据源
```

**支持的数据源：**
- **AkShare**: 财务指标、K线行情、资金流向、机构持仓
- **BochaAI**: 实时新闻搜索、分析师报告
- **浏览器搜索**: 百度资讯、搜狗、360等多引擎搜索
- **知识库**: 历史新闻和分析数据

---

## 📈 路线图

### Phase 1: MVP（已完成） ✅
- [x] 项目基础设施
- [x] 数据库模型
- [x] 爬虫工具重构（10个新闻源）
- [x] LLM 服务集成
- [x] NewsAnalyst 智能体
- [x] FastAPI 路由
- [x] React + TypeScript 前端

### Phase 1.5: 多厂商 LLM 支持（已完成） ✅
- [x] 支持 5 大 LLM 厂商（百炼、OpenAI、DeepSeek、Kimi、智谱）
- [x] 前端动态模型切换
- [x] LLM 配置 API（`/api/v1/llm/config`）
- [x] 新闻详情抽屉（完整内容 + AI 分析）
- [x] 实时搜索功能（多维度 + 关键词高亮）
- [x] Markdown 渲染（支持表格、代码块）
- [x] 一键复制分析报告

### Phase 1.6: 股票分析与增强爬虫（已完成） ✅
- [x] 股票 K 线图（集成 akshare + klinecharts）
- [x] 多周期支持（日K/60分/30分/15分/5分/1分）
- [x] 股票搜索（代码/名称模糊查询，预加载 5000+ A股）
- [x] 增强版爬虫模块
  - [x] 多引擎支持（Requests/Playwright/Jina）
  - [x] 智能内容提取（readabilipy + 启发式算法）
  - [x] 内容质量评估与自动重试
  - [x] 缓存机制和统一 Article 模型

### Phase 1.7: AgenticX 深度集成与批量操作（已完成） ✅
- [x] 迁移到 AgenticX BailianEmbeddingProvider（移除冗余批量处理逻辑）
- [x] 迁移到 AgenticX MilvusStorage（简化存储封装，移除重复代码）
- [x] 异步向量化接口（aembed_text/aembed_batch），避免事件循环冲突
- [x] 后台异步向量化，不阻塞分析流程
- [x] Milvus 统计信息优化（查询计数回退机制）
- [x] 前端批量选择功能（复选框 + Shift 范围选择）
- [x] 批量删除新闻功能
- [x] 批量分析新闻功能（带进度显示和结果统计）
- [x] Docker Compose 优化（Celery 镜像构建，提升启动性能）

### Phase 2: 多智能体辩论（已完成） ✅
- [x] BullResearcher & BearResearcher 智能体
- [x] SearchAnalyst 搜索分析师（动态数据获取）
- [x] InvestmentManager 投资经理决策
- [x] 辩论编排器（DebateOrchestrator）
- [x] 动态搜索机制（辩论中按需获取数据）
- [x] 三种辩论模式：并行分析、实时辩论、快速分析
- [ ] 实时 WebSocket 推送（进行中）
- [ ] 智能体执行轨迹可视化（进行中）

### Phase 3: 知识增强（计划中）
- [ ] 金融知识图谱（Neo4j）
- [ ] 智能体记忆系统
- [ ] GraphRetriever 图检索

### Phase 4: 自我进化（计划中）
- [ ] ACE 框架集成
- [ ] 投资策略 Playbook
- [ ] 决策效果评估与学习

---

## 📄 许可证

本项目遵循 AgenticX 的许可证。

---

## 🙏 致谢

- [AgenticX](https://github.com/yourusername/AgenticX) - 多智能体框架
- [FastAPI](https://fastapi.tiangolo.com/) - Web 框架
- [Milvus](https://milvus.io/) - 向量数据库
- [阿里云百炼](https://dashscope.console.aliyun.com/) - LLM 服务
- [Shadcn UI](https://ui.shadcn.com/) - 前端组件库

---

## ⭐ Star History

如果你觉得这个项目对你有帮助，欢迎给个 Star ⭐️！

[![Star History Chart](https://api.star-history.com/svg?repos=DemonDamon/FinnewsHunter&type=Date)](https://star-history.com/#DemonDamon/FinnewsHunter&Date)

---

**Built with ❤️ using AgenticX**


================================================
FILE: backend/.gitignore
================================================
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Environment variables
.env
.env.local

# IDE
.vscode/
.idea/
*.swp
*.swo
*~

# Logs
logs/
*.log

# Database
*.db
*.sqlite

# OS
.DS_Store
Thumbs.db

# Testing
.pytest_cache/
.coverage
htmlcov/

celerybeat-schedule
celerybeat-schedule
celerybeat-schedule


================================================
FILE: backend/README.md
================================================
# FinnewsHunter Backend

Backend service for the financial news intelligent analysis system based on the AgenticX framework.

## Documentation Navigation

### Quick Start
- **[QUICKSTART.md](../QUICKSTART.md)** - Quick start guide (recommended for beginners)

### Configuration Guides
- **[CONFIG_GUIDE.md](CONFIG_GUIDE.md)** - **Unified Configuration Guide** (recommended)
  - Single configuration file supports all LLM providers
  - Quick switching between OpenAI / Bailian / Proxy
  - Includes scenario examples and working principles
  
- **[env.example](env.example)** - Configuration template (with comments for all scenarios)

### Specialized Configuration
- **[BAILIAN_SETUP.md](BAILIAN_SETUP.md)** - Detailed Alibaba Cloud Bailian configuration (recommended for Chinese users)
- **[API_PROXY_GUIDE.md](API_PROXY_GUIDE.md)** - API proxy configuration guide

---

## Quick Configuration

### Method 1: Interactive Script (Recommended)

```bash
chmod +x setup_env.sh
./setup_env.sh

# Follow the prompts to select:
# 1) OpenAI Official
# 2) Alibaba Cloud Bailian (recommended for Chinese users)
# 3) Other Proxy
# 4) Manual Configuration
```

### Method 2: Manual Configuration

```bash
cp env.example .env
nano .env  # Choose configuration scheme according to comments
```

---

## Main Features

- **Multi-Agent System**: Based on AgenticX framework
  - NewsAnalyst: News analysis agent
  - More agents under development...

- **Data Collection**:
  - Sina Finance crawler
  - JRJ Finance crawler

- **Storage System**:
  - PostgreSQL: Relational data storage
  - Milvus: Vector database
  - Redis: Cache and task queue

- **LLM Support**:
  - OpenAI (GPT-3.5/GPT-4)
  - Alibaba Cloud Bailian (Qwen)
  - Other OpenAI-compatible services

---

## Project Structure

```
backend/
├── app/
│   ├── agents/          # Agent definitions
│   ├── api/             # FastAPI routes
│   ├── core/            # Core configuration
│   ├── models/          # Data models
│   ├── services/        # Business services
│   ├── storage/         # Storage wrappers
│   └── tools/           # Crawlers and tools
├── logs/                # Log files
├── tests/               # Test files
├── .env                 # Environment configuration (copy from env.example)
├── env.example          # Configuration template
├── requirements.txt     # Python dependencies
└── start.sh            # Startup script
```

---

## Development Guide

### Start Development Environment

```bash
# 1. Configure environment variables
./setup_env.sh

# 2. Start services (including Docker containers)
./start.sh
```

### Utility Scripts

The project provides some utility scripts located in the `tests/` directory:

```bash
# Check Milvus vector storage data
python tests/check_milvus_data.py

# Check news embedding status
python tests/check_news_embedding_status.py

# Manually vectorize a specific news item (for fixing unvectorized news)
python tests/manual_vectorize.py <news_id>
```

### View Logs

```bash
tail -f logs/finnews.log
```

---

## Common Configuration Scenarios

### OpenAI Official
```bash
LLM_MODEL=gpt-3.5-turbo
OPENAI_API_KEY=sk-openai-key
MILVUS_DIM=1536
```

### Alibaba Cloud Bailian (Recommended for Chinese Users)
```bash
LLM_MODEL=qwen-plus
OPENAI_API_KEY=sk-bailian-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MILVUS_DIM=1024
```

### OpenAI Proxy
```bash
LLM_MODEL=gpt-3.5-turbo
OPENAI_API_KEY=sk-proxy-key
OPENAI_BASE_URL=https://your-proxy.com/v1
MILVUS_DIM=1536
```

For detailed information, see **[CONFIG_GUIDE.md](CONFIG_GUIDE.md)**

---

## API Documentation

- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

### Troubleshooting

If the documentation page appears blank or keeps loading:

1. **Check Browser Console**: Press F12 to open developer tools, check Console and Network tabs for errors
2. **Try ReDoc**: If Swagger UI fails to load, try accessing ReDoc (uses a different CDN)
3. **Clear Browser Cache**: Press `Ctrl+Shift+R` (Windows/Linux) or `Cmd+Shift+R` (Mac) to force refresh
4. **Check Network Connection**: Documentation pages need to load JavaScript resources from CDN, ensure network connection is normal
5. **Check Backend Service**: Ensure the backend service is running, verify by accessing http://localhost:8000/health


================================================
FILE: backend/README_zn.md
================================================
# FinnewsHunter Backend

基于 AgenticX 框架的金融新闻智能分析系统后端服务。

## 文档导航

### 快速开始
- **[QUICKSTART.md](../QUICKSTART.md)** - 快速启动指南（推荐新手阅读）

### 配置指南
- **[CONFIG_GUIDE.md](CONFIG_GUIDE.md)** - **统一配置指南**（推荐首选）
  - 一个配置文件支持所有 LLM 服务商
  - 快速切换 OpenAI / 百炼 / 代理
  - 包含场景示例和工作原理
  
- **[env.example](env.example)** - 配置模板（包含所有场景的注释）

### 专项配置
- **[BAILIAN_SETUP.md](BAILIAN_SETUP.md)** - 阿里云百炼详细配置（国内用户推荐）
- **[API_PROXY_GUIDE.md](API_PROXY_GUIDE.md)** - API 代理配置详解

---

## 快速配置

### 方法 1: 交互式脚本（推荐）

```bash
chmod +x setup_env.sh
./setup_env.sh

# 按提示选择：
# 1) OpenAI 官方
# 2) 阿里云百炼（推荐国内用户）
# 3) 其他代理
# 4) 手动配置
```

### 方法 2: 手动配置

```bash
cp env.example .env
nano .env  # 根据注释选择配置方案
```

---

## 主要功能

- **多智能体系统**：基于 AgenticX 框架
  - NewsAnalyst：新闻分析智能体
  - 更多智能体开发中...

- **数据采集**：
  - 新浪财经爬虫
  - 金融界爬虫

- **存储系统**：
  - PostgreSQL：关系数据存储
  - Milvus：向量数据库
  - Redis：缓存和任务队列

- **LLM 支持**：
  - OpenAI (GPT-3.5/GPT-4)
  - 阿里云百炼（通义千问）
  - 其他 OpenAI 兼容服务

---

## 项目结构

```
backend/
├── app/
│   ├── agents/          # 智能体定义
│   ├── api/             # FastAPI 路由
│   ├── core/            # 核心配置
│   ├── models/          # 数据模型
│   ├── services/        # 业务服务
│   ├── storage/         # 存储封装
│   └── tools/           # 爬虫和工具
├── logs/                # 日志文件
├── tests/               # 测试文件
├── .env                 # 环境配置（从 env.example 复制）
├── env.example          # 配置模板
├── requirements.txt     # Python 依赖
└── start.sh            # 启动脚本
```

---

## 开发指南

### 启动开发环境

```bash
# 1. 配置环境变量
./setup_env.sh

# 2. 启动服务（包括 Docker 容器）
./start.sh
```

### 工具脚本

项目提供了一些实用工具脚本，位于 `tests/` 目录下：

```bash
# 检查 Milvus 向量存储数据
python tests/check_milvus_data.py

# 检查新闻向量化状态
python tests/check_news_embedding_status.py

# 手动向量化指定新闻（用于修复未向量化的新闻）
python tests/manual_vectorize.py <news_id>
```

### 查看日志

```bash
tail -f logs/finnews.log
```

---

## 常用配置场景

### OpenAI 官方
```bash
LLM_MODEL=gpt-3.5-turbo
OPENAI_API_KEY=sk-openai-key
MILVUS_DIM=1536
```

### 阿里云百炼（推荐国内）
```bash
LLM_MODEL=qwen-plus
OPENAI_API_KEY=sk-bailian-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MILVUS_DIM=1024
```

### OpenAI 代理
```bash
LLM_MODEL=gpt-3.5-turbo
OPENAI_API_KEY=sk-proxy-key
OPENAI_BASE_URL=https://your-proxy.com/v1
MILVUS_DIM=1536
```

详细说明见 **[CONFIG_GUIDE.md](CONFIG_GUIDE.md)**

---

## API 文档

- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

### 手动触发爬取

如果某个新闻源显示为空，可以手动触发实时爬取：

```bash
# 触发腾讯财经爬取
curl -X POST "http://localhost:8000/api/v1/tasks/realtime" \
  -H "Content-Type: application/json" \
  -d '{"source": "tencent", "force_refresh": true}'

# 触发经济观察网爬取
curl -X POST "http://localhost:8000/api/v1/tasks/realtime" \
  -H "Content-Type: application/json" \
  -d '{"source": "eeo", "force_refresh": true}'
```

支持的新闻源：
- `sina` - 新浪财经
- `tencent` - 腾讯财经
- `eeo` - 经济观察网
- `jwview` - 金融界
- `caijing` - 财经网
- `jingji21` - 21经济网
- `nbd` - 每日经济新闻
- `yicai` - 第一财经
- `163` - 网易财经
- `eastmoney` - 东方财富

### 故障排查

如果文档页面显示空白或一直加载：

1. **检查浏览器控制台**：按 F12 打开开发者工具，查看 Console 和 Network 标签页是否有错误
2. **尝试 ReDoc**：如果 Swagger UI 无法加载，尝试访问 ReDoc（使用不同的 CDN）
3. **清除浏览器缓存**：按 `Ctrl+Shift+R` (Windows/Linux) 或 `Cmd+Shift+R` (Mac) 强制刷新
4. **检查网络连接**：文档页面需要从 CDN 加载 JavaScript 资源，确保网络连接正常
5. **检查后端服务**：确保后端服务正在运行，可以访问 http://localhost:8000/health 验证


================================================
FILE: backend/add_raw_html_column.py
================================================
"""
数据库迁移：添加 raw_html 字段
"""
import os
from pathlib import Path
from dotenv import load_dotenv

# 加载环境变量
env_path = Path(__file__).parent / ".env"
load_dotenv(env_path)

# 构建数据库 URL
POSTGRES_USER = os.getenv("POSTGRES_USER", "postgres")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD", "postgres")
POSTGRES_HOST = os.getenv("POSTGRES_HOST", "localhost")
POSTGRES_PORT = os.getenv("POSTGRES_PORT", "5432")
POSTGRES_DB = os.getenv("POSTGRES_DB", "finnews_db")

DATABASE_URL = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

from sqlalchemy import create_engine, text

def add_raw_html_column():
    """添加 raw_html 字段到 news 表"""
    print("🔧 正在添加 raw_html 字段...")
    
    engine = create_engine(DATABASE_URL)
    
    with engine.connect() as conn:
        # 检查字段是否已存在
        result = conn.execute(text("""
            SELECT column_name FROM information_schema.columns 
            WHERE table_name = 'news' AND column_name = 'raw_html'
        """))
        
        if result.fetchone():
            print("✅ raw_html 字段已存在，无需迁移")
            return
        
        # 添加字段
        conn.execute(text("""
            ALTER TABLE news ADD COLUMN raw_html TEXT
        """))
        conn.commit()
        
        print("✅ raw_html 字段已添加成功！")

if __name__ == "__main__":
    print("=" * 50)
    print("📦 数据库迁移：添加 raw_html 字段")
    print("=" * 50)
    add_raw_html_column()


================================================
FILE: backend/app/__init__.py
================================================
"""
FinnewsHunter Backend Application
"""
__version__ = "0.1.0"


================================================
FILE: backend/app/agents/__init__.py
================================================
"""
智能体模块
"""
from .news_analyst import NewsAnalystAgent, create_news_analyst
from .debate_agents import (
    BullResearcherAgent,
    BearResearcherAgent,
    InvestmentManagerAgent,
    DebateWorkflow,
    create_debate_workflow,
)
from .data_collector_v2 import DataCollectorAgentV2, QuickAnalystAgent, create_data_collector
from .orchestrator import DebateOrchestrator, create_orchestrator
from .quantitative_agent import QuantitativeAgent, create_quantitative_agent

__all__ = [
    "NewsAnalystAgent",
    "create_news_analyst",
    "BullResearcherAgent",
    "BearResearcherAgent",
    "InvestmentManagerAgent",
    "DebateWorkflow",
    "create_debate_workflow",
    "DataCollectorAgentV2",
    "QuickAnalystAgent",
    "create_data_collector",
    "DebateOrchestrator",
    "create_orchestrator",
    "QuantitativeAgent",
    "create_quantitative_agent",
]


================================================
FILE: backend/app/agents/data_collector.py
================================================
"""
数据专员智能体

负责在辩论前搜集和整理相关数据资料，包括：
- 新闻数据（从数据库或BochaAI搜索）
- 财务数据（从AkShare获取）
- 行情数据（实时行情、K线等）
"""
import logging
from typing import Dict, Any, List, Optional
from datetime import datetime

from agenticx.core.agent import Agent
from ..services.llm_service import get_llm_provider

logger = logging.getLogger(__name__)


class DataCollectorAgent(Agent):
    """数据专员智能体"""
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="DataCollector",
            role="数据专员",
            goal="搜集和整理股票相关的新闻、财务和行情数据，为辩论提供全面的信息支持",
            backstory="""你是一位专业的金融数据分析师，擅长从多个数据源搜集和整理信息。
你的职责是在辩论开始前，为Bull/Bear研究员提供全面、准确、及时的数据支持。
你需要：
1. 搜集最新的相关新闻
2. 获取关键财务指标
3. 分析资金流向
4. 整理行情数据
你的工作质量直接影响辩论的深度和专业性。""",
            organization_id=organization_id
        )
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        logger.info(f"Initialized {self.name} agent")
    
    async def collect_data(
        self,
        stock_code: str,
        stock_name: str,
        data_requirements: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """
        搜集股票相关数据
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            data_requirements: 数据需求配置
            
        Returns:
            包含各类数据的字典
        """
        logger.info(f"📊 DataCollector: 开始搜集 {stock_name}({stock_code}) 的数据...")
        
        result = {
            "stock_code": stock_code,
            "stock_name": stock_name,
            "collected_at": datetime.utcnow().isoformat(),
            "news": [],
            "financial": {},
            "fund_flow": {},
            "realtime_quote": {},
            "summary": ""
        }
        
        try:
            # 1. 搜集新闻数据
            news_data = await self._collect_news(stock_code, stock_name)
            result["news"] = news_data
            logger.info(f"📰 DataCollector: 搜集到 {len(news_data)} 条新闻")
            
            # 2. 搜集财务数据
            financial_data = await self._collect_financial(stock_code)
            result["financial"] = financial_data
            logger.info(f"💰 DataCollector: 搜集到财务数据")
            
            # 3. 搜集资金流向
            fund_flow = await self._collect_fund_flow(stock_code)
            result["fund_flow"] = fund_flow
            logger.info(f"💸 DataCollector: 搜集到资金流向数据")
            
            # 4. 搜集实时行情
            realtime = await self._collect_realtime_quote(stock_code)
            result["realtime_quote"] = realtime
            logger.info(f"📈 DataCollector: 搜集到实时行情")
            
            # 5. 生成数据摘要
            result["summary"] = await self._generate_summary(result)
            logger.info(f"📋 DataCollector: 数据摘要生成完成")
            
        except Exception as e:
            logger.error(f"DataCollector 搜集数据时出错: {e}", exc_info=True)
            result["error"] = str(e)
        
        return result
    
    async def _collect_news(self, stock_code: str, stock_name: str) -> List[Dict[str, Any]]:
        """搜集新闻数据"""
        from ..services.news_service import news_service
        
        try:
            # 从数据库获取已有新闻
            news_list = await news_service.get_news_by_stock(stock_code, limit=20)
            return [
                {
                    "title": news.title,
                    "content": news.content[:500] if news.content else "",
                    "source": news.source,
                    "published_at": news.published_at.isoformat() if news.published_at else None,
                    "sentiment": news.sentiment
                }
                for news in news_list
            ]
        except Exception as e:
            logger.warning(f"从数据库获取新闻失败: {e}")
            return []
    
    async def _collect_financial(self, stock_code: str) -> Dict[str, Any]:
        """搜集财务数据"""
        from ..services.stock_data_service import stock_data_service
        
        try:
            return await stock_data_service.get_financial_indicators(stock_code) or {}
        except Exception as e:
            logger.warning(f"获取财务数据失败: {e}")
            return {}
    
    async def _collect_fund_flow(self, stock_code: str) -> Dict[str, Any]:
        """搜集资金流向数据"""
        from ..services.stock_data_service import stock_data_service
        
        try:
            return await stock_data_service.get_fund_flow(stock_code) or {}
        except Exception as e:
            logger.warning(f"获取资金流向失败: {e}")
            return {}
    
    async def _collect_realtime_quote(self, stock_code: str) -> Dict[str, Any]:
        """搜集实时行情"""
        from ..services.stock_data_service import stock_data_service
        
        try:
            return await stock_data_service.get_realtime_quote(stock_code) or {}
        except Exception as e:
            logger.warning(f"获取实时行情失败: {e}")
            return {}
    
    async def _generate_summary(self, data: Dict[str, Any]) -> str:
        """使用LLM生成数据摘要"""
        try:
            # 准备摘要内容
            news_summary = ""
            if data.get("news"):
                news_titles = [n["title"] for n in data["news"][:5]]
                news_summary = f"最新新闻（{len(data['news'])}条）:\n" + "\n".join(f"- {t}" for t in news_titles)
            
            financial_summary = ""
            if data.get("financial"):
                f = data["financial"]
                financial_summary = f"""财务指标:
- PE: {f.get('pe', 'N/A')}
- PB: {f.get('pb', 'N/A')}
- ROE: {f.get('roe', 'N/A')}
- 净利润增长率: {f.get('net_profit_growth', 'N/A')}"""
            
            fund_flow_summary = ""
            if data.get("fund_flow"):
                ff = data["fund_flow"]
                fund_flow_summary = f"""资金流向:
- 主力净流入: {ff.get('main_net_inflow', 'N/A')}
- 散户净流入: {ff.get('retail_net_inflow', 'N/A')}"""
            
            realtime_summary = ""
            if data.get("realtime_quote"):
                rt = data["realtime_quote"]
                realtime_summary = f"""实时行情:
- 当前价: {rt.get('price', 'N/A')}
- 涨跌幅: {rt.get('change_pct', 'N/A')}%
- 成交量: {rt.get('volume', 'N/A')}"""
            
            summary = f"""## {data['stock_name']}({data['stock_code']}) 数据摘要

{realtime_summary}

{financial_summary}

{fund_flow_summary}

{news_summary}

数据搜集时间: {data['collected_at']}"""
            
            return summary
            
        except Exception as e:
            logger.error(f"生成数据摘要失败: {e}")
            return f"数据搜集完成，但生成摘要时出错: {e}"
    
    async def analyze_data_quality(self, data: Dict[str, Any]) -> Dict[str, Any]:
        """分析数据质量和完整性"""
        quality = {
            "score": 0,
            "max_score": 100,
            "details": [],
            "recommendations": []
        }
        
        # 检查新闻数据
        news_count = len(data.get("news", []))
        if news_count >= 10:
            quality["score"] += 30
            quality["details"].append(f"✅ 新闻数据充足（{news_count}条）")
        elif news_count >= 5:
            quality["score"] += 20
            quality["details"].append(f"⚠️ 新闻数据较少（{news_count}条）")
            quality["recommendations"].append("建议搜集更多新闻以支持分析")
        elif news_count > 0:
            quality["score"] += 10
            quality["details"].append(f"⚠️ 新闻数据不足（{news_count}条）")
            quality["recommendations"].append("新闻数据偏少，分析可能不够全面")
        else:
            quality["details"].append("❌ 无新闻数据")
            quality["recommendations"].append("缺少新闻数据，建议先进行定向爬取")
        
        # 检查财务数据
        if data.get("financial"):
            quality["score"] += 25
            quality["details"].append("✅ 财务数据完整")
        else:
            quality["details"].append("❌ 缺少财务数据")
            quality["recommendations"].append("无法获取财务指标")
        
        # 检查资金流向
        if data.get("fund_flow"):
            quality["score"] += 20
            quality["details"].append("✅ 资金流向数据完整")
        else:
            quality["details"].append("⚠️ 缺少资金流向数据")
        
        # 检查实时行情
        if data.get("realtime_quote"):
            quality["score"] += 25
            quality["details"].append("✅ 实时行情数据完整")
        else:
            quality["details"].append("⚠️ 缺少实时行情数据")
        
        return quality


# 快速分析师（用于快速分析模式）
class QuickAnalystAgent(Agent):
    """快速分析师智能体"""
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="QuickAnalyst",
            role="快速分析师",
            goal="快速综合多角度给出投资建议",
            backstory="""你是一位经验丰富的量化分析师，擅长快速分析和决策。
你能够在短时间内综合考虑多空因素，给出简洁明了的投资建议。
你的分析风格是：快速、准确、实用。""",
            organization_id=organization_id
        )
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        logger.info(f"Initialized {self.name} agent")
    
    async def quick_analyze(
        self,
        stock_code: str,
        stock_name: str,
        context: str
    ) -> Dict[str, Any]:
        """快速分析"""
        # 获取当前系统时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""请对 {stock_name}({stock_code}) 进行快速投资分析。

【当前时间】
{current_time}

背景资料:
{context}

请在1分钟内给出：
1. 核心观点（一句话）
2. 看多因素（3点）
3. 看空因素（3点）
4. 投资建议（买入/持有/卖出）
5. 目标价位和止损价位

请用简洁的语言，直接给出结论。"""

        try:
            response = await self._llm_provider.chat(prompt)
            return {
                "success": True,
                "analysis": response,
                "timestamp": datetime.utcnow().isoformat()
            }
        except Exception as e:
            logger.error(f"Quick analysis failed: {e}")
            return {
                "success": False,
                "error": str(e)
            }


================================================
FILE: backend/app/agents/data_collector_v2.py
================================================
"""
数据专员智能体 V2 (DataCollectorAgent)

统一负责所有数据获取任务，支持：
- 辩论前的初始数据收集
- 辩论中的动态数据补充
- 用户追问时的按需搜索

核心特性：
1. 计划/执行分离：先生成搜索计划，用户确认后再执行
2. 多数据源支持：AkShare、BochaAI、网页搜索、知识库
3. 智能意图识别：根据用户问题自动选择数据源
"""
import logging
import re
import asyncio
from typing import Dict, Any, List, Optional, ClassVar, Pattern
from datetime import datetime
from enum import Enum
from pydantic import BaseModel, Field

from agenticx.core.agent import Agent
from ..services.llm_service import get_llm_provider
from ..services.stock_data_service import stock_data_service
from ..tools.bochaai_search import bochaai_search, SearchResult
from ..tools.interactive_crawler import InteractiveCrawler

logger = logging.getLogger(__name__)


class SearchSource(str, Enum):
    """搜索数据源类型"""
    AKSHARE = "akshare"           # AkShare 财务/行情数据
    BOCHAAI = "bochaai"           # BochaAI Web搜索
    BROWSER = "browser"           # 交互式浏览器搜索
    KNOWLEDGE_BASE = "kb"         # 内部知识库
    ALL = "all"                   # 所有来源


class SearchTask(BaseModel):
    """单个搜索任务"""
    id: str = Field(..., description="任务ID")
    source: SearchSource = Field(..., description="数据源")
    query: str = Field(..., description="搜索查询")
    description: str = Field("", description="任务描述（用于展示给用户）")
    data_type: Optional[str] = Field(None, description="数据类型（如 financial, news, kline）")
    icon: str = Field("🔍", description="图标（用于UI展示）")
    estimated_time: int = Field(3, description="预计耗时（秒）")


class SearchPlan(BaseModel):
    """搜索计划"""
    plan_id: str = Field(..., description="计划ID")
    stock_code: str = Field(..., description="股票代码")
    stock_name: str = Field("", description="股票名称")
    user_query: str = Field(..., description="用户原始问题")
    tasks: List[SearchTask] = Field(default_factory=list, description="搜索任务列表")
    total_estimated_time: int = Field(0, description="总预计耗时（秒）")
    created_at: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
    status: str = Field("pending", description="状态：pending, confirmed, executing, completed, cancelled")


class SearchResult(BaseModel):
    """搜索结果"""
    task_id: str
    source: str
    success: bool
    data: Dict[str, Any] = Field(default_factory=dict)
    summary: str = ""
    error: Optional[str] = None
    execution_time: float = 0


class DataCollectorAgentV2(Agent):
    """
    数据专员智能体 V2
    
    支持"确认优先"模式：
    1. 用户 @数据专员 提问
    2. 生成搜索计划（不执行）
    3. 用户确认后执行
    4. 返回结果
    """
    
    # 关键词到数据源的映射
    KEYWORD_SOURCE_MAP: ClassVar[Dict[str, tuple]] = {
        # 财务相关 -> AkShare
        "财务": (SearchSource.AKSHARE, "financial", "📊"),
        "pe": (SearchSource.AKSHARE, "financial", "📊"),
        "pb": (SearchSource.AKSHARE, "financial", "📊"),
        "roe": (SearchSource.AKSHARE, "financial", "📊"),
        "利润": (SearchSource.AKSHARE, "financial", "📊"),
        "营收": (SearchSource.AKSHARE, "financial", "📊"),
        "估值": (SearchSource.AKSHARE, "financial", "📊"),
        "市盈": (SearchSource.AKSHARE, "financial", "📊"),
        "市净": (SearchSource.AKSHARE, "financial", "📊"),
        "报表": (SearchSource.AKSHARE, "financial", "📊"),
        
        # 资金/行情 -> AkShare
        "资金": (SearchSource.AKSHARE, "fund_flow", "💰"),
        "主力": (SearchSource.AKSHARE, "fund_flow", "💰"),
        "流入": (SearchSource.AKSHARE, "fund_flow", "💰"),
        "流出": (SearchSource.AKSHARE, "fund_flow", "💰"),
        "行情": (SearchSource.AKSHARE, "realtime", "📈"),
        "价格": (SearchSource.AKSHARE, "realtime", "📈"),
        "涨跌": (SearchSource.AKSHARE, "realtime", "📈"),
        "k线": (SearchSource.AKSHARE, "kline", "📈"),
        "走势": (SearchSource.AKSHARE, "kline", "📈"),
        
        # 新闻相关 -> BochaAI
        "新闻": (SearchSource.BOCHAAI, "news", "📰"),
        "资讯": (SearchSource.BOCHAAI, "news", "📰"),
        "报道": (SearchSource.BOCHAAI, "news", "📰"),
        "公告": (SearchSource.BOCHAAI, "news", "📰"),
        "消息": (SearchSource.BOCHAAI, "news", "📰"),
        
        # 上下游/产业链 -> 多源搜索
        "上下游": (SearchSource.BROWSER, "industry", "🔗"),
        "供应链": (SearchSource.BROWSER, "industry", "🔗"),
        "客户": (SearchSource.BROWSER, "industry", "🔗"),
        "供应商": (SearchSource.BROWSER, "industry", "🔗"),
        "合作": (SearchSource.BROWSER, "industry", "🔗"),
        "产业链": (SearchSource.BROWSER, "industry", "🔗"),
    }
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="DataCollector",
            role="数据专员",
            goal="根据用户需求，从多个数据源搜集和整理相关信息，支持辩论前准备和辩论中追问",
            backstory="""你是一位专业的金融数据专家，精通各类金融数据源的使用。
你的职责是：
1. 理解用户的数据需求
2. 制定合理的搜索计划
3. 从多个数据源获取数据
4. 整理并格式化数据

你能够访问的数据源包括：
- AkShare: 股票财务指标、K线行情、资金流向等
- BochaAI: 实时新闻搜索、财经报道
- 网页搜索: 百度资讯、搜狗等
- 知识库: 历史新闻和分析数据""",
            organization_id=organization_id
        )
        
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        # 初始化搜索工具
        self._interactive_crawler = InteractiveCrawler(timeout=20)
        
        logger.info(f"✅ Initialized DataCollectorV2 with multi-source search capabilities")
    
    async def generate_search_plan(
        self,
        query: str,
        stock_code: str,
        stock_name: str = ""
    ) -> SearchPlan:
        """
        生成搜索计划（不执行）
        
        根据用户问题分析需要哪些数据，生成待确认的搜索计划
        
        Args:
            query: 用户问题
            stock_code: 股票代码
            stock_name: 股票名称
            
        Returns:
            SearchPlan 对象
        """
        logger.info(f"📋 DataCollector: 为 '{query}' 生成搜索计划...")
        
        plan_id = f"plan_{datetime.utcnow().strftime('%Y%m%d%H%M%S')}_{stock_code}"
        
        plan = SearchPlan(
            plan_id=plan_id,
            stock_code=stock_code,
            stock_name=stock_name or stock_code,
            user_query=query,
            tasks=[],
            status="pending"
        )
        
        query_lower = query.lower()
        
        # 1. 基于关键词匹配生成任务
        matched_sources = set()
        for keyword, (source, data_type, icon) in self.KEYWORD_SOURCE_MAP.items():
            if keyword in query_lower:
                if (source, data_type) not in matched_sources:
                    matched_sources.add((source, data_type))
                    task = self._create_task(
                        source=source,
                        data_type=data_type,
                        icon=icon,
                        query=query,
                        stock_code=stock_code,
                        stock_name=stock_name
                    )
                    plan.tasks.append(task)
        
        # 2. 如果没有匹配到任何关键词，使用 LLM 分析
        if not plan.tasks:
            plan.tasks = await self._analyze_with_llm(query, stock_code, stock_name)
        
        # 3. 如果还是没有任务，添加默认的综合搜索
        if not plan.tasks:
            plan.tasks = [
                SearchTask(
                    id=f"task_{plan_id}_1",
                    source=SearchSource.BOCHAAI,
                    query=f"{stock_name or stock_code} {query}",
                    description=f"搜索 {stock_name} 相关新闻",
                    icon="📰",
                    estimated_time=3
                ),
                SearchTask(
                    id=f"task_{plan_id}_2",
                    source=SearchSource.AKSHARE,
                    query=query,
                    description="获取最新财务和行情数据",
                    data_type="overview",
                    icon="📊",
                    estimated_time=2
                )
            ]
        
        # 计算总耗时
        plan.total_estimated_time = sum(t.estimated_time for t in plan.tasks)
        
        logger.info(f"✅ 生成搜索计划: {len(plan.tasks)} 个任务，预计耗时 {plan.total_estimated_time}s")
        
        return plan
    
    def _create_task(
        self,
        source: SearchSource,
        data_type: str,
        icon: str,
        query: str,
        stock_code: str,
        stock_name: str
    ) -> SearchTask:
        """创建搜索任务"""
        task_id = f"task_{datetime.utcnow().strftime('%H%M%S%f')}"
        
        # 根据数据类型生成描述
        descriptions = {
            "financial": f"获取 {stock_name or stock_code} 财务指标（PE/PB/ROE等）",
            "fund_flow": f"获取 {stock_name or stock_code} 资金流向（主力/散户）",
            "realtime": f"获取 {stock_name or stock_code} 实时行情",
            "kline": f"获取 {stock_name or stock_code} K线走势",
            "news": f"搜索 {stock_name or stock_code} 最新新闻",
            "industry": f"搜索 {stock_name or stock_code} 产业链/上下游信息",
        }
        
        # 根据数据类型生成查询
        queries = {
            "financial": stock_code,
            "fund_flow": stock_code,
            "realtime": stock_code,
            "kline": stock_code,
            "news": f"{stock_name or stock_code} {query}",
            "industry": f"{stock_name or stock_code} {query}",
        }
        
        return SearchTask(
            id=task_id,
            source=source,
            query=queries.get(data_type, query),
            description=descriptions.get(data_type, f"搜索: {query}"),
            data_type=data_type,
            icon=icon,
            estimated_time=3 if source != SearchSource.BROWSER else 5
        )
    
    async def _analyze_with_llm(
        self,
        query: str,
        stock_code: str,
        stock_name: str
    ) -> List[SearchTask]:
        """使用 LLM 分析需要哪些数据"""
        try:
            prompt = f"""分析以下用户问题，判断需要搜索哪些数据：

用户问题: "{query}"
股票: {stock_name}({stock_code})

可用数据源:
1. akshare - 财务数据（PE/PB/ROE等）、资金流向、实时行情、K线
2. bochaai - 新闻搜索、财经报道
3. browser - 网页搜索（适合搜索产业链、上下游、合作方等）
4. kb - 历史新闻数据库

请返回需要搜索的内容，格式如下（每行一个）:
SOURCE:数据源|TYPE:数据类型|QUERY:搜索词|DESC:描述

示例:
SOURCE:bochaai|TYPE:news|QUERY:ST国华 上下游|DESC:搜索ST国华上下游相关新闻
SOURCE:akshare|TYPE:financial|QUERY:002074|DESC:获取国轩高科财务数据

只输出2-4个最相关的搜索任务。"""

            response = self._llm_provider.invoke([
                {"role": "system", "content": "你是数据搜索专家，帮助分析需要哪些数据。"},
                {"role": "user", "content": prompt}
            ])
            
            content = response.content if hasattr(response, 'content') else str(response)
            
            tasks = []
            for line in content.strip().split('\n'):
                if 'SOURCE:' in line:
                    try:
                        parts = {}
                        for part in line.split('|'):
                            if ':' in part:
                                key, value = part.split(':', 1)
                                parts[key.strip().upper()] = value.strip()
                        
                        if 'SOURCE' in parts:
                            source_str = parts['SOURCE'].lower()
                            source = SearchSource(source_str) if source_str in [s.value for s in SearchSource] else SearchSource.BOCHAAI
                            
                            tasks.append(SearchTask(
                                id=f"task_llm_{len(tasks)+1}",
                                source=source,
                                query=parts.get('QUERY', query),
                                description=parts.get('DESC', f"搜索: {query}"),
                                data_type=parts.get('TYPE', 'general'),
                                icon=self._get_icon_for_source(source),
                                estimated_time=3
                            ))
                    except Exception as e:
                        logger.debug(f"解析 LLM 响应行失败: {e}")
            
            return tasks
            
        except Exception as e:
            logger.warning(f"LLM 分析失败: {e}")
            return []
    
    def _get_icon_for_source(self, source: SearchSource) -> str:
        """获取数据源对应的图标"""
        icons = {
            SearchSource.AKSHARE: "📊",
            SearchSource.BOCHAAI: "📰",
            SearchSource.BROWSER: "🌐",
            SearchSource.KNOWLEDGE_BASE: "📚",
            SearchSource.ALL: "🔍"
        }
        return icons.get(source, "🔍")
    
    async def execute_search_plan(
        self,
        plan: SearchPlan
    ) -> Dict[str, Any]:
        """
        执行搜索计划
        
        Args:
            plan: 已确认的搜索计划
            
        Returns:
            搜索结果汇总
        """
        logger.info(f"🚀 DataCollector: 开始执行搜索计划 {plan.plan_id}...")
        
        plan.status = "executing"
        start_time = datetime.utcnow()
        
        results = {
            "plan_id": plan.plan_id,
            "stock_code": plan.stock_code,
            "stock_name": plan.stock_name,
            "user_query": plan.user_query,
            "task_results": [],
            "combined_data": {},
            "summary": "",
            "success": False,
            "execution_time": 0
        }
        
        # 并行执行所有任务
        async_tasks = []
        for task in plan.tasks:
            async_tasks.append(self._execute_task(task, plan.stock_code, plan.stock_name))
        
        task_results = await asyncio.gather(*async_tasks, return_exceptions=True)
        
        # 收集结果
        for i, result in enumerate(task_results):
            if isinstance(result, Exception):
                logger.error(f"任务执行失败: {result}")
                results["task_results"].append(SearchResult(
                    task_id=plan.tasks[i].id,
                    source=plan.tasks[i].source.value,
                    success=False,
                    error=str(result)
                ).dict())
            else:
                results["task_results"].append(result.dict() if hasattr(result, 'dict') else result)
                if result.get("success"):
                    # 合并数据
                    source = result.get("source", "unknown")
                    if source not in results["combined_data"]:
                        results["combined_data"][source] = {}
                    results["combined_data"][source].update(result.get("data", {}))
        
        # 生成综合摘要
        results["summary"] = await self._generate_combined_summary(
            plan.user_query,
            results["combined_data"],
            plan.stock_name
        )
        
        # 计算执行时间
        end_time = datetime.utcnow()
        results["execution_time"] = (end_time - start_time).total_seconds()
        results["success"] = any(r.get("success") for r in results["task_results"])
        
        plan.status = "completed"
        
        logger.info(f"✅ 搜索计划执行完成，耗时 {results['execution_time']:.1f}s")
        
        return results
    
    async def _execute_task(
        self,
        task: SearchTask,
        stock_code: str,
        stock_name: str
    ) -> Dict[str, Any]:
        """执行单个搜索任务"""
        logger.info(f"🔍 执行任务: {task.description}")
        
        start_time = datetime.utcnow()
        result = {
            "task_id": task.id,
            "source": task.source.value,
            "success": False,
            "data": {},
            "summary": "",
            "execution_time": 0
        }
        
        try:
            if task.source == SearchSource.AKSHARE:
                data = await self._search_akshare(task.query, stock_code, task.data_type)
                result["data"] = data or {}
                result["success"] = bool(data)
                
            elif task.source == SearchSource.BOCHAAI:
                data = await self._search_bochaai(task.query, stock_name)
                result["data"] = data or {}
                result["success"] = bool(data)
                
            elif task.source == SearchSource.BROWSER:
                data = await self._search_browser(task.query)
                result["data"] = data or {}
                result["success"] = bool(data)
                
            elif task.source == SearchSource.KNOWLEDGE_BASE:
                data = await self._search_knowledge_base(task.query, stock_code)
                result["data"] = data or {}
                result["success"] = bool(data)
            
        except Exception as e:
            logger.error(f"任务 {task.id} 执行失败: {e}")
            result["error"] = str(e)
        
        end_time = datetime.utcnow()
        result["execution_time"] = (end_time - start_time).total_seconds()
        
        return result
    
    async def _search_akshare(
        self,
        query: str,
        stock_code: str,
        data_type: Optional[str] = None
    ) -> Optional[Dict[str, Any]]:
        """从 AkShare 获取数据"""
        data = {}
        
        try:
            if data_type == "financial" or data_type == "overview":
                financial = await stock_data_service.get_financial_indicators(stock_code)
                if financial:
                    data["financial_indicators"] = financial
            
            if data_type == "fund_flow" or data_type == "overview":
                fund_flow = await stock_data_service.get_fund_flow(stock_code, days=10)
                if fund_flow:
                    data["fund_flow"] = fund_flow
            
            if data_type == "realtime" or data_type == "overview":
                realtime = await stock_data_service.get_realtime_quote(stock_code)
                if realtime:
                    data["realtime_quote"] = realtime
            
            if data_type == "kline":
                kline = await stock_data_service.get_kline_data(stock_code, period="daily", limit=30)
                if kline:
                    data["kline_summary"] = {
                        "period": "daily",
                        "count": len(kline),
                        "latest": kline[-1] if kline else None,
                        "recent_5": kline[-5:] if len(kline) >= 5 else kline
                    }
            
            if data:
                logger.info(f"✅ AkShare 返回数据: {list(data.keys())}")
                return data
                
        except Exception as e:
            logger.warning(f"AkShare 搜索出错: {e}")
        
        return None
    
    async def _search_bochaai(
        self,
        query: str,
        stock_name: Optional[str] = None
    ) -> Optional[Dict[str, Any]]:
        """从 BochaAI 搜索新闻"""
        if not bochaai_search.is_available():
            logger.debug("BochaAI 未配置，跳过")
            return None
        
        try:
            results = bochaai_search.search(
                query=query,
                freshness="oneWeek",
                count=10
            )
            
            if results:
                news_list = [
                    {
                        "title": r.title,
                        "snippet": r.snippet[:200] if r.snippet else "",
                        "url": r.url,
                        "source": r.site_name or "unknown",
                        "date": r.date_published or ""
                    }
                    for r in results
                ]
                logger.info(f"✅ BochaAI 返回 {len(news_list)} 条新闻")
                return {"news": news_list, "count": len(news_list)}
        
        except Exception as e:
            logger.warning(f"BochaAI 搜索出错: {e}")
        
        return None
    
    async def _search_browser(self, query: str) -> Optional[Dict[str, Any]]:
        """使用交互式爬虫搜索"""
        try:
            loop = asyncio.get_event_loop()
            results = await loop.run_in_executor(
                None,
                lambda: self._interactive_crawler.interactive_search(
                    query=query,
                    engines=["baidu_news", "sogou"],
                    num_results=10,
                    search_type="news"
                )
            )
            
            if results:
                news_list = [
                    {
                        "title": r.get("title", ""),
                        "snippet": r.get("snippet", "")[:200],
                        "url": r.get("url", ""),
                        "source": "browser_search"
                    }
                    for r in results
                ]
                logger.info(f"✅ Browser 返回 {len(news_list)} 条结果")
                return {"search_results": news_list, "count": len(news_list)}
        
        except Exception as e:
            logger.warning(f"Browser 搜索出错: {e}")
        
        return None
    
    async def _search_knowledge_base(
        self,
        query: str,
        stock_code: str
    ) -> Optional[Dict[str, Any]]:
        """从知识库搜索历史数据"""
        try:
            from ..services.news_service import news_service
            
            if stock_code and news_service:
                news_list = await news_service.get_news_by_stock(stock_code, limit=10)
                if news_list:
                    kb_news = [
                        {
                            "title": getattr(news, 'title', ''),
                            "content": (getattr(news, 'content', '') or '')[:300],
                            "source": getattr(news, 'source', ''),
                            "date": news.publish_time.isoformat() if hasattr(news, 'publish_time') and news.publish_time else ""
                        }
                        for news in news_list
                    ]
                    logger.info(f"✅ KB 返回 {len(kb_news)} 条历史新闻")
                    return {"historical_news": kb_news, "count": len(kb_news)}
        
        except Exception as e:
            logger.debug(f"KB 搜索出错: {e}")
        
        return None
    
    async def _generate_combined_summary(
        self,
        query: str,
        data: Dict[str, Any],
        stock_name: str
    ) -> str:
        """生成综合摘要"""
        summary_parts = [f"## 搜索结果: {query}\n"]
        summary_parts.append(f"**股票**: {stock_name}\n")
        
        # AkShare 数据
        if "akshare" in data:
            ak_data = data["akshare"]
            summary_parts.append("### 📊 财务/行情数据\n")
            
            if "financial_indicators" in ak_data:
                fi = ak_data["financial_indicators"]
                summary_parts.append(f"- PE: {fi.get('pe_ratio', 'N/A')}, PB: {fi.get('pb_ratio', 'N/A')}")
                summary_parts.append(f"- ROE: {fi.get('roe', 'N/A')}%")
            
            if "realtime_quote" in ak_data:
                rt = ak_data["realtime_quote"]
                summary_parts.append(f"- 当前价: {rt.get('price', 'N/A')}元, 涨跌幅: {rt.get('change_percent', 'N/A')}%")
            
            if "fund_flow" in ak_data:
                ff = ak_data["fund_flow"]
                summary_parts.append(f"- 资金流向: {ff.get('main_flow_trend', 'N/A')}")
            
            summary_parts.append("")
        
        # BochaAI 新闻
        if "bochaai" in data:
            news = data["bochaai"].get("news", [])
            if news:
                summary_parts.append("### 📰 最新新闻\n")
                for i, n in enumerate(news[:5], 1):
                    summary_parts.append(f"{i}. **{n['title'][:50]}**")
                    if n.get('snippet'):
                        summary_parts.append(f"   {n['snippet'][:100]}...")
                summary_parts.append("")
        
        # Browser 结果
        if "browser" in data:
            results = data["browser"].get("search_results", [])
            if results:
                summary_parts.append("### 🌐 网页搜索结果\n")
                for i, r in enumerate(results[:5], 1):
                    summary_parts.append(f"{i}. {r['title'][:50]}")
                summary_parts.append("")
        
        # KB 历史数据
        if "kb" in data:
            kb_news = data["kb"].get("historical_news", [])
            if kb_news:
                summary_parts.append("### 📚 历史资料\n")
                for i, n in enumerate(kb_news[:3], 1):
                    summary_parts.append(f"{i}. {n['title'][:50]}")
                summary_parts.append("")
        
        return "\n".join(summary_parts)
    
    # ============ 兼容旧 API ============
    
    async def collect_data(
        self,
        stock_code: str,
        stock_name: str,
        data_requirements: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """
        搜集股票相关数据（兼容旧 API）
        """
        # 创建并执行一个全面的搜索计划
        plan = await self.generate_search_plan(
            query="综合数据搜集",
            stock_code=stock_code,
            stock_name=stock_name
        )
        
        # 添加所有基础数据任务
        plan.tasks = [
            SearchTask(
                id=f"task_init_1",
                source=SearchSource.AKSHARE,
                query=stock_code,
                description="获取财务和行情数据",
                data_type="overview",
                icon="📊",
                estimated_time=3
            ),
            SearchTask(
                id=f"task_init_2",
                source=SearchSource.KNOWLEDGE_BASE,
                query=stock_code,
                description="获取历史新闻",
                data_type="news",
                icon="📚",
                estimated_time=2
            )
        ]
        
        return await self.execute_search_plan(plan)


# 快速分析师（保持不变）
class QuickAnalystAgent(Agent):
    """快速分析师智能体"""
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="QuickAnalyst",
            role="快速分析师",
            goal="快速综合多角度给出投资建议",
            backstory="""你是一位经验丰富的量化分析师，擅长快速分析和决策。
你能够在短时间内综合考虑多空因素，给出简洁明了的投资建议。
你的分析风格是：快速、准确、实用。""",
            organization_id=organization_id
        )
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        logger.info(f"Initialized {self.name} agent")
    
    async def quick_analyze(
        self,
        stock_code: str,
        stock_name: str,
        context: str
    ) -> Dict[str, Any]:
        """快速分析"""
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""请对 {stock_name}({stock_code}) 进行快速投资分析。

【当前时间】
{current_time}

背景资料:
{context}

请在1分钟内给出：
1. 核心观点（一句话）
2. 看多因素（3点）
3. 看空因素（3点）
4. 投资建议（买入/持有/卖出）
5. 目标价位和止损价位

请用简洁的语言，直接给出结论。"""

        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": "你是快速分析师，擅长快速给出投资建议。"},
                {"role": "user", "content": prompt}
            ])
            content = response.content if hasattr(response, 'content') else str(response)
            return {
                "success": True,
                "analysis": content,
                "timestamp": datetime.utcnow().isoformat()
            }
        except Exception as e:
            logger.error(f"Quick analysis failed: {e}")
            return {
                "success": False,
                "error": str(e)
            }


# 工厂函数
def create_data_collector(llm_provider=None) -> DataCollectorAgentV2:
    """创建数据专员实例"""
    return DataCollectorAgentV2(llm_provider=llm_provider)


================================================
FILE: backend/app/agents/debate_agents.py
================================================
"""
辩论智能体 - Phase 2
实现 Bull vs Bear 多智能体辩论机制

支持动态搜索：智能体可以在发言中请求额外数据
格式: [SEARCH: "查询内容" source:数据源]
"""
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
from agenticx import Agent

from ..services.llm_service import get_llm_provider

logger = logging.getLogger(__name__)

# 数据请求提示词片段（用于启用动态搜索的场景）
DATA_REQUEST_HINT = """
【数据请求】如果需要更多数据支撑你的论点，可以在发言末尾添加搜索请求：
- [SEARCH: "具体数据需求" source:akshare]  -- 财务/行情数据
- [SEARCH: "新闻关键词" source:bochaai]  -- 最新新闻
- [SEARCH: "搜索内容"]  -- 自动选择最佳数据源
请只在确实需要时使用，每次最多1-2个请求。"""


class BullResearcherAgent(Agent):
    """
    看多研究员智能体
    职责：基于新闻和数据，生成看多观点和投资建议
    支持在辩论中请求额外数据
    """
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        # 先调用父类初始化（Pydantic BaseModel）
        super().__init__(
            name="BullResearcher",
            role="看多研究员",
            goal="从积极角度分析股票，发现投资机会和增长潜力",
            backstory="""你是一位乐观但理性的股票研究员，擅长发现被低估的投资机会。
你善于从新闻和数据中提取正面信息，分析公司的增长潜力、竞争优势和市场机遇。
你的分析注重长期价值，但也关注短期催化剂。
当你发现数据不足以支撑论点时，你会主动请求补充数据。""",
            organization_id=organization_id
        )
        
        # 在 super().__init__() 之后设置 _llm_provider（避免被 Pydantic 清除）
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        logger.info(f"Initialized {self.name} agent")
    
    def analyze(
        self,
        stock_code: str,
        stock_name: str,
        news_list: List[Dict[str, Any]],
        context: str = ""
    ) -> Dict[str, Any]:
        """
        生成看多分析报告
        """
        news_summary = self._summarize_news(news_list)
        
        # 获取当前系统时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""你是一位看多研究员，请从积极角度分析以下股票：

【当前时间】
{current_time}

【股票信息】
代码：{stock_code}
名称：{stock_name}

【相关新闻摘要】
{news_summary}

【分析背景】
{context if context else "无额外背景信息"}

请从以下角度进行看多分析：

## 1. 核心看多逻辑
- 列出3-5个看多的核心理由
- 每个理由需要有数据或新闻支撑

## 2. 增长催化剂
- 短期催化剂（1-3个月内可能发生的利好）
- 中长期催化剂（3-12个月的增长驱动力）

## 3. 估值分析
- 当前估值是否具有吸引力
- 与同行业对比的优势

## 4. 目标预期
- 给出合理的预期收益空间
- 说明达成条件

## 5. 风险提示
- 虽然看多，但也需要指出可能的风险

请确保分析客观、有理有据，避免盲目乐观。
"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            analysis_text = response.content if hasattr(response, 'content') else str(response)
            
            return {
                "success": True,
                "agent_name": self.name,
                "agent_role": self.role,
                "stance": "bull",
                "analysis": analysis_text,
                "timestamp": datetime.utcnow().isoformat()
            }
        
        except Exception as e:
            logger.error(f"Bull analysis failed: {e}")
            return {
                "success": False,
                "agent_name": self.name,
                "stance": "bull",
                "error": str(e)
            }
    
    async def debate_round(self, prompt: str, enable_data_request: bool = True) -> str:
        """
        辩论回合发言（用于实时辩论模式）
        
        Args:
            prompt: 辩论提示词
            enable_data_request: 是否启用数据请求功能
            
        Returns:
            发言内容（可能包含数据请求标记）
        """
        system_content = f"""你是{self.role}，{self.backstory}
你正在参与一场多空辩论，请用专业但有说服力的语气发言。

作为看多方，你的核心任务是：
1. 挖掘公司的增长潜力和投资价值
2. 用数据和事实支撑你的乐观观点
3. 反驳看空方提出的风险点
4. 识别被市场低估的机会"""

        if enable_data_request:
            system_content += DATA_REQUEST_HINT
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": system_content},
                {"role": "user", "content": prompt}
            ])
            return response.content if hasattr(response, 'content') else str(response)
        except Exception as e:
            logger.error(f"Bull debate round failed: {e}")
            return f"[发言出错: {e}]"
    
    def _summarize_news(self, news_list: List[Dict[str, Any]]) -> str:
        """汇总新闻信息"""
        if not news_list:
            return "暂无相关新闻"
        
        summaries = []
        for i, news in enumerate(news_list[:5], 1):
            title = news.get("title", "")
            sentiment = news.get("sentiment_score")
            sentiment_text = ""
            if sentiment is not None:
                if sentiment > 0.1:
                    sentiment_text = "（利好）"
                elif sentiment < -0.1:
                    sentiment_text = "（利空）"
                else:
                    sentiment_text = "（中性）"
            summaries.append(f"{i}. {title} {sentiment_text}")
        
        return "\n".join(summaries)


class BearResearcherAgent(Agent):
    """
    看空研究员智能体
    职责：基于新闻和数据，识别风险和潜在问题
    支持在辩论中请求额外数据
    """
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        # 先调用父类初始化（Pydantic BaseModel）
        super().__init__(
            name="BearResearcher",
            role="看空研究员",
            goal="从风险角度分析股票，识别潜在问题和下行风险",
            backstory="""你是一位谨慎的股票研究员，擅长发现被忽视的风险。
你善于从新闻和数据中提取负面信号，分析公司的潜在问题、竞争威胁和市场风险。
你的分析注重风险控制，帮助投资者避免损失。
当你发现数据不足以支撑风险判断时，你会主动请求补充数据。""",
            organization_id=organization_id
        )
        
        # 在 super().__init__() 之后设置 _llm_provider（避免被 Pydantic 清除）
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        logger.info(f"Initialized {self.name} agent")
    
    def analyze(
        self,
        stock_code: str,
        stock_name: str,
        news_list: List[Dict[str, Any]],
        context: str = ""
    ) -> Dict[str, Any]:
        """
        生成看空分析报告
        """
        news_summary = self._summarize_news(news_list)
        
        # 获取当前系统时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""你是一位看空研究员，请从风险角度分析以下股票：

【当前时间】
{current_time}

【股票信息】
代码：{stock_code}
名称：{stock_name}

【相关新闻摘要】
{news_summary}

【分析背景】
{context if context else "无额外背景信息"}

请从以下角度进行风险分析：

## 1. 核心风险因素
- 列出3-5个主要风险点
- 每个风险需要有数据或新闻支撑

## 2. 负面催化剂
- 短期可能出现的利空事件
- 中长期的结构性风险

## 3. 估值风险
- 当前估值是否过高
- 与同行业对比的劣势

## 4. 下行空间
- 分析可能的下跌幅度
- 触发下跌的条件

## 5. 反驳看多观点
- 针对常见的看多逻辑提出质疑
- 指出乐观预期的不确定性

请确保分析客观、有理有据，避免无根据的悲观。
"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            analysis_text = response.content if hasattr(response, 'content') else str(response)
            
            return {
                "success": True,
                "agent_name": self.name,
                "agent_role": self.role,
                "stance": "bear",
                "analysis": analysis_text,
                "timestamp": datetime.utcnow().isoformat()
            }
        
        except Exception as e:
            logger.error(f"Bear analysis failed: {e}")
            return {
                "success": False,
                "agent_name": self.name,
                "stance": "bear",
                "error": str(e)
            }
    
    def _summarize_news(self, news_list: List[Dict[str, Any]]) -> str:
        """汇总新闻信息"""
        if not news_list:
            return "暂无相关新闻"
        
        summaries = []
        for i, news in enumerate(news_list[:5], 1):
            title = news.get("title", "")
            sentiment = news.get("sentiment_score")
            sentiment_text = ""
            if sentiment is not None:
                if sentiment > 0.1:
                    sentiment_text = "（利好）"
                elif sentiment < -0.1:
                    sentiment_text = "（利空）"
                else:
                    sentiment_text = "（中性）"
            summaries.append(f"{i}. {title} {sentiment_text}")
        
        return "\n".join(summaries)
    
    async def debate_round(self, prompt: str, enable_data_request: bool = True) -> str:
        """
        辩论回合发言（用于实时辩论模式）
        
        Args:
            prompt: 辩论提示词
            enable_data_request: 是否启用数据请求功能
            
        Returns:
            发言内容（可能包含数据请求标记）
        """
        system_content = f"""你是{self.role}，{self.backstory}
你正在参与一场多空辩论，请用专业但有说服力的语气发言。

作为看空方，你的核心任务是：
1. 识别公司的潜在风险和问题
2. 用数据和事实支撑你的谨慎观点
3. 反驳看多方过于乐观的论点
4. 揭示被市场忽视的风险因素"""

        if enable_data_request:
            system_content += DATA_REQUEST_HINT
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": system_content},
                {"role": "user", "content": prompt}
            ])
            return response.content if hasattr(response, 'content') else str(response)
        except Exception as e:
            logger.error(f"Bear debate round failed: {e}")
            return f"[发言出错: {e}]"


class InvestmentManagerAgent(Agent):
    """
    投资经理智能体
    职责：综合 Bull/Bear 观点，做出最终投资决策
    支持在决策前请求额外数据
    """
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        # 先调用父类初始化（Pydantic BaseModel）
        super().__init__(
            name="InvestmentManager",
            role="投资经理",
            goal="综合多方观点，做出理性的投资决策",
            backstory="""你是一位经验丰富的投资经理，擅长在多方观点中找到平衡。
你善于综合看多和看空的分析，结合市场环境，做出最优的投资决策。
你的决策注重风险收益比，追求稳健的长期回报。
当你认为辩论双方提供的数据不足以做出决策时，你会主动请求补充关键数据。""",
            organization_id=organization_id
        )
        
        # 在 super().__init__() 之后设置 _llm_provider（避免被 Pydantic 清除）
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        logger.info(f"Initialized {self.name} agent")
    
    def make_decision(
        self,
        stock_code: str,
        stock_name: str,
        bull_analysis: str,
        bear_analysis: str,
        context: str = "",
        enable_data_request: bool = False
    ) -> Dict[str, Any]:
        """
        综合双方观点，做出投资决策
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            bull_analysis: 看多分析
            bear_analysis: 看空分析
            context: 市场背景和补充数据
            enable_data_request: 是否允许请求额外数据
        """
        # 获取当前系统时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""你是一位投资经理，请综合以下看多和看空观点，做出投资决策：

【当前时间】
{current_time}

【股票信息】
代码：{stock_code}
名称：{stock_name}

【看多观点】
{bull_analysis}

【看空观点】
{bear_analysis}

【市场背景及补充数据】
{context if context else "当前市场处于正常波动区间"}

请按以下结构给出最终决策：

## 1. 观点评估

### 看多方论点质量
- 评估看多论点的说服力（1-10分）
- 指出最有力的看多论据
- 指出看多方忽视的问题

### 看空方论点质量
- 评估看空论点的说服力（1-10分）
- 指出最有力的看空论据
- 指出看空方过于悲观的地方

## 2. 数据充分性评估
- 辩论中使用的数据是否充分？
- 是否有关键数据缺失影响决策？
- 已获得的补充数据如何影响判断？

## 3. 综合判断
- 当前股票的核心矛盾是什么
- 短期（1-3个月）和中长期（6-12个月）的观点

## 4. 投资决策

**最终评级**：[强烈推荐 / 推荐 / 中性 / 谨慎 / 回避]

**决策理由**：
（详细说明决策依据）

**建议操作**：
- 对于持仓者：持有/加仓/减仓/清仓
- 对于观望者：买入/观望/规避

**关键监测指标**：
- 列出需要持续关注的信号
- 什么情况下需要调整决策

## 5. 风险收益比
- 预期收益空间
- 潜在下行风险
- 风险收益比评估

请确保决策客观、理性，充分考虑双方观点和已获取的数据。
"""
        
        if enable_data_request:
            prompt += f"""

【数据请求】如果你认为还需要更多数据才能做出准确决策，可以添加搜索请求：
- [SEARCH: "具体数据需求" source:akshare]
- [SEARCH: "新闻关键词" source:bochaai]
但请优先基于现有数据做出判断。"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            decision_text = response.content if hasattr(response, 'content') else str(response)
            
            # 提取评级
            rating = self._extract_rating(decision_text)
            
            return {
                "success": True,
                "agent_name": self.name,
                "agent_role": self.role,
                "decision": decision_text,
                "rating": rating,
                "timestamp": datetime.utcnow().isoformat()
            }
        
        except Exception as e:
            logger.error(f"Investment decision failed: {e}")
            return {
                "success": False,
                "agent_name": self.name,
                "error": str(e)
            }
    
    def _extract_rating(self, text: str) -> str:
        """从决策文本中提取评级"""
        import re
        
        ratings = ["强烈推荐", "推荐", "中性", "谨慎", "回避"]
        for rating in ratings:
            if rating in text:
                return rating
        return "中性"


class DebateWorkflow:
    """
    辩论工作流
    协调 Bull/Bear/InvestmentManager 进行多轮辩论
    """
    
    def __init__(self, llm_provider=None):
        self.bull_agent = BullResearcherAgent(llm_provider)
        self.bear_agent = BearResearcherAgent(llm_provider)
        self.manager_agent = InvestmentManagerAgent(llm_provider)
        
        # 执行轨迹记录
        self.trajectory = []
        
        logger.info("Initialized DebateWorkflow")
    
    async def run_debate(
        self,
        stock_code: str,
        stock_name: str,
        news_list: List[Dict[str, Any]],
        context: str = "",
        rounds: int = 1
    ) -> Dict[str, Any]:
        """
        执行完整的辩论流程
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            news_list: 相关新闻列表
            context: 额外上下文
            rounds: 辩论轮数
        
        Returns:
            辩论结果
        """
        start_time = datetime.utcnow()
        self.trajectory = []
        
        logger.info(f"🚀 辩论工作流开始: {stock_name}({stock_code}), 新闻数量={len(news_list)}")
        
        try:
            # 第一阶段：独立分析
            self._log_step("debate_start", {
                "stock_code": stock_code,
                "stock_name": stock_name,
                "news_count": len(news_list)
            })
            
            # Bull 分析
            logger.info("📈 开始看多分析 (BullResearcher)...")
            self._log_step("bull_analysis_start", {"agent": "BullResearcher"})
            bull_result = self.bull_agent.analyze(stock_code, stock_name, news_list, context)
            logger.info(f"📈 看多分析完成: success={bull_result.get('success', False)}")
            self._log_step("bull_analysis_complete", {
                "agent": "BullResearcher",
                "success": bull_result.get("success", False)
            })
            
            # Bear 分析
            logger.info("📉 开始看空分析 (BearResearcher)...")
            self._log_step("bear_analysis_start", {"agent": "BearResearcher"})
            bear_result = self.bear_agent.analyze(stock_code, stock_name, news_list, context)
            logger.info(f"📉 看空分析完成: success={bear_result.get('success', False)}")
            self._log_step("bear_analysis_complete", {
                "agent": "BearResearcher",
                "success": bear_result.get("success", False)
            })
            
            # 第二阶段：投资经理决策
            logger.info("⚖️ 开始投资经理决策 (InvestmentManager)...")
            self._log_step("decision_start", {"agent": "InvestmentManager"})
            decision_result = self.manager_agent.make_decision(
                stock_code=stock_code,
                stock_name=stock_name,
                bull_analysis=bull_result.get("analysis", ""),
                bear_analysis=bear_result.get("analysis", ""),
                context=context
            )
            logger.info(f"⚖️ 投资经理决策完成: rating={decision_result.get('rating', 'unknown')}")
            self._log_step("decision_complete", {
                "agent": "InvestmentManager",
                "rating": decision_result.get("rating", "unknown")
            })
            
            end_time = datetime.utcnow()
            execution_time = (end_time - start_time).total_seconds()
            
            logger.info(f"✅ 辩论工作流完成! 耗时={execution_time:.2f}秒, 评级={decision_result.get('rating', 'unknown')}")
            
            self._log_step("debate_complete", {
                "execution_time": execution_time,
                "final_rating": decision_result.get("rating", "unknown")
            })
            
            return {
                "success": True,
                "stock_code": stock_code,
                "stock_name": stock_name,
                "bull_analysis": bull_result,
                "bear_analysis": bear_result,
                "final_decision": decision_result,
                "trajectory": self.trajectory,
                "execution_time": execution_time,
                "timestamp": start_time.isoformat()
            }
        
        except Exception as e:
            logger.error(f"❌ 辩论工作流失败: {e}", exc_info=True)
            self._log_step("debate_failed", {"error": str(e)})
            return {
                "success": False,
                "error": str(e),
                "trajectory": self.trajectory
            }
    
    def _log_step(self, step_name: str, data: Dict[str, Any]):
        """记录执行步骤"""
        step = {
            "step": step_name,
            "timestamp": datetime.utcnow().isoformat(),
            "data": data
        }
        self.trajectory.append(step)
        logger.info(f"Debate step: {step_name} - {data}")


# 工厂函数
def create_debate_workflow(llm_provider=None) -> DebateWorkflow:
    """创建辩论工作流实例"""
    return DebateWorkflow(llm_provider)


================================================
FILE: backend/app/agents/news_analyst.py
================================================
"""
新闻分析师智能体
"""
import logging
from typing import List, Dict, Any, Optional
from agenticx import Agent, Task, BaseTool
from agenticx.core.agent_executor import AgentExecutor

from ..services.llm_service import get_llm_provider
from ..tools import TextCleanerTool

logger = logging.getLogger(__name__)


class NewsAnalystAgent(Agent):
    """
    新闻分析师智能体
    职责：分析金融新闻的情感、影响和关键信息
    """
    
    def __init__(
        self,
        llm_provider=None,
        tools: Optional[List[BaseTool]] = None,
        organization_id: str = "finnews",
        **kwargs
    ):
        """
        初始化新闻分析师智能体
        
        Args:
            llm_provider: LLM 提供者
            tools: 工具列表
            organization_id: 组织ID（用于多租户隔离），默认 "finnews"
            **kwargs: 额外参数
        """
        # 如果没有提供 LLM，使用默认的
        if llm_provider is None:
            llm_provider = get_llm_provider()
        
        # 如果没有提供工具，使用默认工具
        if tools is None:
            tools = [TextCleanerTool()]
        
        # 保存 LLM 和工具供后续使用（在 super().__init__ 之前保存）
        self._llm_provider = llm_provider
        self._tools = tools
        
        # 定义智能体属性（Agent 基类）
        super().__init__(
            name="NewsAnalyst",
            role="金融新闻分析师",
            goal="深度分析金融新闻，提取关键信息，评估市场影响",
            backstory="""你是一位经验丰富的金融新闻分析专家，具有10年以上的证券市场分析经验。
你擅长从新闻中提取关键信息，准确判断新闻对股票市场的影响，并能够识别潜在的投资机会和风险。
你的分析报告准确、专业，深受投资者信赖。""",
            organization_id=organization_id,
            **kwargs
        )
        
        # 创建 AgentExecutor（在 super().__init__ 之后）
        self._executor = None
        self._init_executor(llm_provider, tools)
        
        logger.info(f"Initialized {self.name} agent")
    
    def _init_executor(self, llm_provider=None, tools=None):
        """初始化 AgentExecutor（延迟初始化）"""
        if self._executor is None:
            if llm_provider is None:
                llm_provider = getattr(self, '_llm_provider', None) or get_llm_provider()
            if tools is None:
                tools = getattr(self, '_tools', None) or [TextCleanerTool()]
            
            self._llm_provider = llm_provider
            self._tools = tools
            self._executor = AgentExecutor(
                llm_provider=llm_provider,
                tools=tools
            )
    
    @property
    def executor(self):
        """获取 AgentExecutor（延迟初始化）"""
        if self._executor is None:
            self._init_executor()
        return self._executor
    
    def analyze_news(
        self,
        news_title: str,
        news_content: str,
        news_url: str = "",
        stock_codes: List[str] = None
    ) -> Dict[str, Any]:
        """
        分析单条新闻
        
        Args:
            news_title: 新闻标题
            news_content: 新闻内容
            news_url: 新闻URL
            stock_codes: 关联股票代码
            
        Returns:
            分析结果字典
        """
        # 构建分析提示词
        prompt = f"""你是一位经验丰富的金融新闻分析专家，具有10年以上的证券市场分析经验。
你擅长从新闻中提取关键信息，准确判断新闻对股票市场的影响，并能够识别潜在的投资机会和风险。

请深度分析以下金融新闻，并提供结构化的分析报告：

【新闻标题】
{news_title}

【新闻内容】
{news_content[:2000]}

【关联股票】
{', '.join(stock_codes) if stock_codes else '无'}

请按照以下结构进行专业分析，并严格使用 Markdown 格式输出：

## 摘要

结构性分析，长期利好市场生态**

### 正面影响：
- 核心要点1
- 核心要点2
- 核心要点3

### 潜在挑战：
- 挑战点1
- 挑战点2

---

## 1. 情感倾向：[中性偏利好] （评分：X.X）

**情感判断**：[中性偏利好/利好/利空/中性]**
**综合评分**：+X.X （范围：-1 至 +1）**

**理由说明：**
详细说明评分依据，包括：
- 政策影响分析
- 市场短期/长期影响
- 预期收益/风险评估

---

## 2. 关键信息提取

**请使用标准 Markdown 表格格式，确保表格清晰易读：**

| 类别 | 内容 |
|------|------|
| 公司名称 | XXX公司（全称，股票代码：XXXXXX） |
| 事件时间 | 新闻发布时间：YYYY年MM月DD日；关键事件时间线涵盖YYYY年QXXX |
| 股价变动 | 详细描述股价变化趋势和数据 |
| 财务表现（YYYY年QX） | 关键财务指标（使用具体数字和增长率） |
| 驱动因素 | • 因素1<br>• 因素2<br>• 因素3 |
| 分析师观点 | • 机构1（分析师）：观点内容<br>• 机构2（分析师）：观点内容 |
| 市场情绪指标 | 具体指标和数据 |

**重要说明（表格严格规范）**：
- **禁止跨行**：同一类别下的所有内容必须在**同一行**的单元格内
- **强制换行**：如果同一单元格有多条内容，**必须**使用 `<br>` 分隔，**严禁**使用 Markdown 列表（- 或 1.）或直接换行
- **错误示例**（绝对禁止）：
  | 驱动因素 | • 因素1 |
  |          | • 因素2 |  <-- 错误！不能另起一行
- **正确示例**：
  | 驱动因素 | • 因素1<br>• 因素2 |
- 表头和内容之间用 `|------|------|` 分隔
- 数据要准确，有具体数字时必须标注

---

## 3. 市场影响分析

### 短期影响（1-3个月）
- 影响点1：具体分析
- 影响点2：具体分析

### 中期影响（3-12个月）
- 影响点1：具体分析
- 影响点2：具体分析

### 长期影响（1年以上）
- 影响点1：具体分析
- 影响点2：具体分析

---

## 4. 投资建议

**投资评级**：[推荐买入/谨慎持有/观望/减持]

**建议理由**：
1. 核心逻辑1
2. 核心逻辑2
3. 核心逻辑3

**风险提示**：
- 风险1
- 风险2

---

**格式要求（重要）**：
1. 必须使用标准 Markdown 语法
2. **表格内容严禁跨行**，单元格内换行只能用 `<br>`
3. 标题层级清晰：使用 ##、### 等
4. 列表使用 - 或数字编号（表格外）
5. 加粗使用 **文本**
6. 分隔线使用 ---
7. 评分必须精确到小数点后1位
8. 所有数据必须真实、准确，来源于新闻内容

请确保分析报告专业、准确、结构清晰，特别注意表格格式的规范性，避免表格行错位。
"""
        
        try:
            # 确保 LLM provider 已初始化
            if not hasattr(self, '_llm_provider') or self._llm_provider is None:
                self._llm_provider = get_llm_provider()
            
            logger.info(f"Calling LLM provider: {type(self._llm_provider).__name__}, model: {getattr(self._llm_provider, 'model', 'unknown')}")
            
            # 直接调用 LLM（不使用 AgentExecutor，避免审批暂停）
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            logger.info("LLM response received")
            
            # 获取分析结果
            analysis_text = response.content if hasattr(response, 'content') else str(response)
            
            # 修复 Markdown 表格格式
            analysis_text = self._repair_markdown_table(analysis_text)
            
            # 尝试提取结构化信息
            structured_result = self._extract_structured_info(analysis_text)
            
            return {
                "success": True,
                "analysis_result": analysis_text,
                "structured_data": structured_result,
                "agent_name": self.name,
                "agent_role": self.role,
            }
        
        except Exception as e:
            logger.error(f"News analysis failed: {e}", exc_info=True)
            return {
                "success": False,
                "error": str(e),
                "agent_name": self.name,
            }
    
    def _repair_markdown_table(self, text: str) -> str:
        """
        修复 Markdown 表格格式问题
        主要解决：多行内容被错误拆分为多行单元格，导致首列为空的问题
        """
        import re
        
        lines = text.split('\n')
        new_lines = []
        in_table = False
        last_table_line_idx = -1
        
        for line in lines:
            stripped = line.strip()
            
            # 检测表格行
            is_table_row = stripped.startswith('|') and stripped.endswith('|')
            is_separator = '---' in stripped and '|' in stripped
            
            if is_table_row:
                if not in_table:
                    in_table = True
                
                # 如果是分隔行，直接添加
                if is_separator:
                    new_lines.append(line)
                    last_table_line_idx = len(new_lines) - 1
                    continue
                
                # 检查是否是"坏行"（首列为空）
                # 匹配模式：| 空白 | 内容 |
                parts = [p.strip() for p in stripped.strip('|').split('|')]
                
                # 如果首列为空，且不是第一行，且上一行也是表格行
                if len(parts) >= 2 and not parts[0] and last_table_line_idx >= 0:
                    # 获取上一行
                    prev_line = new_lines[last_table_line_idx]
                    prev_parts = [p.strip() for p in prev_line.strip().strip('|').split('|')]
                    
                    # 确保列数匹配
                    if len(parts) == len(prev_parts):
                        # 将内容合并到上一行的对应列
                        for i in range(1, len(parts)):
                            if parts[i]:
                                prev_parts[i] = f"{prev_parts[i]}<br>• {parts[i]}" if parts[i].startswith('•') else f"{prev_parts[i]}<br>{parts[i]}"
                        
                        # 重建上一行
                        new_prev_line = '| ' + ' | '.join(prev_parts) + ' |'
                        new_lines[last_table_line_idx] = new_prev_line
                        # 当前行被合并，不添加到 new_lines
                        continue
            
            else:
                in_table = False
            
            new_lines.append(line)
            if in_table:
                last_table_line_idx = len(new_lines) - 1
                
        return '\n'.join(new_lines)
    
    def _extract_structured_info(self, analysis_text: str) -> Dict[str, Any]:
        """
        从分析文本中提取结构化信息
        
        Args:
            analysis_text: 分析文本
            
        Returns:
            结构化数据
        """
        import re
        
        result = {
            "sentiment": "neutral",
            "sentiment_score": 0.0,
            "confidence": 0.5,
            "key_points": [],
            "market_impact": "",
            "investment_advice": "",
        }
        
        try:
            # 提取情感倾向（支持多种格式）
            # 匹配：利好、利空、中性、显著利好、显著利空等
            sentiment_patterns = [
                r'情感倾向[：:]\s*\*?\*?(显著|明显)?(利好|利空|中性)',
                r'(显著|明显)?(利好|利空|中性)',  # 备用模式
            ]
            for pattern in sentiment_patterns:
                sentiment_match = re.search(pattern, analysis_text)
                if sentiment_match:
                    # 提取最后一个匹配的词（利好/利空/中性）
                    groups = [g for g in sentiment_match.groups() if g]
                    if groups:
                        sentiment_word = groups[-1]
                        sentiment_map = {"利好": "positive", "利空": "negative", "中性": "neutral"}
                        result["sentiment"] = sentiment_map.get(sentiment_word, "neutral")
                        break
            
            # 提取情感评分（支持多种格式）
            # 匹配：-0.92、**-0.92**、-0.92 / -1.0 等格式
            score_patterns = [
                r'综合评分[：:]\s*\*?\*?([-+]?\d*\.?\d+)',  # 综合评分：-0.92（优先级最高）
                r'评分[：:]\s*\*?\*?([-+]?\d*\.?\d+)\s*/\s*[-+]?\d*\.?\d+',  # 评分：-0.85 / 1.0
                r'情感评分[：:]\s*\*?\*?([-+]?\d*\.?\d+)',  # 情感评分：-0.92
                r'评分[：:]\s*\*?\*?([-+]?\d*\.?\d+)',       # 评分：-0.92
            ]
            for pattern in score_patterns:
                score_match = re.search(pattern, analysis_text)
                if score_match:
                    result["sentiment_score"] = float(score_match.group(1))
                    logger.info(f"Extracted sentiment score: {result['sentiment_score']}")
                    break
            
            # 如果未提取到评分，尝试从情感倾向推断
            if result["sentiment_score"] == 0.0 and result["sentiment"] != "neutral":
                if result["sentiment"] == "positive":
                    result["sentiment_score"] = 0.5  # 默认中等利好
                elif result["sentiment"] == "negative":
                    result["sentiment_score"] = -0.5  # 默认中等利空
            
            # 提取置信度
            confidence_match = re.search(r'置信度[：:]\s*\*?\*?(\d*\.?\d+)', analysis_text)
            if confidence_match:
                result["confidence"] = float(confidence_match.group(1))
            
            # 提取关键信息点（简单实现：查找列表）
            key_points_section = re.search(r'关键信息[：:](.*?)(?=市场影响|投资建议|$)', analysis_text, re.DOTALL)
            if key_points_section:
                points_text = key_points_section.group(1)
                points = re.findall(r'[•\-\*]\s*(.+)', points_text)
                result["key_points"] = [p.strip() for p in points if p.strip()]
            
            # 提取市场影响
            impact_match = re.search(r'市场影响[：:](.*?)(?=投资建议|置信度|$)', analysis_text, re.DOTALL)
            if impact_match:
                result["market_impact"] = impact_match.group(1).strip()
            
            # 提取投资建议
            advice_match = re.search(r'投资建议[：:](.*?)(?=置信度|$)', analysis_text, re.DOTALL)
            if advice_match:
                result["investment_advice"] = advice_match.group(1).strip()
        
        except Exception as e:
            logger.warning(f"Failed to extract structured info: {e}")
        
        # 日志记录提取结果
        logger.info(
            f"Extracted sentiment: {result['sentiment']}, "
            f"score: {result['sentiment_score']}, "
            f"confidence: {result['confidence']}"
        )
        
        return result
    
    def batch_analyze(
        self,
        news_list: List[Dict[str, Any]]
    ) -> List[Dict[str, Any]]:
        """
        批量分析新闻
        
        Args:
            news_list: 新闻列表
            
        Returns:
            分析结果列表
        """
        results = []
        
        for news in news_list:
            try:
                result = self.analyze_news(
                    news_title=news.get("title", ""),
                    news_content=news.get("content", ""),
                    news_url=news.get("url", ""),
                    stock_codes=news.get("stock_codes", [])
                )
                results.append(result)
            except Exception as e:
                logger.error(f"Failed to analyze news: {e}")
                results.append({
                    "success": False,
                    "error": str(e),
                    "news_url": news.get("url", "")
                })
        
        return results


def create_news_analyst(
    llm_provider=None,
    tools: Optional[List[BaseTool]] = None,
    organization_id: str = "finnews"
) -> NewsAnalystAgent:
    """
    创建新闻分析师智能体实例
    
    Args:
        llm_provider: LLM 提供者
        tools: 工具列表
        organization_id: 组织ID（用于多租户隔离），默认 "finnews"
        
    Returns:
        NewsAnalystAgent 实例
    """
    return NewsAnalystAgent(
        llm_provider=llm_provider, 
        tools=tools,
        organization_id=organization_id
    )


================================================
FILE: backend/app/agents/orchestrator.py
================================================
"""
协作编排器

负责管理多智能体协作流程，支持：
- 并行分析模式（parallel）
- 实时辩论模式（realtime_debate）
- 快速分析模式（quick_analysis）
- 动态搜索模式（在辩论过程中按需获取数据）
"""
import logging
import asyncio
from typing import Dict, Any, List, Optional, Callable, AsyncGenerator
from datetime import datetime
from enum import Enum

from ..config import get_mode_config, get_default_mode, DebateModeConfig
from ..services.llm_service import get_llm_provider

logger = logging.getLogger(__name__)


class DebatePhase(Enum):
    """辩论阶段"""
    INITIALIZING = "initializing"
    DATA_COLLECTION = "data_collection"
    OPENING = "opening"
    DEBATE = "debate"
    CLOSING = "closing"
    COMPLETED = "completed"
    FAILED = "failed"


class DebateEvent:
    """辩论事件（用于实时流式输出）"""
    def __init__(
        self,
        event_type: str,
        agent_name: str,
        content: str,
        phase: DebatePhase,
        round_number: Optional[int] = None,
        metadata: Optional[Dict[str, Any]] = None
    ):
        self.event_type = event_type
        self.agent_name = agent_name
        self.content = content
        self.phase = phase
        self.round_number = round_number
        self.metadata = metadata or {}
        self.timestamp = datetime.utcnow().isoformat()
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "event_type": self.event_type,
            "agent_name": self.agent_name,
            "content": self.content,
            "phase": self.phase.value,
            "round_number": self.round_number,
            "metadata": self.metadata,
            "timestamp": self.timestamp
        }


class DebateOrchestrator:
    """辩论编排器"""
    
    def __init__(
        self,
        mode: str = None,
        llm_provider=None,
        enable_dynamic_search: bool = True
    ):
        """
        初始化辩论编排器
        
        Args:
            mode: 辩论模式 (parallel, realtime_debate, quick_analysis)
            llm_provider: LLM 提供者
            enable_dynamic_search: 是否启用动态搜索（辩论中按需获取数据）
        """
        self.mode = mode or get_default_mode()
        self.config = get_mode_config(self.mode)
        if not self.config:
            raise ValueError(f"未知的辩论模式: {self.mode}")
        
        self.llm_provider = llm_provider or get_llm_provider()
        self.current_phase = DebatePhase.INITIALIZING
        self.current_round = 0
        self.start_time: Optional[datetime] = None
        self.events: List[DebateEvent] = []
        self.is_interrupted = False
        
        # 动态搜索配置
        self.enable_dynamic_search = enable_dynamic_search
        self._search_analyst = None
        
        # 搜索统计
        self.search_stats = {
            "total_requests": 0,
            "successful_searches": 0,
            "data_supplements": []
        }
        
        # 事件回调
        self._event_callbacks: List[Callable[[DebateEvent], None]] = []
        
        logger.info(f"🎭 初始化辩论编排器，模式: {self.mode}, 动态搜索: {enable_dynamic_search}")
    
    def _get_search_analyst(self):
        """懒加载搜索分析师"""
        if self._search_analyst is None and self.enable_dynamic_search:
            from .search_analyst import SearchAnalystAgent
            self._search_analyst = SearchAnalystAgent(self.llm_provider)
        return self._search_analyst
    
    def on_event(self, callback: Callable[[DebateEvent], None]):
        """注册事件回调"""
        self._event_callbacks.append(callback)
    
    def _emit_event(self, event: DebateEvent):
        """触发事件"""
        self.events.append(event)
        for callback in self._event_callbacks:
            try:
                callback(event)
            except Exception as e:
                logger.error(f"事件回调出错: {e}")
    
    def interrupt(self, reason: str = "manager_decision"):
        """打断辩论"""
        self.is_interrupted = True
        self._emit_event(DebateEvent(
            event_type="interrupt",
            agent_name="InvestmentManager",
            content=f"辩论被打断: {reason}",
            phase=self.current_phase
        ))
        logger.info(f"⚡ 辩论被打断: {reason}")
    
    async def run(
        self,
        stock_code: str,
        stock_name: str,
        context: str = "",
        news_list: List[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """运行辩论流程"""
        self.start_time = datetime.utcnow()
        result = {
            "success": False,
            "mode": self.mode,
            "stock_code": stock_code,
            "stock_name": stock_name,
            "trajectory": [],
            "events": []
        }
        
        try:
            self._emit_event(DebateEvent(
                event_type="start",
                agent_name="Orchestrator",
                content=f"开始 {self.config.name}",
                phase=DebatePhase.INITIALIZING
            ))
            
            # 根据模式选择执行流程
            if self.config.flow.type == "parallel_then_summarize":
                result = await self._run_parallel_mode(stock_code, stock_name, context, news_list)
            elif self.config.flow.type == "orchestrated_debate":
                result = await self._run_realtime_debate_mode(stock_code, stock_name, context, news_list)
            elif self.config.flow.type == "single_agent":
                result = await self._run_quick_mode(stock_code, stock_name, context)
            else:
                raise ValueError(f"未知的流程类型: {self.config.flow.type}")
            
            self.current_phase = DebatePhase.COMPLETED
            self._emit_event(DebateEvent(
                event_type="complete",
                agent_name="Orchestrator",
                content="辩论完成",
                phase=DebatePhase.COMPLETED
            ))
            
        except Exception as e:
            logger.error(f"辩论执行失败: {e}", exc_info=True)
            self.current_phase = DebatePhase.FAILED
            result["error"] = str(e)
            self._emit_event(DebateEvent(
                event_type="error",
                agent_name="Orchestrator",
                content=f"辩论失败: {e}",
                phase=DebatePhase.FAILED
            ))
        
        result["events"] = [e.to_dict() for e in self.events]
        result["execution_time"] = (datetime.utcnow() - self.start_time).total_seconds()
        
        return result
    
    async def _run_parallel_mode(
        self,
        stock_code: str,
        stock_name: str,
        context: str,
        news_list: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """运行并行分析模式"""
        from .debate_agents import BullResearcherAgent, BearResearcherAgent, InvestmentManagerAgent
        
        logger.info("🔄 执行并行分析模式")
        
        # 初始化智能体
        bull_agent = BullResearcherAgent(self.llm_provider)
        bear_agent = BearResearcherAgent(self.llm_provider)
        manager_agent = InvestmentManagerAgent(self.llm_provider)
        
        # 准备新闻摘要
        news_summary = self._prepare_news_summary(news_list)
        full_context = f"{context}\n\n{news_summary}" if context else news_summary
        
        self.current_phase = DebatePhase.DEBATE
        
        # 并行执行Bull和Bear分析
        self._emit_event(DebateEvent(
            event_type="analysis_start",
            agent_name="BullResearcher",
            content="开始看多分析",
            phase=self.current_phase
        ))
        self._emit_event(DebateEvent(
            event_type="analysis_start",
            agent_name="BearResearcher",
            content="开始看空分析",
            phase=self.current_phase
        ))
        
        bull_task = asyncio.create_task(
            bull_agent.analyze(stock_code, stock_name, full_context)
        )
        bear_task = asyncio.create_task(
            bear_agent.analyze(stock_code, stock_name, full_context)
        )
        
        bull_analysis, bear_analysis = await asyncio.gather(bull_task, bear_task)
        
        self._emit_event(DebateEvent(
            event_type="analysis_complete",
            agent_name="BullResearcher",
            content=bull_analysis.get("analysis", "")[:200] + "...",
            phase=self.current_phase
        ))
        self._emit_event(DebateEvent(
            event_type="analysis_complete",
            agent_name="BearResearcher",
            content=bear_analysis.get("analysis", "")[:200] + "...",
            phase=self.current_phase
        ))
        
        # 投资经理做决策
        self.current_phase = DebatePhase.CLOSING
        self._emit_event(DebateEvent(
            event_type="decision_start",
            agent_name="InvestmentManager",
            content="开始综合决策",
            phase=self.current_phase
        ))
        
        final_decision = await manager_agent.make_decision(
            stock_code=stock_code,
            stock_name=stock_name,
            bull_analysis=bull_analysis.get("analysis", ""),
            bear_analysis=bear_analysis.get("analysis", ""),
            context=full_context
        )
        
        self._emit_event(DebateEvent(
            event_type="decision_complete",
            agent_name="InvestmentManager",
            content=f"决策完成: {final_decision.get('rating', 'N/A')}",
            phase=self.current_phase
        ))
        
        return {
            "success": True,
            "mode": self.mode,
            "bull_analysis": bull_analysis,
            "bear_analysis": bear_analysis,
            "final_decision": final_decision,
            "trajectory": [
                {"agent": "BullResearcher", "action": "analyze", "status": "completed"},
                {"agent": "BearResearcher", "action": "analyze", "status": "completed"},
                {"agent": "InvestmentManager", "action": "decide", "status": "completed"}
            ]
        }
    
    async def _run_realtime_debate_mode(
        self,
        stock_code: str,
        stock_name: str,
        context: str,
        news_list: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """运行实时辩论模式（支持动态搜索）"""
        from .debate_agents import BullResearcherAgent, BearResearcherAgent, InvestmentManagerAgent
        from .data_collector import DataCollectorAgent
        
        logger.info("🎭 执行实时辩论模式")
        
        # 初始化智能体
        data_collector = DataCollectorAgent(self.llm_provider)
        bull_agent = BullResearcherAgent(self.llm_provider)
        bear_agent = BearResearcherAgent(self.llm_provider)
        manager_agent = InvestmentManagerAgent(self.llm_provider)
        
        # 获取搜索分析师（如果启用）
        search_analyst = self._get_search_analyst()
        
        rules = self.config.rules
        max_rounds = rules.max_rounds or 5
        max_time = rules.max_time or 600
        
        trajectory = []
        debate_history = []
        dynamic_data_supplements = []  # 记录动态搜索补充的数据
        
        # Phase 1: 数据搜集
        if rules.require_data_collection:
            self.current_phase = DebatePhase.DATA_COLLECTION
            self._emit_event(DebateEvent(
                event_type="phase_start",
                agent_name="DataCollector",
                content="开始搜集数据",
                phase=self.current_phase
            ))
            
            collected_data = await data_collector.collect_data(stock_code, stock_name)
            data_summary = collected_data.get("summary", "")
            
            self._emit_event(DebateEvent(
                event_type="data_collected",
                agent_name="DataCollector",
                content=data_summary[:300] + "...",
                phase=self.current_phase
            ))
            
            trajectory.append({
                "agent": "DataCollector",
                "action": "collect_data",
                "status": "completed"
            })
            
            # 合并数据到上下文
            context = f"{context}\n\n{data_summary}" if context else data_summary
        
        # Phase 2: 投资经理开场
        self.current_phase = DebatePhase.OPENING
        opening_prompt = f"""你是投资经理，现在要主持一场关于 {stock_name}({stock_code}) 的多空辩论。

请做开场陈述，说明：
1. 今天辩论的股票背景
2. 辩论的规则（最多{max_rounds}轮，每人每轮1分钟）
3. 请看多研究员先发言

背景资料:
{context[:2000]}"""
        
        self._emit_event(DebateEvent(
            event_type="opening",
            agent_name="InvestmentManager",
            content="投资经理开场中...",
            phase=self.current_phase
        ))
        
        opening = await self.llm_provider.chat(opening_prompt)
        
        self._emit_event(DebateEvent(
            event_type="speech",
            agent_name="InvestmentManager",
            content=opening,
            phase=self.current_phase,
            round_number=0
        ))
        
        trajectory.append({
            "agent": "InvestmentManager",
            "action": "opening",
            "status": "completed",
            "content": opening
        })
        
        debate_history.append({
            "round": 0,
            "agent": "InvestmentManager",
            "type": "opening",
            "content": opening
        })
        
        # Phase 3: 辩论回合
        self.current_phase = DebatePhase.DEBATE
        bull_analysis_full = ""
        bear_analysis_full = ""
        
        for round_num in range(1, max_rounds + 1):
            if self.is_interrupted:
                logger.info(f"辩论在第{round_num}轮被打断")
                break
            
            # 检查时间限制
            elapsed = (datetime.utcnow() - self.start_time).total_seconds()
            if elapsed > max_time:
                logger.info(f"辩论超时，已进行 {elapsed:.0f} 秒")
                break
            
            self.current_round = round_num
            
            # Bull发言
            self._emit_event(DebateEvent(
                event_type="round_start",
                agent_name="BullResearcher",
                content=f"第{round_num}轮 - 看多研究员发言",
                phase=self.current_phase,
                round_number=round_num
            ))
            
            bull_prompt = self._build_debate_prompt(
                agent_role="看多研究员",
                stock_name=stock_name,
                stock_code=stock_code,
                round_num=round_num,
                max_rounds=max_rounds,
                context=context,
                debate_history=debate_history,
                enable_search_requests=self.enable_dynamic_search
            )
            
            bull_response = await bull_agent.debate_round(bull_prompt)
            bull_analysis_full += f"\n\n### 第{round_num}轮\n{bull_response}"
            
            self._emit_event(DebateEvent(
                event_type="speech",
                agent_name="BullResearcher",
                content=bull_response,
                phase=self.current_phase,
                round_number=round_num
            ))
            
            debate_history.append({
                "round": round_num,
                "agent": "BullResearcher",
                "type": "argument",
                "content": bull_response
            })
            
            # 动态搜索：处理 Bull 发言中的数据请求
            if search_analyst:
                context, supplement = await self._process_speech_for_search(
                    search_analyst=search_analyst,
                    speech_text=bull_response,
                    agent_name="BullResearcher",
                    stock_code=stock_code,
                    stock_name=stock_name,
                    context=context,
                    round_num=round_num,
                    trajectory=trajectory
                )
                if supplement:
                    dynamic_data_supplements.append(supplement)
            
            # Bear发言
            self._emit_event(DebateEvent(
                event_type="round_continue",
                agent_name="BearResearcher",
                content=f"第{round_num}轮 - 看空研究员发言",
                phase=self.current_phase,
                round_number=round_num
            ))
            
            bear_prompt = self._build_debate_prompt(
                agent_role="看空研究员",
                stock_name=stock_name,
                stock_code=stock_code,
                round_num=round_num,
                max_rounds=max_rounds,
                context=context,
                debate_history=debate_history,
                enable_search_requests=self.enable_dynamic_search
            )
            
            bear_response = await bear_agent.debate_round(bear_prompt)
            bear_analysis_full += f"\n\n### 第{round_num}轮\n{bear_response}"
            
            self._emit_event(DebateEvent(
                event_type="speech",
                agent_name="BearResearcher",
                content=bear_response,
                phase=self.current_phase,
                round_number=round_num
            ))
            
            debate_history.append({
                "round": round_num,
                "agent": "BearResearcher",
                "type": "argument",
                "content": bear_response
            })
            
            # 动态搜索：处理 Bear 发言中的数据请求
            if search_analyst:
                context, supplement = await self._process_speech_for_search(
                    search_analyst=search_analyst,
                    speech_text=bear_response,
                    agent_name="BearResearcher",
                    stock_code=stock_code,
                    stock_name=stock_name,
                    context=context,
                    round_num=round_num,
                    trajectory=trajectory
                )
                if supplement:
                    dynamic_data_supplements.append(supplement)
            
            trajectory.append({
                "agent": "Debate",
                "action": f"round_{round_num}",
                "status": "completed"
            })
            
            # 投资经理可选择打断或请求更多数据
            if rules.manager_can_interrupt and round_num < max_rounds:
                should_interrupt, manager_data_request = await self._check_manager_interrupt_or_search(
                    manager_agent, debate_history, stock_name, stock_code,
                    search_analyst, context
                )
                
                # 如果经理请求了更多数据，更新上下文
                if manager_data_request:
                    context = f"{context}\n\n【投资经理补充数据】\n{manager_data_request}"
                    dynamic_data_supplements.append({
                        "round": round_num,
                        "agent": "InvestmentManager",
                        "data": manager_data_request
                    })
                
                if should_interrupt:
                    self.interrupt("投资经理认为已有足够信息做决策")
                    break
        
        # Phase 4: 投资经理总结决策
        self.current_phase = DebatePhase.CLOSING
        self._emit_event(DebateEvent(
            event_type="closing_start",
            agent_name="InvestmentManager",
            content="投资经理正在做最终决策...",
            phase=self.current_phase
        ))
        
        # 如果启用了动态搜索，在做决策前进行智能数据补充
        if search_analyst and len(dynamic_data_supplements) < 2:
            self._emit_event(DebateEvent(
                event_type="smart_supplement",
                agent_name="SearchAnalyst",
                content="智能分析数据缺口，补充关键信息...",
                phase=self.current_phase
            ))
            
            smart_result = await search_analyst.smart_data_supplement(
                stock_code=stock_code,
                stock_name=stock_name,
                existing_context=context,
                debate_history=debate_history
            )
            
            if smart_result.get("success") and smart_result.get("combined_summary"):
                context = f"{context}\n\n【智能补充数据】\n{smart_result['combined_summary']}"
                dynamic_data_supplements.append({
                    "round": "pre_decision",
                    "agent": "SearchAnalyst",
                    "data": smart_result["combined_summary"]
                })
        
        final_decision = await manager_agent.make_decision(
            stock_code=stock_code,
            stock_name=stock_name,
            bull_analysis=bull_analysis_full,
            bear_analysis=bear_analysis_full,
            context=f"{context}\n\n辩论历史:\n{self._format_debate_history(debate_history)}"
        )
        
        self._emit_event(DebateEvent(
            event_type="decision",
            agent_name="InvestmentManager",
            content=final_decision.get("summary", ""),
            phase=self.current_phase,
            metadata={"rating": final_decision.get("rating")}
        ))
        
        trajectory.append({
            "agent": "InvestmentManager",
            "action": "final_decision",
            "status": "completed"
        })
        
        return {
            "success": True,
            "mode": self.mode,
            "bull_analysis": {"analysis": bull_analysis_full, "success": True},
            "bear_analysis": {"analysis": bear_analysis_full, "success": True},
            "final_decision": final_decision,
            "debate_history": debate_history,
            "total_rounds": self.current_round,
            "was_interrupted": self.is_interrupted,
            "trajectory": trajectory,
            "dynamic_search_enabled": self.enable_dynamic_search,
            "data_supplements": dynamic_data_supplements,
            "search_stats": self.search_stats
        }
    
    async def _process_speech_for_search(
        self,
        search_analyst,
        speech_text: str,
        agent_name: str,
        stock_code: str,
        stock_name: str,
        context: str,
        round_num: int,
        trajectory: List[Dict]
    ) -> tuple:
        """
        处理发言中的搜索请求
        
        Returns:
            (updated_context, supplement_data)
        """
        try:
            result = await search_analyst.process_debate_speech(
                speech_text=speech_text,
                stock_code=stock_code,
                stock_name=stock_name,
                agent_name=agent_name
            )
            
            self.search_stats["total_requests"] += result.get("requests_found", 0)
            
            if result.get("success") and result.get("combined_summary"):
                self.search_stats["successful_searches"] += len(result.get("search_results", []))
                
                self._emit_event(DebateEvent(
                    event_type="dynamic_search",
                    agent_name="SearchAnalyst",
                    content=f"为 {agent_name} 补充了 {result['requests_found']} 项数据",
                    phase=self.current_phase,
                    round_number=round_num,
                    metadata={"requests": result["requests_found"]}
                ))
                
                trajectory.append({
                    "agent": "SearchAnalyst",
                    "action": f"search_for_{agent_name}",
                    "status": "completed",
                    "requests": result["requests_found"]
                })
                
                # 更新上下文
                new_context = f"{context}\n\n【{agent_name} 请求的补充数据】\n{result['combined_summary']}"
                
                supplement = {
                    "round": round_num,
                    "agent": agent_name,
                    "requests": result["requests_found"],
                    "data": result["combined_summary"][:500]
                }
                
                return new_context, supplement
            
        except Exception as e:
            logger.warning(f"处理搜索请求时出错: {e}")
        
        return context, None
    
    async def _run_quick_mode(
        self,
        stock_code: str,
        stock_name: str,
        context: str
    ) -> Dict[str, Any]:
        """运行快速分析模式"""
        from .data_collector import QuickAnalystAgent
        
        logger.info("🚀 执行快速分析模式")
        
        quick_analyst = QuickAnalystAgent(self.llm_provider)
        
        self.current_phase = DebatePhase.DEBATE
        self._emit_event(DebateEvent(
            event_type="quick_analysis_start",
            agent_name="QuickAnalyst",
            content="开始快速分析",
            phase=self.current_phase
        ))
        
        result = await quick_analyst.quick_analyze(stock_code, stock_name, context)
        
        self._emit_event(DebateEvent(
            event_type="quick_analysis_complete",
            agent_name="QuickAnalyst",
            content=result.get("analysis", "")[:200] + "...",
            phase=self.current_phase
        ))
        
        return {
            "success": result.get("success", False),
            "mode": self.mode,
            "quick_analysis": result,
            "trajectory": [
                {"agent": "QuickAnalyst", "action": "analyze", "status": "completed"}
            ]
        }
    
    def _prepare_news_summary(self, news_list: List[Dict[str, Any]]) -> str:
        """准备新闻摘要"""
        if not news_list:
            return "暂无相关新闻数据"
        
        summary_parts = ["## 相关新闻摘要\n"]
        for i, news in enumerate(news_list[:10], 1):
            title = news.get("title", "无标题")
            content = news.get("content", "")[:200]
            source = news.get("source", "未知来源")
            date = news.get("published_at", "")
            
            summary_parts.append(f"{i}. **{title}** ({source}, {date})\n   {content}...\n")
        
        return "\n".join(summary_parts)
    
    def _build_debate_prompt(
        self,
        agent_role: str,
        stock_name: str,
        stock_code: str,
        round_num: int,
        max_rounds: int,
        context: str,
        debate_history: List[Dict],
        enable_search_requests: bool = False
    ) -> str:
        """构建辩论提示词"""
        history_text = self._format_debate_history(debate_history[-4:])  # 只取最近4条
        
        # 基础提示词
        prompt = f"""你是{agent_role}，正在参与关于 {stock_name}({stock_code}) 的多空辩论。

当前是第 {round_num}/{max_rounds} 轮辩论。

背景资料:
{context[:1500]}

最近的辩论历史:
{history_text}

请发表你的观点（约200字）：
1. 如果是第一轮，阐述你的核心论点
2. 如果不是第一轮，先反驳对方观点，再补充新论据
3. 用数据和事实支持你的论点
4. 语气专业但有说服力"""

        # 如果启用了动态搜索，添加搜索请求说明
        if enable_search_requests:
            prompt += """

【数据请求功能】
如果你在分析过程中发现缺少关键数据，可以在发言中使用以下格式请求搜索：
- [SEARCH: "最新的毛利率数据" source:akshare]  -- 从AkShare获取财务数据
- [SEARCH: "最近的行业新闻" source:bochaai]  -- 从网络搜索新闻
- [SEARCH: "近期资金流向" source:akshare]  -- 获取资金流向
- [SEARCH: "竞品对比分析"]  -- 不指定来源则自动选择

搜索请求会在你发言后自动执行，数据会补充到下一轮的背景资料中。
请只在确实需要更多数据支撑论点时才使用搜索请求，每次最多1-2个。"""

        return prompt
    
    def _format_debate_history(self, history: List[Dict]) -> str:
        """格式化辩论历史"""
        if not history:
            return "（尚无辩论历史）"
        
        lines = []
        for item in history:
            agent = item.get("agent", "Unknown")
            content = item.get("content", "")[:300]
            round_num = item.get("round", 0)
            lines.append(f"[第{round_num}轮 - {agent}]: {content}")
        
        return "\n\n".join(lines)
    
    async def _check_manager_interrupt(
        self,
        manager_agent,
        debate_history: List[Dict],
        stock_name: str
    ) -> bool:
        """检查投资经理是否要打断辩论"""
        if len(debate_history) < 4:
            return False
        
        check_prompt = f"""你是投资经理，正在主持关于 {stock_name} 的辩论。

目前的辩论历史:
{self._format_debate_history(debate_history[-4:])}

请判断：你是否已经获得足够的信息来做出投资决策？
如果是，回复"是"；如果还需要更多辩论，回复"否"。
只回复一个字。"""
        
        try:
            response = await self.llm_provider.chat(check_prompt)
            return "是" in response[:5]
        except Exception:
            return False

    async def _check_manager_interrupt_or_search(
        self,
        manager_agent,
        debate_history: List[Dict],
        stock_name: str,
        stock_code: str,
        search_analyst,
        context: str
    ) -> tuple:
        """
        检查投资经理是否要打断辩论或请求更多数据
        
        Returns:
            (should_interrupt: bool, additional_data: str or None)
        """
        if len(debate_history) < 4:
            return False, None
        
        # 如果没有搜索分析师，使用简单的打断检查
        if not search_analyst:
            should_interrupt = await self._check_manager_interrupt(
                manager_agent, debate_history, stock_name
            )
            return should_interrupt, None
        
        check_prompt = f"""你是投资经理，正在主持关于 {stock_name}({stock_code}) 的多空辩论。

目前的辩论历史:
{self._format_debate_history(debate_history[-4:])}

请判断当前情况：
1. 如果你已经获得足够的信息做决策，回复：决策就绪
2. 如果你需要更多数据支持，使用以下格式请求：
   [SEARCH: "你需要的具体数据" source:数据源]
   
可用数据源: akshare(财务/行情), bochaai(新闻), browser(网页搜索)

请只回复"决策就绪"或搜索请求，不要添加其他内容。"""
        
        try:
            response = await self.llm_provider.chat(check_prompt)
            
            # 检查是否决策就绪
            if "决策就绪" in response:
                return True, None
            
            # 检查是否有搜索请求
            requests = search_analyst.extract_search_requests(response)
            if requests:
                self._emit_event(DebateEvent(
                    event_type="manager_search_request",
                    agent_name="InvestmentManager",
                    content=f"投资经理请求 {len(requests)} 项补充数据",
                    phase=self.current_phase,
                    round_number=self.current_round
                ))
                
                # 执行搜索
                search_result = await search_analyst.process_debate_speech(
                    speech_text=response,
                    stock_code=stock_code,
                    stock_name=stock_name,
                    agent_name="InvestmentManager"
                )
                
                if search_result.get("success") and search_result.get("combined_summary"):
                    self.search_stats["total_requests"] += len(requests)
                    self.search_stats["successful_searches"] += len(search_result.get("search_results", []))
                    return False, search_result["combined_summary"]
            
            return False, None
            
        except Exception as e:
            logger.warning(f"检查经理决策时出错: {e}")
            return False, None


def create_orchestrator(
    mode: str = None,
    llm_provider=None,
    enable_dynamic_search: bool = True
) -> DebateOrchestrator:
    """
    创建辩论编排器
    
    Args:
        mode: 辩论模式 (parallel, realtime_debate, quick_analysis)
        llm_provider: LLM 提供者
        enable_dynamic_search: 是否启用动态搜索
        
    Returns:
        DebateOrchestrator 实例
    """
    return DebateOrchestrator(
        mode=mode,
        llm_provider=llm_provider,
        enable_dynamic_search=enable_dynamic_search
    )


================================================
FILE: backend/app/agents/quantitative_agent.py
================================================
"""
量化分析智能体

负责量化因子挖掘、技术分析和量化策略生成。
集成 Alpha Mining 模块，提供自动化因子发现能力。

功能：
- 因子挖掘：使用 RL 自动发现有效交易因子
- 因子评估：评估因子的预测能力和回测表现
- 技术分析：结合传统技术指标进行分析
- 策略生成：基于因子生成交易策略建议
"""

import logging
import asyncio
from typing import Dict, Any, List, Optional
from datetime import datetime
import json

logger = logging.getLogger(__name__)


class QuantitativeAgent:
    """
    量化分析智能体
    
    集成 Alpha Mining 模块，提供因子挖掘和量化分析能力。
    
    Args:
        llm_provider: LLM 提供者
        enable_alpha_mining: 是否启用因子挖掘
        model_path: 预训练模型路径
        
    Example:
        agent = QuantitativeAgent(llm_provider)
        result = await agent.analyze(stock_code, stock_name, market_data)
    """
    
    def __init__(
        self,
        llm_provider=None,
        enable_alpha_mining: bool = True,
        model_path: Optional[str] = None
    ):
        self.llm_provider = llm_provider
        self.enable_alpha_mining = enable_alpha_mining
        self.model_path = model_path
        
        # 延迟初始化 Alpha Mining 组件
        self._alpha_mining_initialized = False
        self._generator = None
        self._trainer = None
        self._vm = None
        self._evaluator = None
        self._market_builder = None
        self._sentiment_builder = None
        
        # 存储发现的因子
        self.discovered_factors: List[Dict[str, Any]] = []
        
        logger.info(f"QuantitativeAgent initialized (alpha_mining={enable_alpha_mining})")
    
    def _init_alpha_mining(self):
        """延迟初始化 Alpha Mining 组件"""
        if self._alpha_mining_initialized:
            return
        
        try:
            from ..alpha_mining import (
                AlphaMiningConfig,
                FactorVocab,
                FactorVM,
                AlphaGenerator,
                AlphaTrainer,
                FactorEvaluator,
                MarketFeatureBuilder,
                SentimentFeatureBuilder
            )
            
            config = AlphaMiningConfig()
            vocab = FactorVocab()
            
            self._vm = FactorVM(vocab=vocab)
            self._evaluator = FactorEvaluator(config=config)
            self._market_builder = MarketFeatureBuilder(config=config)
            self._sentiment_builder = SentimentFeatureBuilder(config=config)
            
            # 初始化生成器
            self._generator = AlphaGenerator(vocab=vocab, config=config)
            
            # 如果有预训练模型，加载它
            if self.model_path:
                try:
                    self._generator = AlphaGenerator.load(self.model_path, vocab=vocab)
                    logger.info(f"Loaded pretrained model from {self.model_path}")
                except Exception as e:
                    logger.warning(f"Failed to load model: {e}")
            
            self._alpha_mining_initialized = True
            logger.info("Alpha Mining components initialized")
            
        except ImportError as e:
            logger.warning(f"Alpha Mining not available: {e}")
            self.enable_alpha_mining = False
    
    async def analyze(
        self,
        stock_code: str,
        stock_name: str,
        market_data: Optional[Dict[str, Any]] = None,
        sentiment_data: Optional[Dict[str, Any]] = None,
        context: str = ""
    ) -> Dict[str, Any]:
        """
        执行量化分析
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            market_data: 行情数据（可选）
            sentiment_data: 情感数据（可选）
            context: 额外上下文
            
        Returns:
            分析结果字典
        """
        result = {
            "success": True,
            "stock_code": stock_code,
            "stock_name": stock_name,
            "timestamp": datetime.utcnow().isoformat(),
            "analysis_type": "quantitative",
            "factors_discovered": [],
            "technical_analysis": {},
            "strategy_suggestion": "",
            "confidence": 0.0
        }
        
        try:
            # 1. 因子挖掘（如果启用）
            if self.enable_alpha_mining:
                factor_result = await self._mine_factors(
                    stock_code, stock_name, market_data, sentiment_data
                )
                result["factors_discovered"] = factor_result.get("factors", [])
                result["factor_mining_stats"] = factor_result.get("stats", {})
            
            # 2. 技术分析（使用 LLM）
            if self.llm_provider and market_data:
                tech_analysis = await self._technical_analysis(
                    stock_code, stock_name, market_data, context
                )
                result["technical_analysis"] = tech_analysis
            
            # 3. 生成策略建议
            if self.llm_provider:
                strategy = await self._generate_strategy(
                    stock_code, stock_name, result, context
                )
                result["strategy_suggestion"] = strategy.get("suggestion", "")
                result["confidence"] = strategy.get("confidence", 0.0)
            
        except Exception as e:
            logger.error(f"Quantitative analysis failed: {e}", exc_info=True)
            result["success"] = False
            result["error"] = str(e)
        
        return result
    
    async def _mine_factors(
        self,
        stock_code: str,
        stock_name: str,
        market_data: Optional[Dict[str, Any]],
        sentiment_data: Optional[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """执行因子挖掘"""
        self._init_alpha_mining()
        
        if not self._alpha_mining_initialized:
            return {"factors": [], "stats": {"error": "Alpha Mining not available"}}
        
        try:
            import torch
            from ..alpha_mining.utils import generate_mock_data
            
            # 准备特征数据
            if market_data is not None:
                market_features = self._market_builder.build(market_data)
                time_steps = market_features.size(-1)
                
                if sentiment_data is not None:
                    sentiment_features = self._sentiment_builder.build(
                        sentiment_data, time_steps=time_steps
                    )
                    features = self._sentiment_builder.combine_with_market(
                        market_features, sentiment_features
                    )
                else:
                    features = market_features
                
                returns = market_features[:, 0, :]  # RET
            else:
                # 使用模拟数据
                features, returns = generate_mock_data(
                    num_samples=50,
                    num_features=6,
                    time_steps=252,
                    seed=42
                )
            
            # 生成候选因子
            formulas, _ = self._generator.generate(batch_size=20, max_len=8)
            
            # 评估每个因子
            evaluated_factors = []
            for formula in formulas:
                factor = self._vm.execute(formula, features)
                if factor is not None and factor.std() > 1e-6:
                    try:
                        metrics = self._evaluator.evaluate(factor, returns)
                        evaluated_factors.append({
                            "formula": formula,
                            "formula_str": self._vm.decode(formula),
                            "sortino": metrics["sortino_ratio"],
                            "sharpe": metrics["sharpe_ratio"],
                            "ic": metrics["ic"],
                            "max_drawdown": metrics["max_drawdown"]
                        })
                    except Exception:
                        continue
            
            # 按 Sortino 排序，取 top 5
            evaluated_factors.sort(key=lambda x: x["sortino"], reverse=True)
            top_factors = evaluated_factors[:5]
            
            # 更新已发现因子
            for f in top_factors:
                f["stock_code"] = stock_code
                f["discovered_at"] = datetime.utcnow().isoformat()
            self.discovered_factors.extend(top_factors)
            
            return {
                "factors": top_factors,
                "stats": {
                    "generated": len(formulas),
                    "valid": len(evaluated_factors),
                    "top_sortino": top_factors[0]["sortino"] if top_factors else 0
                }
            }
            
        except Exception as e:
            logger.error(f"Factor mining failed: {e}")
            return {"factors": [], "stats": {"error": str(e)}}
    
    async def _technical_analysis(
        self,
        stock_code: str,
        stock_name: str,
        market_data: Dict[str, Any],
        context: str
    ) -> Dict[str, Any]:
        """使用 LLM 进行技术分析"""
        # 提取关键指标
        data_summary = self._summarize_market_data(market_data)
        
        prompt = f"""你是一位资深量化分析师，请对 {stock_name}({stock_code}) 进行技术分析。

行情数据摘要：
{data_summary}

{f'额外背景：{context}' if context else ''}

请分析：
1. 趋势判断（上涨/下跌/震荡）
2. 关键支撑位和阻力位
3. 技术指标信号（MA/MACD/RSI等）
4. 成交量分析
5. 短期（1周）和中期（1月）预测

请以 JSON 格式返回：
{{
    "trend": "上涨/下跌/震荡",
    "support_levels": [价格1, 价格2],
    "resistance_levels": [价格1, 价格2],
    "technical_signals": {{
        "ma_signal": "看涨/看跌/中性",
        "macd_signal": "看涨/看跌/中性",
        "rsi_signal": "超买/超卖/中性"
    }},
    "volume_analysis": "放量/缩量/正常",
    "short_term_outlook": "看涨/看跌/中性",
    "medium_term_outlook": "看涨/看跌/中性",
    "confidence": 0.0-1.0
}}"""
        
        try:
            response = await self.llm_provider.chat(prompt)
            # 尝试解析 JSON
            start = response.find("{")
            end = response.rfind("}") + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
            return {"raw_analysis": response}
        except Exception as e:
            logger.warning(f"Technical analysis parsing failed: {e}")
            return {"error": str(e)}
    
    async def _generate_strategy(
        self,
        stock_code: str,
        stock_name: str,
        analysis_result: Dict[str, Any],
        context: str
    ) -> Dict[str, Any]:
        """生成交易策略建议"""
        factors_summary = ""
        if analysis_result.get("factors_discovered"):
            factors = analysis_result["factors_discovered"][:3]
            factors_summary = "发现的有效因子：\n"
            for i, f in enumerate(factors, 1):
                factors_summary += f"{i}. {f['formula_str']} (Sortino={f['sortino']:.2f}, IC={f['ic']:.3f})\n"
        
        tech_summary = ""
        tech = analysis_result.get("technical_analysis", {})
        if tech and not tech.get("error"):
            tech_summary = f"""技术分析结论：
- 趋势：{tech.get('trend', 'N/A')}
- 短期展望：{tech.get('short_term_outlook', 'N/A')}
- 中期展望：{tech.get('medium_term_outlook', 'N/A')}
"""
        
        prompt = f"""你是一位量化投资顾问，请为 {stock_name}({stock_code}) 生成交易策略建议。

{factors_summary}

{tech_summary}

{f'额外背景：{context}' if context else ''}

请提供：
1. 总体投资建议（买入/持有/卖出/观望）
2. 建议的仓位比例（0-100%）
3. 入场/出场价位建议
4. 风险控制建议（止损/止盈）
5. 策略置信度（0-1）

请以 JSON 格式返回：
{{
    "suggestion": "详细策略建议（100-200字）",
    "action": "买入/持有/卖出/观望",
    "position_ratio": 0-100,
    "entry_price": 价格或null,
    "exit_price": 价格或null,
    "stop_loss": 价格或null,
    "take_profit": 价格或null,
    "confidence": 0.0-1.0,
    "risk_level": "低/中/高"
}}"""
        
        try:
            response = await self.llm_provider.chat(prompt)
            start = response.find("{")
            end = response.rfind("}") + 1
            if start >= 0 and end > start:
                return json.loads(response[start:end])
            return {"suggestion": response, "confidence": 0.5}
        except Exception as e:
            logger.warning(f"Strategy generation failed: {e}")
            return {"suggestion": "策略生成失败", "confidence": 0.0, "error": str(e)}
    
    def _summarize_market_data(self, market_data: Dict[str, Any]) -> str:
        """摘要行情数据"""
        if isinstance(market_data, dict):
            if "close" in market_data:
                close = market_data["close"]
                if hasattr(close, "tolist"):
                    close = close.tolist()
                if isinstance(close, list) and len(close) > 0:
                    return f"""
- 最新价格：{close[-1]:.2f}
- 最高价（近期）：{max(close[-20:]):.2f}
- 最低价（近期）：{min(close[-20:]):.2f}
- 价格变化（5日）：{((close[-1]/close[-5])-1)*100:.2f}%
- 价格变化（20日）：{((close[-1]/close[-20])-1)*100:.2f}%
"""
        return "行情数据格式不支持摘要"
    
    async def evaluate_factor(
        self,
        formula_str: str,
        market_data: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """评估指定因子表达式"""
        self._init_alpha_mining()
        
        if not self._alpha_mining_initialized:
            return {"success": False, "error": "Alpha Mining not available"}
        
        try:
            import torch
            from ..alpha_mining.utils import generate_mock_data
            
            # 解析公式
            tokens = []
            parts = formula_str.replace("(", " ").replace(")", " ").replace(",", " ").split()
            for part in parts:
                part = part.strip()
                if not part:
                    continue
                try:
                    token = self._vm.vocab.name_to_token(part)
                    tokens.append(token)
                except (ValueError, KeyError):
                    continue
            
            if not tokens:
                return {"success": False, "error": "Invalid formula"}
            
            # 准备数据
            if market_data is not None:
                features = self._market_builder.build(market_data)
                returns = features[:, 0, :]
            else:
                features, returns = generate_mock_data()
            
            # 执行
            factor = self._vm.execute(tokens, features)
            if factor is None:
                return {"success": False, "error": "Factor execution failed"}
            
            # 评估
            metrics = self._evaluator.evaluate(factor, returns)
            
            return {
                "success": True,
                "formula": formula_str,
                "metrics": metrics
            }
            
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def get_best_factors(self, top_k: int = 5) -> List[Dict[str, Any]]:
        """获取最优因子"""
        sorted_factors = sorted(
            self.discovered_factors,
            key=lambda x: x.get("sortino", 0),
            reverse=True
        )
        return sorted_factors[:top_k]


def create_quantitative_agent(
    llm_provider=None,
    enable_alpha_mining: bool = True,
    model_path: Optional[str] = None
) -> QuantitativeAgent:
    """
    创建量化分析智能体
    
    Args:
        llm_provider: LLM 提供者
        enable_alpha_mining: 是否启用因子挖掘
        model_path: 预训练模型路径
        
    Returns:
        QuantitativeAgent 实例
    """
    return QuantitativeAgent(
        llm_provider=llm_provider,
        enable_alpha_mining=enable_alpha_mining,
        model_path=model_path
    )


================================================
FILE: backend/app/agents/search_analyst.py
================================================
"""
搜索分析师智能体 (SearchAnalystAgent)

负责在辩论过程中动态搜集数据，支持多种数据源：
- AkShare: 财务指标、K线数据、资金流向、机构持仓
- BochaAI: 实时新闻搜索、分析师报告
- InteractiveCrawler: 多引擎网页搜索 (百度、搜狗、360等)
- Knowledge Base: 历史新闻和上下文 (向量数据库)
"""
import logging
import re
import asyncio
from typing import Dict, Any, List, Optional, ClassVar, Pattern
from datetime import datetime
from enum import Enum

from agenticx.core.agent import Agent
from ..services.llm_service import get_llm_provider
from ..services.stock_data_service import stock_data_service
from ..tools.bochaai_search import bochaai_search, SearchResult
from ..tools.interactive_crawler import InteractiveCrawler

logger = logging.getLogger(__name__)


class SearchSource(Enum):
    """搜索数据源类型"""
    AKSHARE = "akshare"           # AkShare 财务/行情数据
    BOCHAAI = "bochaai"           # BochaAI Web搜索
    BROWSER = "browser"           # 交互式浏览器搜索
    KNOWLEDGE_BASE = "kb"         # 内部知识库
    ALL = "all"                   # 所有来源


class SearchAnalystAgent(Agent):
    """
    搜索分析师智能体
    
    在辩论过程中被其他智能体调用，动态获取所需数据。
    支持解析结构化搜索请求，并返回格式化的数据。
    """
    
    # 搜索请求的正则模式 [SEARCH: "query" source:xxx]
    # 使用 ClassVar 避免 Pydantic 将其视为模型字段
    SEARCH_PATTERN: ClassVar[Pattern] = re.compile(
        r'\[SEARCH:\s*["\']([^"\']+)["\']\s*(?:source:(\w+))?\]',
        re.IGNORECASE
    )
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="SearchAnalyst",
            role="搜索分析师",
            goal="根据辩论中的数据需求，快速从多个数据源获取相关信息",
            backstory="""你是一位专业的金融数据搜索专家，精通各类金融数据源的使用。
你的职责是：
1. 解析辩论智能体的数据请求
2. 选择最合适的数据源进行查询
3. 整理并格式化数据，使其便于辩论使用
4. 对数据质量进行初步评估

你能够访问的数据源包括：
- AkShare: 股票财务指标、K线行情、资金流向、机构持仓等
- BochaAI: 实时新闻搜索、财经报道
- 多引擎搜索: 百度资讯、搜狗、360等
- 内部知识库: 历史新闻和分析数据""",
            organization_id=organization_id
        )
        
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        # 初始化搜索工具
        self._interactive_crawler = InteractiveCrawler(timeout=20)
        
        logger.info(f"✅ Initialized {self.name} agent with multi-source search capabilities")
    
    def extract_search_requests(self, text: str) -> List[Dict[str, Any]]:
        """
        从文本中提取搜索请求
        
        支持格式:
        - [SEARCH: "query"]
        - [SEARCH: "query" source:akshare]
        - [SEARCH: "query" source:bochaai]
        - [SEARCH: "query" source:browser]
        
        Args:
            text: 包含搜索请求的文本
            
        Returns:
            搜索请求列表 [{"query": "...", "source": "..."}]
        """
        requests = []
        matches = self.SEARCH_PATTERN.findall(text)
        
        for match in matches:
            query = match[0].strip()
            source = match[1].lower() if match[1] else "all"
            
            # 验证 source
            valid_sources = [s.value for s in SearchSource]
            if source not in valid_sources:
                source = "all"
            
            requests.append({
                "query": query,
                "source": source
            })
            logger.info(f"🔍 提取搜索请求: query='{query}', source={source}")
        
        return requests
    
    async def search(
        self,
        query: str,
        source: str = "all",
        stock_code: Optional[str] = None,
        stock_name: Optional[str] = None,
        context: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        执行搜索请求
        
        Args:
            query: 搜索查询
            source: 数据源 (akshare, bochaai, browser, kb, all)
            stock_code: 股票代码 (用于 akshare 查询)
            stock_name: 股票名称 (用于新闻搜索)
            context: 额外上下文
            
        Returns:
            搜索结果字典
        """
        logger.info(f"🔍 SearchAnalyst: 执行搜索 query='{query}', source={source}")
        
        result = {
            "query": query,
            "source": source,
            "timestamp": datetime.utcnow().isoformat(),
            "data": {},
            "summary": "",
            "success": False
        }
        
        try:
            if source == SearchSource.AKSHARE.value or source == SearchSource.ALL.value:
                akshare_data = await self._search_akshare(query, stock_code)
                if akshare_data:
                    result["data"]["akshare"] = akshare_data
            
            if source == SearchSource.BOCHAAI.value or source == SearchSource.ALL.value:
                bochaai_data = await self._search_bochaai(query, stock_name)
                if bochaai_data:
                    result["data"]["bochaai"] = bochaai_data
            
            if source == SearchSource.BROWSER.value or source == SearchSource.ALL.value:
                browser_data = await self._search_browser(query)
                if browser_data:
                    result["data"]["browser"] = browser_data
            
            if source == SearchSource.KNOWLEDGE_BASE.value or source == SearchSource.ALL.value:
                kb_data = await self._search_knowledge_base(query, stock_code, stock_name)
                if kb_data:
                    result["data"]["knowledge_base"] = kb_data
            
            # 生成摘要
            if result["data"]:
                result["summary"] = await self._generate_summary(query, result["data"])
                result["success"] = True
            else:
                result["summary"] = f"未找到与'{query}'相关的数据"
            
        except Exception as e:
            logger.error(f"SearchAnalyst 搜索失败: {e}", exc_info=True)
            result["error"] = str(e)
        
        return result
    
    async def _search_akshare(
        self,
        query: str,
        stock_code: Optional[str] = None
    ) -> Optional[Dict[str, Any]]:
        """从 AkShare 获取数据"""
        if not stock_code:
            logger.debug("AkShare 搜索需要股票代码，跳过")
            return None
        
        data = {}
        query_lower = query.lower()
        
        try:
            # 根据查询内容决定获取哪些数据
            if any(kw in query_lower for kw in ["财务", "pe", "pb", "roe", "利润", "估值", "市盈", "市净"]):
                financial = await stock_data_service.get_financial_indicators(stock_code)
                if financial:
                    data["financial_indicators"] = financial
            
            if any(kw in query_lower for kw in ["资金", "主力", "流入", "流出", "散户", "机构"]):
                fund_flow = await stock_data_service.get_fund_flow(stock_code, days=10)
                if fund_flow:
                    data["fund_flow"] = fund_flow
            
            if any(kw in query_lower for kw in ["行情", "价格", "涨跌", "成交", "量"]):
                realtime = await stock_data_service.get_realtime_quote(stock_code)
                if realtime:
                    data["realtime_quote"] = realtime
            
            if any(kw in query_lower for kw in ["k线", "走势", "历史", "均线", "趋势"]):
                kline = await stock_data_service.get_kline_data(stock_code, period="daily", limit=30)
                if kline:
                    # 只返回最近10天的简要数据
                    data["kline_summary"] = {
                        "period": "daily",
                        "count": len(kline),
                        "latest": kline[-1] if kline else None,
                        "recent_5": kline[-5:] if len(kline) >= 5 else kline
                    }
            
            # 如果没有匹配到特定查询，获取综合数据
            if not data:
                context_data = await stock_data_service.get_debate_context(stock_code)
                if context_data:
                    data = context_data
            
            if data:
                logger.info(f"✅ AkShare 返回数据: {list(data.keys())}")
                return data
            
        except Exception as e:
            logger.warning(f"AkShare 搜索出错: {e}")
        
        return None
    
    async def _search_bochaai(
        self,
        query: str,
        stock_name: Optional[str] = None
    ) -> Optional[Dict[str, Any]]:
        """从 BochaAI 搜索新闻"""
        if not bochaai_search.is_available():
            logger.debug("BochaAI 未配置，跳过")
            return None
        
        try:
            # 构建搜索查询
            search_query = query
            if stock_name and stock_name not in query:
                search_query = f"{stock_name} {query}"
            
            results = bochaai_search.search(
                query=search_query,
                freshness="oneWeek",
                count=10
            )
            
            if results:
                news_list = [
                    {
                        "title": r.title,
                        "snippet": r.snippet[:200] if r.snippet else "",
                        "url": r.url,
                        "source": r.site_name or "unknown",
                        "date": r.date_published or ""
                    }
                    for r in results
                ]
                logger.info(f"✅ BochaAI 返回 {len(news_list)} 条新闻")
                return {"news": news_list, "count": len(news_list)}
        
        except Exception as e:
            logger.warning(f"BochaAI 搜索出错: {e}")
        
        return None
    
    async def _search_browser(self, query: str) -> Optional[Dict[str, Any]]:
        """使用交互式爬虫搜索"""
        try:
            loop = asyncio.get_event_loop()
            results = await loop.run_in_executor(
                None,
                lambda: self._interactive_crawler.interactive_search(
                    query=query,
                    engines=["baidu_news", "sogou"],
                    num_results=10,
                    search_type="news"
                )
            )
            
            if results:
                news_list = [
                    {
                        "title": r.get("title", ""),
                        "snippet": r.get("snippet", "")[:200],
                        "url": r.get("url", ""),
                        "source": "browser_search"
                    }
                    for r in results
                ]
                logger.info(f"✅ Browser 返回 {len(news_list)} 条结果")
                return {"search_results": news_list, "count": len(news_list)}
        
        except Exception as e:
            logger.warning(f"Browser 搜索出错: {e}")
        
        return None
    
    async def _search_knowledge_base(
        self,
        query: str,
        stock_code: Optional[str] = None,
        stock_name: Optional[str] = None
    ) -> Optional[Dict[str, Any]]:
        """从知识库搜索历史数据"""
        try:
            # 尝试导入 news_service（可能不存在）
            try:
                from ..services.news_service import news_service
            except ImportError:
                logger.debug("news_service 未配置，跳过知识库搜索")
                return None
            
            # 尝试从数据库获取相关新闻
            if stock_code and news_service:
                news_list = await news_service.get_news_by_stock(stock_code, limit=10)
                if news_list:
                    kb_news = [
                        {
                            "title": getattr(news, 'title', ''),
                            "content": (getattr(news, 'content', '') or '')[:300],
                            "source": getattr(news, 'source', ''),
                            "date": news.published_at.isoformat() if hasattr(news, 'published_at') and news.published_at else ""
                        }
                        for news in news_list
                    ]
                    logger.info(f"✅ KB 返回 {len(kb_news)} 条历史新闻")
                    return {"historical_news": kb_news, "count": len(kb_news)}
        
        except Exception as e:
            logger.debug(f"KB 搜索出错: {e}")
        
        return None
    
    async def _generate_summary(self, query: str, data: Dict[str, Any]) -> str:
        """生成数据摘要"""
        summary_parts = [f"## 搜索结果: {query}\n"]
        
        # AkShare 数据摘要
        if "akshare" in data:
            ak_data = data["akshare"]
            summary_parts.append("### 📊 财务/行情数据 (AkShare)\n")
            
            if "financial_indicators" in ak_data:
                fi = ak_data["financial_indicators"]
                summary_parts.append(f"- PE: {fi.get('pe_ratio', 'N/A')}, PB: {fi.get('pb_ratio', 'N/A')}")
                summary_parts.append(f"- ROE: {fi.get('roe', 'N/A')}%, 净利润同比: {fi.get('profit_yoy', 'N/A')}%")
            
            if "realtime_quote" in ak_data:
                rt = ak_data["realtime_quote"]
                summary_parts.append(f"- 当前价: {rt.get('price', 'N/A')}元, 涨跌幅: {rt.get('change_percent', 'N/A')}%")
            
            if "fund_flow" in ak_data:
                ff = ak_data["fund_flow"]
                main_net = ff.get('total_main_net', 0)
                trend = ff.get('main_flow_trend', 'N/A')
                summary_parts.append(f"- 资金流向: 近{ff.get('period_days', 5)}日主力{trend}")
            
            summary_parts.append("")
        
        # BochaAI 新闻摘要
        if "bochaai" in data:
            news = data["bochaai"].get("news", [])
            if news:
                summary_parts.append("### 📰 最新新闻 (BochaAI)\n")
                for i, n in enumerate(news[:5], 1):
                    summary_parts.append(f"{i}. **{n['title'][:50]}**")
                    if n.get('snippet'):
                        summary_parts.append(f"   {n['snippet'][:100]}...")
                summary_parts.append("")
        
        # Browser 搜索结果摘要
        if "browser" in data:
            results = data["browser"].get("search_results", [])
            if results:
                summary_parts.append("### 🌐 网页搜索结果\n")
                for i, r in enumerate(results[:5], 1):
                    summary_parts.append(f"{i}. {r['title'][:50]}")
                summary_parts.append("")
        
        # KB 历史数据摘要
        if "knowledge_base" in data:
            kb_news = data["knowledge_base"].get("historical_news", [])
            if kb_news:
                summary_parts.append("### 📚 历史资料 (知识库)\n")
                for i, n in enumerate(kb_news[:3], 1):
                    summary_parts.append(f"{i}. {n['title'][:50]}")
                summary_parts.append("")
        
        return "\n".join(summary_parts)
    
    async def process_debate_speech(
        self,
        speech_text: str,
        stock_code: str,
        stock_name: str,
        agent_name: str = "Unknown"
    ) -> Dict[str, Any]:
        """
        处理辩论发言中的搜索请求
        
        Args:
            speech_text: 辩论发言文本
            stock_code: 股票代码
            stock_name: 股票名称
            agent_name: 发言智能体名称
            
        Returns:
            处理结果，包含所有搜索结果和综合摘要
        """
        logger.info(f"🔍 SearchAnalyst: 处理 {agent_name} 的发言，检测搜索请求...")
        
        result = {
            "agent_name": agent_name,
            "requests_found": 0,
            "search_results": [],
            "combined_summary": "",
            "success": False
        }
        
        # 提取搜索请求
        requests = self.extract_search_requests(speech_text)
        result["requests_found"] = len(requests)
        
        if not requests:
            logger.info(f"📝 {agent_name} 的发言中未包含搜索请求")
            result["success"] = True
            return result
        
        logger.info(f"📋 从 {agent_name} 的发言中提取到 {len(requests)} 个搜索请求")
        
        # 并行执行所有搜索
        search_tasks = []
        for req in requests:
            task = self.search(
                query=req["query"],
                source=req["source"],
                stock_code=stock_code,
                stock_name=stock_name
            )
            search_tasks.append(task)
        
        search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
        
        # 收集结果
        summaries = []
        for i, res in enumerate(search_results):
            if isinstance(res, Exception):
                logger.error(f"搜索请求 {i+1} 失败: {res}")
                continue
            
            if res.get("success"):
                result["search_results"].append(res)
                summaries.append(res.get("summary", ""))
        
        # 生成综合摘要
        if summaries:
            result["combined_summary"] = "\n---\n".join(summaries)
            result["success"] = True
        
        logger.info(f"✅ SearchAnalyst: 为 {agent_name} 完成 {len(result['search_results'])} 个搜索请求")
        
        return result
    
    async def smart_data_supplement(
        self,
        stock_code: str,
        stock_name: str,
        existing_context: str,
        debate_history: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """
        智能数据补充
        
        分析辩论历史和现有上下文，主动识别缺失的关键数据并补充
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            existing_context: 现有上下文
            debate_history: 辩论历史
            
        Returns:
            补充的数据和摘要
        """
        logger.info(f"🧠 SearchAnalyst: 智能分析数据缺口...")
        
        # 使用 LLM 分析需要什么数据
        analysis_prompt = f"""你是一位金融数据分析专家。请分析以下辩论情况，判断还需要哪些数据支撑：

【股票】{stock_name} ({stock_code})

【现有数据】
{existing_context[:1500]}

【辩论历史】
{self._format_debate_history(debate_history[-4:])}

请判断：
1. 看多方缺少什么关键数据？
2. 看空方缺少什么关键数据？
3. 还需要搜索什么信息？

请按以下格式输出需要搜索的内容（每行一个）：
[SEARCH: "搜索内容" source:数据源]

可用数据源：akshare（财务/行情）, bochaai（新闻）, browser（网页搜索）

只输出3-5个最关键的搜索请求。"""

        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": analysis_prompt}
            ])
            
            llm_response = response.content if hasattr(response, 'content') else str(response)
            
            # 处理 LLM 建议的搜索
            return await self.process_debate_speech(
                speech_text=llm_response,
                stock_code=stock_code,
                stock_name=stock_name,
                agent_name="SmartSupplement"
            )
            
        except Exception as e:
            logger.error(f"智能数据补充失败: {e}")
            return {"success": False, "error": str(e)}
    
    def _format_debate_history(self, history: List[Dict[str, Any]]) -> str:
        """格式化辩论历史"""
        if not history:
            return "（暂无辩论历史）"
        
        lines = []
        for item in history:
            agent = item.get("agent", "Unknown")
            content = item.get("content", "")[:300]
            lines.append(f"[{agent}]: {content}")
        
        return "\n\n".join(lines)


# 工厂函数
def create_search_analyst(llm_provider=None) -> SearchAnalystAgent:
    """创建搜索分析师实例"""
    return SearchAnalystAgent(llm_provider=llm_provider)


================================================
FILE: backend/app/alpha_mining/README.md
================================================
# M12: Alpha Mining 量化因子挖掘模块

基于 AlphaGPT 技术的量化因子自动挖掘模块，使用符号回归 + 强化学习自动发现有预测能力的交易因子。

## 功能特性

- **因子自动发现**：使用 Transformer + RL 自动生成和优化因子表达式
- **DSL 表达式系统**：支持丰富的时序操作符（MA、STD、DELAY、DELTA 等）
- **情感特征融合**：可结合新闻情感分析提升因子效果
- **回测评估**：内置 Sortino/Sharpe/IC 等多种评估指标
- **AgenticX 集成**：提供 BaseTool 封装，供 Agent 调用

## 模块结构

```
alpha_mining/
├── __init__.py          # 模块入口
├── config.py            # 配置管理
├── utils.py             # 工具函数
├── dsl/                 # 因子表达式 DSL
│   ├── ops.py          # 操作符定义
│   └── vocab.py        # 词汇表管理
├── vm/                  # 因子执行器
│   └── factor_vm.py    # 栈式虚拟机
├── model/               # 生成模型
│   ├── alpha_generator.py  # Transformer 策略网络
│   └── trainer.py      # RL 训练器
├── features/            # 特征构建
│   ├── market.py       # 行情特征
│   └── sentiment.py    # 情感特征
├── backtest/            # 回测评估
│   └── evaluator.py    # 因子评估器
└── tools/               # AgenticX 工具
    └── alpha_mining_tool.py
```

## 快速开始

### 基础使用

```python
from app.alpha_mining import (
    AlphaGenerator,
    AlphaTrainer,
    FactorVM,
    FactorEvaluator,
    generate_mock_data
)

# 1. 准备数据
features, returns = generate_mock_data(
    num_samples=50,
    num_features=6,
    time_steps=252
)

# 2. 创建训练器
trainer = AlphaTrainer()

# 3. 训练挖掘因子
result = trainer.train(
    features=features,
    returns=returns,
    num_steps=100
)

print(f"最优因子: {result['best_formula_str']}")
print(f"得分: {result['best_score']:.4f}")
```

### 使用 QuantitativeAgent

```python
from app.agents import QuantitativeAgent

# 创建智能体
agent = QuantitativeAgent(
    llm_provider=llm,
    enable_alpha_mining=True
)

# 执行分析
result = await agent.analyze(
    stock_code="000001",
    stock_name="平安银行",
    market_data=market_data,
    sentiment_data=sentiment_data
)

# 获取发现的因子
for factor in result["factors_discovered"]:
    print(f"{factor['formula_str']}: Sortino={factor['sortino']:.2f}")
```

### REST API

```bash
# 启动因子挖掘任务
curl -X POST http://localhost:8000/api/v1/alpha-mining/mine \
  -H "Content-Type: application/json" \
  -d '{"num_steps": 100, "use_sentiment": true}'

# 评估因子表达式
curl -X POST http://localhost:8000/api/v1/alpha-mining/evaluate \
  -H "Content-Type: application/json" \
  -d '{"formula": "ADD(RET, MA5(VOL))"}'

# 获取已发现的因子
curl http://localhost:8000/api/v1/alpha-mining/factors?top_k=10
```

## 支持的操作符

### 算术操作符
| 操作符 | 参数数 | 描述 |
|--------|--------|------|
| ADD | 2 | 加法 |
| SUB | 2 | 减法 |
| MUL | 2 | 乘法 |
| DIV | 2 | 除法 |

### 一元操作符
| 操作符 | 参数数 | 描述 |
|--------|--------|------|
| NEG | 1 | 取负 |
| ABS | 1 | 绝对值 |
| SIGN | 1 | 符号函数 |

### 时序操作符
| 操作符 | 参数数 | 描述 |
|--------|--------|------|
| DELAY1/5 | 1 | 延迟 1/5 期 |
| DELTA1/5 | 1 | 差分 1/5 期 |
| MA5/10 | 1 | 5/10 日移动平均 |
| STD5/10 | 1 | 5/10 日滚动标准差 |

### 条件操作符
| 操作符 | 参数数 | 描述 |
|--------|--------|------|
| GATE | 3 | 条件选择 |
| MAX | 2 | 取最大值 |
| MIN | 2 | 取最小值 |

## 特征列表

| 特征 | 描述 | 数据来源 |
|------|------|----------|
| RET | 收益率 | 行情数据 |
| VOL | 波动率 | 行情数据 |
| VOLUME_CHG | 成交量变化 | 行情数据 |
| TURNOVER | 换手率 | 行情数据 |
| SENTIMENT | 情感分数 | 新闻分析 |
| NEWS_COUNT | 新闻数量 | 新闻分析 |

## 评估指标

- **Sortino Ratio**: 风险调整收益（只考虑下行风险）
- **Sharpe Ratio**: 风险调整收益
- **IC**: 信息系数（因子与收益的相关性）
- **Rank IC**: 排名信息系数
- **Max Drawdown**: 最大回撤
- **Turnover**: 换手率

## 配置选项

```python
from app.alpha_mining import AlphaMiningConfig

config = AlphaMiningConfig(
    # 模型参数
    d_model=64,           # Transformer 隐藏维度
    num_layers=2,         # Transformer 层数
    nhead=4,              # 注意力头数
    max_seq_len=12,       # 最大序列长度
    
    # 训练参数
    batch_size=1024,      # 批量大小
    lr=1e-3,              # 学习率
    num_steps=1000,       # 训练步数
    
    # 奖励参数
    invalid_formula_reward=-5.0,  # 无效公式惩罚
    constant_factor_reward=-2.0,  # 常量因子惩罚
    
    # 回测参数
    cost_rate=0.0015,     # 交易成本率
    signal_threshold=0.7, # 信号阈值
    
    # 特征配置
    enable_sentiment=True,  # 启用情感特征
)
```

## 参考

- [AlphaGPT](https://github.com/imbue-bit/AlphaGPT) - 原始实现

================================================
FILE: backend/app/alpha_mining/__init__.py
================================================
"""
M12: Alpha Mining Module for FinnewsHunter

基于 AlphaGPT 技术的量化因子自动挖掘模块。
使用符号回归 + 强化学习自动发现有预测能力的交易因子。

核心组件：
- dsl: 因子表达式 DSL（操作符、词汇表）
- vm: 因子执行器（StackVM）
- model: 因子生成模型（AlphaGenerator）和训练器（AlphaTrainer）
- features: 特征构建器（行情、情感）
- backtest: 因子回测评估
- tools: AgenticX 工具封装

References:
- AlphaGPT: https://github.com/imbue-bit/AlphaGPT
- 技术方案: researches/AlphaGPT/AlphaGPT_proposal.md
"""

__version__ = "0.1.0"
__author__ = "FinnewsHunter Team"

from .config import AlphaMiningConfig, DEFAULT_CONFIG
from .dsl.vocab import FactorVocab, DEFAULT_VOCAB
from .dsl.ops import OPS_CONFIG
from .vm.factor_vm import FactorVM
from .model.alpha_generator import AlphaGenerator
from .model.trainer import AlphaTrainer
from .features.market import MarketFeatureBuilder
from .features.sentiment import SentimentFeatureBuilder
from .backtest.evaluator import FactorEvaluator
from .utils import generate_mock_data

__all__ = [
    # Config
    "AlphaMiningConfig",
    "DEFAULT_CONFIG",
    # DSL
    "FactorVocab",
    "DEFAULT_VOCAB",
    "OPS_CONFIG",
    # VM
    "FactorVM",
    # Model
    "AlphaGenerator",
    "AlphaTrainer",
    # Features
    "MarketFeatureBuilder",
    "SentimentFeatureBuilder",
    # Backtest
    "FactorEvaluator",
    # Utils
    "generate_mock_data",
]


================================================
FILE: backend/app/alpha_mining/backtest/__init__.py
================================================
"""
因子回测评估模块

提供因子有效性评估，包括 Sortino Ratio 等指标计算。
"""

from .evaluator import FactorEvaluator

__all__ = ["FactorEvaluator"]


================================================
FILE: backend/app/alpha_mining/backtest/evaluator.py
================================================
"""
因子回测评估器

评估因子的预测能力和交易表现。

评估指标：
- Sortino Ratio: 风险调整收益（只考虑下行风险）
- Sharpe Ratio: 风险调整收益
- IC: 信息系数（因子与收益的相关性）
- Rank IC: 排名信息系数
- Turnover: 换手率
- Max Drawdown: 最大回撤
"""

import torch
from typing import Dict, Optional, List, Tuple
import numpy as np
import logging

from ..config import AlphaMiningConfig, DEFAULT_CONFIG

logger = logging.getLogger(__name__)


class FactorEvaluator:
    """
    因子回测评估器
    
    评估因子表达式的有效性和收益表现。
    
    Args:
        config: 配置实例
        cost_rate: 交易成本率
        signal_threshold: 信号阈值（用于生成持仓）
        
    Example:
        evaluator = FactorEvaluator()
        metrics = evaluator.evaluate(factor, returns)
    """
    
    def __init__(
        self,
        config: Optional[AlphaMiningConfig] = None,
        cost_rate: Optional[float] = None,
        signal_threshold: Optional[float] = None
    ):
        self.config = config or DEFAULT_CONFIG
        self.cost_rate = cost_rate or self.config.cost_rate
        self.signal_threshold = signal_threshold or self.config.signal_threshold
        
        # 年化系数（假设 252 个交易日）
        self.annualize_factor = np.sqrt(252)
        
        logger.info(
            f"FactorEvaluator initialized: "
            f"cost_rate={self.cost_rate}, threshold={self.signal_threshold}"
        )
    
    def evaluate(
        self,
        factor: torch.Tensor,
        returns: torch.Tensor,
        benchmark: Optional[torch.Tensor] = None
    ) -> Dict[str, float]:
        """
        综合评估因子
        
        Args:
            factor: 因子值 [batch, time_steps] 或 [time_steps]
            returns: 收益率 [batch, time_steps] 或 [time_steps]
            benchmark: 基准收益率（可选）
            
        Returns:
            评估指标字典
        """
        # 确保是 2D
        if factor.dim() == 1:
            factor = factor.unsqueeze(0)
        if returns.dim() == 1:
            returns = returns.unsqueeze(0)
        
        # 对每个样本计算指标，然后平均
        metrics_list = []
        for i in range(factor.size(0)):
            f = factor[i]
            r = returns[i]
            b = benchmark[i] if benchmark is not None else None
            
            m = self._evaluate_single(f, r, b)
            metrics_list.append(m)
        
        # 聚合指标
        result = {}
        for key in metrics_list[0].keys():
            values = [m[key] for m in metrics_list]
            result[key] = np.mean(values)
            result[f"{key}_std"] = np.std(values)
        
        return result
    
    def _evaluate_single(
        self,
        factor: torch.Tensor,
        returns: torch.Tensor,
        benchmark: Optional[torch.Tensor] = None
    ) -> Dict[str, float]:
        """评估单个样本"""
        # 转换为 numpy
        factor_np = factor.detach().cpu().numpy()
        returns_np = returns.detach().cpu().numpy()
        
        # 生成信号和持仓
        signal = self._factor_to_signal(factor_np)
        position = self._signal_to_position(signal)
        
        # 计算策略收益
        strategy_returns = position[:-1] * returns_np[1:]
        
        # 计算交易成本
        turnover = np.abs(np.diff(position)).mean()
        net_returns = strategy_returns - turnover * self.cost_rate
        
        # 计算各指标
        metrics = {
            "sortino_ratio": self._calc_sortino(net_returns),
            "sharpe_ratio": self._calc_sharpe(net_returns),
            "ic": self._calc_ic(factor_np, returns_np),
            "rank_ic": self._calc_rank_ic(factor_np, returns_np),
            "turnover": turnover,
            "max_drawdown": self._calc_max_drawdown(net_returns),
            "total_return": np.sum(net_returns),
            "win_rate": np.mean(net_returns > 0),
            "avg_return": np.mean(net_returns),
        }
        
        # 相对基准的超额收益
        if benchmark is not None:
            benchmark_np = benchmark.detach().cpu().numpy()
            excess_returns = net_returns - benchmark_np[1:]
            metrics["excess_return"] = np.sum(excess_returns)
            metrics["information_ratio"] = self._calc_sharpe(excess_returns)
        
        return metrics
    
    def _factor_to_signal(self, factor: np.ndarray) -> np.ndarray:
        """因子值转换为信号（-1 到 1）"""
        # 使用 Z-score 标准化
        mean = np.mean(factor)
        std = np.std(factor) + 1e-8
        z_score = (factor - mean) / std
        
        # Sigmoid 映射到 (-1, 1)
        signal = 2 / (1 + np.exp(-z_score)) - 1
        
        return signal
    
    def _signal_to_position(self, signal: np.ndarray) -> np.ndarray:
        """信号转换为持仓"""
        position = np.zeros_like(signal)
        
        # 信号大于阈值时做多
        position[signal > self.signal_threshold] = 1.0
        # 信号小于负阈值时做空
        position[signal < -self.signal_threshold] = -1.0
        # 中间区域不持仓
        
        return position
    
    def _calc_sortino(self, returns: np.ndarray) -> float:
        """
        计算 Sortino Ratio
        
        只考虑下行风险（负收益的标准差）
        """
        mean_return = np.mean(returns)
        downside = returns[returns < 0]
        
        if len(downside) == 0:
            return float('inf') if mean_return > 0 else 0.0
        
        downside_std = np.std(downside) + 1e-8
        sortino = mean_return / downside_std * self.annualize_factor
        
        return float(sortino)
    
    def _calc_sharpe(self, returns: np.ndarray) -> float:
        """计算 Sharpe Ratio"""
        mean_return = np.mean(returns)
        std_return = np.std(returns) + 1e-8
        
        sharpe = mean_return / std_return * self.annualize_factor
        return float(sharpe)
    
    def _calc_ic(self, factor: np.ndarray, returns: np.ndarray) -> float:
        """
        计算 IC (Information Coefficient)
        
        因子值与下期收益的 Pearson 相关系数
        """
        # 对齐：factor[t] 预测 returns[t+1]
        factor_lag = factor[:-1]
        returns_lead = returns[1:]
        
        # Pearson 相关
        corr = np.corrcoef(factor_lag, returns_lead)[0, 1]
        
        return float(corr) if not np.isnan(corr) else 0.0
    
    def _calc_rank_ic(self, factor: np.ndarray, returns: np.ndarray) -> float:
        """
        计算 Rank IC
        
        因子排名与收益排名的 Spearman 相关系数
        """
        from scipy.stats import spearmanr
        
        factor_lag = factor[:-1]
        returns_lead = returns[1:]
        
        try:
            corr, _ = spearmanr(factor_lag, returns_lead)
            return float(corr) if not np.isnan(corr) else 0.0
        except Exception:
            return 0.0
    
    def _calc_max_drawdown(self, returns: np.ndarray) -> float:
        """计算最大回撤"""
        cumulative = np.cumsum(returns)
        running_max = np.maximum.accumulate(cumulative)
        drawdown = running_max - cumulative
        
        max_dd = np.max(drawdown)
        return float(max_dd)
    
    def get_reward(
        self,
        factor: torch.Tensor,
        returns: torch.Tensor
    ) -> float:
        """
        获取强化学习奖励
        
        使用 Sortino Ratio 作为奖励信号。
        
        Args:
            factor: 因子值
            returns: 收益率
            
        Returns:
            奖励值
        """
        metrics = self.evaluate(factor, returns)
        
        # 主要使用 Sortino Ratio
        reward = metrics["sortino_ratio"]
        
        # 惩罚过高的换手率
        if metrics["turnover"] > 0.5:
            reward -= (metrics["turnover"] - 0.5) * 2
        
        # 惩罚过大的最大回撤
        if metrics["max_drawdown"] > 0.2:
            reward -= (metrics["max_drawdown"] - 0.2) * 5
        
        return float(reward)
    
    def compare_factors(
        self,
        factors: List[torch.Tensor],
        returns: torch.Tensor,
        factor_names: Optional[List[str]] = None
    ) -> Dict[str, Dict[str, float]]:
        """
        比较多个因子的表现
        
        Args:
            factors: 因子列表
            returns: 收益率
            factor_names: 因子名称列表
            
        Returns:
            {factor_name: metrics_dict}
        """
        if factor_names is None:
            factor_names = [f"factor_{i}" for i in range(len(factors))]
        
        results = {}
        for name, factor in zip(factor_names, factors):
            results[name] = self.evaluate(factor, returns)
        
        return results
    
    def rank_factors(
        self,
        factors: List[torch.Tensor],
        returns: torch.Tensor,
        metric: str = "sortino_ratio"
    ) -> List[Tuple[int, float]]:
        """
        对因子按指定指标排名
        
        Args:
            factors: 因子列表
            returns: 收益率
            metric: 排名指标
            
        Returns:
            [(index, score), ...] 按 score 降序排列
        """
        scores = []
        for i, factor in enumerate(factors):
            metrics = self.evaluate(factor, returns)
            scores.append((i, metrics.get(metric, 0)))
        
        # 降序排列
        scores.sort(key=lambda x: x[1], reverse=True)
        
        return scores


================================================
FILE: backend/app/alpha_mining/config.py
================================================
"""
Alpha Mining 配置模块

定义训练、模型、回测等配置参数。

References:
- AlphaGPT upstream/model_core/config.py
"""

import torch
from dataclasses import dataclass, field
from typing import List, Optional


@dataclass
class AlphaMiningConfig:
    """Alpha Mining 模块配置"""
    
    # ============ 设备配置 ============
    device: str = field(default_factory=lambda: "cuda" if torch.cuda.is_available() else "cpu")
    
    # ============ 模型配置 ============
    d_model: int = 64              # Transformer 嵌入维度
    nhead: int = 4                 # 注意力头数
    num_layers: int = 2            # Transformer 层数
    max_seq_len: int = 12          # 最大因子表达式长度
    
    # ============ 训练配置 ============
    batch_size: int = 1024         # 批量大小（每批生成的因子数）
    num_steps: int = 1000          # 训练步数
    lr: float = 1e-3               # 学习率
    
    # ============ 奖励配置 ============
    invalid_formula_reward: float = -5.0   # 无效公式惩罚
    constant_factor_reward: float = -2.0   # 常量因子惩罚
    low_activity_reward: float = -10.0     # 低活跃度惩罚
    constant_threshold: float = 1e-4       # 常量因子阈值（std < 此值视为常量）
    
    # ============ 回测配置 ============
    cost_rate: float = 0.0015      # A股交易费率（双边约0.3%）
    signal_threshold: float = 0.7  # 信号阈值（factor > threshold 时建仓）
    min_holding_days: int = 1      # 最小持仓天数
    min_activity: int = 5          # 最小活跃度（持仓天数）
    
    # ============ 特征配置 ============
    market_features: List[str] = field(default_factory=lambda: [
        "RET",           # 收益率
        "VOL",           # 波动率  
        "VOLUME_CHG",    # 成交量变化
        "TURNOVER",      # 换手率
    ])
    
    sentiment_features: List[str] = field(default_factory=lambda: [
        "SENTIMENT",     # 情感分数
        "NEWS_COUNT",    # 新闻数量
    ])
    
    enable_sentiment: bool = True  # 是否启用情感特征
    
    # ============ 持久化配置 ============
    checkpoint_dir: str = "checkpoints/alpha_mining"
    save_every_n_steps: int = 100
    
    @property
    def torch_device(self) -> torch.device:
        """获取 torch.device 对象"""
        return torch.device(self.device)
    
    @property
    def all_features(self) -> List[str]:
        """获取所有启用的特征列表"""
        features = self.market_features.copy()
        if self.enable_sentiment:
            features.extend(self.sentiment_features)
        return features
    
    @property
    def num_features(self) -> int:
        """特征数量"""
        return len(self.all_features)


# 默认配置实例
DEFAULT_CONFIG = AlphaMiningConfig()


================================================
FILE: backend/app/alpha_mining/dsl/__init__.py
================================================
"""
因子表达式 DSL（Domain Specific Language）

包含操作符定义和词汇表管理。
"""

from .ops import OPS_CONFIG, ts_delay, ts_delta, ts_mean, ts_std
from .vocab import FactorVocab, FEATURES

__all__ = [
    "OPS_CONFIG",
    "ts_delay",
    "ts_delta", 
    "ts_mean",
    "ts_std",
    "FactorVocab",
    "FEATURES",
]


================================================
FILE: backend/app/alpha_mining/dsl/ops.py
================================================
"""
因子操作符定义

定义因子表达式中可用的操作符，包括：
- 算术运算：ADD, SUB, MUL, DIV
- 一元运算：NEG, ABS, SIGN
- 时序运算：DELAY, DELTA, MA, STD
- 条件运算：GATE, MAX, MIN

References:
- AlphaGPT upstream/model_core/ops.py
"""

import torch
from typing import Callable, Tuple, List


# ============================================================================
# 时序操作函数（优化版本，支持 JIT 编译）
# ============================================================================

def ts_delay(x: torch.Tensor, d: int = 1) -> torch.Tensor:
    """
    时序延迟：将序列向右移动 d 步
    
    Args:
        x: [batch, time_steps] 输入张量
        d: 延迟步数
        
    Returns:
        延迟后的张量，前 d 个位置填充 0
    """
    if d == 0:
        return x
    if d < 0:
        raise ValueError(f"Delay must be non-negative, got {d}")
    
    batch_size = x.shape[0]
    pad = torch.zeros((batch_size, d), device=x.device, dtype=x.dtype)
    return torch.cat([pad, x[:, :-d]], dim=1)


def ts_delta(x: torch.Tensor, d: int = 1) -> torch.Tensor:
    """
    时序差分：计算 x[t] - x[t-d]
    
    Args:
        x: [batch, time_steps] 输入张量
        d: 差分步数
        
    Returns:
        差分后的张量
    """
    return x - ts_delay(x, d)


def ts_mean(x: torch.Tensor, window: int = 5) -> torch.Tensor:
    """
    滑动平均
    
    Args:
        x: [batch, time_steps] 输入张量
        window: 窗口大小
        
    Returns:
        滑动平均后的张量
    """
    if window <= 0:
        raise ValueError(f"Window must be positive, got {window}")
    
    # 使用 unfold 实现滑动窗口
    batch_size, time_steps = x.shape
    
    # Padding
    pad = torch.zeros((batch_size, window - 1), device=x.device, dtype=x.dtype)
    x_padded = torch.cat([pad, x], dim=1)
    
    # 滑动窗口平均
    result = x_padded.unfold(1, window, 1).mean(dim=-1)
    return result


def ts_std(x: torch.Tensor, window: int = 5) -> torch.Tensor:
    """
    滑动标准差
    
    Args:
        x: [batch, time_steps] 输入张量
        window: 窗口大小
        
    Returns:
        滑动标准差后的张量
    """
    if window <= 0:
        raise ValueError(f"Window must be positive, got {window}")
    
    batch_size, time_steps = x.shape
    
    # Padding
    pad = torch.zeros((batch_size, window - 1), device=x.device, dtype=x.dtype)
    x_padded = torch.cat([pad, x], dim=1)
    
    # 滑动窗口标准差
    result = x_padded.unfold(1, window, 1).std(dim=-1)
    return result


def _op_gate(condition: torch.Tensor, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """
    条件选择：condition > 0 时返回 x，否则返回 y
    
    类似于 torch.where(condition > 0, x, y)
    """
    mask = (condition > 0).float()
    return mask * x + (1.0 - mask) * y


def _op_jump(x: torch.Tensor) -> torch.Tensor:
    """
    跳跃检测：返回超过 3 sigma 的异常值
    
    用于检测价格跳跃/异常波动
    """
    mean = x.mean(dim=1, keepdim=True)
    std = x.std(dim=1, keepdim=True) + 1e-6
    z = (x - mean) / std
    return torch.relu(z - 3.0)


def _op_decay(x: torch.Tensor) -> torch.Tensor:
    """
    衰减加权：x + 0.8*x[-1] + 0.6*x[-2]
    
    给近期数据更高权重
    """
    return x + 0.8 * ts_delay(x, 1) + 0.6 * ts_delay(x, 2)


def _op_max3(x: torch.Tensor) -> torch.Tensor:
    """
    3 期最大值
    """
    return torch.max(x, torch.max(ts_delay(x, 1), ts_delay(x, 2)))


# ============================================================================
# 操作符配置
# ============================================================================

# 操作符配置格式：(name, function, arity)
# - name: 操作符名称
# - function: 操作符函数
# - arity: 参数数量（1=一元，2=二元，3=三元）

OPS_CONFIG: List[Tuple[str, Callable, int]] = [
    # 二元算术运算
    ('ADD', lambda x, y: x + y, 2),
    ('SUB', lambda x, y: x - y, 2),
    ('MUL', lambda x, y: x * y, 2),
    ('DIV', lambda x, y: x / (y + 1e-6), 2),  # 安全除法
    
    # 一元运算
    ('NEG', lambda x: -x, 1),
    ('ABS', torch.abs, 1),
    ('SIGN', torch.sign, 1),
    
    # 条件运算
    ('GATE', _op_gate, 3),  # 条件选择
    ('MAX', lambda x, y: torch.max(x, y), 2),
    ('MIN', lambda x, y: torch.min(x, y), 2),
    
    # 时序运算
    ('DELAY1', lambda x: ts_delay(x, 1), 1),
    ('DELAY5', lambda x: ts_delay(x, 5), 1),
    ('DELTA1', lambda x: ts_delta(x, 1), 1),
    ('DELTA5', lambda x: ts_delta(x, 5), 1),
    ('MA5', lambda x: ts_mean(x, 5), 1),
    ('MA10', lambda x: ts_mean(x, 10), 1),
    ('STD5', lambda x: ts_std(x, 5), 1),
    ('STD10', lambda x: ts_std(x, 10), 1),
    
    # 特殊运算
    ('JUMP', _op_jump, 1),
    ('DECAY', _op_decay, 1),
    ('MAX3', _op_max3, 1),
]


def get_op_names() -> List[str]:
    """获取所有操作符名称"""
    return [op[0] for op in OPS_CONFIG]


def get_op_by_name(name: str) -> Tuple[Callable, int]:
    """
    根据名称获取操作符函数和参数数量
    
    Args:
        name: 操作符名称
        
    Returns:
        (function, arity) 元组
        
    Raises:
        KeyError: 如果操作符不存在
    """
    for op_name, func, arity in OPS_CONFIG:
        if op_name == name:
            return func, arity
    raise KeyError(f"Unknown operator: {name}")


def get_num_ops() -> int:
    """获取操作符数量"""
    return len(OPS_CONFIG)


================================================
FILE: backend/app/alpha_mining/dsl/vocab.py
================================================
"""
因子词汇表管理

管理因子表达式中的 token 词汇表，包括：
- 特征 token（RET, VOL, VOLUME_CHG 等）
- 操作符 token（ADD, SUB, MUL 等）

提供 token <-> name 双向映射。

References:
- AlphaGPT upstream/model_core/alphagpt.py:10-14
"""

from typing import List, Dict, Optional
from dataclasses import dataclass, field

from .ops import OPS_CONFIG, get_op_names


# 默认特征列表
FEATURES: List[str] = [
    "RET",           # 收益率
    "VOL",           # 波动率
    "VOLUME_CHG",    # 成交量变化
    "TURNOVER",      # 换手率
    "SENTIMENT",     # 情感分数
    "NEWS_COUNT",    # 新闻数量
]


@dataclass
class FactorVocab:
    """
    因子词汇表
    
    词汇表结构：[FEATURES..., OPERATORS...]
    - 前 num_features 个 token 是特征
    - 后 num_ops 个 token 是操作符
    
    Example:
        vocab = FactorVocab(features=["RET", "VOL"])
        vocab.token_to_name(0)  # -> "RET"
        vocab.name_to_token("ADD")  # -> 2 (假设有 2 个特征)
    """
    
    features: List[str] = field(default_factory=lambda: FEATURES.copy())
    
    def __post_init__(self):
        """初始化词汇表映射"""
        self._operators = get_op_names()
        self._vocab = self.features + self._operators
        
        # 构建映射
        self._token_to_name: Dict[int, str] = {
            i: name for i, name in enumerate(self._vocab)
        }
        self._name_to_token: Dict[str, int] = {
            name: i for i, name in enumerate(self._vocab)
        }
    
    @property
    def vocab_size(self) -> int:
        """词汇表大小"""
        return len(self._vocab)
    
    @property
    def num_features(self) -> int:
        """特征数量"""
        return len(self.features)
    
    @property
    def num_ops(self) -> int:
        """操作符数量"""
        return len(self._operators)
    
    @property
    def feature_offset(self) -> int:
        """特征 token 的结束位置（也是操作符的起始位置）"""
        return self.num_features
    
    def token_to_name(self, token: int) -> str:
        """
        将 token ID 转换为名称
        
        Args:
            token: token ID
            
        Returns:
            token 对应的名称
            
        Raises:
            KeyError: 如果 token 不存在
        """
        if token not in self._token_to_name:
            raise KeyError(f"Unknown token: {token}")
        return self._token_to_name[token]
    
    def name_to_token(self, name: str) -> int:
        """
        将名称转换为 token ID
        
        Args:
            name: 特征或操作符名称
            
        Returns:
            对应的 token ID
            
        Raises:
            KeyError: 如果名称不存在
        """
        if name not in self._name_to_token:
            raise KeyError(f"Unknown name: {name}")
        return self._name_to_token[name]
    
    def is_feature(self, token: int) -> bool:
        """判断 token 是否为特征"""
        return 0 <= token < self.feature_offset
    
    def is_operator(self, token: int) -> bool:
        """判断 token 是否为操作符"""
        return self.feature_offset <= token < self.vocab_size
    
    def get_operator_arity(self, token: int) -> int:
        """
        获取操作符的参数数量
        
        Args:
            token: 操作符 token ID
            
        Returns:
            参数数量（1, 2 或 3）
            
        Raises:
            ValueError: 如果不是操作符
        """
        if not self.is_operator(token):
            raise ValueError(f"Token {token} is not an operator")
        
        op_index = token - self.feature_offset
        return OPS_CONFIG[op_index][2]
    
    def get_operator_func(self, token: int):
        """
        获取操作符的函数
        
        Args:
            token: 操作符 token ID
            
        Returns:
            操作符函数
            
        Raises:
            ValueError: 如果不是操作符
        """
        if not self.is_operator(token):
            raise ValueError(f"Token {token} is not an operator")
        
        op_index = token - self.feature_offset
        return OPS_CONFIG[op_index][1]
    
    def get_all_tokens(self) -> List[int]:
        """获取所有 token ID"""
        return list(range(self.vocab_size))
    
    def get_feature_tokens(self) -> List[int]:
        """获取所有特征 token ID"""
        return list(range(self.num_features))
    
    def get_operator_tokens(self) -> List[int]:
        """获取所有操作符 token ID"""
        return list(range(self.feature_offset, self.vocab_size))
    
    def __repr__(self) -> str:
        return f"FactorVocab(features={self.features}, vocab_size={self.vocab_size})"


# 默认词汇表实例
DEFAULT_VOCAB = FactorVocab()


================================================
FILE: backend/app/alpha_mining/features/__init__.py
================================================
"""
特征构建器模块

- MarketFeatureBuilder: 从行情数据构建特征
- SentimentFeatureBuilder: 从新闻情感分析结果构建特征
"""

from .market import MarketFeatureBuilder
from .sentiment import SentimentFeatureBuilder

__all__ = ["MarketFeatureBuilder", "SentimentFeatureBuilder"]


================================================
FILE: backend/app/alpha_mining/features/market.py
================================================
"""
行情特征构建器

从原始行情数据（OHLCV）构建因子挖掘所需的标准化特征。

特征列表：
- RET: 收益率
- VOL: 波动率（滚动标准差）
- VOLUME_CHG: 成交量变化率
- TURNOVER: 换手率
"""

import torch
from typing import Dict, List, Optional, Union
import pandas as pd
import numpy as np
import logging

from ..config import AlphaMiningConfig, DEFAULT_CONFIG

logger = logging.getLogger(__name__)


class MarketFeatureBuilder:
    """
    行情特征构建器
    
    从 OHLCV 数据构建标准化的因子特征。
    
    Args:
        config: 配置实例
        vol_window: 波动率计算窗口
        normalize: 是否标准化特征
        
    Example:
        builder = MarketFeatureBuilder()
        features = builder.build(ohlcv_df)
    """
    
    # 支持的特征名称
    FEATURE_NAMES = ["RET", "VOL", "VOLUME_CHG", "TURNOVER"]
    
    def __init__(
        self,
        config: Optional[AlphaMiningConfig] = None,
        vol_window: int = 20,
        normalize: bool = True
    ):
        self.config = config or DEFAULT_CONFIG
        self.vol_window = vol_window
        self.normalize = normalize
        
        logger.info(
            f"MarketFeatureBuilder initialized: "
            f"vol_window={vol_window}, normalize={normalize}"
        )
    
    def build(
        self,
        data: Union[pd.DataFrame, Dict[str, torch.Tensor]],
        device: Optional[torch.device] = None
    ) -> torch.Tensor:
        """
        从行情数据构建特征张量
        
        Args:
            data: 行情数据，DataFrame 或张量字典
                DataFrame 需包含: close, volume, (可选: turnover, shares)
                Dict 需包含: close, volume
            device: 目标设备
            
        Returns:
            特征张量 [batch, num_features, time_steps]
        """
        device = device or self.config.torch_device
        
        if isinstance(data, pd.DataFrame):
            return self._build_from_dataframe(data, device)
        elif isinstance(data, dict):
            return self._build_from_tensors(data, device)
        else:
            raise ValueError(f"Unsupported data type: {type(data)}")
    
    def _build_from_dataframe(
        self,
        df: pd.DataFrame,
        device: torch.device
    ) -> torch.Tensor:
        """
        从 DataFrame 构建特征
        
        支持两种格式：
        1. 单股票：index=date, columns=[close, volume, ...]
        2. 多股票：MultiIndex 或 pivot 后的 DataFrame
        """
        # 确保列名小写
        df = df.copy()
        df.columns = [c.lower() for c in df.columns]
        
        # 检查必需列
        if "close" not in df.columns:
            raise ValueError("DataFrame must have 'close' column")
        
        # 计算各特征
        close = torch.tensor(df["close"].values, dtype=torch.float32)
        
        # RET: 收益率
        ret = self._calc_returns(close)
        
        # VOL: 波动率
        vol = self._calc_volatility(ret, self.vol_window)
        
        # VOLUME_CHG: 成交量变化
        if "volume" in df.columns:
            volume = torch.tensor(df["volume"].values, dtype=torch.float32)
            volume_chg = self._calc_pct_change(volume)
        else:
            volume_chg = torch.zeros_like(ret)
        
        # TURNOVER: 换手率
        if "turnover" in df.columns:
            turnover = torch.tensor(df["turnover"].values, dtype=torch.float32)
        elif "volume" in df.columns and "shares" in df.columns:
            volume = df["volume"].values
            shares = df["shares"].values
            turnover = torch.tensor(volume / (shares + 1e-8), dtype=torch.float32)
        else:
            turnover = torch.zeros_like(ret)
        
        # Stack features: [num_features, time_steps]
        features = torch.stack([ret, vol, volume_chg, turnover], dim=0)
        
        # 标准化
        if self.normalize:
            features = self._robust_normalize(features)
        
        # 添加 batch 维度: [1, num_features, time_steps]
        features = features.unsqueeze(0).to(device)
        
        return features
    
    def _build_from_tensors(
        self,
        data: Dict[str, torch.Tensor],
        device: torch.device
    ) -> torch.Tensor:
        """
        从张量字典构建特征
        
        Args:
            data: 包含 close, volume 等张量的字典
                  每个张量形状为 [batch, time_steps] 或 [time_steps]
        """
        close = data["close"]
        
        # 确保是 2D: [batch, time_steps]
        if close.dim() == 1:
            close = close.unsqueeze(0)
        
        batch_size, time_steps = close.shape
        
        # RET
        ret = self._calc_returns(close)
        
        # VOL
        vol = self._calc_volatility(ret, self.vol_window)
        
        # VOLUME_CHG
        if "volume" in data:
            volume = data["volume"]
            if volume.dim() == 1:
                volume = volume.unsqueeze(0)
            volume_chg = self._calc_pct_change(volume)
        else:
            volume_chg = torch.zeros_like(ret)
        
        # TURNOVER
        if "turnover" in data:
            turnover = data["turnover"]
            if turnover.dim() == 1:
                turnover = turnover.unsqueeze(0)
        else:
            turnover = torch.zeros_like(ret)
        
        # Stack: [batch, num_features, time_steps]
        features = torch.stack([ret, vol, volume_chg, turnover], dim=1)
        
        # 标准化
        if self.normalize:
            features = self._robust_normalize(features)
        
        return features.to(device)
    
    def _calc_returns(self, close: torch.Tensor) -> torch.Tensor:
        """计算收益率"""
        # close: [batch, time] or [time]
        if close.dim() == 1:
            close = close.unsqueeze(0)
        
        prev_close = torch.roll(close, 1, dims=-1)
        prev_close[..., 0] = close[..., 0]
        
        returns = (close - prev_close) / (prev_close + 1e-8)
        returns[..., 0] = 0  # 第一个收益率设为 0
        
        return returns.squeeze(0) if close.size(0) == 1 else returns
    
    def _calc_volatility(self, returns: torch.Tensor, window: int) -> torch.Tensor:
        """计算滚动波动率"""
        if returns.dim() == 1:
            returns = returns.unsqueeze(0)
        
        batch_size, time_steps = returns.shape
        
        # Padding
        pad = torch.zeros((batch_size, window - 1), device=returns.device)
        padded = torch.cat([pad, returns], dim=-1)
        
        # 滚动标准差
        vol = padded.unfold(-1, window, 1).std(dim=-1)
        
        return vol.squeeze(0) if batch_size == 1 else vol
    
    def _calc_pct_change(self, x: torch.Tensor) -> torch.Tensor:
        """计算百分比变化"""
        if x.dim() == 1:
            x = x.unsqueeze(0)
        
        prev = torch.roll(x, 1, dims=-1)
        prev[..., 0] = x[..., 0]
        
        pct = (x - prev) / (prev + 1e-8)
        pct[..., 0] = 0
        
        return pct.squeeze(0) if x.size(0) == 1 else pct
    
    def _robust_normalize(self, features: torch.Tensor) -> torch.Tensor:
        """
        稳健标准化（使用中位数和 MAD）
        
        Args:
            features: [batch, num_features, time_steps] 或 [num_features, time_steps]
        """
        if features.dim() == 2:
            features = features.unsqueeze(0)
        
        # 计算每个特征的中位数
        median = features.median(dim=-1, keepdim=True).values
        
        # 计算 MAD
        mad = (features - median).abs().median(dim=-1, keepdim=True).values + 1e-6
        
        # 标准化
        normalized = (features - median) / mad
        
        # 裁剪极端值
        normalized = torch.clamp(normalized, -5.0, 5.0)
        
        return normalized
    
    def get_feature_names(self) -> List[str]:
        """获取特征名称列表"""
        return self.FEATURE_NAMES.copy()
    
    def build_batch(
        self,
        data_list: List[Union[pd.DataFrame, Dict[str, torch.Tensor]]],
        device: Optional[torch.device] = None
    ) -> torch.Tensor:
        """
        批量构建特征
        
        Args:
            data_list: 行情数据列表
            device: 目标设备
            
        Returns:
            特征张量 [batch, num_features, time_steps]
        """
        features_list = []
        for data in data_list:
            features = self.build(data, device)
            features_list.append(features)
        
        return torch.cat(features_list, dim=0)


================================================
FILE: backend/app/alpha_mining/features/sentiment.py
================================================
"""
情感特征构建器

从 FinnewsHunter 的新闻分析结果构建情感特征。

特征列表：
- SENTIMENT: 情感分数（-1 到 1）
- NEWS_COUNT: 新闻数量（标准化）

与 FinnewsHunter 现有组件集成：
- 使用 SentimentAgent 的分析结果
- 从 PostgreSQL/Milvus 获取历史情感数据
"""

import torch
from typing import Dict, List, Optional, Union, Any
import pandas as pd
import numpy as np
import logging
from datetime import datetime, timedelta

from ..config import AlphaMiningConfig, DEFAULT_CONFIG

logger = logging.getLogger(__name__)


class SentimentFeatureBuilder:
    """
    情感特征构建器
    
    从新闻情感分析结果构建因子特征。
    
    Args:
        config: 配置实例
        sentiment_decay: 情感衰减因子（用于时序平滑）
        normalize: 是否标准化特征
        
    Example:
        builder = SentimentFeatureBuilder()
        features = builder.build(sentiment_data)
    """
    
    # 支持的特征名称
    FEATURE_NAMES = ["SENTIMENT", "NEWS_COUNT"]
    
    def __init__(
        self,
        config: Optional[AlphaMiningConfig] = None,
        sentiment_decay: float = 0.9,
        normalize: bool = True
    ):
        self.config = config or DEFAULT_CONFIG
        self.sentiment_decay = sentiment_decay
        self.normalize = normalize
        
        logger.info(
            f"SentimentFeatureBuilder initialized: "
            f"decay={sentiment_decay}, normalize={normalize}"
        )
    
    def build(
        self,
        data: Union[pd.DataFrame, Dict[str, Any], List[Dict]],
        time_steps: Optional[int] = None,
        device: Optional[torch.device] = None
    ) -> torch.Tensor:
        """
        从情感数据构建特征张量
        
        Args:
            data: 情感数据，支持多种格式：
                - DataFrame: columns=[date, sentiment, news_count]
                - Dict: {"sentiment": [...], "news_count": [...]}
                - List[Dict]: [{"date": ..., "sentiment": ..., "count": ...}, ...]
            time_steps: 目标时间步数（用于对齐行情数据）
            device: 目标设备
            
        Returns:
            特征张量 [1, 2, time_steps] (SENTIMENT, NEWS_COUNT)
        """
        device = device or self.config.torch_device
        
        if isinstance(data, pd.DataFrame):
            sentiment, news_count = self._parse_dataframe(data)
        elif isinstance(data, dict):
            sentiment, news_count = self._parse_dict(data)
        elif isinstance(data, list):
            sentiment, news_count = self._parse_list(data)
        else:
            raise ValueError(f"Unsupported data type: {type(data)}")
        
        # 转换为张量
        sentiment = torch.tensor(sentiment, dtype=torch.float32)
        news_count = torch.tensor(news_count, dtype=torch.float32)
        
        # 对齐时间步
        if time_steps is not None:
            sentiment = self._align_time_steps(sentiment, time_steps)
            news_count = self._align_time_steps(news_count, time_steps)
        
        # 应用情感衰减（指数平滑）
        sentiment = self._apply_decay(sentiment)
        
        # Stack: [2, time_steps]
        features = torch.stack([sentiment, news_count], dim=0)
        
        # 标准化
        if self.normalize:
            features = self._normalize(features)
        
        # 添加 batch 维度: [1, 2, time_steps]
        features = features.unsqueeze(0).to(device)
        
        return features
    
    def _parse_dataframe(self, df: pd.DataFrame):
        """从 DataFrame 解析情感数据"""
        df = df.copy()
        df.columns = [c.lower() for c in df.columns]
        
        # 情感分数
        if "sentiment" in df.columns:
            sentiment = df["sentiment"].fillna(0).values
        elif "sentiment_score" in df.columns:
            sentiment = df["sentiment_score"].fillna(0).values
        else:
            sentiment = np.zeros(len(df))
            logger.warning("No sentiment column found, using zeros")
        
        # 新闻数量
        if "news_count" in df.columns:
            news_count = df["news_count"].fillna(0).values
        elif "count" in df.columns:
            news_count = df["count"].fillna(0).values
        else:
            news_count = np.ones(len(df))  # 默认每天 1 条
        
        return sentiment, news_count
    
    def _parse_dict(self, data: Dict[str, Any]):
        """从字典解析情感数据"""
        sentiment = data.get("sentiment", data.get("sentiment_score", []))
        news_count = data.get("news_count", data.get("count", []))
        
        sentiment = np.array(sentiment) if sentiment else np.array([0])
        news_count = np.array(news_count) if news_count else np.array([1])
        
        return sentiment, news_count
    
    def _parse_list(self, data: List[Dict]):
        """从列表解析情感数据"""
        sentiment = []
        news_count = []
        
        for item in data:
            s = item.get("sentiment", item.get("sentiment_score", 0))
            c = item.get("news_count", item.get("count", 1))
            sentiment.append(s)
            news_count.append(c)
        
        return np.array(sentiment), np.array(news_count)
    
    def _align_time_steps(self, x: torch.Tensor, target_len: int) -> torch.Tensor:
        """对齐时间步长度"""
        current_len = len(x)
        
        if current_len == target_len:
            return x
        elif current_len > target_len:
            # 截取最近的数据
            return x[-target_len:]
        else:
            # 前面填充 0
            pad = torch.zeros(target_len - current_len)
            return torch.cat([pad, x])
    
    def _apply_decay(self, sentiment: torch.Tensor) -> torch.Tensor:
        """
        应用指数衰减平滑
        
        情感影响会随时间衰减，使用指数移动平均来平滑
        """
        if self.sentiment_decay >= 1.0:
            return sentiment
        
        result = torch.zeros_like(sentiment)
        result[0] = sentiment[0]
        
        for i in range(1, len(sentiment)):
            result[i] = self.sentiment_decay * result[i-1] + (1 - self.sentiment_decay) * sentiment[i]
        
        return result
    
    def _normalize(self, features: torch.Tensor) -> torch.Tensor:
        """标准化特征"""
        # features: [2, time_steps]
        
        # SENTIMENT: 已经在 [-1, 1] 范围内，保持不变
        # NEWS_COUNT: 标准化到 0 均值、1 标准差
        news_count = features[1]
        if news_count.std() > 0:
            features[1] = (news_count - news_count.mean()) / (news_count.std() + 1e-6)
        
        # 裁剪极端值
        features = torch.clamp(features, -5.0, 5.0)
        
        return features
    
    def get_feature_names(self) -> List[str]:
        """获取特征名称列表"""
        return self.FEATURE_NAMES.copy()
    
    def build_from_finnews(
        self,
        stock_code: str,
        start_date: datetime,
        end_date: datetime,
        db_session: Any = None,
        device: Optional[torch.device] = None
    ) -> torch.Tensor:
        """
        从 FinnewsHunter 数据库构建情感特征
        
        Args:
            stock_code: 股票代码
            start_date: 开始日期
            end_date: 结束日期
            db_session: 数据库会话（可选，用于真实数据）
            device: 目标设备
            
        Returns:
            特征张量 [1, 2, time_steps]
        """
        device = device or self.config.torch_device
        
        # 计算交易日数
        time_steps = (end_date - start_date).days
        
        if db_session is None:
            # 无数据库连接时返回模拟数据
            logger.warning("No db_session provided, returning mock sentiment data")
            return self._generate_mock_sentiment(time_steps, device)
        
        # TODO: 实现真实数据查询
        # 查询逻辑示例：
        # query = """
        #     SELECT date, AVG(sentiment_score) as sentiment, COUNT(*) as news_count
        #     FROM news_analysis
        #     WHERE stock_code = :code AND date BETWEEN :start AND :end
        #     GROUP BY date
        #     ORDER BY date
        # """
        # results = db_session.execute(query, {...})
        
        logger.info(f"Building sentiment features for {stock_code}")
        return self._generate_mock_sentiment(time_steps, device)
    
    def _generate_mock_sentiment(
        self,
        time_steps: int,
        device: torch.device
    ) -> torch.Tensor:
        """生成模拟情感数据"""
        # 模拟情感分数（正态分布，均值 0）
        sentiment = torch.randn(time_steps) * 0.3
        sentiment = torch.clamp(sentiment, -1, 1)
        
        # 模拟新闻数量（泊松分布）
        news_count = torch.abs(torch.randn(time_steps)) * 3 + 1
        
        # Stack 并添加 batch 维度
        features = torch.stack([sentiment, news_count], dim=0)
        
        if self.normalize:
            features = self._normalize(features)
        
        return features.unsqueeze(0).to(device)
    
    def combine_with_market(
        self,
        market_features: torch.Tensor,
        sentiment_features: torch.Tensor
    ) -> torch.Tensor:
        """
        合并行情特征和情感特征
        
        Args:
            market_features: [batch, 4, time_steps] (RET, VOL, VOLUME_CHG, TURNOVER)
            sentiment_features: [batch, 2, time_steps] (SENTIMENT, NEWS_COUNT)
            
        Returns:
            合并后的特征 [batch, 6, time_steps]
        """
        return torch.cat([market_features, sentiment_features], dim=1)


================================================
FILE: backend/app/alpha_mining/model/__init__.py
================================================
"""
因子生成模型和训练器

- AlphaGenerator: Transformer 策略网络，生成因子表达式
- AlphaTrainer: RL 训练器，使用 REINFORCE 算法优化
"""

from .alpha_generator import AlphaGenerator
from .trainer import AlphaTrainer

__all__ = ["AlphaGenerator", "AlphaTrainer"]


================================================
FILE: backend/app/alpha_mining/model/alpha_generator.py
================================================
"""
因子生成模型

基于 Transformer 的策略网络，用于生成因子表达式 token 序列。

架构：
- Token Embedding + Position Embedding
- Transformer Encoder（使用 causal mask）
- Policy Head（输出 token 概率）
- Value Head（估计状态价值，用于 Actor-Critic）

References:
- AlphaGPT upstream/model_core/alphagpt.py
"""

import torch
import torch.nn as nn
from torch.distributions import Categorical
from typing import Tuple, List, Optional
import logging

from ..config import AlphaMiningConfig, DEFAULT_CONFIG
from ..dsl.vocab import FactorVocab, DEFAULT_VOCAB

logger = logging.getLogger(__name__)


class AlphaGenerator(nn.Module):
    """
    因子生成器（Transformer 策略网络）
    
    使用 Transformer 架构生成因子表达式的 token 序列。
    
    Args:
        vocab: 词汇表实例
        config: 配置实例
        
    Example:
        generator = AlphaGenerator()
        tokens = torch.zeros((batch_size, 1), dtype=torch.long)
        logits, value = generator(tokens)
    """
    
    def __init__(
        self, 
        vocab: Optional[FactorVocab] = None,
        config: Optional[AlphaMiningConfig] = None
    ):
        super().__init__()
        
        self.vocab = vocab or DEFAULT_VOCAB
        self.config = config or DEFAULT_CONFIG
        
        # 模型参数
        self.vocab_size = self.vocab.vocab_size
        self.d_model = self.config.d_model
        self.max_seq_len = self.config.max_seq_len
        
        # Token Embedding
        self.token_emb = nn.Embedding(self.vocab_size, self.d_model)
        
        # Position Embedding（可学习的位置编码）
        self.pos_emb = nn.Parameter(
            torch.zeros(1, self.max_seq_len + 1, self.d_model)
        )
        
        # Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=self.d_model,
            nhead=self.config.nhead,
            dim_feedforward=self.d_model * 2,
            batch_first=True,
            dropout=0.1
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer,
            num_layers=self.config.num_layers
        )
        
        # Output heads
        self.ln_f = nn.LayerNorm(self.d_model)
        self.policy_head = nn.Linear(self.d_model, self.vocab_size)  # Actor
        self.value_head = nn.Linear(self.d_model, 1)  # Critic
        
        # 初始化权重
        self._init_weights()
        
        logger.info(
            f"AlphaGenerator initialized: vocab_size={self.vocab_size}, "
            f"d_model={self.d_model}, max_seq_len={self.max_seq_len}"
        )
    
    def _init_weights(self):
        """初始化模型权重"""
        # 使用 Xavier 初始化
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(
        self, 
        tokens: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        前向传播
        
        Args:
            tokens: 输入 token 序列 [batch, seq_len]
            
        Returns:
            logits: 下一个 token 的 logits [batch, vocab_size]
            value: 状态价值估计 [batch, 1]
        """
        batch_size, seq_len = tokens.size()
        device = tokens.device
        
        # Token + Position Embedding
        x = self.token_emb(tokens) + self.pos_emb[:, :seq_len, :]
        
        # Causal Mask（确保只能看到之前的 token）
        mask = nn.Transformer.generate_square_subsequent_mask(seq_len).to(device)
        
        # Transformer 编码
        x = self.transformer(x, mask=mask, is_causal=True)
        
        # Layer Norm
        x = self.ln_f(x)
        
        # 取最后一个位置的表示
        last_hidden = x[:, -1, :]  # [batch, d_model]
        
        # 输出 heads
        logits = self.policy_head(last_hidden)  # [batch, vocab_size]
        value = self.value_head(last_hidden)    # [batch, 1]
        
        return logits, value
    
    @torch.no_grad()
    def generate(
        self,
        batch_size: int = 1,
        max_len: Optional[int] = None,
        temperature: float = 1.0,
        device: Optional[torch.device] = None
    ) -> Tuple[List[List[int]], List[torch.Tensor]]:
        """
        批量生成因子表达式
        
        使用自回归采样生成 token 序列。
        
        Args:
            batch_size: 生成数量
            max_len: 最大长度，默认使用 config.max_seq_len
            temperature: 采样温度，越高越随机
            device: 设备，默认使用 config.device
            
        Returns:
            formulas: 生成的 token 序列列表
            log_probs_list: 每个序列的 log_prob 列表（用于策略梯度）
        """
        self.eval()
        
        max_len = max_len or self.config.max_seq_len
        device = device or self.config.torch_device
        
        # 初始化：以空 token 开始（使用 0）
        tokens = torch.zeros((batch_size, 1), dtype=torch.long, device=device)
        
        all_log_probs: List[List[torch.Tensor]] = [[] for _ in range(batch_size)]
        
        for step in range(max_len):
            # 前向传播
            logits, _ = self.forward(tokens)
            
            # 应用温度
            if temperature != 1.0:
                logits = logits / temperature
            
            # 采样
            dist = Categorical(logits=logits)
            action = dist.sample()  # [batch]
            
            # 记录 log_prob
            log_prob = dist.log_prob(action)  # [batch]
            for i in range(batch_size):
                all_log_probs[i].append(log_prob[i])
            
            # 拼接到序列
            tokens = torch.cat([tokens, action.unsqueeze(1)], dim=1)
        
        # 转换为列表格式
        formulas = tokens[:, 1:].tolist()  # 去掉初始的 0
        
        # 将 log_probs 转换为 tensor 列表
        log_probs_tensors = [torch.stack(lps) for lps in all_log_probs]
        
        return formulas, log_probs_tensors
    
    def generate_with_training(
        self,
        batch_size: int = 1,
        max_len: Optional[int] = None,
        device: Optional[torch.device] = None
    ) -> Tuple[torch.Tensor, List[torch.Tensor], List[torch.Tensor]]:
        """
        生成因子表达式（训练模式，保留梯度）
        
        Args:
            batch_size: 生成数量
            max_len: 最大长度
            device: 设备
            
        Returns:
            sequences: 生成的序列 [batch, seq_len]
            log_probs: 每步的 log_prob 列表
            values: 每步的 value 估计列表
        """
        self.train()
        
        max_len = max_len or self.config.max_seq_len
        device = device or self.config.torch_device
        
        # 初始化
        tokens = torch.zeros((batch_size, 1), dtype=torch.long, device=device)
        
        log_probs_list = []
        values_list = []
        tokens_list = []
        
        for step in range(max_len):
            # 前向传播
            logits, value = self.forward(tokens)
            
            # 采样
            dist = Categorical(logits=logits)
            action = dist.sample()
            
            # 记录
            log_probs_list.append(dist.log_prob(action))
            values_list.append(value.squeeze(-1))
            tokens_list.append(action)
            
            # 拼接
            tokens = torch.cat([tokens, action.unsqueeze(1)], dim=1)
        
        # 组装结果
        sequences = torch.stack(tokens_list, dim=1)  # [batch, max_len]
        
        return sequences, log_probs_list, values_list
    
    def save(self, path: str):
        """保存模型"""
        torch.save({
            'model_state_dict': self.state_dict(),
            'vocab_size': self.vocab_size,
            'd_model': self.d_model,
            'max_seq_len': self.max_seq_len,
        }, path)
        logger.info(f"Model saved to {path}")
    
    @classmethod
    def load(cls, path: str, vocab: Optional[FactorVocab] = None) -> 'AlphaGenerator':
        """加载模型"""
        checkpoint = torch.load(path, map_location='cpu')
        
        # 创建模型
        config = AlphaMiningConfig(
            d_model=checkpoint['d_model'],
            max_seq_len=checkpoint['max_seq_len']
        )
        model = cls(vocab=vocab, config=config)
        
        # 加载权重
        model.load_state_dict(checkpoint['model_state_dict'])
        logger.info(f"Model loaded from {path}")
        
        return model


================================================
FILE: backend/app/alpha_mining/model/trainer.py
================================================
"""
因子挖掘 RL 训练器

使用 REINFORCE 算法训练 AlphaGenerator，以回测收益为奖励信号。

训练流程：
1. 生成因子表达式
2. 执行表达式得到因子值
3. 回测评估因子有效性（计算奖励）
4. 策略梯度更新

References:
- AlphaGPT upstream/model_core/engine.py
"""

import torch
from typing import Optional, List, Dict, Any, Callable
from tqdm import tqdm
import logging
import json
from pathlib import Path

from ..config import AlphaMiningConfig, DEFAULT_CONFIG
from ..dsl.vocab import FactorVocab, DEFAULT_VOCAB
from ..vm.factor_vm import FactorVM
from .alpha_generator import AlphaGenerator

logger = logging.getLogger(__name__)


class AlphaTrainer:
    """
    因子挖掘 RL 训练器
    
    使用 REINFORCE 算法训练 AlphaGenerator。
    
    Args:
        generator: 因子生成模型
        vocab: 词汇表
        config: 配置
        evaluator: 因子评估函数，接收 (factor, returns) 返回 score
    """
    
    def __init__(
        self,
        generator: Optional[AlphaGenerator] = None,
        vocab: Optional[FactorVocab] = None,
        config: Optional[AlphaMiningConfig] = None,
        evaluator: Optional[Callable[[torch.Tensor, torch.Tensor], float]] = None
    ):
        self.config = config or DEFAULT_CONFIG
        self.vocab = vocab or DEFAULT_VOCAB
        self.generator = generator or AlphaGenerator(vocab=self.vocab, config=self.config)
        self.vm = FactorVM(vocab=self.vocab)
        
        # 默认评估器（简单 Sharpe-like）
        self.evaluator = evaluator or self._default_evaluator
        
        # 优化器
        self.optimizer = torch.optim.AdamW(
            self.generator.parameters(),
            lr=self.config.lr
        )
        
        # 训练状态
        self.best_score = -float('inf')
        self.best_formula: Optional[List[int]] = None
        self.best_formula_str: Optional[str] = None
        self.training_history: List[Dict[str, Any]] = []
        self.step_count = 0
        
        # 移动到指定设备
        self.device = self.config.torch_device
        self.generator.to(self.device)
        
        logger.info(f"AlphaTrainer initialized on device: {self.device}")
    
    def _default_evaluator(self, factor: torch.Tensor, returns: torch.Tensor) -> float:
        """
        默认因子评估器（简化版 Sharpe-like）
        
        Args:
            factor: 因子值 [batch, time_steps]
            returns: 收益率 [batch, time_steps]
            
        Returns:
            评分（越高越好）
        """
        # 因子值作为信号（sigmoid 归一化）
        signal = torch.sigmoid(factor)
        
        # 简单策略：signal > threshold 时持仓
        threshold = self.config.signal_threshold
        position = (signal > threshold).float()
        
        # 计算收益
        pnl = position * returns
        
        # Sharpe-like ratio（简化）
        mean_pnl = pnl.mean()
        std_pnl = pnl.std() + 1e-6
        
        score = (mean_pnl / std_pnl).item()
        return score
    
    def train_step(
        self,
        features: torch.Tensor,
        returns: torch.Tensor
    ) -> Dict[str, Any]:
        """
        单步训练
        
        Args:
            features: 特征张量 [batch, num_features, time_steps]
            returns: 收益率张量 [batch, time_steps]
            
        Returns:
            训练指标字典
        """
        self.generator.train()
        batch_size = self.config.batch_size
        
        # 1. 生成因子表达式
        sequences, log_probs_list, _ = self.generator.generate_with_training(
            batch_size=batch_size,
            device=self.device
        )
        
        # 2. 执行并评估每个公式
        rewards = torch.zeros(batch_size, device=self.device)
        valid_count = 0
        
        for i in range(batch_size):
            formula = sequences[i].tolist()
            
            # 执行因子表达式
            factor = self.vm.execute(formula, features)
            
            if factor is None:
                # 无效公式
                rewards[i] = self.config.invalid_formula_reward
                continue
            
            # 检查是否为常量因子
            if factor.std() < self.config.constant_threshold:
                rewards[i] = self.config.constant_factor_reward
                continue
            
            # 评估因子
            try:
                score = self.evaluator(factor, returns)
                rewards[i] = score
                valid_count += 1
                
                # 更新最优
                if score > self.best_score:
                    self.best_score = score
                    self.best_formula = formula
                    self.best_formula_str = self.vm.decode(formula)
                    logger.info(
                        f"[Step {self.step_count}] New best: "
                        f"score={score:.4f}, formula={self.best_formula_str}"
                    )
            except Exception as e:
                logger.warning(f"Evaluation error: {e}")
                rewards[i] = self.config.invalid_formula_reward
        
        # 3. 计算 advantage（归一化）
        adv = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
        
        # 4. 策略梯度 loss
        loss = torch.zeros(1, device=self.device)
        for t, log_prob in enumerate(log_probs_list):
            loss = loss - (log_prob * adv).mean()
        
        # 5. 反向传播
        self.optimizer.zero_grad()
        loss.backward()
        
        # 梯度裁剪
        torch.nn.utils.clip_grad_norm_(self.generator.parameters(), max_norm=1.0)
        
        self.optimizer.step()
        
        # 6. 记录指标
        self.step_count += 1
        metrics = {
            "step": self.step_count,
            "loss": loss.item(),
            "avg_reward": rewards.mean().item(),
            "max_reward": rewards.max().item(),
            "min_reward": rewards.min().item(),
            "valid_ratio": valid_count / batch_size,
            "best_score": self.best_score,
            "best_formula": self.best_formula_str,
        }
        self.training_history.append(metrics)
        
        return metrics
    
    def train(
        self,
        features: torch.Tensor,
        returns: torch.Tensor,
        num_steps: Optional[int] = None,
        progress_bar: bool = True,
        step_callback: Optional[Callable[[Dict[str, Any]], None]] = None
    ) -> Dict[str, Any]:
        """
        完整训练循环
        
        Args:
            features: 特征张量 [num_samples, num_features, time_steps]
            returns: 收益率张量 [num_samples, time_steps]
            num_steps: 训练步数，默认使用 config.num_steps
            progress_bar: 是否显示进度条
            step_callback: 每步回调函数，接收 metrics 字典，用于 SSE 流式推送
            
        Returns:
            训练结果
        """
        num_steps = num_steps or self.config.num_steps
        
        logger.info(f"Starting training for {num_steps} steps...")
        
        # 确保数据在正确设备上
        features = features.to(self.device)
        returns = returns.to(self.device)
        
        iterator = range(num_steps)
        if progress_bar:
            iterator = tqdm(iterator, desc="Training")
        
        for step in iterator:
            metrics = self.train_step(features, returns)
            
            # 添加进度百分比
            metrics["progress"] = (step + 1) / num_steps * 100
            metrics["total_steps"] = num_steps
            
            if progress_bar:
                iterator.set_postfix({
                    "loss": f"{metrics['loss']:.4f}",
                    "avg_rew": f"{metrics['avg_reward']:.4f}",
                    "best": f"{metrics['best_score']:.4f}"
                })
            
            # 调用回调函数（用于 SSE 流式推送）
            if step_callback is not None:
                try:
                    step_callback(metrics)
                except Exception as e:
                    logger.warning(f"Step callback error: {e}")
            
            # 定期保存检查点
            if self.step_count % self.config.save_every_n_steps == 0:
                self._save_checkpoint()
        
        # 最终结果
        result = {
            "total_steps": self.step_count,
            "best_score": self.best_score,
            "best_formula": self.best_formula,
            "best_formula_str": self.best_formula_str,
            "final_metrics": self.training_history[-1] if self.training_history else None,
        }
        
        logger.info(f"Training complete. Best score: {self.best_score:.4f}")
        logger.info(f"Best formula: {self.best_formula_str}")
        
        return result
    
    def _save_checkpoint(self):
        """保存训练检查点"""
        checkpoint_dir = Path(self.config.checkpoint_dir)
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        
        # 保存模型
        model_path = checkpoint_dir / f"model_step_{self.step_count}.pt"
        self.generator.save(str(model_path))
        
        # 保存最优公式
        if self.best_formula:
            formula_path = checkpoint_dir / "best_formula.json"
            with open(formula_path, 'w') as f:
                json.dump({
                    "formula": self.best_formula,
                    "formula_str": self.best_formula_str,
                    "score": self.best_score,
                    "step": self.step_count
                }, f, indent=2)
    
    def get_best_formula(self) -> Optional[str]:
        """获取最优因子表达式字符串"""
        return self.best_formula_str
    
    def get_training_history(self) -> List[Dict[str, Any]]:
        """获取训练历史"""
        return self.training_history


================================================
FILE: backend/app/alpha_mining/tools/__init__.py
================================================
"""
AgenticX 工具封装

将因子挖掘能力封装为 AgenticX Tool，供 QuantitativeAgent 调用。
"""

from .alpha_mining_tool import AlphaMiningTool

__all__ = ["AlphaMiningTool"]


================================================
FILE: backend/app/alpha_mining/tools/alpha_mining_tool.py
================================================
"""
Alpha Mining AgenticX 工具封装

将因子挖掘功能封装为 AgenticX BaseTool，供 Agent 调用。

支持的操作：
- mine: 挖掘新因子
- evaluate: 评估现有因子
- list: 列出已发现的因子
"""

import torch
from typing import Dict, Any, Optional, List
from datetime import datetime
import logging
import json
import uuid

from agenticx.core.tool_v2 import (
    BaseTool,
    ToolMetadata,
    ToolParameter,
    ToolResult,
    ToolContext,
    ToolCategory,
    ToolStatus,
    ParameterType
)

from ..config import AlphaMiningConfig, DEFAULT_CONFIG
from ..dsl.vocab import FactorVocab, DEFAULT_VOCAB
from ..vm.factor_vm import FactorVM
from ..model.alpha_generator import AlphaGenerator
from ..model.trainer import AlphaTrainer
from ..features.market import MarketFeatureBuilder
from ..features.sentiment import SentimentFeatureBuilder
from ..backtest.evaluator import FactorEvaluator
from ..utils import generate_mock_data

logger = logging.getLogger(__name__)


class AlphaMiningTool(BaseTool[Dict[str, Any]]):
    """
    Alpha Mining 工具
    
    封装因子挖掘功能，供 QuantitativeAgent 调用。
    
    支持操作：
    - mine: 使用 RL 挖掘新因子
    - evaluate: 评估指定因子表达式
    - generate: 生成候选因子
    - list: 列出最优因子
    
    Example:
        tool = AlphaMiningTool()
        result = tool.execute({
            "action": "mine",
            "num_steps": 100,
            "use_sentiment": True
        }, context)
    """
    
    def __init__(
        self,
        config: Optional[AlphaMiningConfig] = None,
        model_path: Optional[str] = None
    ):
        """
        初始化 Alpha Mining 工具
        
        Args:
            config: 配置实例
            model_path: 预训练模型路径
        """
        self.config = config or DEFAULT_CONFIG
        
        metadata = ToolMetadata(
            name="alpha_mining",
            version="1.0.0",
            description="量化因子自动挖掘工具，使用符号回归 + 强化学习发现有效交易因子",
            category=ToolCategory.ANALYSIS,
            author="FinnewsHunter Team",
            tags=["quant", "factor", "alpha", "ml", "reinforcement-learning"],
            timeout=600,  # 10分钟超时
            max_retries=1,
        )
        
        super().__init__(metadata)
        
        # 初始化组件
        self.vocab = DEFAULT_VOCAB
        self.vm = FactorVM(vocab=self.vocab)
        self.evaluator = FactorEvaluator(config=self.config)
        self.market_builder = MarketFeatureBuilder(config=self.config)
        self.sentiment_builder = SentimentFeatureBuilder(config=self.config)
        
        # 初始化模型
        self.generator = AlphaGenerator(vocab=self.vocab, config=self.config)
        self.trainer: Optional[AlphaTrainer] = None
        
        # 加载预训练模型
        if model_path:
            try:
                self.generator = AlphaGenerator.load(model_path, vocab=self.vocab)
                logger.info(f"Loaded pretrained model from {model_path}")
            except Exception as e:
                logger.warning(f"Failed to load model: {e}")
        
        # 存储发现的因子
        self.discovered_factors: List[Dict[str, Any]] = []
        
        logger.info("AlphaMiningTool initialized")
    
    def _setup_parameters(self) -> None:
        """设置工具参数"""
        self._parameters = {
            "action": ToolParameter(
                name="action",
                type=ParameterType.STRING,
                description="操作类型: mine(挖掘), evaluate(评估), generate(生成), list(列表)",
                required=True,
                enum=["mine", "evaluate", "generate", "list"]
            ),
            "num_steps": ToolParameter(
                name="num_steps",
                type=ParameterType.INTEGER,
                description="训练步数（仅 mine 操作）",
                required=False,
                default=100,
                minimum=1,
                maximum=10000
            ),
            "formula": ToolParameter(
                name="formula",
                type=ParameterType.STRING,
                description="因子表达式（仅 evaluate 操作）",
                required=False
            ),
            "use_sentiment": ToolParameter(
                name="use_sentiment",
                type=ParameterType.BOOLEAN,
                description="是否使用情感特征",
                required=False,
                default=True
            ),
            "batch_size": ToolParameter(
                name="batch_size",
                type=ParameterType.INTEGER,
                description="生成因子数量（仅 generate 操作）",
                required=False,
                default=10,
                minimum=1,
                maximum=100
            ),
            "top_k": ToolParameter(
                name="top_k",
                type=ParameterType.INTEGER,
                description="返回最优因子数量（仅 list 操作）",
                required=False,
                default=5,
                minimum=1,
                maximum=50
            ),
            "market_data": ToolParameter(
                name="market_data",
                type=ParameterType.OBJECT,
                description="行情数据（可选，不提供则使用模拟数据）",
                required=False
            ),
            "sentiment_data": ToolParameter(
                name="sentiment_data",
                type=ParameterType.OBJECT,
                description="情感数据（可选）",
                required=False
            )
        }
    
    def execute(self, parameters: Dict[str, Any], context: ToolContext) -> ToolResult:
        """同步执行工具"""
        start_time = datetime.now()
        
        try:
            validated = self.validate_parameters(parameters)
            action = validated["action"]
            
            if action == "mine":
                result_data = self._action_mine(validated, context)
            elif action == "evaluate":
                result_data = self._action_evaluate(validated, context)
            elif action == "generate":
                result_data = self._action_generate(validated, context)
            elif action == "list":
                result_data = self._action_list(validated, context)
            else:
                raise ValueError(f"Unknown action: {action}")
            
            end_time = datetime.now()
            
            return ToolResult(
                status=ToolStatus.SUCCESS,
                data=result_data,
                execution_time=(end_time - start_time).total_seconds(),
                start_time=start_time,
                end_time=end_time,
                metadata={"action": action}
            )
            
        except Exception as e:
            logger.error(f"AlphaMiningTool error: {e}")
            end_time = datetime.now()
            
            return ToolResult(
                status=ToolStatus.ERROR,
                error=str(e),
                execution_time=(end_time - start_time).total_seconds(),
                start_time=start_time,
                end_time=end_time
            )
    
    async def aexecute(self, parameters: Dict[str, Any], context: ToolContext) -> ToolResult:
        """异步执行工具"""
        # 目前使用同步实现
        return self.execute(parameters, context)
    
    def _action_mine(self, params: Dict[str, Any], context: ToolContext) -> Dict[str, Any]:
        """执行因子挖掘"""
        num_steps = params.get("num_steps", 100)
        use_sentiment = params.get("use_sentiment", True)
        
        # 准备特征数据
        features, returns = self._prepare_features(params, use_sentiment)
        
        # 创建或复用训练器
        if self.trainer is None:
            self.trainer = AlphaTrainer(
                generator=self.generator,
                vocab=self.vocab,
                config=self.config,
                evaluator=self.evaluator.get_reward
            )
        
        # 执行训练
        logger.info(f"Starting factor mining for {num_steps} steps...")
        result = self.trainer.train(
            features=features,
            returns=returns,
            num_steps=num_steps,
            progress_bar=False
        )
        
        # 保存最优因子
        if result["best_formula"]:
            factor_info = {
                "id": str(uuid.uuid4()),
                "formula": result["best_formula"],
                "formula_str": result["best_formula_str"],
                "score": result["best_score"],
                "discovered_at": datetime.now().isoformat(),
                "training_steps": num_steps,
                "use_sentiment": use_sentiment
            }
            self.discovered_factors.append(factor_info)
            
            # 保持只存储最优的 100 个
            self.discovered_factors.sort(key=lambda x: x["score"], reverse=True)
            self.discovered_factors = self.discovered_factors[:100]
        
        return {
            "success": True,
            "best_factor": result["best_formula_str"],
            "best_score": result["best_score"],
            "total_steps": result["total_steps"],
            "message": f"因子挖掘完成，最优因子: {result['best_formula_str']} (score={result['best_score']:.4f})"
        }
    
    def _action_evaluate(self, params: Dict[str, Any], context: ToolContext) -> Dict[str, Any]:
        """评估因子表达式"""
        formula_str = params.get("formula")
        if not formula_str:
            raise ValueError("Parameter 'formula' is required for evaluate action")
        
        use_sentiment = params.get("use_sentiment", True)
        
        # 解析公式
        formula = self._parse_formula(formula_str)
        if formula is None:
            return {
                "success": False,
                "error": f"Invalid formula: {formula_str}",
                "message": "无法解析因子表达式"
            }
        
        # 准备数据
        features, returns = self._prepare_features(params, use_sentiment)
        
        # 执行因子
        factor = self.vm.execute(formula, features)
        if factor is None:
            return {
                "success": False,
                "error": "Formula execution failed",
                "message": "因子表达式执行失败"
            }
        
        # 评估
        metrics = self.evaluator.evaluate(factor, returns)
        
        return {
            "success": True,
            "formula": formula_str,
            "metrics": metrics,
            "message": f"因子评估完成: Sortino={metrics['sortino_ratio']:.4f}, IC={metrics['ic']:.4f}"
        }
    
    def _action_generate(self, params: Dict[str, Any], context: ToolContext) -> Dict[str, Any]:
        """生成候选因子"""
        batch_size = params.get("batch_size", 10)
        use_sentiment = params.get("use_sentiment", True)
        
        # 生成因子
        formulas, _ = self.generator.generate(batch_size=batch_size)
        
        # 准备数据用于评估
        features, returns = self._prepare_features(params, use_sentiment)
        
        # 评估每个因子
        results = []
        for formula in formulas:
            factor = self.vm.execute(formula, features)
            if factor is not None and factor.std() > 1e-6:
                try:
                    metrics = self.evaluator.evaluate(factor, returns)
                    results.append({
                        "formula": formula,
                        "formula_str": self.vm.decode(formula),
                        "sortino": metrics["sortino_ratio"],
                        "ic": metrics["ic"]
                    })
                except Exception:
                    continue
        
        # 按 Sortino 排序
        results.sort(key=lambda x: x["sortino"], reverse=True)
        
        return {
            "success": True,
            "generated": len(formulas),
            "valid": len(results),
            "factors": results[:10],  # 返回 top 10
            "message": f"生成 {len(formulas)} 个因子，其中 {len(results)} 个有效"
        }
    
    def _action_list(self, params: Dict[str, Any], context: ToolContext) -> Dict[str, Any]:
        """列出已发现的因子"""
        top_k = params.get("top_k", 5)
        
        factors = self.discovered_factors[:top_k]
        
        return {
            "success": True,
            "total_discovered": len(self.discovered_factors),
            "factors": factors,
            "message": f"共发现 {len(self.discovered_factors)} 个因子，返回 top {len(factors)}"
        }
    
    def _prepare_features(
        self,
        params: Dict[str, Any],
        use_sentiment: bool
    ) -> tuple:
        """准备特征数据"""
        market_data = params.get("market_data")
        sentiment_data = params.get("sentiment_data")
        
        if market_data is not None:
            # 使用提供的行情数据
            market_features = self.market_builder.build(market_data)
            time_steps = market_features.size(-1)
            
            if use_sentiment and sentiment_data is not None:
                sentiment_features = self.sentiment_builder.build(
                    sentiment_data, time_steps=time_steps
                )
                features = self.sentiment_builder.combine_with_market(
                    market_features, sentiment_features
                )
            else:
                features = market_features
            
            # 假设收益率在行情数据中
            returns = market_features[:, 0, :]  # RET 特征
        else:
            # 使用模拟数据
            num_features = 6 if use_sentiment else 4
            features, returns = generate_mock_data(
                num_samples=50,
                num_features=num_features,
                time_steps=252,
                seed=42
            )
        
        return features, returns
    
    def _parse_formula(self, formula_str: str) -> Optional[List[int]]:
        """解析因子表达式字符串"""
        # 简单解析：尝试匹配已知的 token
        tokens = []
        
        # 移除括号和空格，按操作符分割
        clean = formula_str.replace("(", " ").replace(")", " ").replace(",", " ")
        parts = clean.split()
        
        for part in parts:
            part = part.strip()
            if not part:
                continue
            
            # 尝试作为特征名
            try:
                token = self.vocab.name_to_token(part)
                tokens.append(token)
            except (ValueError, KeyError):
                # 尝试作为数字（常量）
                try:
                    float(part)
                    # 忽略常量
                    continue
                except ValueError:
                    logger.warning(f"Unknown token: {part}")
                    return None
        
        return tokens if tokens else None


================================================
FILE: backend/app/alpha_mining/utils.py
================================================
"""
Alpha Mining 工具函数

提供模拟数据生成、数据预处理等工具函数。
"""

import torch
import numpy as np
from typing import Tuple, Optional
import logging

from .config import AlphaMiningConfig, DEFAULT_CONFIG

logger = logging.getLogger(__name__)


def generate_mock_data(
    num_samples: int = 100,
    num_features: int = 6,
    time_steps: int = 252,
    seed: Optional[int] = 42,
    device: Optional[torch.device] = None
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    生成模拟行情数据用于测试
    
    Args:
        num_samples: 样本数（股票数）
        num_features: 特征数
        time_steps: 时间步数（交易日数）
        seed: 随机种子
        device: 设备
        
    Returns:
        features: [num_samples, num_features, time_steps]
        returns: [num_samples, time_steps]
    """
    if seed is not None:
        torch.manual_seed(seed)
        np.random.seed(seed)
    
    device = device or DEFAULT_CONFIG.torch_device
    
    # 生成模拟收益率（正态分布）
    returns = torch.randn(num_samples, time_steps, device=device) * 0.02
    
    # 生成模拟价格（累积收益）
    prices = torch.exp(returns.cumsum(dim=1))
    
    # 生成模拟特征
    features_list = []
    
    # Feature 0: RET - 收益率
    ret = returns.clone()
    features_list.append(ret)
    
    # Feature 1: VOL - 波动率（滚动 20 日标准差）
    vol = _rolling_std(returns, window=20)
    features_list.append(vol)
    
    # Feature 2: VOLUME_CHG - 成交量变化（模拟）
    volume = torch.abs(torch.randn(num_samples, time_steps, device=device))
    volume_chg = _pct_change(volume)
    features_list.append(volume_chg)
    
    # Feature 3: TURNOVER - 换手率（模拟）
    turnover = torch.abs(torch.randn(num_samples, time_steps, device=device)) * 0.05
    features_list.append(turnover)
    
    # Feature 4: SENTIMENT - 情感分数（模拟）
    sentiment = torch.randn(num_samples, time_steps, device=device) * 0.5
    features_list.append(sentiment)
    
    # Feature 5: NEWS_COUNT - 新闻数量（模拟）
    news_count = torch.abs(torch.randn(num_samples, time_steps, device=device)) * 5
    features_list.append(news_count)
    
    # 如果需要更多特征，填充随机噪声
    while len(features_list) < num_features:
        noise = torch.randn(num_samples, time_steps, device=device)
        features_list.append(noise)
    
    # 截取到指定特征数
    features_list = features_list[:num_features]
    
    # Stack features: [num_samples, num_features, time_steps]
    features = torch.stack(features_list, dim=1)
    
    # 标准化特征
    features = _robust_normalize(features)
    
    logger.debug(
        f"Generated mock data: features {features.shape}, returns {returns.shape}"
    )
    
    return features, returns


def _rolling_std(x: torch.Tensor, window: int = 20) -> torch.Tensor:
    """
    计算滚动标准差
    
    Args:
        x: [batch, time_steps]
        window: 窗口大小
        
    Returns:
        滚动标准差 [batch, time_steps]
    """
    batch_size, time_steps = x.shape
    device = x.device
    
    # Padding
    pad = torch.zeros((batch_size, window - 1), device=device)
    x_padded = torch.cat([pad, x], dim=1)
    
    # 使用 unfold 计算滚动窗口
    result = x_padded.unfold(1, window, 1).std(dim=-1)
    
    return result


def _pct_change(x: torch.Tensor) -> torch.Tensor:
    """
    计算百分比变化
    
    Args:
        x: [batch, time_steps]
        
    Returns:
        百分比变化 [batch, time_steps]
    """
    prev = torch.roll(x, 1, dims=1)
    prev[:, 0] = x[:, 0]  # 第一个值不变
    
    pct = (x - prev) / (prev + 1e-8)
    return pct


def _robust_normalize(x: torch.Tensor) -> torch.Tensor:
    """
    稳健标准化（使用中位数和 MAD）
    
    Args:
        x: [batch, num_features, time_steps]
        
    Returns:
        标准化后的张量
    """
    # 计算每个特征的中位数
    median = x.median(dim=2, keepdim=True).values
    
    # 计算 MAD (Median Absolute Deviation)
    mad = (x - median).abs().median(dim=2, keepdim=True).values + 1e-6
    
    # 标准化
    normalized = (x - median) / mad
    
    # 裁剪极端值
    normalized = torch.clamp(normalized, -5.0, 5.0)
    
    return normalized


def set_random_seed(seed: int):
    """设置随机种子以确保可复现性"""
    torch.manual_seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


def get_device() -> torch.device:
    """获取最佳可用设备"""
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return torch.device("mps")
    else:
        return torch.device("cpu")


================================================
FILE: backend/app/alpha_mining/vm/__init__.py
================================================
"""
因子执行器模块

提供 FactorVM 栈式虚拟机，用于执行因子表达式。
"""

from .factor_vm import FactorVM

__all__ = ["FactorVM"]


================================================
FILE: backend/app/alpha_mining/vm/factor_vm.py
================================================
"""
因子表达式执行器（栈式虚拟机）

使用栈式执行方式解析和执行因子表达式 token 序列。

执行流程：
1. 遍历 token 序列
2. 如果是特征 token：将对应特征数据入栈
3. 如果是操作符 token：弹出所需参数，执行操作，结果入栈
4. 最终栈中应只剩一个结果

References:
- AlphaGPT upstream/model_core/vm.py
"""

import torch
from typing import List, Optional, Union
import logging

from ..dsl.vocab import FactorVocab, DEFAULT_VOCAB

logger = logging.getLogger(__name__)


class FactorVM:
    """
    因子表达式栈式虚拟机
    
    执行因子表达式 token 序列，返回计算结果。
    
    Example:
        vm = FactorVM()
        # features: [batch, num_features, time_steps]
        # formula: [0, 1, 6] 表示 ADD(RET, VOL)
        result = vm.execute([0, 1, 6], features)
    """
    
    def __init__(self, vocab: Optional[FactorVocab] = None):
        """
        初始化虚拟机
        
        Args:
            vocab: 词汇表实例，默认使用 DEFAULT_VOCAB
        """
        self.vocab = vocab or DEFAULT_VOCAB
    
    def execute(
        self, 
        formula: List[int], 
        features: torch.Tensor
    ) -> Optional[torch.Tensor]:
        """
        执行因子表达式
        
        Args:
            formula: token 序列，如 [0, 1, 6] 表示 ADD(RET, VOL)
            features: 特征张量，形状 [batch, num_features, time_steps]
            
        Returns:
            因子值张量 [batch, time_steps]，如果表达式无效则返回 None
            
        Note:
            - 如果堆栈溢出/不足，返回 None
            - 如果结果包含 NaN/Inf，会自动替换为 0
            - 如果最终堆栈不是恰好一个元素，返回 None
        """
        stack: List[torch.Tensor] = []
        
        try:
            for token in formula:
                token = int(token)
                
                if self.vocab.is_feature(token):
                    # 特征 token：从特征张量中取出对应特征
                    if token >= features.shape[1]:
                        logger.debug(f"Feature index {token} out of range")
                        return None
                    stack.append(features[:, token, :])
                    
                elif self.vocab.is_operator(token):
                    # 操作符 token：执行操作
                    arity = self.vocab.get_operator_arity(token)
                    
                    # 检查堆栈是否有足够参数
                    if len(stack) < arity:
                        logger.debug(f"Stack underflow: need {arity}, have {len(stack)}")
                        return None
                    
                    # 弹出参数（注意顺序：先弹出的是后入的）
                    args = []
                    for _ in range(arity):
                        args.append(stack.pop())
                    args.reverse()  # 恢复正确顺序
                    
                    # 执行操作
                    func = self.vocab.get_operator_func(token)
                    result = func(*args)
                    
                    # 处理 NaN 和 Inf
                    if torch.isnan(result).any() or torch.isinf(result).any():
                        result = torch.nan_to_num(
                            result, 
                            nan=0.0, 
                            posinf=1.0, 
                            neginf=-1.0
                        )
                    
                    stack.append(result)
                    
                else:
                    # 未知 token
                    logger.debug(f"Unknown token: {token}")
                    return None
            
            # 检查最终堆栈状态
            if len(stack) == 1:
                return stack[0]
            else:
                logger.debug(f"Invalid stack state: {len(stack)} elements remaining")
                return None
                
        except Exception as e:
            logger.debug(f"Execution error: {e}")
            return None
    
    def decode(self, formula: List[int]) -> str:
        """
        将 token 序列解码为人类可读的表达式字符串
        
        使用逆波兰表达式解析，转换为前缀表示法（函数调用形式）
        
        Args:
            formula: token 序列
            
        Returns:
            人类可读的表达式，如 "ADD(RET, VOL)"
            
        Example:
            vm.decode([0, 1, 6])  # -> "ADD(RET, VOL)"
            vm.decode([0, 4])    # -> "NEG(RET)"
        """
        stack: List[str] = []
        
        try:
            for token in formula:
                token = int(token)
                
                if self.vocab.is_feature(token):
                    # 特征：直接入栈名称
                    name = self.vocab.token_to_name(token)
                    stack.append(name)
                    
                elif self.vocab.is_operator(token):
                    # 操作符：弹出参数，构建表达式
                    name = self.vocab.token_to_name(token)
                    arity = self.vocab.get_operator_arity(token)
                    
                    if len(stack) < arity:
                        return f"<INVALID: stack underflow at {name}>"
                    
                    args = []
                    for _ in range(arity):
                        args.append(stack.pop())
                    args.reverse()
                    
                    # 构建函数调用形式
                    expr = f"{name}({', '.join(args)})"
                    stack.append(expr)
                    
                else:
                    return f"<INVALID: unknown token {token}>"
            
            if len(stack) == 1:
                return stack[0]
            elif len(stack) == 0:
                return "<EMPTY>"
            else:
                # 多个元素：用逗号连接
                return f"<INCOMPLETE: {', '.join(stack)}>"
                
        except Exception as e:
            return f"<ERROR: {e}>"
    
    def validate(self, formula: List[int]) -> bool:
        """
        验证因子表达式是否语法正确
        
        使用模拟执行（不实际计算）来验证。
        
        Args:
            formula: token 序列
            
        Returns:
            True 如果表达式语法正确
        """
        stack_depth = 0
        
        try:
            for token in formula:
                token = int(token)
                
                if self.vocab.is_feature(token):
                    stack_depth += 1
                elif self.vocab.is_operator(token):
                    arity = self.vocab.get_operator_arity(token)
                    if stack_depth < arity:
                        return False
                    stack_depth -= arity
                    stack_depth += 1  # 操作结果
                else:
                    return False
            
            return stack_depth == 1
            
        except Exception:
            return False
    
    def get_required_features(self, formula: List[int]) -> List[int]:
        """
        获取表达式中使用的特征列表
        
        Args:
            formula: token 序列
            
        Returns:
            使用的特征 token 列表（去重）
        """
        features = []
        for token in formula:
            token = int(token)
            if self.vocab.is_feature(token) and token not in features:
                features.append(token)
        return features


================================================
FILE: backend/app/api/__init__.py
================================================
"""
API模块
"""


================================================
FILE: backend/app/api/v1/__init__.py
================================================
"""
API v1 模块
"""
from fastapi import APIRouter
from . import analysis, tasks, llm_config, stocks, agents, debug, knowledge_graph
from . import news  # 原有的新闻 API（数据库操作）
from . import news_v2  # 新版 API（Provider-Fetcher 实时获取）
from . import alpha_mining  # 因子挖掘 API

# 创建主路由器
api_router = APIRouter()

# 注册子路由
api_router.include_router(news.router, prefix="/news", tags=["news"])  # 原有端点
api_router.include_router(news_v2.router, prefix="/news/v2", tags=["news-v2"])  # 新版端点
api_router.include_router(analysis.router, prefix="/analysis", tags=["analysis"])
api_router.include_router(tasks.router, prefix="/tasks", tags=["tasks"])
api_router.include_router(llm_config.router, prefix="/llm", tags=["llm"])
api_router.include_router(stocks.router, prefix="/stocks", tags=["stocks"])  # Phase 2: 个股分析
api_router.include_router(agents.router, prefix="/agents", tags=["agents"])  # Phase 2: 智能体监控
api_router.include_router(debug.router, prefix="/debug", tags=["debug"])  # 调试工具
api_router.include_router(knowledge_graph.router, prefix="/knowledge-graph", tags=["knowledge-graph"])  # 知识图谱
api_router.include_router(alpha_mining.router)  # 因子挖掘

__all__ = ["api_router"]


================================================
FILE: backend/app/api/v1/agents.py
================================================
"""
智能体 API 路由 - Phase 2
提供辩论功能、执行日志、性能监控等接口
"""
import logging
import json
import asyncio
from datetime import datetime, timedelta
from typing import List, Optional, Dict, Any, AsyncGenerator
from fastapi import APIRouter, Depends, HTTPException, Query, Body
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func, desc, or_

from ...core.database import get_db
from ...models.news import News
from ...models.analysis import Analysis
from ...agents import (
    create_debate_workflow,
    create_orchestrator,
    create_data_collector
)
from ...services.llm_service import get_llm_provider
from ...services.stock_data_service import stock_data_service

logger = logging.getLogger(__name__)

router = APIRouter()


# ============ 多语言提示词辅助函数 ============

def get_prompts(language: str = "zh") -> Dict[str, str]:
    """获取多语言提示词"""
    if language == "en":
        return {
            "quick_analyst_system": "You are a professional stock analyst, skilled in quick analysis and decision-making.",
            "quick_analysis_prompt": """Please provide a quick investment analysis for {stock_name}({stock_code}).

Background:
{context}

Related News:
{news}

Please quickly provide:
1. Core Viewpoint (one sentence)
2. Bullish Factors (3 points)
3. Bearish Factors (3 points)
4. Investment Recommendation (Buy/Hold/Sell)
5. Risk Warning""",
            "data_collector_content": "📊 Collected relevant data for {stock_name}: {count} news items, financial data ready.\n\nDebate will begin in {rounds} rounds.",
            "bull_system": "You are a bullish researcher, skilled at analyzing stocks from a positive perspective. When answering user questions, maintain an optimistic but rational attitude.",
            "bear_system": "You are a bearish researcher, skilled at identifying risks. When answering user questions, remain cautious and focus on potential risks.",
            "manager_system": "You are an experienced investment manager, skilled at comprehensive analysis and providing investment advice. Answer user questions objectively and professionally.",
            "phase_start": "Starting {mode} mode analysis",
            "phase_analyzing": "Quick analyst is analyzing...",
            "phase_data_collection": "Data Collector is gathering materials...",
            "role_quick_analyst": "Quick Analyst",
            "role_data_collector": "Data Collector",
            "round_debate": "Round {round}/{max_rounds} debate",
            "role_bull": "Bull Researcher",
            "role_bear": "Bear Researcher",
            "bull_first_round": """You are a bullish researcher participating in a bull vs bear debate about {stock_name}({stock_code}).

Background: {context}
News: {news}

This is Round 1. Please make an opening statement (about 150 words):
1. State your core bullish view
2. Provide 2-3 key arguments""",
            "bull_subsequent_rounds": """You are a bullish researcher debating with a bearish researcher about {stock_name}.

This is Round {round}.

The bearish researcher just said:
"{bear_last_statement}"

Please refute the opponent's arguments and add new points (about 120 words):
1. Point out flaws in the opponent's arguments
2. Add new bullish reasons""",
            "bear_first_round": """You are a bearish researcher participating in a bull vs bear debate about {stock_name}({stock_code}).

Background: {context}
News: {news}

This is Round 1. Please make an opening statement (about 150 words):
1. State your core bearish view
2. Provide 2-3 key risk points""",
            "bear_subsequent_rounds": """You are a bearish researcher debating with a bullish researcher about {stock_name}.

This is Round {round}.

The bullish researcher just said:
"{bull_last_statement}"

Please refute the opponent's arguments and add new points (about 120 words):
1. Point out flaws in the opponent's arguments
2. Add new risk points""",
            "manager_decision": """You are an investment manager synthesizing the debate between bullish and bearish researchers to make a final investment decision.

Stock: {stock_name}({stock_code})

Bullish Researcher's View:
{bull_analysis}

Bearish Researcher's View:
{bear_analysis}

Please provide the final decision (about 200 words):
1. Comprehensive evaluation of both views
2. Investment recommendation (Strongly Recommend/Recommend/Neutral/Avoid/Caution)
3. Reasoning and risk warnings""",
        }
    else:  # zh (default)
        return {
            "quick_analyst_system": "你是一位专业的股票分析师，擅长快速分析和决策。",
            "quick_analysis_prompt": """请对 {stock_name}({stock_code}) 进行快速投资分析。

背景资料:
{context}

相关新闻:
{news}

请快速给出：
1. 核心观点（一句话）
2. 看多因素（3点）
3. 看空因素（3点）
4. 投资建议（买入/持有/卖出）
5. 风险提示""",
            "data_collector_content": "📊 已搜集 {stock_name} 的相关数据：{count} 条新闻，财务数据已就绪。\n\n辩论即将开始，共 {rounds} 轮。",
            "bull_system": "你是一位看多研究员，擅长从积极角度分析股票。回答用户问题时保持乐观但理性的态度。",
            "bear_system": "你是一位看空研究员，擅长发现风险。回答用户问题时保持谨慎，重点指出潜在风险。",
            "manager_system": "你是一位经验丰富的投资经理，擅长综合分析和给出投资建议。回答用户问题时客观、专业。",
            "phase_start": "开始{mode}模式分析",
            "phase_analyzing": "快速分析师正在分析...",
            "phase_data_collection": "数据专员正在搜集资料...",
            "role_quick_analyst": "快速分析师",
            "role_data_collector": "数据专员",
            "round_debate": "第 {round}/{max_rounds} 轮辩论",
            "role_bull": "看多研究员",
            "role_bear": "看空研究员",
            "bull_first_round": """你是看多研究员，正在参与关于 {stock_name}({stock_code}) 的多空辩论。

背景资料: {context}
新闻: {news}

这是第1轮辩论，请做开场陈述（约150字）：
1. 表明你的核心看多观点
2. 给出2-3个关键论据""",
            "bull_subsequent_rounds": """你是看多研究员，正在与看空研究员辩论 {stock_name}。

这是第{round}轮辩论。

对方（看空研究员）刚才说：
"{bear_last_statement}"

请反驳对方观点并补充新论据（约120字）：
1. 指出对方论据的漏洞
2. 补充新的看多理由""",
            "bear_first_round": """你是看空研究员，正在参与关于 {stock_name}({stock_code}) 的多空辩论。

背景资料: {context}
新闻: {news}

这是第1轮辩论，请做开场陈述（约150字）：
1. 表明你的核心看空观点
2. 给出2-3个关键风险点""",
            "bear_subsequent_rounds": """你是看空研究员，正在与看多研究员辩论 {stock_name}。

这是第{round}轮辩论。

对方（看多研究员）刚才说：
"{bull_last_statement}"

请反驳对方观点并补充新论据（约120字）：
1. 指出对方论据的漏洞
2. 补充新的风险点""",
            "manager_decision": """你是投资经理，正在综合看多和看空研究员的辩论，做出最终投资决策。

股票: {stock_name}({stock_code})

看多研究员观点:
{bull_analysis}

看空研究员观点:
{bear_analysis}

请给出最终决策（约200字）：
1. 综合评估双方观点
2. 给出投资建议（强烈推荐/推荐/中性/回避/谨慎）
3. 说明理由和风险提示""",
        }


# ============ 模拟数据存储（生产环境应使用数据库） ============

# 存储执行日志
execution_logs: List[Dict[str, Any]] = []

# 存储辩论结果
debate_results: Dict[str, Dict[str, Any]] = {}


# ============ Pydantic 模型 ============

class DebateRequest(BaseModel):
    """辩论请求"""
    stock_code: str = Field(..., description="股票代码")
    stock_name: Optional[str] = Field(None, description="股票名称")
    context: Optional[str] = Field(None, description="额外背景信息")
    provider: Optional[str] = Field(None, description="LLM提供商")
    model: Optional[str] = Field(None, description="模型名称")
    mode: Optional[str] = Field("parallel", description="辩论模式: parallel, realtime_debate, quick_analysis")
    language: Optional[str] = Field("zh", description="语言设置: zh=中文, en=英文")


class DebateResponse(BaseModel):
    """辩论响应"""
    success: bool
    debate_id: Optional[str] = None
    stock_code: str
    stock_name: Optional[str] = None
    mode: Optional[str] = None  # 辩论模式
    bull_analysis: Optional[Dict[str, Any]] = None
    bear_analysis: Optional[Dict[str, Any]] = None
    final_decision: Optional[Dict[str, Any]] = None
    quick_analysis: Optional[Dict[str, Any]] = None  # 快速分析结果
    debate_history: Optional[List[Dict[str, Any]]] = None  # 实时辩论历史
    trajectory: Optional[List[Dict[str, Any]]] = None
    execution_time: Optional[float] = None
    error: Optional[str] = None


class AgentLogEntry(BaseModel):
    """智能体日志条目"""
    id: str
    timestamp: str
    agent_name: str
    agent_role: Optional[str] = None
    action: str
    status: str  # "started", "completed", "failed"
    details: Optional[Dict[str, Any]] = None
    execution_time: Optional[float] = None


class AgentMetrics(BaseModel):
    """智能体性能指标"""
    total_executions: int
    successful_executions: int
    failed_executions: int
    avg_execution_time: float
    agent_stats: Dict[str, Dict[str, Any]]
    recent_activity: List[Dict[str, Any]]


class TrajectoryStep(BaseModel):
    """执行轨迹步骤"""
    step_id: str
    step_name: str
    timestamp: str
    agent_name: Optional[str] = None
    input_data: Optional[Dict[str, Any]] = None
    output_data: Optional[Dict[str, Any]] = None
    duration: Optional[float] = None
    status: str


class SearchPlanRequest(BaseModel):
    """生成搜索计划请求"""
    query: str
    stock_code: str
    stock_name: Optional[str] = None


class SearchExecuteRequest(BaseModel):
    """执行搜索计划请求"""
    plan: Dict[str, Any]  # 完整的 SearchPlan 对象


# ============ API 端点 ============

@router.post("/debate", response_model=DebateResponse)
async def run_stock_debate(
    request: DebateRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    触发股票辩论分析（Bull vs Bear）
    
    - **stock_code**: 股票代码
    - **stock_name**: 股票名称（可选）
    - **context**: 额外背景信息（可选）
    - **provider**: LLM提供商（可选）
    - **model**: 模型名称（可选）
    """
    logger.info(f"🎯 收到辩论请求: stock_code={request.stock_code}, stock_name={request.stock_name}")
    
    start_time = datetime.utcnow()
    debate_id = f"debate_{start_time.strftime('%Y%m%d%H%M%S')}_{request.stock_code}"
    
    try:
        # 记录开始
        log_entry = {
            "id": debate_id,
            "timestamp": start_time.isoformat(),
            "agent_name": "DebateWorkflow",
            "action": "debate_start",
            "status": "started",
            "details": {
                "stock_code": request.stock_code,
                "stock_name": request.stock_name
            }
        }
        execution_logs.append(log_entry)
        
        # 标准化股票代码
        code = request.stock_code.upper()
        if code.startswith("SH") or code.startswith("SZ"):
            short_code = code[2:]
        else:
            short_code = code
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        logger.info(f"🔍 查询股票 {code} 的关联新闻...")
        
        # 获取关联新闻 - 使用 PostgreSQL 原生 ARRAY 查询语法
        from sqlalchemy import text
        stock_codes_filter = text(
            "stock_codes @> ARRAY[:code1]::varchar[] OR stock_codes @> ARRAY[:code2]::varchar[]"
        ).bindparams(code1=short_code, code2=code)
        
        news_query = select(News).where(stock_codes_filter).order_by(desc(News.publish_time)).limit(10)
        
        result = await db.execute(news_query)
        news_list = result.scalars().all()
        
        logger.info(f"📰 找到 {len(news_list)} 条关联新闻")
        
        news_data = [
            {
                "id": n.id,
                "title": n.title,
                "content": n.content[:500],
                "sentiment_score": n.sentiment_score,
                "publish_time": n.publish_time.isoformat() if n.publish_time else None
            }
            for n in news_list
        ]
        
        # 如果没有关联新闻，给出警告
        if not news_data:
            logger.warning(f"⚠️ 股票 {code} 没有关联新闻，辩论将基于空数据进行")
        
        # 获取财务数据和资金流向（用于增强辩论上下文）
        logger.info(f"📊 获取 {code} 的财务数据和资金流向...")
        try:
            debate_context = await stock_data_service.get_debate_context(code)
            akshare_context = debate_context.get("summary", "")
            logger.info(f"📊 获取到额外数据: {akshare_context[:100]}...")
        except Exception as e:
            logger.warning(f"⚠️ 获取财务数据失败: {e}")
            akshare_context = ""
        
        # 合并用户提供的上下文和 akshare 数据
        full_context = ""
        if request.context:
            full_context += f"【用户补充信息】\n{request.context}\n\n"
        if akshare_context:
            full_context += f"【实时数据】\n{akshare_context}"
        
        # 创建 LLM provider（如果指定了自定义配置）
        llm_provider = None
        if request.provider or request.model:
            logger.info(f"🤖 使用自定义模型: provider={request.provider}, model={request.model}")
            llm_provider = get_llm_provider(
                provider=request.provider,
                model=request.model
            )
        else:
            logger.info("🤖 使用默认 LLM 配置")
        
        # 选择辩论模式
        mode = request.mode or "parallel"
        logger.info(f"⚔️ 开始辩论工作流，模式: {mode}")
        
        if mode == "parallel":
            # 使用原有的并行工作流
            workflow = create_debate_workflow(llm_provider)
            debate_result = await workflow.run_debate(
                stock_code=code,
                stock_name=request.stock_name or code,
                news_list=news_data,
                context=full_context
            )
        else:
            # 使用新的编排器（支持 realtime_debate 和 quick_analysis）
            orchestrator = create_orchestrator(mode=mode, llm_provider=llm_provider)
            debate_result = await orchestrator.run(
                stock_code=code,
                stock_name=request.stock_name or code,
                context=full_context,
                news_list=news_data
            )
        
        end_time = datetime.utcnow()
        execution_time = (end_time - start_time).total_seconds()
        
        # 存储结果
        debate_results[debate_id] = debate_result
        
        # 记录完成
        log_entry = {
            "id": f"{debate_id}_complete",
            "timestamp": end_time.isoformat(),
            "agent_name": "DebateWorkflow",
            "action": "debate_complete",
            "status": "completed" if debate_result.get("success") else "failed",
            "details": {
                "stock_code": request.stock_code,
                "rating": debate_result.get("final_decision", {}).get("rating", "unknown")
            },
            "execution_time": execution_time
        }
        execution_logs.append(log_entry)
        
        if debate_result.get("success"):
            return DebateResponse(
                success=True,
                debate_id=debate_id,
                stock_code=code,
                stock_name=request.stock_name,
                mode=mode,
                bull_analysis=debate_result.get("bull_analysis"),
                bear_analysis=debate_result.get("bear_analysis"),
                final_decision=debate_result.get("final_decision"),
                quick_analysis=debate_result.get("quick_analysis"),
                debate_history=debate_result.get("debate_history"),
                trajectory=debate_result.get("trajectory"),
                execution_time=execution_time
            )
        else:
            return DebateResponse(
                success=False,
                debate_id=debate_id,
                stock_code=code,
                mode=mode,
                error=debate_result.get("error", "Unknown error")
            )
    
    except Exception as e:
        logger.error(f"Debate failed: {e}", exc_info=True)
        
        # 记录失败
        log_entry = {
            "id": f"{debate_id}_error",
            "timestamp": datetime.utcnow().isoformat(),
            "agent_name": "DebateWorkflow",
            "action": "debate_error",
            "status": "failed",
            "details": {"error": str(e)}
        }
        execution_logs.append(log_entry)
        
        return DebateResponse(
            success=False,
            debate_id=debate_id,
            stock_code=request.stock_code,
            error=str(e)
        )


# ============ SSE 流式辩论 ============

async def generate_debate_stream(
    stock_code: str,
    stock_name: str,
    mode: str,
    context: str,
    news_data: List[Dict],
    llm_provider,
    language: str = "zh"
) -> AsyncGenerator[str, None]:
    """
    生成辩论的 SSE 流
    
    事件类型:
    - phase: 阶段变化
    - agent: 智能体发言
    - progress: 进度更新
    - result: 最终结果
    - error: 错误信息
    """
    debate_id = f"debate_{datetime.utcnow().strftime('%Y%m%d%H%M%S')}"
    prompts = get_prompts(language)
    
    def sse_event(event_type: str, data: Dict) -> str:
        """格式化 SSE 事件"""
        return f"event: {event_type}\ndata: {json.dumps(data, ensure_ascii=False)}\n\n"
    
    try:
        # 发送开始事件
        yield sse_event("phase", {
            "phase": "start",
            "message": prompts["phase_start"].format(mode=mode),
            "debate_id": debate_id
        })
        
        if mode == "quick_analysis":
            # 快速分析模式 - 使用流式输出
            yield sse_event("phase", {"phase": "analyzing", "message": prompts["phase_analyzing"]})
            
            news_titles = json.dumps([n.get('title', '') for n in news_data[:5]], ensure_ascii=False)
            prompt = prompts["quick_analysis_prompt"].format(
                stock_name=stock_name,
                stock_code=stock_code,
                context=context[:2000],
                news=news_titles
            )
            
            messages = [
                {"role": "system", "content": prompts["quick_analyst_system"]},
                {"role": "user", "content": prompt}
            ]
            
            full_response = ""
            for chunk in llm_provider.stream(messages):
                full_response += chunk
                yield sse_event("agent", {
                    "agent": "QuickAnalyst",
                    "role": prompts["role_quick_analyst"],
                    "content": chunk,
                    "is_chunk": True
                })
                await asyncio.sleep(0)  # 让出控制权
            
            # 发送完成事件
            yield sse_event("result", {
                "success": True,
                "mode": mode,
                "quick_analysis": {
                    "analysis": full_response,
                    "success": True
                },
                "execution_time": 0
            })
            
        elif mode == "realtime_debate":
            # 实时辩论模式 - 多轮交锋
            max_rounds = 3  # 最大辩论轮数
            
            yield sse_event("phase", {"phase": "data_collection", "message": prompts["phase_data_collection"]})
            await asyncio.sleep(0.3)
            
            # 数据搜集
            yield sse_event("agent", {
                "agent": "DataCollector",
                "role": prompts["role_data_collector"],
                "content": prompts["data_collector_content"].format(
                    stock_name=stock_name,
                    count=len(news_data),
                    rounds=max_rounds
                ),
                "is_chunk": False
            })
            
            # 辩论历史（用于上下文）
            debate_history = []
            bull_full = ""
            bear_full = ""
            
            # 多轮辩论
            for round_num in range(1, max_rounds + 1):
                yield sse_event("phase", {
                    "phase": "debate",
                    "message": prompts["round_debate"].format(round=round_num, max_rounds=max_rounds),
                    "round": round_num,
                    "max_rounds": max_rounds
                })
                
                # === Bull 发言 ===
                yield sse_event("agent", {
                    "agent": "BullResearcher",
                    "role": prompts["role_bull"],
                    "content": "",
                    "is_start": True,
                    "round": round_num
                })
                
                if round_num == 1:
                    # 第一轮：开场陈述
                    news_titles = json.dumps([n.get('title', '') for n in news_data[:3]], ensure_ascii=False)
                    bull_prompt = prompts["bull_first_round"].format(
                        stock_name=stock_name,
                        stock_code=stock_code,
                        context=context[:800],
                        news=news_titles
                    )
                else:
                    # 后续轮次：反驳对方
                    last_bear = debate_history[-1]["content"] if debate_history else ""
                    bull_prompt = prompts["bull_subsequent_rounds"].format(
                        stock_name=stock_name,
                        round=round_num,
                        bear_last_statement=last_bear[:300]
                    )
                
                bull_system_msg = prompts["bull_system"] if language == "en" else "你是一位辩论中的看多研究员。言简意赅，有理有据，语气自信但不傲慢。"
                bull_messages = [
                    {"role": "system", "content": bull_system_msg},
                    {"role": "user", "content": bull_prompt}
                ]
                
                bull_response = ""
                for chunk in llm_provider.stream(bull_messages):
                    bull_response += chunk
                    yield sse_event("agent", {
                        "agent": "BullResearcher",
                        "role": "看多研究员",
                        "content": chunk,
                        "is_chunk": True,
                        "round": round_num
                    })
                    await asyncio.sleep(0)
                
                round_marker = f"\n\n**【Round {round_num}】**\n" if language == "en" else f"\n\n**【第{round_num}轮】**\n"
                bull_full += round_marker + bull_response
                debate_history.append({"agent": "Bull", "round": round_num, "content": bull_response})
                
                yield sse_event("agent", {
                    "agent": "BullResearcher",
                    "role": prompts["role_bull"],
                    "content": "",
                    "is_end": True,
                    "round": round_num
                })
                
                # === Bear 发言（反驳） ===
                yield sse_event("agent", {
                    "agent": "BearResearcher",
                    "role": prompts["role_bear"],
                    "content": "",
                    "is_start": True,
                    "round": round_num
                })
                
                if round_num == 1:
                    news_titles = json.dumps([n.get('title', '') for n in news_data[:3]], ensure_ascii=False)
                    bear_prompt = prompts["bear_first_round"].format(
                        stock_name=stock_name,
                        stock_code=stock_code,
                        context=context[:800],
                        news=news_titles
                    )
                else:
                    bear_prompt = prompts["bear_subsequent_rounds"].format(
                        stock_name=stock_name,
                        round=round_num,
                        bull_last_statement=bull_response[:300]
                    )
                
                bear_system_msg = prompts["bear_system"] if language == "en" else "你是一位辩论中的看空研究员。言简意赅，善于发现风险，语气谨慎但有说服力。"
                bear_messages = [
                    {"role": "system", "content": bear_system_msg},
                    {"role": "user", "content": bear_prompt}
                ]
                
                bear_response = ""
                for chunk in llm_provider.stream(bear_messages):
                    bear_response += chunk
                    yield sse_event("agent", {
                        "agent": "BearResearcher",
                        "role": prompts["role_bear"],
                        "content": chunk,
                        "is_chunk": True,
                        "round": round_num
                    })
                    await asyncio.sleep(0)
                
                bear_full += round_marker + bear_response
                debate_history.append({"agent": "Bear", "round": round_num, "content": bear_response})
                
                yield sse_event("agent", {
                    "agent": "BearResearcher",
                    "role": prompts["role_bear"],
                    "content": "",
                    "is_end": True,
                    "round": round_num
                })
            
            # === 投资经理总结决策 ===
            decision_msg = "Debate ended, Investment Manager is making final decision..." if language == "en" else "辩论结束，投资经理正在做最终决策..."
            yield sse_event("phase", {"phase": "decision", "message": decision_msg})
            
            manager_role = "Investment Manager" if language == "en" else "投资经理"
            yield sse_event("agent", {
                "agent": "InvestmentManager",
                "role": manager_role,
                "content": "",
                "is_start": True
            })
            
            # 整理辩论历史
            debate_summary = "\n".join([
                f"【第{h['round']}轮-{'看多' if h['agent']=='Bull' else '看空'}】{h['content'][:150]}..."
                for h in debate_history
            ])
            
            decision_prompt = prompts["manager_decision"].format(
                stock_name=stock_name,
                stock_code=stock_code,
                bull_analysis=bull_full[:1000],
                bear_analysis=bear_full[:1000]
            )
            
            manager_system_msg = prompts["manager_system"] if language == "en" else "你是一位经验丰富的投资经理，善于在多空观点中做出理性决策。"
            decision_messages = [
                {"role": "system", "content": manager_system_msg},
                {"role": "user", "content": decision_prompt}
            ]
            
            decision = ""
            for chunk in llm_provider.stream(decision_messages):
                decision += chunk
                yield sse_event("agent", {
                    "agent": "InvestmentManager",
                    "role": manager_role,
                    "content": chunk,
                    "is_chunk": True
                })
                await asyncio.sleep(0)
            
            yield sse_event("agent", {
                "agent": "InvestmentManager",
                "role": manager_role,
                "content": "",
                "is_end": True
            })
            
            # 提取评级
            if language == "en":
                rating = "Neutral"
                for r in ["Strongly Recommend", "Recommend", "Neutral", "Caution", "Avoid"]:
                    if r in decision:
                        rating = r
                        break
            else:
                rating = "中性"
                for r in ["强烈推荐", "推荐", "中性", "谨慎", "回避"]:
                    if r in decision:
                        rating = r
                        break
            
            # 发送完成事件
            yield sse_event("result", {
                "success": True,
                "mode": mode,
                "debate_id": debate_id,
                "total_rounds": max_rounds,
                "bull_analysis": {"analysis": bull_full.strip(), "success": True, "agent_name": "BullResearcher", "agent_role": prompts["role_bull"]},
                "bear_analysis": {"analysis": bear_full.strip(), "success": True, "agent_name": "BearResearcher", "agent_role": prompts["role_bear"]},
                "final_decision": {"decision": decision, "rating": rating, "success": True, "agent_name": "InvestmentManager", "agent_role": manager_role},
                "debate_history": debate_history
            })
            
        else:
            # parallel 模式 - 也使用流式，但并行展示
            yield sse_event("phase", {"phase": "parallel_analysis", "message": "Bull/Bear 并行分析中..."})
            
            # 由于是并行，我们交替输出
            bull_prompt = f"""你是看多研究员，请从积极角度分析 {stock_name}({stock_code})：
背景资料: {context[:1500]}
新闻: {json.dumps([n.get('title', '') for n in news_data[:5]], ensure_ascii=False)}
请给出完整的看多分析报告。"""

            bear_prompt = f"""你是看空研究员，请从风险角度分析 {stock_name}({stock_code})：
背景资料: {context[:1500]}
新闻: {json.dumps([n.get('title', '') for n in news_data[:5]], ensure_ascii=False)}
请给出完整的看空分析报告。"""

            # Bull 流式输出
            yield sse_event("agent", {"agent": "BullResearcher", "role": "看多研究员", "content": "", "is_start": True})
            bull_analysis = ""
            for chunk in llm_provider.stream([
                {"role": "system", "content": "你是一位乐观但理性的股票研究员。"},
                {"role": "user", "content": bull_prompt}
            ]):
                bull_analysis += chunk
                yield sse_event("agent", {"agent": "BullResearcher", "role": "看多研究员", "content": chunk, "is_chunk": True})
                await asyncio.sleep(0)
            yield sse_event("agent", {"agent": "BullResearcher", "role": "看多研究员", "content": "", "is_end": True})
            
            # Bear 流式输出
            yield sse_event("agent", {"agent": "BearResearcher", "role": "看空研究员", "content": "", "is_start": True})
            bear_analysis = ""
            for chunk in llm_provider.stream([
                {"role": "system", "content": "你是一位谨慎的股票研究员。"},
                {"role": "user", "content": bear_prompt}
            ]):
                bear_analysis += chunk
                yield sse_event("agent", {"agent": "BearResearcher", "role": "看空研究员", "content": chunk, "is_chunk": True})
                await asyncio.sleep(0)
            yield sse_event("agent", {"agent": "BearResearcher", "role": "看空研究员", "content": "", "is_end": True})
            
            # 投资经理决策
            yield sse_event("phase", {"phase": "decision", "message": "投资经理决策中..."})
            yield sse_event("agent", {"agent": "InvestmentManager", "role": "投资经理", "content": "", "is_start": True})
            
            decision_prompt = f"""综合以下多空观点，对 {stock_name} 做出投资决策：
【看多】{bull_analysis[:800]}
【看空】{bear_analysis[:800]}
请给出评级[强烈推荐/推荐/中性/谨慎/回避]和决策理由。"""
            
            decision = ""
            for chunk in llm_provider.stream([
                {"role": "system", "content": "你是投资经理。"},
                {"role": "user", "content": decision_prompt}
            ]):
                decision += chunk
                yield sse_event("agent", {"agent": "InvestmentManager", "role": "投资经理", "content": chunk, "is_chunk": True})
                await asyncio.sleep(0)
            yield sse_event("agent", {"agent": "InvestmentManager", "role": "投资经理", "content": "", "is_end": True})
            
            rating = "中性"
            for r in ["强烈推荐", "推荐", "中性", "谨慎", "回避"]:
                if r in decision:
                    rating = r
                    break
            
            yield sse_event("result", {
                "success": True,
                "mode": mode,
                "bull_analysis": {"analysis": bull_analysis, "success": True, "agent_name": "BullResearcher", "agent_role": "看多研究员"},
                "bear_analysis": {"analysis": bear_analysis, "success": True, "agent_name": "BearResearcher", "agent_role": "看空研究员"},
                "final_decision": {"decision": decision, "rating": rating, "success": True, "agent_name": "InvestmentManager", "agent_role": "投资经理"}
            })
        
        yield sse_event("phase", {"phase": "complete", "message": "分析完成"})
        
    except Exception as e:
        logger.error(f"SSE Debate error: {e}", exc_info=True)
        yield sse_event("error", {"message": str(e)})


@router.post("/debate/stream")
async def run_stock_debate_stream(
    request: DebateRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    流式辩论分析（SSE）
    
    使用 Server-Sent Events 实时推送辩论过程
    """
    logger.info(f"🎯 收到流式辩论请求: stock_code={request.stock_code}, mode={request.mode}")
    
    # 标准化股票代码
    code = request.stock_code.upper()
    if code.startswith("SH") or code.startswith("SZ"):
        short_code = code[2:]
    else:
        short_code = code
        code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
    
    # 获取关联新闻
    from sqlalchemy import text
    stock_codes_filter = text(
        "stock_codes @> ARRAY[:code1]::varchar[] OR stock_codes @> ARRAY[:code2]::varchar[]"
    ).bindparams(code1=short_code, code2=code)
    
    news_query = select(News).where(stock_codes_filter).order_by(desc(News.publish_time)).limit(10)
    result = await db.execute(news_query)
    news_list = result.scalars().all()
    
    news_data = [
        {
            "id": n.id,
            "title": n.title,
            "content": n.content[:500] if n.content else "",
            "sentiment_score": n.sentiment_score,
            "publish_time": n.publish_time.isoformat() if n.publish_time else None
        }
        for n in news_list
    ]
    
    # 获取额外上下文
    try:
        debate_context = await stock_data_service.get_debate_context(code)
        akshare_context = debate_context.get("summary", "")
    except Exception as e:
        logger.warning(f"获取财务数据失败: {e}")
        akshare_context = ""
    
    full_context = ""
    if request.context:
        full_context += f"【用户补充】{request.context}\n\n"
    if akshare_context:
        full_context += f"【实时数据】{akshare_context}"
    
    # 创建 LLM provider
    llm_provider = get_llm_provider(
        provider=request.provider,
        model=request.model
    ) if request.provider or request.model else get_llm_provider()
    
    mode = request.mode or "parallel"
    stock_name = request.stock_name or code
    
    language = request.language or "zh"
    
    return StreamingResponse(
        generate_debate_stream(code, stock_name, mode, full_context, news_data, llm_provider, language),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"  # 禁用 nginx 缓冲
        }
    )


# ============ 追问功能 ============

class FollowUpRequest(BaseModel):
    """追问请求"""
    stock_code: str = Field(..., description="股票代码")
    stock_name: Optional[str] = Field(None, description="股票名称")
    question: str = Field(..., description="用户问题")
    target_agent: Optional[str] = Field(None, description="目标角色: bull, bear, manager")
    context: Optional[str] = Field(None, description="之前的辩论摘要")


async def generate_followup_stream(
    stock_code: str,
    stock_name: str,
    question: str,
    target_agent: str,
    context: str,
    llm_provider
) -> AsyncGenerator[str, None]:
    """
    生成追问回复的 SSE 流
    """
    def sse_event(event_type: str, data: Dict) -> str:
        return f"event: {event_type}\ndata: {json.dumps(data, ensure_ascii=False)}\n\n"
    
    # 确定回复角色
    agent_config = {
        'bull': {
            'agent': 'BullResearcher',
            'role': '多方辩手',
            'system': '你是一位看多研究员，擅长从积极角度分析股票。回答用户问题时保持乐观但理性的态度。'
        },
        'bear': {
            'agent': 'BearResearcher', 
            'role': '空方辩手',
            'system': '你是一位看空研究员，擅长发现风险。回答用户问题时保持谨慎，重点指出潜在风险。'
        },
        'manager': {
            'agent': 'InvestmentManager',
            'role': '投资经理',
            'system': '你是一位经验丰富的投资经理，擅长综合分析和给出投资建议。回答用户问题时客观、专业。'
        }
    }
    
    config = agent_config.get(target_agent, agent_config['manager'])
    
    try:
        yield sse_event("agent", {
            "agent": config['agent'],
            "role": config['role'],
            "content": "",
            "is_start": True
        })
        
        prompt = f"""你正在参与关于 {stock_name}({stock_code}) 的投资讨论。

之前的讨论背景：
{context[:1500] if context else '暂无'}

用户现在问你：
"{question}"

请以{config['role']}的身份回答（约150-200字）："""

        messages = [
            {"role": "system", "content": config['system']},
            {"role": "user", "content": prompt}
        ]
        
        full_response = ""
        for chunk in llm_provider.stream(messages):
            full_response += chunk
            yield sse_event("agent", {
                "agent": config['agent'],
                "role": config['role'],
                "content": chunk,
                "is_chunk": True
            })
            await asyncio.sleep(0)
        
        yield sse_event("agent", {
            "agent": config['agent'],
            "role": config['role'],
            "content": "",
            "is_end": True
        })
        
        yield sse_event("complete", {"success": True})
        
    except Exception as e:
        logger.error(f"Followup error: {e}", exc_info=True)
        yield sse_event("error", {"message": str(e)})


@router.post("/debate/followup")
async def debate_followup(request: FollowUpRequest):
    """
    辩论追问（SSE）
    
    用户可以在辩论结束后继续提问
    - 默认由投资经理回答
    - 如果问题中包含 @多方 或 @bull，由多方辩手回答
    - 如果问题中包含 @空方 或 @bear，由空方辩手回答
    - 如果问题中包含 @数据专员，则生成搜索计划（不直接回答）
    """
    logger.info(f"🎯 收到追问请求: {request.question[:50]}...")
    
    # 解析目标角色
    question = request.question
    target = request.target_agent or 'manager'
    
    # 1. 检查是否提及数据专员（确认优先模式）
    if '@数据专员' in question or target == 'data_collector':
        logger.info("🔍 检测到数据专员提及，生成搜索计划...")
        
        # 移除提及词
        clean_question = question.replace('@数据专员', '').strip()
        
        # 创建数据专员
        data_collector = create_data_collector()
        
        # 生成计划
        plan = await data_collector.generate_search_plan(
            query=clean_question,
            stock_code=request.stock_code,
            stock_name=request.stock_name or request.stock_code
        )
        
        # 使用 SSE 返回计划事件
        async def generate_plan_stream():
            # Pydantic V2: 使用 model_dump_json() 或 json.dumps(model_dump())
            plan_json = json.dumps(plan.model_dump(), ensure_ascii=False)
            yield f"event: task_plan\ndata: {plan_json}\n\n"
            yield "event: complete\ndata: {\"success\": true}\n\n"
            
        return StreamingResponse(
            generate_plan_stream(),
            media_type="text/event-stream",
            headers={
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "X-Accel-Buffering": "no"
            }
        )

    # 2. 普通追问逻辑
    # 从问题中解析 @ 提及
    if '@多方' in question or '@bull' in question.lower() or '@看多' in question:
        target = 'bull'
        question = question.replace('@多方', '').replace('@bull', '').replace('@Bull', '').replace('@看多', '').strip()
    elif '@空方' in question or '@bear' in question.lower() or '@看空' in question:
        target = 'bear'
        question = question.replace('@空方', '').replace('@bear', '').replace('@Bear', '').replace('@看空', '').strip()
    elif '@经理' in question or '@manager' in question.lower() or '@投资经理' in question:
        target = 'manager'
        question = question.replace('@经理', '').replace('@manager', '').replace('@Manager', '').replace('@投资经理', '').strip()
    
    # 创建 LLM provider
    llm_provider = get_llm_provider()
    
    stock_name = request.stock_name or request.stock_code
    
    return StreamingResponse(
        generate_followup_stream(
            request.stock_code,
            stock_name,
            question,
            target,
            request.context or "",
            llm_provider
        ),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"
        }
    )


@router.post("/search/execute")
async def execute_search(request: SearchExecuteRequest):
    """
    执行确认后的搜索计划（SSE）
    """
    from ...agents.data_collector_v2 import SearchPlan
    
    logger.info(f"🚀 收到搜索执行请求: {request.plan.get('plan_id')}")
    
    try:
        # 反序列化计划
        plan = SearchPlan(**request.plan)
        
        async def generate_search_results():
            yield f"event: phase\ndata: {json.dumps({'phase': 'executing', 'message': '正在执行搜索任务...'}, ensure_ascii=False)}\n\n"
            
            data_collector = create_data_collector()
            
            # 执行计划
            results = await data_collector.execute_search_plan(plan)
            
            # 发送结果事件
            yield f"event: agent\ndata: {json.dumps({'agent': 'DataCollector', 'role': '数据专员', 'content': results.get('summary', ''), 'is_chunk': False}, ensure_ascii=False)}\n\n"
            
            yield f"event: result\ndata: {json.dumps(results, ensure_ascii=False)}\n\n"
            yield "event: complete\ndata: {\"success\": true}\n\n"
            
        return StreamingResponse(
            generate_search_results(),
            media_type="text/event-stream",
            headers={
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "X-Accel-Buffering": "no"
            }
        )
        
    except Exception as e:
        logger.error(f"执行搜索计划失败: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/debate/{debate_id}", response_model=DebateResponse)
async def get_debate_result(debate_id: str):
    """
    获取辩论结果
    
    - **debate_id**: 辩论ID
    """
    if debate_id not in debate_results:
        raise HTTPException(status_code=404, detail="Debate not found")
    
    result = debate_results[debate_id]
    
    return DebateResponse(
        success=result.get("success", False),
        debate_id=debate_id,
        stock_code=result.get("stock_code", ""),
        stock_name=result.get("stock_name"),
        bull_analysis=result.get("bull_analysis"),
        bear_analysis=result.get("bear_analysis"),
        final_decision=result.get("final_decision"),
        trajectory=result.get("trajectory"),
        execution_time=result.get("execution_time")
    )


@router.get("/logs", response_model=List[AgentLogEntry])
async def get_agent_logs(
    limit: int = Query(50, le=200),
    agent_name: Optional[str] = Query(None, description="按智能体名称筛选"),
    status: Optional[str] = Query(None, description="按状态筛选: started, completed, failed")
):
    """
    获取智能体执行日志
    
    - **limit**: 返回数量限制
    - **agent_name**: 按智能体名称筛选
    - **status**: 按状态筛选
    """
    logs = execution_logs.copy()
    
    # 筛选
    if agent_name:
        logs = [log for log in logs if log.get("agent_name") == agent_name]
    if status:
        logs = [log for log in logs if log.get("status") == status]
    
    # 按时间倒序
    logs.sort(key=lambda x: x.get("timestamp", ""), reverse=True)
    
    # 限制数量
    logs = logs[:limit]
    
    return [AgentLogEntry(**log) for log in logs]


@router.get("/metrics", response_model=AgentMetrics)
async def get_agent_metrics():
    """
    获取智能体性能指标
    """
    total = len(execution_logs)
    successful = len([log for log in execution_logs if log.get("status") == "completed"])
    failed = len([log for log in execution_logs if log.get("status") == "failed"])
    
    # 计算平均执行时间
    execution_times = [
        log.get("execution_time", 0) 
        for log in execution_logs 
        if log.get("execution_time") is not None
    ]
    avg_time = sum(execution_times) / len(execution_times) if execution_times else 0
    
    # 按智能体统计
    agent_stats = {}
    for log in execution_logs:
        agent_name = log.get("agent_name", "Unknown")
        if agent_name not in agent_stats:
            agent_stats[agent_name] = {
                "total": 0,
                "successful": 0,
                "failed": 0,
                "avg_time": 0,
                "times": []
            }
        agent_stats[agent_name]["total"] += 1
        if log.get("status") == "completed":
            agent_stats[agent_name]["successful"] += 1
        elif log.get("status") == "failed":
            agent_stats[agent_name]["failed"] += 1
        if log.get("execution_time"):
            agent_stats[agent_name]["times"].append(log["execution_time"])
    
    # 计算每个智能体的平均时间
    for agent_name, stats in agent_stats.items():
        if stats["times"]:
            stats["avg_time"] = sum(stats["times"]) / len(stats["times"])
        del stats["times"]  # 不返回原始时间列表
    
    # 最近活动
    recent_logs = sorted(
        execution_logs, 
        key=lambda x: x.get("timestamp", ""), 
        reverse=True
    )[:10]
    
    recent_activity = [
        {
            "timestamp": log.get("timestamp"),
            "agent_name": log.get("agent_name"),
            "action": log.get("action"),
            "status": log.get("status")
        }
        for log in recent_logs
    ]
    
    return AgentMetrics(
        total_executions=total,
        successful_executions=successful,
        failed_executions=failed,
        avg_execution_time=round(avg_time, 2),
        agent_stats=agent_stats,
        recent_activity=recent_activity
    )


@router.get("/trajectory/{debate_id}", response_model=List[TrajectoryStep])
async def get_debate_trajectory(debate_id: str):
    """
    获取辩论执行轨迹
    
    - **debate_id**: 辩论ID
    """
    if debate_id not in debate_results:
        raise HTTPException(status_code=404, detail="Debate not found")
    
    result = debate_results[debate_id]
    trajectory = result.get("trajectory", [])
    
    steps = []
    for i, step in enumerate(trajectory):
        steps.append(TrajectoryStep(
            step_id=f"{debate_id}_step_{i}",
            step_name=step.get("step", "unknown"),
            timestamp=step.get("timestamp", ""),
            agent_name=step.get("data", {}).get("agent"),
            input_data=None,  # 可以扩展
            output_data=step.get("data"),
            duration=None,
            status="completed"
        ))
    
    return steps


@router.delete("/logs")
async def clear_logs():
    """
    清空执行日志（仅用于开发测试）
    """
    global execution_logs
    count = len(execution_logs)
    execution_logs = []
    return {"message": f"Cleared {count} logs"}


@router.get("/available")
async def get_available_agents():
    """
    获取可用的智能体列表
    """
    return {
        "agents": [
            {
                "name": "NewsAnalyst",
                "role": "金融新闻分析师",
                "description": "分析金融新闻的情感、影响和关键信息",
                "status": "active"
            },
            {
                "name": "BullResearcher",
                "role": "看多研究员",
                "description": "从积极角度分析股票，发现投资机会",
                "status": "active"
            },
            {
                "name": "BearResearcher",
                "role": "看空研究员",
                "description": "从风险角度分析股票，识别潜在问题",
                "status": "active"
            },
            {
                "name": "InvestmentManager",
                "role": "投资经理",
                "description": "综合多方观点，做出投资决策",
                "status": "active"
            },
            {
                "name": "SearchAnalyst",
                "role": "搜索分析师",
                "description": "动态获取数据，支持 AkShare、BochaAI、网页搜索等",
                "status": "active"
            }
        ],
        "workflows": [
            {
                "name": "NewsAnalysisWorkflow",
                "description": "新闻分析工作流：爬取 -> 清洗 -> 情感分析",
                "agents": ["NewsAnalyst"],
                "status": "active"
            },
            {
                "name": "InvestmentDebateWorkflow",
                "description": "投资辩论工作流：Bull vs Bear 多智能体辩论",
                "agents": ["BullResearcher", "BearResearcher", "InvestmentManager"],
                "status": "active"
            }
        ]
    }


# ============ 辩论历史 API ============

class DebateHistoryRequest(BaseModel):
    """保存辩论历史请求"""
    stock_code: str = Field(..., description="股票代码")
    sessions: List[Dict[str, Any]] = Field(..., description="会话列表")


class DebateHistoryResponse(BaseModel):
    """辩论历史响应"""
    success: bool
    stock_code: str
    sessions: List[Dict[str, Any]] = []
    message: Optional[str] = None


@router.get("/debate/history/{stock_code}", response_model=DebateHistoryResponse)
async def get_debate_history(
    stock_code: str,
    limit: int = Query(10, le=50, description="返回会话数量限制"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取股票的辩论历史
    
    - **stock_code**: 股票代码
    - **limit**: 返回数量限制（默认10，最大50）
    """
    from ...models.debate_history import DebateHistory
    
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        # 查询历史记录
        query = select(DebateHistory).where(
            DebateHistory.stock_code == code
        ).order_by(desc(DebateHistory.updated_at)).limit(limit)
        
        result = await db.execute(query)
        histories = result.scalars().all()
        
        sessions = []
        for h in histories:
            sessions.append({
                "id": h.session_id,
                "stockCode": h.stock_code,
                "stockName": h.stock_name,
                "mode": h.mode,
                "messages": h.messages,
                "createdAt": h.created_at.isoformat() if h.created_at else None,
                "updatedAt": h.updated_at.isoformat() if h.updated_at else None
            })
        
        return DebateHistoryResponse(
            success=True,
            stock_code=code,
            sessions=sessions
        )
        
    except Exception as e:
        logger.error(f"获取辩论历史失败: {e}", exc_info=True)
        return DebateHistoryResponse(
            success=False,
            stock_code=stock_code,
            message=str(e)
        )


@router.post("/debate/history", response_model=DebateHistoryResponse)
async def save_debate_history(
    request: DebateHistoryRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    保存辩论历史
    
    - **stock_code**: 股票代码
    - **sessions**: 会话列表
    """
    from ...models.debate_history import DebateHistory
    
    try:
        # 标准化股票代码
        code = request.stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        saved_count = 0
        
        for session_data in request.sessions:
            session_id = session_data.get("id")
            if not session_id:
                continue
            
            messages = session_data.get("messages", [])
            logger.info(f"📥 Processing session {session_id}: {len(messages)} messages")
            logger.info(f"📥 Message roles: {[m.get('role') for m in messages]}")
            
            # 检查是否已存在
            existing_query = select(DebateHistory).where(
                DebateHistory.session_id == session_id
            )
            existing_result = await db.execute(existing_query)
            existing = existing_result.scalar_one_or_none()
            
            if existing:
                # 更新现有记录
                logger.info(f"📥 Updating existing session, old messages: {len(existing.messages)}, new: {len(messages)}")
                existing.messages = messages
                existing.mode = session_data.get("mode")
                existing.updated_at = datetime.utcnow()
            else:
                # 解析 created_at，确保是 naive datetime（去掉时区信息）
                created_at_str = session_data.get("createdAt")
                if created_at_str:
                    # 处理 ISO 格式字符串，移除末尾的 'Z' 并转换
                    if created_at_str.endswith('Z'):
                        created_at_str = created_at_str[:-1] + '+00:00'
                    parsed_dt = datetime.fromisoformat(created_at_str)
                    # 转换为 naive datetime (去掉时区信息)
                    if parsed_dt.tzinfo is not None:
                        created_at = parsed_dt.replace(tzinfo=None)
                    else:
                        created_at = parsed_dt
                else:
                    created_at = datetime.utcnow()
                
                # 创建新记录
                new_history = DebateHistory(
                    session_id=session_id,
                    stock_code=code,
                    stock_name=session_data.get("stockName"),
                    mode=session_data.get("mode"),
                    messages=session_data.get("messages", []),
                    created_at=created_at,
                    updated_at=datetime.utcnow()
                )
                db.add(new_history)
            
            saved_count += 1
        
        await db.commit()
        
        logger.info(f"保存了 {saved_count} 个辩论会话到数据库")
        
        return DebateHistoryResponse(
            success=True,
            stock_code=code,
            message=f"成功保存 {saved_count} 个会话"
        )
        
    except Exception as e:
        logger.error(f"保存辩论历史失败: {e}", exc_info=True)
        await db.rollback()
        return DebateHistoryResponse(
            success=False,
            stock_code=request.stock_code,
            message=str(e)
        )


@router.delete("/debate/history/{stock_code}")
async def delete_debate_history(
    stock_code: str,
    session_id: Optional[str] = Query(None, description="删除指定会话，不传则删除所有"),
    db: AsyncSession = Depends(get_db)
):
    """
    删除辩论历史
    
    - **stock_code**: 股票代码
    - **session_id**: 会话ID（可选，不传则删除该股票的所有历史）
    """
    from ...models.debate_history import DebateHistory
    from sqlalchemy import delete
    
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        if session_id:
            # 删除指定会话
            stmt = delete(DebateHistory).where(
                DebateHistory.session_id == session_id
            )
        else:
            # 删除该股票的所有会话
            stmt = delete(DebateHistory).where(
                DebateHistory.stock_code == code
            )
        
        result = await db.execute(stmt)
        await db.commit()
        
        deleted_count = result.rowcount
        
        return {
            "success": True,
            "stock_code": code,
            "deleted_count": deleted_count,
            "message": f"删除了 {deleted_count} 条记录"
        }
        
    except Exception as e:
        logger.error(f"删除辩论历史失败: {e}", exc_info=True)
        await db.rollback()
        return {
            "success": False,
            "stock_code": stock_code,
            "message": str(e)
    }


================================================
FILE: backend/app/api/v1/alpha_mining.py
================================================
"""
Alpha Mining REST API

提供因子挖掘相关的 HTTP 接口。

Endpoints:
- POST /alpha-mining/mine - 启动因子挖掘任务
- POST /alpha-mining/mine/stream - SSE 流式训练进度
- POST /alpha-mining/evaluate - 评估因子表达式
- POST /alpha-mining/generate - 生成候选因子
- POST /alpha-mining/compare-sentiment - 情感融合效果对比
- POST /alpha-mining/agent-demo - AgenticX Agent 调用演示
- GET /alpha-mining/factors - 获取已发现的因子列表
- GET /alpha-mining/status - 获取挖掘状态
- GET /alpha-mining/operators - 获取操作符列表
"""

from fastapi import APIRouter, HTTPException, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any, AsyncGenerator
from datetime import datetime
import logging
import uuid
import asyncio
import json
import queue
import threading

logger = logging.getLogger(__name__)

router = APIRouter(prefix="/alpha-mining", tags=["Alpha Mining"])

# 存储挖掘任务状态
_mining_tasks: Dict[str, Dict[str, Any]] = {}
_discovered_factors: List[Dict[str, Any]] = []


# ============================================================================
# Request/Response Models
# ============================================================================

class MineRequest(BaseModel):
    """因子挖掘请求"""
    stock_code: Optional[str] = Field(None, description="股票代码")
    num_steps: int = Field(100, ge=1, le=10000, description="训练步数")
    use_sentiment: bool = Field(True, description="是否使用情感特征")
    batch_size: int = Field(16, ge=1, le=128, description="批量大小")


class EvaluateRequest(BaseModel):
    """因子评估请求"""
    formula: str = Field(..., description="因子表达式")
    stock_code: Optional[str] = Field(None, description="股票代码")


class GenerateRequest(BaseModel):
    """因子生成请求"""
    batch_size: int = Field(10, ge=1, le=100, description="生成数量")
    max_len: int = Field(8, ge=4, le=16, description="最大表达式长度")


class FactorResponse(BaseModel):
    """因子响应"""
    formula: List[int] = Field(..., description="Token 序列")
    formula_str: str = Field(..., description="表达式字符串")
    sortino: float = Field(..., description="Sortino Ratio")
    sharpe: Optional[float] = Field(None, description="Sharpe Ratio")
    ic: Optional[float] = Field(None, description="IC")
    discovered_at: Optional[str] = Field(None, description="发现时间")


class MineResponse(BaseModel):
    """挖掘响应"""
    success: bool
    task_id: str
    message: str
    best_factor: Optional[FactorResponse] = None


class EvaluateResponse(BaseModel):
    """评估响应"""
    success: bool
    formula: str
    metrics: Optional[Dict[str, float]] = None
    error: Optional[str] = None


class GenerateResponse(BaseModel):
    """生成响应"""
    success: bool
    generated: int
    valid: int
    factors: List[Dict[str, Any]]


class TaskStatusResponse(BaseModel):
    """任务状态响应"""
    task_id: str
    status: str  # pending, running, completed, failed
    progress: float  # 0-100
    result: Optional[Dict[str, Any]] = None
    error: Optional[str] = None
    started_at: Optional[str] = None
    completed_at: Optional[str] = None


class SentimentCompareRequest(BaseModel):
    """情感融合对比请求"""
    num_steps: int = Field(50, ge=10, le=500, description="训练步数")
    batch_size: int = Field(16, ge=1, le=64, description="批量大小")


class SentimentCompareResponse(BaseModel):
    """情感融合对比响应"""
    success: bool
    with_sentiment: Dict[str, Any] = Field(..., description="含情感特征的结果")
    without_sentiment: Dict[str, Any] = Field(..., description="不含情感特征的结果")
    improvement: Dict[str, float] = Field(..., description="改进幅度")


class AgentDemoRequest(BaseModel):
    """Agent 调用演示请求"""
    stock_code: Optional[str] = Field(None, description="股票代码")
    num_steps: int = Field(30, ge=10, le=200, description="训练步数")
    use_sentiment: bool = Field(True, description="使用情感特征")


class AgentDemoResponse(BaseModel):
    """Agent 调用演示响应"""
    success: bool
    agent_name: str
    tool_name: str
    input_params: Dict[str, Any]
    output: Optional[Dict[str, Any]] = None
    execution_time: float
    logs: List[str] = []


# ============================================================================
# Helper Functions
# ============================================================================

def _get_alpha_mining_components():
    """获取 Alpha Mining 组件"""
    try:
        from ...alpha_mining import (
            AlphaMiningConfig,
            FactorVocab,
            FactorVM,
            AlphaGenerator,
            AlphaTrainer,
            FactorEvaluator,
            generate_mock_data
        )
        
        config = AlphaMiningConfig()
        vocab = FactorVocab()
        vm = FactorVM(vocab=vocab)
        generator = AlphaGenerator(vocab=vocab, config=config)
        evaluator = FactorEvaluator(config=config)
        
        return {
            "config": config,
            "vocab": vocab,
            "vm": vm,
            "generator": generator,
            "evaluator": evaluator,
            "generate_mock_data": generate_mock_data
        }
    except ImportError as e:
        logger.error(f"Failed to import Alpha Mining: {e}")
        raise HTTPException(
            status_code=503,
            detail="Alpha Mining module not available"
        )


async def _run_mining_task(task_id: str, request: MineRequest):
    """后台运行挖掘任务"""
    global _discovered_factors
    
    try:
        _mining_tasks[task_id]["status"] = "running"
        _mining_tasks[task_id]["started_at"] = datetime.utcnow().isoformat()
        
        components = _get_alpha_mining_components()
        
        from ...alpha_mining import AlphaTrainer
        
        # 准备数据
        features, returns = components["generate_mock_data"](
            num_samples=50,
            num_features=6,
            time_steps=252,
            seed=42
        )
        
        # 创建训练器
        config = components["config"]
        config.batch_size = request.batch_size
        
        trainer = AlphaTrainer(
            generator=components["generator"],
            vocab=components["vocab"],
            config=config
        )
        
        # 训练
        result = trainer.train(
            features=features,
            returns=returns,
            num_steps=request.num_steps,
            progress_bar=False
        )
        
        # 保存结果
        if result["best_formula"]:
            factor_info = {
                "formula": result["best_formula"],
                "formula_str": result["best_formula_str"],
                "sortino": result["best_score"],
                "discovered_at": datetime.utcnow().isoformat(),
                "task_id": task_id,
                "stock_code": request.stock_code
            }
            _discovered_factors.append(factor_info)
            
            # 保持只存储最优的 100 个
            _discovered_factors.sort(key=lambda x: x.get("sortino", 0), reverse=True)
            _discovered_factors = _discovered_factors[:100]
        
        _mining_tasks[task_id]["status"] = "completed"
        _mining_tasks[task_id]["progress"] = 100
        _mining_tasks[task_id]["completed_at"] = datetime.utcnow().isoformat()
        _mining_tasks[task_id]["result"] = {
            "best_factor": result["best_formula_str"],
            "best_score": result["best_score"],
            "total_steps": result["total_steps"]
        }
        
    except Exception as e:
        logger.error(f"Mining task {task_id} failed: {e}")
        _mining_tasks[task_id]["status"] = "failed"
        _mining_tasks[task_id]["error"] = str(e)
        _mining_tasks[task_id]["completed_at"] = datetime.utcnow().isoformat()


# ============================================================================
# API Endpoints
# ============================================================================

@router.post("/mine", response_model=MineResponse)
async def mine_factors(
    request: MineRequest,
    background_tasks: BackgroundTasks
):
    """
    启动因子挖掘任务
    
    使用强化学习自动发现有效的交易因子。
    任务在后台执行，可通过 /status/{task_id} 查询进度。
    """
    task_id = str(uuid.uuid4())
    
    # 初始化任务状态
    _mining_tasks[task_id] = {
        "status": "pending",
        "progress": 0,
        "request": request.dict(),
        "created_at": datetime.utcnow().isoformat()
    }
    
    # 添加后台任务
    background_tasks.add_task(_run_mining_task, task_id, request)
    
    return MineResponse(
        success=True,
        task_id=task_id,
        message=f"因子挖掘任务已启动，预计 {request.num_steps} 步训练"
    )


@router.post("/mine/stream")
async def mine_factors_stream(request: MineRequest):
    """
    SSE 流式返回训练进度
    
    实时推送每步训练指标，包括 loss、reward、best_score 等。
    前端可使用 EventSource 订阅。
    """
    async def event_generator() -> AsyncGenerator[str, None]:
        try:
            components = _get_alpha_mining_components()
            
            from ...alpha_mining import AlphaTrainer
            
            # 准备数据
            features, returns = components["generate_mock_data"](
                num_samples=50,
                num_features=6,
                time_steps=252,
                seed=42
            )
            
            # 创建训练器
            config = components["config"]
            config.batch_size = request.batch_size
            
            trainer = AlphaTrainer(
                generator=components["generator"],
                vocab=components["vocab"],
                config=config
            )
            
            # 使用队列在线程间传递数据
            metrics_queue: queue.Queue = queue.Queue()
            training_complete = threading.Event()
            training_error: List[str] = []
            
            def step_callback(metrics: Dict[str, Any]):
                """每步训练回调，将指标放入队列"""
                metrics_queue.put(metrics)
            
            def run_training():
                """在后台线程中运行训练"""
                try:
                    trainer.train(
                        features=features,
                        returns=returns,
                        num_steps=request.num_steps,
                        progress_bar=False,
                        step_callback=step_callback
                    )
                except Exception as e:
                    training_error.append(str(e))
                finally:
                    training_complete.set()
            
            # 启动训练线程
            training_thread = threading.Thread(target=run_training)
            training_thread.start()
            
            # 发送开始事件
            yield f"event: start\ndata: {json.dumps({'status': 'started', 'total_steps': request.num_steps})}\n\n"
            
            # 流式发送训练进度
            while not training_complete.is_set() or not metrics_queue.empty():
                try:
                    metrics = metrics_queue.get(timeout=0.1)
                    event_data = {
                        "step": metrics.get("step", 0),
                        "progress": metrics.get("progress", 0),
                        "loss": round(metrics.get("loss", 0), 6),
                        "avg_reward": round(metrics.get("avg_reward", 0), 6),
                        "max_reward": round(metrics.get("max_reward", 0), 6),
                        "valid_ratio": round(metrics.get("valid_ratio", 0), 4),
                        "best_score": round(metrics.get("best_score", -999), 6),
                        "best_formula": metrics.get("best_formula", ""),
                    }
                    yield f"event: progress\ndata: {json.dumps(event_data)}\n\n"
                except queue.Empty:
                    continue
            
            # 等待训练线程结束
            training_thread.join(timeout=5)
            
            # 发送完成事件
            if training_error:
                yield f"event: error\ndata: {json.dumps({'error': training_error[0]})}\n\n"
            else:
                final_result = {
                    "status": "completed",
                    "best_score": round(trainer.best_score, 6),
                    "best_formula": trainer.best_formula_str,
                    "total_steps": trainer.step_count,
                }
                yield f"event: complete\ndata: {json.dumps(final_result)}\n\n"
                
                # 保存发现的因子
                if trainer.best_formula:
                    _discovered_factors.append({
                        "formula": trainer.best_formula,
                        "formula_str": trainer.best_formula_str,
                        "sortino": trainer.best_score,
                        "discovered_at": datetime.utcnow().isoformat(),
                        "stock_code": request.stock_code
                    })
                    
        except Exception as e:
            logger.error(f"SSE streaming error: {e}")
            yield f"event: error\ndata: {json.dumps({'error': str(e)})}\n\n"
    
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        }
    )


@router.post("/compare-sentiment", response_model=SentimentCompareResponse)
async def compare_sentiment_effect(request: SentimentCompareRequest):
    """
    对比有/无情感特征的因子挖掘效果
    
    分别使用纯技术特征和技术+情感特征进行因子挖掘，
    对比最终效果差异。
    """
    try:
        components = _get_alpha_mining_components()
        from ...alpha_mining import AlphaTrainer, AlphaMiningConfig
        
        results = {}
        
        for use_sentiment in [False, True]:
            # 准备数据
            num_features = 6 if use_sentiment else 4  # 4个技术特征 + 2个情感特征
            features, returns = components["generate_mock_data"](
                num_samples=50,
                num_features=num_features,
                time_steps=252,
                seed=42
            )
            
            # 训练
            config = AlphaMiningConfig()
            config.batch_size = request.batch_size
            
            trainer = AlphaTrainer(
                generator=components["generator"],
                vocab=components["vocab"],
                config=config
            )
            
            result = trainer.train(
                features=features,
                returns=returns,
                num_steps=request.num_steps,
                progress_bar=False
            )
            
            key = "with_sentiment" if use_sentiment else "without_sentiment"
            results[key] = {
                "best_score": round(result["best_score"], 6),
                "best_formula": result["best_formula_str"],
                "total_steps": result["total_steps"],
                "num_features": num_features,
            }
        
        # 计算改进幅度
        with_score = results["with_sentiment"]["best_score"]
        without_score = results["without_sentiment"]["best_score"]
        
        if without_score != 0:
            improvement_pct = (with_score - without_score) / abs(without_score) * 100
        else:
            improvement_pct = 0 if with_score == 0 else 100
        
        improvement = {
            "score_diff": round(with_score - without_score, 6),
            "improvement_pct": round(improvement_pct, 2),
        }
        
        return SentimentCompareResponse(
            success=True,
            with_sentiment=results["with_sentiment"],
            without_sentiment=results["without_sentiment"],
            improvement=improvement
        )
        
    except Exception as e:
        logger.error(f"Sentiment comparison failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/agent-demo", response_model=AgentDemoResponse)
async def agent_alpha_mining_demo(request: AgentDemoRequest):
    """
    演示 AgenticX Agent 调用 AlphaMiningTool
    
    展示如何通过 Agent 接口调用因子挖掘功能。
    """
    import time
    start_time = time.time()
    logs = []
    
    try:
        logs.append(f"[{datetime.utcnow().isoformat()}] Agent 初始化...")
        logs.append(f"[{datetime.utcnow().isoformat()}] 调用 AlphaMiningTool...")
        
        # 模拟 Agent 调用
        components = _get_alpha_mining_components()
        from ...alpha_mining import AlphaTrainer
        
        input_params = {
            "stock_code": request.stock_code,
            "num_steps": request.num_steps,
            "use_sentiment": request.use_sentiment,
        }
        logs.append(f"[{datetime.utcnow().isoformat()}] Tool 参数: {json.dumps(input_params)}")
        
        # 准备数据
        features, returns = components["generate_mock_data"](
            num_samples=50,
            num_features=6 if request.use_sentiment else 4,
            time_steps=252,
            seed=42
        )
        logs.append(f"[{datetime.utcnow().isoformat()}] 数据准备完成")
        
        # 训练
        trainer = AlphaTrainer(
            generator=components["generator"],
            vocab=components["vocab"],
            config=components["config"]
        )
        
        logs.append(f"[{datetime.utcnow().isoformat()}] 开始训练...")
        result = trainer.train(
            features=features,
            returns=returns,
            num_steps=request.num_steps,
            progress_bar=False
        )
        logs.append(f"[{datetime.utcnow().isoformat()}] 训练完成")
        
        execution_time = time.time() - start_time
        
        output = {
            "best_formula": result["best_formula_str"],
            "best_score": round(result["best_score"], 6),
            "total_steps": result["total_steps"],
        }
        logs.append(f"[{datetime.utcnow().isoformat()}] 返回结果: {json.dumps(output)}")
        
        return AgentDemoResponse(
            success=True,
            agent_name="QuantitativeAgent",
            tool_name="AlphaMiningTool",
            input_params=input_params,
            output=output,
            execution_time=round(execution_time, 2),
            logs=logs
        )
        
    except Exception as e:
        execution_time = time.time() - start_time
        logs.append(f"[{datetime.utcnow().isoformat()}] 错误: {str(e)}")
        
        return AgentDemoResponse(
            success=False,
            agent_name="QuantitativeAgent",
            tool_name="AlphaMiningTool",
            input_params=request.dict(),
            output=None,
            execution_time=round(execution_time, 2),
            logs=logs
        )


@router.post("/evaluate", response_model=EvaluateResponse)
async def evaluate_factor(request: EvaluateRequest):
    """
    评估因子表达式
    
    对指定的因子表达式进行回测评估，返回各项指标。
    """
    try:
        components = _get_alpha_mining_components()
        vm = components["vm"]
        evaluator = components["evaluator"]
        
        # 解析公式
        tokens = []
        parts = request.formula.replace("(", " ").replace(")", " ").replace(",", " ").split()
        for part in parts:
            part = part.strip()
            if not part:
                continue
            try:
                token = vm.vocab.name_to_token(part)
                tokens.append(token)
            except (ValueError, KeyError):
                continue
        
        if not tokens:
            return EvaluateResponse(
                success=False,
                formula=request.formula,
                error="无法解析因子表达式"
            )
        
        # 准备数据
        features, returns = components["generate_mock_data"](
            num_samples=50,
            num_features=6,
            time_steps=252,
            seed=42
        )
        
        # 执行因子
        factor = vm.execute(tokens, features)
        if factor is None:
            return EvaluateResponse(
                success=False,
                formula=request.formula,
                error="因子执行失败"
            )
        
        # 评估
        metrics = evaluator.evaluate(factor, returns)
        
        return EvaluateResponse(
            success=True,
            formula=request.formula,
            metrics={
                "sortino_ratio": metrics["sortino_ratio"],
                "sharpe_ratio": metrics["sharpe_ratio"],
                "ic": metrics["ic"],
                "rank_ic": metrics["rank_ic"],
                "max_drawdown": metrics["max_drawdown"],
                "turnover": metrics["turnover"],
                "total_return": metrics["total_return"],
                "win_rate": metrics["win_rate"]
            }
        )
        
    except Exception as e:
        logger.error(f"Factor evaluation failed: {e}")
        return EvaluateResponse(
            success=False,
            formula=request.formula,
            error=str(e)
        )


@router.post("/generate", response_model=GenerateResponse)
async def generate_factors(request: GenerateRequest):
    """
    生成候选因子
    
    使用训练好的模型生成一批候选因子表达式。
    """
    try:
        components = _get_alpha_mining_components()
        generator = components["generator"]
        vm = components["vm"]
        evaluator = components["evaluator"]
        
        # 生成因子
        formulas, _ = generator.generate(
            batch_size=request.batch_size,
            max_len=request.max_len
        )
        
        # 准备数据用于评估
        features, returns = components["generate_mock_data"](
            num_samples=50,
            num_features=6,
            time_steps=252,
            seed=42
        )
        
        # 评估每个因子
        results = []
        for formula in formulas:
            factor = vm.execute(formula, features)
            if factor is not None and factor.std() > 1e-6:
                try:
                    metrics = evaluator.evaluate(factor, returns)
                    results.append({
                        "formula": formula,
                        "formula_str": vm.decode(formula),
                        "sortino": round(metrics["sortino_ratio"], 4),
                        "ic": round(metrics["ic"], 4)
                    })
                except Exception:
                    continue
        
        # 按 Sortino 排序
        results.sort(key=lambda x: x["sortino"], reverse=True)
        
        return GenerateResponse(
            success=True,
            generated=len(formulas),
            valid=len(results),
            factors=results[:10]
        )
        
    except Exception as e:
        logger.error(f"Factor generation failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/factors")
async def get_factors(
    top_k: int = 10,
    stock_code: Optional[str] = None
):
    """
    获取已发现的因子列表
    
    返回按 Sortino Ratio 排序的最优因子。
    """
    factors = _discovered_factors.copy()
    
    # 按股票代码过滤
    if stock_code:
        factors = [f for f in factors if f.get("stock_code") == stock_code]
    
    # 取 top_k
    factors = factors[:top_k]
    
    return {
        "success": True,
        "total": len(_discovered_factors),
        "returned": len(factors),
        "factors": factors
    }


@router.get("/status/{task_id}", response_model=TaskStatusResponse)
async def get_task_status(task_id: str):
    """
    获取挖掘任务状态
    """
    if task_id not in _mining_tasks:
        raise HTTPException(status_code=404, detail="Task not found")
    
    task = _mining_tasks[task_id]
    
    return TaskStatusResponse(
        task_id=task_id,
        status=task["status"],
        progress=task.get("progress", 0),
        result=task.get("result"),
        error=task.get("error"),
        started_at=task.get("started_at"),
        completed_at=task.get("completed_at")
    )


@router.get("/operators")
async def get_operators():
    """
    获取支持的操作符列表
    """
    try:
        from ...alpha_mining.dsl.ops import OPS_CONFIG, get_op_names
        from ...alpha_mining.dsl.vocab import FEATURES
        
        operators = []
        for name, func, arity in OPS_CONFIG:
            operators.append({
                "name": name,
                "arity": arity,
                "description": func.__doc__ or ""
            })
        
        return {
            "success": True,
            "features": FEATURES,
            "operators": operators
        }
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@router.delete("/tasks/{task_id}")
async def delete_task(task_id: str):
    """
    删除任务记录
    """
    if task_id not in _mining_tasks:
        raise HTTPException(status_code=404, detail="Task not found")
    
    del _mining_tasks[task_id]
    
    return {"success": True, "message": f"Task {task_id} deleted"}


================================================
FILE: backend/app/api/v1/analysis.py
================================================
"""
分析任务 API 路由
"""
import logging
import asyncio
import json
from typing import List, Optional
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks, Body, Request
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession

from ...core.database import get_db
from ...models.database import AsyncSessionLocal
from ...services.analysis_service import get_analysis_service

logger = logging.getLogger(__name__)

router = APIRouter()


# Pydantic 模型
class AnalysisRequest(BaseModel):
    """分析请求模型"""
    provider: Optional[str] = Field(default=None, description="LLM提供商 (bailian/openai/deepseek/kimi/zhipu)")
    model: Optional[str] = Field(default=None, description="模型名称")


class AnalysisResponse(BaseModel):
    """分析响应模型"""
    success: bool
    analysis_id: Optional[int] = None
    news_id: int
    sentiment: Optional[str] = None
    sentiment_score: Optional[float] = None
    confidence: Optional[float] = None
    summary: Optional[str] = None
    execution_time: Optional[float] = None
    error: Optional[str] = None


class AnalysisDetailResponse(BaseModel):
    """分析详情响应模型"""
    model_config = {"from_attributes": True}
    
    id: int
    news_id: int
    agent_name: str
    agent_role: Optional[str] = None
    analysis_result: str
    summary: Optional[str] = None
    sentiment: Optional[str] = None
    sentiment_score: Optional[float] = None
    confidence: Optional[float] = None
    execution_time: Optional[float] = None
    created_at: str


class BatchAnalyzeRequest(BaseModel):
    """批量分析请求模型"""
    news_ids: List[int] = Field(..., description="要分析的新闻ID列表")
    provider: Optional[str] = Field(default=None, description="LLM提供商")
    model: Optional[str] = Field(default=None, description="模型名称")


class BatchAnalyzeResponse(BaseModel):
    """批量分析响应模型"""
    success: bool
    message: str
    total_count: int
    success_count: int
    failed_count: int
    results: List[AnalysisResponse]


# 后台任务：执行分析
async def run_analysis_task(news_id: int, db: AsyncSession):
    """
    后台任务：执行新闻分析
    """
    try:
        analysis_service = get_analysis_service()
        result = await analysis_service.analyze_news(news_id, db)
        logger.info(f"Analysis task completed for news {news_id}: {result}")
    except Exception as e:
        logger.error(f"Analysis task failed for news {news_id}: {e}")


# API 端点
# 注意：具体路径（如 /news/batch）必须在参数路径（如 /news/{news_id}）之前定义
# 否则 FastAPI 会把 "batch" 当作 news_id 参数

@router.post("/news/batch", response_model=BatchAnalyzeResponse)
async def batch_analyze_news(
    request_body: BatchAnalyzeRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    批量分析新闻（并发）
    
    - **news_ids**: 要分析的新闻ID列表
    - **provider**: LLM提供商（可选）
    - **model**: 模型名称（可选）
    """
    try:
        logger.info(f"Received batch analyze request: news_ids={request_body.news_ids}, provider={request_body.provider}, model={request_body.model}")
        
        if not request_body.news_ids:
            raise HTTPException(status_code=400, detail="news_ids cannot be empty")
        
        analysis_service = get_analysis_service()
        
        # 准备LLM provider参数
        llm_provider = request_body.provider
        llm_model = request_body.model
        
        # 定义单个新闻的分析任务
        # 注意：每个任务需要独立的数据库会话，因为SQLAlchemy异步会话不支持并发操作
        async def analyze_single_news(news_id: int) -> AnalysisResponse:
            # 为每个任务创建独立的数据库会话
            async with AsyncSessionLocal() as task_db:
                try:
                    result = await analysis_service.analyze_news(
                        news_id,
                        task_db,
                        llm_provider=llm_provider,
                        llm_model=llm_model
                    )
                    
                    # 提交事务
                    await task_db.commit()
                    
                    if result.get("success"):
                        return AnalysisResponse(
                            success=True,
                            analysis_id=result.get("analysis_id"),
                            news_id=news_id,
                            sentiment=result.get("sentiment"),
                            sentiment_score=result.get("sentiment_score"),
                            confidence=result.get("confidence"),
                            summary=result.get("summary"),
                            execution_time=result.get("execution_time"),
                        )
                    else:
                        return AnalysisResponse(
                            success=False,
                            news_id=news_id,
                            error=result.get("error")
                        )
                except Exception as e:
                    # 发生错误时回滚事务
                    await task_db.rollback()
                    logger.error(f"Failed to analyze news {news_id}: {e}", exc_info=True)
                    return AnalysisResponse(
                        success=False,
                        news_id=news_id,
                        error=str(e)
                    )
        
        # 并发执行所有分析任务
        logger.info(f"Starting batch analysis for {len(request_body.news_ids)} news items")
        results = await asyncio.gather(*[analyze_single_news(news_id) for news_id in request_body.news_ids])
        
        # 统计结果
        success_count = sum(1 for r in results if r.success)
        failed_count = len(results) - success_count
        
        logger.info(f"Batch analysis completed: {success_count} succeeded, {failed_count} failed")
        
        return BatchAnalyzeResponse(
            success=True,
            message=f"批量分析完成：成功 {success_count} 条，失败 {failed_count} 条",
            total_count=len(request_body.news_ids),
            success_count=success_count,
            failed_count=failed_count,
            results=results
        )
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to batch analyze news: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/news/{news_id}", response_model=AnalysisResponse)
async def analyze_news(
    news_id: int,
    request: Optional[AnalysisRequest] = Body(None),
    background_tasks: BackgroundTasks = None,
    db: AsyncSession = Depends(get_db)
):
    """
    触发新闻分析任务
    
    - **news_id**: 新闻ID
    - **provider**: LLM提供商（可选）
    - **model**: 模型名称（可选）
    
    Returns:
        分析任务状态
    """
    try:
        analysis_service = get_analysis_service()
        
        # 准备LLM provider参数
        llm_provider = None
        llm_model = None
        if request:
            llm_provider = request.provider
            llm_model = request.model
            if llm_provider or llm_model:
                logger.info(f"Using custom LLM config: provider={llm_provider}, model={llm_model}")
        
        # 执行分析（同步，便于快速验证MVP）
        # 在生产环境中，应该使用后台任务
        result = await analysis_service.analyze_news(
            news_id, 
            db, 
            llm_provider=llm_provider,
            llm_model=llm_model
        )
        
        if result.get("success"):
            return AnalysisResponse(
                success=True,
                analysis_id=result.get("analysis_id"),
                news_id=news_id,
                sentiment=result.get("sentiment"),
                sentiment_score=result.get("sentiment_score"),
                confidence=result.get("confidence"),
                summary=result.get("summary"),
                execution_time=result.get("execution_time"),
            )
        else:
            return AnalysisResponse(
                success=False,
                news_id=news_id,
                error=result.get("error")
            )
    
    except Exception as e:
        logger.error(f"Failed to analyze news {news_id}: {e}", exc_info=True)
        return AnalysisResponse(
            success=False,
            news_id=news_id,
            error=str(e)
        )


@router.get("/news/{news_id}/all", response_model=List[AnalysisDetailResponse])
async def get_news_analyses(
    news_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    获取指定新闻的所有分析结果
    
    - **news_id**: 新闻ID
    """
    try:
        analysis_service = get_analysis_service()
        results = await analysis_service.get_analyses_by_news_id(news_id, db)
        
        return [AnalysisDetailResponse(**result) for result in results]
    
    except Exception as e:
        logger.error(f"Failed to get analyses for news {news_id}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{analysis_id}", response_model=AnalysisDetailResponse)
async def get_analysis_detail(
    analysis_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    获取分析结果详情
    
    - **analysis_id**: 分析ID
    """
    try:
        analysis_service = get_analysis_service()
        result = await analysis_service.get_analysis_by_id(analysis_id, db)
        
        if not result:
            raise HTTPException(status_code=404, detail="Analysis not found")
        
        return AnalysisDetailResponse(**result)
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to get analysis {analysis_id}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


================================================
FILE: backend/app/api/v1/debug.py
================================================
"""
调试 API - 用于测试爬虫和内容提取
"""
import re
import logging
from typing import Optional
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup

logger = logging.getLogger(__name__)
router = APIRouter()


class CrawlRequest(BaseModel):
    url: str
    return_html: bool = True  # 是否返回原始 HTML


class CrawlResponse(BaseModel):
    url: str
    title: Optional[str] = None
    content: Optional[str] = None
    content_length: int = 0
    html_length: int = 0
    raw_html: Optional[str] = None  # 原始 HTML（可选）
    debug_info: dict = {}


def extract_chinese_ratio(text: str) -> float:
    """计算中文字符比例"""
    pattern = re.compile(r'[\u4e00-\u9fa5]+')
    chinese_chars = pattern.findall(text)
    chinese_count = sum(len(chars) for chars in chinese_chars)
    total_count = len(text)
    return chinese_count / total_count if total_count > 0 else 0


def clean_text(text: str) -> str:
    """清理文本"""
    text = re.sub(r'<[^>]+>', '', text)
    text = text.replace('\u3000', ' ')
    text = ' '.join(text.split())
    return text.strip()


def is_noise_text(text: str) -> bool:
    """判断是否为噪音文本"""
    noise_patterns = [
        r'^责任编辑',
        r'^编辑[:：]',
        r'^来源[:：]',
        r'^声明[:：]',
        r'^免责声明',
        r'^版权',
        r'^copyright',
        r'^点击进入',
        r'^相关阅读',
        r'^延伸阅读',
        r'登录新浪财经APP',
        r'搜索【信披】',
        r'缩小字体',
        r'放大字体',
        r'收藏',
        r'微博',
        r'微信',
        r'分享',
        r'腾讯QQ',
    ]
    text_lower = text.lower().strip()
    for pattern in noise_patterns:
        if re.search(pattern, text_lower, re.I):
            return True
    return False


def extract_content_from_html(html: str, url: str) -> tuple[str, str, dict]:
    """
    从 HTML 中提取内容
    返回: (title, content, debug_info)
    """
    soup = BeautifulSoup(html, 'lxml')
    debug_info = {
        "selectors_tried": [],
        "selector_matched": None,
        "total_lines_raw": 0,
        "lines_kept": 0,
        "lines_filtered": 0,
    }
    
    # 提取标题
    title = ""
    title_tag = soup.find('h1', class_='main-title') or soup.find('h1') or soup.find('title')
    if title_tag:
        title = title_tag.get_text().strip()
        title = re.sub(r'[-_].*?(新浪|财经|网)', '', title).strip()
    
    # 内容选择器（按优先级）
    content_selectors = [
        {'id': 'artibody'},
        {'class': 'article-content'},
        {'class': 'article'},
        {'id': 'article'},
        {'class': 'content'},
        {'class': 'news-content'},
    ]
    
    for selector in content_selectors:
        debug_info["selectors_tried"].append(str(selector))
        content_div = soup.find(['div', 'article'], selector)
        
        if content_div:
            debug_info["selector_matched"] = str(selector)
            
            # 移除噪音元素
            for tag in content_div.find_all(['script', 'style', 'iframe', 'ins', 'select', 'input', 'button', 'form']):
                tag.decompose()
            for ad in content_div.find_all(class_=re.compile(r'ad|banner|share|otherContent|recommend|app-guide', re.I)):
                ad.decompose()
            
            # 获取全文
            full_text = content_div.get_text(separator='\n', strip=True)
            lines = full_text.split('\n')
            debug_info["total_lines_raw"] = len(lines)
            
            article_parts = []
            for line in lines:
                line = line.strip()
                if not line or len(line) < 2:
                    continue
                
                chinese_ratio = extract_chinese_ratio(line)
                if chinese_ratio > 0.05 or len(line) > 20:
                    clean_line = clean_text(line)
                    if clean_line and not is_noise_text(clean_line):
                        article_parts.append(clean_line)
                        debug_info["lines_kept"] += 1
                    else:
                        debug_info["lines_filtered"] += 1
                else:
                    debug_info["lines_filtered"] += 1
            
            content = '\n'.join(article_parts)
            return title, content, debug_info
    
    debug_info["selector_matched"] = "fallback (body)"
    # 后备：直接取 body
    body = soup.find('body')
    if body:
        content = body.get_text(separator='\n', strip=True)
        return title, content[:5000], debug_info  # 限制长度
    
    return title, "", debug_info


@router.post("/crawl", response_model=CrawlResponse)
async def debug_crawl(request: CrawlRequest):
    """
    实时爬取指定 URL 并返回内容（用于调试）
    
    - **url**: 要爬取的新闻 URL
    - **return_html**: 是否返回原始 HTML（默认 True）
    """
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        }
        
        response = requests.get(request.url, headers=headers, timeout=30)
        response.encoding = 'utf-8'
        html = response.text
        
        title, content, debug_info = extract_content_from_html(html, request.url)
        
        return CrawlResponse(
            url=request.url,
            title=title,
            content=content,
            content_length=len(content),
            html_length=len(html),
            raw_html=html if request.return_html else None,
            debug_info=debug_info,
        )
        
    except requests.RequestException as e:
        raise HTTPException(status_code=500, detail=f"爬取失败: {str(e)}")
    except Exception as e:
        logger.error(f"Debug crawl error: {e}")
        raise HTTPException(status_code=500, detail=f"解析失败: {str(e)}")


@router.get("/test-sina")
async def test_sina_crawl():
    """
    测试新浪财经爬取（使用固定 URL）
    """
    test_url = "https://finance.sina.com.cn/jjxw/2024-12-28/doc-ineayfsz5142013.shtml"
    request = CrawlRequest(url=test_url, return_html=False)
    return await debug_crawl(request)


================================================
FILE: backend/app/api/v1/knowledge_graph.py
================================================
"""
知识图谱管理 API
提供图谱的查询、构建、更新、删除接口
"""
import logging
from typing import List, Dict, Any, Optional
from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field

logger = logging.getLogger(__name__)

router = APIRouter()


# ============ Pydantic 模型 ============

class CompanyGraphResponse(BaseModel):
    """公司图谱响应"""
    stock_code: str
    stock_name: str
    graph_exists: bool
    stats: Optional[Dict[str, int]] = None
    name_variants: List[str] = Field(default_factory=list)
    businesses: List[Dict[str, Any]] = Field(default_factory=list)
    industries: List[str] = Field(default_factory=list)
    products: List[str] = Field(default_factory=list)
    concepts: List[str] = Field(default_factory=list)
    search_queries: List[str] = Field(default_factory=list, description="生成的检索查询")


class BuildGraphRequest(BaseModel):
    """构建图谱请求"""
    force_rebuild: bool = Field(default=False, description="是否强制重建")


class BuildGraphResponse(BaseModel):
    """构建图谱响应"""
    success: bool
    message: str
    graph_stats: Optional[Dict[str, int]] = None


class UpdateGraphRequest(BaseModel):
    """更新图谱请求"""
    update_from_news: bool = Field(default=True, description="是否从新闻更新")
    news_limit: int = Field(default=20, description="分析的新闻数量")


class GraphStatsResponse(BaseModel):
    """图谱统计响应"""
    total_companies: int
    total_nodes: int
    total_relationships: int
    companies: List[Dict[str, str]] = Field(default_factory=list)


# ============ API 路由 ============

@router.get("/{stock_code}", response_model=CompanyGraphResponse)
async def get_company_graph(stock_code: str):
    """
    获取公司知识图谱
    
    - **stock_code**: 股票代码
    """
    try:
        from ...knowledge.graph_service import get_graph_service
        
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        graph_service = get_graph_service()
        
        # 获取图谱
        graph = graph_service.get_company_graph(code)
        
        if not graph:
            return CompanyGraphResponse(
                stock_code=code,
                stock_name=stock_code,
                graph_exists=False
            )
        
        # 获取统计信息
        stats = graph_service.get_graph_stats(code)
        
        # 获取检索关键词
        keyword_set = graph_service.get_search_keywords(code)
        search_queries = keyword_set.combined_queries if keyword_set else []
        
        return CompanyGraphResponse(
            stock_code=code,
            stock_name=graph.company.stock_name,
            graph_exists=True,
            stats=stats,
            name_variants=[v.variant for v in graph.name_variants],
            businesses=[
                {
                    "name": b.business_name,
                    "type": b.business_type,
                    "status": b.status,
                    "description": b.description
                }
                for b in graph.businesses
            ],
            industries=[i.industry_name for i in graph.industries],
            products=[p.product_name for p in graph.products],
            concepts=[c.concept_name for c in graph.concepts],
            search_queries=search_queries
        )
    
    except Exception as e:
        logger.error(f"Failed to get company graph for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/{stock_code}/build", response_model=BuildGraphResponse)
async def build_company_graph(
    stock_code: str,
    request: BuildGraphRequest,
    background_tasks: BackgroundTasks
):
    """
    构建或重建公司知识图谱
    
    - **stock_code**: 股票代码
    - **force_rebuild**: 是否强制重建（删除现有图谱）
    """
    try:
        from ...knowledge.graph_service import get_graph_service
        from ...knowledge.knowledge_extractor import (
            create_knowledge_extractor,
            AkshareKnowledgeExtractor
        )
        
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        graph_service = get_graph_service()
        
        # 检查是否已存在
        existing = graph_service.get_company_graph(code)
        
        if existing and not request.force_rebuild:
            return BuildGraphResponse(
                success=False,
                message=f"图谱已存在，如需重建请设置 force_rebuild=true",
                graph_stats=graph_service.get_graph_stats(code)
            )
        
        # 强制重建：先删除
        if existing and request.force_rebuild:
            graph_service.delete_company_graph(code)
            logger.info(f"已删除现有图谱: {code}")
        
        # 从 akshare 获取信息
        akshare_info = AkshareKnowledgeExtractor.extract_company_info(code)
        
        if not akshare_info:
            return BuildGraphResponse(
                success=False,
                message=f"无法从 akshare 获取公司信息: {code}"
            )
        
        # 获取股票名称
        stock_name = akshare_info.get('raw_data', {}).get('股票简称', code)
        
        # 使用 LLM 提取详细信息
        extractor = create_knowledge_extractor()
        
        # 在后台任务中执行（避免阻塞）
        import asyncio
        graph = await extractor.extract_from_akshare(code, stock_name, akshare_info)
        
        # 构建图谱
        success = graph_service.build_company_graph(graph)
        
        if success:
            stats = graph_service.get_graph_stats(code)
            return BuildGraphResponse(
                success=True,
                message=f"图谱构建成功: {stock_name}",
                graph_stats=stats
            )
        else:
            return BuildGraphResponse(
                success=False,
                message="图谱构建失败"
            )
    
    except Exception as e:
        logger.error(f"Failed to build graph for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/{stock_code}/update", response_model=BuildGraphResponse)
async def update_company_graph(
    stock_code: str,
    request: UpdateGraphRequest
):
    """
    更新公司知识图谱
    
    - **stock_code**: 股票代码
    - **update_from_news**: 是否从新闻更新
    - **news_limit**: 分析的新闻数量
    """
    try:
        from ...knowledge.graph_service import get_graph_service
        from ...knowledge.knowledge_extractor import create_knowledge_extractor
        from ...core.database import get_db
        from sqlalchemy.ext.asyncio import AsyncSession
        from ...models.news import News
        from sqlalchemy import select, text
        
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        pure_code = code[2:] if code.startswith(("SH", "SZ")) else code
        
        graph_service = get_graph_service()
        
        # 检查图谱是否存在
        if not graph_service.get_company_graph(code):
            return BuildGraphResponse(
                success=False,
                message="图谱不存在，请先构建图谱"
            )
        
        if request.update_from_news:
            # 从数据库获取最新新闻
            from ...core.database import get_sync_db_session
            db = get_sync_db_session()
            
            recent_news = db.execute(
                text("""
                    SELECT title, content FROM news 
                    WHERE stock_codes @> ARRAY[:code]::varchar[] 
                    ORDER BY publish_time DESC LIMIT :limit
                """).bindparams(code=pure_code, limit=request.news_limit)
            ).fetchall()
            
            if not recent_news:
                return BuildGraphResponse(
                    success=False,
                    message="没有可用的新闻数据"
                )
            
            news_data = [
                {"title": n[0], "content": n[1]}
                for n in recent_news
            ]
            
            # 提取信息
            extractor = create_knowledge_extractor()
            extracted_info = await extractor.extract_from_news(code, "", news_data)
            
            # 更新图谱
            if any(extracted_info.values()):
                success = graph_service.update_from_news(code, "", extracted_info)
                
                if success:
                    stats = graph_service.get_graph_stats(code)
                    return BuildGraphResponse(
                        success=True,
                        message=f"图谱已更新: 新增业务{len(extracted_info.get('new_businesses', []))}个, 概念{len(extracted_info.get('new_concepts', []))}个",
                        graph_stats=stats
                    )
                else:
                    return BuildGraphResponse(
                        success=False,
                        message="图谱更新失败"
                    )
            else:
                return BuildGraphResponse(
                    success=True,
                    message="未提取到新信息",
                    graph_stats=graph_service.get_graph_stats(code)
                )
        
        return BuildGraphResponse(
            success=False,
            message="未指定更新方式"
        )
    
    except Exception as e:
        logger.error(f"Failed to update graph for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.delete("/{stock_code}")
async def delete_company_graph(stock_code: str):
    """
    删除公司知识图谱
    
    - **stock_code**: 股票代码
    """
    try:
        from ...knowledge.graph_service import get_graph_service
        
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        graph_service = get_graph_service()
        success = graph_service.delete_company_graph(code)
        
        if success:
            return {"success": True, "message": f"图谱已删除: {code}"}
        else:
            return {"success": False, "message": "删除失败"}
    
    except Exception as e:
        logger.error(f"Failed to delete graph for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/", response_model=GraphStatsResponse)
async def get_graph_stats():
    """
    获取所有图谱统计信息
    """
    try:
        from ...knowledge.graph_service import get_graph_service
        
        graph_service = get_graph_service()
        companies = graph_service.list_all_companies()
        
        # 获取总体统计
        total_companies = len(companies)
        
        # 查询总节点数和关系数（简化版）
        return GraphStatsResponse(
            total_companies=total_companies,
            total_nodes=total_companies * 10,  # 估算
            total_relationships=total_companies * 15,  # 估算
            companies=companies
        )
    
    except Exception as e:
        logger.error(f"Failed to get graph stats: {e}")
        raise HTTPException(status_code=500, detail=str(e))


================================================
FILE: backend/app/api/v1/llm_config.py
================================================
"""
LLM 配置 API 路由
返回可用的 LLM 厂商和模型列表
"""
import logging
from typing import List, Dict, Optional
from fastapi import APIRouter
from pydantic import BaseModel, Field

from ...core.config import settings

logger = logging.getLogger(__name__)

router = APIRouter()


class ModelInfo(BaseModel):
    """模型信息"""
    value: str = Field(..., description="模型标识")
    label: str = Field(..., description="模型显示名称")
    description: str = Field(default="", description="模型描述")


class ProviderInfo(BaseModel):
    """厂商信息"""
    value: str = Field(..., description="厂商标识")
    label: str = Field(..., description="厂商显示名称")
    icon: str = Field(..., description="厂商图标")
    models: List[ModelInfo] = Field(..., description="可用模型列表")
    has_api_key: bool = Field(..., description="是否已配置API Key")


class LLMConfigResponse(BaseModel):
    """LLM 配置响应"""
    default_provider: str = Field(..., description="默认厂商")
    default_model: str = Field(..., description="默认模型")
    providers: List[ProviderInfo] = Field(..., description="可用厂商列表")


def parse_models(models_str: str, provider_label: str) -> List[ModelInfo]:
    """
    解析逗号分隔的模型字符串
    
    Args:
        models_str: 逗号分隔的模型字符串
        provider_label: 厂商显示名称
        
    Returns:
        模型信息列表
    """
    if not models_str:
        return []
    
    models = []
    for model in models_str.split(','):
        model = model.strip()
        if model:
            models.append(ModelInfo(
                value=model,
                label=model,
                description=f"{provider_label} 模型"
            ))
    return models


@router.get("/config", response_model=LLMConfigResponse)
async def get_llm_config():
    """
    获取 LLM 配置信息
    
    返回所有可用的厂商和模型列表，以及是否已配置 API Key
    """
    try:
        providers = []
        
        # 1. 百炼
        if settings.BAILIAN_MODELS:
            providers.append(ProviderInfo(
                value="bailian",
                label="百炼（阿里云）",
                icon="📦",
                models=parse_models(settings.BAILIAN_MODELS, "百炼"),
                has_api_key=bool(settings.DASHSCOPE_API_KEY or settings.BAILIAN_API_KEY)
            ))
        
        # 2. OpenAI
        if settings.OPENAI_MODELS:
            providers.append(ProviderInfo(
                value="openai",
                label="OpenAI",
                icon="🤖",
                models=parse_models(settings.OPENAI_MODELS, "OpenAI"),
                has_api_key=bool(settings.OPENAI_API_KEY)
            ))
        
        # 3. DeepSeek
        if settings.DEEPSEEK_MODELS:
            providers.append(ProviderInfo(
                value="deepseek",
                label="DeepSeek",
                icon="🧠",
                models=parse_models(settings.DEEPSEEK_MODELS, "DeepSeek"),
                has_api_key=bool(settings.DEEPSEEK_API_KEY)
            ))
        
        # 4. Kimi
        if settings.MOONSHOT_MODELS:
            providers.append(ProviderInfo(
                value="kimi",
                label="Kimi (Moonshot)",
                icon="🌙",
                models=parse_models(settings.MOONSHOT_MODELS, "Kimi"),
                has_api_key=bool(settings.MOONSHOT_API_KEY)
            ))
        
        # 5. 智谱
        if settings.ZHIPU_MODELS:
            providers.append(ProviderInfo(
                value="zhipu",
                label="智谱",
                icon="🔮",
                models=parse_models(settings.ZHIPU_MODELS, "智谱"),
                has_api_key=bool(settings.ZHIPU_API_KEY)
            ))
        
        return LLMConfigResponse(
            default_provider=settings.LLM_PROVIDER,
            default_model=settings.LLM_MODEL,
            providers=providers
        )
    
    except Exception as e:
        logger.error(f"Failed to get LLM config: {e}", exc_info=True)
        # 返回默认配置
        return LLMConfigResponse(
            default_provider="bailian",
            default_model="qwen-plus",
            providers=[]
        )


================================================
FILE: backend/app/api/v1/news.py
================================================
"""
新闻管理 API 路由
"""
import logging
from typing import List, Optional
from datetime import datetime, timedelta
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks, Query
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, desc

from ...core.database import get_db
from ...models.news import News
from ...tools import SinaCrawlerTool

logger = logging.getLogger(__name__)

router = APIRouter()


# Pydantic 模型
class NewsResponse(BaseModel):
    """新闻响应模型"""
    model_config = {"from_attributes": True}
    
    id: int
    title: str
    content: str
    url: str
    source: str
    publish_time: Optional[str] = None
    stock_codes: Optional[List[str]] = None
    sentiment_score: Optional[float] = None
    created_at: str


class CrawlRequest(BaseModel):
    """爬取请求模型"""
    source: str = Field(default="sina", description="新闻源（sina, jrj, cnstock）")
    start_page: int = Field(default=1, ge=1, description="起始页码")
    end_page: int = Field(default=1, ge=1, le=10, description="结束页码")


class CrawlResponse(BaseModel):
    """爬取响应模型"""
    success: bool
    message: str
    crawled_count: int
    saved_count: int
    source: str


class BatchDeleteRequest(BaseModel):
    """批量删除请求模型"""
    news_ids: List[int] = Field(..., description="要删除的新闻ID列表")


class BatchDeleteResponse(BaseModel):
    """批量删除响应模型"""
    success: bool
    message: str
    deleted_count: int


# 后台任务：爬取并保存新闻（使用同步方式）
def crawl_and_save_news_sync(
    source: str,
    start_page: int,
    end_page: int
):
    """
    后台任务：爬取新闻并保存到数据库（同步版本）
    """
    from sqlalchemy import create_engine
    from sqlalchemy.orm import Session
    from ...core.config import settings
    
    try:
        logger.info(f"Starting crawl task: {source}, pages {start_page}-{end_page}")
        
        # 创建爬虫
        if source == "sina":
            crawler = SinaCrawlerTool()
        else:
            logger.error(f"Unsupported source: {source}")
            return
        
        # 执行爬取
        news_list = crawler.crawl(start_page, end_page)
        logger.info(f"Crawled {len(news_list)} news items")
        
        # 创建新的数据库连接（同步）
        engine = create_engine(settings.SYNC_DATABASE_URL)
        db = Session(engine)
        
        try:
            # 时间过滤：只保存最近7天内的新闻（避免保存太旧的新闻）
            cutoff_time = datetime.utcnow() - timedelta(days=7)
            
            # 保存到数据库
            saved_count = 0
            skipped_old_count = 0
            skipped_existing_count = 0
            
            for news_item in news_list:
                # 时间过滤：跳过太旧的新闻
                if news_item.publish_time and news_item.publish_time < cutoff_time:
                    skipped_old_count += 1
                    logger.debug(f"Skipping old news: {news_item.title[:50]} (published: {news_item.publish_time})")
                    continue
                
                # 检查URL是否已存在
                existing = db.execute(
                    select(News).where(News.url == news_item.url)
                ).scalar_one_or_none()
                
                if existing:
                    skipped_existing_count += 1
                    logger.debug(f"News already exists: {news_item.url}")
                    continue
                
                # 创建新记录
                news = News(
                    title=news_item.title,
                    content=news_item.content,
                    url=news_item.url,
                    source=news_item.source,
                    publish_time=news_item.publish_time,
                    author=news_item.author,
                    keywords=news_item.keywords,
                    stock_codes=news_item.stock_codes,
                    # summary 字段已移除，content 包含完整内容
                )
                
                db.add(news)
                saved_count += 1
                logger.info(f"Saved new news: {news_item.title[:50]} (published: {news_item.publish_time})")
            
            db.commit()
            logger.info(
                f"Crawl summary: crawled={len(news_list)}, "
                f"saved={saved_count}, "
                f"skipped_old={skipped_old_count}, "
                f"skipped_existing={skipped_existing_count}"
            )
        
        finally:
            db.close()
    
    except Exception as e:
        logger.error(f"Crawl task failed: {e}", exc_info=True)


# API 端点
@router.post("/crawl", response_model=CrawlResponse)
async def crawl_news(
    request: CrawlRequest,
    background_tasks: BackgroundTasks
):
    """
    触发新闻爬取任务（异步后台任务）
    
    - **source**: 新闻源（sina, jrj, cnstock）
    - **start_page**: 起始页码
    - **end_page**: 结束页码
    
    注意：这是简单的后台任务版本。如需更强大的任务管理，
    请使用 POST /api/v1/tasks/cold-start 触发 Celery 任务。
    """
    # 添加到后台任务（同步版本）
    background_tasks.add_task(
        crawl_and_save_news_sync,
        request.source,
        request.start_page,
        request.end_page
    )
    
    logger.info(f"Background crawl task added: {request.source}, pages {request.start_page}-{request.end_page}")
    
    return CrawlResponse(
        success=True,
        message=f"Crawl task started for {request.source}, pages {request.start_page}-{request.end_page}",
        crawled_count=0,  # 后台任务还未完成
        saved_count=0,
        source=request.source
    )


@router.post("/refresh", response_model=CrawlResponse)
async def refresh_news(
    source: str = Query("sina", description="新闻源"),
    pages: int = Query(1, ge=1, le=5, description="爬取页数"),
    background_tasks: BackgroundTasks = None
):
    """
    刷新新闻（前端刷新按钮调用）
    
    - **source**: 新闻源（sina, tencent, nbd, eastmoney, yicai, 163）
    - **pages**: 爬取页数（1-5）
    """
    background_tasks.add_task(
        crawl_and_save_news_sync,
        source,
        1,  # start_page
        pages  # end_page
    )
    
    logger.info(f"Refresh task started: {source}, {pages} pages")
    
    return CrawlResponse(
        success=True,
        message=f"刷新任务已启动：{source}，{pages} 页",
        crawled_count=0,
        saved_count=0,
        source=source
    )


@router.get("/", response_model=List[NewsResponse])
async def get_news_list(
    skip: int = Query(0, ge=0, description="跳过的记录数"),
    limit: int = Query(20, ge=1, le=100, description="返回的记录数"),
    source: Optional[str] = Query(None, description="按来源筛选"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取新闻列表
    
    - **skip**: 跳过的记录数（分页）
    - **limit**: 返回的记录数
    - **source**: 按来源筛选（可选）
    """
    try:
        query = select(News).order_by(desc(News.created_at))
        
        if source:
            query = query.where(News.source == source)
        
        query = query.offset(skip).limit(limit)
        
        result = await db.execute(query)
        news_list = result.scalars().all()
        
        return [NewsResponse(**news.to_dict()) for news in news_list]
    
    except Exception as e:
        logger.error(f"Failed to get news list: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/latest", response_model=List[NewsResponse])
async def get_latest_news(
    limit: int = Query(20, ge=1, le=500, description="返回的记录数"),
    source: Optional[str] = Query(None, description="按来源筛选"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取最新新闻（按发布时间排序）
    
    - **limit**: 返回的记录数（最多500条）
    - **source**: 按来源筛选（可选）
    """
    try:
        query = select(News).order_by(desc(News.publish_time))
        
        if source:
            query = query.where(News.source == source)
        
        query = query.limit(limit)
        
        result = await db.execute(query)
        news_list = result.scalars().all()
        
        return [NewsResponse(**news.to_dict()) for news in news_list]
    
    except Exception as e:
        logger.error(f"Failed to get latest news: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{news_id}", response_model=NewsResponse)
async def get_news_detail(
    news_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    获取新闻详情
    
    - **news_id**: 新闻ID
    """
    try:
        result = await db.execute(
            select(News).where(News.id == news_id)
        )
        news = result.scalar_one_or_none()
        
        if not news:
            raise HTTPException(status_code=404, detail="News not found")
        
        return NewsResponse(**news.to_dict())
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to get news {news_id}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/batch/delete", response_model=BatchDeleteResponse)
async def batch_delete_news(
    request: BatchDeleteRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    批量删除新闻
    
    - **news_ids**: 要删除的新闻ID列表
    """
    try:
        if not request.news_ids:
            raise HTTPException(status_code=400, detail="news_ids cannot be empty")
        
        # 查询要删除的新闻
        result = await db.execute(
            select(News).where(News.id.in_(request.news_ids))
        )
        news_list = result.scalars().all()
        
        deleted_count = len(news_list)
        
        if deleted_count == 0:
            return BatchDeleteResponse(
                success=True,
                message="No news found to delete",
                deleted_count=0
            )
        
        # 批量删除
        for news in news_list:
            await db.delete(news)
        
        await db.commit()
        
        logger.info(f"Batch deleted {deleted_count} news items: {request.news_ids}")
        
        return BatchDeleteResponse(
            success=True,
            message=f"Successfully deleted {deleted_count} news items",
            deleted_count=deleted_count
        )
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to batch delete news: {e}")
        await db.rollback()
        raise HTTPException(status_code=500, detail=str(e))


@router.delete("/{news_id}")
async def delete_news(
    news_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    删除新闻
    
    - **news_id**: 新闻ID
    """
    try:
        result = await db.execute(
            select(News).where(News.id == news_id)
        )
        news = result.scalar_one_or_none()
        
        if not news:
            raise HTTPException(status_code=404, detail="News not found")
        
        await db.delete(news)
        await db.commit()
        
        return {"success": True, "message": f"News {news_id} deleted"}
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to delete news {news_id}: {e}")
        await db.rollback()
        raise HTTPException(status_code=500, detail=str(e))


================================================
FILE: backend/app/api/v1/news_v2.py
================================================
"""
新闻 API v2 - 使用新的 Financial Data Layer

新功能:
1. 多数据源支持：可指定 provider (sina, tencent, nbd...)
2. 自动降级：一个源失败自动切换另一个
3. 标准化数据：统一的 NewsData 格式
4. 实时获取：直接从数据源获取，不经过数据库

前端可通过对比 /api/v1/news (旧) vs /api/v1/news/v2 (新) 看到差异
"""
import logging
from typing import List, Optional
from datetime import datetime
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel, Field

from ...financial import get_registry, NewsQueryParams
from ...financial.tools import FinancialNewsTool, setup_default_providers

logger = logging.getLogger(__name__)

router = APIRouter()

# 确保 Provider 已注册
setup_default_providers()


class NewsDataResponse(BaseModel):
    """标准化新闻响应（使用 NewsData 模型）"""
    id: str
    title: str
    content: str
    summary: Optional[str] = None
    source: str
    source_url: str
    publish_time: datetime
    stock_codes: List[str] = []
    sentiment: Optional[str] = None
    sentiment_score: Optional[float] = None


class FetchNewsResponse(BaseModel):
    """获取新闻响应"""
    success: bool
    count: int
    provider: Optional[str] = None
    available_providers: Optional[List[str]] = None
    data: List[NewsDataResponse] = []
    error: Optional[str] = None


class ProviderInfoResponse(BaseModel):
    """Provider 信息响应"""
    name: str
    display_name: str
    description: str
    supported_types: List[str]
    priority: int


@router.get("/fetch", response_model=FetchNewsResponse)
async def fetch_news_realtime(
    stock_codes: Optional[str] = Query(
        None, 
        description="股票代码，多个用逗号分隔，如 '600519,000001'"
    ),
    keywords: Optional[str] = Query(
        None, 
        description="关键词，多个用逗号分隔"
    ),
    limit: int = Query(
        20, 
        ge=1, 
        le=100, 
        description="返回条数"
    ),
    provider: Optional[str] = Query(
        None, 
        description="指定数据源（sina, tencent, nbd），不指定则自动选择"
    )
):
    """
    实时获取新闻（使用新的 Provider-Fetcher 架构）
    
    特点:
    - 直接从数据源获取，不经过数据库
    - 支持指定数据源或自动选择
    - 返回标准化的 NewsData 格式
    
    示例:
    - GET /api/v1/news/v2/fetch?stock_codes=600519&limit=10
    - GET /api/v1/news/v2/fetch?keywords=茅台,白酒&provider=sina
    """
    tool = FinancialNewsTool()
    
    # 解析参数
    stock_code_list = stock_codes.split(",") if stock_codes else None
    keyword_list = keywords.split(",") if keywords else None
    
    try:
        result = await tool.aexecute(
            stock_codes=stock_code_list,
            keywords=keyword_list,
            limit=limit,
            provider=provider
        )
        
        if result["success"]:
            # 转换为响应格式
            news_list = [
                NewsDataResponse(
                    id=item["id"],
                    title=item["title"],
                    content=item["content"],
                    summary=item.get("summary"),
                    source=item["source"],
                    source_url=item["source_url"],
                    publish_time=item["publish_time"],
                    stock_codes=item.get("stock_codes", []),
                    sentiment=item.get("sentiment"),
                    sentiment_score=item.get("sentiment_score")
                )
                for item in result["data"]
            ]
            
            return FetchNewsResponse(
                success=True,
                count=result["count"],
                provider=result.get("provider"),
                data=news_list
            )
        else:
            return FetchNewsResponse(
                success=False,
                count=0,
                error=result.get("error"),
                available_providers=result.get("available_providers", [])
            )
            
    except Exception as e:
        logger.exception(f"Failed to fetch news: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/providers", response_model=List[ProviderInfoResponse])
async def list_providers():
    """
    列出所有可用的数据源 Provider
    
    返回:
    - 每个 Provider 的名称、描述、支持的数据类型、优先级
    """
    registry = get_registry()
    providers = []
    
    for name in registry.list_providers():
        provider = registry.get_provider(name)
        if provider:
            providers.append(ProviderInfoResponse(
                name=provider.info.name,
                display_name=provider.info.display_name,
                description=provider.info.description,
                supported_types=list(provider.fetchers.keys()),
                priority=provider.info.priority
            ))
    
    return providers


@router.get("/providers/{provider_name}/test")
async def test_provider(
    provider_name: str,
    limit: int = Query(5, ge=1, le=20)
):
    """
    测试指定的 Provider 是否工作正常
    
    返回:
    - 测试结果和获取到的样本数据
    """
    tool = FinancialNewsTool()
    
    try:
        result = await tool.aexecute(
            limit=limit,
            provider=provider_name
        )
        
        return {
            "provider": provider_name,
            "success": result["success"],
            "count": result.get("count", 0),
            "error": result.get("error"),
            "sample_titles": [
                item["title"][:50] for item in result.get("data", [])[:3]
            ]
        }
        
    except Exception as e:
        return {
            "provider": provider_name,
            "success": False,
            "error": str(e)
        }


================================================
FILE: backend/app/api/v1/stocks.py
================================================
"""
股票分析 API 路由 - Phase 2
提供个股分析、关联新闻、情感趋势等接口
支持 akshare 真实股票数据
"""
import logging
from datetime import datetime, timedelta
from typing import List, Optional
from fastapi import APIRouter, Depends, HTTPException, Query
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func, and_, desc, text, or_
from sqlalchemy.dialects.postgresql import ARRAY, array

from ...core.database import get_db
from ...models.news import News
from ...models.stock import Stock
from ...models.analysis import Analysis
from ...models.crawl_task import CrawlTask, CrawlMode, TaskStatus
from ...services.stock_data_service import stock_data_service
from ...tasks.crawl_tasks import targeted_stock_crawl_task

logger = logging.getLogger(__name__)

router = APIRouter()


# ============ Pydantic 模型 ============

class StockInfo(BaseModel):
    """股票信息"""
    model_config = {"from_attributes": True}
    
    code: str
    name: str
    full_code: Optional[str] = None
    industry: Optional[str] = None
    market: Optional[str] = None
    pe_ratio: Optional[float] = None
    market_cap: Optional[float] = None


class StockNewsItem(BaseModel):
    """股票关联新闻"""
    id: int
    title: str
    content: str
    url: str
    source: str
    publish_time: Optional[str] = None
    sentiment_score: Optional[float] = None
    has_analysis: bool = False


class SentimentTrendPoint(BaseModel):
    """情感趋势数据点"""
    date: str
    avg_sentiment: float
    news_count: int
    positive_count: int
    negative_count: int
    neutral_count: int


class StockOverview(BaseModel):
    """股票概览数据"""
    code: str
    name: Optional[str] = None
    total_news: int
    analyzed_news: int
    avg_sentiment: Optional[float] = None
    recent_sentiment: Optional[float] = None  # 最近7天
    sentiment_trend: str  # "up", "down", "stable"
    last_news_time: Optional[str] = None


class KLineDataPoint(BaseModel):
    """K线数据点（akshare 真实数据）"""
    timestamp: int  # 时间戳（毫秒）
    date: str
    open: float
    high: float
    low: float
    close: float
    volume: int
    turnover: Optional[float] = None  # 成交额
    change_percent: Optional[float] = None  # 涨跌幅
    change_amount: Optional[float] = None  # 涨跌额
    amplitude: Optional[float] = None  # 振幅
    turnover_rate: Optional[float] = None  # 换手率


# ============ API 端点 ============

# ⚠️ 注意：具体路径的路由必须放在动态路由 /{stock_code} 之前！

class StockSearchResult(BaseModel):
    """股票搜索结果"""
    code: str
    name: str
    full_code: str
    market: Optional[str] = None
    industry: Optional[str] = None


@router.get("/search/realtime", response_model=List[StockSearchResult])
async def search_stocks_realtime(
    q: str = Query(..., min_length=1, description="搜索关键词（代码或名称）"),
    limit: int = Query(20, le=50),
    db: AsyncSession = Depends(get_db)
):
    """
    搜索股票（从数据库，支持代码和名称模糊匹配）
    
    - **q**: 搜索关键词（如 "600519" 或 "茅台"）
    - **limit**: 返回数量限制
    """
    try:
        # 从数据库搜索
        query = select(Stock).where(
            (Stock.code.ilike(f"%{q}%")) | 
            (Stock.name.ilike(f"%{q}%")) |
            (Stock.full_code.ilike(f"%{q}%"))
        ).limit(limit)
        
        result = await db.execute(query)
        stocks = result.scalars().all()
        
        if stocks:
            return [
                StockSearchResult(
                    code=stock.code,
                    name=stock.name,
                    full_code=stock.full_code or f"{'SH' if stock.code.startswith('6') else 'SZ'}{stock.code}",
                    market=stock.market,
                    industry=stock.industry,
                )
                for stock in stocks
            ]
        
        return []
    
    except Exception as e:
        logger.error(f"Failed to search stocks: {e}")
        raise HTTPException(status_code=500, detail=str(e))


class StockInitResponse(BaseModel):
    """股票数据初始化响应"""
    success: bool
    message: str
    count: int = 0


@router.post("/init", response_model=StockInitResponse)
async def init_stock_data(
    db: AsyncSession = Depends(get_db)
):
    """
    初始化股票数据（从 akshare 获取全部 A 股并存入数据库）
    """
    try:
        import akshare as ak
        from datetime import datetime
        from sqlalchemy import delete
        
        logger.info("Starting stock data initialization...")
        
        df = ak.stock_zh_a_spot_em()
        
        if df is None or df.empty:
            return StockInitResponse(success=False, message="Failed to fetch stocks from akshare", count=0)
        
        await db.execute(delete(Stock))
        
        count = 0
        for _, row in df.iterrows():
            code = str(row['代码'])
            name = str(row['名称'])
            
            if not code or not name or name in ['N/A', 'nan', '']:
                continue
            
            if code.startswith('6'):
                market = "SH"
                full_code = f"SH{code}"
            elif code.startswith('0') or code.startswith('3'):
                market = "SZ"
                full_code = f"SZ{code}"
            else:
                market = "OTHER"
                full_code = code
            
            stock = Stock(
                code=code,
                name=name,
                full_code=full_code,
                market=market,
                status="active",
                created_at=datetime.utcnow(),
                updated_at=datetime.utcnow(),
            )
            db.add(stock)
            count += 1
        
        await db.commit()
        
        return StockInitResponse(success=True, message=f"Successfully initialized {count} stocks", count=count)
        
    except ImportError:
        return StockInitResponse(success=False, message="akshare not installed", count=0)
    except Exception as e:
        logger.error(f"Failed to init stocks: {e}")
        await db.rollback()
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/count")
async def get_stock_count(db: AsyncSession = Depends(get_db)):
    """获取数据库中的股票数量"""
    from sqlalchemy import func as sql_func
    
    result = await db.execute(select(sql_func.count(Stock.id)))
    count = result.scalar() or 0
    
    return {"count": count, "message": f"Database has {count} stocks"}


# ============ 动态路由（必须放在最后） ============

@router.get("/{stock_code}", response_model=StockOverview)
async def get_stock_overview(
    stock_code: str,
    db: AsyncSession = Depends(get_db)
):
    """
    获取股票概览信息
    
    - **stock_code**: 股票代码（如 SH600519, 600519）
    """
    # 标准化股票代码（支持带前缀和不带前缀）
    code = stock_code.upper()
    if code.startswith("SH") or code.startswith("SZ"):
        short_code = code[2:]
    else:
        short_code = code
        code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
    
    try:
        # 查询股票基本信息
        stock_query = select(Stock).where(
            (Stock.code == short_code) | (Stock.full_code == code)
        )
        result = await db.execute(stock_query)
        stock = result.scalar_one_or_none()
        
        stock_name = stock.name if stock else None
        
        # 统计关联新闻
        # 使用 PostgreSQL 原生 ARRAY 查询语法
        stock_codes_filter = text(
            "stock_codes @> ARRAY[:code1]::varchar[] OR stock_codes @> ARRAY[:code2]::varchar[]"
        ).bindparams(code1=short_code, code2=code)
        
        news_query = select(func.count(News.id)).where(stock_codes_filter)
        result = await db.execute(news_query)
        total_news = result.scalar() or 0
        
        # 已分析的新闻数量
        analyzed_query = select(func.count(News.id)).where(
            and_(
                stock_codes_filter,
                News.sentiment_score.isnot(None)
            )
        )
        result = await db.execute(analyzed_query)
        analyzed_news = result.scalar() or 0
        
        # 计算平均情感
        avg_sentiment_query = select(func.avg(News.sentiment_score)).where(
            and_(
                stock_codes_filter,
                News.sentiment_score.isnot(None)
            )
        )
        result = await db.execute(avg_sentiment_query)
        avg_sentiment = result.scalar()
        
        # 最近7天的平均情感
        seven_days_ago = datetime.utcnow() - timedelta(days=7)
        recent_query = select(func.avg(News.sentiment_score)).where(
            and_(
                stock_codes_filter,
                News.sentiment_score.isnot(None),
                News.publish_time >= seven_days_ago
            )
        )
        result = await db.execute(recent_query)
        recent_sentiment = result.scalar()
        
        # 判断趋势
        if avg_sentiment is not None and recent_sentiment is not None:
            diff = recent_sentiment - avg_sentiment
            if diff > 0.1:
                sentiment_trend = "up"
            elif diff < -0.1:
                sentiment_trend = "down"
            else:
                sentiment_trend = "stable"
        else:
            sentiment_trend = "stable"
        
        # 最新新闻时间
        last_news_query = select(News.publish_time).where(
            stock_codes_filter
        ).order_by(desc(News.publish_time)).limit(1)
        result = await db.execute(last_news_query)
        last_news_time = result.scalar()
        
        return StockOverview(
            code=code,
            name=stock_name,
            total_news=total_news,
            analyzed_news=analyzed_news,
            avg_sentiment=round(avg_sentiment, 3) if avg_sentiment else None,
            recent_sentiment=round(recent_sentiment, 3) if recent_sentiment else None,
            sentiment_trend=sentiment_trend,
            last_news_time=last_news_time.isoformat() if last_news_time else None
        )
    
    except Exception as e:
        logger.error(f"Failed to get stock overview for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{stock_code}/news", response_model=List[StockNewsItem])
async def get_stock_news(
    stock_code: str,
    limit: int = Query(50, le=200),
    offset: int = Query(0, ge=0),
    sentiment: Optional[str] = Query(None, description="筛选情感: positive, negative, neutral"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取股票关联新闻列表
    
    - **stock_code**: 股票代码
    - **limit**: 返回数量限制
    - **offset**: 偏移量
    - **sentiment**: 情感筛选
    """
    # 标准化股票代码
    code = stock_code.upper()
    if code.startswith("SH") or code.startswith("SZ"):
        short_code = code[2:]
    else:
        short_code = code
        code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
    
    try:
        # 构建查询 - 使用 PostgreSQL 原生 ARRAY 查询语法
        stock_codes_filter = text(
            "stock_codes @> ARRAY[:code1]::varchar[] OR stock_codes @> ARRAY[:code2]::varchar[]"
        ).bindparams(code1=short_code, code2=code)
        
        query = select(News).where(stock_codes_filter)
        
        # 情感筛选
        if sentiment:
            if sentiment == "positive":
                query = query.where(News.sentiment_score > 0.1)
            elif sentiment == "negative":
                query = query.where(News.sentiment_score < -0.1)
            elif sentiment == "neutral":
                query = query.where(
                    and_(
                        News.sentiment_score >= -0.1,
                        News.sentiment_score <= 0.1
                    )
                )
        
        # 排序和分页
        query = query.order_by(desc(News.publish_time)).offset(offset).limit(limit)
        
        result = await db.execute(query)
        news_list = result.scalars().all()
        
        # 检查每条新闻是否有分析
        response = []
        for news in news_list:
            # 检查是否有分析记录
            analysis_query = select(func.count(Analysis.id)).where(Analysis.news_id == news.id)
            analysis_result = await db.execute(analysis_query)
            has_analysis = (analysis_result.scalar() or 0) > 0
            
            response.append(StockNewsItem(
                id=news.id,
                title=news.title,
                content=news.content[:500] + "..." if len(news.content) > 500 else news.content,
                url=news.url,
                source=news.source,
                publish_time=news.publish_time.isoformat() if news.publish_time else None,
                sentiment_score=news.sentiment_score,
                has_analysis=has_analysis
            ))
        
        return response
    
    except Exception as e:
        logger.error(f"Failed to get news for stock {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.delete("/{stock_code}/news")
async def delete_stock_news(
    stock_code: str,
    db: AsyncSession = Depends(get_db)
):
    """
    清除股票的所有关联新闻
    
    - **stock_code**: 股票代码
    """
    # 标准化股票代码
    code = stock_code.upper()
    if code.startswith("SH") or code.startswith("SZ"):
        short_code = code[2:]
    else:
        short_code = code
        code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
    
    try:
        # 构建查询 - 使用 PostgreSQL 原生 ARRAY 查询语法
        stock_codes_filter = text(
            "stock_codes @> ARRAY[:code1]::varchar[] OR stock_codes @> ARRAY[:code2]::varchar[]"
        ).bindparams(code1=short_code, code2=code)
        
        # 先查询要删除的新闻ID列表（用于同时删除关联的分析记录）
        news_query = select(News.id).where(stock_codes_filter)
        news_result = await db.execute(news_query)
        news_ids = [row[0] for row in news_result.all()]
        
        deleted_count = len(news_ids)
        
        if deleted_count > 0:
            # 删除关联的分析记录
            analysis_delete = await db.execute(
                text("DELETE FROM analyses WHERE news_id = ANY(:news_ids)").bindparams(news_ids=news_ids)
            )
            logger.info(f"Deleted {analysis_delete.rowcount} analysis records for stock {stock_code}")
            
            # 删除新闻记录
            news_delete = await db.execute(
                text("DELETE FROM news WHERE id = ANY(:news_ids)").bindparams(news_ids=news_ids)
            )
            await db.commit()
            
            logger.info(f"Deleted {deleted_count} news for stock {stock_code}")
        
        return {
            "success": True,
            "message": f"已清除 {deleted_count} 条新闻",
            "deleted_count": deleted_count
        }
    
    except Exception as e:
        await db.rollback()
        logger.error(f"Failed to delete news for stock {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{stock_code}/sentiment-trend", response_model=List[SentimentTrendPoint])
async def get_sentiment_trend(
    stock_code: str,
    days: int = Query(30, le=90, ge=7, description="天数范围"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取股票情感趋势（按天聚合）
    
    - **stock_code**: 股票代码
    - **days**: 查询天数范围（7-90天）
    """
    # 标准化股票代码
    code = stock_code.upper()
    if code.startswith("SH") or code.startswith("SZ"):
        short_code = code[2:]
    else:
        short_code = code
        code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
    
    try:
        start_date = datetime.utcnow() - timedelta(days=days)
        
        # 按天聚合情感数据
        # 使用原生 SQL 进行日期聚合
        from sqlalchemy import text
        
        query = text("""
            SELECT 
                DATE(publish_time) as date,
                AVG(sentiment_score) as avg_sentiment,
                COUNT(*) as news_count,
                SUM(CASE WHEN sentiment_score > 0.1 THEN 1 ELSE 0 END) as positive_count,
                SUM(CASE WHEN sentiment_score < -0.1 THEN 1 ELSE 0 END) as negative_count,
                SUM(CASE WHEN sentiment_score >= -0.1 AND sentiment_score <= 0.1 THEN 1 ELSE 0 END) as neutral_count
            FROM news
            WHERE (
                :short_code = ANY(stock_codes) 
                OR :full_code = ANY(stock_codes)
            )
            AND publish_time >= :start_date
            AND sentiment_score IS NOT NULL
            GROUP BY DATE(publish_time)
            ORDER BY date ASC
        """)
        
        result = await db.execute(query, {
            "short_code": short_code,
            "full_code": code,
            "start_date": start_date
        })
        rows = result.fetchall()
        
        trend_data = []
        for row in rows:
            trend_data.append(SentimentTrendPoint(
                date=row.date.isoformat() if row.date else "",
                avg_sentiment=round(row.avg_sentiment, 3) if row.avg_sentiment else 0,
                news_count=row.news_count or 0,
                positive_count=row.positive_count or 0,
                negative_count=row.negative_count or 0,
                neutral_count=row.neutral_count or 0
            ))
        
        return trend_data
    
    except Exception as e:
        logger.error(f"Failed to get sentiment trend for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{stock_code}/kline", response_model=List[KLineDataPoint])
async def get_kline_data(
    stock_code: str,
    period: str = Query("daily", description="周期: daily, 1m, 5m, 15m, 30m, 60m"),
    limit: int = Query(90, le=500, ge=10, description="数据条数"),
    adjust: str = Query("qfq", description="复权类型: qfq=前复权, hfq=后复权, 空=不复权（仅日线有效）"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取K线数据（真实数据，使用 akshare）
    
    - **stock_code**: 股票代码（支持 600519, SH600519, sh600519 等格式）
    - **period**: 周期类型
      - daily: 日线（默认）
      - 1m: 1分钟
      - 5m: 5分钟
      - 15m: 15分钟
      - 30m: 30分钟
      - 60m: 60分钟/1小时
    - **limit**: 返回数据条数（10-500，默认90）
    - **adjust**: 复权类型 (qfq=前复权, hfq=后复权, ""=不复权)，仅对日线有效
    """
    try:
        kline_data = await stock_data_service.get_kline_data(
            stock_code=stock_code,
            period=period,
            limit=limit,
            adjust=adjust
        )
        
        if not kline_data:
            logger.warning(f"No kline data for {stock_code} period={period}")
            return []
        
        return [KLineDataPoint(**item) for item in kline_data]
    
    except Exception as e:
        logger.error(f"Failed to get kline data for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


class RealtimeQuote(BaseModel):
    """实时行情"""
    code: str
    name: str
    price: float
    change_percent: float
    change_amount: float
    volume: int
    turnover: float
    high: float
    low: float
    open: float
    prev_close: float


@router.get("/{stock_code}/realtime", response_model=Optional[RealtimeQuote])
async def get_realtime_quote(
    stock_code: str,
    db: AsyncSession = Depends(get_db)
):
    """
    获取实时行情（使用 akshare）
    
    - **stock_code**: 股票代码
    """
    try:
        quote = await stock_data_service.get_realtime_quote(stock_code)
        if quote:
            return RealtimeQuote(**quote)
        return None
    except Exception as e:
        logger.error(f"Failed to get realtime quote for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/search/code", response_model=List[StockInfo])
async def search_stocks_db(
    q: str = Query(..., min_length=1, description="搜索关键词"),
    limit: int = Query(10, le=50),
    db: AsyncSession = Depends(get_db)
):
    """
    从数据库搜索股票
    
    - **q**: 搜索关键词（代码或名称）
    """
    try:
        query = select(Stock).where(
            (Stock.code.ilike(f"%{q}%")) | 
            (Stock.name.ilike(f"%{q}%")) |
            (Stock.full_code.ilike(f"%{q}%"))
        ).limit(limit)
        
        result = await db.execute(query)
        stocks = result.scalars().all()
        
        return [StockInfo.model_validate(stock) for stock in stocks]
    
    except Exception as e:
        logger.error(f"Failed to search stocks: {e}")
        raise HTTPException(status_code=500, detail=str(e))


# ============ 定向爬取 API ============

class TargetedCrawlRequest(BaseModel):
    """定向爬取请求"""
    stock_name: str = Field(..., description="股票名称")
    days: int = Field(default=30, ge=1, le=90, description="搜索时间范围（天）")


class TargetedCrawlResponse(BaseModel):
    """定向爬取响应"""
    success: bool
    message: str
    task_id: Optional[int] = None
    celery_task_id: Optional[str] = None


class TargetedCrawlStatus(BaseModel):
    """定向爬取状态"""
    task_id: Optional[int] = None
    status: str  # idle, pending, running, completed, failed
    celery_task_id: Optional[str] = None
    progress: Optional[dict] = None
    crawled_count: Optional[int] = None
    saved_count: Optional[int] = None
    error_message: Optional[str] = None
    execution_time: Optional[float] = None
    started_at: Optional[str] = None
    completed_at: Optional[str] = None


@router.post("/{stock_code}/targeted-crawl", response_model=TargetedCrawlResponse)
async def start_targeted_crawl(
    stock_code: str,
    request: TargetedCrawlRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    触发定向爬取任务
    
    - **stock_code**: 股票代码（如 SH600519）
    - **stock_name**: 股票名称（如 贵州茅台）
    - **days**: 搜索时间范围（默认30天）
    """
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        # 检查是否有正在运行的任务
        running_task = await db.execute(
            select(CrawlTask).where(
                and_(
                    CrawlTask.mode == CrawlMode.TARGETED,
                    CrawlTask.status.in_([TaskStatus.PENDING, TaskStatus.RUNNING]),
                    text("config->>'stock_code' = :stock_code").bindparams(stock_code=code)
                )
            ).order_by(desc(CrawlTask.created_at)).limit(1)
        )
        existing_task = running_task.scalar_one_or_none()
        
        if existing_task:
            return TargetedCrawlResponse(
                success=False,
                message=f"该股票已有正在进行的爬取任务 (ID: {existing_task.id})",
                task_id=existing_task.id,
                celery_task_id=existing_task.celery_task_id
            )
        
        logger.info(f"触发定向爬取任务: {request.stock_name}({code}), 时间范围: {request.days}天")
        
        # 先在数据库中创建任务记录（PENDING状态），这样前端轮询时能立即看到
        task_record = CrawlTask(
            mode=CrawlMode.TARGETED,
            status=TaskStatus.PENDING,
            source="targeted",
            config={
                "stock_code": code,
                "stock_name": request.stock_name,
                "days": request.days,
            },
        )
        db.add(task_record)
        await db.commit()
        await db.refresh(task_record)
        
        # 触发 Celery 任务，传入任务记录ID
        celery_task = targeted_stock_crawl_task.apply_async(
            args=(code, request.stock_name, request.days, task_record.id)
        )
        
        # 更新 celery_task_id
        task_record.celery_task_id = celery_task.id
        await db.commit()
        
        return TargetedCrawlResponse(
            success=True,
            message=f"定向爬取任务已启动: {request.stock_name}({code})",
            task_id=task_record.id,
            celery_task_id=celery_task.id
        )
    
    except Exception as e:
        logger.error(f"Failed to start targeted crawl for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{stock_code}/targeted-crawl/status", response_model=TargetedCrawlStatus)
async def get_targeted_crawl_status(
    stock_code: str,
    db: AsyncSession = Depends(get_db)
):
    """
    查询定向爬取任务状态
    
    - **stock_code**: 股票代码
    """
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        # 查询最近的定向爬取任务
        task_query = select(CrawlTask).where(
            and_(
                CrawlTask.mode == CrawlMode.TARGETED,
                text("config->>'stock_code' = :stock_code").bindparams(stock_code=code)
            )
        ).order_by(desc(CrawlTask.created_at)).limit(1)
        
        result = await db.execute(task_query)
        task = result.scalar_one_or_none()
        
        if not task:
            return TargetedCrawlStatus(
                status="idle",
                progress=None
            )
        
        # 检测超时：如果任务在 PENDING 状态超过 5 分钟，自动标记为失败
        if task.status == TaskStatus.PENDING and task.created_at:
            pending_duration = datetime.utcnow() - task.created_at
            if pending_duration > timedelta(minutes=5):
                logger.warning(f"Task {task.id} has been PENDING for {pending_duration}, marking as FAILED (timeout)")
                task.status = TaskStatus.FAILED
                task.completed_at = datetime.utcnow()
                task.error_message = "任务超时：Celery worker 可能未启动或已停止"
                await db.commit()
        
        # 检测运行超时：如果任务在 RUNNING 状态超过 30 分钟，也标记为失败
        if task.status == TaskStatus.RUNNING and task.started_at:
            running_duration = datetime.utcnow() - task.started_at
            if running_duration > timedelta(minutes=30):
                logger.warning(f"Task {task.id} has been RUNNING for {running_duration}, marking as FAILED (timeout)")
                task.status = TaskStatus.FAILED
                task.completed_at = datetime.utcnow()
                task.error_message = "任务执行超时"
                await db.commit()
        
        return TargetedCrawlStatus(
            task_id=task.id,
            status=task.status,
            celery_task_id=task.celery_task_id,
            progress=task.progress,
            crawled_count=task.crawled_count,
            saved_count=task.saved_count,
            error_message=task.error_message,
            execution_time=task.execution_time,
            started_at=task.started_at.isoformat() if task.started_at else None,
            completed_at=task.completed_at.isoformat() if task.completed_at else None
        )
    
    except Exception as e:
        logger.error(f"Failed to get targeted crawl status for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/{stock_code}/targeted-crawl/cancel")
async def cancel_targeted_crawl(
    stock_code: str,
    db: AsyncSession = Depends(get_db)
):
    """
    取消定向爬取任务
    
    - **stock_code**: 股票代码
    """
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if not (code.startswith("SH") or code.startswith("SZ")):
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        # 查找正在进行的任务
        task_query = select(CrawlTask).where(
            and_(
                CrawlTask.mode == CrawlMode.TARGETED,
                CrawlTask.status.in_([TaskStatus.PENDING, TaskStatus.RUNNING]),
                text("config->>'stock_code' = :stock_code").bindparams(stock_code=code)
            )
        ).order_by(desc(CrawlTask.created_at)).limit(1)
        
        result = await db.execute(task_query)
        task = result.scalar_one_or_none()
        
        if not task:
            return {
                "success": True,
                "message": "没有正在进行的任务"
            }
        
        # 更新任务状态为已取消
        task.status = TaskStatus.CANCELLED
        task.completed_at = datetime.utcnow()
        task.error_message = "用户手动取消"
        await db.commit()
        
        # 如果有 celery_task_id，尝试撤销 Celery 任务
        if task.celery_task_id:
            try:
                from ...tasks.crawl_tasks import celery_app
                celery_app.control.revoke(task.celery_task_id, terminate=True)
                logger.info(f"Revoked Celery task: {task.celery_task_id}")
            except Exception as e:
                logger.warning(f"Failed to revoke Celery task: {e}")
        
        logger.info(f"Cancelled targeted crawl task {task.id} for {code}")
        
        return {
            "success": True,
            "message": f"已取消任务 (ID: {task.id})",
            "task_id": task.id
        }
    
    except Exception as e:
        logger.error(f"Failed to cancel targeted crawl for {stock_code}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/cache/clear")
async def clear_stock_data_cache(
    pattern: Optional[str] = Query(None, description="缓存键模式，如 'kline' 或 '002837'")
):
    """
    清除股票数据缓存
    
    - **pattern**: 可选的缓存键模式，如果不提供则清除所有缓存
    
    Examples:
    - `POST /api/v1/stocks/cache/clear` - 清除所有缓存
    - `POST /api/v1/stocks/cache/clear?pattern=kline` - 只清除K线缓存
    - `POST /api/v1/stocks/cache/clear?pattern=002837` - 只清除特定股票的缓存
    """
    try:
        stock_data_service.clear_cache(pattern)
        return {
            "success": True,
            "message": f"Cache cleared successfully" + (f" (pattern: {pattern})" if pattern else " (all)")
        }
    except Exception as e:
        logger.error(f"Failed to clear cache: {e}")
        raise HTTPException(status_code=500, detail=str(e))


================================================
FILE: backend/app/api/v1/tasks.py
================================================
"""
任务管理 API 路由
"""
import logging
from typing import List, Optional
from fastapi import APIRouter, Depends, HTTPException, Query
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, desc
from datetime import datetime

from ...core.database import get_db
from ...models.crawl_task import CrawlTask, CrawlMode, TaskStatus
from ...tasks.crawl_tasks import cold_start_crawl_task, realtime_crawl_task

logger = logging.getLogger(__name__)

router = APIRouter()


# Pydantic 模型
class TaskResponse(BaseModel):
    """任务响应模型"""
    model_config = {"from_attributes": True}
    
    id: int
    celery_task_id: Optional[str] = None
    mode: str
    status: str
    source: str
    config: Optional[dict] = None
    progress: Optional[dict] = None
    current_page: Optional[int] = None
    total_pages: Optional[int] = None
    result: Optional[dict] = None
    crawled_count: int
    saved_count: int
    error_message: Optional[str] = None
    execution_time: Optional[float] = None
    created_at: str
    started_at: Optional[str] = None
    completed_at: Optional[str] = None


class ColdStartRequest(BaseModel):
    """冷启动请求模型"""
    source: str = Field(default="sina", description="新闻源")
    start_page: int = Field(default=1, ge=1, description="起始页码")
    end_page: int = Field(default=50, ge=1, le=100, description="结束页码")


class ColdStartResponse(BaseModel):
    """冷启动响应模型"""
    success: bool
    message: str
    task_id: Optional[int] = None
    celery_task_id: Optional[str] = None


class RealtimeCrawlRequest(BaseModel):
    """实时爬取请求模型"""
    source: str = Field(description="新闻源（sina, tencent, eeo等）")
    force_refresh: bool = Field(default=False, description="是否强制刷新（跳过缓存）")


class RealtimeCrawlResponse(BaseModel):
    """实时爬取响应模型"""
    success: bool
    message: str
    celery_task_id: Optional[str] = None


# API 端点
@router.get("/", response_model=List[TaskResponse])
async def get_tasks_list(
    skip: int = Query(0, ge=0, description="跳过的记录数"),
    limit: int = Query(20, ge=1, le=100, description="返回的记录数"),
    mode: Optional[str] = Query(None, description="按模式筛选"),
    status: Optional[str] = Query(None, description="按状态筛选"),
    db: AsyncSession = Depends(get_db)
):
    """
    获取任务列表
    
    - **skip**: 跳过的记录数（分页）
    - **limit**: 返回的记录数
    - **mode**: 按模式筛选（cold_start, realtime, targeted）
    - **status**: 按状态筛选（pending, running, completed, failed）
    """
    try:
        query = select(CrawlTask).order_by(desc(CrawlTask.created_at))
        
        if mode:
            query = query.where(CrawlTask.mode == mode)
        if status:
            query = query.where(CrawlTask.status == status)
        
        query = query.offset(skip).limit(limit)
        
        result = await db.execute(query)
        tasks = result.scalars().all()
        
        return [TaskResponse(**task.to_dict()) for task in tasks]
    
    except Exception as e:
        logger.error(f"Failed to get tasks list: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{task_id}", response_model=TaskResponse)
async def get_task_detail(
    task_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    获取任务详情
    
    - **task_id**: 任务ID
    """
    try:
        result = await db.execute(
            select(CrawlTask).where(CrawlTask.id == task_id)
        )
        task = result.scalar_one_or_none()
        
        if not task:
            raise HTTPException(status_code=404, detail="Task not found")
        
        return TaskResponse(**task.to_dict())
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to get task {task_id}: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/cold-start", response_model=ColdStartResponse)
async def trigger_cold_start(
    request: ColdStartRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    触发冷启动批量爬取任务
    
    - **source**: 新闻源（sina, jrj等）
    - **start_page**: 起始页码
    - **end_page**: 结束页码
    """
    try:
        logger.info(
            f"触发冷启动任务: {request.source}, "
            f"页码 {request.start_page}-{request.end_page}"
        )
        
        # 触发 Celery 任务
        celery_task = cold_start_crawl_task.apply_async(
            args=(request.source, request.start_page, request.end_page)
        )
        
        # 等待任务记录创建（最多等待2秒）
        await db.commit()  # 确保之前的事务已提交
        
        return ColdStartResponse(
            success=True,
            message=f"冷启动任务已启动: {request.source}, 页码 {request.start_page}-{request.end_page}",
            celery_task_id=celery_task.id
        )
    
    except Exception as e:
        logger.error(f"Failed to trigger cold start: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.post("/realtime", response_model=RealtimeCrawlResponse)
async def trigger_realtime_crawl(
    request: RealtimeCrawlRequest,
    db: AsyncSession = Depends(get_db)
):
    """
    手动触发实时爬取任务
    
    - **source**: 新闻源（sina, tencent, eeo, jwview等）
    - **force_refresh**: 是否强制刷新（跳过缓存）
    
    示例:
    - POST /api/v1/tasks/realtime {"source": "tencent", "force_refresh": true}
    - POST /api/v1/tasks/realtime {"source": "eeo"}
    """
    try:
        logger.info(
            f"手动触发实时爬取任务: {request.source}, "
            f"force_refresh={request.force_refresh}"
        )
        
        # 触发 Celery 任务
        celery_task = realtime_crawl_task.apply_async(
            args=(request.source, request.force_refresh)
        )
        
        return RealtimeCrawlResponse(
            success=True,
            message=f"实时爬取任务已启动: {request.source}",
            celery_task_id=celery_task.id
        )
    
    except Exception as e:
        logger.error(f"Failed to trigger realtime crawl: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/stats/summary")
async def get_task_stats(
    db: AsyncSession = Depends(get_db)
):
    """
    获取任务统计信息
    """
    try:
        # 统计各状态的任务数
        result = await db.execute(select(CrawlTask))
        all_tasks = result.scalars().all()
        
        stats = {
            "total": len(all_tasks),
            "by_status": {},
            "by_mode": {},
            "recent_completed": 0,
            "total_news_crawled": 0,
            "total_news_saved": 0,
        }
        
        for task in all_tasks:
            # 按状态统计
            stats["by_status"][task.status] = stats["by_status"].get(task.status, 0) + 1
            
            # 按模式统计
            stats["by_mode"][task.mode] = stats["by_mode"].get(task.mode, 0) + 1
            
            # 统计新闻数
            stats["total_news_crawled"] += task.crawled_count or 0
            stats["total_news_saved"] += task.saved_count or 0
            
            # 最近24小时完成的任务
            if task.status == TaskStatus.COMPLETED and task.completed_at:
                from datetime import timedelta
                if datetime.utcnow() - task.completed_at < timedelta(days=1):
                    stats["recent_completed"] += 1
        
        return stats
    
    except Exception as e:
        logger.error(f"Failed to get task stats: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@router.delete("/{task_id}")
async def delete_task(
    task_id: int,
    db: AsyncSession = Depends(get_db)
):
    """
    删除任务记录
    
    - **task_id**: 任务ID
    """
    try:
        result = await db.execute(
            select(CrawlTask).where(CrawlTask.id == task_id)
        )
        task = result.scalar_one_or_none()
        
        if not task:
            raise HTTPException(status_code=404, detail="Task not found")
        
        await db.delete(task)
        await db.commit()
        
        return {"success": True, "message": f"Task {task_id} deleted"}
    
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Failed to delete task {task_id}: {e}")
        await db.rollback()
        raise HTTPException(status_code=500, detail=str(e))


================================================
FILE: backend/app/config/__init__.py
================================================
"""
配置模块
"""
import os
from pathlib import Path
from typing import Dict, Any, Optional, List
import yaml
from pydantic import BaseModel, Field, ConfigDict


# 配置目录
CONFIG_DIR = Path(__file__).parent


class AgentConfig(BaseModel):
    """智能体配置"""
    name: str
    role: str
    description: str


class FlowStep(BaseModel):
    """流程步骤配置"""
    name: str
    description: str
    parallel: bool = False
    agents: List[str] = Field(default_factory=list)
    type: Optional[str] = None
    max_rounds: Optional[int] = None


class FlowConfig(BaseModel):
    """流程配置"""
    type: str
    steps: List[FlowStep]


class ModeRules(BaseModel):
    """模式规则配置"""
    max_time: int = 300
    max_rounds: Optional[int] = None
    round_time_limit: Optional[int] = None
    manager_can_interrupt: bool = False
    require_news: bool = True
    require_financial: bool = True
    require_data_collection: bool = False
    early_decision: bool = False
    min_news_count: int = 0


class DebateRules(BaseModel):
    """辩论规则配置"""
    opening_statement: bool = True
    rebuttal_required: bool = True
    evidence_required: bool = True
    interrupt_cooldown: int = 30


class DebateModeConfig(BaseModel):
    """辩论模式配置"""
    name: str
    description: str
    icon: str = "📊"
    agents: List[AgentConfig]
    flow: FlowConfig
    rules: ModeRules
    debate_rules: Optional[DebateRules] = None


class LLMConfig(BaseModel):
    """LLM配置"""
    default_provider: str = "bailian"
    default_model: str = "qwen-plus"
    temperature: float = 0.7
    max_tokens: int = 4096


class DataSourceConfig(BaseModel):
    """数据源配置"""
    type: str
    priority: int = 1


class DataSourcesConfig(BaseModel):
    """数据源集合配置"""
    news: List[DataSourceConfig] = Field(default_factory=list)
    financial: List[DataSourceConfig] = Field(default_factory=list)


class OutputConfig(BaseModel):
    """输出配置"""
    format: str = "markdown"
    include_trajectory: bool = True
    include_timestamps: bool = True


class GlobalConfig(BaseModel):
    """全局配置"""
    llm: LLMConfig = Field(default_factory=LLMConfig)
    data_sources: DataSourcesConfig = Field(default_factory=DataSourcesConfig)
    output: OutputConfig = Field(default_factory=OutputConfig)


class DebateModesConfig(BaseModel):
    """辩论模式总配置"""
    model_config = ConfigDict(populate_by_name=True)
    
    default_mode: str = "parallel"
    modes: Dict[str, DebateModeConfig]
    global_config: GlobalConfig = Field(default_factory=GlobalConfig, alias="global")


def load_debate_modes_config() -> DebateModesConfig:
    """加载辩论模式配置"""
    config_file = CONFIG_DIR / "debate_modes.yaml"
    
    if not config_file.exists():
        raise FileNotFoundError(f"配置文件不存在: {config_file}")
    
    with open(config_file, "r", encoding="utf-8") as f:
        raw_config = yaml.safe_load(f)
    
    # 处理 global 关键字冲突
    if "global" in raw_config:
        raw_config["global_config"] = raw_config.pop("global")
    
    return DebateModesConfig(**raw_config)


def get_mode_config(mode_name: str) -> Optional[DebateModeConfig]:
    """获取指定模式的配置"""
    config = load_debate_modes_config()
    return config.modes.get(mode_name)


def get_available_modes() -> List[Dict[str, Any]]:
    """获取所有可用的模式列表"""
    config = load_debate_modes_config()
    modes = []
    for mode_id, mode_config in config.modes.items():
        modes.append({
            "id": mode_id,
            "name": mode_config.name,
            "description": mode_config.description,
            "icon": mode_config.icon,
            "is_default": mode_id == config.default_mode
        })
    return modes


def get_default_mode() -> str:
    """获取默认模式"""
    config = load_debate_modes_config()
    return config.default_mode


# 单例缓存
_cached_config: Optional[DebateModesConfig] = None


def get_cached_config() -> DebateModesConfig:
    """获取缓存的配置（避免重复读取文件）"""
    global _cached_config
    if _cached_config is None:
        _cached_config = load_debate_modes_config()
    return _cached_config


def reload_config() -> DebateModesConfig:
    """重新加载配置"""
    global _cached_config
    _cached_config = load_debate_modes_config()
    return _cached_config


================================================
FILE: backend/app/config/debate_modes.yaml
================================================
# 多智能体协作模式配置
# 支持多种辩论/分析模式，可通过前端或API选择

# 默认模式
default_mode: parallel

modes:
  # ============ 并行分析模式（当前默认） ============
  parallel:
    name: "并行分析模式"
    description: "Bull/Bear并行分析，投资经理汇总决策"
    icon: "⚡"
    
    # 参与的智能体
    agents:
      - name: BullResearcher
        role: "看多研究员"
        description: "从积极角度分析股票，发现投资机会"
      - name: BearResearcher
        role: "看空研究员"
        description: "从风险角度分析股票，识别潜在问题"
      - name: InvestmentManager
        role: "投资经理"
        description: "综合双方观点，做出最终投资决策"
    
    # 执行流程
    flow:
      type: parallel_then_summarize
      steps:
        - name: data_preparation
          description: "准备新闻和财务数据"
          parallel: false
        - name: researcher_analysis
          description: "Bull/Bear并行分析"
          parallel: true
          agents: [BullResearcher, BearResearcher]
        - name: manager_decision
          description: "投资经理综合决策"
          parallel: false
          agents: [InvestmentManager]
    
    # 规则配置
    rules:
      max_time: 300          # 最长执行时间（秒）
      require_news: true     # 是否需要新闻数据
      require_financial: true # 是否需要财务数据
      min_news_count: 1      # 最少新闻数量

  # ============ 实时辩论模式 ============
  realtime_debate:
    name: "实时辩论模式"
    description: "四人实时对话，投资经理主持，多空双方交替发言"
    icon: "🎭"
    
    # 参与的智能体
    agents:
      - name: DataCollector
        role: "数据专员"
        description: "搜集和整理相关数据资料"
      - name: BullResearcher
        role: "多方辩手"
        description: "支持买入，提出看多论点"
      - name: BearResearcher
        role: "空方辩手"
        description: "建议卖出，提出看空论点"
      - name: InvestmentManager
        role: "投资经理（主持人）"
        description: "主持辩论，随时提问，最终裁决"
    
    # 执行流程
    flow:
      type: orchestrated_debate
      steps:
        - name: opening
          description: "投资经理开场，下发分析任务"
          agents: [InvestmentManager]
        - name: data_collection
          description: "数据专员搜集资料"
          agents: [DataCollector]
        - name: debate_rounds
          description: "多空双方辩论"
          type: alternating
          agents: [BullResearcher, BearResearcher]
          max_rounds: 5
        - name: closing
          description: "投资经理总结决策"
          agents: [InvestmentManager]
    
    # 规则配置
    rules:
      max_rounds: 5            # 最大辩论回合数
      max_time: 600            # 最长执行时间（秒）
      round_time_limit: 60     # 每回合时间限制（秒）
      manager_can_interrupt: true  # 投资经理是否可以打断
      require_data_collection: true  # 是否需要先搜集数据
      early_decision: true     # 是否允许提前做决策
      
    # 辩论规则
    debate_rules:
      opening_statement: true   # 是否需要开场陈述
      rebuttal_required: true   # 是否必须反驳对方
      evidence_required: true   # 是否需要提供证据
      interrupt_cooldown: 30    # 打断冷却时间（秒）

  # ============ 快速分析模式 ============
  quick_analysis:
    name: "快速分析模式"
    description: "单一分析师快速给出建议，适合时间紧迫场景"
    icon: "🚀"
    
    agents:
      - name: QuickAnalyst
        role: "快速分析师"
        description: "综合多角度快速给出投资建议"
    
    flow:
      type: single_agent
      steps:
        - name: quick_analysis
          description: "快速综合分析"
          agents: [QuickAnalyst]
    
    rules:
      max_time: 60
      require_news: false
      require_financial: true

# ============ 全局配置 ============
global:
  # LLM配置
  llm:
    default_provider: bailian
    default_model: qwen-plus
    temperature: 0.7
    max_tokens: 4096
  
  # 数据源配置
  data_sources:
    news:
      - type: database
        priority: 1
      - type: bochaai
        priority: 2
    financial:
      - type: akshare
        priority: 1
  
  # 输出配置
  output:
    format: markdown
    include_trajectory: true
    include_timestamps: true


================================================
FILE: backend/app/core/__init__.py
================================================
"""
核心模块
"""
from .config import settings, get_settings
from .database import get_db, init_database

__all__ = ["settings", "get_settings", "get_db", "init_database"]


================================================
FILE: backend/app/core/celery_app.py
================================================
"""
Celery 应用配置
"""
from celery import Celery
from celery.schedules import crontab
from .config import settings

# 创建 Celery 应用
celery_app = Celery(
    "finnews",
    broker=settings.REDIS_URL,
    backend=settings.REDIS_URL,
    include=["app.tasks.crawl_tasks"]  # 导入任务模块
)

# Celery 配置
celery_app.conf.update(
    # 时区设置
    timezone="Asia/Shanghai",
    enable_utc=True,
    
    # 任务结果配置
    result_expires=3600,  # 结果保存1小时
    result_backend_transport_options={
        'master_name': 'mymaster'
    },
    
    # 任务执行配置
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],
    task_track_started=True,
    task_time_limit=30 * 60,  # 30分钟超时
    task_soft_time_limit=25 * 60,  # 25分钟软超时
    
    # Worker 配置
    worker_prefetch_multiplier=1,  # 每次只拿一个任务
    worker_max_tasks_per_child=1000,  # 每个 worker 处理1000个任务后重启
    
    # Beat 调度配置
    beat_schedule={
        # 每1分钟爬取新浪财经
        "crawl-sina-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("sina",),
        },
        # 每1分钟爬取腾讯财经
        "crawl-tencent-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("tencent",),
        },
        # 每1分钟爬取中新经纬
        "crawl-jwview-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("jwview",),
        },
        # 每1分钟爬取经济观察网
        "crawl-eeo-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("eeo",),
        },
        # 每1分钟爬取财经网
        "crawl-caijing-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("caijing",),
        },
        # 每1分钟爬取21经济网
        "crawl-jingji21-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("jingji21",),
        },
        # 每1分钟爬取每日经济新闻
        "crawl-nbd-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("nbd",),
        },
        # 每1分钟爬取第一财经
        "crawl-yicai-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("yicai",),
        },
        # 每1分钟爬取网易财经
        "crawl-163-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("163",),
        },
        # 每1分钟爬取东方财富
        "crawl-eastmoney-every-1min": {
            "task": "app.tasks.crawl_tasks.realtime_crawl_task",
            "schedule": crontab(minute="*/1"),
            "args": ("eastmoney",),
        },
    },
)

# 任务路由（可选，用于任务分发）
# 注释掉自定义路由，使用默认的 celery 队列
# celery_app.conf.task_routes = {
#     "app.tasks.crawl_tasks.*": {"queue": "crawl"},
#     "app.tasks.analysis_tasks.*": {"queue": "analysis"},
# }


if __name__ == "__main__":
    celery_app.start()


================================================
FILE: backend/app/core/config.py
================================================
"""
FinnewsHunter 核心配置模块
使用 Pydantic Settings 管理环境变量和配置
"""
from typing import Optional, List
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict


class Settings(BaseSettings):
    """应用配置类"""
    
    # 应用基础配置
    APP_NAME: str = "FinnewsHunter"
    APP_VERSION: str = "0.1.0"
    API_V1_PREFIX: str = "/api/v1"
    DEBUG: bool = Field(default=True)
    
    # 服务器配置
    HOST: str = Field(default="0.0.0.0")
    PORT: int = Field(default=8000)
    
    # CORS 配置
    BACKEND_CORS_ORIGINS: List[str] = Field(
        default=["http://localhost:3000", "http://localhost:8000"]
    )
    
    # PostgreSQL 数据库配置
    POSTGRES_USER: str = Field(default="finnews")
    POSTGRES_PASSWORD: str = Field(default="finnews_dev_password")
    POSTGRES_HOST: str = Field(default="localhost")
    POSTGRES_PORT: int = Field(default=5432)
    POSTGRES_DB: str = Field(default="finnews_db")
    
    @property
    def DATABASE_URL(self) -> str:
        """异步数据库连接 URL"""
        return (
            f"postgresql+asyncpg://{self.POSTGRES_USER}:{self.POSTGRES_PASSWORD}"
            f"@{self.POSTGRES_HOST}:{self.POSTGRES_PORT}/{self.POSTGRES_DB}"
        )
    
    @property
    def SYNC_DATABASE_URL(self) -> str:
        """同步数据库连接 URL（用于初始化）"""
        return (
            f"postgresql://{self.POSTGRES_USER}:{self.POSTGRES_PASSWORD}"
            f"@{self.POSTGRES_HOST}:{self.POSTGRES_PORT}/{self.POSTGRES_DB}"
        )
    
    # Redis 配置
    REDIS_HOST: str = Field(default="localhost")
    REDIS_PORT: int = Field(default=6379)
    REDIS_DB: int = Field(default=0)
    REDIS_PASSWORD: Optional[str] = Field(default=None)
    
    @property
    def REDIS_URL(self) -> str:
        """Redis 连接 URL"""
        if self.REDIS_PASSWORD:
            return f"redis://:{self.REDIS_PASSWORD}@{self.REDIS_HOST}:{self.REDIS_PORT}/{self.REDIS_DB}"
        return f"redis://{self.REDIS_HOST}:{self.REDIS_PORT}/{self.REDIS_DB}"
    
    # Milvus 配置
    MILVUS_HOST: str = Field(default="localhost")
    MILVUS_PORT: int = Field(default=19530)
    MILVUS_COLLECTION_NAME: str = Field(default="finnews_embeddings")
    MILVUS_DIM: int = Field(default=1536)  # OpenAI embedding dimension
    
    # Neo4j 知识图谱配置
    NEO4J_URI: str = Field(default="bolt://localhost:7687", description="Neo4j 连接URI")
    NEO4J_USER: str = Field(default="neo4j", description="Neo4j 用户名")
    NEO4J_PASSWORD: str = Field(default="finnews_neo4j_password", description="Neo4j 密码")
    
    # LLM 配置
    LLM_PROVIDER: str = Field(default="bailian")  # 默认提供商
    LLM_MODEL: str = Field(default="qwen-plus")
    LLM_TEMPERATURE: float = Field(default=0.7)
    LLM_MAX_TOKENS: int = Field(default=2000)
    LLM_TIMEOUT: int = Field(default=180)  # LLM 调用超时时间（秒），百炼建议180秒
    
    # 各厂商 API Key 配置
    DASHSCOPE_API_KEY: Optional[str] = Field(default=None, description="阿里云百炼 API Key")
    DASHSCOPE_BASE_URL: str = Field(
        default="https://dashscope.aliyuncs.com/compatible-mode/v1",
        description="阿里云百炼 Base URL"
    )
    BAILIAN_API_KEY: Optional[str] = Field(default=None, description="百炼 API Key（与DASHSCOPE相同）")
    OPENAI_API_KEY: Optional[str] = Field(default=None, description="OpenAI API Key")
    DEEPSEEK_API_KEY: Optional[str] = Field(default=None, description="DeepSeek API Key")
    MOONSHOT_API_KEY: Optional[str] = Field(default=None, description="Moonshot (Kimi) API Key")
    ZHIPU_API_KEY: Optional[str] = Field(default=None, description="智谱 API Key")
    ANTHROPIC_API_KEY: Optional[str] = Field(default=None, description="Anthropic API Key")
    
    # 各厂商可用模型列表（逗号分隔）
    BAILIAN_MODELS: str = Field(
        default="qwen-plus,qwen-max,qwen-turbo,qwen-long",
        description="百炼可用模型（逗号分隔）"
    )
    OPENAI_MODELS: str = Field(
        default="gpt-4,gpt-4-turbo,gpt-3.5-turbo",
        description="OpenAI可用模型（逗号分隔）"
    )
    DEEPSEEK_MODELS: str = Field(
        default="deepseek-chat",
        description="DeepSeek可用模型（逗号分隔）"
    )
    MOONSHOT_MODELS: str = Field(
        default="moonshot-v1-8k,moonshot-v1-32k,moonshot-v1-128k",
        description="Moonshot可用模型（逗号分隔）"
    )
    ZHIPU_MODELS: str = Field(
        default="glm-4,glm-4-plus,glm-4-air,glm-3-turbo",
        description="智谱可用模型（逗号分隔）"
    )
    
    # Base URL 配置（用于第三方 API 转发）
    OPENAI_BASE_URL: Optional[str] = Field(default=None, description="OpenAI Base URL")
    DEEPSEEK_BASE_URL: Optional[str] = Field(default="https://api.deepseek.com/v1", description="DeepSeek Base URL")
    MOONSHOT_BASE_URL: Optional[str] = Field(default="https://api.moonshot.cn/v1", description="Moonshot Base URL")
    ZHIPU_BASE_URL: Optional[str] = Field(default="https://open.bigmodel.cn/api/paas/v4", description="智谱 Base URL")
    ANTHROPIC_BASE_URL: Optional[str] = Field(default=None, description="Anthropic Base URL")
    QWEN_BASE_URL: Optional[str] = Field(default=None, description="Qwen Base URL (deprecated)")
    BAILIAN_ACCESS_KEY_ID: Optional[str] = Field(default=None, description="百炼 Access Key ID")
    BAILIAN_ACCESS_KEY_SECRET: Optional[str] = Field(default=None, description="百炼 Access Key Secret")
    BAILIAN_AGENT_CODE: Optional[str] = Field(default=None, description="百炼 Agent Code")
    BAILIAN_REGION_ID: str = Field(default="cn-beijing", description="百炼 Region ID")
    
    # BochaAI 搜索 API 配置
    BOCHAAI_API_KEY: Optional[str] = Field(default=None, description="BochaAI Web Search API Key")
    BOCHAAI_ENDPOINT: str = Field(default="https://api.bochaai.com/v1/web-search", description="BochaAI API Endpoint")
    
    # Embedding 配置
    EMBEDDING_PROVIDER: str = Field(default="openai")  # openai, huggingface
    EMBEDDING_MODEL: str = Field(default="text-embedding-ada-002")
    EMBEDDING_BATCH_SIZE: int = Field(default=100)
    EMBEDDING_BASE_URL: Optional[str] = Field(default=None)  # 自定义 Embedding API 端点
    EMBEDDING_TIMEOUT: int = Field(default=30, description="Embedding API 超时时间（秒），建议设置为20-30秒")
    EMBEDDING_MAX_RETRIES: int = Field(default=2, description="Embedding API 最大重试次数，建议设置为1-2次以避免等待太久")
    
    # 爬虫配置
    CRAWLER_USER_AGENT: str = Field(
        default="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    )
    CRAWLER_TIMEOUT: int = Field(default=30)
    CRAWLER_MAX_RETRIES: int = Field(default=3)
    CRAWLER_DELAY: float = Field(default=1.0)  # 请求间隔（秒）
    
    # Phase 2: 实时爬取与缓存配置（多源支持）
    CACHE_TTL: int = Field(default=1800, description="缓存过期时间（秒），默认30分钟")
    CRAWL_INTERVAL_SINA: int = Field(default=60, description="新浪财经爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_TENCENT: int = Field(default=60, description="腾讯财经爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_JWVIEW: int = Field(default=60, description="中新经纬爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_EEO: int = Field(default=60, description="经济观察网爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_CAIJING: int = Field(default=60, description="财经网爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_JINGJI21: int = Field(default=60, description="21经济网爬取间隔（秒），默认60秒")
    CRAWL_INTERVAL_JRJ: int = Field(default=600, description="金融界爬取间隔（秒），默认10分钟")
    NEWS_RETENTION_HOURS: int = Field(default=72000, description="新闻保留时间（小时），临时设置为72000小时（约8年）以包含所有爬取的新闻")
    FRONTEND_REFETCH_INTERVAL: int = Field(default=180, description="前端自动刷新间隔（秒），默认3分钟")
    
    # 日志配置
    LOG_LEVEL: str = Field(default="INFO")
    LOG_FILE: Optional[str] = Field(default="logs/finnews.log")
    
    # 安全配置
    SECRET_KEY: str = Field(default="your-secret-key-here-change-in-production")
    ACCESS_TOKEN_EXPIRE_MINUTES: int = Field(default=60 * 24 * 7)  # 7 days
    
    # 业务配置
    MAX_NEWS_PER_REQUEST: int = Field(default=50)
    NEWS_CACHE_TTL: int = Field(default=3600)  # 1 hour
    
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=True,
        extra="ignore",
        env_ignore_empty=True,
    )


# 全局配置实例
settings = Settings()


# 便捷访问函数
def get_settings() -> Settings:
    """获取配置实例（用于依赖注入）"""
    return settings


================================================
FILE: backend/app/core/database.py
================================================
"""
数据库连接和依赖注入
"""
from typing import AsyncGenerator
from sqlalchemy.ext.asyncio import AsyncSession

from ..models.database import (
    AsyncSessionLocal,
    init_db as create_tables,
    Base,
)


async def get_db() -> AsyncGenerator[AsyncSession, None]:
    """
    FastAPI 依赖注入：获取数据库会话
    
    Usage:
        @app.get("/items")
        async def get_items(db: AsyncSession = Depends(get_db)):
            ...
    
    Yields:
        AsyncSession: 数据库会话
    """
    async with AsyncSessionLocal() as session:
        try:
            yield session
            await session.commit()
        except Exception:
            await session.rollback()
            raise
        finally:
            await session.close()


def init_database():
    """
    初始化数据库
    创建所有表结构
    """
    print("=" * 50)
    print("Initializing FinnewsHunter Database...")
    print("=" * 50)
    
    try:
        create_tables()
        print("\n✓ Database initialization completed successfully!")
    except Exception as e:
        print(f"\n✗ Database initialization failed: {e}")
        raise


if __name__ == "__main__":
    # 直接运行此文件以初始化数据库
    init_database()


================================================
FILE: backend/app/core/neo4j_client.py
================================================
"""
Neo4j 图数据库客户端
用于存储和查询公司知识图谱
"""
import logging
from typing import Optional, Dict, List, Any
from neo4j import GraphDatabase, Driver
from contextlib import contextmanager

from .config import settings

logger = logging.getLogger(__name__)


class Neo4jClient:
    """Neo4j 客户端封装"""
    
    def __init__(
        self,
        uri: str = None,
        user: str = None,
        password: str = None
    ):
        """
        初始化 Neo4j 客户端
        
        Args:
            uri: Neo4j URI（如 bolt://localhost:7687）
            user: 用户名
            password: 密码
        """
        self.uri = uri or settings.NEO4J_URI or "bolt://localhost:7687"
        self.user = user or settings.NEO4J_USER or "neo4j"
        self.password = password or settings.NEO4J_PASSWORD or "finnews_neo4j_password"
        
        self._driver: Optional[Driver] = None
        self._connected = False
    
    def connect(self):
        """建立连接"""
        if self._connected:
            return
        
        try:
            self._driver = GraphDatabase.driver(
                self.uri,
                auth=(self.user, self.password)
            )
            # 测试连接
            self._driver.verify_connectivity()
            self._connected = True
            logger.info(f"✅ Neo4j 连接成功: {self.uri}")
        except Exception as e:
            logger.error(f"❌ Neo4j 连接失败: {e}")
            raise
    
    def close(self):
        """关闭连接"""
        if self._driver:
            self._driver.close()
            self._connected = False
            logger.info("Neo4j 连接已关闭")
    
    @contextmanager
    def session(self):
        """获取会话（上下文管理器）"""
        if not self._connected:
            self.connect()
        
        session = self._driver.session()
        try:
            yield session
        finally:
            session.close()
    
    def execute_query(
        self,
        query: str,
        parameters: Dict[str, Any] = None
    ) -> List[Dict[str, Any]]:
        """
        执行 Cypher 查询
        
        Args:
            query: Cypher 查询语句
            parameters: 查询参数
            
        Returns:
            查询结果列表
        """
        with self.session() as session:
            result = session.run(query, parameters or {})
            return [dict(record) for record in result]
    
    def execute_write(
        self,
        query: str,
        parameters: Dict[str, Any] = None
    ) -> List[Dict[str, Any]]:
        """
        执行写入操作
        
        Args:
            query: Cypher 写入语句
            parameters: 参数
            
        Returns:
            写入结果
        """
        with self.session() as session:
            result = session.run(query, parameters or {})
            return [dict(record) for record in result]
    
    def is_connected(self) -> bool:
        """检查连接状态"""
        return self._connected
    
    def health_check(self) -> bool:
        """健康检查"""
        try:
            if not self._connected:
                self.connect()
            
            with self.session() as session:
                result = session.run("RETURN 1 as health")
                return result.single()["health"] == 1
        except Exception as e:
            logger.error(f"Neo4j 健康检查失败: {e}")
            return False


# 全局单例
_neo4j_client: Optional[Neo4jClient] = None


def get_neo4j_client() -> Neo4jClient:
    """获取 Neo4j 客户端单例"""
    global _neo4j_client
    if _neo4j_client is None:
        _neo4j_client = Neo4jClient()
        _neo4j_client.connect()
    return _neo4j_client


def close_neo4j_client():
    """关闭 Neo4j 客户端"""
    global _neo4j_client
    if _neo4j_client:
        _neo4j_client.close()
        _neo4j_client = None


================================================
FILE: backend/app/core/redis_client.py
================================================
"""
Redis Client for Caching and Task Queue
"""
import json
import logging
from typing import Optional, Any
from datetime import datetime, timedelta

import redis
from app.core.config import settings

logger = logging.getLogger(__name__)


class RedisClient:
    """Redis client wrapper with JSON serialization support"""
    
    def __init__(self):
        try:
            self.client = redis.Redis(
                host=settings.REDIS_HOST,
                port=settings.REDIS_PORT,
                db=settings.REDIS_DB,
                password=settings.REDIS_PASSWORD if settings.REDIS_PASSWORD else None,
                decode_responses=True,  # 自动解码为字符串
                socket_connect_timeout=5,
                socket_timeout=5,
            )
            # 测试连接
            self.client.ping()
            logger.info(f"✅ Redis connected: {settings.REDIS_HOST}:{settings.REDIS_PORT}")
        except Exception as e:
            logger.error(f"❌ Redis connection failed: {e}")
            self.client = None
    
    def is_available(self) -> bool:
        """检查 Redis 是否可用"""
        try:
            if self.client:
                self.client.ping()
                return True
        except:
            pass
        return False
    
    def get_json(self, key: str) -> Optional[Any]:
        """获取 JSON 数据"""
        if not self.is_available():
            return None
        
        try:
            value = self.client.get(key)
            if value:
                return json.loads(value)
        except Exception as e:
            logger.error(f"Redis get_json error: {e}")
        return None
    
    def set_json(self, key: str, value: Any, ttl: int = None) -> bool:
        """存储 JSON 数据"""
        if not self.is_available():
            return False
        
        try:
            json_str = json.dumps(value, ensure_ascii=False, default=str)
            if ttl:
                self.client.setex(key, ttl, json_str)
            else:
                self.client.set(key, json_str)
            return True
        except Exception as e:
            logger.error(f"Redis set_json error: {e}")
            return False
    
    def get(self, key: str) -> Optional[str]:
        """获取字符串数据"""
        if not self.is_available():
            return None
        
        try:
            return self.client.get(key)
        except Exception as e:
            logger.error(f"Redis get error: {e}")
            return None
    
    def set(self, key: str, value: str, ttl: int = None) -> bool:
        """存储字符串数据"""
        if not self.is_available():
            return False
        
        try:
            if ttl:
                self.client.setex(key, ttl, value)
            else:
                self.client.set(key, value)
            return True
        except Exception as e:
            logger.error(f"Redis set error: {e}")
            return False
    
    def delete(self, key: str) -> bool:
        """删除键"""
        if not self.is_available():
            return False
        
        try:
            self.client.delete(key)
            return True
        except Exception as e:
            logger.error(f"Redis delete error: {e}")
            return False
    
    def exists(self, key: str) -> bool:
        """检查键是否存在"""
        if not self.is_available():
            return False
        
        try:
            return self.client.exists(key) > 0
        except Exception as e:
            logger.error(f"Redis exists error: {e}")
            return False
    
    def get_cache_metadata(self, key: str) -> Optional[dict]:
        """获取缓存元数据（时间戳）"""
        time_key = f"{key}:timestamp"
        timestamp_str = self.get(time_key)
        
        if timestamp_str:
            try:
                return {
                    "timestamp": datetime.fromisoformat(timestamp_str),
                    "age_seconds": (datetime.now() - datetime.fromisoformat(timestamp_str)).total_seconds()
                }
            except:
                pass
        return None
    
    def set_with_metadata(self, key: str, value: Any, ttl: int = None) -> bool:
        """存储数据并记录时间戳"""
        success = self.set_json(key, value, ttl)
        if success:
            time_key = f"{key}:timestamp"
            self.set(time_key, datetime.now().isoformat(), ttl)
        return success
    
    def clear_pattern(self, pattern: str) -> int:
        """清除匹配模式的所有键"""
        if not self.is_available():
            return 0
        
        try:
            keys = self.client.keys(pattern)
            if keys:
                return self.client.delete(*keys)
        except Exception as e:
            logger.error(f"Redis clear_pattern error: {e}")
        return 0


# 全局单例
redis_client = RedisClient()


================================================
FILE: backend/app/financial/__init__.py
================================================
"""
FinnewsHunter 金融数据层

借鉴 OpenBB 的 Provider-Fetcher 架构，提供：
1. Standard Models: 统一的数据模型 (NewsData, StockPriceData 等)
2. Provider Registry: 多数据源管理与自动降级
3. AgenticX Tools: 封装为 Agent 可调用的工具

设计原则：
- 不修改 AgenticX 核心，所有金融特定逻辑内化在本模块
- TET Pipeline: Transform Query → Extract Data → Transform Data
- 多源降级: Provider 失败时自动切换到备用源
"""
from .registry import get_registry, ProviderRegistry
from .models.news import NewsQueryParams, NewsData, NewsSentiment
from .models.stock import (
    StockQueryParams,
    StockPriceData,
    KlineInterval,
    AdjustType
)

__all__ = [
    # Registry
    "get_registry",
    "ProviderRegistry",
    # News Models
    "NewsQueryParams",
    "NewsData",
    "NewsSentiment",
    # Stock Models
    "StockQueryParams",
    "StockPriceData",
    "KlineInterval",
    "AdjustType",
]


================================================
FILE: backend/app/financial/models/__init__.py
================================================
"""
金融数据标准模型

借鉴 OpenBB Standard Models 设计:
- QueryParams: 定义标准输入参数
- Data: 定义标准输出字段

所有 Provider 的 Fetcher 都使用这些标准模型，确保数据格式一致。
"""
from .news import NewsQueryParams, NewsData, NewsSentiment
from .stock import StockQueryParams, StockPriceData, KlineInterval, AdjustType

__all__ = [
    "NewsQueryParams",
    "NewsData",
    "NewsSentiment",
    "StockQueryParams",
    "StockPriceData",
    "KlineInterval",
    "AdjustType",
]


================================================
FILE: backend/app/financial/models/news.py
================================================
"""
金融新闻标准模型

借鉴 OpenBB Standard Models 设计:
- NewsQueryParams: 新闻查询参数标准模型
- NewsData: 新闻数据标准模型

所有 NewsProvider 的 Fetcher 都接收 NewsQueryParams 作为输入，
返回 List[NewsData] 作为输出，确保不同数据源返回的数据格式一致。

来源参考:
- OpenBB: openbb_core.provider.standard_models
- 设计文档: research/codedeepresearch/OpenBB/FinnewsHunter_improvement_plan.md
"""
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional, List
from enum import Enum
import hashlib


class NewsSentiment(str, Enum):
    """新闻情感标签"""
    POSITIVE = "positive"
    NEGATIVE = "negative"
    NEUTRAL = "neutral"


class NewsQueryParams(BaseModel):
    """
    新闻查询参数标准模型

    所有 NewsProvider 的 Fetcher 都接收此模型作为输入，
    内部再转换为各自 API 的参数格式 (transform_query)。

    Example:
        >>> params = NewsQueryParams(stock_codes=["600519"], limit=10)
        >>> fetcher.fetch(params)  # 返回 List[NewsData]
    """
    keywords: Optional[List[str]] = Field(
        default=None,
        description="搜索关键词列表"
    )
    stock_codes: Optional[List[str]] = Field(
        default=None,
        description="关联股票代码列表，如 ['600519', '000001']"
    )
    start_date: Optional[datetime] = Field(
        default=None,
        description="新闻发布时间起始"
    )
    end_date: Optional[datetime] = Field(
        default=None,
        description="新闻发布时间截止"
    )
    limit: int = Field(
        default=50,
        ge=1,
        le=500,
        description="返回条数上限"
    )
    source_filter: Optional[List[str]] = Field(
        default=None,
        description="数据源过滤，如 ['sina', 'tencent']"
    )

    class Config:
        json_schema_extra = {
            "example": {
                "stock_codes": ["600519", "000001"],
                "limit": 20,
                "keywords": ["茅台", "白酒"]
            }
        }


class NewsData(BaseModel):
    """
    新闻数据标准模型

    所有 Provider 返回的数据都必须转换为此模型，
    确保上层 Agent 处理逻辑一致。

    设计原则:
    - 必填字段: id, title, content, source, source_url, publish_time
    - 可选字段: summary, sentiment 等 (可由 LLM 后续填充)
    - extra 字段: 存储 Provider 特有的额外数据
    """
    id: str = Field(..., description="新闻唯一标识 (建议用 URL 的 MD5)")
    title: str = Field(..., description="新闻标题")
    content: str = Field(..., description="新闻正文")
    summary: Optional[str] = Field(default=None, description="摘要（可由 LLM 生成）")
    source: str = Field(..., description="来源网站名称，如 'sina', 'tencent'")
    source_url: str = Field(..., description="原文链接")
    publish_time: datetime = Field(..., description="发布时间")
    crawl_time: Optional[datetime] = Field(
        default_factory=datetime.now,
        description="抓取时间"
    )

    # 关联信息
    stock_codes: List[str] = Field(
        default_factory=list,
        description="关联股票代码，如 ['SH600519', 'SZ000001']"
    )
    stock_names: List[str] = Field(
        default_factory=list,
        description="关联股票名称，如 ['贵州茅台', '平安银行']"
    )

    # 情感分析（可选，由 Agent 或 LLM 填充）
    sentiment: Optional[NewsSentiment] = Field(
        default=None,
        description="情感标签"
    )
    sentiment_score: Optional[float] = Field(
        default=None,
        ge=-1,
        le=1,
        description="情感分数：-1(极度负面) ~ 1(极度正面)"
    )

    # 原始数据（可选）
    keywords: List[str] = Field(
        default_factory=list,
        description="关键词列表"
    )
    author: Optional[str] = Field(default=None, description="作者")

    # 元数据
    extra: dict = Field(
        default_factory=dict,
        description="Provider 特有的额外字段"
    )

    class Config:
        json_encoders = {
            datetime: lambda v: v.isoformat()
        }
        json_schema_extra = {
            "example": {
                "id": "a1b2c3d4e5f6",
                "title": "贵州茅台2024年三季度业绩超预期",
                "content": "贵州茅台发布2024年三季度报告...",
                "source": "sina",
                "source_url": "https://finance.sina.com.cn/stock/...",
                "publish_time": "2024-10-30T10:30:00",
                "stock_codes": ["SH600519"],
                "sentiment": "positive",
                "sentiment_score": 0.8
            }
        }

    @staticmethod
    def generate_id(url: str) -> str:
        """根据 URL 生成唯一 ID"""
        return hashlib.md5(url.encode()).hexdigest()[:16]

    def to_legacy_dict(self) -> dict:
        """
        转换为旧版 NewsItem 格式 (兼容现有代码)

        Returns:
            与旧版 NewsItem.to_dict() 格式一致的字典
        """
        return {
            "title": self.title,
            "content": self.content,
            "url": self.source_url,
            "source": self.source,
            "publish_time": self.publish_time.isoformat() if self.publish_time else None,
            "author": self.author,
            "keywords": self.keywords,
            "stock_codes": self.stock_codes,
            "summary": self.summary,
            "raw_html": self.extra.get("raw_html"),
        }


================================================
FILE: backend/app/financial/models/stock.py
================================================
"""
股票数据标准模型

借鉴 OpenBB Standard Models 设计:
- StockQueryParams: 股票数据查询参数
- StockPriceData: 股票价格数据 (K线)

来源参考:
- OpenBB: openbb_core.provider.standard_models
- 设计文档: research/codedeepresearch/OpenBB/FinnewsHunter_improvement_plan.md
"""
from pydantic import BaseModel, Field
from datetime import date, datetime
from typing import Optional, List
from enum import Enum


class KlineInterval(str, Enum):
    """K线周期"""
    MIN_1 = "1m"
    MIN_5 = "5m"
    MIN_15 = "15m"
    MIN_30 = "30m"
    MIN_60 = "60m"
    DAILY = "1d"
    WEEKLY = "1w"
    MONTHLY = "1M"


class AdjustType(str, Enum):
    """复权类型"""
    NONE = "none"
    QFQ = "qfq"    # 前复权
    HFQ = "hfq"    # 后复权


class StockQueryParams(BaseModel):
    """
    股票数据查询参数

    Example:
        >>> params = StockQueryParams(symbol="600519", interval=KlineInterval.DAILY)
        >>> fetcher.fetch(params)  # 返回 List[StockPriceData]
    """
    symbol: str = Field(..., description="股票代码，如 '600519' 或 'SH600519'")
    start_date: Optional[date] = Field(default=None, description="开始日期")
    end_date: Optional[date] = Field(default=None, description="结束日期")
    interval: KlineInterval = Field(
        default=KlineInterval.DAILY,
        description="K线周期"
    )
    adjust: AdjustType = Field(
        default=AdjustType.QFQ,
        description="复权类型"
    )
    limit: int = Field(
        default=90,
        ge=1,
        le=1000,
        description="返回条数"
    )

    class Config:
        json_schema_extra = {
            "example": {
                "symbol": "600519",
                "interval": "1d",
                "limit": 90,
                "adjust": "qfq"
            }
        }


class StockPriceData(BaseModel):
    """
    股票价格数据（K线）

    与现有 StockDataService 返回格式对齐，
    确保迁移时的兼容性。
    """
    symbol: str = Field(..., description="股票代码")
    date: datetime = Field(..., description="交易时间")
    open: float = Field(..., description="开盘价")
    high: float = Field(..., description="最高价")
    low: float = Field(..., description="最低价")
    close: float = Field(..., description="收盘价")
    volume: int = Field(..., description="成交量")
    turnover: Optional[float] = Field(default=None, description="成交额")
    change_percent: Optional[float] = Field(default=None, description="涨跌幅 %")
    change_amount: Optional[float] = Field(default=None, description="涨跌额")
    amplitude: Optional[float] = Field(default=None, description="振幅 %")
    turnover_rate: Optional[float] = Field(default=None, description="换手率 %")

    class Config:
        json_encoders = {
            datetime: lambda v: v.isoformat()
        }

    def to_legacy_dict(self) -> dict:
        """
        转换为旧版 StockDataService 格式 (兼容现有代码)

        Returns:
            与旧版 get_kline_data 返回格式一致的字典
        """
        return {
            "timestamp": int(self.date.timestamp() * 1000),
            "date": self.date.strftime("%Y-%m-%d") if self.date else None,
            "open": self.open,
            "high": self.high,
            "low": self.low,
            "close": self.close,
            "volume": self.volume,
            "turnover": self.turnover or 0,
            "change_percent": self.change_percent or 0,
            "change_amount": self.change_amount or 0,
            "amplitude": self.amplitude or 0,
            "turnover_rate": self.turnover_rate or 0,
        }


class StockRealtimeData(BaseModel):
    """股票实时行情"""
    symbol: str
    name: str
    price: float
    change_percent: float
    change_amount: float
    volume: int
    turnover: float
    high: float
    low: float
    open: float
    prev_close: float
    timestamp: datetime = Field(default_factory=datetime.now)


class StockFinancialData(BaseModel):
    """股票财务指标"""
    symbol: str
    pe_ratio: Optional[float] = None          # 市盈率
    pb_ratio: Optional[float] = None          # 市净率
    roe: Optional[float] = None               # 净资产收益率
    total_market_value: Optional[float] = None
    circulating_market_value: Optional[float] = None
    gross_profit_margin: Optional[float] = None
    net_profit_margin: Optional[float] = None
    debt_ratio: Optional[float] = None
    revenue_yoy: Optional[float] = None       # 营收同比
    profit_yoy: Optional[float] = None        # 净利润同比


================================================
FILE: backend/app/financial/providers/__init__.py
================================================
"""
数据源 Provider 模块

每个 Provider 代表一个数据源（如 Sina, Tencent, AkShare），
每个 Provider 下可以有多个 Fetcher，每个 Fetcher 对应一种数据类型。

架构:
    Provider (数据源)
    └── Fetcher (数据获取器，实现 TET Pipeline)
        ├── transform_query: 将标准参数转换为 Provider 特定参数
        ├── extract_data: 执行实际的数据获取
        └── transform_data: 将原始数据转换为标准模型
"""
from .base import BaseProvider, BaseFetcher, ProviderInfo

__all__ = [
    "BaseProvider",
    "BaseFetcher",
    "ProviderInfo",
]


================================================
FILE: backend/app/financial/providers/base.py
================================================
"""
Provider & Fetcher 基础抽象

借鉴 OpenBB 的 TET (Transform-Extract-Transform) Pipeline:
1. Transform Query: 将标准参数转换为 Provider 特定参数
2. Extract Data: 执行实际的数据获取 (HTTP/爬虫/SDK)
3. Transform Data: 将原始数据转换为标准模型

来源参考:
- OpenBB: openbb_core.provider.abstract.fetcher.Fetcher
- 设计文档: research/codedeepresearch/OpenBB/FinnewsHunter_improvement_plan.md
"""
from abc import ABC, abstractmethod
from typing import TypeVar, Generic, Dict, Any, List, Type, Optional
from pydantic import BaseModel
from dataclasses import dataclass, field
import logging

# 泛型类型变量
QueryT = TypeVar("QueryT", bound=BaseModel)
DataT = TypeVar("DataT", bound=BaseModel)


@dataclass
class ProviderInfo:
    """
    Provider 元信息

    Attributes:
        name: 唯一标识，如 'sina', 'akshare'
        display_name: 显示名称，如 '新浪财经'
        description: 描述
        website: 官网 URL
        requires_credentials: 是否需要 API Key
        credential_keys: 需要的凭证 key 列表
        priority: 降级优先级，数字越小越优先
    """
    name: str
    display_name: str
    description: str
    website: Optional[str] = None
    requires_credentials: bool = False
    credential_keys: List[str] = field(default_factory=list)
    priority: int = 0  # 数字越小，优先级越高


class BaseFetcher(ABC, Generic[QueryT, DataT]):
    """
    数据获取器基类 - 实现 TET (Transform-Extract-Transform) Pipeline

    子类必须:
    1. 声明 query_model 和 data_model 类属性
    2. 实现 transform_query, extract_data, transform_data 三个抽象方法

    Example:
        >>> class SinaNewsFetcher(BaseFetcher[NewsQueryParams, NewsData]):
        ...     query_model = NewsQueryParams
        ...     data_model = NewsData
        ...
        ...     def transform_query(self, params):
        ...         return {"url": "...", "limit": params.limit}
        ...
        ...     async def extract_data(self, query):
        ...         return await self._fetch_html(query["url"])
        ...
        ...     def transform_data(self, raw_data, query):
        ...         return [NewsData(...) for item in raw_data]
    """

    # 子类必须声明这两个类属性
    query_model: Type[QueryT]
    data_model: Type[DataT]

    def __init__(self):
        self.logger = logging.getLogger(
            f"{self.__class__.__module__}.{self.__class__.__name__}"
        )

    @abstractmethod
    def transform_query(self, params: QueryT) -> Dict[str, Any]:
        """
        [T]ransform Query: 将标准参数转换为 Provider 特定参数

        Args:
            params: 标准查询参数 (NewsQueryParams, StockQueryParams 等)

        Returns:
            Provider 特定的参数字典

        Example:
            NewsQueryParams(stock_codes=['600519'], limit=10)
            → {'url': 'https://...', 'symbol': 'sh600519', 'count': 10}
        """
        pass

    @abstractmethod
    async def extract_data(self, query: Dict[str, Any]) -> Any:
        """
        [E]xtract Data: 执行实际的数据获取

        可以是:
        - HTTP 请求
        - 网页爬虫
        - SDK 调用
        - 数据库查询

        Args:
            query: transform_query 返回的参数字典

        Returns:
            原始数据 (任意格式，由 transform_data 处理)
        """
        pass

    @abstractmethod
    def transform_data(self, raw_data: Any, query: QueryT) -> List[DataT]:
        """
        [T]ransform Data: 将原始数据转换为标准模型

        Args:
            raw_data: extract_data 返回的原始数据
            query: 原始查询参数 (可用于补充信息)

        Returns:
            标准模型列表 (List[NewsData], List[StockPriceData] 等)
        """
        pass

    async def fetch(self, params: QueryT) -> List[DataT]:
        """
        完整的 TET 执行流程

        Args:
            params: 标准查询参数

        Returns:
            标准模型列表

        Raises:
            Exception: 任何阶段失败时抛出异常
        """
        self.logger.info(f"Fetching with params: {params.model_dump()}")

        # T: Transform Query
        query = self.transform_query(params)
        self.logger.debug(f"Transformed query: {query}")

        # E: Extract Data
        raw = await self.extract_data(query)
        raw_count = len(raw) if isinstance(raw, (list, tuple)) else 1
        self.logger.debug(f"Extracted {raw_count} raw records")

        # T: Transform Data
        results = self.transform_data(raw, params)
        self.logger.info(f"Transformed to {len(results)} standard records")

        return results

    def fetch_sync(self, params: QueryT) -> List[DataT]:
        """
        同步版本的 fetch (用于非异步环境)

        Args:
            params: 标准查询参数

        Returns:
            标准模型列表
        """
        import asyncio
        return asyncio.run(self.fetch(params))


class BaseProvider(ABC):
    """
    Provider 基类 - 定义数据源能力

    每个 Provider 可以有多个 Fetcher，每个 Fetcher 对应一种数据类型。

    Example:
        >>> class SinaProvider(BaseProvider):
        ...     @property
        ...     def info(self):
        ...         return ProviderInfo(name="sina", ...)
        ...
        ...     @property
        ...     def fetchers(self):
        ...         return {"news": SinaNewsFetcher}
    """

    @property
    @abstractmethod
    def info(self) -> ProviderInfo:
        """返回 Provider 元信息"""
        pass

    @property
    @abstractmethod
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        """
        返回支持的 Fetcher 映射

        Returns:
            格式: {data_type: FetcherClass}
            例如: {'news': SinaNewsFetcher, 'stock_price': SinaStockFetcher}
        """
        pass

    def get_fetcher(self, data_type: str) -> Optional[BaseFetcher]:
        """
        获取指定类型的 Fetcher 实例

        Args:
            data_type: 数据类型，如 'news', 'stock_price'

        Returns:
            Fetcher 实例，如果不支持该类型则返回 None
        """
        fetcher_cls = self.fetchers.get(data_type)
        if fetcher_cls:
            return fetcher_cls()
        return None

    def supports(self, data_type: str) -> bool:
        """
        检查是否支持某种数据类型

        Args:
            data_type: 数据类型

        Returns:
            是否支持
        """
        return data_type in self.fetchers

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} name='{self.info.name}' types={list(self.fetchers.keys())}>"


================================================
FILE: backend/app/financial/providers/eastmoney/__init__.py
================================================
"""
东方财富 Provider
"""
from .provider import EastmoneyProvider
from .fetchers.news import EastmoneyNewsFetcher

__all__ = ["EastmoneyProvider", "EastmoneyNewsFetcher"]


================================================
FILE: backend/app/financial/providers/eastmoney/fetchers/__init__.py
================================================
"""
东方财富 Fetchers
"""
from .news import EastmoneyNewsFetcher

__all__ = ["EastmoneyNewsFetcher"]


================================================
FILE: backend/app/financial/providers/eastmoney/fetchers/news.py
================================================
"""
东方财富新闻 Fetcher

基于 TET Pipeline 实现
"""
import re
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
from bs4 import BeautifulSoup
import requests

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData, NewsSentiment

logger = logging.getLogger(__name__)


class EastmoneyNewsFetcher(BaseFetcher):
    """
    东方财富新闻 Fetcher
    
    数据源: https://stock.eastmoney.com/
    """
    
    BASE_URL = "https://stock.eastmoney.com/"
    STOCK_URL = "https://stock.eastmoney.com/news/"
    SOURCE_NAME = "eastmoney"
    
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }
    
    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """转换标准查询参数"""
        return {
            "url": self.STOCK_URL,
            "limit": params.limit or 20,
            "stock_codes": params.stock_codes,
            "keywords": params.keywords,
        }
    
    def extract_data(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """从东方财富抓取原始数据"""
        raw_news = []
        
        try:
            # 尝试股票新闻页面，失败则尝试主页
            try:
                response = requests.get(query["url"], headers=self.HEADERS, timeout=30)
                response.raise_for_status()
            except:
                response = requests.get(self.BASE_URL, headers=self.HEADERS, timeout=30)
                response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            news_links = self._extract_news_links(soup)
            
            logger.info(f"[Eastmoney] Found {len(news_links)} news links")
            
            max_fetch = min(query["limit"], 20)
            
            for link_info in news_links[:max_fetch]:
                try:
                    news_item = self._fetch_news_detail(link_info)
                    if news_item:
                        raw_news.append(news_item)
                except Exception as e:
                    logger.warning(f"[Eastmoney] Failed to fetch {link_info['url']}: {e}")
                    continue
            
            logger.info(f"[Eastmoney] Extracted {len(raw_news)} news items")
            
        except Exception as e:
            logger.error(f"[Eastmoney] Extract failed: {e}")
        
        return raw_news
    
    def transform_data(
        self,
        raw_data: List[Dict[str, Any]],
        params: NewsQueryParams
    ) -> List[NewsData]:
        """转换原始数据为标准 NewsData 格式"""
        news_list = []
        
        for item in raw_data:
            try:
                stock_codes = self._extract_stock_codes(
                    item.get("title", "") + " " + item.get("content", "")
                )
                
                if params.stock_codes:
                    if not any(code in stock_codes for code in params.stock_codes):
                        continue
                
                if params.keywords:
                    text = item.get("title", "") + " " + item.get("content", "")
                    if not any(kw in text for kw in params.keywords):
                        continue
                
                news = NewsData(
                    title=item.get("title", ""),
                    content=item.get("content", ""),
                    source=self.SOURCE_NAME,
                    source_url=item.get("url", ""),
                    publish_time=item.get("publish_time", datetime.now()),
                    author=item.get("author"),
                    stock_codes=stock_codes,
                    sentiment=NewsSentiment.NEUTRAL,
                )
                news_list.append(news)
                
            except Exception as e:
                logger.warning(f"[Eastmoney] Transform failed: {e}")
                continue
        
        if params.limit:
            news_list = news_list[:params.limit]
        
        return news_list
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        """从页面提取新闻链接"""
        news_links = []
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 东方财富新闻URL模式
            if ('eastmoney.com' in href and ('/news/' in href or '/stock/' in href or '.html' in href)) and title:
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://stock.eastmoney.com' + href
                elif not href.startswith('http'):
                    href = 'https://stock.eastmoney.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _fetch_news_detail(self, link_info: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """获取新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = requests.get(url, headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            content = self._extract_content(soup)
            if not content:
                return None
            
            publish_time = self._extract_publish_time(soup)
            author = self._extract_author(soup)
            
            return {
                "title": title,
                "content": content,
                "url": url,
                "publish_time": publish_time,
                "author": author,
            }
            
        except Exception as e:
            logger.debug(f"[Eastmoney] Detail fetch failed: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'Body'},
            {'id': 'ContentBody'},
            {'class': 'article-content'},
            {'class': 'newsContent'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([
                        p.get_text(strip=True) for p in paragraphs 
                        if p.get_text(strip=True)
                    ])
                    if content:
                        return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> datetime:
        """提取发布时间"""
        try:
            time_elem = soup.find('div', {'class': re.compile(r'time|date')})
            if not time_elem:
                time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception:
            pass
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        formats = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%Y-%m-%d', '%Y年%m月%d日 %H:%M']
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        return datetime.now()
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            elem = soup.find('div', {'class': re.compile(r'author|source')})
            if not elem:
                elem = soup.find('span', {'class': re.compile(r'author|source')})
            if elem:
                return elem.get_text(strip=True)
        except Exception:
            pass
        return None
    
    def _extract_stock_codes(self, text: str) -> List[str]:
        """从文本提取股票代码"""
        patterns = [
            r'(\d{6})\.(SH|SZ|sh|sz)',
            r'(SH|SZ|sh|sz)(\d{6})',
            r'[（(](\d{6})[)）]',
        ]
        
        codes = set()
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                if isinstance(match, tuple):
                    code = ''.join(match)
                else:
                    code = match
                code = re.sub(r'[^0-9]', '', code)
                if len(code) == 6:
                    codes.add(code)
        
        return list(codes)
    
    def _clean_text(self, text: str) -> str:
        """清理文本"""
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


================================================
FILE: backend/app/financial/providers/eastmoney/provider.py
================================================
"""
东方财富 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import EastmoneyNewsFetcher


class EastmoneyProvider(BaseProvider):
    """
    东方财富数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="eastmoney",
            display_name="东方财富",
            description="东方财富股票新闻 (eastmoney.com)",
            website="https://stock.eastmoney.com/",
            requires_credentials=False,
            priority=4  # 第四优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": EastmoneyNewsFetcher,
        }


================================================
FILE: backend/app/financial/providers/nbd/__init__.py
================================================
"""
每日经济新闻 Provider
"""
from .provider import NbdProvider
from .fetchers.news import NbdNewsFetcher

__all__ = ["NbdProvider", "NbdNewsFetcher"]


================================================
FILE: backend/app/financial/providers/nbd/fetchers/__init__.py
================================================
"""
每日经济新闻 Fetchers
"""
from .news import NbdNewsFetcher

__all__ = ["NbdNewsFetcher"]


================================================
FILE: backend/app/financial/providers/nbd/fetchers/news.py
================================================
"""
每日经济新闻 Fetcher

基于 TET Pipeline 实现
"""
import re
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
from bs4 import BeautifulSoup
import requests

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData, NewsSentiment

logger = logging.getLogger(__name__)


class NbdNewsFetcher(BaseFetcher):
    """
    每日经济新闻 Fetcher
    
    数据源: https://www.nbd.com.cn/
    """
    
    BASE_URL = "https://www.nbd.com.cn/"
    STOCK_URL = "https://www.nbd.com.cn/columns/3/"  # 股市栏目
    SOURCE_NAME = "nbd"
    
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }
    
    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """转换标准查询参数"""
        return {
            "url": self.STOCK_URL,
            "limit": params.limit or 20,
            "stock_codes": params.stock_codes,
            "keywords": params.keywords,
        }
    
    def extract_data(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """从每日经济新闻抓取原始数据"""
        raw_news = []
        
        try:
            response = requests.get(query["url"], headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            news_links = self._extract_news_links(soup)
            
            logger.info(f"[NBD] Found {len(news_links)} news links")
            
            max_fetch = min(query["limit"], 20)
            
            for link_info in news_links[:max_fetch]:
                try:
                    news_item = self._fetch_news_detail(link_info)
                    if news_item:
                        raw_news.append(news_item)
                except Exception as e:
                    logger.warning(f"[NBD] Failed to fetch {link_info['url']}: {e}")
                    continue
            
            logger.info(f"[NBD] Extracted {len(raw_news)} news items")
            
        except Exception as e:
            logger.error(f"[NBD] Extract failed: {e}")
        
        return raw_news
    
    def transform_data(
        self,
        raw_data: List[Dict[str, Any]],
        params: NewsQueryParams
    ) -> List[NewsData]:
        """转换原始数据为标准 NewsData 格式"""
        news_list = []
        
        for item in raw_data:
            try:
                stock_codes = self._extract_stock_codes(
                    item.get("title", "") + " " + item.get("content", "")
                )
                
                if params.stock_codes:
                    if not any(code in stock_codes for code in params.stock_codes):
                        continue
                
                if params.keywords:
                    text = item.get("title", "") + " " + item.get("content", "")
                    if not any(kw in text for kw in params.keywords):
                        continue
                
                news = NewsData(
                    title=item.get("title", ""),
                    content=item.get("content", ""),
                    source=self.SOURCE_NAME,
                    source_url=item.get("url", ""),
                    publish_time=item.get("publish_time", datetime.now()),
                    author=item.get("author"),
                    stock_codes=stock_codes,
                    sentiment=NewsSentiment.NEUTRAL,
                )
                news_list.append(news)
                
            except Exception as e:
                logger.warning(f"[NBD] Transform failed: {e}")
                continue
        
        if params.limit:
            news_list = news_list[:params.limit]
        
        return news_list
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        """从页面提取新闻链接"""
        news_links = []
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            if ('/articles/' in href or '/article/' in href or '.html' in href) and title:
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.nbd.com.cn' + href
                elif not href.startswith('http'):
                    href = 'https://www.nbd.com.cn/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _fetch_news_detail(self, link_info: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """获取新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = requests.get(url, headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            content = self._extract_content(soup)
            if not content:
                return None
            
            publish_time = self._extract_publish_time(soup)
            author = self._extract_author(soup)
            
            return {
                "title": title,
                "content": content,
                "url": url,
                "publish_time": publish_time,
                "author": author,
            }
            
        except Exception as e:
            logger.debug(f"[NBD] Detail fetch failed: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'article-body'},
            {'class': 'article__body'},
            {'class': 'article-text'},
            {'class': 'content-article'},
            {'class': 'main-content'},
            {'class': 'g-article-content'},
            {'class': 'article-content'},
            {'id': 'contentText'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find(['div', 'article', 'section'], selector)
            if content_div:
                for tag in content_div.find_all(['script', 'style', 'iframe', 'ins']):
                    tag.decompose()
                for ad in content_div.find_all(class_=re.compile(r'ad|advertisement|banner')):
                    ad.decompose()
                
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([
                        p.get_text(strip=True) for p in paragraphs 
                        if p.get_text(strip=True)
                    ])
                    if content and len(content) > 50:
                        return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> datetime:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date|pub')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception:
            pass
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        formats = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%Y-%m-%d', '%Y年%m月%d日 %H:%M']
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        return datetime.now()
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            elem = soup.find('span', {'class': re.compile(r'author|source|editor')})
            if elem:
                return elem.get_text(strip=True)
        except Exception:
            pass
        return None
    
    def _extract_stock_codes(self, text: str) -> List[str]:
        """从文本提取股票代码"""
        patterns = [
            r'(\d{6})\.(SH|SZ|sh|sz)',
            r'(SH|SZ|sh|sz)(\d{6})',
            r'[（(](\d{6})[)）]',
        ]
        
        codes = set()
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                if isinstance(match, tuple):
                    code = ''.join(match)
                else:
                    code = match
                code = re.sub(r'[^0-9]', '', code)
                if len(code) == 6:
                    codes.add(code)
        
        return list(codes)
    
    def _clean_text(self, text: str) -> str:
        """清理文本"""
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


================================================
FILE: backend/app/financial/providers/nbd/provider.py
================================================
"""
每日经济新闻 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import NbdNewsFetcher


class NbdProvider(BaseProvider):
    """
    每日经济新闻数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="nbd",
            display_name="每日经济新闻",
            description="每日经济新闻 (nbd.com.cn)",
            website="https://www.nbd.com.cn/",
            requires_credentials=False,
            priority=3  # 第三优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": NbdNewsFetcher,
        }


================================================
FILE: backend/app/financial/providers/netease/__init__.py
================================================
"""
网易财经 Provider
"""
from .provider import NeteaseProvider
from .fetchers.news import NeteaseNewsFetcher

__all__ = ["NeteaseProvider", "NeteaseNewsFetcher"]


================================================
FILE: backend/app/financial/providers/netease/fetchers/__init__.py
================================================
"""
网易财经 Fetchers
"""
from .news import NeteaseNewsFetcher

__all__ = ["NeteaseNewsFetcher"]


================================================
FILE: backend/app/financial/providers/netease/fetchers/news.py
================================================
"""
网易财经新闻 Fetcher

基于 TET Pipeline 实现
"""
import re
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
from bs4 import BeautifulSoup
import requests

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData, NewsSentiment

logger = logging.getLogger(__name__)


class NeteaseNewsFetcher(BaseFetcher):
    """
    网易财经新闻 Fetcher
    
    数据源: https://money.163.com/
    """
    
    BASE_URL = "https://money.163.com/"
    STOCK_URL = "https://money.163.com/stock/"
    SOURCE_NAME = "163"
    
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }
    
    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """转换标准查询参数"""
        return {
            "url": self.STOCK_URL,
            "limit": params.limit or 20,
            "stock_codes": params.stock_codes,
            "keywords": params.keywords,
        }
    
    def extract_data(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """从网易财经抓取原始数据"""
        raw_news = []
        
        try:
            # 尝试股票页面，失败则尝试主页
            try:
                response = requests.get(query["url"], headers=self.HEADERS, timeout=30)
                response.raise_for_status()
            except:
                response = requests.get(self.BASE_URL, headers=self.HEADERS, timeout=30)
                response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            news_links = self._extract_news_links(soup)
            
            logger.info(f"[Netease] Found {len(news_links)} news links")
            
            max_fetch = min(query["limit"], 20)
            
            for link_info in news_links[:max_fetch]:
                try:
                    news_item = self._fetch_news_detail(link_info)
                    if news_item:
                        raw_news.append(news_item)
                except Exception as e:
                    logger.warning(f"[Netease] Failed to fetch {link_info['url']}: {e}")
                    continue
            
            logger.info(f"[Netease] Extracted {len(raw_news)} news items")
            
        except Exception as e:
            logger.error(f"[Netease] Extract failed: {e}")
        
        return raw_news
    
    def transform_data(
        self,
        raw_data: List[Dict[str, Any]],
        params: NewsQueryParams
    ) -> List[NewsData]:
        """转换原始数据为标准 NewsData 格式"""
        news_list = []
        
        for item in raw_data:
            try:
                stock_codes = self._extract_stock_codes(
                    item.get("title", "") + " " + item.get("content", "")
                )
                
                if params.stock_codes:
                    if not any(code in stock_codes for code in params.stock_codes):
                        continue
                
                if params.keywords:
                    text = item.get("title", "") + " " + item.get("content", "")
                    if not any(kw in text for kw in params.keywords):
                        continue
                
                news = NewsData(
                    title=item.get("title", ""),
                    content=item.get("content", ""),
                    source=self.SOURCE_NAME,
                    source_url=item.get("url", ""),
                    publish_time=item.get("publish_time", datetime.now()),
                    author=item.get("author"),
                    stock_codes=stock_codes,
                    sentiment=NewsSentiment.NEUTRAL,
                )
                news_list.append(news)
                
            except Exception as e:
                logger.warning(f"[Netease] Transform failed: {e}")
                continue
        
        if params.limit:
            news_list = news_list[:params.limit]
        
        return news_list
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        """从页面提取新闻链接"""
        news_links = []
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 网易新闻URL模式
            if ('money.163.com' in href or 'stock' in href) and title:
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://money.163.com' + href
                elif not href.startswith('http'):
                    href = 'https://money.163.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _fetch_news_detail(self, link_info: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """获取新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = requests.get(url, headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            content = self._extract_content(soup)
            if not content:
                return None
            
            publish_time = self._extract_publish_time(soup)
            author = self._extract_author(soup)
            
            return {
                "title": title,
                "content": content,
                "url": url,
                "publish_time": publish_time,
                "author": author,
            }
            
        except Exception as e:
            logger.debug(f"[Netease] Detail fetch failed: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'post_text'},
            {'id': 'endText'},
            {'class': 'article-content'},
            {'class': 'content'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([
                        p.get_text(strip=True) for p in paragraphs 
                        if p.get_text(strip=True)
                    ])
                    if content:
                        return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> datetime:
        """提取发布时间"""
        try:
            time_elem = soup.find('div', {'class': re.compile(r'post_time|time')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception:
            pass
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        formats = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%Y-%m-%d', '%Y年%m月%d日 %H:%M']
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        return datetime.now()
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            elem = soup.find('span', {'class': re.compile(r'author|source')})
            if not elem:
                elem = soup.find('div', {'id': 'ne_article_source'})
            if elem:
                return elem.get_text(strip=True)
        except Exception:
            pass
        return None
    
    def _extract_stock_codes(self, text: str) -> List[str]:
        """从文本提取股票代码"""
        patterns = [
            r'(\d{6})\.(SH|SZ|sh|sz)',
            r'(SH|SZ|sh|sz)(\d{6})',
            r'[（(](\d{6})[)）]',
        ]
        
        codes = set()
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                if isinstance(match, tuple):
                    code = ''.join(match)
                else:
                    code = match
                code = re.sub(r'[^0-9]', '', code)
                if len(code) == 6:
                    codes.add(code)
        
        return list(codes)
    
    def _clean_text(self, text: str) -> str:
        """清理文本"""
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


================================================
FILE: backend/app/financial/providers/netease/provider.py
================================================
"""
网易财经 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import NeteaseNewsFetcher


class NeteaseProvider(BaseProvider):
    """
    网易财经数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="163",
            display_name="网易财经",
            description="网易财经股票新闻 (money.163.com)",
            website="https://money.163.com/",
            requires_credentials=False,
            priority=6  # 第六优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": NeteaseNewsFetcher,
        }


================================================
FILE: backend/app/financial/providers/sina/__init__.py
================================================
"""
新浪财经 Provider

提供:
- 新闻数据 (news): SinaNewsFetcher

从 tools/sina_crawler.py 迁移而来，保留核心逻辑，
适配 TET Pipeline 架构。
"""
from .provider import SinaProvider

__all__ = ["SinaProvider"]


================================================
FILE: backend/app/financial/providers/sina/fetchers/__init__.py
================================================
"""
新浪财经 Fetchers
"""
from .news import SinaNewsFetcher

__all__ = ["SinaNewsFetcher"]


================================================
FILE: backend/app/financial/providers/sina/fetchers/news.py
================================================
"""
新浪财经新闻 Fetcher

从 tools/sina_crawler.py 迁移而来，适配 TET Pipeline 架构。

主要变更:
- transform_query: 将 NewsQueryParams 转换为爬虫参数
- extract_data: 执行网页爬取
- transform_data: 将原始数据转换为 NewsData 标准模型

保留原有的:
- 网页解析逻辑
- 标题/内容/日期提取
- 股票代码提取
- 噪音过滤

来源: tools/sina_crawler.py (SinaCrawlerTool)
"""
import re
import time
import hashlib
import logging
from typing import Dict, Any, List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData

logger = logging.getLogger(__name__)


class SinaNewsFetcher(BaseFetcher[NewsQueryParams, NewsData]):
    """
    新浪财经新闻获取器

    实现 TET Pipeline:
    - Transform Query: 将 NewsQueryParams 转换为爬虫参数
    - Extract Data: 爬取网页
    - Transform Data: 解析为 NewsData
    """

    query_model = NewsQueryParams
    data_model = NewsData

    # 新浪财经最新滚动新闻页面
    BASE_URL = "https://finance.sina.com.cn/roll/c/56592.shtml"
    SOURCE_NAME = "sina"

    # 请求配置
    DEFAULT_TIMEOUT = 30
    DEFAULT_DELAY = 0.5
    DEFAULT_USER_AGENT = (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )

    # 噪音文本模式
    NOISE_PATTERNS = [
        r'^责任编辑', r'^编辑[:：]', r'^来源[:：]', r'^声明[:：]',
        r'^免责声明', r'^版权', r'^copyright', r'^点击进入',
        r'^相关阅读', r'^延伸阅读', r'登录新浪财经APP',
        r'搜索【信披】', r'缩小字体', r'放大字体', r'收藏',
        r'微博', r'微信', r'分享', r'腾讯QQ',
    ]

    def __init__(self):
        super().__init__()
        self._session = None

    def _get_session(self):
        """获取 requests Session (延迟初始化)"""
        if self._session is None:
            import requests
            self._session = requests.Session()
            self._session.headers.update({
                'User-Agent': self.DEFAULT_USER_AGENT
            })
        return self._session

    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """
        将标准参数转换为爬虫参数

        Args:
            params: 标准查询参数

        Returns:
            爬虫参数字典
        """
        query = {
            "base_url": self.BASE_URL,
            "limit": params.limit,
            "stock_codes": params.stock_codes or [],
            "keywords": params.keywords or [],
        }

        # 如果有股票代码，构建股票新闻 URL
        if params.stock_codes:
            query["stock_urls"] = []
            for code in params.stock_codes:
                symbol = self._normalize_symbol(code)
                stock_url = (
                    f"https://vip.stock.finance.sina.com.cn"
                    f"/corp/go.php/vCB_AllNewsStock/symbol/{symbol}.phtml"
                )
                query["stock_urls"].append(stock_url)

        return query

    async def extract_data(self, query: Dict[str, Any]) -> List[Dict]:
        """
        执行网页爬取

        Args:
            query: transform_query 返回的参数

        Returns:
            原始新闻数据列表
        """
        all_news = []
        limit = query["limit"]

        # 确定要爬取的 URL 列表
        urls_to_crawl = query.get("stock_urls", [query["base_url"]])
        if not urls_to_crawl:
            urls_to_crawl = [query["base_url"]]

        for url in urls_to_crawl:
            try:
                news_items = await self._crawl_page(url, limit - len(all_news))
                all_news.extend(news_items)

                if len(all_news) >= limit:
                    break

            except Exception as e:
                self.logger.error(f"Failed to crawl {url}: {e}")
                continue

        return all_news[:limit]

    async def _crawl_page(self, url: str, max_items: int) -> List[Dict]:
        """爬取单个页面"""
        import asyncio

        self.logger.info(f"Fetching page: {url}")

        # 使用 run_in_executor 执行同步请求
        loop = asyncio.get_event_loop()
        response = await loop.run_in_executor(
            None,
            lambda: self._fetch_page_sync(url)
        )

        if not response:
            return []

        # 设置编码
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'lxml')

        # 查找新闻链接
        news_links = self._extract_news_links(soup)
        self.logger.info(f"Found {len(news_links)} news links")

        # 爬取每条新闻详情
        news_list = []
        for idx, news_url in enumerate(news_links[:max_items], 1):
            try:
                self.logger.debug(f"Crawling news {idx}/{min(len(news_links), max_items)}")
                news_item = await self._crawl_news_detail(news_url)
                if news_item:
                    news_list.append(news_item)
            except Exception as e:
                self.logger.warning(f"Failed to crawl {news_url}: {e}")
                continue

            # 请求间隔
            await asyncio.sleep(self.DEFAULT_DELAY)

        return news_list

    def _fetch_page_sync(self, url: str):
        """同步获取页面"""
        try:
            session = self._get_session()
            response = session.get(url, timeout=self.DEFAULT_TIMEOUT)
            response.raise_for_status()
            return response
        except Exception as e:
            self.logger.error(f"Failed to fetch {url}: {e}")
            return None

    def _extract_news_links(self, soup: BeautifulSoup) -> List[str]:
        """提取新闻链接"""
        news_links = []
        for link in soup.find_all('a', href=True):
            href = link.get('href', '')
            # 匹配新浪财经新闻 URL
            if 'finance.sina.com.cn' in href and ('/stock/' in href or '/roll/' in href):
                if href.startswith('http'):
                    news_links.append(href)
                elif href.startswith('//'):
                    news_links.append('http:' + href)

        # 去重
        return list(set(news_links))

    async def _crawl_news_detail(self, url: str) -> Optional[Dict]:
        """爬取新闻详情"""
        import asyncio

        loop = asyncio.get_event_loop()
        response = await loop.run_in_executor(
            None,
            lambda: self._fetch_page_sync(url)
        )

        if not response:
            return None

        try:
            soup = BeautifulSoup(response.content, "lxml")
            raw_html = response.text

            # 提取各字段
            title = self._extract_title(soup)
            if not title:
                return None

            summary, keywords = self._extract_meta(soup)
            publish_time = self._extract_date(soup)
            stock_codes = self._extract_stock_codes(soup)
            content = self._extract_content(soup)

            if not content or len(content) < 50:
                return None

            return {
                "url": url,
                "title": title,
                "content": content,
                "summary": summary,
                "keywords": keywords,
                "publish_time": publish_time,
                "stock_codes": stock_codes,
                "raw_html": raw_html,
            }

        except Exception as e:
            self.logger.error(f"Error parsing {url}: {e}")
            return None

    def transform_data(
        self,
        raw_data: List[Dict],
        query: NewsQueryParams
    ) -> List[NewsData]:
        """
        将原始数据转换为 NewsData 标准模型

        Args:
            raw_data: extract_data 返回的原始数据
            query: 原始查询参数

        Returns:
            NewsData 列表
        """
        results = []
        for item in raw_data:
            try:
                news = NewsData(
                    id=NewsData.generate_id(item["url"]),
                    title=item["title"],
                    content=item["content"],
                    summary=item.get("summary"),
                    source=self.SOURCE_NAME,
                    source_url=item["url"],
                    publish_time=item.get("publish_time") or datetime.now(),
                    stock_codes=item.get("stock_codes", []),
                    keywords=item.get("keywords", []),
                    extra={"raw_html": item.get("raw_html")},
                )
                results.append(news)
            except Exception as e:
                self.logger.warning(f"Failed to transform item: {e}")
                continue

        return results

    # ========== 辅助方法（从原 sina_crawler.py 迁移）==========

    def _normalize_symbol(self, code: str) -> str:
        """标准化股票代码为新浪格式"""
        code = code.upper().replace("SH", "sh").replace("SZ", "sz")
        if code.isdigit():
            if code.startswith("6"):
                return f"sh{code}"
            else:
                return f"sz{code}"
        return code.lower()

    def _extract_title(self, soup: BeautifulSoup) -> Optional[str]:
        """提取标题"""
        title_tag = soup.find('h1', class_='main-title')
        if not title_tag:
            title_tag = soup.find('h1')
        if not title_tag:
            title_tag = soup.find('title')

        if title_tag:
            title = title_tag.get_text().strip()
            title = re.sub(r'[-_].*?(新浪|财经|网)', '', title)
            return title.strip()
        return None

    def _extract_meta(self, soup: BeautifulSoup) -> tuple:
        """提取元数据（摘要和关键词）"""
        summary = ""
        keywords = []

        for meta in soup.find_all('meta'):
            name = meta.get('name', '').lower()
            content = meta.get('content', '')

            if name == 'description':
                summary = content
            elif name == 'keywords':
                keywords = [kw.strip() for kw in content.split(',') if kw.strip()]

        return summary, keywords

    def _extract_date(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        for span in soup.find_all('span'):
            class_attr = span.get('class', [])
            if 'date' in class_attr or 'time-source' in class_attr:
                date_text = span.get_text()
                return self._parse_date(date_text)

            if span.get('id') == 'pub_date':
                date_text = span.get_text()
                return self._parse_date(date_text)

        return None

    def _parse_date(self, date_text: str) -> Optional[datetime]:
        """解析日期字符串"""
        try:
            date_text = date_text.strip()
            date_text = date_text.replace('年', '-').replace('月', '-').replace('日', '')

            for fmt in ['%Y-%m-%d %H:%M', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d']:
                try:
                    return datetime.strptime(date_text.strip(), fmt)
                except ValueError:
                    continue
        except Exception:
            pass
        return None

    def _extract_stock_codes(self, soup: BeautifulSoup) -> List[str]:
        """提取关联股票代码"""
        stock_codes = []
        for span in soup.find_all('span'):
            span_id = span.get('id', '')
            if span_id.startswith('stock_'):
                code = span_id[6:].upper()
                if code:
                    stock_codes.append(code)
        return list(set(stock_codes))

    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取正文内容"""
        content_selectors = [
            {'id': 'artibody'},
            {'class': 'article-content'},
            {'class': 'article'},
            {'id': 'article'},
        ]

        for selector in content_selectors:
            content_div = soup.find(['div', 'article'], selector)
            if content_div:
                # 移除噪音元素
                for tag in content_div.find_all([
                    'script', 'style', 'iframe', 'ins',
                    'select', 'input', 'button', 'form'
                ]):
                    tag.decompose()

                for ad in content_div.find_all(class_=re.compile(
                    r'ad|banner|share|otherContent|recommend|app-guide', re.I
                )):
                    ad.decompose()

                # 提取文本
                full_text = content_div.get_text(separator='\n', strip=True)
                lines = full_text.split('\n')
                article_parts = []

                for line in lines:
                    line = line.strip()
                    if not line or len(line) < 2:
                        continue

                    if not self._is_noise_text(line):
                        article_parts.append(line)

                if article_parts:
                    return '\n'.join(article_parts)

        return ""

    def _is_noise_text(self, text: str) -> bool:
        """判断是否为噪音文本"""
        text_lower = text.lower().strip()
        for pattern in self.NOISE_PATTERNS:
            if re.match(pattern, text_lower, re.I) or re.search(pattern, text_lower, re.I):
                return True
        return False

    def _extract_chinese_ratio(self, text: str) -> float:
        """计算中文字符比例"""
        pattern = re.compile(r'[\u4e00-\u9fa5]+')
        chinese_chars = pattern.findall(text)
        chinese_count = sum(len(chars) for chars in chinese_chars)
        total_count = len(text)
        return chinese_count / total_count if total_count > 0 else 0


================================================
FILE: backend/app/financial/providers/sina/provider.py
================================================
"""
新浪财经 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import SinaNewsFetcher


class SinaProvider(BaseProvider):
    """
    新浪财经数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="sina",
            display_name="新浪财经",
            description="新浪财经新闻和股票数据",
            website="https://finance.sina.com.cn",
            requires_credentials=False,
            priority=1  # 第一优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": SinaNewsFetcher,
            # 可扩展: "stock_price": SinaStockFetcher
        }


================================================
FILE: backend/app/financial/providers/tencent/__init__.py
================================================
"""
腾讯财经 Provider
"""
from .provider import TencentProvider
from .fetchers.news import TencentNewsFetcher

__all__ = ["TencentProvider", "TencentNewsFetcher"]


================================================
FILE: backend/app/financial/providers/tencent/fetchers/__init__.py
================================================
"""
腾讯财经 Fetchers
"""
from .news import TencentNewsFetcher

__all__ = ["TencentNewsFetcher"]


================================================
FILE: backend/app/financial/providers/tencent/fetchers/news.py
================================================
"""
腾讯财经新闻 Fetcher

基于 TET Pipeline 实现:
- Transform Query: 转换标准参数为腾讯财经特定参数
- Extract Data: 从腾讯财经抓取原始数据
- Transform Data: 转换为标准 NewsData 格式
"""
import re
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import requests

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData, NewsSentiment

logger = logging.getLogger(__name__)


class TencentNewsFetcher(BaseFetcher):
    """
    腾讯财经新闻 Fetcher
    
    数据源: https://news.qq.com/ch/finance/
    """
    
    BASE_URL = "https://news.qq.com/ch/finance/"
    SOURCE_NAME = "tencent"
    
    # 请求配置
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }
    
    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """
        转换标准查询参数为腾讯财经特定参数
        """
        return {
            "url": self.BASE_URL,
            "limit": params.limit or 20,
            "stock_codes": params.stock_codes,
            "keywords": params.keywords,
        }
    
    def extract_data(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """
        从腾讯财经抓取原始新闻数据
        """
        raw_news = []
        
        try:
            response = requests.get(
                query["url"],
                headers=self.HEADERS,
                timeout=30
            )
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            news_links = self._extract_news_links(soup)
            
            logger.info(f"[Tencent] Found {len(news_links)} news links")
            
            # 限制获取数量
            max_fetch = min(query["limit"], 20)
            
            for link_info in news_links[:max_fetch]:
                try:
                    news_item = self._fetch_news_detail(link_info)
                    if news_item:
                        raw_news.append(news_item)
                except Exception as e:
                    logger.warning(f"[Tencent] Failed to fetch {link_info['url']}: {e}")
                    continue
            
            logger.info(f"[Tencent] Extracted {len(raw_news)} news items")
            
        except Exception as e:
            logger.error(f"[Tencent] Extract failed: {e}")
        
        return raw_news
    
    def transform_data(
        self,
        raw_data: List[Dict[str, Any]],
        params: NewsQueryParams
    ) -> List[NewsData]:
        """
        转换原始数据为标准 NewsData 格式
        """
        news_list = []
        
        for item in raw_data:
            try:
                # 提取股票代码
                stock_codes = self._extract_stock_codes(
                    item.get("title", "") + " " + item.get("content", "")
                )
                
                # 过滤：如果指定了股票代码，只保留相关新闻
                if params.stock_codes:
                    if not any(code in stock_codes for code in params.stock_codes):
                        continue
                
                # 过滤：关键词过滤
                if params.keywords:
                    text = item.get("title", "") + " " + item.get("content", "")
                    if not any(kw in text for kw in params.keywords):
                        continue
                
                news = NewsData(
                    title=item.get("title", ""),
                    content=item.get("content", ""),
                    source=self.SOURCE_NAME,
                    source_url=item.get("url", ""),
                    publish_time=item.get("publish_time", datetime.now()),
                    author=item.get("author"),
                    stock_codes=stock_codes,
                    sentiment=NewsSentiment.NEUTRAL,  # 默认中性
                )
                news_list.append(news)
                
            except Exception as e:
                logger.warning(f"[Tencent] Transform failed for item: {e}")
                continue
        
        # 应用 limit
        if params.limit:
            news_list = news_list[:params.limit]
        
        return news_list
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        """从页面提取新闻链接"""
        news_links = []
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            
            # 腾讯新闻URL模式
            if '/rain/a/' in href or '/omn/' in href:
                if not href.startswith('http'):
                    href = 'https:' + href if href.startswith('//') else 'https://news.qq.com' + href
                
                title = link.get_text(strip=True)
                if title and href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _fetch_news_detail(self, link_info: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """获取新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = requests.get(url, headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            content = self._extract_content(soup)
            if not content:
                return None
            
            publish_time = self._extract_publish_time(soup)
            author = self._extract_author(soup)
            
            return {
                "title": title,
                "content": content,
                "url": url,
                "publish_time": publish_time,
                "author": author,
            }
            
        except Exception as e:
            logger.debug(f"[Tencent] Detail fetch failed: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'content-article'},
            {'class': 'LEFT'},
            {'id': 'Cnt-Main-Article-QQ'},
            {'class': 'article'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([
                        p.get_text(strip=True) for p in paragraphs 
                        if p.get_text(strip=True)
                    ])
                    if content:
                        return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> datetime:
        """提取发布时间"""
        try:
            time_selectors = [
                {'class': 'a-time'},
                {'class': 'article-time'},
                {'class': 'time'},
            ]
            
            for selector in time_selectors:
                time_elem = soup.find('span', selector)
                if time_elem:
                    time_str = time_elem.get_text(strip=True)
                    return self._parse_time_string(time_str)
            
            meta_time = soup.find('meta', {'property': 'article:published_time'})
            if meta_time and meta_time.get('content'):
                return datetime.fromisoformat(meta_time['content'].replace('Z', '+00:00'))
            
        except Exception:
            pass
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        if '分钟前' in time_str:
            minutes = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(minutes=minutes)
        elif '小时前' in time_str:
            hours = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(hours=hours)
        elif '昨天' in time_str:
            return now - timedelta(days=1)
        
        formats = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%Y-%m-%d']
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            for selector in [{'class': 'author'}, {'class': 'source'}]:
                elem = soup.find('span', selector) or soup.find('a', selector)
                if elem:
                    return elem.get_text(strip=True)
        except Exception:
            pass
        return None
    
    def _extract_stock_codes(self, text: str) -> List[str]:
        """从文本提取股票代码"""
        patterns = [
            r'(\d{6})\.(SH|SZ|sh|sz)',
            r'(SH|SZ|sh|sz)(\d{6})',
            r'[（(](\d{6})[)）]',
        ]
        
        codes = set()
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                if isinstance(match, tuple):
                    code = ''.join(match)
                else:
                    code = match
                code = re.sub(r'[^0-9]', '', code)
                if len(code) == 6:
                    codes.add(code)
        
        return list(codes)
    
    def _clean_text(self, text: str) -> str:
        """清理文本"""
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
        return text


================================================
FILE: backend/app/financial/providers/tencent/provider.py
================================================
"""
腾讯财经 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import TencentNewsFetcher


class TencentProvider(BaseProvider):
    """
    腾讯财经数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="tencent",
            display_name="腾讯财经",
            description="腾讯财经新闻 (news.qq.com)",
            website="https://news.qq.com/ch/finance/",
            requires_credentials=False,
            priority=2  # 第二优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": TencentNewsFetcher,
        }


================================================
FILE: backend/app/financial/providers/yicai/__init__.py
================================================
"""
第一财经 Provider
"""
from .provider import YicaiProvider
from .fetchers.news import YicaiNewsFetcher

__all__ = ["YicaiProvider", "YicaiNewsFetcher"]


================================================
FILE: backend/app/financial/providers/yicai/fetchers/__init__.py
================================================
"""
第一财经 Fetchers
"""
from .news import YicaiNewsFetcher

__all__ = ["YicaiNewsFetcher"]


================================================
FILE: backend/app/financial/providers/yicai/fetchers/news.py
================================================
"""
第一财经新闻 Fetcher

基于 TET Pipeline 实现
"""
import re
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
from bs4 import BeautifulSoup
import requests

from ...base import BaseFetcher
from ....models.news import NewsQueryParams, NewsData, NewsSentiment

logger = logging.getLogger(__name__)


class YicaiNewsFetcher(BaseFetcher):
    """
    第一财经新闻 Fetcher
    
    数据源: https://www.yicai.com/
    """
    
    BASE_URL = "https://www.yicai.com/"
    STOCK_URL = "https://www.yicai.com/news/gushi/"
    SOURCE_NAME = "yicai"
    
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }
    
    def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
        """转换标准查询参数"""
        return {
            "url": self.STOCK_URL,
            "limit": params.limit or 20,
            "stock_codes": params.stock_codes,
            "keywords": params.keywords,
        }
    
    def extract_data(self, query: Dict[str, Any]) -> List[Dict[str, Any]]:
        """从第一财经抓取原始数据"""
        raw_news = []
        
        try:
            response = requests.get(query["url"], headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, "html.parser")
            news_links = self._extract_news_links(soup)
            
            logger.info(f"[Yicai] Found {len(news_links)} news links")
            
            max_fetch = min(query["limit"], 20)
            
            for link_info in news_links[:max_fetch]:
                try:
                    news_item = self._fetch_news_detail(link_info)
                    if news_item:
                        raw_news.append(news_item)
                except Exception as e:
                    logger.warning(f"[Yicai] Failed to fetch {link_info['url']}: {e}")
                    continue
            
            logger.info(f"[Yicai] Extracted {len(raw_news)} news items")
            
        except Exception as e:
            logger.error(f"[Yicai] Extract failed: {e}")
        
        return raw_news
    
    def transform_data(
        self,
        raw_data: List[Dict[str, Any]],
        params: NewsQueryParams
    ) -> List[NewsData]:
        """转换原始数据为标准 NewsData 格式"""
        news_list = []
        
        for item in raw_data:
            try:
                stock_codes = self._extract_stock_codes(
                    item.get("title", "") + " " + item.get("content", "")
                )
                
                if params.stock_codes:
                    if not any(code in stock_codes for code in params.stock_codes):
                        continue
                
                if params.keywords:
                    text = item.get("title", "") + " " + item.get("content", "")
                    if not any(kw in text for kw in params.keywords):
                        continue
                
                news = NewsData(
                    title=item.get("title", ""),
                    content=item.get("content", ""),
                    source=self.SOURCE_NAME,
                    source_url=item.get("url", ""),
                    publish_time=item.get("publish_time", datetime.now()),
                    author=item.get("author"),
                    stock_codes=stock_codes,
                    sentiment=NewsSentiment.NEUTRAL,
                )
                news_list.append(news)
                
            except Exception as e:
                logger.warning(f"[Yicai] Transform failed: {e}")
                continue
        
        if params.limit:
            news_list = news_list[:params.limit]
        
        return news_list
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        """从页面提取新闻链接"""
        news_links = []
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            if ('/news/' in href or '/article/' in href) and title:
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.yicai.com' + href
                elif not href.startswith('http'):
                    href = 'https://www.yicai.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _fetch_news_detail(self, link_info: Dict[str, str]) -> Optional[Dict[str, Any]]:
        """获取新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = requests.get(url, headers=self.HEADERS, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            content = self._extract_content(soup)
            if not content:
                return None
            
            publish_time = self._extract_publish_time(soup)
            author = self._extract_author(soup)
            
            return {
                "title": title,
                "content": content,
                "url": url,
                "publish_time": publish_time,
                "author": author,
            }
            
        except Exception as e:
            logger.debug(f"[Yicai] Detail fetch failed: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'm-txt'},
            {'class': 'article-content'},
            {'class': 'content'},
            {'class': 'newsContent'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([
                        p.get_text(strip=True) for p in paragraphs 
                        if p.get_text(strip=True)
                    ])
                    if content:
                        return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> datetime:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if not time_elem:
                time_elem = soup.find('time')
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception:
            pass
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        formats = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%Y-%m-%d', '%Y年%m月%d日 %H:%M']
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        return datetime.now()
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            elem = soup.find('span', {'class': re.compile(r'author|source')})
            if elem:
                return elem.get_text(strip=True)
        except Exception:
            pass
        return None
    
    def _extract_stock_codes(self, text: str) -> List[str]:
        """从文本提取股票代码"""
        patterns = [
            r'(\d{6})\.(SH|SZ|sh|sz)',
            r'(SH|SZ|sh|sz)(\d{6})',
            r'[（(](\d{6})[)）]',
        ]
        
        codes = set()
        for pattern in patterns:
            matches = re.findall(pattern, text)
            for match in matches:
                if isinstance(match, tuple):
                    code = ''.join(match)
                else:
                    code = match
                code = re.sub(r'[^0-9]', '', code)
                if len(code) == 6:
                    codes.add(code)
        
        return list(codes)
    
    def _clean_text(self, text: str) -> str:
        """清理文本"""
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


================================================
FILE: backend/app/financial/providers/yicai/provider.py
================================================
"""
第一财经 Provider
"""
from typing import Dict, Type

from ..base import BaseProvider, BaseFetcher, ProviderInfo
from .fetchers.news import YicaiNewsFetcher


class YicaiProvider(BaseProvider):
    """
    第一财经数据源

    支持的数据类型:
    - news: 财经新闻
    """

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="yicai",
            display_name="第一财经",
            description="第一财经股市新闻 (yicai.com)",
            website="https://www.yicai.com/",
            requires_credentials=False,
            priority=5  # 第五优先级
        )

    @property
    def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
        return {
            "news": YicaiNewsFetcher,
        }


================================================
FILE: backend/app/financial/registry.py
================================================
"""
Provider 注册中心

支持:
1. 动态注册/注销 Provider
2. 根据数据类型获取 Fetcher
3. 多 Provider 自动降级

来源参考:
- OpenBB: Provider Registry 机制
- 设计文档: research/codedeepresearch/OpenBB/FinnewsHunter_improvement_plan.md
"""
from typing import Dict, Optional, List
import logging

from .providers.base import BaseProvider, BaseFetcher

logger = logging.getLogger(__name__)


class ProviderNotFoundError(Exception):
    """Provider 未找到异常"""
    pass


class FetcherNotFoundError(Exception):
    """Fetcher 未找到异常"""
    pass


class ProviderRegistry:
    """
    Provider 注册中心

    功能:
    1. 注册/注销 Provider
    2. 根据数据类型获取 Fetcher
    3. 支持多 Provider 自动降级

    Example:
        >>> registry = ProviderRegistry()
        >>> registry.register(SinaProvider())
        >>> registry.register(TencentProvider())
        >>>
        >>> # 获取 Fetcher (按优先级自动选择)
        >>> fetcher = registry.get_fetcher("news")
        >>>
        >>> # 指定 Provider
        >>> fetcher = registry.get_fetcher("news", provider="tencent")
    """

    _instance: Optional["ProviderRegistry"] = None

    def __new__(cls):
        """单例模式"""
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance._providers: Dict[str, BaseProvider] = {}
            cls._instance._priority_order: List[str] = []
            cls._instance._initialized = False
        return cls._instance

    def register(self, provider: BaseProvider) -> None:
        """
        注册 Provider

        Args:
            provider: Provider 实例

        Note:
            - 如果 Provider 已存在，会被替换
            - 按 priority 自动排序
        """
        name = provider.info.name
        priority = provider.info.priority

        # 如果已存在，先移除
        if name in self._providers:
            self._priority_order.remove(name)

        self._providers[name] = provider

        # 按优先级插入（priority 越小越靠前）
        inserted = False
        for i, existing_name in enumerate(self._priority_order):
            existing_priority = self._providers[existing_name].info.priority
            if priority < existing_priority:
                self._priority_order.insert(i, name)
                inserted = True
                break

        if not inserted:
            self._priority_order.append(name)

        logger.info(
            f"Registered provider: {name} "
            f"(priority={priority}, types={list(provider.fetchers.keys())})"
        )

    def unregister(self, name: str) -> bool:
        """
        注销 Provider

        Args:
            name: Provider 名称

        Returns:
            是否成功注销
        """
        if name in self._providers:
            del self._providers[name]
            self._priority_order.remove(name)
            logger.info(f"Unregistered provider: {name}")
            return True
        return False

    def get_provider(self, name: str) -> Optional[BaseProvider]:
        """
        获取指定 Provider

        Args:
            name: Provider 名称

        Returns:
            Provider 实例，如果不存在返回 None
        """
        return self._providers.get(name)

    def get_fetcher(
        self,
        data_type: str,
        provider: Optional[str] = None
    ) -> BaseFetcher:
        """
        获取 Fetcher，支持自动降级

        Args:
            data_type: 数据类型，如 'news', 'stock_price'
            provider: 可选的 Provider 名称，如果不指定则按优先级选择

        Returns:
            BaseFetcher 实例

        Raises:
            FetcherNotFoundError: 如果没有找到支持该数据类型的 Provider
            ProviderNotFoundError: 如果指定的 Provider 不存在

        Example:
            >>> # 自动选择最高优先级的 Provider
            >>> fetcher = registry.get_fetcher("news")
            >>>
            >>> # 指定 Provider
            >>> fetcher = registry.get_fetcher("news", provider="tencent")
        """
        # 如果指定了 Provider
        if provider:
            p = self._providers.get(provider)
            if not p:
                raise ProviderNotFoundError(f"Provider '{provider}' not found")

            fetcher = p.get_fetcher(data_type)
            if not fetcher:
                raise FetcherNotFoundError(
                    f"Provider '{provider}' does not support data_type='{data_type}'"
                )
            return fetcher

        # 否则按优先级选择
        for p_name in self._priority_order:
            p = self._providers[p_name]
            if p.supports(data_type):
                fetcher = p.get_fetcher(data_type)
                if fetcher:
                    logger.debug(f"Using {p_name} for {data_type}")
                    return fetcher

        # 没有找到支持的 Provider
        available = self.get_providers_for_type(data_type)
        raise FetcherNotFoundError(
            f"No provider found for data_type='{data_type}'. "
            f"Available providers for this type: {available}"
        )

    def list_providers(self) -> List[str]:
        """
        列出所有已注册的 Provider (按优先级排序)

        Returns:
            Provider 名称列表
        """
        return list(self._priority_order)

    def get_providers_for_type(self, data_type: str) -> List[str]:
        """
        获取支持指定数据类型的所有 Provider

        Args:
            data_type: 数据类型

        Returns:
            支持该类型的 Provider 名称列表 (按优先级排序)
        """
        return [
            name for name in self._priority_order
            if self._providers[name].supports(data_type)
        ]

    def get_all_data_types(self) -> List[str]:
        """
        获取所有支持的数据类型

        Returns:
            数据类型列表
        """
        types = set()
        for provider in self._providers.values():
            types.update(provider.fetchers.keys())
        return sorted(types)

    def clear(self) -> None:
        """清空所有注册的 Provider"""
        self._providers.clear()
        self._priority_order.clear()
        logger.info("Cleared all providers from registry")

    def __repr__(self) -> str:
        return f"<ProviderRegistry providers={self._priority_order}>"


# 全局单例
_registry: Optional[ProviderRegistry] = None


def get_registry() -> ProviderRegistry:
    """
    获取全局 Registry 实例

    Returns:
        ProviderRegistry 单例
    """
    global _registry
    if _registry is None:
        _registry = ProviderRegistry()
    return _registry


def reset_registry() -> ProviderRegistry:
    """
    重置全局 Registry (主要用于测试)

    Returns:
        新的 ProviderRegistry 实例
    """
    global _registry
    if _registry:
        _registry.clear()
    _registry = ProviderRegistry()
    _registry.clear()  # 确保单例也被清空
    return _registry


================================================
FILE: backend/app/financial/tools.py
================================================
"""
金融数据工具 - 封装为 AgenticX BaseTool

这些工具可以直接被 Agent 调用，内部使用 Provider Registry 获取数据。

设计原则:
- 继承 AgenticX BaseTool，保持与框架兼容
- 内部使用 ProviderRegistry 实现多源降级
- 返回标准化的数据格式

来源参考:
- 设计文档: research/codedeepresearch/OpenBB/FinnewsHunter_improvement_plan.md
"""
from typing import List, Optional, Dict, Any
import asyncio
import logging

from agenticx import BaseTool
from agenticx.core import ToolMetadata, ToolCategory

from .registry import get_registry, FetcherNotFoundError, ProviderNotFoundError
from .models.news import NewsQueryParams, NewsData
from .models.stock import StockQueryParams, StockPriceData, KlineInterval, AdjustType

logger = logging.getLogger(__name__)


class FinancialNewsTool(BaseTool):
    """
    金融新闻获取工具

    支持多数据源自动切换，返回标准化的新闻数据。

    Example:
        >>> tool = FinancialNewsTool()
        >>> result = await tool.aexecute(stock_codes=["600519"], limit=10)
        >>> print(result["data"])  # List[NewsData.model_dump()]
    """

    def __init__(self):
        metadata = ToolMetadata(
            name="financial_news",
            description="获取金融新闻，支持多数据源自动切换",
            category=ToolCategory.DATA_ACCESS,
            version="1.0.0"
        )
        super().__init__(metadata=metadata)

    def _setup_parameters(self):
        """设置工具参数（AgenticX BaseTool 要求的抽象方法）"""
        pass

    async def aexecute(
        self,
        keywords: Optional[List[str]] = None,
        stock_codes: Optional[List[str]] = None,
        limit: int = 50,
        provider: Optional[str] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """
        异步执行新闻获取

        Args:
            keywords: 搜索关键词列表
            stock_codes: 关联股票代码列表
            limit: 返回条数
            provider: 指定数据源

        Returns:
            {
                "success": bool,
                "count": int,
                "provider": str,
                "data": List[dict]  # NewsData.model_dump()
            }
        """
        # 构建标准查询参数
        params = NewsQueryParams(
            keywords=keywords,
            stock_codes=stock_codes,
            limit=limit
        )

        try:
            # 获取 Fetcher（自动降级）
            registry = get_registry()
            fetcher = registry.get_fetcher("news", provider)

            # 执行 TET Pipeline
            results: List[NewsData] = await fetcher.fetch(params)

            # 获取实际使用的 provider 名称
            provider_name = fetcher.__class__.__module__.split(".")[-3]

            return {
                "success": True,
                "count": len(results),
                "provider": provider_name,
                "data": [r.model_dump() for r in results]
            }

        except (FetcherNotFoundError, ProviderNotFoundError) as e:
            logger.error(f"Provider error: {e}")
            registry = get_registry()
            return {
                "success": False,
                "error": str(e),
                "available_providers": registry.get_providers_for_type("news")
            }

        except Exception as e:
            logger.exception(f"Unexpected error in FinancialNewsTool: {e}")
            return {
                "success": False,
                "error": f"Unexpected error: {e}"
            }

    def execute(
        self,
        keywords: Optional[List[str]] = None,
        stock_codes: Optional[List[str]] = None,
        limit: int = 50,
        provider: Optional[str] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """
        同步执行（包装异步方法）
        """
        return asyncio.run(self.aexecute(
            keywords=keywords,
            stock_codes=stock_codes,
            limit=limit,
            provider=provider,
            **kwargs
        ))


class StockPriceTool(BaseTool):
    """
    股票价格获取工具（K线数据）

    Example:
        >>> tool = StockPriceTool()
        >>> result = await tool.aexecute(symbol="600519", interval="1d", limit=30)
        >>> print(result["data"])  # List[StockPriceData.model_dump()]
    """

    def __init__(self):
        metadata = ToolMetadata(
            name="stock_price",
            description="获取股票K线数据，支持多数据源自动切换",
            category=ToolCategory.DATA_ACCESS,
            version="1.0.0"
        )
        super().__init__(metadata=metadata)

    def _setup_parameters(self):
        """设置工具参数（AgenticX BaseTool 要求的抽象方法）"""
        pass

    async def aexecute(
        self,
        symbol: str,
        interval: str = "1d",
        limit: int = 90,
        adjust: str = "qfq",
        provider: Optional[str] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """
        异步执行价格获取

        Args:
            symbol: 股票代码
            interval: K线周期
            limit: 返回条数
            adjust: 复权类型
            provider: 指定数据源

        Returns:
            {
                "success": bool,
                "symbol": str,
                "count": int,
                "provider": str,
                "data": List[dict]  # StockPriceData.model_dump()
            }
        """
        try:
            params = StockQueryParams(
                symbol=symbol,
                interval=KlineInterval(interval),
                limit=limit,
                adjust=AdjustType(adjust)
            )
        except ValueError as e:
            return {
                "success": False,
                "error": f"Invalid parameter: {e}"
            }

        try:
            registry = get_registry()
            fetcher = registry.get_fetcher("stock_price", provider)
            results: List[StockPriceData] = await fetcher.fetch(params)

            provider_name = fetcher.__class__.__module__.split(".")[-3]

            return {
                "success": True,
                "symbol": symbol,
                "count": len(results),
                "provider": provider_name,
                "data": [r.model_dump() for r in results]
            }

        except (FetcherNotFoundError, ProviderNotFoundError) as e:
            logger.error(f"Provider error: {e}")
            registry = get_registry()
            return {
                "success": False,
                "error": str(e),
                "available_providers": registry.get_providers_for_type("stock_price")
            }

        except Exception as e:
            logger.exception(f"Unexpected error in StockPriceTool: {e}")
            return {
                "success": False,
                "error": f"Unexpected error: {e}"
            }

    def execute(
        self,
        symbol: str,
        interval: str = "1d",
        limit: int = 90,
        adjust: str = "qfq",
        provider: Optional[str] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """同步执行"""
        return asyncio.run(self.aexecute(
            symbol=symbol,
            interval=interval,
            limit=limit,
            adjust=adjust,
            provider=provider,
            **kwargs
        ))


# 便捷函数：自动注册默认 Provider
def setup_default_providers():
    """
    注册默认的 Provider

    在应用启动时调用，确保 Registry 中有可用的 Provider。
    
    当前支持的数据源（按优先级排序）:
    1. sina - 新浪财经
    2. tencent - 腾讯财经
    3. nbd - 每日经济新闻
    4. eastmoney - 东方财富
    5. yicai - 第一财经
    6. 163 - 网易财经
    """
    from .providers.sina import SinaProvider
    from .providers.tencent import TencentProvider
    from .providers.nbd import NbdProvider
    from .providers.eastmoney import EastmoneyProvider
    from .providers.yicai import YicaiProvider
    from .providers.netease import NeteaseProvider

    registry = get_registry()
    
    # 定义所有 Provider（按优先级顺序）
    providers = [
        ("sina", SinaProvider),
        ("tencent", TencentProvider),
        ("nbd", NbdProvider),
        ("eastmoney", EastmoneyProvider),
        ("yicai", YicaiProvider),
        ("163", NeteaseProvider),
    ]

    # 注册所有 Provider
    for name, provider_class in providers:
        if name not in registry.list_providers():
            try:
                registry.register(provider_class())
                logger.debug(f"Registered provider: {name}")
            except Exception as e:
                logger.warning(f"Failed to register provider {name}: {e}")

    logger.info(f"Registered {len(registry.list_providers())} providers: {registry.list_providers()}")


================================================
FILE: backend/app/knowledge/README.md
================================================
# 知识图谱模块

## 📊 概述

知识图谱模块为每只股票构建动态的知识图谱，用于智能化的新闻检索和分析。

## 🎯 核心功能

### 1. 多维度知识建模

为每家公司建立包含以下信息的知识图谱：

- **名称变体**：公司简称、别名、全称
- **业务线**：主营业务、新增业务、已停止业务
- **行业归属**：一级行业、二级行业、细分领域
- **产品服务**：主要产品和服务
- **关联概念**：涉及的热点概念（AI大模型、云计算等）
- **检索关键词**：优化检索效果的关键词

### 2. 智能并发检索

基于知识图谱生成多样化的检索查询，并发调用搜索API：

```
示例：彩讯股份 (300634)

生成的查询组合：
1. "彩讯股份 300634"
2. "彩讯 股票"
3. "彩讯股份 运营商增值服务"
4. "彩讯 AI大模型应用"
5. "彩讯科技 云计算"
6. ...（最多10条并发查询）
```

### 3. 动态图谱更新

- **构建时机**：首次定向爬取时自动构建
- **数据来源**：
  - akshare：基础信息（行业、市值、主营业务）
  - LLM推理：名称变体、业务细分
  - 新闻分析：业务变化、新概念
  - 文档解析：深度信息（年报、公告）
- **更新机制**：每次定向爬取后自动更新

## 🏗️ 架构设计

### 图谱结构

```
(Company) 公司节点
   ├─ HAS_VARIANT ─> (NameVariant) 名称变体
   ├─ OPERATES_IN ─> (Business) 业务线
   ├─ BELONGS_TO  ─> (Industry) 行业
   ├─ PROVIDES    ─> (Product) 产品
   ├─ RELATES_TO  ─> (Keyword) 关键词
   └─ INVOLVES    ─> (Concept) 概念
```

### 核心组件

1. **graph_models.py** - 数据模型定义
2. **graph_service.py** - 图谱CRUD服务
3. **knowledge_extractor.py** - 知识提取Agent
4. **parallel_search.py** - 并发检索策略

## 🚀 使用方法

### 1. 启动 Neo4j

```bash
cd deploy
docker-compose -f docker-compose.dev.yml up -d neo4j
```

### 2. 初始化图谱

```bash
cd backend
python init_knowledge_graph.py
```

### 3. API 调用

#### 查询图谱
```bash
GET /api/v1/knowledge-graph/{stock_code}
```

#### 构建图谱
```bash
POST /api/v1/knowledge-graph/{stock_code}/build
{
  "force_rebuild": false
}
```

#### 更新图谱
```bash
POST /api/v1/knowledge-graph/{stock_code}/update
{
  "update_from_news": true,
  "news_limit": 20
}
```

#### 删除图谱
```bash
DELETE /api/v1/knowledge-graph/{stock_code}
```

### 4. 自动集成

定向爬取时自动使用知识图谱：

1. **检查图谱**：如果不存在，自动从 akshare + LLM 构建
2. **并发检索**：基于图谱生成的多个关键词并发搜索
3. **更新图谱**：爬取完成后，从新闻中提取新信息更新图谱

## 📈 效果对比

### 传统单关键词检索

```python
query = "彩讯股份 股票 300634"
results = search(query)  # ~20-30条
```

### 基于知识图谱的并发检索

```python
queries = [
    "彩讯股份 300634",
    "彩讯 运营商增值服务",
    "彩讯股份 AI大模型应用",
    "彩讯科技 云计算",
    ...
]
results = parallel_search(queries)  # ~100-200条，去重后70-130条
```

**召回率提升：3-5倍**

## 🔧 配置

环境变量：
```bash
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=finnews_neo4j_password
```

## 📊 监控

访问 Neo4j 浏览器：
- URL: http://localhost:7474
- 用户名: neo4j
- 密码: finnews_neo4j_password

示例查询：
```cypher
// 查看所有公司
MATCH (c:Company) RETURN c

// 查看公司的完整图谱
MATCH (c:Company {stock_code: 'SZ300634'})-[r]->(n)
RETURN c, r, n

// 查看业务线
MATCH (c:Company)-[:OPERATES_IN]->(b:Business)
WHERE b.status = 'active'
RETURN c.stock_name, b.business_name, b.status
```

## ⚠️ 注意事项

1. **LLM成本**：图谱构建和更新会调用LLM，注意API成本
2. **并发限制**：并发检索默认5个worker，可根据API限制调整
3. **图谱更新**：建议每次定向爬取后自动更新，保持图谱时效性
4. **数据质量**：LLM提取的信息需要人工review，建议提供review接口


================================================
FILE: backend/app/knowledge/__init__.py
================================================
"""
知识图谱模块
"""
from .graph_models import (
    CompanyNode,
    NameVariantNode,
    BusinessNode,
    IndustryNode,
    ProductNode,
    KeywordNode,
    ConceptNode,
    CompanyKnowledgeGraph,
    SearchKeywordSet,
    NodeType,
    RelationType
)
from .graph_service import KnowledgeGraphService

__all__ = [
    "CompanyNode",
    "NameVariantNode", 
    "BusinessNode",
    "IndustryNode",
    "ProductNode",
    "KeywordNode",
    "ConceptNode",
    "CompanyKnowledgeGraph",
    "SearchKeywordSet",
    "NodeType",
    "RelationType",
    "KnowledgeGraphService"
]


================================================
FILE: backend/app/knowledge/graph_models.py
================================================
"""
知识图谱数据模型
定义公司知识图谱的节点和关系结构
"""
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field
from datetime import datetime
from enum import Enum


class NodeType(str, Enum):
    """节点类型枚举"""
    COMPANY = "Company"                # 公司
    NAME_VARIANT = "NameVariant"       # 名称变体
    BUSINESS = "Business"              # 业务线
    INDUSTRY = "Industry"              # 行业
    PRODUCT = "Product"                # 产品/服务
    KEYWORD = "Keyword"                # 检索关键词
    CONCEPT = "Concept"                # 概念/主题
    PARTNER = "Partner"                # 合作伙伴


class RelationType(str, Enum):
    """关系类型枚举"""
    HAS_VARIANT = "HAS_VARIANT"        # 有变体
    OPERATES_IN = "OPERATES_IN"        # 运营于（业务线）
    BELONGS_TO = "BELONGS_TO"          # 属于（行业）
    PROVIDES = "PROVIDES"              # 提供（产品）
    RELATES_TO = "RELATES_TO"          # 关联（关键词）
    INVOLVES = "INVOLVES"              # 涉及（概念）
    COOPERATES_WITH = "COOPERATES_WITH"  # 合作（伙伴）
    UPSTREAM = "UPSTREAM"              # 上游
    DOWNSTREAM = "DOWNSTREAM"          # 下游


class CompanyNode(BaseModel):
    """公司节点"""
    stock_code: str = Field(description="股票代码（如 SZ300634）")
    stock_name: str = Field(description="股票全称（如 彩讯股份）")
    short_code: str = Field(description="纯数字代码（如 300634）")
    industry: Optional[str] = Field(default=None, description="所属行业")
    sector: Optional[str] = Field(default=None, description="所属板块")
    market_cap: Optional[float] = Field(default=None, description="市值")
    listed_date: Optional[str] = Field(default=None, description="上市日期")
    created_at: datetime = Field(default_factory=datetime.utcnow)
    updated_at: datetime = Field(default_factory=datetime.utcnow)


class NameVariantNode(BaseModel):
    """名称变体节点"""
    variant: str = Field(description="变体名称（如 彩讯、彩讯科技）")
    variant_type: str = Field(description="变体类型: abbreviation, alias, full_name")
    created_at: datetime = Field(default_factory=datetime.utcnow)


class BusinessNode(BaseModel):
    """业务线节点"""
    business_name: str = Field(description="业务名称")
    business_type: str = Field(description="业务类型: main, new, stopped")
    description: Optional[str] = Field(default=None, description="业务描述")
    start_date: Optional[str] = Field(default=None, description="开始日期")
    end_date: Optional[str] = Field(default=None, description="结束日期（如果停止）")
    status: str = Field(default="active", description="状态: active, stopped, planned")
    created_at: datetime = Field(default_factory=datetime.utcnow)
    updated_at: datetime = Field(default_factory=datetime.utcnow)


class IndustryNode(BaseModel):
    """行业节点"""
    industry_name: str = Field(description="行业名称")
    industry_code: Optional[str] = Field(default=None, description="行业代码")
    level: int = Field(default=1, description="层级: 1=一级行业, 2=二级行业")
    created_at: datetime = Field(default_factory=datetime.utcnow)


class ProductNode(BaseModel):
    """产品/服务节点"""
    product_name: str = Field(description="产品名称")
    product_type: str = Field(description="产品类型: software, hardware, service")
    description: Optional[str] = Field(default=None, description="产品描述")
    created_at: datetime = Field(default_factory=datetime.utcnow)
    updated_at: datetime = Field(default_factory=datetime.utcnow)


class KeywordNode(BaseModel):
    """检索关键词节点"""
    keyword: str = Field(description="关键词")
    keyword_type: str = Field(description="类型: business, product, industry, general")
    weight: float = Field(default=1.0, description="权重（检索时的重要性）")
    created_at: datetime = Field(default_factory=datetime.utcnow)


class ConceptNode(BaseModel):
    """概念/主题节点"""
    concept_name: str = Field(description="概念名称（如 AI大模型、元宇宙）")
    description: Optional[str] = Field(default=None, description="概念描述")
    hot_level: int = Field(default=0, description="热度等级 0-10")
    created_at: datetime = Field(default_factory=datetime.utcnow)


class CompanyKnowledgeGraph(BaseModel):
    """公司知识图谱完整结构（用于导入导出）"""
    company: CompanyNode
    name_variants: List[NameVariantNode] = Field(default_factory=list)
    businesses: List[BusinessNode] = Field(default_factory=list)
    industries: List[IndustryNode] = Field(default_factory=list)
    products: List[ProductNode] = Field(default_factory=list)
    keywords: List[KeywordNode] = Field(default_factory=list)
    concepts: List[ConceptNode] = Field(default_factory=list)


class SearchKeywordSet(BaseModel):
    """检索关键词集合（用于定向爬取）"""
    stock_code: str
    stock_name: str
    
    # 名称相关
    name_keywords: List[str] = Field(default_factory=list, description="名称变体")
    
    # 业务相关
    business_keywords: List[str] = Field(default_factory=list, description="业务线关键词")
    
    # 行业相关
    industry_keywords: List[str] = Field(default_factory=list, description="行业关键词")
    
    # 产品相关
    product_keywords: List[str] = Field(default_factory=list, description="产品关键词")
    
    # 概念相关
    concept_keywords: List[str] = Field(default_factory=list, description="概念关键词")
    
    # 组合查询
    combined_queries: List[str] = Field(default_factory=list, description="预组合的查询串")
    
    def get_all_keywords(self) -> List[str]:
        """获取所有关键词（去重）"""
        all_kw = (
            self.name_keywords +
            self.business_keywords +
            self.industry_keywords +
            self.product_keywords +
            self.concept_keywords
        )
        return list(set(all_kw))
    
    def generate_search_queries(self, max_queries: int = 10) -> List[str]:
        """
        生成多样化的搜索查询组合
        
        Args:
            max_queries: 最大查询数量
            
        Returns:
            查询字符串列表
        """
        queries = []
        
        # 1. 核心查询：股票名称 + 股票代码
        if self.name_keywords:
            queries.append(f"{self.stock_name} {self.stock_code}")
            queries.append(f"{self.name_keywords[0]} 股票")
        
        # 2. 业务线查询
        for business in self.business_keywords[:3]:  # 最多3个业务线
            queries.append(f"{self.stock_name} {business}")
            if len(self.name_keywords) > 1:
                queries.append(f"{self.name_keywords[0]} {business}")
        
        # 3. 概念查询
        for concept in self.concept_keywords[:2]:  # 最多2个概念
            queries.append(f"{self.stock_name} {concept}")
        
        # 4. 产品查询
        for product in self.product_keywords[:2]:  # 最多2个产品
            queries.append(f"{self.stock_name} {product}")
        
        # 5. 使用预组合查询
        queries.extend(self.combined_queries)
        
        # 去重并限制数量
        unique_queries = list(dict.fromkeys(queries))  # 保持顺序的去重
        return unique_queries[:max_queries]


================================================
FILE: backend/app/knowledge/graph_service.py
================================================
"""
知识图谱服务
提供公司知识图谱的创建、查询、更新操作
"""
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime

from ..core.neo4j_client import get_neo4j_client
from .graph_models import (
    CompanyNode,
    NameVariantNode,
    BusinessNode,
    IndustryNode,
    ProductNode,
    KeywordNode,
    ConceptNode,
    CompanyKnowledgeGraph,
    SearchKeywordSet,
    NodeType,
    RelationType
)

logger = logging.getLogger(__name__)


class KnowledgeGraphService:
    """知识图谱服务"""
    
    def __init__(self):
        self.neo4j = get_neo4j_client()
        self._ensure_constraints()
    
    def _ensure_constraints(self):
        """确保数据库约束和索引存在"""
        constraints = [
            # 公司节点唯一约束
            "CREATE CONSTRAINT company_code IF NOT EXISTS FOR (c:Company) REQUIRE c.stock_code IS UNIQUE",
            # 索引加速查询
            "CREATE INDEX company_name IF NOT EXISTS FOR (c:Company) ON (c.stock_name)",
            "CREATE INDEX business_name IF NOT EXISTS FOR (b:Business) ON (b.business_name)",
            "CREATE INDEX keyword_text IF NOT EXISTS FOR (k:Keyword) ON (k.keyword)",
        ]
        
        for constraint in constraints:
            try:
                self.neo4j.execute_write(constraint)
            except Exception as e:
                # 约束可能已存在，忽略错误
                logger.debug(f"Constraint creation skipped: {e}")
    
    # ============ 公司节点操作 ============
    
    def create_or_update_company(self, company: CompanyNode) -> bool:
        """
        创建或更新公司节点
        
        Args:
            company: 公司节点数据
            
        Returns:
            是否成功
        """
        query = """
        MERGE (c:Company {stock_code: $stock_code})
        SET c.stock_name = $stock_name,
            c.short_code = $short_code,
            c.industry = $industry,
            c.sector = $sector,
            c.market_cap = $market_cap,
            c.listed_date = $listed_date,
            c.updated_at = datetime(),
            c.created_at = coalesce(c.created_at, datetime())
        RETURN c
        """
        
        params = company.model_dump()
        params['created_at'] = company.created_at.isoformat()
        params['updated_at'] = datetime.utcnow().isoformat()
        
        try:
            self.neo4j.execute_write(query, params)
            logger.info(f"✅ 公司节点已创建/更新: {company.stock_name}({company.stock_code})")
            return True
        except Exception as e:
            logger.error(f"❌ 公司节点创建失败: {e}")
            return False
    
    def get_company(self, stock_code: str) -> Optional[Dict[str, Any]]:
        """获取公司节点"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        RETURN c
        """
        
        results = self.neo4j.execute_query(query, {"stock_code": stock_code})
        return results[0]['c'] if results else None
    
    # ============ 名称变体操作 ============
    
    def add_name_variants(
        self,
        stock_code: str,
        variants: List[NameVariantNode]
    ) -> bool:
        """
        添加名称变体
        
        Args:
            stock_code: 股票代码
            variants: 名称变体列表
            
        Returns:
            是否成功
        """
        for variant in variants:
            query = """
            MATCH (c:Company {stock_code: $stock_code})
            MERGE (v:NameVariant {variant: $variant})
            SET v.variant_type = $variant_type,
                v.created_at = $created_at
            MERGE (c)-[r:HAS_VARIANT]->(v)
            RETURN v
            """
            
            params = {
                "stock_code": stock_code,
                "variant": variant.variant,
                "variant_type": variant.variant_type,
                "created_at": variant.created_at.isoformat()
            }
            
            try:
                self.neo4j.execute_write(query, params)
            except Exception as e:
                logger.error(f"添加名称变体失败 {variant.variant}: {e}")
                return False
        
        logger.info(f"✅ 已添加 {len(variants)} 个名称变体")
        return True
    
    # ============ 业务线操作 ============
    
    def add_business(
        self,
        stock_code: str,
        business: BusinessNode
    ) -> bool:
        """添加业务线"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        MERGE (b:Business {business_name: $business_name})
        SET b.business_type = $business_type,
            b.description = $description,
            b.start_date = $start_date,
            b.end_date = $end_date,
            b.status = $status,
            b.updated_at = datetime(),
            b.created_at = coalesce(b.created_at, datetime())
        MERGE (c)-[r:OPERATES_IN]->(b)
        RETURN b
        """
        
        params = business.model_dump()
        params['stock_code'] = stock_code
        
        try:
            self.neo4j.execute_write(query, params)
            logger.info(f"✅ 业务线已添加: {business.business_name}")
            return True
        except Exception as e:
            logger.error(f"❌ 业务线添加失败: {e}")
            return False
    
    def stop_business(
        self,
        stock_code: str,
        business_name: str,
        end_date: str = None
    ) -> bool:
        """停止业务线"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})-[:OPERATES_IN]->(b:Business {business_name: $business_name})
        SET b.status = 'stopped',
            b.end_date = $end_date,
            b.updated_at = datetime()
        RETURN b
        """
        
        params = {
            "stock_code": stock_code,
            "business_name": business_name,
            "end_date": end_date or datetime.utcnow().strftime("%Y-%m-%d")
        }
        
        try:
            self.neo4j.execute_write(query, params)
            logger.info(f"✅ 业务线已停止: {business_name}")
            return True
        except Exception as e:
            logger.error(f"❌ 业务线停止失败: {e}")
            return False
    
    # ============ 关键词操作 ============
    
    def add_keywords(
        self,
        stock_code: str,
        keywords: List[KeywordNode],
        relation_type: str = "RELATES_TO"
    ) -> bool:
        """添加检索关键词"""
        for keyword in keywords:
            query = """
            MATCH (c:Company {stock_code: $stock_code})
            MERGE (k:Keyword {keyword: $keyword})
            SET k.keyword_type = $keyword_type,
                k.weight = $weight,
                k.created_at = $created_at
            MERGE (c)-[r:RELATES_TO]->(k)
            RETURN k
            """
            
            params = {
                "stock_code": stock_code,
                "keyword": keyword.keyword,
                "keyword_type": keyword.keyword_type,
                "weight": keyword.weight,
                "created_at": keyword.created_at.isoformat()
            }
            
            try:
                self.neo4j.execute_write(query, params)
            except Exception as e:
                logger.error(f"添加关键词失败 {keyword.keyword}: {e}")
                return False
        
        logger.info(f"✅ 已添加 {len(keywords)} 个关键词")
        return True
    
    # ============ 概念操作 ============
    
    def add_concepts(
        self,
        stock_code: str,
        concepts: List[ConceptNode]
    ) -> bool:
        """添加概念/主题"""
        for concept in concepts:
            query = """
            MATCH (c:Company {stock_code: $stock_code})
            MERGE (con:Concept {concept_name: $concept_name})
            SET con.description = $description,
                con.hot_level = $hot_level,
                con.created_at = $created_at
            MERGE (c)-[r:INVOLVES]->(con)
            RETURN con
            """
            
            params = {
                "stock_code": stock_code,
                "concept_name": concept.concept_name,
                "description": concept.description,
                "hot_level": concept.hot_level,
                "created_at": concept.created_at.isoformat()
            }
            
            try:
                self.neo4j.execute_write(query, params)
            except Exception as e:
                logger.error(f"添加概念失败 {concept.concept_name}: {e}")
                return False
        
        logger.info(f"✅ 已添加 {len(concepts)} 个概念")
        return True
    
    # ============ 完整图谱操作 ============
    
    def build_company_graph(self, graph: CompanyKnowledgeGraph) -> bool:
        """
        构建完整的公司知识图谱
        
        Args:
            graph: 公司知识图谱数据
            
        Returns:
            是否成功
        """
        try:
            # 1. 创建公司节点
            self.create_or_update_company(graph.company)
            
            # 2. 添加名称变体
            if graph.name_variants:
                self.add_name_variants(graph.company.stock_code, graph.name_variants)
            
            # 3. 添加业务线
            for business in graph.businesses:
                self.add_business(graph.company.stock_code, business)
            
            # 4. 添加行业
            for industry in graph.industries:
                self._add_industry(graph.company.stock_code, industry)
            
            # 5. 添加产品
            for product in graph.products:
                self._add_product(graph.company.stock_code, product)
            
            # 6. 添加关键词
            if graph.keywords:
                self.add_keywords(graph.company.stock_code, graph.keywords)
            
            # 7. 添加概念
            if graph.concepts:
                self.add_concepts(graph.company.stock_code, graph.concepts)
            
            logger.info(f"✅ 知识图谱构建完成: {graph.company.stock_name}")
            return True
            
        except Exception as e:
            logger.error(f"❌ 知识图谱构建失败: {e}")
            return False
    
    def _add_industry(self, stock_code: str, industry: IndustryNode) -> bool:
        """添加行业节点（内部方法）"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        MERGE (i:Industry {industry_name: $industry_name})
        SET i.industry_code = $industry_code,
            i.level = $level,
            i.created_at = $created_at
        MERGE (c)-[r:BELONGS_TO]->(i)
        RETURN i
        """
        
        params = industry.model_dump()
        params['stock_code'] = stock_code
        
        try:
            self.neo4j.execute_write(query, params)
            return True
        except Exception as e:
            logger.error(f"行业添加失败: {e}")
            return False
    
    def _add_product(self, stock_code: str, product: ProductNode) -> bool:
        """添加产品节点（内部方法）"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        MERGE (p:Product {product_name: $product_name})
        SET p.product_type = $product_type,
            p.description = $description,
            p.updated_at = datetime(),
            p.created_at = coalesce(p.created_at, datetime())
        MERGE (c)-[r:PROVIDES]->(p)
        RETURN p
        """
        
        params = product.model_dump()
        params['stock_code'] = stock_code
        
        try:
            self.neo4j.execute_write(query, params)
            return True
        except Exception as e:
            logger.error(f"产品添加失败: {e}")
            return False
    
    # ============ 查询操作 ============
    
    def get_company_graph(self, stock_code: str) -> Optional[CompanyKnowledgeGraph]:
        """
        获取完整的公司知识图谱
        
        Args:
            stock_code: 股票代码
            
        Returns:
            公司知识图谱或None
        """
        # 查询公司及其所有关联节点
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        OPTIONAL MATCH (c)-[:HAS_VARIANT]->(v:NameVariant)
        OPTIONAL MATCH (c)-[:OPERATES_IN]->(b:Business)
        OPTIONAL MATCH (c)-[:BELONGS_TO]->(i:Industry)
        OPTIONAL MATCH (c)-[:PROVIDES]->(p:Product)
        OPTIONAL MATCH (c)-[:RELATES_TO]->(k:Keyword)
        OPTIONAL MATCH (c)-[:INVOLVES]->(con:Concept)
        RETURN c,
               collect(DISTINCT v) as variants,
               collect(DISTINCT b) as businesses,
               collect(DISTINCT i) as industries,
               collect(DISTINCT p) as products,
               collect(DISTINCT k) as keywords,
               collect(DISTINCT con) as concepts
        """
        
        try:
            results = self.neo4j.execute_query(query, {"stock_code": stock_code})
            
            if not results or not results[0]['c']:
                return None
            
            data = results[0]
            company_data = dict(data['c'])
            
            # 构建完整图谱
            graph = CompanyKnowledgeGraph(
                company=CompanyNode(**company_data),
                name_variants=[NameVariantNode(**dict(v)) for v in data['variants'] if v],
                businesses=[BusinessNode(**dict(b)) for b in data['businesses'] if b],
                industries=[IndustryNode(**dict(i)) for i in data['industries'] if i],
                products=[ProductNode(**dict(p)) for p in data['products'] if p],
                keywords=[KeywordNode(**dict(k)) for k in data['keywords'] if k],
                concepts=[ConceptNode(**dict(c)) for c in data['concepts'] if c]
            )
            
            return graph
            
        except Exception as e:
            logger.error(f"查询公司图谱失败: {e}")
            return None
    
    def get_search_keywords(self, stock_code: str) -> Optional[SearchKeywordSet]:
        """
        获取用于检索的关键词集合
        
        Args:
            stock_code: 股票代码
            
        Returns:
            检索关键词集合
        """
        graph = self.get_company_graph(stock_code)
        if not graph:
            return None
        
        # 构建检索关键词集合
        keyword_set = SearchKeywordSet(
            stock_code=stock_code,
            stock_name=graph.company.stock_name,
            name_keywords=[v.variant for v in graph.name_variants],
            business_keywords=[b.business_name for b in graph.businesses if b.status == "active"],
            industry_keywords=[i.industry_name for i in graph.industries],
            product_keywords=[p.product_name for p in graph.products],
            concept_keywords=[c.concept_name for c in graph.concepts]
        )
        
        # 生成组合查询
        keyword_set.combined_queries = keyword_set.generate_search_queries(max_queries=10)
        
        return keyword_set
    
    # ============ 图谱更新 ============
    
    def update_from_news(
        self,
        stock_code: str,
        news_content: str,
        extracted_info: Dict[str, Any]
    ) -> bool:
        """
        根据新闻更新图谱
        
        Args:
            stock_code: 股票代码
            news_content: 新闻内容
            extracted_info: 提取的信息（由 LLM 提取）
                {
                    "new_businesses": [...],
                    "stopped_businesses": [...],
                    "new_products": [...],
                    "new_concepts": [...]
                }
        
        Returns:
            是否成功
        """
        try:
            # 添加新业务线
            for biz_name in extracted_info.get("new_businesses", []):
                business = BusinessNode(
                    business_name=biz_name,
                    business_type="new",
                    status="active",
                    start_date=datetime.utcnow().strftime("%Y-%m-%d")
                )
                self.add_business(stock_code, business)
            
            # 停止业务线
            for biz_name in extracted_info.get("stopped_businesses", []):
                self.stop_business(stock_code, biz_name)
            
            # 添加新产品
            for prod_name in extracted_info.get("new_products", []):
                product = ProductNode(
                    product_name=prod_name,
                    product_type="service"
                )
                self._add_product(stock_code, product)
            
            # 添加新概念
            for concept_name in extracted_info.get("new_concepts", []):
                concept = ConceptNode(
                    concept_name=concept_name,
                    hot_level=5
                )
                self.add_concepts(stock_code, [concept])
            
            logger.info(f"✅ 图谱已更新（基于新闻）")
            return True
            
        except Exception as e:
            logger.error(f"❌ 图谱更新失败: {e}")
            return False
    
    # ============ 统计和管理 ============
    
    def get_graph_stats(self, stock_code: str) -> Dict[str, int]:
        """获取图谱统计信息"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        OPTIONAL MATCH (c)-[:HAS_VARIANT]->(v:NameVariant)
        OPTIONAL MATCH (c)-[:OPERATES_IN]->(b:Business)
        OPTIONAL MATCH (c)-[:BELONGS_TO]->(i:Industry)
        OPTIONAL MATCH (c)-[:PROVIDES]->(p:Product)
        OPTIONAL MATCH (c)-[:RELATES_TO]->(k:Keyword)
        OPTIONAL MATCH (c)-[:INVOLVES]->(con:Concept)
        RETURN 
            count(DISTINCT v) as variants_count,
            count(DISTINCT b) as businesses_count,
            count(DISTINCT i) as industries_count,
            count(DISTINCT p) as products_count,
            count(DISTINCT k) as keywords_count,
            count(DISTINCT con) as concepts_count
        """
        
        try:
            results = self.neo4j.execute_query(query, {"stock_code": stock_code})
            if results:
                return dict(results[0])
            return {}
        except Exception as e:
            logger.error(f"查询图谱统计失败: {e}")
            return {}
    
    def delete_company_graph(self, stock_code: str) -> bool:
        """删除公司及其所有关联节点"""
        query = """
        MATCH (c:Company {stock_code: $stock_code})
        OPTIONAL MATCH (c)-[r]->(n)
        DETACH DELETE c, n
        """
        
        try:
            self.neo4j.execute_write(query, {"stock_code": stock_code})
            logger.info(f"✅ 公司图谱已删除: {stock_code}")
            return True
        except Exception as e:
            logger.error(f"❌ 图谱删除失败: {e}")
            return False
    
    def list_all_companies(self) -> List[Dict[str, str]]:
        """列出所有公司"""
        query = """
        MATCH (c:Company)
        RETURN c.stock_code as stock_code, 
               c.stock_name as stock_name, 
               c.industry as industry
        ORDER BY c.stock_code
        """
        
        try:
            return self.neo4j.execute_query(query)
        except Exception as e:
            logger.error(f"查询公司列表失败: {e}")
            return []


# 便捷函数
def get_graph_service() -> KnowledgeGraphService:
    """获取知识图谱服务实例"""
    return KnowledgeGraphService()


================================================
FILE: backend/app/knowledge/knowledge_extractor.py
================================================
"""
知识提取器
从多种数据源提取公司知识并构建图谱
"""
import logging
import json
from typing import List, Dict, Any, Optional
from datetime import datetime

from agenticx import Agent
from ..services.llm_service import get_llm_provider
from .graph_models import (
    CompanyNode,
    NameVariantNode,
    BusinessNode,
    IndustryNode,
    ProductNode,
    KeywordNode,
    ConceptNode,
    CompanyKnowledgeGraph
)

logger = logging.getLogger(__name__)


class KnowledgeExtractorAgent(Agent):
    """
    知识提取智能体
    从多种数据源提取公司信息并构建知识图谱
    """
    
    def __init__(self, llm_provider=None, organization_id: str = "finnews"):
        super().__init__(
            name="KnowledgeExtractor",
            role="知识提取专家",
            goal="从多种数据源提取公司信息，构建全面的知识图谱",
            backstory="""你是一位专业的企业分析师和知识工程师。
你擅长从各类数据源（财务数据、新闻、公告、研报）中提取关键信息，
识别公司的业务线、产品、行业归属、关联概念等，
并将这些信息结构化为知识图谱，用于后续的智能检索和分析。""",
            organization_id=organization_id
        )
        
        if llm_provider is None:
            llm_provider = get_llm_provider()
        object.__setattr__(self, '_llm_provider', llm_provider)
        
        logger.info(f"Initialized {self.name} agent")
    
    async def extract_from_akshare(
        self,
        stock_code: str,
        stock_name: str,
        stock_info: Dict[str, Any]
    ) -> CompanyKnowledgeGraph:
        """
        从 akshare 数据提取基础信息
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            stock_info: akshare 返回的股票信息
            
        Returns:
            公司知识图谱
        """
        # 获取当前时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        # 提取纯数字代码
        short_code = stock_code
        if stock_code.startswith("SH") or stock_code.startswith("SZ"):
            short_code = stock_code[2:]
        
        # 创建公司节点
        company = CompanyNode(
            stock_code=stock_code,
            stock_name=stock_name,
            short_code=short_code,
            industry=stock_info.get("industry"),
            sector=stock_info.get("sector"),
            market_cap=stock_info.get("market_cap"),
            listed_date=stock_info.get("listed_date")
        )
        
        # 生成名称变体（通过 LLM 推理）
        name_variants_prompt = f"""请为以下公司生成可能的名称变体（简称、别名等）：

【当前时间】
{current_time}

【公司信息】
股票代码: {stock_code}
公司全称: {stock_name}
所属行业: {stock_info.get('industry', '未知')}

请以JSON格式返回名称变体列表，每个变体包含：
- variant: 变体名称
- variant_type: 类型（abbreviation=简称, alias=别名, full_name=全称）

示例：
```json
[
    {{"variant": "彩讯", "variant_type": "abbreviation"}},
    {{"variant": "彩讯科技", "variant_type": "alias"}},
    {{"variant": "{stock_name}", "variant_type": "full_name"}}
]
```

只返回JSON，不要其他解释。"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": name_variants_prompt}
            ])
            
            content = response.content if hasattr(response, 'content') else str(response)
            
            # 提取JSON
            import re
            json_match = re.search(r'\[.*\]', content, re.DOTALL)
            if json_match:
                variants_data = json.loads(json_match.group())
                name_variants = [NameVariantNode(**v) for v in variants_data]
            else:
                # 默认变体
                name_variants = [
                    NameVariantNode(variant=stock_name, variant_type="full_name"),
                    NameVariantNode(variant=stock_name[:2], variant_type="abbreviation")
                ]
                logger.warning("LLM 未返回有效JSON，使用默认变体")
        except Exception as e:
            logger.error(f"名称变体提取失败: {e}")
            name_variants = [
                NameVariantNode(variant=stock_name, variant_type="full_name")
            ]
        
        # 生成业务线（通过 LLM 推理 + akshare 数据）
        business_prompt = f"""请分析以下公司的主营业务线：

【当前时间】
{current_time}

【公司信息】
股票代码: {stock_code}
公司名称: {stock_name}
所属行业: {stock_info.get('industry', '未知')}
主营业务: {stock_info.get('main_business', '未知')}

请以JSON格式返回业务线列表，每个业务包含：
- business_name: 业务名称（简洁）
- business_type: 类型（main=主营, new=新增, stopped=已停止）
- description: 业务描述
- status: 状态（active=活跃, stopped=已停止）

示例：
```json
[
    {{"business_name": "运营商增值服务", "business_type": "main", "description": "为运营商提供增值业务", "status": "active"}},
    {{"business_name": "AI大模型应用", "business_type": "new", "description": "AI应用开发与落地", "status": "active"}}
]
```

只返回JSON数组，不要其他解释。"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": business_prompt}
            ])
            
            content = response.content if hasattr(response, 'content') else str(response)
            
            # 提取JSON
            json_match = re.search(r'\[.*\]', content, re.DOTALL)
            if json_match:
                businesses_data = json.loads(json_match.group())
                businesses = [BusinessNode(**b) for b in businesses_data]
            else:
                businesses = []
                logger.warning("LLM 未返回有效业务线JSON")
        except Exception as e:
            logger.error(f"业务线提取失败: {e}")
            businesses = []
        
        # 行业节点
        industries = []
        if stock_info.get('industry'):
            industries.append(IndustryNode(
                industry_name=stock_info['industry'],
                level=1
            ))
        
        # 返回基础图谱
        return CompanyKnowledgeGraph(
            company=company,
            name_variants=name_variants,
            businesses=businesses,
            industries=industries,
            products=[],
            keywords=[],
            concepts=[]
        )
    
    async def extract_from_news(
        self,
        stock_code: str,
        stock_name: str,
        news_list: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """
        从新闻中提取业务变化和概念
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            news_list: 新闻列表
            
        Returns:
            提取的信息
        """
        if not news_list:
            return {
                "new_businesses": [],
                "stopped_businesses": [],
                "new_products": [],
                "new_concepts": []
            }
        
        # 获取当前时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        # 汇总新闻
        news_summary = "\n\n".join([
            f"【{i+1}】{news.get('title', '')}\n{news.get('content', '')[:300]}..."
            for i, news in enumerate(news_list[:10])
        ])
        
        prompt = f"""请分析以下新闻，提取{stock_name}公司的业务变化和相关概念：

【当前时间】
{current_time}

【公司】{stock_name}({stock_code})

【近期新闻】
{news_summary}

请从新闻中提取：
1. **新增业务线**：公司新开拓的业务方向
2. **停止业务线**：公司明确表示停止或退出的业务
3. **新产品/服务**：公司推出的新产品或服务
4. **关联概念**：新闻中提到的热门概念（如 AI大模型、云计算、元宇宙等）

以JSON格式返回：
```json
{{
    "new_businesses": ["业务1", "业务2"],
    "stopped_businesses": ["业务3"],
    "new_products": ["产品1", "产品2"],
    "new_concepts": ["概念1", "概念2"]
}}
```

注意：
- 只提取明确的信息，不要臆测
- 如果没有相关信息，返回空数组
- 只返回JSON，不要其他文字

JSON:"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            content = response.content if hasattr(response, 'content') else str(response)
            
            # 提取JSON
            import re
            json_match = re.search(r'\{.*\}', content, re.DOTALL)
            if json_match:
                extracted = json.loads(json_match.group())
                logger.info(f"✅ 从新闻提取信息: {extracted}")
                return extracted
            else:
                logger.warning("LLM 未返回有效JSON")
                return {
                    "new_businesses": [],
                    "stopped_businesses": [],
                    "new_products": [],
                    "new_concepts": []
                }
        except Exception as e:
            logger.error(f"新闻信息提取失败: {e}")
            return {
                "new_businesses": [],
                "stopped_businesses": [],
                "new_products": [],
                "new_concepts": []
            }
    
    async def extract_from_document(
        self,
        stock_code: str,
        stock_name: str,
        document_content: str,
        document_type: str = "annual_report"
    ) -> Dict[str, Any]:
        """
        从PDF/Word文档提取深度信息
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            document_content: 文档内容（已通过MinerU解析）
            document_type: 文档类型（annual_report=年报, announcement=公告）
            
        Returns:
            提取的信息
        """
        # 获取当前时间
        current_time = datetime.now().strftime("%Y年%m月%d日 %H:%M")
        
        prompt = f"""请从以下{stock_name}的{document_type}中提取详细的业务信息：

【当前时间】
{current_time}

【公司】{stock_name}({stock_code})

【文档内容】（前3000字）
{document_content[:3000]}

请提取：
1. **主营业务**：公司当前的核心业务（详细）
2. **新增业务**：文档中提到的新业务拓展
3. **主要产品**：公司的主要产品或服务
4. **行业定位**：所属行业和细分领域
5. **战略方向**：未来战略和关注的热点领域

以JSON格式返回：
```json
{{
    "main_businesses": [
        {{"name": "业务1", "description": "详细描述"}}
    ],
    "new_businesses": [
        {{"name": "业务2", "description": "详细描述"}}
    ],
    "products": [
        {{"name": "产品1", "type": "software/hardware/service", "description": "描述"}}
    ],
    "industries": ["一级行业", "二级行业"],
    "concepts": ["概念1", "概念2"],
    "keywords": ["关键词1", "关键词2"]
}}
```

只返回JSON，不要其他解释。"""
        
        try:
            response = self._llm_provider.invoke([
                {"role": "system", "content": f"你是{self.role}，{self.backstory}"},
                {"role": "user", "content": prompt}
            ])
            
            content = response.content if hasattr(response, 'content') else str(response)
            
            # 提取JSON
            import re
            json_match = re.search(r'\{.*\}', content, re.DOTALL)
            if json_match:
                extracted = json.loads(json_match.group())
                logger.info(f"✅ 从文档提取信息: {len(extracted.get('products', []))}个产品, {len(extracted.get('concepts', []))}个概念")
                return extracted
            else:
                logger.warning("LLM 未返回有效JSON")
                return {}
        except Exception as e:
            logger.error(f"文档信息提取失败: {e}")
            return {}


class AkshareKnowledgeExtractor:
    """
    从 akshare 提取基础信息，构建简单图谱并生成搜索关键词
    """
    
    @staticmethod
    def extract_company_info(stock_code: str) -> Optional[Dict[str, Any]]:
        """
        从 akshare 获取公司基础信息
        
        Args:
            stock_code: 股票代码
            
        Returns:
            公司信息字典
        """
        try:
            import akshare as ak
            
            # 提取纯数字代码
            pure_code = stock_code
            if stock_code.startswith("SH") or stock_code.startswith("SZ"):
                pure_code = stock_code[2:]
            
            logger.info(f"🔍 从 akshare 获取公司信息: {pure_code}")
            
            # 获取个股信息
            try:
                # 尝试获取实时行情（包含基本信息）
                stock_df = ak.stock_individual_info_em(symbol=pure_code)
                
                if stock_df is not None and not stock_df.empty:
                    # 打印 DataFrame 结构用于调试
                    logger.info(f"📋 akshare 返回 DataFrame: columns={list(stock_df.columns)}, rows={len(stock_df)}")
                    
                    # 转换为字典 - 兼容不同的列名格式
                    info_dict = {}
                    
                    # 确定列名
                    columns = list(stock_df.columns)
                    key_col = None
                    value_col = None
                    
                    # 尝试找到 key 列
                    for col in ['item', '属性', 'name', '项目']:
                        if col in columns:
                            key_col = col
                            break
                    
                    # 尝试找到 value 列
                    for col in ['value', '值', 'data', '数值']:
                        if col in columns:
                            value_col = col
                            break
                    
                    # 如果只有两列，直接使用
                    if len(columns) == 2 and (key_col is None or value_col is None):
                        key_col, value_col = columns[0], columns[1]
                    
                    if key_col and value_col:
                        for _, row in stock_df.iterrows():
                            try:
                                key = str(row[key_col]) if row[key_col] is not None else ''
                                value = str(row[value_col]) if row[value_col] is not None else ''
                                if key and value and key != 'nan' and value != 'nan':
                                    info_dict[key] = value
                            except Exception as row_err:
                                logger.debug(f"跳过行: {row_err}")
                                continue
                    else:
                        logger.warning(f"⚠️ 无法识别 DataFrame 列结构: {columns}")
                    
                    logger.info(f"📊 解析到 {len(info_dict)} 个字段: {list(info_dict.keys())[:10]}...")
                    
                    # 提取关键字段
                    result = {
                        "industry": info_dict.get("行业") or info_dict.get("所属行业"),
                        "sector": info_dict.get("板块") or info_dict.get("所属板块"),
                        "main_business": info_dict.get("主营业务") or info_dict.get("经营范围"),
                        "total_market_cap": info_dict.get("总市值"),
                        "listed_date": info_dict.get("上市时间"),
                        "raw_data": info_dict
                    }
                    
                    main_business_preview = (result.get('main_business') or '')[:30]
                    logger.info(f"✅ 获取到公司信息: 行业={result.get('industry')}, 主营={main_business_preview}...")
                    return result
                else:
                    logger.warning(f"⚠️ akshare 未返回数据: {pure_code}")
                    return None
                    
            except Exception as e:
                logger.error(f"❌ akshare 查询失败: {e}", exc_info=True)
                return None
                
        except ImportError:
            logger.error("akshare 未安装")
            return None
        except Exception as e:
            logger.error(f"提取公司信息失败: {e}")
            return None
    
    @staticmethod
    def generate_search_keywords(
        stock_code: str,
        stock_name: str,
        akshare_info: Optional[Dict[str, Any]] = None
    ) -> Dict[str, List[str]]:
        """
        基于股票信息生成分层关键词
        
        返回两类关键词：
        - core_keywords: 核心关键词（公司名、代码等，必须包含）
        - extension_keywords: 扩展关键词（行业、业务、人名等，用于组合）
        
        Args:
            stock_code: 股票代码（如 SZ000004）
            stock_name: 股票名称（如 *ST国华）
            akshare_info: akshare 返回的公司信息（可选）
            
        Returns:
            {"core_keywords": [...], "extension_keywords": [...]}
        """
        core_keywords = []
        extension_keywords = []
        
        # 提取纯数字代码
        pure_code = stock_code
        if stock_code.startswith("SH") or stock_code.startswith("SZ"):
            pure_code = stock_code[2:]
        
        # === 1. 核心关键词（必须包含，用于确保相关性）===
        # 原始名称（如 *ST国华）
        core_keywords.append(stock_name)
        
        # 去除 ST 标记的名称（如 国华）
        clean_name = stock_name
        for prefix in ["*ST", "ST", "S*ST", "S"]:
            if clean_name.startswith(prefix):
                clean_name = clean_name[len(prefix):]
                break
        if clean_name != stock_name and len(clean_name) >= 2:
            core_keywords.append(clean_name)
        
        # 股票代码
        core_keywords.append(pure_code)  # 000004
        core_keywords.append(stock_code)  # SZ000004
        
        # 小写变体（如 st国华）
        core_keywords.append(stock_name.lower())
        if clean_name != stock_name:
            core_keywords.append(clean_name.lower())
        
        # === 2. 扩展关键词（用于组合搜索，扩大范围）===
        if akshare_info:
            raw_data = akshare_info.get("raw_data", {})
            
            # 公司全称（从 raw_data 中提取）
            company_full_name = raw_data.get("公司名称", raw_data.get("公司全称"))
            if company_full_name and len(company_full_name) > 4:
                extension_keywords.append(company_full_name)
            
            # 行业（但不单独搜索）
            industry = akshare_info.get("industry")
            if industry:
                extension_keywords.append(industry)
            
            # 主营业务（提取关键词）
            main_business = akshare_info.get("main_business", "")
            if main_business:
                import re
                business_parts = re.split(r'[，,、；;。\s]+', main_business)
                for part in business_parts[:3]:  # 只取前3个
                    if 3 <= len(part) <= 10:  # 长度适中的词
                        extension_keywords.append(part)
            
            # 董事长、总经理等关键人物
            ceo = raw_data.get("董事长", raw_data.get("总经理"))
            if ceo and 2 <= len(str(ceo)) <= 4:
                extension_keywords.append(str(ceo))
        
        # 去重
        core_keywords = list(dict.fromkeys(core_keywords))
        extension_keywords = list(dict.fromkeys(extension_keywords))
        
        logger.info(
            f"📋 生成分层关键词: 核心={len(core_keywords)}个{core_keywords[:5]}, "
            f"扩展={len(extension_keywords)}个{extension_keywords[:5]}"
        )
        
        return {
            "core_keywords": core_keywords,
            "extension_keywords": extension_keywords
        }
    
    @staticmethod
    def build_simple_graph_from_info(
        stock_code: str,
        stock_name: str,
        akshare_info: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """
        基于 akshare 信息构建简单的知识图谱结构
        
        即使 akshare 调用失败，也能基于股票名称构建基础图谱
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            akshare_info: akshare 返回的公司信息（可选）
            
        Returns:
            简单图谱结构
        """
        # 提取纯数字代码
        pure_code = stock_code
        if stock_code.startswith("SH") or stock_code.startswith("SZ"):
            pure_code = stock_code[2:]
        
        # 构建基础图谱
        graph = {
            "company": {
                "stock_code": stock_code,
                "stock_name": stock_name,
                "pure_code": pure_code
            },
            "name_variants": [],
            "industries": [],
            "businesses": [],
            "keywords": []
        }
        
        # === 1. 名称变体 ===
        graph["name_variants"].append(stock_name)
        
        # 去除 ST 标记
        clean_name = stock_name
        for prefix in ["*ST", "ST", "S*ST", "S"]:
            if clean_name.startswith(prefix):
                clean_name = clean_name[len(prefix):]
                break
        if clean_name != stock_name:
            graph["name_variants"].append(clean_name)
        
        # 简称（取前两个字）
        if len(clean_name) >= 2:
            graph["name_variants"].append(clean_name[:2])
        
        # === 2. 基于 akshare 信息填充 ===
        if akshare_info:
            # 行业
            industry = akshare_info.get("industry")
            if industry:
                graph["industries"].append(industry)
            
            # 板块
            sector = akshare_info.get("sector")
            if sector:
                graph["industries"].append(sector)
            
            # 主营业务
            main_business = akshare_info.get("main_business", "")
            if main_business:
                graph["businesses"].append(main_business[:100])  # 截取前100字
                
                # 提取业务关键词
                import re
                business_parts = re.split(r'[，,、；;。\s]+', main_business)
                for part in business_parts[:5]:
                    if 2 <= len(part) <= 10:
                        graph["keywords"].append(part)
        
        # === 3. 生成搜索关键词（分层：核心 + 扩展） ===
        keyword_groups = AkshareKnowledgeExtractor.generate_search_keywords(
            stock_code, stock_name, akshare_info
        )
        graph["core_keywords"] = keyword_groups["core_keywords"]
        graph["extension_keywords"] = keyword_groups["extension_keywords"]
        
        logger.info(f"📊 构建简单图谱: 公司={stock_name}, 名称变体={len(graph['name_variants'])}个, "
                   f"行业={len(graph['industries'])}个, "
                   f"核心词={len(graph['core_keywords'])}个, 扩展词={len(graph['extension_keywords'])}个")
        
        return graph


class NewsKnowledgeExtractor:
    """
    从新闻中提取业务变化
    """
    
    def __init__(self, extractor_agent: KnowledgeExtractorAgent):
        self.agent = extractor_agent
    
    async def extract_business_changes(
        self,
        stock_code: str,
        stock_name: str,
        news_list: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """
        从新闻列表中提取业务变化
        
        Args:
            stock_code: 股票代码
            stock_name: 股票名称
            news_list: 新闻列表
            
        Returns:
            业务变化信息
        """
        return await self.agent.extract_from_news(stock_code, stock_name, news_list)


# 工厂函数
def create_knowledge_extractor(llm_provider=None) -> KnowledgeExtractorAgent:
    """创建知识提取智能体"""
    return KnowledgeExtractorAgent(llm_provider)


================================================
FILE: backend/app/knowledge/parallel_search.py
================================================
"""
并发多关键词检索策略
基于知识图谱的关键词，并发调用多个搜索API
"""
import logging
import asyncio
from typing import List, Dict, Any, Set
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime

from ..tools.bochaai_search import bochaai_search
from .graph_models import SearchKeywordSet

logger = logging.getLogger(__name__)


class ParallelSearchStrategy:
    """
    并发检索策略
    基于知识图谱生成的关键词，并发搜索获取更全面的新闻
    """
    
    def __init__(self, max_workers: int = 5):
        """
        初始化并发检索策略
        
        Args:
            max_workers: 最大并发工作线程数
        """
        self.max_workers = max_workers
    
    def search_with_multiple_keywords(
        self,
        keyword_set: SearchKeywordSet,
        days: int = 30,
        max_results_per_query: int = 50
    ) -> List[Dict[str, Any]]:
        """
        使用多个关键词并发搜索
        
        Args:
            keyword_set: 关键词集合
            days: 搜索天数
            max_results_per_query: 每个查询的最大结果数
            
        Returns:
            去重后的新闻列表
        """
        # 生成多样化的搜索查询
        queries = keyword_set.generate_search_queries(max_queries=10)
        
        logger.info(f"🔍 开始并发检索: {keyword_set.stock_name}, 查询数={len(queries)}")
        logger.info(f"📋 查询列表: {queries}")
        
        all_results = []
        seen_urls: Set[str] = set()  # 用于去重
        
        # 并发执行搜索
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            # 提交所有搜索任务
            future_to_query = {}
            for query in queries:
                future = executor.submit(
                    self._search_single_query,
                    query,
                    days,
                    max_results_per_query
                )
                future_to_query[future] = query
            
            # 收集结果
            for future in as_completed(future_to_query):
                query = future_to_query[future]
                try:
                    results = future.result()
                    
                    # 去重并添加
                    added_count = 0
                    for result in results:
                        if result.url not in seen_urls:
                            seen_urls.add(result.url)
                            all_results.append(result)
                            added_count += 1
                    
                    logger.info(f"✅ 查询「{query}」完成: 返回{len(results)}条, 去重后新增{added_count}条")
                    
                except Exception as e:
                    logger.error(f"❌ 查询「{query}」失败: {e}")
        
        logger.info(f"🎉 并发检索完成: 共获取 {len(all_results)} 条去重后的新闻")
        return all_results
    
    def _search_single_query(
        self,
        query: str,
        days: int,
        count: int
    ) -> List[Any]:
        """
        执行单个查询（在线程中运行）
        
        Args:
            query: 搜索查询
            days: 天数
            count: 结果数
            
        Returns:
            搜索结果列表
        """
        try:
            if not bochaai_search.is_available():
                return []
            
            # 调用 BochaAI 搜索
            results = bochaai_search.search(
                query=query,
                freshness="year",
                count=count,
                offset=0
            )
            
            return results
            
        except Exception as e:
            logger.error(f"搜索失败 {query}: {e}")
            return []
    
    async def search_async(
        self,
        keyword_set: SearchKeywordSet,
        days: int = 30,
        max_results_per_query: int = 50
    ) -> List[Dict[str, Any]]:
        """
        异步版本的并发搜索
        
        Args:
            keyword_set: 关键词集合
            days: 搜索天数
            max_results_per_query: 每个查询的最大结果数
            
        Returns:
            去重后的新闻列表
        """
        # 在线程池中运行同步搜索
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(
            None,
            self.search_with_multiple_keywords,
            keyword_set,
            days,
            max_results_per_query
        )


# 便捷函数
def create_parallel_search(max_workers: int = 5) -> ParallelSearchStrategy:
    """创建并发检索策略"""
    return ParallelSearchStrategy(max_workers=max_workers)


================================================
FILE: backend/app/main.py
================================================
"""
FinnewsHunter 主应用入口
"""
import logging
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response
from fastapi.exceptions import RequestValidationError
from fastapi.openapi.docs import get_swagger_ui_html, get_redoc_html, get_swagger_ui_oauth2_redirect_html
from starlette.middleware.base import BaseHTTPMiddleware

from .core.config import settings
from .core.database import init_database
from .api.v1 import api_router

# 配置日志
logging.basicConfig(
    level=getattr(logging, settings.LOG_LEVEL),
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


class DocsCSPMiddleware(BaseHTTPMiddleware):
    """为文档页面设置 CSP 头，允许 unsafe-eval（Swagger UI 需要）"""
    async def dispatch(self, request: Request, call_next):
        response = await call_next(request)
        # 只为文档页面设置 CSP
        if request.url.path in ["/docs", "/redoc", "/openapi.json"]:
            # 开发环境：完全禁用 CSP 限制（仅用于文档页面）
            # 生产环境应该使用更严格的策略
            if settings.DEBUG:
                # 开发环境：允许所有内容（Swagger UI 需要）
                response.headers["Content-Security-Policy"] = (
                    "default-src * 'unsafe-inline' 'unsafe-eval' data: blob:; "
                    "script-src * 'unsafe-inline' 'unsafe-eval'; "
                    "style-src * 'unsafe-inline'; "
                    "img-src * data: blob:; "
                    "font-src * data:; "
                    "connect-src *; "
                    "frame-src *; "
                    "object-src *; "
                    "media-src *; "
                    "worker-src * blob:; "
                    "manifest-src *; "
                    "form-action *; "
                    "base-uri *; "
                    "frame-ancestors *;"
                )
            else:
                # 生产环境：使用较宽松但仍有限制的策略
                response.headers["Content-Security-Policy"] = (
                    "default-src 'self' 'unsafe-inline' 'unsafe-eval' data: blob: https:; "
                    "script-src 'self' 'unsafe-eval' 'unsafe-inline' https://cdn.jsdelivr.net https://unpkg.com; "
                    "style-src 'self' 'unsafe-inline' https://cdn.jsdelivr.net https://fonts.googleapis.com https://unpkg.com; "
                    "font-src 'self' data: https://fonts.gstatic.com https://cdn.jsdelivr.net; "
                    "img-src 'self' data: blob: https:; "
                    "connect-src 'self' https:; "
                    "frame-src 'self' https:; "
                    "object-src 'none'; "
                    "base-uri 'self'; "
                    "form-action 'self'"
                )
        return response


@asynccontextmanager
async def lifespan(app: FastAPI):
    """应用生命周期管理"""
    # 启动时执行
    logger.info("=== FinnewsHunter Starting ===")
    logger.info(f"Environment: {'Development' if settings.DEBUG else 'Production'}")
    logger.info(f"LLM Provider: {settings.LLM_PROVIDER}/{settings.LLM_MODEL}")
    
    # 初始化 Neo4j 知识图谱（仅创建约束和索引，不构建具体图谱）
    try:
        from .core.neo4j_client import get_neo4j_client
        from .knowledge.graph_service import get_graph_service
        
        logger.info("🔍 初始化 Neo4j 知识图谱...")
        neo4j_client = get_neo4j_client()
        
        if neo4j_client.health_check():
            logger.info("✅ Neo4j 连接正常")
            # 初始化约束和索引（由 graph_service 自动完成）
            graph_service = get_graph_service()
            logger.info("✅ Neo4j 约束和索引已就绪")
            logger.info("💡 提示: 首次定向爬取时会自动为股票构建知识图谱")
        else:
            logger.warning("⚠️ Neo4j 连接失败，知识图谱功能将不可用（不影响其他功能）")
    except Exception as e:
        logger.warning(f"⚠️ Neo4j 初始化失败: {e}，知识图谱功能将不可用（不影响其他功能）")
    
    yield
    
    # 关闭时执行
    logger.info("=== FinnewsHunter Shutting Down ===")
    
    # 关闭 Neo4j 连接
    try:
        from .core.neo4j_client import close_neo4j_client
        close_neo4j_client()
        logger.info("✅ Neo4j 连接已关闭")
    except:
        pass


# 创建 FastAPI 应用
# 禁用默认文档（我们将使用自定义 CDN）
app = FastAPI(
    title=settings.APP_NAME,
    description="Financial News Analysis Platform powered by AgenticX",
    version=settings.APP_VERSION,
    debug=settings.DEBUG,
    lifespan=lifespan,
    docs_url=None,  # 禁用默认文档，使用自定义路由
    redoc_url=None,  # 禁用默认 ReDoc，使用自定义路由
)

# 添加文档页面的 CSP 中间件（必须在 CORS 之前）
app.add_middleware(DocsCSPMiddleware)

# 配置 CORS
# 开发环境允许所有来源（包括 file:// 协议）
if settings.DEBUG:
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],  # 开发环境允许所有来源
        allow_credentials=False,  # 允许所有来源时必须为 False
        allow_methods=["*"],
        allow_headers=["*"],
    )
else:
    # 生产环境只允许配置的来源
    app.add_middleware(
        CORSMiddleware,
        allow_origins=settings.BACKEND_CORS_ORIGINS,
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
    )


# 请求验证错误处理（422错误）
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request: Request, exc: RequestValidationError):
    """处理请求验证错误（422）"""
    # 尝试读取请求体
    body_str = ""
    try:
        body_bytes = await request.body()
        body_str = body_bytes.decode('utf-8')
    except Exception as e:
        logger.warning(f"Failed to read request body: {e}")
    
    logger.error(f"Validation error for {request.method} {request.url.path}")
    logger.error(f"Validation errors: {exc.errors()}")
    logger.error(f"Request body: {body_str}")
    
    return JSONResponse(
        status_code=422,
        content={
            "detail": exc.errors(),
            "body": body_str if settings.DEBUG else None
        }
    )


# 全局异常处理
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
    logger.error(f"Global exception: {exc}", exc_info=True)
    return JSONResponse(
        status_code=500,
        content={
            "success": False,
            "error": "Internal server error",
            "detail": str(exc) if settings.DEBUG else None
        }
    )


# 根路由
@app.get("/")
async def root():
    """根路由 - 系统信息"""
    return {
        "name": settings.APP_NAME,
        "version": settings.APP_VERSION,
        "status": "active",
        "message": "Welcome to FinnewsHunter API",
        "docs_url": "/docs",
        "api_prefix": settings.API_V1_PREFIX,
    }


# 健康检查
@app.get("/health")
async def health_check():
    """健康检查端点"""
    return {
        "status": "healthy",
        "app": settings.APP_NAME,
        "version": settings.APP_VERSION,
    }


# 自定义 Swagger UI（使用 unpkg.com CDN，因为 jsdelivr.net 无法访问）
@app.get("/docs", include_in_schema=False)
@app.head("/docs", include_in_schema=False)
async def custom_swagger_ui_html():
    """自定义 Swagger UI，使用 unpkg.com CDN"""
    return get_swagger_ui_html(
        openapi_url=app.openapi_url,
        title=app.title + " - Swagger UI",
        oauth2_redirect_url="/docs/oauth2-redirect",
        swagger_js_url="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js",
        swagger_css_url="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css",
        swagger_favicon_url="https://fastapi.tiangolo.com/img/favicon.png",
    )


# Swagger UI OAuth2 重定向
@app.get("/docs/oauth2-redirect", include_in_schema=False)
async def swagger_ui_redirect():
    """Swagger UI OAuth2 重定向"""
    return get_swagger_ui_oauth2_redirect_html()


# 自定义 ReDoc（使用 unpkg.com CDN）
@app.get("/redoc", include_in_schema=False)
@app.head("/redoc", include_in_schema=False)
async def redoc_html():
    """自定义 ReDoc，使用 unpkg.com CDN"""
    return get_redoc_html(
        openapi_url=app.openapi_url,
        title=app.title + " - ReDoc",
        redoc_js_url="https://unpkg.com/redoc@2/bundles/redoc.standalone.js",
        redoc_favicon_url="https://fastapi.tiangolo.com/img/favicon.png",
    )


# Chrome DevTools 配置文件（避免 404 日志）
@app.get("/.well-known/appspecific/com.chrome.devtools.json")
async def chrome_devtools_config():
    """Chrome DevTools 配置文件"""
    return {}


# 注册 API 路由
app.include_router(api_router, prefix=settings.API_V1_PREFIX)


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        "app.main:app",
        host=settings.HOST,
        port=settings.PORT,
        reload=settings.DEBUG,
    )


================================================
FILE: backend/app/models/__init__.py
================================================
"""
数据模型模块
"""
from .database import Base, get_async_session, get_sync_session, init_db
from .news import News
from .stock import Stock
from .analysis import Analysis
from .crawl_task import CrawlTask, CrawlMode, TaskStatus
from .debate_history import DebateHistory

__all__ = [
    "Base",
    "get_async_session",
    "get_sync_session",
    "init_db",
    "News",
    "Stock",
    "Analysis",
    "CrawlTask",
    "CrawlMode",
    "TaskStatus",
    "DebateHistory",
]


================================================
FILE: backend/app/models/analysis.py
================================================
"""
分析结果数据模型
"""
from datetime import datetime
from sqlalchemy import Column, Integer, String, Text, DateTime, Float, ForeignKey, JSON
from sqlalchemy.orm import relationship

from .database import Base


class Analysis(Base):
    """智能体分析结果表"""
    
    __tablename__ = "analyses"
    
    # 主键
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    
    # 关联新闻
    news_id = Column(Integer, ForeignKey("news.id", ondelete="CASCADE"), nullable=False, index=True)
    
    # 智能体信息
    agent_name = Column(String(100), nullable=False, comment="执行分析的智能体名称")
    agent_role = Column(String(100), nullable=True, comment="智能体角色")
    
    # 分析结果
    analysis_result = Column(Text, nullable=False, comment="分析结果（完整文本）")
    summary = Column(Text, nullable=True, comment="分析摘要")
    
    # 情感分析
    sentiment = Column(String(20), nullable=True, comment="情感倾向（positive, negative, neutral）")
    sentiment_score = Column(Float, nullable=True, comment="情感评分（-1到1）")
    confidence = Column(Float, nullable=True, comment="置信度（0到1）")
    
    # 结构化数据
    structured_data = Column(JSON, nullable=True, comment="结构化分析数据（JSON格式）")
    
    # 元数据
    execution_time = Column(Float, nullable=True, comment="执行时间（秒）")
    llm_model = Column(String(100), nullable=True, comment="使用的LLM模型")
    tokens_used = Column(Integer, nullable=True, comment="消耗的Token数")
    
    # 时间戳
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    
    # 关系
    news = relationship("News", back_populates="analyses")
    
    def __repr__(self):
        return f"<Analysis(id={self.id}, news_id={self.news_id}, agent='{self.agent_name}')>"
    
    def to_dict(self):
        """转换为字典"""
        return {
            "id": self.id,
            "news_id": self.news_id,
            "agent_name": self.agent_name,
            "agent_role": self.agent_role,
            "analysis_result": self.analysis_result,
            "summary": self.summary,
            "sentiment": self.sentiment,
            "sentiment_score": self.sentiment_score,
            "confidence": self.confidence,
            "structured_data": self.structured_data,
            "execution_time": self.execution_time,
            "llm_model": self.llm_model,
            "tokens_used": self.tokens_used,
            "created_at": self.created_at.isoformat() if self.created_at else None,
        }


================================================
FILE: backend/app/models/crawl_task.py
================================================
"""
爬取任务数据模型
"""
from datetime import datetime
from typing import Optional
from sqlalchemy import Column, Integer, String, DateTime, JSON, Float
from enum import Enum

from .database import Base


class CrawlMode(str, Enum):
    """爬取模式枚举"""
    COLD_START = "cold_start"      # 冷启动（批量历史）
    REALTIME = "realtime"           # 实时监控
    TARGETED = "targeted"           # 定向分析


class TaskStatus(str, Enum):
    """任务状态枚举"""
    PENDING = "pending"             # 待执行
    RUNNING = "running"             # 执行中
    COMPLETED = "completed"         # 已完成
    FAILED = "failed"               # 失败
    CANCELLED = "cancelled"         # 已取消


class CrawlTask(Base):
    """爬取任务表"""
    
    __tablename__ = "crawl_tasks"
    
    # 主键
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    
    # 任务信息
    celery_task_id = Column(String(255), unique=True, nullable=True, index=True, comment="Celery任务ID")
    mode = Column(String(20), nullable=False, index=True, comment="爬取模式")
    status = Column(String(20), nullable=False, default=TaskStatus.PENDING, index=True, comment="任务状态")
    
    # 任务配置
    source = Column(String(100), nullable=False, comment="新闻源")
    config = Column(JSON, nullable=True, comment="任务配置（JSON）")
    
    # 执行进度
    progress = Column(JSON, nullable=True, comment="进度信息")
    current_page = Column(Integer, nullable=True, comment="当前页码")
    total_pages = Column(Integer, nullable=True, comment="总页数")
    
    # 执行结果
    result = Column(JSON, nullable=True, comment="结果统计（JSON）")
    crawled_count = Column(Integer, default=0, comment="爬取到的新闻数")
    saved_count = Column(Integer, default=0, comment="保存到数据库的新闻数")
    error_message = Column(String(1000), nullable=True, comment="错误信息")
    
    # 性能指标
    execution_time = Column(Float, nullable=True, comment="执行时间（秒）")
    
    # 时间戳
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False, comment="创建时间")
    started_at = Column(DateTime, nullable=True, comment="开始时间")
    completed_at = Column(DateTime, nullable=True, comment="完成时间")
    
    def __repr__(self):
        return f"<CrawlTask(id={self.id}, mode='{self.mode}', source='{self.source}', status='{self.status}')>"
    
    def to_dict(self):
        """转换为字典"""
        return {
            "id": self.id,
            "celery_task_id": self.celery_task_id,
            "mode": self.mode,
            "status": self.status,
            "source": self.source,
            "config": self.config,
            "progress": self.progress,
            "current_page": self.current_page,
            "total_pages": self.total_pages,
            "result": self.result,
            "crawled_count": self.crawled_count,
            "saved_count": self.saved_count,
            "error_message": self.error_message,
            "execution_time": self.execution_time,
            "created_at": self.created_at.isoformat() if self.created_at else None,
            "started_at": self.started_at.isoformat() if self.started_at else None,
            "completed_at": self.completed_at.isoformat() if self.completed_at else None,
        }


================================================
FILE: backend/app/models/database.py
================================================
"""
数据库连接和会话管理
"""
from typing import AsyncGenerator
from sqlalchemy import create_engine
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
from sqlalchemy.orm import sessionmaker, declarative_base, Session

from ..core.config import settings

# 声明基类
Base = declarative_base()

# 异步引擎（用于应用运行时）
async_engine = create_async_engine(
    settings.DATABASE_URL,
    echo=settings.DEBUG,
    pool_pre_ping=True,
    pool_size=10,
    max_overflow=20,
)

# 异步会话工厂
AsyncSessionLocal = async_sessionmaker(
    bind=async_engine,
    class_=AsyncSession,
    expire_on_commit=False,
    autocommit=False,
    autoflush=False,
)

# 同步引擎（用于数据库初始化）
sync_engine = create_engine(
    settings.SYNC_DATABASE_URL,
    echo=settings.DEBUG,
    pool_pre_ping=True,
)

# 同步会话工厂
SyncSessionLocal = sessionmaker(
    bind=sync_engine,
    autocommit=False,
    autoflush=False,
)


async def get_async_session() -> AsyncGenerator[AsyncSession, None]:
    """
    异步数据库会话依赖注入
    
    Yields:
        AsyncSession: 数据库会话
    """
    async with AsyncSessionLocal() as session:
        try:
            yield session
            await session.commit()
        except Exception:
            await session.rollback()
            raise
        finally:
            await session.close()


def get_sync_session() -> Session:
    """
    同步数据库会话（用于初始化脚本）
    
    Returns:
        Session: 数据库会话
    """
    session = SyncSessionLocal()
    try:
        yield session
        session.commit()
    except Exception:
        session.rollback()
        raise
    finally:
        session.close()


def init_db():
    """
    初始化数据库表
    在首次运行或重置数据库时调用
    """
    from .news import News
    from .stock import Stock
    from .analysis import Analysis
    
    print("Creating database tables...")
    Base.metadata.create_all(bind=sync_engine)
    print("Database tables created successfully!")


if __name__ == "__main__":
    # 直接运行此文件以初始化数据库
    init_db()


================================================
FILE: backend/app/models/debate_history.py
================================================
"""
辩论历史数据模型
"""
from datetime import datetime
from typing import List, Optional
from sqlalchemy import Column, Integer, String, Text, DateTime, JSON, Index

from .database import Base


class DebateHistory(Base):
    """辩论历史表模型"""
    
    __tablename__ = "debate_histories"
    
    # 主键
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    
    # 会话标识
    session_id = Column(String(100), unique=True, nullable=False, index=True, comment="会话ID")
    
    # 股票信息
    stock_code = Column(String(20), nullable=False, index=True, comment="股票代码")
    stock_name = Column(String(100), nullable=True, comment="股票名称")
    
    # 辩论模式
    mode = Column(String(50), nullable=True, comment="辩论模式(parallel/realtime_debate/quick_analysis)")
    
    # 聊天消息（JSON数组）
    messages = Column(JSON, nullable=False, default=list, comment="聊天消息数组")
    
    # 时间信息
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False, comment="创建时间")
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, comment="更新时间")
    
    # 索引
    __table_args__ = (
        # 按股票+时间查询
        Index('idx_debate_stock_updated', 'stock_code', 'updated_at'),
    )
    
    def __repr__(self):
        return f"<DebateHistory(id={self.id}, stock_code='{self.stock_code}', session_id='{self.session_id}')>"
    
    def to_dict(self):
        """转换为字典"""
        return {
            "id": self.id,
            "session_id": self.session_id,
            "stock_code": self.stock_code,
            "stock_name": self.stock_name,
            "mode": self.mode,
            "messages": self.messages,
            "created_at": self.created_at.isoformat() if self.created_at else None,
            "updated_at": self.updated_at.isoformat() if self.updated_at else None,
        }


================================================
FILE: backend/app/models/news.py
================================================
"""
新闻数据模型 - Phase 2 索引优化
"""
from datetime import datetime
from typing import List, Optional
from sqlalchemy import Column, Integer, String, Text, DateTime, Float, ARRAY, Index
from sqlalchemy.orm import relationship

from .database import Base


class News(Base):
    """新闻表模型 - Phase 2 优化版"""
    
    __tablename__ = "news"
    
    # 主键
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    
    # 基本信息
    title = Column(String(500), nullable=False, index=True, comment="新闻标题")
    content = Column(Text, nullable=False, comment="新闻正文（解析后）")
    raw_html = Column(Text, nullable=True, comment="原始HTML内容")
    url = Column(String(1000), unique=True, nullable=False, index=True, comment="新闻URL")
    source = Column(String(100), nullable=False, index=True, comment="新闻来源（sina, jrj, cnstock等）")
    
    # 时间信息
    publish_time = Column(DateTime, nullable=True, index=True, comment="发布时间")
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False, comment="爬取时间")
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, comment="更新时间")
    
    # 关联股票
    stock_codes = Column(ARRAY(String), nullable=True, comment="关联的股票代码列表")
    
    # 情感分析
    sentiment_score = Column(Float, nullable=True, comment="情感评分（-1到1，负面到正面）")
    
    # 其他元数据
    author = Column(String(200), nullable=True, comment="作者")
    keywords = Column(ARRAY(String), nullable=True, comment="关键词")
    
    # 向量化标识
    is_embedded = Column(Integer, default=0, comment="是否已向量化（0:否, 1:是）")
    
    # 关系
    analyses = relationship("Analysis", back_populates="news", cascade="all, delete-orphan")
    
    # Phase 2: 复合索引优化（提升常见查询性能）
    __table_args__ = (
        # 按来源+时间查询（最常用）
        Index('idx_source_publish_time', 'source', 'publish_time'),
        # 按情感+时间筛选
        Index('idx_sentiment_publish_time', 'sentiment_score', 'publish_time'),
    )
    
    def __repr__(self):
        return f"<News(id={self.id}, title='{self.title[:30]}...', source='{self.source}')>"
    
    def to_dict(self, include_html: bool = False):
        """转换为字典"""
        result = {
            "id": self.id,
            "title": self.title,
            "content": self.content,
            "url": self.url,
            "source": self.source,
            "publish_time": self.publish_time.isoformat() if self.publish_time else None,
            "created_at": self.created_at.isoformat() if self.created_at else None,
            "stock_codes": self.stock_codes,
            "sentiment_score": self.sentiment_score,
            "author": self.author,
            "keywords": self.keywords,
            "has_raw_html": self.raw_html is not None and len(self.raw_html or '') > 0,
        }
        if include_html and self.raw_html:
            result["raw_html"] = self.raw_html
        return result


================================================
FILE: backend/app/models/stock.py
================================================
"""
股票数据模型
"""
from datetime import datetime
from sqlalchemy import Column, Integer, String, DateTime, Float

from .database import Base


class Stock(Base):
    """股票基本信息表"""
    
    __tablename__ = "stocks"
    
    # 主键
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    
    # 股票基本信息
    code = Column(String(20), unique=True, nullable=False, index=True, comment="股票代码（如：600519）")
    name = Column(String(100), nullable=False, comment="股票名称（如：贵州茅台）")
    full_code = Column(String(20), nullable=True, comment="完整代码（如：SH600519）")
    
    # 分类信息
    industry = Column(String(100), nullable=True, comment="所属行业")
    market = Column(String(20), nullable=True, comment="所属市场（SH:上海, SZ:深圳）")
    area = Column(String(50), nullable=True, comment="所属地区")
    
    # 财务指标（可选，后续扩展）
    pe_ratio = Column(Float, nullable=True, comment="市盈率")
    market_cap = Column(Float, nullable=True, comment="总市值")
    
    # 状态
    status = Column(String(20), default="active", comment="状态（active, suspended, delisted）")
    
    # 时间戳
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    def __repr__(self):
        return f"<Stock(code='{self.code}', name='{self.name}')>"
    
    def to_dict(self):
        """转换为字典"""
        return {
            "id": self.id,
            "code": self.code,
            "name": self.name,
            "full_code": self.full_code,
            "industry": self.industry,
            "market": self.market,
            "area": self.area,
            "pe_ratio": self.pe_ratio,
            "market_cap": self.market_cap,
            "status": self.status,
            "created_at": self.created_at.isoformat() if self.created_at else None,
            "updated_at": self.updated_at.isoformat() if self.updated_at else None,
        }


================================================
FILE: backend/app/scripts/init_stocks.py
================================================
"""
初始化股票数据脚本
从 akshare 获取全部 A 股信息并存入 PostgreSQL

使用方法:
    cd backend
    python -m app.scripts.init_stocks
"""
import asyncio
import logging
import os
from datetime import datetime
from pathlib import Path

# ⚠️ 禁用代理（akshare 需要直连国内网站）
for proxy_var in ['http_proxy', 'https_proxy', 'HTTP_PROXY', 'HTTPS_PROXY', 'all_proxy', 'ALL_PROXY']:
    os.environ.pop(proxy_var, None)

# 设置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# 加载 .env
from dotenv import load_dotenv
env_path = Path(__file__).parent.parent.parent / ".env"
load_dotenv(env_path)
logger.info(f"Loaded .env from: {env_path}")

# 构建数据库 URL
DATABASE_URL = os.getenv("DATABASE_URL", "")

if not DATABASE_URL:
    # 从分开的变量构建 DATABASE_URL
    pg_user = os.getenv("POSTGRES_USER", "finnews")
    pg_password = os.getenv("POSTGRES_PASSWORD", "finnews_dev_password")
    pg_host = os.getenv("POSTGRES_HOST", "localhost")
    pg_port = os.getenv("POSTGRES_PORT", "5432")
    pg_db = os.getenv("POSTGRES_DB", "finnews_db")
    
    DATABASE_URL = f"postgresql+asyncpg://{pg_user}:{pg_password}@{pg_host}:{pg_port}/{pg_db}"
    logger.info(f"Built DATABASE_URL from individual variables")

elif DATABASE_URL.startswith("postgresql://"):
    DATABASE_URL = DATABASE_URL.replace("postgresql://", "postgresql+asyncpg://", 1)

logger.info(f"Database: {DATABASE_URL.split('@')[-1] if '@' in DATABASE_URL else DATABASE_URL[:30]}...")

# 导入依赖
try:
    import akshare as ak
    import pandas as pd
    AKSHARE_AVAILABLE = True
    logger.info("akshare loaded successfully")
except ImportError:
    AKSHARE_AVAILABLE = False
    logger.error("akshare not installed! Run: pip install akshare")
    exit(1)

from sqlalchemy import Column, Integer, String, DateTime, Float, text
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker, declarative_base

Base = declarative_base()


class Stock(Base):
    """股票基本信息表"""
    __tablename__ = "stocks"
    
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)
    code = Column(String(20), unique=True, nullable=False, index=True)
    name = Column(String(100), nullable=False)
    full_code = Column(String(20), nullable=True)
    industry = Column(String(100), nullable=True)
    market = Column(String(20), nullable=True)
    area = Column(String(50), nullable=True)
    pe_ratio = Column(Float, nullable=True)
    market_cap = Column(Float, nullable=True)
    status = Column(String(20), default="active")
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    updated_at = Column(DateTime, default=datetime.utcnow)


def get_fallback_stocks() -> list:
    """备用股票列表（如果 akshare 失败时使用）"""
    return [
        {"code": "600519", "name": "贵州茅台", "full_code": "SH600519", "market": "SH", "status": "active"},
        {"code": "000001", "name": "平安银行", "full_code": "SZ000001", "market": "SZ", "status": "active"},
        {"code": "601318", "name": "中国平安", "full_code": "SH601318", "market": "SH", "status": "active"},
        {"code": "000858", "name": "五粮液", "full_code": "SZ000858", "market": "SZ", "status": "active"},
        {"code": "002594", "name": "比亚迪", "full_code": "SZ002594", "market": "SZ", "status": "active"},
        {"code": "600036", "name": "招商银行", "full_code": "SH600036", "market": "SH", "status": "active"},
        {"code": "601166", "name": "兴业银行", "full_code": "SH601166", "market": "SH", "status": "active"},
        {"code": "000333", "name": "美的集团", "full_code": "SZ000333", "market": "SZ", "status": "active"},
        {"code": "002415", "name": "海康威视", "full_code": "SZ002415", "market": "SZ", "status": "active"},
        {"code": "600276", "name": "恒瑞医药", "full_code": "SH600276", "market": "SH", "status": "active"},
        {"code": "000002", "name": "万科A", "full_code": "SZ000002", "market": "SZ", "status": "active"},
        {"code": "600887", "name": "伊利股份", "full_code": "SH600887", "market": "SH", "status": "active"},
        {"code": "000725", "name": "京东方A", "full_code": "SZ000725", "market": "SZ", "status": "active"},
        {"code": "600000", "name": "浦发银行", "full_code": "SH600000", "market": "SH", "status": "active"},
        {"code": "000063", "name": "中兴通讯", "full_code": "SZ000063", "market": "SZ", "status": "active"},
        {"code": "600104", "name": "上汽集团", "full_code": "SH600104", "market": "SH", "status": "active"},
        {"code": "002304", "name": "洋河股份", "full_code": "SZ002304", "market": "SZ", "status": "active"},
        {"code": "600585", "name": "海螺水泥", "full_code": "SH600585", "market": "SH", "status": "active"},
        {"code": "000876", "name": "新希望", "full_code": "SZ000876", "market": "SZ", "status": "active"},
        {"code": "600309", "name": "万华化学", "full_code": "SH600309", "market": "SH", "status": "active"},
    ]


async def fetch_all_stocks() -> list:
    """从 akshare 获取全部 A 股信息"""
    logger.info("Fetching all A-share stocks from akshare...")
    
    # 设置 requests 不使用代理
    import requests
    session = requests.Session()
    session.proxies = {
        'http': None,
        'https': None,
    }
    
    # 设置 User-Agent
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    
    max_retries = 3
    for attempt in range(max_retries):
        try:
            logger.info(f"Attempt {attempt + 1}/{max_retries}...")
            
            # 方法1: 尝试使用 stock_zh_a_spot_em
            try:
                df = ak.stock_zh_a_spot_em()
            except Exception as e1:
                logger.warning(f"Method 1 failed: {e1}")
                # 方法2: 尝试使用 stock_info_a_code_name
                try:
                    logger.info("Trying alternative method: stock_info_a_code_name...")
                    df = ak.stock_info_a_code_name()
                    if df is not None and not df.empty:
                        # 重命名列
                        df.columns = ['代码', '名称']
                except Exception as e2:
                    logger.warning(f"Method 2 failed: {e2}")
                    raise e1  # 抛出第一个错误
            
            if df is None or df.empty:
                logger.error("No data returned from akshare")
                if attempt < max_retries - 1:
                    await asyncio.sleep(2)  # 等待2秒后重试
                    continue
                return []
            
            logger.info(f"✅ Fetched {len(df)} stocks from akshare")
            
            stocks = []
            for _, row in df.iterrows():
                code = str(row['代码'])
                name = str(row['名称'])
                
                # 跳过异常数据
                if not code or not name or name in ['N/A', 'nan', '']:
                    continue
                
                # 确定市场前缀
                if code.startswith('6'):
                    market = "SH"
                    full_code = f"SH{code}"
                elif code.startswith('0') or code.startswith('3'):
                    market = "SZ"
                    full_code = f"SZ{code}"
                else:
                    market = "OTHER"
                    full_code = code
                
                stocks.append({
                    "code": code,
                    "name": name,
                    "full_code": full_code,
                    "market": market,
                    "status": "active",
                })
            
            return stocks
            
        except Exception as e:
            logger.error(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                wait_time = (attempt + 1) * 2
                logger.info(f"Waiting {wait_time} seconds before retry...")
                await asyncio.sleep(wait_time)
            else:
                logger.error("All attempts failed!")
                import traceback
                traceback.print_exc()
                return []
    
    return []


async def init_stocks_to_db():
    """初始化股票数据到数据库"""
    # 创建数据库引擎
    engine = create_async_engine(DATABASE_URL, echo=False)
    async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
    
    # 确保表存在
    async with engine.begin() as conn:
        await conn.run_sync(Base.metadata.create_all)
    
    # 获取股票数据
    stocks_data = await fetch_all_stocks()
    
    if not stocks_data:
        logger.warning("⚠️  Failed to fetch from akshare, using fallback stock list...")
        # 备用方案：导入常用股票
        stocks_data = get_fallback_stocks()
        if not stocks_data:
            logger.error("No stocks to insert")
            await engine.dispose()
            return
        logger.info(f"Using {len(stocks_data)} fallback stocks")
    
    async with async_session() as session:
        try:
            # 清空现有数据
            logger.info("Clearing existing stock data...")
            await session.execute(text("DELETE FROM stocks"))
            await session.commit()
            
            # 批量插入
            logger.info(f"Inserting {len(stocks_data)} stocks...")
            
            batch_size = 500
            for i in range(0, len(stocks_data), batch_size):
                batch = stocks_data[i:i + batch_size]
                for stock_data in batch:
                    stock = Stock(
                        code=stock_data["code"],
                        name=stock_data["name"],
                        full_code=stock_data["full_code"],
                        market=stock_data["market"],
                        status=stock_data["status"],
                        created_at=datetime.utcnow(),
                        updated_at=datetime.utcnow(),
                    )
                    session.add(stock)
                
                await session.commit()
                logger.info(f"Inserted batch {i // batch_size + 1}, total: {min(i + batch_size, len(stocks_data))}/{len(stocks_data)}")
            
            logger.info(f"✅ Successfully initialized {len(stocks_data)} stocks!")
            
        except Exception as e:
            logger.error(f"Failed to insert stocks: {e}")
            import traceback
            traceback.print_exc()
            await session.rollback()
        finally:
            await engine.dispose()


async def get_stock_count():
    """获取数据库中股票数量"""
    engine = create_async_engine(DATABASE_URL, echo=False)
    async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
    
    async with async_session() as session:
        result = await session.execute(text("SELECT COUNT(*) FROM stocks"))
        count = result.scalar() or 0
        logger.info(f"Current stock count in database: {count}")
        await engine.dispose()
        return count


async def main():
    print("=" * 60)
    print("🚀 Stock Data Initialization Script")
    print("=" * 60)
    
    # 检查当前数量
    try:
        await get_stock_count()
    except Exception as e:
        logger.warning(f"Could not get current count (table may not exist): {e}")
    
    # 执行初始化
    print("\n📥 Starting initialization...")
    await init_stocks_to_db()
    
    # 再次检查
    print("\n📊 After initialization:")
    await get_stock_count()
    
    print("\n✅ Done!")


if __name__ == "__main__":
    asyncio.run(main())


================================================
FILE: backend/app/services/__init__.py
================================================
"""
服务模块
"""
from .llm_service import get_llm_provider, get_llm_service, LLMService
from .embedding_service import get_embedding_service, EmbeddingService
from .analysis_service import get_analysis_service, AnalysisService

__all__ = [
    "get_llm_provider",
    "get_llm_service",
    "LLMService",
    "get_embedding_service",
    "EmbeddingService",
    "get_analysis_service",
    "AnalysisService",
]


================================================
FILE: backend/app/services/analysis_service.py
================================================
"""
新闻分析服务
协调智能体执行分析任务
"""
import logging
import time
from typing import Dict, Any, Optional
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from starlette.concurrency import run_in_threadpool
from ..models.database import AsyncSessionLocal

from ..agents import create_news_analyst
from ..models.news import News
from ..models.analysis import Analysis
from ..services.embedding_service import get_embedding_service
from ..storage.vector_storage import get_vector_storage

logger = logging.getLogger(__name__)


class AnalysisService:
    """
    新闻分析服务
    负责协调智能体执行新闻分析任务
    """
    
    def __init__(self):
        """初始化分析服务"""
        self.news_analyst = create_news_analyst()
        self.embedding_service = get_embedding_service()
        self.vector_storage = get_vector_storage()
        logger.info("Initialized AnalysisService")
    
    async def analyze_news(
        self,
        news_id: int,
        db: AsyncSession,
        llm_provider: Optional[str] = None,
        llm_model: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        分析指定新闻
        
        Args:
            news_id: 新闻ID
            db: 数据库会话
            llm_provider: 模型厂商（可选：bailian, openai, deepseek, kimi）
            llm_model: 模型名称（可选）
            
        Returns:
            分析结果
        """
        start_time = time.time()
        
        # 如果指定了自定义模型，创建临时的智能体
        if llm_provider and llm_model:
            from ..services.llm_service import create_custom_llm_provider
            from ..agents.news_analyst import NewsAnalystAgent
            
            logger.info(f"Using custom model: {llm_provider}/{llm_model}")
            custom_llm = create_custom_llm_provider(llm_provider, llm_model)
            analyst = NewsAnalystAgent(llm_provider=custom_llm)
        else:
            analyst = self.news_analyst
        
        try:
            # 1. 查询新闻
            result = await db.execute(
                select(News).where(News.id == news_id)
            )
            news = result.scalar_one_or_none()
            
            if not news:
                return {
                    "success": False,
                    "error": f"News not found: {news_id}"
                }
            
            logger.info(f"Analyzing news: {news_id} - {news.title}")
            
            # 2. 执行智能体分析
            # 注意：由于 agent.analyze_news 是同步方法，需要在线程池中运行以避免阻塞异步事件循环
            analysis_result = await run_in_threadpool(
                analyst.analyze_news,  # 使用 analyst（可能是自定义的或默认的）
                news_title=news.title,
                news_content=news.content,
                news_url=news.url,
                stock_codes=news.stock_codes or []
            )
            
            if not analysis_result.get("success"):
                return analysis_result
            
            # 3. 保存分析结果到数据库
            structured_data = analysis_result.get("structured_data", {})
            
            analysis = Analysis(
                news_id=news_id,
                agent_name=analysis_result.get("agent_name"),
                agent_role=analysis_result.get("agent_role"),
                analysis_result=analysis_result.get("analysis_result", ""),
                summary=structured_data.get("market_impact", "")[:500],
                sentiment=structured_data.get("sentiment"),
                sentiment_score=structured_data.get("sentiment_score"),
                confidence=structured_data.get("confidence"),
                structured_data=structured_data,
                execution_time=time.time() - start_time,
                llm_model=f"{llm_provider}/{llm_model}" if llm_provider and llm_model else (analyst._llm_provider.model if hasattr(analyst, '_llm_provider') and hasattr(analyst._llm_provider, 'model') else None),
            )
            
            db.add(analysis)
            
            # 4. 更新新闻的情感评分
            news.sentiment_score = structured_data.get("sentiment_score")
            
            # 5. 向量化新闻内容（如果尚未向量化）
            # 注意：embedding是可选功能，失败不应影响分析结果
            # 在后台异步执行，不阻塞分析流程
            if not news.is_embedded:
                # 使用 asyncio.create_task 在后台执行，不等待结果
                # 这样即使embedding超时或失败，也不会影响分析结果的返回
                import asyncio
                
                async def vectorize_in_background():
                    try:
                        # 组合标题和内容进行向量化
                        text_to_embed = f"{news.title}\n{news.content[:1000]}"
                        
                        # 使用异步方法，避免事件循环问题
                        embedding = await asyncio.wait_for(
                            self.embedding_service.aembed_text(text_to_embed),
                            timeout=20.0  # 20秒超时，避免等待太久
                        )
                        
                        # 存储到 Milvus（也在线程池中执行）
                        await run_in_threadpool(
                            self.vector_storage.store_embedding,
                            news_id=news_id,
                            embedding=embedding,
                            text=text_to_embed
                        )
                        
                        # 更新数据库中的is_embedded标志（需要新的数据库会话）
                        async with AsyncSessionLocal() as update_db:
                            try:
                                result = await update_db.execute(
                                    select(News).where(News.id == news_id)
                                )
                                update_news = result.scalar_one_or_none()
                                if update_news:
                                    update_news.is_embedded = 1
                                    await update_db.commit()
                                    logger.info(f"Vectorized news: {news_id}")
                            except Exception as e:
                                logger.warning(f"Failed to update is_embedded flag for news {news_id}: {e}")
                                await update_db.rollback()
                    except asyncio.TimeoutError:
                        logger.warning(f"Embedding timeout for news {news_id} (20s), skipping vectorization")
                    except Exception as e:
                        logger.warning(f"Failed to vectorize news {news_id}: {e}")
                
                # 在后台执行，不等待完成
                asyncio.create_task(vectorize_in_background())
            
            await db.commit()
            await db.refresh(analysis)
            
            logger.info(f"Analysis completed for news {news_id}, execution time: {analysis.execution_time:.2f}s")
            
            return {
                "success": True,
                "analysis_id": analysis.id,
                "news_id": news_id,
                "sentiment": analysis.sentiment,
                "sentiment_score": analysis.sentiment_score,
                "confidence": analysis.confidence,
                "summary": analysis.summary,
                "execution_time": analysis.execution_time,
            }
        
        except Exception as e:
            logger.error(f"Analysis failed for news {news_id}: {e}")
            await db.rollback()
            return {
                "success": False,
                "error": str(e)
            }
    
    async def get_analysis_by_id(
        self,
        analysis_id: int,
        db: AsyncSession
    ) -> Optional[Dict[str, Any]]:
        """
        获取分析结果
        
        Args:
            analysis_id: 分析ID
            db: 数据库会话
            
        Returns:
            分析结果或None
        """
        try:
            result = await db.execute(
                select(Analysis).where(Analysis.id == analysis_id)
            )
            analysis = result.scalar_one_or_none()
            
            if analysis:
                return analysis.to_dict()
            return None
        
        except Exception as e:
            logger.error(f"Failed to get analysis {analysis_id}: {e}")
            return None
    
    async def get_analyses_by_news_id(
        self,
        news_id: int,
        db: AsyncSession
    ) -> list:
        """
        获取指定新闻的所有分析结果（按时间倒序，最新的在前）
        
        Args:
            news_id: 新闻ID
            db: 数据库会话
            
        Returns:
            分析结果列表（最新的在前）
        """
        try:
            from sqlalchemy import desc
            
            result = await db.execute(
                select(Analysis)
                .where(Analysis.news_id == news_id)
                .order_by(desc(Analysis.created_at))  # 按创建时间倒序，最新的在前
            )
            analyses = result.scalars().all()
            
            return [analysis.to_dict() for analysis in analyses]
        
        except Exception as e:
            logger.error(f"Failed to get analyses for news {news_id}: {e}")
            return []


# 全局实例
_analysis_service: Optional[AnalysisService] = None


def get_analysis_service() -> AnalysisService:
    """
    获取分析服务实例（单例模式）
    
    Returns:
        AnalysisService 实例
    """
    global _analysis_service
    if _analysis_service is None:
        _analysis_service = AnalysisService()
    return _analysis_service


================================================
FILE: backend/app/services/embedding_service.py
================================================
"""
Embedding 服务封装
使用 agenticx.embeddings.BailianEmbeddingProvider
"""
import logging
import asyncio
from typing import List, Optional
import redis
import hashlib
import json

from ..core.config import settings
from agenticx.embeddings import BailianEmbeddingProvider

logger = logging.getLogger(__name__)


class EmbeddingService:
    """
    Embedding 服务封装类
    基于 agenticx.embeddings.BailianEmbeddingProvider
    提供文本向量化功能，支持缓存
    """
    
    def __init__(
        self,
        provider: str = None,
        model: str = None,
        batch_size: int = None,
        enable_cache: bool = True,
        base_url: str = None,
    ):
        """
        初始化 Embedding 服务
        
        Args:
            provider: 提供商（保留参数以兼容，实际使用 bailian）
            model: 模型名称
            batch_size: 批处理大小
            enable_cache: 是否启用Redis缓存
            base_url: 自定义 API 端点（用于百炼等第三方服务）
        """
        self.provider = provider or settings.EMBEDDING_PROVIDER
        self.model = model or settings.EMBEDDING_MODEL
        self.batch_size = batch_size or settings.EMBEDDING_BATCH_SIZE
        self.enable_cache = enable_cache
        self.base_url = base_url or settings.EMBEDDING_BASE_URL
        
        # 获取 API Key
        api_key = settings.DASHSCOPE_API_KEY
        if not api_key:
            # 如果没有 DASHSCOPE_API_KEY，尝试使用 OPENAI_API_KEY（向后兼容）
            api_key = settings.OPENAI_API_KEY
            if not api_key:
                raise ValueError("DASHSCOPE_API_KEY or OPENAI_API_KEY is required for embedding")
        
        # 设置 API URL
        api_url = self.base_url or settings.DASHSCOPE_BASE_URL or "https://dashscope.aliyuncs.com/compatible-mode/v1"
        
        # 初始化 agenticx BailianEmbeddingProvider
        self.provider_instance = BailianEmbeddingProvider(
            api_key=api_key,
            model=self.model,
            api_url=api_url,
            batch_size=self.batch_size,
            timeout=settings.EMBEDDING_TIMEOUT,
            retry_count=settings.EMBEDDING_MAX_RETRIES,
            dimensions=settings.MILVUS_DIM,  # 确保维度匹配
            use_dashscope_sdk=False  # 使用 HTTP API，避免 SDK 依赖问题
        )
        
        logger.info(f"Initialized BailianEmbeddingProvider: {self.model}, dimension={self.provider_instance.get_embedding_dim()}")
        
        # 初始化Redis缓存
        if self.enable_cache:
            try:
                self.redis_client = redis.from_url(settings.REDIS_URL)
                self.cache_ttl = 86400 * 7  # 7天
                logger.info("Redis cache enabled for embeddings")
            except Exception as e:
                logger.warning(f"Failed to connect to Redis, cache disabled: {e}")
                self.enable_cache = False
    
    def _get_cache_key(self, text: str) -> str:
        """生成缓存键"""
        # 使用文本的MD5哈希和模型名称作为键
        text_hash = hashlib.md5(text.encode()).hexdigest()
        return f"embedding:{self.model}:{text_hash}"
    
    def _get_from_cache(self, text: str) -> Optional[List[float]]:
        """从缓存获取向量"""
        if not self.enable_cache:
            return None
        
        try:
            cache_key = self._get_cache_key(text)
            cached = self.redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
        except Exception as e:
            logger.warning(f"Failed to get from cache: {e}")
        
        return None
    
    def _save_to_cache(self, text: str, embedding: List[float]):
        """保存向量到缓存"""
        if not self.enable_cache:
            return
        
        try:
            cache_key = self._get_cache_key(text)
            self.redis_client.setex(
                cache_key,
                self.cache_ttl,
                json.dumps(embedding)
            )
        except Exception as e:
            logger.warning(f"Failed to save to cache: {e}")
    
    def embed_text(self, text: str) -> List[float]:
        """
        将文本转换为向量
        
        Args:
            text: 文本
            
        Returns:
            向量（List[float]）
        """
        # 检查缓存
        cached = self._get_from_cache(text)
        if cached is not None:
            return cached
        
        # 限制文本长度（避免超过模型限制）
        max_length = 6000
        if len(text) > max_length:
            logger.warning(f"Text too long ({len(text)} chars), truncating to {max_length} chars")
            text = text[:max_length]
        
        # 生成向量（使用 agenticx provider）
        # 注意：embed() 方法内部使用 asyncio.run()，在同步上下文中可以直接调用
        # 如果在异步上下文中调用此同步方法，应该在 ThreadPoolExecutor 中运行
        try:
            # 直接调用 embed()，它内部会使用 asyncio.run() 创建新的事件循环
            # 这在同步上下文中可以正常工作
            # 如果在异步上下文中，调用者应该在 ThreadPoolExecutor 中运行此方法
            embeddings = self.provider_instance.embed([text])
            embedding = embeddings[0] if embeddings else []
            
            # 保存到缓存
            self._save_to_cache(text, embedding)
            
            return embedding
        
        except Exception as e:
            logger.error(f"Embedding failed for text: {text[:100]}..., error: {e}")
            raise
    
    def embed_batch(self, texts: List[str]) -> List[List[float]]:
        """
        批量将文本转换为向量
        
        Args:
            texts: 文本列表
            
        Returns:
            向量列表
        """
        if not texts:
            return []
        
        # 检查缓存并分离需要处理的文本
        embeddings_map = {}  # {index: embedding}
        texts_to_embed = []  # [(index, text), ...]
        
        max_length = 6000
        for idx, text in enumerate(texts):
            # 检查缓存
            cached = self._get_from_cache(text)
            if cached is not None:
                embeddings_map[idx] = cached
            else:
                # 限制文本长度
                if len(text) > max_length:
                    logger.warning(f"Text too long ({len(text)} chars), truncating to {max_length} chars")
                    text = text[:max_length]
                texts_to_embed.append((idx, text))
        
        # 对未缓存的文本批量生成向量
        # 注意：BailianEmbeddingProvider.embed() 内部已经会分批处理，不需要我们再次分批
        if texts_to_embed:
            try:
                texts_list = [t[1] for t in texts_to_embed]
                # 直接调用 embed()，它内部会使用 asyncio.run() 创建新的事件循环
                # BailianEmbeddingProvider 内部会根据 batch_size 自动分批处理
                new_embeddings = self.provider_instance.embed(texts_list)
                
                # 保存到缓存并添加到结果
                for (idx, text), embedding in zip(texts_to_embed, new_embeddings):
                    self._save_to_cache(text, embedding)
                    embeddings_map[idx] = embedding
            
            except Exception as e:
                logger.error(f"Batch embedding failed: {e}")
                raise
        
        # 按原始顺序返回结果
        return [embeddings_map.get(i, []) for i in range(len(texts))]
    
    async def aembed_text(self, text: str) -> List[float]:
        """
        异步将文本转换为向量（推荐在异步上下文中使用）
        
        Args:
            text: 文本
            
        Returns:
            向量（List[float]）
        """
        # 检查缓存
        cached = self._get_from_cache(text)
        if cached is not None:
            return cached
        
        # 限制文本长度（避免超过模型限制）
        max_length = 6000
        if len(text) > max_length:
            logger.warning(f"Text too long ({len(text)} chars), truncating to {max_length} chars")
            text = text[:max_length]
        
        # 使用异步接口，避免 asyncio.run() 的问题
        try:
            embeddings = await self.provider_instance.aembed([text])
            embedding = embeddings[0] if embeddings else []
            
            # 保存到缓存
            self._save_to_cache(text, embedding)
            
            return embedding
        
        except Exception as e:
            logger.error(f"Embedding failed for text: {text[:100]}..., error: {e}")
            raise
    
    async def aembed_batch(self, texts: List[str]) -> List[List[float]]:
        """
        异步批量将文本转换为向量（推荐在异步上下文中使用）
        
        Args:
            texts: 文本列表
            
        Returns:
            向量列表
        """
        if not texts:
            return []
        
        # 检查缓存并分离需要处理的文本
        embeddings_map = {}  # {index: embedding}
        texts_to_embed = []  # [(index, text), ...]
        
        max_length = 6000
        for idx, text in enumerate(texts):
            # 检查缓存
            cached = self._get_from_cache(text)
            if cached is not None:
                embeddings_map[idx] = cached
            else:
                # 限制文本长度
                if len(text) > max_length:
                    logger.warning(f"Text too long ({len(text)} chars), truncating to {max_length} chars")
                    text = text[:max_length]
                texts_to_embed.append((idx, text))
        
        # 对未缓存的文本批量生成向量
        # BailianEmbeddingProvider.aembed() 内部已经会分批处理
        if texts_to_embed:
            try:
                texts_list = [t[1] for t in texts_to_embed]
                # 使用异步接口，避免 asyncio.run() 的问题
                new_embeddings = await self.provider_instance.aembed(texts_list)
                
                # 保存到缓存并添加到结果
                for (idx, text), embedding in zip(texts_to_embed, new_embeddings):
                    self._save_to_cache(text, embedding)
                    embeddings_map[idx] = embedding
            
            except Exception as e:
                logger.error(f"Batch embedding failed: {e}")
                raise
        
        # 按原始顺序返回结果
        return [embeddings_map.get(i, []) for i in range(len(texts))]


# 全局实例
_embedding_service: Optional[EmbeddingService] = None


def get_embedding_service() -> EmbeddingService:
    """
    获取 Embedding 服务实例（单例模式）
    
    Returns:
        EmbeddingService 实例
    """
    global _embedding_service
    if _embedding_service is None:
        _embedding_service = EmbeddingService()
    return _embedding_service


================================================
FILE: backend/app/services/llm_service.py
================================================
"""
LLM 服务封装
"""
import logging
from typing import Optional, Dict, Any, Union
from agenticx import LiteLLMProvider, LLMResponse
from agenticx.llms.bailian_provider import BailianProvider

from ..core.config import settings

logger = logging.getLogger(__name__)


class LLMService:
    """
    LLM 服务封装类
    提供统一的 LLM 调用接口
    """
    
    def __init__(
        self,
        provider: str = None,
        model: str = None,
        temperature: float = None,
        max_tokens: int = None,
        api_key: str = None,
        base_url: str = None,
    ):
        """
        初始化 LLM 服务
        
        Args:
            provider: 提供商（openai, anthropic, ollama）
            model: 模型名称
            temperature: 温度参数
            max_tokens: 最大token数
            api_key: API密钥
            base_url: 自定义 API 端点（用于第三方转发）
        """
        self.provider_name = provider or settings.LLM_PROVIDER
        self.model = model or settings.LLM_MODEL
        self.temperature = temperature or settings.LLM_TEMPERATURE
        self.max_tokens = max_tokens or settings.LLM_MAX_TOKENS
        
        # 设置API密钥
        if api_key:
            self.api_key = api_key
        elif self.provider_name == "bailian":
            self.api_key = settings.DASHSCOPE_API_KEY or settings.BAILIAN_API_KEY
        elif self.provider_name == "openai":
            self.api_key = settings.OPENAI_API_KEY
        elif self.provider_name == "deepseek":
            self.api_key = settings.DEEPSEEK_API_KEY
        elif self.provider_name == "kimi":
            self.api_key = settings.MOONSHOT_API_KEY
        elif self.provider_name == "zhipu":
            self.api_key = settings.ZHIPU_API_KEY
        elif self.provider_name == "anthropic":
            self.api_key = settings.ANTHROPIC_API_KEY
        else:
            self.api_key = None
        
        # 设置 Base URL（用于第三方 API 转发）
        if base_url:
            self.base_url = base_url
        elif self.provider_name == "bailian":
            self.base_url = settings.DASHSCOPE_BASE_URL
        elif self.provider_name == "openai":
            self.base_url = settings.OPENAI_BASE_URL
        elif self.provider_name == "deepseek":
            self.base_url = settings.DEEPSEEK_BASE_URL or "https://api.deepseek.com/v1"
        elif self.provider_name == "kimi":
            self.base_url = settings.MOONSHOT_BASE_URL or "https://api.moonshot.cn/v1"
        elif self.provider_name == "zhipu":
            self.base_url = settings.ZHIPU_BASE_URL or "https://open.bigmodel.cn/api/paas/v4"
        elif self.provider_name == "anthropic":
            self.base_url = settings.ANTHROPIC_BASE_URL
        else:
            self.base_url = None
        
        # 创建 LLM 提供者
        self.llm_provider = self._create_provider()
    
    def _create_provider(self) -> Union[LiteLLMProvider, BailianProvider]:
        """创建 LLM 提供者"""
        try:
            # 检测是否使用 Dashscope/Bailian API
            is_dashscope = (
                self.base_url and "dashscope" in self.base_url.lower()
            ) or (
                self.model and self.model.startswith("qwen") and self.base_url
            )
            
            if is_dashscope:
                # 使用 BailianProvider（专门为百炼 API 设计）
                if not self.api_key:
                    raise ValueError("API key is required for Bailian provider")
                
                provider = BailianProvider(
                    model=self.model,
                    api_key=self.api_key,
                    base_url=self.base_url or "https://dashscope.aliyuncs.com/compatible-mode/v1",
                    temperature=self.temperature,
                    timeout=float(settings.LLM_TIMEOUT),  # 从配置读取超时时间
                    max_retries=2   # 减少重试次数，避免总耗时过长
                )
                logger.info(f"Initialized BailianProvider: {self.model}")
                return provider
            else:
                # 使用 LiteLLMProvider（通用 provider）
                provider_kwargs = {
                    "model": self.model,
                    "temperature": self.temperature,
                    "max_tokens": self.max_tokens,
                    "api_key": self.api_key,
                }
                
                # 如果设置了自定义 base_url，添加到配置中
                if self.base_url:
                    provider_kwargs["base_url"] = self.base_url
                    logger.info(f"Using custom base URL: {self.base_url}")
                
                provider = LiteLLMProvider(**provider_kwargs)
                logger.info(f"Initialized LiteLLMProvider: {self.provider_name}/{self.model}")
                return provider
        except Exception as e:
            logger.error(f"Failed to initialize LLM provider: {e}")
            raise
    
    def generate(
        self,
        prompt: str,
        system_message: Optional[str] = None,
        **kwargs
    ) -> str:
        """
        生成文本
        
        Args:
            prompt: 用户提示
            system_message: 系统消息
            **kwargs: 额外参数
            
        Returns:
            生成的文本
        """
        try:
            messages = []
            
            if system_message:
                messages.append({"role": "system", "content": system_message})
            
            messages.append({"role": "user", "content": prompt})
            
            # 确保传递 max_tokens（如果 kwargs 中没有）
            if "max_tokens" not in kwargs:
                kwargs["max_tokens"] = self.max_tokens
            
            response: LLMResponse = self.llm_provider.generate(
                messages=messages,
                **kwargs
            )
            
            return response.content
        
        except Exception as e:
            logger.error(f"LLM generation failed: {e}")
            raise
    
    def analyze_sentiment(self, text: str) -> Dict[str, Any]:
        """
        分析文本情感
        
        Args:
            text: 待分析文本
            
        Returns:
            情感分析结果
        """
        system_message = """你是一个专业的金融新闻情感分析专家。
请分析给定新闻的情感倾向，判断其对相关股票的影响是利好、利空还是中性。

输出格式（JSON）：
{
    "sentiment": "positive/negative/neutral",
    "score": 0.0-1.0（情感强度）,
    "confidence": 0.0-1.0（置信度）,
    "reasoning": "分析理由"
}
"""
        
        prompt = f"""请分析以下新闻的情感倾向：

{text[:1000]}

请严格按照JSON格式输出结果。"""
        
        try:
            response_text = self.generate(prompt, system_message)
            
            # 尝试解析JSON
            import json
            import re
            
            # 提取JSON部分
            json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
            if json_match:
                result = json.loads(json_match.group())
                return result
            else:
                # 如果无法解析，返回默认值
                return {
                    "sentiment": "neutral",
                    "score": 0.5,
                    "confidence": 0.5,
                    "reasoning": response_text
                }
        
        except Exception as e:
            logger.error(f"Sentiment analysis failed: {e}")
            return {
                "sentiment": "neutral",
                "score": 0.5,
                "confidence": 0.0,
                "reasoning": f"分析失败: {str(e)}"
            }
    
    def summarize(self, text: str, max_length: int = 200) -> str:
        """
        文本摘要
        
        Args:
            text: 原始文本
            max_length: 摘要最大长度
            
        Returns:
            摘要文本
        """
        system_message = f"""你是一个专业的金融新闻摘要专家。
请将给定的新闻内容总结为不超过{max_length}字的简洁摘要，保留关键信息。"""
        
        prompt = f"""请总结以下新闻：

{text}

摘要："""
        
        try:
            summary = self.generate(prompt, system_message, max_tokens=max_length)
            return summary.strip()
        except Exception as e:
            logger.error(f"Summarization failed: {e}")
            return text[:max_length] + "..."


# 全局实例
_llm_service: Optional[LLMService] = None


def get_llm_provider(
    provider: Optional[str] = None,
    model: Optional[str] = None
) -> Union[LiteLLMProvider, BailianProvider]:
    """
    获取 LLM 提供者实例（用于 AgenticX Agent）
    
    Args:
        provider: 可选的提供商名称（如 openai, bailian, ollama）
        model: 可选的模型名称
    
    Returns:
        LiteLLMProvider 或 BailianProvider 实例
    """
    global _llm_service
    
    # 如果指定了 provider 或 model，创建新的实例
    if provider or model:
        custom_service = LLMService(provider=provider, model=model)
        return custom_service.llm_provider
    
    # 否则使用全局实例
    if _llm_service is None:
        _llm_service = LLMService()
    return _llm_service.llm_provider


def get_llm_service() -> LLMService:
    """
    获取 LLM 服务实例
    
    Returns:
        LLMService 实例
    """
    global _llm_service
    if _llm_service is None:
        _llm_service = LLMService()
    return _llm_service


def create_custom_llm_provider(
    provider: Optional[str] = None,
    model: Optional[str] = None,
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    api_key: Optional[str] = None,
    base_url: Optional[str] = None,
) -> Union[LiteLLMProvider, BailianProvider]:
    """
    动态创建自定义 LLM provider（用于模型切换）
    
    Args:
        provider: 厂商名称（bailian, openai, deepseek, kimi, zhipu）
        model: 模型名称
        temperature: 温度参数
        max_tokens: 最大token数
        api_key: API Key（可选，优先从settings读取）
        base_url: Base URL（可选，优先从settings读取）
    
    Returns:
        LLM provider 实例
    
    Examples:
        >>> llm = create_custom_llm_provider('bailian', 'qwen-max')
        >>> llm = create_custom_llm_provider('openai', 'gpt-4')
        >>> llm = create_custom_llm_provider('zhipu', 'glm-4')
    """
    _provider = provider or settings.LLM_PROVIDER
    _model = model or settings.LLM_MODEL
    _temperature = temperature if temperature is not None else settings.LLM_TEMPERATURE
    _max_tokens = max_tokens if max_tokens is not None else settings.LLM_MAX_TOKENS
    
    logger.info(f"Creating custom LLM provider: {_provider}/{_model}")
    
    try:
        if _provider == 'bailian':
            # 使用阿里云百炼（通过 OpenAI 兼容接口）
            _api_key = api_key or settings.DASHSCOPE_API_KEY or settings.BAILIAN_API_KEY
            if not _api_key:
                raise ValueError("DASHSCOPE_API_KEY or BAILIAN_API_KEY is required for bailian provider")
            
            _base_url = base_url or settings.DASHSCOPE_BASE_URL
            return BailianProvider(
                model=_model,
                api_key=_api_key,
                base_url=_base_url,
                access_key_id=settings.BAILIAN_ACCESS_KEY_ID,
                access_key_secret=settings.BAILIAN_ACCESS_KEY_SECRET,
                agent_code=settings.BAILIAN_AGENT_CODE,
                region_id=settings.BAILIAN_REGION_ID,
                temperature=_temperature,
                max_tokens=_max_tokens,
                timeout=float(settings.LLM_TIMEOUT),  # 从配置读取超时时间
                max_retries=2  # 减少重试次数，避免总耗时过长
            )
        
        elif _provider == 'openai':
            # 使用 OpenAI
            _api_key = api_key or settings.OPENAI_API_KEY
            if not _api_key:
                raise ValueError("OPENAI_API_KEY is required for openai provider")
            
            _base_url = base_url or settings.OPENAI_BASE_URL
            return LiteLLMProvider(
                provider="openai",
                model=_model,
                api_key=_api_key,
                base_url=_base_url,
                temperature=_temperature,
                max_tokens=_max_tokens
            )
        
        elif _provider == 'deepseek':
            # 使用 DeepSeek（通过 OpenAI 兼容接口）
            _api_key = api_key or settings.DEEPSEEK_API_KEY
            if not _api_key:
                raise ValueError("DEEPSEEK_API_KEY is required for deepseek provider")
            
            _base_url = base_url or settings.DEEPSEEK_BASE_URL or 'https://api.deepseek.com/v1'
            return LiteLLMProvider(
                provider="openai",
                model=_model,
                api_key=_api_key,
                base_url=_base_url,
                temperature=_temperature,
                max_tokens=_max_tokens
            )
        
        elif _provider == 'kimi':
            # 使用 Kimi (Moonshot)
            _api_key = api_key or settings.MOONSHOT_API_KEY
            if not _api_key:
                raise ValueError("MOONSHOT_API_KEY is required for kimi provider")
            
            _base_url = base_url or settings.MOONSHOT_BASE_URL or 'https://api.moonshot.cn/v1'
            return LiteLLMProvider(
                provider="openai",
                model=_model,
                api_key=_api_key,
                base_url=_base_url,
                temperature=_temperature,
                max_tokens=_max_tokens
            )
        
        elif _provider == 'zhipu':
            # 使用智谱 AI
            _api_key = api_key or settings.ZHIPU_API_KEY
            if not _api_key:
                raise ValueError("ZHIPU_API_KEY is required for zhipu provider")
            
            _base_url = base_url or settings.ZHIPU_BASE_URL or 'https://open.bigmodel.cn/api/paas/v4'
            return LiteLLMProvider(
                provider="openai",
                model=_model,
                api_key=_api_key,
                base_url=_base_url,
                temperature=_temperature,
                max_tokens=_max_tokens
            )
        
        else:
            logger.warning(f"Unsupported provider: {_provider}, falling back to default")
            return get_llm_provider()
    
    except ValueError as e:
        logger.error(f"Configuration error: {e}")
        raise
    except Exception as e:
        logger.error(f"Failed to create custom LLM provider: {e}", exc_info=True)
        # 降级到默认 provider
        return get_llm_provider()


================================================
FILE: backend/app/services/stock_data_service.py
================================================
"""
股票数据服务 - 使用 akshare 获取真实股票数据
"""
import logging
from datetime import datetime, timedelta
from typing import List, Optional, Dict, Any
from functools import lru_cache
import asyncio

logger = logging.getLogger(__name__)

# 尝试导入 akshare
try:
    import akshare as ak
    import pandas as pd
    AKSHARE_AVAILABLE = True
except ImportError:
    AKSHARE_AVAILABLE = False
    logger.warning("akshare not installed, using mock data")


class StockDataService:
    """股票数据服务 - 封装 akshare 接口"""
    
    # 缓存过期时间（秒）
    CACHE_TTL = 300  # 5分钟
    CACHE_TTL_MINUTE = 60  # 分钟级数据缓存1分钟
    
    # 股票代码前缀映射
    MARKET_PREFIX = {
        "sh": "6",     # 上海 60xxxx
        "sz": "0",     # 深圳 00xxxx, 30xxxx
        "sz3": "3",    # 创业板 30xxxx
    }
    
    # 周期映射
    PERIOD_MAP = {
        "1m": "1",      # 1分钟
        "5m": "5",      # 5分钟
        "15m": "15",    # 15分钟
        "30m": "30",    # 30分钟
        "60m": "60",    # 60分钟/1小时
        "1h": "60",     # 1小时（别名）
        "daily": "daily",  # 日线
        "1d": "daily",     # 日线（别名）
    }
    
    def __init__(self):
        self._cache: Dict[str, tuple] = {}  # {key: (data, timestamp)}
    
    def _normalize_code(self, stock_code: str) -> str:
        """
        标准化股票代码，返回纯数字代码
        支持格式: SH600519, sh600519, 600519
        """
        code = stock_code.upper().strip()
        if code.startswith("SH") or code.startswith("SZ"):
            return code[2:]
        return code
    
    def _get_symbol(self, stock_code: str) -> str:
        """
        获取 akshare 使用的股票代码格式
        akshare stock_zh_a_hist 需要纯数字代码
        """
        return self._normalize_code(stock_code)
    
    def _is_cache_valid(self, key: str, ttl: int = None) -> bool:
        """检查缓存是否有效"""
        if key not in self._cache:
            return False
        _, timestamp = self._cache[key]
        cache_ttl = ttl if ttl is not None else self.CACHE_TTL
        # 修复bug: 使用 total_seconds() 而不是 seconds
        # seconds 只返回秒数部分(0-86399)，不包括天数
        return (datetime.now() - timestamp).total_seconds() < cache_ttl
    
    def _get_cached(self, key: str, ttl: int = None) -> Optional[Any]:
        """获取缓存数据"""
        if self._is_cache_valid(key, ttl):
            return self._cache[key][0]
        # 清理过期缓存
        if key in self._cache:
            del self._cache[key]
        return None
    
    def _set_cache(self, key: str, data: Any):
        """设置缓存"""
        self._cache[key] = (data, datetime.now())
    
    def clear_cache(self, pattern: str = None):
        """
        清除缓存
        Args:
            pattern: 可选的缓存键模式，如果提供则只清除匹配的缓存
        """
        if pattern:
            keys_to_delete = [k for k in self._cache.keys() if pattern in k]
            for key in keys_to_delete:
                del self._cache[key]
            logger.info(f"🧹 Cleared {len(keys_to_delete)} cache entries matching pattern: {pattern}")
        else:
            count = len(self._cache)
            self._cache.clear()
            logger.info(f"🧹 Cleared all {count} cache entries")
    
    async def get_kline_data(
        self,
        stock_code: str,
        period: str = "daily",  # daily, 1m, 5m, 15m, 30m, 60m
        limit: int = 90,  # 数据条数
        adjust: str = "qfq"  # qfq=前复权, hfq=后复权, ""=不复权
    ) -> List[Dict[str, Any]]:
        """
        获取K线数据（支持日线和分钟级数据）
        
        Args:
            stock_code: 股票代码
            period: 周期 (daily, 1m, 5m, 15m, 30m, 60m)
            limit: 返回数据条数
            adjust: 复权类型（仅日线有效）
            
        Returns:
            K线数据列表，每条包含: timestamp, open, high, low, close, volume, turnover
        """
        # 标准化周期
        period_key = self.PERIOD_MAP.get(period, period)
        cache_key = f"kline:{stock_code}:{period}:{limit}:{adjust}"
        
        # 根据周期使用不同的缓存TTL：日线5分钟，分钟级1分钟
        cache_ttl = self.CACHE_TTL if period_key == "daily" else self.CACHE_TTL_MINUTE
        cached = self._get_cached(cache_key, ttl=cache_ttl)
        if cached:
            latest_date = cached[-1].get('date', 'unknown') if cached else 'empty'
            logger.info(f"🔵 Cache hit for {cache_key}, latest date: {latest_date}, count: {len(cached)}")
            return cached
        
        logger.info(f"🔴 Cache miss for {cache_key}, fetching fresh data...")
        
        if not AKSHARE_AVAILABLE:
            logger.warning("akshare not available, returning mock data")
            return self._generate_mock_kline(stock_code, limit)
        
        try:
            symbol = self._get_symbol(stock_code)
            loop = asyncio.get_event_loop()
            
            if period_key == "daily":
                # 日线数据
                kline_data = await self._fetch_daily_kline(symbol, limit, adjust, loop)
            else:
                # 分钟级数据
                kline_data = await self._fetch_minute_kline(symbol, period_key, limit, loop)
            
            if not kline_data:
                logger.warning(f"⚠️ No valid data after parsing for {stock_code} period={period}, using mock data")
                return self._generate_mock_kline(stock_code, limit)
            
            # 记录最新数据的日期和价格，便于调试
            latest = kline_data[-1]
            logger.info(f"✅ Successfully fetched {len(kline_data)} kline records for {stock_code} period={period}, latest: {latest['date']}, close: {latest['close']}")
            
            self._set_cache(cache_key, kline_data)
            return kline_data
            
        except Exception as e:
            logger.error(f"❌ Failed to fetch kline data for {stock_code}: {type(e).__name__}: {e}", exc_info=True)
            # 只在某些特定错误时返回mock数据，其他错误应该抛出
            if "NaTType" in str(e) or "timestamp" in str(e).lower():
                logger.warning(f"Data parsing error, this should not happen after fix. Returning empty list.")
                return []
            # 网络错误或API错误才返回mock数据
            return self._generate_mock_kline(stock_code, limit)
    
    async def _fetch_daily_kline(
        self, 
        symbol: str, 
        limit: int, 
        adjust: str,
        loop
    ) -> List[Dict[str, Any]]:
        """获取日线数据"""
        end_date = datetime.now()
        # 多获取一些天数，确保有足够数据（考虑周末和节假日，约1个交易日=1.5个自然日）
        # limit * 1.6 能确保获取到足够的交易日数据
        start_date = end_date - timedelta(days=int(limit * 1.6))
        
        logger.info(f"📊 Calling akshare API: symbol={symbol}, start={start_date.strftime('%Y%m%d')}, end={end_date.strftime('%Y%m%d')}, adjust={adjust}")
        
        df = await loop.run_in_executor(
            None,
            lambda: ak.stock_zh_a_hist(
                symbol=symbol,
                start_date=start_date.strftime("%Y%m%d"),
                end_date=end_date.strftime("%Y%m%d"),
                adjust=adjust
            )
        )
        
        logger.info(f"✅ Akshare returned {len(df) if df is not None and not df.empty else 0} rows")
        
        if df is None or df.empty:
            return []
        
        # 清理数据：移除日期为NaT的行
        df = df.dropna(subset=['日期'])
        
        # 只取最近 limit 条数据
        df = df.tail(limit)
        
        # 转换为标准格式
        kline_data = []
        for _, row in df.iterrows():
            try:
                # 处理日期
                date_val = row['日期']
                if pd.isna(date_val):
                    logger.warning(f"Skipping row with NaT date")
                    continue
                    
                if isinstance(date_val, str):
                    dt = datetime.strptime(date_val, "%Y-%m-%d")
                    date_str = date_val
                else:
                    dt = pd.to_datetime(date_val)
                    if pd.isna(dt):
                        logger.warning(f"Skipping row with invalid date")
                        continue
                    date_str = dt.strftime("%Y-%m-%d")
                
                timestamp = int(dt.timestamp() * 1000)
                
                kline_data.append({
                    "timestamp": timestamp,
                    "date": date_str,
                    "open": float(row['开盘']),
                    "high": float(row['最高']),
                    "low": float(row['最低']),
                    "close": float(row['收盘']),
                    "volume": int(row['成交量']),
                    "turnover": float(row.get('成交额', 0)),
                    "change_percent": float(row.get('涨跌幅', 0)),
                    "change_amount": float(row.get('涨跌额', 0)),
                    "amplitude": float(row.get('振幅', 0)),
                    "turnover_rate": float(row.get('换手率', 0)),
                })
            except Exception as e:
                logger.warning(f"Failed to parse row, skipping: {e}")
                continue
        
        # 记录数据范围
        if kline_data:
            logger.info(f"✅ Parsed {len(kline_data)} valid records, date range: {kline_data[0]['date']} to {kline_data[-1]['date']}")
        
        return kline_data
    
    async def _fetch_minute_kline(
        self, 
        symbol: str, 
        period: str,  # "1", "5", "15", "30", "60"
        limit: int,
        loop
    ) -> List[Dict[str, Any]]:
        """获取分钟级数据"""
        df = await loop.run_in_executor(
            None,
            lambda: ak.stock_zh_a_hist_min_em(
                symbol=symbol,
                period=period,
                adjust=""
            )
        )
        
        if df is None or df.empty:
            return []
        
        # 清理数据：移除时间为NaT的行
        df = df.dropna(subset=['时间'])
        
        # 只取最近 limit 条数据
        df = df.tail(limit)
        
        # 转换为标准格式
        kline_data = []
        for _, row in df.iterrows():
            try:
                # 处理时间
                time_val = row['时间']
                if pd.isna(time_val):
                    logger.warning(f"Skipping row with NaT time")
                    continue
                
                time_str = str(time_val)
                try:
                    dt = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
                except:
                    dt = pd.to_datetime(time_val)
                    if pd.isna(dt):
                        logger.warning(f"Skipping row with invalid time")
                        continue
                    time_str = dt.strftime("%Y-%m-%d %H:%M:%S")
                
                timestamp = int(dt.timestamp() * 1000)
                
                kline_data.append({
                    "timestamp": timestamp,
                    "date": time_str,
                    "open": float(row['开盘']),
                    "high": float(row['最高']),
                    "low": float(row['最低']),
                    "close": float(row['收盘']),
                    "volume": int(row['成交量']),
                    "turnover": float(row.get('成交额', 0)),
                    "change_percent": 0,  # 分钟数据可能没有涨跌幅
                    "change_amount": 0,
                    "amplitude": 0,
                    "turnover_rate": 0,
                })
            except Exception as e:
                logger.warning(f"Failed to parse minute row, skipping: {e}")
                continue
        
        # 记录数据范围
        if kline_data:
            logger.info(f"✅ Parsed {len(kline_data)} valid minute records, time range: {kline_data[0]['date']} to {kline_data[-1]['date']}")
        
        return kline_data
    
    async def get_realtime_quote(self, stock_code: str) -> Optional[Dict[str, Any]]:
        """
        获取实时行情
        
        Returns:
            实时行情数据
        """
        cache_key = f"realtime:{stock_code}"
        cached = self._get_cached(cache_key)
        if cached:
            return cached
        
        if not AKSHARE_AVAILABLE:
            return None
        
        try:
            symbol = self._get_symbol(stock_code)
            
            loop = asyncio.get_event_loop()
            df = await loop.run_in_executor(
                None,
                lambda: ak.stock_zh_a_spot_em()
            )
            
            if df is None or df.empty:
                return None
            
            # 根据股票代码筛选
            row = df[df['代码'] == symbol]
            if row.empty:
                return None
            
            row = row.iloc[0]
            quote = {
                "code": symbol,
                "name": row.get('名称', ''),
                "price": float(row.get('最新价', 0)),
                "change_percent": float(row.get('涨跌幅', 0)),
                "change_amount": float(row.get('涨跌额', 0)),
                "volume": int(row.get('成交量', 0)),
                "turnover": float(row.get('成交额', 0)),
                "high": float(row.get('最高', 0)),
                "low": float(row.get('最低', 0)),
                "open": float(row.get('今开', 0)),
                "prev_close": float(row.get('昨收', 0)),
            }
            
            self._set_cache(cache_key, quote)
            return quote
            
        except Exception as e:
            logger.error(f"Failed to fetch realtime quote for {stock_code}: {e}")
            return None
    
    async def search_stocks(
        self,
        keyword: str,
        limit: int = 20
    ) -> List[Dict[str, Any]]:
        """
        搜索股票（通过代码或名称模糊匹配）
        
        Args:
            keyword: 搜索关键词
            limit: 返回数量限制
            
        Returns:
            股票列表
        """
        cache_key = f"search:{keyword}:{limit}"
        cached = self._get_cached(cache_key)
        if cached:
            return cached
        
        if not AKSHARE_AVAILABLE:
            return self._get_mock_stock_list(keyword, limit)
        
        try:
            loop = asyncio.get_event_loop()
            
            # 获取全部 A 股实时行情（包含代码和名称）
            df = await loop.run_in_executor(
                None,
                lambda: ak.stock_zh_a_spot_em()
            )
            
            if df is None or df.empty:
                return self._get_mock_stock_list(keyword, limit)
            
            # 模糊匹配代码或名称
            keyword_upper = keyword.upper()
            mask = (
                df['代码'].str.contains(keyword_upper, na=False) |
                df['名称'].str.contains(keyword, na=False)
            )
            matched = df[mask].head(limit)
            
            results = []
            for _, row in matched.iterrows():
                code = str(row['代码'])
                # 确定市场前缀
                if code.startswith('6'):
                    full_code = f"SH{code}"
                elif code.startswith('0') or code.startswith('3'):
                    full_code = f"SZ{code}"
                else:
                    full_code = code
                
                results.append({
                    "code": code,
                    "name": str(row['名称']),
                    "full_code": full_code,
                    "price": float(row.get('最新价', 0)) if pd.notna(row.get('最新价')) else 0,
                    "change_percent": float(row.get('涨跌幅', 0)) if pd.notna(row.get('涨跌幅')) else 0,
                })
            
            self._set_cache(cache_key, results)
            return results
            
        except Exception as e:
            logger.error(f"Failed to search stocks: {e}")
            return self._get_mock_stock_list(keyword, limit)
    
    def _get_mock_stock_list(self, keyword: str, limit: int) -> List[Dict[str, Any]]:
        """返回模拟股票列表"""
        mock_stocks = [
            {"code": "600519", "name": "贵州茅台", "full_code": "SH600519", "price": 1420.0, "change_percent": 0.5},
            {"code": "000001", "name": "平安银行", "full_code": "SZ000001", "price": 12.0, "change_percent": -0.3},
            {"code": "601318", "name": "中国平安", "full_code": "SH601318", "price": 45.0, "change_percent": 0.2},
            {"code": "000858", "name": "五粮液", "full_code": "SZ000858", "price": 150.0, "change_percent": 1.1},
            {"code": "002594", "name": "比亚迪", "full_code": "SZ002594", "price": 250.0, "change_percent": -0.8},
            {"code": "600036", "name": "招商银行", "full_code": "SH600036", "price": 35.0, "change_percent": 0.1},
            {"code": "601166", "name": "兴业银行", "full_code": "SH601166", "price": 18.0, "change_percent": 0.3},
            {"code": "000333", "name": "美的集团", "full_code": "SZ000333", "price": 65.0, "change_percent": 0.6},
            {"code": "002415", "name": "海康威视", "full_code": "SZ002415", "price": 32.0, "change_percent": -0.5},
            {"code": "600276", "name": "恒瑞医药", "full_code": "SH600276", "price": 42.0, "change_percent": 0.4},
        ]
        
        keyword_lower = keyword.lower()
        filtered = [
            s for s in mock_stocks
            if keyword_lower in s["code"].lower() or keyword_lower in s["name"].lower()
        ]
        return filtered[:limit]
    
    async def get_stock_info(self, stock_code: str) -> Optional[Dict[str, Any]]:
        """
        获取股票基本信息
        """
        if not AKSHARE_AVAILABLE:
            return None
        
        try:
            symbol = self._get_symbol(stock_code)
            
            loop = asyncio.get_event_loop()
            df = await loop.run_in_executor(
                None,
                lambda: ak.stock_individual_info_em(symbol=symbol)
            )
            
            if df is None or df.empty:
                return None
            
            # 转换为字典
            info = {}
            for _, row in df.iterrows():
                info[row['item']] = row['value']
            
            return info
            
        except Exception as e:
            logger.error(f"Failed to fetch stock info for {stock_code}: {e}")
            return None
    
    def _generate_mock_kline(self, stock_code: str, days: int) -> List[Dict[str, Any]]:
        """
        生成模拟K线数据（当 akshare 不可用时使用）
        """
        import random
        
        # 根据股票代码设定基准价格
        base_prices = {
            "600519": 1500.0,  # 贵州茅台
            "000001": 12.0,    # 平安银行
            "601318": 45.0,    # 中国平安
            "000858": 150.0,   # 五粮液
            "002594": 250.0,   # 比亚迪
        }
        
        code = self._normalize_code(stock_code)
        base_price = base_prices.get(code, 50.0)
        current_price = base_price
        
        kline_data = []
        for i in range(days):
            dt = datetime.now() - timedelta(days=days - i - 1)
            # 跳过周末
            if dt.weekday() >= 5:
                continue
                
            timestamp = int(dt.timestamp() * 1000)
            date_str = dt.strftime("%Y-%m-%d")
            
            # 随机波动
            change_percent = random.uniform(-3, 3)
            open_price = current_price
            close_price = current_price * (1 + change_percent / 100)
            high_price = max(open_price, close_price) * (1 + random.uniform(0, 1.5) / 100)
            low_price = min(open_price, close_price) * (1 - random.uniform(0, 1.5) / 100)
            volume = random.randint(50000, 500000)
            turnover = volume * close_price
            
            kline_data.append({
                "timestamp": timestamp,
                "date": date_str,
                "open": round(open_price, 2),
                "high": round(high_price, 2),
                "low": round(low_price, 2),
                "close": round(close_price, 2),
                "volume": volume,
                "turnover": round(turnover, 2),
                "change_percent": round(change_percent, 2),
                "change_amount": round(close_price - open_price, 2),
                "amplitude": round((high_price - low_price) / open_price * 100, 2),
                "turnover_rate": round(random.uniform(0.5, 5), 2),
            })
            
            current_price = close_price
        
        return kline_data[-days:] if len(kline_data) > days else kline_data
    
    async def get_financial_indicators(self, stock_code: str) -> Optional[Dict[str, Any]]:
        """
        获取股票财务指标（用于辩论分析）
        
        包括：PE、PB、ROE、净利润增长率等
        
        Args:
            stock_code: 股票代码
            
        Returns:
            财务指标字典
        """
        cache_key = f"financial:{stock_code}"
        cached = self._get_cached(cache_key, ttl=3600)  # 财务数据缓存1小时
        if cached:
            return cached
        
        if not AKSHARE_AVAILABLE:
            logger.warning("akshare not available, returning mock financial data")
            return self._get_mock_financial_indicators(stock_code)
        
        try:
            symbol = self._get_symbol(stock_code)
            loop = asyncio.get_event_loop()
            
            # 方法1：从实时行情获取基础估值数据
            spot_df = await loop.run_in_executor(
                None,
                lambda: ak.stock_zh_a_spot_em()
            )
            
            financial_data = {}
            
            if spot_df is not None and not spot_df.empty:
                row = spot_df[spot_df['代码'] == symbol]
                if not row.empty:
                    row = row.iloc[0]
                    financial_data.update({
                        "pe_ratio": self._safe_float(row.get('市盈率-动态')),
                        "pb_ratio": self._safe_float(row.get('市净率')),
                        "total_market_value": self._safe_float(row.get('总市值')),
                        "circulating_market_value": self._safe_float(row.get('流通市值')),
                        "turnover_rate": self._safe_float(row.get('换手率')),
                        "volume_ratio": self._safe_float(row.get('量比')),
                        "amplitude": self._safe_float(row.get('振幅')),
                        "price_52w_high": self._safe_float(row.get('52周最高')),
                        "price_52w_low": self._safe_float(row.get('52周最低')),
                    })
            
            # 方法2：尝试获取更详细的财务摘要
            try:
                financial_abstract = await loop.run_in_executor(
                    None,
                    lambda: ak.stock_financial_abstract_ths(symbol=symbol)
                )
                
                if financial_abstract is not None and not financial_abstract.empty:
                    # 取最新一期数据
                    latest = financial_abstract.iloc[0] if len(financial_abstract) > 0 else None
                    if latest is not None:
                        financial_data.update({
                            "roe": self._safe_float(latest.get('净资产收益率')),
                            "gross_profit_margin": self._safe_float(latest.get('毛利率')),
                            "net_profit_margin": self._safe_float(latest.get('净利率')),
                            "debt_ratio": self._safe_float(latest.get('资产负债率')),
                            "revenue_yoy": self._safe_float(latest.get('营业总收入同比增长率')),
                            "profit_yoy": self._safe_float(latest.get('净利润同比增长率')),
                        })
            except Exception as e:
                logger.debug(f"Failed to fetch financial abstract for {stock_code}: {e}")
            
            if financial_data:
                self._set_cache(cache_key, financial_data)
                return financial_data
            
            return self._get_mock_financial_indicators(stock_code)
            
        except Exception as e:
            logger.error(f"Failed to fetch financial indicators for {stock_code}: {e}")
            return self._get_mock_financial_indicators(stock_code)
    
    def _safe_float(self, value, default=None) -> Optional[float]:
        """安全转换为浮点数"""
        if value is None or (isinstance(value, float) and pd.isna(value)):
            return default
        try:
            return float(value)
        except (ValueError, TypeError):
            return default
    
    def _get_mock_financial_indicators(self, stock_code: str) -> Dict[str, Any]:
        """返回模拟财务指标"""
        return {
            "pe_ratio": 25.5,
            "pb_ratio": 3.2,
            "roe": 15.8,
            "total_market_value": 100000000000,  # 1000亿
            "circulating_market_value": 80000000000,
            "turnover_rate": 2.5,
            "gross_profit_margin": 45.2,
            "net_profit_margin": 22.1,
            "debt_ratio": 35.5,
            "revenue_yoy": 12.5,
            "profit_yoy": 18.3,
        }
    
    async def get_fund_flow(self, stock_code: str, days: int = 5) -> Optional[Dict[str, Any]]:
        """
        获取个股资金流向（用于辩论分析）
        
        包括：主力资金净流入、散户资金流向等
        
        Args:
            stock_code: 股票代码
            days: 获取最近几天的数据
            
        Returns:
            资金流向数据
        """
        cache_key = f"fund_flow:{stock_code}:{days}"
        cached = self._get_cached(cache_key, ttl=300)  # 资金流向缓存5分钟
        if cached:
            return cached
        
        if not AKSHARE_AVAILABLE:
            logger.warning("akshare not available, returning mock fund flow data")
            return self._get_mock_fund_flow(stock_code)
        
        try:
            symbol = self._get_symbol(stock_code)
            loop = asyncio.get_event_loop()
            
            # 获取个股资金流向
            df = await loop.run_in_executor(
                None,
                lambda: ak.stock_individual_fund_flow(stock=symbol, market="sh" if symbol.startswith("6") else "sz")
            )
            
            if df is None or df.empty:
                return self._get_mock_fund_flow(stock_code)
            
            # 取最近几天的数据
            df = df.head(days)
            
            # 汇总数据
            total_main_net = 0
            total_super_large_net = 0
            total_large_net = 0
            total_medium_net = 0
            total_small_net = 0
            daily_flows = []
            
            for _, row in df.iterrows():
                main_net = self._safe_float(row.get('主力净流入-净额'), 0)
                super_large_net = self._safe_float(row.get('超大单净流入-净额'), 0)
                large_net = self._safe_float(row.get('大单净流入-净额'), 0)
                medium_net = self._safe_float(row.get('中单净流入-净额'), 0)
                small_net = self._safe_float(row.get('小单净流入-净额'), 0)
                
                total_main_net += main_net
                total_super_large_net += super_large_net
                total_large_net += large_net
                total_medium_net += medium_net
                total_small_net += small_net
                
                daily_flows.append({
                    "date": str(row.get('日期', '')),
                    "main_net": main_net,
                    "super_large_net": super_large_net,
                    "large_net": large_net,
                    "medium_net": medium_net,
                    "small_net": small_net,
                })
            
            fund_flow_data = {
                "period_days": days,
                "total_main_net": total_main_net,
                "total_super_large_net": total_super_large_net,
                "total_large_net": total_large_net,
                "total_medium_net": total_medium_net,
                "total_small_net": total_small_net,
                "main_flow_trend": "流入" if total_main_net > 0 else "流出",
                "daily_flows": daily_flows,
            }
            
            self._set_cache(cache_key, fund_flow_data)
            return fund_flow_data
            
        except Exception as e:
            logger.error(f"Failed to fetch fund flow for {stock_code}: {e}")
            return self._get_mock_fund_flow(stock_code)
    
    def _get_mock_fund_flow(self, stock_code: str) -> Dict[str, Any]:
        """返回模拟资金流向数据"""
        return {
            "period_days": 5,
            "total_main_net": 50000000,  # 5000万
            "total_super_large_net": 30000000,
            "total_large_net": 20000000,
            "total_medium_net": -5000000,
            "total_small_net": -10000000,
            "main_flow_trend": "流入",
            "daily_flows": [],
        }
    
    async def get_debate_context(self, stock_code: str) -> Dict[str, Any]:
        """
        获取用于辩论的综合上下文数据
        
        整合财务指标、资金流向、实时行情等信息
        
        Args:
            stock_code: 股票代码
            
        Returns:
            综合上下文数据
        """
        # 并行获取多个数据源
        realtime_task = self.get_realtime_quote(stock_code)
        financial_task = self.get_financial_indicators(stock_code)
        fund_flow_task = self.get_fund_flow(stock_code, days=5)
        
        realtime, financial, fund_flow = await asyncio.gather(
            realtime_task, financial_task, fund_flow_task,
            return_exceptions=True
        )
        
        # 处理异常
        if isinstance(realtime, Exception):
            logger.error(f"Failed to get realtime quote: {realtime}")
            realtime = None
        if isinstance(financial, Exception):
            logger.error(f"Failed to get financial indicators: {financial}")
            financial = None
        if isinstance(fund_flow, Exception):
            logger.error(f"Failed to get fund flow: {fund_flow}")
            fund_flow = None
        
        # 生成文本摘要
        context_parts = []
        
        if realtime:
            context_parts.append(
                f"【实时行情】当前价: {realtime.get('price', 'N/A')}元, "
                f"涨跌幅: {realtime.get('change_percent', 'N/A')}%, "
                f"成交量: {realtime.get('volume', 'N/A')}"
            )
        
        if financial:
            pe = financial.get('pe_ratio')
            pb = financial.get('pb_ratio')
            roe = financial.get('roe')
            profit_yoy = financial.get('profit_yoy')
            context_parts.append(
                f"【估值指标】PE: {pe if pe else 'N/A'}, PB: {pb if pb else 'N/A'}, "
                f"ROE: {roe if roe else 'N/A'}%, 净利润同比: {profit_yoy if profit_yoy else 'N/A'}%"
            )
        
        if fund_flow:
            main_net = fund_flow.get('total_main_net', 0)
            main_net_str = f"{main_net/10000:.2f}万" if abs(main_net) < 100000000 else f"{main_net/100000000:.2f}亿"
            context_parts.append(
                f"【资金流向】近{fund_flow.get('period_days', 5)}日主力净{fund_flow.get('main_flow_trend', 'N/A')}: {main_net_str}"
            )
        
        return {
            "realtime": realtime,
            "financial": financial,
            "fund_flow": fund_flow,
            "summary": "\n".join(context_parts) if context_parts else "暂无额外数据",
        }


# 单例实例
stock_data_service = StockDataService()


================================================
FILE: backend/app/storage/__init__.py
================================================
"""
存储模块
"""
from .vector_storage import VectorStorage

__all__ = ["VectorStorage"]


================================================
FILE: backend/app/storage/vector_storage.py
================================================
"""
向量存储封装 - 直接使用 agenticx.storage.vectordb_storages.milvus.MilvusStorage
提供简单的兼容性接口，充分利用 base 类的便利方法
"""
import logging
import asyncio
from typing import List, Dict, Any, Optional

from ..core.config import settings
from agenticx.storage.vectordb_storages.milvus import MilvusStorage
from agenticx.storage.vectordb_storages.base import VectorRecord, VectorDBQuery

logger = logging.getLogger(__name__)


class VectorStorage:
    """
    Milvus 向量存储封装类
    直接使用 agenticx.storage.vectordb_storages.milvus.MilvusStorage
    提供简单的兼容性接口，只做必要的接口转换
    """
    
    def __init__(
        self,
        host: str = None,
        port: int = None,
        collection_name: str = None,
        dim: int = None,
    ):
        """初始化向量存储"""
        self.host = host or settings.MILVUS_HOST
        self.port = port or settings.MILVUS_PORT
        self.collection_name = collection_name or settings.MILVUS_COLLECTION_NAME
        self.dim = dim or settings.MILVUS_DIM
        
        # 直接使用 agenticx MilvusStorage
        self.milvus_storage = MilvusStorage(
            dimension=self.dim,
            host=self.host,
            port=self.port,
            collection_name=self.collection_name
        )
        
        logger.info(f"Initialized VectorStorage using MilvusStorage: {self.collection_name}, dim={self.dim}")
    
    def _call_add_async(self, records: List[VectorRecord], timeout: int = 15) -> None:
        """辅助方法：在同步上下文中调用异步 add() 方法"""
        try:
            loop = asyncio.get_running_loop()
            future = asyncio.run_coroutine_threadsafe(self.milvus_storage.add(records), loop)
            try:
                future.result(timeout=timeout)
            except Exception:
                logger.warning(f"Vector insert timeout ({timeout}s), but data may have been inserted")
        except RuntimeError:
            try:
                asyncio.run(asyncio.wait_for(self.milvus_storage.add(records), timeout=timeout))
            except asyncio.TimeoutError:
                logger.warning(f"Vector insert timeout ({timeout}s), but data may have been inserted")
    
    def connect(self):
        """连接到 Milvus（兼容性方法）"""
        # MilvusStorage 在初始化时已经连接
        pass
    
    def create_collection(self, drop_existing: bool = False):
        """创建集合（兼容性方法）"""
        # MilvusStorage 在初始化时已经创建集合
        if drop_existing:
            self.milvus_storage.clear()
            self.milvus_storage = MilvusStorage(
                dimension=self.dim,
                host=self.host,
                port=self.port,
                collection_name=self.collection_name
            )
    
    def load_collection(self):
        """加载集合到内存（兼容性方法）"""
        self.milvus_storage.load()
    
    def store_embedding(
        self,
        news_id: int,
        embedding: List[float],
        text: str
    ) -> int:
        """存储单个向量（兼容性接口）"""
        record = VectorRecord(
            id=str(news_id),
            vector=embedding,
            payload={"news_id": news_id, "text": text[:65535]}
        )
        self._call_add_async([record], timeout=15)
        return news_id
    
    def store_embeddings_batch(
        self,
        news_ids: List[int],
        embeddings: List[List[float]],
        texts: List[str]
    ) -> List[int]:
        """批量存储向量（兼容性接口）"""
        records = [
            VectorRecord(
                id=str(news_id),
                vector=embedding,
                payload={"news_id": news_id, "text": text[:65535]}
            )
            for news_id, embedding, text in zip(news_ids, embeddings, texts)
        ]
        self._call_add_async(records, timeout=30)
        return news_ids
    
    def search_similar(
        self,
        query_embedding: List[float],
        top_k: int = 10,
        filter_expr: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """搜索相似向量（兼容性接口）"""
        query = VectorDBQuery(query_vector=query_embedding, top_k=top_k)
        results = self.milvus_storage.query(query)
        
        # 格式化结果
        formatted_results = []
        for result in results:
            payload = result.record.payload or {}
            news_id = payload.get("news_id")
            if news_id is None:
                try:
                    news_id = int(result.record.id)
                except (ValueError, TypeError):
                    continue
            
            # 简单的过滤支持
            if filter_expr and "news_id" in filter_expr:
                import re
                match = re.search(r'news_id\s*==\s*(\d+)', filter_expr)
                if match and news_id != int(match.group(1)):
                    continue
            
            formatted_results.append({
                "id": result.record.id,
                "news_id": news_id,
                "text": payload.get("text", ""),
                "distance": result.similarity,
                "score": 1 / (1 + result.similarity) if result.similarity > 0 else 1.0,
            })
        
        return formatted_results
    
    def delete_by_news_id(self, news_id: int):
        """删除指定新闻的向量（兼容性接口）"""
        self.milvus_storage.delete([str(news_id)])
    
    def verify_insert(self, news_id: int, wait_for_flush: bool = True) -> bool:
        """验证数据是否成功插入（兼容性接口）"""
        if wait_for_flush:
            import time
            time.sleep(3)
        
        # 使用 base 类的 get_payloads_by_vector 方法
        zero_vector = [0.0] * self.dim
        payloads = self.milvus_storage.get_payloads_by_vector(zero_vector, top_k=1000)
        
        for payload in payloads:
            if payload and payload.get("news_id") == news_id:
                return True
        return False
    
    def get_stats(self) -> Dict[str, Any]:
        """获取集合统计信息（兼容性接口）
        
        注意：如果 num_entities 为 0，会通过实际查询来获取真实数量
        （因为 flush 失败时 num_entities 可能不准确）
        """
        status = self.milvus_storage.status()
        num_entities = status.vector_count
        
        # 如果 num_entities 为 0，尝试通过查询获取真实数量
        # 这可以解决 flush 失败导致统计不准确的问题
        if num_entities == 0:
            try:
                from agenticx.storage.vectordb_storages.base import VectorDBQuery
                # 使用零向量查询，设置一个较大的 top_k 来获取实际数量
                zero_vector = [0.0] * status.vector_dim
                query = VectorDBQuery(query_vector=zero_vector, top_k=10000)  # 最多查询10000条
                results = self.milvus_storage.query(query)
                if results:
                    num_entities = len(results)
                    # 如果返回了10000条，说明可能还有更多，标记为近似值
                    if len(results) >= 10000:
                        num_entities = f"{len(results)}+ (近似值，实际可能更多)"
            except Exception as e:
                logger.debug(f"无法通过查询获取真实数量: {e}")
                # 如果查询失败，仍然使用 num_entities=0
        
        return {
            "num_entities": num_entities,
            "collection_name": self.collection_name,
            "dim": status.vector_dim,
        }
    
    def disconnect(self):
        """断开连接（兼容性方法）"""
        self.milvus_storage.close()
    
    @property
    def collection(self):
        """兼容性属性：返回底层的 Milvus collection 对象"""
        return self.milvus_storage.collection


# 全局实例
_vector_storage: Optional[VectorStorage] = None


def get_vector_storage() -> VectorStorage:
    """获取向量存储实例（单例模式）"""
    global _vector_storage
    if _vector_storage is None:
        _vector_storage = VectorStorage()
    return _vector_storage


================================================
FILE: backend/app/tasks/__init__.py
================================================
"""
Celery 任务模块
"""
from .crawl_tasks import realtime_crawl_task, cold_start_crawl_task

__all__ = [
    "realtime_crawl_task",
    "cold_start_crawl_task",
]


================================================
FILE: backend/app/tasks/crawl_tasks.py
================================================
"""
Celery 爬取任务 - Phase 2: 实时监控升级版 + 多源支持
"""
import logging
import json
from datetime import datetime, timedelta
from typing import List, Dict, Any
from sqlalchemy import select, create_engine, text
from sqlalchemy.orm import Session
import asyncio

from ..core.celery_app import celery_app
from ..core.config import settings
from ..core.redis_client import redis_client
from ..models.crawl_task import CrawlTask, CrawlMode, TaskStatus
from ..models.news import News
from ..tools import (
    SinaCrawlerTool,
    TencentCrawlerTool,
    JwviewCrawlerTool,
    EeoCrawlerTool,
    CaijingCrawlerTool,
    Jingji21CrawlerTool,
    NbdCrawlerTool,
    YicaiCrawlerTool,
    Netease163CrawlerTool,
    EastmoneyCrawlerTool,
    bochaai_search,
    NewsItem,
)
from ..tools.crawler_enhanced import EnhancedCrawler, crawl_url

logger = logging.getLogger(__name__)


def clean_text_for_db(text: str) -> str:
    """
    清理文本中不适合存入数据库的字符
    
    PostgreSQL 不允许在文本字段中存储 NUL 字符 (\x00)
    
    Args:
        text: 原始文本
        
    Returns:
        清理后的文本
    """
    if text is None:
        return None
    if not isinstance(text, str):
        return text
    # 移除 NUL 字符
    return text.replace('\x00', '').replace('\0', '')


def get_crawler_tool(source: str):
    """
    爬虫工厂函数
    
    Args:
        source: 新闻源名称
        
    Returns:
        对应的爬虫实例
    """
    crawlers = {
        "sina": SinaCrawlerTool,
        "tencent": TencentCrawlerTool,
        "jwview": JwviewCrawlerTool,
        "eeo": EeoCrawlerTool,
        "caijing": CaijingCrawlerTool,
        "jingji21": Jingji21CrawlerTool,
        "nbd": NbdCrawlerTool,
        "yicai": YicaiCrawlerTool,
        "163": Netease163CrawlerTool,
        "eastmoney": EastmoneyCrawlerTool,
    }
    
    crawler_class = crawlers.get(source)
    if not crawler_class:
        raise ValueError(f"Unknown news source: {source}")
    
    return crawler_class()


def get_sync_db_session():
    """获取同步数据库会话（Celery任务中使用）"""
    engine = create_engine(settings.SYNC_DATABASE_URL)
    return Session(engine)


@celery_app.task(bind=True, name="app.tasks.crawl_tasks.realtime_crawl_task")
def realtime_crawl_task(self, source: str = "sina", force_refresh: bool = False):
    """
    实时爬取任务 (Phase 2 升级版)
    
    核心改进：
    1. Redis 缓存检查（避免频繁爬取）
    2. 智能时间过滤（基于配置的 NEWS_RETENTION_HOURS）
    3. 只爬取最新一页
    
    Args:
        source: 新闻源（sina, jrj等）
        force_refresh: 是否强制刷新（跳过缓存）
    """
    db = get_sync_db_session()
    task_record = None
    cache_key = f"news:{source}:latest"
    cache_time_key = f"{cache_key}:timestamp"
    
    try:
        # ===== Phase 2.1: 检查 Redis 缓存 =====
        if not force_refresh and redis_client.is_available():
            cache_metadata = redis_client.get_cache_metadata(cache_key)
            
            if cache_metadata:
                age_seconds = cache_metadata['age_seconds']
                # 根据不同源获取对应的爬取间隔
                interval_map = {
                    "sina": settings.CRAWL_INTERVAL_SINA,
                    "tencent": settings.CRAWL_INTERVAL_TENCENT,
                    "jwview": settings.CRAWL_INTERVAL_JWVIEW,
                    "eeo": settings.CRAWL_INTERVAL_EEO,
                    "caijing": settings.CRAWL_INTERVAL_CAIJING,
                    "jingji21": settings.CRAWL_INTERVAL_JINGJI21,
                    "nbd": 60,  # 每日经济新闻
                    "yicai": 60,  # 第一财经
                    "163": 60,  # 网易财经
                    "eastmoney": 60,  # 东方财富
                }
                interval = interval_map.get(source, 60)  # 默认60秒
                
                # 如果缓存时间 < 爬取间隔，使用缓存
                if age_seconds < interval:
                    logger.info(
                        f"[{source}] 使用缓存数据 (age: {age_seconds:.0f}s < {interval}s)"
                    )
                    return {
                        "status": "cached",
                        "source": source,
                        "cache_age": age_seconds,
                        "message": f"缓存数据仍然有效，距上次爬取 {age_seconds:.0f} 秒"
                    }
        
        # ===== 1. 创建任务记录 =====
        task_record = CrawlTask(
            celery_task_id=self.request.id,
            mode=CrawlMode.REALTIME,
            status=TaskStatus.RUNNING,
            source=source,
            config={
                "page_limit": 1, 
                "retention_hours": settings.NEWS_RETENTION_HOURS,
                "force_refresh": force_refresh
            },
            started_at=datetime.utcnow(),
        )
        db.add(task_record)
        db.commit()
        db.refresh(task_record)
        
        logger.info(f"[Task {task_record.id}] 🚀 开始实时爬取: {source}")
        
        # ===== 2. 创建爬虫（使用工厂函数） =====
        try:
            crawler = get_crawler_tool(source)
        except ValueError as e:
            logger.error(f"[Task {task_record.id}] ❌ {e}")
            raise
        
        # ===== 3. 执行爬取（只爬第一页） =====
        start_time = datetime.utcnow()
        news_list = crawler.crawl(start_page=1, end_page=1)
        
        logger.info(f"[Task {task_record.id}] 📰 爬取到 {len(news_list)} 条新闻")
        
        # ===== Phase 2.2: 智能时间过滤 =====
        cutoff_time = datetime.utcnow() - timedelta(hours=settings.NEWS_RETENTION_HOURS)
        recent_news = [
            news for news in news_list
            if news.publish_time and news.publish_time > cutoff_time
        ] if news_list else []
        
        logger.info(
            f"[Task {task_record.id}] ⏱️  过滤后剩余 {len(recent_news)} 条新闻 "
            f"(保留 {settings.NEWS_RETENTION_HOURS} 小时内)"
        )
        
        # ===== 4. 去重并保存 =====
        saved_count = 0
        duplicate_count = 0
        
        for news_item in recent_news:
            # 检查URL是否已存在
            existing = db.execute(
                select(News).where(News.url == news_item.url)
            ).scalar_one_or_none()
            
            if existing:
                duplicate_count += 1
                logger.debug(f"[Task {task_record.id}] ⏭️  跳过重复新闻: {news_item.title[:30]}...")
                continue
            
            # 创建新记录（清理 NUL 字符，PostgreSQL 不允许存储）
            news = News(
                title=clean_text_for_db(news_item.title),
                content=clean_text_for_db(news_item.content),
                raw_html=clean_text_for_db(news_item.raw_html),  # 保存原始 HTML
                url=clean_text_for_db(news_item.url),
                source=clean_text_for_db(news_item.source),
                publish_time=news_item.publish_time,
                author=clean_text_for_db(news_item.author),
                keywords=news_item.keywords,
                stock_codes=news_item.stock_codes,
            )
            
            db.add(news)
            saved_count += 1
        
        db.commit()
        
        logger.info(
            f"[Task {task_record.id}] 💾 保存 {saved_count} 条新新闻 "
            f"(重复: {duplicate_count})"
        )
        
        # ===== Phase 2.3: 更新 Redis 缓存 =====
        if redis_client.is_available() and recent_news:
            # 将新闻列表序列化后存入缓存
            cache_data = [
                {
                    "title": n.title,
                    "url": n.url,
                    "publish_time": n.publish_time.isoformat() if n.publish_time else None,
                    "source": n.source,
                }
                for n in recent_news
            ]
            success = redis_client.set_with_metadata(
                cache_key, 
                cache_data, 
                ttl=settings.CACHE_TTL
            )
            if success:
                logger.info(f"[Task {task_record.id}] 💾 Redis 缓存已更新 (TTL: {settings.CACHE_TTL}s)")
        
        # ===== 5. 更新任务状态 =====
        end_time = datetime.utcnow()
        execution_time = (end_time - start_time).total_seconds()
        
        task_record.status = TaskStatus.COMPLETED
        task_record.completed_at = end_time
        task_record.execution_time = execution_time
        task_record.crawled_count = len(recent_news)
        task_record.saved_count = saved_count
        task_record.result = {
            "total_crawled": len(news_list),
            "filtered": len(recent_news),
            "saved": saved_count,
            "duplicates": duplicate_count,
            "retention_hours": settings.NEWS_RETENTION_HOURS,
        }
        db.commit()
        
        logger.info(
            f"[Task {task_record.id}] ✅ 完成! "
            f"爬取: {len(news_list)} → 过滤: {len(recent_news)} → 保存: {saved_count}, "
            f"耗时: {execution_time:.2f}s"
        )
        
        return {
            "task_id": task_record.id,
            "status": "completed",
            "source": source,
            "crawled": len(news_list),
            "filtered": len(recent_news),
            "saved": saved_count,
            "duplicates": duplicate_count,
            "execution_time": execution_time,
            "timestamp": datetime.utcnow().isoformat(),
        }
        
    except Exception as e:
        logger.error(f"[Task {task_record.id if task_record else 'unknown'}] 爬取失败: {e}", exc_info=True)
        
        if task_record:
            task_record.status = TaskStatus.FAILED
            task_record.completed_at = datetime.utcnow()
            task_record.error_message = str(e)[:1000]
            db.commit()
        
        # 重新抛出异常，让 Celery 记录
        raise
    
    finally:
        db.close()


@celery_app.task(bind=True, name="app.tasks.crawl_tasks.cold_start_crawl_task")
def cold_start_crawl_task(
    self,
    source: str = "sina",
    start_page: int = 1,
    end_page: int = 50,
):
    """
    冷启动批量爬取任务
    
    Args:
        source: 新闻源
        start_page: 起始页
        end_page: 结束页
    """
    db = get_sync_db_session()
    task_record = None
    
    try:
        # 1. 创建任务记录
        task_record = CrawlTask(
            celery_task_id=self.request.id,
            mode=CrawlMode.COLD_START,
            status=TaskStatus.RUNNING,
            source=source,
            config={
                "start_page": start_page,
                "end_page": end_page,
            },
            total_pages=end_page - start_page + 1,
            started_at=datetime.utcnow(),
        )
        db.add(task_record)
        db.commit()
        db.refresh(task_record)
        
        logger.info(f"[Task {task_record.id}] 开始冷启动爬取: {source}, 页码 {start_page}-{end_page}")
        
        # 2. 创建爬虫
        if source == "sina":
            crawler = SinaCrawlerTool()
        else:
            raise ValueError(f"不支持的新闻源: {source}")
        
        # 3. 分页爬取
        start_time = datetime.utcnow()
        total_crawled = 0
        total_saved = 0
        
        for page in range(start_page, end_page + 1):
            try:
                # 更新进度
                task_record.current_page = page
                task_record.progress = {
                    "current_page": page,
                    "total_pages": task_record.total_pages,
                    "percentage": round((page - start_page + 1) / task_record.total_pages * 100, 2),
                }
                db.commit()
                
                # 爬取单页
                news_list = crawler.crawl(start_page=page, end_page=page)
                total_crawled += len(news_list)
                
                # 保存新闻
                page_saved = 0
                for news_item in news_list:
                    existing = db.execute(
                        select(News).where(News.url == news_item.url)
                    ).scalar_one_or_none()
                    
                    if not existing:
                        # 清理 NUL 字符，PostgreSQL 不允许存储
                        news = News(
                            title=clean_text_for_db(news_item.title),
                            content=clean_text_for_db(news_item.content),
                            raw_html=clean_text_for_db(news_item.raw_html),  # 保存原始 HTML
                            url=clean_text_for_db(news_item.url),
                            source=clean_text_for_db(news_item.source),
                            publish_time=news_item.publish_time,
                            author=clean_text_for_db(news_item.author),
                            keywords=news_item.keywords,
                            stock_codes=news_item.stock_codes,
                        )
                        db.add(news)
                        page_saved += 1
                
                db.commit()
                total_saved += page_saved
                
                logger.info(
                    f"[Task {task_record.id}] 页 {page}/{end_page}: "
                    f"爬取 {len(news_list)} 条, 保存 {page_saved} 条"
                )
                
            except Exception as e:
                logger.error(f"[Task {task_record.id}] 页 {page} 爬取失败: {e}")
                continue
        
        # 4. 更新任务状态
        end_time = datetime.utcnow()
        execution_time = (end_time - start_time).total_seconds()
        
        task_record.status = TaskStatus.COMPLETED
        task_record.completed_at = end_time
        task_record.execution_time = execution_time
        task_record.crawled_count = total_crawled
        task_record.saved_count = total_saved
        task_record.result = {
            "pages_crawled": end_page - start_page + 1,
            "total_crawled": total_crawled,
            "total_saved": total_saved,
            "duplicates": total_crawled - total_saved,
        }
        db.commit()
        
        logger.info(
            f"[Task {task_record.id}] 冷启动完成! "
            f"页数: {end_page - start_page + 1}, 爬取: {total_crawled}, 保存: {total_saved}, "
            f"耗时: {execution_time:.2f}s"
        )
        
        return {
            "task_id": task_record.id,
            "status": "completed",
            "crawled": total_crawled,
            "saved": total_saved,
            "execution_time": execution_time,
        }
        
    except Exception as e:
        logger.error(f"[Task {task_record.id if task_record else 'unknown'}] 冷启动失败: {e}", exc_info=True)
        
        if task_record:
            task_record.status = TaskStatus.FAILED
            task_record.completed_at = datetime.utcnow()
            task_record.error_message = str(e)[:1000]
            db.commit()
        
        raise
    
    finally:
        db.close()


@celery_app.task(bind=True, name="app.tasks.crawl_tasks.targeted_stock_crawl_task")
def targeted_stock_crawl_task(
    self,
    stock_code: str,
    stock_name: str,
    days: int = 30,
    task_record_id: int = None
):
    """
    定向爬取某只股票的相关新闻（精简版 - 只使用 BochaAI）
    
    数据来源：BochaAI 搜索引擎 API
    
    图谱构建逻辑：
    - 有历史新闻数据 → 先构建/使用图谱 → 基于图谱扩展关键词搜索
    - 无历史新闻数据 → 先用 BochaAI 爬取 → 爬取完成后异步构建图谱
    
    Args:
        stock_code: 股票代码（如 SH600519）
        stock_name: 股票名称（如 贵州茅台）
        days: 搜索时间范围（天），默认30天
        task_record_id: 数据库中的任务记录ID（如果已创建）
    """
    db = get_sync_db_session()
    task_record = None
    
    try:
        # 标准化股票代码
        code = stock_code.upper()
        if code.startswith("SH") or code.startswith("SZ"):
            pure_code = code[2:]
        else:
            pure_code = code
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        # 1. 获取或创建任务记录
        if task_record_id:
            task_record = db.query(CrawlTask).filter(CrawlTask.id == task_record_id).first()
            if task_record:
                task_record.status = TaskStatus.RUNNING
                task_record.started_at = datetime.utcnow()
                db.commit()
                db.refresh(task_record)
            else:
                logger.warning(f"Task record {task_record_id} not found, creating new one")
                task_record_id = None
        
        if not task_record:
            task_record = CrawlTask(
                celery_task_id=self.request.id,
                mode=CrawlMode.TARGETED,
                status=TaskStatus.RUNNING,
                source="targeted",
                config={
                    "stock_code": code,
                    "stock_name": stock_name,
                    "days": days,
                },
                started_at=datetime.utcnow(),
            )
            db.add(task_record)
            db.commit()
            db.refresh(task_record)
        
        logger.info(f"[Task {task_record.id}] 🎯 开始定向爬取: {stock_name}({code}), 时间范围: {days}天")
        
        start_time = datetime.utcnow()
        all_news = []
        search_results = []
        
        # ========================================
        # 【核心逻辑】先用 akshare 获取股票基础信息，构建简单图谱
        # ========================================
        task_record.progress = {"current": 5, "total": 100, "message": "获取股票基础信息..."}
        db.commit()
        
        from ..knowledge.knowledge_extractor import AkshareKnowledgeExtractor
        
        # 1. 从 akshare 获取公司基础信息
        logger.info(f"[Task {task_record.id}] 🔍 从 akshare 获取 {stock_name}({pure_code}) 基础信息...")
        akshare_info = None
        try:
            akshare_info = AkshareKnowledgeExtractor.extract_company_info(pure_code)
            if akshare_info:
                logger.info(f"[Task {task_record.id}] ✅ akshare 返回: 行业={akshare_info.get('industry')}, 主营={akshare_info.get('main_business', '')[:50]}...")
            else:
                logger.warning(f"[Task {task_record.id}] ⚠️ akshare 未返回数据，将使用股票名称生成关键词")
        except Exception as e:
            logger.warning(f"[Task {task_record.id}] ⚠️ akshare 查询失败: {e}，将使用股票名称生成关键词")
        
        # 2. 构建简单图谱并生成搜索关键词
        task_record.progress = {"current": 10, "total": 100, "message": "构建知识图谱..."}
        db.commit()
        
        simple_graph = AkshareKnowledgeExtractor.build_simple_graph_from_info(
            stock_code=code,
            stock_name=stock_name,
            akshare_info=akshare_info
        )
        
        # 获取分层关键词
        core_keywords = simple_graph.get("core_keywords", [stock_name])
        extension_keywords = simple_graph.get("extension_keywords", [])
        
        logger.info(
            f"[Task {task_record.id}] 📋 关键词分层: "
            f"核心={len(core_keywords)}个{core_keywords[:4]}, "
            f"扩展={len(extension_keywords)}个{extension_keywords[:4]}"
        )
        logger.info(f"[Task {task_record.id}] 🔑 完整核心关键词列表: {core_keywords}")
        logger.info(f"[Task {task_record.id}] 🔑 完整扩展关键词列表: {extension_keywords}")
        
        # ========================================
        # 【搜索阶段】使用组合关键词调用 BochaAI 搜索
        # ========================================
        task_record.progress = {"current": 20, "total": 100, "message": "BochaAI 组合搜索中..."}
        db.commit()
        
        if not bochaai_search.is_available():
            logger.error(f"[Task {task_record.id}] ❌ BochaAI API Key 未配置，无法执行搜索")
            raise ValueError("BochaAI API Key 未配置")
        
        # ========================================
        # 【组合搜索策略】
        # 1. 必须搜索：核心关键词（公司名、代码）
        # 2. 可选组合：核心词 + 扩展词（行业、业务、人名）
        # ========================================
        all_search_results = []
        search_queries = []
        
        # 策略1：核心关键词单独搜索（取前3个最重要的）
        for core_kw in core_keywords[:3]:
            # 跳过纯数字代码（单独搜会很泛）
            if not (core_kw.isdigit() or core_kw.startswith("SH") or core_kw.startswith("SZ")):
                search_queries.append(core_kw)
        
        # 策略2：核心词 + 扩展词组合搜索（最多3个组合）
        if extension_keywords:
            # 取最主要的核心词（通常是股票简称）
            main_core = core_keywords[0] if core_keywords else stock_name
            
            for ext_kw in extension_keywords[:3]:
                # 组合搜索：如 "*ST国华 软件开发"
                combined_query = f"{main_core} {ext_kw}"
                search_queries.append(combined_query)
        
        # 限制总查询数（避免过多请求）
        search_queries = search_queries[:5]
        
        logger.info(f"[Task {task_record.id}] 🚀 生成 {len(search_queries)} 个搜索查询:")
        for i, q in enumerate(search_queries):
            logger.info(f"  [{i+1}] {q}")
        
        # 执行搜索
        for query in search_queries:
            try:
                logger.info(f"[Task {task_record.id}] 🔍 搜索: '{query}'")
                kw_results = bochaai_search.search_stock_news(
                    stock_name=query,  # 使用组合查询
                    stock_code=pure_code,
                    days=days,
                    count=50,  # 每个查询最多 50 条
                    max_age_days=365
                )
                logger.info(f"[Task {task_record.id}] 📰 查询 '{query}' 搜索到 {len(kw_results)} 条结果")
                all_search_results.extend(kw_results)
            except Exception as e:
                logger.warning(f"[Task {task_record.id}] ⚠️ 查询 '{query}' 搜索失败: {e}")
        
        # 去重（按 URL）
        seen_urls = set()
        search_results = []
        for r in all_search_results:
            if r.url not in seen_urls:
                seen_urls.add(r.url)
                search_results.append(r)
        
        logger.info(f"[Task {task_record.id}] 📊 合并 {len(all_search_results)} 条，去重后 {len(search_results)} 条")
        
        # ========================================
        # 【处理阶段】转换搜索结果为 NewsItem
        # ========================================
        task_record.progress = {"current": 50, "total": 100, "message": "处理搜索结果..."}
        db.commit()
        
        bochaai_matched = 0
        bochaai_filtered = 0
        
        # 检查是否应该启用宽松过滤模式
        # 如果核心关键词太少（<= 2个），或者搜索结果很少（<10条），使用宽松过滤
        use_relaxed_filter = len(core_keywords) <= 2 or len(search_results) < 10
        if use_relaxed_filter:
            logger.info(f"[Task {task_record.id}] 🔓 启用宽松过滤模式（核心词={len(core_keywords)}个, 结果={len(search_results)}条）")
        
        # 打印 BochaAI 返回的前 10 条数据用于调试
        logger.info(f"[Task {task_record.id}] 📋 BochaAI 返回数据预览 (前10条):")
        for i, r in enumerate(search_results[:10]):
            logger.info(f"  [{i+1}] 标题: {r.title[:60]}...")
            logger.info(f"      来源: {r.site_name}, 日期: {r.date_published}")
            logger.info(f"      URL: {r.url[:80]}...")
        
        for idx, result in enumerate(search_results):
            # 解析发布时间
            publish_time = None
            if result.date_published:
                try:
                    publish_time = datetime.fromisoformat(
                        result.date_published.replace('Z', '+00:00')
                    )
                except (ValueError, AttributeError):
                    pass
            
            # 【注意】不再二次爬取完整内容，直接使用摘要（提升速度）
            full_content = result.snippet
            
            # 相关性过滤：必须包含至少一个核心关键词
            text_to_check = result.title + " " + result.snippet
            text_to_check_lower = text_to_check.lower()
            
            # 检查是否匹配任何核心关键词
            is_match = False
            matched_keyword = None
            for kw in core_keywords:
                if not kw or len(kw) < 2:
                    continue
                
                kw_lower = kw.lower()
                
                # 宽松匹配策略：
                # 1. 完整匹配（大小写不敏感）
                if kw in text_to_check or kw_lower in text_to_check_lower:
                    is_match = True
                    matched_keyword = kw
                    break
                
                # 2. 去除特殊字符后匹配（处理 *ST 等情况）
                import re
                kw_clean = re.sub(r'[*\s]', '', kw)
                if len(kw_clean) >= 2 and kw_clean.lower() in text_to_check_lower:
                    is_match = True
                    matched_keyword = f"{kw} (cleaned: {kw_clean})"
                    break
            
            if not is_match:
                # 宽松模式下，如果标题包含股票代码数字，也认为相关
                if use_relaxed_filter and pure_code in text_to_check:
                    is_match = True
                    matched_keyword = f"{pure_code} (relaxed mode)"
                    logger.debug(f"[Task {task_record.id}] 🔓 宽松模式匹配: {result.title[:40]}... (包含代码)")
                else:
                    bochaai_filtered += 1
                    # 打印前 5 条被过滤的原因
                    if bochaai_filtered <= 5:
                        logger.info(f"[Task {task_record.id}] ❌ 过滤[{idx+1}]: 不包含核心关键词")
                        logger.info(f"      标题: {result.title[:80]}")
                        logger.info(f"      摘要: {result.snippet[:100]}...")
                        logger.info(f"      核心词: {core_keywords}")
                    continue
            
            # 如果宽松模式跳过了上面的 continue，需要确保 is_match 为 True
            if not is_match:
                continue
            
            logger.debug(f"[Task {task_record.id}] ✅ 匹配核心词 '{matched_keyword}': {result.title[:40]}...")
            
            bochaai_matched += 1
            
            # 尝试爬取页面获取完整 HTML（只对前 15 条匹配结果爬取，避免任务太慢）
            raw_html = None
            crawled_content = None
            if bochaai_matched <= 15:
                try:
                    from ..tools.interactive_crawler import InteractiveCrawler
                    page_crawler = InteractiveCrawler(timeout=10)
                    page_data = page_crawler.crawl_page(result.url)
                    if page_data:
                        raw_html = page_data.get('html')
                        crawled_content = page_data.get('content') or page_data.get('text')
                        logger.debug(f"[Task {task_record.id}] 📄 爬取成功: {result.url[:50]}... | HTML {len(raw_html) if raw_html else 0}字符")
                except Exception as e:
                    logger.debug(f"[Task {task_record.id}] ⚠️ 爬取页面失败 {result.url[:50]}...: {e}")
            
            # 优先使用爬取的完整内容
            final_content = crawled_content if crawled_content and len(crawled_content) > len(full_content) else full_content
            
            news_item = NewsItem(
                title=result.title,
                content=final_content,
                url=result.url,
                source=result.site_name or "web_search",
                publish_time=publish_time,
                stock_codes=[pure_code, code],
                raw_html=raw_html,
            )
            all_news.append(news_item)
            
            # 每处理 20 条更新一次进度
            if (idx + 1) % 20 == 0:
                progress_pct = 50 + int((idx + 1) / len(search_results) * 30)
                task_record.progress = {"current": progress_pct, "total": 100, "message": f"处理中 {idx+1}/{len(search_results)}..."}
                db.commit()
        
        logger.info(f"[Task {task_record.id}] 🔍 搜索到 {len(search_results)} 条，匹配 {bochaai_matched} 条，过滤 {bochaai_filtered} 条")
        
        # ========================================
        # 【交互式爬虫补充】如果相关性匹配结果太少，使用交互式爬虫补充
        # ========================================
        if bochaai_matched < 5:  # 匹配结果太少时启动交互式爬虫
            logger.info(f"[Task {task_record.id}] 🌐 相关结果较少({bochaai_matched}条)，启用交互式爬虫补充...")
            
            try:
                from ..tools.interactive_crawler import create_interactive_crawler
                
                # 使用核心关键词进行搜索
                # 取最主要的核心词（通常是股票简称）
                interactive_query = core_keywords[0] if core_keywords else stock_name
                
                logger.info(f"[Task {task_record.id}] 🔍 使用交互式爬虫搜索: '{interactive_query}'")
                
                crawler = create_interactive_crawler(headless=True)
                # 使用百度资讯搜索（专门获取新闻，比 Bing 更稳定）
                interactive_results = crawler.interactive_search(
                    interactive_query,
                    engines=["baidu_news", "sogou"],  # 百度资讯 + 搜狗
                    num_results=15,
                    search_type="news"  # 新闻搜索
                )
                
                logger.info(f"[Task {task_record.id}] ✅ 交互式爬虫返回 {len(interactive_results)} 条结果")
                
                # 现在使用 news.baidu.com 入口，返回的是真实的第三方链接
                # 可以安全爬取这些页面获取完整内容（除了需要 JS 渲染的网站）
                
                # 需要 JS 渲染的网站列表（无法用 requests 爬取）
                JS_RENDERED_SITES = [
                    'baijiahao.baidu.com',  # 百家号需要 JS 渲染
                    'mbd.baidu.com',        # 百度移动版
                    'xueqiu.com',           # 雪球
                    'mp.weixin.qq.com',     # 微信公众号
                ]
                
                for result in interactive_results[:10]:  # 最多取 10 条
                    url = result.get('url', '')
                    title = result.get('title', '')
                    snippet = result.get('snippet', '')
                    
                    # 跳过无效结果
                    if not url or not title:
                        continue
                    # 跳过已存在的 URL
                    if url in {item.url for item in all_news}:
                        continue
                    # 跳过百度跳转链接
                    if 'baidu.com/link?' in url:
                        logger.debug(f"跳过百度跳转链接: {url}")
                        continue
                    
                    # 检查是否是需要 JS 渲染的网站
                    needs_js_render = any(site in url for site in JS_RENDERED_SITES)
                    
                    page_content = ""
                    raw_html = None
                    
                    if needs_js_render:
                        # JS 渲染网站：直接使用搜索结果的摘要
                        logger.debug(f"  ⚠️ JS渲染网站，使用搜索摘要: {url[:50]}...")
                        page_content = snippet if snippet else title
                    else:
                        # 普通网站：尝试爬取页面获取完整内容
                        try:
                            page_data = crawler.crawl_page(url)
                            if page_data:
                                page_content = page_data.get('text', '') or page_data.get('content', '')
                                raw_html = page_data.get('html', '')
                                # 如果爬取的标题更完整，使用爬取的标题
                                if page_data.get('title') and len(page_data.get('title', '')) > len(title):
                                    title = page_data.get('title', title)
                                logger.debug(f"  ✅ 成功爬取页面: {title[:30]}...")
                        except Exception as e:
                            logger.debug(f"  ⚠️ 爬取页面失败 {url}: {e}")
                    
                    # 如果爬取失败，使用搜索结果的摘要
                    if not page_content:
                        page_content = snippet if snippet else title
                    
                    news_item = NewsItem(
                        title=title,
                        content=page_content,
                        url=url,
                        source=result.get('news_source') or result.get('source', 'baidu_news'),
                        publish_time=None,  # 交互爬虫没有发布时间
                        stock_codes=[pure_code, code],
                        raw_html=raw_html,  # JS 渲染网站不保存乱码 HTML
                    )
                    all_news.append(news_item)
                    bochaai_matched += 1
                
                logger.info(f"[Task {task_record.id}] 📊 交互式爬虫补充后总计: {bochaai_matched} 条匹配结果")
                
            except ImportError:
                logger.warning(f"[Task {task_record.id}] ⚠️ 交互式爬虫模块不可用，跳过补充搜索")
            except Exception as e:
                logger.error(f"[Task {task_record.id}] ❌ 交互式爬虫补充失败: {e}", exc_info=True)
        
        # ========================================
        # 【保存阶段】去重并保存新闻
        # ========================================
        task_record.progress = {"current": 80, "total": 100, "message": "保存新闻..."}
        db.commit()
        saved_count = 0
        duplicate_count = 0
        
        logger.info(f"[Task {task_record.id}] 💾 开始保存 {len(all_news)} 条新闻...")
        
        for news_item in all_news:
            # 检查URL是否已存在
            existing = db.execute(
                select(News).where(News.url == news_item.url)
            ).scalar_one_or_none()
            
            if existing:
                duplicate_count += 1
                # 如果已存在但没有关联这个股票，更新关联
                if existing.stock_codes is None:
                    existing.stock_codes = []
                if pure_code not in existing.stock_codes:
                    existing.stock_codes = existing.stock_codes + [pure_code]
                    db.commit()
                continue
            
            # 创建新记录（清理 NUL 字符，PostgreSQL 不允许存储）
            news = News(
                title=clean_text_for_db(news_item.title),
                content=clean_text_for_db(news_item.content),
                raw_html=clean_text_for_db(news_item.raw_html),  # 保存原始 HTML
                url=clean_text_for_db(news_item.url),
                source=clean_text_for_db(news_item.source),
                publish_time=news_item.publish_time,
                author=clean_text_for_db(news_item.author),
                keywords=news_item.keywords,
                stock_codes=news_item.stock_codes or [pure_code, code],
            )
            
            db.add(news)
            saved_count += 1
        
        db.commit()
        
        logger.info(
            f"[Task {task_record.id}] 💾 保存 {saved_count} 条新闻 "
            f"(重复: {duplicate_count})"
        )
        
        # ========================================
        # 【图谱更新阶段】异步构建完整图谱（基于 Neo4j）
        # ========================================
        task_record.progress = {"current": 90, "total": 100, "message": "触发异步图谱构建..."}
        db.commit()
        
        if saved_count > 0:
            # 有新闻保存成功后，触发异步图谱构建任务
            logger.info(f"[Task {task_record.id}] 🧠 触发异步图谱构建任务...")
            try:
                build_knowledge_graph_task.delay(code, stock_name)
                logger.info(f"[Task {task_record.id}] ✅ 异步图谱构建任务已触发")
            except Exception as e:
                logger.error(f"[Task {task_record.id}] ❌ 触发异步图谱构建失败: {e}")
        
        # ========================================
        # 【完成阶段】更新任务状态
        # ========================================
        end_time = datetime.utcnow()
        execution_time = (end_time - start_time).total_seconds()
        
        task_record.status = TaskStatus.COMPLETED
        task_record.completed_at = end_time
        task_record.execution_time = execution_time
        task_record.crawled_count = len(all_news)
        task_record.saved_count = saved_count
        task_record.result = {
            "stock_code": code,
            "stock_name": stock_name,
            "total_found": len(all_news),
            "saved": saved_count,
            "duplicates": duplicate_count,
            "akshare_info": bool(akshare_info),  # 是否获取到 akshare 数据
            "core_keywords": core_keywords[:5],  # 核心关键词
            "search_queries": search_queries,  # 实际搜索的查询
            "sources": {
                "bochaai": len(search_results),
            }
        }
        task_record.progress = {
            "current": 100,
            "total": 100,
            "message": f"完成！新增 {saved_count} 条新闻"
        }
        db.commit()
        
        logger.info(
            f"[Task {task_record.id}] ✅ 定向爬取完成! "
            f"股票: {stock_name}({code}), 找到: {len(all_news)}, 保存: {saved_count}, "
            f"耗时: {execution_time:.2f}s"
        )
        
        return {
            "task_id": task_record.id,
            "status": "completed",
            "stock_code": code,
            "stock_name": stock_name,
            "crawled": len(all_news),
            "saved": saved_count,
            "duplicates": duplicate_count,
            "execution_time": execution_time,
            "timestamp": datetime.utcnow().isoformat(),
        }
        
    except Exception as e:
        logger.error(f"[Task {task_record.id if task_record else 'unknown'}] 定向爬取失败: {e}", exc_info=True)
        
        if task_record:
            task_record.status = TaskStatus.FAILED
            task_record.completed_at = datetime.utcnow()
            task_record.error_message = str(e)[:1000]
            task_record.progress = {
                "current": 0,
                "total": 100,
                "message": f"失败: {str(e)[:100]}"
            }
            db.commit()
        
        raise
    
    finally:
        db.close()


@celery_app.task(bind=True, name="app.tasks.crawl_tasks.build_knowledge_graph_task")
def build_knowledge_graph_task(self, stock_code: str, stock_name: str):
    """
    异步构建知识图谱任务
    
    在无历史新闻数据的股票首次爬取完成后触发。
    从数据库中的新闻数据 + akshare 基础信息构建知识图谱。
    
    Args:
        stock_code: 股票代码（如 SH600519）
        stock_name: 股票名称（如 贵州茅台）
    """
    db = get_sync_db_session()
    
    try:
        code = stock_code.upper()
        if code.startswith("SH") or code.startswith("SZ"):
            pure_code = code[2:]
        else:
            pure_code = code
            code = f"SH{code}" if code.startswith("6") else f"SZ{code}"
        
        logger.info(f"[GraphTask] 🏗️ 开始异步构建知识图谱: {stock_name}({code})")
        
        from ..knowledge.graph_service import get_graph_service
        from ..knowledge.knowledge_extractor import (
            create_knowledge_extractor,
            AkshareKnowledgeExtractor
        )
        
        graph_service = get_graph_service()
        
        # 1. 检查图谱是否已存在（避免重复构建）
        existing_graph = graph_service.get_company_graph(code)
        if existing_graph:
            logger.info(f"[GraphTask] ✅ 图谱已存在，跳过构建")
            return {"status": "skipped", "reason": "graph_exists"}
        
        # 2. 从 akshare 获取基础公司信息
        akshare_info = AkshareKnowledgeExtractor.extract_company_info(code)
        
        if akshare_info:
            extractor = create_knowledge_extractor()
            base_graph = asyncio.run(
                extractor.extract_from_akshare(code, stock_name, akshare_info)
            )
            graph_service.build_company_graph(base_graph)
            logger.info(f"[GraphTask] ✅ 基础图谱构建完成")
        else:
            logger.warning(f"[GraphTask] ⚠️ akshare 未返回数据")
        
        # 3. 从数据库新闻中提取信息更新图谱
        recent_news = db.execute(
            text("""
                SELECT title, content FROM news 
                WHERE stock_codes @> ARRAY[:code]::varchar[] 
                ORDER BY publish_time DESC LIMIT 50
            """).bindparams(code=pure_code)
        ).fetchall()
        
        if recent_news:
            news_data = [{"title": n[0], "content": n[1]} for n in recent_news]
            extractor = create_knowledge_extractor()
            
            extracted_info = asyncio.run(
                extractor.extract_from_news(code, stock_name, news_data)
            )
            
            if any(extracted_info.values()):
                graph_service.update_from_news(code, "", extracted_info)
                logger.info(f"[GraphTask] ✅ 从新闻更新图谱完成")
        
        logger.info(f"[GraphTask] ✅ 知识图谱构建完成: {stock_name}({code})")
        
        return {
            "status": "completed",
            "stock_code": code,
            "stock_name": stock_name,
            "news_count": len(recent_news) if recent_news else 0,
        }
        
    except Exception as e:
        logger.error(f"[GraphTask] ❌ 知识图谱构建失败: {e}", exc_info=True)
        return {"status": "failed", "error": str(e)}
    
    finally:
        db.close()


================================================
FILE: backend/app/tools/__init__.py
================================================
"""
工具模块
"""
from .crawler_base import BaseCrawler, NewsItem
from .sina_crawler import SinaCrawlerTool, create_sina_crawler
from .tencent_crawler import TencentCrawlerTool
from .jwview_crawler import JwviewCrawlerTool
from .eeo_crawler import EeoCrawlerTool
from .caijing_crawler import CaijingCrawlerTool
from .jingji21_crawler import Jingji21CrawlerTool
from .nbd_crawler import NbdCrawlerTool
from .yicai_crawler import YicaiCrawlerTool
from .netease163_crawler import Netease163CrawlerTool
from .eastmoney_crawler import EastmoneyCrawlerTool
from .text_cleaner import TextCleanerTool, create_text_cleaner
from .bochaai_search import BochaAISearchTool, bochaai_search, SearchResult

__all__ = [
    "BaseCrawler",
    "NewsItem",
    "SinaCrawlerTool",
    "create_sina_crawler",
    "TencentCrawlerTool",
    "JwviewCrawlerTool",
    "EeoCrawlerTool",
    "CaijingCrawlerTool",
    "Jingji21CrawlerTool",
    "NbdCrawlerTool",
    "YicaiCrawlerTool",
    "Netease163CrawlerTool",
    "EastmoneyCrawlerTool",
    "TextCleanerTool",
    "create_text_cleaner",
    "BochaAISearchTool",
    "bochaai_search",
    "SearchResult",
]


================================================
FILE: backend/app/tools/bochaai_search.py
================================================
"""
BochaAI Web Search Tool
用于定向搜索股票相关新闻
"""
import json
import logging
import urllib.request
import urllib.error
from typing import List, Dict, Any, Optional
from datetime import datetime
from dataclasses import dataclass

from ..core.config import settings

logger = logging.getLogger(__name__)


@dataclass
class SearchResult:
    """搜索结果数据类"""
    title: str
    url: str
    snippet: str
    site_name: Optional[str] = None
    date_published: Optional[str] = None
    

class BochaAISearchTool:
    """
    BochaAI Web Search 工具
    用于搜索股票相关新闻
    """
    
    def __init__(self, api_key: Optional[str] = None, endpoint: Optional[str] = None):
        """
        初始化 BochaAI 搜索工具
        
        Args:
            api_key: BochaAI API Key（如果不提供，从配置中获取）
            endpoint: API 端点（默认使用配置中的端点）
        """
        self.api_key = api_key or settings.BOCHAAI_API_KEY
        self.endpoint = endpoint or settings.BOCHAAI_ENDPOINT
        
        if not self.api_key:
            logger.warning(
                "BochaAI API Key 未配置，搜索功能将不可用。\n"
                "请在 .env 文件中设置 BOCHAAI_API_KEY=your_api_key"
            )
    
    def is_available(self) -> bool:
        """检查搜索功能是否可用"""
        return bool(self.api_key)
    
    def search(
        self,
        query: str,
        freshness: str = "noLimit",
        count: int = 10,
        offset: int = 0,
        include_sites: Optional[str] = None,
        exclude_sites: Optional[str] = None,
    ) -> List[SearchResult]:
        """
        执行 Web 搜索
        
        Args:
            query: 搜索查询字符串
            freshness: 时间范围（noLimit, day, week, month）
            count: 返回结果数量（1-50，单次最大50条）
            offset: 结果偏移量（用于分页）
            include_sites: 限定搜索的网站（逗号分隔）
            exclude_sites: 排除的网站（逗号分隔）
            
        Returns:
            搜索结果列表
        """
        if not self.is_available():
            logger.warning("BochaAI API Key 未配置，跳过搜索")
            return []
        
        try:
            # 构建请求数据
            request_data = {
                "query": query,
                "freshness": freshness,
                "summary": False,
                "count": min(max(count, 1), 50)
            }
            
            # 添加 offset 参数进行分页
            if offset > 0:
                request_data["offset"] = offset
            
            if include_sites:
                request_data["include"] = include_sites
            if exclude_sites:
                request_data["exclude"] = exclude_sites
            
            # 创建请求
            req = urllib.request.Request(
                self.endpoint,
                data=json.dumps(request_data).encode('utf-8'),
                headers={
                    'Authorization': f'Bearer {self.api_key}',
                    'Content-Type': 'application/json',
                    'User-Agent': 'FinnewsHunter-BochaAI-Search/1.0'
                }
            )
            
            # 发送请求
            with urllib.request.urlopen(req, timeout=30) as response:
                data = response.read().decode('utf-8')
                result = json.loads(data)
            
            # 解析结果
            results = []
            
            if 'data' in result:
                data = result['data']
                if 'webPages' in data and data['webPages'] and 'value' in data['webPages']:
                    for item in data['webPages']['value']:
                        search_result = SearchResult(
                            title=item.get('name', '无标题'),
                            url=item.get('url', ''),
                            snippet=item.get('snippet', ''),
                            site_name=item.get('siteName', ''),
                            date_published=item.get('datePublished', '')
                        )
                        results.append(search_result)
            
            logger.info(f"BochaAI 搜索完成: query='{query}', offset={offset}, 结果数={len(results)}")
            return results
            
        except urllib.error.HTTPError as e:
            error_msg = f"BochaAI API HTTP 错误: {e.code} - {e.reason}"
            if e.code == 401:
                error_msg += " (请检查 BOCHAAI_API_KEY 是否正确)"
            elif e.code == 429:
                error_msg += " (请求过于频繁)"
            logger.error(error_msg)
            return []
            
        except urllib.error.URLError as e:
            logger.error(f"BochaAI 网络错误: {e.reason}")
            return []
            
        except json.JSONDecodeError as e:
            logger.error(f"BochaAI 响应解析失败: {e}")
            return []
            
        except Exception as e:
            logger.error(f"BochaAI 搜索失败: {e}")
            return []
    
    def search_stock_news(
        self,
        stock_name: str,
        stock_code: Optional[str] = None,
        days: int = 30,
        count: int = 100,
        max_age_days: int = 365,
    ) -> List[SearchResult]:
        """
        搜索股票相关新闻
        
        Args:
            stock_name: 股票名称（如"贵州茅台"）
            stock_code: 股票代码（可选，如"600519"）
            days: 搜索时间范围（天），用于API freshness参数
            count: 返回结果数量（支持超过50条，会自动分页请求）
            max_age_days: 最大新闻年龄（天），默认365天（1年），超过此时间的新闻将被过滤
            
        Returns:
            搜索结果列表（按时间从新到旧排序，只返回最近max_age_days天内的新闻）
        """
        # 构建搜索查询 - 简洁明确，添加"最新"关键词优先获取新内容
        query = f"{stock_name} 最新"
        
        # BochaAI API 支持的 freshness 参数值：
        # - noLimit: 不限制
        # - oneDay: 一天内
        # - oneWeek: 一周内  
        # - oneMonth: 一月内
        # 注意：不支持 "year"、"day"、"week" 等其他值！
        
        # 根据请求天数确定 freshness 参数
        if days <= 1:
            freshness = "oneDay"
        elif days <= 7:
            freshness = "oneWeek"
        elif days <= 30:
            freshness = "oneMonth"
        else:
            freshness = "noLimit"  # 超过30天用 noLimit，本地再过滤
        
        # 财经网站列表（用于优先搜索）
        finance_sites = (
            "finance.sina.com.cn,"
            "stock.eastmoney.com,"
            "finance.qq.com,"
            "money.163.com,"
            "caijing.com.cn,"
            "yicai.com,"
            "nbd.com.cn,"
            "21jingji.com,"
            "eeo.com.cn,"
            "chinanews.com.cn"
        )
        
        # 计算截止时间（半年前）
        from datetime import timedelta
        cutoff_date = datetime.now() - timedelta(days=max_age_days)
        
        all_results = []
        offset = 0
        batch_size = 50  # API单次最大返回数
        max_requests = 5  # 最多请求5次，防止无限循环
        request_count = 0
        
        logger.info(f"BochaAI 开始搜索股票新闻: {stock_name}, 目标数量={count}, 截止日期={cutoff_date.strftime('%Y-%m-%d')}")
        
        while len(all_results) < count and request_count < max_requests:
            batch_results = self.search(
                query=query,
                freshness=freshness,
                count=batch_size,
                offset=offset,
                include_sites=finance_sites
            )
            
            if not batch_results:
                logger.info(f"BochaAI 第{request_count+1}次请求未返回结果，停止分页")
                break
            
            # 时间过滤：保留有日期且在范围内的新闻，以及无日期但可能相关的新闻
            for result in batch_results:
                # 如果有发布日期，检查是否在时间范围内
                if result.date_published:
                    try:
                        # 尝试解析发布时间
                        pub_date = datetime.fromisoformat(
                            result.date_published.replace('Z', '+00:00')
                        )
                        # 转换为无时区的时间进行比较
                        if pub_date.tzinfo:
                            pub_date = pub_date.replace(tzinfo=None)
                        
                        # 检查是否在指定时间范围内
                        if pub_date < cutoff_date:
                            logger.debug(f"过滤超过{max_age_days}天的新闻: {result.title[:30]}... ({result.date_published})")
                            continue
                            
                    except (ValueError, AttributeError) as e:
                        # 日期解析失败，但仍然保留（可能是新闻）
                        logger.debug(f"无法解析日期，但仍保留: {result.title[:30]}...")
                else:
                    # 无日期的新闻也保留（可能是相关新闻）
                    logger.debug(f"无日期新闻，保留: {result.title[:30]}...")
                
                # 添加到结果中
                all_results.append(result)
                
                if len(all_results) >= count:
                    break
            
            offset += batch_size
            request_count += 1
            logger.info(f"BochaAI 第{request_count}次请求完成，当前累计 {len(all_results)} 条有效结果")
        
        # 按发布时间排序（从新到旧）
        def parse_date(r):
            if r.date_published:
                try:
                    dt = datetime.fromisoformat(r.date_published.replace('Z', '+00:00'))
                    if dt.tzinfo:
                        dt = dt.replace(tzinfo=None)
                    return dt
                except (ValueError, AttributeError):
                    pass
            return datetime.min  # 无法解析的日期排在最后
        
        all_results.sort(key=parse_date, reverse=True)
        
        logger.info(f"BochaAI 搜索股票新闻完成: {stock_name}, 返回 {len(all_results)} 条结果 (共请求{request_count}次, 仅保留最近{max_age_days}天即{max_age_days//30}个月内)")
        
        return all_results[:count]  # 确保不超过请求数量


# 全局实例
bochaai_search = BochaAISearchTool()


================================================
FILE: backend/app/tools/caijing_crawler.py
================================================
"""
财经网爬虫工具
目标URL: https://www.caijing.com.cn/ (股市栏目)
"""
import re
import logging
from typing import List, Optional
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class CaijingCrawlerTool(BaseCrawler):
    """
    财经网爬虫
    主要爬取股市相关新闻
    """
    
    BASE_URL = "https://finance.caijing.com.cn/"
    # 股市栏目URL
    STOCK_URL = "https://finance.caijing.com.cn/"
    SOURCE_NAME = "caijing"
    
    def __init__(self):
        super().__init__(
            name="caijing_crawler",
            description="Crawl financial news from Caijing (caijing.com.cn)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取财经网新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Caijing, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Caijing: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            # 尝试爬取股市栏目或主页
            try:
                response = self._fetch_page(self.STOCK_URL)
            except:
                response = self._fetch_page(self.BASE_URL)
            
            # 财经网编码处理
            if response.encoding == 'ISO-8859-1' or not response.encoding:
                response.encoding = 'utf-8'
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        # 财经网新闻URL模式（扩展更多模式）
        caijing_patterns = [
            r'/\d{4}/',           # 日期路径 /2024/
            '/article/',         # 文章
            '.shtml',            # 静态HTML
            '/finance/',         # 财经频道
            '/stock/',           # 股票频道
        ]
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 检查是否匹配财经网URL模式
            is_caijing_url = False
            
            # 方式1: 检查URL模式
            for pattern in caijing_patterns:
                if re.search(pattern, href):
                    is_caijing_url = True
                    break
            
            # 方式2: 检查是否包含caijing.com.cn域名
            if 'caijing.com.cn' in href or 'finance.caijing.com.cn' in href:
                is_caijing_url = True
            
            # 方式3: 检查链接的class或data属性
            if not is_caijing_url:
                link_class = link.get('class', [])
                if isinstance(link_class, list):
                    link_class_str = ' '.join(link_class)
                else:
                    link_class_str = str(link_class)
                if any(kw in link_class_str.lower() for kw in ['news', 'article', 'item', 'title', 'list']):
                    if href.startswith('/') or 'caijing.com.cn' in href:
                        is_caijing_url = True
            
            if is_caijing_url and title and len(title.strip()) > 5:
                # 规范化 URL，优先 https，避免重复前缀
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.caijing.com.cn' + href
                elif href.startswith('http://'):
                    href = href.replace('http://', 'https://', 1)
                elif not href.startswith('http'):
                    href = 'https://www.caijing.com.cn/' + href.lstrip('/')
                
                # 过滤掉明显不是新闻的链接
                if any(skip in href.lower() for skip in ['javascript:', 'mailto:', '#', 'void(0)', '/tag/', '/author/', '/user/']):
                    continue
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title.strip()})
        
        logger.debug(f"Caijing: Found {len(news_links)} potential news links")
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'article-content'},
            {'class': 'main_txt'},
            {'class': 'content'},
            {'id': 'the_content'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/crawler_base.py
================================================
"""
爬虫基类
符合 AgenticX BaseTool 协议
"""
import time
import logging
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import requests.exceptions

from agenticx import BaseTool
from agenticx.core import ToolMetadata, ToolCategory
from ..core.config import settings

logger = logging.getLogger(__name__)


@dataclass
class NewsItem:
    """新闻数据项"""
    title: str
    content: str
    url: str
    source: str
    publish_time: Optional[datetime] = None
    author: Optional[str] = None
    keywords: Optional[List[str]] = None
    stock_codes: Optional[List[str]] = None
    summary: Optional[str] = None
    raw_html: Optional[str] = None  # 原始 HTML 内容
    
    def to_dict(self) -> Dict[str, Any]:
        """转换为字典"""
        return {
            "title": self.title,
            "content": self.content,
            "url": self.url,
            "source": self.source,
            "publish_time": self.publish_time.isoformat() if self.publish_time else None,
            "author": self.author,
            "keywords": self.keywords,
            "stock_codes": self.stock_codes,
            "summary": self.summary,
            "raw_html": self.raw_html,
        }


class BaseCrawler(BaseTool):
    """
    爬虫基类
    继承自 AgenticX BaseTool
    """
    
    # 股票相关URL关键词
    STOCK_URL_KEYWORDS = [
        '/stock/', '/gupiao/', '/securities/', '/zhengquan/', 
        '/a-shares/', '/ashares/', '/equity/', '/shares/',
        '/market/', '/listed/', '/ipo/'
    ]
    
    # 股票相关标题关键词
    STOCK_TITLE_KEYWORDS = [
        '股票', 'A股', 'a股', '上市', '个股', '涨停', '跌停', 
        'IPO', 'ipo', '新股', '配股', '增发', '重组', '并购',
        '股东', '董事', '证券', '港股', '科创板', '创业板',
        '主板', '中小板', '北交所', '沪市', '深市', '股价',
        '股份', '停牌', '复牌', '退市', '借壳'
    ]
    
    def __init__(self, name: str = "base_crawler", description: str = "Base crawler for financial news"):
        # 创建 ToolMetadata
        metadata = ToolMetadata(
            name=name,
            description=description,
            category=ToolCategory.DATA_ACCESS,
            version="1.0.0"
        )
        super().__init__(metadata=metadata)
        
        # 爬虫特定配置
        self.user_agent = settings.CRAWLER_USER_AGENT
        self.timeout = settings.CRAWLER_TIMEOUT
        self.max_retries = settings.CRAWLER_MAX_RETRIES
        self.delay = settings.CRAWLER_DELAY
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': self.user_agent})
    
    def _fetch_page(self, url: str) -> requests.Response:
        """
        获取网页内容（带重试机制，但503错误不重试）
        
        Args:
            url: 目标URL
            
        Returns:
            响应对象
        """
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, timeout=self.timeout)
                
                # 对于503错误，不重试，直接抛出（让调用者处理）
                if response.status_code == 503:
                    logger.debug(f"503 error for {url}, skipping retry (server overloaded)")
                    response.raise_for_status()
                
                response.raise_for_status()
                
                # 修复编码问题：优先使用 apparent_encoding，如果检测失败则尝试常见编码
                if response.encoding is None or response.encoding == 'ISO-8859-1':
                    # 尝试检测真实编码
                    if response.apparent_encoding:
                        response.encoding = response.apparent_encoding
                    else:
                        # 对于中文网站，尝试常见编码
                        encodings = ['utf-8', 'gb2312', 'gbk', 'gb18030']
                        for enc in encodings:
                            try:
                                # 尝试解码验证
                                response.content.decode(enc)
                                response.encoding = enc
                                break
                            except (UnicodeDecodeError, LookupError):
                                continue
                        else:
                            # 如果都失败，默认使用 utf-8
                            response.encoding = 'utf-8'
                
                time.sleep(self.delay)  # 请求间隔
                return response
                
            except requests.exceptions.HTTPError as e:
                # 503错误不重试，直接抛出
                if e.response and e.response.status_code == 503:
                    logger.debug(f"503 error for {url}, not retrying")
                    raise
                # 其他HTTP错误，重试
                if attempt < max_retries - 1:
                    wait_time = min(2 ** attempt, 10)
                    logger.warning(f"HTTP error fetching {url} (attempt {attempt + 1}/{max_retries}): {e}, retrying in {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    logger.error(f"HTTP error fetching {url} after {max_retries} attempts: {e}")
                    raise
            except Exception as e:
                # 其他错误，重试
                if attempt < max_retries - 1:
                    wait_time = min(2 ** attempt, 10)
                    logger.warning(f"Error fetching {url} (attempt {attempt + 1}/{max_retries}): {e}, retrying in {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    logger.error(f"Failed to fetch {url} after {max_retries} attempts: {e}")
                    raise
        
        # 理论上不会到达这里
        raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
    
    def _parse_html(self, html: str) -> BeautifulSoup:
        """
        解析HTML
        
        Args:
            html: HTML字符串
            
        Returns:
            BeautifulSoup对象
        """
        return BeautifulSoup(html, 'lxml')
    
    def _extract_chinese_ratio(self, text: str) -> float:
        """
        计算中文字符比例
        
        Args:
            text: 文本
            
        Returns:
            中文字符比例（0-1）
        """
        import re
        pattern = re.compile(r'[\u4e00-\u9fa5]+')
        chinese_chars = pattern.findall(text)
        chinese_count = sum(len(chars) for chars in chinese_chars)
        total_count = len(text)
        return chinese_count / total_count if total_count > 0 else 0
    
    def _clean_text(self, text: str) -> str:
        """
        清理文本
        
        Args:
            text: 原始文本
            
        Returns:
            清理后的文本
        """
        import re
        # 移除HTML标签
        text = re.sub(r'<[^>]+>', '', text)
        # 移除特殊空格
        text = text.replace('\u3000', ' ')
        # 移除多余空格和换行
        text = ' '.join(text.split())
        return text.strip()
    
    def _extract_article_content(self, soup: BeautifulSoup, selectors: List[dict] = None) -> str:
        """
        通用智能内容提取方法
        
        Args:
            soup: BeautifulSoup对象
            selectors: 可选的自定义选择器列表
            
        Returns:
            提取的正文内容
        """
        import re
        
        # 默认选择器（按优先级排序）
        default_selectors = [
            # 文章主体选择器
            {'class': re.compile(r'article[-_]?(body|content|text|main)', re.I)},
            {'class': re.compile(r'content[-_]?(article|body|text|main)', re.I)},
            {'class': re.compile(r'main[-_]?(content|body|text|article)', re.I)},
            {'class': re.compile(r'^(article|content|body|text|post)$', re.I)},
            {'itemprop': 'articleBody'},
            {'id': re.compile(r'(article|content|body|text)[-_]?(content|body|text)?', re.I)},
            # 通用选择器
            {'class': 'g-article-content'},
            {'class': 'article-content'},
            {'class': 'news-content'},
            {'id': 'contentText'},
        ]
        
        all_selectors = (selectors or []) + default_selectors
        
        for selector in all_selectors:
            content_div = soup.find(['div', 'article', 'section', 'main'], selector)
            if content_div:
                # 移除无关元素
                for tag in content_div.find_all(['script', 'style', 'iframe', 'ins', 'noscript', 'nav', 'footer', 'header']):
                    tag.decompose()
                for ad in content_div.find_all(class_=re.compile(r'(ad|advertisement|banner|recommend|related|share|comment)', re.I)):
                    ad.decompose()
                
                # 提取所有段落（不限制数量）
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content and len(content) > 50:
                        return self._clean_text(content)
                
                # 如果没有 p 标签，直接取文本
                text = content_div.get_text(separator='\n', strip=True)
                if text and len(text) > 50:
                    return self._clean_text(text)
        
        # 后备方案：取所有符合条件的段落（不限制数量）
        paragraphs = soup.find_all('p')
        if paragraphs:
            valid_paragraphs = [
                p.get_text(strip=True) for p in paragraphs 
                if p.get_text(strip=True) and len(p.get_text(strip=True)) > 15
                and not any(kw in p.get_text(strip=True).lower() for kw in ['copyright', '版权', '广告', 'advertisement'])
            ]
            content = '\n'.join(valid_paragraphs)
            if content:
                return self._clean_text(content)
        
        return ""
    
    def _is_stock_related_by_url(self, url: str) -> bool:
        """
        根据URL路径判断是否为股票相关新闻
        
        Args:
            url: 新闻URL
            
        Returns:
            是否为股票相关
        """
        url_lower = url.lower()
        return any(keyword in url_lower for keyword in self.STOCK_URL_KEYWORDS)
    
    def _is_stock_related_by_title(self, title: str) -> bool:
        """
        根据标题关键词判断是否为股票相关新闻
        
        Args:
            title: 新闻标题
            
        Returns:
            是否为股票相关
        """
        return any(keyword in title for keyword in self.STOCK_TITLE_KEYWORDS)
    
    def _filter_stock_news(self, news_list: List[NewsItem]) -> List[NewsItem]:
        """
        筛选股票相关新闻
        组合URL路径和标题关键词两种策略
        
        策略调整：
        - 如果过滤后没有新闻，返回所有新闻（避免过度过滤）
        - 对于财经类网站，放宽筛选条件
        
        Args:
            news_list: 原始新闻列表
            
        Returns:
            股票相关新闻列表
        """
        filtered_news = []
        url_matched = 0
        title_matched = 0
        filtered_out = 0
        
        for news in news_list:
            # URL匹配 或 标题匹配
            url_match = self._is_stock_related_by_url(news.url)
            title_match = self._is_stock_related_by_title(news.title)
            
            if url_match or title_match:
                filtered_news.append(news)
                if url_match:
                    url_matched += 1
                if title_match:
                    title_matched += 1
                logger.debug(f"✓ Stock news matched: {news.title[:50]}... (URL:{url_match}, Title:{title_match})")
            else:
                filtered_out += 1
                # 只记录前5条被过滤的，避免日志过多
                if filtered_out <= 5:
                    logger.debug(f"✗ Filtered out: {news.title[:50]}...")
        
        logger.info(f"Stock filter [{self.SOURCE_NAME}]: {len(news_list)} -> {len(filtered_news)} items "
                   f"(URL matched: {url_matched}, Title matched: {title_matched}, Filtered: {filtered_out})")
        
        # 如果过滤后没有新闻，返回所有新闻（避免过度过滤）
        # 这对于财经类网站特别重要，因为它们的新闻通常都与金融相关
        if len(news_list) > 0 and len(filtered_news) == 0:
            logger.warning(f"⚠️  All {len(news_list)} news items were filtered out for source {self.SOURCE_NAME}. "
                          f"Returning all news to avoid over-filtering.")
            return news_list
        
        return filtered_news
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取新闻
        
        Args:
            start_page: 起始页
            end_page: 结束页
            
        Returns:
            新闻列表
        """
        raise NotImplementedError("Subclass must implement crawl method")
    
    def _setup_parameters(self):
        """设置工具参数（AgenticX 要求）"""
        pass  # 爬虫不需要特殊参数设置
    
    def execute(self, **kwargs) -> Dict[str, Any]:
        """
        同步执行方法（AgenticX Tool 协议要求）
        
        Args:
            **kwargs: 参数字典
                - start_page: 起始页
                - end_page: 结束页
                
        Returns:
            执行结果
        """
        start_page = kwargs.get('start_page', 1)
        end_page = kwargs.get('end_page', 1)
        
        logger.info(f"Crawling from page {start_page} to {end_page}")
        news_list = self.crawl(start_page, end_page)
        
        return {
            "success": True,
            "count": len(news_list),
            "news_list": [news.to_dict() for news in news_list],
        }
    
    async def aexecute(self, **kwargs) -> Dict[str, Any]:
        """
        异步执行方法（AgenticX Tool 协议要求）
        当前实现为同步执行的包装
        
        Args:
            **kwargs: 参数字典
                
        Returns:
            执行结果
        """
        return self.execute(**kwargs)


================================================
FILE: backend/app/tools/crawler_enhanced.py
================================================
"""
增强版爬虫模块
整合 deer-flow、BasicWebCrawler 和现有爬虫的优点

特性：
1. 多引擎支持：本地爬取 + Jina Reader API + Playwright JS 渲染
2. 智能内容提取：readabilipy + 启发式算法
3. 网站特定配置
4. 内容质量评估与自动重试
5. 缓存和去重
6. 统一 Article 模型，支持 LLM 消息格式
"""
import re
import os
import json
import time
import hashlib
import logging
from typing import List, Dict, Any, Optional, Literal
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from urllib.parse import urlparse, urljoin

import requests
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential

# 可选依赖
try:
    from markdownify import markdownify as md
except ImportError:
    md = None

try:
    from readabilipy import simple_json_from_html_string
except ImportError:
    simple_json_from_html_string = None

try:
    from playwright.sync_api import sync_playwright
except ImportError:
    sync_playwright = None

logger = logging.getLogger(__name__)


# ============ 配置 ============

# 财经新闻网站特定配置
FINANCE_SITE_CONFIGS = {
    # 新浪财经
    'finance.sina.com.cn': {
        'main_content_selectors': [
            '.article-content', '.article', '#artibody', 
            '.main-content', '.post-body'
        ],
        'title_selectors': ['h1.main-title', 'h1', '.article-title'],
        'time_selectors': ['.date', '.pub_date', '.time-source'],
        'needs_js': False,
        'headers': {
            'Referer': 'https://finance.sina.com.cn/',
        }
    },
    # 东方财富
    'eastmoney.com': {
        'main_content_selectors': [
            '.article-content', '#ContentBody', '.newsContent',
            '.article', '.content-article'
        ],
        'needs_js': True,
        'wait_selectors': ['.article-content', '#ContentBody'],
    },
    # 每经网
    'nbd.com.cn': {
        'main_content_selectors': [
            '.article-content', '.g-article-content', 
            '.article-detail', '.post-content'
        ],
        'needs_js': False,
    },
    # 财新
    'caixin.com': {
        'main_content_selectors': [
            '#Main_Content_Val', '.article-content', 
            '.articleBody', '.main-content'
        ],
        'needs_cookies': True,  # 付费内容
        'needs_js': False,
    },
    # 腾讯财经
    'finance.qq.com': {
        'main_content_selectors': [
            '.content-article', '.Cnt-Main-Article-QQ',
            '#Cnt-Main-Article-QQ', '.article-content'
        ],
        'needs_js': False,
    },
    # 21世纪经济报道
    '21jingji.com': {
        'main_content_selectors': [
            '.article-content', '.detailContent', 
            '.article-body', '.post-content'
        ],
        'needs_js': False,
    },
    # 默认配置
    'default': {
        'main_content_selectors': [
            'article', 'main', '.article', '.content', 
            '.post-content', '.entry-content', '#content'
        ],
        'needs_js': False,
        'headers': {}
    }
}


# ============ Article 模型 ============

@dataclass
class Article:
    """
    统一的文章模型（参考 deer-flow）
    支持转换为 Markdown 和 LLM 消息格式
    """
    title: str
    content: str  # 纯文本内容
    html_content: Optional[str] = None  # 原始 HTML
    url: str = ""
    source: str = ""
    publish_time: Optional[datetime] = None
    author: Optional[str] = None
    keywords: List[str] = field(default_factory=list)
    stock_codes: List[str] = field(default_factory=list)
    images: List[str] = field(default_factory=list)
    
    # 元数据
    crawl_time: datetime = field(default_factory=datetime.utcnow)
    engine_used: str = ""  # 使用的爬取引擎
    quality_score: float = 0.0  # 内容质量评分
    
    def to_markdown(self, include_title: bool = True, include_meta: bool = False) -> str:
        """转换为 Markdown 格式"""
        parts = []
        
        if include_title and self.title:
            parts.append(f"# {self.title}\n")
        
        if include_meta:
            meta = []
            if self.source:
                meta.append(f"来源: {self.source}")
            if self.publish_time:
                meta.append(f"时间: {self.publish_time.strftime('%Y-%m-%d %H:%M')}")
            if self.author:
                meta.append(f"作者: {self.author}")
            if self.url:
                meta.append(f"原文: {self.url}")
            if meta:
                parts.append(f"*{' | '.join(meta)}*\n")
        
        # 如果有 HTML 内容且安装了 markdownify，转换它
        if self.html_content and md:
            parts.append(md(self.html_content))
        else:
            parts.append(self.content)
        
        return "\n".join(parts)
    
    def to_llm_message(self) -> List[Dict[str, Any]]:
        """
        转换为 LLM 消息格式（参考 deer-flow）
        将图片和文本分离，便于多模态 LLM 处理
        """
        content: List[Dict[str, str]] = []
        markdown = self.to_markdown()
        
        if not markdown.strip():
            return [{"type": "text", "text": "No content available"}]
        
        # 提取图片 URL
        image_pattern = r"!\[.*?\]\((.*?)\)"
        parts = re.split(image_pattern, markdown)
        
        for i, part in enumerate(parts):
            if i % 2 == 1:  # 图片 URL
                image_url = urljoin(self.url, part.strip())
                content.append({
                    "type": "image_url", 
                    "image_url": {"url": image_url}
                })
            else:  # 文本内容
                text_part = part.strip()
                if text_part:
                    content.append({"type": "text", "text": text_part})
        
        return content if content else [{"type": "text", "text": "No content available"}]
    
    def to_dict(self) -> Dict[str, Any]:
        """转换为字典"""
        return {
            "title": self.title,
            "content": self.content,
            "html_content": self.html_content,
            "url": self.url,
            "source": self.source,
            "publish_time": self.publish_time.isoformat() if self.publish_time else None,
            "author": self.author,
            "keywords": self.keywords,
            "stock_codes": self.stock_codes,
            "images": self.images,
            "crawl_time": self.crawl_time.isoformat(),
            "engine_used": self.engine_used,
            "quality_score": self.quality_score,
        }


# ============ 内容提取器 ============

class ContentExtractor:
    """
    智能内容提取器
    结合 readabilipy 和启发式算法
    """
    
    @staticmethod
    def extract_with_readability(html: str) -> Optional[Article]:
        """使用 readabilipy 提取（参考 deer-flow）"""
        if simple_json_from_html_string is None:
            return None
        
        try:
            result = simple_json_from_html_string(html, use_readability=True)
            content = result.get("content", "")
            title = result.get("title", "Untitled")
            
            if not content or len(content.strip()) < 100:
                return None
            
            return Article(
                title=title,
                content=BeautifulSoup(content, 'html.parser').get_text(separator='\n', strip=True),
                html_content=content,
            )
        except Exception as e:
            logger.warning(f"Readability extraction failed: {e}")
            return None
    
    @staticmethod
    def extract_with_selectors(soup: BeautifulSoup, config: dict) -> Optional[Article]:
        """使用 CSS 选择器提取"""
        # 提取标题
        title = None
        for sel in config.get('title_selectors', ['h1', 'title']):
            el = soup.select_one(sel)
            if el:
                title = el.get_text(strip=True)
                break
        
        if not title:
            title_el = soup.find('title')
            title = title_el.get_text(strip=True) if title_el else "Untitled"
        
        # 提取主要内容
        content_el = None
        for sel in config.get('main_content_selectors', []):
            content_el = soup.select_one(sel)
            if content_el and len(content_el.get_text(strip=True)) > 100:
                break
        
        if not content_el:
            return None
        
        # 清理内容
        for tag in content_el.find_all(['script', 'style', 'nav', 'footer', 'aside']):
            tag.decompose()
        
        content = content_el.get_text(separator='\n', strip=True)
        html_content = str(content_el)
        
        if len(content) < 100:
            return None
        
        return Article(
            title=title,
            content=content,
            html_content=html_content,
        )
    
    @staticmethod
    def extract_heuristic(soup: BeautifulSoup) -> Optional[Article]:
        """
        启发式内容提取（参考 BasicWebCrawler）
        找到包含最多段落文本的元素
        """
        # 提取标题
        title_el = soup.find('title')
        title = title_el.get_text(strip=True) if title_el else "Untitled"
        
        # 排除导航等元素
        for tag in soup.find_all(['script', 'style', 'nav', 'footer', 'aside', 
                                   'header', '.sidebar', '.advertisement']):
            if hasattr(tag, 'decompose'):
                tag.decompose()
        
        # 找到最佳内容容器
        candidates = []
        for tag in ['article', 'main', 'section', 'div']:
            for elem in soup.find_all(tag):
                # 排除导航、侧边栏等
                elem_class = ' '.join(elem.get('class', [])).lower()
                elem_id = (elem.get('id') or '').lower()
                
                exclude_keywords = ['nav', 'sidebar', 'footer', 'header', 
                                    'menu', 'ad', 'banner', 'comment']
                if any(kw in elem_class or kw in elem_id for kw in exclude_keywords):
                    continue
                
                text = elem.get_text(strip=True)
                text_len = len(text)
                
                if text_len > 200:
                    score = text_len
                    # 有标题加分
                    if elem.find(['h1', 'h2', 'h3']):
                        score += 1000
                    # 有段落加分
                    p_count = len(elem.find_all('p'))
                    score += p_count * 50
                    
                    candidates.append((elem, score, text_len))
        
        if not candidates:
            return None
        
        # 选择得分最高的
        best_elem = max(candidates, key=lambda x: x[1])[0]
        content = best_elem.get_text(separator='\n', strip=True)
        
        return Article(
            title=title,
            content=content,
            html_content=str(best_elem),
        )
    
    @classmethod
    def extract(cls, html: str, url: str = "", config: dict = None) -> Article:
        """
        智能提取：依次尝试多种方法
        1. readabilipy（最智能）
        2. CSS 选择器（网站特定）
        3. 启发式算法（兜底）
        """
        soup = BeautifulSoup(html, 'html.parser')
        config = config or FINANCE_SITE_CONFIGS.get('default', {})
        
        # 方法 1: readabilipy
        article = cls.extract_with_readability(html)
        if article and article.quality_score > 0.5:
            article.engine_used = "readability"
            return article
        
        # 方法 2: CSS 选择器
        article = cls.extract_with_selectors(soup, config)
        if article:
            article.engine_used = "selectors"
            return article
        
        # 方法 3: 启发式
        article = cls.extract_heuristic(soup)
        if article:
            article.engine_used = "heuristic"
            return article
        
        # 兜底：返回整个 body
        body = soup.find('body')
        return Article(
            title=soup.title.string if soup.title else "Untitled",
            content=body.get_text(separator='\n', strip=True) if body else "",
            html_content=str(body) if body else "",
            engine_used="fallback",
        )


# ============ 爬取引擎 ============

class JinaReaderEngine:
    """
    Jina Reader API 引擎（参考 deer-flow）
    https://jina.ai/reader
    """
    
    API_URL = "https://r.jina.ai/"
    
    def __init__(self, api_key: Optional[str] = None):
        self.api_key = api_key or os.getenv("JINA_API_KEY")
    
    def crawl(self, url: str, return_format: str = "html") -> Optional[str]:
        """爬取 URL"""
        headers = {
            "Content-Type": "application/json",
            "X-Return-Format": return_format,
        }
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        
        try:
            response = requests.post(
                self.API_URL,
                headers=headers,
                json={"url": url},
                timeout=30
            )
            
            if response.status_code != 200:
                logger.error(f"Jina API error: {response.status_code}")
                return None
            
            return response.text
        except Exception as e:
            logger.error(f"Jina crawl failed: {e}")
            return None


class PlaywrightEngine:
    """
    Playwright 浏览器引擎（参考 BasicWebCrawler）
    支持 JS 渲染
    """
    
    def __init__(self, headless: bool = True):
        self.headless = headless
    
    def crawl(self, url: str, wait_selectors: List[str] = None, 
              timeout_ms: int = 15000) -> Optional[str]:
        """使用 Playwright 爬取"""
        if sync_playwright is None:
            logger.warning("Playwright not installed")
            return None
        
        try:
            with sync_playwright() as p:
                browser = p.chromium.launch(
                    headless=self.headless,
                    args=['--disable-blink-features=AutomationControlled']
                )
                
                context = browser.new_context(
                    user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
                               'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
                    viewport={'width': 1920, 'height': 1080},
                )
                
                # 反检测
                context.add_init_script("""
                    Object.defineProperty(navigator, 'webdriver', {
                        get: () => undefined
                    });
                """)
                
                page = context.new_page()
                page.goto(url, wait_until='networkidle', timeout=timeout_ms)
                
                # 等待选择器
                if wait_selectors:
                    for sel in wait_selectors:
                        try:
                            page.wait_for_selector(sel, timeout=5000)
                            break
                        except Exception:
                            continue
                
                # 等待内容稳定
                page.wait_for_timeout(1000)
                
                content = page.content()
                context.close()
                browser.close()
                
                return content
                
        except Exception as e:
            logger.error(f"Playwright crawl failed: {e}")
            return None


class RequestsEngine:
    """
    基础 Requests 引擎
    """
    
    DEFAULT_HEADERS = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
                      'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    }
    
    def __init__(self, timeout: int = 20):
        self.timeout = timeout
        self.session = requests.Session()
        self.session.headers.update(self.DEFAULT_HEADERS)
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=10))
    def crawl(self, url: str, headers: dict = None, cookies: dict = None) -> Optional[str]:
        """爬取 URL"""
        try:
            response = self.session.get(
                url,
                headers=headers,
                cookies=cookies,
                timeout=self.timeout
            )
            response.raise_for_status()
            response.encoding = response.apparent_encoding
            return response.text
        except Exception as e:
            logger.error(f"Requests crawl failed: {e}")
            raise


# ============ 缓存 ============

class CrawlCache:
    """
    爬取缓存（参考 BasicWebCrawler）
    """
    
    def __init__(self, cache_dir: str = ".crawl_cache", ttl_hours: int = 24):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.ttl_seconds = ttl_hours * 3600
    
    def _key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()
    
    def get(self, url: str) -> Optional[str]:
        """获取缓存"""
        key = self._key(url)
        cache_file = self.cache_dir / f"{key}.json"
        
        if not cache_file.exists():
            return None
        
        try:
            data = json.loads(cache_file.read_text(encoding='utf-8'))
            cached_time = datetime.fromisoformat(data['time'])
            
            if (datetime.utcnow() - cached_time).total_seconds() > self.ttl_seconds:
                cache_file.unlink()  # 过期删除
                return None
            
            return data['html']
        except Exception:
            return None
    
    def set(self, url: str, html: str):
        """设置缓存"""
        key = self._key(url)
        cache_file = self.cache_dir / f"{key}.json"
        
        try:
            data = {
                'url': url,
                'time': datetime.utcnow().isoformat(),
                'html': html,
            }
            cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding='utf-8')
        except Exception as e:
            logger.warning(f"Cache write failed: {e}")


# ============ 主爬虫类 ============

class EnhancedCrawler:
    """
    增强版爬虫
    自动选择最佳引擎，智能提取内容
    """
    
    def __init__(
        self,
        use_cache: bool = True,
        cache_ttl_hours: int = 24,
        jina_api_key: Optional[str] = None,
        default_engine: Literal['requests', 'playwright', 'jina'] = 'requests'
    ):
        self.use_cache = use_cache
        self.cache = CrawlCache(ttl_hours=cache_ttl_hours) if use_cache else None
        
        # 初始化引擎
        self.requests_engine = RequestsEngine()
        self.playwright_engine = PlaywrightEngine()
        self.jina_engine = JinaReaderEngine(api_key=jina_api_key)
        
        self.default_engine = default_engine
    
    def _get_site_config(self, url: str) -> dict:
        """获取网站配置"""
        domain = urlparse(url).netloc
        
        for site_domain, config in FINANCE_SITE_CONFIGS.items():
            if site_domain in domain:
                return config
        
        return FINANCE_SITE_CONFIGS['default']
    
    def _evaluate_quality(self, article: Article) -> float:
        """
        评估内容质量
        返回 0-1 的分数
        """
        score = 0.0
        
        # 内容长度
        content_len = len(article.content)
        if content_len > 500:
            score += 0.3
        elif content_len > 200:
            score += 0.2
        elif content_len > 100:
            score += 0.1
        
        # 有标题
        if article.title and article.title != "Untitled":
            score += 0.2
        
        # 中文内容比例（财经新闻应该主要是中文）
        chinese_pattern = re.compile(r'[\u4e00-\u9fa5]')
        chinese_count = len(chinese_pattern.findall(article.content))
        if content_len > 0:
            chinese_ratio = chinese_count / content_len
            if chinese_ratio > 0.5:
                score += 0.3
            elif chinese_ratio > 0.3:
                score += 0.2
        
        # 段落结构
        paragraph_count = article.content.count('\n')
        if paragraph_count > 5:
            score += 0.2
        elif paragraph_count > 2:
            score += 0.1
        
        return min(score, 1.0)
    
    def crawl(
        self,
        url: str,
        engine: Optional[Literal['requests', 'playwright', 'jina', 'auto']] = None,
        force_refresh: bool = False
    ) -> Article:
        """
        爬取单个 URL
        
        Args:
            url: 目标 URL
            engine: 爬取引擎 ('requests', 'playwright', 'jina', 'auto')
            force_refresh: 是否强制刷新缓存
            
        Returns:
            Article 对象
        """
        # 检查缓存
        if self.use_cache and not force_refresh:
            cached_html = self.cache.get(url)
            if cached_html:
                logger.info(f"Using cached content for {url}")
                article = ContentExtractor.extract(cached_html, url)
                article.url = url
                article.quality_score = self._evaluate_quality(article)
                return article
        
        # 获取网站配置
        config = self._get_site_config(url)
        engine = engine or self.default_engine
        
        html = None
        used_engine = engine
        
        # 自动选择引擎
        if engine == 'auto':
            if config.get('needs_js'):
                engine = 'playwright'
            else:
                engine = 'requests'
        
        # 爬取
        if engine == 'requests':
            html = self.requests_engine.crawl(
                url,
                headers=config.get('headers'),
                cookies=config.get('cookies')
            )
            used_engine = 'requests'
            
        elif engine == 'playwright':
            html = self.playwright_engine.crawl(
                url,
                wait_selectors=config.get('wait_selectors')
            )
            used_engine = 'playwright'
            
        elif engine == 'jina':
            html = self.jina_engine.crawl(url)
            used_engine = 'jina'
        
        # 如果主引擎失败，尝试备用引擎
        if not html or len(html) < 500:
            logger.warning(f"Primary engine failed, trying fallback...")
            
            if used_engine != 'jina' and self.jina_engine.api_key:
                html = self.jina_engine.crawl(url)
                used_engine = 'jina'
            
            if not html and used_engine != 'playwright' and sync_playwright:
                html = self.playwright_engine.crawl(url)
                used_engine = 'playwright'
        
        if not html:
            logger.error(f"All engines failed for {url}")
            return Article(
                title="Crawl Failed",
                content=f"Failed to crawl {url}",
                url=url,
                engine_used="none",
                quality_score=0.0
            )
        
        # 缓存
        if self.use_cache:
            self.cache.set(url, html)
        
        # 提取内容
        article = ContentExtractor.extract(html, url, config)
        article.url = url
        article.source = urlparse(url).netloc
        article.engine_used = used_engine
        article.quality_score = self._evaluate_quality(article)
        
        # 质量检查：如果质量太低且没用过 Jina，尝试用 Jina
        if article.quality_score < 0.3 and used_engine != 'jina' and self.jina_engine.api_key:
            logger.info(f"Low quality ({article.quality_score:.2f}), retrying with Jina...")
            jina_html = self.jina_engine.crawl(url)
            if jina_html:
                jina_article = ContentExtractor.extract(jina_html, url, config)
                jina_article.quality_score = self._evaluate_quality(jina_article)
                
                if jina_article.quality_score > article.quality_score:
                    article = jina_article
                    article.engine_used = 'jina'
        
        return article
    
    def crawl_batch(
        self,
        urls: List[str],
        engine: Optional[str] = None,
        delay: float = 1.0
    ) -> List[Article]:
        """
        批量爬取
        
        Args:
            urls: URL 列表
            engine: 爬取引擎
            delay: 请求间隔（秒）
            
        Returns:
            Article 列表
        """
        articles = []
        
        for i, url in enumerate(urls):
            logger.info(f"Crawling {i+1}/{len(urls)}: {url}")
            
            try:
                article = self.crawl(url, engine=engine)
                articles.append(article)
            except Exception as e:
                logger.error(f"Failed to crawl {url}: {e}")
                articles.append(Article(
                    title="Crawl Failed",
                    content=str(e),
                    url=url,
                    quality_score=0.0
                ))
            
            if delay > 0 and i < len(urls) - 1:
                time.sleep(delay)
        
        return articles


# ============ 便捷函数 ============

# 全局爬虫实例
_crawler: Optional[EnhancedCrawler] = None


def get_crawler() -> EnhancedCrawler:
    """获取全局爬虫实例"""
    global _crawler
    if _crawler is None:
        _crawler = EnhancedCrawler()
    return _crawler


def crawl_url(url: str, engine: str = 'auto') -> Article:
    """便捷函数：爬取单个 URL"""
    return get_crawler().crawl(url, engine=engine)


def crawl_urls(urls: List[str], engine: str = 'auto') -> List[Article]:
    """便捷函数：批量爬取"""
    return get_crawler().crawl_batch(urls, engine=engine)


# ============ 测试 ============

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    
    # 测试爬取
    test_urls = [
        "https://finance.sina.com.cn/roll/c/56592.shtml",
    ]
    
    crawler = EnhancedCrawler(use_cache=True)
    
    for url in test_urls:
        print(f"\n{'='*60}")
        print(f"Crawling: {url}")
        
        article = crawler.crawl(url, engine='auto')
        
        print(f"Title: {article.title}")
        print(f"Engine: {article.engine_used}")
        print(f"Quality: {article.quality_score:.2f}")
        print(f"Content length: {len(article.content)}")
        print(f"Preview: {article.content[:200]}...")


================================================
FILE: backend/app/tools/dynamic_crawler_example.py
================================================
"""
动态网站爬虫示例 - 使用 Selenium
适用于需要点击"加载更多"的网站

依赖安装：
pip install selenium webdriver-manager
"""
import logging
from typing import List, Optional
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class DynamicCrawlerExample(BaseCrawler):
    """
    动态网站爬虫示例
    支持点击"加载更多"按钮
    """
    
    BASE_URL = "https://www.eeo.com.cn/"
    STOCK_URL = "https://www.eeo.com.cn/jg/jinrong/zhengquan/"
    SOURCE_NAME = "eeo_dynamic"
    
    def __init__(self):
        super().__init__(
            name="eeo_dynamic_crawler",
            description="Crawl EEO with dynamic loading support"
        )
        self.driver = None
    
    def _init_driver(self):
        """初始化 Selenium WebDriver"""
        if self.driver:
            return
        
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # 无头模式
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument(f'user-agent={self.user_agent}')
        
        service = Service(ChromeDriverManager().install())
        self.driver = webdriver.Chrome(service=service, options=chrome_options)
        logger.info("Selenium WebDriver initialized")
    
    def _close_driver(self):
        """关闭 WebDriver"""
        if self.driver:
            self.driver.quit()
            self.driver = None
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取新闻（支持动态加载）
        
        Args:
            start_page: 起始页（对于点击加载更多的网站，这个参数表示点击次数）
            end_page: 结束页
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            self._init_driver()
            page_news = self._crawl_with_selenium()
            news_list.extend(page_news)
            logger.info(f"Crawled EEO (dynamic), got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling EEO (dynamic): {e}")
        finally:
            self._close_driver()
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_with_selenium(self) -> List[NewsItem]:
        """使用 Selenium 爬取动态加载的内容"""
        news_items = []
        
        try:
            # 1. 访问页面
            self.driver.get(self.STOCK_URL)
            logger.info(f"Loaded page: {self.STOCK_URL}")
            
            # 2. 等待页面加载
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            
            # 3. 尝试点击"加载更多"按钮（如果存在）
            click_count = 0
            max_clicks = 3  # 最多点击3次"加载更多"
            
            while click_count < max_clicks:
                try:
                    # 查找"加载更多"按钮（根据实际页面调整选择器）
                    load_more_button = self.driver.find_element(
                        By.XPATH, 
                        "//button[contains(text(), '加载更多')] | //div[contains(text(), '点击加载更多')]"
                    )
                    
                    # 滚动到按钮位置
                    self.driver.execute_script("arguments[0].scrollIntoView();", load_more_button)
                    
                    # 点击按钮
                    load_more_button.click()
                    click_count += 1
                    logger.info(f"Clicked 'Load More' button {click_count} times")
                    
                    # 等待新内容加载
                    import time
                    time.sleep(2)
                    
                except Exception as e:
                    logger.debug(f"No more 'Load More' button or click failed: {e}")
                    break
            
            # 4. 提取所有新闻链接
            news_links = self._extract_news_links_from_selenium()
            logger.info(f"Found {len(news_links)} news links")
            
            # 5. 爬取每条新闻的详情
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error in Selenium crawling: {e}")
        
        return news_items
    
    def _extract_news_links_from_selenium(self) -> List[dict]:
        """从 Selenium 页面中提取新闻链接"""
        news_links = []
        
        try:
            # 查找所有新闻链接（根据实际页面结构调整选择器）
            link_elements = self.driver.find_elements(By.CSS_SELECTOR, "a[href*='/article/']")
            
            for element in link_elements:
                try:
                    href = element.get_attribute('href')
                    title = element.text.strip()
                    
                    if href and title and href not in [n['url'] for n in news_links]:
                        news_links.append({'url': href, 'title': title})
                except Exception as e:
                    continue
            
        except Exception as e:
            logger.error(f"Error extracting links: {e}")
        
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情（使用传统 requests 方式）"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            soup = self._parse_html(response.text)
            
            # 提取正文（简化示例）
            content_div = soup.find('div', class_='article-content')
            if content_div:
                content = content_div.get_text(strip=True)
            else:
                content = ""
            
            if not content:
                return None
            
            return NewsItem(
                title=title,
                content=self._clean_text(content),
                url=url,
                source=self.SOURCE_NAME,
                publish_time=datetime.now(),
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None


# 使用示例
if __name__ == "__main__":
    crawler = DynamicCrawlerExample()
    news = crawler.crawl()
    print(f"Crawled {len(news)} news items")
    for item in news[:5]:
        print(f"- {item.title}")


================================================
FILE: backend/app/tools/eastmoney_crawler.py
================================================
"""
东方财富爬虫工具
目标URL: https://stock.eastmoney.com/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class EastmoneyCrawlerTool(BaseCrawler):
    """
    东方财富爬虫
    主要爬取股市新闻
    """
    
    BASE_URL = "https://stock.eastmoney.com/"
    STOCK_URL = "https://stock.eastmoney.com/news/"
    SOURCE_NAME = "eastmoney"
    
    def __init__(self):
        super().__init__(
            name="eastmoney_crawler",
            description="Crawl financial news from East Money (eastmoney.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取东方财富新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Eastmoney, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Eastmoney: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            # 尝试爬取新闻栏目或主页
            try:
                response = self._fetch_page(self.STOCK_URL)
            except:
                response = self._fetch_page(self.BASE_URL)
            
            # 东方财富编码处理
            if response.encoding == 'ISO-8859-1' or not response.encoding:
                response.encoding = 'utf-8'
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        # 东方财富新闻URL模式（扩展更多模式）
        eastmoney_patterns = [
            '/news/',             # 新闻频道
            '/stock/',            # 股票频道
            '/a/',                # 文章
            '/article/',          # 文章
            '.html',              # HTML页面
            '/guba/',             # 股吧
        ]
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 检查是否匹配东方财富URL模式
            is_eastmoney_url = False
            
            # 方式1: 检查是否包含eastmoney.com域名
            if 'eastmoney.com' in href or 'eastmoney.cn' in href:
                for pattern in eastmoney_patterns:
                    if pattern in href:
                        is_eastmoney_url = True
                        break
            
            # 方式2: 相对路径且匹配模式
            if not is_eastmoney_url and href.startswith('/'):
                for pattern in eastmoney_patterns:
                    if pattern in href:
                        is_eastmoney_url = True
                        break
            
            # 方式3: 检查data属性或class中包含新闻标识
            if not is_eastmoney_url:
                link_class = link.get('class', [])
                if isinstance(link_class, list):
                    link_class_str = ' '.join(link_class)
                else:
                    link_class_str = str(link_class)
                if any(kw in link_class_str.lower() for kw in ['news', 'article', 'item', 'title']):
                    if any(pattern in href for pattern in ['/a/', '/news/', '.html']):
                        is_eastmoney_url = True
            
            if is_eastmoney_url and title and len(title.strip()) > 5:
                # 确保是完整URL
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    # 判断是stock还是www域名
                    if '/stock/' in href or '/guba/' in href:
                        href = 'https://stock.eastmoney.com' + href
                    else:
                        href = 'https://www.eastmoney.com' + href
                elif not href.startswith('http'):
                    href = 'https://stock.eastmoney.com/' + href.lstrip('/')
                
                # 过滤掉明显不是新闻的链接
                if any(skip in href.lower() for skip in ['javascript:', 'mailto:', '#', 'void(0)', '/guba/']):
                    continue
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title.strip()})
        
        logger.debug(f"Eastmoney: Found {len(news_links)} potential news links")
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'Body'},
            {'id': 'ContentBody'},
            {'class': 'article-content'},
            {'class': 'newsContent'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('div', {'class': re.compile(r'time|date')})
            if not time_elem:
                time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('div', {'class': re.compile(r'author|source')})
            if not author_elem:
                author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/eeo_crawler.py
================================================
"""
经济观察网爬虫工具
目标URL: https://www.eeo.com.cn/jg/jinrong/zhengquan/
"""
import re
import json
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class EeoCrawlerTool(BaseCrawler):
    """
    经济观察网爬虫
    主要爬取证券栏目
    使用官方API接口
    """
    
    BASE_URL = "https://www.eeo.com.cn/"
    # 证券栏目URL（用于获取uuid）
    STOCK_URL = "https://www.eeo.com.cn/jg/jinrong/zhengquan/"
    # API接口URL
    API_URL = "https://app.eeo.com.cn/"
    SOURCE_NAME = "eeo"
    # 证券频道的UUID（通过访问页面获取）
    CHANNEL_UUID = "9905934f8ec548ddae87652dbb9eebc6"
    
    def __init__(self):
        super().__init__(
            name="eeo_crawler",
            description="Crawl financial news from Economic Observer (eeo.com.cn)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取经济观察网新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled EEO, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling EEO: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _fetch_api_news(self, page: int = 0, prev_uuid: str = "", prev_publish_date: str = "") -> List[dict]:
        """
        通过API获取新闻列表
        
        Args:
            page: 页码（从0开始）
            prev_uuid: 上一条新闻的UUID（用于翻页）
            prev_publish_date: 上一条新闻的发布时间（用于翻页）
            
        Returns:
            新闻列表
        """
        try:
            # 构建API参数
            params = {
                "app": "article",
                "controller": "index",
                "action": "getMoreArticle",
                "uuid": self.CHANNEL_UUID,
                "page": page,
                "pageSize": 20,  # 每页20条
                "prevUuid": prev_uuid,
                "prevPublishDate": prev_publish_date,
            }
            
            # 添加必要的请求头
            headers = {
                "User-Agent": self.user_agent,
                "Referer": self.STOCK_URL,
                "Accept": "*/*",
            }
            
            response = self.session.get(
                self.API_URL,
                params=params,
                headers=headers,
                timeout=self.timeout
            )
            response.raise_for_status()
            
            # 处理JSONP响应
            # 响应格式可能是: jQuery11130...callback({"code":200,"data":[...]})
            # 或者直接是JSON: {"code":200,"data":[...]}
            content = response.text.strip()
            logger.debug(f"[EEO] API response preview (first 300 chars): {content[:300]}")
            
            # 尝试1: 如果是JSONP格式，提取JSON部分
            json_match = re.search(r'\((.*)\)$', content)
            if json_match:
                try:
                    json_str = json_match.group(1)
                    data = json.loads(json_str)
                    # 支持两种格式：status==1 或 code==200
                    if (data.get('status') == 1 or data.get('code') == 200) and 'data' in data:
                        logger.info(f"[EEO] Successfully parsed JSONP, found {len(data['data'])} items")
                        return data['data']
                except json.JSONDecodeError as e:
                    logger.debug(f"[EEO] JSONP parse failed: {e}")
                    pass
            
            # 尝试2: 直接解析JSON
            try:
                data = json.loads(content)
                if isinstance(data, dict):
                    # 支持两种格式：status==1 或 code==200
                    if (data.get('status') == 1 or data.get('code') == 200) and 'data' in data:
                        logger.info(f"[EEO] Successfully parsed JSON, found {len(data['data'])} items")
                        return data['data']
                elif isinstance(data, list):
                    logger.info(f"[EEO] API returned list with {len(data)} items")
                    return data
            except json.JSONDecodeError as e:
                logger.debug(f"[EEO] JSON parse failed: {e}")
                pass
            
            # 尝试3: 查找JSON对象（更宽松的匹配）
            json_obj_match = re.search(r'\{[^{}]*"(status|code)"[^{}]*"data"[^{}]*\}', content, re.DOTALL)
            if json_obj_match:
                try:
                    data = json.loads(json_obj_match.group(0))
                    # 支持两种格式：status==1 或 code==200
                    if (data.get('status') == 1 or data.get('code') == 200) and 'data' in data:
                        logger.info(f"[EEO] Successfully parsed with regex, found {len(data['data'])} items")
                        return data['data']
                except json.JSONDecodeError as e:
                    logger.debug(f"[EEO] Regex parse failed: {e}")
                    pass
            
            logger.warning(f"Failed to parse API response, content preview: {content[:200]}")
            return []
            
        except Exception as e:
            logger.error(f"API fetch failed: {e}")
            return []
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """
        爬取单页新闻（使用API）
        
        Args:
            page: 页码
            
        Returns:
            新闻列表
        """
        news_items = []
        
        try:
            # 使用API获取新闻列表
            api_news_list = self._fetch_api_news(page=0)  # 第一页
            
            if not api_news_list:
                logger.warning("No news from API, fallback to HTML parsing")
                return self._crawl_page_html()
            
            logger.info(f"Fetched {len(api_news_list)} news from API")
            
            # 解析每条新闻
            for news_data in api_news_list[:20]:  # 限制20条
                try:
                    news_item = self._parse_api_news_item(news_data)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to parse news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _parse_api_news_item(self, news_data: dict) -> Optional[NewsItem]:
        """
        解析API返回的新闻数据
        
        Args:
            news_data: API返回的单条新闻数据
            
        Returns:
            NewsItem对象
        """
        try:
            # 提取基本信息
            title = news_data.get('title', '').strip()
            url = news_data.get('url', '')
            
            # 确保URL是完整的
            if url and not url.startswith('http'):
                url = 'https://www.eeo.com.cn' + url
            
            if not title or not url:
                return None
            
            # 提取发布时间（API返回的字段可能是 published 或 publishDate）
            publish_time_str = news_data.get('published', '') or news_data.get('publishDate', '')
            publish_time = self._parse_time_string(publish_time_str) if publish_time_str else datetime.now()
            
            # 提取作者
            author = news_data.get('author', '')
            
            # 获取新闻详情（内容和原始HTML）
            content, raw_html = self._fetch_news_content(url)
            
            if not content:
                return None
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author if author else None,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to parse API news item: {e}")
            return None
    
    def _fetch_news_content(self, url: str) -> tuple:
        """
        获取新闻详情页内容
        
        Args:
            url: 新闻详情页URL
            
        Returns:
            (新闻正文, 原始HTML)
        """
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            return content, raw_html
            
        except Exception as e:
            logger.warning(f"Failed to fetch content from {url}: {e}")
            return "", ""
    
    def _crawl_page_html(self) -> List[NewsItem]:
        """
        备用方案：直接解析HTML页面（只能获取首屏内容）
        """
        news_items = []
        
        try:
            response = self._fetch_page(self.STOCK_URL)
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links from HTML")
            
            # 限制爬取数量
            max_news = 10
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling HTML page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        # 经济观察网新闻URL模式（扩展更多模式）
        eeo_patterns = [
            r'/\d{4}/',           # 日期路径 /2024/
            '.shtml',              # 静态HTML
            '/jg/',                # 经济观察
            '/jinrong/',           # 金融
            '/zhengquan/',         # 证券
            '/article/',           # 文章
        ]
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 检查是否匹配经济观察网URL模式
            is_eeo_url = False
            
            # 方式1: 检查URL模式
            for pattern in eeo_patterns:
                if re.search(pattern, href):
                    is_eeo_url = True
                    break
            
            # 方式2: 检查是否包含eeo.com.cn域名
            if 'eeo.com.cn' in href:
                is_eeo_url = True
            
            # 方式3: 检查链接的class或data属性
            if not is_eeo_url:
                link_class = link.get('class', [])
                if isinstance(link_class, list):
                    link_class_str = ' '.join(link_class)
                else:
                    link_class_str = str(link_class)
                if any(kw in link_class_str.lower() for kw in ['news', 'article', 'item', 'title', 'list']):
                    if href.startswith('/') or 'eeo.com.cn' in href:
                        is_eeo_url = True
            
            if is_eeo_url and title and len(title.strip()) > 5:
                # 确保是完整URL
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.eeo.com.cn' + href
                elif not href.startswith('http'):
                    href = 'https://www.eeo.com.cn/' + href.lstrip('/')
                
                # 过滤掉明显不是新闻的链接
                if any(skip in href.lower() for skip in ['javascript:', 'mailto:', '#', 'void(0)', '/tag/', '/author/']):
                    continue
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title.strip()})
        
        logger.debug(f"EEO: Found {len(news_links)} potential news links from HTML")
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情（HTML方式）"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'article-content'},
            {'class': 'content'},
            {'id': 'articleContent'},
            {'class': 'news-content'},
            {'class': 'text_content'},  # 常见的正文类名
        ]
        
        for selector in content_selectors:
            content_div = soup.find(['div', 'article'], selector)
            if content_div:
                # 1. 移除明确的噪音元素
                for tag in content_div.find_all(['script', 'style', 'iframe', 'ins', 'select', 'input', 'button', 'form']):
                    tag.decompose()
                
                # 2. 移除特定的广告和推荐块
                for ad in content_div.find_all(class_=re.compile(r'ad|banner|share|otherContent|recommend|app-guide|qrcode', re.I)):
                    ad.decompose()

                # 3. 获取所有文本，使用换行符分隔
                # 关键修改：使用 get_text 而不是 find_all('p')
                full_text = content_div.get_text(separator='\n', strip=True)
                
                # 4. 按行分割并清洗
                lines = full_text.split('\n')
                article_parts = []
                
                for line in lines:
                    line = line.strip()
                    if not line:
                        continue
                        
                    # 5. 简单的长度过滤，防止页码等噪音
                    if len(line) < 2:
                        continue
                        
                    article_parts.append(line)
                
                if article_parts:
                    content = '\n'.join(article_parts)
                    return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/interactive_crawler.py
================================================
"""
交互式网页爬虫
使用 requests + BeautifulSoup 进行网页爬取
特别用于搜索结果补充，当 BochaAI 结果不足时使用

注意：主要搜索引擎（Bing、百度）都有反爬机制，本模块已做相应优化：
1. 模拟真实浏览器请求头
2. 检测验证页面并自动降级
3. 多引擎轮换备选
"""
import logging
import re
import time
import random
from typing import List, Dict, Any, Optional
from urllib.parse import quote_plus, urljoin, urlparse

import requests
from bs4 import BeautifulSoup

logger = logging.getLogger(__name__)

# 更完善的 User-Agent，模拟最新的 Chrome 浏览器
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
]

# 验证页面关键词（用于检测被拦截）
CAPTCHA_KEYWORDS = [
    '确认您是真人', '人机验证', 'captcha', 'verify you are human',
    '验证码', '请完成验证', '安全验证', '异常访问', '请输入验证码',
    '最后一步', '请解决以下难题'
]


class InteractiveCrawler:
    """交互式网页爬虫（纯 requests 实现）"""
    
    def __init__(self, timeout: int = 15):
        """
        初始化爬虫
        
        Args:
            timeout: 请求超时时间（秒）
        """
        self.timeout = timeout
        self.session = requests.Session()
        self._user_agent = random.choice(USER_AGENTS)
        # 更完善的请求头，模拟真实浏览器
        self.session.headers.update({
            'User-Agent': self._user_agent,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'Cache-Control': 'max-age=0',
            'sec-ch-ua': '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"macOS"',
        })
    
    def _is_captcha_page(self, html_content: str, soup: BeautifulSoup = None) -> bool:
        """
        检测页面是否为验证码/人机验证页面
        
        Args:
            html_content: HTML 原始内容
            soup: 已解析的 BeautifulSoup 对象
            
        Returns:
            True 如果是验证页面
        """
        text_to_check = html_content.lower()
        if soup:
            text_to_check = soup.get_text().lower()
        
        for keyword in CAPTCHA_KEYWORDS:
            if keyword.lower() in text_to_check:
                return True
        return False
    
    def search_on_bing(
        self,
        query: str,
        num_results: int = 10
    ) -> List[Dict[str, str]]:
        """
        在 Bing 上搜索并获取结果
        
        Args:
            query: 搜索关键词
            num_results: 获取的结果数量
            
        Returns:
            搜索结果列表 [{"url": "...", "title": "...", "snippet": "..."}]
        """
        results = []
        
        try:
            # 使用国际版 Bing，中国版有更严格的反爬
            search_url = f"https://www.bing.com/search?q={quote_plus(query)}&count={num_results}"
            
            logger.info(f"🔍 Bing 搜索: {query}")
            logger.debug(f"搜索URL: {search_url}")
            
            response = self.session.get(search_url, timeout=self.timeout)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # ========== 检测验证码页面 ==========
            if self._is_captcha_page(response.text, soup):
                logger.warning("⚠️ Bing 触发人机验证，跳过此引擎")
                return []  # 返回空，让调用者使用其他引擎
            
            # ========== 调试：打印找到的元素 ==========
            # 尝试多种选择器
            b_algo_items = soup.select('.b_algo')
            logger.info(f"📊 Bing HTML解析: .b_algo={len(b_algo_items)}个")
            
            # 如果 .b_algo 没找到，尝试其他选择器
            if not b_algo_items:
                # 尝试查找所有包含链接的 li 元素
                li_items = soup.select('#b_results > li')
                logger.info(f"📊 尝试 #b_results > li: {len(li_items)}个")
                
                # 打印页面中所有链接供调试
                all_links = soup.select('a[href^="http"]')
                logger.info(f"📊 页面总链接数: {len(all_links)}个")
                
                # 打印前10个链接
                for i, link in enumerate(all_links[:15]):
                    href = link.get('href', '')
                    text = link.get_text(strip=True)[:50]
                    # 过滤掉 Bing 内部链接
                    if 'bing.com' not in href and 'microsoft.com' not in href:
                        logger.info(f"  链接{i+1}: {text} -> {href[:80]}")
            
            # ========== 提取搜索结果 ==========
            # 方法1: 标准 .b_algo 选择器
            for result in b_algo_items[:num_results]:
                try:
                    # 提取标题和链接
                    title_elem = result.select_one('h2 a')
                    if not title_elem:
                        title_elem = result.select_one('a')  # 备选
                    if not title_elem:
                        continue
                    
                    title = title_elem.get_text(strip=True)
                    url = title_elem.get('href', '')
                    
                    # 提取摘要
                    snippet_elem = result.select_one('.b_caption p, p')
                    snippet = snippet_elem.get_text(strip=True) if snippet_elem else ''
                    
                    if url and title and 'bing.com' not in url:
                        results.append({
                            "url": url,
                            "title": title,
                            "snippet": snippet[:300],
                            "source": "bing"
                        })
                        logger.debug(f"  ✅ 提取: {title[:40]} -> {url[:60]}")
                        
                except Exception as e:
                    logger.debug(f"解析 Bing 结果失败: {e}")
                    continue
            
            # 方法2: 如果 .b_algo 没有结果，可能是验证页面的残留链接，不再使用备选提取
            if not results and b_algo_items:
                logger.info("⚠️ Bing 无有效结果")
            
            logger.info(f"✅ Bing 搜索完成，获得 {len(results)} 条结果")
            
        except requests.exceptions.Timeout:
            logger.warning(f"⚠️ Bing 搜索超时: {query}")
        except requests.exceptions.RequestException as e:
            logger.warning(f"⚠️ Bing 搜索请求失败: {e}")
        except Exception as e:
            logger.error(f"❌ Bing 搜索失败: {e}")
        
        return results
    
    def search_on_baidu(
        self,
        query: str,
        num_results: int = 10
    ) -> List[Dict[str, str]]:
        """
        在百度上搜索并获取结果（百度对简单爬虫相对友好）
        
        Args:
            query: 搜索关键词
            num_results: 获取的结果数量
            
        Returns:
            搜索结果列表
        """
        results = []
        
        try:
            # 百度搜索 URL
            search_url = f"https://www.baidu.com/s?wd={quote_plus(query)}&rn={num_results}"
            
            logger.info(f"🔍 百度搜索: {query}")
            logger.debug(f"搜索URL: {search_url}")
            
            # 百度需要特定的请求头
            headers = {
                'User-Agent': self._user_agent,
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': 'zh-CN,zh;q=0.9',
                'Accept-Encoding': 'gzip, deflate',
                'Referer': 'https://www.baidu.com/',
                'Connection': 'keep-alive',
            }
            
            response = self.session.get(search_url, headers=headers, timeout=self.timeout)
            response.encoding = 'utf-8'
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 检测验证码
            if self._is_captcha_page(response.text, soup):
                logger.warning("⚠️ 百度触发验证，跳过此引擎")
                return []
            
            # 百度搜索结果选择器（多种尝试）
            result_items = soup.select('.result.c-container, .c-container, div[class*="result"]')
            logger.info(f"📊 百度HTML解析: 结果容器={len(result_items)}个")
            
            for result in result_items[:num_results]:
                try:
                    # 提取标题和链接
                    title_elem = result.select_one('h3 a, .t a, a[href]')
                    if not title_elem:
                        continue
                    
                    title = title_elem.get_text(strip=True)
                    url = title_elem.get('href', '')
                    
                    # 百度使用跳转链接，需要提取真实URL
                    # 但通常跳转链接也能用
                    
                    # 提取摘要
                    snippet_elem = result.select_one('.c-abstract, .c-span-last, .content-right_8Zs40')
                    snippet = snippet_elem.get_text(strip=True) if snippet_elem else ''
                    
                    if url and title and 'baidu.com' not in url:
                        results.append({
                            "url": url,
                            "title": title,
                            "snippet": snippet[:300],
                            "source": "baidu"
                        })
                        logger.debug(f"  ✅ 提取: {title[:40]}")
                        
                except Exception as e:
                    logger.debug(f"解析百度结果失败: {e}")
                    continue
            
            # 备选方法：从所有标题链接提取
            if not results:
                logger.info("⚠️ 百度标准选择器无结果，尝试提取 h3 链接...")
                h3_links = soup.select('h3 a')
                for link in h3_links[:num_results]:
                    href = link.get('href', '')
                    text = link.get_text(strip=True)
                    
                    if not href or not text or len(text) < 3:
                        continue
                    if href in [r['url'] for r in results]:
                        continue
                    
                    results.append({
                        "url": href,
                        "title": text[:100],
                        "snippet": "",
                        "source": "baidu"
                    })
                    
                    if len(results) >= num_results:
                        break
            
            logger.info(f"✅ 百度搜索完成，获得 {len(results)} 条结果")
            
        except Exception as e:
            logger.warning(f"⚠️ 百度搜索失败: {e}")
        
        return results
    
    def search_on_baidu_news(
        self,
        query: str,
        num_results: int = 10
    ) -> List[Dict[str, str]]:
        """
        在百度新闻搜索（news.baidu.com）获取新闻结果
        
        使用 news.baidu.com 入口，返回的 URL 是真实的第三方新闻链接，
        不是百度跳转链接，避免乱码问题。
        
        Args:
            query: 搜索关键词
            num_results: 获取的结果数量
            
        Returns:
            搜索结果列表
        """
        results = []
        
        try:
            # 使用百度新闻入口（news.baidu.com），返回真实的第三方 URL
            search_url = f"https://news.baidu.com/ns?word={quote_plus(query)}&tn=news&from=news&cl=2&rn={num_results}&ct=1"
            
            logger.info(f"🔍 百度新闻搜索: {query}")
            logger.debug(f"搜索URL: {search_url}")
            
            # 百度需要特定的请求头
            headers = {
                'User-Agent': self._user_agent,
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': 'zh-CN,zh;q=0.9',
                'Accept-Encoding': 'gzip, deflate',
                'Referer': 'https://news.baidu.com/',
                'Connection': 'keep-alive',
            }
            
            response = self.session.get(search_url, headers=headers, timeout=self.timeout, allow_redirects=True)
            response.encoding = 'utf-8'
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 检测验证码
            if self._is_captcha_page(response.text, soup):
                logger.warning("⚠️ 百度新闻触发验证，跳过")
                return []
            
            # 百度新闻搜索结果选择器
            # 新闻标题在 h3 > a 中，链接是真实的第三方 URL
            news_h3_links = soup.select('h3 a[href^="http"]')
            logger.info(f"📊 百度新闻HTML解析: h3链接={len(news_h3_links)}个")
            
            for link in news_h3_links[:num_results * 2]:  # 多取一些，后面过滤
                try:
                    url = link.get('href', '')
                    title = link.get_text(strip=True)
                    
                    # 清理标题（去掉"标题："前缀）
                    if title.startswith('标题：'):
                        title = title[3:]
                    
                    # 过滤无效结果
                    if not url or not title or len(title) < 5:
                        continue
                    # 过滤百度内部链接（但保留百家号 baijiahao.baidu.com）
                    if 'baidu.com' in url and 'baijiahao.baidu.com' not in url:
                        continue
                    if url in [r['url'] for r in results]:
                        continue  # 去重
                    
                    # 尝试找到父容器获取摘要
                    parent = link.find_parent(['div', 'li'])
                    snippet = ''
                    news_source = ''
                    publish_time = ''
                    
                    if parent:
                        # 提取摘要（通常在 generic 或 p 元素中）
                        snippet_elem = parent.select_one('[class*="summary"], [class*="abstract"], p')
                        if snippet_elem:
                            snippet = snippet_elem.get_text(strip=True)[:300]
                        
                        # 提取来源（通常在包含"来源"的链接中）
                        source_links = parent.select('a')
                        for src_link in source_links:
                            src_text = src_link.get_text(strip=True)
                            if src_text and src_text != title[:20] and len(src_text) < 20:
                                # 可能是来源（如"同花顺财经"、"新浪财经"）
                                if '新闻来源' in (src_link.get('aria-label', '') or ''):
                                    news_source = src_text
                                    break
                                elif not news_source and not src_text.startswith('标题'):
                                    news_source = src_text
                    
                    results.append({
                        "url": url,
                        "title": title,
                        "snippet": snippet,
                        "source": "baidu_news",
                        "news_source": news_source  # 新闻来源（如"同花顺财经"）
                    })
                    logger.debug(f"  ✅ 新闻: {title[:40]} | {news_source}")
                    
                    if len(results) >= num_results:
                        break
                        
                except Exception as e:
                    logger.debug(f"解析百度新闻结果失败: {e}")
                    continue
            
            logger.info(f"✅ 百度新闻搜索完成，获得 {len(results)} 条新闻")
            
        except Exception as e:
            logger.warning(f"⚠️ 百度新闻搜索失败: {e}")
        
        return results
    
    def search_on_sogou(
        self,
        query: str,
        num_results: int = 10
    ) -> List[Dict[str, str]]:
        """
        在搜狗上搜索并获取结果（备用搜索引擎）
        
        Args:
            query: 搜索关键词
            num_results: 获取的结果数量
            
        Returns:
            搜索结果列表
        """
        results = []
        
        try:
            # 构建搜狗搜索 URL
            search_url = f"https://www.sogou.com/web?query={quote_plus(query)}"
            
            logger.info(f"🔍 搜狗搜索: {query}")
            logger.debug(f"搜索URL: {search_url}")
            
            response = self.session.get(search_url, timeout=self.timeout)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 检测验证码
            if self._is_captcha_page(response.text, soup):
                logger.warning("⚠️ 搜狗触发验证，跳过此引擎")
                return []
            
            # ========== 调试：打印找到的元素 ==========
            vrwrap_items = soup.select('.vrwrap, .rb, .results .vrwrap')
            logger.info(f"📊 搜狗HTML解析: .vrwrap/.rb={len(vrwrap_items)}个")
            
            # 搜狗搜索结果选择器
            for result in vrwrap_items[:num_results]:
                try:
                    title_elem = result.select_one('h3 a, .vr-title a, a[href]')
                    if not title_elem:
                        continue
                    
                    title = title_elem.get_text(strip=True)
                    url = title_elem.get('href', '')
                    
                    snippet_elem = result.select_one('.str_info, .str-text, p, .txt-info')
                    snippet = snippet_elem.get_text(strip=True) if snippet_elem else ''
                    
                    if url and title and 'sogou.com' not in url:
                        results.append({
                            "url": url,
                            "title": title,
                            "snippet": snippet[:300],
                            "source": "sogou"
                        })
                        logger.debug(f"  ✅ 提取: {title[:40]} -> {url[:60]}")
                        
                except Exception as e:
                    logger.debug(f"解析搜狗结果失败: {e}")
                    continue
            
            # 备选方法：从页面链接提取
            if not results:
                logger.info("⚠️ 搜狗标准选择器无结果，尝试从页面链接提取...")
                all_links = soup.select('a[href^="http"]')
                for link in all_links[:num_results * 3]:
                    href = link.get('href', '')
                    text = link.get_text(strip=True)
                    
                    if not href or not text or len(text) < 5:
                        continue
                    if 'sogou.com' in href:
                        continue
                    if href in [r['url'] for r in results]:
                        continue
                    
                    results.append({
                        "url": href,
                        "title": text[:100],
                        "snippet": "",
                        "source": "sogou"
                    })
                    
                    if len(results) >= num_results:
                        break
            
            logger.info(f"✅ 搜狗搜索完成，获得 {len(results)} 条结果")
            
        except Exception as e:
            logger.warning(f"⚠️ 搜狗搜索失败: {e}")
        
        return results
    
    def search_on_360(
        self,
        query: str,
        num_results: int = 10
    ) -> List[Dict[str, str]]:
        """
        在 360 搜索上搜索并获取结果
        
        Args:
            query: 搜索关键词
            num_results: 获取的结果数量
            
        Returns:
            搜索结果列表
        """
        results = []
        
        try:
            # 构建 360 搜索 URL
            search_url = f"https://www.so.com/s?q={quote_plus(query)}"
            
            logger.info(f"🔍 360搜索: {query}")
            logger.debug(f"搜索URL: {search_url}")
            
            response = self.session.get(search_url, timeout=self.timeout)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 检测验证码
            if self._is_captcha_page(response.text, soup):
                logger.warning("⚠️ 360触发验证，跳过此引擎")
                return []
            
            # ========== 调试：打印找到的元素 ==========
            res_items = soup.select('.res-list, .result, li.res-list')
            logger.info(f"📊 360 HTML解析: .res-list/.result={len(res_items)}个")
            
            # 360 搜索结果选择器
            for result in res_items[:num_results]:
                try:
                    title_elem = result.select_one('h3 a, .res-title a, a[href]')
                    if not title_elem:
                        continue
                    
                    title = title_elem.get_text(strip=True)
                    url = title_elem.get('href', '')
                    
                    snippet_elem = result.select_one('.res-desc, p.res-summary, p, .res-comm-con')
                    snippet = snippet_elem.get_text(strip=True) if snippet_elem else ''
                    
                    if url and title and 'so.com' not in url and '360.cn' not in url:
                        results.append({
                            "url": url,
                            "title": title,
                            "snippet": snippet[:300],
                            "source": "360"
                        })
                        logger.debug(f"  ✅ 提取: {title[:40]} -> {url[:60]}")
                        
                except Exception as e:
                    logger.debug(f"解析 360 结果失败: {e}")
                    continue
            
            # 备选方法：从页面链接提取
            if not results:
                logger.info("⚠️ 360 标准选择器无结果，尝试从页面链接提取...")
                all_links = soup.select('a[href^="http"]')
                for link in all_links[:num_results * 3]:
                    href = link.get('href', '')
                    text = link.get_text(strip=True)
                    
                    if not href or not text or len(text) < 5:
                        continue
                    if 'so.com' in href or '360.cn' in href:
                        continue
                    if href in [r['url'] for r in results]:
                        continue
                    
                    results.append({
                        "url": href,
                        "title": text[:100],
                        "snippet": "",
                        "source": "360"
                    })
                    
                    if len(results) >= num_results:
                        break
            
            logger.info(f"✅ 360搜索完成，获得 {len(results)} 条结果")
            
        except Exception as e:
            logger.warning(f"⚠️ 360搜索失败: {e}")
        
        return results
    
    def interactive_search(
        self,
        query: str,
        engines: List[str] = None,
        num_results: int = 10,
        search_type: str = "news",  # 新增参数：news（新闻）或 web（网页）
        **kwargs  # 兼容旧接口
    ) -> List[Dict[str, str]]:
        """
        使用多个搜索引擎进行搜索
        
        Args:
            query: 搜索关键词
            engines: 搜索引擎列表 ['baidu_news', 'baidu', 'sogou', '360', 'bing']
            num_results: 每个引擎的结果数量
            search_type: 搜索类型 'news'（新闻优先）或 'web'（网页）
            
        Returns:
            合并的搜索结果
        """
        if engines is None:
            if search_type == "news":
                # 新闻搜索：优先使用百度资讯
                engines = ["baidu_news", "sogou"]
            else:
                # 普通网页搜索
                engines = ["baidu", "sogou"]
        
        all_results = []
        engines_tried = []
        
        for engine in engines:
            try:
                engine_lower = engine.lower()
                if engine_lower == "baidu_news":
                    results = self.search_on_baidu_news(query, num_results)
                elif engine_lower == "baidu":
                    results = self.search_on_baidu(query, num_results)
                elif engine_lower == "bing":
                    results = self.search_on_bing(query, num_results)
                elif engine_lower == "sogou":
                    results = self.search_on_sogou(query, num_results)
                elif engine_lower == "360":
                    results = self.search_on_360(query, num_results)
                else:
                    logger.warning(f"⚠️ 不支持的搜索引擎: {engine}")
                    continue
                
                if results:
                    all_results.extend(results)
                    engines_tried.append(engine_lower)
                    logger.info(f"✅ {engine} 返回 {len(results)} 条结果")
                else:
                    logger.info(f"⚠️ {engine} 无结果或被拦截")
                
                # 搜索间隔，避免被封
                if len(engines) > 1:
                    time.sleep(random.uniform(0.8, 1.5))
                    
            except Exception as e:
                logger.error(f"❌ 使用 {engine} 搜索失败: {e}")
                continue
        
        # 如果所有引擎都失败了，尝试备用引擎
        if not all_results:
            backup_engines = ["baidu_news", "360", "baidu", "sogou"]
            for backup in backup_engines:
                if backup not in [e.lower() for e in engines]:
                    logger.info(f"🔄 尝试备用引擎: {backup}")
                    try:
                        if backup == "baidu_news":
                            results = self.search_on_baidu_news(query, num_results)
                        elif backup == "360":
                            results = self.search_on_360(query, num_results)
                        elif backup == "baidu":
                            results = self.search_on_baidu(query, num_results)
                        elif backup == "sogou":
                            results = self.search_on_sogou(query, num_results)
                        
                        if results:
                            all_results.extend(results)
                            engines_tried.append(backup)
                            logger.info(f"✅ 备用引擎 {backup} 返回 {len(results)} 条结果")
                            break
                    except Exception as e:
                        logger.warning(f"备用引擎 {backup} 也失败: {e}")
                        continue
        
        # 去重
        seen_urls = set()
        unique_results = []
        for r in all_results:
            if r["url"] not in seen_urls:
                seen_urls.add(r["url"])
                unique_results.append(r)
        
        logger.info(f"交互式搜索完成: {len(all_results)} -> {len(unique_results)} (去重后), 使用引擎: {engines_tried}")
        return unique_results
    
    def crawl_page(self, url: str) -> Optional[Dict[str, Any]]:
        """
        爬取单个页面内容
        
        Args:
            url: 页面 URL
            
        Returns:
            {"url": "...", "title": "...", "content": "...", "text": "...", "html": "..."} 或 None
        """
        try:
            response = self.session.get(url, timeout=self.timeout)
            response.encoding = response.apparent_encoding or 'utf-8'
            
            # 保存原始 HTML（清理 NUL 字符）
            raw_html = response.text.replace('\x00', '').replace('\0', '')
            
            soup = BeautifulSoup(raw_html, 'html.parser')
            
            # 获取标题（在移除元素之前）
            title = ''
            title_elem = soup.find('title')
            if title_elem:
                title = title_elem.get_text(strip=True)
            
            # 尝试获取 h1 作为更好的标题
            h1_elem = soup.find('h1')
            if h1_elem:
                h1_text = h1_elem.get_text(strip=True)
                if h1_text and len(h1_text) > 5:
                    title = h1_text
            
            # 移除无关元素（用于提取正文）
            for elem in soup.select('script, style, iframe, nav, footer, header, aside, .ad, .advertisement, .comment, .sidebar'):
                elem.decompose()
            
            # 获取主要内容
            # 优先选择 article, main, .content 等
            main_content = None
            content_selectors = [
                'article', 'main', '.content', '.post-content', '.article-content', 
                '#content', '.main-content', '.news-content', '.article-body',
                '.entry-content', '.post-body', '[itemprop="articleBody"]'
            ]
            for selector in content_selectors:
                main_content = soup.select_one(selector)
                if main_content:
                    break
            
            if not main_content:
                main_content = soup.find('body') or soup
            
            # 提取文本
            text_content = main_content.get_text(separator='\n', strip=True)
            
            # 清理文本
            text_content = re.sub(r'\n{3,}', '\n\n', text_content)
            # 不再截断内容，保留完整正文（数据库字段应该支持长文本）
            # text_content = text_content[:5000]  # 移除截断
            
            logger.debug(f"📄 爬取完成: {title[:40]}... | 正文{len(text_content)}字符 | HTML{len(raw_html) if raw_html else 0}字符")
            
            return {
                "url": url,
                "title": title,
                "content": text_content,  # 完整正文
                "text": text_content,  # 兼容字段
                "html": raw_html if raw_html else None  # 完整原始 HTML
            }
            
        except requests.exceptions.Timeout:
            logger.warning(f"⚠️ 爬取页面超时: {url[:60]}...")
        except Exception as e:
            logger.warning(f"⚠️ 爬取页面失败 {url[:60]}...: {e}")
        
        return None
    
    def crawl_search_results(
        self,
        search_results: List[Dict[str, str]],
        max_results: int = 5
    ) -> List[Dict[str, Any]]:
        """
        爬取搜索结果中的页面内容
        
        Args:
            search_results: 搜索结果列表
            max_results: 最多爬取多少个页面
            
        Returns:
            爬取结果列表 [{"url": "...", "title": "...", "content": "..."}]
        """
        crawled = []
        
        for i, result in enumerate(search_results[:max_results]):
            url = result.get("url")
            if not url:
                continue
            
            logger.info(f"📄 爬取页面 {i+1}/{min(max_results, len(search_results))}: {url[:60]}...")
            
            page_data = self.crawl_page(url)
            
            if page_data and page_data.get("content"):
                page_data["snippet"] = result.get("snippet", "")
                page_data["source"] = result.get("source", "web")
                crawled.append(page_data)
                logger.debug(f"✅ 爬取成功: {page_data['title'][:50]}...")
            else:
                # 爬取失败时，使用搜索结果的摘要
                crawled.append({
                    "url": url,
                    "title": result.get("title", ""),
                    "content": result.get("snippet", ""),
                    "snippet": result.get("snippet", ""),
                    "source": result.get("source", "web")
                })
                logger.debug(f"⚠️ 使用摘要代替: {result.get('title', 'N/A')[:50]}...")
            
            # 爬取间隔
            if i < max_results - 1:
                time.sleep(random.uniform(0.3, 0.8))
        
        logger.info(f"📄 页面爬取完成: {len(crawled)} 个成功")
        return crawled


# 便捷函数
def create_interactive_crawler(headless: bool = True, **kwargs) -> InteractiveCrawler:
    """创建交互式爬虫（兼容旧接口）"""
    return InteractiveCrawler()


def search_and_crawl(
    query: str,
    engines: List[str] = None,
    max_search_results: int = 10,
    max_crawl_results: int = 5,
    **kwargs  # 兼容旧接口
) -> Dict[str, Any]:
    """
    一体化搜索和爬取函数
    
    Args:
        query: 搜索关键词
        engines: 搜索引擎列表
        max_search_results: 最多获取多少个搜索结果
        max_crawl_results: 最多爬取多少个页面
        
    Returns:
        {
            "search_results": [...],
            "crawled_results": [...],
            "total_results": int
        }
    """
    crawler = InteractiveCrawler()
    
    logger.info(f"🔍 开始搜索: {query}")
    search_results = crawler.interactive_search(
        query,
        engines=engines,
        num_results=max_search_results
    )
    
    if not search_results:
        logger.warning(f"搜索未返回结果: {query}")
        return {
            "search_results": [],
            "crawled_results": [],
            "total_results": 0
        }
    
    logger.info(f"📄 开始爬取前 {max_crawl_results} 个结果")
    crawled_results = crawler.crawl_search_results(
        search_results,
        max_results=max_crawl_results
    )
    
    return {
        "search_results": search_results,
        "crawled_results": crawled_results,
        "total_results": len(crawled_results)
    }


================================================
FILE: backend/app/tools/jingji21_crawler.py
================================================
"""
21经济网爬虫工具
目标URL: https://www.21jingji.com/ (证券栏目)
"""
import re
import logging
from typing import List, Optional
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class Jingji21CrawlerTool(BaseCrawler):
    """
    21经济网爬虫
    主要爬取证券栏目
    """
    
    BASE_URL = "https://www.21jingji.com/"
    # 证券栏目URL
    STOCK_URL = "https://www.21jingji.com/channel/capital/"
    SOURCE_NAME = "jingji21"
    
    def __init__(self):
        super().__init__(
            name="jingji21_crawler",
            description="Crawl financial news from 21 Jingji (21jingji.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取21经济网新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Jingji21, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Jingji21: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            # 尝试爬取证券栏目或主页
            try:
                response = self._fetch_page(self.STOCK_URL)
            except:
                response = self._fetch_page(self.BASE_URL)
            
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 21经济网新闻URL模式
            if ('/article/' in href or '/html/' in href or '.shtml' in href) and title:
                # 确保是完整URL
                if not href.startswith('http'):
                    href = 'https://www.21jingji.com' + href
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            # 确保编码正确：21经济网可能使用 gbk 编码
            if '21jingji.com' in url:
                # 尝试多种编码
                encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030']
                raw_html = None
                for enc in encodings:
                    try:
                        raw_html = response.content.decode(enc)
                        # 验证是否包含中文字符（避免乱码）
                        if '\u4e00' <= raw_html[0:100] <= '\u9fff' or any('\u4e00' <= c <= '\u9fff' for c in raw_html[:500]):
                            break
                    except (UnicodeDecodeError, LookupError):
                        continue
                if raw_html is None:
                    raw_html = response.text
            else:
                raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'article-content'},
            {'class': 'content'},
            {'class': 'text'},
            {'id': 'content'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/jwview_crawler.py
================================================
"""
中新经纬爬虫工具
目标URL: https://www.jwview.com/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class JwviewCrawlerTool(BaseCrawler):
    """
    中新经纬新闻爬虫
    爬取中新经纬财经新闻
    """
    
    BASE_URL = "https://www.jwview.com/"
    # 股票/证券专栏URL（如果有）
    STOCK_URL = "https://www.jwview.com/jingwei/html/index.shtml"
    SOURCE_NAME = "jwview"
    
    def __init__(self):
        super().__init__(
            name="jwview_crawler",
            description="Crawl financial news from Zhongxin Jingwei (jwview.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取中新经纬新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Jwview, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Jwview: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            # 尝试爬取主页或股票专栏
            response = self._fetch_page(self.BASE_URL)
            # 金融界可能使用 gbk 编码
            if response.encoding == 'ISO-8859-1' or not response.encoding:
                try:
                    response.content.decode('gbk')
                    response.encoding = 'gbk'
                except:
                    response.encoding = 'utf-8'
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接（中新经纬的URL模式）
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 中新经纬新闻URL模式
            if ('/jingwei/' in href or '/html/' in href) and title:
                # 规范化 URL，避免出现 //www... 重复前缀
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.jwview.com' + href
                elif not href.startswith('http'):
                    href = 'https://www.jwview.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'content'},
            {'class': 'article-content'},
            {'class': 'text'},
            {'id': 'content'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 处理相对时间
        if '分钟前' in time_str:
            minutes = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(minutes=minutes)
        elif '小时前' in time_str:
            hours = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(hours=hours)
        elif '昨天' in time_str:
            return now - timedelta(days=1)
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/nbd_crawler.py
================================================
"""
每日经济新闻爬虫工具
目标URL: https://finance.nbd.com.cn/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class NbdCrawlerTool(BaseCrawler):
    """
    每日经济新闻爬虫
    主要爬取财经股市新闻
    """
    
    BASE_URL = "https://www.nbd.com.cn/"
    STOCK_URL = "https://www.nbd.com.cn/columns/3/"
    SOURCE_NAME = "nbd"
    
    def __init__(self):
        super().__init__(
            name="nbd_crawler",
            description="Crawl financial news from NBD (nbd.com.cn)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取每日经济新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled NBD, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling NBD: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            response = self._fetch_page(self.STOCK_URL)
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    # 如果是503错误，记录但继续处理其他URL
                    error_str = str(e)
                    if '503' in error_str or 'Service Temporarily Unavailable' in error_str:
                        logger.warning(f"Skipping {link_info.get('url', 'unknown')} due to 503 error (server overloaded)")
                    else:
                        logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        # NBD新闻URL模式（扩展更多模式）
        nbd_patterns = [
            '/articles/',        # 文章列表
            '/article/',         # 文章
            '.html',             # HTML页面
            '/columns/',         # 栏目
            '/finance/',         # 财经
        ]
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 检查是否匹配NBD URL模式
            is_nbd_url = False
            
            # 方式1: 检查URL模式
            for pattern in nbd_patterns:
                if pattern in href:
                    is_nbd_url = True
                    break
            
            # 方式2: 检查是否包含nbd.com.cn域名
            if 'nbd.com.cn' in href:
                is_nbd_url = True
            
            # 方式3: 检查链接的class或data属性
            if not is_nbd_url:
                link_class = link.get('class', [])
                if isinstance(link_class, list):
                    link_class_str = ' '.join(link_class)
                else:
                    link_class_str = str(link_class)
                if any(kw in link_class_str.lower() for kw in ['news', 'article', 'item', 'title', 'list']):
                    if href.startswith('/') or 'nbd.com.cn' in href:
                        is_nbd_url = True
            
            if is_nbd_url and title and len(title.strip()) > 5:
                # 确保是完整URL
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.nbd.com.cn' + href
                elif not href.startswith('http'):
                    href = 'https://www.nbd.com.cn/' + href.lstrip('/')
                
                # 过滤掉明显不是新闻的链接
                if any(skip in href.lower() for skip in ['javascript:', 'mailto:', '#', 'void(0)', '/tag/', '/author/', '/user/', '/login']):
                    continue
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title.strip()})
        
        logger.debug(f"NBD: Found {len(news_links)} potential news links")
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            # 检查是否是503错误（服务器过载）
            error_str = str(e)
            if '503' in error_str or 'Service Temporarily Unavailable' in error_str:
                logger.debug(f"Skipping {url} due to 503 error (server overloaded, will retry later)")
                # 对于503错误，直接返回None，不记录为警告，因为这是临时性问题
                return None
            else:
                logger.warning(f"Failed to extract news from {url}: {e}")
                return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        # 每经网站可能的正文容器选择器（按优先级排序）
        content_selectors = [
            # 新版页面结构
            {'class': 'article-body'},
            {'class': 'article__body'},
            {'class': 'article-text'},
            {'class': 'content-article'},
            {'class': 'main-content'},
            # 旧版页面结构
            {'class': 'g-article-content'},
            {'class': 'article-content'},
            {'class': 'content'},
            {'id': 'contentText'},
            {'id': 'article-content'},
            # 通用选择器
            {'itemprop': 'articleBody'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find(['div', 'article', 'section'], selector)
            if content_div:
                # 移除脚本、样式、广告等无关元素
                for tag in content_div.find_all(['script', 'style', 'iframe', 'ins', 'noscript']):
                    tag.decompose()
                for ad in content_div.find_all(class_=re.compile(r'ad|advertisement|banner|recommend')):
                    ad.decompose()
                
                # 提取所有段落，不限制数量
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content and len(content) > 50:
                        return self._clean_text(content)
                
                # 如果没有 p 标签，直接取文本
                text = content_div.get_text(separator='\n', strip=True)
                if text and len(text) > 50:
                    return self._clean_text(text)
        
        # 后备方案：取所有段落（不限制数量）
        paragraphs = soup.find_all('p')
        if paragraphs:
            # 过滤掉可能的导航、页脚等短段落
            valid_paragraphs = [
                p.get_text(strip=True) for p in paragraphs 
                if p.get_text(strip=True) and len(p.get_text(strip=True)) > 10
            ]
            content = '\n'.join(valid_paragraphs)
            if content:
                return self._clean_text(content)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date|pub')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source|editor')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/netease163_crawler.py
================================================
"""
网易财经爬虫工具
目标URL: https://money.163.com/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class Netease163CrawlerTool(BaseCrawler):
    """
    网易财经爬虫
    主要爬取财经股市新闻
    """
    
    BASE_URL = "https://money.163.com/"
    STOCK_URL = "https://money.163.com/stock/"
    SOURCE_NAME = "163"
    
    def __init__(self):
        super().__init__(
            name="netease163_crawler",
            description="Crawl financial news from Netease Money (money.163.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取网易财经新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled 163, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling 163: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            # 尝试爬取股票栏目或主页
            try:
                response = self._fetch_page(self.STOCK_URL)
            except:
                response = self._fetch_page(self.BASE_URL)
            
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 网易新闻URL模式
            if ('money.163.com' in href or 'stock' in href) and title:
                # 确保是完整URL
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://money.163.com' + href
                elif not href.startswith('http'):
                    href = 'https://money.163.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'post_text'},
            {'id': 'endText'},
            {'class': 'article-content'},
            {'class': 'content'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('div', {'class': re.compile(r'post_time|time')})
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if not author_elem:
                author_elem = soup.find('div', {'id': 'ne_article_source'})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/search_engine_crawler.py
================================================
"""
搜索引擎爬虫工具
直接爬取搜索引擎结果页面（Bing/Baidu）
"""
import logging
import re
import requests
from typing import List, Dict, Any, Optional
from datetime import datetime, timedelta
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
import time

logger = logging.getLogger(__name__)


class SearchEngineCrawler:
    """
    搜索引擎爬虫
    直接爬取 Bing/Baidu 搜索结果
    """
    
    def __init__(self):
        """初始化搜索引擎爬虫"""
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Accept-Encoding': 'gzip, deflate',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1'
        }
        
        self.session = requests.Session()
        self.session.headers.update(self.headers)
        
        logger.info("🔧 搜索引擎爬虫已初始化")
    
    def _fetch_url(self, url: str, timeout: int = 10) -> Optional[str]:
        """
        爬取URL内容
        
        Args:
            url: 目标URL
            timeout: 超时时间
            
        Returns:
            HTML内容
        """
        try:
            response = self.session.get(url, timeout=timeout)
            response.raise_for_status()
            
            # 尝试检测编码
            if response.encoding == 'ISO-8859-1':
                # 对于中文网站，尝试使用 gb2312 或 utf-8
                encodings = ['utf-8', 'gb2312', 'gbk']
                for enc in encodings:
                    try:
                        response.encoding = enc
                        _ = response.text
                        break
                    except:
                        continue
            
            return response.text
            
        except Exception as e:
            logger.error(f"❌ 爬取失败 {url}: {e}")
            return None
    
    def search_with_engine(
        self,
        query: str,
        engine: str = "bing",
        days: int = 30,
        max_results: int = 50
    ) -> List[Dict[str, Any]]:
        """
        使用搜索引擎搜索新闻
        
        Args:
            query: 搜索关键词
            engine: 搜索引擎 (bing/baidu)
            days: 时间范围（天）
            max_results: 最大结果数
            
        Returns:
            新闻列表
        """
        if engine not in self.search_engines:
            logger.error(f"❌ 不支持的搜索引擎: {engine}")
            return []
        
        # 构建搜索URL
        search_query = self._build_search_query(query, days)
        search_url = self.search_engines[engine].format(query=quote_plus(search_query))
        
        logger.info(f"🔍 搜索引擎爬取: {engine} - {search_query}")
        logger.info(f"    URL: {search_url}")
        
        # 创建临时输出目录
        with tempfile.TemporaryDirectory() as temp_dir:
            # 爬取搜索结果页面
            result = self._call_mcp_crawl(search_url, temp_dir)
            
            if not result:
                logger.warning(f"⚠️ 搜索引擎爬取失败: {search_url}")
                return []
            
            # 解析搜索结果
            news_items = self._parse_search_results(
                content=result.get("content", ""),
                engine=engine,
                max_results=max_results
            )
            
            logger.info(f"✅ 从 {engine} 提取到 {len(news_items)} 条结果")
            return news_items
    
    def _build_search_query(self, query: str, days: int) -> str:
        """
        构建搜索查询字符串（添加时间限制）
        
        Args:
            query: 原始查询
            days: 时间范围
            
        Returns:
            增强的搜索查询
        """
        # 添加时间范围（对于 Bing 和 Baidu）
        # Bing: 支持 "query site:xxx.com"
        # 可以添加新闻源限制
        
        # 可选：限制到新闻网站
        news_sites = [
            "sina.com.cn",
            "163.com",
            "eastmoney.com",
            "cnstock.com",
            "stcn.com",
            "caijing.com.cn",
            "yicai.com",
        ]
        
        # 构建基础查询
        enhanced_query = f"{query} 新闻"
        
        # 添加时间提示词
        if days <= 7:
            enhanced_query += " 最近一周"
        elif days <= 30:
            enhanced_query += " 最近一个月"
        
        return enhanced_query
    
    def _parse_search_results(
        self,
        content: str,
        engine: str,
        max_results: int
    ) -> List[Dict[str, Any]]:
        """
        解析搜索引擎返回的内容，提取新闻链接和标题
        
        Args:
            content: 爬取的页面内容（Markdown格式）
            engine: 搜索引擎类型
            max_results: 最大结果数
            
        Returns:
            新闻条目列表
        """
        news_items = []
        
        # 从 Markdown 内容中提取链接
        # 格式：[标题](URL)
        link_pattern = r'\[([^\]]+)\]\(([^\)]+)\)'
        matches = re.findall(link_pattern, content)
        
        for title, url in matches[:max_results]:
            # 过滤掉搜索引擎自身的链接
            if engine in url.lower():
                continue
            
            # 过滤掉非新闻链接
            if not self._is_news_url(url):
                continue
            
            news_items.append({
                "title": title.strip(),
                "url": url.strip(),
                "snippet": "",  # 暂时为空，后续可以从 content 中提取
                "source": self._extract_source_from_url(url),
                "engine": engine
            })
        
        return news_items
    
    def _is_news_url(self, url: str) -> bool:
        """判断是否为新闻URL"""
        news_domains = [
            "sina.com", "163.com", "eastmoney.com", "cnstock.com",
            "stcn.com", "caijing.com", "yicai.com", "nbd.com",
            "jwview.com", "eeo.com.cn", "finance.qq.com"
        ]
        return any(domain in url.lower() for domain in news_domains)
    
    def _extract_source_from_url(self, url: str) -> str:
        """从URL提取来源"""
        domain_mapping = {
            "sina.com": "新浪财经",
            "163.com": "网易财经",
            "eastmoney.com": "东方财富",
            "cnstock.com": "中国证券网",
            "stcn.com": "证券时报",
            "caijing.com": "财经网",
            "yicai.com": "第一财经",
            "nbd.com": "每日经济新闻",
            "jwview.com": "金融界",
            "eeo.com.cn": "经济观察网",
            "qq.com": "腾讯财经",
        }
        
        for domain, source in domain_mapping.items():
            if domain in url.lower():
                return source
        
        return "未知来源"
    
    def search_stock_news(
        self,
        stock_name: str,
        stock_code: str,
        days: int = 30,
        engines: Optional[List[str]] = None,
        max_per_engine: int = 30
    ) -> List[Dict[str, Any]]:
        """
        搜索股票新闻（多搜索引擎）
        
        Args:
            stock_name: 股票名称
            stock_code: 股票代码
            days: 时间范围
            engines: 搜索引擎列表，默认 ["bing"]
            max_per_engine: 每个搜索引擎最大结果数
            
        Returns:
            新闻列表
        """
        if engines is None:
            engines = ["bing"]  # 默认只用 Bing（Baidu 可能需要处理反爬）
        
        all_news = []
        
        # 构建搜索关键词
        queries = [
            stock_name,
            f"{stock_name} {stock_code}",
            f"{stock_name} 公告",
        ]
        
        for engine in engines:
            for query in queries:
                try:
                    news = self.search_with_engine(
                        query=query,
                        engine=engine,
                        days=days,
                        max_results=max_per_engine
                    )
                    all_news.extend(news)
                except Exception as e:
                    logger.error(f"❌ 搜索失败 [{engine}] {query}: {e}")
        
        # 去重（按URL）
        seen_urls = set()
        unique_news = []
        for news in all_news:
            url = news.get("url")
            if url and url not in seen_urls:
                seen_urls.add(url)
                unique_news.append(news)
        
        logger.info(f"✅ 多引擎搜索完成: 总计 {len(unique_news)} 条（去重后）")
        return unique_news


# 便捷函数
def create_search_engine_crawler(mcp_server_path: Optional[str] = None) -> SearchEngineCrawler:
    """创建搜索引擎爬虫实例"""
    return SearchEngineCrawler(mcp_server_path)


# 测试代码
if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    
    crawler = create_search_engine_crawler()
    
    # 测试搜索
    results = crawler.search_stock_news(
        stock_name="深振业A",
        stock_code="000006",
        days=7,
        engines=["bing"],
        max_per_engine=10
    )
    
    print(f"\n✅ 搜索到 {len(results)} 条新闻:")
    for i, news in enumerate(results[:5], 1):
        print(f"{i}. {news['title']}")
        print(f"   来源: {news['source']}")
        print(f"   URL: {news['url']}")


================================================
FILE: backend/app/tools/sina_crawler.py
================================================
"""
新浪财经爬虫工具
重构自 legacy_v1/Crawler/crawler_sina.py
"""
import re
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class SinaCrawlerTool(BaseCrawler):
    """
    新浪财经新闻爬虫
    爬取最新滚动新闻页面
    """
    
    # 新浪财经最新滚动新闻页面（2024年后的新URL）
    BASE_URL = "https://finance.sina.com.cn/roll/c/56592.shtml"  # 暂不支持翻页，只爬首页
    SOURCE_NAME = "sina"
    
    def __init__(self):
        super().__init__(
            name="sina_finance_crawler",
            description="Crawl financial news from Sina Finance (sina.com.cn)"
        )
        self.min_chinese_ratio = 0.5  # 最小中文比例阈值
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取新浪财经新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        for page in range(start_page, end_page + 1):
            try:
                page_news = self._crawl_page(page)
                news_list.extend(page_news)
                logger.info(f"Crawled page {page}, got {len(page_news)} news items")
            except Exception as e:
                logger.error(f"Failed to crawl page {page}: {e}")
                continue
        
        return news_list
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """
        爬取单页新闻列表
        
        Args:
            page: 页码（目前只支持首页，忽略此参数）
            
        Returns:
            新闻列表
        """
        url = self.BASE_URL  # 新URL不支持翻页，只爬首页
        logger.info(f"Fetching page: {url}")
        response = self._fetch_page(url)
        
        # 设置正确的编码
        response.encoding = 'utf-8'
        soup = self._parse_html(response.text)
        
        # 查找新闻链接（改进选择器，更精确地找到新闻链接）
        news_links = []
        for link in soup.find_all('a', href=True):
            href = link.get('href', '')
            # 匹配新浪财经股票相关新闻URL
            if 'finance.sina.com.cn' in href and ('/stock/' in href or '/roll/' in href):
                # 确保是完整的URL
                if href.startswith('http'):
                    news_links.append(href)
                elif href.startswith('//'):
                    news_links.append('http:' + href)
        
        # 去重
        news_links = list(set(news_links))
        logger.info(f"Found {len(news_links)} news links on page {page}")
        
        # 爬取每条新闻详情（限制每页最多50条，避免超时）
        news_list = []
        max_news_per_page = 50 if page == 1 else 30  # 第一页爬取更多，其他页少一些
        for idx, news_url in enumerate(news_links[:max_news_per_page], 1):
            try:
                logger.debug(f"Crawling news {idx}/{min(len(news_links), max_news_per_page)}: {news_url}")
                news_item = self._crawl_news_detail(news_url)
                if news_item:
                    news_list.append(news_item)
                    logger.debug(f"Successfully crawled: {news_item.title[:50]}")
            except Exception as e:
                logger.warning(f"Failed to crawl news detail {news_url}: {e}")
                continue
        
        logger.info(f"Successfully crawled {len(news_list)} news items from page {page}")
        return news_list
    
    def _crawl_news_detail(self, url: str) -> Optional[NewsItem]:
        """
        爬取新闻详情页
        
        Args:
            url: 新闻URL
            
        Returns:
            新闻项或None
        """
        try:
            response = self._fetch_page(url)
            response.encoding = BeautifulSoup(response.content, "lxml").original_encoding
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取标题
            title = self._extract_title(soup)
            if not title:
                return None
            
            # 提取摘要和关键词
            summary, keywords = self._extract_meta(soup)
            
            # 提取发布时间
            publish_time = self._extract_date(soup)
            
            # 提取关联股票代码
            stock_codes = self._extract_stock_codes(soup)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content or len(content) < 50:
                return None
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                summary=summary,
                keywords=keywords,
                stock_codes=stock_codes,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.error(f"Error crawling {url}: {e}")
            return None
    
    def _extract_title(self, soup: BeautifulSoup) -> Optional[str]:
        """提取标题"""
        # 尝试多个可能的标题位置
        title_tag = soup.find('h1', class_='main-title')
        if not title_tag:
            title_tag = soup.find('h1')
        if not title_tag:
            title_tag = soup.find('title')
        
        if title_tag:
            title = title_tag.get_text().strip()
            # 移除来源信息
            title = re.sub(r'[-_].*?(新浪|财经|网)', '', title)
            return title.strip()
        return None
    
    def _extract_meta(self, soup: BeautifulSoup) -> tuple:
        """提取元数据（摘要和关键词）"""
        summary = ""
        keywords = []
        
        for meta in soup.find_all('meta'):
            name = meta.get('name', '').lower()
            content = meta.get('content', '')
            
            if name == 'description':
                summary = content
            elif name == 'keywords':
                keywords = [kw.strip() for kw in content.split(',') if kw.strip()]
        
        return summary, keywords
    
    def _extract_date(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        # 查找时间标签
        for span in soup.find_all('span'):
            # 检查 class 属性
            class_attr = span.get('class', [])
            if 'date' in class_attr or 'time-source' in class_attr:
                date_text = span.get_text()
                return self._parse_date(date_text)
            
            # 检查 id 属性
            if span.get('id') == 'pub_date':
                date_text = span.get_text()
                return self._parse_date(date_text)
        
        return None
    
    def _parse_date(self, date_text: str) -> Optional[datetime]:
        """解析日期字符串"""
        try:
            # 格式：2024年12月01日 10:30
            date_text = date_text.strip()
            date_text = date_text.replace('年', '-').replace('月', '-').replace('日', '')
            
            # 尝试多种格式
            for fmt in [
                '%Y-%m-%d %H:%M',
                '%Y-%m-%d %H:%M:%S',
                '%Y-%m-%d',
            ]:
                try:
                    return datetime.strptime(date_text.strip(), fmt)
                except ValueError:
                    continue
        except Exception:
            pass
        
        return None
    
    def _extract_stock_codes(self, soup: BeautifulSoup) -> List[str]:
        """提取关联股票代码"""
        stock_codes = []
        
        for span in soup.find_all('span'):
            span_id = span.get('id', '')
            if span_id.startswith('stock_'):
                # 格式：stock_sh600519
                code = span_id[6:]  # 移除 'stock_' 前缀
                if code:
                    stock_codes.append(code.upper())
        
        return list(set(stock_codes))
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取正文内容"""
        # 尝试使用更精确的选择器
        content_selectors = [
            {'id': 'artibody'},
            {'class': 'article-content'},
            {'class': 'article'},
            {'id': 'article'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find(['div', 'article'], selector)
            if content_div:
                # 1. 移除明确的噪音元素
                for tag in content_div.find_all(['script', 'style', 'iframe', 'ins', 'select', 'input', 'button', 'form']):
                    tag.decompose()
                
                # 2. 移除特定的广告和推荐块
                for ad in content_div.find_all(class_=re.compile(r'ad|banner|share|otherContent|recommend|app-guide', re.I)):
                    ad.decompose()

                # 3. 获取所有文本，使用换行符分隔
                # 关键修改：使用 get_text 而不是 find_all('p')，确保不漏掉裸露的文本节点
                full_text = content_div.get_text(separator='\n', strip=True)
                
                # 4. 按行分割并清洗
                lines = full_text.split('\n')
                article_parts = []
                
                for line in lines:
                    line = line.strip()
                    if not line:
                        continue
                        
                    # 5. 过滤和清洗行
                    # 检查中文比例
                    chinese_ratio = self._extract_chinese_ratio(line)
                    
                    # 宽松的保留策略：
                    # - 忽略极短的非中文行（可能是页码、特殊符号）
                    if len(line) < 2:
                        continue
                        
                    # 保留条件：
                    # 1. 包含一定比例中文（>5%）
                    # 2. 或者长文本（>20字符），可能是纯数据或英文段落
                    if chinese_ratio > 0.05 or len(line) > 20:
                        clean_line = self._clean_text(line)
                        if clean_line and not self._is_noise_text(clean_line):
                            article_parts.append(clean_line)
                
                if article_parts:
                    return '\n'.join(article_parts)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
    
    def _is_noise_text(self, text: str) -> bool:
        """判断是否为噪音文本（广告、版权等）"""
        noise_patterns = [
            r'^责任编辑',
            r'^编辑[:：]',
            r'^来源[:：]',
            r'^声明[:：]',
            r'^免责声明',
            r'^版权',
            r'^copyright',
            r'^点击进入',
            r'^相关阅读',
            r'^延伸阅读',
            r'^\s*$',
            r'登录新浪财经APP',
            r'搜索【信披】',
            r'缩小字体',
            r'放大字体',
            r'收藏',
            r'微博',
            r'微信',
            r'分享',
            r'腾讯QQ',
        ]
        text_lower = text.lower().strip()
        for pattern in noise_patterns:
            if re.match(pattern, text_lower, re.I) or re.search(pattern, text_lower, re.I):
                return True
        return False


# 便捷创建函数
def create_sina_crawler() -> SinaCrawlerTool:
    """创建新浪财经爬虫实例"""
    return SinaCrawlerTool()


================================================
FILE: backend/app/tools/tencent_crawler.py
================================================
"""
腾讯财经爬虫工具
目标URL: https://news.qq.com/ch/finance/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import json

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class TencentCrawlerTool(BaseCrawler):
    """
    腾讯财经新闻爬虫
    爬取腾讯财经频道最新新闻
    """
    
    BASE_URL = "https://news.qq.com/ch/finance_stock/"
    # 腾讯新闻API（如果页面动态加载，可能需要调用API）
    API_URL = "https://pacaio.match.qq.com/irs/rcd"
    SOURCE_NAME = "tencent"
    
    def __init__(self):
        super().__init__(
            name="tencent_finance_crawler",
            description="Crawl financial news from Tencent Finance (news.qq.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取腾讯财经新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            # 腾讯财经页面只爬取首页
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Tencent Finance, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Tencent Finance: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """
        爬取单页新闻
        
        优先使用API获取新闻，如果API失败则回退到HTML解析
        
        Args:
            page: 页码
            
        Returns:
            新闻列表
        """
        news_items = []
        
        # 先尝试使用API获取新闻
        try:
            logger.info(f"[Tencent] Attempting API fetch for page {page}")
            api_news = self._fetch_api_news(page)
            logger.info(f"[Tencent] API returned {len(api_news) if api_news else 0} news items")
            if api_news:
                logger.info(f"Fetched {len(api_news)} news from API")
                for news_data in api_news[:20]:  # 限制20条
                    try:
                        news_item = self._parse_api_news_item(news_data)
                        if news_item:
                            news_items.append(news_item)
                    except Exception as e:
                        logger.warning(f"Failed to parse API news item: {e}")
                        continue
                if news_items:
                    logger.info(f"[Tencent] Successfully parsed {len(news_items)} news items from API")
                    return news_items
            else:
                logger.info(f"[Tencent] API returned empty list, falling back to HTML")
        except Exception as e:
            logger.warning(f"API fetch failed, fallback to HTML: {e}")
        
        # API失败，回退到HTML解析
        try:
            response = self._fetch_page(self.BASE_URL)
            # 腾讯新闻可能使用动态加载，确保编码正确
            if response.encoding == 'ISO-8859-1' or not response.encoding:
                response.encoding = 'utf-8'
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            # 腾讯的新闻可能在各种容器中，尝试提取所有新闻链接
            news_links = self._extract_news_links(soup)
            
            logger.info(f"Found {len(news_links)} potential news links from HTML")
            
            # 限制爬取数量，避免过多请求
            max_news = 20
            for i, link_info in enumerate(news_links[:max_news]):
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item {i+1}: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page {page}: {e}")
        
        return news_items
    
    def _fetch_api_news(self, page: int = 0) -> List[dict]:
        """
        通过API获取新闻列表
        
        Args:
            page: 页码（从0开始）
            
        Returns:
            新闻列表
        """
        try:
            # 腾讯新闻API参数（根据实际API文档调整）
            params = {
                "cid": "finance_stock",  # 股票频道
                "page": page,
                "num": 20,  # 每页20条
                "ext": "finance_stock",  # 扩展参数
            }
            
            headers = {
                "User-Agent": self.user_agent,
                "Referer": self.BASE_URL,
                "Accept": "application/json, text/javascript, */*; q=0.01",
            }
            
            logger.info(f"[Tencent] Calling API: {self.API_URL} with params: {params}")
            response = self.session.get(
                self.API_URL,
                params=params,
                headers=headers,
                timeout=self.timeout
            )
            logger.info(f"[Tencent] API response status: {response.status_code}")
            response.raise_for_status()
            
            # 解析JSON响应（可能是JSONP格式）
            content = response.text.strip()
            logger.info(f"[Tencent] API response preview (first 500 chars): {content[:500]}")
            
            # 尝试解析JSONP格式
            if content.startswith('callback(') or content.startswith('jQuery'):
                # 提取JSON部分
                import re
                json_match = re.search(r'\((.*)\)$', content)
                if json_match:
                    content = json_match.group(1)
            
            data = json.loads(content)
            logger.info(f"[Tencent] Parsed API response type: {type(data)}, keys: {list(data.keys()) if isinstance(data, dict) else 'N/A'}")
            
            if isinstance(data, dict):
                if 'data' in data:
                    logger.info(f"[Tencent] Found 'data' key with {len(data['data']) if isinstance(data['data'], list) else 'non-list'} items")
                    return data['data']
                elif 'list' in data:
                    logger.info(f"[Tencent] Found 'list' key with {len(data['list']) if isinstance(data['list'], list) else 'non-list'} items")
                    return data['list']
                elif 'result' in data:
                    logger.info(f"[Tencent] Found 'result' key with {len(data['result']) if isinstance(data['result'], list) else 'non-list'} items")
                    return data['result']
                else:
                    logger.warning(f"[Tencent] Unexpected API response format, keys: {list(data.keys())}")
            elif isinstance(data, list):
                logger.info(f"[Tencent] API returned list with {len(data)} items")
                return data
            
            logger.warning(f"Unexpected API response format: {type(data)}")
            return []
            
        except json.JSONDecodeError as e:
            logger.warning(f"API JSON decode failed: {e}, response preview: {response.text[:200] if 'response' in locals() else 'N/A'}")
            return []
        except Exception as e:
            logger.warning(f"API fetch failed: {e}")
            return []
    
    def _parse_api_news_item(self, news_data: dict) -> Optional[NewsItem]:
        """
        解析API返回的新闻数据
        
        Args:
            news_data: API返回的单条新闻数据
            
        Returns:
            NewsItem对象
        """
        try:
            # 提取基本信息
            title = news_data.get('title', '').strip()
            url = news_data.get('url', '') or news_data.get('surl', '')
            
            # 确保URL是完整的
            if url and not url.startswith('http'):
                if url.startswith('//'):
                    url = 'https:' + url
                elif url.startswith('/'):
                    url = 'https://news.qq.com' + url
                else:
                    url = 'https://news.qq.com/' + url.lstrip('/')
            
            if not title or not url:
                return None
            
            # 提取发布时间
            publish_time_str = news_data.get('time', '') or news_data.get('publish_time', '')
            publish_time = self._parse_time_string(publish_time_str) if publish_time_str else datetime.now()
            
            # 提取摘要作为内容（API通常不返回完整内容）
            content = news_data.get('abstract', '') or news_data.get('intro', '') or title
            
            # 提取作者
            author = news_data.get('author', '') or news_data.get('source', '')
            
            # 尝试获取完整内容
            try:
                response = self._fetch_page(url)
                if response.encoding == 'ISO-8859-1' or not response.encoding:
                    response.encoding = 'utf-8'
                raw_html = response.text
                soup = self._parse_html(raw_html)
                full_content = self._extract_content(soup)
                if full_content and len(full_content) > len(content):
                    content = full_content
            except Exception as e:
                logger.debug(f"Failed to fetch full content from {url}: {e}")
                raw_html = None
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author if author else None,
                raw_html=raw_html,
            )
            
        except Exception as e:
            logger.warning(f"Failed to parse API news item: {e}")
            return None
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """
        从页面中提取新闻链接
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            新闻链接信息列表
        """
        news_links = []
        
        # 查找所有链接
        all_links = soup.find_all('a', href=True)
        
        # 腾讯新闻URL模式（扩展更多模式）
        tencent_patterns = [
            '/rain/a/',           # 旧模式
            '/omn/',              # 旧模式
            '/a/',                # 新模式
            '/finance/',          # 财经频道
            'finance.qq.com',     # 财经域名
            '/stock/',            # 股票相关
        ]
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 检查是否匹配腾讯新闻URL模式
            is_tencent_url = False
            for pattern in tencent_patterns:
                if pattern in href:
                    is_tencent_url = True
                    break
            
            # 或者检查是否是qq.com域名且包含新闻相关关键词
            if not is_tencent_url:
                if 'qq.com' in href and any(kw in href for kw in ['/a/', '/article/', '/news/', '/finance/']):
                    is_tencent_url = True
            
            if is_tencent_url and title and len(title.strip()) > 5:
                # 确保是完整URL
                if not href.startswith('http'):
                    if href.startswith('//'):
                        href = 'https:' + href
                    elif href.startswith('/'):
                        href = 'https://news.qq.com' + href
                    else:
                        href = 'https://news.qq.com/' + href.lstrip('/')
                
                # 过滤掉明显不是新闻的链接
                if any(skip in href.lower() for skip in ['javascript:', 'mailto:', '#', 'void(0)']):
                    continue
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({
                        'url': href,
                        'title': title.strip()
                    })
        
        logger.debug(f"Tencent: Found {len(news_links)} potential news links")
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """
        提取单条新闻详情
        
        Args:
            link_info: 新闻链接信息
            
        Returns:
            NewsItem或None
        """
        url = link_info['url']
        title = link_info['title']
        
        try:
            # 获取新闻详情页
            response = self._fetch_page(url)
            # 确保编码正确
            if response.encoding == 'ISO-8859-1' or not response.encoding:
                response.encoding = 'utf-8'
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文内容
            content = self._extract_content(soup)
            if not content:
                logger.debug(f"No content found for: {title}")
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """
        提取新闻正文
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            新闻正文
        """
        # 尝试多种选择器
        content_selectors = [
            {'class': 'content-article'},
            {'class': 'LEFT'},
            {'id': 'Cnt-Main-Article-QQ'},
            {'class': 'article'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                # 获取所有段落
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """
        提取发布时间
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            发布时间
        """
        try:
            # 尝试多种时间选择器
            time_selectors = [
                {'class': 'a-time'},
                {'class': 'article-time'},
                {'class': 'time'},
            ]
            
            for selector in time_selectors:
                time_elem = soup.find('span', selector)
                if time_elem:
                    time_str = time_elem.get_text(strip=True)
                    return self._parse_time_string(time_str)
            
            # 尝试从meta标签获取
            meta_time = soup.find('meta', {'property': 'article:published_time'})
            if meta_time and meta_time.get('content'):
                return datetime.fromisoformat(meta_time['content'].replace('Z', '+00:00'))
            
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        # 默认返回当前时间
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """
        解析时间字符串（如"1小时前"、"昨天"、"2024-12-06 10:00"）
        
        Args:
            time_str: 时间字符串
            
        Returns:
            datetime对象
        """
        now = datetime.now()
        
        # 处理相对时间
        if '分钟前' in time_str:
            minutes = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(minutes=minutes)
        elif '小时前' in time_str:
            hours = int(re.search(r'(\d+)', time_str).group(1))
            return now - timedelta(hours=hours)
        elif '昨天' in time_str:
            return now - timedelta(days=1)
        elif '前天' in time_str:
            return now - timedelta(days=2)
        
        # 尝试解析绝对时间
        try:
            # 尝试多种格式
            formats = [
                '%Y-%m-%d %H:%M:%S',
                '%Y-%m-%d %H:%M',
                '%Y-%m-%d',
                '%Y年%m月%d日 %H:%M',
                '%Y年%m月%d日',
            ]
            for fmt in formats:
                try:
                    return datetime.strptime(time_str, fmt)
                except ValueError:
                    continue
        except Exception:
            pass
        
        # 默认返回当前时间
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """
        提取作者
        
        Args:
            soup: BeautifulSoup对象
            
        Returns:
            作者名称
        """
        try:
            # 尝试多种作者选择器
            author_selectors = [
                {'class': 'author'},
                {'class': 'article-author'},
                {'class': 'source'},
            ]
            
            for selector in author_selectors:
                author_elem = soup.find('span', selector) or soup.find('a', selector)
                if author_elem:
                    author = author_elem.get_text(strip=True)
                    if author:
                        return author
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/app/tools/text_cleaner.py
================================================
"""
文本清洗工具
重构自 legacy_v1/src/Killua/
"""
import re
import logging
from typing import List, Set
import jieba

from agenticx import BaseTool
from agenticx.core import ToolMetadata, ToolCategory

logger = logging.getLogger(__name__)


class TextCleanerTool(BaseTool):
    """
    文本清洗工具
    提供去停用词、分词、文本标准化等功能
    """
    
    # 中文停用词列表（简化版）
    STOP_WORDS = {
        '的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一',
        '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有',
        '看', '好', '自己', '这', '那', '里', '就是', '什么', '可以', '为', '以',
        '及', '等', '将', '并', '个', '与', '对', '如', '所', '于', '被', '由',
        '从', '而', '把', '让', '向', '却', '但', '或', '及', '但是', '然而',
        '因为', '所以', '如果', '虽然', '尽管', '无论', '不管', '只要', '除非',
        '、', '，', '。', '；', '：', '？', '！', '"', '"', ''', ''', '（', '）',
        '【', '】', '《', '》', '—', '…', '·', '~', '#', '@', '&',
    }
    
    def __init__(self):
        metadata = ToolMetadata(
            name="text_cleaner",
            description="Clean and preprocess Chinese financial text",
            category=ToolCategory.UTILITY,
            version="1.0.0"
        )
        super().__init__(metadata=metadata)
        # 初始化jieba
        jieba.setLogLevel(logging.WARNING)
        
        # 加载金融领域自定义词典（可选）
        self._load_custom_dict()
    
    def _load_custom_dict(self):
        """加载自定义词典"""
        # 金融领域常用词
        financial_words = [
            '股票', '证券', '基金', '债券', '期货', '期权', '外汇',
            '上证指数', '深证成指', '创业板', '科创板',
            '涨停', '跌停', '停牌', '复牌', '退市', '上市',
            '市盈率', '市净率', '市值', '流通股', '限售股',
            '分红', '配股', '增发', '回购', '重组', '并购',
            '利好', '利空', '看多', '看空', '做多', '做空',
            '成交量', '换手率', '振幅', '量比',
        ]
        
        for word in financial_words:
            jieba.add_word(word)
    
    def clean_text(self, text: str) -> str:
        """
        基础文本清洗
        
        Args:
            text: 原始文本
            
        Returns:
            清洗后的文本
        """
        if not text:
            return ""
        
        # 移除URL
        text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
        
        # 移除邮箱
        text = re.sub(r'[\w\.-]+@[\w\.-]+\.\w+', '', text)
        
        # 移除特殊字符（保留中文、英文、数字）
        text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9\s\.\,\!\?\:\;\-\%\(\)]', '', text)
        
        # 统一空格
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    def tokenize(self, text: str, remove_stopwords: bool = True) -> List[str]:
        """
        中文分词
        
        Args:
            text: 文本
            remove_stopwords: 是否去除停用词
            
        Returns:
            词语列表
        """
        # 分词
        words = jieba.cut(text)
        
        # 过滤
        result = []
        for word in words:
            word = word.strip()
            if not word:
                continue
            
            # 去除停用词
            if remove_stopwords and word in self.STOP_WORDS:
                continue
            
            # 去除单字符（除了一些特殊字如"涨"、"跌"）
            if len(word) == 1 and not re.match(r'[\u4e00-\u9fa5]', word):
                continue
            
            result.append(word)
        
        return result
    
    def extract_keywords(self, text: str, top_k: int = 10) -> List[str]:
        """
        提取关键词
        
        Args:
            text: 文本
            top_k: 返回的关键词数量
            
        Returns:
            关键词列表
        """
        import jieba.analyse
        
        keywords = jieba.analyse.extract_tags(
            text,
            topK=top_k,
            withWeight=False
        )
        return keywords
    
    def normalize_stock_code(self, code: str) -> str:
        """
        标准化股票代码
        
        Args:
            code: 原始代码（如 sh600519, 600519, SH600519）
            
        Returns:
            标准化代码（如 600519）
        """
        code = code.upper().strip()
        # 移除市场前缀
        code = re.sub(r'^(SH|SZ|HK)', '', code)
        return code
    
    def _setup_parameters(self):
        """设置工具参数（AgenticX 要求）"""
        # TextCleanerTool 的参数通过 execute 方法的 kwargs 传递
        pass
    
    def execute(self, **kwargs) -> dict:
        """
        同步执行方法（AgenticX Tool 协议要求）
        
        Args:
            **kwargs: 参数字典
                - text: 输入文本（必需）
                - operation: 操作类型（clean, tokenize, keywords），默认 "clean"
                - remove_stopwords: 是否去除停用词（仅用于 tokenize），默认 True
                - top_k: 关键词数量（仅用于 keywords），默认 10
                
        Returns:
            执行结果
        """
        text = kwargs.get("text", "")
        if not text:
            return {"success": False, "error": "Missing required parameter: text"}
        
        operation = kwargs.get("operation", "clean")
        
        if operation == "clean":
            result = self.clean_text(text)
            return {"success": True, "result": result}
        
        elif operation == "tokenize":
            remove_stopwords = kwargs.get("remove_stopwords", True)
            result = self.tokenize(text, remove_stopwords)
            return {"success": True, "result": result, "count": len(result)}
        
        elif operation == "keywords":
            top_k = kwargs.get("top_k", 10)
            result = self.extract_keywords(text, top_k)
            return {"success": True, "result": result}
        
        else:
            return {"success": False, "error": f"Unknown operation: {operation}"}
    
    async def aexecute(self, **kwargs) -> dict:
        """
        异步执行方法（AgenticX Tool 协议要求）
        当前实现为同步执行的包装
        
        Args:
            **kwargs: 参数字典
                
        Returns:
            执行结果
        """
        return self.execute(**kwargs)


# 便捷创建函数
def create_text_cleaner() -> TextCleanerTool:
    """创建文本清洗工具实例"""
    return TextCleanerTool()


================================================
FILE: backend/app/tools/yicai_crawler.py
================================================
"""
第一财经爬虫工具
目标URL: https://www.yicai.com/news/gushi/
"""
import re
import logging
from typing import List, Optional
from datetime import datetime
from bs4 import BeautifulSoup

from .crawler_base import BaseCrawler, NewsItem

logger = logging.getLogger(__name__)


class YicaiCrawlerTool(BaseCrawler):
    """
    第一财经爬虫
    主要爬取股市新闻
    """
    
    BASE_URL = "https://www.yicai.com/"
    STOCK_URL = "https://www.yicai.com/news/gushi/"
    SOURCE_NAME = "yicai"
    
    def __init__(self):
        super().__init__(
            name="yicai_crawler",
            description="Crawl financial news from Yicai (yicai.com)"
        )
    
    def crawl(self, start_page: int = 1, end_page: int = 1) -> List[NewsItem]:
        """
        爬取第一财经新闻
        
        Args:
            start_page: 起始页码
            end_page: 结束页码
            
        Returns:
            新闻列表
        """
        news_list = []
        
        try:
            page_news = self._crawl_page(1)
            news_list.extend(page_news)
            logger.info(f"Crawled Yicai, got {len(page_news)} news items")
        except Exception as e:
            logger.error(f"Error crawling Yicai: {e}")
        
        # 应用股票筛选
        filtered_news = self._filter_stock_news(news_list)
        return filtered_news
    
    def _crawl_page(self, page: int) -> List[NewsItem]:
        """爬取单页新闻"""
        news_items = []
        
        try:
            response = self._fetch_page(self.STOCK_URL)
            soup = self._parse_html(response.text)
            
            # 提取新闻列表
            news_links = self._extract_news_links(soup)
            logger.info(f"Found {len(news_links)} potential news links")
            
            # 限制爬取数量
            max_news = 20
            for link_info in news_links[:max_news]:
                try:
                    news_item = self._extract_news_item(link_info)
                    if news_item:
                        news_items.append(news_item)
                except Exception as e:
                    logger.warning(f"Failed to extract news item: {e}")
                    continue
            
        except Exception as e:
            logger.error(f"Error crawling page: {e}")
        
        return news_items
    
    def _extract_news_links(self, soup: BeautifulSoup) -> List[dict]:
        """从页面中提取新闻链接"""
        news_links = []
        
        # 查找新闻链接
        all_links = soup.find_all('a', href=True)
        
        for link in all_links:
            href = link.get('href', '')
            title = link.get_text(strip=True)
            
            # 第一财经新闻URL模式
            if ('/news/' in href or '/article/' in href) and title:
                # 确保是完整URL
                if href.startswith('//'):
                    href = 'https:' + href
                elif href.startswith('/'):
                    href = 'https://www.yicai.com' + href
                elif not href.startswith('http'):
                    href = 'https://www.yicai.com/' + href.lstrip('/')
                
                if href not in [n['url'] for n in news_links]:
                    news_links.append({'url': href, 'title': title})
        
        return news_links
    
    def _extract_news_item(self, link_info: dict) -> Optional[NewsItem]:
        """提取单条新闻详情"""
        url = link_info['url']
        title = link_info['title']
        
        try:
            response = self._fetch_page(url)
            raw_html = response.text  # 保存原始 HTML
            soup = self._parse_html(raw_html)
            
            # 提取正文
            content = self._extract_content(soup)
            if not content:
                return None
            
            # 提取发布时间
            publish_time = self._extract_publish_time(soup)
            
            # 提取作者
            author = self._extract_author(soup)
            
            return NewsItem(
                title=title,
                content=content,
                url=url,
                source=self.SOURCE_NAME,
                publish_time=publish_time,
                author=author,
                raw_html=raw_html,  # 保存原始 HTML
            )
            
        except Exception as e:
            logger.warning(f"Failed to extract news from {url}: {e}")
            return None
    
    def _extract_content(self, soup: BeautifulSoup) -> str:
        """提取新闻正文"""
        content_selectors = [
            {'class': 'm-txt'},
            {'class': 'article-content'},
            {'class': 'content'},
            {'class': 'newsContent'},
        ]
        
        for selector in content_selectors:
            content_div = soup.find('div', selector)
            if content_div:
                paragraphs = content_div.find_all('p')
                if paragraphs:
                    content = '\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
                    if content:
                        return self._clean_text(content)
        
        # 后备方案：使用基类的智能提取方法
        return self._extract_article_content(soup)
        
        return ""
    
    def _extract_publish_time(self, soup: BeautifulSoup) -> Optional[datetime]:
        """提取发布时间"""
        try:
            time_elem = soup.find('span', {'class': re.compile(r'time|date')})
            if not time_elem:
                time_elem = soup.find('time')
            if time_elem:
                time_str = time_elem.get_text(strip=True)
                return self._parse_time_string(time_str)
        except Exception as e:
            logger.debug(f"Failed to parse publish time: {e}")
        
        return datetime.now()
    
    def _parse_time_string(self, time_str: str) -> datetime:
        """解析时间字符串"""
        now = datetime.now()
        
        # 尝试解析绝对时间
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%Y-%m-%d',
            '%Y年%m月%d日 %H:%M',
            '%Y年%m月%d日',
        ]
        for fmt in formats:
            try:
                return datetime.strptime(time_str, fmt)
            except ValueError:
                continue
        
        return now
    
    def _extract_author(self, soup: BeautifulSoup) -> Optional[str]:
        """提取作者"""
        try:
            author_elem = soup.find('span', {'class': re.compile(r'author|source')})
            if author_elem:
                return author_elem.get_text(strip=True)
        except Exception as e:
            logger.debug(f"Failed to extract author: {e}")
        
        return None


================================================
FILE: backend/clear_news_data.py
================================================
"""
清除所有新闻相关数据
"""
import os
import sys
from pathlib import Path

# 加载环境变量
from dotenv import load_dotenv
env_path = Path(__file__).parent / ".env"
load_dotenv(env_path)

# 构建数据库 URL
POSTGRES_USER = os.getenv("POSTGRES_USER", "postgres")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD", "postgres")
POSTGRES_HOST = os.getenv("POSTGRES_HOST", "localhost")
POSTGRES_PORT = os.getenv("POSTGRES_PORT", "5432")
POSTGRES_DB = os.getenv("POSTGRES_DB", "finnews_db")

DATABASE_URL = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

from sqlalchemy import create_engine, text

def clear_all_news_data():
    """清除所有新闻相关数据"""
    print("🗑️  正在清除所有新闻数据...")
    
    engine = create_engine(DATABASE_URL)
    
    with engine.connect() as conn:
        # 查询存在的表
        result = conn.execute(text("""
            SELECT table_name FROM information_schema.tables 
            WHERE table_schema = 'public' AND table_type = 'BASE TABLE'
        """))
        existing_tables = [row[0] for row in result.fetchall()]
        print(f"   数据库中的表: {existing_tables}")
        
        # 清除 news 表
        if 'news' in existing_tables:
            result = conn.execute(text("SELECT COUNT(*) FROM news"))
            news_count = result.scalar()
            print(f"   当前新闻数量: {news_count}")
            conn.execute(text("TRUNCATE TABLE news RESTART IDENTITY CASCADE"))
            print("   ✅ news 表已清除")
        else:
            print("   ⚠️ news 表不存在")
        
        # 清除 news_analysis 表（如果存在）
        if 'news_analysis' in existing_tables:
            result = conn.execute(text("SELECT COUNT(*) FROM news_analysis"))
            analysis_count = result.scalar()
            print(f"   当前分析数量: {analysis_count}")
            conn.execute(text("TRUNCATE TABLE news_analysis RESTART IDENTITY CASCADE"))
            print("   ✅ news_analysis 表已清除")
        
        # 清除 analysis 表（如果存在）
        if 'analysis' in existing_tables:
            result = conn.execute(text("SELECT COUNT(*) FROM analysis"))
            analysis_count = result.scalar()
            print(f"   当前 analysis 数量: {analysis_count}")
            conn.execute(text("TRUNCATE TABLE analysis RESTART IDENTITY CASCADE"))
            print("   ✅ analysis 表已清除")
        
        conn.commit()
        print("\n✅ 所有新闻数据已清除！")

if __name__ == "__main__":
    print("=" * 50)
    print("📰 FinnewsHunter - 清除新闻数据")
    print("=" * 50)
    
    # 确认操作
    if len(sys.argv) > 1 and sys.argv[1] == "--yes":
        confirm = "y"
    else:
        confirm = input("\n⚠️  确定要清除所有新闻数据吗？(y/N): ").strip().lower()
    
    if confirm == "y":
        clear_all_news_data()
        print("\n🎉 完成！")
    else:
        print("❌ 已取消操作")


================================================
FILE: backend/env.example
================================================
# FinnewsHunter 环境变量配置模板
# 复制此文件为 .env 并填入实际值

# ===== 应用配置 =====
APP_NAME=FinnewsHunter
APP_VERSION=0.1.0
DEBUG=True

# ===== 服务器配置 =====
HOST=0.0.0.0
PORT=8000

# ===== 数据库配置 =====
POSTGRES_USER=finnews
POSTGRES_PASSWORD=finnews_dev_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=finnews_db

# ===== Redis 配置 =====
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
# REDIS_PASSWORD=  # 可选，生产环境建议设置

# ===== Milvus 配置 =====
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION_NAME=finnews_embeddings

# ⚠️ 重要：向量维度必须与 Embedding 模型匹配
# - OpenAI text-embedding-ada-002: 1536 维
# - 百炼 text-embedding-v4: 1024 维
MILVUS_DIM=1536

# ===== Neo4j 知识图谱配置 =====
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=finnews_neo4j_password

# ==========================================
# LLM 和 Embedding 配置
# ==========================================
# 支持5个厂商：bailian、openai、deepseek、kimi、zhipu
# 前端可以动态切换，后端需要配置对应的 API Key

# ===== 默认LLM配置（可选，用于后端默认行为） =====
LLM_PROVIDER=bailian  # 默认提供商
LLM_MODEL=qwen-plus   # 默认模型
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=2000
LLM_TIMEOUT=180  # LLM 调用超时时间（秒）

# ==========================================
# 各厂商 API Key 配置
# ==========================================
# ⚠️ 注意：前端可以切换任意厂商，请配置所有需要使用的厂商的 API Key

# ----- 1. 百炼（Bailian / 阿里云）-----
# 获取地址：https://dashscope.console.aliyun.com/
DASHSCOPE_API_KEY=your-dashscope-api-key-here
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
# 可用模型列表（逗号分隔，可自定义添加新模型）
BAILIAN_MODELS=qwen-plus,qwen-max,qwen-turbo,qwen-long
# 百炼可选配置（如需使用 Agent 功能）
# BAILIAN_ACCESS_KEY_ID=your-access-key-id
# BAILIAN_ACCESS_KEY_SECRET=your-access-key-secret
# BAILIAN_AGENT_CODE=your-agent-code
# BAILIAN_REGION_ID=cn-beijing

# ----- 2. OpenAI -----
# 获取地址：https://platform.openai.com/api-keys
OPENAI_API_KEY=your-openai-api-key-here
OPENAI_BASE_URL=  # 留空使用官方 API，或填写代理地址
# 可用模型列表（逗号分隔，可自定义添加新模型）
OPENAI_MODELS=gpt-4,gpt-4-turbo,gpt-3.5-turbo

# ----- 3. DeepSeek -----
# 获取地址：https://platform.deepseek.com/api_keys
DEEPSEEK_API_KEY=your-deepseek-api-key-here
DEEPSEEK_BASE_URL=https://api.deepseek.com/v1  # 默认值，可不填
# 可用模型列表（逗号分隔，可自定义添加新模型）
DEEPSEEK_MODELS=deepseek-chat,deepseek-coder

# ----- 4. Kimi (Moonshot) -----
# 获取地址：https://platform.moonshot.cn/console/api-keys
MOONSHOT_API_KEY=your-moonshot-api-key-here
MOONSHOT_BASE_URL=https://api.moonshot.cn/v1  # 默认值，可不填
# 可用模型列表（逗号分隔，可自定义添加新模型）
MOONSHOT_MODELS=moonshot-v1-8k,moonshot-v1-32k,moonshot-v1-128k

# ----- 5. 智谱 (Zhipu AI) -----
# 获取地址：https://open.bigmodel.cn/usercenter/apikeys
ZHIPU_API_KEY=your-zhipu-api-key-here
ZHIPU_BASE_URL=https://open.bigmodel.cn/api/paas/v4  # 默认值，可不填
# 可用模型列表（逗号分隔，可自定义添加新模型）
ZHIPU_MODELS=glm-4,glm-4-plus,glm-4-air,glm-3-turbo

# ----- 6. BochaAI (Web Search API) -----
# 获取地址：https://bochaai.com/
# 用于定向爬取股票新闻时的搜索引擎
BOCHAAI_API_KEY=your-bochaai-api-key-here
BOCHAAI_ENDPOINT=https://api.bochaai.com/v1/web-search  # 默认值，可不填

# ==========================================
# Embedding 配置
# ==========================================
# EMBEDDING_PROVIDER=openai
# EMBEDDING_MODEL=text-embedding-ada-002
# EMBEDDING_BATCH_SIZE=100
# EMBEDDING_BASE_URL=  # 留空使用官方 API

# 使用百炼 Embedding 时的配置示例：
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-v4
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MILVUS_DIM=1024  # 百炼 embedding 是 1024 维

# ===== 爬取间隔配置（多源支持）=====
CRAWL_INTERVAL_SINA=60  # 新浪财经爬取间隔（秒）
CRAWL_INTERVAL_TENCENT=60  # 腾讯财经爬取间隔（秒）
CRAWL_INTERVAL_JWVIEW=60  # 中新经纬爬取间隔（秒）
CRAWL_INTERVAL_EEO=60  # 经济观察网爬取间隔（秒）
CRAWL_INTERVAL_CAIJING=60  # 财经网爬取间隔（秒）
CRAWL_INTERVAL_JINGJI21=60  # 21经济网爬取间隔（秒）

# ===== 实时爬取与缓存配置 =====
CACHE_TTL=1800  # 缓存过期时间（秒），默认30分钟
NEWS_RETENTION_HOURS=24  # 新闻保留时间（小时），默认24小时
FRONTEND_REFETCH_INTERVAL=180  # 前端自动刷新间隔（秒），默认3分钟

# ===== 爬虫配置 =====
CRAWLER_TIMEOUT=30
CRAWLER_MAX_RETRIES=3
CRAWLER_DELAY=1.0

# ===== 安全配置 =====
SECRET_KEY=your-secret-key-here-please-change-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=10080

# ===== 日志配置 =====
LOG_LEVEL=INFO
LOG_FILE=logs/finnews.log

# ===== 业务配置 =====
MAX_NEWS_PER_REQUEST=50
NEWS_CACHE_TTL=3600

================================================
FILE: backend/init_db.py
================================================
#!/usr/bin/env python
"""
数据库初始化脚本
独立运行以创建数据库表
"""
import sys
import os

# 添加当前目录到 Python 路径
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

if __name__ == "__main__":
    print("=" * 60)
    print("Initializing FinnewsHunter Database...")
    print("=" * 60)
    
    try:
        from sqlalchemy import create_engine
        from sqlalchemy.orm import declarative_base
        from app.core.config import settings
        
        # 导入所有模型
        from app.models.database import Base
        from app.models.news import News
        from app.models.stock import Stock
        from app.models.analysis import Analysis
        from app.models.crawl_task import CrawlTask
        from app.models.debate_history import DebateHistory
        
        print(f"\nConnecting to database: {settings.POSTGRES_HOST}:{settings.POSTGRES_PORT}/{settings.POSTGRES_DB}")
        
        # 创建同步引擎
        sync_engine = create_engine(
            settings.SYNC_DATABASE_URL,
            echo=False,
            pool_pre_ping=True,
        )
        
        print("Creating tables...")
        Base.metadata.create_all(bind=sync_engine)
        
        print("\nDatabase initialized successfully!")
        print(f"   - News table created")
        print(f"   - Stock table created")
        print(f"   - Analysis table created")
        print(f"   - CrawlTask table created")
        print(f"   - DebateHistory table created")
        print("=" * 60)
        sys.exit(0)
        
    except Exception as e:
        print(f"\nDatabase initialization failed: {e}")
        import traceback
        traceback.print_exc()
        print("=" * 60)
        print("\nNote: If tables already exist, this error is expected.")
        print("You can safely ignore it and proceed with starting the server.")
        sys.exit(0)  # 即使失败也返回0，因为表可能已存在


================================================
FILE: backend/init_knowledge_graph.py
================================================
#!/usr/bin/env python
"""
初始化知识图谱
创建 Neo4j 约束、索引，并为示例股票构建图谱
"""
import asyncio
import logging
import sys

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


async def init_knowledge_graph():
    """初始化知识图谱"""
    try:
        from app.core.neo4j_client import get_neo4j_client
        from app.knowledge.graph_service import get_graph_service
        from app.knowledge.knowledge_extractor import (
            create_knowledge_extractor,
            AkshareKnowledgeExtractor
        )
        
        logger.info("=" * 80)
        logger.info("开始初始化知识图谱")
        logger.info("=" * 80)
        
        # 1. 测试 Neo4j 连接
        logger.info("\n[1/4] 测试 Neo4j 连接...")
        neo4j_client = get_neo4j_client()
        if neo4j_client.health_check():
            logger.info("Neo4j 连接正常")
        else:
            logger.error("Neo4j 连接失败，请检查配置")
            sys.exit(1)
        
        # 2. 初始化约束和索引
        logger.info("\n[2/4] 初始化数据库约束和索引...")
        graph_service = get_graph_service()
        logger.info("约束和索引已创建")
        
        # 3. 为示例股票创建图谱
        logger.info("\n[3/4] 为示例股票创建知识图谱...")
        
        example_stocks = [
            ("SH600519", "贵州茅台"),  # 示例1：大盘蓝筹
            ("SZ300634", "彩讯股份"),  # 示例2：中小板
        ]
        
        extractor = create_knowledge_extractor()
        
        for stock_code, stock_name in example_stocks:
            logger.info(f"\n处理: {stock_name}({stock_code})")
            
            # 检查是否已存在
            existing = graph_service.get_company_graph(stock_code)
            if existing:
                logger.info(f"  图谱已存在，跳过")
                continue
            
            # 从 akshare 获取信息
            logger.info(f"  从 akshare 获取信息...")
            akshare_info = AkshareKnowledgeExtractor.extract_company_info(stock_code)
            
            if not akshare_info:
                logger.warning(f"  akshare 未返回数据，跳过")
                continue
            
            # 使用 LLM 提取详细信息
            logger.info(f"  使用 LLM 提取详细信息...")
            base_graph = await extractor.extract_from_akshare(
                stock_code, stock_name, akshare_info
            )
            
            # 构建图谱
            logger.info(f"  构建图谱...")
            success = graph_service.build_company_graph(base_graph)
            
            if success:
                stats = graph_service.get_graph_stats(stock_code)
                logger.info(f"  图谱构建成功: {stats}")
            else:
                logger.error(f"  图谱构建失败")
        
        # 4. 显示统计信息
        logger.info("\n[4/4] 图谱统计...")
        companies = graph_service.list_all_companies()
        logger.info(f"当前共有 {len(companies)} 家公司的知识图谱")
        
        for company in companies:
            stats = graph_service.get_graph_stats(company['stock_code'])
            logger.info(f"  - {company['stock_name']}({company['stock_code']}): {stats}")
        
        logger.info("\n" + "=" * 80)
        logger.info("知识图谱初始化完成！")
        logger.info("=" * 80)
        logger.info("\n下一步：")
        logger.info("  1. 访问 http://localhost:7474 查看 Neo4j 浏览器")
        logger.info("  2. 用户名: neo4j, 密码: finnews_neo4j_password")
        logger.info("  3. 执行定向爬取时，系统会自动使用知识图谱进行多关键词并发检索")
        logger.info("\n")
        
    except Exception as e:
        logger.error(f"初始化失败: {e}", exc_info=True)
        sys.exit(1)


if __name__ == "__main__":
    asyncio.run(init_knowledge_graph())


================================================
FILE: backend/requirements.txt
================================================
# ===== Web 框架 =====
fastapi>=0.100.0
uvicorn[standard]>=0.22.0
pydantic>=2.0.0
pydantic-settings>=2.0.0
python-dotenv>=1.0.0

# ===== 数据库 =====
sqlalchemy>=2.0.0
asyncpg>=0.29.0  # PostgreSQL 异步驱动
psycopg2-binary>=2.9.0  # PostgreSQL 同步驱动（用于初始化）
alembic>=1.12.0  # 数据库迁移工具

# ===== 缓存与任务队列 =====
redis>=4.5.0
celery>=5.3.0

# ===== 向量数据库 =====
pymilvus>=2.3.0

# ===== 图数据库 =====
neo4j>=5.14.0  # Neo4j Python驱动

# ===== 网络请求与爬虫 =====
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=4.9.0
aiohttp>=3.9.0
markdownify>=0.11.0  # HTML 转 Markdown
readabilipy>=0.2.0  # 智能内容提取（Mozilla Readability）
playwright>=1.40.0  # JS 渲染（可选，需运行 playwright install）

# ===== AI/ML =====
openai>=1.0.0
anthropic>=0.7.0
litellm>=1.0.0
tiktoken>=0.5.0  # Token 计数

# ===== 文本处理 =====
jieba>=0.42.1  # 中文分词
python-dateutil>=2.8.2

# ===== 工具库 =====
httpx>=0.25.0
tenacity>=8.2.0  # 重试机制

# ===== AgenticX 框架 =====
agenticx==0.1.9  # Docker 容器中使用 PyPI 版本
# 本地开发可以用：pip install -e ../../../../agenticx

akshare

================================================
FILE: backend/reset_database.py
================================================
"""
清空数据库并重新开始
用于重置系统数据
"""
import asyncio
import sys
from sqlalchemy import text
from app.core.database import get_async_engine
from app.core.redis_client import redis_client
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


async def reset_database():
    """清空所有数据"""
    engine = get_async_engine()
    
    try:
        async with engine.begin() as conn:
            logger.info("=" * 60)
            logger.info("开始清空数据库...")
            logger.info("=" * 60)
            
            # 1. 清空新闻表
            logger.info("清空新闻表 (news)...")
            result = await conn.execute(text("DELETE FROM news"))
            logger.info(f"✅ 已删除 {result.rowcount} 条新闻记录")
            
            # 2. 清空爬取任务表
            logger.info("清空爬取任务表 (crawl_tasks)...")
            result = await conn.execute(text("DELETE FROM crawl_tasks"))
            logger.info(f"✅ 已删除 {result.rowcount} 条任务记录")
            
            # 3. 清空分析表（如果存在）
            try:
                logger.info("清空分析表 (analyses)...")
                result = await conn.execute(text("DELETE FROM analyses"))
                logger.info(f"✅ 已删除 {result.rowcount} 条分析记录")
            except Exception as e:
                logger.warning(f"清空分析表失败（表可能不存在）: {e}")
            
            # 4. 重置自增ID
            logger.info("重置表自增ID...")
            try:
                await conn.execute(text("ALTER SEQUENCE news_id_seq RESTART WITH 1"))
                await conn.execute(text("ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1"))
                await conn.execute(text("ALTER SEQUENCE analyses_id_seq RESTART WITH 1"))
                logger.info("✅ 自增ID已重置")
            except Exception as e:
                logger.warning(f"重置自增ID失败: {e}")
            
            logger.info("=" * 60)
            logger.info("数据库清空完成！")
            logger.info("=" * 60)
        
        # 5. 清空Redis缓存
        if redis_client.is_available():
            logger.info("清空Redis缓存...")
            try:
                # 删除所有news相关的缓存键
                redis_client.client.flushdb()
                logger.info("✅ Redis缓存已清空")
            except Exception as e:
                logger.error(f"清空Redis失败: {e}")
        else:
            logger.warning("⚠️  Redis不可用，跳过缓存清理")
        
        logger.info("=" * 60)
        logger.info("✨ 数据重置完成！")
        logger.info("=" * 60)
        logger.info("下一步：")
        logger.info("1. 重启 Celery Worker 和 Beat")
        logger.info("2. 系统将自动开始爬取最新新闻")
        logger.info("3. 约5-10分钟后可在前端查看新数据")
        logger.info("=" * 60)
        
    except Exception as e:
        logger.error(f"❌ 清空数据失败: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    finally:
        await engine.dispose()


if __name__ == "__main__":
    # 确认操作
    print("⚠️  警告：此操作将删除所有新闻和任务数据！")
    print("⚠️  此操作不可恢复！")
    confirm = input("确认要清空所有数据吗？(yes/no): ")
    
    if confirm.lower() in ['yes', 'y']:
        asyncio.run(reset_database())
    else:
        print("❌ 操作已取消")
        sys.exit(0)


================================================
FILE: backend/setup_env.sh
================================================
#!/bin/bash
# 环境变量快速配置脚本

echo "============================================"
echo "  FinnewsHunter 环境配置向导"
echo "============================================"
echo ""
echo "请选择 LLM 服务商："
echo "  1) OpenAI 官方（默认）"
echo "  2) 阿里云百炼（推荐国内用户）"
echo "  3) 其他 OpenAI 代理"
echo "  4) 手动配置（复制模板）"
echo ""
read -p "请输入选项 (1-4) [默认:1]: " choice
choice=${choice:-1}

case $choice in
  1)
    # OpenAI 官方
    cat > .env << 'EOF'
# FinnewsHunter 环境配置
APP_NAME=FinnewsHunter
DEBUG=True

POSTGRES_USER=finnews
POSTGRES_PASSWORD=finnews_dev_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=finnews_db

REDIS_HOST=localhost
REDIS_PORT=6379

MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_DIM=1536

# OpenAI 官方配置
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=2000
OPENAI_API_KEY=sk-your-openai-api-key-here

EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-ada-002

LOG_LEVEL=INFO
EOF
    echo ""
    echo "OpenAI 配置已创建"
    echo "请编辑 .env 并填入你的 OPENAI_API_KEY"
    ;;
    
  2)
    # 阿里云百炼
    cat > .env << 'EOF'
# FinnewsHunter 环境配置
APP_NAME=FinnewsHunter
DEBUG=True

POSTGRES_USER=finnews
POSTGRES_PASSWORD=finnews_dev_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=finnews_db

REDIS_HOST=localhost
REDIS_PORT=6379

MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_DIM=1024

# 阿里云百炼配置（OpenAI 兼容模式）
LLM_PROVIDER=openai
LLM_MODEL=qwen-plus
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=2000
OPENAI_API_KEY=sk-your-bailian-api-key-here
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-v4

LOG_LEVEL=INFO
EOF
    echo ""
    echo "百炼配置已创建"
    echo "请编辑 .env 并填入你的百炼 API Key"
    echo "获取 Key: https://dashscope.console.aliyun.com/"
    ;;
    
  3)
    # 其他代理
    cat > .env << 'EOF'
# FinnewsHunter 环境配置
APP_NAME=FinnewsHunter
DEBUG=True

POSTGRES_USER=finnews
POSTGRES_PASSWORD=finnews_dev_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=finnews_db

REDIS_HOST=localhost
REDIS_PORT=6379

MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_DIM=1536

# OpenAI 代理配置
LLM_PROVIDER=openai
LLM_MODEL=gpt-3.5-turbo
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=2000
OPENAI_API_KEY=sk-your-proxy-api-key
OPENAI_BASE_URL=https://your-proxy.com/v1

EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-ada-002

LOG_LEVEL=INFO
EOF
    echo ""
    echo "代理配置已创建"
    echo "请编辑 .env 并填入你的代理信息"
    ;;
    
  4)
    # 手动配置
    cp env.example .env
    echo ""
    echo "配置模板已复制"
    echo "请编辑 .env 并选择合适的配置方案"
    ;;
    
  *)
    echo "无效选项"
    exit 1
    ;;
esac

echo ""
read -p "是否现在编辑配置文件？(Y/n): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Nn]$ ]]; then
    ${EDITOR:-nano} .env
fi

echo ""
echo "配置完成！运行 ./start.sh 启动服务"


================================================
FILE: backend/start.sh
================================================
#!/bin/bash
# FinnewsHunter 启动脚本

set -e

echo "==================================="
echo "  FinnewsHunter Backend Startup"
echo "==================================="

# 获取脚本所在目录（backend目录）
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
DEPLOY_DIR="$(cd "$SCRIPT_DIR/../deploy" && pwd)"

# 1. 启动 Docker Compose 服务
echo ""
echo "[1/4] Starting Docker Compose services..."
cd "$DEPLOY_DIR"
docker-compose -f docker-compose.dev.yml up -d

# 等待数据库启动
echo ""
echo "[2/4] Waiting for databases to be ready..."
sleep 10

# 2. 初始化数据库（首次运行）
echo ""
echo "[3/4] Initializing database..."
cd "$SCRIPT_DIR"
python init_db.py || echo "Database initialization skipped (may already exist)"

# 3. 启动 FastAPI 应用
echo ""
echo "[4/4] Starting FastAPI application..."
echo ""
echo "Server will start at: http://localhost:8000"
echo "API Documentation: http://localhost:8000/docs"
echo ""

# 确保在 backend 目录下启动
cd "$SCRIPT_DIR"
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000


================================================
FILE: backend/start_celery.sh
================================================
#!/bin/bash
# Celery 容器化重启脚本
# 用法: ./start_celery.sh [--restart|-r] [--force-recreate|-f] [--rebuild|-b] [--logs|-l]

set -e

# 解析命令行参数
AUTO_RESTART=false
FORCE_RECREATE=false
REBUILD_IMAGE=false
SHOW_LOGS=false

while [[ $# -gt 0 ]]; do
    case $1 in
        --restart|-r)
            AUTO_RESTART=true
            shift
            ;;
        --force-recreate|-f)
            FORCE_RECREATE=true
            AUTO_RESTART=true
            shift
            ;;
        --rebuild|-b)
            REBUILD_IMAGE=true
            FORCE_RECREATE=true
            AUTO_RESTART=true
            shift
            ;;
        --logs|-l)
            SHOW_LOGS=true
            shift
            ;;
        --help|-h)
            echo "用法: $0 [选项]"
            echo ""
            echo "选项:"
            echo "  --restart, -r        自动重启容器（容器使用 python:3.11 基础镜像 + volumes 挂载）"
            echo "  --force-recreate, -f 强制重建容器（会重新安装依赖，因为使用基础镜像）"
            echo "  --rebuild, -b        重新构建镜像（构建的镜像不会被使用，仅用于清理未使用的镜像）"
            echo "  --logs, -l           重启后自动显示日志"
            echo "  --help, -h           显示帮助信息"
            echo ""
            echo "注意:"
            echo "  - 当前容器使用 python:3.11 基础镜像 + volumes 挂载代码"
            echo "  - 每次启动容器都会执行 pip install 安装依赖"
            echo "  - --rebuild 选项会构建镜像，但构建的镜像不会被容器使用"
            echo ""
            echo "示例:"
            echo "  $0                   交互式重启容器"
            echo "  $0 --restart         自动重启容器"
            echo "  $0 -r -l             自动重启并显示日志"
            echo "  $0 -f                强制重建容器（会重新安装依赖）"
            echo "  $0 --rebuild         重新构建镜像（仅用于清理未使用的镜像）"
            exit 0
            ;;
        *)
            echo "未知参数: $1"
            echo "使用 --help 查看帮助信息"
            exit 1
            ;;
    esac
done

echo "============================================"
echo "  FinnewsHunter Celery 容器重启脚本"
echo "============================================"
echo ""

# 获取脚本所在目录
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR"

# 检查 Docker 是否运行
if ! docker info > /dev/null 2>&1; then
    echo "Docker 未运行，请先启动 Docker"
    exit 1
fi

# 检查 docker-compose 文件是否存在
COMPOSE_FILE="../deploy/docker-compose.dev.yml"
if [ ! -f "$COMPOSE_FILE" ]; then
    echo "找不到 docker-compose 文件: $COMPOSE_FILE"
    exit 1
fi

# 检查容器状态
echo ""
echo "[1/4] 检查 Celery 容器状态..."
WORKER_RUNNING=$(docker ps -q -f name=finnews_celery_worker)
BEAT_RUNNING=$(docker ps -q -f name=finnews_celery_beat)

if [ -n "$WORKER_RUNNING" ] || [ -n "$BEAT_RUNNING" ]; then
    echo "检测到 Celery 容器正在运行"
    echo "   - Worker: $([ -n "$WORKER_RUNNING" ] && echo "运行中 ($WORKER_RUNNING)" || echo "未运行")"
    echo "   - Beat: $([ -n "$BEAT_RUNNING" ] && echo "运行中 ($BEAT_RUNNING)" || echo "未运行")"
    
    if [ "$AUTO_RESTART" = false ]; then
        read -p "是否重启容器？(y/N): " -n 1 -r
        echo
        if [[ ! $REPLY =~ ^[Yy]$ ]]; then
            echo "已取消重启"
            exit 0
        fi
    else
        echo "自动重启模式，无需确认"
    fi
fi

# 检查 Redis 是否运行
echo ""
echo "[2/4] 检查 Redis 连接..."
if docker exec finnews_redis redis-cli ping > /dev/null 2>&1; then
    echo "Redis 正常运行"
else
    echo "Redis 未运行，请先启动 Docker Compose:"
    echo "   cd ../deploy && docker-compose -f docker-compose.dev.yml up -d redis"
    exit 1
fi

# 重启 Celery Worker 容器
echo ""
cd ../deploy

if [ "$REBUILD_IMAGE" = true ]; then
    echo "[3/5] 重新构建镜像（注意：构建的镜像不会被容器使用，仅用于清理未使用的镜像）..."
    docker-compose -f docker-compose.dev.yml build celery-worker celery-beat
    echo "[4/5] 强制重建 Celery Worker 容器（使用 python:3.11 基础镜像 + volumes 挂载）..."
    docker-compose -f docker-compose.dev.yml up -d --force-recreate celery-worker
elif [ "$FORCE_RECREATE" = true ]; then
    echo "[3/4] 强制重建 Celery Worker 容器（使用 python:3.11 基础镜像，会重新安装依赖）..."
    docker-compose -f docker-compose.dev.yml up -d --force-recreate celery-worker
else
    echo "[3/4] 重启 Celery Worker 容器（使用 python:3.11 基础镜像 + volumes 挂载）..."
    docker-compose -f docker-compose.dev.yml restart celery-worker
fi
WORKER_CONTAINER_ID=$(docker ps -q -f name=finnews_celery_worker)
echo "Worker 容器已重启 (Container ID: $WORKER_CONTAINER_ID)"

# 等待 Worker 启动
sleep 3

# 重启 Celery Beat 容器
echo ""
if [ "$REBUILD_IMAGE" = true ]; then
    echo "[5/5] 强制重建 Celery Beat 容器（使用 python:3.11 基础镜像 + volumes 挂载）..."
    docker-compose -f docker-compose.dev.yml up -d --force-recreate celery-beat
elif [ "$FORCE_RECREATE" = true ]; then
    echo "[4/4] 强制重建 Celery Beat 容器（使用 python:3.11 基础镜像，会重新安装依赖）..."
    docker-compose -f docker-compose.dev.yml up -d --force-recreate celery-beat
else
    echo "[4/4] 重启 Celery Beat 容器（使用 python:3.11 基础镜像 + volumes 挂载）..."
    docker-compose -f docker-compose.dev.yml restart celery-beat
fi
BEAT_CONTAINER_ID=$(docker ps -q -f name=finnews_celery_beat)
echo "Beat 容器已重启 (Container ID: $BEAT_CONTAINER_ID)"

cd "$SCRIPT_DIR"

echo ""
echo "============================================"
echo "  Celery 容器重启成功！"
echo "============================================"
echo ""
echo "容器信息:"
echo "   - Worker Container ID: $WORKER_CONTAINER_ID"
echo "   - Beat Container ID: $BEAT_CONTAINER_ID"
echo ""
echo "查看日志命令:"
echo "   - Worker 日志: docker logs -f finnews_celery_worker"
echo "   - Beat 日志: docker logs -f finnews_celery_beat"
echo "   - 最近100行: docker logs --tail 100 finnews_celery_worker"
echo ""
echo "监控命令:"
echo "   - 查看任务列表: curl http://localhost:8000/api/v1/tasks/"
echo "   - 查看容器状态: docker ps | grep celery"
echo ""
echo "实时监控已启动，每1分钟自动爬取新闻"
echo ""
echo "说明:"
echo "   - 容器使用 python:3.11 基础镜像 + volumes 挂载代码"
echo "   - 每次启动容器都会执行 pip install 安装依赖"
echo "   - 构建的镜像（deploy-celery-worker/beat）不会被使用，可以删除释放空间"
echo ""
echo "停止服务:"
echo "   cd ../deploy && docker-compose -f docker-compose.dev.yml stop celery-worker celery-beat"
echo ""
echo "完全重启（重建容器，会重新安装依赖）:"
echo "   cd ../deploy && docker-compose -f docker-compose.dev.yml up -d --force-recreate celery-worker celery-beat"
echo ""
echo "============================================"

if [ "$SHOW_LOGS" = true ]; then
    echo ""
    echo "正在监控日志（按 Ctrl+C 退出）..."
    echo ""
    sleep 2
    docker logs -f --tail 50 finnews_celery_worker
fi


================================================
FILE: backend/tests/__init__.py
================================================
"""FinnewsHunter Tests"""


================================================
FILE: backend/tests/check_milvus_data.py
================================================
#!/usr/bin/env python3
"""
检查 Milvus 向量存储中的数据
"""
import sys
import os
import asyncio

# 添加项目路径
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

from app.storage.vector_storage import get_vector_storage
from app.core.config import settings

def main():
    try:
        print("=" * 60)
        print("Milvus 向量存储信息")
        print("=" * 60)
        
        storage = get_vector_storage()
        stats = storage.get_stats()
        
        print(f"\n📊 集合统计信息:")
        print(f"  集合名称: {stats['collection_name']}")
        print(f"  向量维度: {stats['dim']}")
        num_entities = stats['num_entities']
        if isinstance(num_entities, str):
            print(f"  存储的向量数量: {num_entities}")
        else:
            print(f"  存储的向量数量: {num_entities}")
            if num_entities == 0:
                print(f"  ⚠️  注意：如果显示为 0，可能是 flush 失败导致统计不准确")
        print(f"  Milvus地址: {storage.host}:{storage.port}")
        
        # 查询一些示例数据
        print(f"\n📝 查询示例数据:")
        try:
            # 使用 agenticx 的 query 方法获取数据
            from agenticx.storage.vectordb_storages.base import VectorDBQuery
            
            # 创建一个零向量查询来获取所有数据（top_k 限制结果数）
            zero_vector = [0.0] * stats['dim']
            query = VectorDBQuery(query_vector=zero_vector, top_k=10)
            
            # query 是同步方法，可以直接调用
            results = storage.milvus_storage.query(query)
            
            if results:
                print(f"   ✅ 找到 {len(results)} 条记录")
                if isinstance(stats['num_entities'], str) or stats['num_entities'] != len(results):
                    print(f"   ℹ️  统计数量: {stats['num_entities']}")
                print()
                for i, result in enumerate(results[:5], 1):  # 只显示前5条
                    payload = result.record.payload or {}
                    news_id = payload.get('news_id', result.record.id)
                    text = payload.get('text', '')
                    text_preview = text[:100] + "..." if len(text) > 100 else text
                    print(f"  {i}. 新闻ID: {news_id}")
                    print(f"     文本预览: {text_preview}")
                if len(results) > 5:
                    print(f"\n  ... 还有 {len(results) - 5} 条记录未显示")
            else:
                if stats['num_entities'] == 0:
                    print("   ⚠️  未找到数据，集合可能确实为空")
                    print("   提示: 向量数据会在新闻分析时自动生成并存储")
                else:
                    print(f"   ⚠️  未找到数据，但统计显示有 {stats['num_entities']} 条记录")
                    print("   可能的原因：数据在缓冲区中，需要等待 Milvus 自动刷新")
        except Exception as e:
            print(f"  ❌ 无法查询数据: {e}")
            import traceback
            traceback.print_exc()
            if stats['num_entities'] == 0:
                print("\n   提示: 如果这是首次运行，集合可能确实为空")
        
        print("\n" + "=" * 60)
        print("💡 提示:")
        print("  - 向量数据存储在 Milvus 数据库中")
        print("  - 可以通过 Milvus 客户端工具查看完整数据")
        print("  - 向量维度必须与 embedding 模型匹配")
        print("=" * 60)
        
    except Exception as e:
        print(f"\n❌ 错误: {e}")
        print("\n可能的原因:")
        print("  1. Milvus 服务未启动")
        print("  2. Milvus 连接配置错误")
        print("  3. 集合尚未创建")
        print("\n检查方法:")
        print(f"  - 确认 Milvus 运行在 {settings.MILVUS_HOST}:{settings.MILVUS_PORT}")
        print(f"  - 检查 .env 文件中的 MILVUS_* 配置")

if __name__ == "__main__":
    main()


================================================
FILE: backend/tests/check_news_embedding_status.py
================================================
#!/usr/bin/env python3
"""
检查新闻的向量化状态
"""
import sys
import os
import asyncio

# 添加项目路径
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

from sqlalchemy import select, func
from app.core.database import get_db
from app.models.news import News
from app.models.analysis import Analysis

async def main():
    try:
        async for db in get_db():
            # 统计总体情况
            total_result = await db.execute(select(func.count(News.id)))
            total_news = total_result.scalar() or 0
            
            embedded_result = await db.execute(
                select(func.count(News.id)).where(News.is_embedded == 1)
            )
            embedded_count = embedded_result.scalar() or 0
            
            analyzed_result = await db.execute(
                select(func.count(News.id)).where(News.sentiment_score.isnot(None))
            )
            analyzed_count = analyzed_result.scalar() or 0
            
            # 查找已分析但未向量化的新闻
            not_embedded_result = await db.execute(
                select(News.id, News.title, News.sentiment_score)
                .where(
                    News.sentiment_score.isnot(None),
                    News.is_embedded == 0
                )
                .order_by(News.id.desc())
                .limit(10)
            )
            not_embedded_news = not_embedded_result.all()
            
            print("=" * 60)
            print("新闻向量化状态统计")
            print("=" * 60)
            print(f"\n📊 总体统计:")
            print(f"  总新闻数: {total_news}")
            print(f"  已分析新闻: {analyzed_count}")
            print(f"  已向量化新闻: {embedded_count}")
            print(f"  已分析但未向量化: {analyzed_count - embedded_count}")
            
            if not_embedded_news:
                print(f"\n⚠️  最近10条已分析但未向量化的新闻:")
                for news_id, title, sentiment_score in not_embedded_news:
                    title_preview = title[:50] + "..." if len(title) > 50 else title
                    print(f"  - ID: {news_id}, 情感分数: {sentiment_score:.2f}")
                    print(f"    标题: {title_preview}")
            else:
                print("\n✅ 所有已分析的新闻都已向量化")
            
            print("\n" + "=" * 60)
            print("💡 可能的原因:")
            print("  1. Embedding API 超时（20秒超时）")
            print("  2. Milvus 连接失败")
            print("  3. Embedding 服务配置错误")
            print("\n🔧 解决方案:")
            print("  1. 检查后端日志中的 embedding 错误")
            print("  2. 确认 Milvus 服务正在运行")
            print("  3. 检查 embedding API 配置（百炼/OpenAI）")
            print("  4. 可以手动重新向量化这些新闻")
            print("=" * 60)
            
    except Exception as e:
        print(f"\n❌ 错误: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    asyncio.run(main())


================================================
FILE: backend/tests/financial/__init__.py
================================================
"""Financial module tests"""


================================================
FILE: backend/tests/financial/test_smoke_openbb_models.py
================================================
"""
冒烟测试: Standard Models (P0-1, P0-2)

验证:
- NewsQueryParams, NewsData 模型可正常实例化
- StockQueryParams, StockPriceData 模型可正常实例化
- 字段验证逻辑正确
- to_legacy_dict 兼容方法正常工作

运行:
    pytest -q -k "smoke_openbb_models"
"""
import pytest
from datetime import datetime


class TestNewsModels:
    """测试新闻相关模型"""

    def test_news_query_params_basic(self):
        """测试 NewsQueryParams 基本实例化"""
        from app.financial.models.news import NewsQueryParams

        # 默认参数
        params = NewsQueryParams()
        assert params.limit == 50
        assert params.keywords is None
        assert params.stock_codes is None

        # 自定义参数
        params = NewsQueryParams(
            keywords=["茅台", "白酒"],
            stock_codes=["600519"],
            limit=20
        )
        assert params.keywords == ["茅台", "白酒"]
        assert params.stock_codes == ["600519"]
        assert params.limit == 20

    def test_news_query_params_validation(self):
        """测试 NewsQueryParams 字段验证"""
        from app.financial.models.news import NewsQueryParams
        from pydantic import ValidationError

        # limit 边界测试
        params = NewsQueryParams(limit=1)
        assert params.limit == 1

        params = NewsQueryParams(limit=500)
        assert params.limit == 500

        # limit 超出范围应报错
        with pytest.raises(ValidationError):
            NewsQueryParams(limit=0)

        with pytest.raises(ValidationError):
            NewsQueryParams(limit=501)

    def test_news_data_basic(self):
        """测试 NewsData 基本实例化"""
        from app.financial.models.news import NewsData, NewsSentiment

        news = NewsData(
            id="test123",
            title="测试新闻标题",
            content="这是测试新闻的正文内容...",
            source="sina",
            source_url="https://finance.sina.com.cn/test",
            publish_time=datetime(2024, 1, 1, 10, 30)
        )

        assert news.id == "test123"
        assert news.title == "测试新闻标题"
        assert news.source == "sina"
        assert news.sentiment is None  # 可选字段默认 None
        assert news.stock_codes == []  # 默认空列表

    def test_news_data_with_sentiment(self):
        """测试 NewsData 带情感标签"""
        from app.financial.models.news import NewsData, NewsSentiment

        news = NewsData(
            id="test456",
            title="利好消息",
            content="公司业绩超预期...",
            source="sina",
            source_url="https://example.com",
            publish_time=datetime.now(),
            sentiment=NewsSentiment.POSITIVE,
            sentiment_score=0.85
        )

        assert news.sentiment == NewsSentiment.POSITIVE
        assert news.sentiment_score == 0.85

    def test_news_data_generate_id(self):
        """测试 NewsData.generate_id 方法"""
        from app.financial.models.news import NewsData

        url1 = "https://finance.sina.com.cn/news/123"
        url2 = "https://finance.sina.com.cn/news/456"

        id1 = NewsData.generate_id(url1)
        id2 = NewsData.generate_id(url2)

        # 相同 URL 生成相同 ID
        assert id1 == NewsData.generate_id(url1)
        # 不同 URL 生成不同 ID
        assert id1 != id2
        # ID 长度为 16
        assert len(id1) == 16

    def test_news_data_to_legacy_dict(self):
        """测试 NewsData.to_legacy_dict 兼容方法"""
        from app.financial.models.news import NewsData

        news = NewsData(
            id="test789",
            title="测试标题",
            content="测试内容",
            source="sina",
            source_url="https://example.com/news",
            publish_time=datetime(2024, 6, 15, 14, 30),
            author="记者",
            stock_codes=["SH600519"]
        )

        legacy = news.to_legacy_dict()

        # 验证字段映射
        assert legacy["title"] == "测试标题"
        assert legacy["url"] == "https://example.com/news"  # source_url → url
        assert legacy["source"] == "sina"
        assert legacy["author"] == "记者"
        assert "SH600519" in legacy["stock_codes"]


class TestStockModels:
    """测试股票相关模型"""

    def test_stock_query_params_basic(self):
        """测试 StockQueryParams 基本实例化"""
        from app.financial.models.stock import (
            StockQueryParams, KlineInterval, AdjustType
        )

        # 最小参数
        params = StockQueryParams(symbol="600519")
        assert params.symbol == "600519"
        assert params.interval == KlineInterval.DAILY
        assert params.adjust == AdjustType.QFQ
        assert params.limit == 90

        # 自定义参数
        params = StockQueryParams(
            symbol="SH600519",
            interval=KlineInterval.MIN_5,
            adjust=AdjustType.HFQ,
            limit=30
        )
        assert params.interval == KlineInterval.MIN_5
        assert params.adjust == AdjustType.HFQ

    def test_stock_price_data_basic(self):
        """测试 StockPriceData 基本实例化"""
        from app.financial.models.stock import StockPriceData

        price = StockPriceData(
            symbol="600519",
            date=datetime(2024, 6, 15),
            open=1500.0,
            high=1520.0,
            low=1490.0,
            close=1510.0,
            volume=1000000
        )

        assert price.symbol == "600519"
        assert price.close == 1510.0
        assert price.turnover is None  # 可选字段

    def test_stock_price_data_to_legacy_dict(self):
        """测试 StockPriceData.to_legacy_dict 兼容方法"""
        from app.financial.models.stock import StockPriceData

        price = StockPriceData(
            symbol="600519",
            date=datetime(2024, 6, 15, 10, 0, 0),
            open=1500.0,
            high=1520.0,
            low=1490.0,
            close=1510.0,
            volume=1000000,
            change_percent=0.67
        )

        legacy = price.to_legacy_dict()

        # 验证字段
        assert legacy["date"] == "2024-06-15"
        assert legacy["close"] == 1510.0
        assert legacy["change_percent"] == 0.67
        assert "timestamp" in legacy  # 应包含毫秒时间戳

    def test_kline_interval_enum(self):
        """测试 KlineInterval 枚举"""
        from app.financial.models.stock import KlineInterval

        assert KlineInterval.MIN_1.value == "1m"
        assert KlineInterval.DAILY.value == "1d"
        assert KlineInterval("1d") == KlineInterval.DAILY

    def test_adjust_type_enum(self):
        """测试 AdjustType 枚举"""
        from app.financial.models.stock import AdjustType

        assert AdjustType.QFQ.value == "qfq"
        assert AdjustType.HFQ.value == "hfq"
        assert AdjustType("none") == AdjustType.NONE


================================================
FILE: backend/tests/financial/test_smoke_openbb_provider.py
================================================
"""
冒烟测试: Provider & Registry (P0-3, P0-4)

验证:
- BaseFetcher 抽象类可被正确继承
- BaseProvider 抽象类可被正确继承
- ProviderRegistry 注册/获取/降级逻辑
- SinaProvider 正确注册

运行:
    pytest -q -k "smoke_openbb_provider"
"""
import pytest
from typing import Dict, Any, List, Type
from datetime import datetime


class TestBaseFetcherAbstraction:
    """测试 BaseFetcher 抽象"""

    def test_fetcher_subclass_implementation(self):
        """测试 Fetcher 子类实现"""
        from app.financial.providers.base import BaseFetcher
        from app.financial.models.news import NewsQueryParams, NewsData

        class MockNewsFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params: NewsQueryParams) -> Dict[str, Any]:
                return {"limit": params.limit, "keywords": params.keywords}

            async def extract_data(self, query: Dict[str, Any]) -> List[Dict]:
                return [
                    {"title": "Test News", "content": "Content", "url": "http://test.com"}
                ]

            def transform_data(self, raw_data: List[Dict], query: NewsQueryParams) -> List[NewsData]:
                return [
                    NewsData(
                        id=f"mock_{i}",
                        title=item["title"],
                        content=item["content"],
                        source="mock",
                        source_url=item["url"],
                        publish_time=datetime.now()
                    )
                    for i, item in enumerate(raw_data)
                ]

        fetcher = MockNewsFetcher()

        # 测试 transform_query
        params = NewsQueryParams(limit=10, keywords=["test"])
        query = fetcher.transform_query(params)
        assert query["limit"] == 10
        assert query["keywords"] == ["test"]

    @pytest.mark.asyncio
    async def test_fetcher_fetch_pipeline(self):
        """测试 Fetcher 完整 TET Pipeline"""
        from app.financial.providers.base import BaseFetcher
        from app.financial.models.news import NewsQueryParams, NewsData

        class MockFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params):
                return {"count": params.limit}

            async def extract_data(self, query):
                return [{"title": f"News {i}"} for i in range(query["count"])]

            def transform_data(self, raw_data, query):
                return [
                    NewsData(
                        id=f"id_{i}",
                        title=item["title"],
                        content="content",
                        source="mock",
                        source_url="http://mock.com",
                        publish_time=datetime.now()
                    )
                    for i, item in enumerate(raw_data)
                ]

        fetcher = MockFetcher()
        params = NewsQueryParams(limit=5)
        results = await fetcher.fetch(params)

        assert len(results) == 5
        assert all(isinstance(r, NewsData) for r in results)


class TestBaseProviderAbstraction:
    """测试 BaseProvider 抽象"""

    def test_provider_subclass_implementation(self):
        """测试 Provider 子类实现"""
        from app.financial.providers.base import BaseProvider, BaseFetcher, ProviderInfo
        from app.financial.models.news import NewsQueryParams, NewsData

        class MockFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params):
                return {}

            async def extract_data(self, query):
                return []

            def transform_data(self, raw_data, query):
                return []

        class MockProvider(BaseProvider):
            @property
            def info(self) -> ProviderInfo:
                return ProviderInfo(
                    name="mock",
                    display_name="Mock Provider",
                    description="For testing",
                    priority=99
                )

            @property
            def fetchers(self) -> Dict[str, Type[BaseFetcher]]:
                return {"news": MockFetcher}

        provider = MockProvider()

        assert provider.info.name == "mock"
        assert provider.supports("news") is True
        assert provider.supports("stock_price") is False

        fetcher = provider.get_fetcher("news")
        assert fetcher is not None
        assert isinstance(fetcher, MockFetcher)


class TestProviderRegistry:
    """测试 ProviderRegistry"""

    def test_registry_singleton(self):
        """测试 Registry 单例模式"""
        from app.financial.registry import ProviderRegistry

        r1 = ProviderRegistry()
        r2 = ProviderRegistry()
        assert r1 is r2

    def test_registry_register_and_list(self):
        """测试注册和列出 Provider"""
        from app.financial.registry import reset_registry
        from app.financial.providers.base import BaseProvider, ProviderInfo, BaseFetcher
        from typing import Dict, Type

        registry = reset_registry()

        class MockProvider1(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="p1", display_name="P1", description="", priority=2)

            @property
            def fetchers(self):
                return {}

        class MockProvider2(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="p2", display_name="P2", description="", priority=1)

            @property
            def fetchers(self):
                return {}

        registry.register(MockProvider1())
        registry.register(MockProvider2())

        providers = registry.list_providers()
        assert "p1" in providers
        assert "p2" in providers
        # p2 优先级更高，应该在前面
        assert providers.index("p2") < providers.index("p1")

    def test_registry_get_fetcher_auto_fallback(self):
        """测试获取 Fetcher 自动降级"""
        from app.financial.registry import reset_registry, FetcherNotFoundError
        from app.financial.providers.base import BaseProvider, ProviderInfo, BaseFetcher
        from app.financial.models.news import NewsQueryParams, NewsData
        from typing import Dict, Type
        from datetime import datetime

        registry = reset_registry()

        class MockFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params):
                return {}

            async def extract_data(self, query):
                return []

            def transform_data(self, raw_data, query):
                return []

        class ProviderA(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="a", display_name="A", description="", priority=1)

            @property
            def fetchers(self):
                return {"news": MockFetcher}

        class ProviderB(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="b", display_name="B", description="", priority=2)

            @property
            def fetchers(self):
                return {"news": MockFetcher, "stock": MockFetcher}

        registry.register(ProviderA())
        registry.register(ProviderB())

        # 获取 news：应该返回 ProviderA 的 (优先级更高)
        fetcher = registry.get_fetcher("news")
        assert fetcher is not None

        # 获取 stock：只有 ProviderB 支持
        fetcher = registry.get_fetcher("stock")
        assert fetcher is not None

        # 获取不存在的类型
        with pytest.raises(FetcherNotFoundError):
            registry.get_fetcher("nonexistent")

    def test_registry_get_fetcher_by_name(self):
        """测试指定 Provider 名称获取 Fetcher"""
        from app.financial.registry import reset_registry, ProviderNotFoundError
        from app.financial.providers.base import BaseProvider, ProviderInfo, BaseFetcher
        from app.financial.models.news import NewsQueryParams, NewsData

        registry = reset_registry()

        class MockFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params):
                return {}

            async def extract_data(self, query):
                return []

            def transform_data(self, raw_data, query):
                return []

        class MyProvider(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="my", display_name="My", description="")

            @property
            def fetchers(self):
                return {"news": MockFetcher}

        registry.register(MyProvider())

        # 指定存在的 Provider
        fetcher = registry.get_fetcher("news", provider="my")
        assert fetcher is not None

        # 指定不存在的 Provider
        with pytest.raises(ProviderNotFoundError):
            registry.get_fetcher("news", provider="nonexistent")


class TestSinaProvider:
    """测试 SinaProvider"""

    def test_sina_provider_info(self):
        """测试 SinaProvider 元信息"""
        from app.financial.providers.sina import SinaProvider

        provider = SinaProvider()

        assert provider.info.name == "sina"
        assert provider.info.display_name == "新浪财经"
        assert provider.supports("news") is True

    def test_sina_provider_get_news_fetcher(self):
        """测试获取 SinaNewsFetcher"""
        from app.financial.providers.sina import SinaProvider
        from app.financial.providers.sina.fetchers.news import SinaNewsFetcher

        provider = SinaProvider()
        fetcher = provider.get_fetcher("news")

        assert fetcher is not None
        assert isinstance(fetcher, SinaNewsFetcher)

    def test_sina_news_fetcher_transform_query(self):
        """测试 SinaNewsFetcher.transform_query"""
        from app.financial.providers.sina.fetchers.news import SinaNewsFetcher
        from app.financial.models.news import NewsQueryParams

        fetcher = SinaNewsFetcher()

        # 无股票代码
        params = NewsQueryParams(limit=10)
        query = fetcher.transform_query(params)
        assert query["limit"] == 10
        assert "base_url" in query

        # 有股票代码
        params = NewsQueryParams(stock_codes=["600519"], limit=20)
        query = fetcher.transform_query(params)
        assert "stock_urls" in query
        assert len(query["stock_urls"]) == 1
        assert "sh600519" in query["stock_urls"][0].lower()


================================================
FILE: backend/tests/financial/test_smoke_openbb_tools.py
================================================
"""
冒烟测试: Financial Tools (P1-2)

验证:
- FinancialNewsTool 可正常实例化
- Tool 在无 Provider 时返回错误而非崩溃
- Tool 正确调用 Registry

运行:
    pytest -q -k "smoke_openbb_tools"
"""
import pytest
from unittest.mock import patch, AsyncMock, MagicMock
from datetime import datetime


class TestFinancialNewsTool:
    """测试 FinancialNewsTool"""

    def test_tool_instantiation(self):
        """测试工具实例化"""
        from app.financial.tools import FinancialNewsTool

        tool = FinancialNewsTool()

        assert tool.name == "financial_news"
        assert "金融新闻" in tool.description or "news" in tool.description.lower()

    def test_tool_has_required_methods(self):
        """测试工具具有必要方法"""
        from app.financial.tools import FinancialNewsTool

        tool = FinancialNewsTool()

        assert hasattr(tool, "execute")
        assert hasattr(tool, "aexecute")
        assert callable(tool.execute)
        assert callable(tool.aexecute)

    @pytest.mark.asyncio
    async def test_tool_returns_error_when_no_provider(self):
        """测试无 Provider 时返回错误"""
        from app.financial.tools import FinancialNewsTool
        from app.financial.registry import reset_registry

        # 清空 Registry
        reset_registry()

        tool = FinancialNewsTool()
        result = await tool.aexecute(limit=10)

        # 应返回错误而非崩溃
        assert result["success"] is False
        assert "error" in result

    @pytest.mark.asyncio
    async def test_tool_with_mocked_fetcher(self):
        """测试工具与 Mock Fetcher 集成"""
        from app.financial.tools import FinancialNewsTool
        from app.financial.registry import reset_registry, get_registry
        from app.financial.providers.base import BaseProvider, ProviderInfo, BaseFetcher
        from app.financial.models.news import NewsQueryParams, NewsData

        registry = reset_registry()

        # 创建 Mock Fetcher
        class MockFetcher(BaseFetcher[NewsQueryParams, NewsData]):
            query_model = NewsQueryParams
            data_model = NewsData

            def transform_query(self, params):
                return {"limit": params.limit}

            async def extract_data(self, query):
                return [
                    {"title": "Mock News 1", "content": "Content 1", "url": "http://mock1.com"},
                    {"title": "Mock News 2", "content": "Content 2", "url": "http://mock2.com"},
                ]

            def transform_data(self, raw_data, query):
                return [
                    NewsData(
                        id=f"mock_{i}",
                        title=item["title"],
                        content=item["content"],
                        source="mock",
                        source_url=item["url"],
                        publish_time=datetime.now()
                    )
                    for i, item in enumerate(raw_data)
                ]

        class MockProvider(BaseProvider):
            @property
            def info(self):
                return ProviderInfo(name="mock", display_name="Mock", description="")

            @property
            def fetchers(self):
                return {"news": MockFetcher}

        registry.register(MockProvider())

        tool = FinancialNewsTool()
        result = await tool.aexecute(limit=10)

        assert result["success"] is True
        assert result["count"] == 2
        assert len(result["data"]) == 2
        assert result["data"][0]["title"] == "Mock News 1"


class TestStockPriceTool:
    """测试 StockPriceTool"""

    def test_tool_instantiation(self):
        """测试工具实例化"""
        from app.financial.tools import StockPriceTool

        tool = StockPriceTool()

        assert tool.name == "stock_price"
        assert "K线" in tool.description or "price" in tool.description.lower()

    @pytest.mark.asyncio
    async def test_tool_returns_error_for_invalid_interval(self):
        """测试无效参数时返回错误"""
        from app.financial.tools import StockPriceTool

        tool = StockPriceTool()
        result = await tool.aexecute(symbol="600519", interval="invalid_interval")

        assert result["success"] is False
        assert "error" in result

    @pytest.mark.asyncio
    async def test_tool_returns_error_when_no_provider(self):
        """测试无 Provider 时返回错误"""
        from app.financial.tools import StockPriceTool
        from app.financial.registry import reset_registry

        reset_registry()

        tool = StockPriceTool()
        result = await tool.aexecute(symbol="600519")

        assert result["success"] is False
        assert "error" in result


class TestSetupDefaultProviders:
    """测试默认 Provider 设置"""

    def test_setup_registers_sina(self):
        """测试 setup_default_providers 注册 SinaProvider"""
        from app.financial.registry import reset_registry, get_registry
        from app.financial.tools import setup_default_providers

        registry = reset_registry()
        assert "sina" not in registry.list_providers()

        setup_default_providers()

        assert "sina" in registry.list_providers()

    def test_setup_idempotent(self):
        """测试 setup_default_providers 幂等性"""
        from app.financial.registry import reset_registry, get_registry
        from app.financial.tools import setup_default_providers

        reset_registry()

        # 多次调用不应报错
        setup_default_providers()
        setup_default_providers()
        setup_default_providers()

        registry = get_registry()
        assert registry.list_providers().count("sina") == 1


================================================
FILE: backend/tests/manual_vectorize.py
================================================
#!/usr/bin/env python3
"""
手动向量化新闻（用于修复未向量化的新闻）
"""
import sys
import os
import asyncio
import logging

# 添加项目路径
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

# 先加载环境变量（避免循环导入）
from dotenv import load_dotenv
from pathlib import Path
env_path = Path(__file__).parent / ".env"
load_dotenv(env_path)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def vectorize_news_manually(news_id: int):
    """手动向量化单个新闻"""
    # 直接使用 SQLAlchemy 创建连接，避免循环导入
    from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
    from sqlalchemy import text
    from starlette.concurrency import run_in_threadpool
    
    # 从环境变量构建数据库 URL
    POSTGRES_USER = os.getenv("POSTGRES_USER", "postgres")
    POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD", "postgres")
    POSTGRES_HOST = os.getenv("POSTGRES_HOST", "localhost")
    POSTGRES_PORT = os.getenv("POSTGRES_PORT", "5432")
    POSTGRES_DB = os.getenv("POSTGRES_DB", "finnews_db")
    DATABASE_URL = f"postgresql+asyncpg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"
    
    # 创建引擎和会话
    engine = create_async_engine(DATABASE_URL, echo=False)
    AsyncSessionLocal = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
    
    try:
        # 使用原始 SQL 查询，避免导入模型
        async with AsyncSessionLocal() as db:
            # 查询新闻数据
            result = await db.execute(
                text("SELECT id, title, content, is_embedded FROM news WHERE id = :news_id"),
                {"news_id": news_id}
            )
            row = result.first()
            
            if not row:
                print(f"❌ 新闻 {news_id} 不存在")
                return False
            
            news_id_db, title, content, is_embedded = row
            
            if is_embedded == 1:
                print(f"ℹ️  新闻 {news_id} 已经向量化过了")
                return True
            
            print(f"🔄 开始向量化新闻 {news_id}: {title[:50]}...")
            
            # 获取服务（这些服务不依赖数据库连接）
            from app.services.embedding_service import get_embedding_service
            from app.storage.vector_storage import get_vector_storage
            
            embedding_service = get_embedding_service()
            vector_storage = get_vector_storage()
            
            # 组合文本
            text_to_embed = f"{title}\n{content[:1000]}"
            
            # 生成向量（增加超时时间到60秒）
            print("  📡 调用 embedding API...")
            embedding = await asyncio.wait_for(
                embedding_service.aembed_text(text_to_embed),
                timeout=60.0  # 增加到60秒
            )
            print(f"  ✅ 向量生成成功，维度: {len(embedding)}")
            
            # 存储到 Milvus（设置超时，避免卡住）
            print("  💾 存储到 Milvus...")
            try:
                await asyncio.wait_for(
                    run_in_threadpool(
                        vector_storage.store_embedding,
                        news_id=news_id,
                        embedding=embedding,
                        text=text_to_embed
                    ),
                    timeout=30.0  # 30秒超时
                )
                print("  ✅ 存储成功")
            except asyncio.TimeoutError:
                print("  ⚠️  存储超时（30秒），但数据可能已插入")
                # 即使超时，数据可能已经插入，只是flush还没完成
            
            # 更新数据库标志
            await db.execute(
                text("UPDATE news SET is_embedded = 1 WHERE id = :news_id"),
                {"news_id": news_id}
            )
            await db.commit()
            print(f"  ✅ 更新数据库标志成功")
            
            print(f"✅ 新闻 {news_id} 向量化完成！")
            return True
            
    except asyncio.TimeoutError:
        print(f"❌ 新闻 {news_id} 向量化超时（60秒）")
        return False
    except Exception as e:
        print(f"❌ 新闻 {news_id} 向量化失败: {e}")
        import traceback
        traceback.print_exc()
        return False
    finally:
        await engine.dispose()

async def vectorize_all_pending():
    """向量化所有未向量化但已分析的新闻"""
    # 直接使用 SQLAlchemy 创建连接，避免循环导入
    from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
    from sqlalchemy import text
    
    # 从环境变量构建数据库 URL
    POSTGRES_USER = os.getenv("POSTGRES_USER", "postgres")
    POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD", "postgres")
    POSTGRES_HOST = os.getenv("POSTGRES_HOST", "localhost")
    POSTGRES_PORT = os.getenv("POSTGRES_PORT", "5432")
    POSTGRES_DB = os.getenv("POSTGRES_DB", "finnews_db")
    DATABASE_URL = f"postgresql+asyncpg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"
    
    # 创建引擎和会话
    engine = create_async_engine(DATABASE_URL, echo=False)
    AsyncSessionLocal = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
    
    try:
        print("🔍 正在查找需要向量化的新闻...")
        async with AsyncSessionLocal() as db:
            # 使用原始 SQL 查询，避免导入模型
            result = await db.execute(
                text("""
                    SELECT id, title 
                    FROM news 
                    WHERE sentiment_score IS NOT NULL 
                    AND is_embedded = 0 
                    ORDER BY id DESC
                """)
            )
            pending_news = result.all()
            
            print(f"📊 查询完成，找到 {len(pending_news) if pending_news else 0} 条记录")
            
            if not pending_news:
                print("✅ 没有需要向量化的新闻")
                return
            
            print(f"📊 找到 {len(pending_news)} 条需要向量化的新闻")
            print("=" * 60)
            
            success_count = 0
            failed_count = 0
            
            # 使用单个处理方式，但添加了超时保护
            for news_id, title in pending_news:
                print(f"\n处理新闻 {news_id}...")
                if await vectorize_news_manually(news_id):
                    success_count += 1
                else:
                    failed_count += 1
            
            print("\n" + "=" * 60)
            print(f"📊 向量化完成统计:")
            print(f"  成功: {success_count}")
            print(f"  失败: {failed_count}")
            print("=" * 60)
    finally:
        await engine.dispose()

async def main_async():
    import sys
    
    print("🚀 脚本开始执行...")
    
    if len(sys.argv) > 1:
        try:
            # 向量化指定的新闻ID
            news_id = int(sys.argv[1])
            print(f"📌 向量化指定的新闻: {news_id}")
            await vectorize_news_manually(news_id)
        except ValueError:
            # 如果不是数字，可能是 --no-wait 参数
            if sys.argv[1] == "--no-wait":
                print("📌 向量化所有未向量化的新闻（跳过等待）")
                await vectorize_all_pending()
            else:
                print(f"❌ 无效的参数: {sys.argv[1]}")
                print("用法: python manual_vectorize.py [news_id|--no-wait]")
    else:
        # 向量化所有未向量化的新闻
        print("⚠️  这将向量化所有已分析但未向量化的新闻")
        print("   按 Ctrl+C 取消，或等待5秒后继续...")
        print("   (使用 --no-wait 参数可跳过等待)")
        try:
            await asyncio.sleep(5)
        except KeyboardInterrupt:
            print("\n已取消")
            sys.exit(0)
        
        await vectorize_all_pending()
    
    print("✅ 脚本执行完成")

if __name__ == "__main__":
    asyncio.run(main_async())


================================================
FILE: backend/tests/test_alpha_mining/__init__.py
================================================
"""Alpha Mining 测试模块"""


================================================
FILE: backend/tests/test_alpha_mining/test_integration_p2.py
================================================
"""
P2 集成测试 - Alpha Mining 完整集成

测试覆盖：
- F18: QuantitativeAgent 集成
- F19: REST API 端点
- 完整工作流测试
"""

import pytest
import sys
from pathlib import Path
from unittest.mock import AsyncMock, MagicMock, patch
import asyncio

# 添加项目路径
project_root = Path(__file__).parent.parent.parent
sys.path.insert(0, str(project_root))


# ============================================================================
# F18: QuantitativeAgent 集成测试
# ============================================================================

class TestQuantitativeAgent:
    """量化分析智能体测试"""
    
    def test_agent_import(self):
        """测试 Agent 可导入"""
        from app.agents.quantitative_agent import QuantitativeAgent, create_quantitative_agent
        
        assert QuantitativeAgent is not None
        assert create_quantitative_agent is not None
    
    def test_agent_init_without_llm(self):
        """测试不使用 LLM 初始化"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(
            llm_provider=None,
            enable_alpha_mining=True
        )
        
        assert agent.enable_alpha_mining is True
        assert agent._alpha_mining_initialized is False
    
    def test_agent_lazy_init(self):
        """测试延迟初始化"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(enable_alpha_mining=True)
        
        # 初始时未初始化
        assert agent._generator is None
        assert agent._vm is None
        
        # 调用 _init_alpha_mining
        agent._init_alpha_mining()
        
        # 现在应该已初始化
        assert agent._alpha_mining_initialized is True
        assert agent._generator is not None
        assert agent._vm is not None
    
    @pytest.mark.asyncio
    async def test_agent_mine_factors(self):
        """测试因子挖掘功能"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(enable_alpha_mining=True)
        
        result = await agent._mine_factors(
            stock_code="000001",
            stock_name="测试股票",
            market_data=None,
            sentiment_data=None
        )
        
        assert "factors" in result
        assert "stats" in result
        assert isinstance(result["factors"], list)
    
    @pytest.mark.asyncio
    async def test_agent_full_analysis(self):
        """测试完整分析流程（无 LLM）"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(
            llm_provider=None,
            enable_alpha_mining=True
        )
        
        result = await agent.analyze(
            stock_code="000001",
            stock_name="平安银行",
            market_data=None,
            sentiment_data=None,
            context=""
        )
        
        assert result["success"] is True
        assert result["stock_code"] == "000001"
        assert "factors_discovered" in result
    
    @pytest.mark.asyncio
    async def test_agent_with_mock_llm(self):
        """测试使用 Mock LLM"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        # 创建 Mock LLM
        mock_llm = AsyncMock()
        mock_llm.chat = AsyncMock(return_value='{"trend": "上涨", "confidence": 0.7}')
        
        agent = QuantitativeAgent(
            llm_provider=mock_llm,
            enable_alpha_mining=True
        )
        
        # 准备模拟数据
        import torch
        market_data = {
            "close": torch.randn(100).abs() * 100 + 50,
            "volume": torch.randn(100).abs() * 1e6
        }
        
        result = await agent.analyze(
            stock_code="000001",
            stock_name="平安银行",
            market_data=market_data,
            context="测试上下文"
        )
        
        assert result["success"] is True
        assert len(result["factors_discovered"]) >= 0
    
    def test_agent_evaluate_factor(self):
        """测试因子评估"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(enable_alpha_mining=True)
        
        # 同步包装异步调用
        loop = asyncio.get_event_loop()
        result = loop.run_until_complete(
            agent.evaluate_factor("ADD RET VOL")
        )
        
        # 可能成功或失败，取决于公式解析
        assert "success" in result
    
    def test_agent_get_best_factors(self):
        """测试获取最优因子"""
        from app.agents.quantitative_agent import QuantitativeAgent
        
        agent = QuantitativeAgent(enable_alpha_mining=True)
        
        # 手动添加一些因子
        agent.discovered_factors = [
            {"formula_str": "ADD(RET, VOL)", "sortino": 1.5},
            {"formula_str": "MUL(RET, MA5(VOL))", "sortino": 0.8},
            {"formula_str": "SUB(RET, DELTA1(VOL))", "sortino": 2.0},
        ]
        
        best = agent.get_best_factors(top_k=2)
        
        assert len(best) == 2
        assert best[0]["sortino"] == 2.0  # 最高分在前


# ============================================================================
# F19: REST API 测试
# ============================================================================

class TestAlphaMiningAPI:
    """Alpha Mining REST API 测试"""
    
    def test_api_module_import(self):
        """测试 API 模块可导入"""
        from app.api.v1.alpha_mining import router
        
        assert router is not None
        assert router.prefix == "/alpha-mining"
    
    def test_api_routes_exist(self):
        """测试 API 路由存在"""
        from app.api.v1.alpha_mining import router
        
        routes = [r.path for r in router.routes]
        
        assert "/mine" in routes
        assert "/evaluate" in routes
        assert "/generate" in routes
        assert "/factors" in routes
        assert "/status/{task_id}" in routes
        assert "/operators" in routes
    
    @pytest.fixture
    def test_client(self):
        """创建测试客户端"""
        try:
            from fastapi.testclient import TestClient
            from app.main import app
            return TestClient(app)
        except ImportError:
            pytest.skip("FastAPI test client not available")
    
    def test_get_operators(self, test_client):
        """测试获取操作符列表"""
        if test_client is None:
            pytest.skip("Test client not available")
        
        response = test_client.get("/api/v1/alpha-mining/operators")
        
        assert response.status_code == 200
        data = response.json()
        assert data["success"] is True
        assert "features" in data
        assert "operators" in data
    
    def test_get_factors_empty(self, test_client):
        """测试获取因子列表（空）"""
        if test_client is None:
            pytest.skip("Test client not available")
        
        response = test_client.get("/api/v1/alpha-mining/factors")
        
        assert response.status_code == 200
        data = response.json()
        assert data["success"] is True
        assert "factors" in data
    
    def test_evaluate_factor(self, test_client):
        """测试因子评估端点"""
        if test_client is None:
            pytest.skip("Test client not available")
        
        response = test_client.post(
            "/api/v1/alpha-mining/evaluate",
            json={"formula": "RET"}
        )
        
        assert response.status_code == 200
        data = response.json()
        assert "success" in data
    
    def test_generate_factors(self, test_client):
        """测试因子生成端点"""
        if test_client is None:
            pytest.skip("Test client not available")
        
        response = test_client.post(
            "/api/v1/alpha-mining/generate",
            json={"batch_size": 5, "max_len": 6}
        )
        
        assert response.status_code == 200
        data = response.json()
        assert data["success"] is True
        assert "factors" in data


# ============================================================================
# 完整工作流测试
# ============================================================================

class TestFullWorkflow:
    """完整工作流测试"""
    
    @pytest.mark.asyncio
    async def test_end_to_end_factor_discovery(self):
        """端到端因子发现流程"""
        import torch
        
        # 1. 准备数据
        from app.alpha_mining import (
            AlphaMiningConfig,
            FactorVocab,
            FactorVM,
            AlphaGenerator,
            AlphaTrainer,
            FactorEvaluator,
            MarketFeatureBuilder,
            SentimentFeatureBuilder,
            generate_mock_data
        )
        
        # 2. 初始化组件
        config = AlphaMiningConfig(
            d_model=32,
            num_layers=1,
            batch_size=8,
            max_seq_len=6
        )
        vocab = FactorVocab()
        vm = FactorVM(vocab=vocab)
        generator = AlphaGenerator(vocab=vocab, config=config)
        evaluator = FactorEvaluator(config=config)
        
        # 3. 生成模拟数据
        features, returns = generate_mock_data(
            num_samples=30,
            num_features=6,
            time_steps=100,
            seed=42
        )
        
        # 4. 创建训练器并训练
        trainer = AlphaTrainer(
            generator=generator,
            vocab=vocab,
            config=config
        )
        
        result = trainer.train(
            features=features,
            returns=returns,
            num_steps=5,  # 少量步数用于测试
            progress_bar=False
        )
        
        assert result["total_steps"] == 5
        assert "best_score" in result
        
        # 5. 验证最优因子
        if result["best_formula"]:
            factor = vm.execute(result["best_formula"], features)
            assert factor is not None or factor is None  # 可能无效
            
            if factor is not None:
                metrics = evaluator.evaluate(factor, returns)
                assert "sortino_ratio" in metrics
        
        print("\n✅ End-to-end factor discovery test passed!")
    
    @pytest.mark.asyncio
    async def test_quantitative_agent_workflow(self):
        """量化智能体工作流测试"""
        from app.agents.quantitative_agent import QuantitativeAgent
        import torch
        
        # 创建智能体
        agent = QuantitativeAgent(enable_alpha_mining=True)
        
        # 准备数据
        market_data = {
            "close": torch.randn(252).abs() * 100 + 50,
            "volume": torch.randn(252).abs() * 1e6
        }
        
        sentiment_data = {
            "sentiment": torch.randn(252).tolist(),
            "news_count": torch.abs(torch.randn(252)).tolist()
        }
        
        # 执行分析
        result = await agent.analyze(
            stock_code="600000",
            stock_name="浦发银行",
            market_data=market_data,
            sentiment_data=sentiment_data,
            context="银行股分析"
        )
        
        assert result["success"] is True
        assert result["stock_code"] == "600000"
        assert "factors_discovered" in result
        
        print("\n✅ QuantitativeAgent workflow test passed!")
        print(f"   - Factors discovered: {len(result['factors_discovered'])}")
    
    def test_api_and_agent_integration(self):
        """API 和 Agent 集成测试"""
        from app.agents.quantitative_agent import create_quantitative_agent
        
        # 创建智能体
        agent = create_quantitative_agent(enable_alpha_mining=True)
        
        # 验证组件
        agent._init_alpha_mining()
        
        assert agent._generator is not None
        assert agent._vm is not None
        assert agent._evaluator is not None
        
        # 验证因子生成
        formulas, _ = agent._generator.generate(batch_size=3, max_len=5)
        
        assert len(formulas) == 3
        
        # 验证因子执行
        from app.alpha_mining import generate_mock_data
        features, returns = generate_mock_data(num_samples=10, time_steps=50)
        
        valid_count = 0
        for formula in formulas:
            factor = agent._vm.execute(formula, features)
            if factor is not None:
                valid_count += 1
        
        print(f"\n✅ API-Agent integration test passed!")
        print(f"   - Generated: {len(formulas)}, Valid: {valid_count}")


# ============================================================================
# 性能测试
# ============================================================================

class TestPerformance:
    """性能测试"""
    
    def test_generator_speed(self):
        """测试生成器速度"""
        import time
        from app.alpha_mining import AlphaGenerator, AlphaMiningConfig
        
        config = AlphaMiningConfig(d_model=64, num_layers=2)
        generator = AlphaGenerator(config=config)
        
        # 预热
        generator.generate(batch_size=10, max_len=8)
        
        # 计时
        start = time.time()
        for _ in range(10):
            generator.generate(batch_size=100, max_len=8)
        elapsed = time.time() - start
        
        avg_time = elapsed / 10
        print(f"\n📊 Generator speed: {avg_time*1000:.2f}ms per batch (100 factors)")
        
        assert avg_time < 5.0  # 应该在 5 秒内完成
    
    def test_vm_execution_speed(self):
        """测试 VM 执行速度"""
        import time
        import torch
        from app.alpha_mining import FactorVM, FactorVocab, generate_mock_data
        
        vm = FactorVM()
        vocab = FactorVocab()
        features, _ = generate_mock_data(num_samples=100, time_steps=252)
        
        # 创建测试公式
        formulas = [
            [0],  # RET
            [0, 1, vocab.name_to_token("ADD")],  # ADD(RET, VOL)
            [0, vocab.name_to_token("MA5")],  # MA5(RET)
        ]
        
        # 计时
        start = time.time()
        for _ in range(100):
            for formula in formulas:
                vm.execute(formula, features)
        elapsed = time.time() - start
        
        avg_time = elapsed / (100 * len(formulas))
        print(f"\n📊 VM execution speed: {avg_time*1000:.3f}ms per formula")
        
        assert avg_time < 0.1  # 应该在 100ms 内完成


if __name__ == "__main__":
    pytest.main([__file__, "-v"])


================================================
FILE: backend/tests/test_alpha_mining/test_smoke_p0.py
================================================
"""
P0 冒烟测试 - Alpha Mining 核心机制

测试覆盖：
- F02: 配置模块
- F03-F04: 操作符和时序函数
- F05: 词汇表
- F06-F07: FactorVM 执行和解码
- F08-F09: AlphaGenerator 模型和生成
- F10: AlphaTrainer 训练
- F11: 模拟数据生成
"""

import pytest
import torch
import sys
from pathlib import Path

# 添加项目路径
project_root = Path(__file__).parent.parent.parent
sys.path.insert(0, str(project_root))

from app.alpha_mining.config import AlphaMiningConfig, DEFAULT_CONFIG
from app.alpha_mining.dsl.ops import (
    OPS_CONFIG, ts_delay, ts_delta, ts_mean, ts_std, get_op_names
)
from app.alpha_mining.dsl.vocab import FactorVocab, FEATURES, DEFAULT_VOCAB
from app.alpha_mining.vm.factor_vm import FactorVM
from app.alpha_mining.model.alpha_generator import AlphaGenerator
from app.alpha_mining.model.trainer import AlphaTrainer
from app.alpha_mining.utils import generate_mock_data


# ============================================================================
# F02: 配置模块测试
# ============================================================================

class TestConfig:
    """配置模块测试"""
    
    def test_default_config_exists(self):
        """测试默认配置存在"""
        assert DEFAULT_CONFIG is not None
        assert isinstance(DEFAULT_CONFIG, AlphaMiningConfig)
    
    def test_config_device(self):
        """测试设备配置"""
        config = AlphaMiningConfig()
        assert config.device in ["cpu", "cuda", "mps"]
        assert isinstance(config.torch_device, torch.device)
    
    def test_config_features(self):
        """测试特征配置"""
        config = AlphaMiningConfig()
        assert len(config.market_features) >= 4
        assert len(config.all_features) >= 4
        assert config.num_features > 0


# ============================================================================
# F03-F04: 操作符测试
# ============================================================================

class TestOps:
    """操作符测试"""
    
    @pytest.fixture
    def sample_tensor(self):
        """创建测试张量"""
        return torch.randn(10, 100)  # [batch=10, time=100]
    
    def test_ts_delay(self, sample_tensor):
        """测试时序延迟"""
        result = ts_delay(sample_tensor, d=1)
        assert result.shape == sample_tensor.shape
        # 第一列应该是 0
        assert (result[:, 0] == 0).all()
        # 后续应该是原始值的延迟
        assert torch.allclose(result[:, 1:], sample_tensor[:, :-1])
    
    def test_ts_delta(self, sample_tensor):
        """测试时序差分"""
        result = ts_delta(sample_tensor, d=1)
        assert result.shape == sample_tensor.shape
        # 差分 = x[t] - x[t-1]
        expected = sample_tensor - ts_delay(sample_tensor, 1)
        assert torch.allclose(result, expected)
    
    def test_ts_mean(self, sample_tensor):
        """测试滑动平均"""
        result = ts_mean(sample_tensor, window=5)
        assert result.shape == sample_tensor.shape
        # 值应该在合理范围内
        assert not torch.isnan(result).any()
    
    def test_ts_std(self, sample_tensor):
        """测试滑动标准差"""
        result = ts_std(sample_tensor, window=5)
        assert result.shape == sample_tensor.shape
        # 标准差应该非负
        assert (result >= 0).all()
    
    def test_ops_config_complete(self):
        """测试操作符配置完整性"""
        assert len(OPS_CONFIG) >= 10
        for name, func, arity in OPS_CONFIG:
            assert isinstance(name, str)
            assert callable(func)
            assert arity in [1, 2, 3]
    
    def test_all_ops_executable(self, sample_tensor):
        """测试所有操作符可执行"""
        y = torch.randn_like(sample_tensor)
        z = torch.randn_like(sample_tensor)
        
        for name, func, arity in OPS_CONFIG:
            try:
                if arity == 1:
                    result = func(sample_tensor)
                elif arity == 2:
                    result = func(sample_tensor, y)
                elif arity == 3:
                    result = func(sample_tensor, y, z)
                
                assert result.shape == sample_tensor.shape, f"{name} shape mismatch"
                assert not torch.isnan(result).all(), f"{name} all NaN"
            except Exception as e:
                pytest.fail(f"Operator {name} failed: {e}")


# ============================================================================
# F05: 词汇表测试
# ============================================================================

class TestVocab:
    """词汇表测试"""
    
    def test_default_vocab_exists(self):
        """测试默认词汇表存在"""
        assert DEFAULT_VOCAB is not None
        assert DEFAULT_VOCAB.vocab_size > 0
    
    def test_vocab_token_mapping(self):
        """测试 token 映射"""
        vocab = FactorVocab()
        
        # 测试特征映射
        assert vocab.token_to_name(0) == FEATURES[0]
        assert vocab.name_to_token(FEATURES[0]) == 0
        
        # 测试操作符映射
        op_names = get_op_names()
        first_op_token = vocab.num_features
        assert vocab.token_to_name(first_op_token) == op_names[0]
    
    def test_vocab_is_feature_operator(self):
        """测试特征/操作符判断"""
        vocab = FactorVocab()
        
        # 特征 token
        assert vocab.is_feature(0)
        assert not vocab.is_operator(0)
        
        # 操作符 token
        op_token = vocab.num_features
        assert vocab.is_operator(op_token)
        assert not vocab.is_feature(op_token)
    
    def test_vocab_get_operator_arity(self):
        """测试获取操作符参数数量"""
        vocab = FactorVocab()
        
        for i, (name, func, arity) in enumerate(OPS_CONFIG):
            token = vocab.num_features + i
            assert vocab.get_operator_arity(token) == arity


# ============================================================================
# F06-F07: FactorVM 测试
# ============================================================================

class TestFactorVM:
    """因子执行器测试"""
    
    @pytest.fixture
    def vm(self):
        """创建 VM 实例"""
        return FactorVM()
    
    @pytest.fixture
    def features(self):
        """创建测试特征"""
        # [batch=10, num_features=6, time=100]
        return torch.randn(10, 6, 100)
    
    def test_vm_execute_simple(self, vm, features):
        """测试简单表达式执行"""
        # 只取第一个特征
        formula = [0]  # RET
        result = vm.execute(formula, features)
        
        assert result is not None
        assert result.shape == (10, 100)
        assert torch.allclose(result, features[:, 0, :])
    
    def test_vm_execute_binary_op(self, vm, features):
        """测试二元操作"""
        vocab = vm.vocab
        add_token = vocab.name_to_token("ADD")
        
        # ADD(RET, VOL) = features[0] + features[1]
        formula = [0, 1, add_token]
        result = vm.execute(formula, features)
        
        assert result is not None
        expected = features[:, 0, :] + features[:, 1, :]
        assert torch.allclose(result, expected)
    
    def test_vm_execute_unary_op(self, vm, features):
        """测试一元操作"""
        vocab = vm.vocab
        neg_token = vocab.name_to_token("NEG")
        
        # NEG(RET) = -features[0]
        formula = [0, neg_token]
        result = vm.execute(formula, features)
        
        assert result is not None
        expected = -features[:, 0, :]
        assert torch.allclose(result, expected)
    
    def test_vm_execute_invalid_formula(self, vm, features):
        """测试无效公式"""
        vocab = vm.vocab
        add_token = vocab.name_to_token("ADD")
        
        # 只有一个参数的 ADD（无效）
        formula = [0, add_token]
        result = vm.execute(formula, features)
        
        assert result is None  # 应该返回 None
    
    def test_vm_decode_simple(self, vm):
        """测试表达式解码"""
        # RET
        assert "RET" in vm.decode([0])
        
        # ADD(RET, VOL)
        vocab = vm.vocab
        add_token = vocab.name_to_token("ADD")
        decoded = vm.decode([0, 1, add_token])
        assert "ADD" in decoded
        assert "RET" in decoded
    
    def test_vm_validate(self, vm):
        """测试表达式验证"""
        vocab = vm.vocab
        add_token = vocab.name_to_token("ADD")
        neg_token = vocab.name_to_token("NEG")
        
        # 有效公式
        assert vm.validate([0])  # RET
        assert vm.validate([0, neg_token])  # NEG(RET)
        assert vm.validate([0, 1, add_token])  # ADD(RET, VOL)
        
        # 无效公式
        assert not vm.validate([add_token])  # ADD without args
        assert not vm.validate([0, 1])  # Two features, no op


# ============================================================================
# F08-F09: AlphaGenerator 测试
# ============================================================================

class TestAlphaGenerator:
    """因子生成器测试"""
    
    @pytest.fixture
    def generator(self):
        """创建生成器实例"""
        config = AlphaMiningConfig(d_model=32, num_layers=1)  # 小模型用于测试
        return AlphaGenerator(config=config)
    
    def test_generator_init(self, generator):
        """测试生成器初始化"""
        assert generator.vocab_size > 0
        assert generator.d_model > 0
    
    def test_generator_forward(self, generator):
        """测试前向传播"""
        batch_size = 4
        seq_len = 5
        tokens = torch.zeros((batch_size, seq_len), dtype=torch.long)
        
        logits, value = generator(tokens)
        
        assert logits.shape == (batch_size, generator.vocab_size)
        assert value.shape == (batch_size, 1)
    
    def test_generator_generate(self, generator):
        """测试生成功能"""
        batch_size = 8
        max_len = 6
        
        formulas, log_probs = generator.generate(
            batch_size=batch_size,
            max_len=max_len
        )
        
        assert len(formulas) == batch_size
        assert all(len(f) == max_len for f in formulas)
        assert len(log_probs) == batch_size
    
    def test_generator_generate_with_training(self, generator):
        """测试训练模式生成"""
        batch_size = 4
        max_len = 6
        
        sequences, log_probs, values = generator.generate_with_training(
            batch_size=batch_size,
            max_len=max_len
        )
        
        assert sequences.shape == (batch_size, max_len)
        assert len(log_probs) == max_len
        assert len(values) == max_len


# ============================================================================
# F10: AlphaTrainer 测试
# ============================================================================

class TestAlphaTrainer:
    """训练器测试"""
    
    @pytest.fixture
    def trainer(self):
        """创建训练器实例"""
        config = AlphaMiningConfig(
            d_model=32,
            num_layers=1,
            batch_size=16,
            max_seq_len=6
        )
        return AlphaTrainer(config=config)
    
    @pytest.fixture
    def mock_data(self):
        """创建模拟数据"""
        return generate_mock_data(
            num_samples=20,
            num_features=6,
            time_steps=50,
            seed=42
        )
    
    def test_trainer_init(self, trainer):
        """测试训练器初始化"""
        assert trainer.generator is not None
        assert trainer.vm is not None
        assert trainer.best_score == -float('inf')
    
    def test_trainer_train_step(self, trainer, mock_data):
        """测试单步训练"""
        features, returns = mock_data
        
        metrics = trainer.train_step(features, returns)
        
        assert "loss" in metrics
        assert "avg_reward" in metrics
        assert "valid_ratio" in metrics
        assert trainer.step_count == 1
    
    def test_trainer_short_training(self, trainer, mock_data):
        """测试短训练（3步）"""
        features, returns = mock_data
        
        result = trainer.train(
            features, returns,
            num_steps=3,
            progress_bar=False
        )
        
        assert result["total_steps"] == 3
        assert "best_score" in result
        assert len(trainer.training_history) == 3


# ============================================================================
# F11: 模拟数据测试
# ============================================================================

class TestMockData:
    """模拟数据生成测试"""
    
    def test_generate_mock_data_shape(self):
        """测试模拟数据形状"""
        features, returns = generate_mock_data(
            num_samples=50,
            num_features=6,
            time_steps=100
        )
        
        assert features.shape == (50, 6, 100)
        assert returns.shape == (50, 100)
    
    def test_generate_mock_data_no_nan(self):
        """测试模拟数据无 NaN"""
        features, returns = generate_mock_data()
        
        assert not torch.isnan(features).any()
        assert not torch.isnan(returns).any()
    
    def test_generate_mock_data_reproducible(self):
        """测试模拟数据可复现"""
        f1, r1 = generate_mock_data(seed=42)
        f2, r2 = generate_mock_data(seed=42)
        
        assert torch.allclose(f1, f2)
        assert torch.allclose(r1, r2)


# ============================================================================
# 端到端冒烟测试
# ============================================================================

class TestEndToEnd:
    """端到端测试"""
    
    def test_full_pipeline_smoke(self):
        """完整流程冒烟测试"""
        # 1. 创建配置
        config = AlphaMiningConfig(
            d_model=32,
            num_layers=1,
            batch_size=8,
            max_seq_len=6
        )
        
        # 2. 创建组件
        vocab = FactorVocab()
        vm = FactorVM(vocab=vocab)
        generator = AlphaGenerator(vocab=vocab, config=config)
        trainer = AlphaTrainer(generator=generator, vocab=vocab, config=config)
        
        # 3. 生成模拟数据
        features, returns = generate_mock_data(
            num_samples=10,
            num_features=6,
            time_steps=30,
            seed=42
        )
        
        # 4. 生成因子表达式
        formulas, _ = generator.generate(batch_size=4, max_len=5)
        
        # 5. 执行表达式
        valid_count = 0
        for formula in formulas:
            result = vm.execute(formula, features)
            if result is not None:
                valid_count += 1
                decoded = vm.decode(formula)
                assert isinstance(decoded, str)
        
        # 6. 训练（1步）
        metrics = trainer.train_step(features, returns)
        assert metrics["step"] == 1
        
        print(f"\n✅ End-to-end smoke test passed!")
        print(f"   - Valid formulas: {valid_count}/{len(formulas)}")
        print(f"   - Avg reward: {metrics['avg_reward']:.4f}")


if __name__ == "__main__":
    pytest.main([__file__, "-v"])


================================================
FILE: backend/tests/test_alpha_mining/test_smoke_p1.py
================================================
"""
P1 冒烟测试 - Alpha Mining 数据集成

测试覆盖：
- F13: MarketFeatureBuilder
- F14: SentimentFeatureBuilder
- F15: FactorEvaluator
- F16: AlphaMiningTool
"""

import pytest
import torch
import pandas as pd
import numpy as np
import sys
from pathlib import Path
from datetime import datetime, timedelta

# 添加项目路径
project_root = Path(__file__).parent.parent.parent
sys.path.insert(0, str(project_root))

from app.alpha_mining.config import AlphaMiningConfig, DEFAULT_CONFIG
from app.alpha_mining.features.market import MarketFeatureBuilder
from app.alpha_mining.features.sentiment import SentimentFeatureBuilder
from app.alpha_mining.backtest.evaluator import FactorEvaluator
from app.alpha_mining.utils import generate_mock_data


# ============================================================================
# F13: MarketFeatureBuilder 测试
# ============================================================================

class TestMarketFeatureBuilder:
    """行情特征构建器测试"""
    
    @pytest.fixture
    def builder(self):
        return MarketFeatureBuilder()
    
    @pytest.fixture
    def sample_df(self):
        """创建示例 DataFrame"""
        dates = pd.date_range("2024-01-01", periods=100, freq="D")
        np.random.seed(42)
        
        return pd.DataFrame({
            "date": dates,
            "close": 100 * np.exp(np.cumsum(np.random.randn(100) * 0.02)),
            "volume": np.abs(np.random.randn(100)) * 1e6 + 1e6,
            "turnover": np.abs(np.random.randn(100)) * 0.05,
        }).set_index("date")
    
    def test_build_from_dataframe(self, builder, sample_df):
        """测试从 DataFrame 构建特征"""
        features = builder.build(sample_df)
        
        assert features.dim() == 3  # [batch, features, time]
        assert features.size(0) == 1  # batch=1
        assert features.size(1) == 4  # 4 个特征
        assert features.size(2) == 100  # time_steps
    
    def test_build_from_tensors(self, builder):
        """测试从张量字典构建特征"""
        data = {
            "close": torch.randn(10, 100).abs() * 100 + 50,
            "volume": torch.randn(10, 100).abs() * 1e6,
        }
        
        features = builder.build(data)
        
        assert features.shape == (10, 4, 100)
    
    def test_features_normalized(self, builder, sample_df):
        """测试特征被正确标准化"""
        features = builder.build(sample_df)
        
        # 检查值在合理范围内
        assert features.max() <= 5.0
        assert features.min() >= -5.0
    
    def test_no_nan_in_features(self, builder, sample_df):
        """测试特征无 NaN"""
        features = builder.build(sample_df)
        
        assert not torch.isnan(features).any()
        assert not torch.isinf(features).any()
    
    def test_feature_names(self, builder):
        """测试特征名称"""
        names = builder.get_feature_names()
        
        assert "RET" in names
        assert "VOL" in names
        assert "VOLUME_CHG" in names
        assert "TURNOVER" in names


# ============================================================================
# F14: SentimentFeatureBuilder 测试
# ============================================================================

class TestSentimentFeatureBuilder:
    """情感特征构建器测试"""
    
    @pytest.fixture
    def builder(self):
        return SentimentFeatureBuilder()
    
    @pytest.fixture
    def sample_df(self):
        """创建示例 DataFrame"""
        dates = pd.date_range("2024-01-01", periods=50, freq="D")
        np.random.seed(42)
        
        return pd.DataFrame({
            "date": dates,
            "sentiment": np.random.randn(50) * 0.3,
            "news_count": np.abs(np.random.randn(50)) * 5 + 1,
        }).set_index("date")
    
    def test_build_from_dataframe(self, builder, sample_df):
        """测试从 DataFrame 构建特征"""
        features = builder.build(sample_df)
        
        assert features.dim() == 3
        assert features.size(0) == 1
        assert features.size(1) == 2  # SENTIMENT, NEWS_COUNT
        assert features.size(2) == 50
    
    def test_build_from_dict(self, builder):
        """测试从字典构建特征"""
        data = {
            "sentiment": [0.1, -0.2, 0.3, 0.0, -0.1],
            "news_count": [5, 3, 8, 2, 4]
        }
        
        features = builder.build(data)
        
        assert features.shape == (1, 2, 5)
    
    def test_build_from_list(self, builder):
        """测试从列表构建特征"""
        data = [
            {"sentiment": 0.1, "news_count": 5},
            {"sentiment": -0.2, "news_count": 3},
            {"sentiment": 0.3, "news_count": 8},
        ]
        
        features = builder.build(data)
        
        assert features.shape == (1, 2, 3)
    
    def test_time_alignment(self, builder):
        """测试时间步对齐"""
        data = {"sentiment": [0.1, 0.2, 0.3], "news_count": [1, 2, 3]}
        
        features = builder.build(data, time_steps=10)
        
        assert features.size(2) == 10
    
    def test_sentiment_decay(self, builder):
        """测试情感衰减"""
        # 创建一个有明显峰值的情感序列
        data = {"sentiment": [0, 0, 0, 1.0, 0, 0, 0], "news_count": [1] * 7}
        
        features = builder.build(data)
        
        # 衰减后的值应该逐渐减小
        sentiment = features[0, 0, :]
        assert sentiment[4] < sentiment[3]  # 峰值后开始衰减
    
    def test_combine_with_market(self, builder):
        """测试与行情特征合并"""
        market = torch.randn(2, 4, 100)  # [batch, 4 features, time]
        sentiment = torch.randn(2, 2, 100)  # [batch, 2 features, time]
        
        combined = builder.combine_with_market(market, sentiment)
        
        assert combined.shape == (2, 6, 100)


# ============================================================================
# F15: FactorEvaluator 测试
# ============================================================================

class TestFactorEvaluator:
    """因子评估器测试"""
    
    @pytest.fixture
    def evaluator(self):
        return FactorEvaluator()
    
    @pytest.fixture
    def sample_data(self):
        """创建示例数据"""
        np.random.seed(42)
        time_steps = 252
        
        # 模拟收益率
        returns = torch.randn(time_steps) * 0.02
        
        # 模拟因子（与收益率有一定相关性）
        noise = torch.randn(time_steps) * 0.5
        factor = returns + noise
        
        return factor, returns
    
    def test_evaluate_basic(self, evaluator, sample_data):
        """测试基础评估"""
        factor, returns = sample_data
        
        metrics = evaluator.evaluate(factor, returns)
        
        assert "sortino_ratio" in metrics
        assert "sharpe_ratio" in metrics
        assert "ic" in metrics
        assert "rank_ic" in metrics
        assert "max_drawdown" in metrics
        assert "turnover" in metrics
    
    def test_evaluate_batch(self, evaluator):
        """测试批量评估"""
        factor = torch.randn(10, 100)
        returns = torch.randn(10, 100) * 0.02
        
        metrics = evaluator.evaluate(factor, returns)
        
        # 应该返回平均值和标准差
        assert "sortino_ratio" in metrics
        assert "sortino_ratio_std" in metrics
    
    def test_get_reward(self, evaluator, sample_data):
        """测试获取 RL 奖励"""
        factor, returns = sample_data
        
        reward = evaluator.get_reward(factor, returns)
        
        assert isinstance(reward, float)
        assert not np.isnan(reward)
    
    def test_good_factor_high_ic(self, evaluator):
        """测试好因子有较高 IC"""
        # 创建一个与收益率高度相关的因子
        returns = torch.randn(252) * 0.02
        factor = returns * 0.8 + torch.randn(252) * 0.01  # 80% 相关
        
        metrics = evaluator.evaluate(factor, returns)
        
        # IC 应该显著为正
        assert metrics["ic"] > 0.3
    
    def test_random_factor_low_ic(self, evaluator):
        """测试随机因子 IC 接近 0"""
        returns = torch.randn(252) * 0.02
        factor = torch.randn(252)  # 完全随机
        
        metrics = evaluator.evaluate(factor, returns)
        
        # IC 应该接近 0
        assert abs(metrics["ic"]) < 0.3
    
    def test_compare_factors(self, evaluator):
        """测试因子比较"""
        returns = torch.randn(252) * 0.02
        
        # 创建不同质量的因子
        good_factor = returns * 0.8 + torch.randn(252) * 0.01
        bad_factor = torch.randn(252)
        
        results = evaluator.compare_factors(
            [good_factor, bad_factor],
            returns,
            ["good", "bad"]
        )
        
        assert "good" in results
        assert "bad" in results
        assert results["good"]["ic"] > results["bad"]["ic"]
    
    def test_rank_factors(self, evaluator):
        """测试因子排名"""
        returns = torch.randn(100) * 0.02
        
        factors = [torch.randn(100) for _ in range(5)]
        
        ranking = evaluator.rank_factors(factors, returns)
        
        assert len(ranking) == 5
        # 检查是降序排列
        scores = [score for _, score in ranking]
        assert scores == sorted(scores, reverse=True)


# ============================================================================
# F16: AlphaMiningTool 测试（需要 AgenticX 依赖）
# ============================================================================

class TestAlphaMiningToolImport:
    """AlphaMiningTool 导入测试"""
    
    def test_import_tool(self):
        """测试工具可导入"""
        try:
            from app.alpha_mining.tools.alpha_mining_tool import AlphaMiningTool
            assert AlphaMiningTool is not None
        except ImportError as e:
            # 如果 AgenticX 不可用，跳过
            pytest.skip(f"AgenticX not available: {e}")
    
    def test_tool_metadata(self):
        """测试工具元数据"""
        try:
            from app.alpha_mining.tools.alpha_mining_tool import AlphaMiningTool
            
            tool = AlphaMiningTool()
            
            assert tool.name == "alpha_mining"
            assert "量化因子" in tool.description
            assert len(tool.parameters) > 0
        except ImportError:
            pytest.skip("AgenticX not available")


# ============================================================================
# 端到端 P1 测试
# ============================================================================

class TestP1EndToEnd:
    """P1 端到端测试"""
    
    def test_full_pipeline_with_real_features(self):
        """使用真实特征的完整流程"""
        # 1. 准备行情数据
        dates = pd.date_range("2024-01-01", periods=252, freq="D")
        np.random.seed(42)
        
        market_df = pd.DataFrame({
            "close": 100 * np.exp(np.cumsum(np.random.randn(252) * 0.02)),
            "volume": np.abs(np.random.randn(252)) * 1e6 + 1e6,
            "turnover": np.abs(np.random.randn(252)) * 0.05,
        }, index=dates)
        
        # 2. 构建行情特征
        market_builder = MarketFeatureBuilder()
        market_features = market_builder.build(market_df)
        
        assert market_features.shape == (1, 4, 252)
        
        # 3. 准备情感数据
        sentiment_data = {
            "sentiment": np.random.randn(252) * 0.3,
            "news_count": np.abs(np.random.randn(252)) * 5 + 1
        }
        
        # 4. 构建情感特征
        sentiment_builder = SentimentFeatureBuilder()
        sentiment_features = sentiment_builder.build(sentiment_data, time_steps=252)
        
        assert sentiment_features.shape == (1, 2, 252)
        
        # 5. 合并特征
        combined = sentiment_builder.combine_with_market(
            market_features, sentiment_features
        )
        
        assert combined.shape == (1, 6, 252)
        
        # 6. 导入生成器和 VM
        from app.alpha_mining.model.alpha_generator import AlphaGenerator
        from app.alpha_mining.vm.factor_vm import FactorVM
        
        config = AlphaMiningConfig(d_model=32, num_layers=1, max_seq_len=6)
        generator = AlphaGenerator(config=config)
        vm = FactorVM()
        
        # 7. 生成并执行因子
        formulas, _ = generator.generate(batch_size=5, max_len=5)
        
        valid_factors = []
        for formula in formulas:
            factor = vm.execute(formula, combined)
            if factor is not None and factor.std() > 1e-6:
                valid_factors.append(factor)
        
        # 8. 评估因子
        if valid_factors:
            evaluator = FactorEvaluator()
            returns = market_features[:, 0, :]  # RET 作为收益率
            
            for factor in valid_factors:
                metrics = evaluator.evaluate(factor, returns)
                assert "sortino_ratio" in metrics
        
        print(f"\n✅ P1 End-to-end test passed!")
        print(f"   - Market features: {market_features.shape}")
        print(f"   - Sentiment features: {sentiment_features.shape}")
        print(f"   - Combined features: {combined.shape}")
        print(f"   - Valid factors generated: {len(valid_factors)}/{len(formulas)}")


if __name__ == "__main__":
    pytest.main([__file__, "-v"])


================================================
FILE: backend/tests/test_smoke_alpha_mining.py
================================================
"""
Alpha Mining 模块冒烟测试

测试覆盖：
1. DSL 操作符执行
2. 因子虚拟机（FactorVM）
3. 因子生成模型（AlphaGenerator）
4. RL 训练器（AlphaTrainer）
5. 因子评估器（FactorEvaluator）
6. REST API 端点
"""

import pytest
import torch
import numpy as np
from typing import List

# 确保可以导入模块
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "app"))


class TestDSLOperators:
    """测试 DSL 操作符"""
    
    def test_ops_config_exists(self):
        """操作符配置存在"""
        from app.alpha_mining.dsl.ops import OPS_CONFIG, get_op_names
        
        assert len(OPS_CONFIG) == 21, f"Expected 21 operators, got {len(OPS_CONFIG)}"
        
        names = get_op_names()
        assert 'ADD' in names
        assert 'SUB' in names
        assert 'MUL' in names
        assert 'DIV' in names
        assert 'MA5' in names
        assert 'DELAY1' in names
    
    def test_arithmetic_ops(self):
        """算术操作符测试"""
        from app.alpha_mining.dsl.ops import get_op_by_name
        
        x = torch.tensor([1.0, 2.0, 3.0])
        y = torch.tensor([2.0, 3.0, 4.0])
        
        # ADD
        add_fn, add_arity = get_op_by_name('ADD')
        assert add_arity == 2
        result = add_fn(x, y)
        assert torch.allclose(result, torch.tensor([3.0, 5.0, 7.0]))
        
        # MUL
        mul_fn, mul_arity = get_op_by_name('MUL')
        result = mul_fn(x, y)
        assert torch.allclose(result, torch.tensor([2.0, 6.0, 12.0]))
        
        # DIV (safe division)
        div_fn, _ = get_op_by_name('DIV')
        result = div_fn(x, y)
        assert result.shape == x.shape
        assert not torch.any(torch.isinf(result))
    
    def test_timeseries_ops(self):
        """时序操作符测试"""
        from app.alpha_mining.dsl.ops import ts_delay, ts_mean, ts_std
        
        x = torch.tensor([[1.0, 2.0, 3.0, 4.0, 5.0]])
        
        # Delay
        delayed = ts_delay(x, 1)
        assert delayed[0, 0] == 0  # 填充 0
        assert delayed[0, 1] == 1  # 原来的第一个值
        
        # MA
        ma = ts_mean(x, 3)
        assert ma.shape == x.shape
        
        # STD
        std = ts_std(x, 3)
        assert std.shape == x.shape


class TestFactorVM:
    """测试因子虚拟机"""
    
    @pytest.fixture
    def vm(self):
        from app.alpha_mining.vm.factor_vm import FactorVM
        from app.alpha_mining.dsl.vocab import DEFAULT_VOCAB
        return FactorVM(vocab=DEFAULT_VOCAB)
    
    @pytest.fixture
    def sample_features(self):
        """[batch=2, features=4, time=10]"""
        return torch.randn(2, 4, 10)
    
    def test_execute_simple_formula(self, vm, sample_features):
        """执行简单因子表达式"""
        # RET + VOL (假设 RET=0, VOL=1, ADD=某个 token)
        formula = [0, 1, vm.vocab.name_to_token('ADD')]
        
        result = vm.execute(formula, sample_features)
        assert result is not None
        assert result.shape == (2, 10)  # [batch, time]
    
    def test_execute_invalid_formula(self, vm, sample_features):
        """无效表达式返回 None"""
        # 不完整的表达式
        formula = [0]  # 只有一个特征，没有操作
        result = vm.execute(formula, sample_features)
        # 只有一个操作数，应该返回该操作数（有效）
        assert result is not None
        
        # 操作符参数不足
        formula = [vm.vocab.name_to_token('ADD')]  # 二元操作符但没有操作数
        result = vm.execute(formula, sample_features)
        assert result is None
    
    def test_decode_formula(self, vm):
        """解码因子表达式为字符串"""
        formula = [0, 1, vm.vocab.name_to_token('ADD')]
        decoded = vm.decode(formula)
        assert decoded is not None
        assert 'ADD' in decoded or '+' in decoded


class TestAlphaGenerator:
    """测试因子生成模型"""
    
    @pytest.fixture
    def generator(self):
        from app.alpha_mining.model.alpha_generator import AlphaGenerator
        from app.alpha_mining.dsl.vocab import DEFAULT_VOCAB
        from app.alpha_mining.config import AlphaMiningConfig
        
        config = AlphaMiningConfig()
        return AlphaGenerator(vocab=DEFAULT_VOCAB, config=config)
    
    def test_generate_batch(self, generator):
        """生成一批因子表达式"""
        formulas, log_probs = generator.generate(batch_size=5, max_len=8)
        
        assert len(formulas) == 5
        for formula in formulas:
            assert len(formula) <= 8
            assert all(isinstance(t, int) for t in formula)
    
    def test_generate_with_training(self, generator):
        """训练模式生成"""
        sequences, log_probs_list, values = generator.generate_with_training(
            batch_size=3, device='cpu'
        )
        
        assert sequences.shape[0] == 3
        assert len(log_probs_list) > 0


class TestAlphaTrainer:
    """测试 RL 训练器"""
    
    @pytest.fixture
    def trainer(self):
        from app.alpha_mining.model.trainer import AlphaTrainer
        from app.alpha_mining.config import AlphaMiningConfig
        
        config = AlphaMiningConfig()
        config.batch_size = 8
        return AlphaTrainer(config=config)
    
    @pytest.fixture
    def sample_data(self):
        """生成样本数据"""
        features = torch.randn(10, 4, 50)  # [samples, features, time]
        returns = torch.randn(10, 50)      # [samples, time]
        return features, returns
    
    def test_train_step(self, trainer, sample_data):
        """单步训练测试"""
        features, returns = sample_data
        
        metrics = trainer.train_step(features, returns)
        
        assert 'step' in metrics
        assert 'loss' in metrics
        assert 'avg_reward' in metrics
        assert 'valid_ratio' in metrics
        assert 'best_score' in metrics
    
    def test_train_with_callback(self, trainer, sample_data):
        """带回调的训练测试"""
        features, returns = sample_data
        
        callback_results = []
        def callback(metrics):
            callback_results.append(metrics)
        
        result = trainer.train(
            features=features,
            returns=returns,
            num_steps=3,
            progress_bar=False,
            step_callback=callback
        )
        
        assert len(callback_results) == 3
        assert 'best_score' in result
        assert 'best_formula_str' in result


class TestFactorEvaluator:
    """测试因子评估器"""
    
    @pytest.fixture
    def evaluator(self):
        from app.alpha_mining.backtest.evaluator import FactorEvaluator
        return FactorEvaluator()
    
    def test_evaluate_factor(self, evaluator):
        """评估因子"""
        factor = torch.randn(50)   # 因子值
        returns = torch.randn(50)  # 收益率
        
        metrics = evaluator.evaluate(factor, returns)
        
        assert 'sortino_ratio' in metrics
        assert 'sharpe_ratio' in metrics
        assert 'ic' in metrics
        assert 'rank_ic' in metrics
        assert 'max_drawdown' in metrics
        assert 'turnover' in metrics
        assert 'win_rate' in metrics
    
    def test_get_reward(self, evaluator):
        """获取 RL 奖励"""
        factor = torch.randn(50)
        returns = torch.randn(50)
        
        reward = evaluator.get_reward(factor, returns)
        
        assert isinstance(reward, float)


class TestVocab:
    """测试词汇表"""
    
    def test_vocab_initialization(self):
        """词汇表初始化"""
        from app.alpha_mining.dsl.vocab import FactorVocab, FEATURES
        
        vocab = FactorVocab()
        
        assert vocab.vocab_size > 0
        assert vocab.num_features == len(FEATURES)
        assert vocab.num_ops > 0
    
    def test_token_conversion(self):
        """Token 转换"""
        from app.alpha_mining.dsl.vocab import FactorVocab
        
        vocab = FactorVocab()
        
        # 特征转换
        token = vocab.name_to_token('RET')
        name = vocab.token_to_name(token)
        assert name == 'RET'
        
        # 操作符转换
        token = vocab.name_to_token('ADD')
        name = vocab.token_to_name(token)
        assert name == 'ADD'


class TestAPIEndpoints:
    """测试 REST API 端点（需要 FastAPI TestClient）"""
    
    @pytest.fixture
    def client(self):
        """创建测试客户端"""
        try:
            from fastapi.testclient import TestClient
            from app.main import app
            return TestClient(app)
        except ImportError:
            pytest.skip("FastAPI TestClient not available")
    
    def test_get_operators(self, client):
        """获取操作符列表"""
        response = client.get("/api/v1/alpha-mining/operators")
        
        assert response.status_code == 200
        data = response.json()
        assert data.get('success') is True
        assert 'operators' in data
        assert 'features' in data
        assert len(data['operators']) == 21
    
    def test_get_factors_empty(self, client):
        """获取因子列表（空）"""
        response = client.get("/api/v1/alpha-mining/factors?top_k=5")
        
        assert response.status_code == 200
        data = response.json()
        assert data.get('success') is True
        assert 'factors' in data
    
    def test_evaluate_factor(self, client):
        """评估因子表达式"""
        response = client.post(
            "/api/v1/alpha-mining/evaluate",
            json={"formula": "ADD(RET, VOL)"}
        )
        
        assert response.status_code == 200
        data = response.json()
        # 可能成功或失败（取决于公式解析）
        assert 'success' in data
    
    def test_mine_task_start(self, client):
        """启动挖掘任务"""
        response = client.post(
            "/api/v1/alpha-mining/mine",
            json={"num_steps": 5, "use_sentiment": False, "batch_size": 4}
        )
        
        assert response.status_code == 200
        data = response.json()
        assert data.get('success') is True
        assert 'task_id' in data


class TestEdgeCases:
    """边界条件测试"""
    
    def test_empty_formula(self):
        """空表达式"""
        from app.alpha_mining.vm.factor_vm import FactorVM
        from app.alpha_mining.dsl.vocab import DEFAULT_VOCAB
        
        vm = FactorVM(vocab=DEFAULT_VOCAB)
        features = torch.randn(2, 4, 10)
        
        result = vm.execute([], features)
        assert result is None
    
    def test_constant_factor_penalty(self):
        """常量因子惩罚"""
        from app.alpha_mining.model.trainer import AlphaTrainer
        from app.alpha_mining.config import AlphaMiningConfig
        
        config = AlphaMiningConfig()
        trainer = AlphaTrainer(config=config)
        
        # 常量因子的标准差接近 0
        constant_factor = torch.ones(50)
        assert constant_factor.std() < config.constant_threshold
    
    def test_nan_handling(self):
        """NaN 处理"""
        from app.alpha_mining.vm.factor_vm import FactorVM
        from app.alpha_mining.dsl.vocab import DEFAULT_VOCAB
        
        vm = FactorVM(vocab=DEFAULT_VOCAB)
        
        # 创建包含 NaN 的特征
        features = torch.randn(2, 4, 10)
        features[0, 0, 5] = float('nan')
        
        # 执行应该处理 NaN
        formula = [0]  # 只取第一个特征
        result = vm.execute(formula, features)
        
        if result is not None:
            # NaN 应该被替换为 0
            assert not torch.any(torch.isnan(result))


# 运行测试
if __name__ == "__main__":
    pytest.main([__file__, "-v", "--tb=short"])


================================================
FILE: deploy/Dockerfile.celery
================================================
FROM python:3.11

WORKDIR /app

# 复制requirements文件和entrypoint脚本
COPY backend/requirements.txt /app/requirements.txt
COPY deploy/celery-entrypoint.sh /usr/local/bin/celery-entrypoint.sh

# 安装依赖（构建时安装，用于生产环境）
# 注意：开发环境使用 volumes 挂载会覆盖 /app，依赖会在 entrypoint 中重新安装
RUN pip install --no-cache-dir -r requirements.txt && \
    chmod +x /usr/local/bin/celery-entrypoint.sh

# 设置entrypoint（用于开发环境：检查并安装依赖）
ENTRYPOINT ["/usr/local/bin/celery-entrypoint.sh"]

# 设置默认命令（可以被docker-compose覆盖）
CMD ["celery", "-A", "app.core.celery_app", "worker", "--loglevel=info"]


================================================
FILE: deploy/celery-entrypoint.sh
================================================
#!/bin/bash
set -e

# 开发环境：检查依赖是否已安装（通过检查关键包）
# 注意：由于 volumes 挂载会覆盖 /app，构建时安装的依赖可能不可见
# 这个脚本确保在开发环境中依赖总是可用的
CHECK_PACKAGES=("celery" "fastapi" "sqlalchemy")
NEED_INSTALL=false

for pkg in "${CHECK_PACKAGES[@]}"; do
    if ! python -c "import ${pkg}" 2>/dev/null; then
        NEED_INSTALL=true
        break
    fi
done

if [ "$NEED_INSTALL" = true ]; then
    echo "📦 [开发环境] 检测到依赖未安装，正在安装..."
    echo "   提示：这是因为 volumes 挂载覆盖了镜像中的依赖"
    pip install --no-cache-dir -r requirements.txt
    echo "✅ 依赖安装完成"
else
    echo "✅ 依赖已存在，跳过安装"
fi

# 执行传入的命令
exec "$@"


================================================
FILE: deploy/docker-compose.dev.yml
================================================
version: '3.8'

services:
  postgres:
    image: postgres:15-alpine
    container_name: finnews_postgres
    environment:
      POSTGRES_USER: finnews
      POSTGRES_PASSWORD: finnews_dev_password
      POSTGRES_DB: finnews_db
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U finnews -d finnews_db"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - finnews_network

  redis:
    image: redis:7-alpine
    container_name: finnews_redis
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - finnews_network

  milvus-etcd:
    image: quay.io/coreos/etcd:v3.5.5
    container_name: finnews_milvus_etcd
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - milvus_etcd_data:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3
    networks:
      - finnews_network

  milvus-minio:
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    container_name: finnews_milvus_minio
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - milvus_minio_data:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    networks:
      - finnews_network

  milvus-standalone:
    image: milvusdb/milvus:v2.3.3
    container_name: finnews_milvus
    command: ["milvus", "run", "standalone"]
    security_opt:
      - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: milvus-etcd:2379
      MINIO_ADDRESS: milvus-minio:9000
    volumes:
      - milvus_data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - milvus-etcd
      - milvus-minio
    networks:
      - finnews_network

  celery-worker:
    build:
      context: ..
      dockerfile: deploy/Dockerfile.celery
    container_name: finnews_celery_worker
    working_dir: /app
    command: celery -A app.core.celery_app worker --loglevel=info
    volumes:
      - ../backend:/app
    env_file:
      - ../backend/.env
    environment:
      - POSTGRES_USER=finnews
      - POSTGRES_PASSWORD=finnews_dev_password
      - POSTGRES_HOST=postgres
      - POSTGRES_PORT=5432
      - POSTGRES_DB=finnews_db
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_DB=0
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=finnews_neo4j_password
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      neo4j:
        condition: service_healthy
    networks:
      - finnews_network
    dns:
      - 8.8.8.8
      - 8.8.4.4
    restart: unless-stopped

  celery-beat:
    build:
      context: ..
      dockerfile: deploy/Dockerfile.celery
    container_name: finnews_celery_beat
    working_dir: /app
    command: celery -A app.core.celery_app beat --loglevel=info
    volumes:
      - ../backend:/app
    env_file:
      - ../backend/.env
    environment:
      - POSTGRES_USER=finnews
      - POSTGRES_PASSWORD=finnews_dev_password
      - POSTGRES_HOST=postgres
      - POSTGRES_PORT=5432
      - POSTGRES_DB=finnews_db
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_DB=0
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=finnews_neo4j_password
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      neo4j:
        condition: service_healthy
    networks:
      - finnews_network
    dns:
      - 8.8.8.8
      - 8.8.4.4
    restart: unless-stopped

  # Neo4j - 知识图谱数据库
  neo4j:
    image: neo4j:5.26.0
    container_name: finnews_neo4j
    environment:
      NEO4J_AUTH: neo4j/finnews_neo4j_password
      NEO4J_PLUGINS: '["apoc", "graph-data-science"]'
      NEO4J_dbms_memory_pagecache_size: 1G
      NEO4J_dbms_memory_heap_initial__size: 1G
      NEO4J_dbms_memory_heap_max__size: 2G
      NEO4J_apoc_export_file_enabled: 'true'
      NEO4J_apoc_import_file_enabled: 'true'
      NEO4J_apoc_import_file_use__neo4j__config: 'true'
    ports:
      - "7474:7474"  # HTTP
      - "7687:7687"  # Bolt
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs
      - neo4j_import:/var/lib/neo4j/import
      - neo4j_plugins:/plugins
    healthcheck:
      test: ["CMD", "cypher-shell", "-u", "neo4j", "-p", "finnews_neo4j_password", "RETURN 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    networks:
      - finnews_network
    restart: unless-stopped

volumes:
  postgres_data:
    driver: local
  redis_data:
    driver: local
  milvus_etcd_data:
    driver: local
  milvus_minio_data:
    driver: local
  milvus_data:
    driver: local
  neo4j_data:
    driver: local
  neo4j_logs:
    driver: local
  neo4j_import:
    driver: local
  neo4j_plugins:
    driver: local

networks:
  finnews_network:
    driver: bridge


================================================
FILE: docs/BochaAI_Web_Search_API_20251222_121535.md
================================================
# BochaAI_Web_Search_API

> 来源: https://bocha-ai.feishu.cn/wiki/RXEOw02rFiwzGSkd9mUcqoeAnNK
> 爬取时间: 2025-12-22 12:15:35
> 方式: 浏览器提取

---

博查用户帮助文档
Web Search API

一、API简介
从全网搜索任何网页信息和网页链接，结果准确、摘要完整，更适合AI使用。
可配置搜索时间范围、是否显示摘要，支持按分页获取更多结果。

二、搜索结果
包括网页、图片、视频，Response格式兼容Bing Search API。
• 网页包括name、url、snippet、summary、siteName、siteIcon、datePublished等信息
• 图片包括 contentUrl、hostPageUrl、width、height等信息

三、API接口
请求方式: POST
请求地址: https://api.bochaai.com/v1/web-search

四、请求参数
| 参数 | 类型 | 必填 | 描述 |
| --- | --- | --- | --- |
| query | string | 是 | 搜索关键词 |
| freshness | string | 否 | 搜索时间范围（noLimit, oneDay, oneWeek, oneMonth） |
| count | integer | 否 | 返回结果数量（默认10，最大50） |
| offset | integer | 否 | 偏移量 |

五、响应定义
返回结果包含 webPages, images, videos 等模块。
每个网页包含 title, url, snippet, datePublished, siteName 等。

六、Python SDK 示例
```python
import requests
import json

url = "https://api.bochaai.com/v1/web-search"
payload = json.dumps({
  "query": "彩讯股份",
  "freshness": "oneMonth",
  "count": 10
})
headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
```


================================================
FILE: docs/天眼查MCP服务_20260104_171528.md
================================================
# 天眼查MCP服务

> 来源: https://bigmodel.cn/marketplace/detail/1846da9039e4
> 爬取时间: 2026-01-04 17:15:28
> 方式: 浏览器提取

---

控制台
应用空间
体验中心
开发文档
特惠专区 🔥
API Key
财务
返回广场
天眼查
全方位展示企业信息，实时监控企业风险，深挖股权关系，查询企业法律诉讼、知识产权等情况，助力识别风险。
立即体验 
介绍信息
价格
工具
使用指南
介绍信息
什么是天眼查MCP服务？

天眼查 MCP（Model Context Protocol）服务，作为连接天眼查丰富数据资源与各类应用的桥梁，通过标准化接口，为用户在企业信息查询、企业风险评估、企业专利洞察等方面，提供一站式、便捷且高效的数据调用与分析解决方案。该服务依托天眼查海量数据优势，借助 MCP 协议特性，突破传统数据获取与处理瓶颈，让不同类型用户轻松获取所需企业深度信息，辅助商业决策。

支持类型：该 MCP 支持 SSE 和 Streamable 两种协议。

核心功能
（一）企业信息查询
全量工商数据：支持通过企业标识快速获取注册信息、股权结构、分支机构、变更记录等，数据源自权威平台。
多维度筛选：可按行业、注册资本、经营状态等条件精准定位目标企业。
变更轨迹追溯：记录企业工商信息变更历史，辅助分析经营战略调整。
（二）企业风险评估
全维度风险监控：实时同步法律诉讼、失信记录、行政处罚等风险数据，直连法院、工商等权威源。
风险关联分析：挖掘风险传导路径，如关联企业风险扩散、诉讼影响评估等。
实时预警推送：自定义风险类型，目标企业触发条件时即时通知。
（三）企业专利洞察
专利全要素获取：快速调取专利名称、类型、法律状态、发明人等核心字段，评估技术实力。
专利价值量化：综合引用次数、法律状态等指标量化专利资产，辅助投资与合作决策。
侵权智能预警：比对专利相似度，提前识别侵权风险，支持技术研发避坑。
如何在MCP Server上使用天眼查插件服务？

MCP Server已完成天眼查插件服务的云端部署，用户操作简便。目前 MCP 服务已支持在体验中心添加使用。

支持运行 MCP 协议的客户端，如Cherry Studio、vscode等中配置，在个人中心的API Key页面复制您的 API 密钥，并按照文档内容设置服务器命令。

天眼查MCP插件服务的关键特性
海量数据支撑：整合全国乃至全球海量企业数据，构建全面企业信息库，无论是新兴创业公司，还是成熟大型集团，都能在其中找到详尽信息。
实时数据更新：与权威数据源实时对接，企业信息变更、风险事件发生、专利状态更新等，第一时间同步至系统，确保用户获取信息始终处于最新状态。
智能检索分析：支持自然语言检索，用户可用日常语言描述需求；同时，内置智能分析引擎，对查询结果进行关联分析、趋势预测等，挖掘数据深层价值。
安全可靠保障：数据传输全程加密，采用多重防护机制抵御网络攻击，确保数据隐私与服务稳定；服务器具备高并发处理能力，满足大量用户同时查询需求。
多端集成便捷：无缝对接浏览器、办公软件及各类管理系统，用户无需切换复杂系统，在日常办公环境中即可随时调用天眼查 MCP 服务，提升工作效率。
价格
工具名称	工具说明	价格
companyBaseInfo	公司名称或ID、类型、成立日期、经营状态、注册资本、
法人、工商注册号、组织机构代码、纳税人识别号等信息	0.15/次
risk	企业自身/周边/预警风险信息	0.2/次
enterprisePatent	包括专利名称、申请号、申请公布号等字段的详细信息	0.1/次
工具
companyBaseInfo
可以通过公司名称或ID获取企业基本信息，企业基本信息包括公司名称或ID、类型、成立日期、经营状态、注册资本、法人、工商注册号、统一社会信用代码、组织机构代码、纳税人识别号等字段信息
risk
可以通过关键字（公司名称、公司id、注册号或社会统一信用代码）获取企业相关天眼风险列表，包括企业自身/周边/预警风险信息
enterprisePatent
可以通过公司名称或ID获取专利的有关信息，包括专利名称、申请号、申请公布号等字段的详细信息
使用指南
天眼查MCP服务的使用场景示例
投资领域：投资人在筛选投资项目时，借助天眼查 MCP 服务，通过企业信息了解目标企业基本面，利用风险评估功能排查潜在风险，依据专利洞察判断企业创新能力与技术壁垒，综合评估投资价值与风险，辅助投资决策。
企业合作：企业寻求合作伙伴时，查询对方企业信息，明确其实力与信誉；评估合作方风险，避免合作过程中陷入法律纠纷、经营异常等陷阱；分析合作方专利布局，判断技术互补性，保障合作顺利开展。
研发创新：研发人员利用企业专利洞察功能，检索行业内相关专利，了解前沿技术动态，避免重复研发；同时，通过分析竞争对手专利，寻找技术创新突破口，优化自身研发路径。
政府招商：政府部门在招商引资过程中，借助天眼查 MCP 服务筛选拥有核心专利、具备创新实力与良好发展前景的企业；评估企业对本地产业带动价值，精准定位优质招商对象，提升招商质量与效果 。
使用教程

支持GLM文本模型API直接调用MCP

或持运行MCP协议的客户端，如Cherry Studio、Vscode、Cursor

点击获取智谱 BigModel 开放平台的API Key

在BigModel体验中心使用

目前 MCP 服务已支持在体验中心添加使用。

打开模型设置，打开MCP开关，点击添加MCP。
选择MCP，确认后，发送Prompt进行对话。
通过GLM文本模型API直接调用

cURL代码示例：

curl --request POST \\
  --url https://open.bigmodel.cn/api/paas/v4/chat/completions \\
  --header 'Authorization: Bearer Your_Zhipu_API_Key' \\
  --header 'Content-Type: application/json' \\
  --data '{
  "model": "glm-4.5",
  "do_sample": true,
  "stream": false,
  "thinking": {
    "type": "enabled"
  },
  "temperature": 0.6,
  "top_p": 0.95,
  "response_format": {
    "type": "text"
  },
  "messages": [
    {
      "role": "user",
      "content": "帮我查询下北京天眼查科技有限公司的基本信息"
    }
  ],
  "tools": [
    {
      "mcp": {
        "transport_type": "sse",
        "server_label": "tianyancha",
        "server_url": "https://open.bigmodel.cn/api/mcp-broker/proxy/tianyancha/sse",
        "headers": {
          "Authorization": "Bearer Your_Zhipu_API_Key"
        }
      },
      "type": "mcp"
    }
  ]
}'

在Cherry studio中使用
1. 在对话界面，点击MCP按钮

2. MCP服务器界面，点击添加服务器

3. 完成以下配置：

3.1 可流式传输的HTTP（streamableHttp）

· URL：https://open.bigmodel.cn/api/mcp-broker/proxy/tianyancha/mcp

· 请求头：Authorization = Your Zhipu API Key

3.2 服务器发送事件（sse）

· URL：https://open.bigmodel.cn/api/mcp-broker/proxy/tianyancha/sse

· 请求头：Authorization = Your Zhipu API Key

4. 回到对话界面，选择MCP

5. 进行模型对话，即可使用

在Cursor中使用

Cursor0.45.6版本提供了MCP功能，Cursor将作为MCP服务客户端使用的MCP服务，在Cursor中通过简单的配置就可以完成MCP服务的接入。

操作路径：Cursor设置-->【Tools&Integrations】-->【tianyancha】。

配置MCP服务器
{
  "mcpServers":
     {
        "tianyancha": 
           {
               "url": "https://open.bigmodel.cn/api/mcp-broker/proxy/tianyancha/mcp?Authorization=Your Zhipu API Key"
           }
     } 
}

Cursor MCP使用

Cursor MCP必须在Composer的agent模式下使用。

常见问题解答

1. BigModel上哪些类型的模型支持MCP？

实际上，MCP 是基于 Function Calling 接口来实现功能的，所以，使用 MCP 所选用的模型必须具备支持 Function Calling 的特性。Bigmodel上现在所有的语言模型（包括GLM-4-Plus、GLM-4-Flash等）均支持Function Calling。 Z1系列推理模型因为不支持 Function Calling ，无法调用MCP。

2.如何获取API Key进行调用？

前往智谱 BigModel 开放平台的 API Key
点击"添加新的API Key"
Hover新添加的 API Key，点击复制ICON按钮。

3. 现在有哪些MCP是支持在Cursor中使用的？

所有Streamable的MCP，目前是可以在Cursor中调用的，其他的MCP仅支持在我们的体验中心、Cherry Studio和Vscode中使用。后续我们所有的MCP将新增Streamable类型，方便大家使用。

4.我开发了一个MCP，如何申请入驻BigModel应用空间？

你可以填写申请入驻的表单，我们将优先处理您的合作申请。

官方推荐
值得买
北京值得买科技股份有限公司
帮助用户查询商品的优惠信息、商品评测、商品概况、价格、购买渠道、性价比推荐等信息，并给出优惠商品的链接地址。
贵金属价格查询
杭州安那其科技有限公司
提供全球贵金属的实时行情、历史价格、K线走势及期货合约数据。
农产品行情数据
湖南惠农科技有限公司
全国常见农产品的行情价格数据，来自真实产地和市场用户一手行情，数据真实，可追溯。
今日油价查询
杭州玖舟数字科技有限公司
提供全国实时油价、历史价格趋势、调价预测及地区对比，助力车主、物流等场景优化加油决策与成本管理。
万物识别
北京智谱华章科技股份有限公司
万物识别MCP服务是智谱提供的基于深度学习的图像识别工具，能够快速分析图片中的地点和人物信息，支持整图及局部区域识别。


================================================
FILE: frontend/.gitignore
================================================
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?


================================================
FILE: frontend/QUICKSTART.md
================================================
# FinnewsHunter Frontend 快速启动

## 🚀 5分钟启动

### 1. 安装依赖

```bash
npm install
```

### 2. 配置环境变量

```bash
cp .env.example .env
# 默认配置已经指向 localhost:8000，无需修改
```

### 3. 启动开发服务器

```bash
npm run dev
```

访问 http://localhost:3000

### 4. 确保后端运行

```bash
# 在另一个终端
cd ../backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

---

## 📁 项目结构

- `src/pages/` - 页面组件
- `src/components/ui/` - UI 组件库
- `src/lib/` - 工具函数和 API 客户端
- `src/store/` - Zustand 全局状态
- `src/types/` - TypeScript 类型定义

---

## ✨ 功能演示

### 1. 首页仪表盘
- 统计卡片（总新闻数、任务数、成功率）
- 最新新闻预览

### 2. 新闻流
- 爬取新闻（可配置页码范围）
- 新闻卡片展示
- 一键分析（调用 NewsAnalyst 智能体）
- 情感评分展示

### 3. 任务管理
- 实时任务列表
- 任务状态和进度
- 自动刷新（每5秒）

---

## 🛠️ 开发命令

```bash
# 开发
npm run dev

# 构建
npm run build

# 预览构建
npm run preview

# 代码检查
npm run lint

# 格式化
npm run format
```

---

**享受现代化的开发体验！🎉**


================================================
FILE: frontend/README.md
================================================
# FinnewsHunter Frontend (React + TypeScript)

现代化的金融新闻智能分析平台前端，基于 **React 18 + TypeScript + Vite + Tailwind CSS + Shadcn UI**。

## 技术栈

- **Core**: React 18, TypeScript, Vite
- **UI**: Tailwind CSS, Shadcn UI (Radix Primitives)
- **State**: Zustand, TanStack Query (React Query)
- **Routing**: React Router v6
- **Icons**: Lucide React
- **Notifications**: Sonner

## 快速开始

### 安装依赖

```bash
npm install
# 或使用 pnpm/yarn
```

### 开发模式

```bash
npm run dev
# 访问 http://localhost:3000
```

### 构建生产版本

```bash
npm run build
npm run preview
```

## 项目结构

```
src/
├── components/
│   └── ui/              # Shadcn UI 组件
│       ├── button.tsx
│       ├── card.tsx
│       └── badge.tsx
├── layout/
│   └── MainLayout.tsx   # 主布局（侧边栏+顶部栏）
├── pages/
│   ├── Dashboard.tsx            # 首页仪表盘
│   ├── NewsListPage.tsx         # 新闻流
│   ├── StockAnalysisPage.tsx    # 个股分析（待实现）
│   ├── AgentMonitorPage.tsx     # 智能体监控（待实现）
│   └── TaskManagerPage.tsx      # 任务管理
├── lib/
│   ├── api-client.ts    # API 客户端
│   └── utils.ts         # 工具函数
├── store/
│   ├── useNewsStore.ts  # 新闻状态
│   └── useTaskStore.ts  # 任务状态
├── types/
│   └── api.ts           # TypeScript 类型定义
├── App.tsx
├── main.tsx
└── index.css
```

## 功能特性

### ✅ 已实现
- Dashboard 仪表盘（统计卡片）
- 新闻列表展示
- 新闻爬取功能
- 智能分析按钮
- 任务管理列表
- 响应式布局
- 实时数据刷新（React Query）

### 🚧 开发中
- 个股深度分析
- K线图展示
- 智能体监控台
- WebSocket 实时推送
- 辩论可视化

## 开发指南

### 添加新组件

```bash
# 从 Shadcn UI 添加组件
npx shadcn-ui@latest add dialog
npx shadcn-ui@latest add tabs
```

### API 调用

```typescript
import { newsApi } from '@/lib/api-client'
import { useQuery } from '@tanstack/react-query'

const { data, isLoading } = useQuery({
  queryKey: ['news', 'list'],
  queryFn: () => newsApi.getNewsList({ limit: 20 }),
})
```

### 状态管理

```typescript
import { useNewsStore } from '@/store/useNewsStore'

const { newsList, setNewsList } = useNewsStore()
```

## 环境变量

创建 `.env.local` 文件：

```
VITE_API_BASE_URL=http://localhost:8000/api/v1
```

## 与后端集成

确保后端服务运行在 `http://localhost:8000`，前端会自动代理 API 请求到后端。

## 下一步

- [ ] 实现 WebSocket 连接（实时新闻推送）
- [ ] 实现个股分析页面（K线图）
- [ ] 实现智能体监控台（Chain of Thought）
- [ ] 实现辩论可视化（Bull vs Bear）

---

**Built with ❤️ using React + AgenticX**


================================================
FILE: frontend/index.html
================================================
<!doctype html>
<html lang="zh-CN">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>FinnewsHunter - 金融新闻智能分析平台</title>
  </head>
  <body>
    <div id="root"></div>
    <script type="module" src="/src/main.tsx"></script>
  </body>
</html>


================================================
FILE: frontend/package.json
================================================
{
  "name": "finnews-hunter-frontend",
  "private": true,
  "version": "0.1.0",
  "type": "module",
  "scripts": {
    "dev": "vite",
    "build": "tsc && vite build",
    "preview": "vite preview",
    "lint": "eslint . --ext ts,tsx --report-unused-disable-directives --max-warnings 0",
    "format": "prettier --write \"src/**/*.{ts,tsx,css}\""
  },
  "dependencies": {
    "@radix-ui/react-avatar": "^1.1.0",
    "@radix-ui/react-dialog": "^1.1.1",
    "@radix-ui/react-dropdown-menu": "^2.1.1",
    "@radix-ui/react-label": "^2.1.0",
    "@radix-ui/react-popover": "^1.1.1",
    "@radix-ui/react-scroll-area": "^1.1.0",
    "@radix-ui/react-select": "^2.1.1",
    "@radix-ui/react-separator": "^1.1.0",
    "@radix-ui/react-slot": "^1.1.0",
    "@radix-ui/react-tabs": "^1.1.0",
    "@radix-ui/react-tooltip": "^1.1.2",
    "@tanstack/react-query": "^5.28.0",
    "axios": "^1.6.7",
    "class-variance-authority": "^0.7.0",
    "clsx": "^2.1.0",
    "date-fns": "^3.3.1",
    "framer-motion": "^11.0.8",
    "lucide-react": "^0.343.0",
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "react-markdown": "^9.0.1",
    "react-router-dom": "^6.22.2",
    "recharts": "^2.12.0",
    "klinecharts": "^9.8.10",
    "remark-gfm": "^4.0.1",
    "socket.io-client": "^4.7.4",
    "sonner": "^1.4.3",
    "tailwind-merge": "^2.2.1",
    "tailwindcss-animate": "^1.0.7",
    "zustand": "^4.5.1"
  },
  "devDependencies": {
    "@types/node": "^20.11.24",
    "@types/react": "^18.2.61",
    "@types/react-dom": "^18.2.19",
    "@typescript-eslint/eslint-plugin": "^7.1.0",
    "@typescript-eslint/parser": "^7.1.0",
    "@vitejs/plugin-react-swc": "^3.5.0",
    "autoprefixer": "^10.4.18",
    "eslint": "^8.57.0",
    "eslint-plugin-react-hooks": "^4.6.0",
    "eslint-plugin-react-refresh": "^0.4.5",
    "postcss": "^8.4.35",
    "prettier": "^3.2.5",
    "tailwindcss": "^3.4.1",
    "typescript": "^5.3.3",
    "vite": "^7.2.7"
  }
}


================================================
FILE: frontend/postcss.config.js
================================================
export default {
  plugins: {
    tailwindcss: {},
    autoprefixer: {},
  },
}


================================================
FILE: frontend/src/App.tsx
================================================
import { Routes, Route } from 'react-router-dom'
import { Toaster } from 'sonner'
import MainLayout from './layout/MainLayout'
import Dashboard from './pages/Dashboard'
import NewsListPage from './pages/NewsListPage'
import StockSearchPage from './pages/StockSearchPage'
import StockAnalysisPage from './pages/StockAnalysisPage'
import AgentMonitorPage from './pages/AgentMonitorPage'
import TaskManagerPage from './pages/TaskManagerPage'
import AlphaMiningPage from './pages/AlphaMiningPage'

function App() {
  return (
    <>
      <Routes>
        <Route path="/" element={<MainLayout />}>
          <Route index element={<Dashboard />} />
          <Route path="news" element={<NewsListPage />} />
          <Route path="stock" element={<StockSearchPage />} />
          <Route path="stock/:code" element={<StockAnalysisPage />} />
          <Route path="agents" element={<AgentMonitorPage />} />
          <Route path="tasks" element={<TaskManagerPage />} />
          <Route path="alpha-mining" element={<AlphaMiningPage />} />
        </Route>
      </Routes>
      <Toaster richColors position="top-right" />
    </>
  )
}

export default App


================================================
FILE: frontend/src/components/DebateChatRoom.tsx
================================================
import React, { useState, useRef, useEffect, useCallback } from 'react'
import { 
  Send, User, TrendingUp, TrendingDown, Briefcase, 
  Loader2, Bot, History, Trash2, Search, ChevronDown,
  CheckCircle2, Clock, ListChecks, PlayCircle, XCircle
} from 'lucide-react'
import { Button } from '@/components/ui/button'
import ReactMarkdown from 'react-markdown'
import remarkGfm from 'remark-gfm'
import { cn } from '@/lib/utils'
import MentionInput, { MentionTarget } from './MentionInput'
import type { DebateSession } from '@/store/useDebateStore'
import { agentApi, SSEDebateEvent } from '@/lib/api-client'
import { toast } from 'sonner'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'

// 消息角色类型
export type ChatRole = 'user' | 'bull' | 'bear' | 'manager' | 'system' | 'data_collector' | 'search'

// 搜索计划类型
export interface SearchTask {
  id: string
  source: string
  query: string
  description: string
  icon: string
  estimated_time: number
}

export interface SearchPlan {
  plan_id: string
  stock_code: string
  stock_name: string
  user_query: string
  tasks: SearchTask[]
  total_estimated_time: number
}

// 聊天消息类型
export interface ChatMessage {
  id: string
  role: ChatRole
  content: string
  timestamp: Date
  round?: number
  isStreaming?: boolean
  searchPlan?: SearchPlan // 关联的搜索计划
  searchStatus?: 'pending' | 'executing' | 'completed' | 'cancelled'
}

// 获取角色配置（支持国际化）
const getRoleConfig = (t: any): Record<ChatRole, {
  name: string
  icon: React.ReactNode
  bgColor: string
  textColor: string
  borderColor: string
  align: 'left' | 'right'
}> => ({
  user: {
    name: t.debateHistory.roleNames.user,
    icon: <User className="w-4 h-4" />,
    bgColor: 'bg-blue-500',
    textColor: 'text-white',
    borderColor: 'border-blue-500',
    align: 'right'
  },
  bull: {
    name: t.debateHistory.roleNames.bull,
    icon: <TrendingUp className="w-4 h-4" />,
    bgColor: 'bg-emerald-500',
    textColor: 'text-white',
    borderColor: 'border-emerald-300',
    align: 'left'
  },
  bear: {
    name: t.debateHistory.roleNames.bear,
    icon: <TrendingDown className="w-4 h-4" />,
    bgColor: 'bg-rose-500',
    textColor: 'text-white',
    borderColor: 'border-rose-300',
    align: 'left'
  },
  manager: {
    name: t.debateHistory.roleNames.manager,
    icon: <Briefcase className="w-4 h-4" />,
    bgColor: 'bg-indigo-500',
    textColor: 'text-white',
    borderColor: 'border-indigo-300',
    align: 'left'
  },
  data_collector: {
    name: t.debateHistory.roleNames.data_collector,
    icon: <Bot className="w-4 h-4" />,
    bgColor: 'bg-purple-500',
    textColor: 'text-white',
    borderColor: 'border-purple-300',
    align: 'left'
  },
  system: {
    name: 'System',
    icon: <Bot className="w-4 h-4" />,
    bgColor: 'bg-gray-400',
    textColor: 'text-white',
    borderColor: 'border-gray-200',
    align: 'left'
  },
  search: {
    name: 'Search Results',
    icon: <Bot className="w-4 h-4" />,
    bgColor: 'bg-cyan-500',
    textColor: 'text-white',
    borderColor: 'border-cyan-300',
    align: 'left'
  }
})

interface DebateChatRoomProps {
  messages: ChatMessage[]
  onSendMessage: (content: string, mentions?: MentionTarget[]) => void
  isDebating: boolean
  currentRound?: { round: number; maxRounds: number } | null
  activeAgent?: string | null
  stockName?: string
  disabled?: boolean
  // 历史相关
  historySessions?: DebateSession[]
  onLoadSession?: (sessionId: string) => void
  onClearHistory?: () => void
  showHistory?: boolean
  // 搜索计划相关
  onConfirmSearch?: (plan: SearchPlan, msgId: string) => void
  onCancelSearch?: (msgId: string) => void
}

// 搜索计划展示组件
const SearchPlanCard: React.FC<{ 
  plan: SearchPlan, 
  status: string,
  onConfirm: (plan: SearchPlan) => void,
  onCancel: () => void
}> = ({ plan, status, onConfirm, onCancel }) => {
  const t = useGlobalI18n()
  const isPending = status === 'pending'
  const isExecuting = status === 'executing'
  
  return (
    <div className="mt-3 p-4 bg-slate-50 rounded-xl border border-slate-200 shadow-sm animate-in fade-in zoom-in duration-300">
      <div className="flex items-center gap-2 mb-3 pb-2 border-b border-slate-200">
        <ListChecks className="w-5 h-5 text-indigo-500" />
        <h4 className="font-semibold text-slate-800 text-sm">📋 {t.debateRoom.searchPlanConfirm}</h4>
      </div>
      
      <div className="space-y-2 mb-4">
        {plan.tasks.map((task, index) => (
          <div key={task.id} className="flex items-start gap-3 text-xs text-slate-600">
            <span className="mt-0.5">{task.icon || '🔍'}</span>
            <div className="flex-1">
              <p className="font-medium text-slate-700">{index + 1}. {task.description}</p>
              <p className="text-[10px] text-slate-400">{t.debateRoom.roundPrefix === '第' ? '关键词' : 'Keyword'}: "{task.query}"</p>
            </div>
          </div>
        ))}
      </div>
      
      <div className="flex items-center justify-between pt-2">
        <div className="flex items-center gap-1.5 text-[10px] text-slate-400">
          <Clock className="w-3 h-3" />
          {t.debateRoom.estimatedTime}: {plan.total_estimated_time}{t.debateRoom.seconds}
        </div>
        
        {isPending && (
          <div className="flex gap-2">
            <Button 
              size="sm" 
              variant="outline" 
              className="h-7 text-[10px] px-3 py-0"
              onClick={onCancel}
            >
              {t.debateRoom.searchPlanCancel}
            </Button>
            <Button 
              size="sm" 
              className="h-7 text-[10px] px-3 py-0 bg-indigo-500 hover:bg-indigo-600"
              onClick={() => onConfirm(plan)}
            >
              {t.debateRoom.searchPlanConfirmBtn}
            </Button>
          </div>
        )}
        
        {isExecuting && (
          <div className="flex items-center gap-2 text-[10px] text-indigo-600 animate-pulse">
            <Loader2 className="w-3 h-3 animate-spin" />
            {t.debateRoom.searchPlanExecuting}
          </div>
        )}
        
        {status === 'completed' && (
          <div className="flex items-center gap-1 text-[10px] text-emerald-600 font-medium">
            <CheckCircle2 className="w-3 h-3" />
            {t.debateRoom.searchPlanCompleted}
          </div>
        )}
      </div>
    </div>
  )
}

// 单条消息组件
const ChatBubble: React.FC<{ 
  message: ChatMessage,
  onConfirmSearch?: (plan: SearchPlan, msgId: string) => void,
  onCancelSearch?: (msgId: string) => void
}> = ({ message, onConfirmSearch, onCancelSearch }) => {
  const t = useGlobalI18n()
  const ROLE_CONFIG = getRoleConfig(t)
  const config = ROLE_CONFIG[message.role]
  const isRight = config.align === 'right'
  
  return (
    <div className={cn(
      "flex gap-2 mb-4 animate-in fade-in slide-in-from-bottom-2 duration-300",
      isRight ? "flex-row-reverse" : "flex-row"
    )}>
      {/* 头像 */}
      <div className={cn(
        "w-9 h-9 rounded-full flex items-center justify-center flex-shrink-0 shadow-sm",
        config.bgColor,
        config.textColor
      )}>
        {config.icon}
      </div>
      
      {/* 消息体 */}
      <div className={cn("flex flex-col max-w-[75%]", isRight ? "items-end" : "items-start")}>
        {/* 角色名称和轮次 */}
        <div className={cn(
          "flex items-center gap-2 mb-1 text-xs",
          isRight ? "flex-row-reverse" : "flex-row"
        )}>
          <span className="font-medium text-gray-600">{config.name}</span>
          {message.round && (
            <span className="px-1.5 py-0.5 rounded bg-gray-100 text-gray-500 text-[10px]">
              {t.debateRoom.roundPrefix}{message.round}{t.debateRoom.roundSuffix}
            </span>
          )}
          <span className="text-gray-400">
            {message.timestamp.toLocaleTimeString(t.debateRoom.roundPrefix === '第' ? 'zh-CN' : 'en-US', { hour: '2-digit', minute: '2-digit' })}
          </span>
        </div>
        
        {/* 消息气泡 */}
        <div className={cn(
          "rounded-2xl px-4 py-2.5 shadow-sm border",
          isRight 
            ? "bg-blue-500 text-white rounded-tr-sm border-blue-400" 
            : `bg-white ${config.borderColor} rounded-tl-sm`
        )}>
          {message.content ? (
            <div className={cn(
              "prose prose-sm max-w-none",
              isRight ? "prose-invert" : "prose-gray"
            )}>
              <ReactMarkdown remarkPlugins={[remarkGfm]}>
                {message.content}
              </ReactMarkdown>
              {message.isStreaming && (
                <span className="inline-block w-2 h-4 bg-current opacity-70 animate-pulse ml-1 align-middle rounded-sm" />
              )}
            </div>
          ) : message.searchPlan ? (
            <div className="text-sm text-gray-500 italic">{t.stockDetail.generatingSearchPlan}</div>
          ) : (
            <div className="flex items-center gap-2 text-gray-400">
              <Loader2 className="w-4 h-4 animate-spin" />
              <span className="text-sm">{t.debateRoom.thinking}</span>
            </div>
          )}
          
          {/* 搜索计划卡片 */}
          {message.searchPlan && (
            <SearchPlanCard 
              plan={message.searchPlan} 
              status={message.searchStatus || 'pending'}
              onConfirm={(plan) => onConfirmSearch?.(plan, message.id)}
              onCancel={() => onCancelSearch?.(message.id)}
            />
          )}
        </div>
      </div>
    </div>
  )
}

// 系统消息组件
const SystemMessage: React.FC<{ message: ChatMessage }> = ({ message }) => (
  <div className="flex justify-center my-3">
    <div className="px-3 py-1 rounded-full bg-gray-100 text-gray-500 text-xs">
      {message.content}
    </div>
  </div>
)

// 主组件
const DebateChatRoom: React.FC<DebateChatRoomProps> = ({
  messages,
  onSendMessage,
  isDebating,
  currentRound,
  activeAgent,
  stockName,
  disabled = false,
  historySessions = [],
  onLoadSession,
  onClearHistory,
  showHistory = true,
  onConfirmSearch,
  onCancelSearch
}) => {
  const t = useGlobalI18n()
  const ROLE_CONFIG = getRoleConfig(t)
  const [inputValue, setInputValue] = useState('')
  const [showHistoryDropdown, setShowHistoryDropdown] = useState(false)
  const [pendingMentions, setPendingMentions] = useState<MentionTarget[]>([])
  const scrollRef = useRef<HTMLDivElement>(null)
  const historyDropdownRef = useRef<HTMLDivElement>(null)
  
  // 自动滚动到底部
  useEffect(() => {
    if (scrollRef.current) {
      scrollRef.current.scrollTop = scrollRef.current.scrollHeight
    }
  }, [messages])
  
  // 点击外部关闭历史下拉框
  useEffect(() => {
    const handleClickOutside = (e: MouseEvent) => {
      if (historyDropdownRef.current && !historyDropdownRef.current.contains(e.target as Node)) {
        setShowHistoryDropdown(false)
      }
    }
    document.addEventListener('mousedown', handleClickOutside)
    return () => document.removeEventListener('mousedown', handleClickOutside)
  }, [])
  
  const handleSendWithMentions = useCallback((content: string, mentions: MentionTarget[]) => {
    if (content.trim() && !disabled && !isDebating) {
      onSendMessage(content.trim(), mentions)
      setInputValue('')
      setPendingMentions([])
    }
  }, [disabled, isDebating, onSendMessage])
  
  // 获取当前活跃角色的提示
  const getActiveIndicator = () => {
    if (!activeAgent) return null
    
    const agentMap: Record<string, ChatRole> = {
      'BullResearcher': 'bull',
      'BearResearcher': 'bear',
      'InvestmentManager': 'manager',
      'DataCollector': 'data_collector'
    }
    
    const role = agentMap[activeAgent]
    if (!role) return null
    
    const config = ROLE_CONFIG[role]
    return (
      <div className="flex items-center gap-2 text-sm text-gray-500">
        <div className={cn("w-2 h-2 rounded-full animate-pulse", config.bgColor)} />
        <span>{config.name} {t.debateRoom.typing}</span>
      </div>
    )
  }
  
  return (
    <div className="flex flex-col h-[600px] bg-gradient-to-b from-gray-50 to-white rounded-xl border shadow-lg overflow-hidden">
      {/* 头部 */}
      <div className="flex items-center justify-between px-4 py-3 bg-white border-b">
        <div className="flex items-center gap-3">
          <div className="flex -space-x-2">
            <div className="w-8 h-8 rounded-full bg-emerald-500 flex items-center justify-center text-white ring-2 ring-white">
              <TrendingUp className="w-4 h-4" />
            </div>
            <div className="w-8 h-8 rounded-full bg-rose-500 flex items-center justify-center text-white ring-2 ring-white">
              <TrendingDown className="w-4 h-4" />
            </div>
            <div className="w-8 h-8 rounded-full bg-indigo-500 flex items-center justify-center text-white ring-2 ring-white">
              <Briefcase className="w-4 h-4" />
            </div>
          </div>
          <div>
            <h3 className="font-semibold text-gray-900">
              {stockName ? `${stockName} ${t.debateRoom.title}` : t.debateRoom.titlePlaceholder}
            </h3>
            <p className="text-xs text-gray-500">{t.debateRoom.subtitle}</p>
          </div>
        </div>
        
        {/* 轮次指示器 */}
        {currentRound && (
          <div className="flex items-center gap-2 px-3 py-1.5 bg-purple-50 rounded-full">
            <div className="flex gap-0.5">
              {Array.from({ length: currentRound.maxRounds }, (_, i) => (
                <div
                  key={i}
                  className={cn(
                    "w-2 h-2 rounded-full transition-colors",
                    i < currentRound.round
                      ? 'bg-purple-500'
                      : 'bg-gray-200'
                  )}
                />
              ))}
            </div>
            <span className="text-xs font-medium text-purple-600">
              {t.debateRoom.roundPrefix}{currentRound.round}{t.debateRoom.roundSuffix}
            </span>
          </div>
        )}
      </div>
      
      {/* 消息区域 */}
      <div 
        ref={scrollRef}
        className="flex-1 px-4 overflow-y-auto scrollbar-thin scrollbar-thumb-gray-300 scrollbar-track-transparent"
      >
        <div className="py-4">
          {messages.length === 0 ? (
            <div className="flex flex-col items-center justify-center h-full text-gray-400 py-20">
              <div className="w-16 h-16 rounded-full bg-gray-100 flex items-center justify-center mb-4">
                <Briefcase className="w-8 h-8 text-gray-300" />
              </div>
              <p className="text-sm mb-2">{t.debateRoom.clickStartDebate}</p>
              <p className="text-xs">{t.debateRoom.canSpeakDuringDebate}</p>
            </div>
          ) : (
            messages.map((msg) => (
              msg.role === 'system' ? (
                <SystemMessage key={msg.id} message={msg} />
              ) : (
                <ChatBubble 
                  key={msg.id} 
                  message={msg} 
                  onConfirmSearch={onConfirmSearch}
                  onCancelSearch={onCancelSearch}
                />
              )
            ))
          )}
          
          {/* 输入指示器 */}
          {isDebating && activeAgent && (
            <div className="flex items-center gap-2 ml-11 mb-4">
              {getActiveIndicator()}
            </div>
          )}
        </div>
      </div>
      
      {/* 输入区域 */}
      <div className="px-4 py-3 bg-white border-t">
        <div className="flex items-center gap-2">
          <div className="w-8 h-8 rounded-full bg-blue-500 flex items-center justify-center text-white flex-shrink-0">
            <User className="w-4 h-4" />
          </div>
          <MentionInput
            value={inputValue}
            onChange={setInputValue}
            onSubmit={handleSendWithMentions}
            placeholder={isDebating ? t.debateRoom.debateInProgress : t.mentionInput.placeholder}
            disabled={disabled}
          />
          <Button
            onClick={() => handleSendWithMentions(inputValue, pendingMentions)}
            disabled={!inputValue.trim() || disabled || isDebating}
            size="icon"
            className="rounded-full bg-blue-500 hover:bg-blue-600"
          >
            <Send className="w-4 h-4" />
          </Button>
        </div>
        
        {/* 提示和历史按钮 */}
        <div className="flex items-center justify-between mt-2 ml-10">
          {isDebating ? (
            <p className="text-xs text-gray-400">
              💡 {t.debateRoom.mentionTip}
            </p>
          ) : (
            <p className="text-xs text-gray-400">
              💡 {t.stockDetail.history === '历史' ? '输入 @ 可以选择智能体或数据源' : 'Enter @ to select agents or data sources'}
          </p>
        )}
          
          {/* 历史记录按钮 */}
          {showHistory && historySessions.length > 0 && (
            <div className="relative" ref={historyDropdownRef}>
              <Button
                variant="ghost"
                size="sm"
                onClick={() => setShowHistoryDropdown(!showHistoryDropdown)}
                className="h-7 px-2 text-gray-500 hover:text-gray-700"
              >
                <History className="w-3.5 h-3.5 mr-1" />
                {t.debateHistory.history} ({historySessions.length})
                <ChevronDown className={cn("w-3 h-3 ml-1 transition-transform", showHistoryDropdown && "rotate-180")} />
              </Button>
              
              {/* 历史下拉菜单 */}
              {showHistoryDropdown && (
                <div className="absolute bottom-full right-0 mb-1 w-64 bg-white rounded-lg shadow-xl border border-gray-200 py-2 z-50 animate-in fade-in slide-in-from-bottom-2 duration-200">
                  <div className="px-3 py-1 border-b border-gray-100 flex items-center justify-between">
                    <span className="text-xs font-medium text-gray-500">{t.debateHistory.history} {t.stockDetail.session}</span>
                    {onClearHistory && (
                      <button 
                        onClick={() => {
                          if (confirm(t.agents.confirmClearLogs)) {
                            onClearHistory()
                            setShowHistoryDropdown(false)
                          }
                        }}
                        className="text-xs text-rose-500 hover:text-rose-600 flex items-center gap-1"
                      >
                        <Trash2 className="w-3 h-3" />
                        {t.common.cancel === '取消' ? '清除' : 'Clear'}
                      </button>
                    )}
                  </div>
                  <div className="max-h-48 overflow-y-auto">
                    {historySessions.map((session, index) => (
                      <button
                        key={session.id}
                        onClick={() => {
                          onLoadSession?.(session.id)
                          setShowHistoryDropdown(false)
                        }}
                        className="w-full px-3 py-2 text-left hover:bg-gray-50 flex items-center gap-2"
                      >
                        <div className="w-6 h-6 rounded-full bg-gray-100 flex items-center justify-center text-xs text-gray-500 flex-shrink-0">
                          {index + 1}
                        </div>
                        <div className="flex-1 min-w-0">
                          <div className="text-sm font-medium text-gray-700 truncate">
                            {session.stockName || session.stockCode}
                          </div>
                          <div className="text-xs text-gray-400">
                            {session.messages.length} {t.debateHistory.messages} · {new Date(session.updatedAt).toLocaleDateString(t.debateRoom.roundPrefix === '第' ? 'zh-CN' : 'en-US')}
                          </div>
                        </div>
                      </button>
                    ))}
                  </div>
                </div>
              )}
            </div>
          )}
        </div>
      </div>
    </div>
  )
}

export default DebateChatRoom


================================================
FILE: frontend/src/components/DebateConfig.tsx
================================================
/**
 * 辩论模式配置组件
 * 支持选择不同的多智能体协作模式
 */
import React, { useState, useEffect } from 'react'
import {
  Settings,
  Zap,
  Theater,
  Rocket,
  Clock,
  Users,
  MessageSquare,
  ChevronDown,
  ChevronUp,
  Info
} from 'lucide-react'
import { useGlobalI18n } from '@/store/useLanguageStore'

// 辩论模式类型
export interface DebateMode {
  id: string
  name: string
  description: string
  icon: string
  isDefault?: boolean
}

// 模式规则配置
export interface ModeRules {
  maxTime: number
  maxRounds?: number
  managerCanInterrupt?: boolean
  requireDataCollection?: boolean
}

// 可用的辩论模式（使用函数获取，支持国际化）
const getDebateModes = (t: any): DebateMode[] => [
  {
    id: 'parallel',
    name: t.stockDetail.parallelAnalysis,
    description: t.stockDetail.parallelAnalysisDesc || 'Bull/Bear parallel analysis, Investment Manager summarizes decision',
    icon: '⚡',
    isDefault: true
  },
  {
    id: 'realtime_debate',
    name: t.stockDetail.realtimeDebate,
    description: t.stockDetail.realtimeDebateDesc || 'Four agents real-time dialogue, Investment Manager moderates, Bull/Bear alternate',
    icon: '🎭'
  },
  {
    id: 'quick_analysis',
    name: t.stockDetail.quickAnalysis,
    description: t.stockDetail.quickAnalysisDesc || 'Single analyst quick recommendation, suitable for time-sensitive scenarios',
    icon: '🚀'
  }
]

// 默认规则配置
const DEFAULT_RULES: Record<string, ModeRules> = {
  parallel: {
    maxTime: 300,
    maxRounds: 1,
    managerCanInterrupt: false,
    requireDataCollection: false
  },
  realtime_debate: {
    maxTime: 600,
    maxRounds: 5,
    managerCanInterrupt: true,
    requireDataCollection: true
  },
  quick_analysis: {
    maxTime: 60,
    maxRounds: 1,
    managerCanInterrupt: false,
    requireDataCollection: false
  }
}

interface DebateConfigProps {
  selectedMode: string
  onModeChange: (mode: string) => void
  rules?: ModeRules
  onRulesChange?: (rules: ModeRules) => void
  disabled?: boolean
  compact?: boolean
}

export const DebateConfig: React.FC<DebateConfigProps> = ({
  selectedMode,
  onModeChange,
  rules,
  onRulesChange,
  disabled = false,
  compact = false
}) => {
  const t = useGlobalI18n()
  const DEBATE_MODES = getDebateModes(t)
  const [showAdvanced, setShowAdvanced] = useState(false)
  const [localRules, setLocalRules] = useState<ModeRules>(
    rules || DEFAULT_RULES[selectedMode] || DEFAULT_RULES.parallel
  )

  useEffect(() => {
    // 模式切换时重置规则为默认值
    setLocalRules(DEFAULT_RULES[selectedMode] || DEFAULT_RULES.parallel)
  }, [selectedMode])

  const handleRuleChange = (key: keyof ModeRules, value: number | boolean) => {
    const newRules = { ...localRules, [key]: value }
    setLocalRules(newRules)
    onRulesChange?.(newRules)
  }

  const getModeIcon = (mode: DebateMode) => {
    switch (mode.id) {
      case 'parallel':
        return <Zap className="w-5 h-5 text-yellow-500" />
      case 'realtime_debate':
        return <Theater className="w-5 h-5 text-purple-500" />
      case 'quick_analysis':
        return <Rocket className="w-5 h-5 text-blue-500" />
      default:
        return <Settings className="w-5 h-5 text-gray-500" />
    }
  }

  const selectedModeData = DEBATE_MODES.find(m => m.id === selectedMode)

  if (compact) {
    return (
      <div className="flex items-center gap-2">
        <label className="text-sm text-gray-500">{t.stockDetail.analysisMode}:</label>
        <select
          value={selectedMode}
          onChange={(e) => onModeChange(e.target.value)}
          disabled={disabled}
          className="text-sm border border-gray-200 rounded-md px-2 py-1 bg-white focus:outline-none focus:ring-2 focus:ring-blue-500 disabled:bg-gray-100 disabled:cursor-not-allowed"
        >
          {DEBATE_MODES.map((mode) => (
            <option key={mode.id} value={mode.id}>
              {mode.icon} {mode.name}
            </option>
          ))}
        </select>
      </div>
    )
  }

  return (
    <div className="bg-white rounded-xl border border-gray-200 overflow-hidden">
      {/* 模式选择 */}
      <div className="p-4 border-b border-gray-100">
        <div className="flex items-center gap-2 mb-3">
          <Settings className="w-5 h-5 text-gray-600" />
          <h3 className="font-semibold text-gray-800">{t.stockDetail.analysisModeConfig || t.stockDetail.analysisMode}</h3>
        </div>
        
        <div className="grid grid-cols-1 md:grid-cols-3 gap-3">
          {DEBATE_MODES.map((mode) => (
            <button
              key={mode.id}
              onClick={() => onModeChange(mode.id)}
              disabled={disabled}
              className={`
                relative p-4 rounded-lg border-2 transition-all text-left
                ${selectedMode === mode.id
                  ? 'border-blue-500 bg-blue-50'
                  : 'border-gray-200 hover:border-gray-300 hover:bg-gray-50'
                }
                ${disabled ? 'opacity-50 cursor-not-allowed' : 'cursor-pointer'}
              `}
            >
              {mode.isDefault && (
                <span className="absolute top-2 right-2 text-xs bg-blue-100 text-blue-600 px-2 py-0.5 rounded-full">
                  {t.stockDetail.default || 'Default'}
                </span>
              )}
              <div className="flex items-center gap-2 mb-2">
                {getModeIcon(mode)}
                <span className="font-medium text-gray-800">{mode.name}</span>
              </div>
              <p className="text-xs text-gray-500 line-clamp-2">
                {mode.description}
              </p>
            </button>
          ))}
        </div>
      </div>

      {/* 模式说明 */}
      {selectedModeData && (
        <div className="p-4 bg-gray-50 border-b border-gray-100">
          <div className="flex items-start gap-3">
            <div className="p-2 bg-white rounded-lg shadow-sm">
              {getModeIcon(selectedModeData)}
            </div>
            <div className="flex-1">
              <h4 className="font-medium text-gray-800 mb-1">
                {selectedModeData.name}
              </h4>
              <p className="text-sm text-gray-600">
                {selectedModeData.description}
              </p>
              
              {/* 模式特性标签 */}
              <div className="flex flex-wrap gap-2 mt-3">
                {selectedMode === 'parallel' && (
                  <>
                    <span className="inline-flex items-center gap-1 text-xs bg-yellow-100 text-yellow-700 px-2 py-1 rounded-full">
                      <Zap className="w-3 h-3" /> {t.stockDetail.parallelExecution || 'Parallel Execution'}
                    </span>
                    <span className="inline-flex items-center gap-1 text-xs bg-green-100 text-green-700 px-2 py-1 rounded-full">
                      <Clock className="w-3 h-3" /> {t.stockDetail.about2to3min || '~2-3 min'}
                    </span>
                  </>
                )}
                {selectedMode === 'realtime_debate' && (
                  <>
                    <span className="inline-flex items-center gap-1 text-xs bg-purple-100 text-purple-700 px-2 py-1 rounded-full">
                      <MessageSquare className="w-3 h-3" /> {t.stockDetail.realtimeDialogue || 'Real-time Dialogue'}
                    </span>
                    <span className="inline-flex items-center gap-1 text-xs bg-orange-100 text-orange-700 px-2 py-1 rounded-full">
                      <Users className="w-3 h-3" /> {t.stockDetail.fourAgents || '4 Agents'}
                    </span>
                    <span className="inline-flex items-center gap-1 text-xs bg-green-100 text-green-700 px-2 py-1 rounded-full">
                      <Clock className="w-3 h-3" /> {t.stockDetail.about5to10min || '~5-10 min'}
                    </span>
                  </>
                )}
                {selectedMode === 'quick_analysis' && (
                  <>
                    <span className="inline-flex items-center gap-1 text-xs bg-blue-100 text-blue-700 px-2 py-1 rounded-full">
                      <Rocket className="w-3 h-3" /> {t.stockDetail.singleAgent || 'Single Agent'}
                    </span>
                    <span className="inline-flex items-center gap-1 text-xs bg-green-100 text-green-700 px-2 py-1 rounded-full">
                      <Clock className="w-3 h-3" /> {t.stockDetail.about1min || '~1 min'}
                    </span>
                  </>
                )}
              </div>
            </div>
          </div>
        </div>
      )}

      {/* 高级配置 */}
      <div className="border-t border-gray-100">
        <button
          onClick={() => setShowAdvanced(!showAdvanced)}
          disabled={disabled}
          className="w-full p-3 flex items-center justify-between text-sm text-gray-600 hover:bg-gray-50 transition-colors disabled:cursor-not-allowed"
        >
          <span className="flex items-center gap-2">
            <Info className="w-4 h-4" />
            {t.stockDetail.advancedConfig || 'Advanced Config'}
          </span>
          {showAdvanced ? (
            <ChevronUp className="w-4 h-4" />
          ) : (
            <ChevronDown className="w-4 h-4" />
          )}
        </button>

        {showAdvanced && (
          <div className="p-4 border-t border-gray-100 bg-gray-50 space-y-4">
            {/* 最大时间 */}
            <div className="flex items-center justify-between">
              <label className="text-sm text-gray-600">{t.stockDetail.maxExecutionTime || 'Max Execution Time'}</label>
              <div className="flex items-center gap-2">
                <input
                  type="number"
                  value={localRules.maxTime}
                  onChange={(e) => handleRuleChange('maxTime', parseInt(e.target.value) || 300)}
                  disabled={disabled}
                  min={60}
                  max={1800}
                  step={60}
                  className="w-20 text-sm border border-gray-200 rounded px-2 py-1 text-right disabled:bg-gray-100"
                />
                <span className="text-sm text-gray-500">{t.stockDetail.seconds || 's'}</span>
              </div>
            </div>

            {/* 实时辩论模式专属配置 */}
            {selectedMode === 'realtime_debate' && (
              <>
                <div className="flex items-center justify-between">
                  <label className="text-sm text-gray-600">{t.stockDetail.maxDebateRounds || 'Max Debate Rounds'}</label>
                  <div className="flex items-center gap-2">
                    <input
                      type="number"
                      value={localRules.maxRounds || 5}
                      onChange={(e) => handleRuleChange('maxRounds', parseInt(e.target.value) || 5)}
                      disabled={disabled}
                      min={1}
                      max={10}
                      className="w-20 text-sm border border-gray-200 rounded px-2 py-1 text-right disabled:bg-gray-100"
                    />
                    <span className="text-sm text-gray-500">{t.stockDetail.rounds || 'rounds'}</span>
                  </div>
                </div>

                <div className="flex items-center justify-between">
                  <label className="text-sm text-gray-600">{t.stockDetail.managerCanInterrupt || 'Manager Can Interrupt'}</label>
                  <input
                    type="checkbox"
                    checked={localRules.managerCanInterrupt || false}
                    onChange={(e) => handleRuleChange('managerCanInterrupt', e.target.checked)}
                    disabled={disabled}
                    className="w-4 h-4 text-blue-600 border-gray-300 rounded focus:ring-blue-500 disabled:cursor-not-allowed"
                  />
                </div>

                <div className="flex items-center justify-between">
                  <label className="text-sm text-gray-600">{t.stockDetail.collectDataBeforeDebate || 'Collect Data Before Debate'}</label>
                  <input
                    type="checkbox"
                    checked={localRules.requireDataCollection || false}
                    onChange={(e) => handleRuleChange('requireDataCollection', e.target.checked)}
                    disabled={disabled}
                    className="w-4 h-4 text-blue-600 border-gray-300 rounded focus:ring-blue-500 disabled:cursor-not-allowed"
                  />
                </div>
              </>
            )}
          </div>
        )}
      </div>
    </div>
  )
}

// 辩论模式选择器（简化版本，用于其他页面）
export const DebateModeSelector: React.FC<{
  value: string
  onChange: (value: string) => void
  disabled?: boolean
}> = ({ value, onChange, disabled }) => {
  const t = useGlobalI18n()
  const DEBATE_MODES = getDebateModes(t)
  return (
    <div className="flex gap-2">
      {DEBATE_MODES.map((mode) => (
        <button
          key={mode.id}
          onClick={() => onChange(mode.id)}
          disabled={disabled}
          className={`
            px-3 py-1.5 rounded-lg text-sm font-medium transition-all
            ${value === mode.id
              ? 'bg-blue-100 text-blue-700 border border-blue-300'
              : 'bg-gray-100 text-gray-600 border border-transparent hover:bg-gray-200'
            }
            ${disabled ? 'opacity-50 cursor-not-allowed' : 'cursor-pointer'}
          `}
          title={mode.description}
        >
          <span className="mr-1">{mode.icon}</span>
          {mode.name}
        </button>
      ))}
    </div>
  )
}

export default DebateConfig


================================================
FILE: frontend/src/components/DebateHistorySidebar.tsx
================================================
import React, { useState, useMemo } from 'react'
import { 
  History, 
  Trash2, 
  MessageSquare, 
  Clock, 
  PlayCircle,
  Swords,
  Zap,
  Activity,
  X,
  Search,
  Calendar
} from 'lucide-react'
import { Button } from '@/components/ui/button'
import { cn } from '@/lib/utils'
import type { DebateSession } from '@/store/useDebateStore'
import { useGlobalI18n } from '@/store/useLanguageStore'

interface DebateHistorySidebarProps {
  sessions: DebateSession[]
  currentSessionId?: string | null
  onLoadSession: (session: DebateSession) => void
  onDeleteSession?: (sessionId: string) => void
  onClearHistory?: () => void
  isOpen: boolean
  onToggle: () => void
}

// 获取模式图标和样式（支持国际化）
const getModeInfo = (mode: string, t: any) => {
  switch (mode) {
    case 'parallel':
      return {
        icon: <Zap className="w-3.5 h-3.5" />,
        label: t.stockDetail.parallelAnalysis,
        color: 'text-amber-600',
        bgColor: 'bg-amber-50'
      }
    case 'realtime_debate':
      return {
        icon: <Swords className="w-3.5 h-3.5" />,
        label: t.stockDetail.realtimeDebate,
        color: 'text-purple-600',
        bgColor: 'bg-purple-50'
      }
    case 'quick_analysis':
      return {
        icon: <Activity className="w-3.5 h-3.5" />,
        label: t.stockDetail.quickAnalysis || 'Quick Analysis',
        color: 'text-blue-600',
        bgColor: 'bg-blue-50'
      }
    default:
      return {
        icon: <MessageSquare className="w-3.5 h-3.5" />,
        label: t.stockDetail.bullBear || 'Debate',
        color: 'text-gray-600',
        bgColor: 'bg-gray-50'
      }
  }
}

// 格式化时间（支持国际化）
const formatTime = (date: Date, t: any) => {
  const now = new Date()
  const diff = now.getTime() - date.getTime()
  const minutes = Math.floor(diff / 60000)
  const hours = Math.floor(diff / 3600000)
  const days = Math.floor(diff / 86400000)

  if (minutes < 1) return t.debateHistory.justNow
  if (minutes < 60) return `${minutes}${t.debateHistory.minutesAgo}`
  if (hours < 24) return `${hours}${t.debateHistory.hoursAgo}`
  if (days < 7) return `${days}${t.debateHistory.daysAgo}`
  
  return date.toLocaleDateString(t.debateHistory.justNow === '刚刚' ? 'zh-CN' : 'en-US', {
    month: 'short',
    day: 'numeric'
  })
}

// 会话预览内容（支持国际化）
const getSessionPreview = (session: DebateSession, t: any) => {
  if (session.messages.length === 0) {
    return t.debateHistory.noMessages
  }
  
  // 获取最后一条非系统消息
  const lastMessage = [...session.messages]
    .reverse()
    .find(m => m.role !== 'system')
  
  if (lastMessage) {
    const roleName = t.debateHistory.roleNames[lastMessage.role] || lastMessage.role
    const content = lastMessage.content.slice(0, 40)
    return `${roleName}: ${content}${lastMessage.content.length > 40 ? '...' : ''}`
  }
  
  return `${session.messages.length} ${t.debateHistory.messages}`
}

const DebateHistorySidebar: React.FC<DebateHistorySidebarProps> = ({
  sessions,
  currentSessionId,
  onLoadSession,
  onDeleteSession,
  onClearHistory,
  isOpen,
  onToggle
}) => {
  const t = useGlobalI18n()
  const [searchTerm, setSearchTerm] = useState('')
  const [hoveredId, setHoveredId] = useState<string | null>(null)

  // 过滤会话
  const filteredSessions = useMemo(() => {
    if (!searchTerm) return sessions
    const term = searchTerm.toLowerCase()
    return sessions.filter(s => 
      s.stockName?.toLowerCase().includes(term) ||
      s.messages.some(m => m.content.toLowerCase().includes(term))
    )
  }, [sessions, searchTerm])

  // 按日期分组
  const groupedSessions = useMemo(() => {
    const groups: { label: string; sessions: DebateSession[] }[] = []
    const today = new Date()
    today.setHours(0, 0, 0, 0)
    const yesterday = new Date(today)
    yesterday.setDate(yesterday.getDate() - 1)
    const weekAgo = new Date(today)
    weekAgo.setDate(weekAgo.getDate() - 7)

    const todaySessions: DebateSession[] = []
    const yesterdaySessions: DebateSession[] = []
    const thisWeekSessions: DebateSession[] = []
    const olderSessions: DebateSession[] = []

    filteredSessions.forEach(session => {
      const sessionDate = new Date(session.updatedAt)
      sessionDate.setHours(0, 0, 0, 0)

      if (sessionDate.getTime() === today.getTime()) {
        todaySessions.push(session)
      } else if (sessionDate.getTime() === yesterday.getTime()) {
        yesterdaySessions.push(session)
      } else if (sessionDate > weekAgo) {
        thisWeekSessions.push(session)
      } else {
        olderSessions.push(session)
      }
    })

    if (todaySessions.length > 0) groups.push({ label: t.debateHistory.today, sessions: todaySessions })
    if (yesterdaySessions.length > 0) groups.push({ label: t.debateHistory.yesterday, sessions: yesterdaySessions })
    if (thisWeekSessions.length > 0) groups.push({ label: t.debateHistory.thisWeek, sessions: thisWeekSessions })
    if (olderSessions.length > 0) groups.push({ label: t.debateHistory.older, sessions: olderSessions })

    return groups
  }, [filteredSessions, t])

  return (
    <>
      {/* 折叠状态的标签按钮 */}
      {!isOpen && sessions.length > 0 && (
        <button
          onClick={onToggle}
          className="fixed right-0 top-1/2 -translate-y-1/2 z-40 bg-white shadow-lg rounded-l-lg px-2 py-4 border border-r-0 border-gray-200 hover:bg-gray-50 transition-colors group"
          title={t.debateHistory.expandHistory}
        >
          <div className="flex flex-col items-center gap-2">
            <History className="w-5 h-5 text-gray-600 group-hover:text-indigo-600" />
            <span className="text-xs font-medium text-gray-600 writing-vertical group-hover:text-indigo-600">
              {t.debateHistory.history}
            </span>
            <span className="text-xs bg-indigo-100 text-indigo-600 rounded-full w-5 h-5 flex items-center justify-center">
              {sessions.length}
            </span>
          </div>
        </button>
      )}

      {/* 侧边栏面板 */}
      <div
        className={cn(
          "fixed right-0 top-0 h-full bg-white shadow-2xl border-l border-gray-200 z-50 transition-transform duration-300 ease-in-out flex flex-col",
          isOpen ? "translate-x-0" : "translate-x-full",
          "w-80"
        )}
      >
        {/* 头部 */}
        <div className="flex items-center justify-between px-4 py-3 border-b border-gray-100 bg-gradient-to-r from-indigo-50 to-purple-50">
          <div className="flex items-center gap-2">
            <div className="w-8 h-8 rounded-full bg-indigo-100 flex items-center justify-center">
              <History className="w-4 h-4 text-indigo-600" />
            </div>
            <div>
              <h3 className="font-semibold text-gray-900 text-sm">{t.debateHistory.history}</h3>
              <p className="text-xs text-gray-500">{sessions.length} {t.stockDetail.session}</p>
            </div>
          </div>
          <Button
            variant="ghost"
            size="icon"
            onClick={onToggle}
            className="h-8 w-8 text-gray-500 hover:text-gray-700"
          >
            <X className="w-4 h-4" />
          </Button>
        </div>

        {/* 搜索框 */}
        <div className="px-3 py-2 border-b border-gray-100">
          <div className="relative">
            <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
            <input
              type="text"
              value={searchTerm}
              onChange={(e) => setSearchTerm(e.target.value)}
              placeholder={t.debateHistory.searchPlaceholder}
              className="w-full pl-9 pr-3 py-2 text-sm border border-gray-200 rounded-lg focus:outline-none focus:ring-2 focus:ring-indigo-200 focus:border-indigo-300"
            />
          </div>
        </div>

        {/* 会话列表 */}
        <div className="flex-1 overflow-y-auto">
          {groupedSessions.length === 0 ? (
            <div className="flex flex-col items-center justify-center h-full text-gray-400 px-4">
              <History className="w-12 h-12 mb-3 opacity-50" />
              <p className="text-sm text-center">
                {searchTerm ? t.debateHistory.noMatchingRecords : t.debateHistory.noHistoryYet}
              </p>
              <p className="text-xs mt-1 text-center">
                {searchTerm ? t.debateHistory.tryOtherKeywords : t.debateHistory.historyAutoSave}
              </p>
            </div>
          ) : (
            <div className="py-2">
              {groupedSessions.map(group => (
                <div key={group.label} className="mb-4">
                  <div className="px-4 py-1.5 text-xs font-medium text-gray-500 uppercase tracking-wider flex items-center gap-2">
                    <Calendar className="w-3 h-3" />
                    {group.label}
                  </div>
                  {group.sessions.map(session => {
                    const modeInfo = getModeInfo(session.mode, t)
                    const isActive = session.id === currentSessionId
                    const isHovered = session.id === hoveredId
                    
                    return (
                      <div
                        key={session.id}
                        className={cn(
                          "relative px-3 py-2 mx-2 rounded-lg cursor-pointer transition-all duration-200",
                          isActive 
                            ? "bg-indigo-50 border border-indigo-200" 
                            : "hover:bg-gray-50 border border-transparent"
                        )}
                        onMouseEnter={() => setHoveredId(session.id)}
                        onMouseLeave={() => setHoveredId(null)}
                        onClick={() => onLoadSession(session)}
                      >
                        <div className="flex items-start gap-3">
                          {/* 模式图标 */}
                          <div className={cn(
                            "w-8 h-8 rounded-lg flex items-center justify-center flex-shrink-0 mt-0.5",
                            modeInfo.bgColor,
                            modeInfo.color
                          )}>
                            {modeInfo.icon}
                          </div>
                          
                          {/* 会话信息 */}
                          <div className="flex-1 min-w-0">
                            <div className="flex items-center gap-2">
                              <span className={cn(
                                "text-sm font-medium truncate",
                                isActive ? "text-indigo-700" : "text-gray-700"
                              )}>
                                {session.stockName || session.stockCode}
                              </span>
                              <span className={cn(
                                "text-[10px] px-1.5 py-0.5 rounded",
                                modeInfo.bgColor,
                                modeInfo.color
                              )}>
                                {modeInfo.label}
                              </span>
                            </div>
                            
                            <p className="text-xs text-gray-500 mt-0.5 truncate">
                              {getSessionPreview(session, t)}
                            </p>
                            
                            <div className="flex items-center gap-2 mt-1.5 text-[10px] text-gray-400">
                              <span className="flex items-center gap-1">
                                <MessageSquare className="w-3 h-3" />
                                {session.messages.length}
                              </span>
                              <span>·</span>
                              <span className="flex items-center gap-1">
                                <Clock className="w-3 h-3" />
                                {formatTime(new Date(session.updatedAt), t)}
                              </span>
                            </div>
                          </div>
                          
                          {/* 操作按钮 */}
                          <div className={cn(
                            "flex items-center gap-1 transition-opacity",
                            isHovered || isActive ? "opacity-100" : "opacity-0"
                          )}>
                            <Button
                              variant="ghost"
                              size="icon"
                              className="h-6 w-6 text-indigo-500 hover:text-indigo-600 hover:bg-indigo-100"
                              onClick={(e) => {
                                e.stopPropagation()
                                onLoadSession(session)
                              }}
                              title={t.debateHistory.continueDebate}
                            >
                              <PlayCircle className="w-3.5 h-3.5" />
                            </Button>
                            {onDeleteSession && (
                              <Button
                                variant="ghost"
                                size="icon"
                                className="h-6 w-6 text-rose-400 hover:text-rose-500 hover:bg-rose-50"
                                onClick={(e) => {
                                  e.stopPropagation()
                                  if (confirm(t.stockDetail.deleteSessionConfirm)) {
                                    onDeleteSession(session.id)
                                  }
                                }}
                                title={t.debateHistory.delete}
                              >
                                <Trash2 className="w-3.5 h-3.5" />
                              </Button>
                            )}
                          </div>
                        </div>
                        
                        {/* 活跃指示器 */}
                        {isActive && (
                          <div className="absolute left-0 top-1/2 -translate-y-1/2 w-1 h-8 bg-indigo-500 rounded-r" />
                        )}
                      </div>
                    )
                  })}
                </div>
              ))}
            </div>
          )}
        </div>

        {/* 底部操作 */}
        {sessions.length > 0 && onClearHistory && (
          <div className="px-4 py-3 border-t border-gray-100 bg-gray-50">
            <Button
              variant="outline"
              size="sm"
              onClick={() => {
                if (confirm(t.stockDetail.clearAllHistoryConfirm)) {
                  onClearHistory()
                }
              }}
              className="w-full text-rose-500 border-rose-200 hover:bg-rose-50 hover:text-rose-600"
            >
              <Trash2 className="w-3.5 h-3.5 mr-2" />
              {t.stockDetail.clearAllRecords}
            </Button>
          </div>
        )}
      </div>

      {/* 遮罩层 */}
      {isOpen && (
        <div
          className="fixed inset-0 bg-black/20 z-40 transition-opacity"
          onClick={onToggle}
        />
      )}
    </>
  )
}

export default DebateHistorySidebar


================================================
FILE: frontend/src/components/HighlightText.tsx
================================================
import React from 'react'

interface HighlightTextProps {
  text: string
  highlight: string
  className?: string
}

/**
 * HighlightText 组件
 * 
 * 用于在文本中高亮显示指定的关键词
 * 
 * @param text - 原始文本
 * @param highlight - 需要高亮的关键词
 * @param className - 应用到容器的 CSS 类名
 * 
 * @example
 * <HighlightText 
 *   text="贵州茅台股价上涨" 
 *   highlight="茅台" 
 *   className="text-sm"
 * />
 */
export default function HighlightText({ text, highlight, className = '' }: HighlightTextProps) {
  // 如果没有高亮词，直接返回原文本
  if (!highlight || !highlight.trim()) {
    return <span className={className}>{text}</span>
  }

  // 转义特殊正则字符，避免正则表达式错误
  const escapeRegExp = (str: string) => {
    return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')
  }

  try {
    // 使用正则表达式分割文本，保留匹配部分
    const escapedHighlight = escapeRegExp(highlight.trim())
    const parts = text.split(new RegExp(`(${escapedHighlight})`, 'gi'))

    return (
      <span className={className}>
        {parts.map((part, index) => {
          // 判断是否为匹配的关键词（不区分大小写）
          const isMatch = part.toLowerCase() === highlight.toLowerCase()
          
          return isMatch ? (
            <mark 
              key={index} 
              className="bg-yellow-200 text-gray-900 font-semibold px-0.5 rounded"
            >
              {part}
            </mark>
          ) : (
            <React.Fragment key={index}>{part}</React.Fragment>
          )
        })}
      </span>
    )
  } catch (error) {
    // 如果正则表达式出错，返回原文本
    console.error('HighlightText error:', error)
    return <span className={className}>{text}</span>
  }
}


================================================
FILE: frontend/src/components/KLineChart.tsx
================================================
/**
 * KLineChart 组件
 * 使用 klinecharts 库展示专业的 K 线图
 * 支持：蜡烛图、成交量、MA均线、MACD等
 */
import { useEffect, useRef, useCallback, useState } from 'react'
import { init, dispose, registerLocale } from 'klinecharts'
import type { Chart } from 'klinecharts'
import type { KLineDataPoint } from '@/types/api'
import { cn } from '@/lib/utils'
import { useLanguageStore } from '@/store/useLanguageStore'

// 注册语言包（使用动态语言）
const registerKLineLocales = () => {
  const { lang } = useLanguageStore.getState();
  const t = globalI18n[lang];
  
  registerLocale('zh-CN', {
    time: `${t.stockDetail.timeLabel}：`,
    open: `${t.stockDetail.openLabel}：`,
    high: `${t.stockDetail.highLabel}：`,
    low: `${t.stockDetail.lowLabel}：`,
    close: `${t.stockDetail.closeLabel}：`,
    volume: `${t.stockDetail.volumeLabel}：`,
    turnover: '额：',
    change: '涨跌：',
  })

  registerLocale('en-US', {
    time: `${t.stockDetail.timeLabel}: `,
    open: `${t.stockDetail.openLabel}: `,
    high: `${t.stockDetail.highLabel}: `,
    low: `${t.stockDetail.lowLabel}: `,
    close: `${t.stockDetail.closeLabel}: `,
    volume: `${t.stockDetail.volumeLabel}: `,
    turnover: 'Turnover: ',
    change: 'Change: ',
  })
}

// 初始化注册
registerLocale('zh-CN', {
  time: '时间：',
  open: '开：',
  high: '高：',
  low: '低：',
  close: '收：',
  volume: '量：',
  turnover: '额：',
  change: '涨跌：',
})

registerLocale('en-US', {
  time: 'Time: ',
  open: 'Open: ',
  high: 'High: ',
  low: 'Low: ',
  close: 'Close: ',
  volume: 'Volume: ',
  turnover: 'Turnover: ',
  change: 'Change: ',
})

interface KLineChartProps {
  data: KLineDataPoint[]
  height?: number
  className?: string
  showVolume?: boolean
  showMA?: boolean
  showMACD?: boolean
  theme?: 'light' | 'dark'
  period?: 'daily' | '1m' | '5m' | '15m' | '30m' | '60m'  // 添加周期参数
}

export default function KLineChart({
  data,
  height = 500,
  className,
  showVolume = true,
  showMA = true,
  showMACD = false,
  theme = 'light',
  period = 'daily',
}: KLineChartProps) {
  const { lang } = useLanguageStore()
  const containerRef = useRef<HTMLDivElement>(null)
  const chartRef = useRef<Chart | null>(null)
  const [isInitialized, setIsInitialized] = useState(false)

  // 转换数据格式 - klinecharts 需要的格式
  const formatData = useCallback((rawData: KLineDataPoint[]) => {
    return rawData.map((item) => ({
      timestamp: item.timestamp,
      open: item.open,
      high: item.high,
      low: item.low,
      close: item.close,
      volume: item.volume,
      turnover: item.turnover,
    }))
  }, [])

  // 初始化图表
  useEffect(() => {
    if (!containerRef.current) return

    // 重置初始化状态
    setIsInitialized(false)

    // 销毁旧图表
    if (chartRef.current) {
      dispose(chartRef.current)
      chartRef.current = null
    }

    // 中国 A 股风格样式：红涨绿跌
    const styles = {
      grid: {
        show: true,
        horizontal: {
          show: true,
          size: 1,
          color: theme === 'dark' ? 'rgba(255,255,255,0.08)' : 'rgba(0,0,0,0.06)',
          style: 'dashed' as const,
        },
        vertical: {
          show: true,
          size: 1,
          color: theme === 'dark' ? 'rgba(255,255,255,0.08)' : 'rgba(0,0,0,0.06)',
          style: 'dashed' as const,
        },
      },
      candle: {
        type: 'candle_solid' as const,
        bar: {
          upColor: '#EF5350',      // 红色涨
          downColor: '#26A69A',    // 绿色跌
          noChangeColor: '#888888',
          upBorderColor: '#EF5350',
          downBorderColor: '#26A69A',
          noChangeBorderColor: '#888888',
          upWickColor: '#EF5350',
          downWickColor: '#26A69A',
          noChangeWickColor: '#888888',
        },
        priceMark: {
          show: true,
          high: {
            show: true,
            color: theme === 'dark' ? '#D9D9D9' : '#333333',
            textOffset: 5,
            textSize: 10,
            textFamily: 'Helvetica Neue',
            textWeight: 'normal',
          },
          low: {
            show: true,
            color: theme === 'dark' ? '#D9D9D9' : '#333333',
            textOffset: 5,
            textSize: 10,
            textFamily: 'Helvetica Neue',
            textWeight: 'normal',
          },
          last: {
            show: true,
            upColor: '#EF5350',
            downColor: '#26A69A',
            noChangeColor: '#888888',
            line: {
              show: true,
              style: 'dashed' as const,
              dashedValue: [4, 4],
              size: 1,
            },
            text: {
              show: true,
              style: 'fill' as const,
              size: 12,
              paddingLeft: 4,
              paddingTop: 4,
              paddingRight: 4,
              paddingBottom: 4,
              borderColor: 'transparent',
              borderSize: 0,
              borderRadius: 2,
              color: '#FFFFFF',
              family: 'Helvetica Neue',
              weight: 'normal',
            },
          },
        },
        tooltip: {
          showRule: 'always' as const,
          showType: 'standard' as const,
        },
      },
      indicator: {
        ohlc: {
          upColor: '#EF5350',
          downColor: '#26A69A',
          noChangeColor: '#888888',
        },
        bars: [
          {
            style: 'fill' as const,
            borderStyle: 'solid' as const,
            borderSize: 1,
            borderDashedValue: [2, 2],
            upColor: 'rgba(239, 83, 80, 0.7)',
            downColor: 'rgba(38, 166, 154, 0.7)',
            noChangeColor: '#888888',
          },
        ],
        lines: [
          { style: 'solid' as const, smooth: false, size: 1, dashedValue: [2, 2], color: '#FF9600' },
          { style: 'solid' as const, smooth: false, size: 1, dashedValue: [2, 2], color: '#9D65C9' },
          { style: 'solid' as const, smooth: false, size: 1, dashedValue: [2, 2], color: '#2196F3' },
          { style: 'solid' as const, smooth: false, size: 1, dashedValue: [2, 2], color: '#E91E63' },
          { style: 'solid' as const, smooth: false, size: 1, dashedValue: [2, 2], color: '#00BCD4' },
        ],
      },
      xAxis: {
        show: true,
        size: 'auto' as const,
        axisLine: {
          show: true,
          color: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
          size: 1,
        },
        tickText: {
          show: true,
          color: theme === 'dark' ? '#D9D9D9' : '#666666',
          family: 'Helvetica Neue',
          weight: 'normal',
          size: 11,
        },
        tickLine: {
          show: true,
          size: 1,
          length: 3,
          color: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
        },
      },
      yAxis: {
        show: true,
        size: 'auto' as const,
        position: 'right' as const,
        type: 'normal' as const,
        inside: false,
        reverse: false,
        axisLine: {
          show: true,
          color: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
          size: 1,
        },
        tickText: {
          show: true,
          color: theme === 'dark' ? '#D9D9D9' : '#666666',
          family: 'Helvetica Neue',
          weight: 'normal',
          size: 11,
        },
        tickLine: {
          show: true,
          size: 1,
          length: 3,
          color: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
        },
      },
      crosshair: {
        show: true,
        horizontal: {
          show: true,
          line: {
            show: true,
            style: 'dashed' as const,
            dashedValue: [4, 2],
            size: 1,
            color: theme === 'dark' ? 'rgba(255,255,255,0.3)' : 'rgba(0,0,0,0.2)',
          },
          text: {
            show: true,
            style: 'fill' as const,
            color: '#FFFFFF',
            size: 12,
            family: 'Helvetica Neue',
            weight: 'normal',
            borderStyle: 'solid' as const,
            borderDashedValue: [2, 2],
            borderSize: 1,
            borderColor: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
            borderRadius: 2,
            paddingLeft: 4,
            paddingRight: 4,
            paddingTop: 2,
            paddingBottom: 2,
            backgroundColor: theme === 'dark' ? 'rgba(35,35,35,0.95)' : 'rgba(50,50,50,0.9)',
          },
        },
        vertical: {
          show: true,
          line: {
            show: true,
            style: 'dashed' as const,
            dashedValue: [4, 2],
            size: 1,
            color: theme === 'dark' ? 'rgba(255,255,255,0.3)' : 'rgba(0,0,0,0.2)',
          },
          text: {
            show: true,
            style: 'fill' as const,
            color: '#FFFFFF',
            size: 12,
            family: 'Helvetica Neue',
            weight: 'normal',
            borderStyle: 'solid' as const,
            borderDashedValue: [2, 2],
            borderSize: 1,
            borderColor: theme === 'dark' ? 'rgba(255,255,255,0.15)' : 'rgba(0,0,0,0.1)',
            borderRadius: 2,
            paddingLeft: 4,
            paddingRight: 4,
            paddingTop: 2,
            paddingBottom: 2,
            backgroundColor: theme === 'dark' ? 'rgba(35,35,35,0.95)' : 'rgba(50,50,50,0.9)',
          },
        },
      },
    }

    // 创建图表
    const chart = init(containerRef.current, {
      locale: lang === 'zh' ? 'zh-CN' : 'en-US',
      styles,
    })

    if (chart) {
      chartRef.current = chart
      
      // 设置自定义时间格式化
      chart.setCustomApi({
        formatDate: (dateTimeFormat: any, timestamp: number, format: string, type: number) => {
          const date = new Date(timestamp)
          
          // 日线：只显示日期
          if (period === 'daily') {
            const year = date.getFullYear()
            const month = String(date.getMonth() + 1).padStart(2, '0')
            const day = String(date.getDate()).padStart(2, '0')
            return `${month}-${day}`  // 简化为月-日
          }
          
          // 分钟线：显示月-日 时:分
          const month = String(date.getMonth() + 1).padStart(2, '0')
          const day = String(date.getDate()).padStart(2, '0')
          const hours = String(date.getHours()).padStart(2, '0')
          const minutes = String(date.getMinutes()).padStart(2, '0')
          return `${month}-${day} ${hours}:${minutes}`
        },
      })
      
      // 设置右侧留白为最小，让 K 线尽量占满
      chart.setOffsetRightDistance(20)
      
      // 先添加 MA 均线到主图（蜡烛图上叠加）
      if (showMA) {
        chart.createIndicator('MA', false, { id: 'candle_pane' })
      }

      // 添加成交量指标 - 在独立的副图面板
      if (showVolume) {
        chart.createIndicator('VOL')
      }

      // 添加 MACD 指标 - 在独立的副图面板
      if (showMACD) {
        chart.createIndicator('MACD')
      }

      // 如果有数据，立即应用
      if (data && data.length > 0) {
        try {
          const formattedData = formatData(data)
          chart.applyNewData(formattedData)
        } catch (error) {
          console.error('Failed to apply initial chart data:', error)
        }
      }

      setIsInitialized(true)
    }

    return () => {
      setIsInitialized(false)
      if (chartRef.current) {
        dispose(chartRef.current)
        chartRef.current = null
      }
    }
  }, [theme, showVolume, showMA, showMACD, period, lang, data, formatData])

  // 更新数据 - 当图表初始化完成且有数据时应用
  useEffect(() => {
    if (!chartRef.current || !isInitialized || !data || data.length === 0) return

    try {
      const formattedData = formatData(data)
      chartRef.current.applyNewData(formattedData)
    } catch (error) {
      console.error('Failed to apply chart data:', error)
    }
  }, [data, isInitialized, formatData])

  return (
    <div
      ref={containerRef}
      className={cn('w-full rounded-lg overflow-hidden bg-white', className)}
      style={{ height }}
    />
  )
}

// 简化版迷你 K 线图组件
export function MiniKLineChart({
  data,
  height = 150,
  className,
}: {
  data: KLineDataPoint[]
  height?: number
  className?: string
}) {
  return (
    <KLineChart
      data={data}
      height={height}
      className={className}
      showVolume={false}
      showMA={false}
      showMACD={false}
      theme="light"
    />
  )
}


================================================
FILE: frontend/src/components/MentionInput.tsx
================================================
import React, { useState, useRef, useEffect, useCallback, useMemo } from 'react'
import { 
  TrendingUp, 
  TrendingDown, 
  Briefcase, 
  Search, 
  Database, 
  Globe, 
  Chrome,
  Bot,
  Hash,
  X
} from 'lucide-react'
import { cn } from '@/lib/utils'
import { useGlobalI18n } from '@/store/useLanguageStore'

// 可提及的目标类型
export type MentionType = 'agent' | 'source' | 'stock'

export interface MentionTarget {
  type: MentionType
  id: string
  label: string
  description?: string
  icon: React.ReactNode
  color: string
}

// 预定义的智能体列表
const AGENTS: MentionTarget[] = [
  { 
    type: 'agent', 
    id: 'bull', 
    label: '多方辩手', 
    description: '分析看多因素',
    icon: <TrendingUp className="w-4 h-4" />,
    color: 'text-emerald-600 bg-emerald-50'
  },
  { 
    type: 'agent', 
    id: 'bear', 
    label: '空方辩手', 
    description: '分析看空因素',
    icon: <TrendingDown className="w-4 h-4" />,
    color: 'text-rose-600 bg-rose-50'
  },
  { 
    type: 'agent', 
    id: 'manager', 
    label: '投资经理', 
    description: '综合决策',
    icon: <Briefcase className="w-4 h-4" />,
    color: 'text-indigo-600 bg-indigo-50'
  },
  { 
    type: 'agent', 
    id: 'data_collector', 
    label: '数据专员', 
    description: '收集市场数据/动态搜索',
    icon: <Bot className="w-4 h-4" />,
    color: 'text-cyan-600 bg-cyan-50'
  },
]

// 预定义的数据源列表
const SOURCES: MentionTarget[] = [
  { 
    type: 'source', 
    id: 'akshare', 
    label: 'AkShare', 
    description: '金融数据接口',
    icon: <Database className="w-4 h-4" />,
    color: 'text-blue-600 bg-blue-50'
  },
  { 
    type: 'source', 
    id: 'bochaai', 
    label: 'BochaAI', 
    description: '实时新闻搜索',
    icon: <Globe className="w-4 h-4" />,
    color: 'text-orange-600 bg-orange-50'
  },
  { 
    type: 'source', 
    id: 'browser', 
    label: '网页搜索', 
    description: '多引擎网页搜索',
    icon: <Chrome className="w-4 h-4" />,
    color: 'text-green-600 bg-green-50'
  },
  { 
    type: 'source', 
    id: 'kb', 
    label: '知识库', 
    description: '历史新闻数据',
    icon: <Hash className="w-4 h-4" />,
    color: 'text-amber-600 bg-amber-50'
  },
]

// 所有可提及目标
const ALL_TARGETS = [...AGENTS, ...SOURCES]

interface MentionInputProps {
  value: string
  onChange: (value: string) => void
  onSubmit: (value: string, mentions: MentionTarget[]) => void
  placeholder?: string
  disabled?: boolean
  className?: string
  // 可选：动态股票列表
  stockOptions?: Array<{ code: string; name: string }>
}

const MentionInput: React.FC<MentionInputProps> = ({
  value,
  onChange,
  onSubmit,
  placeholder,
  disabled = false,
  className,
  stockOptions = []
}) => {
  const t = useGlobalI18n()
  const defaultPlaceholder = placeholder || t.mentionInput.placeholder
  const [showPopup, setShowPopup] = useState(false)
  const [popupPosition, setPopupPosition] = useState({ top: 0, left: 0 })
  const [selectedIndex, setSelectedIndex] = useState(0)
  const [mentionQuery, setMentionQuery] = useState('')
  const [mentionStartPos, setMentionStartPos] = useState(-1)
  const [activeMentions, setActiveMentions] = useState<MentionTarget[]>([])
  
  const inputRef = useRef<HTMLInputElement>(null)
  const popupRef = useRef<HTMLDivElement>(null)
  
  // 合并股票选项到目标列表
  const allTargets = useMemo(() => {
    const stockTargets: MentionTarget[] = stockOptions.map(s => ({
      type: 'stock' as MentionType,
      id: s.code,
      label: s.name,
      description: s.code,
      icon: <Hash className="w-4 h-4" />,
      color: 'text-gray-600 bg-gray-50'
    }))
    return [...ALL_TARGETS, ...stockTargets]
  }, [stockOptions])
  
  // 过滤后的目标列表
  const filteredTargets = useMemo(() => {
    if (!mentionQuery) return allTargets
    const query = mentionQuery.toLowerCase()
    return allTargets.filter(t => 
      t.label.toLowerCase().includes(query) ||
      t.id.toLowerCase().includes(query) ||
      t.description?.toLowerCase().includes(query)
    )
  }, [allTargets, mentionQuery])
  
  // 分组显示
  const groupedTargets = useMemo(() => {
    const agents = filteredTargets.filter(t => t.type === 'agent')
    const sources = filteredTargets.filter(t => t.type === 'source')
    const stocks = filteredTargets.filter(t => t.type === 'stock')
    
    const groups: { label: string; items: MentionTarget[] }[] = []
    if (agents.length > 0) groups.push({ label: t.mentionInput.agents, items: agents })
    if (sources.length > 0) groups.push({ label: t.mentionInput.sources, items: sources })
    if (stocks.length > 0) groups.push({ label: t.mentionInput.stocks, items: stocks.slice(0, 5) })
    
    return groups
  }, [filteredTargets, t])
  
  // 扁平化用于键盘导航
  const flatTargets = useMemo(() => {
    return groupedTargets.flatMap(g => g.items)
  }, [groupedTargets])
  
  // 处理输入变化
  const handleChange = useCallback((e: React.ChangeEvent<HTMLInputElement>) => {
    const newValue = e.target.value
    const cursorPos = e.target.selectionStart || 0
    
    onChange(newValue)
    
    // 检测 @ 符号
    const textBeforeCursor = newValue.slice(0, cursorPos)
    const lastAtIndex = textBeforeCursor.lastIndexOf('@')
    
    if (lastAtIndex !== -1) {
      // 检查 @ 后面是否有空格（如果有，说明不是正在输入的提及）
      const textAfterAt = textBeforeCursor.slice(lastAtIndex + 1)
      if (!textAfterAt.includes(' ')) {
        setMentionQuery(textAfterAt)
        setMentionStartPos(lastAtIndex)
        setShowPopup(true)
        setSelectedIndex(0)
        
        // 计算弹窗位置
        if (inputRef.current) {
          const rect = inputRef.current.getBoundingClientRect()
          setPopupPosition({
            top: rect.top - 8, // 在输入框上方显示
            left: rect.left
          })
        }
        return
      }
    }
    
    setShowPopup(false)
    setMentionQuery('')
    setMentionStartPos(-1)
  }, [onChange])
  
  // 选择提及目标
  const selectTarget = useCallback((target: MentionTarget) => {
    if (mentionStartPos === -1) return
    
    const beforeMention = value.slice(0, mentionStartPos)
    const afterMention = value.slice(mentionStartPos + mentionQuery.length + 1) // +1 for @
    const newValue = `${beforeMention}@${target.label} ${afterMention}`
    
    onChange(newValue)
    setActiveMentions(prev => [...prev, target])
    setShowPopup(false)
    setMentionQuery('')
    setMentionStartPos(-1)
    
    // 聚焦回输入框
    inputRef.current?.focus()
  }, [value, mentionStartPos, mentionQuery, onChange])
  
  // 键盘事件处理
  const handleKeyDown = useCallback((e: React.KeyboardEvent<HTMLInputElement>) => {
    if (showPopup) {
      switch (e.key) {
        case 'ArrowDown':
          e.preventDefault()
          setSelectedIndex(prev => 
            prev < flatTargets.length - 1 ? prev + 1 : 0
          )
          break
        case 'ArrowUp':
          e.preventDefault()
          setSelectedIndex(prev => 
            prev > 0 ? prev - 1 : flatTargets.length - 1
          )
          break
        case 'Enter':
          e.preventDefault()
          if (flatTargets[selectedIndex]) {
            selectTarget(flatTargets[selectedIndex])
          }
          break
        case 'Escape':
          e.preventDefault()
          setShowPopup(false)
          break
        case 'Tab':
          e.preventDefault()
          if (flatTargets[selectedIndex]) {
            selectTarget(flatTargets[selectedIndex])
          }
          break
      }
    } else if (e.key === 'Enter' && !e.shiftKey) {
      e.preventDefault()
      if (value.trim()) {
        onSubmit(value.trim(), activeMentions)
        setActiveMentions([])
      }
    }
  }, [showPopup, flatTargets, selectedIndex, selectTarget, value, onSubmit, activeMentions])
  
  // 点击外部关闭弹窗
  useEffect(() => {
    const handleClickOutside = (e: MouseEvent) => {
      if (
        popupRef.current && 
        !popupRef.current.contains(e.target as Node) &&
        inputRef.current &&
        !inputRef.current.contains(e.target as Node)
      ) {
        setShowPopup(false)
      }
    }
    
    document.addEventListener('mousedown', handleClickOutside)
    return () => document.removeEventListener('mousedown', handleClickOutside)
  }, [])
  
  // 滚动选中项到可见区域
  useEffect(() => {
    if (showPopup && popupRef.current) {
      const selectedElement = popupRef.current.querySelector(`[data-index="${selectedIndex}"]`)
      selectedElement?.scrollIntoView({ block: 'nearest' })
    }
  }, [selectedIndex, showPopup])
  
  // 移除已添加的提及标签
  const removeMention = useCallback((targetId: string) => {
    const target = activeMentions.find(m => m.id === targetId)
    if (target) {
      const newValue = value.replace(`@${target.label}`, '').replace(/\s+/g, ' ').trim()
      onChange(newValue)
      setActiveMentions(prev => prev.filter(m => m.id !== targetId))
    }
  }, [activeMentions, value, onChange])
  
  return (
    <div className={cn("relative flex-1", className)}>
      {/* 已选择的提及标签 */}
      {activeMentions.length > 0 && (
        <div className="flex flex-wrap gap-1 mb-2">
          {activeMentions.map(mention => (
            <span 
              key={mention.id}
              className={cn(
                "inline-flex items-center gap-1 px-2 py-0.5 rounded-full text-xs font-medium",
                mention.color
              )}
            >
              {mention.icon}
              {mention.label}
              <button 
                onClick={() => removeMention(mention.id)}
                className="ml-0.5 hover:opacity-70"
              >
                <X className="w-3 h-3" />
              </button>
            </span>
          ))}
        </div>
      )}
      
      {/* 输入框 */}
      <input
        ref={inputRef}
        type="text"
        value={value}
        onChange={handleChange}
        onKeyDown={handleKeyDown}
            placeholder={defaultPlaceholder}
        disabled={disabled}
        className={cn(
          "w-full px-4 py-2 rounded-full bg-gray-50 border border-gray-200",
          "focus:border-blue-300 focus:outline-none focus:ring-2 focus:ring-blue-100",
          "text-sm disabled:opacity-50 disabled:cursor-not-allowed",
          "transition-all duration-200"
        )}
      />
      
      {/* @ 提及弹窗 */}
      {showPopup && filteredTargets.length > 0 && (
        <div
          ref={popupRef}
          className={cn(
            "absolute z-50 w-72 max-h-80 overflow-y-auto",
            "bg-white rounded-xl shadow-xl border border-gray-200",
            "animate-in fade-in slide-in-from-bottom-2 duration-200"
          )}
          style={{
            bottom: '100%',
            left: 0,
            marginBottom: '8px'
          }}
        >
          <div className="p-2">
            <div className="text-xs text-gray-400 px-2 py-1 mb-1">
              使用 ↑↓ 选择，Enter 确认，Esc 取消
            </div>
            
            {groupedTargets.map((group, groupIndex) => (
              <div key={group.label} className={groupIndex > 0 ? 'mt-2' : ''}>
                <div className="text-xs font-medium text-gray-500 px-2 py-1 sticky top-0 bg-white">
                  {group.label}
                </div>
                {group.items.map((target, itemIndex) => {
                  const flatIndex = groupedTargets
                    .slice(0, groupIndex)
                    .reduce((acc, g) => acc + g.items.length, 0) + itemIndex
                  
                  return (
                    <button
                      key={target.id}
                      data-index={flatIndex}
                      onClick={() => selectTarget(target)}
                      className={cn(
                        "w-full flex items-center gap-3 px-3 py-2 rounded-lg text-left",
                        "transition-colors duration-100",
                        flatIndex === selectedIndex
                          ? "bg-blue-50 text-blue-700"
                          : "hover:bg-gray-50"
                      )}
                    >
                      <div className={cn(
                        "w-8 h-8 rounded-lg flex items-center justify-center",
                        target.color
                      )}>
                        {target.icon}
                      </div>
                      <div className="flex-1 min-w-0">
                        <div className="font-medium text-sm truncate">
                          {target.label}
                        </div>
                        {target.description && (
                          <div className="text-xs text-gray-500 truncate">
                            {target.description}
                          </div>
                        )}
                      </div>
                    </button>
                  )
                })}
              </div>
            ))}
          </div>
        </div>
      )}
      
      {/* 空结果提示 */}
      {showPopup && filteredTargets.length === 0 && (
        <div
          ref={popupRef}
          className={cn(
            "absolute z-50 w-72",
            "bg-white rounded-xl shadow-xl border border-gray-200 p-4",
            "animate-in fade-in slide-in-from-bottom-2 duration-200"
          )}
          style={{
            bottom: '100%',
            left: 0,
            marginBottom: '8px'
          }}
        >
          <div className="text-sm text-gray-500 text-center">
            未找到匹配的选项
          </div>
        </div>
      )}
    </div>
  )
}

export default MentionInput
export { AGENTS, SOURCES, ALL_TARGETS }


================================================
FILE: frontend/src/components/ModelSelector.tsx
================================================
import { useState, useEffect, useMemo } from 'react'
import { useQuery } from '@tanstack/react-query'
import { Button } from '@/components/ui/button'
import {
  DropdownMenu,
  DropdownMenuContent,
  DropdownMenuItem,
  DropdownMenuLabel,
  DropdownMenuSeparator,
  DropdownMenuTrigger,
} from '@/components/ui/dropdown-menu'
import { ChevronDown, Check, AlertCircle } from 'lucide-react'
import { cn } from '@/lib/utils'
import { llmApi } from '@/lib/api-client'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'

// 模型配置
export interface ModelConfig {
  provider: string
  model: string
}

// Provider 和 Model 的国际化映射
const PROVIDER_I18N: Record<string, { labelZh: string; labelEn: string }> = {
  bailian: {
    labelZh: '百炼（阿里云）',
    labelEn: 'Bailian (Alibaba Cloud)',
  },
  openai: {
    labelZh: 'OpenAI',
    labelEn: 'OpenAI',
  },
  deepseek: {
    labelZh: 'DeepSeek',
    labelEn: 'DeepSeek',
  },
  kimi: {
    labelZh: 'Kimi (Moonshot)',
    labelEn: 'Kimi (Moonshot)',
  },
  zhipu: {
    labelZh: '智谱',
    labelEn: 'Zhipu',
  },
}

const MODEL_DESCRIPTION_I18N: Record<string, { descZh: string; descEn: string }> = {
  bailian: {
    descZh: '百炼 模型',
    descEn: 'Bailian Model',
  },
  openai: {
    descZh: 'OpenAI 模型',
    descEn: 'OpenAI Model',
  },
  deepseek: {
    descZh: 'DeepSeek 模型',
    descEn: 'DeepSeek Model',
  },
  kimi: {
    descZh: 'Kimi 模型',
    descEn: 'Kimi Model',
  },
  zhipu: {
    descZh: '智谱 模型',
    descEn: 'Zhipu Model',
  },
}

const DEFAULT_CONFIG: ModelConfig = {
  provider: 'bailian',
  model: 'qwen-plus',
}

export default function ModelSelector() {
  const t = useGlobalI18n()
  const { lang } = useLanguageStore()
  const [config, setConfig] = useState<ModelConfig>(DEFAULT_CONFIG)
  
  // 从后端 API 动态加载可用厂商和模型
  const { data: llmConfig, isLoading } = useQuery({
    queryKey: ['llm-config'],
    queryFn: llmApi.getConfig,
    staleTime: 5 * 60 * 1000, // 缓存 5 分钟
    retry: 1,
  })
  
  // 国际化处理：将后端返回的 provider 和 model 数据转换为国际化文本
  const providers = useMemo(() => {
    if (!llmConfig?.providers) return []
    return llmConfig.providers.map(provider => {
      const providerI18n = PROVIDER_I18N[provider.value] || { 
        labelZh: provider.label, 
        labelEn: provider.label 
      }
      const modelDescI18n = MODEL_DESCRIPTION_I18N[provider.value] || { 
        descZh: `${provider.label} 模型`, 
        descEn: `${provider.label} Model` 
      }
      
      return {
        ...provider,
        label: lang === 'zh' ? providerI18n.labelZh : providerI18n.labelEn,
        models: provider.models.map(model => ({
          ...model,
          description: lang === 'zh' ? modelDescI18n.descZh : modelDescI18n.descEn,
        })),
      }
    })
  }, [llmConfig?.providers, lang])

  // 从 localStorage 加载配置
  useEffect(() => {
    const saved = localStorage.getItem('modelConfig')
    if (saved) {
      try {
        setConfig(JSON.parse(saved))
      } catch (e) {
        console.error('Failed to load model config:', e)
      }
    }
  }, [])

  // 保存配置到 localStorage
  const saveConfig = (newConfig: ModelConfig) => {
    setConfig(newConfig)
    localStorage.setItem('modelConfig', JSON.stringify(newConfig))
    // 触发全局事件，通知其他组件
    window.dispatchEvent(
      new CustomEvent('model-config-changed', { detail: newConfig })
    )
  }

  const currentProvider = providers.find((p) => p.value === config.provider)
  const currentModel = currentProvider?.models.find(
    (m) => m.value === config.model
  )

  // 加载状态
  if (isLoading) {
    return (
      <div className="flex items-center">
        <Button variant="outline" size="sm" disabled className="gap-2 h-10 rounded-lg px-3">
          <span className="text-sm">{t.model.loading}</span>
        </Button>
      </div>
    )
  }

  // 无可用厂商
  if (providers.length === 0) {
    return (
      <div className="flex items-center">
        <Button variant="outline" size="sm" disabled className="gap-2 h-10 rounded-lg px-3 border-orange-300">
          <AlertCircle className="h-4 w-4 text-orange-500" />
          <span className="text-sm text-orange-600">{t.model.notConfigured}</span>
        </Button>
      </div>
    )
  }

  return (
    <div className="flex items-center">
      <DropdownMenu>
        <DropdownMenuTrigger asChild>
          <Button
            variant="outline"
            size="sm"
            className="gap-2 h-10 rounded-lg px-3 border-slate-200 bg-white shadow-sm hover:shadow-md transition-all"
          >
            <span className="text-base">{currentProvider?.icon || '📦'}</span>
            <div className="flex flex-col items-start leading-tight">
              <span className="text-[11px] text-slate-500">
                {currentProvider?.label || t.model.selectModel}
              </span>
              <span className="text-sm font-semibold text-slate-900">
                {currentModel?.label || config.model}
              </span>
            </div>
            <ChevronDown className="h-4 w-4 opacity-60" />
          </Button>
        </DropdownMenuTrigger>
        <DropdownMenuContent
          align="end"
          className="w-96 max-h-[480px] overflow-y-auto border-slate-200 shadow-xl"
        >
          <DropdownMenuLabel className="text-xs text-slate-500">
            {t.model.selectTip}
          </DropdownMenuLabel>
          <DropdownMenuSeparator />

          {providers.map((provider) => (
            <div key={provider.value} className="px-1 py-1">
              <DropdownMenuLabel className="text-xs text-slate-500 flex items-center gap-2">
                <span className="text-base">{provider.icon}</span>
                <span className="font-medium text-slate-700">{provider.label}</span>
                {!provider.has_api_key && (
                  <span className="text-xs text-orange-500 ml-auto">⚠️ {t.model.noApiKey}</span>
                )}
              </DropdownMenuLabel>
              <div className="grid gap-1">
                {provider.models.map((model) => {
                  const isActive =
                    config.provider === provider.value &&
                    config.model === model.value
                  return (
                    <DropdownMenuItem
                      key={`${provider.value}-${model.value}`}
                      onClick={() =>
                        saveConfig({
                          provider: provider.value,
                          model: model.value,
                        })
                      }
                      disabled={!provider.has_api_key}
                      className={cn(
                        "flex items-start gap-3 rounded-lg border border-transparent px-3 py-3 transition-colors",
                        !provider.has_api_key && "opacity-50 cursor-not-allowed",
                        isActive
                          ? "border-primary/30 bg-primary/5"
                          : "hover:bg-slate-50"
                      )}
                    >
                      <div className="flex flex-1 flex-col">
                        <div className="flex items-center gap-2">
                          <span className="font-medium text-sm text-slate-900">
                            {model.label}
                          </span>
                          {isActive && <Check className="h-4 w-4 text-primary" />}
                        </div>
                        <span className="text-xs text-slate-500">
                          {model.description}
                        </span>
                      </div>
                    </DropdownMenuItem>
                  )
                })}
              </div>
              <DropdownMenuSeparator className="my-2" />
            </div>
          ))}

          <div className="px-3 py-2 text-xs text-slate-500 bg-slate-50 rounded-md mx-1">
            {t.model.current}：{currentProvider?.label} · {currentModel?.label}
          </div>
        </DropdownMenuContent>
      </DropdownMenu>
    </div>
  )
}

// 导出 hook 供其他组件使用
export function useModelConfig() {
  const [config, setConfig] = useState<ModelConfig>(DEFAULT_CONFIG)

  useEffect(() => {
    // 加载配置
    const saved = localStorage.getItem('modelConfig')
    if (saved) {
      try {
        setConfig(JSON.parse(saved))
      } catch (e) {
        console.error('Failed to load model config:', e)
      }
    }

    // 监听配置变化
    const handleConfigChange = (e: CustomEvent<ModelConfig>) => {
      setConfig(e.detail)
    }

    window.addEventListener(
      'model-config-changed',
      handleConfigChange as EventListener
    )

    return () => {
      window.removeEventListener(
        'model-config-changed',
        handleConfigChange as EventListener
      )
    }
  }, [])

  return config
}


================================================
FILE: frontend/src/components/NewsDetailDrawer.tsx
================================================
import { useQuery } from '@tanstack/react-query'
import { useState, useEffect } from 'react'
import { toast } from 'sonner'
import ReactMarkdown from 'react-markdown'
import remarkGfm from 'remark-gfm'
import {
  Sheet,
  SheetContent,
  SheetHeader,
  SheetTitle,
  SheetDescription,
} from '@/components/ui/sheet'
import { Button } from '@/components/ui/button'
import { Badge } from '@/components/ui/badge'
import { Card, CardContent } from '@/components/ui/card'
import { newsApi, analysisApi } from '@/lib/api-client'
import { formatRelativeTime } from '@/lib/utils'
import {
  ExternalLink,
  Share2,
  Calendar,
  TrendingUp,
  CheckCircle2,
  XCircle,
  MinusCircle,
  Sparkles,
  Copy,
  Check,
  FileText,
  Code,
} from 'lucide-react'

// 新闻源配置
const NEWS_SOURCES = [
  { key: 'all', name: '全部来源', icon: '📰' },
  { key: 'sina', name: '新浪财经', icon: '🌐' },
  { key: 'tencent', name: '腾讯财经', icon: '🐧' },
  { key: 'jwview', name: '金融界', icon: '💰' },
  { key: 'eeo', name: '经济观察网', icon: '📊' },
  { key: 'caijing', name: '财经网', icon: '📈' },
  { key: 'jingji21', name: '21经济网', icon: '📉' },
  { key: 'nbd', name: '每日经济新闻', icon: '📰' },
  { key: 'yicai', name: '第一财经', icon: '🎯' },
  { key: '163', name: '网易财经', icon: '📧' },
  { key: 'eastmoney', name: '东方财富', icon: '💎' },
]

interface NewsDetailDrawerProps {
  newsId: number | null
  open: boolean
  onOpenChange: (open: boolean) => void
}

export default function NewsDetailDrawer({
  newsId,
  open,
  onOpenChange,
}: NewsDetailDrawerProps) {
  const [analyzing, setAnalyzing] = useState(false)
  const [copiedId, setCopiedId] = useState<number | null>(null)
  const [showRawHtml, setShowRawHtml] = useState(false)  // 是否显示原始 HTML

  // 清理HTML标签并转换为Markdown
  const cleanMarkdown = (text: string): string => {
    return text
      // 替换HTML换行标签为Markdown换行
      .replace(/<br\s*\/?>/gi, '\n')
      .replace(/<br>/gi, '\n')
      // 移除其他HTML标签
      .replace(/<[^>]+>/g, '')
      // 清理多余空行
      .replace(/\n{3,}/g, '\n\n')
      .trim()
  }

  // 复制功能
  const handleCopy = async (text: string, analysisId: number) => {
    try {
      await navigator.clipboard.writeText(text)
      setCopiedId(analysisId)
      toast.success('已复制到剪贴板')
      setTimeout(() => setCopiedId(null), 2000)
    } catch (err) {
      toast.error('复制失败，请手动复制')
    }
  }

  // 获取新闻详情
  const { data: news, isLoading } = useQuery({
    queryKey: ['news', 'detail', newsId],
    queryFn: () => newsApi.getNewsDetail(newsId!),
    enabled: !!newsId && open,
  })

  // 获取分析结果（如果已分析）
  const { data: analyses, refetch: refetchAnalyses } = useQuery({
    queryKey: ['analysis', 'news', newsId],
    queryFn: () => analysisApi.getNewsAnalyses(newsId!),
    enabled: !!newsId && open,
    staleTime: 0,  // 立即过期，确保每次打开都获取最新数据
  })

  // 获取相关新闻（同来源的其他新闻）
  const { data: relatedNews } = useQuery({
    queryKey: ['news', 'related', newsId],
    queryFn: async () => {
      if (!news) return []
      const allNews = await newsApi.getLatestNews({ 
        source: news.source, 
        limit: 10 
      })
      // 排除当前新闻，返回前5条
      return allNews.filter(n => n.id !== newsId).slice(0, 5)
    },
    enabled: !!newsId && open && !!news,
  })

  // 获取原始 HTML（仅在点击"查看原始内容"时加载）
  const { data: htmlData, isLoading: htmlLoading } = useQuery({
    queryKey: ['news', 'html', newsId],
    queryFn: () => newsApi.getNewsHtml(newsId!),
    enabled: !!newsId && open && showRawHtml,
  })

  // 当切换到新新闻时，重置分析状态
  useEffect(() => {
    setAnalyzing(false)
  }, [newsId])

  // 处理分享
  const handleShare = async () => {
    if (!news) return
    const url = `${window.location.origin}/news/${news.id}`
    try {
      await navigator.clipboard.writeText(url)
      toast.success('链接已复制到剪贴板')
    } catch (err) {
      toast.error('复制失败，请手动复制')
    }
  }

  // 处理分析
  const handleAnalyze = async () => {
    if (!newsId) return
    setAnalyzing(true)
    try {
      const result = await analysisApi.analyzeNews(newsId)
      if (result.success) {
        toast.success('分析完成！')
        // 刷新分析数据（不重载整个页面）
        await refetchAnalyses()
      } else {
        toast.error(result.error || '分析失败')
      }
    } catch (error) {
      toast.error('分析失败，请稍后重试')
    } finally {
      setAnalyzing(false)
    }
  }

  // 获取情感标签
  const getSentimentBadge = (score: number | null) => {
    if (score === null) {
      return (
        <Badge variant="outline" className="bg-gray-50 text-gray-700">
          <span className="mr-1">😐</span>
          待分析
        </Badge>
      )
    }
    if (score > 0.1) {
      return (
        <Badge className="bg-emerald-100 text-emerald-700 border-emerald-300">
          <CheckCircle2 className="w-3 h-3 mr-1" />
          利好 {score.toFixed(2)}
        </Badge>
      )
    }
    if (score < -0.1) {
      return (
        <Badge className="bg-rose-100 text-rose-700 border-rose-300">
          <XCircle className="w-3 h-3 mr-1" />
          利空 {score.toFixed(2)}
        </Badge>
      )
    }
    return (
      <Badge className="bg-slate-100 text-slate-700 border-slate-300">
        <MinusCircle className="w-3 h-3 mr-1" />
        中性 {score.toFixed(2)}
      </Badge>
    )
  }

  const sourceInfo = NEWS_SOURCES.find(s => s.key === news?.source)

  return (
    <Sheet open={open} onOpenChange={onOpenChange}>
      <SheetContent side="right" className="overflow-y-auto">
        {isLoading ? (
          <div className="flex items-center justify-center h-full">
            <div className="text-center">
              <div className="inline-block animate-spin rounded-full h-8 w-8 border-b-2 border-primary mb-4"></div>
              <p className="text-gray-500">加载中...</p>
            </div>
          </div>
        ) : !news ? (
          <div className="flex items-center justify-center h-full">
            <p className="text-gray-500">新闻不存在</p>
          </div>
        ) : (
          <div className="space-y-6">
            {/* 头部区域 */}
            <SheetHeader>
              <SheetTitle className="text-2xl font-bold leading-tight pr-8">
                {news.title}
              </SheetTitle>
              <SheetDescription>
                <div className="flex items-center gap-4 text-sm text-gray-500 mt-2">
                  <div className="flex items-center gap-1">
                    <span>{sourceInfo?.icon || '📰'}</span>
                    <span>{sourceInfo?.name || news.source}</span>
                  </div>
                  <span>•</span>
                  <div className="flex items-center gap-1">
                    <Calendar className="w-3 h-3" />
                    <span>{formatRelativeTime(news.publish_time || news.created_at)}</span>
                  </div>
                  {news.author && (
                    <>
                      <span>•</span>
                      <span>作者：{news.author}</span>
                    </>
                  )}
                </div>
              </SheetDescription>
            </SheetHeader>

            {/* 操作按钮栏 */}
            <div className="flex flex-wrap gap-2 pb-4 border-b">
              <Button
                variant="outline"
                size="sm"
                onClick={() => window.open(news.url, '_blank')}
                className="flex items-center gap-2"
              >
                <ExternalLink className="w-4 h-4" />
                原文链接
              </Button>
              <Button
                variant="outline"
                size="sm"
                onClick={handleShare}
                className="flex items-center gap-2"
              >
                <Share2 className="w-4 h-4" />
                分享
              </Button>
              <Button
                variant="outline"
                size="sm"
                onClick={handleAnalyze}
                disabled={analyzing}
                className="flex items-center gap-2"
              >
                <Sparkles className={`w-4 h-4 ${analyzing ? 'animate-spin' : ''}`} />
                {analyzing ? '分析中...' : '分析'}
              </Button>
              <Button
                variant={showRawHtml ? "default" : "outline"}
                size="sm"
                onClick={() => setShowRawHtml(!showRawHtml)}
                className="flex items-center gap-2"
              >
                <Code className="w-4 h-4" />
                {showRawHtml ? '显示解析内容' : '查看原始内容'}
              </Button>
            </div>

            {/* 情感分析卡片 - 优先显示最新分析结果 */}
            {(() => {
              // 优先使用最新分析记录中的评分，否则使用 news 表中的评分
              const latestScore = analyses && analyses.length > 0 && analyses[0].sentiment_score !== null
                ? analyses[0].sentiment_score
                : news.sentiment_score;
              
              if (latestScore === null) return null;
              
              return (
                <Card className="bg-gradient-to-r from-blue-50 to-indigo-50 border-blue-200">
                  <CardContent className="pt-6">
                    <div className="flex items-center justify-between">
                      <div>
                        <h3 className="font-semibold text-gray-900 mb-2">情感分析</h3>
                        <div className="flex items-center gap-2">
                          {getSentimentBadge(latestScore)}
                          <span className="text-sm text-gray-600">
                            评分：{latestScore.toFixed(3)}
                          </span>
                        </div>
                      </div>
                      {analyses && analyses.length > 0 && (
                        <div className="text-xs text-gray-500">
                          分析时间：{formatRelativeTime(analyses[0].created_at)}
                        </div>
                      )}
                    </div>
                  </CardContent>
                </Card>
              );
            })()}

            {/* 股票代码区域 */}
            {news.stock_codes && news.stock_codes.length > 0 && (
              <div>
                <h3 className="font-semibold text-gray-900 mb-3 flex items-center gap-2">
                  <TrendingUp className="w-4 h-4" />
                  关联股票
                </h3>
                <div className="flex flex-wrap gap-2">
                  {news.stock_codes.map((code) => (
                    <Badge
                      key={code}
                      variant="outline"
                      className="text-sm bg-blue-50 text-blue-700 border-blue-200 hover:bg-blue-100 cursor-pointer px-3 py-1"
                    >
                      <TrendingUp className="w-3 h-3 mr-1" />
                      {code}
                    </Badge>
                  ))}
                </div>
              </div>
            )}

            {/* 完整正文区域 */}
            <div>
              <h3 className="font-semibold text-gray-900 mb-3 flex items-center gap-2">
                {showRawHtml ? <Code className="w-4 h-4" /> : <FileText className="w-4 h-4" />}
                {showRawHtml ? '原始内容' : '正文内容'}
              </h3>
              
              {showRawHtml ? (
                // 原始 HTML 展示区域
                <div className="border rounded-lg overflow-hidden bg-white">
                  {htmlLoading ? (
                    <div className="p-8 text-center text-gray-500">
                      <div className="animate-spin w-6 h-6 border-2 border-blue-500 border-t-transparent rounded-full mx-auto mb-2"></div>
                      加载原始内容中...
                    </div>
                  ) : htmlData?.raw_html ? (
                    <iframe
                      srcDoc={htmlData.raw_html}
                      className="w-full border-0"
                      style={{ height: '600px' }}
                      sandbox="allow-same-origin"
                      title="原始新闻内容"
                    />
                  ) : (
                    <div className="p-8 text-center text-gray-500">
                      <Code className="w-8 h-8 mx-auto mb-2 opacity-50" />
                      <p>该新闻暂无原始 HTML 内容</p>
                      <p className="text-sm mt-1">请重新爬取该新闻以获取完整内容</p>
                    </div>
                  )}
                </div>
              ) : (
                // 解析后的文本展示
                <div className="prose prose-sm max-w-none">
                  <div className="text-gray-700 leading-relaxed whitespace-pre-wrap">
                    {news.content.split('\n').map((paragraph, idx) => (
                      <p key={idx} className="mb-4">
                        {paragraph}
                      </p>
                    ))}
                  </div>
                </div>
              )}
            </div>

            {/* 分析详情 */}
            {analyses && analyses.length > 0 && (
              <div>
                <h3 className="font-semibold text-gray-900 mb-3 flex items-center gap-2">
                  <Sparkles className="w-4 h-4" />
                  智能体分析详情
                </h3>
                {analyses.map((analysis) => {
                  // 清理和合并所有内容用于复制
                  const fullContent = [
                    analysis.summary ? `## 摘要\n\n${cleanMarkdown(analysis.summary)}` : '',
                    analysis.analysis_result ? `## 详细分析\n\n${cleanMarkdown(analysis.analysis_result)}` : ''
                  ].filter(Boolean).join('\n\n')

                  return (
                    <Card key={analysis.id} className="mb-4 relative">
                      <CardContent className="pt-6">
                        <div className="space-y-3">
                          <div className="flex items-center justify-between">
                            <Badge variant="outline">{analysis.agent_name}</Badge>
                            <div className="flex items-center gap-2">
                              {analysis.confidence && (
                                <span className="text-xs text-gray-500">
                                  置信度：{(analysis.confidence * 100).toFixed(1)}%
                                </span>
                              )}
                            </div>
                          </div>
                          {analysis.summary && (
                            <div>
                              <h4 className="font-medium text-sm text-gray-700 mb-2">摘要</h4>
                              <div className="prose prose-sm max-w-none">
                                <ReactMarkdown
                                  remarkPlugins={[remarkGfm]}
                                  className="text-sm text-gray-600 leading-relaxed"
                                  components={{
                                    h1: ({node, ...props}) => <h1 className="text-base font-bold mb-2 mt-3" {...props} />,
                                    h2: ({node, ...props}) => <h2 className="text-sm font-bold mb-2 mt-2" {...props} />,
                                    h3: ({node, ...props}) => <h3 className="text-sm font-semibold mb-1 mt-2" {...props} />,
                                    h4: ({node, ...props}) => <h4 className="text-sm font-medium mb-1 mt-2" {...props} />,
                                    p: ({node, ...props}) => <p className="mb-2" {...props} />,
                                    ul: ({node, ...props}) => <ul className="list-disc list-inside mb-2 space-y-1" {...props} />,
                                    ol: ({node, ...props}) => <ol className="list-decimal list-inside mb-2 space-y-1" {...props} />,
                                    li: ({node, ...props}) => <li className="ml-2" {...props} />,
                                    strong: ({node, ...props}) => <strong className="font-semibold text-gray-800" {...props} />,
                                    em: ({node, ...props}) => <em className="italic" {...props} />,
                                    code: ({node, ...props}) => (
                                      <code className="bg-gray-100 px-1 py-0.5 rounded text-xs font-mono text-gray-800" {...props} />
                                    ),
                                    pre: ({node, ...props}) => <pre className="bg-gray-100 p-2 rounded overflow-x-auto mb-2" {...props} />,
                                    blockquote: ({node, ...props}) => <blockquote className="border-l-4 border-gray-300 pl-3 italic text-gray-600 my-2" {...props} />,
                                    hr: ({node, ...props}) => <hr className="my-3 border-gray-200" {...props} />,
                                    table: ({node, ...props}) => (
                                      <div className="overflow-x-auto my-3">
                                        <table className="min-w-full border-collapse border border-gray-300 text-xs" {...props} />
                                      </div>
                                    ),
                                    thead: ({node, ...props}) => <thead className="bg-gray-50" {...props} />,
                                    tbody: ({node, ...props}) => <tbody {...props} />,
                                    tr: ({node, ...props}) => <tr className="border-b border-gray-200" {...props} />,
                                    th: ({node, ...props}) => <th className="border border-gray-300 px-3 py-2 text-left font-semibold bg-gray-100" {...props} />,
                                    td: ({node, ...props}) => <td className="border border-gray-300 px-3 py-2" {...props} />,
                                  }}
                                >
                                  {cleanMarkdown(analysis.summary)}
                                </ReactMarkdown>
                              </div>
                            </div>
                          )}
                          {analysis.analysis_result && (
                            <div>
                              <h4 className="font-medium text-sm text-gray-700 mb-2">详细分析</h4>
                              <div className="prose prose-sm max-w-none">
                                <ReactMarkdown
                                  remarkPlugins={[remarkGfm]}
                                  className="text-sm text-gray-600 leading-relaxed"
                                  components={{
                                    h1: ({node, ...props}) => <h1 className="text-base font-bold mb-2 mt-3" {...props} />,
                                    h2: ({node, ...props}) => <h2 className="text-sm font-bold mb-2 mt-2" {...props} />,
                                    h3: ({node, ...props}) => <h3 className="text-sm font-semibold mb-1 mt-2" {...props} />,
                                    h4: ({node, ...props}) => <h4 className="text-sm font-medium mb-1 mt-2" {...props} />,
                                    p: ({node, ...props}) => <p className="mb-2" {...props} />,
                                    ul: ({node, ...props}) => <ul className="list-disc list-inside mb-2 space-y-1" {...props} />,
                                    ol: ({node, ...props}) => <ol className="list-decimal list-inside mb-2 space-y-1" {...props} />,
                                    li: ({node, ...props}) => <li className="ml-2" {...props} />,
                                    strong: ({node, ...props}) => <strong className="font-semibold text-gray-800" {...props} />,
                                    em: ({node, ...props}) => <em className="italic" {...props} />,
                                    code: ({node, ...props}) => (
                                      <code className="bg-gray-100 px-1 py-0.5 rounded text-xs font-mono text-gray-800" {...props} />
                                    ),
                                    pre: ({node, ...props}) => <pre className="bg-gray-100 p-2 rounded overflow-x-auto mb-2" {...props} />,
                                    blockquote: ({node, ...props}) => <blockquote className="border-l-4 border-gray-300 pl-3 italic text-gray-600 my-2" {...props} />,
                                    hr: ({node, ...props}) => <hr className="my-3 border-gray-200" {...props} />,
                                    table: ({node, ...props}) => (
                                      <div className="overflow-x-auto my-3">
                                        <table className="min-w-full border-collapse border border-gray-300 text-xs" {...props} />
                                      </div>
                                    ),
                                    thead: ({node, ...props}) => <thead className="bg-gray-50" {...props} />,
                                    tbody: ({node, ...props}) => <tbody {...props} />,
                                    tr: ({node, ...props}) => <tr className="border-b border-gray-200" {...props} />,
                                    th: ({node, ...props}) => <th className="border border-gray-300 px-3 py-2 text-left font-semibold bg-gray-100" {...props} />,
                                    td: ({node, ...props}) => <td className="border border-gray-300 px-3 py-2" {...props} />,
                                  }}
                                >
                                  {cleanMarkdown(analysis.analysis_result)}
                                </ReactMarkdown>
                              </div>
                            </div>
                          )}
                          <div className="flex items-center justify-between pt-2 border-t">
                            <span className="text-xs text-gray-400">
                              分析时间：{formatRelativeTime(analysis.created_at)}
                            </span>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleCopy(fullContent, analysis.id)}
                              className="h-7 px-2 text-xs"
                            >
                              {copiedId === analysis.id ? (
                                <>
                                  <Check className="w-3 h-3 mr-1" />
                                  已复制
                                </>
                              ) : (
                                <>
                                  <Copy className="w-3 h-3 mr-1" />
                                  复制
                                </>
                              )}
                            </Button>
                          </div>
                        </div>
                      </CardContent>
                    </Card>
                  )
                })}
              </div>
            )}

            {/* 相关新闻推荐 */}
            {relatedNews && relatedNews.length > 0 && (
              <div>
                <h3 className="font-semibold text-gray-900 mb-3">相关新闻</h3>
                <div className="space-y-2">
                  {relatedNews.map((related) => (
                    <Card
                      key={related.id}
                      className="hover:shadow-md transition-shadow cursor-pointer"
                      onClick={() => {
                        onOpenChange(false)
                        setTimeout(() => {
                          // 触发父组件更新newsId
                          window.dispatchEvent(new CustomEvent('news-select', { detail: related.id }))
                        }, 300)
                      }}
                    >
                      <CardContent className="pt-4">
                        <h4 className="font-medium text-sm line-clamp-2 mb-2">
                          {related.title}
                        </h4>
                        <div className="flex items-center gap-2 text-xs text-gray-500">
                          <span>{formatRelativeTime(related.publish_time || related.created_at)}</span>
                          {related.stock_codes && related.stock_codes.length > 0 && (
                            <>
                              <span>•</span>
                              <span>{related.stock_codes.length} 只股票</span>
                            </>
                          )}
                        </div>
                      </CardContent>
                    </Card>
                  ))}
                </div>
              </div>
            )}
          </div>
        )}
      </SheetContent>
    </Sheet>
  )
}


================================================
FILE: frontend/src/components/StockSearch.tsx
================================================
/**
 * 股票搜索组件
 * 支持代码和名称模糊搜索
 */
import { useState, useCallback, useRef, useEffect } from 'react'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import { useNavigate } from 'react-router-dom'
import { stockApi } from '@/lib/api-client'
import { cn } from '@/lib/utils'
import { Search, Loader2, Database, RefreshCw } from 'lucide-react'
import { toast } from 'sonner'

interface StockSearchProps {
  className?: string
  placeholder?: string
  onSelect?: (stock: { code: string; name: string; full_code: string }) => void
}

export default function StockSearch({
  className,
  placeholder = '搜索股票代码或名称...',
  onSelect,
}: StockSearchProps) {
  const [keyword, setKeyword] = useState('')
  const [isOpen, setIsOpen] = useState(false)
  const [selectedIndex, setSelectedIndex] = useState(-1)
  const inputRef = useRef<HTMLInputElement>(null)
  const listRef = useRef<HTMLDivElement>(null)
  const navigate = useNavigate()
  const queryClient = useQueryClient()

  // 获取股票数量
  const { data: stockCount } = useQuery({
    queryKey: ['stock-count'],
    queryFn: () => stockApi.getStockCount(),
    staleTime: 60 * 1000,
  })

  // 初始化股票数据
  const initMutation = useMutation({
    mutationFn: () => stockApi.initStockData(),
    onSuccess: (data) => {
      if (data.success) {
        toast.success(`成功导入 ${data.count} 只股票！`)
        queryClient.invalidateQueries({ queryKey: ['stock-count'] })
        queryClient.invalidateQueries({ queryKey: ['stock-search'] })
      } else {
        toast.error(data.message)
      }
    },
    onError: (error: Error) => {
      toast.error(`初始化失败: ${error.message}`)
    },
  })

  // 搜索查询
  const { data: searchResults, isLoading } = useQuery({
    queryKey: ['stock-search', keyword],
    queryFn: () => stockApi.searchRealtime(keyword, 15),
    enabled: keyword.length >= 1,
    staleTime: 30 * 1000,
  })

  // 处理选择股票
  const handleSelect = useCallback((stock: { code: string; name: string; full_code: string }) => {
    setKeyword('')
    setIsOpen(false)
    setSelectedIndex(-1)
    
    if (onSelect) {
      onSelect(stock)
    } else {
      // 默认跳转到股票分析页面
      navigate(`/stock/${stock.full_code}`)
    }
  }, [navigate, onSelect])

  // 键盘导航
  const handleKeyDown = useCallback((e: React.KeyboardEvent) => {
    if (!searchResults || searchResults.length === 0) return

    switch (e.key) {
      case 'ArrowDown':
        e.preventDefault()
        setSelectedIndex(prev => 
          prev < searchResults.length - 1 ? prev + 1 : 0
        )
        break
      case 'ArrowUp':
        e.preventDefault()
        setSelectedIndex(prev => 
          prev > 0 ? prev - 1 : searchResults.length - 1
        )
        break
      case 'Enter':
        e.preventDefault()
        if (selectedIndex >= 0 && searchResults[selectedIndex]) {
          handleSelect(searchResults[selectedIndex])
        }
        break
      case 'Escape':
        setIsOpen(false)
        setSelectedIndex(-1)
        break
    }
  }, [searchResults, selectedIndex, handleSelect])

  // 点击外部关闭
  useEffect(() => {
    const handleClickOutside = (e: MouseEvent) => {
      if (
        inputRef.current &&
        !inputRef.current.contains(e.target as Node) &&
        listRef.current &&
        !listRef.current.contains(e.target as Node)
      ) {
        setIsOpen(false)
      }
    }

    document.addEventListener('mousedown', handleClickOutside)
    return () => document.removeEventListener('mousedown', handleClickOutside)
  }, [])

  // 滚动到选中项
  useEffect(() => {
    if (selectedIndex >= 0 && listRef.current) {
      const selectedItem = listRef.current.children[selectedIndex] as HTMLElement
      if (selectedItem) {
        selectedItem.scrollIntoView({ block: 'nearest' })
      }
    }
  }, [selectedIndex])

  return (
    <div className={cn('relative', className)}>
      {/* 搜索输入框 */}
      <div className="relative">
        <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
        <input
          ref={inputRef}
          type="text"
          value={keyword}
          onChange={(e) => {
            setKeyword(e.target.value)
            setIsOpen(true)
            setSelectedIndex(-1)
          }}
          onFocus={() => setIsOpen(true)}
          onKeyDown={handleKeyDown}
          placeholder={placeholder}
          className={cn(
            'w-full pl-10 pr-4 py-2.5 text-sm',
            'border border-gray-200 rounded-lg',
            'focus:outline-none focus:ring-2 focus:ring-blue-500/20 focus:border-blue-400',
            'placeholder:text-gray-400',
            'transition-all duration-200'
          )}
        />
        {isLoading && (
          <Loader2 className="absolute right-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400 animate-spin" />
        )}
      </div>

      {/* 搜索结果下拉列表 */}
      {isOpen && keyword.length >= 1 && (
        <div
          ref={listRef}
          className={cn(
            'absolute z-50 w-full mt-1',
            'bg-white rounded-lg shadow-lg border border-gray-100',
            'max-h-[400px] overflow-y-auto',
            'animate-in fade-in-0 zoom-in-95 duration-150'
          )}
        >
          {isLoading ? (
            <div className="flex items-center justify-center py-8 text-gray-500">
              <Loader2 className="w-5 h-5 animate-spin mr-2" />
              搜索中...
            </div>
          ) : searchResults && searchResults.length > 0 ? (
            <div className="py-1">
              {searchResults.map((stock, index) => (
                <div
                  key={stock.code}
                  onClick={() => handleSelect(stock)}
                  className={cn(
                    'flex items-center justify-between px-4 py-3 cursor-pointer',
                    'transition-colors duration-100',
                    selectedIndex === index
                      ? 'bg-blue-50'
                      : 'hover:bg-gray-50'
                  )}
                >
                  <div className="flex items-center gap-3">
                    <div className="flex flex-col">
                      <span className="font-medium text-gray-900">
                        {stock.name}
                      </span>
                      <span className="text-xs text-gray-500">
                        {stock.full_code}
                      </span>
                    </div>
                  </div>
                  <div className="flex items-center gap-2">
                    {stock.market && (
                      <span className="text-xs px-1.5 py-0.5 bg-gray-100 text-gray-600 rounded">
                        {stock.market}
                      </span>
                    )}
                    {stock.industry && (
                      <span className="text-xs text-gray-500">
                        {stock.industry}
                      </span>
                    )}
                  </div>
                </div>
              ))}
            </div>
          ) : (
            <div className="py-6 text-center">
              {stockCount && stockCount.count === 0 ? (
                // 数据库为空时显示初始化按钮
                <div className="space-y-3">
                  <Database className="w-10 h-10 mx-auto text-gray-300" />
                  <p className="text-gray-500">股票数据库为空</p>
                  <p className="text-sm text-gray-400">点击下方按钮初始化股票数据</p>
                  <button
                    onClick={(e) => {
                      e.stopPropagation()
                      initMutation.mutate()
                    }}
                    disabled={initMutation.isPending}
                    className={cn(
                      'inline-flex items-center gap-2 px-4 py-2 text-sm font-medium rounded-lg',
                      'bg-blue-600 text-white hover:bg-blue-700',
                      'disabled:opacity-50 disabled:cursor-not-allowed',
                      'transition-colors'
                    )}
                  >
                    {initMutation.isPending ? (
                      <>
                        <Loader2 className="w-4 h-4 animate-spin" />
                        正在导入股票数据...
                      </>
                    ) : (
                      <>
                        <RefreshCw className="w-4 h-4" />
                        初始化股票数据
                      </>
                    )}
                  </button>
                </div>
              ) : (
                // 有数据但没有匹配结果
                <div>
                  <p className="text-gray-500">未找到匹配的股票</p>
                  <p className="text-sm text-gray-400 mt-1">尝试输入股票代码或名称</p>
                </div>
              )}
            </div>
          )}
          
          {/* 快捷提示 */}
          <div className="px-4 py-2 border-t border-gray-100 bg-gray-50/50">
            <div className="flex items-center gap-4 text-xs text-gray-400">
              <span>
                <kbd className="px-1.5 py-0.5 bg-gray-100 rounded text-gray-500">↑↓</kbd> 导航
              </span>
              <span>
                <kbd className="px-1.5 py-0.5 bg-gray-100 rounded text-gray-500">Enter</kbd> 选择
              </span>
              <span>
                <kbd className="px-1.5 py-0.5 bg-gray-100 rounded text-gray-500">Esc</kbd> 关闭
              </span>
            </div>
          </div>
        </div>
      )}
    </div>
  )
}


================================================
FILE: frontend/src/components/alpha-mining/AgentDemo.tsx
================================================
/**
 * AgenticX Agent 调用演示组件
 * 
 * 展示如何通过 Agent 接口调用 AlphaMiningTool：
 * - Agent 调用流程可视化
 * - Tool 参数输入面板
 * - 执行日志流式显示
 */

import React, { useState, useCallback } from 'react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '../ui/card';
import { Button } from '../ui/button';
import { Badge } from '../ui/badge';
import { motion, AnimatePresence } from 'framer-motion';
import {
  Bot, Wrench, Play, CheckCircle2, XCircle,
  Clock, ArrowRight, Terminal, Loader2, Code
} from 'lucide-react';
import { useGlobalI18n } from '@/store/useLanguageStore';

interface AgentDemoResult {
  success: boolean;
  agent_name: string;
  tool_name: string;
  input_params: Record<string, any>;
  output: Record<string, any> | null;
  execution_time: number;
  logs: string[];
}

interface AgentDemoProps {
  apiBaseUrl?: string;
}

const AgentDemo: React.FC<AgentDemoProps> = ({
  apiBaseUrl = '/api/v1',
}) => {
  const t = useGlobalI18n();
  const [loading, setLoading] = useState(false);
  const [result, setResult] = useState<AgentDemoResult | null>(null);
  const [error, setError] = useState<string | null>(null);
  
  // 参数
  const [stockCode, setStockCode] = useState('SH600519');
  const [numSteps, setNumSteps] = useState(30);
  const [useSentiment, setUseSentiment] = useState(true);
  
  // 执行演示
  const runDemo = useCallback(async () => {
    setLoading(true);
    setError(null);
    setResult(null);

    try {
      const response = await fetch(`${apiBaseUrl}/alpha-mining/agent-demo`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          stock_code: stockCode || null,
          num_steps: numSteps,
          use_sentiment: useSentiment,
        }),
      });

      if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
      }

      const data = await response.json();
      setResult(data);
    } catch (err: any) {
      console.error('Agent demo error:', err);
      setError(err.message || t.alphaMining.agent.executeFailed);
    } finally {
      setLoading(false);
    }
  }, [apiBaseUrl, stockCode, numSteps, useSentiment]);

  return (
    <Card className="w-full">
      <CardHeader>
        <div className="flex items-center justify-between">
          <div>
            <CardTitle className="flex items-center gap-2">
              <Bot className="w-5 h-5 text-indigo-500" />
              {t.alphaMining.agent.title}
            </CardTitle>
            <CardDescription>
              {t.alphaMining.agent.desc}
            </CardDescription>
          </div>
          {result && (
            <Badge className={result.success ? 'bg-green-100 text-green-700' : 'bg-red-100 text-red-700'}>
              {result.success ? <CheckCircle2 className="w-3 h-3 mr-1" /> : <XCircle className="w-3 h-3 mr-1" />}
              {result.success ? t.alphaMining.agent.success : t.alphaMining.agent.failed}
            </Badge>
          )}
        </div>
      </CardHeader>

      <CardContent className="space-y-4">
        {/* 调用流程图 */}
        <div className="flex items-center justify-center gap-2 p-4 bg-gradient-to-r from-indigo-50 to-purple-50 rounded-lg">
          <FlowNode icon={<Bot className="w-5 h-5" />} label="QuantitativeAgent" active={loading} />
          <ArrowRight className="w-4 h-4 text-gray-400" />
          <FlowNode icon={<Wrench className="w-5 h-5" />} label="AlphaMiningTool" active={loading} />
          <ArrowRight className="w-4 h-4 text-gray-400" />
          <FlowNode icon={<Code className="w-5 h-5" />} label="AlphaTrainer" active={loading} />
        </div>

        {/* 参数输入面板 */}
        <Card className="bg-gray-50">
          <CardHeader className="pb-2">
            <CardTitle className="text-sm flex items-center gap-2">
              <Wrench className="w-4 h-4 text-gray-500" />
              {t.alphaMining.agent.toolParams}
            </CardTitle>
          </CardHeader>
          <CardContent>
            <div className="grid grid-cols-1 md:grid-cols-3 gap-4">
              <div>
                <label className="text-xs font-medium text-gray-600 block mb-1">
                  {t.alphaMining.agent.stockCode}
                </label>
                <input
                  type="text"
                  value={stockCode}
                  onChange={(e) => setStockCode(e.target.value)}
                  placeholder={t.alphaMining.agent.stockPlaceholder}
                  disabled={loading}
                  className="w-full px-3 py-2 border rounded-md text-sm"
                />
              </div>
              <div>
                <label className="text-xs font-medium text-gray-600 block mb-1">
                  {t.alphaMining.agent.steps}
                </label>
                <input
                  type="number"
                  value={numSteps}
                  onChange={(e) => setNumSteps(Number(e.target.value))}
                  min={10}
                  max={100}
                  disabled={loading}
                  className="w-full px-3 py-2 border rounded-md text-sm"
                />
              </div>
              <div className="flex items-end">
                <label className="flex items-center gap-2 cursor-pointer">
                  <input
                    type="checkbox"
                    checked={useSentiment}
                    onChange={(e) => setUseSentiment(e.target.checked)}
                    disabled={loading}
                    className="rounded"
                  />
                  <span className="text-sm">{t.alphaMining.agent.useSentiment}</span>
                </label>
              </div>
            </div>
            
            <div className="mt-4 flex justify-end">
              <Button onClick={runDemo} disabled={loading}>
                {loading ? (
                  <>
                    <Loader2 className="w-4 h-4 mr-1 animate-spin" />
                    {t.alphaMining.agent.executing}
                  </>
                ) : (
                  <>
                    <Play className="w-4 h-4 mr-1" />
                    {t.alphaMining.agent.execute}
                  </>
                )}
              </Button>
            </div>
          </CardContent>
        </Card>

        {/* 执行结果 */}
        <AnimatePresence>
          {result && (
            <motion.div
              initial={{ opacity: 0, y: 20 }}
              animate={{ opacity: 1, y: 0 }}
              exit={{ opacity: 0, y: -20 }}
              className="space-y-4"
            >
              {/* Agent & Tool 信息 */}
              <div className="grid grid-cols-2 gap-4">
                <Card>
                  <CardContent className="pt-4">
                    <div className="flex items-center gap-2 text-sm text-gray-500 mb-1">
                      <Bot className="w-4 h-4" />
                      Agent
                    </div>
                    <div className="font-medium">{result.agent_name}</div>
                  </CardContent>
                </Card>
                <Card>
                  <CardContent className="pt-4">
                    <div className="flex items-center gap-2 text-sm text-gray-500 mb-1">
                      <Wrench className="w-4 h-4" />
                      Tool
                    </div>
                    <div className="font-medium">{result.tool_name}</div>
                  </CardContent>
                </Card>
              </div>

              {/* 输入参数 */}
              <Card>
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm">{t.alphaMining.agent.inputParams}</CardTitle>
                </CardHeader>
                <CardContent>
                  <pre className="text-xs bg-gray-900 text-green-400 p-3 rounded-md overflow-x-auto">
                    {JSON.stringify(result.input_params, null, 2)}
                  </pre>
                </CardContent>
              </Card>

              {/* 输出结果 */}
              {result.output && (
                <Card className="border-green-200 bg-green-50/50">
                  <CardHeader className="pb-2">
                    <CardTitle className="text-sm flex items-center gap-2 text-green-700">
                      <CheckCircle2 className="w-4 h-4" />
                      {t.alphaMining.agent.output}
                    </CardTitle>
                  </CardHeader>
                  <CardContent>
                    <div className="grid grid-cols-3 gap-4 mb-3">
                      <div className="text-center p-3 bg-white rounded-lg">
                        <div className="text-xs text-gray-500">Best Score</div>
                        <div className="text-lg font-bold text-green-600">
                          {result.output.best_score?.toFixed(4) || '--'}
                        </div>
                      </div>
                      <div className="text-center p-3 bg-white rounded-lg">
                        <div className="text-xs text-gray-500">Total Steps</div>
                        <div className="text-lg font-bold">
                          {result.output.total_steps || '--'}
                        </div>
                      </div>
                      <div className="text-center p-3 bg-white rounded-lg">
                        <div className="text-xs text-gray-500 flex items-center justify-center gap-1">
                          <Clock className="w-3 h-3" />
                          {t.alphaMining.agent.executionTime}
                        </div>
                        <div className="text-lg font-bold">
                          {result.execution_time}s
                        </div>
                      </div>
                    </div>
                    {result.output.best_formula && (
                      <div className="p-3 bg-white rounded-lg">
                        <div className="text-xs text-gray-500 mb-1">{t.alphaMining.agent.bestFactor}</div>
                        <code className="text-sm font-mono text-emerald-700">
                          {result.output.best_formula}
                        </code>
                      </div>
                    )}
                  </CardContent>
                </Card>
              )}

              {/* 执行日志 */}
              <Card>
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm flex items-center gap-2">
                    <Terminal className="w-4 h-4 text-gray-500" />
                    {t.alphaMining.agent.logs}
                  </CardTitle>
                </CardHeader>
                <CardContent>
                  <div className="bg-gray-900 rounded-md p-3 max-h-48 overflow-y-auto">
                    {result.logs.map((log, idx) => (
                      <motion.div
                        key={idx}
                        initial={{ opacity: 0, x: -10 }}
                        animate={{ opacity: 1, x: 0 }}
                        transition={{ delay: idx * 0.1 }}
                        className="text-xs font-mono text-gray-300 mb-1"
                      >
                        <span className="text-gray-500">{idx + 1}.</span> {log}
                      </motion.div>
                    ))}
                  </div>
                </CardContent>
              </Card>

              {/* 代码示例 */}
              <Card>
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm flex items-center gap-2">
                    <Code className="w-4 h-4 text-gray-500" />
                    {t.alphaMining.agent.codeExample}
                  </CardTitle>
                </CardHeader>
                <CardContent>
                  <pre className="text-xs bg-gray-900 text-gray-300 p-3 rounded-md overflow-x-auto">
{`from agenticx.agents import QuantitativeAgent
from finnews.alpha_mining.tools import AlphaMiningTool

# ${t.alphaMining.agent.createAgent || 'Create Agent'}
agent = QuantitativeAgent(name="Quant")

# ${t.alphaMining.agent.registerTool || 'Register Tool'}
agent.register_tool(AlphaMiningTool())

# ${t.alphaMining.agent.executeMining || 'Execute factor mining'}
result = await agent.run(
    task="${t.alphaMining.agent.miningTask.replace('{code}', stockCode || 'SH600519')}",
    tools=["alpha_mining"],
    params={
        "num_steps": ${numSteps},
        "use_sentiment": ${useSentiment}
    }
)

print(f"Best Factor: {result.best_formula}")
print(f"Score: {result.best_score}")`}
                  </pre>
                </CardContent>
              </Card>
            </motion.div>
          )}
        </AnimatePresence>

        {/* 错误提示 */}
        {error && (
          <div className="p-4 bg-red-50 rounded-lg border border-red-200">
            <p className="text-sm text-red-600">{error}</p>
          </div>
        )}

        {/* 初始状态 */}
        {!loading && !result && !error && (
          <div className="py-8 text-center text-gray-500">
            <Bot className="w-12 h-12 mx-auto opacity-50 mb-3" />
            <p>{t.alphaMining.agent.startHint}</p>
            <p className="text-sm mt-1">
              {t.alphaMining.agent.startDesc}
            </p>
          </div>
        )}
      </CardContent>
    </Card>
  );
};

// 流程节点组件
interface FlowNodeProps {
  icon: React.ReactNode;
  label: string;
  active?: boolean;
}

const FlowNode: React.FC<FlowNodeProps> = ({ icon, label, active }) => {
  return (
    <div className={`
      flex flex-col items-center p-3 rounded-lg transition-all
      ${active ? 'bg-indigo-100 ring-2 ring-indigo-400' : 'bg-white'}
    `}>
      <div className={`
        p-2 rounded-full mb-1
        ${active ? 'bg-indigo-500 text-white' : 'bg-gray-100 text-gray-600'}
      `}>
        {active ? <Loader2 className="w-5 h-5 animate-spin" /> : icon}
      </div>
      <span className="text-xs font-medium text-gray-700">{label}</span>
    </div>
  );
};

export default AgentDemo;


================================================
FILE: frontend/src/components/alpha-mining/MetricsDashboard.tsx
================================================
/**
 * 完整评估指标仪表盘
 * 
 * 展示因子评估的所有指标：
 * - 雷达图：多维度指标可视化
 * - 收益曲线：策略收益 vs 基准
 * - 风险指标卡片
 */

import React from 'react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '../ui/card';
import { Badge } from '../ui/badge';
import {
  RadarChart, PolarGrid, PolarAngleAxis, PolarRadiusAxis, Radar,
  LineChart, Line, XAxis, YAxis, CartesianGrid, Tooltip,
  ResponsiveContainer, Legend, Area, AreaChart, BarChart, Bar
} from 'recharts';
import {
  TrendingUp, TrendingDown, Activity, AlertTriangle,
  Target, BarChart2, PieChart, Percent
} from 'lucide-react';
import { useGlobalI18n } from '@/store/useLanguageStore';

export interface FactorMetrics {
  sortino_ratio: number;
  sharpe_ratio: number;
  ic: number;
  rank_ic: number;
  max_drawdown: number;
  turnover: number;
  total_return: number;
  win_rate: number;
  avg_return?: number;
}

interface MetricsDashboardProps {
  metrics: FactorMetrics | null;
  formula?: string;
  loading?: boolean;
  returnsCurve?: { date: string; strategy: number; benchmark: number }[];
}

const MetricsDashboard: React.FC<MetricsDashboardProps> = ({
  metrics,
  formula,
  loading = false,
  returnsCurve,
}) => {
  const t = useGlobalI18n();
  
  if (loading) {
    return (
      <Card className="w-full animate-pulse">
        <CardHeader>
          <div className="h-6 bg-gray-200 rounded w-1/3" />
        </CardHeader>
        <CardContent>
          <div className="h-64 bg-gray-100 rounded" />
        </CardContent>
      </Card>
    );
  }

  if (!metrics) {
    return (
      <Card className="w-full">
        <CardContent className="py-12 text-center text-gray-500">
          <BarChart2 className="w-12 h-12 mx-auto opacity-50 mb-3" />
          <p>{t.alphaMining.metrics.noData}</p>
          <p className="text-sm mt-1">{t.alphaMining.metrics.hint}</p>
        </CardContent>
      </Card>
    );
  }

  // 雷达图数据（归一化到 0-100）
  const radarData = [
    { 
      metric: 'Sortino', 
      value: normalizeMetric(metrics.sortino_ratio, -2, 5), 
      fullMark: 100 
    },
    { 
      metric: 'Sharpe', 
      value: normalizeMetric(metrics.sharpe_ratio, -2, 3), 
      fullMark: 100 
    },
    { 
      metric: 'IC', 
      value: normalizeMetric(metrics.ic, -0.3, 0.3) , 
      fullMark: 100 
    },
    { 
      metric: 'Rank IC', 
      value: normalizeMetric(metrics.rank_ic, -0.3, 0.3), 
      fullMark: 100 
    },
    { 
      metric: 'Win Rate', 
      value: metrics.win_rate * 100, 
      fullMark: 100 
    },
    { 
      metric: t.alphaMining.metrics.lowTurnover, 
      value: 100 - metrics.turnover * 100, 
      fullMark: 100 
    },
  ];

  // 评级逻辑
  const rating = getFactorRating(metrics, t);

  return (
    <div className="space-y-4">
      {/* 因子表达式 & 评级 */}
      {formula && (
        <Card className="bg-gradient-to-r from-blue-50 to-indigo-50">
          <CardContent className="py-4">
            <div className="flex items-center justify-between">
              <div>
                <div className="text-xs text-gray-500 mb-1">{t.alphaMining.metrics.currentFactor}</div>
                <code className="text-sm font-mono font-medium text-gray-800">
                  {formula}
                </code>
              </div>
              <Badge className={rating.className}>
                {rating.icon}
                <span className="ml-1">{rating.label}</span>
              </Badge>
            </div>
          </CardContent>
        </Card>
      )}

      {/* 主要指标卡片 */}
      <div className="grid grid-cols-2 md:grid-cols-4 gap-3">
        <MetricCard
          label="Sortino Ratio"
          value={metrics.sortino_ratio.toFixed(4)}
          description={t.alphaMining.metrics.maxDrawdown}
          icon={<Target className="w-4 h-4" />}
          trend={metrics.sortino_ratio > 0 ? 'up' : 'down'}
          good={metrics.sortino_ratio > 1}
        />
        <MetricCard
          label="Sharpe Ratio"
          value={metrics.sharpe_ratio.toFixed(4)}
          description={t.alphaMining.metrics.maxDrawdown}
          icon={<TrendingUp className="w-4 h-4" />}
          trend={metrics.sharpe_ratio > 0 ? 'up' : 'down'}
          good={metrics.sharpe_ratio > 0.5}
        />
        <MetricCard
          label="IC"
          value={metrics.ic.toFixed(4)}
          description={t.alphaMining.metrics.maxDrawdown}
          icon={<Activity className="w-4 h-4" />}
          trend={metrics.ic > 0 ? 'up' : 'down'}
          good={Math.abs(metrics.ic) > 0.03}
        />
        <MetricCard
          label="Rank IC"
          value={metrics.rank_ic.toFixed(4)}
          description={t.alphaMining.metrics.maxDrawdown}
          icon={<BarChart2 className="w-4 h-4" />}
          trend={metrics.rank_ic > 0 ? 'up' : 'down'}
          good={Math.abs(metrics.rank_ic) > 0.03}
        />
      </div>

      {/* 雷达图 & 风险指标 */}
      <div className="grid grid-cols-1 lg:grid-cols-2 gap-4">
        {/* 雷达图 */}
        <Card>
          <CardHeader className="pb-2">
            <CardTitle className="text-sm flex items-center gap-2">
              <PieChart className="w-4 h-4 text-indigo-500" />
              {t.alphaMining.metrics.multiDim}
            </CardTitle>
          </CardHeader>
          <CardContent>
            <div className="h-64">
              <ResponsiveContainer width="100%" height="100%">
                <RadarChart data={radarData}>
                  <PolarGrid stroke="#e5e7eb" />
                  <PolarAngleAxis 
                    dataKey="metric" 
                    tick={{ fontSize: 11, fill: '#6b7280' }}
                  />
                  <PolarRadiusAxis 
                    angle={30} 
                    domain={[0, 100]} 
                    tick={{ fontSize: 10 }}
                  />
                  <Radar
                    name={t.alphaMining.metrics.currentFactor}
                    dataKey="value"
                    stroke="#6366f1"
                    fill="#6366f1"
                    fillOpacity={0.3}
                    strokeWidth={2}
                  />
                  <Tooltip
                    contentStyle={{
                      backgroundColor: 'rgba(255, 255, 255, 0.95)',
                      borderRadius: '8px',
                      border: '1px solid #e5e7eb',
                      fontSize: 12,
                    }}
                    formatter={(value: number) => [`${value.toFixed(1)}`, t.alphaMining.metrics.currentFactor]}
                  />
                </RadarChart>
              </ResponsiveContainer>
            </div>
          </CardContent>
        </Card>

        {/* 风险指标 */}
        <Card>
          <CardHeader className="pb-2">
            <CardTitle className="text-sm flex items-center gap-2">
              <AlertTriangle className="w-4 h-4 text-amber-500" />
              {t.alphaMining.metrics.riskMetrics}
            </CardTitle>
          </CardHeader>
          <CardContent className="space-y-4">
            {/* 最大回撤 */}
            <div>
              <div className="flex justify-between text-sm mb-1">
                <span className="text-gray-600">{t.alphaMining.metrics.maxDrawdown}</span>
                <span className={metrics.max_drawdown > 0.2 ? 'text-red-600 font-medium' : 'text-gray-800'}>
                  {(metrics.max_drawdown * 100).toFixed(2)}%
                </span>
              </div>
              <div className="w-full bg-gray-200 rounded-full h-2">
                <div
                  className={`h-2 rounded-full ${
                    metrics.max_drawdown > 0.3 ? 'bg-red-500' :
                    metrics.max_drawdown > 0.2 ? 'bg-amber-500' : 'bg-green-500'
                  }`}
                  style={{ width: `${Math.min(metrics.max_drawdown * 100, 100)}%` }}
                />
              </div>
              <div className="flex justify-between text-xs text-gray-400 mt-0.5">
                <span>0%</span>
                <span>{t.alphaMining.metrics.safe}</span>
                <span>{t.alphaMining.metrics.danger}</span>
                <span>100%</span>
              </div>
            </div>

            {/* 换手率 */}
            <div>
              <div className="flex justify-between text-sm mb-1">
                <span className="text-gray-600">{t.alphaMining.metrics.dailyTurnover}</span>
                <span className={metrics.turnover > 0.5 ? 'text-amber-600 font-medium' : 'text-gray-800'}>
                  {(metrics.turnover * 100).toFixed(2)}%
                </span>
              </div>
              <div className="w-full bg-gray-200 rounded-full h-2">
                <div
                  className={`h-2 rounded-full ${
                    metrics.turnover > 0.5 ? 'bg-amber-500' : 'bg-blue-500'
                  }`}
                  style={{ width: `${Math.min(metrics.turnover * 100, 100)}%` }}
                />
              </div>
            </div>

            {/* 胜率 */}
            <div>
              <div className="flex justify-between text-sm mb-1">
                <span className="text-gray-600">{t.alphaMining.metrics.winRate}</span>
                <span className={metrics.win_rate > 0.5 ? 'text-green-600 font-medium' : 'text-gray-800'}>
                  {(metrics.win_rate * 100).toFixed(2)}%
                </span>
              </div>
              <div className="w-full bg-gray-200 rounded-full h-2">
                <div
                  className={`h-2 rounded-full ${
                    metrics.win_rate > 0.55 ? 'bg-green-500' :
                    metrics.win_rate > 0.5 ? 'bg-blue-500' : 'bg-gray-400'
                  }`}
                  style={{ width: `${metrics.win_rate * 100}%` }}
                />
              </div>
            </div>

            {/* 总收益 */}
            <div className="pt-2 border-t">
              <div className="flex justify-between items-center">
                <span className="text-sm text-gray-600">{t.alphaMining.metrics.totalReturn}</span>
                <span className={`text-lg font-bold ${
                  metrics.total_return > 0 ? 'text-green-600' : 'text-red-600'
                }`}>
                  {metrics.total_return > 0 ? '+' : ''}
                  {(metrics.total_return * 100).toFixed(2)}%
                </span>
              </div>
            </div>
          </CardContent>
        </Card>
      </div>

      {/* 收益曲线 */}
      {returnsCurve && returnsCurve.length > 0 && (
        <Card>
          <CardHeader className="pb-2">
            <CardTitle className="text-sm flex items-center gap-2">
              <TrendingUp className="w-4 h-4 text-emerald-500" />
              {t.alphaMining.metrics.returnsCurve}
            </CardTitle>
            <CardDescription>{t.alphaMining.metrics.returnsDesc}</CardDescription>
          </CardHeader>
          <CardContent>
            <div className="h-64">
              <ResponsiveContainer width="100%" height="100%">
                <AreaChart data={returnsCurve}>
                  <defs>
                    <linearGradient id="colorStrategy" x1="0" y1="0" x2="0" y2="1">
                      <stop offset="5%" stopColor="#10b981" stopOpacity={0.3}/>
                      <stop offset="95%" stopColor="#10b981" stopOpacity={0}/>
                    </linearGradient>
                    <linearGradient id="colorBenchmark" x1="0" y1="0" x2="0" y2="1">
                      <stop offset="5%" stopColor="#6b7280" stopOpacity={0.3}/>
                      <stop offset="95%" stopColor="#6b7280" stopOpacity={0}/>
                    </linearGradient>
                  </defs>
                  <CartesianGrid strokeDasharray="3 3" stroke="#e5e7eb" />
                  <XAxis 
                    dataKey="date" 
                    tick={{ fontSize: 10 }}
                    tickFormatter={(value) => value.slice(5)}
                  />
                  <YAxis 
                    tick={{ fontSize: 10 }}
                    tickFormatter={(value) => `${(value * 100).toFixed(0)}%`}
                  />
                  <Tooltip
                    contentStyle={{
                      backgroundColor: 'rgba(255, 255, 255, 0.95)',
                      borderRadius: '8px',
                      border: '1px solid #e5e7eb',
                      fontSize: 12,
                    }}
                    formatter={(value: number, name: string) => [
                      `${(value * 100).toFixed(2)}%`,
                      name === 'strategy' ? t.alphaMining.metrics.strategy : t.alphaMining.metrics.benchmark
                    ]}
                  />
                  <Legend />
                  <Area
                    type="monotone"
                    dataKey="strategy"
                    stroke="#10b981"
                    strokeWidth={2}
                    fillOpacity={1}
                    fill="url(#colorStrategy)"
                    name={t.alphaMining.metrics.strategy}
                  />
                  <Area
                    type="monotone"
                    dataKey="benchmark"
                    stroke="#6b7280"
                    strokeWidth={1}
                    fillOpacity={1}
                    fill="url(#colorBenchmark)"
                    name={t.alphaMining.metrics.benchmark}
                    strokeDasharray="5 5"
                  />
                </AreaChart>
              </ResponsiveContainer>
            </div>
          </CardContent>
        </Card>
      )}

      {/* 指标说明 */}
      <Card className="bg-gray-50">
        <CardContent className="py-3">
          <div className="grid grid-cols-2 md:grid-cols-4 gap-3 text-xs text-gray-600">
            <div><strong>Sortino:</strong> {t.alphaMining.metrics.sortinoDesc}</div>
            <div><strong>Sharpe:</strong> {t.alphaMining.metrics.sharpeDesc}</div>
            <div><strong>IC:</strong> {t.alphaMining.metrics.icDesc}</div>
            <div><strong>Max DD:</strong> {t.alphaMining.metrics.maxDDDesc}</div>
          </div>
        </CardContent>
      </Card>
    </div>
  );
};

// 单个指标卡片
interface MetricCardProps {
  label: string;
  value: string;
  description: string;
  icon: React.ReactNode;
  trend?: 'up' | 'down';
  good?: boolean;
}

const MetricCard: React.FC<MetricCardProps> = ({
  label,
  value,
  description,
  icon,
  trend,
  good,
}) => {
  return (
    <Card className={good ? 'border-green-200 bg-green-50/50' : ''}>
      <CardContent className="p-3">
        <div className="flex items-center gap-2 text-gray-500 mb-1">
          {icon}
          <span className="text-xs">{label}</span>
        </div>
        <div className="flex items-center gap-1">
          <span className={`text-lg font-bold ${
            good ? 'text-green-600' : trend === 'down' ? 'text-red-600' : ''
          }`}>
            {value}
          </span>
          {trend === 'up' && <TrendingUp className="w-3 h-3 text-green-500" />}
          {trend === 'down' && <TrendingDown className="w-3 h-3 text-red-500" />}
        </div>
        <div className="text-xs text-gray-400 mt-0.5">{description}</div>
      </CardContent>
    </Card>
  );
};

// 归一化函数
function normalizeMetric(value: number, min: number, max: number): number {
  const normalized = ((value - min) / (max - min)) * 100;
  return Math.max(0, Math.min(100, normalized));
}

// 因子评级
function getFactorRating(metrics: FactorMetrics, t: any): {
  label: string;
  className: string;
  icon: React.ReactNode;
} {
  const score = 
    (metrics.sortino_ratio > 1 ? 25 : metrics.sortino_ratio > 0 ? 15 : 0) +
    (metrics.sharpe_ratio > 0.5 ? 25 : metrics.sharpe_ratio > 0 ? 15 : 0) +
    (Math.abs(metrics.ic) > 0.05 ? 25 : Math.abs(metrics.ic) > 0.03 ? 15 : 0) +
    (metrics.win_rate > 0.55 ? 25 : metrics.win_rate > 0.5 ? 15 : 0);

  if (score >= 80) {
    return {
      label: t.alphaMining.metrics.excellent,
      className: 'bg-green-100 text-green-700',
      icon: <TrendingUp className="w-3 h-3" />,
    };
  } else if (score >= 50) {
    return {
      label: t.alphaMining.metrics.good,
      className: 'bg-blue-100 text-blue-700',
      icon: <Activity className="w-3 h-3" />,
    };
  } else if (score >= 30) {
    return {
      label: t.alphaMining.metrics.average,
      className: 'bg-amber-100 text-amber-700',
      icon: <AlertTriangle className="w-3 h-3" />,
    };
  } else {
    return {
      label: t.alphaMining.metrics.poor,
      className: 'bg-red-100 text-red-700',
      icon: <TrendingDown className="w-3 h-3" />,
    };
  }
}

export default MetricsDashboard;


================================================
FILE: frontend/src/components/alpha-mining/OperatorGrid.tsx
================================================
/**
 * DSL 操作符可视化组件
 * 
 * 展示 21 个因子操作符，按类别分组显示
 * 支持点击插入到因子表达式输入框
 */

import React, { useState } from 'react';
import { Card } from '../ui/card';
import { Badge } from '../ui/badge';
import { motion } from 'framer-motion';
import { 
  Plus, Minus, X, Divide,
  ArrowRight, Clock, BarChart2,
  GitBranch, Maximize, Minimize,
  Activity, Zap, TrendingUp
} from 'lucide-react';
import { useGlobalI18n } from '@/store/useLanguageStore';

// 操作符分类
type OperatorCategory = 'arithmetic' | 'unary' | 'timeseries' | 'conditional' | 'special';

interface Operator {
  name: string;
  arity: number;
  description: string;
  category: OperatorCategory;
  example: string;
  icon: React.ReactNode;
}

// 获取操作符图标组件类型
type IconComponent = React.ComponentType<{ className?: string }>;

// 操作符图标组件映射
const OPERATOR_ICON_COMPONENTS: Record<string, IconComponent> = {
  ADD: Plus,
  SUB: Minus,
  MUL: X,
  DIV: Divide,
  NEG: Minus,
  ABS: Activity,
  SIGN: ArrowRight,
  GATE: GitBranch,
  MAX: Maximize,
  MIN: Minimize,
  DELAY1: Clock,
  DELAY5: Clock,
  DELTA1: TrendingUp,
  DELTA5: TrendingUp,
  MA5: BarChart2,
  MA10: BarChart2,
  STD5: Activity,
  STD10: Activity,
  JUMP: Zap,
  DECAY: TrendingUp,
  MAX3: Maximize,
};

// 获取操作符图标
const getOperatorIcon = (name: string): React.ReactNode => {
  const IconComponent = OPERATOR_ICON_COMPONENTS[name] || Activity;
  return <IconComponent className="w-4 h-4" />;
};

// 获取操作符定义（支持国际化）
const getOperators = (t: any): Operator[] => [
  // 算术运算 (4)
  { name: 'ADD', arity: 2, description: t.alphaMining.operators.add, category: 'arithmetic', example: 'ADD(x, y) = x + y', icon: getOperatorIcon('ADD') },
  { name: 'SUB', arity: 2, description: t.alphaMining.operators.sub, category: 'arithmetic', example: 'SUB(x, y) = x - y', icon: getOperatorIcon('SUB') },
  { name: 'MUL', arity: 2, description: t.alphaMining.operators.mul, category: 'arithmetic', example: 'MUL(x, y) = x × y', icon: getOperatorIcon('MUL') },
  { name: 'DIV', arity: 2, description: t.alphaMining.operators.div, category: 'arithmetic', example: 'DIV(x, y) = x / (y + ε)', icon: getOperatorIcon('DIV') },
  
  // 一元运算 (3)
  { name: 'NEG', arity: 1, description: t.alphaMining.operators.neg, category: 'unary', example: 'NEG(x) = -x', icon: getOperatorIcon('NEG') },
  { name: 'ABS', arity: 1, description: t.alphaMining.operators.abs, category: 'unary', example: 'ABS(x) = |x|', icon: getOperatorIcon('ABS') },
  { name: 'SIGN', arity: 1, description: t.alphaMining.operators.sign, category: 'unary', example: 'SIGN(x) = ±1 or 0', icon: getOperatorIcon('SIGN') },
  
  // 条件运算 (3)
  { name: 'GATE', arity: 3, description: t.alphaMining.operators.gate, category: 'conditional', example: 'GATE(c,x,y) = c>0?x:y', icon: getOperatorIcon('GATE') },
  { name: 'MAX', arity: 2, description: t.alphaMining.operators.max, category: 'conditional', example: 'MAX(x, y)', icon: getOperatorIcon('MAX') },
  { name: 'MIN', arity: 2, description: t.alphaMining.operators.min, category: 'conditional', example: 'MIN(x, y)', icon: getOperatorIcon('MIN') },
  
  // 时序运算 (8)
  { name: 'DELAY1', arity: 1, description: t.alphaMining.operators.delay1, category: 'timeseries', example: 'x[t-1]', icon: getOperatorIcon('DELAY1') },
  { name: 'DELAY5', arity: 1, description: t.alphaMining.operators.delay5, category: 'timeseries', example: 'x[t-5]', icon: getOperatorIcon('DELAY5') },
  { name: 'DELTA1', arity: 1, description: t.alphaMining.operators.delta1, category: 'timeseries', example: 'x[t] - x[t-1]', icon: getOperatorIcon('DELTA1') },
  { name: 'DELTA5', arity: 1, description: t.alphaMining.operators.delta5, category: 'timeseries', example: 'x[t] - x[t-5]', icon: getOperatorIcon('DELTA5') },
  { name: 'MA5', arity: 1, description: t.alphaMining.operators.ma5, category: 'timeseries', example: 'mean(x[t-4:t])', icon: getOperatorIcon('MA5') },
  { name: 'MA10', arity: 1, description: t.alphaMining.operators.ma10, category: 'timeseries', example: 'mean(x[t-9:t])', icon: getOperatorIcon('MA10') },
  { name: 'STD5', arity: 1, description: t.alphaMining.operators.std5, category: 'timeseries', example: 'std(x[t-4:t])', icon: getOperatorIcon('STD5') },
  { name: 'STD10', arity: 1, description: t.alphaMining.operators.std10, category: 'timeseries', example: 'std(x[t-9:t])', icon: getOperatorIcon('STD10') },
  
  // 特殊运算 (3)
  { name: 'JUMP', arity: 1, description: t.alphaMining.operators.jump, category: 'special', example: t.alphaMining.operators.jumpExample, icon: getOperatorIcon('JUMP') },
  { name: 'DECAY', arity: 1, description: t.alphaMining.operators.decay, category: 'special', example: 'x+0.8x[-1]+0.6x[-2]', icon: getOperatorIcon('DECAY') },
  { name: 'MAX3', arity: 1, description: t.alphaMining.operators.max3, category: 'special', example: 'max(x[t:t-2])', icon: getOperatorIcon('MAX3') },
];

// 特征列表
const FEATURES = ['RET', 'VOL', 'VOLUME_CHG', 'TURNOVER', 'SENTIMENT', 'NEWS_COUNT'];

// 获取类别配置（支持国际化）
const getCategoryConfig = (t: any): Record<OperatorCategory, { label: string; color: string; bgColor: string }> => ({
  arithmetic: { label: t.alphaMining.operators.categoryArithmetic, color: 'text-blue-600', bgColor: 'bg-blue-50 hover:bg-blue-100' },
  unary: { label: t.alphaMining.operators.categoryUnary, color: 'text-purple-600', bgColor: 'bg-purple-50 hover:bg-purple-100' },
  timeseries: { label: t.alphaMining.operators.categoryTimeseries, color: 'text-emerald-600', bgColor: 'bg-emerald-50 hover:bg-emerald-100' },
  conditional: { label: t.alphaMining.operators.categoryConditional, color: 'text-amber-600', bgColor: 'bg-amber-50 hover:bg-amber-100' },
  special: { label: t.alphaMining.operators.categorySpecial, color: 'text-rose-600', bgColor: 'bg-rose-50 hover:bg-rose-100' },
});

interface OperatorGridProps {
  onOperatorClick?: (operator: string) => void;
  onFeatureClick?: (feature: string) => void;
  compact?: boolean;
}

const OperatorGrid: React.FC<OperatorGridProps> = ({
  onOperatorClick,
  onFeatureClick,
  compact = false,
}) => {
  const t = useGlobalI18n();
  const OPERATORS = getOperators(t);
  const CATEGORY_CONFIG = getCategoryConfig(t);
  const [selectedCategory, setSelectedCategory] = useState<OperatorCategory | 'all'>('all');
  const [hoveredOp, setHoveredOp] = useState<string | null>(null);

  // 按类别过滤
  const filteredOperators = selectedCategory === 'all' 
    ? OPERATORS 
    : OPERATORS.filter(op => op.category === selectedCategory);

  // 按类别分组
  const groupedOperators = filteredOperators.reduce((acc, op) => {
    if (!acc[op.category]) acc[op.category] = [];
    acc[op.category].push(op);
    return acc;
  }, {} as Record<OperatorCategory, Operator[]>);

  return (
    <div className="space-y-4">
      {/* 类别筛选 */}
      <div className="flex flex-wrap gap-2">
        <Badge
          variant={selectedCategory === 'all' ? 'default' : 'outline'}
          className="cursor-pointer"
          onClick={() => setSelectedCategory('all')}
        >
          {t.alphaMining.operators.all} ({OPERATORS.length})
        </Badge>
        {(Object.entries(CATEGORY_CONFIG) as [OperatorCategory, typeof CATEGORY_CONFIG.arithmetic][]).map(([key, config]) => (
          <Badge
            key={key}
            variant={selectedCategory === key ? 'default' : 'outline'}
            className={`cursor-pointer ${selectedCategory === key ? '' : config.color}`}
            onClick={() => setSelectedCategory(key)}
          >
            {config.label} ({OPERATORS.filter(o => o.category === key).length})
          </Badge>
        ))}
      </div>

      {/* 特征列表 */}
      <Card className="p-3">
        <h4 className="text-sm font-medium text-gray-700 mb-2">{t.alphaMining.operators.availableFeatures}</h4>
        <div className="flex flex-wrap gap-2">
          {FEATURES.map((feature, idx) => (
            <motion.button
              key={feature}
              whileHover={{ scale: 1.05 }}
              whileTap={{ scale: 0.95 }}
              className={`px-3 py-1.5 rounded-md text-sm font-mono transition-colors ${
                idx < 4 
                  ? 'bg-blue-100 text-blue-700 hover:bg-blue-200' 
                  : 'bg-emerald-100 text-emerald-700 hover:bg-emerald-200'
              }`}
              onClick={() => onFeatureClick?.(feature)}
              title={idx < 4 ? t.alphaMining.operators.techFeature : t.alphaMining.operators.sentimentFeature}
            >
              {feature}
            </motion.button>
          ))}
        </div>
      </Card>

      {/* 操作符网格 */}
      {selectedCategory === 'all' ? (
        // 分组显示
        Object.entries(groupedOperators).map(([category, ops]) => (
          <div key={category}>
            <h4 className={`text-sm font-medium mb-2 ${CATEGORY_CONFIG[category as OperatorCategory].color}`}>
              {CATEGORY_CONFIG[category as OperatorCategory].label}
            </h4>
            <div className={`grid gap-2 ${compact ? 'grid-cols-4 md:grid-cols-6' : 'grid-cols-2 md:grid-cols-4'}`}>
              {ops.map((op) => (
                <OperatorCard
                  key={op.name}
                  operator={op}
                  compact={compact}
                  isHovered={hoveredOp === op.name}
                  onHover={() => setHoveredOp(op.name)}
                  onLeave={() => setHoveredOp(null)}
                  onClick={() => onOperatorClick?.(op.name)}
                />
              ))}
            </div>
          </div>
        ))
      ) : (
        // 单一类别
        <div className={`grid gap-2 ${compact ? 'grid-cols-4 md:grid-cols-6' : 'grid-cols-2 md:grid-cols-4'}`}>
          {filteredOperators.map((op) => (
            <OperatorCard
              key={op.name}
              operator={op}
              compact={compact}
              isHovered={hoveredOp === op.name}
              onHover={() => setHoveredOp(op.name)}
              onLeave={() => setHoveredOp(null)}
              onClick={() => onOperatorClick?.(op.name)}
            />
          ))}
        </div>
      )}

      {/* 操作符总数统计 */}
      <div className="text-xs text-gray-500 text-center">
        {t.alphaMining.operators.totalOperators.replace('{count}', String(OPERATORS.length))} · {t.alphaMining.operators.totalFeatures.replace('{count}', String(FEATURES.length))}
      </div>
    </div>
  );
};

// 单个操作符卡片
interface OperatorCardProps {
  operator: Operator;
  compact?: boolean;
  isHovered: boolean;
  onHover: () => void;
  onLeave: () => void;
  onClick: () => void;
}

const OperatorCard: React.FC<OperatorCardProps> = ({
  operator,
  compact,
  isHovered,
  onHover,
  onLeave,
  onClick,
}) => {
  const t = useGlobalI18n();
  const CATEGORY_CONFIG = getCategoryConfig(t);
  const config = CATEGORY_CONFIG[operator.category];

  return (
    <motion.div
      whileHover={{ scale: 1.02, y: -2 }}
      whileTap={{ scale: 0.98 }}
      className={`
        ${config.bgColor} 
        rounded-lg cursor-pointer transition-all duration-200
        ${isHovered ? 'shadow-md ring-2 ring-offset-1' : ''}
        ${compact ? 'p-2' : 'p-3'}
      `}
      style={{ 
        '--tw-ring-color': isHovered ? config.color.replace('text-', 'rgb(var(--') + ')' : undefined 
      } as React.CSSProperties}
      onMouseEnter={onHover}
      onMouseLeave={onLeave}
      onClick={onClick}
    >
      <div className="flex items-center gap-2">
        <span className={config.color}>{operator.icon}</span>
        <span className={`font-mono font-semibold ${compact ? 'text-xs' : 'text-sm'} ${config.color}`}>
          {operator.name}
        </span>
        {!compact && (
          <Badge variant="secondary" className="text-xs ml-auto">
            {operator.arity}{t.alphaMining.operators.params}
          </Badge>
        )}
      </div>
      
      {!compact && (
        <>
          <p className="text-xs text-gray-600 mt-1">{operator.description}</p>
          <code className="text-xs text-gray-500 mt-1 block truncate" title={operator.example}>
            {operator.example}
          </code>
        </>
      )}
    </motion.div>
  );
};

export default OperatorGrid;
export { FEATURES };
export type { Operator, OperatorCategory };


================================================
FILE: frontend/src/components/alpha-mining/SentimentCompare.tsx
================================================
/**
 * 情感融合效果对比组件
 * 
 * 对比纯技术因子 vs 情感增强因子的效果：
 * - 左右两栏对比
 * - 指标对比条形图
 * - 改进幅度高亮
 */

import React, { useState, useCallback } from 'react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '../ui/card';
import { Button } from '../ui/button';
import { Badge } from '../ui/badge';
import {
  BarChart, Bar, XAxis, YAxis, CartesianGrid, Tooltip,
  ResponsiveContainer, Legend, Cell, ReferenceLine
} from 'recharts';
import {
  Play, Heart, Cpu, TrendingUp, TrendingDown,
  ArrowRight, Loader2, Sparkles
} from 'lucide-react';
import { useGlobalI18n } from '@/store/useLanguageStore';

interface CompareResult {
  best_score: number;
  best_formula: string;
  total_steps: number;
  num_features: number;
}

interface SentimentCompareProps {
  apiBaseUrl?: string;
}

const SentimentCompare: React.FC<SentimentCompareProps> = ({
  apiBaseUrl = '/api/v1',
}) => {
  const t = useGlobalI18n();
  const [loading, setLoading] = useState(false);
  const [withSentiment, setWithSentiment] = useState<CompareResult | null>(null);
  const [withoutSentiment, setWithoutSentiment] = useState<CompareResult | null>(null);
  const [improvement, setImprovement] = useState<{ score_diff: number; improvement_pct: number } | null>(null);
  const [numSteps, setNumSteps] = useState(50);
  const [error, setError] = useState<string | null>(null);

  // 执行对比
  const runComparison = useCallback(async () => {
    setLoading(true);
    setError(null);
    setWithSentiment(null);
    setWithoutSentiment(null);
    setImprovement(null);

    try {
      const response = await fetch(`${apiBaseUrl}/alpha-mining/compare-sentiment`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          num_steps: numSteps,
          batch_size: 16,
        }),
      });

      if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
      }

      const data = await response.json();

      if (data.success) {
        setWithSentiment(data.with_sentiment);
        setWithoutSentiment(data.without_sentiment);
        setImprovement(data.improvement);
      } else {
        throw new Error(t.alphaMining.sentiment.comparisonFailed);
      }
    } catch (err: any) {
      console.error('Comparison error:', err);
      setError(err.message || t.alphaMining.sentiment.comparisonFailed);
    } finally {
      setLoading(false);
    }
  }, [apiBaseUrl, numSteps]);

  // 对比条形图数据
  const chartData = withSentiment && withoutSentiment ? [
    {
      name: 'Best Score',
      without: withoutSentiment.best_score,
      with: withSentiment.best_score,
    },
    {
      name: 'Features',
      without: withoutSentiment.num_features,
      with: withSentiment.num_features,
    },
  ] : [];

  // 改进幅度是否为正
  const isImproved = improvement && improvement.score_diff > 0;

  return (
    <Card className="w-full">
      <CardHeader>
        <div className="flex items-center justify-between">
          <div>
            <CardTitle className="flex items-center gap-2">
              <Sparkles className="w-5 h-5 text-purple-500" />
              {t.alphaMining.sentiment.title}
            </CardTitle>
            <CardDescription>
              {t.alphaMining.sentiment.desc}
            </CardDescription>
          </div>
          {improvement && (
            <Badge className={isImproved ? 'bg-green-100 text-green-700' : 'bg-red-100 text-red-700'}>
              {isImproved ? <TrendingUp className="w-3 h-3 mr-1" /> : <TrendingDown className="w-3 h-3 mr-1" />}
              {isImproved ? '+' : ''}{improvement.improvement_pct.toFixed(1)}%
            </Badge>
          )}
        </div>
      </CardHeader>

      <CardContent className="space-y-4">
        {/* 控制面板 */}
        <div className="flex items-center gap-4 p-3 bg-gray-50 rounded-lg">
          <div className="flex items-center gap-2">
            <label className="text-sm font-medium">{t.alphaMining.sentiment.steps}:</label>
            <input
              type="number"
              value={numSteps}
              onChange={(e) => setNumSteps(Number(e.target.value))}
              min={20}
              max={200}
              disabled={loading}
              className="w-20 px-2 py-1 border rounded text-sm"
            />
          </div>
          <div className="flex-1" />
          <Button onClick={runComparison} disabled={loading}>
            {loading ? (
              <>
                <Loader2 className="w-4 h-4 mr-1 animate-spin" />
                {t.alphaMining.sentiment.comparing}
              </>
            ) : (
              <>
                <Play className="w-4 h-4 mr-1" />
                {t.alphaMining.sentiment.start}
              </>
            )}
          </Button>
        </div>

        {/* 对比结果 */}
        {withSentiment && withoutSentiment && (
          <>
            {/* 左右对比卡片 */}
            <div className="grid grid-cols-1 md:grid-cols-2 gap-4">
              {/* 纯技术因子 */}
              <Card className="border-blue-200 bg-blue-50/50">
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm flex items-center gap-2 text-blue-700">
                    <Cpu className="w-4 h-4" />
                    {t.alphaMining.sentiment.techOnly}
                  </CardTitle>
                  <CardDescription className="text-xs">
                    {withoutSentiment.num_features}{t.alphaMining.sentiment.techDesc}
                  </CardDescription>
                </CardHeader>
                <CardContent>
                  <div className="space-y-3">
                    <div>
                      <div className="text-xs text-gray-500">{t.alphaMining.sentiment.bestFactor}</div>
                      <code className="text-sm font-mono block mt-1 p-2 bg-white rounded border truncate">
                        {withoutSentiment.best_formula || t.alphaMining.sentiment.none}
                      </code>
                    </div>
                    <div className="flex justify-between items-center">
                      <span className="text-sm text-gray-600">Best Score</span>
                      <span className="text-lg font-bold text-blue-600">
                        {withoutSentiment.best_score.toFixed(4)}
                      </span>
                    </div>
                  </div>
                </CardContent>
              </Card>

              {/* 情感增强因子 */}
              <Card className="border-emerald-200 bg-emerald-50/50">
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm flex items-center gap-2 text-emerald-700">
                    <Heart className="w-4 h-4" />
                    {t.alphaMining.sentiment.enhanced}
                  </CardTitle>
                  <CardDescription className="text-xs">
                    {withSentiment.num_features}{t.alphaMining.sentiment.enhancedDesc}
                  </CardDescription>
                </CardHeader>
                <CardContent>
                  <div className="space-y-3">
                    <div>
                      <div className="text-xs text-gray-500">{t.alphaMining.sentiment.bestFactor}</div>
                      <code className="text-sm font-mono block mt-1 p-2 bg-white rounded border truncate">
                        {withSentiment.best_formula || t.alphaMining.sentiment.none}
                      </code>
                    </div>
                    <div className="flex justify-between items-center">
                      <span className="text-sm text-gray-600">Best Score</span>
                      <span className="text-lg font-bold text-emerald-600">
                        {withSentiment.best_score.toFixed(4)}
                      </span>
                    </div>
                  </div>
                </CardContent>
              </Card>
            </div>

            {/* 改进幅度 */}
            {improvement && (
              <Card className={isImproved ? 'bg-green-50 border-green-200' : 'bg-red-50 border-red-200'}>
                <CardContent className="py-4">
                  <div className="flex items-center justify-between">
                    <div className="flex items-center gap-3">
                      <div className={`p-2 rounded-full ${isImproved ? 'bg-green-100' : 'bg-red-100'}`}>
                        {isImproved ? (
                          <TrendingUp className="w-5 h-5 text-green-600" />
                        ) : (
                          <TrendingDown className="w-5 h-5 text-red-600" />
                        )}
                      </div>
                      <div>
                        <div className="text-sm font-medium">
                          {isImproved ? t.alphaMining.sentiment.improved : t.alphaMining.sentiment.degraded}
                        </div>
                        <div className="text-xs text-gray-500">
                          {t.alphaMining.sentiment.scoreDiff}: {improvement.score_diff > 0 ? '+' : ''}{improvement.score_diff.toFixed(6)}
                        </div>
                      </div>
                    </div>
                    <div className={`text-3xl font-bold ${isImproved ? 'text-green-600' : 'text-red-600'}`}>
                      {isImproved ? '+' : ''}{improvement.improvement_pct.toFixed(1)}%
                    </div>
                  </div>
                </CardContent>
              </Card>
            )}

            {/* 对比条形图 */}
            <Card>
              <CardHeader className="pb-2">
                <CardTitle className="text-sm">{t.alphaMining.sentiment.comparison}</CardTitle>
              </CardHeader>
              <CardContent>
                <div className="h-48">
                  <ResponsiveContainer width="100%" height="100%">
                    <BarChart
                      data={[{
                        name: 'Best Score',
                        [t.alphaMining.sentiment.techOnlyBar]: withoutSentiment.best_score,
                        [t.alphaMining.sentiment.enhancedBar]: withSentiment.best_score,
                      }]}
                      layout="vertical"
                    >
                      <CartesianGrid strokeDasharray="3 3" stroke="#e5e7eb" />
                      <XAxis type="number" tick={{ fontSize: 11 }} />
                      <YAxis type="category" dataKey="name" tick={{ fontSize: 11 }} width={80} />
                      <Tooltip
                        contentStyle={{
                          backgroundColor: 'rgba(255, 255, 255, 0.95)',
                          borderRadius: '8px',
                          border: '1px solid #e5e7eb',
                          fontSize: 12,
                        }}
                        formatter={(value: number) => value.toFixed(4)}
                      />
                      <Legend />
                      <Bar dataKey={t.alphaMining.sentiment.techOnlyBar} fill="#3b82f6" radius={[0, 4, 4, 0]} />
                      <Bar dataKey={t.alphaMining.sentiment.enhancedBar} fill="#10b981" radius={[0, 4, 4, 0]} />
                      <ReferenceLine x={0} stroke="#666" />
                    </BarChart>
                  </ResponsiveContainer>
                </div>
              </CardContent>
            </Card>

            {/* 结论 */}
            <div className="p-4 bg-gray-50 rounded-lg text-sm text-gray-600">
              <strong>{t.alphaMining.sentiment.conclusion}</strong>
              {isImproved ? (
                <>
                  {t.alphaMining.sentiment.conclusionPositive}
                </>
              ) : (
                <>
                  {t.alphaMining.sentiment.conclusionNegative}
                </>
              )}
            </div>
          </>
        )}

        {/* 加载状态 */}
        {loading && (
          <div className="py-12 text-center">
            <Loader2 className="w-8 h-8 animate-spin mx-auto text-purple-500 mb-3" />
            <p className="text-sm text-gray-500">{t.alphaMining.sentiment.comparingText}</p>
            <p className="text-xs text-gray-400 mt-1">
              {t.alphaMining.sentiment.comparingHint} {numSteps} {t.alphaMining.sentiment.stepsText}
            </p>
          </div>
        )}

        {/* 错误提示 */}
        {error && (
          <div className="p-4 bg-red-50 rounded-lg border border-red-200">
            <p className="text-sm text-red-600">{error}</p>
          </div>
        )}

        {/* 初始状态 */}
        {!loading && !withSentiment && !error && (
          <div className="py-12 text-center text-gray-500">
            <Sparkles className="w-12 h-12 mx-auto opacity-50 mb-3" />
            <p>{t.alphaMining.sentiment.startHint}</p>
            <p className="text-sm mt-1">
              {t.alphaMining.sentiment.startDesc}
            </p>
          </div>
        )}
      </CardContent>
    </Card>
  );
};

export default SentimentCompare;


================================================
FILE: frontend/src/components/alpha-mining/TrainingMonitor.tsx
================================================
/**
 * 训练进度实时监控组件
 * 
 * 使用 SSE 订阅训练进度，实时显示：
 * - 训练步数/进度
 * - Loss/Reward 曲线
 * - 当前最优因子表达式
 */

import React, { useState, useEffect, useCallback, useRef } from 'react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '../ui/card';
import { Button } from '../ui/button';
import { Badge } from '../ui/badge';
import { 
  LineChart, Line, XAxis, YAxis, CartesianGrid, Tooltip, 
  ResponsiveContainer, Legend, ReferenceLine 
} from 'recharts';
import { 
  Play, Square, RefreshCw, Activity, 
  TrendingUp, Zap, CheckCircle2, AlertCircle 
} from 'lucide-react';
import { useGlobalI18n } from '@/store/useLanguageStore';

interface TrainingMetrics {
  step: number;
  progress: number;
  loss: number;
  avg_reward: number;
  max_reward: number;
  valid_ratio: number;
  best_score: number;
  best_formula: string;
}

interface TrainingMonitorProps {
  apiBaseUrl?: string;
  onTrainingComplete?: (result: { best_score: number; best_formula: string }) => void;
}

type TrainingStatus = 'idle' | 'running' | 'completed' | 'error';

const TrainingMonitor: React.FC<TrainingMonitorProps> = ({
  apiBaseUrl = '/api/v1',
  onTrainingComplete,
}) => {
  const t = useGlobalI18n();
  const [status, setStatus] = useState<TrainingStatus>('idle');
  const [progress, setProgress] = useState(0);
  const [currentMetrics, setCurrentMetrics] = useState<TrainingMetrics | null>(null);
  const [history, setHistory] = useState<TrainingMetrics[]>([]);
  const [error, setError] = useState<string | null>(null);
  const [numSteps, setNumSteps] = useState(100);
  const [useSentiment, setUseSentiment] = useState(true);
  
  const eventSourceRef = useRef<EventSource | null>(null);
  const abortControllerRef = useRef<AbortController | null>(null);

  // 开始训练
  const startTraining = useCallback(async () => {
    setStatus('running');
    setProgress(0);
    setHistory([]);
    setError(null);
    setCurrentMetrics(null);

    try {
      // 使用 fetch + SSE
      abortControllerRef.current = new AbortController();
      
      const response = await fetch(`${apiBaseUrl}/alpha-mining/mine/stream`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          num_steps: numSteps,
          use_sentiment: useSentiment,
          batch_size: 16,
        }),
        signal: abortControllerRef.current.signal,
      });

      if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
      }

      const reader = response.body?.getReader();
      if (!reader) {
        throw new Error('No response body');
      }

      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });

        // 解析 SSE 事件
        const lines = buffer.split('\n');
        buffer = lines.pop() || '';

        let currentEvent = '';
        let currentData = '';

        for (const line of lines) {
          if (line.startsWith('event: ')) {
            currentEvent = line.slice(7);
          } else if (line.startsWith('data: ')) {
            currentData = line.slice(6);
          } else if (line === '' && currentEvent && currentData) {
            try {
              const data = JSON.parse(currentData);
              handleSSEEvent(currentEvent, data);
            } catch (e) {
              console.error('Failed to parse SSE data:', currentData);
            }
            currentEvent = '';
            currentData = '';
          }
        }
      }
    } catch (err: any) {
      if (err.name !== 'AbortError') {
        console.error('Training error:', err);
        setError(err.message || t.alphaMining.training.trainingFailed);
        setStatus('error');
      }
    }
  }, [apiBaseUrl, numSteps, useSentiment]);

  // 处理 SSE 事件
  const handleSSEEvent = useCallback((event: string, data: any) => {
    switch (event) {
      case 'start':
        console.log('Training started:', data);
        break;
        
      case 'progress':
        const metrics: TrainingMetrics = {
          step: data.step,
          progress: data.progress,
          loss: data.loss,
          avg_reward: data.avg_reward,
          max_reward: data.max_reward,
          valid_ratio: data.valid_ratio,
          best_score: data.best_score,
          best_formula: data.best_formula,
        };
        setCurrentMetrics(metrics);
        setProgress(data.progress);
        setHistory(prev => [...prev, metrics]);
        break;
        
      case 'complete':
        setStatus('completed');
        setProgress(100);
        onTrainingComplete?.({
          best_score: data.best_score,
          best_formula: data.best_formula,
        });
        break;
        
      case 'error':
        setError(data.error);
        setStatus('error');
        break;
    }
  }, [onTrainingComplete]);

  // 停止训练
  const stopTraining = useCallback(() => {
    abortControllerRef.current?.abort();
    eventSourceRef.current?.close();
    setStatus('idle');
  }, []);

  // 组件卸载时清理
  useEffect(() => {
    return () => {
      abortControllerRef.current?.abort();
      eventSourceRef.current?.close();
    };
  }, []);

  // 状态颜色映射
  const statusConfig = {
    idle: { color: 'bg-gray-100 text-gray-600', icon: <Activity className="w-4 h-4" /> },
    running: { color: 'bg-blue-100 text-blue-600', icon: <RefreshCw className="w-4 h-4 animate-spin" /> },
    completed: { color: 'bg-green-100 text-green-600', icon: <CheckCircle2 className="w-4 h-4" /> },
    error: { color: 'bg-red-100 text-red-600', icon: <AlertCircle className="w-4 h-4" /> },
  };

  return (
    <Card className="w-full">
      <CardHeader>
        <div className="flex items-center justify-between">
          <div>
            <CardTitle className="flex items-center gap-2">
              <Zap className="w-5 h-5 text-amber-500" />
              {t.alphaMining.training.title}
            </CardTitle>
            <CardDescription>
              {t.alphaMining.training.desc}
            </CardDescription>
          </div>
          <Badge className={statusConfig[status].color}>
            {statusConfig[status].icon}
            <span className="ml-1">
              {status === 'idle' && t.alphaMining.training.ready}
              {status === 'running' && t.alphaMining.training.running}
              {status === 'completed' && t.alphaMining.training.completed}
              {status === 'error' && t.alphaMining.training.error}
            </span>
          </Badge>
        </div>
      </CardHeader>
      
      <CardContent className="space-y-4">
        {/* 控制面板 */}
        <div className="flex flex-wrap items-center gap-4 p-4 bg-gray-50 rounded-lg">
          <div className="flex items-center gap-2">
            <label className="text-sm font-medium">{t.alphaMining.training.steps}:</label>
            <input
              type="number"
              value={numSteps}
              onChange={(e) => setNumSteps(Number(e.target.value))}
              min={10}
              max={1000}
              disabled={status === 'running'}
              className="w-24 px-2 py-1 border rounded text-sm"
            />
          </div>
          
          <div className="flex items-center gap-2">
            <input
              type="checkbox"
              id="useSentiment"
              checked={useSentiment}
              onChange={(e) => setUseSentiment(e.target.checked)}
              disabled={status === 'running'}
              className="rounded"
            />
            <label htmlFor="useSentiment" className="text-sm">
              {t.alphaMining.training.useSentiment}
            </label>
          </div>
          
          <div className="flex-1" />
          
          {status === 'running' ? (
            <Button variant="destructive" size="sm" onClick={stopTraining}>
              <Square className="w-4 h-4 mr-1" />
              {t.alphaMining.training.stop}
            </Button>
          ) : (
            <Button onClick={startTraining} disabled={status !== 'idle'}>
              <Play className="w-4 h-4 mr-1" />
              {t.alphaMining.training.start}
            </Button>
          )}
        </div>

        {/* 进度条 */}
        <div className="space-y-1">
          <div className="flex justify-between text-sm">
            <span>{t.alphaMining.training.progress}</span>
            <span>{progress.toFixed(1)}%</span>
          </div>
          <div className="w-full bg-gray-200 rounded-full h-2">
            <div
              className="bg-blue-600 h-2 rounded-full transition-all duration-300"
              style={{ width: `${progress}%` }}
            />
          </div>
          {currentMetrics && (
            <div className="text-xs text-gray-500">
              Step {currentMetrics.step} / {numSteps}
            </div>
          )}
        </div>

        {/* 实时指标 */}
        {currentMetrics && (
          <div className="grid grid-cols-2 md:grid-cols-4 gap-3">
            <MetricCard label="Loss" value={currentMetrics.loss.toFixed(4)} trend="down" />
            <MetricCard label="Avg Reward" value={currentMetrics.avg_reward.toFixed(4)} trend="up" />
            <MetricCard label="Best Score" value={currentMetrics.best_score.toFixed(4)} trend="up" highlight />
            <MetricCard label="Valid Ratio" value={`${(currentMetrics.valid_ratio * 100).toFixed(1)}%`} />
          </div>
        )}

        {/* 当前最优因子 */}
        {currentMetrics?.best_formula && (
          <div className="p-3 bg-emerald-50 rounded-lg border border-emerald-200">
            <div className="text-xs text-emerald-600 font-medium mb-1">{t.alphaMining.training.bestFactor}</div>
            <code className="text-sm font-mono text-emerald-800">
              {currentMetrics.best_formula}
            </code>
          </div>
        )}

        {/* 收敛曲线 */}
        {history.length > 0 && (
          <div className="space-y-2">
            <h4 className="text-sm font-medium">{t.alphaMining.training.convergence}</h4>
            <div className="h-64">
              <ResponsiveContainer width="100%" height="100%">
                <LineChart data={history}>
                  <CartesianGrid strokeDasharray="3 3" stroke="#e5e7eb" />
                  <XAxis 
                    dataKey="step" 
                    tick={{ fontSize: 10 }}
                    label={{ value: 'Step', position: 'bottom', fontSize: 12 }}
                  />
                  <YAxis 
                    yAxisId="left"
                    tick={{ fontSize: 10 }}
                    label={{ value: 'Reward', angle: -90, position: 'insideLeft', fontSize: 12 }}
                  />
                  <YAxis 
                    yAxisId="right"
                    orientation="right"
                    tick={{ fontSize: 10 }}
                    label={{ value: 'Loss', angle: 90, position: 'insideRight', fontSize: 12 }}
                  />
                  <Tooltip
                    contentStyle={{
                      backgroundColor: 'rgba(255, 255, 255, 0.95)',
                      borderRadius: '8px',
                      border: '1px solid #e5e7eb',
                      fontSize: 12,
                    }}
                  />
                  <Legend />
                  <Line
                    yAxisId="left"
                    type="monotone"
                    dataKey="avg_reward"
                    stroke="#10b981"
                    strokeWidth={2}
                    dot={false}
                    name="Avg Reward"
                  />
                  <Line
                    yAxisId="left"
                    type="monotone"
                    dataKey="best_score"
                    stroke="#f59e0b"
                    strokeWidth={2}
                    dot={false}
                    name="Best Score"
                  />
                  <Line
                    yAxisId="right"
                    type="monotone"
                    dataKey="loss"
                    stroke="#ef4444"
                    strokeWidth={1}
                    dot={false}
                    name="Loss"
                    strokeDasharray="5 5"
                  />
                  <ReferenceLine yAxisId="left" y={0} stroke="#666" strokeDasharray="3 3" />
                </LineChart>
              </ResponsiveContainer>
            </div>
          </div>
        )}

        {/* 错误信息 */}
        {error && (
          <div className="p-3 bg-red-50 rounded-lg border border-red-200">
            <div className="text-sm text-red-600">{error}</div>
          </div>
        )}
      </CardContent>
    </Card>
  );
};

// 指标卡片组件
interface MetricCardProps {
  label: string;
  value: string;
  trend?: 'up' | 'down';
  highlight?: boolean;
}

const MetricCard: React.FC<MetricCardProps> = ({ label, value, trend, highlight }) => {
  return (
    <div className={`p-3 rounded-lg ${highlight ? 'bg-amber-50 border border-amber-200' : 'bg-gray-50'}`}>
      <div className="text-xs text-gray-500">{label}</div>
      <div className={`text-lg font-semibold flex items-center gap-1 ${highlight ? 'text-amber-600' : ''}`}>
        {value}
        {trend === 'up' && <TrendingUp className="w-3 h-3 text-green-500" />}
        {trend === 'down' && <TrendingUp className="w-3 h-3 text-red-500 rotate-180" />}
      </div>
    </div>
  );
};

export default TrainingMonitor;


================================================
FILE: frontend/src/components/alpha-mining/index.ts
================================================
/**
 * Alpha Mining 组件导出
 */

export { default as OperatorGrid, FEATURES } from './OperatorGrid';
export type { Operator, OperatorCategory } from './OperatorGrid';

export { default as TrainingMonitor } from './TrainingMonitor';

export { default as MetricsDashboard } from './MetricsDashboard';
export type { FactorMetrics } from './MetricsDashboard';

export { default as SentimentCompare } from './SentimentCompare';

export { default as AgentDemo } from './AgentDemo';


================================================
FILE: frontend/src/components/ui/badge.tsx
================================================
import * as React from "react"
import { cva, type VariantProps } from "class-variance-authority"

import { cn } from "@/lib/utils"

const badgeVariants = cva(
  "inline-flex items-center rounded-full border px-2.5 py-0.5 text-xs font-semibold transition-colors focus:outline-none focus:ring-2 focus:ring-ring focus:ring-offset-2",
  {
    variants: {
      variant: {
        default:
          "border-transparent bg-primary text-primary-foreground hover:bg-primary/80",
        secondary:
          "border-transparent bg-secondary text-secondary-foreground hover:bg-secondary/80",
        destructive:
          "border-transparent bg-destructive text-destructive-foreground hover:bg-destructive/80",
        outline: "text-foreground",
        success: "border-transparent bg-green-100 text-green-800 dark:bg-green-900/30 dark:text-green-400",
        warning: "border-transparent bg-yellow-100 text-yellow-800 dark:bg-yellow-900/30 dark:text-yellow-400",
      },
    },
    defaultVariants: {
      variant: "default",
    },
  }
)

export interface BadgeProps
  extends React.HTMLAttributes<HTMLDivElement>,
    VariantProps<typeof badgeVariants> {}

function Badge({ className, variant, ...props }: BadgeProps) {
  return (
    <div className={cn(badgeVariants({ variant }), className)} {...props} />
  )
}

export { Badge, badgeVariants }


================================================
FILE: frontend/src/components/ui/button.tsx
================================================
import * as React from "react"
import { Slot } from "@radix-ui/react-slot"
import { cva, type VariantProps } from "class-variance-authority"

import { cn } from "@/lib/utils"

const buttonVariants = cva(
  "inline-flex items-center justify-center gap-2 whitespace-nowrap rounded-md text-sm font-medium ring-offset-background transition-colors focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50 [&_svg]:pointer-events-none [&_svg]:size-4 [&_svg]:shrink-0",
  {
    variants: {
      variant: {
        default: "bg-primary text-primary-foreground hover:bg-primary/90",
        destructive:
          "bg-destructive text-destructive-foreground hover:bg-destructive/90",
        outline:
          "border border-input bg-background hover:bg-accent hover:text-accent-foreground",
        secondary:
          "bg-secondary text-secondary-foreground hover:bg-secondary/80",
        ghost: "hover:bg-accent hover:text-accent-foreground",
        link: "text-primary underline-offset-4 hover:underline",
      },
      size: {
        default: "h-10 px-4 py-2",
        sm: "h-9 rounded-md px-3",
        lg: "h-11 rounded-md px-8",
        icon: "h-10 w-10",
      },
    },
    defaultVariants: {
      variant: "default",
      size: "default",
    },
  }
)

export interface ButtonProps
  extends React.ButtonHTMLAttributes<HTMLButtonElement>,
    VariantProps<typeof buttonVariants> {
  asChild?: boolean
}

const Button = React.forwardRef<HTMLButtonElement, ButtonProps>(
  ({ className, variant, size, asChild = false, ...props }, ref) => {
    const Comp = asChild ? Slot : "button"
    return (
      <Comp
        className={cn(buttonVariants({ variant, size, className }))}
        ref={ref}
        {...props}
      />
    )
  }
)
Button.displayName = "Button"

export { Button, buttonVariants }


================================================
FILE: frontend/src/components/ui/card.tsx
================================================
import * as React from "react"

import { cn } from "@/lib/utils"

const Card = React.forwardRef<
  HTMLDivElement,
  React.HTMLAttributes<HTMLDivElement>
>(({ className, ...props }, ref) => (
  <div
    ref={ref}
    className={cn(
      "rounded-lg border bg-card text-card-foreground shadow-sm",
      className
    )}
    {...props}
  />
))
Card.displayName = "Card"

const CardHeader = React.forwardRef<
  HTMLDivElement,
  React.HTMLAttributes<HTMLDivElement>
>(({ className, ...props }, ref) => (
  <div
    ref={ref}
    className={cn("flex flex-col space-y-1.5 p-6", className)}
    {...props}
  />
))
CardHeader.displayName = "CardHeader"

const CardTitle = React.forwardRef<
  HTMLParagraphElement,
  React.HTMLAttributes<HTMLHeadingElement>
>(({ className, ...props }, ref) => (
  <h3
    ref={ref}
    className={cn(
      "text-2xl font-semibold leading-none tracking-tight",
      className
    )}
    {...props}
  />
))
CardTitle.displayName = "CardTitle"

const CardDescription = React.forwardRef<
  HTMLParagraphElement,
  React.HTMLAttributes<HTMLParagraphElement>
>(({ className, ...props }, ref) => (
  <p
    ref={ref}
    className={cn("text-sm text-muted-foreground", className)}
    {...props}
  />
))
CardDescription.displayName = "CardDescription"

const CardContent = React.forwardRef<
  HTMLDivElement,
  React.HTMLAttributes<HTMLDivElement>
>(({ className, ...props }, ref) => (
  <div ref={ref} className={cn("p-6 pt-0", className)} {...props} />
))
CardContent.displayName = "CardContent"

const CardFooter = React.forwardRef<
  HTMLDivElement,
  React.HTMLAttributes<HTMLDivElement>
>(({ className, ...props }, ref) => (
  <div
    ref={ref}
    className={cn("flex items-center p-6 pt-0", className)}
    {...props}
  />
))
CardFooter.displayName = "CardFooter"

export { Card, CardHeader, CardFooter, CardTitle, CardDescription, CardContent }


================================================
FILE: frontend/src/components/ui/dropdown-menu.tsx
================================================
import * as React from "react"
import * as DropdownMenuPrimitive from "@radix-ui/react-dropdown-menu"
import { Check, ChevronRight } from "lucide-react"

import { cn } from "@/lib/utils"

const DropdownMenu = DropdownMenuPrimitive.Root

const DropdownMenuTrigger = DropdownMenuPrimitive.Trigger

const DropdownMenuGroup = DropdownMenuPrimitive.Group

const DropdownMenuPortal = DropdownMenuPrimitive.Portal

const DropdownMenuSub = DropdownMenuPrimitive.Sub

const DropdownMenuRadioGroup = DropdownMenuPrimitive.RadioGroup

const DropdownMenuSubTrigger = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.SubTrigger>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.SubTrigger> & {
    inset?: boolean
  }
>(({ className, inset, children, ...props }, ref) => (
  <DropdownMenuPrimitive.SubTrigger
    ref={ref}
    className={cn(
      "flex cursor-default select-none items-center rounded-sm px-2 py-1.5 text-sm outline-none focus:bg-slate-100 data-[state=open]:bg-slate-100",
      inset && "pl-8",
      className
    )}
    {...props}
  >
    {children}
    <ChevronRight className="ml-auto h-4 w-4" />
  </DropdownMenuPrimitive.SubTrigger>
))
DropdownMenuSubTrigger.displayName =
  DropdownMenuPrimitive.SubTrigger.displayName

const DropdownMenuSubContent = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.SubContent>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.SubContent>
>(({ className, ...props }, ref) => (
  <DropdownMenuPrimitive.SubContent
    ref={ref}
    className={cn(
      "z-50 min-w-[8rem] overflow-hidden rounded-md border border-slate-200 bg-white p-1 text-slate-950 shadow-lg data-[state=open]:animate-in data-[state=closed]:animate-out data-[state=closed]:fade-out-0 data-[state=open]:fade-in-0 data-[state=closed]:zoom-out-95 data-[state=open]:zoom-in-95 data-[side=bottom]:slide-in-from-top-2 data-[side=left]:slide-in-from-right-2 data-[side=right]:slide-in-from-left-2 data-[side=top]:slide-in-from-bottom-2",
      className
    )}
    {...props}
  />
))
DropdownMenuSubContent.displayName =
  DropdownMenuPrimitive.SubContent.displayName

const DropdownMenuContent = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.Content>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.Content>
>(({ className, sideOffset = 4, ...props }, ref) => (
  <DropdownMenuPrimitive.Portal>
    <DropdownMenuPrimitive.Content
      ref={ref}
      sideOffset={sideOffset}
      className={cn(
        "z-50 min-w-[8rem] overflow-hidden rounded-md border border-slate-200 bg-white p-1 text-slate-950 shadow-md data-[state=open]:animate-in data-[state=closed]:animate-out data-[state=closed]:fade-out-0 data-[state=open]:fade-in-0 data-[state=closed]:zoom-out-95 data-[state=open]:zoom-in-95 data-[side=bottom]:slide-in-from-top-2 data-[side=left]:slide-in-from-right-2 data-[side=right]:slide-in-from-left-2 data-[side=top]:slide-in-from-bottom-2",
        className
      )}
      {...props}
    />
  </DropdownMenuPrimitive.Portal>
))
DropdownMenuContent.displayName = DropdownMenuPrimitive.Content.displayName

const DropdownMenuItem = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.Item>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.Item> & {
    inset?: boolean
  }
>(({ className, inset, ...props }, ref) => (
  <DropdownMenuPrimitive.Item
    ref={ref}
    className={cn(
      "relative flex cursor-default select-none items-center rounded-sm px-2 py-1.5 text-sm outline-none transition-colors focus:bg-slate-100 focus:text-slate-900 data-[disabled]:pointer-events-none data-[disabled]:opacity-50",
      inset && "pl-8",
      className
    )}
    {...props}
  />
))
DropdownMenuItem.displayName = DropdownMenuPrimitive.Item.displayName

const DropdownMenuCheckboxItem = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.CheckboxItem>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.CheckboxItem>
>(({ className, children, checked, ...props }, ref) => (
  <DropdownMenuPrimitive.CheckboxItem
    ref={ref}
    className={cn(
      "relative flex cursor-default select-none items-center rounded-sm py-1.5 pl-8 pr-2 text-sm outline-none transition-colors focus:bg-slate-100 focus:text-slate-900 data-[disabled]:pointer-events-none data-[disabled]:opacity-50",
      className
    )}
    checked={checked}
    {...props}
  >
    <span className="absolute left-2 flex h-3.5 w-3.5 items-center justify-center">
      <DropdownMenuPrimitive.ItemIndicator>
        <Check className="h-4 w-4" />
      </DropdownMenuPrimitive.ItemIndicator>
    </span>
    {children}
  </DropdownMenuPrimitive.CheckboxItem>
))
DropdownMenuCheckboxItem.displayName =
  DropdownMenuPrimitive.CheckboxItem.displayName

const DropdownMenuRadioItem = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.RadioItem>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.RadioItem>
>(({ className, children, ...props }, ref) => (
  <DropdownMenuPrimitive.RadioItem
    ref={ref}
    className={cn(
      "relative flex cursor-default select-none items-center rounded-sm py-1.5 pl-8 pr-2 text-sm outline-none transition-colors focus:bg-slate-100 focus:text-slate-900 data-[disabled]:pointer-events-none data-[disabled]:opacity-50",
      className
    )}
    {...props}
  >
    <span className="absolute left-2 flex h-3.5 w-3.5 items-center justify-center">
      <DropdownMenuPrimitive.ItemIndicator>
        <div className="h-2 w-2 rounded-full bg-current" />
      </DropdownMenuPrimitive.ItemIndicator>
    </span>
    {children}
  </DropdownMenuPrimitive.RadioItem>
))
DropdownMenuRadioItem.displayName = DropdownMenuPrimitive.RadioItem.displayName

const DropdownMenuLabel = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.Label>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.Label> & {
    inset?: boolean
  }
>(({ className, inset, ...props }, ref) => (
  <DropdownMenuPrimitive.Label
    ref={ref}
    className={cn(
      "px-2 py-1.5 text-sm font-semibold",
      inset && "pl-8",
      className
    )}
    {...props}
  />
))
DropdownMenuLabel.displayName = DropdownMenuPrimitive.Label.displayName

const DropdownMenuSeparator = React.forwardRef<
  React.ElementRef<typeof DropdownMenuPrimitive.Separator>,
  React.ComponentPropsWithoutRef<typeof DropdownMenuPrimitive.Separator>
>(({ className, ...props }, ref) => (
  <DropdownMenuPrimitive.Separator
    ref={ref}
    className={cn("-mx-1 my-1 h-px bg-slate-200", className)}
    {...props}
  />
))
DropdownMenuSeparator.displayName = DropdownMenuPrimitive.Separator.displayName

const DropdownMenuShortcut = ({
  className,
  ...props
}: React.HTMLAttributes<HTMLSpanElement>) => {
  return (
    <span
      className={cn("ml-auto text-xs tracking-widest opacity-60", className)}
      {...props}
    />
  )
}
DropdownMenuShortcut.displayName = "DropdownMenuShortcut"

export {
  DropdownMenu,
  DropdownMenuTrigger,
  DropdownMenuContent,
  DropdownMenuItem,
  DropdownMenuCheckboxItem,
  DropdownMenuRadioItem,
  DropdownMenuLabel,
  DropdownMenuSeparator,
  DropdownMenuShortcut,
  DropdownMenuGroup,
  DropdownMenuPortal,
  DropdownMenuSub,
  DropdownMenuSubContent,
  DropdownMenuSubTrigger,
  DropdownMenuRadioGroup,
}


================================================
FILE: frontend/src/components/ui/sheet.tsx
================================================
import * as React from "react"
import { X } from "lucide-react"
import { cn } from "@/lib/utils"

interface SheetContextValue {
  open: boolean
  onOpenChange: (open: boolean) => void
}

const SheetContext = React.createContext<SheetContextValue | undefined>(undefined)

const Sheet = ({ 
  open, 
  onOpenChange, 
  children 
}: { 
  open: boolean
  onOpenChange: (open: boolean) => void
  children: React.ReactNode 
}) => {
  React.useEffect(() => {
    const handleEscape = (e: KeyboardEvent) => {
      if (e.key === 'Escape' && open) {
        onOpenChange(false)
      }
    }
    if (open) {
      document.addEventListener('keydown', handleEscape)
      document.body.style.overflow = 'hidden'
    } else {
      document.body.style.overflow = ''
    }
    return () => {
      document.removeEventListener('keydown', handleEscape)
      document.body.style.overflow = ''
    }
  }, [open, onOpenChange])

  return (
    <SheetContext.Provider value={{ open, onOpenChange }}>
      {children}
    </SheetContext.Provider>
  )
}

const SheetOverlay = React.forwardRef<
  HTMLDivElement,
  React.HTMLAttributes<HTMLDivElement>
>(({ className, ...props }, ref) => {
  const context = React.useContext(SheetContext)
  if (!context) return null
  
  if (!context.open) return null

  return (
    <div
      ref={ref}
      className={cn(
        "fixed inset-0 z-50 bg-black/50 transition-opacity",
        context.open ? "opacity-100" : "opacity-0",
        className
      )}
      onClick={() => context.onOpenChange(false)}
      {...props}
    />
  )
})
SheetOverlay.displayName = "SheetOverlay"

interface SheetContentProps extends React.HTMLAttributes<HTMLDivElement> {
  side?: "top" | "bottom" | "left" | "right"
}

const SheetContent = React.forwardRef<
  HTMLDivElement,
  SheetContentProps
>(({ side = "right", className, children, ...props }, ref) => {
  const context = React.useContext(SheetContext)
  if (!context) return null

  // 使用状态来控制渲染，确保关闭动画能播放
  const [isVisible, setIsVisible] = React.useState(context.open)

  React.useEffect(() => {
    if (context.open) {
      setIsVisible(true)
    } else {
      // 延迟隐藏，让关闭动画播放
      const timer = setTimeout(() => setIsVisible(false), 300)
      return () => clearTimeout(timer)
    }
  }, [context.open])

  if (!isVisible) return null

  return (
    <>
      <SheetOverlay />
      <div
        ref={ref}
        className={cn(
          "fixed z-50 gap-4 bg-white p-6 shadow-lg transition-transform duration-300 ease-in-out",
          side === "right" && "inset-y-0 right-0 h-full w-full border-l sm:w-3/4 sm:max-w-2xl lg:max-w-4xl",
          side === "left" && "inset-y-0 left-0 h-full w-full border-r sm:w-3/4 sm:max-w-sm",
          side === "top" && "inset-x-0 top-0 border-b",
          side === "bottom" && "inset-x-0 bottom-0 border-t",
          className
        )}
        style={{
          transform: context.open 
            ? "translateX(0)" 
            : side === "right" 
              ? "translateX(100%)" 
              : side === "left" 
                ? "translateX(-100%)" 
                : "translateX(0)",
        }}
        {...props}
      >
        {children}
        <button
          onClick={() => context.onOpenChange(false)}
          className="absolute right-4 top-4 rounded-sm opacity-70 ring-offset-white transition-opacity hover:opacity-100 focus:outline-none focus:ring-2 focus:ring-slate-950 focus:ring-offset-2 disabled:pointer-events-none"
        >
          <X className="h-4 w-4" />
          <span className="sr-only">Close</span>
        </button>
      </div>
    </>
  )
})
SheetContent.displayName = "SheetContent"

const SheetHeader = ({
  className,
  ...props
}: React.HTMLAttributes<HTMLDivElement>) => (
  <div
    className={cn(
      "flex flex-col space-y-2 text-center sm:text-left",
      className
    )}
    {...props}
  />
)
SheetHeader.displayName = "SheetHeader"

const SheetFooter = ({
  className,
  ...props
}: React.HTMLAttributes<HTMLDivElement>) => (
  <div
    className={cn(
      "flex flex-col-reverse sm:flex-row sm:justify-end sm:space-x-2",
      className
    )}
    {...props}
  />
)
SheetFooter.displayName = "SheetFooter"

const SheetTitle = React.forwardRef<
  HTMLHeadingElement,
  React.HTMLAttributes<HTMLHeadingElement>
>(({ className, ...props }, ref) => (
  <h2
    ref={ref}
    className={cn("text-lg font-semibold text-slate-950", className)}
    {...props}
  />
))
SheetTitle.displayName = "SheetTitle"

const SheetDescription = React.forwardRef<
  HTMLParagraphElement,
  React.HTMLAttributes<HTMLParagraphElement>
>(({ className, ...props }, ref) => (
  <p
    ref={ref}
    className={cn("text-sm text-slate-500", className)}
    {...props}
  />
))
SheetDescription.displayName = "SheetDescription"

export {
  Sheet,
  SheetOverlay,
  SheetContent,
  SheetHeader,
  SheetFooter,
  SheetTitle,
  SheetDescription,
}


================================================
FILE: frontend/src/components/ui/tabs.tsx
================================================
import * as React from "react"
import * as TabsPrimitive from "@radix-ui/react-tabs"
import { cn } from "@/lib/utils"

const Tabs = TabsPrimitive.Root

const TabsList = React.forwardRef<
  React.ElementRef<typeof TabsPrimitive.List>,
  React.ComponentPropsWithoutRef<typeof TabsPrimitive.List>
>(({ className, ...props }, ref) => (
  <TabsPrimitive.List
    ref={ref}
    className={cn(
      "inline-flex h-10 items-center justify-center rounded-md bg-muted p-1 text-muted-foreground",
      className
    )}
    {...props}
  />
))
TabsList.displayName = TabsPrimitive.List.displayName

const TabsTrigger = React.forwardRef<
  React.ElementRef<typeof TabsPrimitive.Trigger>,
  React.ComponentPropsWithoutRef<typeof TabsPrimitive.Trigger>
>(({ className, ...props }, ref) => (
  <TabsPrimitive.Trigger
    ref={ref}
    className={cn(
      "inline-flex items-center justify-center whitespace-nowrap rounded-sm px-3 py-1.5 text-sm font-medium ring-offset-background transition-all focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50 data-[state=active]:bg-background data-[state=active]:text-foreground data-[state=active]:shadow-sm",
      className
    )}
    {...props}
  />
))
TabsTrigger.displayName = TabsPrimitive.Trigger.displayName

const TabsContent = React.forwardRef<
  React.ElementRef<typeof TabsPrimitive.Content>,
  React.ComponentPropsWithoutRef<typeof TabsPrimitive.Content>
>(({ className, ...props }, ref) => (
  <TabsPrimitive.Content
    ref={ref}
    className={cn(
      "mt-2 ring-offset-background focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2",
      className
    )}
    {...props}
  />
))
TabsContent.displayName = TabsPrimitive.Content.displayName

export { Tabs, TabsList, TabsTrigger, TabsContent }


================================================
FILE: frontend/src/context/NewsToolbarContext.tsx
================================================
import React, { createContext, useContext, useState } from 'react'

interface ToolbarContent {
  left?: React.ReactNode | null
  right?: React.ReactNode | null
}

interface NewsToolbarContextValue {
  content: ToolbarContent
  setContent: (content: ToolbarContent) => void
}

const NewsToolbarContext = createContext<NewsToolbarContextValue>({
  content: {},
  setContent: () => {},
})

export const NewsToolbarProvider = ({
  children,
}: {
  children: React.ReactNode
}) => {
  const [content, setContent] = useState<ToolbarContent>({})

  return (
    <NewsToolbarContext.Provider value={{ content, setContent }}>
      {children}
    </NewsToolbarContext.Provider>
  )
}

export const useNewsToolbar = () => useContext(NewsToolbarContext)


================================================
FILE: frontend/src/hooks/useDebounce.ts
================================================
import { useState, useEffect } from 'react'

/**
 * useDebounce Hook
 * 
 * 用于延迟处理快速变化的值（如搜索输入），避免频繁触发计算或API请求
 * 
 * @param value - 需要防抖的值
 * @param delay - 延迟时间（毫秒），默认 500ms
 * @returns 防抖后的值
 * 
 * @example
 * const [searchTerm, setSearchTerm] = useState('')
 * const debouncedSearchTerm = useDebounce(searchTerm, 300)
 * 
 * useEffect(() => {
 *   // 只有当用户停止输入 300ms 后才会执行
 *   fetchSearchResults(debouncedSearchTerm)
 * }, [debouncedSearchTerm])
 */
export function useDebounce<T>(value: T, delay: number = 500): T {
  const [debouncedValue, setDebouncedValue] = useState<T>(value)

  useEffect(() => {
    // 设置定时器，在delay后更新debouncedValue
    const timer = setTimeout(() => {
      setDebouncedValue(value)
    }, delay)

    // 清理函数：如果value在delay时间内再次变化，清除上一个定时器
    return () => {
      clearTimeout(timer)
    }
  }, [value, delay])

  return debouncedValue
}


================================================
FILE: frontend/src/index.css
================================================
@tailwind base;
@tailwind components;
@tailwind utilities;

@layer base {
  :root {
    --background: 0 0% 100%;
    --foreground: 222.2 84% 4.9%;
    --card: 0 0% 100%;
    --card-foreground: 222.2 84% 4.9%;
    --popover: 0 0% 100%;
    --popover-foreground: 222.2 84% 4.9%;
    --primary: 221.2 83.2% 53.3%;
    --primary-foreground: 210 40% 98%;
    --secondary: 210 40% 96.1%;
    --secondary-foreground: 222.2 47.4% 11.2%;
    --muted: 210 40% 96.1%;
    --muted-foreground: 215.4 16.3% 46.9%;
    --accent: 210 40% 96.1%;
    --accent-foreground: 222.2 47.4% 11.2%;
    --destructive: 0 84.2% 60.2%;
    --destructive-foreground: 210 40% 98%;
    --border: 214.3 31.8% 91.4%;
    --input: 214.3 31.8% 91.4%;
    --ring: 221.2 83.2% 53.3%;
    --radius: 0.5rem;
  }

  .dark {
    --background: 222.2 84% 4.9%;
    --foreground: 210 40% 98%;
    --card: 222.2 84% 4.9%;
    --card-foreground: 210 40% 98%;
    --popover: 222.2 84% 4.9%;
    --popover-foreground: 210 40% 98%;
    --primary: 217.2 91.2% 59.8%;
    --primary-foreground: 222.2 47.4% 11.2%;
    --secondary: 217.2 32.6% 17.5%;
    --secondary-foreground: 210 40% 98%;
    --muted: 217.2 32.6% 17.5%;
    --muted-foreground: 215 20.2% 65.1%;
    --accent: 217.2 32.6% 17.5%;
    --accent-foreground: 210 40% 98%;
    --destructive: 0 62.8% 30.6%;
    --destructive-foreground: 210 40% 98%;
    --border: 217.2 32.6% 17.5%;
    --input: 217.2 32.6% 17.5%;
    --ring: 224.3 76.3% 48%;
  }
}

@layer base {
  * {
    @apply border-border;
  }
  body {
    @apply bg-background text-foreground;
  }
}


================================================
FILE: frontend/src/layout/MainLayout.tsx
================================================
import { Outlet, Link, useLocation } from 'react-router-dom'
import { Home, Newspaper, TrendingUp, Activity, Settings, Brain } from 'lucide-react'
import { cn } from '@/lib/utils'
import ModelSelector from '@/components/ModelSelector'
import { NewsToolbarProvider, useNewsToolbar } from '@/context/NewsToolbarContext'
import { useLanguageStore, useGlobalI18n } from '@/store/useLanguageStore'

const navigationConfig = [
  { key: 'home', href: '/', icon: Home },
  { key: 'news', href: '/news', icon: Newspaper },
  { key: 'stock', href: '/stock', icon: TrendingUp },
  { key: 'alphaMining', href: '/alpha-mining', icon: Brain },
  { key: 'agents', href: '/agents', icon: Activity },
  { key: 'tasks', href: '/tasks', icon: Settings },
]

export default function MainLayout() {
  return (
    <NewsToolbarProvider>
      <MainLayoutInner />
    </NewsToolbarProvider>
  )
}

function MainLayoutInner() {
  const location = useLocation()
  const { content } = useNewsToolbar()
  const { lang, setLang } = useLanguageStore()
  const t = useGlobalI18n()
  
  const isNewsPage =
    location.pathname === '/news' || location.pathname.startsWith('/news/')

  return (
    <div className="flex h-screen bg-gray-50">
      {/* 侧边栏 */}
      <div className="w-64 bg-white border-r border-gray-200 flex flex-col">
        {/* Logo */}
        <div className="h-16 flex items-center px-6 border-b border-gray-200">
          <h1 className="text-xl font-bold bg-gradient-to-r from-blue-600 to-purple-600 bg-clip-text text-transparent">
            🎯 FinnewsHunter
          </h1>
        </div>

        {/* 导航 */}
        <nav className="flex-1 px-4 py-4 space-y-1">
          {navigationConfig.map((item) => {
            const Icon = item.icon
            const name = t.nav[item.key as keyof typeof t.nav]
            const isActive = location.pathname === item.href ||
              (item.href !== '/' && location.pathname.startsWith(item.href))

            return (
              <Link
                key={item.key}
                to={item.href}
                className={cn(
                  'flex items-center gap-3 px-3 py-2 rounded-lg text-sm font-medium transition-colors',
                  isActive
                    ? 'bg-blue-50 text-blue-700'
                    : 'text-gray-700 hover:bg-gray-100'
                )}
              >
                <Icon className="w-5 h-5" />
                {name}
              </Link>
            )
          })}
        </nav>

        {/* 底部信息 */}
        <div className="p-4 border-t border-gray-200">
          <div className="text-xs text-gray-500">
            {t.header.poweredBy} <span className="font-semibold">AgenticX</span>
          </div>
        </div>
      </div>

      {/* 主内容区 */}
      <div className="flex-1 flex flex-col overflow-hidden">
        {/* 顶部栏 */}
        <header className="h-16 bg-white border-b border-gray-200 flex items-center justify-between px-6 gap-4">
          {/* 左侧：搜索框或标题 */}
          <div className="flex-1 max-w-xl">
            {isNewsPage ? (
              content.left || <h1 className="text-xl font-semibold text-gray-900">
                {lang === 'zh' ? '实时新闻流' : 'Real-time News Feed'}
              </h1>
            ) : (
              <h1 className="text-xl font-semibold text-gray-900">{t.header.title}</h1>
            )}
          </div>
          
          {/* 右侧：工具栏 */}
          <div className="flex items-center gap-4">
            {/* 语言切换 */}
            <div className="flex items-center gap-1 bg-gray-100 rounded-lg p-1">
              <button
                onClick={() => setLang('en')}
                className={`px-3 py-1.5 text-sm font-medium rounded-md transition-colors ${
                  lang === 'en' 
                    ? 'bg-white text-gray-900 shadow-sm' 
                    : 'text-gray-600 hover:text-gray-900'
                }`}
              >
                EN
              </button>
              <button
                onClick={() => setLang('zh')}
                className={`px-3 py-1.5 text-sm font-medium rounded-md transition-colors ${
                  lang === 'zh' 
                    ? 'bg-white text-gray-900 shadow-sm' 
                    : 'text-gray-600 hover:text-gray-900'
                }`}
              >
                中文
              </button>
            </div>
            
            <ModelSelector />
            {isNewsPage && content.right}
          </div>
        </header>

        {/* 页面内容 */}
        <main className="flex-1 overflow-auto">
          <Outlet />
        </main>
      </div>
    </div>
  )
}


================================================
FILE: frontend/src/lib/api-client.ts
================================================
import axios from 'axios'
import type {
  News,
  Analysis,
  CrawlTask,
  TaskStats,
  CrawlRequest,
  CrawlResponse,
  AnalysisResponse,
  StockOverview,
  StockNewsItem,
  SentimentTrendPoint,
  KLineDataPoint,
  RealtimeQuote,
  DebateRequest,
  DebateResponse,
  AgentLogEntry,
  AgentMetrics,
  AgentInfo,
  WorkflowInfo,
} from '@/types/api'

const API_BASE_URL = import.meta.env.VITE_API_BASE_URL || 'http://localhost:8000/api/v1'

const apiClient = axios.create({
  baseURL: API_BASE_URL,
  timeout: 60000,
  headers: {
    'Content-Type': 'application/json',
  },
})

// Request interceptor
apiClient.interceptors.request.use(
  (config) => {
    // 可以在这里添加认证 token
    return config
  },
  (error) => {
    return Promise.reject(error)
  }
)

// Response interceptor
apiClient.interceptors.response.use(
  (response) => response,
  (error) => {
    // 详细的错误日志
    if (error.response) {
      // 服务器返回了错误响应
      console.error('API Error Response:', {
        status: error.response.status,
        statusText: error.response.statusText,
        data: error.response.data,
        url: error.config?.url,
      })
    } else if (error.request) {
      // 请求已发出但没有收到响应
      console.error('API Error Request:', {
        message: error.message,
        url: error.config?.url,
        baseURL: error.config?.baseURL,
        timeout: error.code === 'ECONNABORTED' ? 'Request timeout' : 'Network error',
      })
    } else {
      // 请求配置出错
      console.error('API Error Config:', error.message)
    }
    return Promise.reject(error)
  }
)

/**
 * 新闻相关 API - Phase 2 升级版
 */
export const newsApi = {
  /**
   * Phase 2: 获取最新新闻（智能缓存 + 自动刷新）
   */
  getLatestNews: async (params?: {
    source?: string
    limit?: number
    force_refresh?: boolean
  }): Promise<News[]> => {
    const response = await apiClient.get<any>('/news/latest', { params })
    // Phase 2 API 返回 { success, data: News[], ... }
    // 兼容处理：如果返回的是对象，提取 data 字段；否则直接返回
    if (response.data && typeof response.data === 'object' && 'data' in response.data) {
      return response.data.data
    }
    return response.data
  },

  /**
   * Phase 2: 强制刷新新闻
   */
  forceRefresh: async (params: { source: string }): Promise<{ success: boolean; message: string }> => {
    const response = await apiClient.post('/news/refresh', null, { params })
    return response.data
  },

  /**
   * 获取新闻列表（带筛选）
   */
  getNewsList: async (params?: {
    skip?: number
    limit?: number
    source?: string
    sentiment?: string
  }): Promise<News[]> => {
    const response = await apiClient.get<News[]>('/news/', { params })
    return response.data
  },

  /**
   * 获取新闻详情
   */
  getNewsDetail: async (newsId: number): Promise<News> => {
    const response = await apiClient.get<News>(`/news/${newsId}`)
    return response.data
  },

  /**
   * 获取新闻原始 HTML
   */
  getNewsHtml: async (newsId: number): Promise<{ id: number; title: string; url: string; raw_html: string | null; has_raw_html: boolean }> => {
    const response = await apiClient.get(`/news/${newsId}/html`)
    return response.data
  },

  /**
   * 【已废弃】触发爬取
   */
  crawlNews: async (data: CrawlRequest): Promise<CrawlResponse> => {
    console.warn('⚠️ crawlNews API 已废弃，请使用 forceRefresh')
    const response = await apiClient.post<CrawlResponse>('/news/crawl', data)
    return response.data
  },

  /**
   * 删除新闻
   */
  deleteNews: async (newsId: number): Promise<void> => {
    await apiClient.delete(`/news/${newsId}`)
  },

  /**
   * 批量删除新闻
   */
  batchDeleteNews: async (newsIds: number[]): Promise<{ success: boolean; message: string; deleted_count: number }> => {
    const response = await apiClient.post('/news/batch/delete', { news_ids: newsIds })
    return response.data
  },
}

/**
 * 分析相关 API
 */
export const analysisApi = {
  /**
   * 触发新闻分析
   * @param newsId - 新闻ID
   * @param config - 可选的LLM配置 (provider和model)
   */
  analyzeNews: async (
    newsId: number, 
    config?: { provider?: string; model?: string }
  ): Promise<AnalysisResponse> => {
    const response = await apiClient.post<AnalysisResponse>(
      `/analysis/news/${newsId}`,
      config || {}
    )
    return response.data
  },

  /**
   * 获取分析详情
   */
  getAnalysisDetail: async (analysisId: number): Promise<Analysis> => {
    const response = await apiClient.get<Analysis>(`/analysis/${analysisId}`)
    return response.data
  },

  /**
   * 获取新闻的所有分析结果
   */
  getNewsAnalyses: async (newsId: number): Promise<Analysis[]> => {
    const response = await apiClient.get<Analysis[]>(`/analysis/news/${newsId}/all`)
    return response.data
  },

  /**
   * 批量分析新闻
   * 注意：批量分析可能需要较长时间，超时时间设置为5分钟
   */
  batchAnalyzeNews: async (
    newsIds: number[],
    config?: { provider?: string; model?: string }
  ): Promise<{ success: boolean; message: string; total_count: number; success_count: number; failed_count: number }> => {
    // 确保newsIds是有效的数组
    if (!Array.isArray(newsIds) || newsIds.length === 0) {
      throw new Error('newsIds must be a non-empty array')
    }
    
    const requestBody: { news_ids: number[]; provider?: string; model?: string } = {
      news_ids: newsIds
    }
    
    // 只有当config存在且值不为undefined和空字符串时才添加到请求体
    if (config) {
      if (config.provider !== undefined && config.provider !== null && config.provider !== '') {
        requestBody.provider = config.provider
      }
      if (config.model !== undefined && config.model !== null && config.model !== '') {
        requestBody.model = config.model
      }
    }
    
    // 批量分析可能需要较长时间，设置5分钟超时
    const response = await apiClient.post('/analysis/news/batch', requestBody, {
      timeout: 5 * 60 * 1000  // 5分钟超时
    })
    return response.data
  },
}

/**
 * LLM 配置相关类型
 */
export interface ModelInfo {
  value: string
  label: string
  description: string
}

export interface ProviderInfo {
  value: string
  label: string
  icon: string
  models: ModelInfo[]
  has_api_key: boolean
}

export interface LLMConfigResponse {
  default_provider: string
  default_model: string
  providers: ProviderInfo[]
}

/**
 * LLM 配置相关 API
 */
export const llmApi = {
  /**
   * 获取 LLM 配置（可用厂商和模型列表）
   */
  getConfig: async (): Promise<LLMConfigResponse> => {
    const response = await apiClient.get<LLMConfigResponse>('/llm/config')
    return response.data
  },
}

/**
 * 任务相关 API
 */
export const taskApi = {
  /**
   * 获取任务列表
   */
  getTaskList: async (params?: {
    skip?: number
    limit?: number
    mode?: string
    status?: string
  }): Promise<CrawlTask[]> => {
    const response = await apiClient.get<CrawlTask[]>('/tasks/', { params })
    return response.data
  },

  /**
   * 获取任务详情
   */
  getTaskDetail: async (taskId: number): Promise<CrawlTask> => {
    const response = await apiClient.get<CrawlTask>(`/tasks/${taskId}`)
    return response.data
  },

  /**
   * 触发冷启动
   */
  triggerColdStart: async (data: {
    source: string
    start_page: number
    end_page: number
  }): Promise<{ success: boolean; message: string; celery_task_id?: string }> => {
    const response = await apiClient.post('/tasks/cold-start', data)
    return response.data
  },

  /**
   * 获取任务统计
   */
  getTaskStats: async (): Promise<TaskStats> => {
    const response = await apiClient.get<TaskStats>('/tasks/stats/summary')
    return response.data
  },
}

/**
 * 股票分析相关 API - Phase 2
 */
export const stockApi = {
  /**
   * 获取股票概览信息
   */
  getOverview: async (stockCode: string): Promise<StockOverview> => {
    const response = await apiClient.get<StockOverview>(`/stocks/${stockCode}`)
    return response.data
  },

  /**
   * 获取股票关联新闻
   */
  getNews: async (stockCode: string, params?: {
    limit?: number
    offset?: number
    sentiment?: 'positive' | 'negative' | 'neutral'
  }): Promise<StockNewsItem[]> => {
    const response = await apiClient.get<StockNewsItem[]>(`/stocks/${stockCode}/news`, { params })
    return response.data
  },

  /**
   * 获取情感趋势
   */
  getSentimentTrend: async (stockCode: string, days: number = 30): Promise<SentimentTrendPoint[]> => {
    const response = await apiClient.get<SentimentTrendPoint[]>(
      `/stocks/${stockCode}/sentiment-trend`,
      { params: { days } }
    )
    return response.data
  },

  /**
   * 获取K线数据（真实数据，使用 akshare）
   * @param stockCode 股票代码
   * @param period 周期：daily, 1m, 5m, 15m, 30m, 60m
   * @param limit 数据条数
   * @param adjust 复权类型：qfq=前复权, hfq=后复权, ""=不复权
   */
  getKLineData: async (
    stockCode: string, 
    period: 'daily' | '1m' | '5m' | '15m' | '30m' | '60m' = 'daily',
    limit: number = 90,
    adjust: 'qfq' | 'hfq' | '' = 'qfq'
  ): Promise<KLineDataPoint[]> => {
    const response = await apiClient.get<KLineDataPoint[]>(
      `/stocks/${stockCode}/kline`,
      { params: { period, limit, adjust } }
    )
    return response.data
  },

  /**
   * 获取实时行情
   */
  getRealtimeQuote: async (stockCode: string): Promise<RealtimeQuote | null> => {
    const response = await apiClient.get<RealtimeQuote | null>(
      `/stocks/${stockCode}/realtime`
    )
    return response.data
  },

  /**
   * 搜索股票（从数据库）
   */
  searchRealtime: async (query: string, limit: number = 20): Promise<Array<{
    code: string
    name: string
    full_code: string
    market: string | null
    industry: string | null
  }>> => {
    const response = await apiClient.get('/stocks/search/realtime', {
      params: { q: query, limit }
    })
    return response.data
  },

  /**
   * 初始化股票数据（从 akshare 获取并存入数据库）
   */
  initStockData: async (): Promise<{
    success: boolean
    message: string
    count: number
  }> => {
    const response = await apiClient.post('/stocks/init')
    return response.data
  },

  /**
   * 获取数据库中的股票数量
   */
  getStockCount: async (): Promise<{ count: number; message: string }> => {
    const response = await apiClient.get('/stocks/count')
    return response.data
  },

  /**
   * 从数据库搜索股票
   */
  search: async (query: string, limit: number = 10): Promise<Array<{
    code: string
    name: string
    full_code: string | null
    industry: string | null
  }>> => {
    const response = await apiClient.get('/stocks/search/code', {
      params: { q: query, limit }
    })
    return response.data
  },

  /**
   * 触发定向爬取任务
   */
  startTargetedCrawl: async (
    stockCode: string,
    stockName: string,
    days: number = 30
  ): Promise<{
    success: boolean
    message: string
    task_id?: number
    celery_task_id?: string
  }> => {
    const response = await apiClient.post(`/stocks/${stockCode}/targeted-crawl`, {
      stock_name: stockName,
      days
    })
    return response.data
  },

  /**
   * 查询定向爬取任务状态
   */
  getTargetedCrawlStatus: async (stockCode: string): Promise<{
    task_id?: number
    status: string
    celery_task_id?: string
    progress?: {
      current: number
      total: number
      message?: string
    }
    crawled_count?: number
    saved_count?: number
    error_message?: string
    execution_time?: number
    started_at?: string
    completed_at?: string
  }> => {
    const response = await apiClient.get(`/stocks/${stockCode}/targeted-crawl/status`)
    return response.data
  },

  /**
   * 取消定向爬取任务
   */
  cancelTargetedCrawl: async (stockCode: string): Promise<{
    success: boolean
    message: string
    task_id?: number
  }> => {
    const response = await apiClient.post(`/stocks/${stockCode}/targeted-crawl/cancel`)
    return response.data
  },

  /**
   * 清除股票新闻
   */
  clearStockNews: async (stockCode: string): Promise<{
    success: boolean
    message: string
    deleted_count?: number
  }> => {
    const response = await apiClient.delete(`/stocks/${stockCode}/news`)
    return response.data
  },
}

/**
 * 知识图谱 API
 */
export const knowledgeGraphApi = {
  /**
   * 获取公司知识图谱
   */
  getCompanyGraph: async (stockCode: string): Promise<{
    stock_code: string
    stock_name: string
    graph_exists: boolean
    stats?: Record<string, number>
    name_variants: string[]
    businesses: Array<{
      name: string
      type: string
      status: string
      description?: string
    }>
    industries: string[]
    products: string[]
    concepts: string[]
    search_queries: string[]
  }> => {
    const response = await apiClient.get(`/knowledge-graph/${stockCode}`)
    return response.data
  },

  /**
   * 构建公司知识图谱
   */
  buildGraph: async (stockCode: string, forceRebuild: boolean = false): Promise<{
    success: boolean
    message: string
    graph_stats?: Record<string, number>
  }> => {
    const response = await apiClient.post(`/knowledge-graph/${stockCode}/build`, {
      force_rebuild: forceRebuild
    })
    return response.data
  },

  /**
   * 更新公司知识图谱
   */
  updateGraph: async (stockCode: string): Promise<{
    success: boolean
    message: string
    graph_stats?: Record<string, number>
  }> => {
    const response = await apiClient.post(`/knowledge-graph/${stockCode}/update`, {
      update_from_news: true,
      news_limit: 20
    })
    return response.data
  },

  /**
   * 删除公司知识图谱
   */
  deleteGraph: async (stockCode: string): Promise<{
    success: boolean
    message: string
  }> => {
    const response = await apiClient.delete(`/knowledge-graph/${stockCode}`)
    return response.data
  },
}

/**
 * 智能体相关 API - Phase 2
 */
// SSE 事件类型
export interface SSEDebateEvent {
  type: 'phase' | 'agent' | 'progress' | 'result' | 'error' | 'complete' | 'task_plan'
  data: {
    phase?: string
    message?: string
    agent?: string
    role?: string
    content?: string
    is_chunk?: boolean
    is_start?: boolean
    is_end?: boolean
    round?: number
    max_rounds?: number
    success?: boolean
    mode?: string
    bull_analysis?: any
    bear_analysis?: any
    final_decision?: any
    quick_analysis?: any
    debate_id?: string
    execution_time?: number
    total_rounds?: number
    debate_history?: any[]
  }
}

export const agentApi = {
  /**
   * 触发股票辩论分析（非流式）
   * 注意：辩论分析需要多次LLM调用，耗时较长（可能2-5分钟）
   */
  runDebate: async (request: DebateRequest): Promise<DebateResponse> => {
    const response = await apiClient.post<DebateResponse>('/agents/debate', request, {
      timeout: 300000  // 5分钟超时，因为辩论需要多次LLM调用
    })
    return response.data
  },

  /**
   * 流式辩论分析（SSE）
   * 使用 Server-Sent Events 实时推送辩论过程
   */
  runDebateStream: (
    request: DebateRequest,
    onEvent: (event: SSEDebateEvent) => void,
    onError?: (error: Error) => void,
    onComplete?: () => void
  ): (() => void) => {
    const controller = new AbortController()
    
    // 使用 fetch 发送 POST 请求并处理 SSE 响应
    fetch(`${API_BASE_URL}/agents/debate/stream`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(request),
      signal: controller.signal,
    })
      .then(async (response) => {
        if (!response.ok) {
          throw new Error(`HTTP error! status: ${response.status}`)
        }
        
        const reader = response.body?.getReader()
        if (!reader) {
          throw new Error('No response body')
        }
        
        const decoder = new TextDecoder()
        let buffer = ''
        
        while (true) {
          const { done, value } = await reader.read()
          if (done) break
          
          buffer += decoder.decode(value, { stream: true })
          
          // 解析 SSE 事件
          const lines = buffer.split('\n')
          buffer = lines.pop() || '' // 保留未完成的行
          
          let currentEvent = ''
          let currentData = ''
          
          for (const line of lines) {
            if (line.startsWith('event: ')) {
              currentEvent = line.slice(7)
            } else if (line.startsWith('data: ')) {
              currentData = line.slice(6)
            } else if (line === '' && currentEvent && currentData) {
              // 完整的事件
              try {
                const data = JSON.parse(currentData)
                onEvent({ type: currentEvent as SSEDebateEvent['type'], data })
              } catch (e) {
                console.error('Failed to parse SSE data:', currentData)
              }
              currentEvent = ''
              currentData = ''
            }
          }
        }
        
        onComplete?.()
      })
      .catch((error) => {
        if (error.name !== 'AbortError') {
          console.error('SSE error:', error)
          onError?.(error)
        }
      })
    
    // 返回取消函数
    return () => controller.abort()
  },

  /**
   * 辩论追问（SSE）
   * 用户可以在辩论结束后继续提问
   */
  followUp: (
    request: {
      stock_code: string
      stock_name?: string
      question: string
      target_agent?: string
      context?: string
    },
    onEvent: (event: SSEDebateEvent) => void,
    onError?: (error: Error) => void,
    onComplete?: () => void
  ): (() => void) => {
    const controller = new AbortController()
    
    fetch(`${API_BASE_URL}/agents/debate/followup`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(request),
      signal: controller.signal,
    })
      .then(async (response) => {
        if (!response.ok) {
          throw new Error(`HTTP error! status: ${response.status}`)
        }
        
        const reader = response.body?.getReader()
        if (!reader) {
          throw new Error('No response body')
        }
        
        const decoder = new TextDecoder()
        let buffer = ''
        
        while (true) {
          const { done, value } = await reader.read()
          if (done) break
          
          buffer += decoder.decode(value, { stream: true })
          
          const lines = buffer.split('\n')
          buffer = lines.pop() || ''
          
          let currentEvent = ''
          let currentData = ''
          
          for (const line of lines) {
            if (line.startsWith('event: ')) {
              currentEvent = line.slice(7)
            } else if (line.startsWith('data: ')) {
              currentData = line.slice(6)
            } else if (line === '' && currentEvent && currentData) {
              try {
                const data = JSON.parse(currentData)
                onEvent({ type: currentEvent as SSEDebateEvent['type'], data })
              } catch (e) {
                console.error('Failed to parse SSE data:', currentData)
              }
              currentEvent = ''
              currentData = ''
            }
          }
        }
        
        onComplete?.()
      })
      .catch((error) => {
        if (error.name !== 'AbortError') {
          console.error('SSE error:', error)
          onError?.(error)
        }
      })
    
    return () => controller.abort()
  },

  /**
   * 获取辩论结果
   */
  getDebateResult: async (debateId: string): Promise<DebateResponse> => {
    const response = await apiClient.get<DebateResponse>(`/agents/debate/${debateId}`)
    return response.data
  },

  /**
   * 获取智能体执行日志
   */
  getLogs: async (params?: {
    limit?: number
    agent_name?: string
    status?: 'started' | 'completed' | 'failed'
  }): Promise<AgentLogEntry[]> => {
    const response = await apiClient.get<AgentLogEntry[]>('/agents/logs', { params })
    return response.data
  },

  /**
   * 获取智能体性能指标
   */
  getMetrics: async (): Promise<AgentMetrics> => {
    const response = await apiClient.get<AgentMetrics>('/agents/metrics')
    return response.data
  },

  /**
   * 获取辩论执行轨迹
   */
  getTrajectory: async (debateId: string): Promise<Array<{
    step_id: string
    step_name: string
    timestamp: string
    agent_name?: string
    output_data?: Record<string, any>
    status: string
  }>> => {
    const response = await apiClient.get(`/agents/trajectory/${debateId}`)
    return response.data
  },

  /**
   * 获取可用智能体列表
   */
  getAvailable: async (): Promise<{
    agents: AgentInfo[]
    workflows: WorkflowInfo[]
  }> => {
    const response = await apiClient.get('/agents/available')
    return response.data
  },

  /**
   * 执行搜索计划 (SSE)
   */
  executeSearch: (
    plan: any,
    onEvent: (event: SSEDebateEvent) => void,
    onError?: (error: Error) => void,
    onComplete?: () => void
  ): (() => void) => {
    const controller = new AbortController()
    
    fetch(`${API_BASE_URL}/agents/search/execute`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ plan }),
      signal: controller.signal,
    })
      .then(async (response) => {
        if (!response.ok) {
          throw new Error(`HTTP error! status: ${response.status}`)
        }
        
        const reader = response.body?.getReader()
        if (!reader) {
          throw new Error('No response body')
        }
        
        const decoder = new TextDecoder()
        let buffer = ''
        
        while (true) {
          const { done, value } = await reader.read()
          if (done) break
          
          buffer += decoder.decode(value, { stream: true })
          
          const lines = buffer.split('\n')
          buffer = lines.pop() || ''
          
          let currentEvent = ''
          let currentData = ''
          
          for (const line of lines) {
            if (line.startsWith('event: ')) {
              currentEvent = line.slice(7)
            } else if (line.startsWith('data: ')) {
              currentData = line.slice(6)
            } else if (line === '' && currentEvent && currentData) {
              try {
                const data = JSON.parse(currentData)
                onEvent({ type: currentEvent as SSEDebateEvent['type'], data })
              } catch (e) {
                console.error('Failed to parse SSE data:', currentData)
              }
              currentEvent = ''
              currentData = ''
            }
          }
        }
        
        onComplete?.()
      })
      .catch((error) => {
        if (error.name !== 'AbortError') {
          console.error('SSE error:', error)
          onError?.(error)
        }
      })
    
    return () => controller.abort()
  },

  /**
   * 清空执行日志（仅开发用）
   */
  clearLogs: async (): Promise<{ message: string }> => {
    const response = await apiClient.delete('/agents/logs')
    return response.data
  },
}

/**
 * Alpha Mining 相关类型
 */
export interface AlphaMiningFactor {
  formula: number[]
  formula_str: string
  sortino: number
  sharpe?: number
  ic?: number
  discovered_at?: string
  stock_code?: string
}

export interface AlphaMiningMetrics {
  sortino_ratio: number
  sharpe_ratio: number
  ic: number
  rank_ic: number
  max_drawdown: number
  turnover: number
  total_return: number
  win_rate: number
  avg_return?: number
}

export interface MineRequest {
  stock_code?: string
  num_steps: number
  use_sentiment: boolean
  batch_size?: number
}

export interface EvaluateRequest {
  formula: string
  stock_code?: string
}

export interface SentimentCompareResult {
  best_score: number
  best_formula: string
  total_steps: number
  num_features: number
}

export interface OperatorInfo {
  name: string
  arity: number
  description: string
}

/**
 * Alpha Mining 相关 API
 */
export const alphaMiningApi = {
  /**
   * 启动因子挖掘任务（后台执行）
   */
  mine: async (request: MineRequest): Promise<{
    success: boolean
    task_id: string
    message: string
  }> => {
    const response = await apiClient.post('/alpha-mining/mine', request)
    return response.data
  },

  /**
   * SSE 流式挖掘（返回 fetch Response）
   */
  mineStream: (
    request: MineRequest,
    onProgress: (data: {
      step: number
      progress: number
      loss: number
      avg_reward: number
      max_reward: number
      valid_ratio: number
      best_score: number
      best_formula: string
    }) => void,
    onComplete: (data: { best_score: number; best_formula: string }) => void,
    onError: (error: string) => void
  ): (() => void) => {
    const controller = new AbortController()

    fetch(`${apiClient.defaults.baseURL}/alpha-mining/mine/stream`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(request),
      signal: controller.signal,
    })
      .then(async (response) => {
        if (!response.ok) throw new Error(`HTTP ${response.status}`)
        const reader = response.body?.getReader()
        if (!reader) throw new Error('No body')

        const decoder = new TextDecoder()
        let buffer = ''

        while (true) {
          const { done, value } = await reader.read()
          if (done) break
          buffer += decoder.decode(value, { stream: true })

          const lines = buffer.split('\n')
          buffer = lines.pop() || ''

          let event = '', data = ''
          for (const line of lines) {
            if (line.startsWith('event: ')) event = line.slice(7)
            else if (line.startsWith('data: ')) data = line.slice(6)
            else if (line === '' && event && data) {
              try {
                const parsed = JSON.parse(data)
                if (event === 'progress') onProgress(parsed)
                else if (event === 'complete') onComplete(parsed)
                else if (event === 'error') onError(parsed.error)
              } catch {}
              event = data = ''
            }
          }
        }
      })
      .catch((err) => {
        if (err.name !== 'AbortError') onError(err.message)
      })

    return () => controller.abort()
  },

  /**
   * 评估因子表达式
   */
  evaluate: async (request: EvaluateRequest): Promise<{
    success: boolean
    formula: string
    metrics?: AlphaMiningMetrics
    error?: string
  }> => {
    const response = await apiClient.post('/alpha-mining/evaluate', request)
    return response.data
  },

  /**
   * 生成候选因子
   */
  generate: async (batch_size: number = 10, max_len: number = 8): Promise<{
    success: boolean
    generated: number
    valid: number
    factors: Array<{
      formula: number[]
      formula_str: string
      sortino: number
      ic: number
    }>
  }> => {
    const response = await apiClient.post('/alpha-mining/generate', { batch_size, max_len })
    return response.data
  },

  /**
   * 获取已发现的因子列表
   */
  getFactors: async (top_k: number = 20, stock_code?: string): Promise<{
    success: boolean
    total: number
    returned: number
    factors: AlphaMiningFactor[]
  }> => {
    const response = await apiClient.get('/alpha-mining/factors', {
      params: { top_k, stock_code }
    })
    return response.data
  },

  /**
   * 获取任务状态
   */
  getTaskStatus: async (task_id: string): Promise<{
    task_id: string
    status: 'pending' | 'running' | 'completed' | 'failed'
    progress: number
    result?: { best_factor: string; best_score: number; total_steps: number }
    error?: string
  }> => {
    const response = await apiClient.get(`/alpha-mining/status/${task_id}`)
    return response.data
  },

  /**
   * 获取支持的操作符列表
   */
  getOperators: async (): Promise<{
    success: boolean
    features: string[]
    operators: OperatorInfo[]
  }> => {
    const response = await apiClient.get('/alpha-mining/operators')
    return response.data
  },

  /**
   * 情感融合效果对比
   */
  compareSentiment: async (num_steps: number = 50, batch_size: number = 16): Promise<{
    success: boolean
    with_sentiment: SentimentCompareResult
    without_sentiment: SentimentCompareResult
    improvement: { score_diff: number; improvement_pct: number }
  }> => {
    const response = await apiClient.post('/alpha-mining/compare-sentiment', {
      num_steps,
      batch_size
    })
    return response.data
  },

  /**
   * Agent 调用演示
   */
  agentDemo: async (params: {
    stock_code?: string
    num_steps: number
    use_sentiment: boolean
  }): Promise<{
    success: boolean
    agent_name: string
    tool_name: string
    input_params: Record<string, any>
    output: { best_formula: string; best_score: number; total_steps: number } | null
    execution_time: number
    logs: string[]
  }> => {
    const response = await apiClient.post('/alpha-mining/agent-demo', params)
    return response.data
  },

  /**
   * 删除任务
   */
  deleteTask: async (task_id: string): Promise<{ success: boolean; message: string }> => {
    const response = await apiClient.delete(`/alpha-mining/tasks/${task_id}`)
    return response.data
  },
}

export { apiClient }
export default apiClient


================================================
FILE: frontend/src/lib/utils.ts
================================================
import { type ClassValue, clsx } from "clsx"
import { twMerge } from "tailwind-merge"

export function cn(...inputs: ClassValue[]) {
  return twMerge(clsx(inputs))
}

export function formatDate(date: string | Date): string {
  const d = typeof date === 'string' ? new Date(date) : date
  return new Intl.DateTimeFormat('zh-CN', {
    year: 'numeric',
    month: '2-digit',
    day: '2-digit',
    hour: '2-digit',
    minute: '2-digit',
  }).format(d)
}

export interface TimeI18n {
  justNow: string
  minutesAgo: string
  hoursAgo: string
  daysAgo: string
}

const defaultTimeI18n: TimeI18n = {
  justNow: '刚刚',
  minutesAgo: '分钟前',
  hoursAgo: '小时前',
  daysAgo: '天前',
}

export function formatRelativeTime(date: string | Date, i18n?: TimeI18n): string {
  const t = i18n || defaultTimeI18n
  const d = typeof date === 'string' ? new Date(date) : date
  const now = new Date()
  const diffMs = now.getTime() - d.getTime()
  const diffMins = Math.floor(diffMs / 60000)
  
  if (diffMins < 1) return t.justNow
  if (diffMins < 60) return `${diffMins}${t.minutesAgo}`
  
  const diffHours = Math.floor(diffMins / 60)
  if (diffHours < 24) return `${diffHours}${t.hoursAgo}`
  
  const diffDays = Math.floor(diffHours / 24)
  if (diffDays < 7) return `${diffDays}${t.daysAgo}`
  
  return formatDate(d)
}


================================================
FILE: frontend/src/main.tsx
================================================
import React from 'react'
import ReactDOM from 'react-dom/client'
import { BrowserRouter } from 'react-router-dom'
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
import App from './App.tsx'
import './index.css'

const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      refetchOnWindowFocus: false,
      retry: 1,
      staleTime: 5 * 60 * 1000, // 5 minutes
    },
  },
})

ReactDOM.createRoot(document.getElementById('root')!).render(
  <React.StrictMode>
    <QueryClientProvider client={queryClient}>
      <BrowserRouter>
        <App />
      </BrowserRouter>
    </QueryClientProvider>
  </React.StrictMode>,
)


================================================
FILE: frontend/src/pages/AgentMonitorPage.tsx
================================================
import { useState, useEffect } from 'react'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import { toast } from 'sonner'
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card'
import { Button } from '@/components/ui/button'
import { Badge } from '@/components/ui/badge'
import { agentApi } from '@/lib/api-client'
import {
  Bot,
  Activity,
  CheckCircle2,
  XCircle,
  Clock,
  RefreshCw,
  Trash2,
  Play,
  Zap,
  GitBranch,
  MessageSquare,
  TrendingUp,
  AlertCircle,
  ChevronRight,
  Workflow,
  ArrowRight,
  Timer,
} from 'lucide-react'
import type { AgentLogEntry, AgentMetrics, AgentInfo, WorkflowInfo } from '@/types/api'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'
import { formatRelativeTime as formatRelativeTimeUtil } from '@/lib/utils'

// 智能体角色和描述映射
const AGENT_ROLES: Record<string, { roleZh: string; roleEn: string; descZh: string; descEn: string }> = {
  NewsAnalyst: {
    roleZh: '金融新闻分析师',
    roleEn: 'Financial News Analyst',
    descZh: '分析金融新闻的情感、影响和关键信息',
    descEn: 'Analyzes sentiment, impact and key information of financial news',
  },
  BullResearcher: {
    roleZh: '看多研究员',
    roleEn: 'Bull Researcher',
    descZh: '从积极角度分析股票,发现投资机会',
    descEn: 'Analyzes stocks from a positive perspective, discovering investment opportunities',
  },
  BearResearcher: {
    roleZh: '看空研究员',
    roleEn: 'Bear Researcher',
    descZh: '从风险角度分析股票,识别潜在问题',
    descEn: 'Analyzes stocks from a risk perspective, identifying potential problems',
  },
  InvestmentManager: {
    roleZh: '投资经理',
    roleEn: 'Investment Manager',
    descZh: '综合多方观点,做出投资决策',
    descEn: 'Integrates multiple viewpoints to make investment decisions',
  },
  SearchAnalyst: {
    roleZh: '搜索分析师',
    roleEn: 'Search Analyst',
    descZh: '动态获取数据,支持 AkShare、BochaAI、网页搜索等',
    descEn: 'Dynamically obtains data, supports AkShare, BochaAI, web search, etc.',
  },
}

// 工作流描述映射
const WORKFLOW_DESCRIPTIONS: Record<string, { descZh: string; descEn: string }> = {
  NewsAnalysisWorkflow: {
    descZh: '新闻分析工作流：爬取 -> 清洗 -> 情感分析',
    descEn: 'News Analysis Workflow: Crawl -> Clean -> Sentiment Analysis',
  },
  InvestmentDebateWorkflow: {
    descZh: '投资辩论工作流：Bull vs Bear 多智能体辩论',
    descEn: 'Investment Debate Workflow: Bull vs Bear Multi-Agent Debate',
  },
}

// 状态徽章颜色
const statusColors: Record<string, { bg: string; text: string; border: string }> = {
  started: { bg: 'bg-blue-100', text: 'text-blue-700', border: 'border-blue-200' },
  completed: { bg: 'bg-emerald-100', text: 'text-emerald-700', border: 'border-emerald-200' },
  failed: { bg: 'bg-rose-100', text: 'text-rose-700', border: 'border-rose-200' },
  active: { bg: 'bg-emerald-100', text: 'text-emerald-700', border: 'border-emerald-200' },
  inactive: { bg: 'bg-gray-100', text: 'text-gray-700', border: 'border-gray-200' },
}

// 智能体图标映射
const agentIcons: Record<string, React.ReactNode> = {
  NewsAnalyst: <MessageSquare className="w-4 h-4" />,
  BullResearcher: <TrendingUp className="w-4 h-4" />,
  BearResearcher: <AlertCircle className="w-4 h-4" />,
  InvestmentManager: <Zap className="w-4 h-4" />,
  DebateWorkflow: <Workflow className="w-4 h-4" />,
}

// 格式化时间戳
// 格式化时间戳（已废弃，使用 formatRelativeTimeUtil）
function formatTimestamp(timestamp: string, locale: string = 'zh-CN'): string {
  const date = new Date(timestamp)
  return date.toLocaleString(locale, {
    month: '2-digit',
    day: '2-digit',
    hour: '2-digit',
    minute: '2-digit',
    second: '2-digit',
  })
}

export default function AgentMonitorPage() {
  const t = useGlobalI18n()
  const { lang } = useLanguageStore()
  const queryClient = useQueryClient()
  const [selectedAgent, setSelectedAgent] = useState<string | null>(null)
  const [autoRefresh, setAutoRefresh] = useState(true)
  
  // 获取智能体角色和描述（国际化）
  const getAgentInfo = (agentName: string, defaultRole: string, defaultDesc: string) => {
    const agentInfo = AGENT_ROLES[agentName]
    if (agentInfo) {
      return {
        role: lang === 'zh' ? agentInfo.roleZh : agentInfo.roleEn,
        description: lang === 'zh' ? agentInfo.descZh : agentInfo.descEn,
      }
    }
    return {
      role: defaultRole,
      description: defaultDesc,
    }
  }
  
  // 获取工作流描述（国际化）
  const getWorkflowDescription = (workflowName: string, defaultDesc: string) => {
    const workflowInfo = WORKFLOW_DESCRIPTIONS[workflowName]
    if (workflowInfo) {
      return lang === 'zh' ? workflowInfo.descZh : workflowInfo.descEn
    }
    return defaultDesc
  }

  // 获取性能指标
  const { data: metrics, isLoading: metricsLoading, refetch: refetchMetrics } = useQuery({
    queryKey: ['agent', 'metrics'],
    queryFn: agentApi.getMetrics,
    refetchInterval: autoRefresh ? 10000 : false, // 10秒自动刷新
    staleTime: 5000,
  })

  // 获取执行日志
  const { data: logs, isLoading: logsLoading, refetch: refetchLogs } = useQuery({
    queryKey: ['agent', 'logs', selectedAgent],
    queryFn: () => agentApi.getLogs({
      limit: 50,
      agent_name: selectedAgent || undefined,
    }),
    refetchInterval: autoRefresh ? 5000 : false, // 5秒自动刷新
    staleTime: 3000,
  })

  // 获取可用智能体
  const { data: available, isLoading: availableLoading } = useQuery({
    queryKey: ['agent', 'available'],
    queryFn: agentApi.getAvailable,
    staleTime: 60000, // 1分钟
  })

  // 清空日志 Mutation
  const clearLogsMutation = useMutation({
    mutationFn: agentApi.clearLogs,
    onSuccess: (data) => {
      toast.success(data.message)
      queryClient.invalidateQueries({ queryKey: ['agent', 'logs'] })
      queryClient.invalidateQueries({ queryKey: ['agent', 'metrics'] })
    },
    onError: (error: Error) => {
      toast.error(`清空失败: ${error.message}`)
    },
  })

  const handleRefresh = () => {
    refetchMetrics()
    refetchLogs()
    toast.success('数据已刷新')
  }

  const handleClearLogs = () => {
    if (window.confirm(t.agents.confirmClearLogs)) {
      clearLogsMutation.mutate()
    }
  }

  // 计算成功率
  const successRate = metrics
    ? ((metrics.successful_executions / metrics.total_executions) * 100 || 0).toFixed(1)
    : '0'

  return (
    <div className="p-6 space-y-6 bg-gradient-to-br from-slate-50 to-indigo-50 min-h-screen">
      {/* 顶部标题区 */}
      <div className="flex items-center justify-between">
        <div>
          <h1 className="text-3xl font-bold tracking-tight text-gray-900 flex items-center gap-3">
            <Activity className="w-8 h-8 text-indigo-500" />
            {t.agents.title}
          </h1>
          <p className="text-muted-foreground mt-1">
            {t.agents.subtitle}
          </p>
        </div>
        <div className="flex items-center gap-3">
          <Button
            variant="outline"
            size="sm"
            onClick={() => setAutoRefresh(!autoRefresh)}
            className={autoRefresh ? 'bg-emerald-50 border-emerald-200' : ''}
          >
            <RefreshCw className={`w-4 h-4 mr-2 ${autoRefresh ? 'animate-spin' : ''}`} />
            {autoRefresh ? t.agents.autoRefreshing : t.agents.autoRefreshing}
          </Button>
          <Button
            variant="outline"
            size="sm"
            onClick={handleRefresh}
          >
            <RefreshCw className="w-4 h-4 mr-2" />
            {t.agents.refresh}
          </Button>
          <Button
            variant="outline"
            size="sm"
            onClick={handleClearLogs}
            className="text-rose-600 hover:bg-rose-50"
            disabled={clearLogsMutation.isPending}
          >
            <Trash2 className="w-4 h-4 mr-2" />
            {t.agents.clearLogs}
          </Button>
        </div>
      </div>

      {/* 性能指标卡片 */}
      <div className="grid grid-cols-1 md:grid-cols-4 gap-4">
        <Card className="bg-white/80 backdrop-blur-sm border-indigo-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.agents.totalExec}</p>
                <p className="text-3xl font-bold text-indigo-600">
                  {metrics?.total_executions || 0}
                </p>
              </div>
              <Play className="w-10 h-10 text-indigo-500/30" />
            </div>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-emerald-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.agents.successExec}</p>
                <p className="text-3xl font-bold text-emerald-600">
                  {metrics?.successful_executions || 0}
                </p>
              </div>
              <CheckCircle2 className="w-10 h-10 text-emerald-500/30" />
            </div>
            <p className="text-xs text-muted-foreground mt-2">
              {t.agents.successRate} {successRate}%
            </p>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-rose-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.agents.failedExec}</p>
                <p className="text-3xl font-bold text-rose-600">
                  {metrics?.failed_executions || 0}
                </p>
              </div>
              <XCircle className="w-10 h-10 text-rose-500/30" />
            </div>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-amber-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.agents.avgTime}</p>
                <p className="text-3xl font-bold text-amber-600">
                  {metrics?.avg_execution_time?.toFixed(1) || 0}s
                </p>
              </div>
              <Clock className="w-10 h-10 text-amber-500/30" />
            </div>
          </CardContent>
        </Card>
      </div>

      <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
        {/* 智能体列表 */}
        <Card className="bg-white/90">
          <CardHeader>
            <CardTitle className="flex items-center gap-2">
              <Bot className="w-5 h-5 text-indigo-500" />
              {t.agents.availableAgents}
            </CardTitle>
            <CardDescription>
              {t.agents.availableAgentsDesc}
            </CardDescription>
          </CardHeader>
          <CardContent className="space-y-4">
            {/* 智能体 */}
            <div>
              <h4 className="text-sm font-medium text-gray-500 mb-2">{t.agents.agents}</h4>
              <div className="space-y-2">
                {available?.agents.map((agent) => (
                  <div
                    key={agent.name}
                    className={`p-3 rounded-lg border cursor-pointer transition-all ${
                      selectedAgent === agent.name
                        ? 'border-indigo-300 bg-indigo-50'
                        : 'border-gray-100 hover:border-indigo-200 hover:bg-indigo-50/50'
                    }`}
                    onClick={() => setSelectedAgent(selectedAgent === agent.name ? null : agent.name)}
                  >
                    <div className="flex items-center justify-between">
                      <div className="flex items-center gap-2">
                        <div className={`w-8 h-8 rounded-full flex items-center justify-center ${
                          agent.status === 'active' ? 'bg-emerald-100 text-emerald-600' : 'bg-gray-100 text-gray-600'
                        }`}>
                          {agentIcons[agent.name] || <Bot className="w-4 h-4" />}
                        </div>
                        <div>
                          <p className="font-medium text-gray-900 text-sm">{agent.name}</p>
                          <p className="text-xs text-gray-500">{getAgentInfo(agent.name, agent.role, agent.description).role}</p>
                        </div>
                      </div>
                      <Badge className={`${statusColors[agent.status].bg} ${statusColors[agent.status].text}`}>
                        {agent.status === 'active' ? t.agents.active : t.agents.inactive}
                      </Badge>
                    </div>
                    <p className="text-xs text-gray-500 mt-2">{getAgentInfo(agent.name, agent.role, agent.description).description}</p>
                    {metrics?.agent_stats?.[agent.name] && (
                      <div className="flex items-center gap-3 mt-2 text-xs text-gray-400">
                        <span>{t.agents.execTimes} {metrics.agent_stats[agent.name].total} {t.agents.times}</span>
                        <span>•</span>
                        <span>{t.agents.success} {metrics.agent_stats[agent.name].successful}</span>
                        {metrics.agent_stats[agent.name].avg_time > 0 && (
                          <>
                            <span>•</span>
                            <span>{t.agents.avg} {metrics.agent_stats[agent.name].avg_time.toFixed(1)}s</span>
                          </>
                        )}
                      </div>
                    )}
                  </div>
                ))}
              </div>
            </div>

            {/* 工作流 */}
            <div>
              <h4 className="text-sm font-medium text-gray-500 mb-2">{t.agents.workflows}</h4>
              <div className="space-y-2">
                {available?.workflows.map((workflow) => (
                  <div
                    key={workflow.name}
                    className="p-3 rounded-lg border border-gray-100 hover:border-purple-200 hover:bg-purple-50/50 transition-all"
                  >
                    <div className="flex items-center gap-2 mb-2">
                      <GitBranch className="w-4 h-4 text-purple-500" />
                      <span className="font-medium text-gray-900 text-sm">{workflow.name}</span>
                    </div>
                    <p className="text-xs text-gray-500">{getWorkflowDescription(workflow.name, workflow.description)}</p>
                    <div className="flex items-center gap-1 mt-2 flex-wrap">
                      {workflow.agents.map((agent, idx) => (
                        <span key={agent} className="flex items-center">
                          <Badge variant="outline" className="text-xs">
                            {agent}
                          </Badge>
                          {idx < workflow.agents.length - 1 && (
                            <ArrowRight className="w-3 h-3 text-gray-400 mx-1" />
                          )}
                        </span>
                      ))}
                    </div>
                  </div>
                ))}
              </div>
            </div>
          </CardContent>
        </Card>

        {/* 执行日志 */}
        <Card className="lg:col-span-2 bg-white/90">
          <CardHeader>
            <CardTitle className="flex items-center justify-between">
              <span className="flex items-center gap-2">
                <Activity className="w-5 h-5 text-blue-500" />
                {t.agents.execLogs}
                {selectedAgent && (
                  <Badge variant="outline" className="ml-2">
                    {selectedAgent}
                    <button
                      onClick={(e) => {
                        e.stopPropagation()
                        setSelectedAgent(null)
                      }}
                      className="ml-1 hover:text-rose-500"
                    >
                      ×
                    </button>
                  </Badge>
                )}
              </span>
              <span className="text-sm font-normal text-gray-500">
                {logs?.length || 0} {t.agents.records}
              </span>
            </CardTitle>
            <CardDescription>
              {t.agents.execLogsDesc}
            </CardDescription>
          </CardHeader>
          <CardContent>
            {logsLoading ? (
              <div className="flex items-center justify-center py-12">
                <RefreshCw className="w-8 h-8 animate-spin text-blue-500" />
              </div>
            ) : logs && logs.length > 0 ? (
              <div className="space-y-3 max-h-[600px] overflow-y-auto pr-2">
                {logs.map((log, index) => (
                  <div
                    key={log.id}
                    className={`p-3 rounded-lg border transition-all ${
                      log.status === 'completed'
                        ? 'border-emerald-100 bg-emerald-50/30'
                        : log.status === 'failed'
                        ? 'border-rose-100 bg-rose-50/30'
                        : 'border-blue-100 bg-blue-50/30'
                    }`}
                  >
                    <div className="flex items-start justify-between gap-3">
                      <div className="flex items-start gap-3">
                        <div className={`w-8 h-8 rounded-full flex items-center justify-center flex-shrink-0 ${
                          log.status === 'completed'
                            ? 'bg-emerald-100 text-emerald-600'
                            : log.status === 'failed'
                            ? 'bg-rose-100 text-rose-600'
                            : 'bg-blue-100 text-blue-600'
                        }`}>
                          {log.status === 'completed' ? (
                            <CheckCircle2 className="w-4 h-4" />
                          ) : log.status === 'failed' ? (
                            <XCircle className="w-4 h-4" />
                          ) : (
                            <Play className="w-4 h-4" />
                          )}
                        </div>
                        <div>
                          <div className="flex items-center gap-2 flex-wrap">
                            <span className="font-medium text-gray-900">
                              {log.agent_name}
                            </span>
                            {log.agent_role && (
                              <span className="text-xs text-gray-500">
                                ({getAgentInfo(log.agent_name || '', log.agent_role, '').role})
                              </span>
                            )}
                            <Badge className={`${statusColors[log.status].bg} ${statusColors[log.status].text} text-xs`}>
                              {log.status === 'completed' ? t.tasks.completed : log.status === 'failed' ? t.tasks.failed : t.tasks.running}
                            </Badge>
                          </div>
                          <p className="text-sm text-gray-600 mt-1">
                            {log.action.replace(/_/g, ' ')}
                          </p>
                          {log.details && Object.keys(log.details).length > 0 && (
                            <div className="mt-2 text-xs text-gray-500 bg-gray-50 p-2 rounded">
                              {Object.entries(log.details).map(([key, value]) => (
                                <div key={key} className="flex gap-2">
                                  <span className="font-medium">{key}:</span>
                                  <span>{String(value)}</span>
                                </div>
                              ))}
                            </div>
                          )}
                        </div>
                      </div>
                      <div className="text-right flex-shrink-0">
                        <p className="text-xs text-gray-400">
                          {formatRelativeTimeUtil(log.timestamp, t.time)}
                        </p>
                        {log.execution_time && (
                          <p className="text-xs text-gray-500 flex items-center gap-1 mt-1">
                            <Timer className="w-3 h-3" />
                            {log.execution_time.toFixed(1)}s
                          </p>
                        )}
                      </div>
                    </div>
                  </div>
                ))}
              </div>
            ) : (
              <div className="text-center py-12 text-gray-500">
                <Activity className="w-16 h-16 mx-auto opacity-30 mb-4" />
                <p className="text-lg">{t.agents.noLogs}</p>
                <p className="text-sm mt-2">
                  {t.agents.noLogsHint}
                </p>
              </div>
            )}
          </CardContent>
        </Card>
      </div>

      {/* 最近活动时间线 */}
      {metrics?.recent_activity && metrics.recent_activity.length > 0 && (
        <Card className="bg-white/90">
          <CardHeader>
            <CardTitle className="flex items-center gap-2">
              <Clock className="w-5 h-5 text-purple-500" />
              {t.agents.recentActivity || 'Recent Activity'}
            </CardTitle>
          </CardHeader>
          <CardContent>
            <div className="flex items-center gap-2 overflow-x-auto pb-2">
              {metrics.recent_activity.map((activity, index) => (
                <div
                  key={index}
                  className={`flex-shrink-0 px-3 py-2 rounded-lg border ${statusColors[activity.status]?.bg} ${statusColors[activity.status]?.border}`}
                >
                  <div className="flex items-center gap-2">
                    <div className={`w-2 h-2 rounded-full ${
                      activity.status === 'completed' ? 'bg-emerald-500' :
                      activity.status === 'failed' ? 'bg-rose-500' : 'bg-blue-500'
                    }`} />
                    <span className="text-sm font-medium">{activity.agent_name}</span>
                  </div>
                  <p className="text-xs text-gray-500 mt-1">
                    {activity.action.replace(/_/g, ' ')}
                  </p>
                  <p className="text-xs text-gray-400">
                    {formatRelativeTimeUtil(activity.timestamp, t.time)}
                  </p>
                </div>
              ))}
            </div>
          </CardContent>
        </Card>
      )}
    </div>
  )
}


================================================
FILE: frontend/src/pages/AlphaMiningPage.tsx
================================================
/**
 * Alpha Mining 因子挖掘页面（增强版）
 * 
 * 技术亮点展示：
 * - 符号回归 + RL: Transformer 策略网络 + REINFORCE 算法
 * - DSL 系统: 21 个时序/算术/条件操作符
 * - 情感融合: 支持新闻情感特征增强因子效果
 * - 完整评估: Sortino/Sharpe/IC/Rank IC 等指标
 * - AgenticX 集成: BaseTool 封装，支持 Agent 调用
 */

import React, { useState, useEffect, useCallback } from 'react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '../components/ui/card';
import { Button } from '../components/ui/button';
import { Badge } from '../components/ui/badge';
import { alphaMiningApi } from '../lib/api-client';
import type { AlphaMiningFactor, AlphaMiningMetrics } from '../lib/api-client';
import {
  OperatorGrid,
  TrainingMonitor,
  MetricsDashboard,
  SentimentCompare,
  AgentDemo,
} from '../components/alpha-mining';
import {
  Tabs,
  TabsContent,
  TabsList,
  TabsTrigger,
} from '@/components/ui/tabs';
import {
  Zap, Code, BarChart2, Heart, Bot,
  RefreshCw, ChevronRight, Sparkles, Brain
} from 'lucide-react';
import { useLanguageStore } from '@/store/useLanguageStore';

// ============================================================================
// 国际化文案
// ============================================================================

const i18n = {
  zh: {
    title: 'Alpha因子挖掘',
    subtitle: '',
    tabs: {
      overview: '概览',
      training: '训练',
      evaluate: '评估',
      sentiment: '情感融合',
      agent: 'Agent',
    },
    techBadges: {
      rl: '符号回归 + RL',
      dsl: '21 个 DSL 操作符',
      sentiment: '情感融合',
      metrics: '完整评估体系',
      agent: 'AgenticX 集成',
    },
    dsl: {
      title: 'DSL 操作符系统',
      desc: '21 个可组合操作符，支持算术/时序/条件运算',
    },
    factors: {
      title: '已发现的因子',
      desc: '按 Sortino Ratio 排序的最优因子',
      empty: '暂无已发现的因子',
      emptyHint: '去"训练"标签页启动因子挖掘',
      loading: '加载中...',
    },
    arch: {
      title: '系统架构',
      features: '特征数据',
      featuresDesc: '行情 + 情感',
      generator: 'AlphaGenerator',
      generatorDesc: 'Transformer',
      vm: 'FactorVM',
      vmDesc: 'StackVM 执行',
      evaluator: 'Evaluator',
      evaluatorDesc: '回测评估',
      rl: 'REINFORCE',
      rlDesc: '策略梯度',
    },
    evaluate: {
      expression: '因子表达式',
      placeholder: '点击下方操作符构建表达式，如: ADD(RET, MA5(VOL))',
      button: '评估因子',
      evaluating: '评估中...',
      operators: '操作符',
    },
  },
  en: {
    title: 'Alpha Mining',
    subtitle: '',
    tabs: {
      overview: 'Overview',
      training: 'Training',
      evaluate: 'Evaluate',
      sentiment: 'Sentiment',
      agent: 'Agent',
    },
    techBadges: {
      rl: 'Symbolic Regression + RL',
      dsl: '21 DSL Operators',
      sentiment: 'Sentiment Fusion',
      metrics: 'Full Evaluation',
      agent: 'AgenticX Integration',
    },
    dsl: {
      title: 'DSL Operator System',
      desc: '21 composable operators for arithmetic/timeseries/conditional operations',
    },
    factors: {
      title: 'Discovered Factors',
      desc: 'Top factors ranked by Sortino Ratio',
      empty: 'No factors discovered yet',
      emptyHint: 'Go to "Training" tab to start factor mining',
      loading: 'Loading...',
    },
    arch: {
      title: 'System Architecture',
      features: 'Features',
      featuresDesc: 'Price + Sentiment',
      generator: 'AlphaGenerator',
      generatorDesc: 'Transformer',
      vm: 'FactorVM',
      vmDesc: 'StackVM Executor',
      evaluator: 'Evaluator',
      evaluatorDesc: 'Backtesting',
      rl: 'REINFORCE',
      rlDesc: 'Policy Gradient',
    },
    evaluate: {
      expression: 'Factor Expression',
      placeholder: 'Click operators below to build expression, e.g.: ADD(RET, MA5(VOL))',
      button: 'Evaluate',
      evaluating: 'Evaluating...',
      operators: 'Operators',
    },
  },
};

// ============================================================================
// 主页面组件
// ============================================================================

const AlphaMiningPage: React.FC = () => {
  const { lang } = useLanguageStore();
  const [activeTab, setActiveTab] = useState('overview');
  const [factors, setFactors] = useState<AlphaMiningFactor[]>([]);
  const [isLoadingFactors, setIsLoadingFactors] = useState(true);
  const [evaluateFormula, setEvaluateFormula] = useState('');
  const [evaluateResult, setEvaluateResult] = useState<AlphaMiningMetrics | null>(null);
  const [isEvaluating, setIsEvaluating] = useState(false);

  const t = i18n[lang];

  // 加载已发现的因子
  const loadFactors = useCallback(async () => {
    setIsLoadingFactors(true);
    try {
      const response = await alphaMiningApi.getFactors(20);
      setFactors(response.factors || []);
    } catch (error) {
      console.error('Failed to load factors:', error);
    } finally {
      setIsLoadingFactors(false);
    }
  }, []);

  useEffect(() => {
    loadFactors();
  }, [loadFactors]);

  // 评估因子
  const handleEvaluate = useCallback(async () => {
    if (!evaluateFormula.trim()) return;
    
    setIsEvaluating(true);
    try {
      const response = await alphaMiningApi.evaluate({ formula: evaluateFormula });
      if (response.success && response.metrics) {
        setEvaluateResult(response.metrics);
      }
    } catch (error) {
      console.error('Evaluate error:', error);
    } finally {
      setIsEvaluating(false);
    }
  }, [evaluateFormula]);

  // 插入操作符到表达式
  const handleOperatorClick = (op: string) => {
    setEvaluateFormula(prev => prev ? `${prev} ${op}` : op);
  };

  // 插入特征到表达式
  const handleFeatureClick = (feature: string) => {
    setEvaluateFormula(prev => prev ? `${prev} ${feature}` : feature);
  };

  // 训练完成回调
  const handleTrainingComplete = useCallback((result: { best_score: number; best_formula: string }) => {
    loadFactors(); // 刷新因子列表
    if (result.best_formula) {
      setEvaluateFormula(result.best_formula);
    }
  }, [loadFactors]);

  return (
    <div className="container mx-auto px-4 py-6 max-w-7xl">
      {/* 页面标题 */}
      <div className="mb-6">
        <div className="flex items-center gap-3">
          <div className="p-2 bg-gradient-to-br from-amber-400 to-orange-500 rounded-lg">
            <Brain className="w-6 h-6 text-white" />
          </div>
          <div>
            <h1 className="text-2xl font-bold">{t.title}</h1>
            {t.subtitle && <p className="text-gray-600 text-sm">{t.subtitle}</p>}
          </div>
        </div>
        
        {/* 技术亮点标签 */}
        <div className="flex flex-wrap gap-2 mt-4">
          <TechBadge icon={<Zap className="w-3 h-3" />} label={t.techBadges.rl} />
          <TechBadge icon={<Code className="w-3 h-3" />} label={t.techBadges.dsl} />
          <TechBadge icon={<Heart className="w-3 h-3" />} label={t.techBadges.sentiment} />
          <TechBadge icon={<BarChart2 className="w-3 h-3" />} label={t.techBadges.metrics} />
          <TechBadge icon={<Bot className="w-3 h-3" />} label={t.techBadges.agent} />
        </div>
      </div>

      {/* 主内容区 - Tab 切换 */}
      <Tabs value={activeTab} onValueChange={setActiveTab} className="space-y-6">
        <TabsList className="grid grid-cols-5 w-full max-w-2xl">
          <TabsTrigger value="overview" className="gap-1">
            <Sparkles className="w-4 h-4" />
            {t.tabs.overview}
          </TabsTrigger>
          <TabsTrigger value="training" className="gap-1">
            <Zap className="w-4 h-4" />
            {t.tabs.training}
          </TabsTrigger>
          <TabsTrigger value="evaluate" className="gap-1">
            <BarChart2 className="w-4 h-4" />
            {t.tabs.evaluate}
          </TabsTrigger>
          <TabsTrigger value="sentiment" className="gap-1">
            <Heart className="w-4 h-4" />
            {t.tabs.sentiment}
          </TabsTrigger>
          <TabsTrigger value="agent" className="gap-1">
            <Bot className="w-4 h-4" />
            {t.tabs.agent}
          </TabsTrigger>
        </TabsList>

        {/* 概览 Tab */}
        <TabsContent value="overview" className="space-y-6">
          <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
            {/* DSL 操作符展示 */}
            <Card>
              <CardHeader>
                <CardTitle className="flex items-center gap-2">
                  <Code className="w-5 h-5 text-blue-500" />
                  {t.dsl.title}
                </CardTitle>
                <CardDescription>{t.dsl.desc}</CardDescription>
              </CardHeader>
              <CardContent>
                <OperatorGrid
                  onOperatorClick={handleOperatorClick}
                  onFeatureClick={handleFeatureClick}
                  compact
                />
              </CardContent>
            </Card>

            {/* 已发现的因子 */}
            <Card>
              <CardHeader>
                <div className="flex items-center justify-between">
                  <div>
                    <CardTitle className="flex items-center gap-2">
                      <Sparkles className="w-5 h-5 text-amber-500" />
                      {t.factors.title}
                    </CardTitle>
                    <CardDescription>{t.factors.desc}</CardDescription>
                  </div>
                  <Button variant="outline" size="sm" onClick={loadFactors}>
                    <RefreshCw className={`w-4 h-4 ${isLoadingFactors ? 'animate-spin' : ''}`} />
                  </Button>
                </div>
              </CardHeader>
              <CardContent>
                {isLoadingFactors ? (
                  <div className="text-center py-8 text-gray-500">
                    {t.factors.loading}
                  </div>
                ) : factors.length === 0 ? (
                  <div className="text-center py-8 text-gray-500">
                    <Sparkles className="w-10 h-10 mx-auto opacity-50 mb-2" />
                    <p>{t.factors.empty}</p>
                    <p className="text-sm mt-1">{t.factors.emptyHint}</p>
                  </div>
                ) : (
                  <div className="space-y-2 max-h-96 overflow-y-auto">
                    {factors.slice(0, 10).map((factor, idx) => (
                      <FactorCard
                        key={idx}
                        factor={factor}
                        rank={idx + 1}
                        onSelect={() => setEvaluateFormula(factor.formula_str)}
                      />
                    ))}
                  </div>
                )}
              </CardContent>
            </Card>
          </div>

          {/* 系统架构说明 */}
          <Card className="bg-gradient-to-r from-indigo-50 to-purple-50">
            <CardContent className="py-6">
              <h3 className="font-semibold mb-4 flex items-center gap-2">
                <Brain className="w-5 h-5 text-indigo-500" />
                {t.arch.title}
              </h3>
              <div className="flex items-center justify-center gap-2 flex-wrap">
                <ArchNode label={t.arch.features} sub={t.arch.featuresDesc} />
                <ChevronRight className="w-4 h-4 text-gray-400" />
                <ArchNode label={t.arch.generator} sub={t.arch.generatorDesc} highlight />
                <ChevronRight className="w-4 h-4 text-gray-400" />
                <ArchNode label={t.arch.vm} sub={t.arch.vmDesc} />
                <ChevronRight className="w-4 h-4 text-gray-400" />
                <ArchNode label={t.arch.evaluator} sub={t.arch.evaluatorDesc} />
                <ChevronRight className="w-4 h-4 text-gray-400" />
                <ArchNode label={t.arch.rl} sub={t.arch.rlDesc} highlight />
              </div>
            </CardContent>
          </Card>
        </TabsContent>

        {/* 训练 Tab */}
        <TabsContent value="training">
          <TrainingMonitor onTrainingComplete={handleTrainingComplete} />
        </TabsContent>

        {/* 评估 Tab */}
        <TabsContent value="evaluate" className="space-y-6">
          <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
            {/* 左侧：操作符和输入 */}
            <div className="space-y-4">
              <Card>
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm">{t.evaluate.expression}</CardTitle>
                </CardHeader>
                <CardContent>
                  <textarea
                    value={evaluateFormula}
                    onChange={(e) => setEvaluateFormula(e.target.value)}
                    placeholder={t.evaluate.placeholder}
                    className="w-full px-3 py-2 border rounded-md font-mono text-sm h-24"
                  />
                  <Button
                    onClick={handleEvaluate}
                    disabled={isEvaluating || !evaluateFormula.trim()}
                    className="w-full mt-2"
                  >
                    {isEvaluating ? t.evaluate.evaluating : t.evaluate.button}
                  </Button>
                </CardContent>
              </Card>

              <Card>
                <CardHeader className="pb-2">
                  <CardTitle className="text-sm">{t.evaluate.operators}</CardTitle>
                </CardHeader>
                <CardContent>
                  <OperatorGrid
                    onOperatorClick={handleOperatorClick}
                    onFeatureClick={handleFeatureClick}
                    compact
                  />
                </CardContent>
              </Card>
            </div>

            {/* 右侧：评估结果 */}
            <div className="lg:col-span-2">
              <MetricsDashboard
                metrics={evaluateResult}
                formula={evaluateFormula}
                loading={isEvaluating}
              />
            </div>
          </div>
        </TabsContent>

        {/* 情感融合 Tab */}
        <TabsContent value="sentiment">
          <SentimentCompare />
        </TabsContent>

        {/* Agent Tab */}
        <TabsContent value="agent">
          <AgentDemo />
        </TabsContent>
      </Tabs>
    </div>
  );
};

// ============================================================================
// 子组件
// ============================================================================

// 技术亮点徽章
const TechBadge: React.FC<{ icon: React.ReactNode; label: string }> = ({ icon, label }) => (
  <Badge variant="outline" className="gap-1 px-2 py-1">
    {icon}
    {label}
  </Badge>
);

// 因子卡片
interface FactorCardProps {
  factor: AlphaMiningFactor;
  rank: number;
  onSelect: () => void;
}

const FactorCard: React.FC<FactorCardProps> = ({ factor, rank, onSelect }) => {
  const getSortinoColor = (sortino: number) => {
    if (sortino > 1) return 'text-green-600 bg-green-50';
    if (sortino > 0) return 'text-amber-600 bg-amber-50';
    return 'text-red-600 bg-red-50';
  };

  return (
    <div
      className="p-3 border rounded-lg hover:shadow-sm transition-shadow cursor-pointer"
      onClick={onSelect}
    >
      <div className="flex items-start justify-between gap-2">
        <div className="flex items-center gap-2">
          <span className="text-xs text-gray-400 font-medium">#{rank}</span>
          <code className="text-sm font-mono truncate max-w-[200px]" title={factor.formula_str}>
            {factor.formula_str}
          </code>
        </div>
        <Badge className={`text-xs ${getSortinoColor(factor.sortino)}`}>
          {factor.sortino.toFixed(3)}
        </Badge>
      </div>
      {factor.discovered_at && (
        <div className="text-xs text-gray-400 mt-1">
          {new Date(factor.discovered_at).toLocaleString()}
        </div>
      )}
    </div>
  );
};

// 架构节点
const ArchNode: React.FC<{ label: string; sub: string; highlight?: boolean }> = ({
  label,
  sub,
  highlight,
}) => (
  <div className={`
    px-4 py-2 rounded-lg text-center
    ${highlight ? 'bg-indigo-100 border-2 border-indigo-300' : 'bg-white border border-gray-200'}
  `}>
    <div className={`text-sm font-medium ${highlight ? 'text-indigo-700' : ''}`}>{label}</div>
    <div className="text-xs text-gray-500">{sub}</div>
  </div>
);

export default AlphaMiningPage;


================================================
FILE: frontend/src/pages/Dashboard.tsx
================================================
import { useQuery } from '@tanstack/react-query'
import { Card, CardContent, CardDescription, CardHeader, CardTitle } from '@/components/ui/card'
import { Badge } from '@/components/ui/badge'
import { Button } from '@/components/ui/button'
import { newsApi, taskApi } from '@/lib/api-client'
import { TrendingUp, Newspaper, Activity, Clock } from 'lucide-react'
import { useState, useMemo, useEffect } from 'react'
import { formatRelativeTime } from '@/lib/utils'
import NewsDetailDrawer from '@/components/NewsDetailDrawer'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'
import { useCallback } from 'react'

// 新闻源配置
const NEWS_SOURCES = [
  { key: 'all', nameZh: '全部来源', nameEn: 'All Sources', icon: '📰' },
  { key: 'sina', nameZh: '新浪财经', nameEn: 'Sina Finance', icon: '🌐' },
  { key: 'tencent', nameZh: '腾讯财经', nameEn: 'Tencent Finance', icon: '🐧' },
  { key: 'jwview', nameZh: '金融界', nameEn: 'JRJ', icon: '💰' },
  { key: 'eeo', nameZh: '经济观察网', nameEn: 'EEO', icon: '📊' },
  { key: 'caijing', nameZh: '财经网', nameEn: 'Caijing', icon: '📈' },
  { key: 'jingji21', nameZh: '21经济网', nameEn: '21Jingji', icon: '📉' },
  { key: 'nbd', nameZh: '每日经济新闻', nameEn: 'NBD', icon: '📰' },
  { key: 'yicai', nameZh: '第一财经', nameEn: 'Yicai', icon: '🎯' },
  { key: '163', nameZh: '网易财经', nameEn: '163 Finance', icon: '📧' },
  { key: 'eastmoney', nameZh: '东方财富', nameEn: 'Eastmoney', icon: '💎' },
]

// 后端可能返回的中文 source 名称到 key 的映射
const SOURCE_NAME_TO_KEY: Record<string, string> = {
  '全部来源': 'all',
  '新浪财经': 'sina',
  '腾讯财经': 'tencent',
  '金融界': 'jwview',
  '经济观察网': 'eeo',
  '财经网': 'caijing',
  '21经济网': 'jingji21',
  '每日经济新闻': 'nbd',
  '第一财经': 'yicai',
  '网易财经': '163',
  '东方财富': 'eastmoney',
  '东方财富网': 'eastmoney', // 后端可能返回的变体
  '同花顺财经': 'tonghuashun',
  '证券时报': 'securities_times',
  '证券之星': 'stockstar',
  '中金在线': 'cnfol',
  '澎湃新闻': 'thepaper',
  '证券时报网': 'securities_times_online',
  '北京商报': 'bbtnews',
  '卡车之家': 'truckhome',
  'sogou': 'sogou',
}

// 扩展的新闻源配置（包含后端可能返回的其他来源）
const EXTENDED_NEWS_SOURCES: Record<string, { nameZh: string; nameEn: string; icon: string }> = {
  tonghuashun: { nameZh: '同花顺财经', nameEn: 'Tonghuashun Finance', icon: '📊' },
  securities_times: { nameZh: '证券时报', nameEn: 'Securities Times', icon: '📰' },
  stockstar: { nameZh: '证券之星', nameEn: 'Stockstar', icon: '⭐' },
  cnfol: { nameZh: '中金在线', nameEn: 'CNFOL', icon: '💼' },
  thepaper: { nameZh: '澎湃新闻', nameEn: 'The Paper', icon: '📰' },
  securities_times_online: { nameZh: '证券时报网', nameEn: 'Securities Times Online', icon: '📰' },
  bbtnews: { nameZh: '北京商报', nameEn: 'Beijing Business Today', icon: '📰' },
  truckhome: { nameZh: '卡车之家', nameEn: 'Truck Home', icon: '🚚' },
  sogou: { nameZh: '搜狗', nameEn: 'Sogou', icon: '🔍' },
}

export default function Dashboard() {
  const t = useGlobalI18n()
  const { lang } = useLanguageStore()
  const [selectedSource, setSelectedSource] = useState<string>('all')
  const [selectedNewsId, setSelectedNewsId] = useState<number | null>(null)
  const [drawerOpen, setDrawerOpen] = useState(false)

  // 获取新闻源图标
  const getSourceIcon = useCallback((sourceValue: string) => {
    // 1. 先尝试直接匹配 key
    const sourceByKey = NEWS_SOURCES.find(s => s.key === sourceValue)
    if (sourceByKey) {
      return sourceByKey.icon
    }
    
    // 2. 尝试通过中文名称映射到 key
    const mappedKey = SOURCE_NAME_TO_KEY[sourceValue]
    if (mappedKey) {
      const source = NEWS_SOURCES.find(s => s.key === mappedKey)
      if (source) {
        return source.icon
      }
      // 如果在扩展配置中
      const extendedSource = EXTENDED_NEWS_SOURCES[mappedKey]
      if (extendedSource) {
        return extendedSource.icon
      }
    }
    
    // 3. 尝试在扩展配置中直接查找
    const extendedSource = EXTENDED_NEWS_SOURCES[sourceValue]
    if (extendedSource) {
      return extendedSource.icon
    }
    
    // 4. 默认图标
    return '📰'
  }, [])
  
  // 获取新闻源名称（支持中文 source 名称映射）
  const getSourceName = useCallback((sourceValue: string) => {
    // 1. 先尝试直接匹配 key
    const sourceByKey = NEWS_SOURCES.find(s => s.key === sourceValue)
    if (sourceByKey) {
      return t.nav.home === '首页' ? sourceByKey.nameZh : sourceByKey.nameEn
    }
    
    // 2. 尝试通过中文名称映射到 key
    const mappedKey = SOURCE_NAME_TO_KEY[sourceValue]
    if (mappedKey) {
      const source = NEWS_SOURCES.find(s => s.key === mappedKey)
      if (source) {
        return t.nav.home === '首页' ? source.nameZh : source.nameEn
      }
      // 如果在扩展配置中
      const extendedSource = EXTENDED_NEWS_SOURCES[mappedKey]
      if (extendedSource) {
        return t.nav.home === '首页' ? extendedSource.nameZh : extendedSource.nameEn
      }
    }
    
    // 3. 尝试在扩展配置中直接查找
    const extendedSource = EXTENDED_NEWS_SOURCES[sourceValue]
    if (extendedSource) {
      return t.nav.home === '首页' ? extendedSource.nameZh : extendedSource.nameEn
    }
    
    // 4. 如果都不匹配，返回原值（可能是英文或未知来源）
    return sourceValue
  }, [t])

  // 监听自定义事件，用于从相关新闻跳转
  useEffect(() => {
    const handleNewsSelect = (e: CustomEvent<number>) => {
      setSelectedNewsId(e.detail)
      setDrawerOpen(true)
    }
    window.addEventListener('news-select', handleNewsSelect as EventListener)
    return () => {
      window.removeEventListener('news-select', handleNewsSelect as EventListener)
    }
  }, [])

  const { data: newsList } = useQuery({
    queryKey: ['news', 'dashboard', selectedSource],
    queryFn: () => newsApi.getLatestNews({ 
      source: selectedSource === 'all' ? undefined : selectedSource, 
      limit: 100
    }),
  })

  const { data: taskStats } = useQuery({
    queryKey: ['tasks', 'stats'],
    queryFn: () => taskApi.getTaskStats(),
    refetchInterval: 10000, // 每10秒刷新
  })

  // 按来源统计新闻数量
  const newsStats = useMemo(() => {
    if (!newsList) return []
    const stats = new Map<string, number>()
    newsList.forEach(news => {
      stats.set(news.source, (stats.get(news.source) || 0) + 1)
    })
    return Array.from(stats.entries()).map(([source, count]) => ({
      source,
      count,
      name: getSourceName(source),
      icon: getSourceIcon(source)
    })).sort((a, b) => b.count - a.count)
  }, [newsList, getSourceName, getSourceIcon])

  return (
    <div className="p-6 space-y-6">
      <div className="flex items-center justify-between">
        <div>
          <h1 className="text-3xl font-bold tracking-tight">{t.dashboard.title}</h1>
          <p className="text-muted-foreground">
            {t.dashboard.subtitle}
          </p>
        </div>
      </div>

      {/* 统计卡片 */}
      <div className="grid gap-4 md:grid-cols-2 lg:grid-cols-4">
        <Card>
          <CardHeader className="flex flex-row items-center justify-between space-y-0 pb-2">
            <CardTitle className="text-sm font-medium">
              {t.dashboard.totalNews}
            </CardTitle>
            <Newspaper className="h-4 w-4 text-muted-foreground" />
          </CardHeader>
          <CardContent>
            <div className="text-2xl font-bold">{taskStats?.total_news_saved || 0}</div>
            <p className="text-xs text-muted-foreground">
              {t.dashboard.savedToDb}
            </p>
          </CardContent>
        </Card>

        <Card>
          <CardHeader className="flex flex-row items-center justify-between space-y-0 pb-2">
            <CardTitle className="text-sm font-medium">
              {t.dashboard.totalTasks}
            </CardTitle>
            <Activity className="h-4 w-4 text-muted-foreground" />
          </CardHeader>
          <CardContent>
            <div className="text-2xl font-bold">{taskStats?.total || 0}</div>
            <p className="text-xs text-muted-foreground">
              {t.dashboard.recentCompleted} {taskStats?.recent_completed || 0} {t.dashboard.units}
            </p>
          </CardContent>
        </Card>

        <Card>
          <CardHeader className="flex flex-row items-center justify-between space-y-0 pb-2">
            <CardTitle className="text-sm font-medium">
              {t.dashboard.crawlRate}
            </CardTitle>
            <TrendingUp className="h-4 w-4 text-muted-foreground" />
          </CardHeader>
          <CardContent>
            <div className="text-2xl font-bold">
              {taskStats && taskStats.total > 0
                ? (((taskStats.by_status?.completed || 0) / taskStats.total) * 100).toFixed(1)
                : '0.0'}%
            </div>
            <p className="text-xs text-muted-foreground">
              {taskStats?.by_status?.completed || 0} / {taskStats?.total || 0}
            </p>
          </CardContent>
        </Card>

        <Card>
          <CardHeader className="flex flex-row items-center justify-between space-y-0 pb-2">
            <CardTitle className="text-sm font-medium">
              {t.dashboard.liveMonitor}
            </CardTitle>
            <Clock className="h-4 w-4 text-muted-foreground" />
          </CardHeader>
          <CardContent>
            <div className="text-2xl font-bold text-green-600">{t.dashboard.running}</div>
            <p className="text-xs text-muted-foreground">
              {t.dashboard.autoInterval}
            </p>
          </CardContent>
        </Card>
      </div>

      {/* 来源统计 */}
      {newsStats.length > 0 && (
        <Card>
          <CardHeader>
            <CardTitle>{t.dashboard.newsStats}</CardTitle>
            <CardDescription>{t.dashboard.newsStatsDesc}</CardDescription>
          </CardHeader>
          <CardContent>
            <div className="grid grid-cols-2 md:grid-cols-3 lg:grid-cols-5 gap-3">
              {newsStats.map(stat => (
                <Card key={stat.source} className="p-4 hover:shadow-md transition-shadow">
                  <div className="flex flex-col items-center gap-2">
                    <span className="text-3xl">{stat.icon}</span>
                    <span className="text-sm font-medium text-center">{stat.name}</span>
                    <span className="text-2xl font-bold text-blue-600">{stat.count}</span>
                  </div>
                </Card>
              ))}
            </div>
          </CardContent>
        </Card>
      )}

      {/* 来源筛选 */}
      <Card>
        <CardHeader>
          <CardTitle>{t.dashboard.latestNews}</CardTitle>
          <CardDescription>{t.dashboard.latestNewsDesc}</CardDescription>
        </CardHeader>
        <CardContent className="space-y-4">
          {/* 筛选器 */}
          <div className="flex flex-wrap gap-2 p-3 bg-slate-50 rounded-lg">
            {NEWS_SOURCES.map(source => (
              <Button
                key={source.key}
                variant={selectedSource === source.key ? 'default' : 'outline'}
                size="sm"
                onClick={() => setSelectedSource(source.key)}
                className="text-xs"
              >
                <span className="mr-1">{source.icon}</span>
                {getSourceName(source.key)}
                {source.key !== 'all' && newsStats.find(s => s.source === source.key) && (
                  <Badge variant="secondary" className="ml-2">
                    {newsStats.find(s => s.source === source.key)?.count}
                  </Badge>
                )}
              </Button>
            ))}
          </div>

          {/* 新闻列表 */}
          {newsList && newsList.length > 0 ? (
            <div className="space-y-3 max-h-[600px] overflow-y-auto">
              {newsList.slice(0, 20).map((news) => (
                <div 
                  key={news.id} 
                  className="flex items-start gap-4 p-4 hover:bg-gray-50 rounded-lg transition-colors border border-gray-100 cursor-pointer"
                  onClick={() => {
                    setSelectedNewsId(news.id)
                    setDrawerOpen(true)
                  }}
                >
                  <div className="flex-1">
                    <h3 className="font-medium leading-tight">{news.title}</h3>
                    <p className="text-sm text-gray-600 mt-1 line-clamp-2">
                      {news.content}
                    </p>
                    <div className="flex items-center gap-4 mt-2 text-xs text-gray-500">
                      <span className="flex items-center gap-1">
                        <span>{getSourceIcon(news.source)}</span>
                        <span>{getSourceName(news.source)}</span>
                      </span>
                      <span>⏰ {formatRelativeTime(news.publish_time || news.created_at, t.time)}</span>
                      {news.stock_codes && news.stock_codes.length > 0 && (
                        <span className="flex items-center gap-1">
                          📈 
                          {news.stock_codes.slice(0, 3).map(code => (
                            <Badge key={code} variant="outline" className="text-xs">
                              {code}
                            </Badge>
                          ))}
                          {news.stock_codes.length > 3 && (
                            <span className="text-xs text-gray-400">
                              +{news.stock_codes.length - 3}
                            </span>
                          )}
                        </span>
                      )}
                    </div>
                  </div>
                </div>
              ))}
            </div>
          ) : (
            <div className="text-center py-8 text-gray-500">
              {selectedSource === 'all' ? t.dashboard.noNews : t.dashboard.noNewsFrom}
            </div>
          )}
        </CardContent>
      </Card>

      {/* 新闻详情抽屉 */}
      <NewsDetailDrawer
        newsId={selectedNewsId}
        open={drawerOpen}
        onOpenChange={(open) => {
          setDrawerOpen(open)
          if (!open) {
            // 延迟清除newsId，避免关闭动画时闪烁
            setTimeout(() => setSelectedNewsId(null), 300)
          }
        }}
      />
    </div>
  )
}


================================================
FILE: frontend/src/pages/NewsListPage.tsx
================================================
import { useState, useEffect, useMemo, useRef, useCallback } from 'react'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import { toast } from 'sonner'
import { Card, CardContent, CardHeader, CardTitle, CardFooter } from '@/components/ui/card'
import { Button } from '@/components/ui/button'
import { Badge } from '@/components/ui/badge'
import { newsApi, analysisApi } from '@/lib/api-client'
import { formatRelativeTime } from '@/lib/utils'
import { RefreshCw, Sparkles, Calendar, Newspaper, TrendingUp, RefreshCcw, ChevronDown, ChevronUp, CheckCircle2, XCircle, MinusCircle, HelpCircle, Search, X, Check, Minus } from 'lucide-react'
import NewsDetailDrawer from '@/components/NewsDetailDrawer'
import { useNewsToolbar } from '@/context/NewsToolbarContext'
import { useDebounce } from '@/hooks/useDebounce'
import HighlightText from '@/components/HighlightText'
import { useModelConfig } from '@/components/ModelSelector'
import { useGlobalI18n } from '@/store/useLanguageStore'

type FilterType = 'all' | 'pending' | 'positive' | 'negative' | 'neutral'

// 独立的搜索框组件，自己管理内部状态，避免每次输入都重新挂载
function SearchBox({ onSearch }: { onSearch: (query: string) => void }) {
  const t = useGlobalI18n()
  const [localQuery, setLocalQuery] = useState('')
  const isComposingRef = useRef(false)

  const handleChange = (e: React.ChangeEvent<HTMLInputElement>) => {
    const value = e.target.value
    setLocalQuery(value)
    // 非组合输入状态下，直接更新搜索词
    if (!isComposingRef.current) {
      onSearch(value)
    }
  }

  const handleCompositionEnd = (e: React.CompositionEvent<HTMLInputElement>) => {
    isComposingRef.current = false
    // 组合输入结束后，更新搜索词
    onSearch(e.currentTarget.value)
  }

  const handleClear = () => {
    setLocalQuery('')
    onSearch('')
  }

  return (
    <div className="relative w-full">
      <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
      <input
        type="text"
        placeholder={t.news.search}
        value={localQuery}
        onCompositionStart={() => {
          isComposingRef.current = true
        }}
        onCompositionEnd={handleCompositionEnd}
        onChange={handleChange}
        className="w-full pl-10 pr-10 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 h-10"
      />
      {localQuery && (
        <button
          onClick={handleClear}
          className="absolute right-3 top-1/2 -translate-y-1/2 text-gray-400 hover:text-gray-600 transition-colors"
          aria-label="清除搜索"
        >
          <X className="w-4 h-4" />
        </button>
      )}
    </div>
  )
}

// 新闻源配置（国际化在组件内处理）
const NEWS_SOURCES = [
  { key: 'all', nameZh: '全部来源', nameEn: 'All Sources', icon: '📰' },
  { key: 'sina', nameZh: '新浪财经', nameEn: 'Sina Finance', icon: '🌐' },
  { key: 'tencent', nameZh: '腾讯财经', nameEn: 'Tencent Finance', icon: '🐧' },
  { key: 'jwview', nameZh: '金融界', nameEn: 'JRJ', icon: '💰' },
  { key: 'eeo', nameZh: '经济观察网', nameEn: 'EEO', icon: '📊' },
  { key: 'caijing', nameZh: '财经网', nameEn: 'Caijing', icon: '📈' },
  { key: 'jingji21', nameZh: '21经济网', nameEn: '21Jingji', icon: '📉' },
  { key: 'nbd', nameZh: '每日经济新闻', nameEn: 'NBD', icon: '📰' },
  { key: 'yicai', nameZh: '第一财经', nameEn: 'Yicai', icon: '🎯' },
  { key: '163', nameZh: '网易财经', nameEn: '163 Finance', icon: '📧' },
  { key: 'eastmoney', nameZh: '东方财富', nameEn: 'Eastmoney', icon: '💎' },
]

// 后端可能返回的中文 source 名称到 key 的映射
const SOURCE_NAME_TO_KEY: Record<string, string> = {
  '全部来源': 'all',
  '新浪财经': 'sina',
  '腾讯财经': 'tencent',
  '金融界': 'jwview',
  '经济观察网': 'eeo',
  '财经网': 'caijing',
  '21经济网': 'jingji21',
  '每日经济新闻': 'nbd',
  '第一财经': 'yicai',
  '网易财经': '163',
  '东方财富': 'eastmoney',
  '东方财富网': 'eastmoney', // 后端可能返回的变体
  '同花顺财经': 'tonghuashun', // 后端可能返回的其他来源
  '证券时报': 'securities_times',
  '证券之星': 'stockstar',
  '中金在线': 'cnfol',
  '澎湃新闻': 'thepaper',
  '证券时报网': 'securities_times_online',
  '北京商报': 'bbtnews',
  '卡车之家': 'truckhome',
  'sogou': 'sogou',
}

// 扩展的新闻源配置（包含后端可能返回的其他来源）
const EXTENDED_NEWS_SOURCES: Record<string, { nameZh: string; nameEn: string; icon: string }> = {
  tonghuashun: { nameZh: '同花顺财经', nameEn: 'Tonghuashun Finance', icon: '📊' },
  securities_times: { nameZh: '证券时报', nameEn: 'Securities Times', icon: '📰' },
  stockstar: { nameZh: '证券之星', nameEn: 'Stockstar', icon: '⭐' },
  cnfol: { nameZh: '中金在线', nameEn: 'CNFOL', icon: '💼' },
  thepaper: { nameZh: '澎湃新闻', nameEn: 'The Paper', icon: '📰' },
  securities_times_online: { nameZh: '证券时报网', nameEn: 'Securities Times Online', icon: '📰' },
  bbtnews: { nameZh: '北京商报', nameEn: 'Beijing Business Today', icon: '📰' },
  truckhome: { nameZh: '卡车之家', nameEn: 'Truck Home', icon: '🚚' },
  sogou: { nameZh: '搜狗', nameEn: 'Sogou', icon: '🔍' },
}

export default function NewsListPage() {
  const t = useGlobalI18n()
  const queryClient = useQueryClient()
  const [expandedStocks, setExpandedStocks] = useState<Set<number>>(new Set())
  const [gridCols, setGridCols] = useState(3)
  const [activeFilter, setActiveFilter] = useState<FilterType>('all')
  const [activeSource, setActiveSource] = useState<string>('all') // 新增：来源筛选
  const [analyzingNewsId, setAnalyzingNewsId] = useState<number | null>(null)
  const [isRefreshing, setIsRefreshing] = useState(false) // 手动管理刷新状态
  const [selectedNewsId, setSelectedNewsId] = useState<number | null>(null)
  const [drawerOpen, setDrawerOpen] = useState(false)
  const [searchQuery, setSearchQuery] = useState('') // 搜索关键词
  const debouncedSearchQuery = useDebounce(searchQuery, 300) // 防抖处理
  const [selectedNewsIds, setSelectedNewsIds] = useState<Set<number>>(new Set()) // 批量选择状态
  const [lastSelectedNewsId, setLastSelectedNewsId] = useState<number | null>(null) // 最后选中的新闻ID（用于Shift范围选择）
  
  // 获取新闻源图标
  const getSourceIcon = useCallback((sourceValue: string) => {
    // 1. 先尝试直接匹配 key
    const sourceByKey = NEWS_SOURCES.find(s => s.key === sourceValue)
    if (sourceByKey) {
      return sourceByKey.icon
    }
    
    // 2. 尝试通过中文名称映射到 key
    const mappedKey = SOURCE_NAME_TO_KEY[sourceValue]
    if (mappedKey) {
      const source = NEWS_SOURCES.find(s => s.key === mappedKey)
      if (source) {
        return source.icon
      }
      // 如果在扩展配置中
      const extendedSource = EXTENDED_NEWS_SOURCES[mappedKey]
      if (extendedSource) {
        return extendedSource.icon
      }
    }
    
    // 3. 尝试在扩展配置中直接查找
    const extendedSource = EXTENDED_NEWS_SOURCES[sourceValue]
    if (extendedSource) {
      return extendedSource.icon
    }
    
    // 4. 默认图标
    return '📰'
  }, [])
  
  // 获取新闻源名称（支持中文 source 名称映射）
  const getSourceName = useCallback((sourceValue: string) => {
    // 1. 先尝试直接匹配 key
    const sourceByKey = NEWS_SOURCES.find(s => s.key === sourceValue)
    if (sourceByKey) {
      return t.nav.home === '首页' ? sourceByKey.nameZh : sourceByKey.nameEn
    }
    
    // 2. 尝试通过中文名称映射到 key
    const mappedKey = SOURCE_NAME_TO_KEY[sourceValue]
    if (mappedKey) {
      const source = NEWS_SOURCES.find(s => s.key === mappedKey)
      if (source) {
        return t.nav.home === '首页' ? source.nameZh : source.nameEn
      }
      // 如果在扩展配置中
      const extendedSource = EXTENDED_NEWS_SOURCES[mappedKey]
      if (extendedSource) {
        return t.nav.home === '首页' ? extendedSource.nameZh : extendedSource.nameEn
      }
    }
    
    // 3. 尝试在扩展配置中直接查找
    const extendedSource = EXTENDED_NEWS_SOURCES[sourceValue]
    if (extendedSource) {
      return t.nav.home === '首页' ? extendedSource.nameZh : extendedSource.nameEn
    }
    
    // 4. 如果都不匹配，返回原值（可能是英文或未知来源）
    return sourceValue
  }, [t])
  
  // 使用 useCallback 确保 onSearch 引用稳定，避免 SearchBox 重新渲染
  const handleSearch = useCallback((query: string) => {
    setSearchQuery(query)
  }, [])
  const { setContent } = useNewsToolbar()
  const modelConfig = useModelConfig() // 获取当前选中的模型配置

  // 监听自定义事件，用于从相关新闻跳转
  useEffect(() => {
    const handleNewsSelect = (e: CustomEvent<number>) => {
      setSelectedNewsId(e.detail)
      setDrawerOpen(true)
    }
    window.addEventListener('news-select', handleNewsSelect as EventListener)
    return () => {
      window.removeEventListener('news-select', handleNewsSelect as EventListener)
    }
  }, [])

  // 当切换新闻源时，清空选择状态
  useEffect(() => {
    setSelectedNewsIds(new Set())
    setLastSelectedNewsId(null)
  }, [activeSource])

  // Phase 2: 自动轮询最新新闻（1分钟刷新）
  const { data: newsList, isLoading, isError, error } = useQuery({
    queryKey: ['news', 'latest', activeSource],
    queryFn: () => newsApi.getLatestNews({ 
      source: activeSource === 'all' ? undefined : activeSource, 
      limit: 200  // 增加限制以显示更多新闻
    }),
    staleTime: 1 * 60 * 1000,  // 1分钟内数据视为新鲜
    refetchInterval: 1 * 60 * 1000,  // 每1分钟自动刷新
    refetchIntervalInBackground: true,  // 后台也刷新
    retry: 3,  // 失败时重试3次
    retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000),  // 指数退避重试
    onError: (error: Error) => {
      console.error('Failed to fetch news:', error)
      toast.error(`加载新闻失败: ${error.message}`)
    },
  })

  // 这里保留 dataUpdatedAt，后续可以用于全局最后刷新时间展示

  // Phase 2: 强制刷新 mutation
  const refreshMutation = useMutation({
    mutationFn: newsApi.forceRefresh,
    onSuccess: () => {
      toast.success('爬取任务已触发，正在获取最新新闻...')
      // 等待更长时间让爬取完成（根据日志，爬取大约需要60-120秒）
      const checkInterval = setInterval(() => {
        queryClient.invalidateQueries({ queryKey: ['news', 'latest'] })
      }, 5000) // 每5秒检查一次
      
      // 2分钟后停止轮询并结束
      setTimeout(() => {
        clearInterval(checkInterval)
        queryClient.invalidateQueries({ queryKey: ['news', 'latest'] })
        setIsRefreshing(false) // 结束刷新状态
        toast.success('刷新完成！')
      }, 120000) // 120秒
    },
    onError: (error: Error) => {
      setIsRefreshing(false) // 出错也要结束刷新状态
      toast.error(`刷新失败: ${error.message}`)
    },
  })

  // 分析新闻 mutation
  const analyzeMutation = useMutation({
    mutationFn: (newsId: number) => analysisApi.analyzeNews(newsId, modelConfig),
    onSuccess: (data) => {
      setAnalyzingNewsId(null)
      if (data.success) {
        toast.success(t.news.analysisComplete)
        queryClient.invalidateQueries({ queryKey: ['news'] })
      } else {
        toast.error(`${t.news.analysisFailed}: ${data.error}`)
      }
    },
    onError: (error: Error) => {
      setAnalyzingNewsId(null)
      toast.error(`${t.news.analysisFailed}: ${error.message}`)
    },
  })

  // 批量分析 mutation
  const batchAnalyzeMutation = useMutation({
    mutationFn: (newsIds: number[]) => analysisApi.batchAnalyzeNews(newsIds, modelConfig),
    onSuccess: (data) => {
      if (data.success) {
        const message = t.news.analysisComplete
          .replace('{success}', data.success_count.toString())
          .replace('{failed}', data.failed_count.toString())
        toast.success(message)
        queryClient.invalidateQueries({ queryKey: ['news'] })
      } else {
        toast.error(data.message || '批量分析失败')
      }
    },
    onError: (error: Error) => {
      toast.error(`批量分析失败: ${error.message}`)
    },
  })

  const handleBatchAnalyze = useCallback(() => {
    if (selectedNewsIds.size === 0) return
    batchAnalyzeMutation.mutate(Array.from(selectedNewsIds))
  }, [selectedNewsIds, batchAnalyzeMutation])

  const handleBatchReanalyze = useCallback(() => {
    // 重新分析使用相同的API
    handleBatchAnalyze()
  }, [handleBatchAnalyze])

  // 批量删除新闻 mutation
  const batchDeleteMutation = useMutation({
    mutationFn: (newsIds: number[]) => newsApi.batchDeleteNews(newsIds),
    onSuccess: (data) => {
      if (data.success) {
        toast.success(data.message || t.news.deleteSelected)
        setSelectedNewsIds(new Set()) // 清空选择状态
        setLastSelectedNewsId(null) // 清除最后选中项
        queryClient.invalidateQueries({ queryKey: ['news'] })
      } else {
        toast.error(data.message || '删除失败')
      }
    },
    onError: (error: Error) => {
      toast.error(`删除失败: ${error.message}`)
    },
  })

  // 切换新闻选择状态
  const toggleNewsSelection = useCallback((newsId: number) => {
    setSelectedNewsIds(prev => {
      const newSet = new Set(prev)
      if (newSet.has(newsId)) {
        newSet.delete(newsId)
      } else {
        newSet.add(newsId)
      }
      return newSet
    })
  }, [])

  // 范围选择函数（用于Shift点击）
  const selectRange = useCallback((startId: number, endId: number, newsList: Array<{ id: number }>) => {
    if (!newsList || newsList.length === 0) return
    
    const startIndex = newsList.findIndex(n => n.id === startId)
    const endIndex = newsList.findIndex(n => n.id === endId)
    
    if (startIndex === -1 || endIndex === -1) return
    
    const minIndex = Math.min(startIndex, endIndex)
    const maxIndex = Math.max(startIndex, endIndex)
    
    setSelectedNewsIds(prev => {
      const newSet = new Set(prev)
      for (let i = minIndex; i <= maxIndex; i++) {
        newSet.add(newsList[i].id)
      }
      return newSet
    })
  }, [])

  // 取消所有选择
  const clearSelection = useCallback(() => {
    setSelectedNewsIds(new Set())
    setLastSelectedNewsId(null)
  }, [])

  // 批量删除
  const handleBatchDelete = useCallback(() => {
    if (selectedNewsIds.size === 0) {
      return
    }

    const count = selectedNewsIds.size
    const confirmMessage = t.news.confirmDelete.replace('{count}', count.toString())
    
    if (window.confirm(confirmMessage)) {
      batchDeleteMutation.mutate(Array.from(selectedNewsIds))
    }
  }, [selectedNewsIds, batchDeleteMutation, t])

  const handleForceRefresh = () => {
    if (isRefreshing) {
      toast.warning(t.news.crawling)
      return
    }
    
    setIsRefreshing(true) // 立即设置刷新状态，阻止后续点击
    refreshMutation.mutate({ source: 'sina' })
  }

  // 将搜索框 + 刷新按钮挂到顶部工具栏
  useEffect(() => {
    // 使用独立的 SearchBox 组件，它自己管理内部状态
    // 这样 searchQuery 变化时不会导致 input 重新挂载
    const searchBox = <SearchBox onSearch={handleSearch} />

    const refreshButton = (
      <Button
        onClick={handleForceRefresh}
        disabled={isRefreshing}
        variant="outline"
        size="sm"
        className="h-10 rounded-lg border-gray-300 shadow-sm"
      >
        <RefreshCw
          className={`w-4 h-4 mr-2 ${isRefreshing ? 'animate-spin' : ''}`}
        />
        {isRefreshing ? t.news.crawlingProgress : t.news.refreshNow}
      </Button>
    )

    setContent({ left: searchBox, right: refreshButton })

    return () => {
      setContent({ left: null, right: null })
    }
  }, [isRefreshing, setContent, handleSearch])

  const handleAnalyze = (newsId: number) => {
    setAnalyzingNewsId(newsId)
    analyzeMutation.mutate(newsId)
  }

  const toggleStockExpand = (newsId: number) => {
    setExpandedStocks(prev => {
      const newSet = new Set(prev)
      if (newSet.has(newsId)) {
        newSet.delete(newsId)
      } else {
        newSet.add(newsId)
      }
      return newSet
    })
  }

  // 动态计算每行卡片数量，使卡片尽可能接近正方形
  useEffect(() => {
    const calculateGridCols = () => {
      const containerWidth = window.innerWidth - 48 // 减去左右 padding (24px * 2)
      const idealCardWidth = 380 // 理想卡片宽度，接近 min-h-[480px] 形成正方形
      const gap = 24 // gap-6 = 24px
      
      // 计算可以放下多少列
      let cols = Math.floor((containerWidth + gap) / (idealCardWidth + gap))
      
      // 限制在合理范围内
      cols = Math.max(1, Math.min(cols, 5))
      
      setGridCols(cols)
    }

    calculateGridCols()
    window.addEventListener('resize', calculateGridCols)
    return () => window.removeEventListener('resize', calculateGridCols)
  }, [])

  // 根据股票数量动态计算内容显示行数
  const getContentLines = (stockCount: number, isExpanded: boolean) => {
    if (stockCount === 0) {
      return 8 // 没有股票时显示更多内容
    }
    if (isExpanded || stockCount > 6) {
      return 3 // 展开或股票很多时显示较少内容
    }
    if (stockCount <= 3) {
      return 6 // 股票很少时显示更多内容
    }
    return 5 // 默认中等内容
  }

  const getSentimentBadge = (score: number | null) => {
    if (score === null) return null
    if (score > 0.1) {
      return (
        <Badge className="bg-green-100 text-green-800 hover:bg-green-100 border-green-200">
          <span className="mr-1">😊</span>
          {t.news.positive} {score.toFixed(2)}
        </Badge>
      )
    }
    if (score < -0.1) {
      return (
        <Badge className="bg-red-100 text-red-800 hover:bg-red-100 border-red-200">
          <span className="mr-1">😰</span>
          {t.news.negative} {score.toFixed(2)}
        </Badge>
      )
    }
    return (
      <Badge variant="outline" className="bg-gray-50 text-gray-700">
        <span className="mr-1">😐</span>
        {t.news.neutral} {score.toFixed(2)}
      </Badge>
    )
  }

  // 筛选新闻（情感 + 搜索）
  const filteredNews = useMemo(() => {
    if (!newsList) return []
    
    const query = debouncedSearchQuery.toLowerCase().trim()
    
    return newsList.filter(news => {
      // 1. 情感筛选
      let sentimentMatch = true
      switch (activeFilter) {
        case 'pending':
          sentimentMatch = news.sentiment_score === null
          break
        case 'positive':
          sentimentMatch = news.sentiment_score !== null && news.sentiment_score > 0.1
          break
        case 'negative':
          sentimentMatch = news.sentiment_score !== null && news.sentiment_score < -0.1
          break
        case 'neutral':
          sentimentMatch = news.sentiment_score !== null && news.sentiment_score >= -0.1 && news.sentiment_score <= 0.1
          break
        default:
          sentimentMatch = true
      }
      
      // 2. 搜索匹配（如果没有搜索词，则自动通过）
      if (!query) return sentimentMatch
      
      const titleMatch = news.title.toLowerCase().includes(query)
      const contentMatch = news.content.toLowerCase().includes(query)
      const codeMatch = news.stock_codes?.some(code => code.toLowerCase().includes(query)) || false
      const sourceMatch = getSourceName(news.source).toLowerCase().includes(query)
      
      const searchMatch = titleMatch || contentMatch || codeMatch || sourceMatch
      
      // 3. 返回交集
      return sentimentMatch && searchMatch
    })
  }, [newsList, activeFilter, debouncedSearchQuery, getSourceName])

  // 计算全选状态
  const isAllSelected = useMemo(() => {
    if (!filteredNews || filteredNews.length === 0) return false
    return filteredNews.every(news => selectedNewsIds.has(news.id))
  }, [filteredNews, selectedNewsIds])

  const isPartiallySelected = useMemo(() => {
    if (!filteredNews || filteredNews.length === 0) return false
    const selectedCount = filteredNews.filter(news => selectedNewsIds.has(news.id)).length
    return selectedCount > 0 && selectedCount < filteredNews.length
  }, [filteredNews, selectedNewsIds])

  // 全选/取消全选处理函数
  const handleSelectAll = useCallback(() => {
    if (!filteredNews) return
    if (isAllSelected) {
      // 取消全选：只取消当前筛选的新闻
      setSelectedNewsIds(prev => {
        const newSet = new Set(prev)
        filteredNews.forEach(news => newSet.delete(news.id))
        return newSet
      })
      setLastSelectedNewsId(null)
    } else {
      // 全选：选中所有筛选后的新闻
      setSelectedNewsIds(prev => {
        const newSet = new Set(prev)
        filteredNews.forEach(news => newSet.add(news.id))
        return newSet
      })
      // 设置最后选中项为筛选列表的最后一个
      if (filteredNews.length > 0) {
        setLastSelectedNewsId(filteredNews[filteredNews.length - 1].id)
      }
    }
  }, [filteredNews, isAllSelected])

  // 获取卡片样式类
  const getCardStyle = (sentiment: number | null) => {
    const baseStyle = "flex flex-col transition-all duration-300 border min-w-0 h-full hover:shadow-lg hover:-translate-y-1"
    
    if (sentiment === null) {
      return `${baseStyle} bg-white border-gray-200 hover:border-primary/30`
    }

    if (sentiment > 0.1) {
      // 利好：鲜明的绿色渐变背景 + 深绿边框
      return `${baseStyle} bg-gradient-to-br from-emerald-100 to-white border-emerald-300 hover:border-emerald-400 hover:shadow-emerald-200/60`
    }
    
    if (sentiment < -0.1) {
      // 利空：鲜明的红色渐变背景 + 深红边框
      return `${baseStyle} bg-gradient-to-br from-rose-100 to-white border-rose-300 hover:border-rose-400 hover:shadow-rose-200/60`
    }

    // 中性：清晰的蓝色/灰色渐变背景 + 深灰边框
    return `${baseStyle} bg-gradient-to-br from-slate-100 to-white border-slate-300 hover:border-slate-400 hover:shadow-slate-200/60`
  }

  // 获取重新分析按钮样式
  const getAnalyzeButtonStyle = (sentiment: number | null) => {
    if (sentiment === null) {
      return "w-full bg-primary hover:bg-primary/90 text-white shadow-sm hover:shadow transition-all"
    }
    if (sentiment > 0.1) {
      return "w-full border-emerald-200 text-emerald-700 hover:bg-emerald-50 hover:border-emerald-300 hover:text-emerald-800 transition-colors"
    }
    if (sentiment < -0.1) {
      return "w-full border-rose-200 text-rose-700 hover:bg-rose-50 hover:border-rose-300 hover:text-rose-800 transition-colors"
    }
    return "w-full border-slate-200 text-slate-700 hover:bg-slate-50 hover:border-slate-300 hover:text-slate-800 transition-colors"
  }

  return (
    <div className="flex flex-col h-full overflow-hidden">
      {/* 固定顶部区域：筛选栏和批量操作栏 */}
      <div className="flex-shrink-0 p-6 pb-4 space-y-4 bg-white border-b border-gray-200 z-10">
        {/* 筛选栏：新闻源 + 情感筛选 */}
        <Card className="border-gray-200 shadow-sm">
        <CardHeader className="pb-4">
          <div className="flex flex-wrap items-center gap-3">
            {/* 新闻源筛选 */}
            <div className="flex flex-wrap items-center gap-1.5 bg-blue-50 p-1 rounded-lg border border-blue-200">
              {NEWS_SOURCES.map((source) => (
                <Button
                  key={source.key}
                  variant={activeSource === source.key ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveSource(source.key)}
                  className={
                    activeSource === source.key
                      ? 'bg-white text-blue-600 shadow-sm hover:bg-white/90 text-xs'
                      : 'text-slate-600 hover:text-blue-600 text-xs'
                  }
                >
                  <span className="mr-1">{source.icon}</span>
                  {getSourceName(source.key)}
                </Button>
              ))}
            </div>
            
            {/* 情感筛选 */}
            <div className="flex flex-wrap items-center gap-1 bg-slate-50 p-1 rounded-lg border border-slate-200">
              {/* 全选复选框 */}
              <button
                onClick={handleSelectAll}
                className={`flex items-center gap-1.5 px-2 py-1 rounded h-8 transition-colors ${
                  isAllSelected 
                    ? 'bg-blue-100 text-blue-700' 
                    : isPartiallySelected
                    ? 'bg-blue-50 text-blue-600'
                    : 'hover:bg-gray-100 text-gray-600'
                }`}
                aria-label={isAllSelected ? t.news.deselectAll : t.news.selectAll}
              >
                <div className={`w-4 h-4 rounded border-2 flex items-center justify-center transition-all ${
                  isAllSelected 
                    ? 'bg-blue-500 border-blue-500' 
                    : isPartiallySelected
                    ? 'bg-blue-100 border-blue-500'
                    : 'border-gray-300 bg-white'
                }`}>
                  {isAllSelected && <Check className="w-3 h-3 text-white" />}
                  {isPartiallySelected && <Minus className="w-3 h-3 text-blue-600" />}
                </div>
                <span className="text-xs font-medium">
                  {isAllSelected ? t.news.deselectAll : t.news.selectAll}
                </span>
              </button>
              
                <Button
                  variant={activeFilter === 'all' ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveFilter('all')}
                className={`h-8 ${
                  activeFilter === 'all'
                    ? 'bg-white text-primary shadow-sm hover:bg-white/90'
                    : 'text-slate-600 hover:text-slate-900'
                }`}
                >
                  {t.news.all}
                </Button>
                <Button
                  variant={activeFilter === 'pending' ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveFilter('pending')}
                className={`h-8 ${
                  activeFilter === 'pending'
                    ? 'bg-white text-orange-600 shadow-sm hover:bg-white/90'
                    : 'text-slate-600 hover:text-orange-600'
                }`}
                >
                  <HelpCircle className="w-3.5 h-3.5 mr-1.5" />
                  {t.news.pending}
                </Button>
                <Button
                  variant={activeFilter === 'positive' ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveFilter('positive')}
                className={`h-8 ${
                  activeFilter === 'positive'
                    ? 'bg-white text-emerald-600 shadow-sm hover:bg-white/90'
                    : 'text-slate-600 hover:text-emerald-600'
                }`}
                >
                  <CheckCircle2 className="w-3.5 h-3.5 mr-1.5" />
                  {t.news.positive}
                </Button>
                <Button
                  variant={activeFilter === 'negative' ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveFilter('negative')}
                className={`h-8 ${
                  activeFilter === 'negative'
                    ? 'bg-white text-rose-600 shadow-sm hover:bg-white/90'
                    : 'text-slate-600 hover:text-rose-600'
                }`}
                >
                  <XCircle className="w-3.5 h-3.5 mr-1.5" />
                  {t.news.negative}
                </Button>
                <Button
                  variant={activeFilter === 'neutral' ? 'default' : 'ghost'}
                  size="sm"
                  onClick={() => setActiveFilter('neutral')}
                className={`h-8 ${
                  activeFilter === 'neutral'
                    ? 'bg-white text-slate-600 shadow-sm hover:bg-white/90'
                    : 'text-slate-600 hover:text-slate-900'
                }`}
                >
                  <MinusCircle className="w-3.5 h-3.5 mr-1.5" />
                  {t.news.neutral}
              </Button>
            </div>
          </div>
        </CardHeader>
      </Card>

      {/* 批量操作栏 */}
      {selectedNewsIds.size > 0 && (
        <Card className="border-blue-200 bg-blue-50 shadow-sm">
          <CardContent className="p-4">
            <div className="flex items-center justify-between">
              <div className="flex items-center gap-4">
                <span className="text-sm font-medium text-gray-700">
                  {t.news.selectedItems.replace('{count}', selectedNewsIds.size.toString())}
                </span>
                <Button
                  variant="ghost"
                  size="sm"
                  onClick={clearSelection}
                  className="text-blue-600 hover:text-blue-700 hover:bg-blue-100"
                >
                  {t.news.cancelSelection}
                </Button>
              </div>
              <div className="flex items-center gap-2">
                {/* 根据筛选条件显示不同的分析按钮 */}
                {activeFilter === 'pending' ? (
                  <Button
                    onClick={handleBatchAnalyze}
                    disabled={batchAnalyzeMutation.isPending}
                    size="sm"
                    className="bg-blue-600 hover:bg-blue-700 text-white"
                  >
                    {batchAnalyzeMutation.isPending ? (
                      <>
                        <RefreshCw className="w-4 h-4 mr-2 animate-spin" />
                        {t.news.analyzingSelected.replace('{count}', selectedNewsIds.size.toString())}
                      </>
                    ) : (
                      <>
                        <Sparkles className="w-4 h-4 mr-2" />
                        {t.news.analyzeAll}
                      </>
                    )}
                  </Button>
                ) : (activeFilter === 'positive' || activeFilter === 'negative' || activeFilter === 'neutral') ? (
                  <Button
                    onClick={handleBatchReanalyze}
                    disabled={batchAnalyzeMutation.isPending}
                    size="sm"
                    className="bg-blue-600 hover:bg-blue-700 text-white"
                  >
                    {batchAnalyzeMutation.isPending ? (
                      <>
                        <RefreshCw className="w-4 h-4 mr-2 animate-spin" />
                        {t.news.analyzingSelected.replace('{count}', selectedNewsIds.size.toString())}
                      </>
                    ) : (
                      <>
                        <RefreshCcw className="w-4 h-4 mr-2" />
                        {t.news.reanalyzeAll}
                      </>
                    )}
                  </Button>
                ) : null}
                
                {/* 删除按钮 */}
                <Button
                  variant="destructive"
                  size="sm"
                  onClick={handleBatchDelete}
                  disabled={batchDeleteMutation.isPending}
                  className="bg-red-600 hover:bg-red-700 text-white"
                >
                  {batchDeleteMutation.isPending ? (
                    <>
                      <RefreshCw className="w-4 h-4 mr-2 animate-spin" />
                      {t.common.loading}
                    </>
                  ) : (
                    t.news.deleteSelected
                  )}
                </Button>
              </div>
            </div>
          </CardContent>
        </Card>
      )}
      </div>

      {/* 可滚动的新闻列表区域 */}
      <div className="flex-1 overflow-y-auto p-6 pt-4 min-h-0">
        <div 
          className="grid gap-6"
          style={{
            gridTemplateColumns: `repeat(${gridCols}, minmax(0, 1fr))`
          }}
        >
        {isLoading ? (
          <div className="col-span-full text-center py-12 text-gray-500">
            <div className="inline-block animate-spin rounded-full h-8 w-8 border-b-2 border-primary"></div>
            <p className="mt-4">{t.common.loading}</p>
          </div>
        ) : isError ? (
          <div className="col-span-full text-center py-12">
            <div className="text-red-500 mb-4">
              <XCircle className="w-12 h-12 mx-auto mb-2" />
              <p className="text-lg font-semibold">加载失败</p>
              <p className="text-sm mt-2 text-gray-600">{error?.message || '未知错误'}</p>
            </div>
            <Button
              onClick={() => queryClient.invalidateQueries({ queryKey: ['news', 'latest'] })}
              variant="outline"
            >
              <RefreshCw className="w-4 h-4 mr-2" />
              重试
            </Button>
          </div>
        ) : filteredNews && filteredNews.length > 0 ? (
          filteredNews.map((news) => (
            <Card 
              key={news.id} 
              className={`${getCardStyle(news.sentiment_score)} cursor-pointer hover:shadow-lg transition-shadow relative ${
                selectedNewsIds.has(news.id) ? 'border-blue-500 border-2' : ''
              }`}
              onClick={(e) => {
                // 阻止按钮和选择框点击事件冒泡
                if ((e.target as HTMLElement).closest('button') || 
                    (e.target as HTMLElement).closest('.selection-checkbox')) {
                  return
                }
                
                const isCommandOrCtrl = e.metaKey || e.ctrlKey
                const isShift = e.shiftKey
                
                // Command/Ctrl + 点击：多选模式
                if (isCommandOrCtrl) {
                  e.preventDefault()
                  toggleNewsSelection(news.id)
                  setLastSelectedNewsId(news.id)
                  return
                }
                
                // Shift + 点击：范围选择
                if (isShift) {
                  e.preventDefault()
                  if (lastSelectedNewsId !== null) {
                    selectRange(lastSelectedNewsId, news.id, filteredNews)
                  } else {
                    // 如果没有上次选中项，只选中当前项
                    toggleNewsSelection(news.id)
                  }
                  setLastSelectedNewsId(news.id)
                  return
                }
                
                // 普通点击：如果已选中则切换选择，否则打开详情
                if (selectedNewsIds.has(news.id)) {
                  toggleNewsSelection(news.id)
                  setLastSelectedNewsId(null)
                } else {
                  setSelectedNewsId(news.id)
                  setDrawerOpen(true)
                }
              }}
            >
              <CardHeader className="pb-2 flex-shrink-0 relative">
                {/* 选择框 */}
                <button
                  className={`selection-checkbox absolute top-2 right-2 w-5 h-5 rounded-full flex items-center justify-center transition-all z-10 ${
                    selectedNewsIds.has(news.id)
                      ? 'bg-blue-500 hover:bg-blue-600'
                      : 'border-2 border-gray-300 hover:border-gray-400 bg-white'
                  }`}
                  onClick={(e) => {
                    e.stopPropagation()
                    const isCommandOrCtrl = e.metaKey || e.ctrlKey
                    const isShift = e.shiftKey
                    
                    if (isCommandOrCtrl || isShift) {
                      // 键盘修饰键时，使用与卡片相同的逻辑
                      if (isShift && lastSelectedNewsId !== null) {
                        selectRange(lastSelectedNewsId, news.id, filteredNews)
                      } else {
                        toggleNewsSelection(news.id)
                      }
                      setLastSelectedNewsId(news.id)
                    } else {
                      // 普通点击选择框
                      toggleNewsSelection(news.id)
                      setLastSelectedNewsId(news.id)
                    }
                  }}
                  aria-label={selectedNewsIds.has(news.id) ? '取消选择' : '选择'}
                >
                  {selectedNewsIds.has(news.id) && (
                    <Check className="w-3 h-3 text-white" />
                  )}
                </button>
                <CardTitle className="text-base leading-tight font-semibold text-gray-900 line-clamp-2 mb-1.5 min-h-[44px] pr-7">
                  <HighlightText text={news.title} highlight={debouncedSearchQuery} />
                </CardTitle>
                <div className="flex items-center gap-2 text-xs text-gray-500">
                  <div className="flex items-center gap-1">
                    <Calendar className="w-3 h-3" />
                    <span>{formatRelativeTime(news.publish_time || news.created_at, t.time)}</span>
                  </div>
                  <span>•</span>
                  <div className="flex items-center gap-1">
                    <span>{getSourceIcon(news.source)}</span>
                    <span>{getSourceName(news.source)}</span>
                  </div>
                </div>
              </CardHeader>
              
              <CardContent className="flex-1 flex flex-col pb-3 pt-2 overflow-hidden">
                <div 
                  className="text-sm text-gray-600 mb-3 leading-relaxed flex-shrink-0"
                  style={{
                    display: '-webkit-box',
                    WebkitLineClamp: getContentLines(
                      news.stock_codes?.length || 0,
                      expandedStocks.has(news.id)
                    ),
                    WebkitBoxOrient: 'vertical',
                    overflow: 'hidden'
                  }}
                >
                  <HighlightText text={news.content} highlight={debouncedSearchQuery} />
                </div>
                
                <div className="mt-auto space-y-2">
                  {news.stock_codes && news.stock_codes.length > 0 && (
                    <div className="space-y-1.5">
                      <div className="flex flex-wrap gap-1.5">
                        {(expandedStocks.has(news.id) 
                          ? news.stock_codes 
                          : news.stock_codes.slice(0, 6)
                        ).map((code) => (
                          <Badge 
                            key={code} 
                            variant="outline" 
                            className="text-xs bg-blue-50 text-blue-700 border-blue-200 hover:bg-blue-100 px-2 py-0.5"
                          >
                            <TrendingUp className="w-3 h-3 mr-0.5" />
                            {code}
                          </Badge>
                        ))}
                      </div>
                      {news.stock_codes.length > 6 && (
                        <button
                          onClick={() => toggleStockExpand(news.id)}
                          className="text-xs text-primary hover:text-primary/80 flex items-center gap-0.5 transition-colors pt-0.5"
                        >
                          {expandedStocks.has(news.id) ? (
                            <>
                              <ChevronUp className="w-3 h-3" />
                              {t.news.collapse} ({news.stock_codes.length} {t.news.stocks})
                            </>
                          ) : (
                            <>
                              <ChevronDown className="w-3 h-3" />
                              {t.news.expandMore} ({news.stock_codes.length - 6})
                            </>
                          )}
                        </button>
                      )}
                    </div>
                  )}

                  {news.sentiment_score !== null && (
                    <div className="flex items-center pt-0.5">
                      {getSentimentBadge(news.sentiment_score)}
                    </div>
                  )}
                </div>
              </CardContent>

              <CardFooter className="pt-2 pb-4 px-6 flex-shrink-0">
                <Button
                  onClick={() => handleAnalyze(news.id)}
                  disabled={analyzingNewsId === news.id}
                  size="sm"
                  className={getAnalyzeButtonStyle(news.sentiment_score)}
                  variant={news.sentiment_score !== null ? 'outline' : 'default'}
                >
                  {analyzingNewsId === news.id ? (
                    <>
                      <RefreshCw className="w-4 h-4 mr-2 animate-spin" />
                      {t.news.analyzing}
                    </>
                  ) : news.sentiment_score !== null ? (
                    <>
                      <RefreshCcw className="w-4 h-4 mr-2" />
                      {t.news.reanalyze}
                    </>
                  ) : (
                    <>
                      <Sparkles className="w-4 h-4 mr-2" />
                      {t.news.analyze}
                    </>
                  )}
                </Button>
              </CardFooter>
            </Card>
          ))
        ) : (
          <div className="col-span-full text-center py-16">
            <div className="text-gray-400 mb-2">
              {debouncedSearchQuery ? (
                <Search className="w-16 h-16 mx-auto opacity-50" />
              ) : (
              <Newspaper className="w-16 h-16 mx-auto opacity-50" />
              )}
            </div>
            {debouncedSearchQuery ? (
              <>
                <p className="text-gray-500 text-lg">{t.news.noNewsFound} "{debouncedSearchQuery}" {t.news.relatedNews}</p>
                <p className="text-gray-400 text-sm mt-1">{t.news.tryOtherKeywords}</p>
              </>
            ) : (
              <>
            <p className="text-gray-500 text-lg">{t.news.noNews}</p>
            <p className="text-gray-400 text-sm mt-1">{t.news.pleaseCrawl}</p>
              </>
            )}
          </div>
        )}
        </div>
      </div>

      {/* 新闻详情抽屉 */}
      <NewsDetailDrawer
        newsId={selectedNewsId}
        open={drawerOpen}
        onOpenChange={(open) => {
          setDrawerOpen(open)
          if (!open) {
            // 延迟清除newsId，避免关闭动画时闪烁
            setTimeout(() => setSelectedNewsId(null), 300)
          }
        }}
      />
    </div>
  )
}


================================================
FILE: frontend/src/pages/StockAnalysisPage.tsx
================================================
import { useState, useEffect, useMemo, useRef, useCallback } from 'react'
import { useParams, useNavigate } from 'react-router-dom'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import { toast } from 'sonner'
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card'
import { Button } from '@/components/ui/button'
import { Badge } from '@/components/ui/badge'
import { stockApi, agentApi, knowledgeGraphApi, SSEDebateEvent } from '@/lib/api-client'
import { formatRelativeTime } from '@/lib/utils'
import NewsDetailDrawer from '@/components/NewsDetailDrawer'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'
import DebateChatRoom, { ChatMessage, ChatRole } from '@/components/DebateChatRoom'
import DebateHistorySidebar from '@/components/DebateHistorySidebar'
import { useDebateStore, DebateSession } from '@/store/useDebateStore'
import type { MentionTarget } from '@/components/MentionInput'
import {
  TrendingUp,
  TrendingDown,
  Minus,
  Newspaper,
  BarChart3,
  MessageSquare,
  RefreshCw,
  Calendar,
  Swords,
  Bot,
  ThumbsUp,
  ThumbsDown,
  Scale,
  Loader2,
  Activity,
  ArrowLeft,
  Download,
  CheckCircle2,
  AlertCircle,
  ChevronDown,
  Copy,
  FileDown,
  Settings,
  Trash2,
  Network,
  Building2,
  StopCircle,
  History,
} from 'lucide-react'
import {
  XAxis,
  YAxis,
  CartesianGrid,
  Tooltip,
  ResponsiveContainer,
  Bar,
  Legend,
  ComposedChart,
  Line,
} from 'recharts'
import KLineChart from '@/components/KLineChart'
import type { DebateResponse } from '@/types/api'
import ReactMarkdown from 'react-markdown'
import remarkGfm from 'remark-gfm'
import { DebateModeSelector } from '@/components/DebateConfig'

// 从代码中提取纯数字代码
const extractCode = (fullCode: string): string => {
  const code = fullCode.toUpperCase()
  if (code.startsWith('SH') || code.startsWith('SZ')) {
    return code.slice(2)
  }
  return code
}

// K线周期配置
type KLinePeriod = 'daily' | '1m' | '5m' | '15m' | '30m' | '60m'
const getPeriodOptions = (t: any): { value: KLinePeriod; label: string; limit: number }[] => [
  { value: 'daily', label: t.stockDetail.dailyK, limit: 120 },
  { value: '60m', label: t.stockDetail.min60, limit: 200 },
  { value: '30m', label: t.stockDetail.min30, limit: 200 },
  { value: '15m', label: t.stockDetail.min15, limit: 200 },
  { value: '5m', label: t.stockDetail.min5, limit: 300 },
  { value: '1m', label: t.stockDetail.min1, limit: 400 },
]

// 复权类型配置
type KLineAdjust = 'qfq' | 'hfq' | ''
const getAdjustOptions = (t: any): { value: KLineAdjust; label: string; tip: string }[] => [
  { value: 'qfq', label: t.stockDetail.qfq, tip: t.stockDetail.qfqTip },
  { value: '', label: t.stockDetail.noAdjust, tip: t.stockDetail.noAdjustTip },
  { value: 'hfq', label: t.stockDetail.hfq, tip: t.stockDetail.hfqTip },
]

// 定向爬取任务状态类型
type CrawlTaskStatus = 'idle' | 'pending' | 'running' | 'completed' | 'failed'

interface CrawlTaskState {
  status: CrawlTaskStatus
  taskId?: number
  progress?: {
    current: number
    total: number
    message?: string
  }
  error?: string
}

export default function StockAnalysisPage() {
  const t = useGlobalI18n()
  const { lang } = useLanguageStore()
  const { code } = useParams<{ code: string }>()
  const navigate = useNavigate()
  const queryClient = useQueryClient()
  const [debateResult, setDebateResult] = useState<DebateResponse | null>(null)
  const [klinePeriod, setKlinePeriod] = useState<KLinePeriod>('daily')
  const [klineAdjust, setKlineAdjust] = useState<KLineAdjust>('qfq')  // 默认前复权，与国内主流软件一致
  const [crawlTask, setCrawlTask] = useState<CrawlTaskState>({ status: 'idle' })
  const [selectedNewsId, setSelectedNewsId] = useState<number | null>(null)
  const [drawerOpen, setDrawerOpen] = useState(false)
  const [newsDisplayCount, setNewsDisplayCount] = useState(12) // 默认显示12条
  const [newsExpanded, setNewsExpanded] = useState(true) // 新闻是否展开
  const [debateMode, setDebateMode] = useState<string>('parallel') // 辩论模式
  const [showModelSelector, setShowModelSelector] = useState(false) // 模型选择器显示状态
  const [showKnowledgeGraph, setShowKnowledgeGraph] = useState(true) // 是否展示知识图谱
  
  // 流式辩论状态
  const [isStreaming, setIsStreaming] = useState(false)
  const [streamPhase, setStreamPhase] = useState<string>('')
  const [streamingContent, setStreamingContent] = useState<{
    bull: string
    bear: string
    manager: string
    quick: string
  }>({ bull: '', bear: '', manager: '', quick: '' })
  const [activeAgent, setActiveAgent] = useState<string | null>(null)
  const [currentRound, setCurrentRound] = useState<{ round: number; maxRounds: number } | null>(null)
  const [chatMessages, setChatMessages] = useState<ChatMessage[]>([])
  const currentMessageIdRef = useRef<string | null>(null)
  const cancelStreamRef = useRef<(() => void) | null>(null)
  const chatMessagesRef = useRef<ChatMessage[]>([])
  
  // 保持 ref 同步
  useEffect(() => {
    chatMessagesRef.current = chatMessages
  }, [chatMessages])
  
  const stockCode = code?.toUpperCase() || 'SH600519'
  const pureCode = extractCode(stockCode)
  
  // 辩论历史 Store
  const { 
    currentSession,
    startSession, 
    addMessage: addMessageToStore, 
    syncMessages,
    getStockSessions,
    loadSession,
    clearStockHistory,
    syncToBackend,
    loadFromBackend,
    saveAnalysisResult,
    updateSessionStatus,
    deleteSession,
    getLatestInProgressSession
  } = useDebateStore()
  
  // 历史侧边栏状态
  const [showHistorySidebar, setShowHistorySidebar] = useState(false)
  
  // 获取该股票的历史会话（直接从 Store 订阅，确保数据变化时自动更新）
  const allSessions = useDebateStore(state => state.sessions)
  const historySessions = useMemo(() => allSessions[stockCode] || [], [stockCode, allSessions])
  
  // 页面加载时从后端加载历史
  useEffect(() => {
    loadFromBackend(stockCode)
  }, [stockCode, loadFromBackend])

  // 页面加载时检查是否有未完成的会话，并提示恢复
  useEffect(() => {
    const checkAndRestoreSession = () => {
      const inProgressSession = getLatestInProgressSession(stockCode)
      if (inProgressSession && inProgressSession.messages.length > 0) {
        // 有未完成的会话，提示用户恢复
        const shouldRestore = window.confirm(
          `${t.stockDetail.detectIncompleteSession || '检测到有未完成的'}${inProgressSession.mode === 'realtime_debate' ? t.stockDetail.realtimeDebate : t.stockDetail.analysis || '分析'}${t.stockDetail.session || '会话'}（${inProgressSession.messages.length} ${t.stockDetail.messages || '条消息'}），${t.stockDetail.restore || '是否恢复'}？`
        )
        if (shouldRestore) {
          restoreSessionState(inProgressSession)
          toast.success(t.stockDetail.sessionRestored)
        } else {
          // 标记为中断
          updateSessionStatus('interrupted')
        }
      } else if (inProgressSession && inProgressSession.analysisResult) {
        // 有分析结果的会话，直接恢复
        restoreSessionState(inProgressSession)
      }
    }
    
    // 延迟执行，确保 store 数据已加载
    const timer = setTimeout(checkAndRestoreSession, 500)
    return () => clearTimeout(timer)
  }, [stockCode])

  // 恢复会话状态到页面
  const restoreSessionState = useCallback((session: DebateSession) => {
    // 恢复模式
    setDebateMode(session.mode)
    
    // 恢复聊天消息（需要类型转换）
    if (session.messages.length > 0) {
      const restoredMessages: ChatMessage[] = session.messages.map(m => ({
        id: m.id,
        role: m.role as ChatRole,
        content: m.content,
        timestamp: new Date(m.timestamp),
        round: m.round,
        isStreaming: false
      }))
      setChatMessages(restoredMessages)
    }
    
    // 恢复分析结果（并行/快速模式）
    if (session.analysisResult) {
      setStreamingContent({
        bull: session.analysisResult.bull || '',
        bear: session.analysisResult.bear || '',
        manager: session.analysisResult.manager || '',
        quick: session.analysisResult.quick || ''
      })
      
      // 如果有最终决策，设置 debateResult
      if (session.analysisResult.finalDecision || session.analysisResult.bull || session.analysisResult.bear) {
        setDebateResult({
          success: true,
          stock_code: session.stockCode,
          stock_name: session.stockName,
          mode: session.mode as 'parallel' | 'realtime_debate' | 'quick_analysis',
          bull_analysis: session.analysisResult.bull ? {
            success: true,
            agent_name: 'BullResearcher',
            stance: 'bull',
            analysis: session.analysisResult.bull
          } : undefined,
          bear_analysis: session.analysisResult.bear ? {
            success: true,
            agent_name: 'BearResearcher',
            stance: 'bear',
            analysis: session.analysisResult.bear
          } : undefined,
          final_decision: session.analysisResult.finalDecision ? {
            success: true,
            agent_name: 'InvestmentManager',
            rating: session.analysisResult.finalDecision.rating,
            decision: session.analysisResult.finalDecision.decision
          } : undefined,
          quick_analysis: session.analysisResult.quick ? {
            success: true,
            analysis: session.analysisResult.quick
          } : undefined,
          execution_time: session.analysisResult.executionTime
        })
      }
    }
    
    // 加载会话到 store
    loadSession(session.stockCode, session.id)
  }, [loadSession])

  // 获取当前周期配置
  const PERIOD_OPTIONS = getPeriodOptions(t)
  const ADJUST_OPTIONS = getAdjustOptions(t)
  const currentPeriodConfig = PERIOD_OPTIONS.find(p => p.value === klinePeriod) || PERIOD_OPTIONS[0]

  // 获取股票名称（从数据库查询）
  const { data: stockInfo } = useQuery({
    queryKey: ['stock', 'info', pureCode],
    queryFn: () => stockApi.searchRealtime(pureCode, 1),
    staleTime: 24 * 60 * 60 * 1000, // 缓存24小时
  })
  
  // 股票名称：优先使用查询结果，否则显示代码
  const stockName = stockInfo?.[0]?.name || stockCode

  // 获取股票概览
  const { data: overview, isLoading: overviewLoading, refetch: refetchOverview } = useQuery({
    queryKey: ['stock', 'overview', stockCode],
    queryFn: () => stockApi.getOverview(stockCode),
    staleTime: 5 * 60 * 1000,
  })

  // 获取关联新闻
  const { data: newsList, isLoading: newsLoading } = useQuery({
    queryKey: ['stock', 'news', stockCode],
    queryFn: () => stockApi.getNews(stockCode, { limit: 200 }), // 获取更多数据，前端分页
    staleTime: 5 * 60 * 1000,
  })

  // 计算排序后的展示新闻（按时间从新到旧）
  const displayedNews = useMemo(() => {
    if (!newsList) return []
    const sorted = [...newsList].sort((a, b) => {
      const timeA = a.publish_time ? new Date(a.publish_time).getTime() : 0
      const timeB = b.publish_time ? new Date(b.publish_time).getTime() : 0
      return timeB - timeA // 降序排列（最新的在前）
    })
    return sorted.slice(0, newsDisplayCount)
  }, [newsList, newsDisplayCount])

  // 是否还有更多新闻
  const hasMoreNews = (newsList?.length || 0) > newsDisplayCount
  
  // 是否有历史新闻数据
  const hasHistoryNews = newsList && newsList.length > 0

  // 获取新闻卡片样式（根据情感分数）
  const getNewsCardStyle = (sentiment: number | null) => {
    const baseStyle = "flex flex-col transition-all duration-300 border min-w-0 h-full hover:shadow-lg hover:-translate-y-1 cursor-pointer"
    
    if (sentiment === null) {
      return `${baseStyle} bg-white border-gray-200 hover:border-blue-300`
    }

    if (sentiment > 0.1) {
      // 利好：绿色渐变
      return `${baseStyle} bg-gradient-to-br from-emerald-50 to-white border-emerald-200 hover:border-emerald-400 hover:shadow-emerald-200/60`
    }
    
    if (sentiment < -0.1) {
      // 利空：红色渐变
      return `${baseStyle} bg-gradient-to-br from-rose-50 to-white border-rose-200 hover:border-rose-400 hover:shadow-rose-200/60`
    }

    // 中性：蓝灰色渐变
    return `${baseStyle} bg-gradient-to-br from-slate-50 to-white border-slate-200 hover:border-slate-400 hover:shadow-slate-200/60`
  }

  // 获取情感趋势
  const { data: sentimentTrend, isLoading: trendLoading } = useQuery({
    queryKey: ['stock', 'sentiment-trend', stockCode],
    queryFn: () => stockApi.getSentimentTrend(stockCode, 30),
    staleTime: 5 * 60 * 1000,
  })

  // 获取知识图谱
  const { data: knowledgeGraph, isLoading: kgLoading, refetch: refetchKG } = useQuery({
    queryKey: ['knowledge-graph', stockCode],
    queryFn: () => knowledgeGraphApi.getCompanyGraph(stockCode),
    staleTime: 10 * 60 * 1000, // 缓存10分钟
  })

  // 获取K线数据 - 支持多周期和复权类型
  const { data: klineData, isLoading: klineLoading, refetch: refetchKline } = useQuery({
    queryKey: ['stock', 'kline', stockCode, klinePeriod, currentPeriodConfig.limit, klineAdjust],
    queryFn: async () => {
      const actualAdjust = klinePeriod === 'daily' ? klineAdjust : ''
      console.log(`🔍 Fetching kline data: code=${stockCode}, period=${klinePeriod}, limit=${currentPeriodConfig.limit}, adjust=${actualAdjust}`)
      
      const data = await stockApi.getKLineData(
        stockCode, 
        klinePeriod, 
        currentPeriodConfig.limit,
        actualAdjust
      )
      
      if (data && data.length > 0) {
        console.log(`✅ Received ${data.length} kline data points, latest: ${data[data.length - 1].date}, close: ${data[data.length - 1].close}`)
      } else {
        console.warn(`⚠️ Received empty kline data`)
      }
      
      return data
    },
    staleTime: 0, // 禁用缓存，每次都重新获取以避免混乱
    gcTime: 0, // 立即丢弃缓存 (React Query v5: cacheTime改名为gcTime)
  })

  // 辩论 Mutation（非流式备用）
  const debateMutation = useMutation({
    mutationFn: (mode: string) => agentApi.runDebate({
      stock_code: stockCode,
      stock_name: stockName,
      mode: mode as 'parallel' | 'realtime_debate' | 'quick_analysis',
      language: lang,
    }),
    onSuccess: (data) => {
      setDebateResult(data)
      if (data.success) {
        toast.success(t.stockDetail.debateComplete)
      } else {
        toast.error(`辩论失败: ${data.error}`)
      }
    },
    onError: (error: Error) => {
      toast.error(`辩论失败: ${error.message}`)
    },
  })

  // Agent 名称到聊天角色的映射
  const agentToRole = useCallback((agent: string): ChatRole => {
    switch (agent) {
      case 'BullResearcher': return 'bull'
      case 'BearResearcher': return 'bear'
      case 'InvestmentManager': return 'manager'
      case 'DataCollector': return 'data_collector'
      case 'QuickAnalyst': return 'manager' // 快速分析师用经理角色
      default: return 'system'
    }
  }, [])

  // 处理 SSE 事件
  const handleSSEEvent = useCallback((event: SSEDebateEvent) => {
    console.log('SSE Event:', event.type, event.data)
    
    switch (event.type) {
      case 'task_plan':
        // 搜索计划事件
        const plan = event.data as any
        setChatMessages(prev => {
          // 查找最后一条消息，如果是数据专员的思考中消息，则替换
          const lastMsg = prev[prev.length - 1]
          if (lastMsg && lastMsg.role === 'data_collector' && !lastMsg.content) {
            return prev.map(msg => 
              msg.id === lastMsg.id 
                ? { ...msg, searchPlan: plan, searchStatus: 'pending' } 
                : msg
            )
          }
          // 否则添加新消息
          return [...prev, {
            id: `plan-${Date.now()}`,
            role: 'data_collector' as ChatRole,
            content: '',
            timestamp: new Date(),
            searchPlan: plan,
            searchStatus: 'pending'
          }]
        })
        break

      case 'phase':
        setStreamPhase(event.data.phase || '')
        // 更新轮次信息
        if (event.data.round && event.data.max_rounds) {
          setCurrentRound({ round: event.data.round, maxRounds: event.data.max_rounds })
          
          // 实时辩论模式：添加轮次系统消息
          if (debateMode === 'realtime_debate') {
            setChatMessages(prev => [...prev, {
              id: `system-round-${event.data.round}`,
              role: 'system' as ChatRole,
              content: `📢 ${t.debateRoom.roundPrefix} ${event.data.round}/${event.data.max_rounds} ${t.debateRoom.roundSuffix}${t.debateRoom.roundStarted}`,
              timestamp: new Date()
            }])
          }
        }
        if (event.data.phase === 'complete') {
          toast.success(t.stockDetail.debateComplete)
          // 添加完成消息
          if (debateMode === 'realtime_debate') {
            setChatMessages(prev => [...prev, {
              id: 'system-complete',
              role: 'system' as ChatRole,
              content: `✅ ${t.debateRoom.debateEnded}`,
              timestamp: new Date()
            }])
          }
        }
        if (event.data.phase === 'data_collection' && debateMode === 'realtime_debate') {
          setChatMessages(prev => [...prev, {
            id: 'system-start',
            role: 'system' as ChatRole,
            content: `🎬 ${t.debateRoom.debateStarted}`,
            timestamp: new Date()
          }])
        }
        break
        
      case 'agent':
        const { agent, content, is_start, is_end, is_chunk, round } = event.data
        const chatRole = agentToRole(agent || '')
        
        if (is_start) {
          setActiveAgent(agent || null)
          
          // 实时辩论模式：创建新消息
          if (debateMode === 'realtime_debate') {
            const newMsgId = `msg-${Date.now()}-${agent}`
            currentMessageIdRef.current = newMsgId
            setChatMessages(prev => [...prev, {
              id: newMsgId,
              role: chatRole,
              content: '',
              timestamp: new Date(),
              round: round,
              isStreaming: true
            }])
          }
          
          // 旧逻辑：分栏模式的轮次标记
          if (round && debateMode !== 'realtime_debate') {
            setStreamingContent(prev => {
              const key = agent === 'BullResearcher' ? 'bull' 
                        : agent === 'BearResearcher' ? 'bear'
                        : null
              if (key && round > 1) {
                const roundMarker = lang === 'zh' 
                  ? `\n\n---\n**【第${round}轮】**\n`
                  : `\n\n---\n**【Round ${round}】**\n`
                return { ...prev, [key]: prev[key as keyof typeof prev] + roundMarker }
              }
              return prev
            })
          }
        } else if (is_end) {
          setActiveAgent(null)
          
          // 实时辩论模式：标记消息完成
          if (debateMode === 'realtime_debate' && currentMessageIdRef.current) {
            setChatMessages(prev => prev.map(msg => 
              msg.id === currentMessageIdRef.current 
                ? { ...msg, isStreaming: false }
                : msg
            ))
            currentMessageIdRef.current = null
          }
        } else if (is_chunk && content) {
          // 实时辩论模式：追加到当前消息
          if (debateMode === 'realtime_debate' && currentMessageIdRef.current) {
            setChatMessages(prev => prev.map(msg => 
              msg.id === currentMessageIdRef.current 
                ? { ...msg, content: msg.content + content }
                : msg
            ))
          }
          
          // 旧逻辑：分栏模式
          setStreamingContent(prev => {
            const key = agent === 'BullResearcher' ? 'bull' 
                      : agent === 'BearResearcher' ? 'bear'
                      : agent === 'InvestmentManager' ? 'manager'
                      : agent === 'QuickAnalyst' ? 'quick'
                      : null
            if (key) {
              return { ...prev, [key]: prev[key as keyof typeof prev] + content }
            }
            return prev
          })
        }
        
        // 处理 DataCollector 的非流式消息
        if (agent === 'DataCollector' && content && !is_chunk && debateMode === 'realtime_debate') {
          setChatMessages(prev => [...prev, {
            id: `data-collector-${Date.now()}`,
            role: 'data_collector' as ChatRole,
            content: content,
            timestamp: new Date()
          }])
        }
        break
        
      case 'result':
        // 最终结果
        setDebateResult({
          success: event.data.success || false,
          stock_code: stockCode,
          stock_name: stockName,
          mode: event.data.mode as any,
          bull_analysis: event.data.bull_analysis,
          bear_analysis: event.data.bear_analysis,
          final_decision: event.data.final_decision,
          quick_analysis: event.data.quick_analysis,
          debate_id: event.data.debate_id,
          execution_time: event.data.execution_time
        })
        setIsStreaming(false)
        setCurrentRound(null)
        
        // 保存分析结果到 store（用于历史恢复）
        saveAnalysisResult({
          bull: event.data.bull_analysis?.analysis,
          bear: event.data.bear_analysis?.analysis,
          manager: event.data.final_decision?.decision,
          quick: event.data.quick_analysis?.analysis,
          finalDecision: event.data.final_decision ? {
            rating: event.data.final_decision.rating,
            decision: event.data.final_decision.decision
          } : undefined,
          executionTime: event.data.execution_time
        })
        break
        
      case 'error':
        toast.error(`辩论失败: ${event.data.message}`)
        setIsStreaming(false)
        setCurrentRound(null)
        // 添加错误消息
        if (debateMode === 'realtime_debate') {
          setChatMessages(prev => [...prev, {
            id: 'system-error',
            role: 'system' as ChatRole,
            content: `❌ 发生错误: ${event.data.message}`,
            timestamp: new Date()
          }])
        }
        break
    }
  }, [stockCode, stockName, debateMode, agentToRole])

  // 处理追问 SSE 事件
  const handleFollowUpEvent = useCallback((event: SSEDebateEvent) => {
    console.log('FollowUp Event:', event.type, event.data)
    
    switch (event.type) {
      case 'task_plan':
        const plan = event.data as any
        setChatMessages(prev => [...prev, {
          id: `plan-${Date.now()}`,
          role: 'data_collector' as ChatRole,
          content: '',
          timestamp: new Date(),
          searchPlan: plan,
          searchStatus: 'pending'
        }])
        setIsStreaming(false) // 计划生成完就不再流式了，等待确认
        break

      case 'agent':
        const { agent, content, is_start, is_end, is_chunk } = event.data
        const chatRole = agentToRole(agent || '')
        
        if (is_start) {
          setActiveAgent(agent || null)
          // 创建新消息
          const newMsgId = `followup-${Date.now()}-${agent}`
          currentMessageIdRef.current = newMsgId
          setChatMessages(prev => [...prev, {
            id: newMsgId,
            role: chatRole,
            content: '',
            timestamp: new Date(),
            isStreaming: true
          }])
        } else if (is_end) {
          setActiveAgent(null)
          // 标记消息完成
          if (currentMessageIdRef.current) {
            setChatMessages(prev => prev.map(msg => 
              msg.id === currentMessageIdRef.current 
                ? { ...msg, isStreaming: false }
                : msg
            ))
            currentMessageIdRef.current = null
          }
          setIsStreaming(false)
        } else if (is_chunk && content) {
          // 追加到当前消息
          if (currentMessageIdRef.current) {
            setChatMessages(prev => prev.map(msg => 
              msg.id === currentMessageIdRef.current 
                ? { ...msg, content: msg.content + content }
                : msg
            ))
          }
        }
        break
        
      case 'complete':
        setIsStreaming(false)
        break
        
      case 'error':
        toast.error(`回复失败: ${event.data.message}`)
        setIsStreaming(false)
        break
    }
  }, [agentToRole])

  // 处理用户发送消息（支持 @ 提及）
  const handleUserSendMessage = useCallback((content: string, mentions?: MentionTarget[]) => {
    // 添加用户消息到聊天
    const userMessage: ChatMessage = {
      id: `user-${Date.now()}`,
      role: 'user' as ChatRole,
      content: content,
      timestamp: new Date()
    }
    setChatMessages(prev => [...prev, userMessage])
    
    // 同步到 Store
    if (currentSession) {
      addMessageToStore(userMessage)
    }
    
    // 角色名称映射
    const roleNames: Record<string, string> = {
      bull: t.debateHistory.roleNames.bull,
      bear: t.debateHistory.roleNames.bear,
      manager: t.debateHistory.roleNames.manager,
      data_collector: t.debateHistory.roleNames.data_collector,
      user: t.debateHistory.roleNames.user,
      system: t.stockDetail.history === '历史' ? '系统' : 'System'
    }
    
    // 构建上下文（从之前的聊天记录中提取）
    const contextSummary = chatMessages
      .filter(m => m.role !== 'system' && m.role !== 'user')
      .slice(-6) // 最近6条消息
      .map(m => `【${roleNames[m.role] || m.role}】${m.content.slice(0, 200)}`)
      .join('\n')
    
    // 开始流式请求
    setIsStreaming(true)
    
    const cancel = agentApi.followUp(
      {
        stock_code: stockCode,
        stock_name: stockName,
        question: content,
        context: contextSummary
      },
      handleFollowUpEvent,
      (error) => {
        toast.error(`回复失败: ${error.message}`)
        setIsStreaming(false)
      },
      () => {
        setIsStreaming(false)
      }
    )
    
    // 保存取消函数
    cancelStreamRef.current = cancel
  }, [stockCode, stockName, chatMessages, handleFollowUpEvent])

  // 处理确认搜索
  const handleConfirmSearch = useCallback((plan: any, msgId: string) => {
    // 更新消息状态为执行中
    setChatMessages(prev => prev.map(msg => 
      msg.id === msgId ? { ...msg, searchStatus: 'executing' } : msg
    ))
    
    setIsStreaming(true)
    
    // 执行搜索
    agentApi.executeSearch(
      plan,
      (event) => {
        if (event.type === 'agent') {
          // 搜索结果返回
          const { content } = event.data
          setChatMessages(prev => prev.map(msg => 
            msg.id === msgId 
              ? { ...msg, content: content || '', searchStatus: 'completed' } 
              : msg
          ))
          
          // 同步到 Store
          if (currentSession) {
            const updatedMsg = chatMessages.find(m => m.id === msgId)
            if (updatedMsg) {
              addMessageToStore({ ...updatedMsg, content: content || '', searchStatus: 'completed' })
            }
          }
        }
      },
      (error) => {
        toast.error(`搜索执行失败: ${error.message}`)
        setIsStreaming(false)
        setChatMessages(prev => prev.map(msg => 
          msg.id === msgId ? { ...msg, searchStatus: 'pending' } : msg
        ))
      },
      () => {
        setIsStreaming(false)
        // 先同步消息到 Store，再保存到后端
        syncMessages(chatMessagesRef.current)
        syncToBackend(stockCode)
      }
    )
  }, [stockCode, currentSession, chatMessages, addMessageToStore, syncMessages, syncToBackend])

  // 处理取消搜索
  const handleCancelSearch = useCallback((msgId: string) => {
    setChatMessages(prev => prev.map(msg => 
      msg.id === msgId ? { ...msg, searchStatus: 'cancelled' } : msg
    ))
    toast.info(t.stockDetail.searchCancelled)
  }, [])

  const handleStartDebate = useCallback(() => {
    // 重置状态
    setDebateResult(null)
    setStreamingContent({ bull: '', bear: '', manager: '', quick: '' })
    setStreamPhase('')
    setActiveAgent(null)
    setCurrentRound(null)
    setChatMessages([]) // 重置聊天消息
    currentMessageIdRef.current = null
    setIsStreaming(true)
    
    // 创建新的辩论会话
    startSession(stockCode, stockName, debateMode)
    
    // 取消之前的流
    if (cancelStreamRef.current) {
      cancelStreamRef.current()
    }
    
    // 开始新的流式辩论
    const cancel = agentApi.runDebateStream(
      {
        stock_code: stockCode,
        stock_name: stockName,
        mode: debateMode as 'parallel' | 'realtime_debate' | 'quick_analysis',
        language: lang,
      },
      handleSSEEvent,
      (error) => {
        toast.error(`辩论失败: ${error.message}`)
        setIsStreaming(false)
        updateSessionStatus('interrupted')
      },
      () => {
        // 完成后保存分析结果并同步到后端
        console.log('🏁 Debate completed!')
        console.log('🏁 chatMessagesRef.current:', chatMessagesRef.current.length, 'messages')
        console.log('🏁 Message roles:', chatMessagesRef.current.map(m => m.role))
        
        setIsStreaming(false)
        updateSessionStatus('completed')
        // 使用 ref 获取最新的消息列表，批量同步到 Store
        syncMessages(chatMessagesRef.current)
        // 然后同步到后端
        syncToBackend(stockCode)
      }
    )
    
    cancelStreamRef.current = cancel
  }, [stockCode, stockName, debateMode, handleSSEEvent, startSession, syncMessages, syncToBackend])
  
  // 组件卸载时取消流
  useEffect(() => {
    return () => {
      if (cancelStreamRef.current) {
        cancelStreamRef.current()
      }
    }
  }, [])

  // 定期保存流式内容到 store（防止刷新丢失）
  useEffect(() => {
    if (!isStreaming) return
    
    const saveInterval = setInterval(() => {
      // 保存当前分析内容（并行/快速模式）
      if (streamingContent.bull || streamingContent.bear || streamingContent.manager || streamingContent.quick) {
        saveAnalysisResult({
          bull: streamingContent.bull || undefined,
          bear: streamingContent.bear || undefined,
          manager: streamingContent.manager || undefined,
          quick: streamingContent.quick || undefined
        })
      }
    }, 3000) // 每3秒保存一次
    
    return () => clearInterval(saveInterval)
  }, [isStreaming, streamingContent, saveAnalysisResult])

  // 实时辩论模式：同步所有完成的消息到 store
  useEffect(() => {
    if (debateMode !== 'realtime_debate' || chatMessages.length === 0 || !currentSession) return
    
    // 找出所有已完成但尚未在 Store 中的消息
    const storeMessageIds = new Set(currentSession.messages.map(m => m.id))
    const completedMessages = chatMessages.filter(m => 
      !m.isStreaming && // 已完成
      (m.content || m.searchPlan) && // 有内容
      !storeMessageIds.has(m.id) // 不在 Store 中
    )
    
    // 逐个添加到 Store
    for (const msg of completedMessages) {
      addMessageToStore(msg)
    }
  }, [chatMessages, debateMode, currentSession, addMessageToStore])

  // 定向爬取任务状态查询
  const { data: crawlStatus, refetch: refetchCrawlStatus } = useQuery({
    queryKey: ['stock', 'targeted-crawl-status', stockCode],
    queryFn: () => stockApi.getTargetedCrawlStatus(stockCode),
    enabled: crawlTask.status === 'running' || crawlTask.status === 'pending',
    refetchInterval: (crawlTask.status === 'running' || crawlTask.status === 'pending') ? 2000 : false, // pending/running 时每2秒轮询
    staleTime: 0,
  })

  // 监听爬取状态变化
  useEffect(() => {
    // 只在有状态且当前任务正在进行时处理
    if (crawlStatus && (crawlTask.status === 'running' || crawlTask.status === 'pending')) {
      // 重要：检查 task_id 是否匹配，避免使用旧任务的状态
      const isMatchingTask = !crawlTask.taskId || !crawlStatus.task_id || crawlTask.taskId === crawlStatus.task_id
      
      if (!isMatchingTask) {
        console.warn('Task ID mismatch, ignoring status update', { 
          currentTaskId: crawlTask.taskId, 
          statusTaskId: crawlStatus.task_id 
        })
        return
      }
      
      if (crawlStatus.status === 'completed') {
        setCrawlTask({ 
          status: 'completed', 
          taskId: crawlStatus.task_id,
          progress: { current: 100, total: 100, message: t.stockDetail.crawlComplete }
        })
        // 强制刷新新闻列表（忽略缓存）
        queryClient.resetQueries({ queryKey: ['stock', 'news', stockCode] })
        queryClient.resetQueries({ queryKey: ['stock', 'overview', stockCode] })
        // 立即重新获取
        queryClient.refetchQueries({ queryKey: ['stock', 'news', stockCode], type: 'all' })
        queryClient.refetchQueries({ queryKey: ['stock', 'overview', stockCode], type: 'all' })
        toast.success(`${t.stockDetail.crawlSuccess} ${crawlStatus.saved_count || 0} ${t.stockDetail.newsItems}`)
      } else if (crawlStatus.status === 'failed') {
        setCrawlTask({ 
          status: 'failed', 
          taskId: crawlStatus.task_id,
          error: crawlStatus.error_message || t.stockDetail.crawlFailed
        })
        toast.error(`${t.stockDetail.crawlFailed}: ${crawlStatus.error_message || t.stockDetail.unknownError}`)
      } else if (crawlStatus.status === 'running' || crawlStatus.status === 'pending') {
        // 更新进度和真实的 taskId
        setCrawlTask(prev => ({
          ...prev,
          status: crawlStatus.status as CrawlTaskStatus,
          taskId: crawlStatus.task_id || prev.taskId,
          progress: crawlStatus.progress || prev.progress
        }))
      }
    }
  }, [crawlStatus, crawlTask.status, crawlTask.taskId, stockCode, queryClient])

  // 页面加载时检查是否有进行中的任务
  useEffect(() => {
    const checkExistingTask = async () => {
      try {
        const status = await stockApi.getTargetedCrawlStatus(stockCode)
        // 只恢复正在运行或等待中的任务
        if (status && (status.status === 'running' || status.status === 'pending')) {
          setCrawlTask({
            status: status.status as CrawlTaskStatus,
            taskId: status.task_id,
            progress: status.progress
          })
        } else {
          // 其他状态（completed/failed/idle）重置为 idle
          setCrawlTask({ status: 'idle' })
        }
      } catch {
        // 没有进行中的任务，保持 idle 状态
        setCrawlTask({ status: 'idle' })
      }
    }
    checkExistingTask()
  }, [stockCode])

  // 定向爬取 Mutation
  const targetedCrawlMutation = useMutation({
    mutationFn: () => stockApi.startTargetedCrawl(stockCode, stockName),
    onSuccess: (data) => {
      if (data.success) {
        // 任务启动成功，设置为 pending 状态（后端已创建任务记录）
        setCrawlTask({ 
          status: 'pending', 
          taskId: data.task_id!,  // 现在 task_id 一定存在
          progress: { current: 0, total: 100, message: t.stockDetail.taskCreated }
        })
        toast.success(t.stockDetail.crawlTaskStarted)
        // 立即开始轮询（不需要延迟，因为任务记录已创建）
        refetchCrawlStatus()
      } else if (data.task_id) {
        // 已有正在进行的任务，恢复到该任务的状态
        setCrawlTask({ 
          status: 'running', 
          taskId: data.task_id,
          progress: { current: 0, total: 100, message: t.stockDetail.crawlingInProgress }
        })
        toast.info(t.stockDetail.crawlTaskExists)
        // 立即获取任务状态
        refetchCrawlStatus()
      } else {
        setCrawlTask({ status: 'failed', error: data.message })
        toast.error(`启动失败: ${data.message}`)
      }
    },
    onError: (error: Error) => {
      setCrawlTask({ status: 'failed', error: error.message })
      toast.error(`启动失败: ${error.message}`)
    },
  })

  const handleStartCrawl = () => {
    // 重置状态，清除之前的 taskId
    setCrawlTask({ status: 'pending', taskId: undefined })
    targetedCrawlMutation.mutate()
  }

  const handleStopCrawl = async () => {
    if (window.confirm(t.stockDetail.stopCrawlConfirm)) {
      try {
        // 调用后端 API 取消任务
        const result = await stockApi.cancelTargetedCrawl(stockCode)
        if (result.success) {
          setCrawlTask({ status: 'idle' })
          toast.info(result.message || t.stockDetail.crawlTaskStopped)
        } else {
          toast.error(result.message || t.stockDetail.crawlTaskStopFailed)
        }
      } catch (error: any) {
        console.error('Failed to cancel crawl task:', error)
        // 即使后端失败，也重置前端状态
      setCrawlTask({ status: 'idle' })
      toast.info(t.stockDetail.crawlTaskStopped)
      }
    }
  }

  // 清除新闻 Mutation
  const clearNewsMutation = useMutation({
    mutationFn: () => stockApi.clearStockNews(stockCode),
    onSuccess: (data) => {
      if (data.success) {
        toast.success(`${t.stockDetail.newsCleared} ${data.deleted_count || 0} ${t.stockDetail.newsItems}`)
        // 强制刷新新闻列表
        queryClient.resetQueries({ queryKey: ['stock', 'news', stockCode] })
        queryClient.resetQueries({ queryKey: ['stock', 'overview', stockCode] })
        queryClient.refetchQueries({ queryKey: ['stock', 'news', stockCode], type: 'all' })
        queryClient.refetchQueries({ queryKey: ['stock', 'overview', stockCode], type: 'all' })
      } else {
        toast.error(`清除失败: ${data.message}`)
      }
    },
    onError: (error: Error) => {
      toast.error(`清除失败: ${error.message}`)
    },
  })

  const handleClearNews = () => {
    if (window.confirm(`${t.stockDetail.clearNewsConfirm}${stockName}${t.stockDetail.clearNewsConfirmEnd}`)) {
      clearNewsMutation.mutate()
    }
  }

  // 情感趋势指示器
  const getTrendIcon = (trend: string) => {
    switch (trend) {
      case 'up':
        return <TrendingUp className="w-5 h-5 text-emerald-500" />
      case 'down':
        return <TrendingDown className="w-5 h-5 text-rose-500" />
      default:
        return <Minus className="w-5 h-5 text-gray-500" />
    }
  }

  const getSentimentColor = (score: number | null) => {
    if (score === null) return 'gray'
    if (score > 0.1) return 'emerald'
    if (score < -0.1) return 'rose'
    return 'amber'
  }

  const getSentimentLabel = (score: number | null) => {
    if (score === null) return t.stockDetail.unknown
    if (score > 0.3) return t.stockDetail.strongBull
    if (score > 0.1) return t.stockDetail.positive
    if (score < -0.3) return t.stockDetail.strongBear
    if (score < -0.1) return t.stockDetail.negative
    return t.stockDetail.neutral
  }

  // 复制内容到剪贴板
  const handleCopyContent = (content: string, label: string) => {
    navigator.clipboard.writeText(content).then(() => {
      toast.success(`${label}${t.stockDetail.copy}`)
    }).catch(() => {
      toast.error(`${t.stockDetail.copy}失败`)
    })
  }

  // 导出内容到本地文件
  const handleExportToFile = (content: string, filename: string) => {
    const blob = new Blob([content], { type: 'text/markdown;charset=utf-8' })
    const url = URL.createObjectURL(blob)
    const link = document.createElement('a')
    link.href = url
    link.download = filename
    document.body.appendChild(link)
    link.click()
    document.body.removeChild(link)
    URL.revokeObjectURL(url)
    toast.success(`${t.stockDetail.export}成功`)
  }

  return (
    <div className="p-6 space-y-6 bg-gradient-to-br from-slate-50 to-blue-50 min-h-screen">
      {/* 顶部标题区 */}
      <div className="flex items-center justify-between gap-4 flex-wrap">
        <div className="flex items-center gap-6">
        <div>
          <div className="flex items-center gap-3">
            <h1 className="text-3xl font-bold tracking-tight text-gray-900">
              {stockName}
            </h1>
            <Badge variant="outline" className="text-base px-3 py-1 bg-white">
              {stockCode}
            </Badge>
          </div>
          <p className="text-muted-foreground mt-1 flex items-center gap-2">
            <Activity className="w-4 h-4" />
            {t.stockDetail.title}
          </p>
        </div>
        </div>
        
        <div className="flex items-center gap-3">
          {/* 历史记录按钮 */}
          {historySessions.length > 0 && (
            <Button
              variant="outline"
              size="sm"
              onClick={() => setShowHistorySidebar(true)}
              className="gap-2 hover:bg-indigo-50 border-indigo-200 text-indigo-600"
            >
              <History className="w-4 h-4" />
              {t.stockDetail.history} ({historySessions.length})
            </Button>
          )}
          {/* 返回按钮 */}
            <Button
              variant="outline"
              size="sm"
            onClick={() => navigate('/stock')}
            className="gap-2 hover:bg-gray-100"
        >
            <ArrowLeft className="w-4 h-4" />
            {t.stockDetail.backToSearch}
        </Button>
        </div>
      </div>

      {/* 知识图谱卡片 */}
      {showKnowledgeGraph && knowledgeGraph && knowledgeGraph.graph_exists && (
        <Card className="bg-gradient-to-r from-purple-50 to-blue-50 border-purple-200">
          <CardHeader>
            <div className="flex items-start justify-between">
              <div>
                <CardTitle className="flex items-center gap-2 text-purple-800">
                  <Network className="w-5 h-5 text-purple-600" />
                  {t.stockDetail.knowledgeGraph}
                </CardTitle>
                <CardDescription className="mt-1.5">
                  {t.stockDetail.knowledgeGraphDesc}
                </CardDescription>
              </div>
              <Button
                variant="ghost"
                size="sm"
                onClick={() => refetchKG()}
                className="h-8 px-2"
                title="刷新图谱"
              >
                <RefreshCw className={`w-3.5 h-3.5 ${kgLoading ? 'animate-spin' : ''}`} />
              </Button>
            </div>
          </CardHeader>
          <CardContent className="space-y-3">
            {/* 名称变体 */}
            {knowledgeGraph.name_variants && knowledgeGraph.name_variants.length > 0 && (
              <div>
                <p className="text-xs text-gray-500 mb-1">{t.stockDetail.nameVariants}</p>
                <div className="flex flex-wrap gap-1">
                  {knowledgeGraph.name_variants.map((variant, idx) => (
                    <Badge key={idx} variant="outline" className="text-xs bg-white">
                      {variant}
                    </Badge>
                  ))}
                </div>
              </div>
            )}
            
            {/* 业务线 */}
            {knowledgeGraph.businesses && knowledgeGraph.businesses.length > 0 && (
              <div>
                <p className="text-xs text-gray-500 mb-1">{t.stockDetail.mainBusiness}</p>
                <div className="flex flex-wrap gap-1">
                  {knowledgeGraph.businesses
                    .filter(b => b.status === 'active')
                    .slice(0, 5)
                    .map((business, idx) => (
                      <Badge 
                        key={idx} 
                        className={`text-xs ${
                          business.type === 'new' 
                            ? 'bg-emerald-100 text-emerald-700' 
                            : 'bg-blue-100 text-blue-700'
                        }`}
                        title={business.description || business.name}
                      >
                        {business.type === 'new' && '🆕 '}
                        {business.name}
                      </Badge>
                    ))}
                </div>
              </div>
            )}
            
            {/* 关联概念 */}
            {knowledgeGraph.concepts && knowledgeGraph.concepts.length > 0 && (
              <div>
                <p className="text-xs text-gray-500 mb-1">{t.stockDetail.relatedConcepts}</p>
                <div className="flex flex-wrap gap-1">
                  {knowledgeGraph.concepts.slice(0, 6).map((concept, idx) => (
                    <Badge key={idx} className="text-xs bg-purple-100 text-purple-700">
                      {concept}
                    </Badge>
                  ))}
                </div>
              </div>
            )}
            
            {/* 检索策略 */}
            {knowledgeGraph.search_queries && knowledgeGraph.search_queries.length > 0 && (
              <div>
                <p className="text-xs text-gray-500 mb-1">{t.stockDetail.concurrentQueries}（{knowledgeGraph.search_queries.length}{t.stockDetail.queries}）</p>
                <div className="text-xs text-gray-600 bg-white rounded p-2 max-h-20 overflow-y-auto">
                  {knowledgeGraph.search_queries.slice(0, 3).map((query, idx) => (
                    <div key={idx} className="truncate">• {query}</div>
                  ))}
                  {knowledgeGraph.search_queries.length > 3 && (
                    <div className="text-gray-400">... 还有 {knowledgeGraph.search_queries.length - 3} 条</div>
                  )}
                </div>
              </div>
            )}
          </CardContent>
        </Card>
      )}

      {/* 概览卡片 */}
      <div className="grid grid-cols-1 md:grid-cols-4 gap-4">
        <Card className="bg-white/80 backdrop-blur-sm border-blue-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.stockDetail.relatedNews}</p>
                <p className="text-2xl font-bold text-blue-600">
                  {overview?.total_news || 0}
                </p>
              </div>
              <Newspaper className="w-8 h-8 text-blue-500/50" />
            </div>
            <p className="text-xs text-muted-foreground mt-2">
              {t.stockDetail.analyzed} {overview?.analyzed_news || 0} {t.stockDetail.items}
            </p>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-emerald-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.stockDetail.overallSentiment}</p>
                <p className={`text-2xl font-bold text-${getSentimentColor(overview?.avg_sentiment ?? null)}-600`}>
                  {overview?.avg_sentiment != null 
                    ? (overview.avg_sentiment > 0 ? '+' : '') + overview.avg_sentiment.toFixed(2)
                    : '--'}
                </p>
              </div>
              <BarChart3 className={`w-8 h-8 text-${getSentimentColor(overview?.avg_sentiment || null)}-500/50`} />
            </div>
            <p className="text-xs text-muted-foreground mt-2">
              {getSentimentLabel(overview?.avg_sentiment || null)}
            </p>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-purple-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.stockDetail.recent7d}</p>
                <p className={`text-2xl font-bold text-${getSentimentColor(overview?.recent_sentiment ?? null)}-600`}>
                  {overview?.recent_sentiment != null
                    ? (overview.recent_sentiment > 0 ? '+' : '') + overview.recent_sentiment.toFixed(2)
                    : '--'}
                </p>
              </div>
              {getTrendIcon(overview?.sentiment_trend || 'stable')}
            </div>
            <p className="text-xs text-muted-foreground mt-2 flex items-center gap-1">
              {t.stockDetail.trend}：
              {overview?.sentiment_trend === 'up' && <span className="text-emerald-600">{t.stockDetail.up} ↑</span>}
              {overview?.sentiment_trend === 'down' && <span className="text-rose-600">{t.stockDetail.down} ↓</span>}
              {overview?.sentiment_trend === 'stable' && <span className="text-gray-600">{t.stockDetail.stable} →</span>}
            </p>
          </CardContent>
        </Card>

        <Card className="bg-white/80 backdrop-blur-sm border-orange-100">
          <CardContent className="pt-6">
            <div className="flex items-center justify-between">
              <div>
                <p className="text-sm text-muted-foreground">{t.stockDetail.latestNews}</p>
                <p className="text-lg font-medium text-gray-700">
                  {overview?.last_news_time 
                    ? formatRelativeTime(overview.last_news_time, t.time)
                    : t.stockDetail.none}
                </p>
              </div>
              <Calendar className="w-8 h-8 text-orange-500/50" />
            </div>
          </CardContent>
        </Card>
      </div>

          {/* K线图 */}
          <Card className="bg-white/90">
            <CardHeader className="pb-2">
              <div className="flex items-center justify-between flex-wrap gap-4">
                <div>
              <CardTitle className="flex items-center gap-2">
                <TrendingUp className="w-5 h-5 text-blue-500" />
                    {t.stockDetail.kline}
              </CardTitle>
              <CardDescription>
                    {t.stockDetail.dataSource}：akshare · {ADJUST_OPTIONS.find(o => o.value === klineAdjust)?.label || t.stockDetail.qfq} · {t.stockDetail.supportZoom}
              </CardDescription>
                </div>
                {klineData && klineData.length > 0 && (
                  <div className="flex items-center gap-4 text-sm">
                    <div className="flex items-center gap-1">
                      <span className="text-gray-500">{t.stockDetail.close}：</span>
                      <span className={`font-semibold ${
                        klineData[klineData.length - 1].change_percent !== undefined &&
                        klineData[klineData.length - 1].change_percent! >= 0
                          ? 'text-rose-600'
                          : 'text-emerald-600'
                      }`}>
                        ¥{klineData[klineData.length - 1].close.toFixed(2)}
                      </span>
                    </div>
                    {klineData[klineData.length - 1].change_percent !== undefined && (
                      <div className="flex items-center gap-1">
                        <span className="text-gray-500">{t.stockDetail.change}：</span>
                        <Badge className={
                          klineData[klineData.length - 1].change_percent! >= 0
                            ? 'bg-rose-100 text-rose-700'
                            : 'bg-emerald-100 text-emerald-700'
                        }>
                          {klineData[klineData.length - 1].change_percent! >= 0 ? '+' : ''}
                          {klineData[klineData.length - 1].change_percent!.toFixed(2)}%
                        </Badge>
                      </div>
                    )}
                    {klineData[klineData.length - 1].turnover !== undefined && (
                      <div className="flex items-center gap-1">
                        <span className="text-gray-500">{t.stockDetail.volume}：</span>
                        <span className="font-medium">
                          {(klineData[klineData.length - 1].turnover! / 100000000).toFixed(2)}{t.stockDetail.billion}
                        </span>
                      </div>
                    )}
                  </div>
                )}
              </div>
              {/* 周期和复权选择器 */}
              <div className="flex items-center gap-1 mt-3 pt-3 border-t border-gray-100 flex-wrap">
                <span className="text-sm text-gray-500 mr-2">{t.stockDetail.period}：</span>
                {PERIOD_OPTIONS.map((option) => (
                  <Button
                    key={option.value}
                    variant={klinePeriod === option.value ? 'default' : 'ghost'}
                    size="sm"
                    onClick={() => setKlinePeriod(option.value)}
                    className={`h-7 px-3 text-xs ${
                      klinePeriod === option.value 
                        ? 'bg-blue-600 hover:bg-blue-700' 
                        : 'hover:bg-gray-100'
                    }`}
                  >
                    {option.label}
                  </Button>
                ))}
                
                {/* 复权类型选择器（仅日线有效） */}
                {klinePeriod === 'daily' && (
                  <>
                    <span className="text-gray-300 mx-2">|</span>
                    <span className="text-sm text-gray-500 mr-2" title="前复权可消除分红送股产生的缺口，保持K线连续性">
                      {t.stockDetail.adjust}：
                    </span>
                    {ADJUST_OPTIONS.map((option) => (
                      <Button
                        key={option.value}
                        variant={klineAdjust === option.value ? 'default' : 'ghost'}
                        size="sm"
                        onClick={() => setKlineAdjust(option.value)}
                        title={option.tip}
                        className={`h-7 px-3 text-xs ${
                          klineAdjust === option.value 
                            ? 'bg-amber-600 hover:bg-amber-700' 
                            : 'hover:bg-gray-100'
                        }`}
                      >
                        {option.label}
                        {option.value === 'qfq' && <span className="ml-1 text-[10px] opacity-70">{t.stockDetail.recommendLabel || 'Recommend'}</span>}
                      </Button>
                    ))}
                  </>
                )}
                
                <Button
                  variant="ghost"
                  size="sm"
                  onClick={() => refetchKline()}
                  disabled={klineLoading}
                  className="h-7 px-2 ml-2"
                >
                  <RefreshCw className={`w-3.5 h-3.5 ${klineLoading ? 'animate-spin' : ''}`} />
                </Button>
              </div>
            </CardHeader>
            <CardContent>
              {klineLoading ? (
                <div className="h-[550px] flex items-center justify-center">
                  <Loader2 className="w-8 h-8 animate-spin text-blue-500" />
                </div>
              ) : klineData && klineData.length > 0 ? (
                <KLineChart
                  data={klineData}
                  height={550}
                  showVolume={true}
                  showMA={klinePeriod === 'daily'}
                  showMACD={false}
                  theme="light"
                  period={klinePeriod}
                />
              ) : (
                <div className="h-[550px] flex flex-col items-center justify-center text-gray-500">
                  <BarChart3 className="w-12 h-12 opacity-50 mb-3" />
                  <p>{t.stockDetail.noKline}</p>
                  <p className="text-sm mt-1">{t.stockDetail.checkCode}</p>
                </div>
              )}
          </CardContent>
        </Card>

      {/* 关联新闻 */}
      <Card className="bg-white/90">
          <CardHeader>
            <div className="flex items-start justify-between">
              <div className="flex-1">
                <div className="flex items-center justify-between">
                  <div>
                    <CardTitle className="flex items-center gap-2">
                      <Newspaper className="w-5 h-5 text-blue-500" />
                      {t.stockDetail.news}
                    </CardTitle>
                    <CardDescription className="mt-1.5">
                      {t.stockDetail.newsContain} {stockCode} {t.stockDetail.newsTotal} {newsList && `（${t.stockDetail.newsTotal}${newsList.length}${t.stockDetail.items}）`}
                    </CardDescription>
                  </div>
                  {/* 展开/折叠按钮 */}
                  <Button
                    variant="ghost"
                    size="sm"
                    onClick={() => {
                      setNewsExpanded(!newsExpanded)
                      if (newsExpanded) {
                        // 折叠时重置为12条
                        setNewsDisplayCount(12)
                      }
                    }}
                    className="gap-2"
                  >
                    <ChevronDown className={`w-4 h-4 transition-transform ${newsExpanded ? '' : 'rotate-180'}`} />
                    {newsExpanded ? t.stockDetail.fold : t.stockDetail.expand}
                  </Button>
                </div>
              </div>
              {/* 定向爬取按钮组 */}
              <div className="flex items-center gap-2">
                {/* 一键清除按钮 - 仅在有新闻时显示 */}
                {hasHistoryNews && (
                  <Button
                    variant="ghost"
                    size="sm"
                    onClick={handleClearNews}
                    disabled={clearNewsMutation.isPending || crawlTask.status === 'running' || crawlTask.status === 'pending'}
                    className="gap-2 text-rose-600 hover:text-rose-700 hover:bg-rose-50"
                    title="清除该股票的所有新闻"
                  >
                    {clearNewsMutation.isPending ? (
                      <>
                        <Loader2 className="w-4 h-4 animate-spin" />
                        <span>清除中...</span>
                      </>
                    ) : (
                      <>
                        <Trash2 className="w-4 h-4" />
                        <span>{t.stockDetail.clearData}</span>
                      </>
                    )}
                  </Button>
                )}
                
                {crawlTask.status === 'completed' && (
                  <span className="flex items-center gap-1 text-xs text-emerald-600">
                    <CheckCircle2 className="w-3.5 h-3.5" />
                    {t.stockDetail.crawlComplete}
                  </span>
                )}
                {crawlTask.status === 'failed' && (
                  <span className="flex items-center gap-1 text-xs text-rose-600">
                    <AlertCircle className="w-3.5 h-3.5" />
                    {t.stockDetail.crawlFailed}
                  </span>
                )}
                {crawlTask.status === 'running' || crawlTask.status === 'pending' ? (
                  <>
                    <Button
                      variant="outline"
                      size="sm"
                      disabled
                      className="gap-2"
                    >
                      <Loader2 className="w-4 h-4 animate-spin" />
                      <span>{t.stockDetail.crawling}</span>
                      {crawlTask.progress && (
                        <span className="text-xs text-gray-500">
                          {crawlTask.progress.message || `${crawlTask.progress.current}%`}
                        </span>
                      )}
                    </Button>
                    <Button
                      variant="ghost"
                      size="sm"
                      onClick={handleStopCrawl}
                      className="gap-2 text-rose-600 hover:text-rose-700 hover:bg-rose-50"
                    >
                      <StopCircle className="w-4 h-4" />
                      <span>{t.stockDetail.stop}</span>
                    </Button>
                  </>
                ) : (
                  <Button
                    variant="outline"
                    size="sm"
                    onClick={handleStartCrawl}
                    disabled={targetedCrawlMutation.isPending}
                    className="gap-2"
                  >
                    <Download className="w-4 h-4" />
                    {hasHistoryNews ? t.stockDetail.updateCrawl : t.stockDetail.targetCrawl}
                  </Button>
                )}
              </div>
            </div>
          </CardHeader>
          <CardContent>
            {newsLoading ? (
              <div className="flex items-center justify-center py-12">
                <Loader2 className="w-8 h-8 animate-spin text-blue-500" />
              </div>
            ) : newsList && newsList.length > 0 ? (
              newsExpanded ? (
                <div className="space-y-4">
                  {/* 卡片 Grid 布局 */}
                  <div className="grid gap-4 grid-cols-1 md:grid-cols-2 lg:grid-cols-3">
                  {displayedNews.map((news) => (
                    <Card
                      key={news.id}
                      className={getNewsCardStyle(news.sentiment_score)}
                      onClick={() => {
                        setSelectedNewsId(news.id)
                        setDrawerOpen(true)
                      }}
                    >
                      <CardHeader className="pb-2 flex-shrink-0">
                        <CardTitle className="text-sm leading-tight font-semibold text-gray-900 line-clamp-2 min-h-[40px]">
                          {news.title}
                        </CardTitle>
                        <div className="flex items-center gap-2 text-xs text-gray-500 mt-1">
                          <Calendar className="w-3 h-3" />
                          <span>{news.publish_time ? formatRelativeTime(news.publish_time, t.time) : t.stockDetail.unknown}</span>
                          <span>•</span>
                          <span>{news.source}</span>
                        </div>
                      </CardHeader>
                      
                      <CardContent className="flex-1 flex flex-col pb-3 pt-1 overflow-hidden">
                        <p 
                          className="text-sm text-gray-600 leading-relaxed flex-1"
                          style={{
                            display: '-webkit-box',
                            WebkitLineClamp: 3,
                            WebkitBoxOrient: 'vertical',
                            overflow: 'hidden'
                          }}
                        >
                          {news.content}
                        </p>
                        
                        {/* 底部标签区域 */}
                        <div className="flex items-center justify-between mt-3 pt-2 border-t border-gray-100">
                          <div className="flex items-center gap-1.5">
                            {news.sentiment_score !== null && (
                              <Badge 
                                className={`text-xs px-2 py-0.5 ${
                                  news.sentiment_score > 0.1 ? 'bg-emerald-100 text-emerald-700 border-emerald-200' :
                                  news.sentiment_score < -0.1 ? 'bg-rose-100 text-rose-700 border-rose-200' :
                                  'bg-amber-100 text-amber-700 border-amber-200'
                                }`}
                              >
                                {news.sentiment_score > 0.1 ? `📈 ${t.stockDetail.positive}` : 
                                 news.sentiment_score < -0.1 ? `📉 ${t.stockDetail.negative}` : `➖ ${t.stockDetail.neutral}`}
                              </Badge>
                            )}
                            {news.has_analysis && (
                              <Badge variant="outline" className="text-xs px-2 py-0.5">
                                {t.stockDetail.analyzed}
                              </Badge>
                            )}
                          </div>
                          {news.sentiment_score !== null && (
                            <span className="text-xs text-gray-400">
                              {news.sentiment_score > 0 ? '+' : ''}{news.sentiment_score.toFixed(2)}
                            </span>
                          )}
                        </div>
                      </CardContent>
                    </Card>
                  ))}
                </div>
                
                  {/* 继续扩展按钮 */}
                  {hasMoreNews && (
                    <div className="text-center pt-4">
                      <Button
                        variant="outline"
                        onClick={() => setNewsDisplayCount(prev => prev + 12)}
                        className="gap-2 hover:bg-blue-50"
                      >
                        <ChevronDown className="w-4 h-4" />
                        {t.stockDetail.loadMore} ({t.stockDetail.remaining} {(newsList?.length || 0) - newsDisplayCount} {t.stockDetail.items})
                      </Button>
                    </div>
                  )}
                  
                  {/* 已显示全部提示 */}
                  {!hasMoreNews && newsList && newsList.length > 12 && (
                    <div className="text-center pt-4 text-sm text-gray-400">
                      {t.stockDetail.showAll} {newsList.length} {t.stockDetail.items}
                    </div>
                  )}
                </div>
              ) : (
                <div className="text-center py-8 text-gray-500">
                  <p className="text-sm">{t.stockDetail.newsFolded}</p>
                </div>
              )
            ) : (
              <div className="text-center py-12 text-gray-500">
                <Newspaper className="w-12 h-12 mx-auto opacity-50 mb-3" />
                <p>{t.stockDetail.noRelatedNews}</p>
                <p className="text-sm mt-1">{t.stockDetail.clickCrawl}</p>
                </div>
              )}
            </CardContent>
          </Card>

          {/* 情感趋势图 */}
          <Card className="bg-white/90">
            <CardHeader>
              <CardTitle className="flex items-center gap-2">
                <MessageSquare className="w-5 h-5 text-purple-500" />
                {t.stockDetail.sentimentTrend}
              </CardTitle>
              <CardDescription>
                {t.stockDetail.sentimentDesc}
              </CardDescription>
            </CardHeader>
            <CardContent>
              {trendLoading ? (
                <div className="h-64 flex items-center justify-center">
                  <Loader2 className="w-8 h-8 animate-spin text-purple-500" />
                </div>
              ) : sentimentTrend && sentimentTrend.length > 0 ? (
                <ResponsiveContainer width="100%" height={300}>
                  <ComposedChart data={sentimentTrend}>
                    <CartesianGrid strokeDasharray="3 3" stroke="#e5e7eb" />
                    <XAxis 
                      dataKey="date" 
                      tick={{ fontSize: 10 }}
                      tickFormatter={(value) => value.slice(5)}
                    />
                    <YAxis 
                      yAxisId="left"
                      domain={[-1, 1]}
                      tick={{ fontSize: 10 }}
                    />
                    <YAxis 
                      yAxisId="right"
                      orientation="right"
                      tick={{ fontSize: 10 }}
                    />
                    <Tooltip
                      contentStyle={{
                        backgroundColor: 'rgba(255, 255, 255, 0.95)',
                        borderRadius: '8px',
                        border: '1px solid #e5e7eb',
                      }}
                    />
                    <Legend />
                    <Bar 
                      yAxisId="right"
                      dataKey="positive_count" 
                      stackId="a" 
                      fill="#10b981" 
                      name={t.stockDetail.positive}
                    />
                    <Bar 
                      yAxisId="right"
                      dataKey="neutral_count" 
                      stackId="a" 
                      fill="#f59e0b" 
                      name={t.stockDetail.neutral}
                    />
                    <Bar 
                      yAxisId="right"
                      dataKey="negative_count" 
                      stackId="a" 
                      fill="#ef4444" 
                      name={t.stockDetail.negative}
                    />
                    <Line
                      yAxisId="left"
                      type="monotone"
                      dataKey="avg_sentiment"
                      stroke="#8b5cf6"
                      strokeWidth={2}
                      dot={false}
                      name={t.stockDetail.avgSentiment}
                    />
                  </ComposedChart>
                </ResponsiveContainer>
              ) : (
                <div className="h-64 flex items-center justify-center text-gray-500">
                  暂无数据
                </div>
              )}
            </CardContent>
          </Card>

      {/* Bull vs Bear 辩论 */}
        <div className="space-y-6">
          {/* 触发辩论按钮 */}
          <Card className="bg-gradient-to-r from-emerald-50 to-rose-50 border-none">
            <CardContent className="py-6">
              <div className="flex flex-col gap-4">
                <div className="flex items-center justify-between">
                  <div className="flex items-center gap-4">
                    <div className="flex -space-x-2">
                      <div className="w-12 h-12 rounded-full bg-emerald-500 flex items-center justify-center text-white shadow-lg">
                        <ThumbsUp className="w-6 h-6" />
                      </div>
                      <div className="w-12 h-12 rounded-full bg-rose-500 flex items-center justify-center text-white shadow-lg">
                        <ThumbsDown className="w-6 h-6" />
                      </div>
                    </div>
                    <div>
                      <h3 className="font-semibold text-gray-900">{t.stockDetail.bullBear}</h3>
                      <p className="text-sm text-gray-500">
                        {t.stockDetail.bullBearDesc}
                      </p>
                    </div>
                  </div>
                  <Button
                    onClick={handleStartDebate}
                    disabled={isStreaming || debateMutation.isPending}
                    className="bg-gradient-to-r from-emerald-500 to-rose-500 hover:from-emerald-600 hover:to-rose-600"
                  >
                    {isStreaming || debateMutation.isPending ? (
                      <>
                        <Loader2 className="w-4 h-4 mr-2 animate-spin" />
                        {t.stockDetail.debating}
                      </>
                    ) : (
                      <>
                        <Swords className="w-4 h-4 mr-2" />
                        {t.stockDetail.startDebate}
                      </>
                    )}
                  </Button>
                </div>
                {/* 辩论模式选择器 */}
                <div className="flex items-center gap-3 pt-2 border-t border-gray-100">
                  <span className="text-sm text-gray-500">{t.stockDetail.analysisMode}:</span>
                  <DebateModeSelector
                    value={debateMode}
                    onChange={setDebateMode}
                    disabled={debateMutation.isPending}
                  />
                </div>
              </div>
            </CardContent>
          </Card>

          {/* 流式辩论进行中 - 实时显示内容 */}
          {isStreaming && (
            <>
              {/* 阶段指示器 - 仅非聊天室模式显示 */}
              {debateMode !== 'realtime_debate' && (
                <div className="flex items-center justify-between mb-4">
                  <div className="flex items-center gap-2">
                    <Loader2 className="w-4 h-4 animate-spin text-blue-500" />
                    <span className="text-sm text-blue-600 font-medium">
                      {streamPhase === 'start' && (t.stockDetail.history === '历史' ? '正在初始化...' : 'Initializing...')}
                      {streamPhase === 'data_collection' && (t.stockDetail.history === '历史' ? '📊 数据专员正在搜集资料...' : '📊 Data Collector is gathering materials...')}
                      {streamPhase === 'analyzing' && `🚀 ${t.stockDetail.quickAnalysis || 'Quick Analysis'}...`}
                      {streamPhase === 'parallel_analysis' && `⚡ Bull/Bear ${t.stockDetail.parallelAnalysis}...`}
                      {streamPhase === 'debate' && `🎭 ${t.stockDetail.realtimeDebate}...`}
                      {streamPhase === 'decision' && `⚖️ ${t.stockDetail.managerDecision}...`}
                      {streamPhase === 'complete' && (t.stockDetail.history === '历史' ? '✅ 分析完成' : '✅ Analysis Complete')}
                    </span>
                  </div>
                </div>
              )}

              {/* 快速分析模式 - 流式显示 */}
              {debateMode === 'quick_analysis' && (
                <Card className="bg-gradient-to-r from-blue-50 to-cyan-50 border-none">
                  <CardHeader className="pb-3">
                    <CardTitle className="flex items-center gap-2 text-blue-700">
                      <div className={`w-10 h-10 rounded-full bg-blue-100 flex items-center justify-center ${activeAgent === 'QuickAnalyst' ? 'animate-pulse ring-2 ring-blue-400' : ''}`}>
                        <Activity className="w-5 h-5 text-blue-600" />
                      </div>
                      🚀 {t.stockDetail.quickAnalysis || 'Quick Analysis'}
                      {activeAgent === 'QuickAnalyst' && <span className="text-xs bg-blue-200 px-2 py-0.5 rounded animate-pulse">{t.stockDetail.history === '历史' ? '输出中...' : 'Outputting...'}</span>}
                    </CardTitle>
                    <CardDescription>
                      <Bot className="w-3 h-3 inline mr-1" />
                      QuickAnalyst · {t.stockDetail.quickAnalysis || 'Quick Analysis'}
                    </CardDescription>
                  </CardHeader>
                  <CardContent>
                    {streamingContent.quick ? (
                      <div className="prose prose-sm max-w-none prose-headings:text-blue-800">
                        <ReactMarkdown remarkPlugins={[remarkGfm]}>
                          {streamingContent.quick}
                        </ReactMarkdown>
                        {activeAgent === 'QuickAnalyst' && <span className="inline-block w-2 h-4 bg-blue-500 animate-pulse ml-1" />}
                      </div>
                    ) : (
                      <div className="flex flex-col items-center justify-center py-12 text-gray-500">
                        <Loader2 className="w-10 h-10 animate-spin text-blue-500 mb-4" />
                        <p className="text-sm font-medium">{t.stockDetail.waitingAnalysis}</p>
                      </div>
                    )}
                  </CardContent>
                </Card>
              )}

              {/* 实时辩论模式 - 聊天室界面 */}
              {debateMode === 'realtime_debate' && (
                <DebateChatRoom
                  messages={chatMessages}
                  onSendMessage={handleUserSendMessage}
                  isDebating={isStreaming}
                  currentRound={currentRound}
                  activeAgent={activeAgent}
                  stockName={stockName}
                    historySessions={historySessions}
                    onLoadSession={(sessionId) => {
                      const session = loadSession(stockCode, sessionId)
                      if (session) {
                        setChatMessages(session.messages)
                        toast.success(t.stockDetail.historySessionLoaded)
                      }
                    }}
                    onClearHistory={() => {
                      clearStockHistory(stockCode)
                      toast.success(t.stockDetail.allHistoryCleared)
                    }}
                    onConfirmSearch={handleConfirmSearch}
                    onCancelSearch={handleCancelSearch}
                />
              )}

              {/* 并行模式 - 分栏显示 */}
              {debateMode === 'parallel' && (
                <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
                  {/* 看多观点 - 流式 */}
                  <Card className={`bg-white/90 border-l-4 border-l-emerald-500 ${activeAgent === 'BullResearcher' ? 'ring-2 ring-emerald-400' : ''}`}>
                    <CardHeader className="pb-3">
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-emerald-700">
                            <div className={`w-8 h-8 rounded-full bg-emerald-100 flex items-center justify-center ${activeAgent === 'BullResearcher' ? 'animate-pulse' : ''}`}>
                              <ThumbsUp className="w-4 h-4 text-emerald-600" />
                            </div>
                            {t.stockDetail.bullView}
                            {activeAgent === 'BullResearcher' && <span className="text-xs bg-emerald-200 px-2 py-0.5 rounded animate-pulse">{t.stockDetail.outputting}</span>}
                          </CardTitle>
                          <CardDescription>
                            <Bot className="w-3 h-3 inline mr-1" />
                            BullResearcher · {t.stockDetail.bullView}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        {streamingContent.bull && (
                          <div className="flex items-center gap-1">
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleCopyContent(streamingContent.bull, t.stockDetail.bullView)}
                              className="h-8 px-2"
                              title={t.stockDetail.copy}
                            >
                              <Copy className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleExportToFile(
                                streamingContent.bull, 
                                `${stockName}_${t.stockDetail.bullView}_${new Date().toISOString().slice(0,10)}.md`
                              )}
                              className="h-8 px-2"
                              title={t.stockDetail.export}
                            >
                              <FileDown className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={handleStartDebate}
                              disabled={isStreaming}
                              className="h-8 px-2"
                              title={t.stockDetail.regenerate}
                            >
                              <RefreshCw className="w-3.5 h-3.5" />
                            </Button>
                          </div>
                        )}
                      </div>
                    </CardHeader>
                    <CardContent>
                      {streamingContent.bull ? (
                        <div className="prose prose-sm max-w-none prose-headings:text-emerald-800 max-h-96 overflow-y-auto">
                          <ReactMarkdown remarkPlugins={[remarkGfm]}>
                            {streamingContent.bull}
                          </ReactMarkdown>
                          {activeAgent === 'BullResearcher' && <span className="inline-block w-2 h-4 bg-emerald-500 animate-pulse ml-1" />}
                        </div>
                      ) : (
                        <div className="flex flex-col items-center justify-center py-12 text-gray-500">
                          <Loader2 className="w-8 h-8 animate-spin text-emerald-500 mb-4" />
                          <p className="text-sm">等待分析...</p>
                        </div>
                      )}
                    </CardContent>
                  </Card>

                  {/* 看空观点 - 流式 */}
                  <Card className={`bg-white/90 border-l-4 border-l-rose-500 ${activeAgent === 'BearResearcher' ? 'ring-2 ring-rose-400' : ''}`}>
                    <CardHeader className="pb-3">
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-rose-700">
                            <div className={`w-8 h-8 rounded-full bg-rose-100 flex items-center justify-center ${activeAgent === 'BearResearcher' ? 'animate-pulse' : ''}`}>
                              <ThumbsDown className="w-4 h-4 text-rose-600" />
                            </div>
                            {t.stockDetail.bearView}
                            {activeAgent === 'BearResearcher' && <span className="text-xs bg-rose-200 px-2 py-0.5 rounded animate-pulse">{t.stockDetail.outputting}</span>}
                          </CardTitle>
                          <CardDescription>
                            <Bot className="w-3 h-3 inline mr-1" />
                            BearResearcher · {t.stockDetail.bearView}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        {streamingContent.bear && (
                          <div className="flex items-center gap-1">
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleCopyContent(streamingContent.bear, t.stockDetail.bearView)}
                              className="h-8 px-2"
                              title={t.stockDetail.copy}
                            >
                              <Copy className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleExportToFile(
                                streamingContent.bear, 
                                `${stockName}_${t.stockDetail.bearView}_${new Date().toISOString().slice(0,10)}.md`
                              )}
                              className="h-8 px-2"
                              title={t.stockDetail.export}
                            >
                              <FileDown className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={handleStartDebate}
                              disabled={isStreaming}
                              className="h-8 px-2"
                              title={t.stockDetail.regenerate}
                            >
                              <RefreshCw className="w-3.5 h-3.5" />
                            </Button>
                          </div>
                        )}
                      </div>
                    </CardHeader>
                    <CardContent>
                      {streamingContent.bear ? (
                        <div className="prose prose-sm max-w-none prose-headings:text-rose-800 max-h-96 overflow-y-auto">
                          <ReactMarkdown remarkPlugins={[remarkGfm]}>
                            {streamingContent.bear}
                          </ReactMarkdown>
                          {activeAgent === 'BearResearcher' && <span className="inline-block w-2 h-4 bg-rose-500 animate-pulse ml-1" />}
                        </div>
                      ) : (
                        <div className="flex flex-col items-center justify-center py-12 text-gray-500">
                          <Loader2 className="w-8 h-8 animate-spin text-rose-500 mb-4" />
                          <p className="text-sm">等待分析...</p>
                        </div>
                      )}
                    </CardContent>
                  </Card>

                  {/* 投资经理决策 - 流式 */}
                  <Card className={`lg:col-span-2 bg-gradient-to-r from-blue-50 to-indigo-50 border-none ${activeAgent === 'InvestmentManager' ? 'ring-2 ring-indigo-400' : ''}`}>
                    <CardHeader className="pb-3">
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-indigo-700">
                            <div className={`w-10 h-10 rounded-full bg-indigo-100 flex items-center justify-center ${activeAgent === 'InvestmentManager' ? 'animate-pulse' : ''}`}>
                              <Scale className="w-5 h-5 text-indigo-600" />
                            </div>
                            {t.stockDetail.managerDecision}
                            {activeAgent === 'InvestmentManager' && <span className="text-xs bg-indigo-200 px-2 py-0.5 rounded animate-pulse">{t.stockDetail.deciding}</span>}
                          </CardTitle>
                          <CardDescription>
                            <Bot className="w-3 h-3 inline mr-1" />
                            InvestmentManager · {t.stockDetail.managerDecision}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        {streamingContent.manager && (
                          <div className="flex items-center gap-1">
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleCopyContent(streamingContent.manager, t.stockDetail.managerDecision)}
                              className="h-8 px-2"
                              title={t.stockDetail.copy}
                            >
                              <Copy className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={() => handleExportToFile(
                                streamingContent.manager, 
                                `${stockName}_${t.stockDetail.managerDecision}_${new Date().toISOString().slice(0,10)}.md`
                              )}
                              className="h-8 px-2"
                              title={t.stockDetail.export}
                            >
                              <FileDown className="w-3.5 h-3.5" />
                            </Button>
                            <Button
                              variant="ghost"
                              size="sm"
                              onClick={handleStartDebate}
                              disabled={isStreaming}
                              className="h-8 px-2"
                              title={t.stockDetail.regenerate}
                            >
                              <RefreshCw className="w-3.5 h-3.5" />
                            </Button>
                          </div>
                        )}
                      </div>
                    </CardHeader>
                    <CardContent>
                      {streamingContent.manager ? (
                        <div className="prose prose-sm max-w-none prose-headings:text-indigo-800">
                          <ReactMarkdown remarkPlugins={[remarkGfm]}>
                            {streamingContent.manager}
                          </ReactMarkdown>
                          {activeAgent === 'InvestmentManager' && <span className="inline-block w-2 h-4 bg-indigo-500 animate-pulse ml-1" />}
                        </div>
                      ) : (
                        <div className="flex flex-col items-center justify-center py-8 text-gray-500">
                          <Loader2 className="w-10 h-10 animate-spin text-indigo-500 mb-4" />
                          <p className="text-sm font-medium">{t.stockDetail.waitingDecision}</p>
                        </div>
                      )}
                    </CardContent>
                  </Card>
                </div>
              )}
            </>
          )}

          {/* 辩论结果 */}
          {!debateMutation.isPending && debateResult && debateResult.success && (
            <>
              {/* 快速分析结果 */}
              {debateResult.mode === 'quick_analysis' && debateResult.quick_analysis && (
                <Card className="bg-gradient-to-br from-blue-50 to-cyan-50 border-blue-200">
                  <CardHeader>
                    <CardTitle className="flex items-center gap-2 text-blue-800">
                      <div className="w-10 h-10 rounded-full bg-blue-100 flex items-center justify-center">
                        <Activity className="w-5 h-5 text-blue-600" />
                      </div>
                      🚀 {t.stockDetail.quickAnalysis} {t.stockDetail.result}
                    </CardTitle>
                    <CardDescription className="flex items-center gap-4">
                      <span>
                        <Bot className="w-3 h-3 inline mr-1" />
                        QuickAnalyst · {t.stockDetail.quickAnalysis}
                      </span>
                      {debateResult.execution_time && (
                        <span className="text-xs bg-blue-100 px-2 py-0.5 rounded">
                          {t.stockDetail.executionTime} {debateResult.execution_time.toFixed(1)}s
                        </span>
                      )}
                    </CardDescription>
                  </CardHeader>
                  <CardContent>
                    <div className="prose prose-sm max-w-none prose-headings:text-blue-800 prose-headings:font-semibold">
                      <ReactMarkdown remarkPlugins={[remarkGfm]}>
                        {debateResult.quick_analysis.analysis || t.stockDetail.analysisComplete}
                      </ReactMarkdown>
                    </div>
                  </CardContent>
                </Card>
              )}

              {/* 实时辩论结果 - 显示聊天室 */}
              {debateResult.mode === 'realtime_debate' && chatMessages.length > 0 && (
                <div className="space-y-4">
                  <DebateChatRoom
                    messages={chatMessages}
                    onSendMessage={handleUserSendMessage}
                    isDebating={false}
                    currentRound={null}
                    activeAgent={null}
                    stockName={stockName}
                    historySessions={historySessions}
                    onLoadSession={(sessionId) => {
                      const session = loadSession(stockCode, sessionId)
                      if (session) {
                        setChatMessages(session.messages)
                        toast.success(t.stockDetail.historySessionLoaded)
                      }
                    }}
                    onClearHistory={() => {
                      clearStockHistory(stockCode)
                      toast.success(t.stockDetail.allHistoryCleared)
                    }}
                    onConfirmSearch={handleConfirmSearch}
                    onCancelSearch={handleCancelSearch}
                  />
                  {/* 投资经理决策摘要 */}
                  {debateResult.final_decision && (
                    <Card className="bg-gradient-to-br from-blue-50 to-purple-50 border-blue-200">
                      <CardHeader>
                        <CardTitle className="flex items-center gap-2 text-blue-800">
                          <div className="w-10 h-10 rounded-full bg-blue-100 flex items-center justify-center">
                            <Scale className="w-5 h-5 text-blue-600" />
                          </div>
                          📊 {t.stockDetail.managerDecision}
                          {debateResult.final_decision?.rating && (
                            <Badge 
                              className={`ml-2 ${
                                debateResult.final_decision.rating === '强烈推荐' || debateResult.final_decision.rating === '推荐' ||
                                debateResult.final_decision.rating === t.stockDetail.stronglyRec || debateResult.final_decision.rating === t.stockDetail.recommend ||
                                debateResult.final_decision.rating === 'Strongly Recommend' || debateResult.final_decision.rating === 'Recommend'
                                  ? 'bg-emerald-500' 
                                  : debateResult.final_decision.rating === '中性' || debateResult.final_decision.rating === 'Neutral'
                                  ? 'bg-amber-500'
                                  : 'bg-rose-500'
                              }`}
                            >
                              {debateResult.final_decision.rating}
                            </Badge>
                          )}
                        </CardTitle>
                      </CardHeader>
                    </Card>
                  )}
                </div>
              )}

              {/* 并行分析结果 */}
              {(debateResult.mode === 'parallel' || !debateResult.mode) && (
                <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
                  {/* 看多观点 */}
                  <Card className="bg-white/90 border-l-4 border-l-emerald-500">
                    <CardHeader className="pb-3">
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-emerald-700">
                            <div className="w-8 h-8 rounded-full bg-emerald-100 flex items-center justify-center">
                              <ThumbsUp className="w-4 h-4 text-emerald-600" />
                            </div>
                            {t.stockDetail.bullView}
                          </CardTitle>
                          <CardDescription>
                            <Bot className="w-3 h-3 inline mr-1" />
                            {debateResult.bull_analysis?.agent_name || 'BullResearcher'} · {t.stockDetail.bullResearcher}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        <div className="flex items-center gap-1">
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleCopyContent(debateResult.bull_analysis?.analysis || '', t.stockDetail.bullView)}
                            className="h-8 px-2"
                            title={t.stockDetail.copy}
                          >
                            <Copy className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleExportToFile(
                              debateResult.bull_analysis?.analysis || '', 
                              `${stockName}_${t.stockDetail.bullView}_${new Date().toISOString().slice(0,10)}.md`
                            )}
                            className="h-8 px-2"
                            title={t.stockDetail.export}
                          >
                            <FileDown className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={handleStartDebate}
                            className="h-8 px-2"
                            title={t.stockDetail.regenerate}
                          >
                            <RefreshCw className="w-3.5 h-3.5" />
                          </Button>
                        </div>
                      </div>
                    </CardHeader>
                    <CardContent>
                      <div className="prose prose-sm max-w-none prose-headings:text-emerald-800 prose-headings:font-semibold">
                        <ReactMarkdown remarkPlugins={[remarkGfm]}>
                          {debateResult.bull_analysis?.analysis || t.stockDetail.analysisGenerating}
                        </ReactMarkdown>
                      </div>
                    </CardContent>
                  </Card>

                  {/* 看空观点 */}
                  <Card className="bg-white/90 border-l-4 border-l-rose-500">
                    <CardHeader className="pb-3">
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-rose-700">
                            <div className="w-8 h-8 rounded-full bg-rose-100 flex items-center justify-center">
                              <ThumbsDown className="w-4 h-4 text-rose-600" />
                            </div>
                            {t.stockDetail.bearView}
                          </CardTitle>
                          <CardDescription>
                            <Bot className="w-3 h-3 inline mr-1" />
                            {debateResult.bear_analysis?.agent_name || 'BearResearcher'} · {t.stockDetail.bearResearcher}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        <div className="flex items-center gap-1">
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleCopyContent(debateResult.bear_analysis?.analysis || '', t.stockDetail.bearView)}
                            className="h-8 px-2"
                            title={t.stockDetail.copy}
                          >
                            <Copy className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleExportToFile(
                              debateResult.bear_analysis?.analysis || '', 
                              `${stockName}_${t.stockDetail.bearView}_${new Date().toISOString().slice(0,10)}.md`
                            )}
                            className="h-8 px-2"
                            title={t.stockDetail.export}
                          >
                            <FileDown className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={handleStartDebate}
                            className="h-8 px-2"
                            title={t.stockDetail.regenerate}
                          >
                            <RefreshCw className="w-3.5 h-3.5" />
                          </Button>
                        </div>
                      </div>
                    </CardHeader>
                    <CardContent>
                      <div className="prose prose-sm max-w-none prose-headings:text-rose-800 prose-headings:font-semibold">
                        <ReactMarkdown remarkPlugins={[remarkGfm]}>
                          {debateResult.bear_analysis?.analysis || t.stockDetail.analysisGenerating}
                        </ReactMarkdown>
                      </div>
                    </CardContent>
                  </Card>

                  {/* 最终决策 */}
                  <Card className="lg:col-span-2 bg-gradient-to-br from-blue-50 to-purple-50 border-blue-200">
                    <CardHeader>
                      <div className="flex items-start justify-between gap-2">
                        <div className="flex-1">
                          <CardTitle className="flex items-center gap-2 text-blue-800">
                            <div className="w-10 h-10 rounded-full bg-blue-100 flex items-center justify-center">
                              <Scale className="w-5 h-5 text-blue-600" />
                            </div>
                            {t.stockDetail.managerDecision}
                            {debateResult.final_decision?.rating && (
                              <Badge 
                                className={`ml-2 ${
                                  debateResult.final_decision.rating === '强烈推荐' || debateResult.final_decision.rating === '推荐' ||
                                  debateResult.final_decision.rating === t.stockDetail.stronglyRec || debateResult.final_decision.rating === t.stockDetail.recommend ||
                                  debateResult.final_decision.rating === 'Strongly Recommend' || debateResult.final_decision.rating === 'Recommend'
                                    ? 'bg-emerald-500'
                                    : debateResult.final_decision.rating === '回避' || debateResult.final_decision.rating === '谨慎' ||
                                      debateResult.final_decision.rating === t.stockDetail.avoid || debateResult.final_decision.rating === t.stockDetail.caution ||
                                      debateResult.final_decision.rating === 'Avoid' || debateResult.final_decision.rating === 'Caution'
                                    ? 'bg-rose-500'
                                    : 'bg-amber-500'
                                }`}
                              >
                                {debateResult.final_decision.rating}
                              </Badge>
                            )}
                          </CardTitle>
                          <CardDescription className="flex items-center gap-4">
                            <span>
                              <Bot className="w-3 h-3 inline mr-1" />
                              {debateResult.final_decision?.agent_name || 'InvestmentManager'} · {t.stockDetail.investmentManager}
                            </span>
                            {debateResult.execution_time && (
                              <span className="text-xs bg-blue-100 px-2 py-0.5 rounded">
                                {t.stockDetail.executionTime} {debateResult.execution_time.toFixed(1)}s
                              </span>
                            )}
                          </CardDescription>
                        </div>
                        {/* 操作按钮组 */}
                        <div className="flex items-center gap-1">
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleCopyContent(debateResult.final_decision?.decision || '', t.stockDetail.managerDecision)}
                            className="h-8 px-2"
                            title={t.stockDetail.copy}
                          >
                            <Copy className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={() => handleExportToFile(
                              debateResult.final_decision?.decision || '', 
                              `${stockName}_${t.stockDetail.managerDecision}_${new Date().toISOString().slice(0,10)}.md`
                            )}
                            className="h-8 px-2"
                            title={t.stockDetail.export}
                          >
                            <FileDown className="w-3.5 h-3.5" />
                          </Button>
                          <Button
                            variant="ghost"
                            size="sm"
                            onClick={handleStartDebate}
                            className="h-8 px-2"
                            title={t.stockDetail.regenerate}
                          >
                            <RefreshCw className="w-3.5 h-3.5" />
                          </Button>
                        </div>
                      </div>
                    </CardHeader>
                    <CardContent>
                      <div className="prose prose-sm max-w-none prose-headings:text-blue-800 prose-headings:font-semibold">
                        <ReactMarkdown remarkPlugins={[remarkGfm]}>
                          {debateResult.final_decision?.decision || t.stockDetail.decisionGenerating}
                        </ReactMarkdown>
                      </div>
                    </CardContent>
                  </Card>
                </div>
              )}
            </>
          )}

          {/* 辩论失败 */}
          {debateResult && !debateResult.success && (
            <Card className="bg-rose-50 border-rose-200">
              <CardContent className="py-6">
                <p className="text-rose-700">{t.stockDetail.debateFailed}: {debateResult.error}</p>
              </CardContent>
            </Card>
          )}

          {/* 初始状态 */}
          {!debateResult && !debateMutation.isPending && (
            <Card className="bg-gray-50">
              <CardContent className="py-12 text-center text-gray-500">
                <Swords className="w-16 h-16 mx-auto opacity-50 mb-4" />
                <p className="text-lg">{t.stockDetail.clickDebate}</p>
                <p className="text-sm mt-2">
                  {t.stockDetail.debateDesc}
                </p>
              </CardContent>
            </Card>
          )}
        </div>

      {/* 新闻详情抽屉 */}
      <NewsDetailDrawer
        newsId={selectedNewsId}
        open={drawerOpen}
        onOpenChange={(open) => {
          setDrawerOpen(open)
          if (!open) {
            // 延迟清除newsId，避免关闭动画时闪烁
            setTimeout(() => setSelectedNewsId(null), 300)
          }
        }}
      />
      
      {/* 历史记录侧边栏 */}
      <DebateHistorySidebar
        sessions={historySessions}
        currentSessionId={currentSession?.id}
        onLoadSession={(session) => {
          restoreSessionState(session)
          setShowHistorySidebar(false)
          toast.success(`${t.stockDetail.historySessionLoaded || '已加载历史会话'}：${session.mode === 'realtime_debate' ? t.stockDetail.realtimeDebate : session.mode === 'parallel' ? t.stockDetail.parallelAnalysis : (t.stockDetail.quickAnalysis || 'Quick Analysis')}`)
        }}
        onDeleteSession={(sessionId) => {
          deleteSession(stockCode, sessionId)
          toast.success(t.stockDetail.sessionDeleted)
        }}
        onClearHistory={() => {
          clearStockHistory(stockCode)
          setDebateResult(null)
          setStreamingContent({ bull: '', bear: '', manager: '', quick: '' })
          setChatMessages([])
          toast.success(t.stockDetail.allHistoryCleared)
        }}
        isOpen={showHistorySidebar}
        onToggle={() => setShowHistorySidebar(!showHistorySidebar)}
      />
    </div>
  )
}


================================================
FILE: frontend/src/pages/StockSearchPage.tsx
================================================
/**
 * 股票搜索入口页面
 * 风格参考 Manus/ChatGPT 的对话入口
 */
import { useState, useCallback, useRef, useEffect } from 'react'
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
import { useNavigate } from 'react-router-dom'
import { stockApi } from '@/lib/api-client'
import { cn } from '@/lib/utils'
import { 
  Search, 
  Loader2, 
  Database, 
  RefreshCw, 
  TrendingUp,
  Sparkles,
  ArrowRight,
  BarChart3
} from 'lucide-react'
import { toast } from 'sonner'
import { useGlobalI18n } from '@/store/useLanguageStore'

export default function StockSearchPage() {
  const t = useGlobalI18n()
  const [keyword, setKeyword] = useState('')
  const [isOpen, setIsOpen] = useState(false)
  const [selectedIndex, setSelectedIndex] = useState(-1)
  const inputRef = useRef<HTMLInputElement>(null)
  const listRef = useRef<HTMLDivElement>(null)
  const navigate = useNavigate()
  const queryClient = useQueryClient()

  // 获取股票数量
  const { data: stockCount } = useQuery({
    queryKey: ['stock-count'],
    queryFn: () => stockApi.getStockCount(),
    staleTime: 60 * 1000,
  })

  // 初始化股票数据
  const initMutation = useMutation({
    mutationFn: () => stockApi.initStockData(),
    onSuccess: (data) => {
      if (data.success) {
        toast.success(`成功导入 ${data.count} 只股票！`)
        queryClient.invalidateQueries({ queryKey: ['stock-count'] })
        queryClient.invalidateQueries({ queryKey: ['stock-search'] })
      } else {
        toast.error(data.message)
      }
    },
    onError: (error: Error) => {
      toast.error(`初始化失败: ${error.message}`)
    },
  })

  // 搜索查询
  const { data: searchResults, isLoading } = useQuery({
    queryKey: ['stock-search', keyword],
    queryFn: () => stockApi.searchRealtime(keyword, 15),
    enabled: keyword.length >= 1,
    staleTime: 30 * 1000,
  })

  // 处理选择股票
  const handleSelect = useCallback((stock: { code: string; name: string; full_code: string }) => {
    setKeyword('')
    setIsOpen(false)
    setSelectedIndex(-1)
    navigate(`/stock/${stock.full_code}`)
  }, [navigate])

  // 键盘导航
  const handleKeyDown = useCallback((e: React.KeyboardEvent) => {
    if (!searchResults || searchResults.length === 0) return

    switch (e.key) {
      case 'ArrowDown':
        e.preventDefault()
        setSelectedIndex(prev => 
          prev < searchResults.length - 1 ? prev + 1 : 0
        )
        break
      case 'ArrowUp':
        e.preventDefault()
        setSelectedIndex(prev => 
          prev > 0 ? prev - 1 : searchResults.length - 1
        )
        break
      case 'Enter':
        e.preventDefault()
        if (selectedIndex >= 0 && searchResults[selectedIndex]) {
          handleSelect(searchResults[selectedIndex])
        }
        break
      case 'Escape':
        setIsOpen(false)
        setSelectedIndex(-1)
        break
    }
  }, [searchResults, selectedIndex, handleSelect])

  // 点击外部关闭
  useEffect(() => {
    const handleClickOutside = (e: MouseEvent) => {
      if (
        inputRef.current &&
        !inputRef.current.contains(e.target as Node) &&
        listRef.current &&
        !listRef.current.contains(e.target as Node)
      ) {
        setIsOpen(false)
      }
    }

    document.addEventListener('mousedown', handleClickOutside)
    return () => document.removeEventListener('mousedown', handleClickOutside)
  }, [])

  // 滚动到选中项
  useEffect(() => {
    if (selectedIndex >= 0 && listRef.current) {
      const selectedItem = listRef.current.children[selectedIndex] as HTMLElement
      if (selectedItem) {
        selectedItem.scrollIntoView({ block: 'nearest' })
      }
    }
  }, [selectedIndex])

  // 热门股票示例
  const hotStocks = [
    { code: '600519', name: '贵州茅台', full_code: 'SH600519' },
    { code: '000001', name: '平安银行', full_code: 'SZ000001' },
    { code: '601318', name: '中国平安', full_code: 'SH601318' },
    { code: '000858', name: '五粮液', full_code: 'SZ000858' },
  ]

  return (
    <div className="min-h-[calc(100vh-120px)] flex flex-col items-center justify-center px-4 bg-gradient-to-br from-slate-50 via-blue-50/30 to-indigo-50/50">
      {/* 标题区域 */}
      <div className="text-center mb-10 animate-in fade-in-0 slide-in-from-bottom-4 duration-500">
        <div className="flex items-center justify-center gap-3 mb-4">
          <div className="w-12 h-12 rounded-2xl bg-gradient-to-br from-blue-500 to-indigo-600 flex items-center justify-center shadow-lg shadow-blue-500/25">
            <BarChart3 className="w-6 h-6 text-white" />
          </div>
        </div>
        <h1 className="text-4xl font-bold text-gray-900 tracking-tight mb-3">
          {t.stock.title}
        </h1>
        <p className="text-lg text-gray-500 max-w-md mx-auto">
          {t.stock.subtitle}
        </p>
      </div>

      {/* 搜索框区域 */}
      <div className="w-full max-w-2xl relative animate-in fade-in-0 slide-in-from-bottom-6 duration-500 delay-100">
        <div className={cn(
          'relative bg-white rounded-2xl shadow-xl shadow-gray-200/50',
          'border border-gray-100',
          'transition-all duration-300',
          isOpen && keyword.length >= 1 ? 'rounded-b-none' : ''
        )}>
          {/* 搜索图标 */}
          <Search className="absolute left-5 top-1/2 -translate-y-1/2 w-5 h-5 text-gray-400" />
          
          {/* 输入框 */}
          <input
            ref={inputRef}
            type="text"
            value={keyword}
            onChange={(e) => {
              setKeyword(e.target.value)
              setIsOpen(true)
              setSelectedIndex(-1)
            }}
            onFocus={() => setIsOpen(true)}
            onKeyDown={handleKeyDown}
            placeholder={t.stock.searchPlaceholder}
            className={cn(
              'w-full pl-14 pr-14 py-5 text-lg',
              'border-none rounded-2xl',
              'focus:outline-none focus:ring-0',
              'placeholder:text-gray-400',
              'transition-all duration-200'
            )}
            autoFocus
          />
          
          {/* 右侧图标 */}
          <div className="absolute right-4 top-1/2 -translate-y-1/2 flex items-center gap-2">
            {isLoading ? (
              <Loader2 className="w-5 h-5 text-gray-400 animate-spin" />
            ) : keyword.length > 0 ? (
              <div className="w-8 h-8 rounded-lg bg-blue-600 flex items-center justify-center cursor-pointer hover:bg-blue-700 transition-colors">
                <ArrowRight className="w-4 h-4 text-white" />
              </div>
            ) : (
              <Sparkles className="w-5 h-5 text-gray-300" />
            )}
          </div>
        </div>

        {/* 搜索结果下拉列表 */}
        {isOpen && keyword.length >= 1 && (
          <div
            ref={listRef}
            className={cn(
              'absolute z-50 w-full',
              'bg-white rounded-b-2xl shadow-xl shadow-gray-200/50',
              'border border-t-0 border-gray-100',
              'max-h-[400px] overflow-y-auto',
              'animate-in fade-in-0 duration-150'
            )}
          >
            {isLoading ? (
              <div className="flex items-center justify-center py-10 text-gray-500">
                <Loader2 className="w-5 h-5 animate-spin mr-2" />
                {t.stock.searching}
              </div>
            ) : searchResults && searchResults.length > 0 ? (
              <div className="py-2">
                {searchResults.map((stock, index) => (
                  <div
                    key={stock.code}
                    onClick={() => handleSelect(stock)}
                    className={cn(
                      'flex items-center justify-between px-5 py-4 cursor-pointer',
                      'transition-colors duration-100',
                      selectedIndex === index
                        ? 'bg-blue-50'
                        : 'hover:bg-gray-50'
                    )}
                  >
                    <div className="flex items-center gap-4">
                      <div className="w-10 h-10 rounded-xl bg-gradient-to-br from-blue-100 to-indigo-100 flex items-center justify-center">
                        <TrendingUp className="w-5 h-5 text-blue-600" />
                      </div>
                      <div className="flex flex-col">
                        <span className="font-semibold text-gray-900">
                          {stock.name}
                        </span>
                        <span className="text-sm text-gray-500">
                          {stock.full_code}
                        </span>
                      </div>
                    </div>
                    <div className="flex items-center gap-2">
                      {stock.market && (
                        <span className="text-xs px-2 py-1 bg-gray-100 text-gray-600 rounded-lg">
                          {stock.market}
                        </span>
                      )}
                      {stock.industry && (
                        <span className="text-xs text-gray-500">
                          {stock.industry}
                        </span>
                      )}
                      <ArrowRight className="w-4 h-4 text-gray-300" />
                    </div>
                  </div>
                ))}
              </div>
            ) : (
              <div className="py-10 text-center">
                {stockCount && stockCount.count === 0 ? (
                  <div className="space-y-4">
                    <Database className="w-12 h-12 mx-auto text-gray-300" />
                    <p className="text-gray-500 font-medium">{t.stock.emptyDb}</p>
                    <p className="text-sm text-gray-400">{t.stock.initTip}</p>
                    <button
                      onClick={(e) => {
                        e.stopPropagation()
                        initMutation.mutate()
                      }}
                      disabled={initMutation.isPending}
                      className={cn(
                        'inline-flex items-center gap-2 px-5 py-2.5 text-sm font-medium rounded-xl',
                        'bg-blue-600 text-white hover:bg-blue-700',
                        'disabled:opacity-50 disabled:cursor-not-allowed',
                        'transition-colors shadow-lg shadow-blue-500/25'
                      )}
                    >
                      {initMutation.isPending ? (
                        <>
                          <Loader2 className="w-4 h-4 animate-spin" />
                          {t.stock.importing}
                        </>
                      ) : (
                        <>
                          <RefreshCw className="w-4 h-4" />
                          {t.stock.initBtn}
                        </>
                      )}
                    </button>
                  </div>
                ) : (
                  <div>
                    <p className="text-gray-500 font-medium">{t.stock.notFound}</p>
                    <p className="text-sm text-gray-400 mt-1">{t.stock.tryInput}</p>
                  </div>
                )}
              </div>
            )}
            
            {/* 快捷键提示 */}
            <div className="px-5 py-3 border-t border-gray-100 bg-gray-50/50">
              <div className="flex items-center gap-5 text-xs text-gray-400">
                <span className="flex items-center gap-1">
                  <kbd className="px-1.5 py-0.5 bg-white border border-gray-200 rounded text-gray-500 shadow-sm">↑↓</kbd>
                  <span>{t.stock.nav}</span>
                </span>
                <span className="flex items-center gap-1">
                  <kbd className="px-1.5 py-0.5 bg-white border border-gray-200 rounded text-gray-500 shadow-sm">Enter</kbd>
                  <span>{t.stock.select}</span>
                </span>
                <span className="flex items-center gap-1">
                  <kbd className="px-1.5 py-0.5 bg-white border border-gray-200 rounded text-gray-500 shadow-sm">Esc</kbd>
                  <span>{t.stock.close}</span>
                </span>
              </div>
            </div>
          </div>
        )}
      </div>

      {/* 热门股票推荐 */}
      {!isOpen && (
        <div className="mt-10 animate-in fade-in-0 slide-in-from-bottom-8 duration-500 delay-200">
          <p className="text-sm text-gray-400 text-center mb-4">{t.stock.hotStocks}</p>
          <div className="flex flex-wrap justify-center gap-3">
            {hotStocks.map((stock) => (
              <button
                key={stock.code}
                onClick={() => navigate(`/stock/${stock.full_code}`)}
                className={cn(
                  'flex items-center gap-2 px-4 py-2.5 rounded-xl',
                  'bg-white border border-gray-100 shadow-sm',
                  'hover:border-blue-200 hover:bg-blue-50/50 hover:shadow-md',
                  'transition-all duration-200',
                  'group'
                )}
              >
                <TrendingUp className="w-4 h-4 text-gray-400 group-hover:text-blue-500 transition-colors" />
                <span className="font-medium text-gray-700 group-hover:text-blue-600 transition-colors">
                  {stock.name}
                </span>
                <span className="text-xs text-gray-400">
                  {stock.full_code}
                </span>
              </button>
            ))}
          </div>
        </div>
      )}

      {/* 功能说明 */}
      <div className="mt-16 grid grid-cols-3 gap-8 max-w-2xl animate-in fade-in-0 slide-in-from-bottom-10 duration-500 delay-300">
        <div className="text-center">
          <div className="w-10 h-10 mx-auto rounded-xl bg-blue-100 flex items-center justify-center mb-3">
            <BarChart3 className="w-5 h-5 text-blue-600" />
          </div>
          <p className="text-sm font-medium text-gray-700">{t.stock.kline}</p>
          <p className="text-xs text-gray-400 mt-1">{t.stock.klineDesc}</p>
        </div>
        <div className="text-center">
          <div className="w-10 h-10 mx-auto rounded-xl bg-purple-100 flex items-center justify-center mb-3">
            <Sparkles className="w-5 h-5 text-purple-600" />
          </div>
          <p className="text-sm font-medium text-gray-700">{t.stock.aiSentiment}</p>
          <p className="text-xs text-gray-400 mt-1">{t.stock.aiSentimentDesc}</p>
        </div>
        <div className="text-center">
          <div className="w-10 h-10 mx-auto rounded-xl bg-emerald-100 flex items-center justify-center mb-3">
            <TrendingUp className="w-5 h-5 text-emerald-600" />
          </div>
          <p className="text-sm font-medium text-gray-700">{t.stock.debate}</p>
          <p className="text-xs text-gray-400 mt-1">{t.stock.debateDesc}</p>
        </div>
      </div>
    </div>
  )
}


================================================
FILE: frontend/src/pages/TaskManagerPage.tsx
================================================
import { useQuery } from '@tanstack/react-query'
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card'
import { Badge } from '@/components/ui/badge'
import { taskApi } from '@/lib/api-client'
import { formatRelativeTime } from '@/lib/utils'
import { useGlobalI18n, useLanguageStore } from '@/store/useLanguageStore'

export default function TaskManagerPage() {
  const t = useGlobalI18n()
  const { lang } = useLanguageStore()
  const { data: tasks, isLoading } = useQuery({
    queryKey: ['tasks', 'list'],
    queryFn: () => taskApi.getTaskList({ limit: 20 }),
    refetchInterval: 5000, // 每5秒刷新
  })

  const getStatusBadge = (status: string) => {
    const variants = {
      completed: 'success' as const,
      running: 'default' as const,
      pending: 'secondary' as const,
      failed: 'destructive' as const,
    }
    const labels = {
      completed: `✅ ${t.tasks.completed}`,
      running: `⏳ ${t.tasks.running}`,
      pending: `⏸️ ${t.tasks.pending}`,
      failed: `❌ ${t.tasks.failed}`,
    }
    return <Badge variant={variants[status as keyof typeof variants] || 'outline'}>{labels[status as keyof typeof labels] || status}</Badge>
  }

  return (
    <div className="p-6 space-y-6">
      <div>
        <h1 className="text-3xl font-bold tracking-tight">{t.tasks.title}</h1>
        <p className="text-muted-foreground">{t.tasks.subtitle}</p>
      </div>

      <div className="space-y-4">
        {isLoading ? (
          <div className="text-center py-12 text-gray-500">{t.tasks.loading}</div>
        ) : tasks && tasks.length > 0 ? (
          tasks.map((task) => (
            <Card key={task.id}>
              <CardHeader>
                <div className="flex items-center justify-between">
                  <CardTitle className="text-base">
                    {t.tasks.task} #{task.id} - {task.source}
                  </CardTitle>
                  <div className="flex items-center gap-2">
                    {getStatusBadge(task.status)}
                    <Badge variant="outline">{task.mode === 'realtime' ? `⚡ ${t.tasks.realtime}` : `🥶 ${t.tasks.coldStart}`}</Badge>
                  </div>
                </div>
              </CardHeader>
              <CardContent>
                <div className="grid grid-cols-2 md:grid-cols-4 gap-4 text-sm">
                  <div>
                    <div className="text-gray-500">{t.tasks.crawled}</div>
                    <div className="font-medium">{task.crawled_count}</div>
                  </div>
                  <div>
                    <div className="text-gray-500">{t.tasks.saved}</div>
                    <div className="font-medium">{task.saved_count}</div>
                  </div>
                  <div>
                    <div className="text-gray-500">{t.tasks.duration}</div>
                    <div className="font-medium">
                      {task.execution_time ? `${task.execution_time.toFixed(2)}s` : '-'}
                    </div>
                  </div>
                  <div>
                    <div className="text-gray-500">{t.tasks.createdAt}</div>
                    <div className="font-medium">{formatRelativeTime(task.created_at, t.time)}</div>
                  </div>
                </div>

                {task.progress && task.progress.percentage && (
                  <div className="mt-4">
                    <div className="flex justify-between text-xs text-gray-500 mb-1">
                      <span>{t.tasks.progress}</span>
                      <span>{task.progress.percentage}%</span>
                    </div>
                    <div className="w-full bg-gray-200 rounded-full h-2">
                      <div
                        className="bg-blue-600 h-2 rounded-full transition-all"
                        style={{ width: `${task.progress.percentage}%` }}
                      />
                    </div>
                  </div>
                )}
              </CardContent>
            </Card>
          ))
        ) : (
          <div className="text-center py-12 text-gray-500">
            {t.tasks.noTasks}
          </div>
        )}
      </div>
    </div>
  )
}


================================================
FILE: frontend/src/store/useDebateStore.ts
================================================
import { create } from 'zustand'
import { persist } from 'zustand/middleware'

// 聊天消息类型（与 DebateChatRoom 一致）
export type ChatRole = 'user' | 'bull' | 'bear' | 'manager' | 'system' | 'data_collector' | 'search'

export interface ChatMessage {
  id: string
  role: ChatRole
  content: string
  timestamp: Date
  round?: number
  isStreaming?: boolean
  mentions?: string[] // 消息中的 @ 提及
  searchPlan?: any // 搜索计划
  searchStatus?: 'pending' | 'executing' | 'completed' | 'cancelled'
}

// 分析结果（用于保存并行/快速分析模式的结果）
export interface AnalysisResult {
  bull?: string
  bear?: string
  manager?: string
  quick?: string
  finalDecision?: {
    rating?: string
    decision?: string
  }
  executionTime?: number
}

// 辩论会话
export interface DebateSession {
  id: string
  stockCode: string
  stockName: string
  messages: ChatMessage[]
  mode: string
  createdAt: Date
  updatedAt: Date
  // 新增：并行/快速分析模式的结果
  analysisResult?: AnalysisResult
  // 新增：会话状态
  status?: 'in_progress' | 'completed' | 'interrupted'
}

// 本地存储的会话格式（日期需要序列化）
interface SerializedSession {
  id: string
  stockCode: string
  stockName: string
  messages: Array<Omit<ChatMessage, 'timestamp'> & { timestamp: string }>
  mode: string
  createdAt: string
  updatedAt: string
}

interface DebateStore {
  // 当前会话
  currentSession: DebateSession | null
  // 历史会话列表（按股票代码索引）
  sessions: Record<string, DebateSession[]>
  
  // 操作方法
  startSession: (stockCode: string, stockName: string, mode: string) => string
  addMessage: (message: ChatMessage) => void
  updateMessage: (messageId: string, updates: Partial<ChatMessage>) => void
  clearCurrentSession: () => void
  
  // 批量同步消息（用于辩论完成时一次性同步所有消息）
  syncMessages: (messages: ChatMessage[]) => void
  
  // 新增：保存分析结果（用于并行/快速分析模式）
  saveAnalysisResult: (result: AnalysisResult) => void
  // 新增：更新会话状态
  updateSessionStatus: (status: 'in_progress' | 'completed' | 'interrupted') => void
  // 新增：恢复会话到页面状态
  restoreSession: (sessionId: string) => DebateSession | null
  // 新增：获取最近未完成的会话
  getLatestInProgressSession: (stockCode: string) => DebateSession | null
  
  // 历史管理
  loadSession: (stockCode: string, sessionId?: string) => DebateSession | null
  getStockSessions: (stockCode: string) => DebateSession[]
  deleteSession: (stockCode: string, sessionId: string) => void
  clearStockHistory: (stockCode: string) => Promise<void>
  
  // 同步到后端（可选）
  syncToBackend: (stockCode: string) => Promise<void>
  loadFromBackend: (stockCode: string) => Promise<void>
}

// 序列化会话（用于持久化）
const serializeSession = (session: DebateSession): SerializedSession => ({
  ...session,
  messages: session.messages.map(m => ({
    ...m,
    timestamp: m.timestamp.toISOString()
  })),
  createdAt: session.createdAt.toISOString(),
  updatedAt: session.updatedAt.toISOString()
})

// 反序列化会话（从持久化恢复）
const deserializeSession = (session: SerializedSession): DebateSession => ({
  ...session,
  messages: session.messages.map(m => ({
    ...m,
    timestamp: new Date(m.timestamp)
  })),
  createdAt: new Date(session.createdAt),
  updatedAt: new Date(session.updatedAt)
})

export const useDebateStore = create<DebateStore>()(
  persist(
    (set, get) => ({
      currentSession: null,
      sessions: {},
      
      startSession: (stockCode, stockName, mode) => {
        const sessionId = `debate-${stockCode}-${Date.now()}`
        const newSession: DebateSession = {
          id: sessionId,
          stockCode,
          stockName,
          messages: [],
          mode,
          createdAt: new Date(),
          updatedAt: new Date(),
          status: 'in_progress'
        }
        
        set(state => ({
          currentSession: newSession,
          sessions: {
            ...state.sessions,
            [stockCode]: [
              newSession,
              ...(state.sessions[stockCode] || []).slice(0, 9) // 最多保留10个历史会话
            ]
          }
        }))
        
        return sessionId
      },
      
      addMessage: (message) => {
        set(state => {
          if (!state.currentSession) return state
          
          const updatedSession = {
            ...state.currentSession,
            messages: [...state.currentSession.messages, message],
            updatedAt: new Date()
          }
          
          // 同时更新 sessions 中的记录
          const stockCode = updatedSession.stockCode
          const updatedSessions = (state.sessions[stockCode] || []).map(s =>
            s.id === updatedSession.id ? updatedSession : s
          )
          
          return {
            currentSession: updatedSession,
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            }
          }
        })
      },
      
      // 批量同步消息（替换当前会话的所有消息）
      syncMessages: (messages) => {
        set(state => {
          if (!state.currentSession) return state
          
          // 优化过滤逻辑：只要有内容就保存，并强制标记为非流式
          const validMessages = messages
            .filter(m => m.content || m.searchPlan || m.role === 'system')
            .map(m => ({
              ...m,
              isStreaming: false // 强制标记为已完成
            }))
          
          const updatedSession = {
            ...state.currentSession,
            messages: validMessages,
            updatedAt: new Date()
          }
          
          const stockCode = updatedSession.stockCode
          const updatedSessions = (state.sessions[stockCode] || []).map(s =>
            s.id === updatedSession.id ? updatedSession : s
          )
          
          return {
            currentSession: updatedSession,
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            }
          }
        })
      },
      
      updateMessage: (messageId, updates) => {
        set(state => {
          if (!state.currentSession) return state
          
          const updatedMessages = state.currentSession.messages.map(m =>
            m.id === messageId ? { ...m, ...updates } : m
          )
          
          const updatedSession = {
            ...state.currentSession,
            messages: updatedMessages,
            updatedAt: new Date()
          }
          
          const stockCode = updatedSession.stockCode
          const updatedSessions = (state.sessions[stockCode] || []).map(s =>
            s.id === updatedSession.id ? updatedSession : s
          )
          
          return {
            currentSession: updatedSession,
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            }
          }
        })
      },
      
      clearCurrentSession: () => {
        set({ currentSession: null })
      },
      
      // 保存分析结果（用于并行/快速分析模式）
      saveAnalysisResult: (result) => {
        set(state => {
          if (!state.currentSession) return state
          
          const updatedSession = {
            ...state.currentSession,
            analysisResult: result,
            updatedAt: new Date()
          }
          
          const stockCode = updatedSession.stockCode
          const updatedSessions = (state.sessions[stockCode] || []).map(s =>
            s.id === updatedSession.id ? updatedSession : s
          )
          
          return {
            currentSession: updatedSession,
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            }
          }
        })
      },
      
      // 更新会话状态
      updateSessionStatus: (status) => {
        set(state => {
          if (!state.currentSession) return state
          
          const updatedSession = {
            ...state.currentSession,
            status,
            updatedAt: new Date()
          }
          
          const stockCode = updatedSession.stockCode
          const updatedSessions = (state.sessions[stockCode] || []).map(s =>
            s.id === updatedSession.id ? updatedSession : s
          )
          
          return {
            currentSession: updatedSession,
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            }
          }
        })
      },
      
      // 恢复会话
      restoreSession: (sessionId) => {
        const state = get()
        for (const stockCode of Object.keys(state.sessions)) {
          const session = state.sessions[stockCode].find(s => s.id === sessionId)
          if (session) {
            set({ currentSession: session })
            return session
          }
        }
        return null
      },
      
      // 获取最近未完成的会话
      getLatestInProgressSession: (stockCode) => {
        const state = get()
        const stockSessions = state.sessions[stockCode] || []
        return stockSessions.find(s => s.status === 'in_progress') || null
      },
      
      loadSession: (stockCode, sessionId) => {
        const state = get()
        const stockSessions = state.sessions[stockCode] || []
        
        if (sessionId) {
          const session = stockSessions.find(s => s.id === sessionId)
          if (session) {
            set({ currentSession: session })
            return session
          }
        }
        
        // 如果没有指定 sessionId，返回最新的会话
        if (stockSessions.length > 0) {
          const latestSession = stockSessions[0]
          set({ currentSession: latestSession })
          return latestSession
        }
        
        return null
      },
      
      getStockSessions: (stockCode) => {
        return get().sessions[stockCode] || []
      },
      
      deleteSession: (stockCode, sessionId) => {
        set(state => {
          const updatedSessions = (state.sessions[stockCode] || []).filter(
            s => s.id !== sessionId
          )
          
          return {
            sessions: {
              ...state.sessions,
              [stockCode]: updatedSessions
            },
            // 如果删除的是当前会话，清空当前会话
            currentSession: state.currentSession?.id === sessionId 
              ? null 
              : state.currentSession
          }
        })
      },
      
      clearStockHistory: async (stockCode) => {
        // 1. 先清除本地 Store
        set(state => {
          const { [stockCode]: _, ...rest } = state.sessions
          return {
            sessions: rest,
            currentSession: state.currentSession?.stockCode === stockCode
              ? null
              : state.currentSession
          }
        })
        
        // 2. 同时清除后端数据库中的历史
        try {
          const response = await fetch(`/api/v1/agents/debate/history/${stockCode}`, {
            method: 'DELETE'
          })
          if (response.ok) {
            console.log('✅ 已清除后端历史记录')
          } else {
            console.error('❌ 清除后端历史失败')
          }
        } catch (error) {
          console.error('❌ 清除后端历史出错:', error)
        }
      },
      
      // 同步到后端
      syncToBackend: async (stockCode) => {
        const state = get()
        const sessions = state.sessions[stockCode]
        
        console.log('💾 syncToBackend called for:', stockCode)
        console.log('💾 Sessions count:', sessions?.length || 0)
        
        if (!sessions || sessions.length === 0) {
          console.warn('⚠️ syncToBackend: no sessions to sync')
          return
        }
        
        // 打印每个会话的消息数量
        sessions.forEach((s, i) => {
          console.log(`💾 Session ${i}: ${s.id}, messages: ${s.messages.length}`)
          console.log(`💾 Session ${i} roles:`, s.messages.map(m => m.role))
        })
        
        try {
          const serialized = sessions.map(serializeSession)
          console.log('💾 Sending to backend:', JSON.stringify(serialized).slice(0, 500) + '...')
          
          const response = await fetch(`/api/v1/agents/debate/history`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({
              stock_code: stockCode,
              sessions: serialized
            })
          })
          
          if (!response.ok) {
            console.error('Failed to sync debate history to backend')
          } else {
            console.log('✅ Synced to backend successfully')
          }
        } catch (error) {
          console.error('Error syncing debate history:', error)
        }
      },
      
      // 从后端加载
      loadFromBackend: async (stockCode) => {
        console.log('📥 loadFromBackend called for:', stockCode)
        
        try {
          const response = await fetch(`/api/v1/agents/debate/history/${stockCode}`)
          
          if (response.ok) {
            const data = await response.json()
            console.log('📥 Loaded from backend:', data)
            
            if (data.sessions && data.sessions.length > 0) {
              const sessions = data.sessions.map(deserializeSession)
              console.log('📥 Deserialized sessions:', sessions.length)
              sessions.forEach((s: any, i: number) => {
                console.log(`📥 Session ${i}: ${s.id}, messages: ${s.messages.length}`)
                console.log(`📥 Session ${i} roles:`, s.messages.map((m: any) => m.role))
              })
              
              set(state => ({
                sessions: {
                  ...state.sessions,
                  [stockCode]: sessions
                }
              }))
            } else {
              console.log('📥 No sessions in response')
            }
          } else {
            console.error('📥 Failed to load:', response.status)
          }
        } catch (error) {
          console.error('Error loading debate history from backend:', error)
        }
      }
    }),
    {
      name: 'finnews-debate-history',
      // 自定义序列化
      serialize: (state) => {
        const serialized = {
          ...state,
          state: {
            ...state.state,
            currentSession: state.state.currentSession 
              ? serializeSession(state.state.currentSession)
              : null,
            sessions: Object.fromEntries(
              Object.entries(state.state.sessions).map(([k, v]) => [
                k,
                (v as DebateSession[]).map(serializeSession)
              ])
            )
          }
        }
        return JSON.stringify(serialized)
      },
      // 自定义反序列化
      deserialize: (str) => {
        const parsed = JSON.parse(str)
        return {
          ...parsed,
          state: {
            ...parsed.state,
            currentSession: parsed.state.currentSession
              ? deserializeSession(parsed.state.currentSession)
              : null,
            sessions: Object.fromEntries(
              Object.entries(parsed.state.sessions).map(([k, v]) => [
                k,
                (v as SerializedSession[]).map(deserializeSession)
              ])
            )
          }
        }
      }
    }
  )
)


================================================
FILE: frontend/src/store/useLanguageStore.ts
================================================
/**
 * 全局语言状态管理
 */

import { create } from 'zustand';
import { persist } from 'zustand/middleware';

export type Lang = 'zh' | 'en';

interface LanguageState {
  lang: Lang;
  setLang: (lang: Lang) => void;
  toggleLang: () => void;
}

export const useLanguageStore = create<LanguageState>()(
  persist(
    (set, get) => ({
      lang: 'zh',
      setLang: (lang) => set({ lang }),
      toggleLang: () => set({ lang: get().lang === 'zh' ? 'en' : 'zh' }),
    }),
    {
      name: 'finnews-language',
    }
  )
);

// 全局国际化文案
export const globalI18n = {
  zh: {
    nav: {
      home: '首页',
      news: '新闻流',
      stock: '个股分析',
      alphaMining: 'Alpha因子挖掘',
      agents: '智能体监控',
      tasks: '任务管理',
    },
    header: {
      title: 'FinnewsHunter',
      poweredBy: 'Powered by',
    },
    dashboard: {
      title: '仪表盘',
      subtitle: '金融新闻智能分析平台 - Powered by AgenticX',
      totalNews: '总新闻数',
      savedToDb: '已保存到数据库',
      totalTasks: '总任务数',
      recentCompleted: '最近完成',
      units: '个',
      crawlRate: '爬取成功率',
      liveMonitor: '实时监控',
      running: '运行中',
      autoInterval: '每1分钟自动爬取',
      newsStats: '新闻来源统计',
      newsStatsDesc: '各新闻源的内容数量分布',
      latestNews: '最新新闻',
      latestNewsDesc: '最近爬取的新闻动态',
      allSources: '全部来源',
      noNews: '暂无新闻数据，请先爬取新闻',
      noNewsFrom: '暂无来自该来源的新闻',
    },
    news: {
      search: '搜索新闻、股票代码...',
      all: '全部',
      pending: '待分析',
      positive: '利好',
      negative: '利空',
      neutral: '中性',
      items: '条新闻',
      source: '来源',
      analyzing: '分析中...',
      reanalyze: '重新分析',
      analyze: '分析',
      analysisComplete: '分析完成！',
      analysisFailed: '分析失败',
      crawling: '正在爬取中，请稍候...',
      refreshNow: '立即刷新',
      crawlingProgress: '爬取中...(约2分钟)',
      collapse: '收起',
      expandMore: '展开更多',
      stocks: '只股票',
      noNews: '暂无新闻',
      noNewsFound: '没有找到与',
      relatedNews: '相关的新闻',
      tryOtherKeywords: '试试其他关键词，如股票代码或公司名称',
      pleaseCrawl: '请先爬取新闻',
      selectedItems: '已选择 {count} 项',
      cancelSelection: '取消选择',
      deleteNews: '删除新闻',
      deleteSelected: '删除选中',
      confirmDelete: '确定要删除选中的 {count} 条新闻吗？此操作不可恢复。',
      selectAll: '全选',
      deselectAll: '取消全选',
      analyzeAll: '全部分析',
      reanalyzeAll: '重新分析',
      analyzingSelected: '正在分析选中的 {count} 条新闻...',
      analysisComplete: '分析完成！成功 {success} 条，失败 {failed} 条',
    },
    stock: {
      title: '个股智能分析',
      subtitle: '输入股票代码或名称，开启 AI 驱动的投资洞察',
      searchPlaceholder: '搜索股票代码或名称...',
      searching: '搜索中...',
      notFound: '未找到匹配的股票',
      tryInput: '尝试输入股票代码或名称',
      emptyDb: '股票数据库为空',
      initTip: '点击下方按钮初始化股票数据',
      initBtn: '初始化股票数据',
      importing: '正在导入股票数据...',
      hotStocks: '热门股票',
      kline: 'K线分析',
      klineDesc: '多周期行情数据',
      aiSentiment: 'AI 情感分析',
      aiSentimentDesc: '新闻舆情智能解读',
      debate: '多空辩论',
      debateDesc: 'Bull vs Bear 对决',
      nav: '导航',
      select: '选择',
      close: '关闭',
    },
    agents: {
      title: '智能体监控台',
      subtitle: '实时查看智能体执行状态、性能指标和思考链',
      autoRefreshing: '自动刷新中',
      refresh: '手动刷新',
      clearLogs: '清空日志',
      totalExec: '总执行次数',
      successExec: '成功执行',
      successRate: '成功率',
      failedExec: '失败次数',
      avgTime: '平均耗时',
      availableAgents: '可用智能体',
      availableAgentsDesc: '系统中已注册的智能体和工作流',
      agents: '智能体',
      workflows: '工作流',
      active: '活跃',
      inactive: '未激活',
      execLogs: '执行日志',
      execLogsDesc: '实时智能体执行日志和状态追踪',
      records: '条记录',
      noLogs: '暂无执行日志',
      noLogsHint: '执行分析任务或辩论后，日志将在此显示',
      execTimes: '执行',
      times: '次',
      success: '成功',
      avg: '平均',
      recentActivity: '最近活动',
      confirmClearLogs: '确定要清空所有执行日志吗？此操作不可恢复。',
    },
    tasks: {
      title: '任务管理',
      subtitle: '爬取任务监控和管理',
      task: '任务',
      completed: '已完成',
      running: '运行中',
      pending: '待执行',
      failed: '失败',
      realtime: '实时',
      coldStart: '冷启动',
      crawled: '爬取数',
      saved: '保存数',
      duration: '耗时',
      createdAt: '创建时间',
      progress: '进度',
      noTasks: '暂无任务记录',
      loading: '加载中...',
    },
    common: {
      loading: '加载中...',
      noData: '暂无数据',
      confirm: '确定',
      cancel: '取消',
    },
    time: {
      justNow: '刚刚',
      minutesAgo: '分钟前',
      hoursAgo: '小时前',
      daysAgo: '天前',
    },
    model: {
      loading: '加载中...',
      notConfigured: '未配置LLM',
      selectModel: '选择模型',
      selectTip: '选择模型 · 兼顾质量与成本',
      noApiKey: '未配置API Key',
      current: '当前',
    },
    debateRoom: {
      title: '投资辩论',
      titlePlaceholder: '多空辩论室',
      subtitle: '多方 vs 空方 · 投资经理主持',
      roundPrefix: '第',
      roundSuffix: '轮',
      typing: '正在输入...',
      thinking: '思考中...',
      noMessages: '尚无消息',
      clickStartDebate: '点击「开始辩论」启动多空对决',
      canSpeakDuringDebate: '您也可以在辩论过程中发言提问',
      debateInProgress: '辩论进行中,输入 @提及智能体...',
      mentionTip: '提示:使用@多方辩手@空方辩手可以指定角色回答',
      roundStarted: '轮辩论开始',
      debateEnded: '辩论结束，投资经理已做出最终决策',
      debateStarted: '辩论开始，数据专员正在准备资料...',
      searchPlanConfirm: '搜索计划确认',
      searchPlanExecuting: '正在搜索中...',
      searchPlanCompleted: '执行完成',
      searchPlanCancel: '取消',
      searchPlanConfirmBtn: '确认执行',
      estimatedTime: '预计耗时',
      seconds: '秒',
    },
    mentionInput: {
      placeholder: '输入消息，使用 @ 提及智能体或数据源...',
      agents: '智能体',
      sources: '数据源',
      stocks: '股票',
    },
    debateHistory: {
      history: '历史',
      noMessages: '尚无消息',
      messages: '条消息',
      justNow: '刚刚',
      minutesAgo: '分钟前',
      hoursAgo: '小时前',
      daysAgo: '天前',
      today: '今天',
      yesterday: '昨天',
      thisWeek: '本周',
      older: '更早',
      expandHistory: '展开历史记录',
      continueDebate: '继续辩论',
      delete: '删除',
      searchPlaceholder: '搜索历史记录...',
      noMatchingRecords: '未找到匹配的记录',
      noHistoryYet: '暂无历史记录',
      tryOtherKeywords: '尝试其他关键词',
      historyAutoSave: '开始辩论后会自动保存',
      roleNames: {
        user: '我',
        bull: '多方',
        bear: '空方',
        manager: '经理',
        data_collector: '数据专员',
      },
    },
    stockDetail: {
      title: '个股分析 · 智能体驱动的投资决策',
      relatedNews: '关联新闻',
      analyzed: '已分析',
      items: '条',
      overallSentiment: '整体情感',
      recent7d: '近7天情感',
      unknown: '未知',
      trend: '趋势',
      up: '上升',
      down: '下降',
      stable: '稳定',
      latestNews: '最新新闻',
      none: '暂无',
      kline: 'K线图 · 真实行情',
      dataSource: '数据来源',
      supportZoom: '支持缩放拖拽',
      close: '收盘',
      change: '涨跌',
      volume: '成交额',
      billion: '亿',
      period: '周期',
      adjust: '复权',
      daily: '日K',
      dailyK: '日K',
      min60: '60分',
      min30: '30分',
      min15: '15分',
      min5: '5分',
      min1: '1分',
      qfq: '前复权',
      qfqTip: '消除除权缺口，保持走势连续（推荐）',
      noAdjust: '不复权',
      noAdjustTip: '显示真实交易价格，会有除权缺口',
      hfq: '后复权',
      hfqTip: '以上市首日为基准，价格可能很高',
      recommendLabel: 'Recommend',
      timeLabel: '时间',
      openLabel: '开',
      highLabel: '高',
      lowLabel: '低',
      closeLabel: '收',
      volumeLabel: '量',
      parallelAnalysis: '并行分析',
      parallelAnalysisDesc: 'Bull/Bear并行分析，投资经理汇总决策',
      realtimeDebate: '实时辩论',
      realtimeDebateDesc: '四人实时对话，投资经理主持，多空双方交替发言',
      quickAnalysis: '快速分析',
      quickAnalysisDesc: '单一分析师快速给出建议，适合时间紧迫场景',
      result: '结果',
      historySessionLoaded: '已加载历史会话',
      detectIncompleteSession: '检测到有未完成的',
      session: '会话',
      messages: '条消息',
      restore: '是否恢复',
      analysis: '分析',
      analysisModeConfig: '分析模式配置',
      default: '默认',
      parallelExecution: '并行执行',
      about2to3min: '约2-3分钟',
      realtimeDialogue: '实时对话',
      fourAgents: '4位智能体',
      about5to10min: '约5-10分钟',
      singleAgent: '单智能体',
      about1min: '约1分钟',
      advancedConfig: '高级配置',
      maxExecutionTime: '最大执行时间',
      seconds: '秒',
      maxDebateRounds: '最大辩论回合数',
      rounds: '轮',
      managerCanInterrupt: '投资经理可打断辩论',
      collectDataBeforeDebate: '辩论前搜集数据',
      executionTime: '耗时',
      news: '关联新闻',
      newsContain: '包含',
      newsTotal: '条',
      fold: '折叠',
      expand: '展开',
      clearData: '清除数据',
      clearing: '清除中...',
      crawlComplete: '爬取完成',
      crawlFailed: '爬取失败',
      crawling: '爬取中...',
      stop: '停止',
      updateCrawl: '更新爬取',
      targetCrawl: '定向爬取',
      noRelatedNews: '暂无关联新闻',
      clickCrawl: '点击「定向爬取」获取该股票的相关新闻',
      loadMore: '继续扩展',
      remaining: '还有',
      showAll: '已显示全部',
      newsFolded: '新闻已折叠，点击"展开"查看',
      sentimentTrend: '新闻情感趋势',
      sentimentDesc: '近30天新闻情感分布与平均值',
      positive: '利好',
      negative: '利空',
      neutral: '中性',
      avgSentiment: '平均情感',
      bullBear: 'Bull vs Bear 智能体辩论',
      bullBearDesc: '看多研究员 vs 看空研究员，投资经理综合裁决',
      startDebate: '开始辩论',
      debating: '辩论中...',
      analysisMode: '分析模式',
      bullView: '看多观点',
      bearView: '看空观点',
      managerDecision: '投资经理决策',
      waitingAnalysis: '等待分析...',
      waitingDecision: '等待多空分析完成后进行决策...',
      clickDebate: '点击"开始辩论"启动智能体分析',
      debateDesc: '系统将自动调用 Bull/Bear 研究员进行多角度分析，并由投资经理给出综合决策',
      backToSearch: '返回搜索',
      history: '历史',
      copy: '复制',
      export: '导出',
      regenerate: '重新生成',
      stronglyRec: '强烈推荐',
      recommend: '推荐',
      avoid: '回避',
      caution: '谨慎',
      strongBull: '强烈利好',
      strongBear: '强烈利空',
      noKline: '暂无K线数据',
      checkCode: '请检查股票代码是否正确',
      sessionRestored: '已恢复上次会话',
      debateComplete: '辩论分析完成！',
      outputting: '输出中...',
      deciding: '决策中...',
      analysisComplete: '分析完成',
      analysisGenerating: '分析生成中...',
      decisionGenerating: '决策生成中...',
      debateFailed: '辩论分析失败',
      sessionDeleted: '已删除会话',
      allHistoryCleared: '已清除所有历史记录',
      searchCancelled: '已取消搜索任务',
      crawlTaskStarted: '定向爬取任务已启动',
      crawlingInProgress: '正在爬取中...',
      crawlTaskExists: '该股票已有正在进行的爬取任务，正在同步状态...',
      crawlTaskStopped: '已停止爬取任务',
      crawlTaskStopFailed: '停止任务失败',
      newsCleared: '已清除',
      newsItems: '条新闻',
      clearNewsConfirm: '确定要清除「',
      clearNewsConfirmEnd: '」的所有新闻吗？此操作不可恢复！',
      stopCrawlConfirm: '确定要停止当前的爬取任务吗？',
      knowledgeGraph: '知识图谱 · 智能检索',
      knowledgeGraphDesc: '基于多维度关键词并发检索，提升召回率',
      nameVariants: '名称变体',
      mainBusiness: '主营业务',
      relatedConcepts: '关联概念',
      concurrentQueries: '并发检索查询',
      bullResearcher: '看多研究员',
      bearResearcher: '看空研究员',
      investmentManager: '投资经理',
      generatingSearchPlan: '正在生成搜索计划...',
      deleteSessionConfirm: '确定要删除这条记录吗？',
      clearAllHistoryConfirm: '确定要清除所有历史记录吗？此操作不可恢复！',
      clearAllRecords: '清除所有记录',
      crawlSuccess: '定向爬取完成！新增',
      unknownError: '未知错误',
      taskCreated: '任务已创建，等待执行...',
    },
    alphaMining: {
      training: {
        title: 'RL 训练监控',
        desc: 'Transformer + REINFORCE 算法实时训练进度',
        ready: '就绪',
        running: '训练中',
        completed: '完成',
        error: '错误',
        steps: '训练步数',
        useSentiment: '使用情感特征',
        stop: '停止',
        start: '开始训练',
        progress: '训练进度',
        bestFactor: '当前最优因子',
        convergence: '收敛曲线',
        trainingFailed: '训练失败',
      },
      metrics: {
        noData: '暂无评估数据',
        hint: '请先评估一个因子表达式',
        currentFactor: '当前因子',
        multiDim: '多维度评估',
        riskMetrics: '风险指标',
        maxDrawdown: '最大回撤',
        safe: '安全',
        danger: '危险',
        dailyTurnover: '日均换手率',
        winRate: '胜率',
        totalReturn: '累计收益',
        returnsCurve: '收益曲线',
        returnsDesc: '策略累计收益 vs 基准',
        strategy: '策略',
        benchmark: '基准',
        metricDesc: '指标说明',
        sortinoDesc: 'Sortino: 越高越好，>1优秀',
        sharpeDesc: 'Sharpe: 越高越好，>0.5良好',
        icDesc: 'IC: 绝对值>0.03有效',
        maxDDDesc: 'Max DD: <20%安全',
        excellent: '优秀',
        good: '良好',
        average: '一般',
        poor: '较差',
        lowTurnover: '低换手',
      },
      sentiment: {
        title: '情感融合效果对比',
        desc: '对比纯技术因子 vs 情感增强因子的挖掘效果',
        steps: '训练步数',
        comparing: '对比中...',
        start: '开始对比',
        techOnly: '纯技术因子',
        techDesc: '个特征（RET, VOL, VOLUME_CHG, TURNOVER）',
        enhanced: '情感增强因子',
        enhancedDesc: '个特征（+SENTIMENT, NEWS_COUNT）',
        bestFactor: '最优因子',
        none: '无',
        improvement: '改进幅度',
        improved: '情感特征提升了因子效果',
        degraded: '情感特征降低了因子效果',
        scoreDiff: 'Score 差异',
        comparison: 'Score 对比',
        techOnlyBar: '纯技术',
        enhancedBar: '情感增强',
        conclusion: '结论：',
        conclusionPositive: '情感特征（SENTIMENT, NEWS_COUNT）对因子挖掘有正向贡献，建议在实际应用中开启情感融合功能。',
        conclusionNegative: '本次实验中情感特征未能提升效果，可能原因包括：样本量不足、情感数据噪音、训练步数过少等。建议增加训练步数后重试。',
        comparingText: '正在进行对比实验...',
        comparingHint: '分别训练纯技术因子和情感增强因子，每种',
        stepsText: '步',
        startHint: '点击"开始对比"运行情感融合实验',
        startDesc: '将分别训练纯技术因子和情感增强因子进行效果对比',
        comparisonFailed: '对比失败',
      },
      agent: {
        title: 'AgenticX Agent 调用演示',
        desc: '展示 Agent 如何调用 AlphaMiningTool 进行因子挖掘',
        success: '成功',
        failed: '失败',
        toolParams: 'Tool 参数',
        stockCode: '股票代码（可选）',
        stockPlaceholder: '如 SH600519',
        steps: '训练步数',
        useSentiment: '使用情感特征',
        executing: '执行中...',
        execute: '执行 Agent 调用',
        inputParams: '输入参数',
        output: '输出结果',
        executionTime: '耗时',
        bestFactor: '最优因子',
        logs: '执行日志',
        codeExample: 'Python 调用示例',
        executeFailed: '执行失败',
        startHint: '配置参数后点击"执行 Agent 调用"',
        startDesc: '将演示 QuantitativeAgent 如何通过 AlphaMiningTool 进行因子挖掘',
        miningTask: '为 {code} 挖掘量化因子',
        createAgent: '创建 Agent',
        registerTool: '注册 Tool',
        executeMining: '执行因子挖掘',
      },
      operators: {
        all: '全部',
        availableFeatures: '可用特征',
        techFeature: '技术特征',
        sentimentFeature: '情感特征',
        totalOperators: '共 {count} 个操作符',
        totalFeatures: '{count} 个特征',
        params: '参',
        categoryArithmetic: '算术运算',
        categoryUnary: '一元运算',
        categoryTimeseries: '时序运算',
        categoryConditional: '条件运算',
        categorySpecial: '特殊运算',
        add: '加法',
        sub: '减法',
        mul: '乘法',
        div: '除法（安全）',
        neg: '取负',
        abs: '绝对值',
        sign: '符号函数',
        gate: '条件选择',
        max: '取最大',
        min: '取最小',
        delay1: '延迟1期',
        delay5: '延迟5期',
        delta1: '1期差分',
        delta5: '5期差分',
        ma5: '5期均线',
        ma10: '10期均线',
        std5: '5期标准差',
        std10: '10期标准差',
        jump: '跳跃检测',
        jumpExample: '检测>3σ异常值',
        decay: '衰减加权',
        max3: '3期最大',
      },
    },
  },
  en: {
    nav: {
      home: 'Home',
      news: 'News Feed',
      stock: 'Stock Analysis',
      alphaMining: 'Alpha Mining',
      agents: 'Agent Monitor',
      tasks: 'Task Manager',
    },
    header: {
      title: 'FinnewsHunter',
      poweredBy: 'Powered by',
    },
    dashboard: {
      title: 'Dashboard',
      subtitle: 'Financial News AI Analytics Platform - Powered by AgenticX',
      totalNews: 'Total News',
      savedToDb: 'Saved to database',
      totalTasks: 'Total Tasks',
      recentCompleted: 'Recently completed',
      units: '',
      crawlRate: 'Crawl Success Rate',
      liveMonitor: 'Live Monitor',
      running: 'Running',
      autoInterval: 'Auto crawl every minute',
      newsStats: 'News Source Stats',
      newsStatsDesc: 'Content distribution by news source',
      latestNews: 'Latest News',
      latestNewsDesc: 'Recently crawled news',
      allSources: 'All Sources',
      noNews: 'No news data, please crawl news first',
      noNewsFrom: 'No news from this source',
    },
    news: {
      search: 'Search news, stock codes...',
      all: 'All',
      pending: 'Pending',
      positive: 'Positive',
      negative: 'Negative',
      neutral: 'Neutral',
      items: 'items',
      source: 'Source',
      analyzing: 'Analyzing...',
      reanalyze: 'Re-analyze',
      analyze: 'Analyze',
      analysisComplete: 'Analysis complete!',
      analysisFailed: 'Analysis failed',
      crawling: 'Crawling in progress, please wait...',
      refreshNow: 'Refresh Now',
      crawlingProgress: 'Crawling... (~2 min)',
      collapse: 'Collapse',
      expandMore: 'Expand More',
      stocks: 'stocks',
      noNews: 'No news',
      noNewsFound: 'No news found for',
      relatedNews: '',
      tryOtherKeywords: 'Try other keywords like stock codes or company names',
      pleaseCrawl: 'Please crawl news first',
      selectedItems: 'Selected {count} items',
      cancelSelection: 'Cancel Selection',
      deleteNews: 'Delete News',
      deleteSelected: 'Delete Selected',
      confirmDelete: 'Are you sure you want to delete {count} selected news? This action cannot be undone.',
      selectAll: 'Select All',
      deselectAll: 'Deselect All',
      analyzeAll: 'Analyze All',
      reanalyzeAll: 'Re-analyze All',
      analyzingSelected: 'Analyzing {count} selected news...',
      analysisComplete: 'Analysis complete! {success} succeeded, {failed} failed',
    },
    stock: {
      title: 'Stock Intelligence',
      subtitle: 'Enter stock code or name for AI-powered investment insights',
      searchPlaceholder: 'Search stock code or name...',
      searching: 'Searching...',
      notFound: 'No matching stocks found',
      tryInput: 'Try entering stock code or name',
      emptyDb: 'Stock database is empty',
      initTip: 'Click below to initialize stock data',
      initBtn: 'Initialize Stock Data',
      importing: 'Importing stock data...',
      hotStocks: 'Popular Stocks',
      kline: 'K-Line Analysis',
      klineDesc: 'Multi-period market data',
      aiSentiment: 'AI Sentiment',
      aiSentimentDesc: 'News sentiment analysis',
      debate: 'Bull vs Bear',
      debateDesc: 'Bull vs Bear debate',
      nav: 'Navigate',
      select: 'Select',
      close: 'Close',
    },
    agents: {
      title: 'Agent Monitor',
      subtitle: 'Real-time agent execution status, metrics and reasoning chain',
      autoRefreshing: 'Auto-refreshing',
      refresh: 'Refresh',
      clearLogs: 'Clear Logs',
      totalExec: 'Total Executions',
      successExec: 'Successful',
      successRate: 'Success Rate',
      failedExec: 'Failed',
      avgTime: 'Avg Time',
      availableAgents: 'Available Agents',
      availableAgentsDesc: 'Registered agents and workflows',
      agents: 'Agents',
      workflows: 'Workflows',
      active: 'Active',
      inactive: 'Inactive',
      execLogs: 'Execution Logs',
      execLogsDesc: 'Real-time agent execution logs and status',
      records: 'records',
      noLogs: 'No execution logs',
      noLogsHint: 'Logs will appear here after running analysis or debates',
      execTimes: 'Executions',
      times: '',
      success: 'Success',
      avg: 'Avg',
      recentActivity: 'Recent Activity',
      confirmClearLogs: 'Are you sure you want to clear all execution logs? This action cannot be undone.',
    },
    tasks: {
      title: 'Task Manager',
      subtitle: 'Crawl task monitoring and management',
      task: 'Task',
      completed: 'Completed',
      running: 'Running',
      pending: 'Pending',
      failed: 'Failed',
      realtime: 'Realtime',
      coldStart: 'Cold Start',
      crawled: 'Crawled',
      saved: 'Saved',
      duration: 'Duration',
      createdAt: 'Created',
      progress: 'Progress',
      noTasks: 'No tasks',
      loading: 'Loading...',
    },
    common: {
      loading: 'Loading...',
      noData: 'No data',
      confirm: 'Confirm',
      cancel: 'Cancel',
    },
    time: {
      justNow: 'just now',
      minutesAgo: ' min ago',
      hoursAgo: ' hours ago',
      daysAgo: ' days ago',
    },
    model: {
      loading: 'Loading...',
      notConfigured: 'LLM not configured',
      selectModel: 'Select Model',
      selectTip: 'Select Model - Balance quality & cost',
      noApiKey: 'API Key not configured',
      current: 'Current',
    },
    debateRoom: {
      title: 'Investment Debate',
      titlePlaceholder: 'Bull vs Bear Debate Room',
      subtitle: 'Bull vs Bear · Investment Manager moderates',
      roundPrefix: 'Round',
      roundSuffix: '',
      typing: 'is typing...',
      thinking: 'Thinking...',
      noMessages: 'No messages yet',
      clickStartDebate: 'Click "Start Debate" to initiate bull-bear confrontation',
      canSpeakDuringDebate: 'You can also speak and ask questions during the debate',
      debateInProgress: 'Debate in progress, enter @ to mention agents...',
      mentionTip: 'Tip: Use @BullDebater @BearDebater to specify a role for replies',
      roundStarted: 'round debate started',
      debateEnded: 'Debate ended, Investment Manager has made final decision',
      debateStarted: 'Debate started, Data Collector is preparing materials...',
      searchPlanConfirm: 'Search Plan Confirmation',
      searchPlanExecuting: 'Searching...',
      searchPlanCompleted: 'Execution completed',
      searchPlanCancel: 'Cancel',
      searchPlanConfirmBtn: 'Confirm Execution',
      estimatedTime: 'Estimated time',
      seconds: 's',
    },
    mentionInput: {
      placeholder: 'Enter message, use @ to mention agents or data sources...',
      agents: 'Agents',
      sources: 'Data Sources',
      stocks: 'Stocks',
    },
    debateHistory: {
      history: 'History',
      noMessages: 'No messages yet',
      messages: 'messages',
      justNow: 'just now',
      minutesAgo: 'min ago',
      hoursAgo: 'hours ago',
      daysAgo: 'days ago',
      today: 'Today',
      yesterday: 'Yesterday',
      thisWeek: 'This Week',
      older: 'Older',
      expandHistory: 'Expand history',
      continueDebate: 'Continue debate',
      delete: 'Delete',
      searchPlaceholder: 'Search history...',
      noMatchingRecords: 'No matching records',
      noHistoryYet: 'No history yet',
      tryOtherKeywords: 'Try other keywords',
      historyAutoSave: 'History will be saved after starting debate',
      roleNames: {
        user: 'Me',
        bull: 'Bull',
        bear: 'Bear',
        manager: 'Manager',
        data_collector: 'Data Collector',
      },
    },
    stockDetail: {
      title: 'Stock Analysis - Agent-driven Investment Decisions',
      relatedNews: 'Related News',
      analyzed: 'Analyzed',
      items: '',
      overallSentiment: 'Overall Sentiment',
      recent7d: '7-Day Sentiment',
      unknown: 'Unknown',
      trend: 'Trend',
      up: 'Rising',
      down: 'Falling',
      stable: 'Stable',
      latestNews: 'Latest News',
      none: 'None',
      kline: 'K-Line Chart - Real Market Data',
      dataSource: 'Data source',
      supportZoom: 'Supports zoom & drag',
      close: 'Close',
      change: 'Change',
      volume: 'Volume',
      billion: 'B',
      period: 'Period',
      adjust: 'Adjust',
      daily: 'Daily',
      dailyK: 'Daily',
      min60: '60min',
      min30: '30min',
      min15: '15min',
      min5: '5min',
      min1: '1min',
      qfq: 'Forward Adjusted',
      qfqTip: 'Eliminates ex-dividend gaps, maintains continuity (Recommended)',
      noAdjust: 'No Adjustment',
      noAdjustTip: 'Shows actual trading prices, may have ex-dividend gaps',
      hfq: 'Backward Adjusted',
      hfqTip: 'Based on IPO date, prices may be very high',
      recommendLabel: 'Recommend',
      timeLabel: 'Time',
      openLabel: 'Open',
      highLabel: 'High',
      lowLabel: 'Low',
      closeLabel: 'Close',
      volumeLabel: 'Volume',
      parallelAnalysis: 'Parallel Analysis',
      parallelAnalysisDesc: 'Bull/Bear parallel analysis, Investment Manager summarizes decision',
      realtimeDebate: 'Real-time Debate',
      realtimeDebateDesc: 'Four agents real-time dialogue, Investment Manager moderates, Bull/Bear alternate',
      quickAnalysis: 'Quick Analysis',
      quickAnalysisDesc: 'Single analyst quick recommendation, suitable for time-sensitive scenarios',
      result: 'Result',
      historySessionLoaded: 'Loaded history session',
      detectIncompleteSession: 'Detected incomplete',
      session: 'session',
      messages: 'messages',
      restore: 'Restore?',
      analysis: 'Analysis',
      analysisModeConfig: 'Analysis Mode Config',
      default: 'Default',
      parallelExecution: 'Parallel Execution',
      about2to3min: '~2-3 min',
      realtimeDialogue: 'Real-time Dialogue',
      fourAgents: '4 Agents',
      about5to10min: '~5-10 min',
      singleAgent: 'Single Agent',
      about1min: '~1 min',
      advancedConfig: 'Advanced Config',
      maxExecutionTime: 'Max Execution Time',
      seconds: 's',
      maxDebateRounds: 'Max Debate Rounds',
      rounds: 'rounds',
      managerCanInterrupt: 'Manager Can Interrupt',
      collectDataBeforeDebate: 'Collect Data Before Debate',
      executionTime: 'Time',
      news: 'Related News',
      newsContain: 'Contains',
      newsTotal: '',
      fold: 'Collapse',
      expand: 'Expand',
      clearData: 'Clear Data',
      clearing: 'Clearing...',
      crawlComplete: 'Crawl Complete',
      crawlFailed: 'Crawl Failed',
      crawling: 'Crawling...',
      stop: 'Stop',
      updateCrawl: 'Update Crawl',
      targetCrawl: 'Target Crawl',
      noRelatedNews: 'No related news',
      clickCrawl: 'Click "Target Crawl" to fetch news for this stock',
      loadMore: 'Load More',
      remaining: '',
      showAll: 'Showing all',
      newsFolded: 'News collapsed, click "Expand" to view',
      sentimentTrend: 'News Sentiment Trend',
      sentimentDesc: '30-day sentiment distribution and average',
      positive: 'Positive',
      negative: 'Negative',
      neutral: 'Neutral',
      avgSentiment: 'Avg Sentiment',
      bullBear: 'Bull vs Bear Agent Debate',
      bullBearDesc: 'Bull Researcher vs Bear Researcher, Investment Manager decides',
      startDebate: 'Start Debate',
      debating: 'Debating...',
      analysisMode: 'Analysis Mode',
      bullView: 'Bull View',
      bearView: 'Bear View',
      managerDecision: 'Manager Decision',
      waitingAnalysis: 'Waiting for analysis...',
      waitingDecision: 'Waiting for bull/bear analysis to complete...',
      clickDebate: 'Click "Start Debate" to begin agent analysis',
      debateDesc: 'System will call Bull/Bear researchers for multi-angle analysis, with Investment Manager making final decision',
      backToSearch: 'Back to Search',
      history: 'History',
      copy: 'Copy',
      export: 'Export',
      regenerate: 'Regenerate',
      stronglyRec: 'Strongly Recommend',
      recommend: 'Recommend',
      avoid: 'Avoid',
      caution: 'Caution',
      strongBull: 'Strong Positive',
      strongBear: 'Strong Negative',
      noKline: 'No K-line data',
      checkCode: 'Please check if the stock code is correct',
      sessionRestored: 'Session restored',
      debateComplete: 'Debate analysis complete!',
      outputting: 'Outputting...',
      deciding: 'Deciding...',
      analysisComplete: 'Analysis complete',
      analysisGenerating: 'Analysis generating...',
      decisionGenerating: 'Decision generating...',
      debateFailed: 'Debate analysis failed',
      sessionDeleted: 'Session deleted',
      allHistoryCleared: 'All history cleared',
      searchCancelled: 'Search task cancelled',
      crawlTaskStarted: 'Targeted crawl task started',
      crawlingInProgress: 'Crawling in progress...',
      crawlTaskExists: 'This stock already has a crawl task in progress, syncing status...',
      crawlTaskStopped: 'Crawl task stopped',
      crawlTaskStopFailed: 'Failed to stop task',
      newsCleared: 'Cleared',
      newsItems: 'news items',
      clearNewsConfirm: 'Are you sure you want to clear all news for "',
      clearNewsConfirmEnd: '"? This action cannot be undone!',
      stopCrawlConfirm: 'Are you sure you want to stop the current crawl task?',
      knowledgeGraph: 'Knowledge Graph · Intelligent Retrieval',
      knowledgeGraphDesc: 'Concurrent retrieval based on multi-dimensional keywords to improve recall',
      nameVariants: 'Name Variants',
      mainBusiness: 'Main Business',
      relatedConcepts: 'Related Concepts',
      concurrentQueries: 'Concurrent Retrieval Queries',
      bullResearcher: 'Bull Researcher',
      bearResearcher: 'Bear Researcher',
      investmentManager: 'Investment Manager',
      generatingSearchPlan: 'Generating search plan...',
      deleteSessionConfirm: 'Are you sure you want to delete this record?',
      clearAllHistoryConfirm: 'Are you sure you want to clear all history? This action cannot be undone!',
      clearAllRecords: 'Clear All Records',
      crawlSuccess: 'Targeted crawl complete! Added',
      unknownError: 'Unknown error',
      taskCreated: 'Task created, waiting for execution...',
    },
    alphaMining: {
      training: {
        title: 'RL Training Monitor',
        desc: 'Transformer + REINFORCE algorithm real-time training progress',
        ready: 'Ready',
        running: 'Training',
        completed: 'Completed',
        error: 'Error',
        steps: 'Training Steps',
        useSentiment: 'Use Sentiment Features',
        stop: 'Stop',
        start: 'Start Training',
        progress: 'Training Progress',
        bestFactor: 'Current Best Factor',
        convergence: 'Convergence Curve',
        trainingFailed: 'Training failed',
      },
      metrics: {
        noData: 'No evaluation data',
        hint: 'Please evaluate a factor expression first',
        currentFactor: 'Current Factor',
        multiDim: 'Multi-dimensional Evaluation',
        riskMetrics: 'Risk Metrics',
        maxDrawdown: 'Max Drawdown',
        safe: 'Safe',
        danger: 'Danger',
        dailyTurnover: 'Daily Turnover',
        winRate: 'Win Rate',
        totalReturn: 'Total Return',
        returnsCurve: 'Returns Curve',
        returnsDesc: 'Strategy cumulative returns vs benchmark',
        strategy: 'Strategy',
        benchmark: 'Benchmark',
        metricDesc: 'Metric Description',
        sortinoDesc: 'Sortino: Higher is better, >1 excellent',
        sharpeDesc: 'Sharpe: Higher is better, >0.5 good',
        icDesc: 'IC: |value|>0.03 effective',
        maxDDDesc: 'Max DD: <20% safe',
        excellent: 'Excellent',
        good: 'Good',
        average: 'Average',
        poor: 'Poor',
        lowTurnover: 'Low Turnover',
      },
      sentiment: {
        title: 'Sentiment Fusion Comparison',
        desc: 'Compare pure technical factors vs sentiment-enhanced factors',
        steps: 'Training Steps',
        comparing: 'Comparing...',
        start: 'Start Comparison',
        techOnly: 'Pure Technical Factors',
        techDesc: ' features (RET, VOL, VOLUME_CHG, TURNOVER)',
        enhanced: 'Sentiment-Enhanced Factors',
        enhancedDesc: ' features (+SENTIMENT, NEWS_COUNT)',
        bestFactor: 'Best Factor',
        none: 'None',
        improvement: 'Improvement',
        improved: 'Sentiment features improved factor performance',
        degraded: 'Sentiment features degraded factor performance',
        scoreDiff: 'Score Difference',
        comparison: 'Score Comparison',
        techOnlyBar: 'Technical Only',
        enhancedBar: 'With Sentiment',
        conclusion: 'Conclusion:',
        conclusionPositive: 'Sentiment features (SENTIMENT, NEWS_COUNT) contribute positively to factor mining. It is recommended to enable sentiment fusion in practical applications.',
        conclusionNegative: 'In this experiment, sentiment features did not improve performance. Possible reasons include insufficient sample size, sentiment data noise, or too few training steps. It is recommended to increase training steps and retry.',
        comparingText: 'Comparison experiment in progress...',
        comparingHint: 'Training pure technical factors and sentiment-enhanced factors separately,',
        stepsText: ' steps each',
        startHint: 'Click "Start Comparison" to run sentiment fusion experiment',
        startDesc: 'Will train pure technical factors and sentiment-enhanced factors separately for comparison',
        comparisonFailed: 'Comparison failed',
      },
      agent: {
        title: 'AgenticX Agent Call Demo',
        desc: 'Demonstrates how Agent calls AlphaMiningTool for factor mining',
        success: 'Success',
        failed: 'Failed',
        toolParams: 'Tool Parameters',
        stockCode: 'Stock Code (Optional)',
        stockPlaceholder: 'e.g. SH600519',
        steps: 'Training Steps',
        useSentiment: 'Use Sentiment Features',
        executing: 'Executing...',
        execute: 'Execute Agent Call',
        inputParams: 'Input Parameters',
        output: 'Output Result',
        executionTime: 'Execution Time',
        bestFactor: 'Best Factor',
        logs: 'Execution Logs',
        codeExample: 'Python Call Example',
        executeFailed: 'Execution failed',
        startHint: 'Configure parameters and click "Execute Agent Call"',
        startDesc: 'Will demonstrate how QuantitativeAgent performs factor mining through AlphaMiningTool',
        miningTask: 'Mine quantitative factors for {code}',
        createAgent: 'Create Agent',
        registerTool: 'Register Tool',
        executeMining: 'Execute factor mining',
      },
      operators: {
        all: 'All',
        availableFeatures: 'Available Features',
        techFeature: 'Technical Feature',
        sentimentFeature: 'Sentiment Feature',
        totalOperators: '{count} Operators',
        totalFeatures: '{count} Features',
        params: ' params',
        categoryArithmetic: 'Arithmetic',
        categoryUnary: 'Unary',
        categoryTimeseries: 'Time Series',
        categoryConditional: 'Conditional',
        categorySpecial: 'Special',
        add: 'Addition',
        sub: 'Subtraction',
        mul: 'Multiplication',
        div: 'Division (Safe)',
        neg: 'Negate',
        abs: 'Absolute Value',
        sign: 'Sign Function',
        gate: 'Conditional Select',
        max: 'Maximum',
        min: 'Minimum',
        delay1: 'Delay 1 Period',
        delay5: 'Delay 5 Periods',
        delta1: '1-Period Difference',
        delta5: '5-Period Difference',
        ma5: '5-Period Moving Average',
        ma10: '10-Period Moving Average',
        std5: '5-Period Standard Deviation',
        std10: '10-Period Standard Deviation',
        jump: 'Jump Detection',
        jumpExample: 'Detect >3σ outliers',
        decay: 'Decay Weighted',
        max3: '3-Period Maximum',
      },
    },
  },
};

export const useGlobalI18n = () => {
  const { lang } = useLanguageStore();
  return globalI18n[lang];
};


================================================
FILE: frontend/src/store/useNewsStore.ts
================================================
import { create } from 'zustand'
import type { News } from '@/types/api'

interface NewsStore {
  newsList: News[]
  selectedNews: News | null
  setNewsList: (news: News[]) => void
  setSelectedNews: (news: News | null) => void
  updateNews: (newsId: number, updates: Partial<News>) => void
}

export const useNewsStore = create<NewsStore>((set) => ({
  newsList: [],
  selectedNews: null,
  
  setNewsList: (news) => set({ newsList: news }),
  
  setSelectedNews: (news) => set({ selectedNews: news }),
  
  updateNews: (newsId, updates) =>
    set((state) => ({
      newsList: state.newsList.map((news) =>
        news.id === newsId ? { ...news, ...updates } : news
      ),
    })),
}))


================================================
FILE: frontend/src/store/useTaskStore.ts
================================================
import { create } from 'zustand'
import type { CrawlTask, TaskStats } from '@/types/api'

interface TaskStore {
  tasks: CrawlTask[]
  taskStats: TaskStats | null
  setTasks: (tasks: CrawlTask[]) => void
  setTaskStats: (stats: TaskStats) => void
  addTask: (task: CrawlTask) => void
  updateTask: (taskId: number, updates: Partial<CrawlTask>) => void
}

export const useTaskStore = create<TaskStore>((set) => ({
  tasks: [],
  taskStats: null,
  
  setTasks: (tasks) => set({ tasks }),
  
  setTaskStats: (stats) => set({ taskStats: stats }),
  
  addTask: (task) =>
    set((state) => ({
      tasks: [task, ...state.tasks],
    })),
  
  updateTask: (taskId, updates) =>
    set((state) => ({
      tasks: state.tasks.map((task) =>
        task.id === taskId ? { ...task, ...updates } : task
      ),
    })),
}))


================================================
FILE: frontend/src/types/api.ts
================================================
/**
 * API 类型定义
 * 与后端 API 响应结构保持一致
 */

export interface News {
  id: number
  title: string
  content: string
  url: string
  source: string
  publish_time: string | null
  created_at: string
  stock_codes: string[] | null
  sentiment_score: number | null
  author: string | null
  keywords: string[] | null
}

export interface Analysis {
  id: number
  news_id: number
  agent_name: string
  agent_role: string | null
  analysis_result: string
  summary: string | null
  sentiment: 'positive' | 'negative' | 'neutral' | null
  sentiment_score: number | null
  confidence: number | null
  execution_time: number | null
  created_at: string
}

export interface CrawlTask {
  id: number
  celery_task_id: string | null
  mode: 'cold_start' | 'realtime' | 'targeted'
  status: 'pending' | 'running' | 'completed' | 'failed' | 'cancelled'
  source: string
  config: Record<string, any> | null
  progress: {
    current_page?: number
    total_pages?: number
    percentage?: number
  } | null
  current_page: number | null
  total_pages: number | null
  result: Record<string, any> | null
  crawled_count: number
  saved_count: number
  error_message: string | null
  execution_time: number | null
  created_at: string
  started_at: string | null
  completed_at: string | null
}

export interface TaskStats {
  total: number
  by_status: Record<string, number>
  by_mode: Record<string, number>
  recent_completed: number
  total_news_crawled: number
  total_news_saved: number
}

export interface CrawlRequest {
  source: string
  start_page: number
  end_page: number
}

export interface CrawlResponse {
  success: boolean
  message: string
  crawled_count: number
  saved_count: number
  source: string
}

export interface AnalysisResponse {
  success: boolean
  analysis_id?: number
  news_id: number
  sentiment?: string
  sentiment_score?: number
  confidence?: number
  summary?: string
  execution_time?: number
  error?: string
}

// ============ Phase 2: 个股分析类型 ============

export interface StockOverview {
  code: string
  name: string | null
  total_news: number
  analyzed_news: number
  avg_sentiment: number | null
  recent_sentiment: number | null
  sentiment_trend: 'up' | 'down' | 'stable'
  last_news_time: string | null
}

export interface StockNewsItem {
  id: number
  title: string
  content: string
  url: string
  source: string
  publish_time: string | null
  sentiment_score: number | null
  has_analysis: boolean
}

export interface SentimentTrendPoint {
  date: string
  avg_sentiment: number
  news_count: number
  positive_count: number
  negative_count: number
  neutral_count: number
}

export interface KLineDataPoint {
  timestamp: number  // 时间戳（毫秒）
  date: string
  open: number
  high: number
  low: number
  close: number
  volume: number
  turnover?: number  // 成交额
  change_percent?: number  // 涨跌幅
  change_amount?: number  // 涨跌额
  amplitude?: number  // 振幅
  turnover_rate?: number  // 换手率
}

export interface RealtimeQuote {
  code: string
  name: string
  price: number
  change_percent: number
  change_amount: number
  volume: number
  turnover: number
  high: number
  low: number
  open: number
  prev_close: number
}

// ============ Phase 2: 智能体辩论类型 ============

export interface DebateRequest {
  stock_code: string
  stock_name?: string
  context?: string
  provider?: string
  model?: string
  mode?: 'parallel' | 'realtime_debate' | 'quick_analysis'  // 辩论模式
  language?: 'zh' | 'en'  // 语言设置，影响AI回答的语言
}

export interface AgentAnalysis {
  success: boolean
  agent_name: string
  agent_role?: string
  stance: 'bull' | 'bear'
  analysis?: string
  error?: string
  timestamp?: string
}

export interface FinalDecision {
  success: boolean
  agent_name: string
  agent_role?: string
  decision?: string
  rating?: string
  error?: string
  timestamp?: string
}

export interface TrajectoryStep {
  step: string
  timestamp: string
  data: Record<string, any>
}

export interface QuickAnalysisResult {
  success: boolean
  analysis?: string
  timestamp?: string
  error?: string
}

export interface DebateHistoryItem {
  round: number
  agent: string
  type: string
  content: string
}

export interface DebateResponse {
  success: boolean
  debate_id?: string
  stock_code: string
  stock_name?: string
  mode?: 'parallel' | 'realtime_debate' | 'quick_analysis'
  bull_analysis?: AgentAnalysis
  bear_analysis?: AgentAnalysis
  final_decision?: FinalDecision
  quick_analysis?: QuickAnalysisResult
  debate_history?: DebateHistoryItem[]
  trajectory?: TrajectoryStep[]
  execution_time?: number
  error?: string
}

// ============ Phase 2: 智能体监控类型 ============

export interface AgentLogEntry {
  id: string
  timestamp: string
  agent_name: string
  agent_role?: string
  action: string
  status: 'started' | 'completed' | 'failed'
  details?: Record<string, any>
  execution_time?: number
}

export interface AgentMetrics {
  total_executions: number
  successful_executions: number
  failed_executions: number
  avg_execution_time: number
  agent_stats: Record<string, {
    total: number
    successful: number
    failed: number
    avg_time: number
  }>
  recent_activity: Array<{
    timestamp: string
    agent_name: string
    action: string
    status: string
  }>
}

export interface AgentInfo {
  name: string
  role: string
  description: string
  status: 'active' | 'inactive'
}

export interface WorkflowInfo {
  name: string
  description: string
  agents: string[]
  status: 'active' | 'inactive'
}


================================================
FILE: frontend/tailwind.config.js
================================================
/** @type {import('tailwindcss').Config} */
export default {
  darkMode: ["class"],
  content: [
    './pages/**/*.{ts,tsx}',
    './components/**/*.{ts,tsx}',
    './app/**/*.{ts,tsx}',
    './src/**/*.{ts,tsx}',
  ],
  prefix: "",
  theme: {
    container: {
      center: true,
      padding: "2rem",
      screens: {
        "2xl": "1400px",
      },
    },
    extend: {
      colors: {
        border: "hsl(var(--border))",
        input: "hsl(var(--input))",
        ring: "hsl(var(--ring))",
        background: "hsl(var(--background))",
        foreground: "hsl(var(--foreground))",
        primary: {
          DEFAULT: "hsl(var(--primary))",
          foreground: "hsl(var(--primary-foreground))",
        },
        secondary: {
          DEFAULT: "hsl(var(--secondary))",
          foreground: "hsl(var(--secondary-foreground))",
        },
        destructive: {
          DEFAULT: "hsl(var(--destructive))",
          foreground: "hsl(var(--destructive-foreground))",
        },
        muted: {
          DEFAULT: "hsl(var(--muted))",
          foreground: "hsl(var(--muted-foreground))",
        },
        accent: {
          DEFAULT: "hsl(var(--accent))",
          foreground: "hsl(var(--accent-foreground))",
        },
        popover: {
          DEFAULT: "hsl(var(--popover))",
          foreground: "hsl(var(--popover-foreground))",
        },
        card: {
          DEFAULT: "hsl(var(--card))",
          foreground: "hsl(var(--card-foreground))",
        },
      },
      borderRadius: {
        lg: "var(--radius)",
        md: "calc(var(--radius) - 2px)",
        sm: "calc(var(--radius) - 4px)",
      },
      keyframes: {
        "accordion-down": {
          from: { height: "0" },
          to: { height: "var(--radix-accordion-content-height)" },
        },
        "accordion-up": {
          from: { height: "var(--radix-accordion-content-height)" },
          to: { height: "0" },
        },
      },
      animation: {
        "accordion-down": "accordion-down 0.2s ease-out",
        "accordion-up": "accordion-up 0.2s ease-out",
      },
    },
  },
  plugins: [require("tailwindcss-animate")],
}


================================================
FILE: frontend/tsconfig.json
================================================
{
  "compilerOptions": {
    "target": "ES2020",
    "useDefineForClassFields": true,
    "lib": ["ES2020", "DOM", "DOM.Iterable"],
    "module": "ESNext",
    "skipLibCheck": true,

    /* Bundler mode */
    "moduleResolution": "bundler",
    "allowImportingTsExtensions": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "noEmit": true,
    "jsx": "react-jsx",

    /* Linting */
    "strict": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true,
    "noFallthroughCasesInSwitch": true,
    
    /* Path mapping */
    "baseUrl": ".",
    "paths": {
      "@/*": ["./src/*"]
    }
  },
  "include": ["src"],
  "references": [{ "path": "./tsconfig.node.json" }]
}


================================================
FILE: frontend/tsconfig.node.json
================================================
{
  "compilerOptions": {
    "composite": true,
    "skipLibCheck": true,
    "module": "ESNext",
    "moduleResolution": "bundler",
    "allowSyntheticDefaultImports": true
  },
  "include": ["vite.config.ts"]
}


================================================
FILE: frontend/vite.config.ts
================================================
import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react-swc'
import path from 'path'

// https://vitejs.dev/config/
export default defineConfig({
  plugins: [react()],
  resolve: {
    alias: {
      '@': path.resolve(__dirname, './src'),
    },
  },
  server: {
    port: 3000,
    proxy: {
      '/api': {
        target: 'http://localhost:8000',
        changeOrigin: true,
      },
    },
  },
})


================================================
FILE: legacy_v1/.deepsource.toml
================================================
version = 1

[[analyzers]]
name = "python"

  [analyzers.meta]
  runtime_version = "3.x.x"

================================================
FILE: legacy_v1/Chinese_Stop_Words.txt
================================================

ÿ

ǰ
ת
λ

֤ȯ


ο


Υ߱ؾ


£


:

 
&
*
һһ
~~~~

. 

.һ
./
-- 


ۣ

ۢݣݣ
ۢ٣ģ

P

//


ۢڣ
ۢڣ

}
Ҳ 


ۢ٢ޣ
ۢڣ£ 
ۢ٣
ۢܣ
ۢ٢ۣ
ۣۢ
ۣ
 
 
ۢڣ
 
 
ۢ٢
 

ۢݣ
ۢڣ 
ۢܣ
ۢڢۣ
ۣۢ
ۢܣ
ۢ٢ݣ
ۢ٢ߣ
ۢ٣
ʣ 
ۢ٢
ۢ٢ܣ
ۢ٣
ۢڣ
ۢڢ
ۢڢ٣
ۢ٣ã
ۣۢ
ۣۢ
ۢڢݣ
ۢڢڣ
һ.
ۢ٣
.
ۣ
ۢ٣£
/
ۢ٣
ۣۢ
ۢ٢٣
ۢܣ
ۢܣ
ۣۢ
ۢݣ
ۢ٣
ۢڢ
ۢڢߣ
ۢ٣
ۢڣ

ݣ
://

ۢڢ
ۢݣ


...
...................

ڣأƣɣԣ
ۣۢƣ

ۢ٣
ݡġ䣽 
Ȧա


ڣ

ۢۢ٣
ң̣
ۢ٣ţ

ۣݣ

. 
ۢڣ
ۢ
ۢڢߣ
ۢڢڣ
ۣۢ
ۢ٣
ۢ٣£
ۢ٣
ۢ٣
ۢ٣
ۢ٢ڣ
ۢڣ

ۢ

ۢ٣
ۢڣ
ۢڢޣ
ۣۢ
ۢڢ


Ԫ
ۢڢ

  
ۢ٣
::
ۢڣ
ۣۢ
ۢܣ
ۢݣ
ۢޣ
ۢߣ
ۢ
ۢ 


?


,

'
? 


? 

<
>


[
]
(
)
-
+


/


"
;
#
@


գ
 

sub
exp 
sup
sub
Lex 


=


ۢݣ
ۢݣ
ۢڣ
 
ۢڣǣ
ۢ٣
̣
 
ۣ
......


ʵϰ


ѽ
Ӵ


ȷ


˴


˵


Ȼ


Ω

ֻ


֮


˼


Ӷ


Ļ

ȵ


˵
֮
ǵ
ͽ


µ


λ


ʴ
Ȼ


Ȼ


δ
ο
ʱ


仰˵
֮


ʹ

ʱ


Ȼ

̶
֮


ʹ


֮


˵

˵
˵
ʼ


ɼ


ͬ


һ


˵
˵
ð
ô
ÿ
ÿ

Ī
ĳ
ĳ
ĳЩ


ı
Ķ
ĸ


Щ


Ǳ
Ƕ
Ǹ
ǻ

ô
ôЩ
ô
ʱ
Щ


Ը
Ŷ
Ż
ž


ƾ
ƾ


һ


ǡǡ෴
ǰ
ǰ

Ȼ
Ȼ
Ȼ

˼

κ
ƾ


ɶ


ʹ

ô

ʡ
ʱ
ʲô
ʲô
ʹ

ǵ

˭
˭֪
˳
˳
Ƶ

Ȼ
˵


Ȼ
Ȼ

ʹ


ͨ
ͬ
ͬʱ

һ


Ϊ
Ϊ
Ϊ
Ϊʲô
Ϊ
ι


غ
ں


Զ


ѽ


Ҫ
Ҫ
ҪȻ
Ҫ
Ҫô
Ҫ
Ҳ
Ҳ
Ҳ
һ
һ
һ
һ
һ
һ
һ
һ


Ա
Լ


ֻ


Ϊ
Ӵ


ɴ˿ɼ


е
й
Щ


Ǻ

ͬʱ


Խ


˵


ô
ô
ô

զ


˵

ô
ô
ôЩ
ô
ʱ
Щ


֨
֮
֮
֮
֮һ
ֻ
ֻ
ֻҪ
ֻ


λ


Դ
Ը
Ը
Լ
Լ


ܵ
ܵ˵
ܵ˵
֮ܶ
֮


Ȼ
ʹ

Ϊ


ѽ
Ӵ


Ұ
Ű


ʱ
˵


Ȼ
˳
װ


˵

Ͼ

ض
ؽ


û
û


Ȼ


ò


ɿ
ɿ


ܲ


ȻĻ


ʤ
ʱ

ͬ

Ҫ


ֺ
ɵ

ֶ
ô

֪
ֹ
ֹһ


Ե

һ


Ե
˵
˵ú
ȥ
˵


ҹ

ñ
û


˻
ʤ

϶

Ȼ


伫


ȥ

˶


ȥ
ȴ


Ϣ

˵


˺

ε
Ҵ
Ӳ
Ӵ
ӴԺ
ӹŵ
ӹ
ӽԺ
ӿ


ͷ
δ
޵
С


絽


ﵩ

촰˵


Լ


Ը
ָ֮


ڶ
Ȼ
ͥ
ͷ


˵


˶
ĿǰΪֹ
ͷ
ͷ


ȷ
ȵ


Ȼ


Ȼ
ʱ


ǰ


˵
û˵


֮Ȼ
֮


ǳ
ǵ

ڷ
ͷ

Ȼ


¸
õ

Ͽ
粻


ղ
պ

ߵ


ҹ

ʽ


һ
Ϊ
Ȼ


Ƶ


ʶ


ֲ
߳


ޱ


α
γ
η
ο
ֶΪ

ֹ

ܶ

Ȼ


Ȼ


˵

Ȼ

Ȼ

ͬ


Ϊ
Ҵ


˵


...
֮


֮
֮
ֱ


Ҫ

ϱ
Ϊ


Կ
Ȼ


ʱ


ȥ


Ȼ

Ľ
ľ


Ȼ

ʹ
͵

Ȼ

ٷ
ݳ
ݴ
ʵ
˵
֪
Ϥ
˵


ȥ

ɺ


Ҫ

ü


ϴ
ʵʵ

۴


Ӧ


ʱ


ٵ


һ
·

Ŵ
Ŵ


ʶ
Ȼ

Լ
΢
Ϊ
˵


û
û
ÿ
ÿÿ
ÿʱÿ
Ȼ
Ȼ
Ī
Ī
Ī
Ī
ĬĬ
ĬȻ

ĩ

ѵ
ѵ
ѹ
˵

긴һ

ż
ż


Ʃ
ƫƫ
ƹ
ƽ


ͨ

ʵ


ͷ


ֹ

ǡ
ǡ
ǡǡ
ǡ
ǡ
ǡ
ǧ

ǧ
ǧǧ

в
Ī


׿


̼
֮


ȡ
ȥ
Ȩʱ
ȫ
ȫ
ȫ
ȫȻ
ȫ
Ȼ


Ծ
Ȼ
ոһ
ռ
ս


糣
˵ȵ


ǰ


ͷ
ɪɪ
ɳɳ


ȥ
һ.
һһ
һ
һ
һЩ
һ
һͨ
һ
һ
һʱ
һ
һƬ
һ
һֱ
һ
һ
һת
һ
һ


ȥ


һ


Ȼ


˵
ר
Ҳ˵
˵
ϸ


С
м
ḻ
Ϊ
Ϊʲ
Ϊֹ
Ϊ

Ҫ


֮ǰ
֮
֮
Ҳ˵
Ҳ
˽
ȡ

ƶ
Щ


ʲ


Ϊ
ǰ
Ժ


Թ


ͼ
ΰ
ƺ


ʹ
ʹ


ٽ


Ȼ


Ԫ
Ȳ
Ⱥ


ȫ
ȫ
ȫ
ͬ


֮


ٴ
˵

׼


ֱ


ǰ
ǰ
ǰ

ǿ
ʮ

ȴ
ȴ
ԭ
ּ
ʱ
˫
Ӧ
ӳ
ȡ
ܵ

Ϥ
ֻ
ֻ
ֻ
ֻ

ٿ


ͬһ
ͬ


ʹ
Χ
Ǻ

Ψ
ॵ


ٺ


ô


ʧȥ


õ

ͬ

ʼ


֪
ǵ


ȫ
ȫ

ʵ
ʵ


Ӧ
Դ
Է
Ա
С


Ҫ


޴


Ѿ

Ͱ


㷺
Ӧ
Ӧ
Ӧ


չ

ǿ
ǿ

ǰ

ʱ
γ

ʱ


ó
õ

Ȼ
Ҫ


ܽ


Ω
˼
Ը
Ϊ

ҵ

Ի
ս


ν


/


޷


ȷ
ǲ

Ƿ
Ȼ

ͨ
ձ


м

Ч
ʱ
е
е


ĩ##ĩ


˵

ĳĳ

ӭ

ֵ


˵
˴
ʱ
˴
ÿ
ÿ
ÿ
ȼ
Ƚ
ûκ
ע


Ȼ
ر

ص


ִ


ɴ
Ŀǰ
ֱ
ֱ


෴
ͬ

Ӧ
൱


գ
Ӻ

֪
ȷ


ƶ
ͻ
ͻȻ


ڶ

ϰ


̺

ά

ϵ
ܷ
ܹ
Ժ
Դ


Χ
ĪȻ

Ϊ
ж

ʾ
Ҫ
涨

Ʃ
Ϊ

ʶ


˵
˵
˵˵


˭
˭


ת
ת
ת
ﵽ
Ѹ
ȥ


Ҫ
һ


Ӧ
ʵ


ͨ


⵽
ѭ

ǰ


ȡ

ش

Ҫ


ֹ


ʱ

ѵ˵

Ҫ

Ƕ

 
================================================
FILE: legacy_v1/Crawler/__init__.py
================================================


================================================
FILE: legacy_v1/Crawler/crawler_cnstock.py
================================================
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 3 13:41:50 2018

@author: Damon Li
"""

import time, re, requests
from concurrent import futures
from bs4 import BeautifulSoup
from pymongo import MongoClient
import Text_Analysis.text_mining as tm

import gevent
from gevent import monkey,pool
monkey.patch_all()


class WebCrawlFromcnstock(object):
    '''Crawl company news from 'http://company.cnstock.com/company/scp_gsxw/1',
                               'http://ggjd.cnstock.com/gglist/search/qmtbbdj/1',
                               'http://ggjd.cnstock.com/gglist/search/ggkx/1' website.

    # Arguments:
        totalPages: Number of pages set to be crawled.
        Range: Divide total web pages into totalPages/Range parts 
               for multi-threading processing.
        ThreadsNum: Number of threads needed to be start.
        dbName: Name of database.
        colName: Name of collection.
        IP: Local IP address.
        PORT: Port number corresponding to IP address.
    '''

    def __init__(self,**kwarg):
        self.ThreadsNum = kwarg['ThreadsNum']
        self.dbName = kwarg['dbName']
        self.colName = kwarg['collectionName']
        self.IP = kwarg['IP']
        self.PORT = kwarg['PORT']
        self.Prob = .5
        self.realtimeNewsURL = []
        self.tm = tm.TextMining(IP="localhost",PORT=27017)

    def ConnDB(self):
        '''Connect mongodb.
        '''
        Conn = MongoClient(self.IP, self.PORT) 
        db = Conn[self.dbName]
        self._collection = db.get_collection(self.colName)

    def countchn(self,string):
        '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

        # Arguments:
            string: Each part of crawled website analyzed by BeautifulSoup.
        '''
        pattern = re.compile(u'[\u1100-\uFFFDh]+?')
        result = pattern.findall(string)
        chnnum = len(result)
        possible = chnnum/len(str(string))
        return (chnnum, possible)

    def getUrlInfo(self,url): 
        '''Analyze website and extract useful information.
        '''
        respond = requests.get(url)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        span_list = bs.find_all('span')
        part = bs.find_all('p')
        article = ''
        date = ''
        for span in span_list:
            if 'class' in span.attrs and span['class'] == ['timer']:
                date = span.text
                break

        for paragraph in part:
            chnstatus = self.countchn(str(paragraph))
            possible = chnstatus[1]
            if possible > self.Prob:
               article += str(paragraph)

        while article.find('<') != -1 and article.find('>') != -1:
              string = article[article.find('<'):article.find('>')+1]
              article = article.replace(string,'')
        while article.find('\u3000') != -1:
              article = article.replace('\u3000','')

        article = ' '.join(re.split(' +|\n+', article)).strip() 

        return date, article

    def GenPagesLst(self,totalPages,Range,initPageID):
        '''Generate page number list using Range parameter.
        '''
        PageLst = []
        k = initPageID
        while k+Range-1 <= totalPages:
            PageLst.append((k,k+Range-1))
            k += Range
        if k+Range-1 < totalPages:
            PageLst.append((k,totalPages))
        return PageLst

    def CrawlHistoryCompanyNews(self,startPage,endPage,url_Part_1):
        '''Crawl historical company news 
        '''
        self.ConnDB()
        AddressLst = self.extractData(['Address'])[0]
        if AddressLst == []:
            urls = []
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId))
            for url in urls:
                print(url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.cnstock.com/company/') != -1 \
                    and a.parent.find('span'):
                        date, article = self.getUrlInfo(a['href'])
                        while article == '' and self.Prob >= .1:
                            self.Prob -= .193
                            date, article = self.getUrlInfo(a['href'])
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a['title'],
                                    'Article' : article}
                            self._collection.insert_one(data)
        else:
            urls = []
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId))
            for url in urls:
                print(' <Re-Crawl url> ', url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.cnstock.com/company/') != -1 \
                    and a.parent.find('span'):
                        if a['href'] not in AddressLst:
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)

    def CrawlRealtimeCompanyNews(self,url_part_lst):
        '''Continue crawling company news from first website page 
           every once in a while and extract the useful information, 
           including summary, key words, released date, related stock 
           codes list and main body.
        '''
        doc_lst = []
        self.ConnDB()
        self._AddressLst = self.extractData(['Address'])[0]
        for url_Part in url_part_lst:
            url = url_Part + str(1)
            resp = requests.get(url)
            resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
            bs = BeautifulSoup(resp.text, "lxml")
            a_list = bs.find_all('a')
            if len(self.realtimeNewsURL) == 0:
                for a in a_list:
                    if ('href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.cnstock.com/company/') != -1 \
                    and a.parent.find('span')) or ('href' in a.attrs and 'target' in a.attrs \
                    and 'title' in a.attrs and a['href'].find('http://ggjd.cnstock.com/company/') != -1 \
                    and a.parent.find('span')):
                        if a['href'] not in self._AddressLst:
                            self.realtimeNewsURL.append(a['href'])
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a['title'] + ' ' + article)
                                print(' [' + date + '] ' + a['title'])
            else:
                for a in a_list:
                    if ('href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.cnstock.com/company/') != -1 \
                    and a.parent.find('span')) or ('href' in a.attrs and 'target' in a.attrs \
                    and 'title' in a.attrs and a['href'].find('http://ggjd.cnstock.com/company/') != -1 \
                    and a.parent.find('span')):
                        if a['href'] not in self.realtimeNewsURL and a['href'] not in self._AddressLst:
                            self.realtimeNewsURL.append(a['href'])
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a['title'] + ' ' + article)
                                print(' [' + date + '] ' + a['title'])
        return doc_lst

    def extractData(self,tag_list):
        '''Extract column data with tag in 'tag_list' to the list.
        '''
        data = []
        for tag in tag_list:
            exec(tag + " = self._collection.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")
        return data

    def coroutine_run(self,totalPages,Range,initPageID,**kwarg):
        '''Coroutines running.
        '''
        jobs = []
        page_ranges_lst = self.GenPagesLst(totalPages,Range,initPageID)
        for page_range in page_ranges_lst:
            jobs.append(gevent.spawn(self.CrawlHistoryCompanyNews,page_range[0],page_range[1],kwarg['url_Part_1']))
        gevent.joinall(jobs) 

    def multi_threads_run(self,**kwarg):
        '''Multi-threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        print(' Using ' + str(self.ThreadsNum) + ' threads for collecting news ... ')
        with futures.ThreadPoolExecutor(max_workers=self.ThreadsNum) as executor:
            future_to_url = {executor.submit(self.CrawlHistoryCompanyNews,page_range[0],page_range[1]) : \
                             ind for ind, page_range in enumerate(page_ranges_lst)}  

    def classifyRealtimeStockNews(self):
        '''Continue crawling and classifying news(articles/documents) every 60s. 
        '''
        while True:
            print(' * start crawling news from CNSTOCK ... ')
            doc_list = self.CrawlRealtimeCompanyNews(['http://company.cnstock.com/company/scp_gsxw/',\
                                                    'http://ggjd.cnstock.com/gglist/search/qmtbbdj/',\
                                                    'http://ggjd.cnstock.com/gglist/search/ggkx/']) #
            print(' * finish crawling ... ')
            if len(doc_list) != 0:
                self.tm.classifyRealtimeStockNews(doc_list)
            time.sleep(60)

================================================
FILE: legacy_v1/Crawler/crawler_jrj.py
================================================
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 3 13:41:50 2018

@author: Damon Li
"""

import time, re, requests, datetime
from concurrent import futures
from bs4 import BeautifulSoup
from pymongo import MongoClient
import Text_Analysis.text_mining as tm
from bson.objectid import ObjectId

import gevent
from gevent import monkey,pool
monkey.patch_all()


class WebCrawlFromjrj(object):
    '''Crawl company news from 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index.shtml' website.

    # Arguments:
        totalPages: Number of pages set to be crawled.
        Range: Divide total web pages into totalPages/Range parts 
               for multi-threading processing.
        ThreadsNum: Number of threads needed to be start.
        dbName: Name of database.
        colName: Name of collection.
        IP: Local IP address.
        PORT: Port number corresponding to IP address.
    '''

    def __init__(self,*arg,**kwarg):
        self.startDate = arg[0]
        self.endDate = arg[1]
        self.Range = arg[2]
        self.ThreadsNum = kwarg['ThreadsNum']
        self.dbName = kwarg['dbName']
        self.colName = kwarg['collectionName']
        self.IP = kwarg['IP']
        self.PORT = kwarg['PORT']
        self.Prob = .5
        self.realtimeNewsURL = []
        self.tm = tm.TextMining(IP="localhost",PORT=27017)

    def getEveryDay(self,begin_date,end_date):
        '''Get date list from 'begin_date' to 'end_date' on the calendar.
        '''
        date_list = []  
        begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")  
        end_date = datetime.datetime.strptime(end_date,"%Y-%m-%d")  
        while begin_date <= end_date:  
            date_str = begin_date.strftime("%Y-%m-%d")  
            date_list.append(date_str)  
            begin_date += datetime.timedelta(days=1)  
        return date_list  

    def countchn(self,string):
        '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

        # Arguments:
            string: Each part of crawled website analyzed by BeautifulSoup.
        '''
        pattern = re.compile(u'[\u1100-\uFFFDh]+?')
        result = pattern.findall(string)
        chnnum = len(result)
        possible = chnnum/len(str(string))
        return (chnnum, possible)

    def getUrlInfo(self,url,specificDate):
        '''Analyze website and extract useful information.
        '''
        respond = requests.get(url)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        meta_list = bs.find_all('meta')
        span_list = bs.find_all('span')
        part = bs.find_all('p')
        article = ''
        date = ''
        NotFoundPage = False
        for span in span_list:
            for child in span.children:
                if child == 'jrj_final_date_start':
                    date = span.text.replace('\r','').replace('\n','')
                    if date.find('年') != -1:
                        date = date.replace('年','-').replace('月','-').replace('日','')
                    break
            break
        if date == '':
            date = specificDate

        for p in part:
            if p.text.find('页面没有找到') != -1:
               NotFoundPage = True
               break

        if not NotFoundPage:
            for paragraph in part:
                chnstatus = self.countchn(str(paragraph))
                possible = chnstatus[1]
                if possible > self.Prob:
                   article += str(paragraph)

            while article.find('<') != -1 and article.find('>') != -1:
                  string = article[article.find('<'):article.find('>')+1]
                  article = article.replace(string,'')
            while article.find('\u3000') != -1:
                  article = article.replace('\u3000','')

            article = ' '.join(re.split(' +|\n+', article)).strip() 

        return date, article, NotFoundPage

    def GenDatesLst(self):
        '''Divide date list into parts using Range parameter.
        '''
        DatesLst = self.getEveryDay(self.startDate,self.endDate)
        NewDatesLst = []
        k = 0
        while k < len(DatesLst):
            if k+self.Range >= len(DatesLst):
                break
            else:
                NewDatesLst.append(DatesLst[k:k+self.Range])
                k += self.Range 
        NewDatesLst.append(DatesLst[k:])
        return NewDatesLst

    def findPagesOfSpecificDate(self,firstUrl,date):
        '''Search the number of web pages of specific date.

        # Arguments:
            firstUrl: The first web page of specific date.
            date: Desinated date.
        '''
        respond = requests.get(firstUrl)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        a_list = bs.find_all('a')
        Nums = 1
        for a in a_list:
            if 'href' in a.attrs and 'target' in a.attrs:
                if a['href'].find(date.replace('-','') + '_') != -1 and a.text.isdigit():
                    Nums += 1
        return Nums

    def CrawlRealtimeCompanyNews(self,today_Date): 
        '''Continue crawling company news from first website page
           every once in a while and extract the useful information, 
           including summary, key words, released date, related stock 
           codes list and main body.
        '''
        doc_lst = []
        if len(self.realtimeNewsURL) == 0:
            self.ConnDB()
            self._AddressLst = self.extractData(['Address'])[0]
            urlsAndDates = []
            url_Part_1 = 'http://stock.jrj.com.cn/xwk/'
            url_Part_2 = '_1.shtml'
            firstUrl = url_Part_1 + today_Date.replace('-','')[0:6] + '/' + today_Date.replace('-','') + url_Part_2
            Nums = self.findPagesOfSpecificDate(firstUrl,today_Date)
            for num in range(1,Nums+1):
                urlsAndDates.append((url_Part_1 + today_Date.replace('-','')[0:6] + '/' + today_Date.replace('-','') \
                    + '_' + str(num) + '.shtml', today_Date))
            for url, specificDate in urlsAndDates:
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('/' + specificDate.replace('-','')[0:4] + '/' + specificDate.replace('-','')[4:6] + '/') != -1:
                        if a['href'] not in self._AddressLst:
                            self.realtimeNewsURL.append(a['href'])
                            date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            while article == '' and self.Prob >= .1 and not NotFoundPage:
                                self.Prob -= .1
                                date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a.string,
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a.string + ' ' + article)
                                print(' [' + date + '] ' + a.string)
        else:
            urlsAndDates = []
            url_Part_1 = 'http://stock.jrj.com.cn/xwk/'
            url_Part_2 = '_1.shtml'
            firstUrl = url_Part_1 + today_Date.replace('-','')[0:6] + '/' + today_Date.replace('-','') + url_Part_2
            Nums = self.findPagesOfSpecificDate(firstUrl,today_Date)
            for num in range(1,Nums+1):
                urlsAndDates.append((url_Part_1 + today_Date.replace('-','')[0:6] + '/' + today_Date.replace('-','') \
                    + '_' + str(num) + '.shtml', today_Date))
            for url, specificDate in urlsAndDates:
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('/' + specificDate.replace('-','')[0:4] + '/' + specificDate.replace('-','')[4:6] + '/') != -1:
                        if a['href'] not in self._AddressLst and a['href'] not in self.realtimeNewsURL:
                            self.realtimeNewsURL.append(a['href'])
                            date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            while article == '' and self.Prob >= .1 and not NotFoundPage:
                                self.Prob -= .1
                                date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a.string,
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a.string + ' ' + article)
                                print(' [' + date + '] ' + a.string)
        return doc_lst

    def CrawlHistoryCompanyNews(self,datelst):
        '''Crawl historical company news 
        '''
        self.ConnDB()
        AddressLst = self.extractData(['Address'])[0]
        if AddressLst == []:
            urlsAndDates = []
            url_Part_1 = 'http://stock.jrj.com.cn/xwk/'
            url_Part_2 = '_1.shtml'
            for date in datelst:
                firstUrl = url_Part_1 + date.replace('-','')[0:6] + '/' + date.replace('-','') + url_Part_2
                Nums = self.findPagesOfSpecificDate(firstUrl,date)
                for num in range(1,Nums+1):
                    urlsAndDates.append((url_Part_1 + date.replace('-','')[0:6] + '/' + date.replace('-','') \
                        + '_' + str(num) + '.shtml', date))
            for url, specificDate in urlsAndDates:
                print(url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('/' + specificDate.replace('-','')[0:4] + '/' + specificDate.replace('-','')[4:6] + '/') != -1:
                        date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                        while article == '' and self.Prob >= .1 and not NotFoundPage:
                            self.Prob -= .1
                            date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a.string,
                                    'Article' : article}
                            self._collection.insert_one(data)
        else:
            urlsAndDates = []
            url_Part_1 = 'http://stock.jrj.com.cn/xwk/'
            url_Part_2 = '_1.shtml'
            for date in datelst:
                firstUrl = url_Part_1 + date.replace('-','')[0:6] + '/' + date.replace('-','') + url_Part_2
                Nums = self.findPagesOfSpecificDate(firstUrl,date)
                for num in range(1,Nums+1):
                    urlsAndDates.append((url_Part_1 + date.replace('-','')[0:6] + '/' + date.replace('-','') \
                        + '_' + str(num) + '.shtml', date))
            for url, specificDate in urlsAndDates:
                print(' <Re-Crawl url> ', url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('/' + specificDate.replace('-','')[0:4] + '/' + specificDate.replace('-','')[4:6] + '/') != -1:
                        if a['href'] not in AddressLst:
                            date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            while article == '' and self.Prob >= .1 and not NotFoundPage:
                                self.Prob -= .1
                                date, article, NotFoundPage = self.getUrlInfo(a['href'],specificDate)
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a.string,
                                        'Article' : article}
                                self._collection.insert_one(data)

    def ConnDB(self):
        '''Connect mongodb.
        '''
        Conn = MongoClient(self.IP, self.PORT) 
        db = Conn[self.dbName]
        self._collection = db.get_collection(self.colName)

    def extractData(self,tag_list):
        '''Extract column data with tag in 'tag_list' to the list.
        '''
        data = []
        for tag in tag_list:
            exec(tag + " = self._collection.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")
        return data

    def StockCodeDuplicateRemoval(self):
        '''Discarded.
        '''
        Conn = MongoClient(self.IP, self.PORT) 
        db = Conn[self.dbName]
        collection = db.get_collection(self.colName)
        idLst = collection.distinct('_id')
        relevantStockSeries = []
        for _id in idLst:
            data = collection.find_one({'_id':ObjectId(_id)})
            if 'relevantStock' in data.keys():
                relevantStock = collection.find_one({'_id':ObjectId(_id)})['relevantStock']
                if len(relevantStock) > 1:
                    relevantStockCodeDuplicateRemoval = list(set(relevantStock))
                    collection.update({"_id":_id},{"$set":{"relevantStock":' '.join(relevantStockCodeDuplicateRemoval)}})
                    print(relevantStockCodeDuplicateRemoval)
                    break
                if len(relevantStock) == 1:
                    print(relevantStock)
                    print(len(relevantStock))
                    break
        print('Duplicate Removal successfully ... ')

    def coroutine_run(self):
        '''Coroutines running.
        '''
        jobs = []
        dateLst = self.GenDatesLst()
        for datelst in dateLst:
            jobs.append(gevent.spawn(self.CrawlHistoryCompanyNews,datelst))
        gevent.joinall(jobs) 

    def multi_threads_run(self,**kwarg):
        '''Multi-threading running.
        '''
        dateLst = self.GenDatesLst()
        print(' Using ' + str(self.ThreadsNum) + ' threads for collecting news ... ')
        with futures.ThreadPoolExecutor(max_workers=self.ThreadsNum) as executor:
            future_to_url = {executor.submit(self.CrawlHistoryCompanyNews,datelst) : \
                             ind for ind, datelst in enumerate(dateLst)}  

    def classifyRealtimeStockNews(self):
        '''Continue crawling and classifying news(articles/documents) every 60s. 
        '''
        today_Date = datetime.datetime.now().strftime('%Y-%m-%d')
        while True:
            print(' * start crawling news from JRJ ... ')
            doc_list = self.CrawlRealtimeCompanyNews(today_Date) #
            print(' * finish crawling ... ')
            if len(doc_list) != 0:
                self.tm.classifyRealtimeStockNews(doc_list)
            time.sleep(60)


================================================
FILE: legacy_v1/Crawler/crawler_nbd.py
================================================
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 23 17:19:50 2018

@author: Damon Li
"""

import re, os, time, requests
from bs4 import BeautifulSoup
import pymongo, threading, traceback

import gevent
from gevent import monkey,pool
monkey.patch_all()


class WebCrawlFromNBD(object):
    '''Crawl company news from 'http://stocks.nbd.com.cn/columns/275' website.

    # Arguments:
        totalPages: Number of pages set to be crawled.
        Range: Divide total web pages into totalPages/Range parts 
               for multi-threading processing.
        ThreadsNum: Number of threads needed to be start.
        dbName: Name of database.
        colName: Name of collection.
        IP: Local IP address.
        PORT: Port number corresponding to IP address.
    '''


    def __init__(self,*arg,**kwarg):
        self.totalPages = arg[0] #totalPages
        self.Range = arg[1] #Range
        self.ThreadsNum = kwarg['ThreadsNum']
        self.dbName = kwarg['dbName']
        self.colName = kwarg['collectionName']
        self.IP = kwarg['IP']
        self.PORT = kwarg['PORT']
        self.url_lst_withoutArticles = []
        self.title_lst_withoutArticles = []
        self.url_lst_withoutNews = []
        self.CrawledUrlsID = []
        self.filePath = os.path.dirname(os.path.realpath(__file__))

    def countchn(self,string):
        '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

        # Arguments:
            string: Each part of crawled website analyzed by BeautifulSoup.
        '''
        pattern = re.compile(u'[\u1100-\uFFFDh]+?')
        result = pattern.findall(string)
        chnnum = len(result)
        possible = chnnum/len(str(string))
        return (chnnum, possible)

    def getUrlInfo(self,url):
        '''Analyze website and extract useful information.
        '''
        respond = requests.get(url)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        span_list = bs.find_all('span')
        part = bs.find_all('p')
        article = ''
        date = ''

        for span in span_list:
            if 'class' in span.attrs and span.text and span['class'] == ['time']:
                    string = span.text.split()
                    for dt in string:
                        if dt.find('-') != -1:
                            date += dt + ' '
                        elif dt.find(':') != -1:
                            date += dt
                    break

        for paragraph in part:
            chnstatus = self.countchn(str(paragraph))
            possible = chnstatus[1]
            if possible > 0.5:
               article += str(paragraph)

        while article.find('<') != -1 and article.find('>') != -1:
              string = article[article.find('<'):article.find('>')+1]
              article = article.replace(string,'')
        while article.find('\u3000') != -1:
              article = article.replace('\u3000','')

        article = ' '.join(re.split(' +|\n+', article)).strip() 

        return article, date

    def GenPagesLst(self):
        '''Generate page number list using Range parameter.
        '''
        PageLst = []
        k = 1
        while k+self.Range-1 <= self.totalPages:
            PageLst.append((k,k+self.Range-1))
            k += self.Range
        if k+self.Range-1 < self.totalPages:
            PageLst.append((k,self.totalPages))
        return PageLst

    def ReCrawlNews(self,url_list):
        '''Continue crawling pages without any return.

        # Arguments:
          url_list: List of web pages that without any values.
        '''
        try:
          nums = 1
          ulst = []
          while url_list != []:
             ulst.append(url_list[0])
             print(' <Re-Crawl News> ', url_list[0])
             if nums > 10:
                print(' <!> wait 1s before request url again ...')
                time.sleep(1)
                nums = 1
             resp = requests.get(url_list[0])
             resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
             bs = BeautifulSoup(resp.text, "lxml")
             a_list = bs.find_all('a')
             if a_list != []:
               for a in a_list:
                   if 'click-statistic' in a.attrs and a.string \
                   and a['click-statistic'].find('Article_') != -1 \
                   and a['href'].find('http://www.nbd.com.cn/articles/') != -1:
                       article, date = self.getUrlInfo(a['href'])
                       if date == '' or article == '':
                          self.url_lst_withoutArticles.append(a['href'])
                          self.title_lst_withoutArticles.append(a.string)
                       elif date != '' and article != '':
                           data = {'date' : date,
                                   'address' : a['href'],
                                   'title' : a.string,
                                   'Article' : article}
                           self.collection.insert_one(data)
                           self.CrawledUrlsID.append(int(url_list[0].split('/')[-1]))
               url_list.remove(url_list[0])
             if len(ulst) >= 2 and ulst[-1] == ulst[-2]:
                nums += 1
          return self.url_lst_withoutArticles, self.title_lst_withoutArticles
        except Exception:
            traceback.print_exc()

    def ReCrawlArticles(self,url_list,title_list):
        '''Continue crawling urls without main information return.

        # Arguments:
          url_list: List of urls without getting any articles(main body).
          title_list: List of urls without crawling any titles.
        '''
        nums = 1
        ulst = []
        while url_list != []:
            ulst.append(url_list[0])
            print(' <Re-Crawl Articles> ', url_list[0])
            if nums > 10:
              print(' <!> wait 1s before request url again ...')
              time.sleep(1)
              nums = 1
            article, date = self.getUrlInfo(url_list[0])
            if date != '' and article != '':
               data = {'date' : date,
                       'address' : url_list[0],
                       'title' : title_list[0],
                       'Article' : article}
               print(' remove ' + url_list[0] + ' successfully ... ')
               url_list.remove(url_list[0])
               title_list.remove(title_list[0])
               self.collection.insert_one(data)
            if len(ulst) >= 2 and ulst[-1] == ulst[-2]:
               nums += 1

    def CrawlCompanyNews(self,startPage,endPage):
        '''Crawl historical company news 
        '''
        self.ConnDB()
        AddressLst = self.extractData(['address'])[0]
        if AddressLst == []:
          urls = []
          url_Part = 'http://stocks.nbd.com.cn/columns/275/page/' 
          for pageId in range(startPage,endPage+1):
              urls.append(url_Part + str(pageId))
          for url in urls:
              print(url)
              resp = requests.get(url)
              resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
              bs = BeautifulSoup(resp.text, "lxml")
              a_list = bs.find_all('a')
              if a_list == []:
                self.url_lst_withoutNews.append(url)
              else:
                for a in a_list:
                    if 'click-statistic' in a.attrs and a.string \
                    and a['click-statistic'].find('Article_') != -1 \
                    and a['href'].find('http://www.nbd.com.cn/articles/') != -1:
                        article, date = self.getUrlInfo(a['href'])
                        if date == '' or article == '':
                           self.url_lst_withoutArticles.append(a['href'])
                           self.title_lst_withoutArticles.append(a.string)
                        elif date != '' and article != '':
                            data = {'date' : date,
                                    'address' : a['href'],
                                    'title' : a.string,
                                    'Article' : article}
                            self.collection.insert_one(data)
                            self.CrawledUrlsID.append(int(url.split('/')[-1]))
        else:
          urls = []
          url_Part = 'http://stocks.nbd.com.cn/columns/275/page/' 
          for pageId in range(startPage,endPage+1):
              urls.append(url_Part + str(pageId))
          for url in urls:
              print(' <Re-Crawl url> ', url)
              resp = requests.get(url)
              resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
              bs = BeautifulSoup(resp.text, "lxml")
              a_list = bs.find_all('a')
              if a_list == []:
                self.url_lst_withoutNews.append(url)
              else:
                for a in a_list:
                    if 'click-statistic' in a.attrs and a.string \
                    and a['click-statistic'].find('Article_') != -1 \
                    and a['href'].find('http://www.nbd.com.cn/articles/') != -1:
                        if a['href'] not in AddressLst:
                            article, date = self.getUrlInfo(a['href'])
                            if date == '' or article == '':
                               self.url_lst_withoutArticles.append(a['href'])
                               self.title_lst_withoutArticles.append(a.string)
                            elif date != '' and article != '':
                                data = {'date' : date,
                                        'address' : a['href'],
                                        'title' : a.string,
                                        'Article' : article}
                                self.collection.insert_one(data)
                                self.CrawledUrlsID.append(int(url.split('/')[-1]))

    def ConnDB(self):
        '''Connect mongodb.
        '''
        client = pymongo.MongoClient(self.IP, self.PORT)
        mydb = client[self.dbName]
        self.collection = mydb.get_collection(self.colName)

    def extractData(self,tag_list):
        '''Extract column data with tag in 'tag_list' to the list.
        '''
        data = []
        for tag in tag_list:
            exec(tag + " = self.collection.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")
        return data

    def single_run(self):
        '''Single threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        for ind, page_range in enumerate(page_ranges_lst):
            self.CrawlCompanyNews(page_range[0],page_range[1]) 
        return self.url_lst_withoutNews

    def multi_threads_run(self):
        '''Multi-threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        th_lst = []
        for page_range in page_ranges_lst:
            thread = threading.Thread(target=self.CrawlCompanyNews,\
                                      args=(page_range[0],page_range[1]))
            th_lst.append(thread)
        for thread in th_lst:
            thread.start()
        for thread in th_lst:
            thread.join()
        return self.url_lst_withoutNews

    def coroutine_run(self):
        '''Coroutines running.
        '''
        jobs = []
        page_ranges_lst = self.GenPagesLst()
        for page_range in page_ranges_lst:
            jobs.append(gevent.spawn(self.CrawlCompanyNews,page_range[0],page_range[1]))
        gevent.joinall(jobs) 
        return self.url_lst_withoutNews

================================================
FILE: legacy_v1/Crawler/crawler_sina.py
================================================
# -*- coding: utf-8 -*-
"""
Created on Mon Jan 22 10:01:40 2018

@author: Damon Li
"""

import time, re, requests
from concurrent import futures
from bs4 import BeautifulSoup
from pymongo import MongoClient
import Text_Analysis.text_mining as tm

import gevent
from gevent import monkey,pool
monkey.patch_all()


class WebCrawlFromSina(object):
    '''Crawl company news from 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index.shtml' website.

    # Arguments:
        totalPages: Number of pages set to be crawled(int type).
        Range: Divide total web pages into totalPages/Range parts 
               for multi-threading processing(int type).
        ThreadsNum: Number of threads needed to be start(int type).
        dbName: Name of database(string type).
        colName: Name of collection(string type).
        IP: Local IP address(string type).
        PORT: Port number corresponding to IP address(int type).
    '''

    def __init__(self,*arg,**kwarg):
        self.totalPages = arg[0] #totalPages
        self.Range = arg[1] #Range
        self.ThreadsNum = kwarg['ThreadsNum']
        self.dbName = kwarg['dbName']
        self.colName = kwarg['collectionName']
        self.IP = kwarg['IP']
        self.PORT = kwarg['PORT']
        self.Porb = .5 
        self.realtimeNewsURL = []
        self.tm = tm.TextMining(IP="localhost",PORT=27017)

    def countchn(self,string):
        '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

        # Arguments:
            string: Each part of crawled website analyzed by BeautifulSoup.
        '''
        pattern = re.compile(u'[\u1100-\uFFFDh]+?')
        result = pattern.findall(string)
        chnnum = len(result)
        possible = chnnum/len(str(string))
        return (chnnum, possible)

    def getUrlInfo(self,url):
        '''Analyze website and extract useful information.
        '''
        respond = requests.get(url)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        meta_list = bs.find_all('meta')
        span_list = bs.find_all('span')
        part = bs.find_all('p')
        article = ''
        date = ''
        summary = ''
        keyWords = ''
        stockCodeLst = ''
        for meta in meta_list:
            if 'name' in meta.attrs and meta['name'] == 'description':
                summary = meta['content']
            elif 'name' in meta.attrs and meta['name'] == 'keywords':
                keyWords = meta['content']
            if summary != '' and keyWords != '':
                break
        for span in span_list:
            if 'class' in span.attrs:
                if span['class'] == ['date'] or span['class'] == ['time-source']:
                    string = span.text.split()
                    for dt in string:
                        if dt.find('年') != -1:
                            date += dt.replace('年','-').replace('月','-').replace('日',' ')
                        elif dt.find(':') != -1:
                            date += dt
                    break
            if 'id' in span.attrs and span['id'] == 'pub_date':
                string = span.text.split()
                for dt in string:
                    if dt.find('年') != -1:
                        date += dt.replace('年','-').replace('月','-').replace('日',' ')
                    elif dt.find(':') != -1:
                        date += dt
                break
        for span in span_list:
            if 'id' in span.attrs and span['id'].find('stock_') != -1:
                stockCodeLst += span['id'][8:] + ' '

        for paragraph in part:
            chnstatus = self.countchn(str(paragraph))
            possible = chnstatus[1]
            '''Porb: Standard frequency of Chinese occurrence among 
               each parts of one news(article/document), used
               to judge whether any part is main body or not.
            '''
            if possible > self.Porb:
               article += str(paragraph)

        time1 = time.time()
        while article.find('<') != -1 and article.find('>') != -1:
              string = article[article.find('<'):article.find('>')+1]
              article = article.replace(string,'')
              time2 = time.time()
              if time2 - time1 > 60:
                print(' [*] 循环超时60s，跳出循环 ... ')
                break

        time1 = time.time()
        while article.find('\u3000') != -1:
              article = article.replace('\u3000','')
              time2 = time.time()
              if time2 - time1 > 60:
                print(' [*] 循环超时60s，跳出循环 ... ')
                break

        article = ' '.join(re.split(' +|\n+', article)).strip() 

        return summary, keyWords, date, stockCodeLst, article

    def GenPagesLst(self):
        '''Generate page number list using Range parameter.
        '''
        PageLst = []
        k = 1
        while k+self.Range-1 <= self.totalPages:
            PageLst.append((k,k+self.Range-1))
            k += self.Range
        if k+self.Range-1 < self.totalPages:
            PageLst.append((k,self.totalPages))
        return PageLst

    def CrawlRealtimeCompanyNews(self,firstPage): 
        '''Continue crawling company news from first website page
           every once in a while and extract the useful information, 
           including summary, key words, released date, related stock 
           codes list and main body.
        '''
        doc_lst = []
        if len(self.realtimeNewsURL) == 0:
            self.ConnDB()
            self._AddressLst = self.extractData(['Address'])[0]
            resp = requests.get(firstPage)
            resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
            bs = BeautifulSoup(resp.text, "lxml")
            a_list = bs.find_all('a')
            for a in a_list:
                if 'href' in a.attrs and a.string and \
                a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:
                    if a['href'] not in self._AddressLst:
                        self.realtimeNewsURL.append(a['href'])
                        summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        while article == '' and self.Prob >= .1:
                            self.Prob -= .1
                            summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a.string,
                                    'Keywords' : keyWords,
                                    'Summary' : summary,
                                    'Article' : article,
                                    'RelevantStock' : stockCodeLst}
                            self._collection.insert_one(data)
                            doc_lst.append(a.string + ' ' + summary + ' ' + article)
                            print(' [' + date + '] ' + a.string)
        else:
            resp = requests.get(firstPage)
            resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
            bs = BeautifulSoup(resp.text, "lxml")
            a_list = bs.find_all('a')
            for a in a_list:
                if 'href' in a.attrs and a.string and \
                a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:
                    if a['href'] not in self.realtimeNewsURL and a['href'] not in self._AddressLst:
                        self.realtimeNewsURL.append(a['href'])
                        summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        while article == '' and self.Prob >= .1:
                            self.Prob -= .1
                            summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a.string,
                                    'Keywords' : keyWords,
                                    'Summary' : summary,
                                    'Article' : article,
                                    'RelevantStock' : stockCodeLst}
                            self._collection.insert_one(data)
                            doc_lst.append(a.string + ' ' + summary + ' ' + article)
                            print(' [' + date + '] ' + a.string)
        return doc_lst

    def CrawlHistoryCompanyNews(self,startPage,endPage):
        '''Crawl historical company news 
        '''
        self.ConnDB()
        AddressLst = self.extractData(['Address'])[0]
        if AddressLst == []:
            urls = []
            url_Part_1 = 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index_' 
            url_Part_2 = '.shtml'
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId) + url_Part_2)
            for url in urls:
                print(url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:
                        summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        while article == '' and self.Prob >= .1:
                            self.Prob -= .1
                            summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a.string,
                                    'Keywords' : keyWords,
                                    'Summary' : summary,
                                    'Article' : article,
                                    'RelevantStock' : stockCodeLst}
                            self._collection.insert_one(data)
        else:
            urls = []
            url_Part_1 = 'http://roll.finance.sina.com.cn/finance/zq1/ssgs/index_' 
            url_Part_2 = '.shtml'
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId) + url_Part_2)
            for url in urls:
                print(' <Re-Crawl url> ', url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and a.string and \
                    a['href'].find('http://finance.sina.com.cn/stock/s/') != -1:
                        if a['href'] not in AddressLst:
                            summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                summary, keyWords, date, stockCodeLst, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a.string,
                                        'Keywords' : keyWords,
                                        'Summary' : summary,
                                        'Article' : article,
                                        'RelevantStock' : stockCodeLst}
                                self._collection.insert_one(data)

    def ConnDB(self):
        '''Connect mongodb.
        '''
        Conn = MongoClient(self.IP, self.PORT) 
        db = Conn[self.dbName]
        self._collection = db.get_collection(self.colName)

    def extractData(self,tag_list):
        '''Extract column data with tag in 'tag_list' to the list.
        '''
        data = []
        for tag in tag_list:
            exec(tag + " = self._collection.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")
        return data

    def single_run(self):
        '''Single threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        for ind, page_range in enumerate(page_ranges_lst):
            self.CrawlHistoryCompanyNews(page_range[0],page_range[1]) 

    def coroutine_run(self):
        '''Coroutines running.
        '''
        jobs = []
        page_ranges_lst = self.GenPagesLst()
        for page_range in page_ranges_lst:
            jobs.append(gevent.spawn(self.CrawlHistoryCompanyNews,page_range[0],page_range[1]))
        gevent.joinall(jobs) 

    def multi_threads_run(self,**kwarg):
        '''Multi-threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        print(' Using ' + str(self.ThreadsNum) + ' threads for collecting news ... ')
        with futures.ThreadPoolExecutor(max_workers=self.ThreadsNum) as executor:
            future_to_url = {executor.submit(self.CrawlHistoryCompanyNews,page_range[0],page_range[1]) : \
                             ind for ind, page_range in enumerate(page_ranges_lst)}  

    def classifyRealtimeStockNews(self):
        '''Continue crawling and classifying news(articles/documents) every 60s. 
        '''
        while True:
            print(' * start crawling news from SINA ... ')
            doc_list = self.CrawlRealtimeCompanyNews('http://roll.finance.sina.com.cn/finance/zq1/ssgs/index_1.shtml') #
            print(' * finish crawling ... ')
            if len(doc_list) != 0:
                self.tm.classifyRealtimeStockNews(doc_list)
            time.sleep(60)

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromSina(5000,100,ThreadsNum=4,IP="localhost",PORT=27017,\
        dbName="Sina_Stock",collectionName="sina_news_company")
    web_crawl_obj.coroutine_run()  #web_crawl_obj.single_run() #web_crawl_obj.multi_threads_run()

================================================
FILE: legacy_v1/Crawler/crawler_stcn.py
================================================
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 3 13:41:50 2018

@author: Damon Li
"""

import time, re, requests, datetime
from concurrent import futures
from bs4 import BeautifulSoup
from pymongo import MongoClient
import Text_Analysis.text_mining as tm

import gevent
from gevent import monkey,pool
monkey.patch_all()


class WebCrawlFromstcn(object):
    '''Crawl company news from 'http://company.stcn.com/gsxw/1.shtml',
                                'http://stock.stcn.com/xingu/1.shtml',
                                'http://stock.stcn.com/zhuli/1.shtml',
                                'http://stock.stcn.com/bankuai/1.shtml',
                                'http://stock.stcn.com/dapan/1.shtml' website.

    # Arguments:
        totalPages: Number of pages set to be crawled.
        Range: Divide total web pages into totalPages/Range parts 
               for multi-threading processing.
        ThreadsNum: Number of threads needed to be start.
        dbName: Name of database.
        colName: Name of collection.
        IP: Local IP address.
        PORT: Port number corresponding to IP address.
    '''

    def __init__(self,**kwarg):
        self.ThreadsNum = kwarg['ThreadsNum']
        self.dbName = kwarg['dbName']
        self.colName = kwarg['collectionName']
        self.IP = kwarg['IP']
        self.PORT = kwarg['PORT']
        self.Prob = .5
        self.realtimeNewsURL = []
        self.tm = tm.TextMining(IP="localhost",PORT=27017)

    def countchn(self,string):
        '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

        # Arguments:
            string: Each part of crawled website analyzed by BeautifulSoup.
        '''
        pattern = re.compile(u'[\u1100-\uFFFDh]+?')
        result = pattern.findall(string)
        chnnum = len(result)
        possible = chnnum/len(str(string))
        return (chnnum, possible)

    def getUrlInfo(self,url):
        '''Analyze website and extract useful information.
        '''
        respond = requests.get(url)
        respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
        bs = BeautifulSoup(respond.text, "lxml")
        div_list = bs.find_all('div')
        part = bs.find_all('p')
        article = ''
        date = ''
        for div in div_list:
            if 'class' in div.attrs and div['class'] == ['info']:
                date = div.text.split(' ')[0] + ' ' + div.text.split(' ')[1]
                break

        for paragraph in part:
            chnstatus = self.countchn(str(paragraph))
            possible = chnstatus[1]
            if possible > self.Prob:
               article += str(paragraph)

        while article.find('<') != -1 and article.find('>') != -1:
              string = article[article.find('<'):article.find('>')+1]
              article = article.replace(string,'')
        while article.find('\u3000') != -1:
              article = article.replace('\u3000','')

        article = ' '.join(re.split(' +|\n+', article)).strip() 

        return date, article

    def GenPagesLst(self,totalPages,Range,initPageID):
        '''Generate page number list using Range parameter.
        '''
        PageLst = []
        k = initPageID
        while k+Range-1 <= totalPages:
            PageLst.append((k,k+Range-1))
            k += Range
        if k+Range-1 < totalPages:
            PageLst.append((k,totalPages))
        return PageLst

    def CrawlRealtimeCompanyNews(self,url_part_lst):
        '''Continue crawling company news from first website page
           every once in a while and extract the useful information, 
           including summary, key words, released date, related stock 
           codes list and main body.
        '''
        doc_lst = []
        self.ConnDB()
        self._AddressLst = self.extractData(['Address'])[0]
        for url_Part in url_part_lst:
            url = url_Part + str(1) + '.shtml'
            resp = requests.get(url)
            resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
            bs = BeautifulSoup(resp.text, "lxml")
            a_list = bs.find_all('a')
            if len(self.realtimeNewsURL) == 0:
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.stcn.com/') != -1 \
                    and a.parent.find('span') or ('href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://stock.stcn.com/') != -1 \
                    and a.parent.find('span')):
                        if a['href'] not in self._AddressLst:
                            self.realtimeNewsURL.append(a['href'])
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a['title'] + ' ' + article)
                                print(' [' + date + '] ' + a['title'])
            else:
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.stcn.com/') != -1 \
                    and a.parent.find('span') or ('href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://stock.stcn.com/') != -1 \
                    and a.parent.find('span')):
                        if a['href'] not in self.realtimeNewsURL and a['href'] not in self._AddressLst:
                            self.realtimeNewsURL.append(a['href'])
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)
                                doc_lst.append(a['title'] + ' ' + article)
                                print(' [' + date + '] ' + a['title'])  
        return doc_lst

    def CrawlCompanyNews(self,startPage,endPage,url_Part_1):
        '''Crawl historical company news 
        '''
        self.ConnDB()
        AddressLst = self.extractData(['Address'])[0]
        if AddressLst == []:
            urls = []
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId) + '.shtml')
            for url in urls:
                print(url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.stcn.com/') != -1 \
                    and a.parent.find('span'):
                        date, article = self.getUrlInfo(a['href'])
                        while article == '' and self.Prob >= .1:
                            self.Prob -= .1
                            date, article = self.getUrlInfo(a['href'])
                        self.Prob =.5
                        if article != '':
                            data = {'Date' : date,
                                    'Address' : a['href'],
                                    'Title' : a['title'],
                                    'Article' : article}
                            self._collection.insert_one(data)
        else:
            urls = []
            for pageId in range(startPage,endPage+1):
                urls.append(url_Part_1 + str(pageId) + '.shtml')
            for url in urls:
                print(' <Re-Crawl url> ', url)
                resp = requests.get(url)
                resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding 
                bs = BeautifulSoup(resp.text, "lxml")
                a_list = bs.find_all('a')
                for a in a_list:
                    if 'href' in a.attrs and 'target' in a.attrs and 'title' in a.attrs \
                    and a['href'].find('http://company.stcn.com/') != -1 \
                    and a.parent.find('span'):
                        if a['href'] not in AddressLst:
                            date, article = self.getUrlInfo(a['href'])
                            while article == '' and self.Prob >= .1:
                                self.Prob -= .1
                                date, article = self.getUrlInfo(a['href'])
                            self.Prob =.5
                            if article != '':
                                data = {'Date' : date,
                                        'Address' : a['href'],
                                        'Title' : a['title'],
                                        'Article' : article}
                                self._collection.insert_one(data)

    def ConnDB(self):
        '''Connect mongodb.
        '''
        Conn = MongoClient(self.IP, self.PORT) 
        db = Conn[self.dbName]
        self._collection = db.get_collection(self.colName)

    def extractData(self,tag_list):
        '''Extract column data with tag in 'tag_list' to the list.
        '''
        data = []
        for tag in tag_list:
            exec(tag + " = self._collection.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")
        return data

    def coroutine_run(self,totalPages,Range,initPageID,**kwarg):
        '''Coroutines running.
        '''
        jobs = []
        page_ranges_lst = self.GenPagesLst(totalPages,Range,initPageID)
        for page_range in page_ranges_lst:
            jobs.append(gevent.spawn(self.CrawlCompanyNews,page_range[0],page_range[1],kwarg['url_Part_1']))
        gevent.joinall(jobs) 

    def multi_threads_run(self,**kwarg):
        '''Multi-threading running.
        '''
        page_ranges_lst = self.GenPagesLst()
        print(' Using ' + str(self.ThreadsNum) + ' threads for collecting news ... ')
        with futures.ThreadPoolExecutor(max_workers=self.ThreadsNum) as executor:
            future_to_url = {executor.submit(self.CrawlCompanyNews,page_range[0],page_range[1]) : \
                             ind for ind, page_range in enumerate(page_ranges_lst)}  

    def classifyRealtimeStockNews(self):
        '''Continue crawling and classifying news(articles/documents) every 60s. 
        '''
        today_Date = datetime.datetime.now().strftime('%Y-%m-%d')
        while True:
            print(' * start crawling news from STCN ... ')
            doc_list = self.CrawlRealtimeCompanyNews(['http://company.stcn.com/gsxw/',\
                                                'http://stock.stcn.com/xingu/',\
                                                'http://stock.stcn.com/zhuli/',\
                                                'http://stock.stcn.com/bankuai/',\
                                                'http://stock.stcn.com/dapan/']) #
            print(' * finish crawling ... ')
            if len(doc_list) != 0:
                self.tm.classifyRealtimeStockNews(doc_list)
            time.sleep(60)


================================================
FILE: legacy_v1/Crawler/crawler_tushare.py
================================================
import pymongo
import tushare as ts
import datetime
import time
import math
import traceback

class CrawlStockData(object):
	def __init__(self,**kwarg):
		self.IP = kwarg['IP']
		self.PORT = kwarg['PORT']
		self.ConnDB()
		self.stockDailyPath = 'D:\\stock_daliy'

	def ConnDB(self):
		self._Conn = pymongo.MongoClient(self.IP, self.PORT) 

	def extractData(self,dbName,colName,tag_list):
		db = self._Conn[dbName]
		collection = db.get_collection(colName)
		data = []
		for tag in tag_list:
			exec(tag + " = collection.distinct('" + tag + "')")
			exec("data.append(" + tag + ")")
		return data

	def getStockBasicFromTushare(self,dbName,colName):
		db = self._Conn[dbName]
		collection = db.get_collection(colName)
		stock_basic_info = ts.get_stock_basics()
		for i in range(len(stock_basic_info)):
			data = {stock_basic_info.index.name : stock_basic_info.index[i]}
			data.update({'name' : stock_basic_info['name'][i]})
			data.update({'industry' : stock_basic_info['industry'][i]})
			data.update({'area' : stock_basic_info['area'][i]})
			data.update({'pe' : stock_basic_info['pe'][i]})
			data.update({'outstanding' : stock_basic_info['outstanding'][i]})
			data.update({'totals' : stock_basic_info['totals'][i]})
			data.update({'totalAssets' : stock_basic_info['totalAssets'][i]})
			data.update({'liquidAssets' : stock_basic_info['liquidAssets'][i]})
			data.update({'fixedAssets' : stock_basic_info['fixedAssets'][i]})
			data.update({'reserved' : stock_basic_info['reserved'][i]})
			data.update({'reservedPerShare' : stock_basic_info['reservedPerShare'][i]})
			data.update({'esp' : stock_basic_info['esp'][i]})
			data.update({'bvps' : stock_basic_info['bvps'][i]})
			data.update({'pb' : stock_basic_info['pb'][i]})
			data.update({'undp' : stock_basic_info['undp'][i]})
			data.update({'perundp' : stock_basic_info['perundp'][i]})
			data.update({'rev' : stock_basic_info['rev'][i]})
			data.update({'profit' : stock_basic_info['profit'][i]})
			data.update({'gpr' : stock_basic_info['gpr'][i]})
			data.update({'npr' : stock_basic_info['npr'][i]})
			data.update({'holders' : stock_basic_info['holders'][i]})
			#detail = dict(zip(stock_basic_info.columns, [stock_basic_info[j][i] for j in stock_basic_info.columns]))
			collection.insert_one(data)

	def renewStockBasic(self):
		pass

	def getStockTickHistory(self,dbName,stockCode):
		try:
			db = self._Conn[dbName]
			collection = db.get_collection(stockCode)
			date = self.extractData("NBD","nbd_news_company",['date'])[0]
			begin_date = min(date).split(' ')[0]
			date_list = self.getCalendar(begin_date)
			for dt in date_list:
				tickDataOfEachDate = ts.get_tick_data(stockCode,date=dt)
				if not math.isnan(tickDataOfEachDate['price'][0]): #exist data at that day
					data = {}
					for i in range(len(tickDataOfEachDate)-1,-1,-1):
						data.update({'date' : dt})
						data.update({'time' : tickDataOfEachDate['time'][i]})
						data.update({'price' : tickDataOfEachDate['price'][i]})
						data.update({'change' : tickDataOfEachDate['change'][i]})
						data.update({'volume' : int(tickDataOfEachDate['volume'][i])})
						data.update({'amount' : int(tickDataOfEachDate['amount'][i])})
						data.update({'type' : tickDataOfEachDate['type'][i]})
						collection.insert_one(data)
						data = {}
				print(dt + ' crawl finished ... ')
		except Exception:
			traceback.print_exc()

	def getStockDayHistory(self,dbName,stockCode):
		db = self._Conn[dbName]
		collection = db.get_collection(stockCode)
		Path = self.stockDailyPath + '\\' + stockCode + '.txt'
		data = []
		for row in open(Path,'r'):
			line = row.split()
			data.append(line)
		Dict = {}
		for i in range(len(data)):
			if len(data[i]) > 1:
				Dict.update({'date' : data[i][0]})
				Dict.update({'open' : data[i][1]})
				Dict.update({'high' : data[i][2]})
				Dict.update({'low' : data[i][3]})
				Dict.update({'close' : data[i][4]})
				Dict.update({'volume' : data[i][5]})
				Dict.update({'turnover' : data[i][6]})
				collection.insert_one(Dict)
				Dict = {}

	def getCalendar(self,begin_date):  
		date_list = []  
		begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")  
		end_date = datetime.datetime.strptime(time.strftime('%Y-%m-%d',time.localtime(time.time())), "%Y-%m-%d")  
		while begin_date <= end_date:  
			date_str = begin_date.strftime("%Y-%m-%d")  
			date_list.append(date_str)  
			begin_date += datetime.timedelta(days=1)  
		return date_list

	def isUnique(self, List):  
		# write your code here  
		n = len(List)  
		for i in range(n):  
			if List.count(List[i]) != 1: #判断单个字符串a[i]出现次数  
				return False  
				#break  
		return True 

	def getStockTickRealtime(self):
		pass


================================================
FILE: legacy_v1/README_OLD.md
================================================
# 上市公司新闻文本分析与分类预测

 ![image](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/assets/images/FINNEWS-HUNTER.jpg)

[![Star History Chart](https://api.star-history.com/svg?repos=DemonDamon/Listed-company-news-crawl-and-text-analysis&type=Date)]([https://star-history.com/#linhandev/dataset&Date](https://star-history.com/#DemonDamon/Listed-company-news-crawl-and-text-analysis&Date))

-------------------------------

## 简介

上市公司新闻文本分析与分类预测的基本步骤如下：

 - 从新浪财经、每经网、金融界、中国证券网、证券时报网上，爬取上市公司（个股）的历史新闻文本数据（包括时间、网址、标题、正文）
 - 从Tushare上获取沪深股票日线数据（开、高、低、收、成交量和持仓量）和基本信息（包括股票代码、股票名称、所属行业、所属地区、PE值、总资产、流动资产、固定资产、留存资产等）
 - 对抓取的新闻文本按照，去停用词、加载新词、分词的顺序进行处理
 - 利用前两步中所获取的股票名称和分词后的结果，抽取出每条新闻里所包含的（0支、1支或多支）股票名称，并将所对应的所有股票代码，组合成与该条新闻相关的股票代码列表，并在历史数据表中增加一列相关股票代码数据
 - 从历史新闻数据库中抽取与某支股票相关的所有新闻文本，利用该支股票的日线数据（比如某一天发布的消息，在设定N天后如果价格上涨则认为是利好消息，反之则是利空消息）给每条新闻贴上“利好”和“利空”的标签，并存储到新的数据库中（或导出到CSV文件）
 - 实时抓取新闻数据，判断与该新闻相关的股票有哪些，利用上一步的结果，对与某支股票相关的所有历史新闻文本（已贴标签）进行文本分析（构建新的特征集），然后利用SVM（或随机森林）分类器对文本分析结果进行训练（如果已保存训练模型，可选择重新训练或直接加载模型），最后利用训练模型对实时抓取的新闻数据进行分类预测

开发环境`Python-v3(3.6)`：

 - gensim==3.2.0
 - jieba==0.39
 - scikit-learn==0.19.1
 - pandas==0.20.0
 - numpy==1.13.3+mkl
 - scipy==0.19.0
 - pymongo==3.6.0
 - beautifulsoup4==4.6.0
 - tushare==1.1.1
 - requests==2.18.4
 - gevent==1.2.1

## 文本处理 -> [text_processing.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Text_Analysis/text_processing.py)

 - 文本处理包括去停用词处理、加载新词、中文分词、去掉出现次数少的分词
 - 生成字典和Bow向量，并基于Gensim转化模型（LSI、LDA、TF-IDF）转化Bow向量
 - 计算文本相似度
 - 打印词云

## 文本挖掘 -> [text_mining.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Text_Analysis/text_mining.py)

 - 从新闻文本中抽取特定信息，并贴上新的文本标签方便往后训练模型
 - 从数据库中抽取与某支股票相关的所有新闻文本
 - 将贴好标签的历史新闻进行分类训练，利用训练好的模型对实时抓取的新闻文本进行分类预测

## 新闻爬取 -> [crawler_cnstock.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_cnstock.py), [crawler_jrj.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_jrj.py), [crawler_nbd.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_nbd.py), [crawler_sina.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_sina.py), [crawler_stcn.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_stcn.py)

 - 分析网站结构，多线程（或协程）爬取上市公司历史新闻数据

## Tushare数据提取 -> [crawler_tushare.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/run_crawler_tushare.py)

 - 获取沪深所有股票的基本信息，包括股票代码、股票名称、所属行业、所属地区等

## 用法

 - 配好运行环境以及安装MongoDB，最好再安装一个MongoDB的可视化管理工具Studio 3T
 - 先运行[crawler_cnstock.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_cnstock.py), [crawler_jrj.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_jrj.py), [crawler_nbd.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_nbd.py), [crawler_sina.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_sina.py), [crawler_stcn.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_stcn.py)这5个py文件，而且可能因为对方服务器没有响应而重复多次运行这几个文件才能抓取大量的历史数据
 - 接着运行[crawler_tushare.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/run_crawler_tushare.py)从Tushare获取基本信息和股票价格
 - 最后运行[run_main.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/run_main.py)文件，其中有4个步骤，除了第1步初始化外，其他几步最好单独运行
 - 注意：所有程序都必须在文件所在目录下运行

## 更新目标

 由于之前的项目代码是在初学Python的时候写的，很多写法都是入门级别，因此为了提高整体项目的质量，除了优化代码细节和已有的功能模块之外，还加入了多个功能模块，来支撑未来更加智能化和个性化的金融分析与交易。
 - 完成初步构想，重构该项目，将项目分成8大模块，分别是`数据获取模块`，`数据清洗与预处理模块`，`大数据可视化模块`，`基于机器学习的文本挖掘模块`，`金融知识图谱构建模块`，`任务导向多轮对话模块`，`金融交易模块`，`通用服务模块`
 (备注：项目在完善之后会重新更名为`Finnews Hunter`，命名的来源是出于对`《全职猎人》`的喜爱，与项目本质的结合，其中`Finnews`是`Financial News`的简写。上面提到的8个模块，分别由`《全职猎人》`中的本人最喜爱的8位角色命名，分别是
 - `数据获取模块`               -> [Gon](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Gon) -> `网页爬虫、各种数据源API调用等`
 - `数据清洗与预处理模块`       -> [Killua](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Killua) -> `数据清洗、数据转换(数据采样、类型转换、归一化等)、数据描述(数据可视化)、特征选择与组合(熵增益和分支定界等)、特征抽取(主成分分析、线性判别分析等)`
 - `大数据可视化模块`           -> [Kurapika](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Kurapika) -> `基于多个可视化模块进行封装，包括提供Web可视化界面`
 - `自然语言处理模块`           -> [Leorio](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Leorio) -> `中文分词、词性标注、实体识别`
 - `基于机器学习的文本挖掘模块` -> [Hisoka](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Hisoka)  -> ``
 - `金融知识图谱构建模块`       -> [Chrollo](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Chrollo) -> ``
 - `任务导向多轮对话模块`       -> [Illumi](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Illumi) -> ``
 - `金融交易模块`               -> [Feitan](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Feitan) -> ``
 - `基础与Web服务模块`          -> [Kite](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src/Kite) -> `基础服务集，包括基本参数配置文件(.py)、数据库的构建与连接、日志打印与收集、多线程服务、Web服务框架搭建以及其他函数`)
 
 ## 更新日志
 - 注意：  
   - 以下例子均需在代码根目录[src](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/tree/main/src)下执行  
   - 先安装好MongoDB用作存储数据库，以及Redis用做简单的消息队列
   - 运行下面demo时，先要设置[config.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Kite/config.py)里面的参数
   
 - 更新[crawler_tushare.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_tushare.py)代码为[stockinfospyder.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/stockinfospyder.py)，直接运行即可获取股票历史价格数据，并在每天15:30分后更新数据(目前只采集天数据)
    - example-1 调用[AkShare](https://www.akshare.xyz/zh_CN/latest/)接口获取股票历史价格数据，并开启实时更新
    ```
    from Kite import config
    from Gon.stockinfospyder import StockInfoSpyder

    stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)
    # 指定时间段，获取历史数据，如：stock_info_spyder.get_historical_news(start_date="20150101", end_date="20201204")
    # 如果没有指定时间段，且数据库已存在部分数据，则从最新的数据时间开始获取直到现在，比如数据库里已有sh600000价格数据到
    # 2020-12-03号，如不设定具体时间，则从自动获取sh600000自2020-12-04至当前的价格数据
    stock_info_spyder.get_historical_news()
    ```
    - example-2 开启自动化更新所有股票价格数据(目前只支持在15:30分后更新日数据)
    ```
    from Kite import config
    from Gon.stockinfospyder import StockInfoSpyder

    stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)
    stock_info_spyder.get_realtime_news()
    ```
 - 更新[crawler_cnstock.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_cnstock.py)代码为[cnstockspyder.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/cnstockspyder.py)，直接运行即可获取中国证券网历史新闻数据，并可以实时更新采集
    - example-1 爬取历史新闻数据，然后去重以及去NULL
    ```
    import time
    import logging
    from Kite import config
    from Killua.denull import DeNull
    from Killua.deduplication import Deduplication 
    from Gon.cnstockspyder import CnStockSpyder

    cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
    for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
        logging.info("start crawling {} ...".format(url_to_be_crawled))
        cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn)
        logging.info("finished ...")
        time.sleep(30)

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
    ```
    - example-2 实时更新新闻数据库，并且将新数据推进redis消息队列等待处理
    ```
    import time, logging, threading
    from Kite import config
    from Kite.database import Database
    from Killua.denull import DeNull
    from Killua.deduplication import Deduplication 
    from Gon.cnstockspyder import CnStockSpyder

    obj = Database()
    df = obj.get_data(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK, keys=["Date", "Category"])

    cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
    # 先补充历史数据，比如已爬取数据到2020-12-01，但是启动实时爬取程序在2020-12-23，则先
    # 自动补充爬取2020-12-02至2020-12-23的新闻数据
    for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
        # 查询type_chn的最近一条数据的时间
        latets_date_in_db = max(df[df.Category == type_chn]["Date"].to_list())
        cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn, start_date=latets_date_in_db)

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()

    # 开启多线程并行实时爬取
    thread_list = []
    for url, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
        thread = threading.Thread(target=cnstock_spyder.get_realtime_news, args=(url, type_chn, 60))
        thread_list.append(thread)
    for thread in thread_list:
        thread.start()
    for thread in thread_list:
        thread.join()
    ```
 - 更新[crawler_jrj.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_jrj.py)代码为[jrjspyder.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/jrjspyder.py)，直接运行即可获取金融界历史新闻数据，并可以实时更新采集
    - example-1 爬取历史新闻数据，然后去重以及去NULL
    ```
    from Kite import config
    from Killua.denull import DeNull
    from Killua.deduplication import Deduplication 
    from Gon.jrjspyder import JrjSpyder

    jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
    jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ, start_date="2015-01-01")

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()
    ```
    - example-2 已爬取一定量的历史数据下，开启实时更新新闻数据库，并且将新数据推进redis消息队列等待处理
    ```
    from Kite import config
    from Gon.jrjspyder import JrjSpyder

    jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
    jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ)  # 补充爬虫数据到最新日期
    jrj_spyder.get_realtime_news()
    ```
 - 更新[crawler_nbd.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_nbd.py)代码为[nbdspyder.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/nbdspyder.py)，直接运行即可获取每经网历史新闻数据，并可以实时更新采集
    - example-1 爬取历史新闻数据，然后去重以及去NULL
    ```
    from Kite import config
    from Killua.denull import DeNull
    from Killua.deduplication import Deduplication 
    from Gon.nbdspyder import NbdSpyder

    nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
    nbd_spyder.get_historical_news(start_page=684)

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
    ```
    - example-2 已爬取一定量的历史数据下，开启实时更新新闻数据库，并且将新数据推进redis消息队列等待处理
    ```
    from Kite import config
    from Killua.denull import DeNull
    from Killua.deduplication import Deduplication 
    from Gon.nbdspyder import NbdSpyder

    # 如果没有历史数据从头爬取，如果已爬取历史数据，则从最新的时间开始爬取
    # 如历史数据中最近的新闻时间是"2020-12-09 20:37:10"，则从该时间开始爬取
    nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
    nbd_spyder.get_historical_news()

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()

    nbd_spyder.get_realtime_news()
    ```
 - 更新[crawler_sina.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/Crawler/crawler_sina.py)代码为[sinaspyder.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/sinaspyder.py)，直接运行即可获取新浪财经历史新闻数据(未更新)
 - 停止`证券时报网`爬虫代码的更新(旧代码已不可用)，新增`网易财经`和`凤凰财经`的爬虫代码(未更新)
 - 新增[buildstocknewsdb.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Killua/buildstocknewsdb.py)如果已经在每经网、中国证券网和金融界爬取了一定量新闻文本，接下来就是针对每支股票构建对应的新闻数据库，并根据股价贴上3/5/10/15/30/60天标签，具体判断条件查看[buildstocknewsdb.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Killua/buildstocknewsdb.py)第111-116行注释
    - example-1 从历史新闻数据库中抽取、构建每支股票的新闻数据库，并贴上标签
    ```
    from Kite import config
    from Killua.buildstocknewsdb import GenStockNewsDB

    gen_stock_news_db = GenStockNewsDB()
    gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
    gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
    gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
    ```
    - example-2 监听redis消息队列，将新的数据分别存入与该新闻相关的所有股票新闻数据库中
    ```
    from Kite import config
    from Killua.buildstocknewsdb import GenStockNewsDB

    gen_stock_news_db = GenStockNewsDB()
    gen_stock_news_db.listen_redis_queue()
    ```
 - 新增[realtime_spyder_startup.bat](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/realtime_spyder_startup.bat)同时以下程序
    - 开启多个爬虫实例，包括[realtime_starter_cnstock.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/realtime_starter_cnstock.py)、[realtime_starter_jrj.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/realtime_starter_jrj.py)、[realtime_starter_nbd.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/realtime_starter_nbd.py)等
    - 全股票数据更新代码[realtime_starter_stock_price.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/realtime_starter_stock_price.py)
    - 监听redis消息队列[realtime_starter_redis_queue.py](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/Gon/realtime_starter_redis_queue.py)
  - 新增[realtime_spyder_stopall.bat](https://github.com/DemonDamon/Listed-company-news-crawl-and-text-analysis/blob/main/src/realtime_spyder_stopall.bat)批量终止爬虫程序
 - 更新前使用jieba分词系统，在实体识别上需要不断维护新词表来提高识别精度；更新后，使用基于BERT预训练的FinBERT对金融领域实体进行识别

# FinnewsHunter (Reborn)

基于 AgenticX 框架构建的企业级多智能体金融决策平台。

## 项目状态

🚧 **重构进行中** 🚧

本项目正在经历重大重构，从单一脚本集合升级为现代化的微服务架构。

- **旧版代码**：已归档至 `legacy_v1/` 目录。
- **重构规划**：详见 [planning.md](../../planning.md)。

## 技术架构

- **后端**: Python, FastAPI, AgenticX (Orchestrator, Debate, Tools)
- **前端**: TypeScript, React
- **算法**: sklearn, PyTorch, vllm

## 快速开始

### 后端开发

1. 进入后端目录：
   ```bash
   cd backend
   ```
2. 安装依赖：
   ```bash
   pip install -r requirements.txt
   ```
3. 启动服务：
   ```bash
   uvicorn app.main:app --reload
   ```

## 目录结构

```
FinnewsHunter/
├── backend/            # FastAPI 后端服务
│   ├── app/            # 应用代码
│   └── tests/          # 测试用例
├── frontend/           # React 前端应用 (待初始化)
├── legacy_v1/          # 旧版代码归档
├── docs/               # 项目文档
└── README.md           # 项目说明
```

### 快速开始

1. 进入后端目录：
   ```bash
   cd backend
   ```
2. 安装依赖：
   ```bash
   pip install -r requirements.txt
   ```
3. 启动服务：
   ```bash
   uvicorn app.main:app --reload
   ```

## 目录结构

```
FinnewsHunter/
├── backend/            # FastAPI 后端服务
│   ├── app/            # 应用代码
│   └── tests/          # 测试用例
├── frontend/           # React 前端应用 (待初始化)
├── legacy_v1/          # 旧版代码归档
├── docs/               # 项目文档
└── README.md           # 项目说明
```

================================================
FILE: legacy_v1/Text_Analysis/__init__.py
================================================


================================================
FILE: legacy_v1/Text_Analysis/text_mining.py
================================================
# -*- coding: UTF-8 -*- 
"""
Created on Sat Jan 20 10:20:33 2018

@author: Damon Li
"""

import os, re, csv, time, warnings, threading
from pymongo import MongoClient
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from bson.objectid import ObjectId
import Text_Analysis.text_processing as tp
from gensim import corpora, utils

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier 
from sklearn.externals import joblib
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import sklearn.exceptions
from sklearn.preprocessing import OneHotEncoder

warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=Warning, module='sklearn')
warnings.filterwarnings("ignore", category=UserWarning, module='gensim')
warnings.filterwarnings("ignore", category=RuntimeWarning, module='gensim')

class TextMining(object):
	'''Text analysis and prediction functions class.

	# Arguments:
		IP: IP address of mongodb database.
		PORT: Port number corresponding to IP.
	'''

	def __init__(self,**kwarg): 
		self.IP = kwarg['IP']
		self.PORT = kwarg['PORT']
		self.ConnDB()
		self.tp = tp.TextProcessing(os.getcwd() + '\\' + 'Chinese_Stop_Words.txt', \
			os.getcwd() + '\\' + 'finance_dict.txt')
		if not os.path.exists(os.getcwd() + '\\' + 'stock_dict_file'):
			os.makedirs(os.getcwd() + '\\' + 'stock_dict_file')
		self.DictPath = os.getcwd() + '\\' + 'stock_dict_file'

	def ConnDB(self):
		'''Connect to the mongodb.
		'''
		self._Conn = MongoClient(self.IP, self.PORT) 

	def extractData(self,dbName,colName,tag_list):
		'''Extract data from specific collection of specific database.

		# Arguments:
			dbName: Name of database.
			colName: Name of collection.
			tag_list: List of tags that need to be extracted.
		'''
		db = self._Conn[dbName]
		collection = db.get_collection(colName)
		data = []
		Dict = {}
		for tag in tag_list:
			exec(tag + " = collection.distinct('" + tag + "')")
			exec("data.append(" + tag + ")")
			exec("Dict.update({'" + tag + "' : np.array(" + tag + ")})")
		dataFrame = pd.DataFrame(Dict,columns=tag_list)
		return dataFrame

	def extractStockCodeFromArticle(self,dbName,colName):
		'''Extract the stocks mentioned by each news(articles/documents).

		# Arguments:
			dbName: Name of database.
			colName: Name of collection.
		'''
		db = self._Conn[dbName]
		collection = db.get_collection(colName)
		idLst = self.extractData(dbName,colName,['_id'])._id
		data = self.extractData("Stock","Basic_Info",['name','code'])
		articles = []
		for _id in idLst:
			if dbName == 'NBD_Stock':
				title = collection.find_one({'_id':ObjectId(_id)})['title']
			else:
				title = collection.find_one({'_id':ObjectId(_id)})['Title']
			article = collection.find_one({'_id':ObjectId(_id)})['Article']
			articles.append(title + ' ' + article)
		token, _, _ = self.tp.genDictionary(articles,saveDict=False)
		j = 0
		for tk in token:
			relevantStockName = []
			relevantStockCode = []
			for k in range(len(tk)):
				if len(tk[k]) >= 3 and tk[k] in list(data.name):
					relevantStockName.append(tk[k]) 
					relevantStockCode.append(list(data[(data.name == tk[k])].code)[0]) 
			if len(relevantStockCode) != 0:
				relevantStockCodeDuplicateRemoval = list(set(relevantStockCode))
				collection.update({"_id":idLst[j]},{"$set":{"relevantStock":\
					' '.join(relevantStockCodeDuplicateRemoval)}})
			# print(' [*] finished ' + str(j+1) + ' ... ')
			j += 1

	def extractStockCodeFromRealtimeNews(self,documents):
		'''Extract stocks mentioined by real-time crawled news(articles/documents), 
			and return the list of corresponding codes.

		# Arguments:
			documents: Real-time crawled news(articles/documents).
		'''
		stock_basic_info = self.extractData("Stock","Basic_Info",['name','code'])
		token_list = self.tp.jieba_tokenize(documents)
		relevant_stock_list = []
		for tokens in token_list:
			relevantStockCode = []
			for tk in tokens:
				if len(tk) >= 3 and tk in list(stock_basic_info.name):
					relevantStockCode.append(list(stock_basic_info[(stock_basic_info.name == tk)].code)[0]) 
			relevant_stock_list.append(list(set(relevantStockCode))) 
		return relevant_stock_list

	def judgeGoodOrBadNews(self,stockCode,date,judgeTerm):
		'''Label the historical news(articles/documents) with 'Bad', 'Good' or 'Neutral'.

		# Arguments:
			stockCode: Code of specific stock.
			date: Date at which released the specific news.
			judgeTerm: Interval after which compare the close price with that at the released date.
		'''
		db = self._Conn['Stock']
		collection = db.get_collection(stockCode)
		dateLst = self.extractData("Stock",stockCode,['date']).date
		days = 0
		CloseLst = []
		for dt in dateLst:
			if dt >= date:
				CloseLst.append(float(collection.find_one({'date':dt})['close']))
				if days >= judgeTerm:
					break
				days += 1
		if CloseLst[-1] > CloseLst[0]:
			character = '利好'
		elif CloseLst[-1] < CloseLst[0]:
			character = '利空'
		else:
			character = '中立'
		return character

	def getNewsOfSpecificStock(self,dbColLst,stockCode,**kwarg):
		'''Get news related to specific stock from historical database.

		# Arguments:
			dbColLst: List of databases and collections, eg: [(db_1,col_1),(db_2,col_2),...,(db_N,col_N)].
			stockCode: Code of specific stock.
			export: List parameters deciding the ways of exporting('csv' or 'database')
					and file path of saving, eg: export=['csv','.\\file'].
		'''
		if kwarg['export'][0] == 'csv':
			with open(kwarg['export'][1] + '\\' + stockCode + '.csv', 'a+', newline='',encoding='utf-8') as file:
				fieldnames = ['date','address','title','article']
				writer = csv.DictWriter(file, fieldnames=fieldnames)
				writer.writeheader()
				for dbName,colName in dbColLst:
					db = self._Conn[dbName]
					collection = db.get_collection(colName)
					idLst = self.extractData(dbName,colName,['_id'])._id
					if dbName == 'Sina_Stock':
						for _id in idLst:
							keys = ' '.join([k for k in collection.find_one({'_id':ObjectId(_id)}).keys()])
							if keys.find('RelevantStock') != -1:
								if collection.find_one({'_id':ObjectId(_id)})['RelevantStock'].find(stockCode) != -1:
									print('     ' + collection.find_one({'_id':ObjectId(_id)})['Title'])
									writer.writerow({'date':collection.find_one({'_id':ObjectId(_id)})['Date'], \
										'address':collection.find_one({'_id':ObjectId(_id)})['Address'], \
										'title':collection.find_one({'_id':ObjectId(_id)})['Title'], \
										'article':collection.find_one({'_id':ObjectId(_id)})['Article']})
					elif dbName == 'NBD':
						for _id in idLst:
							keys = ' '.join([k for k in collection.find_one({'_id':ObjectId(_id)}).keys()])
							if keys.find('relevantStock') != -1:
								if collection.find_one({'_id':ObjectId(_id)})['relevantStock'].find(stockCode) != -1:
									print('     ' + collection.find_one({'_id':ObjectId(_id)})['title'])
									writer.writerow({'date':collection.find_one({'_id':ObjectId(_id)})['date'], \
										'address':collection.find_one({'_id':ObjectId(_id)})['address'], \
										'title':collection.find_one({'_id':ObjectId(_id)})['title'], \
										'article':collection.find_one({'_id':ObjectId(_id)})['Article']})
					print(' [*] extracting ' + stockCode + ' news from ' + dbName + ' database to CSV file successfully ... ')
		elif kwarg['export'][0] == 'database': #new database
			for dbName,colName in dbColLst:
				db = self._Conn[dbName]
				collection = db.get_collection(colName)
				idLst = self.extractData(dbName,colName,['_id'])._id
				if dbName == 'NBD_Stock':
					newdb = self._Conn[kwarg['export'][1]]
					newcollection = newdb.get_collection(kwarg['export'][2])
					for _id in idLst:
						keys = ' '.join([k for k in collection.find_one({'_id':ObjectId(_id)}).keys()])
						if keys.find('relevantStock') != -1:
							if collection.find_one({'_id':ObjectId(_id)})['relevantStock'].find(stockCode) != -1:
								character = self.judgeGoodOrBadNews(stockCode,\
									collection.find_one({'_id':ObjectId(_id)})['date'].split(' ')[0].replace('-',''),kwarg['judgeTerm'])

								# print('     ' + collection.find_one({'_id':ObjectId(_id)})['title'] + '(' + character + ')')

								data = {'Date' : collection.find_one({'_id':ObjectId(_id)})['date'],
										'Address' : collection.find_one({'_id':ObjectId(_id)})['address'],
										'Title' : collection.find_one({'_id':ObjectId(_id)})['title'],
										'Article' : collection.find_one({'_id':ObjectId(_id)})['Article'],
										'Character' : character}
								newcollection.insert_one(data) 
				elif dbName == 'Sina_Stock':
					newdb = self._Conn[kwarg['export'][1]]
					newcollection = newdb.get_collection(kwarg['export'][2])
					for _id in idLst:
						keys = ' '.join([k for k in collection.find_one({'_id':ObjectId(_id)}).keys()])
						if keys.find('RelevantStock') != -1:
							if collection.find_one({'_id':ObjectId(_id)})['RelevantStock'].find(stockCode) != -1:
								character = self.judgeGoodOrBadNews(stockCode,\
									collection.find_one({'_id':ObjectId(_id)})['Date'].split(' ')[0].replace('-',''),kwarg['judgeTerm'])

								# print('     ' + collection.find_one({'_id':ObjectId(_id)})['Title'] + '(' + character + ')')

								data = {'Date' : collection.find_one({'_id':ObjectId(_id)})['Date'],
										'Address' : collection.find_one({'_id':ObjectId(_id)})['Address'],
										'Title' : collection.find_one({'_id':ObjectId(_id)})['Title'],
										'Article' : collection.find_one({'_id':ObjectId(_id)})['Article'],
										'Character' : character}
								newcollection.insert_one(data)
				else:
					newdb = self._Conn[kwarg['export'][1]]
					newcollection = newdb.get_collection(kwarg['export'][2])
					for _id in idLst:
						keys = ' '.join([k for k in collection.find_one({'_id':ObjectId(_id)}).keys()])
						if keys.find('relevantStock') != -1:
							if collection.find_one({'_id':ObjectId(_id)})['relevantStock'].find(stockCode) != -1:
								character = self.judgeGoodOrBadNews(stockCode,\
									collection.find_one({'_id':ObjectId(_id)})['Date'].split(' ')[0].replace('-',''),kwarg['judgeTerm'])

								# print('     ' + collection.find_one({'_id':ObjectId(_id)})['Title'] + '(' + character + ')')

								data = {'Date' : collection.find_one({'_id':ObjectId(_id)})['Date'],
										'Address' : collection.find_one({'_id':ObjectId(_id)})['Address'],
										'Title' : collection.find_one({'_id':ObjectId(_id)})['Title'],
										'Article' : collection.find_one({'_id':ObjectId(_id)})['Article'],
										'Character' : character}
								newcollection.insert_one(data)	
				print(' [' + stockCode + '] ' + dbName + ' has been extracted successfully ... ')

	def classifyHistoryStockNews(self,dbName,stockCode,**kwarg):
		'''Build classifier from historical news(articles/documents) of specific stock.

		# Arguments:
			dbName: Name of database.
			stockCode: Code of specific stock.
			renewDict: Renew the dictionary created by historical news(articles/documents) of
						specific stock or not(bool type).
			modelType: Transformation model type, including 'lsi', 'lda' and 'None', 'None' means TF-IDF mmodel.
			tfDim: The number of topics that will be extracted from each news(articles/documents). 
			renewModel: Re-train the transformation models or not(bool type).
			Classifier: The name of classifier, including 'SVM' and 'RandomForest' so far.
			Params: The parameters of classifier, detail refer to the setting of classifier parameters of scikit-learn module.
		'''
		if kwarg['renewDict']:
			if not os.path.exists(self.DictPath+'\\'+stockCode):
				os.makedirs(self.DictPath+'\\'+stockCode)
			db = self._Conn[dbName]
			collection = db.get_collection(stockCode)
			idLst = self.extractData(dbName,stockCode,['_id'])._id
			articles = []
			characters = []
			for _id in idLst:
				articles.append(collection.find_one({'_id':ObjectId(_id)})['Article'])
				if collection.find_one({'_id':ObjectId(_id)})['Character'] == "利好":
					characters.append(1)
				elif collection.find_one({'_id':ObjectId(_id)})['Character'] == "利空":
					characters.append(-1)
				else:
					characters.append(0)
			self.tp.genDictionary(articles,saveDict=True,saveDictPath=self.DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict',\
				saveBowvec=True,saveBowvecPath=self.DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm',returnValue=False)
			print(' [*] renew the dictionary and bow-vector successfully ... ')
		elif not os.path.exists(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict') \
		or not os.path.exists(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm'):
			if not os.path.exists(self.DictPath+'\\'+stockCode):
				os.makedirs(self.DictPath+'\\'+stockCode)
			db = self._Conn[dbName]
			collection = db.get_collection(stockCode)
			idLst = self.extractData(dbName,stockCode,['_id'])._id
			articles = []
			characters = []
			for _id in idLst:
				articles.append(collection.find_one({'_id':ObjectId(_id)})['Article'])
				if collection.find_one({'_id':ObjectId(_id)})['Character'] == "利好":
					characters.append(1)
				elif collection.find_one({'_id':ObjectId(_id)})['Character'] == "利空":
					characters.append(-1)
				else:
					characters.append(0)
			self.tp.genDictionary(articles,saveDict=True,saveDictPath=self.DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict',\
				saveBowvec=True,saveBowvecPath=self.DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm',returnValue=False)
			print(' [*] generate and save the dictionary and bow-vector successfully ... ')
		else:
			db = self._Conn[dbName]
			collection = db.get_collection(stockCode)
			idLst = self.extractData(dbName,stockCode,['_id'])._id
			characters = []
			for _id in idLst:
				if collection.find_one({'_id':ObjectId(_id)})['Character'] == "利好":
					characters.append(1)
				elif collection.find_one({'_id':ObjectId(_id)})['Character'] == "利空":
					characters.append(-1)
				else:
					characters.append(0)
		dictionary = corpora.Dictionary.load(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict')
		bowvec = corpora.MmCorpus(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm')
		print(' [*] load dictionary and bow-vector successfully ... ')
		_, modelVec = self.tp.CallTransformationModel(dictionary,bowvec,modelType=kwarg['modelType'],\
			tfDim=kwarg['tfDim'],renewModel=kwarg['renewModel'],modelPath=self.DictPath+'\\'+stockCode+'\\')
		CSRMatrix = self.ConvertToCSRMatrix(modelVec)
		train_X, train_Y, test_X, test_Y = self.genTrainingSet(CSRMatrix,characters)
		if kwarg['Classifier'] == 'SVM':
			self.SVMClassifier(train_X,train_Y,test_X,test_Y,kwarg['Params'],['precision'],stockCode)
		if kwarg['Classifier'] == 'RandomForest':
			self.RdForestClassifier(train_X,train_Y,test_X,test_Y,kwarg['Params'],['precision'],stockCode)
		return self._precise

	def classifyRealtimeStockNews(self,doc_list):
		'''Classify real-time news(articles/documents) of specific stock.

		#Arguments:
			doc_list: List of real-time news(articles/documents) crawled from specific websites.
		'''
		print(' * extract relevant stock codes from latest crawled news ... ')
		relevant_stock_list = self.extractStockCodeFromRealtimeNews(doc_list)
		if len(relevant_stock_list) != 0:
			tfDim = 200
			for i, code_list in enumerate(relevant_stock_list):
				for code in code_list:

					print(' * load SVM parameters (gamma & C) ... ')
					Params_svm = {'kernel': ['rbf'], 'gamma': [10, 20, 50, 100, 150, 200], \
						'C': [10, 15, 20, 30, 50, 100]}

					print(' * use historical news to build SVM model of ' + code + ' ... ')
					self.classifyHistoryStockNews("Stock_News",code,modelType='lda',tfDim=tfDim,renewDict=False,\
							renewModel=False,Classifier='SVM',Params=Params_svm) #code="600740"

					print(' * load historical dictionary of ' + code + ' ...')
					dictionary = corpora.Dictionary.load(os.getcwd() + '\\' + 'stock_dict_file\\' + code + '\\' + code + '_dict.dict')
					
					print(' * tokenize latest crawled news ... ')
					token = self.tp.jieba_tokenize(doc_list)

					print(' * create bow-vector of latest news of ' + code + ' ... ')
					bowvec_doc = [dictionary.doc2bow(text) for text in token]
					
					print(' * load bow-vector of historical news of ' + code + ' ... ')
					bowvec_all = list(corpora.MmCorpus(os.getcwd() + '\\' + 'stock_dict_file\\' + code + '\\' + code + '_bowvec.mm'))
					
					print(' * extend latest bow-vector to historical bow-vector of ' + code + ' ... ')
					bowvec_all.extend(bowvec_doc)
					
					print(' * create new lda model of ' + code + ' ... ')
					_, NewmodelVec = self.tp.CallTransformationModel(dictionary,bowvec_all,modelType='lda',\
									tfDim=200,renewModel=False,modelPath=os.getcwd() + '\\' + 'stock_dict_file\\' + code + '\\')
					
					print(' * convert latest lda vector to CSR matrix of ' + code + ' ... ')
					NewCSRMatrix = self.ConvertToCSRMatrix(NewmodelVec)
					
					print(' * load SVM model of ' + code + ' ... ')
					clf = joblib.load(os.getcwd() + '\\' + 'stock_dict_file\\' + code + '\\' + code + '_svm.pkl') 
					
					print(' * predicting ... ')
					if clf.predict(NewCSRMatrix[i-2,:])[0] == 1:
						print('   《' + doc_list[i].split(' ')[0] + "》" + '对' + code + '是利好消息 ...')
					elif clf.predict(NewCSRMatrix[i-2,:])[0] == -1:
						print('   《' + doc_list[i].split(' ')[0] + "》" + '对' + code + '是利空消息 ...')
					else:
						print('   《' + doc_list[i].split(' ')[0] + "》" + '对' + code + '是中立消息 ...')
		else:
			print(' * not any relevant stock ... ')

	def SVMClassifier(self,train_X,train_Y,test_X,test_Y,tuned_parameters,scores,stockCode):
		'''SVM Classifier.

		# Arguments:
			train_X: Features train data. 
			train_Y: Labels train data.
			test_X: Features train data.
			test_Y: Labels train data.
			tuned_parameters: The parameters of classifier, refer to the setting of classifier parameters of scikit-learn module.
			scores: Targets of optimization, detail refer to optimal targets setting of scikit-learn module.
			stockCode: Code of specific stock.
		'''
		for score in scores:
			if not os.path.exists(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_svm.pkl'):
				clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=5, scoring='%s_weighted' % score) # 构造这个GridSearch的分类器,5-fold
				clf.fit(train_X, train_Y) # 只在训练集上面做k-fold,然后返回最优的模型参数
				joblib.dump(clf, self.DictPath+'\\'+stockCode+'\\'+stockCode+'_svm.pkl')
				print(clf.best_params_) # 输出最优的模型参数
			else:
				clf = joblib.load(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_svm.pkl') 
			# for params, mean_score, scores in clf.grid_scores_:
			# 	print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
			train_pred = clf.predict(train_X) 
			test_pred = clf.predict(test_X) # 在测试集上测试最优的模型的泛化能力.
			print(classification_report(test_Y, test_pred))

		precise_train = 0
		for k in range(len(train_pred)):
			if train_pred[k] == train_Y[k]:
				precise_train += 1
		precise_test = 0
		for k in range(len(test_pred)):
			if test_pred[k] == test_Y[k]:
				precise_test += 1
		print(' [*] train_pred:', precise_train/len(train_Y), ', test_pred:', precise_test/len(test_pred))
		print(' ' + '-' * 50)
		self._precise = precise_test/len(test_pred)

	def RdForestClassifier(self,train_X,train_Y,test_X,test_Y,tuned_parameters,scores,stockCode):
		'''Random Forest Classifier.

		# Arguments:
			train_X: Features train data. 
			train_Y: Labels train data.
			test_X: Features train data.
			test_Y: Labels train data.
			tuned_parameters: The parameters of classifier, refer to the setting of classifier parameters of scikit-learn module.
			scores: Targets of optimization, detail refer to optimal targets setting of scikit-learn module.
			stockCode: Code of specific stock.
		'''
		for score in scores:
			if not os.path.exists(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_rdf.pkl'):
				clf = GridSearchCV(RandomForestClassifier(random_state=14), tuned_parameters, cv=5, scoring='%s_weighted' % score) # 构造这个GridSearch的分类器,5-fold
				clf.fit(train_X, train_Y) # 只在训练集上面做k-fold,然后返回最优的模型参数
				joblib.dump(clf, self.DictPath+'\\'+stockCode+'\\'+stockCode+'_rdf.pkl')
				print(clf.best_params_) # 输出最优的模型参数
			else:
				clf = joblib.load(self.DictPath+'\\'+stockCode+'\\'+stockCode+'_rdf.pkl') 
			# for params, mean_score, scores in clf.grid_scores_:
			# 	print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
			train_pred = clf.predict(train_X) 
			test_pred = clf.predict(test_X) # 在测试集上测试最优的模型的泛化能力.
			print(classification_report(test_Y, test_pred))
		precise_train = 0
		for k in range(len(train_pred)):
			if train_pred[k] == train_Y[k]:
				precise_train += 1
		precise_test = 0
		for k in range(len(test_pred)):
			if test_pred[k] == test_Y[k]:
				precise_test += 1
		print(' [*] train_pred:', precise_train/len(train_Y), ', test_pred:', precise_test/len(test_pred))
		print(' ' + '-' * 50)
		self._precise = precise_test/len(test_pred)

	def ConvertToCSRMatrix(self,modelVec):
		'''Convert LDA(LSI) model vector to CSR sparse matrix, that could be accepted by Scipy and Numpy.
		
		# Arguments:
			modelVec: Transformation model vector, such as LDA model vector, tfidf model vector or lsi model vector.
		'''
		data = []
		rows = []
		cols = []
		self._line_count = 0
		for line in modelVec:  
			for elem in line:
				rows.append(self._line_count)
				cols.append(elem[0])
				data.append(elem[1])
			self._line_count += 1
		sparse_matrix = csr_matrix((data,(rows,cols))) 
		matrix = sparse_matrix.toarray() 
		return matrix

	def genTrainingSet(self,X,Y):
		'''Generate training data set.

		# Arguments:
			X: Feature set.
			Y: Label set.
		'''
		rarray=np.random.random(size=self._line_count)
		train_X = []
		train_Y = []
		test_X = []
		test_Y = []
		for i in range(self._line_count):
			if rarray[i]<0.8:
				train_X.append(X[i,:])
				train_Y.append(Y[i])
			else:
				test_X.append(X[i,:])
				test_Y.append(Y[i])
		return train_X,train_Y,test_X,test_Y


================================================
FILE: legacy_v1/Text_Analysis/text_processing.py
================================================
# -*- coding: UTF-8 -*- 
"""
Created on Fri Feb 23 12:37:46 2018

@author: Damon Li
"""

import numpy as np

import jieba, os
from gensim import corpora,similarities,models,matutils,utils


class TextProcessing(object):
    '''Text pre-processing functions class.

    # Arguments
        chnSTWPath: chinese stop words txt file path.
        finance_dict: latest financial related words txt file path.
    '''

    def __init__(self,chnSTWPath,finance_dict):
        self.chnSTWPath = chnSTWPath
        self.finance_dict = finance_dict

    def renewFinanceDict(self,new_Word_list):
        '''Add latest necessary financial words into financial dictionary
            for improving tokenization effect.

        # Arguments:
            new_Word_list: New financial words list, eg: ["区块链"，"离岸金融"].
        '''
        with open(self.finance_dict,'a',encoding='utf-8') as file:
            for word in new_Word_list:
                file.write(word + '\n')

    def getchnSTW(self):
        '''Load the stop words txt file.
        '''   
        stopwords = [line.strip() for line in open(self.chnSTWPath, 'r').readlines()]  
        return stopwords

    def jieba_tokenize(self,documents): 
        '''Cut the documents into a sequence of independent words.

        # Arguments:
            documents: List of news(articles).
        '''
        chnSTW = self.getchnSTW()
        corpora_documents = []
        jieba.load_userdict(self.finance_dict)
        for item_text in documents: 
            outstr = []
            sentence_seged = list(jieba.cut(item_text))
            for word in sentence_seged:  
                if word not in chnSTW and word != '\t' \
                and word != ' ':  
                    outstr.append(word)
            corpora_documents.append(outstr)
        return corpora_documents

    def RemoveWordAppearOnce(self,corpora_documents):
        '''Remove the words that appear once among all the tokenized news(articles).

        # Arguments:
             corpora_documents: List of tokenized news(articles).
        '''
        frequency = defaultdict(int)  
        for text in corpora_documents:  
            for token in text:      
                frequency[token] += 1 
        corpora_documents = [[token for token in text if frequency[token] > 1]  for text in corpora_documents] 
        return corpora_documents

    def genDictionary(self,documents,**kwarg):
        '''Generate dictionary and bow-vector of all tokenzied news(articles).

        # Arguments:
            documents: List of news(articles).
            saveDict: Save dictionary or not(bool type).
            saveBowvec: Save bow-vector or not(bool type).
            returnValue: Return value or not(bool type).
        '''
        self._raw_documents = documents
        token = self.jieba_tokenize(documents) #jieba tokenize
        #corpora_documents = self.RemoveWordAppearOnce(token)  # remove thw words appearing once in the dictionary
        self._dictionary = corpora.Dictionary(token)  # generate dictionary using tokenized documents  
        if kwarg['saveDict']:
            self._dictionary.save(kwarg['saveDictPath']) # store the dictionary, for future reference
        self._BowVecOfEachDoc = [self._dictionary.doc2bow(text) for text in token]  # convert tokenized documents to vectors
        if kwarg['saveBowvec']:
            corpora.MmCorpus.serialize(kwarg['saveBowvecPath'], self._BowVecOfEachDoc)  # store to disk, for later use
        if kwarg['returnValue']:
            return token, self._dictionary, self._BowVecOfEachDoc

    def CallTransformationModel(self,Dict,Bowvec,**kwarg):
        '''Invoke specific transformation models of Gensim module.

        # Arguments:
            Dict: Dictionary made by all tokenized news(articles/documents).
            Bowvec: Bow-vector created by all tokenized news(articles/documents).
            modelType: Transformation model type, including 'lsi', 'lda' and 'None', 'None' means TF-IDF mmodel.
            tfDim: The number of topics that will be extracted from each news(articles/documents). 
            renewModel: Re-train the transformation models or not(bool type).
            modelPath: The path of saving trained transformation models.
        '''
        if kwarg['renewModel']:
            tfidf = models.TfidfModel(Bowvec)  # initialize tfidf model
            tfidfVec = tfidf[Bowvec] # use the model to transform whole corpus
            tfidf.save(kwarg['modelPath']+"tfidf_model.tfidf")
            if kwarg['modelType'] == 'lsi':
                model = models.LsiModel(tfidfVec, id2word=Dict, num_topics=kwarg['tfDim']) # initialize an LSI transformation
                modelVec = model[tfidfVec] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
                model.save(kwarg['modelPath']) # same for tfidf, lda, ...
            elif kwarg['modelType'] == 'lda':
                model = models.LdaModel(tfidfVec, id2word=Dict, num_topics=kwarg['tfDim'])
                modelVec = model[tfidfVec] #每个文本对应的LDA向量，稀疏的，元素值是隶属与对应序数类的权重 
                model.save(kwarg['modelPath']) # same for tfidf, lda, ...
            elif kwarg['modelType'] == 'None': 
                model = tfidf
                modelVec = tfidfVec
        else:
            if not os.path.exists(kwarg['modelPath']+"tfidf_model.tfidf"):
                tfidf = models.TfidfModel(Bowvec)  # initialize tfidf model
                tfidfVec = tfidf[Bowvec] #
                tfidf.save(kwarg['modelPath']+"tfidf_model.tfidf")
            else:
                tfidf = models.TfidfModel.load(kwarg['modelPath']+"tfidf_model.tfidf") 
                tfidfVec = tfidf[Bowvec] # use the model to transform whole corpus
            if kwarg['modelType'] == 'lsi':
                if not os.path.exists(kwarg['modelPath']+"lsi_model.lsi"):
                    tfidf = models.TfidfModel.load(kwarg['modelPath']+"tfidf_model.tfidf") 
                    tfidfVec = tfidf[Bowvec] # use the model to transform whole corpus
                    model = models.LsiModel(tfidfVec, id2word=Dict, num_topics=kwarg['tfDim']) # initialize an LSI transformation
                    modelVec = model[tfidfVec] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
                    model.save(kwarg['modelPath']+"lsi_model.lsi") # same for tfidf, lda, ...
                else:
                    model = models.LsiModel.load(kwarg['modelPath']+"lsi_model.lsi")
                    modelVec = model[tfidfVec] 
            elif kwarg['modelType'] == 'lda':
                if not os.path.exists(kwarg['modelPath']+"lda_model.lda"):
                    tfidf = models.TfidfModel.load(kwarg['modelPath']+"tfidf_model.tfidf") 
                    tfidfVec = tfidf[Bowvec] # use the model to transform whole corpus
                    model = models.LdaModel(tfidfVec, id2word=Dict, num_topics=kwarg['tfDim'])
                    modelVec = model[tfidfVec] #每个文本对应的LDA向量，稀疏的，元素值是隶属与对应序数类的权重 
                    model.save(kwarg['modelPath']+"lda_model.lda") # same for tfidf, lda, ...
                else:
                    model = models.LdaModel.load(kwarg['modelPath']+"lda_model.lda")
                    modelVec = model[tfidfVec] 
            elif kwarg['modelType'] == 'None': 
                model = tfidf
                modelVec = tfidfVec
        return tfidfVec, modelVec

    def CalSim(self,test_document,Type,best_num):
        '''Calculate similarities between test document wth all news(articles/documents).

        # Arguments:
            test_document: List of raw documents.
            Type: Models of calculating similarities.
            best_num: refer to 'num_best' parameter in Gensim module.
        '''
        if Type == 'Similarity-tfidf-index':
            tfidf = models.TfidfModel(self._BowVecOfEachDoc)  
            tfidfVec = tfidf[self._BowVecOfEachDoc]
            self._num_features = len(self._dictionary.token2id.keys())
            self._similarity = similarities.Similarity(Type, tfidfVec, \
                num_features=self._num_features,num_best=best_num)  
            test_cut_raw = list(jieba.cut(test_document))  
            test_BowVecOfEachDoc = self._dictionary.doc2bow(test_cut_raw) 
            self._test_BowVecOfEachDoc = tfidf[test_BowVecOfEachDoc]
        elif Type == 'Similarity-LSI-index':
            lsi_model = models.LsiModel(self._BowVecOfEachDoc)  
            corpus_lsi = lsi_model[self._BowVecOfEachDoc]
            self._num_features = len(self._dictionary.token2id.keys())
            self._similarity = similarities.Similarity(Type, corpus_lsi, \
                num_features=self._num_features,num_best=best_num)  
            test_cut_raw = list(jieba.cut(test_document))  
            test_BowVecOfEachDoc = self._dictionary.doc2bow(test_cut_raw) 
            self._test_BowVecOfEachDoc = lsi_model[test_BowVecOfEachDoc]
        self.Print_CalSim()
        IdLst = []
        SimRltLst = []
        SimTxLst = []
        for Id, Sim in self._similarity[self._test_BowVecOfEachDoc]:
            IdLst.append(Id)
            SimRltLst.append(Sim)
            SimTxLst.append(self._raw_documents[Id])
        return IdLst,SimTxLst,SimRltLst

    def PrintWorfCloud(self,documents,backgroundImgPath,fontPath):
        '''Print out the word cloud of all news(articles/documents).

        # Arguments:
            documents: Overall raw documents.
            backgroundImgPath: Background image path.
            fontPath: The path of windows fonts that used to create the word-cloud.
        '''
        from scipy.misc import imread
        import matplotlib.pyplot as plt
        from wordcloud import WordCloud
        corpora_documents = self.jieba_tokenize(documents) #分词
        for k in range(len(corpora_documents)):
            corpora_documents[k] = ' '.join(corpora_documents[k])
        corpora_documents = ' '.join(corpora_documents)
        color_mask = imread(backgroundImgPath) #"C:\\Users\\lenovo\\Desktop\\Text_Mining\\3.jpg"
        cloud = WordCloud(font_path=fontPath,mask=color_mask,background_color='white',\
                          max_words=2000,max_font_size=40) #"C:\\Windows\\Fonts\\simhei.ttf"
        word_cloud = cloud.generate(corpora_documents) 
        plt.imshow(word_cloud, interpolation='bilinear')
        plt.axis("off")

if __name__ == '__main__':
    tp = TextProcessing(os.getcwd() + '\\' + 'Chinese_Stop_Words.txt', \
    os.getcwd() + '\\' + 'finance_dict.txt')
    doc = ['中央、地方支持政策频出,煤炭行业站上了风口 券商研报浩如烟海，投资线索眼花缭乱，第一财经推出\
            《一财研选》产品，挖掘研报精华，每期梳理5条投资线索，便于您短时间内获取有价值的信息。专业团队\
            每周日至每周四晚8点准时“上新”，\
            助您投资顺利！1．中央、地方支持政策频出，这个行业站上了风口！（信达证券）近年来，利好住房租赁\
            市场发展的政策频频发布，顶层设计趋于完善。信达证券指出，2015年以来，住建部、国务院等机构相继出\
            台政策支持住房租赁市场发展，地方积极跟进，试点城市全部出台相关方案支持当地住房租赁市场发展。除\
            此之外，“租购同权”保障承租人享受公共服务的权益，稳定租赁关系，利好长租公寓发展。除政策利好长租\
            公寓外，需求的逐步释放对长租公寓市场形成支撑。信达证券研究发现，人口向核心一、二线城市流动趋势不\
            减，高房价刺激购房需求转向租房需求、首次置业年龄抬升、高校毕业生租房需求增加等因素将刺激长租公寓\
            需求进一步释放。总体而言，住房租赁市场容量逾万亿且具备区域性特征。2017年8月，国土资源部、住房和城\
            乡建设部联合印发《利用集体建设用地建设租赁住房试点方案》，选择13个试点城市推进利用集体建设用地建\
            设租赁住房，各地“只租不售”地块频出，彰显政府发展住房租赁市场决心。类REITs产品盘活租赁资产，解决\
            长租融资痛点，上述举措能够有效增加租赁住房供给。伴随政策利好，多主体纷纷进军住房租赁市场。信达证\
            券指出，截至目前，房企、房地产中介、专业租赁机构、连锁酒店、金融机构和互联网公司均已涉足住宅租赁市\
            场。其中，房企多采用自持物业的重资产运营方式，中介机构及其他公司多以轻资产运营方式为主，从房源获\
            取的角度看，集中与分散并行。信达证券指出，当前我国租赁住房的发展还处于初步阶段，多主体参与、多模式\
            并存。参与各方均凭借自身比较优势切入住房租赁领域。未来，房企、互联网公司、金融机构存在巨大的合作空间。\
            在市场细分的前提下，增值服务的提供将成为住房租赁市场发展的关键。信达证券推荐关注招商蛇口(21.100, \
            -1.43, -6.35%)（001979.SZ）、万科A(31.270, -1.48, -4.52%)（000002.SZ）、世联行(8.700, -0.87,\
             -9.09%)（002285.SZ）、昆百大A(7.510, -0.05, -0.66%)（000560.SZ）、天健集团(9.330, -0.56, -5.66%)\
            （000090.SZ）。2．煤炭库存创八年新低，缺煤升级，高煤价仍将持续（中银国际）截至1月30日，秦皇岛5500大\
            卡山西优混动力煤报755元，跳涨2%，再超预期，并创近6年新高，此轮上涨持续了10周时间，累计涨幅达13%。煤炭\
            行业是本栏重点追踪的行业板块，近期的大涨验证了此前选摘的多家研究机构的观点，今天我们再来看一下中银国际\
            对板块未来表现的分析观点。中银国际指出，六大电厂日耗量周均81万吨，环比增加9%，库存天数由13天下降至10.9天\
            ，为近8年新低，库存下降至899万吨，为近7年新低。缺煤情况非常突出。经济的强韧性叠加寒冷冰雪天气推升需求超预\
            期是主因，供应侧在年关生产积极性不高、运输不畅是辅因，且短期较难明显缓解，2月初地方矿也面临陆续放假，在\
            这种情况下煤价有继续攀高的可能。中银国际认为此轮煤价上涨包含着较多非季节性因素：六大电厂日耗从2017年12月\
            开始同比增幅都在10%以上，这还是在有工业限产的情况下，这是非常高的数字，在2017年7~8月旺季的同比增幅也只\
            有15%左右。经济较好下的需求超预期历来是煤炭股最好的催化剂。尽管2月份由于春节因素可能价格会回落，但在2018\
            年缺煤明显的情况下，幅度不会太大，高煤价还会继续维持。3月初两会召开，安全形势再度紧张，煤炭的供应仍然会偏\
            紧，在叠加3月15日后限产解除，限产解除前后下游补库存，高煤价可能会贯穿整个一季度。中银国际指出，2017年1月秦\
            皇岛煤价均价只有602元，2018年1月的均价为726元，同比增长21%，动力煤公司一季度的业绩大概率会上调。尽管后续煤\
            价调控的压力在加大，但近期效果可能不明显，中期有待观察。煤炭板块2018年市盈率15倍，估值不贵，且存在继续上调\
            盈利预测和估值下行的可能，股价仍有空间。继续推荐动力煤龙头陕西煤业(8.340, -0.77, -8.45%)（601225.SH）、\
            兖州煤业(15.150, -1.24, -7.57%)（600803.SH）、中国神华(24.290, -1.16, -4.56%)（601088.SH），以及优质\
            的国企改革兼并重组题材股潞安环能(11.590, -1.11, -8.74%)（601699.SH）、山西焦化(12.420, -1.38, -10.00%\
            )（600740.SH）、山煤国际(4.520, -0.50, -9.96%)（600546.SH）、阳泉煤业(7.780, -0.86, -9.95%)（600348.SH）\
            。',\
            '郭文仓到重点工程项目督导检查 2月2日,公司党委书记、董事长、总经理郭文仓,公司董事,股份公司副总经理、总工程师、\
            郭毅民,股份公司副总经理张国富、柴高贵及相关单位负责人到焦化厂煤场全封闭和1#—4#干熄焦等重点工程项目建设工地\
            督导检查施工进度和安全工作情况。郭文仓一行实地查看并详细了解了现场施工情况,询问了施工队伍人员状况,他说,\
            煤场全封闭项目和1#—4#干熄焦项目是公司的重点环保项目,一定要力争将重点工程项目建成精品工程、一流环保标杆项目\
            。近日天气寒冷,又临近春节,煤场全封闭项目进入收尾的关键阶段,施工负责人要紧绷安全弦,加强现场安全管理,从细节抓\
            起,消除隐患,确保收尾工作安全稳定顺利。1#—4#干熄焦项目在大面积开工的重要时期,一定要统筹安排项目进度和质量\
            管理,落实好冬季防护措施,管控好每一道施工环节,目前尤其要注重人员的思想状况,做到不安全不施工,保证施工安全和人\
            员人身安全,确保项目“安全无事故、质量全达标、进度按计划、投资不超概、投产即达效、竣工不留尾、审计无问题、廉政建\
            设好”,为公司打造成全国独立焦化旗舰企业奠定坚实的基础。']
    DictPath = os.getcwd() + '\\' + 'stock_dict_file'
    stockCode = '600740'
    print(DictPath)
    print(DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict')
    print(DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm')
    if not os.path.exists(DictPath+'\\'+stockCode):
        os.makedirs(DictPath+'\\'+stockCode)
    tp.genDictionary(doc,saveDict=True,saveDictPath=DictPath+'\\'+stockCode+'\\'+stockCode+'_dict.dict',\
        saveBowvec=True,saveBowvecPath=DictPath+'\\'+stockCode+'\\'+stockCode+'_bowvec.mm',returnValue=False)

================================================
FILE: legacy_v1/finance_dict.txt
================================================


================================================
FILE: legacy_v1/run_crawler_cnstock.py
================================================
from Crawler.crawler_cnstock import WebCrawlFromcnstock

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromcnstock(IP="localhost",PORT=27017,ThreadsNum=4,\
        dbName="Cnstock_Stock",collectionName="cnstock_news_company")
    web_crawl_obj.coroutine_run(621,10,1,url_Part_1='http://company.cnstock.com/company/scp_gsxw/') #Obj.multi_threads_run()
    web_crawl_obj.coroutine_run(112,10,0,url_Part_1='http://ggjd.cnstock.com/gglist/search/qmtbbdj/')
    web_crawl_obj.coroutine_run(116,10,0,url_Part_1='http://ggjd.cnstock.com/gglist/search/ggkx/')
   
 
================================================
FILE: legacy_v1/run_crawler_jrj.py
================================================
from Crawler.crawler_jrj import WebCrawlFromjrj

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromjrj("2009-01-05","2018-02-03",100,ThreadsNum=4,IP="localhost",PORT=27017,\
        dbName="Jrj_Stock",collectionName="jrj_news_company")
    web_crawl_obj.coroutine_run()  #web_crawl_obj.single_run() #web_crawl_obj.multi_threads_run()

================================================
FILE: legacy_v1/run_crawler_nbd.py
================================================
from Crawler.crawler_nbd import WebCrawlFromNBD

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromNBD(2871,10,ThreadsNum=4,IP="localhost",PORT=27017,dbName='NBD_Stock',\
      collectionName="nbd_news_company")
    url_lst_withoutNews = web_crawl_obj.coroutine_run() #web_crawl_obj.single_run() #web_crawl_obj.multi_threads_run()
    if url_lst_withoutNews != []:
       print(' -------------------- Re-Crawl News List Pages -------------------- ')
       url_lst_withoutArticles, title_lst_withoutArticles = web_crawl_obj.ReCrawlNews(url_lst_withoutNews)
    if url_lst_withoutArticles != [] or title_lst_withoutArticles != []:
       print(' -------------------- Re-Crawl Article Pages -------------------- ')
       web_crawl_obj.ReCrawlArticles(url_lst_withoutArticles,title_lst_withoutArticles)

================================================
FILE: legacy_v1/run_crawler_sina.py
================================================
from Crawler.crawler_sina import WebCrawlFromSina

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromSina(5000,100,ThreadsNum=4,IP="localhost",PORT=27017,\
        dbName="Sina_Stock",collectionName="sina_news_company")
    web_crawl_obj.coroutine_run()  #web_crawl_obj.single_run() #web_crawl_obj.multi_threads_run()

================================================
FILE: legacy_v1/run_crawler_stcn.py
================================================
from Crawler.crawler_stcn import WebCrawlFromstcn

if __name__ == '__main__':
    web_crawl_obj = WebCrawlFromstcn(IP="localhost",PORT=27017,ThreadsNum=4,\
        dbName="Stcn_Stock",collectionName="stcn_news_company")
    web_crawl_obj.coroutine_run(20,1,1,url_Part_1='http://company.stcn.com/gsxw/') 
    web_crawl_obj.coroutine_run(20,1,1,url_Part_1='http://stock.stcn.com/xingu/')
    web_crawl_obj.coroutine_run(20,1,1,url_Part_1='http://stock.stcn.com/zhuli/')
    web_crawl_obj.coroutine_run(20,1,1,url_Part_1='http://stock.stcn.com/bankuai/')
    web_crawl_obj.coroutine_run(20,1,1,url_Part_1='http://stock.stcn.com/dapan/')

================================================
FILE: legacy_v1/run_crawler_tushare.py
================================================
from Crawler.crawler_tushare import CrawlStockData

if __name__ == '__main__':
	t1 = time.time()
	# Initiate 
	Obj = CrawlStockData(IP="localhost",PORT=27017)
	# Get basic infos of stocks
	Obj.getStockBasicFromTushare("Stock","Basic_Info")
	# Extract stocks' code 
	Code = Obj.extractData('Stock','Basic_Info',['code'])[0]
	# Get stock price from Tushare
	for stockcode in Code:
		Obj.getStockDayHistory('Stock',stockcode)
		print(' [*] ' + stockcode + ' has finished storing ... ')
	t2 = time.time()
	print(' running time:', t2 - t1)

================================================
FILE: legacy_v1/run_main.py
================================================
import time, datetime, threading
from concurrent import futures

from Crawler.crawler_sina import WebCrawlFromSina
from Crawler.crawler_jrj import WebCrawlFromjrj
from Crawler.crawler_cnstock import WebCrawlFromcnstock
from Crawler.crawler_stcn import WebCrawlFromstcn

import Text_Analysis.text_mining as tm

def crawlers(web):
	if web == 'sina':
		web_crawl_obj = WebCrawlFromSina(5000,100,ThreadsNum=4,IP="localhost",PORT=27017,\
			dbName="Sina_Stock",collectionName="sina_news_company")
		web_crawl_obj.classifyRealtimeStockNews()
	elif web == 'jrj':
		web_crawl_obj = WebCrawlFromjrj("2009-01-05","2018-02-03",100,ThreadsNum=4,IP="localhost",PORT=27017,\
			dbName="Jrj_Stock",collectionName="jrj_news_company")
		web_crawl_obj.classifyRealtimeStockNews()
	elif web == 'cnstock':
		web_crawl_obj = WebCrawlFromcnstock(IP="localhost",PORT=27017,ThreadsNum=4,\
			dbName="Cnstock_Stock",collectionName="cnstock_news_company")
		web_crawl_obj.classifyRealtimeStockNews()
	elif web == 'stcn':
		web_crawl_obj = WebCrawlFromstcn(IP="localhost",PORT=27017,ThreadsNum=4,\
			dbName="Stcn_Stock",collectionName="stcn_news_company")
		web_crawl_obj.classifyRealtimeStockNews()

if __name__ == '__main__':
	# Step 1. Initiate
	text_mining_obj = tm.TextMining(IP="localhost",PORT=27017)

	# Step 2. Extract relevant stock codes of news(articles/documents) from all database
	text_mining_obj.extractStockCodeFromArticle("NBD_Stock","nbd_news_company") # 从每经网的新闻中抽出相关的股票代码
	text_mining_obj.extractStockCodeFromArticle("Cnstock_Stock","cnstock_news_company") # 从中国证券网的新闻中抽出相关的股票代码
	text_mining_obj.extractStockCodeFromArticle("Stcn_Stock","stcn_news_company") # 从证券时报网的新闻中抽出相关的股票代码
	text_mining_obj.extractStockCodeFromArticle("Jrj_Stock","jrj_news_company") # 从金融界网的新闻中抽出相关的股票代码

	# Step 3. Extract all news related to specific stock to new database(this step will take long time)
	codeLst = text_mining_obj.extractData("Stock","Basic_Info",['code']).code
	Range = 10
	Idx = 0
	while Idx < len(codeLst):
		thread_lst = []
		for stockcode in codeLst[Idx:Idx+Range]:
			thread = threading.Thread(target=text_mining_obj.getNewsOfSpecificStock,\
				args=([("NBD_Stock","nbd_news_company"),("Sina_Stock","sina_news_company"),\
				("Cnstock_Stock","cnstock_news_company"),("Stcn_Stock","stcn_news_company"),("Jrj_Stock",\
				"jrj_news_company")],stockcode),kwargs={"export":['database','Stock_News',stockcode],"judgeTerm":3})
			thread_lst.append(thread)
		for thread in thread_lst:
			thread.start()
		for thread in thread_lst:
			thread.join()
		print(' [*] have extracted ' + codeLst[Idx:Idx+Range])
		Idx += Range
	thread_lst = []
	for stockcode in codeLst[Idx:]:
		thread = threading.Thread(target=text_mining_obj.getNewsOfSpecificStock,\
			args=([("NBD_Stock","nbd_news_company"),("Sina_Stock","sina_news_company"),\
			("Cnstock_Stock","cnstock_news_company"),("Stcn_Stock","stcn_news_company"),("Jrj_Stock",\
			"jrj_news_company")],stockcode),kwargs={"export":['database','Stock_News',stockcode],"judgeTerm":3})
		thread_lst.append(thread)
	for thread in thread_lst:
		thread.start()
	for thread in thread_lst:
		thread.join()
	print(' [*] have extracted ' + codeLst[Idx:Idx+Range])

	# Step 4. Crawl real-time news from 'web_list' and make classification 
	web_list = ['sina','jrj','cnstock','stcn']
	with futures.ThreadPoolExecutor(max_workers=4) as executor:
		future_to_url = {executor.submit(crawlers,param) : \
			ind for ind, param in enumerate(web_list)}


================================================
FILE: legacy_v1/src/Gon/__init__.py
================================================
import os
import sys


def add_path(path):
    if path not in sys.path:
        sys.path.insert(0, path)


# add `./src` dir to system path
src_dir_1 = os.path.abspath(os.path.join(os.getcwd(), "../"))

# add `./src/Gon` dir to system path
src_dir_2 = os.path.dirname(__file__)

add_path(src_dir_1)
add_path(src_dir_2)

================================================
FILE: legacy_v1/src/Gon/cnstockspyder.py
================================================
"""
中国证券网：https://www.cnstock.com
公司聚焦：https://company.cnstock.com/company/scp_gsxw
公告解读：https://ggjd.cnstock.com/gglist/search/qmtbbdj
公告快讯：https://ggjd.cnstock.com/gglist/search/ggkx
利好公告：https://ggjd.cnstock.com/company/scp_ggjd/tjd_sdlh
"""

import __init__

from spyder import Spyder

from Kite import utils
from Kite import config
from Kite.database import Database

from Killua.denull import DeNull
from Killua.deduplication import Deduplication

from Leorio.tokenization import Tokenization

import re
import time
import json
import redis
import random
import logging
import threading
from bs4 import BeautifulSoup
from selenium import webdriver

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class CnStockSpyder(Spyder):

    def __init__(self, database_name, collection_name):
        super(CnStockSpyder, self).__init__()
        self.db_obj = Database()
        self.col = self.db_obj.conn[database_name].get_collection(collection_name)
        self.terminated_amount = 0
        self.db_name = database_name
        self.col_name = collection_name
        self.tokenization = Tokenization(import_module="jieba", user_dict=config.USER_DEFINED_DICT_PATH)
        self.redis_client = redis.StrictRedis(host=config.REDIS_IP,
                                              port=config.REDIS_PORT,
                                              db=config.CACHE_NEWS_REDIS_DB_ID)

    def get_url_info(self, url):
        try:
            bs = utils.html_parser(url)
        except Exception:
            return False
        span_list = bs.find_all("span")
        part = bs.find_all("p")
        article = ""
        date = ""
        for span in span_list:
            if "class" in span.attrs and span["class"] == ["timer"]:
                date = span.text
                break
        for paragraph in part:
            chn_status = utils.count_chn(str(paragraph))
            possible = chn_status[1]
            if possible > self.is_article_prob:
                article += str(paragraph)
        while article.find("<") != -1 and article.find(">") != -1:
            string = article[article.find("<"):article.find(">")+1]
            article = article.replace(string, "")
        while article.find("\u3000") != -1:
            article = article.replace("\u3000", "")
        article = " ".join(re.split(" +|\n+", article)).strip()

        return [date, article]

    def get_historical_news(self, url, category_chn=None, start_date=None):
        """
        :param url: 爬虫网页
        :param category_chn: 所属类别, 中文字符串, 包括'公司聚焦', '公告解读', '公告快讯', '利好公告'
        :param start_date: 数据库中category_chn类别新闻最近一条数据的时间
        """
        assert category_chn is not None
        driver = webdriver.Chrome(executable_path=config.CHROME_DRIVER)
        btn_more_text = ""
        crawled_urls_list = self.extract_data(["Url"])[0]
        logging.info("historical data length -> {} ... ".format(len(crawled_urls_list)))
        # crawled_urls_list = []
        driver.get(url)
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        if start_date is None:
            while btn_more_text != "没有更多":
                more_btn = driver.find_element_by_id('j_more_btn')
                btn_more_text = more_btn.text
                logging.info("1-{}".format(more_btn.text))
                if btn_more_text == "加载更多":
                    more_btn.click()
                    time.sleep(random.random())  # sleep random time less 1s
                elif btn_more_text == "加载中...":
                    time.sleep(random.random()+2)
                    more_btn = driver.find_element_by_id('j_more_btn')
                    btn_more_text = more_btn.text
                    logging.info("2-{}".format(more_btn.text))
                    if btn_more_text == "加载更多":
                        more_btn.click()
                else:
                    more_btn.click()
                    break
            bs = BeautifulSoup(driver.page_source, "html.parser")
            for li in bs.find_all("li", attrs={"class": ["newslist"]}):
                a = li.find_all("h2")[0].find("a")
                if a["href"] not in crawled_urls_list:
                    result = self.get_url_info(a["href"])
                    while not result:
                        self.terminated_amount += 1
                        if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                            # 始终无法爬取的URL保存起来
                            with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                file.write("{}\n".format(a["href"]))
                            logging.info("rejected by remote server longer than {} minutes, "
                                         "and the failed url has been written in path {}"
                                         .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                 config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                            break
                        logging.info("rejected by remote server, request {} again after "
                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                        time.sleep(60 * self.terminated_amount)
                        result = self.get_url_info(a["href"])
                    if not result:
                        # 爬取失败的情况
                        logging.info("[FAILED] {} {}".format(a["title"], a["href"]))
                    else:
                        # 有返回但是article为null的情况
                        date, article = result
                        while article == "" and self.is_article_prob >= .1:
                            self.is_article_prob -= .1
                            result = self.get_url_info(a["href"])
                            while not result:
                                self.terminated_amount += 1
                                if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                                    # 始终无法爬取的URL保存起来
                                    with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                        file.write("{}\n".format(a["href"]))
                                    logging.info("rejected by remote server longer than {} minutes, "
                                                 "and the failed url has been written in path {}"
                                                 .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                         config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                                    break
                                logging.info("rejected by remote server, request {} again after "
                                             "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                time.sleep(60 * self.terminated_amount)
                                result = self.get_url_info(a["href"])
                            date, article = result
                        self.is_article_prob = .5
                        if article != "":
                            related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                              name_code_dict)
                            data = {"Date": date,
                                    "Category": category_chn,
                                    "Url": a["href"],
                                    "Title": a["title"],
                                    "Article": article,
                                    "RelatedStockCodes": " ".join(related_stock_codes_list)}
                            # self.col.insert_one(data)
                            self.db_obj.insert_data(self.db_name, self.col_name, data)
                            logging.info("[SUCCESS] {} {} {}".format(date, a["title"], a["href"]))
        else:
            # 当start_date不为None时，补充历史数据
            is_click_button = True
            start_get_url_info = False
            tmp_a = None
            while is_click_button:
                bs = BeautifulSoup(driver.page_source, "html.parser")
                for li in bs.find_all("li", attrs={"class": ["newslist"]}):
                    a = li.find_all("h2")[0].find("a")
                    if tmp_a is not None and a["href"] != tmp_a:
                        continue
                    elif tmp_a is not None and a["href"] == tmp_a:
                        start_get_url_info = True
                    if start_get_url_info:
                        date, _ = self.get_url_info(a["href"])
                        if date <= start_date:
                            is_click_button = False
                            break
                tmp_a = a["href"]
                if is_click_button:
                    more_btn = driver.find_element_by_id('j_more_btn')
                    more_btn.click()
            # 从一开始那条新闻到tmp_a都是新增新闻，不包括tmp_a
            bs = BeautifulSoup(driver.page_source, "html.parser")
            for li in bs.find_all("li", attrs={"class": ["newslist"]}):
                a = li.find_all("h2")[0].find("a")
                if a["href"] != tmp_a:
                    result = self.get_url_info(a["href"])
                    while not result:
                        self.terminated_amount += 1
                        if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                            # 始终无法爬取的URL保存起来
                            with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                file.write("{}\n".format(a["href"]))
                            logging.info("rejected by remote server longer than {} minutes, "
                                         "and the failed url has been written in path {}"
                                         .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                 config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                            break
                        logging.info("rejected by remote server, request {} again after "
                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                        time.sleep(60 * self.terminated_amount)
                        result = self.get_url_info(a["href"])
                    if not result:
                        # 爬取失败的情况
                        logging.info("[FAILED] {} {}".format(a["title"], a["href"]))
                    else:
                        # 有返回但是article为null的情况
                        date, article = result
                        while article == "" and self.is_article_prob >= .1:
                            self.is_article_prob -= .1
                            result = self.get_url_info(a["href"])
                            while not result:
                                self.terminated_amount += 1
                                if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                                    # 始终无法爬取的URL保存起来
                                    with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                        file.write("{}\n".format(a["href"]))
                                    logging.info("rejected by remote server longer than {} minutes, "
                                                 "and the failed url has been written in path {}"
                                                 .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                         config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                                    break
                                logging.info("rejected by remote server, request {} again after "
                                             "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                time.sleep(60 * self.terminated_amount)
                                result = self.get_url_info(a["href"])
                            date, article = result
                        self.is_article_prob = .5
                        if article != "":
                            related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                              name_code_dict)
                            data = {"Date": date,
                                    "Category": category_chn,
                                    "Url": a["href"],
                                    "Title": a["title"],
                                    "Article": article,
                                    "RelatedStockCodes": " ".join(related_stock_codes_list)}
                            # self.col.insert_one(data)
                            self.db_obj.insert_data(self.db_name, self.col_name, data)
                            logging.info("[SUCCESS] {} {} {}".format(date, a["title"], a["href"]))
                else:
                    break
        driver.quit()

    def get_realtime_news(self, url, category_chn=None, interval=60):
        logging.info("start real-time crawling of URL -> {}, request every {} secs ... ".format(url, interval))
        assert category_chn is not None
        # TODO: 由于cnstock爬取的数据量并不大，这里暂时是抽取历史所有数据进行去重，之后会修改去重策略
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        crawled_urls = self.db_obj.get_data(self.db_name,
                                            self.col_name,
                                            keys=["Url"])["Url"].to_list()
        while True:
            # 每隔一定时间轮询该网址
            bs = utils.html_parser(url)
            for li in bs.find_all("li", attrs={"class": ["newslist"]}):
                a = li.find_all("h2")[0].find("a")
                if a["href"] not in crawled_urls:  # latest_3_days_crawled_href
                    result = self.get_url_info(a["href"])
                    while not result:
                        self.terminated_amount += 1
                        if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                            # 始终无法爬取的URL保存起来
                            with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                file.write("{}\n".format(a["href"]))
                            logging.info("rejected by remote server longer than {} minutes, "
                                         "and the failed url has been written in path {}"
                                         .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                 config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                            break
                        logging.info("rejected by remote server, request {} again after "
                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                        time.sleep(60 * self.terminated_amount)
                        result = self.get_url_info(a["href"])
                    if not result:
                        # 爬取失败的情况
                        logging.info("[FAILED] {} {}".format(a["title"], a["href"]))
                    else:
                        # 有返回但是article为null的情况
                        date, article = result
                        while article == "" and self.is_article_prob >= .1:
                            self.is_article_prob -= .1
                            result = self.get_url_info(a["href"])
                            while not result:
                                self.terminated_amount += 1
                                if self.terminated_amount > config.CNSTOCK_MAX_REJECTED_AMOUNTS:
                                    # 始终无法爬取的URL保存起来
                                    with open(config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                        file.write("{}\n".format(a["href"]))
                                    logging.info("rejected by remote server longer than {} minutes, "
                                                 "and the failed url has been written in path {}"
                                                 .format(config.CNSTOCK_MAX_REJECTED_AMOUNTS,
                                                         config.RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH))
                                    break
                                logging.info("rejected by remote server, request {} again after "
                                             "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                time.sleep(60 * self.terminated_amount)
                                result = self.get_url_info(a["href"])
                            date, article = result
                        self.is_article_prob = .5
                        if article != "":
                            related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                              name_code_dict)
                            self.db_obj.insert_data(self.db_name, self.col_name,
                                                    {"Date": date,
                                                     "Category": category_chn,
                                                     "Url": a["href"],
                                                     "Title": a["title"],
                                                     "Article": article,
                                                     "RelatedStockCodes": " ".join(related_stock_codes_list)})
                            self.redis_client.lpush(config.CACHE_NEWS_LIST_NAME, json.dumps(
                                {"Date": date,
                                 "Category": category_chn,
                                 "Url": a["href"],
                                 "Title": a["title"],
                                 "Article": article,
                                 "RelatedStockCodes": " ".join(related_stock_codes_list),
                                 "OriDB": config.DATABASE_NAME,
                                 "OriCOL": config.COLLECTION_NAME_CNSTOCK
                                 }
                            ))
                            logging.info("[SUCCESS] {} {} {}".format(date, a["title"], a["href"]))
                            crawled_urls.append(a["href"])
            # logging.info("sleep {} secs then request {} again ... ".format(interval, url))
            time.sleep(interval)


# """
# Example-1:
# 爬取历史新闻数据
# """
# if __name__ == '__main__':
#     import time
#     import logging
#     from Kite import config
#     from Killua.denull import DeNull
#     from Killua.deduplication import Deduplication
#     from Gon.cnstockspyder import CnStockSpyder
#
#     cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
#     for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
#         logging.info("start crawling {} ...".format(url_to_be_crawled))
#         cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn)
#         logging.info("finished ...")
#         time.sleep(30)
#
#     Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
#     DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()


# """
# Example-2:
# 爬取实时新闻数据
# """
# if __name__ == '__main__':
#     import time, logging, threading
#     from Kite import config
#     from Kite.database import Database
#     from Killua.denull import DeNull
#     from Killua.deduplication import Deduplication
#     from Gon.cnstockspyder import CnStockSpyder
#
#     obj = Database()
#     df = obj.get_data(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK, keys=["Date", "Category"])
#
#     cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
#     # 先补充历史数据，比如已爬取数据到2020-12-01，但是启动实时爬取程序在2020-12-23，则先
#     # 自动补充爬取2020-12-02至2020-12-23的新闻数据
#     for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
#         # 查询type_chn的最近一条数据的时间
#         latets_date_in_db = max(df[df.Category == type_chn]["Date"].to_list())
#         cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn, start_date=latets_date_in_db)
#
#     Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
#     DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
#
#     # 开启多线程并行实时爬取
#     thread_list = []
#     for url, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
#         thread = threading.Thread(target=cnstock_spyder.get_realtime_news, args=(url, type_chn, 60))
#         thread_list.append(thread)
#     for thread in thread_list:
#         thread.start()
#     for thread in thread_list:
#         thread.join()


================================================
FILE: legacy_v1/src/Gon/history_starter_cnstock.py
================================================
import __init__

import time
import logging

from Kite import config

from Killua.denull import DeNull
from Killua.deduplication import Deduplication
from Killua.buildstocknewsdb import GenStockNewsDB

from Gon.cnstockspyder import CnStockSpyder


# 1. 爬取历史数据
cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
    logging.info("start crawling {} ...".format(url_to_be_crawled))
    cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn)
    logging.info("finished ...")
    time.sleep(30)

# 2. 针对历史数据进行去重清洗
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()

# 3. 将历史数据中包含null值的行去掉
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()

# 4. 创建新的数据库，针对每一个股票，将所有涉及该股票的新闻都保存在新的数据库，并贴好"利好","利空"和"中性"标签
gen_stock_news_db = GenStockNewsDB()
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)


================================================
FILE: legacy_v1/src/Gon/history_starter_jrj.py
================================================
import __init__

from Kite import config

from Killua.denull import DeNull
from Killua.deduplication import Deduplication
from Killua.buildstocknewsdb import GenStockNewsDB

from Gon.jrjspyder import JrjSpyder


# 1. 爬取历史数据
jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ, start_date="2015-01-01")

# 2. 针对历史数据进行去重清洗
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()

# 3. 将历史数据中包含null值的行去掉
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()

# 4. 创建新的数据库，针对每一个股票，将所有涉及该股票的新闻都保存在新的数据库，并贴好"利好","利空"和"中性"标签
gen_stock_news_db = GenStockNewsDB()
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)


================================================
FILE: legacy_v1/src/Gon/history_starter_nbd.py
================================================
import __init__

from Kite import config

from Killua.denull import DeNull
from Killua.deduplication import Deduplication
from Killua.buildstocknewsdb import GenStockNewsDB

from Gon.nbdspyder import NbdSpyder


# 1. 爬取历史数据
nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
nbd_spyder.get_historical_news(start_page=684)

# 2. 针对历史数据进行去重清洗
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()

# 3. 将历史数据中包含null值的行去掉
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()

# 4. 创建新的数据库，针对每一个股票，将所有涉及该股票的新闻都保存在新的数据库，并贴好"利好","利空"和"中性"标签
gen_stock_news_db = GenStockNewsDB()
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)


================================================
FILE: legacy_v1/src/Gon/history_starter_stock_price.py
================================================
import __init__

from Kite import config

from Gon.stockinfospyder import StockInfoSpyder


stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)

# 指定时间段，获取历史数据，如：stock_info_spyder.get_historical_news(start_date="20150101", end_date="20201204")
# 如果没有指定时间段，且数据库已存在部分数据，则从最新的数据时间开始获取直到现在，比如数据库里已有sh600000价格数据到
# 2020-12-03号，如不设定具体时间，则从自动获取sh600000自2020-12-04至当前的价格数据
stock_info_spyder.get_historical_news()


================================================
FILE: legacy_v1/src/Gon/ifengspyder.py
================================================
"""
凤凰财经网：https://finance.ifeng.com
上市公司：https://finance.ifeng.com/shanklist/1-62-83-
大盘评述：https://finance.ifeng.com/shanklist/1-62-85-
证券要闻：https://finance.ifeng.com/shanklist/1-62-84-
"""

================================================
FILE: legacy_v1/src/Gon/jrjspyder.py
================================================
"""
金融界：http://www.jrj.com.cn
股票频道全部新闻：http://stock.jrj.com.cn/xwk/202012/20201203_1.shtml
"""

import __init__

from spyder import Spyder

from Kite import utils
from Kite import config
from Kite.database import Database

from Leorio.tokenization import Tokenization

import time
import json
import redis
import datetime
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class JrjSpyder(Spyder):

    def __init__(self, database_name, collection_name):
        super(JrjSpyder, self).__init__()
        self.db_obj = Database()
        self.col = self.db_obj.conn[database_name].get_collection(collection_name)
        self.terminated_amount = 0
        self.db_name = database_name
        self.col_name = collection_name
        self.tokenization = Tokenization(import_module="jieba", user_dict=config.USER_DEFINED_DICT_PATH)
        self.redis_client = redis.StrictRedis(host=config.REDIS_IP,
                                              port=config.REDIS_PORT,
                                              db=config.CACHE_NEWS_REDIS_DB_ID)

    def get_url_info(self, url, specific_date):
        try:
            bs = utils.html_parser(url)
        except Exception:
            return False
        date = ""
        for span in bs.find_all("span"):
            if span.contents[0] == "jrj_final_date_start":
                date = span.text.replace("\r", "").replace("\n", "")
                break
        if date == "":
            date = specific_date
        article = ""
        for p in bs.find_all("p"):
            if not p.find_all("jrj_final_daohang_start") and p.attrs == {} and \
                    not p.find_all("input") and not p.find_all("a", attrs={"class": "red"}) and not p.find_all("i") and not p.find_all("span"):
                # if p.contents[0] != "jrj_final_daohang_start1" and p.attrs == {} and \
                #         not p.find_all("input") and not p.find_all("a", attrs={"class": "red"}) and not p.find_all("i"):
                article += p.text.replace("\r", "").replace("\n", "").replace("\u3000", "")

        return [date, article]

    def get_historical_news(self, url, start_date=None, end_date=None):
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)

        crawled_urls_list = []
        if end_date is None:
            end_date = datetime.datetime.now().strftime("%Y-%m-%d")

        if start_date is None:
            # 如果start_date是None，则从历史数据库最新的日期补充爬取到最新日期
            # e.g. history_latest_date_str -> "2020-12-08"
            #      history_latest_date_dt -> datetime.date(2020, 12, 08)
            #      start_date -> "2020-12-09"
            history_latest_date_list = self.db_obj.get_data(self.db_name,
                                                            self.col_name,
                                                            keys=["Date"])["Date"].to_list()
            if len(history_latest_date_list) != 0:
                history_latest_date_str = max(history_latest_date_list).split(" ")[0]
                history_latest_date_dt = datetime.datetime.strptime(history_latest_date_str, "%Y-%m-%d").date()
                offset = datetime.timedelta(days=1)
                start_date = (history_latest_date_dt + offset).strftime('%Y-%m-%d')
            else:
                start_date = config.JRJ_REQUEST_DEFAULT_DATE

        dates_list = utils.get_date_list_from_range(start_date, end_date)
        dates_separated_into_ranges_list = utils.gen_dates_list(dates_list, config.JRJ_DATE_RANGE)

        for dates_range in dates_separated_into_ranges_list:
            for date in dates_range:
                first_url = "{}/{}/{}_1.shtml".format(url, date.replace("-", "")[0:6], date.replace("-", ""))
                max_pages_num = utils.search_max_pages_num(first_url, date)
                for num in range(1, max_pages_num + 1):
                    _url = "{}/{}/{}_{}.shtml".format(url, date.replace("-", "")[0:6], date.replace("-", ""), str(num))
                    bs = utils.html_parser(_url)
                    a_list = bs.find_all("a")
                    for a in a_list:
                        if "href" in a.attrs and a.string and \
                                a["href"].find("/{}/{}/".format(date.replace("-", "")[:4],
                                                                date.replace("-", "")[4:6])) != -1:
                            if a["href"] not in crawled_urls_list:
                                # 如果标题不包含"收盘","报于"等字样，即可写入数据库，因为包含这些字样标题的新闻多为机器自动生成
                                if a.string.find("收盘") == -1 and a.string.find("报于") == -1 and \
                                        a.string.find("新三板挂牌上市") == -1:
                                    result = self.get_url_info(a["href"], date)
                                    while not result:
                                        self.terminated_amount += 1
                                        if self.terminated_amount > config.JRJ_MAX_REJECTED_AMOUNTS:
                                            # 始终无法爬取的URL保存起来
                                            with open(config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                file.write("{}\n".format(a["href"]))
                                            logging.info("rejected by remote server longer than {} minutes, "
                                                         "and the failed url has been written in path {}"
                                                         .format(config.JRJ_MAX_REJECTED_AMOUNTS,
                                                                 config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH))
                                            break
                                        logging.info("rejected by remote server, request {} again after "
                                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                        time.sleep(60 * self.terminated_amount)
                                        result = self.get_url_info(a["href"], date)
                                    if not result:
                                        # 爬取失败的情况
                                        logging.info("[FAILED] {} {}".format(a.string, a["href"]))
                                    else:
                                        # 有返回但是article为null的情况
                                        article_specific_date, article = result
                                        while article == "" and self.is_article_prob >= .1:
                                            self.is_article_prob -= .1
                                            result = self.get_url_info(a["href"], date)
                                            while not result:
                                                self.terminated_amount += 1
                                                if self.terminated_amount > config.JRJ_MAX_REJECTED_AMOUNTS:
                                                    # 始终无法爬取的URL保存起来
                                                    with open(config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                        file.write("{}\n".format(a["href"]))
                                                    logging.info("rejected by remote server longer than {} minutes, "
                                                                 "and the failed url has been written in path {}"
                                                                 .format(config.JRJ_MAX_REJECTED_AMOUNTS,
                                                                         config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH))
                                                    break
                                                logging.info("rejected by remote server, request {} again after "
                                                             "{} seconds...".format(a["href"],
                                                                                    60 * self.terminated_amount))
                                                time.sleep(60 * self.terminated_amount)
                                                result = self.get_url_info(a["href"], date)
                                            article_specific_date, article = result
                                        self.is_article_prob = .5
                                        if article != "":
                                                related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                                                  name_code_dict)
                                                data = {"Date": article_specific_date,
                                                        "Url": a["href"],
                                                        "Title": a.string,
                                                        "Article": article,
                                                        "RelatedStockCodes": " ".join(related_stock_codes_list)}
                                                # self.col.insert_one(data)
                                                self.db_obj.insert_data(self.db_name, self.col_name, data)
                                                logging.info("[SUCCESS] {} {} {}".format(article_specific_date,
                                                                                         a.string,
                                                                                         a["href"]))
                                    self.terminated_amount = 0  # 爬取结束后重置该参数
                                else:
                                    logging.info("[QUIT] {}".format(a.string))

    def get_realtime_news(self, interval=60):
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        # crawled_urls_list = []
        is_change_date = False
        last_date = datetime.datetime.now().strftime("%Y-%m-%d")
        while True:
            today_date = datetime.datetime.now().strftime("%Y-%m-%d")
            if today_date != last_date:
                is_change_date = True
                last_date = today_date
            if is_change_date:
                # crawled_urls_list = []
                utils.batch_lpop(self.redis_client,
                                 config.CACHE_SAVED_NEWS_JRJ_TODAY_VAR_NAME,
                                 self.redis_client.llen(config.CACHE_SAVED_NEWS_JRJ_TODAY_VAR_NAME))
                is_change_date = False
            _url = "{}/{}/{}_1.shtml".format(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ,
                                             today_date.replace("-", "")[0:6],
                                             today_date.replace("-", ""))
            max_pages_num = utils.search_max_pages_num(_url, today_date)
            for num in range(1, max_pages_num + 1):
                _url = "{}/{}/{}_{}.shtml".format(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ,
                                                  today_date.replace("-", "")[0:6],
                                                  today_date.replace("-", ""),
                                                  str(num))
                bs = utils.html_parser(_url)
                a_list = bs.find_all("a")
                for a in a_list:
                    if "href" in a.attrs and a.string and \
                            a["href"].find("/{}/{}/".format(today_date.replace("-", "")[:4],
                                                            today_date.replace("-", "")[4:6])) != -1:
                        # if a["href"] not in crawled_urls_list:
                        if a["href"] not in self.redis_client.lrange(config.CACHE_SAVED_NEWS_JRJ_TODAY_VAR_NAME, 0, -1):
                            # 如果标题不包含"收盘","报于"等字样，即可写入数据库，因为包含这些字样标题的新闻多为机器自动生成
                            if a.string.find("收盘") == -1 and a.string.find("报于") == -1 and \
                                    a.string.find("新三板挂牌上市") == -1:
                                result = self.get_url_info(a["href"], today_date)
                                while not result:
                                    self.terminated_amount += 1
                                    if self.terminated_amount > config.JRJ_MAX_REJECTED_AMOUNTS:
                                        # 始终无法爬取的URL保存起来
                                        with open(config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                            file.write("{}\n".format(a["href"]))
                                        logging.info("rejected by remote server longer than {} minutes, "
                                                     "and the failed url has been written in path {}"
                                                     .format(config.JRJ_MAX_REJECTED_AMOUNTS,
                                                             config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH))
                                        break
                                    logging.info("rejected by remote server, request {} again after "
                                                 "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                    time.sleep(60 * self.terminated_amount)
                                    result = self.get_url_info(a["href"], today_date)
                                if not result:
                                    # 爬取失败的情况
                                    logging.info("[FAILED] {} {}".format(a.string, a["href"]))
                                else:
                                    # 有返回但是article为null的情况
                                    article_specific_date, article = result
                                    while article == "" and self.is_article_prob >= .1:
                                        self.is_article_prob -= .1
                                        result = self.get_url_info(a["href"], today_date)
                                        while not result:
                                            self.terminated_amount += 1
                                            if self.terminated_amount > config.JRJ_MAX_REJECTED_AMOUNTS:
                                                # 始终无法爬取的URL保存起来
                                                with open(config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                    file.write("{}\n".format(a["href"]))
                                                logging.info("rejected by remote server longer than {} minutes, "
                                                             "and the failed url has been written in path {}"
                                                             .format(config.JRJ_MAX_REJECTED_AMOUNTS,
                                                                     config.RECORD_JRJ_FAILED_URL_TXT_FILE_PATH))
                                                break
                                            logging.info("rejected by remote server, request {} again after "
                                                         "{} seconds...".format(a["href"],
                                                                                60 * self.terminated_amount))
                                            time.sleep(60 * self.terminated_amount)
                                            result = self.get_url_info(a["href"], today_date)
                                        article_specific_date, article = result
                                    self.is_article_prob = .5
                                    if article != "":
                                        related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                                          name_code_dict)
                                        self.db_obj.insert_data(self.db_name, self.col_name,
                                                                {"Date": article_specific_date,
                                                                 "Url": a["href"],
                                                                 "Title": a.string,
                                                                 "Article": article,
                                                                 "RelatedStockCodes": " ".join(related_stock_codes_list)})
                                        self.redis_client.lpush(config.CACHE_NEWS_LIST_NAME, json.dumps(
                                            {"Date": article_specific_date,
                                             "Url": a["href"],
                                             "Title": a.string,
                                             "Article": article,
                                             "RelatedStockCodes": " ".join(related_stock_codes_list),
                                             "OriDB": config.DATABASE_NAME,
                                             "OriCOL": config.COLLECTION_NAME_JRJ
                                             }
                                        ))
                                        logging.info("[SUCCESS] {} {} {}".format(article_specific_date,
                                                                                 a.string,
                                                                                 a["href"]))
                                self.terminated_amount = 0  # 爬取结束后重置该参数
                            else:
                                logging.info("[QUIT] {}".format(a.string))
                            # crawled_urls_list.append(a["href"])
                            self.redis_client.lpush(config.CACHE_SAVED_NEWS_JRJ_TODAY_VAR_NAME, a["href"])
            # logging.info("sleep {} secs then request again ... ".format(interval))
            time.sleep(interval)


# """
# Example-1:
# 爬取历史新闻数据
# """
# if __name__ == "__main__":
#     jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
#     jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ, start_date="2015-01-01")
#
#     Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()
#     DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()


# """
# Example-2:
# 爬取实时新闻数据
# """
# if __name__ == '__main__':
#     from Kite import config
#     from Gon.jrjspyder import JrjSpyder
#
#     jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
#     jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ)  # 补充爬虫数据到最新日期
#     jrj_spyder.get_realtime_news()


================================================
FILE: legacy_v1/src/Gon/kill_realtime_spyder_tasks.py
================================================
import __init__

import os
import wmi
import redis
import logging

from Kite import config

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class KillPyTasks(object):

    def __init__(self):
        self.redis_client = redis.StrictRedis(config.REDIS_IP,
                                              port=config.REDIS_PORT,
                                              db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
        for _id in range(self.redis_client.llen(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR)):
            proc = self.get_python_process(param=self.redis_client.lindex(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, _id).decode())
            for p in proc:
                self.killtask(p.Handle)
                self.print_pid_info(p)
        for _ in range(self.redis_client.llen(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR)):
            self.redis_client.lpop(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR)

    @staticmethod
    def killtask(pid):
        os.system(f"taskkill /F /pid {pid} -t")

    @staticmethod
    def get_python_process(prop="python.exe", param=None):
        output = []
        w = wmi.WMI()
        for proc in w.Win32_Process(name=prop):
            if param is None:
                output.append(proc)
            else:
                if str(proc.CommandLine).find(param) >= 0:
                    output.append(proc)
        return output

    @staticmethod
    def print_pid_info(process):
        logging.info("{} | {} | {} -> killed ... ".format(process.Handle, process.Caption, process.CommandLine))


if __name__ == "__main__":
    KillPyTasks()

================================================
FILE: legacy_v1/src/Gon/money163spyder.py
================================================
"""
网易财经网：https://money.163.com
个股资讯：http://money.163.com/special/g/00251LR5/gptj.html
市场资讯：http://money.163.com/special/00251LR5/cpznList.html
行业板块：http://money.163.com/special/00251LJV/hyyj.html
"""

================================================
FILE: legacy_v1/src/Gon/nbdspyder.py
================================================
"""
每经网：http://www.nbd.com.cn
A股动态：http://stocks.nbd.com.cn/columns/275/page/1
"""

import __init__

from spyder import Spyder

from Kite import utils
from Kite import config
from Kite.database import Database

from Leorio.tokenization import Tokenization

import re
import time
import json
import redis
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class NbdSpyder(Spyder):

    def __init__(self, database_name, collection_name):
        super(NbdSpyder, self).__init__()
        self.db_obj = Database()
        self.col = self.db_obj.conn[database_name].get_collection(collection_name)
        self.terminated_amount = 0
        self.db_name = database_name
        self.col_name = collection_name
        self.tokenization = Tokenization(import_module="jieba", user_dict=config.USER_DEFINED_DICT_PATH)
        self.redis_client = redis.StrictRedis(host=config.REDIS_IP,
                                              port=config.REDIS_PORT,
                                              db=config.CACHE_NEWS_REDIS_DB_ID)

    def get_url_info(self, url):
        try:
            bs = utils.html_parser(url)
        except Exception:
            return False
        span_list = bs.find_all("span")
        part = bs.find_all("p")
        article = ""
        date = ""
        for span in span_list:
            if "class" in span.attrs and span.text and span["class"] == ["time"]:
                    string = span.text.split()
                    for dt in string:
                        if dt.find("-") != -1:
                            date += dt + " "
                        elif dt.find(":") != -1:
                            date += dt
                    break
        for paragraph in part:
            chn_status = utils.count_chn(str(paragraph))
            possible = chn_status[1]
            if possible > self.is_article_prob:
                article += str(paragraph)
        while article.find("<") != -1 and article.find(">") != -1:
            string = article[article.find("<"):article.find(">")+1]
            article = article.replace(string, "")
        while article.find("\u3000") != -1:
            article = article.replace("\u3000", "")
        article = " ".join(re.split(" +|\n+", article)).strip()

        return [date, article]

    def get_historical_news(self, start_page=684):
        date_list = self.db_obj.get_data(self.db_name, self.col_name, keys=["Date"])["Date"].to_list()
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        if len(date_list) == 0:
            # 说明没有历史数据，从头开始爬取
            crawled_urls_list = []
            page_urls = ["{}/{}".format(config.WEBSITES_LIST_TO_BE_CRAWLED_NBD, page_id)
                         for page_id in range(start_page, 0, -1)]
            for page_url in page_urls:
                bs = utils.html_parser(page_url)
                a_list = bs.find_all("a")
                for a in a_list:
                    if "click-statistic" in a.attrs and a.string \
                            and a["click-statistic"].find("Article_") != -1 \
                            and a["href"].find("http://www.nbd.com.cn/articles/") != -1:
                        if a["href"] not in crawled_urls_list:
                            result = self.get_url_info(a["href"])
                            while not result:
                                self.terminated_amount += 1
                                if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                    # 始终无法爬取的URL保存起来
                                    with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                        file.write("{}\n".format(a["href"]))
                                    logging.info("rejected by remote server longer than {} minutes, "
                                                 "and the failed url has been written in path {}"
                                                 .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                         config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                    break
                                logging.info("rejected by remote server, request {} again after "
                                             "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                time.sleep(60 * self.terminated_amount)
                                result = self.get_url_info(a["href"])
                            if not result:
                                # 爬取失败的情况
                                logging.info("[FAILED] {} {}".format(a.string, a["href"]))
                            else:
                                # 有返回但是article为null的情况
                                date, article = result
                                while article == "" and self.is_article_prob >= .1:
                                    self.is_article_prob -= .1
                                    result = self.get_url_info(a["href"])
                                    while not result:
                                        self.terminated_amount += 1
                                        if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                            # 始终无法爬取的URL保存起来
                                            with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                file.write("{}\n".format(a["href"]))
                                            logging.info("rejected by remote server longer than {} minutes, "
                                                         "and the failed url has been written in path {}"
                                                         .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                                 config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                            break
                                        logging.info("rejected by remote server, request {} again after "
                                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                        time.sleep(60 * self.terminated_amount)
                                        result = self.get_url_info(a["href"])
                                    date, article = result
                                self.is_article_prob = .5
                                if article != "":
                                    related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                                      name_code_dict)
                                    data = {"Date": date,
                                            # "PageId": page_url.split("/")[-1],
                                            "Url": a["href"],
                                            "Title": a.string,
                                            "Article": article,
                                            "RelatedStockCodes": " ".join(related_stock_codes_list)}
                                    # self.col.insert_one(data)
                                    self.db_obj.insert_data(self.db_name, self.col_name, data)
                                    logging.info("[SUCCESS] {} {} {}".format(date, a.string, a["href"]))
        else:
            is_stop = False
            start_date = max(date_list)
            page_start_id = 1
            while not is_stop:
                page_url = "{}/{}".format(config.WEBSITES_LIST_TO_BE_CRAWLED_NBD, page_start_id)
                bs = utils.html_parser(page_url)
                a_list = bs.find_all("a")
                for a in a_list:
                    if "click-statistic" in a.attrs and a.string \
                            and a["click-statistic"].find("Article_") != -1 \
                            and a["href"].find("http://www.nbd.com.cn/articles/") != -1:
                        result = self.get_url_info(a["href"])
                        while not result:
                            self.terminated_amount += 1
                            if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                # 始终无法爬取的URL保存起来
                                with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                    file.write("{}\n".format(a["href"]))
                                logging.info("rejected by remote server longer than {} minutes, "
                                             "and the failed url has been written in path {}"
                                             .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                     config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                break
                            logging.info("rejected by remote server, request {} again after "
                                         "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                            time.sleep(60 * self.terminated_amount)
                            result = self.get_url_info(a["href"])
                        if not result:
                            # 爬取失败的情况
                            logging.info("[FAILED] {} {}".format(a.string, a["href"]))
                        else:
                            # 有返回但是article为null的情况
                            date, article = result
                            if date > start_date:
                                while article == "" and self.is_article_prob >= .1:
                                    self.is_article_prob -= .1
                                    result = self.get_url_info(a["href"])
                                    while not result:
                                        self.terminated_amount += 1
                                        if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                            # 始终无法爬取的URL保存起来
                                            with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                file.write("{}\n".format(a["href"]))
                                            logging.info("rejected by remote server longer than {} minutes, "
                                                         "and the failed url has been written in path {}"
                                                         .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                                 config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                            break
                                        logging.info("rejected by remote server, request {} again after "
                                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                        time.sleep(60 * self.terminated_amount)
                                        result = self.get_url_info(a["href"])
                                    date, article = result
                                self.is_article_prob = .5
                                if article != "":
                                    related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                                      name_code_dict)
                                    data = {"Date": date,
                                            "Url": a["href"],
                                            "Title": a.string,
                                            "Article": article,
                                            "RelatedStockCodes": " ".join(related_stock_codes_list)}
                                    self.db_obj.insert_data(self.db_name, self.col_name, data)
                                    logging.info("[SUCCESS] {} {} {}".format(date, a.string, a["href"]))
                            else:
                                is_stop = True
                                break
                if not is_stop:
                    page_start_id += 1

    def get_realtime_news(self, interval=60):
        page_url = "{}/1".format(config.WEBSITES_LIST_TO_BE_CRAWLED_NBD)
        logging.info("start real-time crawling of URL -> {}, request every {} secs ... ".format(page_url, interval))
        name_code_df = self.db_obj.get_data(config.STOCK_DATABASE_NAME,
                                            config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                            keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        # crawled_urls = []
        date_list = self.db_obj.get_data(self.db_name, self.col_name, keys=["Date"])["Date"].to_list()
        latest_date = max(date_list)
        while True:
            # 每隔一定时间轮询该网址
            # if len(crawled_urls) > 100:
            #     # 防止list过长，内存消耗大，维持list在100条
            #     crawled_urls.pop(0)
            if self.redis_client.llen(config.CACHE_SAVED_NEWS_NBD_TODAY_VAR_NAME) > 100:
                # 防止缓存list过长，内存消耗大，维持list在100条
                self.redis_client.rpop(config.CACHE_SAVED_NEWS_NBD_TODAY_VAR_NAME)
            bs = utils.html_parser(page_url)
            a_list = bs.find_all("a")
            for a in a_list:
                if "click-statistic" in a.attrs and a.string \
                        and a["click-statistic"].find("Article_") != -1 \
                        and a["href"].find("http://www.nbd.com.cn/articles/") != -1:
                    # if a["href"] not in crawled_urls:
                    if a["href"] not in self.redis_client.lrange(config.CACHE_SAVED_NEWS_NBD_TODAY_VAR_NAME, 0, -1):
                        result = self.get_url_info(a["href"])
                        while not result:
                            self.terminated_amount += 1
                            if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                # 始终无法爬取的URL保存起来
                                with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                    file.write("{}\n".format(a["href"]))
                                logging.info("rejected by remote server longer than {} minutes, "
                                             "and the failed url has been written in path {}"
                                             .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                     config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                break
                            logging.info("rejected by remote server, request {} again after "
                                         "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                            time.sleep(60 * self.terminated_amount)
                            result = self.get_url_info(a["href"])
                        if not result:
                            # 爬取失败的情况
                            logging.info("[FAILED] {} {}".format(a.string, a["href"]))
                        else:
                            # 有返回但是article为null的情况
                            date, article = result
                            if date > latest_date:
                                while article == "" and self.is_article_prob >= .1:
                                    self.is_article_prob -= .1
                                    result = self.get_url_info(a["href"])
                                    while not result:
                                        self.terminated_amount += 1
                                        if self.terminated_amount > config.NBD_MAX_REJECTED_AMOUNTS:
                                            # 始终无法爬取的URL保存起来
                                            with open(config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH, "a+") as file:
                                                file.write("{}\n".format(a["href"]))
                                            logging.info("rejected by remote server longer than {} minutes, "
                                                         "and the failed url has been written in path {}"
                                                         .format(config.NBD_MAX_REJECTED_AMOUNTS,
                                                                 config.RECORD_NBD_FAILED_URL_TXT_FILE_PATH))
                                            break
                                        logging.info("rejected by remote server, request {} again after "
                                                     "{} seconds...".format(a["href"], 60 * self.terminated_amount))
                                        time.sleep(60 * self.terminated_amount)
                                        result = self.get_url_info(a["href"])
                                    date, article = result
                                self.is_article_prob = .5
                                if article != "":
                                    related_stock_codes_list = self.tokenization.find_relevant_stock_codes_in_article(article,
                                                                                                                      name_code_dict)
                                    self.db_obj.insert_data(self.db_name, self.col_name,
                                                            {"Date": date,
                                                             # "PageId": page_url.split("/")[-1],
                                                             "Url": a["href"],
                                                             "Title": a.string,
                                                             "Article": article,
                                                             "RelatedStockCodes": " ".join(related_stock_codes_list)})
                                    self.redis_client.lpush(config.CACHE_NEWS_LIST_NAME, json.dumps(
                                        {"Date": date,
                                         # "PageId": page_url.split("/")[-1],
                                         "Url": a["href"],
                                         "Title": a.string,
                                         "Article": article,
                                         "RelatedStockCodes": " ".join(related_stock_codes_list),
                                         "OriDB": config.DATABASE_NAME,
                                         "OriCOL": config.COLLECTION_NAME_NBD
                                         }
                                    ))
                                    # crawled_urls.append(a["href"])
                                    self.redis_client.lpush(config.CACHE_SAVED_NEWS_NBD_TODAY_VAR_NAME, a["href"])
                                    logging.info("[SUCCESS] {} {} {}".format(date, a.string, a["href"]))
            # logging.info("sleep {} secs then request again ... ".format(interval))
            time.sleep(interval)


# """
# Example-1:
# 爬取历史新闻数据
# """
# if __name__ == "__main__":
#     nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
#     nbd_spyder.get_historical_news(start_page=684)
#
#     Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
#     DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()


# """
# Example-2:
# 爬取实时新闻数据
# """
# if __name__ == '__main__':
#     from Kite import config
#
#     from Killua.denull import DeNull
#     from Killua.deduplication import Deduplication
#
#     from Gon.nbdspyder import NbdSpyder
#
#     # 如果没有历史数据从头爬取，如果已爬取历史数据，则从最新的时间开始爬取
#     # 如历史数据中最近的新闻时间是"2020-12-09 20:37:10"，则从该时间开始爬取
#     nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
#     nbd_spyder.get_historical_news()
#
#     Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
#     DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
#
#     nbd_spyder.get_realtime_news()


================================================
FILE: legacy_v1/src/Gon/realtime_starter_cnstock.py
================================================
import __init__

import time
import redis
import logging
import threading

from Kite import config
from Kite.database import Database

from Killua.denull import DeNull
from Killua.deduplication import Deduplication 

from Gon.cnstockspyder import CnStockSpyder


redis_client = redis.StrictRedis(config.REDIS_IP,
                                 port=config.REDIS_PORT,
                                 db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
redis_client.lpush(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, "realtime_starter_cnstock.py")

obj = Database()
df = obj.get_data(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK, keys=["Date", "Category"])

cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
# 先补充历史数据，比如已爬取数据到2020-12-01，但是启动实时爬取程序在2020-12-23，则先
# 自动补充爬取2020-12-02至2020-12-23的新闻数据
for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
    # 查询type_chn的最近一条数据的时间
    latets_date_in_db = max(df[df.Category == type_chn]["Date"].to_list())
    cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn, start_date=latets_date_in_db)

Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()

# 开启多线程并行实时爬取
thread_list = []
for url, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
    thread = threading.Thread(target=cnstock_spyder.get_realtime_news, args=(url, type_chn, 60))
    thread_list.append(thread)
for thread in thread_list:
    thread.start()
for thread in thread_list:
    thread.join()

================================================
FILE: legacy_v1/src/Gon/realtime_starter_jrj.py
================================================
import __init__

import redis

from Kite import config

from Gon.jrjspyder import JrjSpyder


redis_client = redis.StrictRedis(config.REDIS_IP,
                                 port=config.REDIS_PORT,
                                 db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
redis_client.lpush(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, "realtime_starter_jrj.py")

jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ)  # 补充爬虫数据到最新日期
jrj_spyder.get_realtime_news()

================================================
FILE: legacy_v1/src/Gon/realtime_starter_nbd.py
================================================
import __init__

import redis

from Kite import config

from Killua.denull import DeNull
from Killua.deduplication import Deduplication 

from Gon.nbdspyder import NbdSpyder


redis_client = redis.StrictRedis(config.REDIS_IP,
                                 port=config.REDIS_PORT,
                                 db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
redis_client.lpush(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, "realtime_starter_nbd.py")

# 如果没有历史数据从头爬取，如果已爬取历史数据，则从最新的时间开始爬取
# 如历史数据中最近的新闻时间是"2020-12-09 20:37:10"，则从该时间开始爬取
nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
nbd_spyder.get_historical_news()

# Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
# DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()

nbd_spyder.get_realtime_news()

================================================
FILE: legacy_v1/src/Gon/realtime_starter_redis_queue.py
================================================
import __init__

import redis

from Kite import config

from Killua.buildstocknewsdb import GenStockNewsDB


redis_client = redis.StrictRedis(config.REDIS_IP,
                                 port=config.REDIS_PORT,
                                 db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
redis_client.lpush(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, "realtime_starter_redis_queue.py")

gen_stock_news_db = GenStockNewsDB()
gen_stock_news_db.listen_redis_queue()

================================================
FILE: legacy_v1/src/Gon/realtime_starter_stock_price.py
================================================
import __init__

import redis

from Kite import config

from Gon.stockinfospyder import StockInfoSpyder


redis_client = redis.StrictRedis(config.REDIS_IP,
                                 port=config.REDIS_PORT,
                                 db=config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID)
redis_client.lpush(config.CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR, "realtime_starter_stock_price.py")

stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)
stock_info_spyder.get_realtime_news()

================================================
FILE: legacy_v1/src/Gon/sinaspyder.py
================================================
"""
新浪财经网：https://finance.sina.com.cn
公司要闻：https://finance.sina.com.cn/roll/index.d.html?cid=56592&page=1
个股点评：https://finance.sina.com.cn/roll/index.d.html?cid=56588&page=1
大盘评述：https://finance.sina.com.cn/roll/index.d.html?cid=56589&page=1
公司研究：http://stock.finance.sina.com.cn/stock/go.php/vReport_List/kind/company/index.phtml?p=1
市场研究：https://finance.sina.com.cn/roll/index.d.html?cid=56605&page=1
主力动向：https://finance.sina.com.cn/roll/index.d.html?cid=56615&page=1
行业研究：http://stock.finance.sina.com.cn/stock/go.php/vReport_List/kind/industry/index.phtml?p=1
投资策略：http://stock.finance.sina.com.cn/stock/go.php/vReport_List/kind/strategy/index.phtml?p=1
"""

import __init__
from spyder import Spyder


================================================
FILE: legacy_v1/src/Gon/spyder.py
================================================
class Spyder(object):

    def __init__(self):
        self.is_article_prob = .5

    def extract_data(self, tag_list):
        data = list()
        for tag in tag_list:
            exec(tag + " = self.col.distinct('" + tag + "')")
            exec("data.append(" + tag + ")")

        return data

    def query_news(self, _key, param):
        # 模糊查询
        return self.col.find({_key: {'$regex': ".*{}.*".format(param)}})

    def get_url_info(self, url):
        pass

    def get_historical_news(self, url):
        pass

    def get_realtime_news(self, url):
        pass

================================================
FILE: legacy_v1/src/Gon/stockinfospyder.py
================================================
"""
https://www.akshare.xyz/zh_CN/latest/
"""

import __init__

import os
import time
import redis
import logging
import datetime
from spyder import Spyder

from pandas._libs.tslibs.timestamps import Timestamp

from Kite.database import Database
from Kite import config

import akshare as ak

import tushare as ts
ts.set_token(config.TUSHARE_TOKEN)

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class StockInfoSpyder(Spyder):

    def __init__(self, database_name, collection_name):
        super(StockInfoSpyder, self).__init__()
        self.db_obj = Database()
        self.col_basic_info = self.db_obj.get_collection(database_name, collection_name)
        self.database_name = database_name
        self.collection_name = collection_name
        self.start_program_date = datetime.datetime.now().strftime("%Y%m%d")
        self.redis_client = redis.StrictRedis(host="localhost",
                                              port=6379,
                                              db=config.REDIS_CLIENT_FOR_CACHING_STOCK_INFO_DB_ID)
        self.redis_client.set("today_date", datetime.datetime.now().strftime("%Y-%m-%d"))

    def get_stock_code_info(self):
        # TODO:每半年需要更新一次
        stock_info_df = ak.stock_info_a_code_name()  # 获取所有A股code和name
        stock_symbol_code = ak.stock_zh_a_spot().get(["symbol", "code"])  # 获取A股所有股票的symbol和code
        for _id in range(stock_info_df.shape[0]):
            _symbol = stock_symbol_code[stock_symbol_code.code == stock_info_df.iloc[_id].code].symbol.values
            if len(_symbol) != 0:
                _dict = {"symbol": _symbol[0]}
                _dict.update(stock_info_df.iloc[_id].to_dict())
                self.col_basic_info.insert_one(_dict)

    def get_historical_news(self, start_date=None, end_date=None, freq="day"):
        if end_date is None:
            end_date = datetime.datetime.now().strftime("%Y%m%d")
        stock_symbol_list = self.col_basic_info.distinct("symbol")
        if len(stock_symbol_list) == 0:
            self.get_stock_code_info()
            stock_symbol_list = self.col_basic_info.distinct("symbol")
        if freq == "day":
            start_stock_code = 0 if self.redis_client.get("start_stock_code") is None else int(self.redis_client.get("start_stock_code").decode())
            for symbol in stock_symbol_list:
                if int(symbol[2:]) > start_stock_code:
                    if start_date is None:
                        # 如果该symbol有历史数据，如果有则从API获取从数据库中最近的时间开始直到现在的所有价格数据
                        # 如果该symbol无历史数据，则从API获取从2015年1月1日开始直到现在的所有价格数据
                        _latest_date = self.redis_client.get(symbol)
                        if _latest_date is None:
                            symbol_start_date = config.STOCK_PRICE_REQUEST_DEFAULT_DATE
                        else:
                            tmp_date_dt = datetime.datetime.strptime(_latest_date.decode(), "%Y-%m-%d").date()
                            offset = datetime.timedelta(days=1)
                            symbol_start_date = (tmp_date_dt + offset).strftime('%Y%m%d')

                    if symbol_start_date < end_date:
                        stock_zh_a_daily_hfq_df = ak.stock_zh_a_daily(symbol=symbol,
                                                                      start_date=symbol_start_date,
                                                                      end_date=end_date,
                                                                      adjust="qfq")
                        stock_zh_a_daily_hfq_df.insert(0, 'date', stock_zh_a_daily_hfq_df.index.tolist())
                        stock_zh_a_daily_hfq_df.index = range(len(stock_zh_a_daily_hfq_df))
                        _col = self.db_obj.get_collection(self.database_name, symbol)
                        for _id in range(stock_zh_a_daily_hfq_df.shape[0]):
                            _tmp_dict = stock_zh_a_daily_hfq_df.iloc[_id].to_dict()
                            _tmp_dict.pop("outstanding_share")
                            _tmp_dict.pop("turnover")
                            _col.insert_one(_tmp_dict)
                            self.redis_client.set(symbol, str(_tmp_dict["date"]).split(" ")[0])

                        logging.info("{} finished saving from {} to {} ... ".format(symbol, symbol_start_date, end_date))
                self.redis_client.set("start_stock_code", int(symbol[2:]))
            self.redis_client.set("start_stock_code", 0)
        elif freq == "week":
            pass
        elif freq == "month":
            pass
        elif freq == "5mins":
            pass
        elif freq == "15mins":
            pass
        elif freq == "30mins":
            pass
        elif freq == "60mins":
            pass

    def get_realtime_news(self, freq="day"):
        while True:
            if_updated = input("Has the stock price dataset been updated today? (Y/N) \n")
            if if_updated == "Y":
                self.redis_client.set("is_today_updated", "1")
                break
            elif if_updated == "N":
                self.redis_client.set("is_today_updated", "")
                break
        self.get_historical_news()  # 对所有股票补充数据到最新
        while True:
            if freq == "day":
                time_now = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                if time_now.split(" ")[0] != self.redis_client.get("today_date").decode():
                    self.redis_client.set("today_date", time_now.split(" ")[0])
                    self.redis_client.set("is_today_updated", "")  # 过了凌晨，该参数设置回空值，表示今天未进行数据更新
                if not bool(self.redis_client.get("is_today_updated").decode()):
                    update_time = "{} {}".format(time_now.split(" ")[0], "15:30:00")
                    if time_now >= update_time:
                        stock_zh_a_spot_df = ak.stock_zh_a_spot()  # 当天的日数据行情下载
                        for _id, sym in enumerate(stock_zh_a_spot_df["symbol"]):
                            _col = self.db_obj.get_collection(self.database_name, sym)
                            _tmp_dict = {}
                            _tmp_dict.update({"date": Timestamp("{} 00:00:00".format(time_now.split(" ")[0]))})
                            _tmp_dict.update({"open": stock_zh_a_spot_df.iloc[_id].open})
                            _tmp_dict.update({"high": stock_zh_a_spot_df.iloc[_id].high})
                            _tmp_dict.update({"low": stock_zh_a_spot_df.iloc[_id].low})
                            _tmp_dict.update({"close": stock_zh_a_spot_df.iloc[_id].trade})
                            _tmp_dict.update({"volume": stock_zh_a_spot_df.iloc[_id].volume})
                            _col.insert_one(_tmp_dict)
                            self.redis_client.set(sym, time_now.split(" ")[0])
                            logging.info("finished updating {} price data of {} ... ".format(sym, time_now.split(" ")[0]))
                        self.redis_client.set("is_today_updated", "1")
        #TODO:当更新股票价格数据后，接着应该更新股票新闻数据库标签


# if __name__ == "__main__":
#     from Kite import config
#     from Gon.stockinfospyder import StockInfoSpyder
#
#     stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)
#
#     # 指定时间段，获取历史数据，如：stock_info_spyder.get_historical_news(start_date="20150101", end_date="20201204")
#     # 如果没有指定时间段，且数据库已存在部分数据，则从最新的数据时间开始获取直到现在，比如数据库里已有sh600000价格数据到
#     # 2020-12-03号，如不设定具体时间，则从自动获取sh600000自2020-12-04至当前的价格数据
#     # stock_info_spyder.get_historical_news()
#
#     # 开启自动化更新所有股票价格数据(目前只支持在15:30分后更新日数据)
#     stock_info_spyder.get_realtime_news()


================================================
FILE: legacy_v1/src/Hisoka/classifier.py
================================================
import __init__
import logging
import warnings

from Kite import config

import joblib
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import sklearn.exceptions

logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s",
                    datefmt="%a, %d %b %Y %H:%M:%S")

warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)
warnings.filterwarnings("ignore", category=Warning, module='sklearn')
warnings.filterwarnings("ignore", category=UserWarning, module='gensim')
warnings.filterwarnings("ignore", category=RuntimeWarning, module='gensim')


class Classifier(object):

    def __init__(self):
        self.scores = config.CLASSIFIER_SCORE_LIST

    def train(self, train_x, train_y, test_x, test_y, model_type="svm", model_save_path=None):
        assert len(self.scores) != 0
        clf = None
        for score in self.scores:
            # 'cv': 构造这个GridSearch的分类器,5-fold
            # 'refit': 默认为True,程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的训练，
            #          作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再
            #          次fit一遍全部数据集
            if model_type == "svm":
                tuned_parameters = config.SMV_TUNED_PARAMTERS
                clf = GridSearchCV(svm.SVC(),
                                   tuned_parameters,
                                   cv=5,
                                   scoring=score,
                                   refit="AUC")
            elif model_type == "rdforest":
                tuned_parameters = config.RDFOREST_TUNED_PARAMTERS
                clf = GridSearchCV(RandomForestClassifier(random_state=10),
                                   tuned_parameters,
                                   cv=5,
                                   scoring=score,
                                   refit="AUC")
            # 只在训练集上面做k-fold,然后返回最优的模型参数
            clf.fit(train_x, train_y)
            if model_save_path is not None:
                joblib.dump(clf, model_save_path)
            # 输出最优的模型参数
            logging.info("the best params: {}".format(clf.best_params_))
            train_pred = clf.predict(train_x)
            test_pred = clf.predict(test_x)  # 在测试集上测试最优的模型的泛化能力
            logging.info("\n{}".format(classification_report(test_y, test_pred)))
            precise_train = 0
            for k in range(len(train_pred)):
                if train_pred[k] == train_y[k]:
                    precise_train += 1
            precise_test = 0
            for k in range(len(test_pred)):
                if test_pred[k] == test_y[k]:
                    precise_test += 1
            logging.info('train_accuracy: {}  test_accuracy: {}'
                         .format(str(round(precise_train / len(train_y), 4)),
                                 str(round(precise_test / len(test_pred), 4))))
            self._precise = precise_test / len(test_pred)
        assert clf is not None
        return clf

    @staticmethod
    def model_load(classifier_save_path):
        return joblib.load(classifier_save_path)

================================================
FILE: legacy_v1/src/Killua/__init__.py
================================================
import os
import sys


def add_path(path):
    if path not in sys.path:
        sys.path.insert(0, path)


# add `./src` dir to system path
src_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))

add_path(src_dir)

================================================
FILE: legacy_v1/src/Killua/buildstocknewsdb.py
================================================
import __init__

import json
import redis
import logging
import datetime
import akshare as ak

from Kite import config
from Kite.database import Database

from Leorio.tokenization import Tokenization
from Leorio.topicmodelling import TopicModelling

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class GenStockNewsDB(object):

    def __init__(self):
        self.database = Database()
        # 获取从1990-12-19至2020-12-31股票交易日数据
        self.trade_date = ak.tool_trade_date_hist_sina()["trade_date"].tolist()
        self.label_range = {3: "3DaysLabel",
                            5: "5DaysLabel",
                            10: "10DaysLabel",
                            15: "15DaysLabel",
                            30: "30DaysLabel",
                            60: "60DaysLabel"}
        self.redis_client = redis.StrictRedis(host=config.REDIS_IP,
                                              port=config.REDIS_PORT,
                                              db=config.CACHE_NEWS_REDIS_DB_ID)
        self.redis_client.set("today_date", datetime.datetime.now().strftime("%Y-%m-%d"))
        self.redis_client.delete("stock_news_num_over_{}".format(config.MINIMUM_STOCK_NEWS_NUM_FOR_ML))
        self._stock_news_nums_stat()

    def get_all_news_about_specific_stock(self, database_name, collection_name):
        # 获取collection_name的key值，看是否包含RelatedStockCodes，如果没有说明，没有做将新闻中所涉及的
        # 股票代码保存在新的一列
        _keys_list = list(next(self.database.get_collection(database_name, collection_name).find()).keys())
        if "RelatedStockCodes" not in _keys_list:
            tokenization = Tokenization(import_module="jieba", user_dict="./Leorio/financedict.txt")
            tokenization.update_news_database_rows(database_name, collection_name)
        # 创建stock_code为名称的collection
        stock_symbol_list = self.database.get_data(config.STOCK_DATABASE_NAME,
                                                   config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                                   keys=["symbol"])["symbol"].to_list()
        col_names = self.database.connect_database(config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE).list_collection_names(session=None)
        for symbol in stock_symbol_list:
            if symbol not in col_names:
                # if int(symbol[2:]) > 837:
                _collection = self.database.get_collection(config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE, symbol)
                _tmp_num_stat = 0
                for row in self.database.get_collection(database_name, collection_name).find():  # 迭代器
                    if symbol[2:] in row["RelatedStockCodes"].split(" "):
                        # 返回新闻发布后n天的标签
                        _tmp_dict = {}
                        for label_days, key_name in self.label_range.items():
                            _tmp_res = self._label_news(
                                datetime.datetime.strptime(row["Date"].split(" ")[0], "%Y-%m-%d"), symbol, label_days)
                            _tmp_dict.update({key_name: _tmp_res})
                        _data = {"Date": row["Date"],
                                 "Url": row["Url"],
                                 "Title": row["Title"],
                                 "Article": row["Article"],
                                 "OriDB": database_name,
                                 "OriCOL": collection_name}
                        _data.update(_tmp_dict)
                        _collection.insert_one(_data)
                        _tmp_num_stat += 1
                logging.info("there are {} news mentioned {} in {} collection need to be fetched ... "
                             .format(_tmp_num_stat, symbol, collection_name))
            # else:
            #     logging.info("{} has fetched all related news from {}...".format(symbol, collection_name))

    def listen_redis_queue(self):
        # 监听redis消息队列，当新的实时数据过来时，根据"RelatedStockCodes"字段，将新闻分别保存到对应的股票数据库
        # e.g.:缓存新的一条数据中，"RelatedStockCodes"字段数据为"603386 603003 600111 603568"，则将该条新闻分别
        # 都存进这四支股票对应的数据库中
        crawled_url_today = set()
        while True:
            date_now = datetime.datetime.now().strftime("%Y-%m-%d")
            if date_now != self.redis_client.get("today_date").decode():
                crawled_url_today = set()
                self.redis_client.set("today_date", date_now)
            if self.redis_client.llen(config.CACHE_NEWS_LIST_NAME) != 0:
                data = json.loads(self.redis_client.lindex(config.CACHE_NEWS_LIST_NAME, -1))
                if data["Url"] not in crawled_url_today:  # 排除重复插入冗余文本
                    crawled_url_today.update({data["Url"]})
                    if data["RelatedStockCodes"] != "":
                        for stock_code in data["RelatedStockCodes"].split(" "):
                            # 将新闻分别送进相关股票数据库
                            symbol = "sh{}".format(stock_code) if stock_code[0] == "6" else "sz{}".format(stock_code)
                            _collection = self.database.get_collection(config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE, symbol)
                            _tmp_dict = {}
                            for label_days, key_name in self.label_range.items():
                                _tmp_res = self._label_news(
                                    datetime.datetime.strptime(data["Date"].split(" ")[0], "%Y-%m-%d"), symbol, label_days)
                                _tmp_dict.update({key_name: _tmp_res})
                            _data = {"Date": data["Date"],
                                     "Url": data["Url"],
                                     "Title": data["Title"],
                                     "Article": data["Article"],
                                     "OriDB": data["OriDB"],
                                     "OriCOL": data["OriCOL"]}
                            _data.update(_tmp_dict)
                            _collection.insert_one(_data)
                            logging.info("the real-time fetched news {}, which was saved in [DB:{} - COL:{}] ...".format(data["Title"],
                                                                                                                         config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE,
                                                                                                                         symbol))
                            #
                            # if symbol.encode() in self.redis_client.lrange("stock_news_num_over_{}".format(config.MINIMUM_STOCK_NEWS_NUM_FOR_ML), 0, -1):
                            #     label_name = "3DaysLabel"
                            #     # classifier_save_path = "{}_classifier.pkl".format(symbol)
                            #     ori_dict_path = "{}_docs_dict.dict".format(symbol)
                            #     bowvec_save_path = "{}_bowvec.mm".format(symbol)
                            #
                            #     topicmodelling = TopicModelling()
                            #     chn_label = topicmodelling.classify_stock_news(data["Article"],
                            #                                                    config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE,
                            #                                                    symbol,
                            #                                                    label_name=label_name,
                            #                                                    topic_model_type="lsi",
                            #                                                    classifier_model="rdforest",  # rdforest / svm
                            #                                                    ori_dict_path=ori_dict_path,
                            #                                                    bowvec_save_path=bowvec_save_path)
                            #     logging.info(
                            #         "document '{}...' was classified with label '{}' for symbol {} ... ".format(
                            #             data["Article"][:20], chn_label, symbol))

                    self.redis_client.rpop(config.CACHE_NEWS_LIST_NAME)
                    logging.info("now pop {} from redis queue of [DB:{} - KEY:{}] ... ".format(data["Title"],
                                                                                               config.CACHE_NEWS_REDIS_DB_ID,
                                                                                               config.CACHE_NEWS_LIST_NAME))

    def _label_news(self, date, symbol, n_days):
        """
        :param date: 类型datetime.datetime，表示新闻发布的日期，只包括年月日，不包括具体时刻，如datetime.datetime(2015, 1, 5, 0, 0)
        :param symbol: 类型str，表示股票标的，如sh600000
        :param n_days: 类型int，表示根据多少天后的价格设定标签，如新闻发布后n_days天，如果收盘价格上涨，则认为该则新闻是利好消息
        """
        # 计算新闻发布当天经过n_days天后的具体年月日
        this_date_data = self.database.get_data(config.STOCK_DATABASE_NAME,
                                                symbol,
                                                query={"date": date})
        # 考虑情况：新闻发布日期是非交易日，因此该日期没有价格数据，则往前寻找，比如新闻发布日期是2020-12-12是星期六，
        # 则考虑2020-12-11日的收盘价作为该新闻发布时的数据
        tmp_date = date
        if this_date_data is None:
            i = 1
            while this_date_data is None and i <= 10:
                tmp_date -= datetime.timedelta(days=i)
                # 判断日期是否是交易日，如果是再去查询数据库；如果this_date_data还是NULL值，则说明数据库没有该交易日数据
                if tmp_date.strftime("%Y-%m-%d") in self.trade_date:
                    this_date_data = self.database.get_data(config.STOCK_DATABASE_NAME,
                                                            symbol,
                                                            query={"date": tmp_date})
                i += 1
        try:
            close_price_this_date = this_date_data["close"][0]
        except Exception:
            close_price_this_date = None
        # 考虑情况：新闻发布后n_days天是非交易日，或者没有采集到数据，因此向后寻找，如新闻发布日期是2020-12-08，5天
        # 后的日期是2020-12-13是周日，因此将2020-12-14日周一的收盘价作为n_days后的数据
        new_date = date + datetime.timedelta(days=n_days)
        n_days_later_data = self.database.get_data(config.STOCK_DATABASE_NAME,
                                                   symbol,
                                                   query={"date": new_date})
        if n_days_later_data is None:
            i = 1
            while n_days_later_data is None and i <= 10:
                new_date = date + datetime.timedelta(days=n_days+i)
                if new_date.strftime("%Y-%m-%d") in self.trade_date:
                    n_days_later_data = self.database.get_data(config.STOCK_DATABASE_NAME,
                                                               symbol,
                                                               query={"date": new_date})
                i += 1
        try:
            close_price_n_days_later = n_days_later_data["close"][0]
        except Exception:
            close_price_n_days_later = None
        # 判断条件：
        # （1）如果n_days个交易日后且n_days<=10天，则价格上涨(下跌)超过3%，则认为该新闻是利好(利空)消息；如果价格在3%的范围内，则为中性消息
        # （2）如果n_days个交易日后且10<n_days<=15天，则价格上涨(下跌)超过5%，则认为该新闻是利好(利空)消息；如果价格在5%的范围内，则为中性消息
        # （3）如果n_days个交易日后且15<n_days<=30天，则价格上涨(下跌)超过10%，则认为该新闻是利好(利空)消息；如果价格在10%的范围内，则为中性消息
        # （4）如果n_days个交易日后且30<n_days<=60天，则价格上涨(下跌)超过15%，则认为该新闻是利好(利空)消息；如果价格在15%的范围内，则为中性消息
        # Note：中性消息定义为，该消息迅速被市场消化，并没有持续性影响
        param = 0.01
        if n_days <= 10:
            param = 0.03
        elif 10 < n_days <= 15:
            param = 0.05
        elif 15 < n_days <= 30:
            param = 0.10
        elif 30 < n_days <= 60:
            param = 0.15
        if close_price_this_date is not None and close_price_n_days_later is not None:
            if (close_price_n_days_later - close_price_this_date) / close_price_this_date > param:
                return "利好"
            elif (close_price_n_days_later - close_price_this_date) / close_price_this_date < -param:
                return "利空"
            else:
                return "中性"
        else:
            return ""

    def _stock_news_nums_stat(self):
        cols_list = self.database.connect_database(config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE).list_collection_names(session=None)
        for sym in cols_list:
            if self.database.get_collection(config.ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE, sym).estimated_document_count() > config.MINIMUM_STOCK_NEWS_NUM_FOR_ML:
                self.redis_client.lpush("stock_news_num_over_{}".format(config.MINIMUM_STOCK_NEWS_NUM_FOR_ML), sym)


if __name__ == "__main__":
    from Kite import config
    from Killua.buildstocknewsdb import GenStockNewsDB

    gen_stock_news_db = GenStockNewsDB()
    # gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
    # gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
    # gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)

    # gen_stock_news_db.listen_redis_queue()


================================================
FILE: legacy_v1/src/Killua/deduplication.py
================================================
import __init__

from Kite.database import Database
from Kite import utils

import logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class Deduplication(object):

    def __init__(self, database_name, collection_name):
        self.database = Database()
        self.database_name = database_name
        self.collection_name = collection_name
        self.delete_num = 0

    def run(self):
        date_list = self.database.get_data(self.database_name,
                                           self.collection_name,
                                           keys=["Date"])["Date"].tolist()
        collection = self.database.get_collection(self.database_name, self.collection_name)
        date_list.sort()  # 升序
        # start_date, end_date = date_list[1].split(" ")[0], date_list[-1].split(" ")[0]
        start_date, end_date = min(date_list).split(" ")[0], max(date_list).split(" ")[0]
        for _date in utils.get_date_list_from_range(start_date, end_date):
            # 获取特定时间对应的数据并根据URL去重
            # logging.info(_date)
            try:
                data_df = self.database.get_data(self.database_name,
                                                 self.collection_name,
                                                 query={"Date": {"$regex": _date}})
            except Exception:
                continue
            if data_df is None:
                continue
            data_df_drop_duplicate = data_df.drop_duplicates(["Url"])
            for _id in list(set(data_df["_id"]) - set(data_df_drop_duplicate["_id"])):
                collection.delete_one({'_id': _id})
                self.delete_num += 1
            # logging.info("{} finished ... ".format(_date))
        logging.info("DB:{} - COL:{} had {} data length originally, now has deleted {} depulications ... "
                     .format(self.database_name, self.collection_name, str(len(date_list)), self.delete_num))


if __name__ == "__main__":
    from Killua.deduplication import Deduplication
    from Kite import config

    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
    Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()


================================================
FILE: legacy_v1/src/Killua/denull.py
================================================
"""
删除数据库中含有null值的行
"""

import __init__
import logging

from Kite.database import Database

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class DeNull(object):

    def __init__(self, database_name, collection_name):
        self.database = Database()
        self.database_name = database_name
        self.collection_name = collection_name
        self.delete_num = 0

    def run(self):
        collection = self.database.get_collection(self.database_name, self.collection_name)
        for row in self.database.get_collection(self.database_name, self.collection_name).find():
            for _key in list(row.keys()):
                if _key != "RelatedStockCodes" and row[_key] == "":
                    collection.delete_one({'_id': row["_id"]})
                    self.delete_num += 1
                    break
        logging.info("there are {} news contained NULL value in {} collection ... "
                     .format(self.delete_num, self.collection_name))


if __name__ == "__main__":
    from Killua.denull import DeNull
    from Kite import config

    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
    DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()

================================================
FILE: legacy_v1/src/Kite/__init__.py
================================================
import os
import sys


def add_path(path):
    if path not in sys.path:
        sys.path.insert(0, path)


this_dir = os.path.dirname(__file__)

# add `./src/Kite` dir to system path
add_path(this_dir)

================================================
FILE: legacy_v1/src/Kite/config.py
================================================
MONGODB_IP = "localhost"
MONGODB_PORT = 27017
REDIS_IP = "localhost"
REDIS_PORT = 6379
THREAD_NUMS_FOR_SPYDER = 4

DATABASE_NAME = "finnewshunter"

COLLECTION_NAME_CNSTOCK = "cnstock"
CHROME_DRIVER = "./chromedriver.exe"
# WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK = {"https://company.cnstock.com/company/scp_gsxw": "公司聚焦",
#                                        "https://ggjd.cnstock.com/gglist/search/qmtbbdj": "公告解读",
#                                        "https://ggjd.cnstock.com/gglist/search/ggkx": "公告快讯",
#                                        "https://ggjd.cnstock.com/company/scp_ggjd/tjd_sdlh": "利好公告"}
WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK = {"https://company.cnstock.com/company/scp_gsxw": "公司聚焦",
                                       "http://ggjd.cnstock.com/company/scp_ggjd/tjd_bbdj": "公告解读",
                                       "http://ggjd.cnstock.com/company/scp_ggjd/tjd_ggkx": "公告快讯",
                                       "https://ggjd.cnstock.com/company/scp_ggjd/tjd_sdlh": "利好公告"}
RECORD_CNSTOCK_FAILED_URL_TXT_FILE_PATH = "D:/workfiles/gpu-cloud-backup/Listed-company-news-crawl-and-text-analysis/src/Gon/cnstock_failed_urls.txt"
CNSTOCK_MAX_REJECTED_AMOUNTS = 10

COLLECTION_NAME_JRJ = "jrj"
JRJ_DATE_RANGE = 100
WEBSITES_LIST_TO_BE_CRAWLED_JRJ = "http://stock.jrj.com.cn/xwk"
RECORD_JRJ_FAILED_URL_TXT_FILE_PATH = "D:/workfiles/gpu-cloud-backup/Listed-company-news-crawl-and-text-analysis/src/Gon/jrj_failed_urls.txt"
JRJ_MAX_REJECTED_AMOUNTS = 10
JRJ_REQUEST_DEFAULT_DATE = "2015-01-01"
CACHE_SAVED_NEWS_JRJ_TODAY_VAR_NAME = "cache_news_queue_jrj"

COLLECTION_NAME_NBD = "nbd"
WEBSITES_LIST_TO_BE_CRAWLED_NBD = "http://stocks.nbd.com.cn/columns/275/page"
RECORD_NBD_FAILED_URL_TXT_FILE_PATH = "D:/workfiles/gpu-cloud-backup/Listed-company-news-crawl-and-text-analysis/src/Gon/nbd_failed_urls.txt"
NBD_TOTAL_PAGES_NUM = 684
NBD_MAX_REJECTED_AMOUNTS = 10
CACHE_SAVED_NEWS_NBD_TODAY_VAR_NAME = "cache_news_queue_nbd"

TUSHARE_TOKEN = "97fbc4c73727b5d171ca6670cbc4af8b0a3de5fbab74b52f30b598cc"
STOCK_DATABASE_NAME = "stock"
COLLECTION_NAME_STOCK_BASIC_INFO = "basic_info"
STOCK_PRICE_REQUEST_DEFAULT_DATE = "20150101"
REDIS_CLIENT_FOR_CACHING_STOCK_INFO_DB_ID = 1

ALL_NEWS_OF_SPECIFIC_STOCK_DATABASE = "stocknews"

TOPIC_NUMBER = 200
SVM_TUNED_PARAMTERS = {"kernel": ["rbf"], "gamma": [10, 20, 50, 100, 150, 200], "C": [10, 15, 20, 30, 50, 100]}
RDFOREST_TUNED_PARAMTERS = {"n_estimators": [1, 2, 3, 4, 5, 10],
                            "criterion": ["gini", "entropy"],
                            "max_features": ["auto", "sqrt"]}
CLASSIFIER_SCORE_LIST = ["f1_weighted"]
USER_DEFINED_DICT_PATH = "D:/workfiles/gpu-cloud-backup/Listed-company-news-crawl-and-text-analysis/src/Leorio/financedict.txt"
CHN_STOP_WORDS_PATH = "D:/workfiles/gpu-cloud-backup/Listed-company-news-crawl-and-text-analysis/src/Leorio/chnstopwords.txt"

CACHE_NEWS_REDIS_DB_ID = 0
CACHE_NEWS_LIST_NAME = "cache_news_waiting_for_classification"

CACHE_RECORED_OPENED_PYTHON_PROGRAM_DB_ID = 0
CACHE_RECORED_OPENED_PYTHON_PROGRAM_VAR = "opened_python_scripts"

MINIMUM_STOCK_NEWS_NUM_FOR_ML = 1000

================================================
FILE: legacy_v1/src/Kite/database.py
================================================
from pymongo import MongoClient
import pandas as pd


class Database(object):

	def __init__(self, ip="localhost", port=27017):
		self.ip = ip
		self.port = port
		self.conn = MongoClient(self.ip, self.port)

	def connect_database(self, database_name):
		return self.conn[database_name]

	def get_collection(self, database_name, collection_name):
		return self.connect_database(database_name).get_collection(collection_name)

	def insert_data(self, database_name, collection_name, data_dict):
		database = self.conn[database_name]
		collection = database.get_collection(collection_name)
		collection.insert_one(data_dict)

	def update_row(self, database_name, collection_name, query, new_values):
		assert isinstance(query, dict)
		assert isinstance(new_values, dict)
		database = self.conn[database_name]
		collection = database.get_collection(collection_name)
		collection.update_one(query, {"$set": new_values})

	def get_data(self, database_name, collection_name, max_data_request=None, query=None, keys=None):
		# e.g.:
		# ExampleObj = Database()
		# ExampleObj.get_data("finnewshunter", "nbd", query={"Date": {"$regex": "2014"}}, keys=["Url", "Title"])
		database = self.conn[database_name]
		collection = database.get_collection(collection_name)
		if query:
			assert isinstance(query, dict)
		else:
			query = {}
		if keys:
			assert isinstance(keys, list)
		else:
			keys = []
		if max_data_request:
			assert isinstance(max_data_request, int)
		else:
			max_data_request = float("inf")
		try:
			if len(keys) != 0:
				_dict = {_key: [] for _key in keys}
				data = collection.find(query) if len(query) != 0 else collection.find()
				for _id, row in enumerate(data):
					if _id + 1 <= max_data_request:
						for _key in keys:
							_dict[_key].append(row[_key])
					else:
						break
			else:
				# data = collection.find()
				data = collection.find(query) if len(query) != 0 else collection.find()
				data_keys = list(
					next(data).keys())  # ['_id', 'Date', 'PageId', 'Url', 'Title', 'Article', 'RelevantStockCodes']
				_dict = {_key: [] for _key in data_keys}
				for _id, row in enumerate(collection.find(query) if len(query) != 0 else collection.find()):
					if _id + 1 <= max_data_request:
						for _key in data_keys:
							_dict[_key].append(row[_key])
					else:
						break
			return pd.DataFrame(_dict)
		except Exception:
			return None

	def drop_db(self, database):
		self.conn.drop_database(database)


'''
from database import Database

ExampleObj = Database()
db = ExampleObj.connect_database("cnstock")
col = ExampleObj.create_col(db, "cnstock_col")
ExampleObj.insert_data(col, {'name': 'sena', "id": 136})
ExampleObj.drop_db(db)
'''


================================================
FILE: legacy_v1/src/Kite/log.py
================================================


================================================
FILE: legacy_v1/src/Kite/utils.py
================================================
import re
import datetime
import requests
import numpy as np
from bs4 import BeautifulSoup
from scipy.sparse import csr_matrix


def generate_pages_list(total_pages, range, init_page_id):
    page_list = list()
    k = init_page_id

    while k + range - 1 <= total_pages:
        page_list.append((k, k + range -1))
        k += range

    if k + range - 1 < total_pages:
        page_list.append((k, total_pages))

    return page_list


def count_chn(string):
    '''Count Chinese numbers and calculate the frequency of Chinese occurrence.

    # Arguments:
        string: Each part of crawled website analyzed by BeautifulSoup.
    '''
    pattern = re.compile(u'[\u1100-\uFFFDh]+?')
    result = pattern.findall(string)
    chn_num = len(result)
    possible = chn_num / len(str(string))

    return chn_num, possible


def get_date_list_from_range(begin_date, end_date):
    '''Get date list from 'begin_date' to 'end_date' on the calendar.
    '''
    date_list = list()
    begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")
    end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d")
    while begin_date <= end_date:
        date_str = begin_date.strftime("%Y-%m-%d")
        date_list.append(date_str)
        begin_date += datetime.timedelta(days=1)

    return date_list


def gen_dates_list(date_list, date_range):
    date_list_latest = list()
    k = 0
    while k < len(date_list):
        if k + date_range >= len(date_list):
            break
        else:
            date_list_latest.append(date_list[k: k + date_range])
            k += date_range
    date_list_latest.append(date_list[k:])

    return date_list_latest


def get_date_before(n_days):
    """
    获取前n_days天的日期，如今天是2020-12-25，当n_days=1，返回"2020-12-24"
    :param n_days: 前n_days天数，如n_days=1，即前1天
    """
    today = datetime.datetime.now()
    # 计算偏移量
    offset = datetime.timedelta(days=-n_days)
    # 获取想要的日期的时间
    re_date = (today + offset).strftime('%Y-%m-%d')
    return re_date


def search_max_pages_num(first_url, date):
    """
    主要针对金融界网站
    通过日期搜索新闻，比如2020年1月1日的新闻，下面链接
    http://stock.jrj.com.cn/xwk/202001/20200101_1.shtml
    为搜索返回的第一个网页，通过这个网页可以发现，数据库
    返回的最大页数是4，即2020年1月1日共有4页的新闻列表
    :param first_url: 搜索该日期返回的第一个网址，如'http://stock.jrj.com.cn/xwk/202001/20200101_1.shtml'
    :param date: 日期，如'2020-01-01'
    """
    respond = requests.get(first_url)
    respond.encoding = BeautifulSoup(respond.content, "lxml").original_encoding
    bs = BeautifulSoup(respond.text, "lxml")
    a_list = bs.find_all("a")
    max_pages_num = 1
    for a in a_list:
        if "href" in a.attrs and "target" in a.attrs:
            if a["href"].find(date.replace("-", "") + "_") != -1 \
                    and a.text.isdigit():
                max_pages_num += 1

    return max_pages_num


def html_parser(url):
    resp = requests.get(url)
    resp.encoding = BeautifulSoup(resp.content, "lxml").original_encoding
    bs = BeautifulSoup(resp.text, "lxml")

    return bs


def get_chn_stop_words(path):
    '''Load the stop words txt file.
    '''
    stopwords = [line.strip() for line in open(path, 'r').readlines()]

    return stopwords


def convert_to_csr_matrix(model_vector):
    """
    Convert LDA(LSI) model vector to CSR sparse matrix, that could be accepted by Scipy and Numpy.

    # Arguments:
        modelVec: Transformation model vector, such as LDA model vector, tfidf model vector or lsi model vector.
    """
    data = []
    rows = []
    cols = []
    _line_count = 0
    for line in model_vector:  # line=[(int, float), (int, float), ...]
        for elem in line:  # elem=(int, float)
            rows.append(_line_count)
            cols.append(elem[0])
            data.append(elem[1])
        _line_count += 1
    sparse_matrix = csr_matrix((data, (rows, cols)))
    matrix = sparse_matrix.toarray()  # <class 'numpy.ndarray'>

    return matrix


def generate_training_set(x, y, split=0.8):
    rand = np.random.random(size=x.shape[0])
    train_x = []
    train_y = []
    test_x = []
    test_y = []
    for i in range(x.shape[0]):
        if rand[i] < split:
            train_x.append(x[i, :])
            train_y.append(y[i])
        else:
            test_x.append(x[i, :])
            test_y.append(y[i])
    return train_x, train_y, test_x, test_y


def is_contain_chn(word):
    """
    判断传入字符串是否包含中文
    :param word: 待判断字符串
    :return: True:包含中文  False:不包含中文
    """
    zh_pattern = re.compile(u'[\u4e00-\u9fa5]+')
    if zh_pattern.search(word):
        return True
    else:
        return False


def batch_lpop(client, key, n):
    p = client.pipeline()
    p.lrange(key, 0, n-1)
    p.ltrim(key, n, -1)
    p.execute()


================================================
FILE: legacy_v1/src/Kite/webserver.py
================================================


================================================
FILE: legacy_v1/src/Leorio/__init__.py
================================================
import os
import sys


def add_path(path):
    if path not in sys.path:
        sys.path.insert(0, path)


# add `./src` dir to system path
src_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
add_path(src_dir)

================================================
FILE: legacy_v1/src/Leorio/chnstopwords.txt
================================================

ÿ

ǰ
ת
λ

֤ȯ


ο


Υ߱ؾ


£


:

 
&
*
һһ
~~~~

. 

.һ
./
-- 


ۣ

ۢݣݣ
ۢ٣ģ

P

//


ۢڣ
ۢڣ

}
Ҳ 


ۢ٢ޣ
ۢڣ£ 
ۢ٣
ۢܣ
ۢ٢ۣ
ۣۢ
ۣ
 
 
ۢڣ
 
 
ۢ٢
 

ۢݣ
ۢڣ 
ۢܣ
ۢڢۣ
ۣۢ
ۢܣ
ۢ٢ݣ
ۢ٢ߣ
ۢ٣
ʣ 
ۢ٢
ۢ٢ܣ
ۢ٣
ۢڣ
ۢڢ
ۢڢ٣
ۢ٣ã
ۣۢ
ۣۢ
ۢڢݣ
ۢڢڣ
һ.
ۢ٣
.
ۣ
ۢ٣£
/
ۢ٣
ۣۢ
ۢ٢٣
ۢܣ
ۢܣ
ۣۢ
ۢݣ
ۢ٣
ۢڢ
ۢڢߣ
ۢ٣
ۢڣ

ݣ
://

ۢڢ
ۢݣ


...
...................

ڣأƣɣԣ
ۣۢƣ

ۢ٣
ݡġ䣽 
Ȧա


ڣ

ۢۢ٣
ң̣
ۢ٣ţ

ۣݣ

. 
ۢڣ
ۢ
ۢڢߣ
ۢڢڣ
ۣۢ
ۢ٣
ۢ٣£
ۢ٣
ۢ٣
ۢ٣
ۢ٢ڣ
ۢڣ

ۢ

ۢ٣
ۢڣ
ۢڢޣ
ۣۢ
ۢڢ


Ԫ
ۢڢ

  
ۢ٣
::
ۢڣ
ۣۢ
ۢܣ
ۢݣ
ۢޣ
ۢߣ
ۢ
ۢ 


?


,

'
? 


? 

<
>


[
]
(
)
-
+


/


"
;
#
@


գ
 

sub
exp 
sup
sub
Lex 


=


ۢݣ
ۢݣ
ۢڣ
 
ۢڣǣ
ۢ٣
̣
 
ۣ
......


ʵϰ


ѽ
Ӵ


ȷ


˴


˵


Ȼ


Ω

ֻ


֮


˼


Ӷ


Ļ

ȵ


˵
֮
ǵ
ͽ


µ


λ


ʴ
Ȼ


Ȼ


δ
ο
ʱ


仰˵
֮


ʹ

ʱ


Ȼ

̶
֮


ʹ


֮


˵

˵
˵
ʼ


ɼ


ͬ


һ


˵
˵
ð
ô
ÿ
ÿ

Ī
ĳ
ĳ
ĳЩ


ı
Ķ
ĸ


Щ


Ǳ
Ƕ
Ǹ
ǻ

ô
ôЩ
ô
ʱ
Щ


Ը
Ŷ
Ż
ž


ƾ
ƾ


һ


ǡǡ෴
ǰ
ǰ

Ȼ
Ȼ
Ȼ

˼

κ
ƾ


ɶ


ʹ

ô

ʡ
ʱ
ʲô
ʲô
ʹ

ǵ

˭
˭֪
˳
˳
Ƶ

Ȼ
˵


Ȼ
Ȼ

ʹ


ͨ
ͬ
ͬʱ

һ


Ϊ
Ϊ
Ϊ
Ϊʲô
Ϊ
ι


غ
ں


Զ


ѽ


Ҫ
Ҫ
ҪȻ
Ҫ
Ҫô
Ҫ
Ҳ
Ҳ
Ҳ
һ
һ
һ
һ
һ
һ
һ
һ


Ա
Լ


ֻ


Ϊ
Ӵ


ɴ˿ɼ


е
й
Щ


Ǻ

ͬʱ


Խ


˵


ô
ô
ô

զ


˵

ô
ô
ôЩ
ô
ʱ
Щ


֨
֮
֮
֮
֮һ
ֻ
ֻ
ֻҪ
ֻ


λ


Դ
Ը
Ը
Լ
Լ


ܵ
ܵ˵
ܵ˵
֮ܶ
֮


Ȼ
ʹ

Ϊ


ѽ
Ӵ


Ұ
Ű


ʱ
˵


Ȼ
˳
װ


˵

Ͼ

ض
ؽ


û
û


Ȼ


ò


ɿ
ɿ


ܲ


ȻĻ


ʤ
ʱ

ͬ

Ҫ


ֺ
ɵ

ֶ
ô

֪
ֹ
ֹһ


Ե

һ


Ե
˵
˵ú
ȥ
˵


ҹ

ñ
û


˻
ʤ

϶

Ȼ


伫


ȥ

˶


ȥ
ȴ


Ϣ

˵


˺

ε
Ҵ
Ӳ
Ӵ
ӴԺ
ӹŵ
ӹ
ӽԺ
ӿ


ͷ
δ
޵
С


絽


ﵩ

촰˵


Լ


Ը
ָ֮


ڶ
Ȼ
ͥ
ͷ


˵


˶
ĿǰΪֹ
ͷ
ͷ


ȷ
ȵ


Ȼ


Ȼ
ʱ


ǰ


˵
û˵


֮Ȼ
֮


ǳ
ǵ

ڷ
ͷ

Ȼ


¸
õ

Ͽ
粻


ղ
պ

ߵ


ҹ

ʽ


һ
Ϊ
Ȼ


Ƶ


ʶ


ֲ
߳


ޱ


α
γ
η
ο
ֶΪ

ֹ

ܶ

Ȼ


Ȼ


˵

Ȼ

Ȼ

ͬ


Ϊ
Ҵ


˵


...
֮


֮
֮
ֱ


Ҫ

ϱ
Ϊ


Կ
Ȼ


ʱ


ȥ


Ȼ

Ľ
ľ


Ȼ

ʹ
͵

Ȼ

ٷ
ݳ
ݴ
ʵ
˵
֪
Ϥ
˵


ȥ

ɺ


Ҫ

ü


ϴ
ʵʵ

۴


Ӧ


ʱ


ٵ


һ
·

Ŵ
Ŵ


ʶ
Ȼ

Լ
΢
Ϊ
˵


û
û
ÿ
ÿÿ
ÿʱÿ
Ȼ
Ȼ
Ī
Ī
Ī
Ī
ĬĬ
ĬȻ

ĩ

ѵ
ѵ
ѹ
˵

긴һ

ż
ż


Ʃ
ƫƫ
ƹ
ƽ


ͨ

ʵ


ͷ


ֹ

ǡ
ǡ
ǡǡ
ǡ
ǡ
ǡ
ǧ

ǧ
ǧǧ

в
Ī


׿


̼
֮


ȡ
ȥ
Ȩʱ
ȫ
ȫ
ȫ
ȫȻ
ȫ
Ȼ


Ծ
Ȼ
ոһ
ռ
ս


糣
˵ȵ


ǰ


ͷ
ɪɪ
ɳɳ


ȥ
һ.
һһ
һ
һ
һЩ
һ
һͨ
һ
һ
һʱ
һ
һƬ
һ
һֱ
һ
һ
һת
һ
һ


ȥ


һ


Ȼ


˵
ר
Ҳ˵
˵
ϸ


С
м
ḻ
Ϊ
Ϊʲ
Ϊֹ
Ϊ

Ҫ


֮ǰ
֮
֮
Ҳ˵
Ҳ
˽
ȡ

ƶ
Щ


ʲ


Ϊ
ǰ
Ժ


Թ


ͼ
ΰ
ƺ


ʹ
ʹ


ٽ


Ȼ


Ԫ
Ȳ
Ⱥ


ȫ
ȫ
ȫ
ͬ


֮


ٴ
˵

׼


ֱ


ǰ
ǰ
ǰ

ǿ
ʮ

ȴ
ȴ
ԭ
ּ
ʱ
˫
Ӧ
ӳ
ȡ
ܵ

Ϥ
ֻ
ֻ
ֻ
ֻ

ٿ


ͬһ
ͬ


ʹ
Χ
Ǻ

Ψ
ॵ


ٺ


ô


ʧȥ


õ

ͬ

ʼ


֪
ǵ


ȫ
ȫ

ʵ
ʵ


Ӧ
Դ
Է
Ա
С


Ҫ


޴


Ѿ

Ͱ


㷺
Ӧ
Ӧ
Ӧ


չ

ǿ
ǿ

ǰ

ʱ
γ

ʱ


ó
õ

Ȼ
Ҫ


ܽ


Ω
˼
Ը
Ϊ

ҵ

Ի
ս


ν


/


޷


ȷ
ǲ

Ƿ
Ȼ

ͨ
ձ


м

Ч
ʱ
е
е


ĩ##ĩ


˵

ĳĳ

ӭ

ֵ


˵
˴
ʱ
˴
ÿ
ÿ
ÿ
ȼ
Ƚ
ûκ
ע


Ȼ
ر

ص


ִ


ɴ
Ŀǰ
ֱ
ֱ


෴
ͬ

Ӧ
൱


գ
Ӻ

֪
ȷ


ƶ
ͻ
ͻȻ


ڶ

ϰ


̺

ά

ϵ
ܷ
ܹ
Ժ
Դ


Χ
ĪȻ

Ϊ
ж

ʾ
Ҫ
涨

Ʃ
Ϊ

ʶ


˵
˵
˵˵


˭
˭


ת
ת
ת
ﵽ
Ѹ
ȥ


Ҫ
һ


Ӧ
ʵ


ͨ


⵽
ѭ

ǰ


ȡ

ش

Ҫ


ֹ


ʱ

ѵ˵

Ҫ

Ƕ

 
================================================
FILE: legacy_v1/src/Leorio/financedict.txt
================================================
备付金
余额宝
佣金宝
前海
C轮融资
区块链
数字货币
去中心化
正虹科技
千山药机
常山北明
华菱精工
蓝晓科技
兴化股份
红墙股份
世荣兆业
奥飞数据
万兴科技
德邦股份
海辰药业
宣亚国际
长亮科技
蓝色光标
翔港科技
永吉股份
天永智能
成飞集成
北特科技
科顺股份
三五互联
哈空调
新宁物流
湖南投资
华联控股
上海雅仕
海澜之家
富祥股份
药石科技
神雾环保
新城控股
上峰水泥
旗滨集团
久吾高科
天虹股份
横店影视
天泽信息
华发股份
四川双马
国发股份
中国国航
万年青
复旦复华
信达地产
光启技术
中设集团
山西焦化
象屿股份
南京银行
安迪苏
神雾节能
罗普斯金
展鹏科技
罗 牛 山
中石科技
真视通
金发拉比
葛洲坝
大唐电信
劲胜智能
*ST金宇
智飞生物
科力远
东方通信
英可瑞
*ST东海A
阳光股份
中房股份
南华仪器
顺网科技
天邦股份
先导智能
南方航空
华斯股份
森马服饰
尚品宅配
彩虹股份
珠江实业
中交地产
光华科技
云南城投
诚志股份
信息发展
泰格医药
飞乐音响
永悦科技
中国化学
宏昌电子
东北电气
南山控股
我武生物
天威视讯
康隆达
协鑫集成
中旗股份
海峡股份
古越龙山
爱建集团
阳 光 城
百合花
格力电器
楚江新材
瀛通通讯
*ST云网
天健集团
掌阅科技
中坚科技
中欣氟材
得利斯
海天味业
滨江集团
久其软件
当代明诚
吉比特
中源协和
华友钴业
格力地产
冠农股份
重庆啤酒
华英农业
珠海港
杭氧股份
海螺水泥
世茂股份
京山轻机
华联综超
威孚高科
井神股份
华鑫股份
华录百纳
生 意 宝
开山股份
华新水泥
飞利信
南大光电
众信旅游
重庆建工
奥马电器
雷曼股份
招商蛇口
一汽轿车
镇海股份
北新建材
世龙实业
中南文化
海汽集团
*ST匹凸
六国化工
掌趣科技
北大荒
中国建筑
健友股份
大晟文化
中远海特
首旅酒店
中国人寿
金牌厨柜
金地集团
风语筑
海大集团
精测电子
吉宏股份
中海油服
金自天正
湘潭电化
东方雨虹
新元科技
先达股份
烽火通信
唐人神
首开股份
创业软件
华鲁恒升
老板电器
欧普照明
新 希 望
金健米业
高鸿股份
恒大高新
九强生物
盛天网络
五洲交通
中国高科
哈工智能
科达洁能
新南洋
大商股份
东方财富
江河集团
大华股份
中青宝
天玑科技
高升控股
同仁堂
安德利
万方发展
田中精机
合盛硅业
通源石油
湖南海利
广州港
华西能源
蓝盾股份
聚灿光电
辉隆股份
未名医药
柯利达
傲农生物
塔牌集团
金 融 街
ST云维
山西证券
蓝思科技
中国长城
易见股份
新日股份
三诺生物
S佳通
吉艾科技
电工合金
山鹰纸业
金科股份
南 玻Ａ
创新股份
华胜天成
ST景谷
三全食品
新钢股份
银座股份
新华保险
神马股份
沱牌舍得
中国武夷
云南锗业
国旅联合
元成股份
北陆药业
赫美集团
卧龙地产
上港集团
康得新
福建水泥
滨海能源
保龄宝
金冠电气
蓝光发展
梅雁吉祥
大连重工
当代东方
冀东装备
大秦铁路
福星股份
欧派家居
众应互联
绿景控股
华东重机
通达股份
波导股份
京汉股份
电子城
华伍股份
大连圣亚
皮阿诺
美利云
冀东水泥
三峡新材
奇精机械
海量数据
恒基达鑫
金杯电工
金陵体育
音飞储存
上海银行
振东制药
沙河股份
康跃科技
利尔化学
梦百合
凯伦股份
*ST昌九
会稽山
苏垦农发
汇洁股份
华菱星马
杰克股份
万达信息
华策影视
银亿股份
三毛派神
登海种业
盐 田 港
上工申贝
沃森生物
中国石化
中材国际
玲珑轮胎
天华超净
鸿博股份
吉峰农机
众源新材
志邦股份
光洋股份
柳 工
中南建设
博彦科技
光力科技
美亚柏科
兰州民百
宝鼎科技
东湖高新
美亚光电
华帝股份
智度股份
美丽生态
中远海控
东港股份
江阴银行
宝新能源
建发股份
众兴菌业
仟源医药
祁连山
*ST昌鱼
常山药业
贝达药业
建新股份
三六五网
宝色股份
龙马环卫
粤泰股份
钧达股份
天晟新材
晨鸣纸业
金 螳 螂
双鹭药业
中国太保
达威股份
光韵达
界龙实业
华泰股份
天创时尚
尖峰集团
迪马股份
探路者
强力新材
纳思达
立霸股份
创维数字
华谊集团
浙江交科
盐湖股份
广州发展
风神股份
新湖中宝
湖南发展
华夏幸福
片仔癀
中信银行
蓝英装备
万通地产
华讯方舟
奥佳华
捷成股份
山煤国际
海南橡胶
柘中股份
九阳股份
鱼跃医疗
全筑股份
新开源
香江控股
交大昂立
东方网力
元隆雅图
派思股份
沃施股份
唐德影视
天康生物
恒瑞医药
三安光电
东方时尚
冰川网络
华瑞股份
天山股份
海峡环保
长方集团
申通地铁
万和电气
电广传媒
航天长峰
中国海诚
梦舟股份
涪陵电力
铁流股份
青岛海尔
力源信息
金字火腿
梦洁股份
健康元
张 裕Ａ
万盛股份
共达电声
贤丰控股
桂东电力
工大高新
雅戈尔
设研院
联美控股
南京高科
华天科技
奥飞娱乐
航天电子
荣盛发展
柳钢股份
暴风集团
爱迪尔
博雅生物
航天电器
道明光学
机器人
泛微网络
龙元建设
鼎捷软件
岱勒新材
华业资本
鸿特精密
中元股份
科伦药业
海南高速
中科曙光
科达股份
长信科技
海航创新
星光农机
美诺华
龙江交通
江泉实业
大亚圣象
中集集团
天源迪科
富安娜
佛山照明
财信发展
三维丝
美的集团
双汇发展
东方钽业
兰太实业
敦煌种业
国际实业
激智科技
凯龙股份
深科技
恒锋工具
兆日科技
青龙管业
时代万恒
洽洽食品
顺发恒业
美凯龙
银信科技
京投发展
兴发集团
梅花生物
川大智胜
云意电气
金枫酒业
利君股份
科泰电源
数据港
天地源
三维通信
上实发展
伟明环保
中国平安
信雅达
天广中茂
绿地控股
金逸影视
粤高速Ａ
天神娱乐
香雪制药
九牧王
浙大网新
北京银行
贵州茅台
同力水泥
天目药业
隆平高科
三棵树
冠城大通
天能重工
华兰生物
陕西黑猫
厦门国贸
易联众
台基股份
永安行
老百姓
腾龙股份
用友网络
北京城建
再升科技
皖江物流
旺能环境
昆仑万维
江苏银行
国联水产
沙隆达Ａ
爱乐达
广州浪奇
*ST准油
水井坊
聚隆科技
华谊兄弟
安妮股份
五 粮 液
博汇纸业
金洲慈航
苏 泊 尔
中国交建
亚宝药业
吉林化纤
金路集团
同洲电子
二三四五
凤形股份
东方通
齐峰新材
深圳华强
明星电缆
建设银行
安彩高科
北信源
海正药业
亚泰集团
鼎信通讯
木林森
万里石
家家悦
金陵饭店
华中数控
达 意 隆
万马股份
南风股份
卫宁健康
洋河股份
金晶科技
中国重汽
辉煌科技
东兴证券
多伦科技
太化股份
瑞斯康达
招商轮船
雏鹰农牧
恒生电子
巴安水务
宁夏建材
东莞控股
杭州银行
深圳机场
冠昊生物
瑞茂通
贵人鸟
招商证券
华侨城Ａ
方正科技
华孚时尚
龙津药业
拓普集团
天原集团
东晶电子
江铃汽车
新澳股份
天坛生物
安正时尚
隆基股份
名雕股份
长盈精密
澳柯玛
网达软件
粤 水 电
华夏银行
现代制药
金科文化
润达医疗
赛摩电气
花园生物
福建高速
三友化工
无锡银行
长春经开
易尚展示
太极股份
京华激光
中毅达
滨化股份
一拖股份
银河生物
长航凤凰
科士达
全 聚 德
神州泰岳
华电重工
中农立华
上海家化
永艺股份
森特股份
中国铁建
顺鑫农业
紫鑫药业
中信海直
山东路桥
深物业A
上柴股份
克来机电
长城汽车
汉威科技
亚盛集团
福田汽车
申万宏源
广州酒家
埃斯顿
煌上煌
同花顺
鲁商置业
七 匹 狼
桐昆股份
绵石投资
易德龙
上海物贸
伊利股份
合锻智能
华贸物流
上海三毛
东阿阿胶
睿康股份
奋达科技
云南能投
游族网络
杰瑞股份
中国中铁
青岛啤酒
黑猫股份
梅安森
同方股份
绿盟科技
创意信息
浪潮软件
浙能电力
来伊份
华星创业
兰石重装
重庆路桥
西水股份
维维股份
新华百货
中直股份
莎普爱思
中国石油
康盛股份
中海达
哈高科
景兴纸业
众合科技
首钢股份
红旗连锁
川环科技
美尔雅
中远海能
赛为智能
三星医疗
银邦股份
爱施德
光大银行
浙江富润
西藏发展
荣科科技
万业企业
芭田股份
三一重工
银禧科技
广宇集团
神州高铁
常熟银行
证通电子
天瑞仪器
国祯环保
中国神华
洁美科技
中国汽研
兴业银行
法 尔 胜
金花股份
东吴证券
中洲控股
新 大 陆
海普瑞
*ST柳化
天富能源
昌红科技
海南瑞泽
*ST宝实
杰恩设计
铁龙物流
三湘印象
张家界
金禾实业
中远海发
阳光照明
新泉股份
歌力思
榕基软件
厦门港务
上海机电
泸州老窖
澄星股份
靖远煤电
白云机场
宁波港
正丹股份
物产中大
襄阳轴承
天夏智慧
浙江美大
恒立液压
顾家家居
华润双鹤
中航光电
千金药业
圣农发展
佳讯飞鸿
宇通客车
继峰股份
保利地产
天润曲轴
广誉远
深纺织Ａ
南方汇通
奥特佳
利安隆
北京文化
长江润发
新五丰
华舟应急
鲁阳节能
拓尔思
国药一致
徐家汇
科新机电
印纪传媒
千禾味业
汇川技术
雪榕生物
华远地产
上海临港
元力股份
欢瑞世纪
汉鼎宇佑
金新农
透景生命
振华重工
理工光科
新乡化纤
世纪星源
云煤能源
海兴电力
天茂集团
莱美药业
同有科技
福耀玻璃
中钨高新
索菲亚
宋城演艺
交运股份
中体产业
星星科技
鹏博士
乐凯新材
广发证券
歌华有线
三维股份
一汽夏利
上海机场
新农开发
希努尔
乐普医疗
浙数文化
东方新星
闽发铝业
深南电路
豪迈科技
陆家嘴
海鸥卫浴
东富龙
中国银行
东北证券
中国国旅
交通银行
通富微电
四维图新
厦门空港
永和智控
易华录
广弘控股
山东海化
亿晶光电
周大生
重庆百货
棒杰股份
益丰药房
新华龙
鸿利智汇
拓日新能
齐心集团
思创医惠
小康股份
艾比森
山推股份
王府井
晶方科技
雪 莱 特
振静股份
华纺股份
*ST坊展
宏大爆破
二六三
龙净环保
承德露露
迎驾贡酒
丰林集团
粤宏远Ａ
大众交通
锡业股份
骆驼股份
科大智能
燕京啤酒
大港股份
四创电子
獐子岛
龙头股份
海利生物
炬华科技
迪安诊断
光线传媒
锦江股份
齐翔腾达
鞍重股份
汇通能源
凯恩股份
汉邦高科
新 海 宜
四川金顶
华域汽车
利欧股份
苏常柴Ａ
太极实业
海欣股份
大连港
杭齿前进
航民股份
广东甘化
人民网
日盈电子
世联行
天润数娱
贵绳股份
云南白药
中新赛克
远方信息
融钰集团
锦江投资
易成新能
中水渔业
沈阳化工
江海股份
楚天科技
华联股份
东材科技
兴源环境
澳洋科技
民生银行
江苏阳光
洪城水业
华宏科技
神州长城
ST常林
农发种业
美芝股份
旋极信息
首航节能
通鼎互联
凯美特气
渤海轮渡
山河药辅
王子新材
新界泵业
汉缆股份
星辉娱乐
重庆水务
三维工程
美好置业
健帆生物
兆驰股份
通化东宝
乐山电力
天鹅股份
渝 开 发
欣龙控股
长江投资
丽珠集团
青海华鼎
湖北广电
东南网架
黑牡丹
上汽集团
东方明珠
实丰文化
康恩贝
宜宾纸业
海默科技
海油工程
中科金财
东华科技
国投电力
太平鸟
合众思壮
天津港
*ST新城
星宇股份
工商银行
弘宇股份
光明乳业
西藏城投
申科股份
延华智能
露天煤业
岭南控股
*ST青松
华金资本
永太科技
中国电建
国药股份
星源材质
西安旅游
佳隆股份
金力泰
金盾股份
四方股份
上海建工
云投生态
怡达股份
宝信软件
广电电气
日照港
海南椰岛
大龙地产
富春股份
*ST 中绒
新亚制程
建投能源
浙江震元
华懋科技
广电网络
锦州港
金证股份
太安堂
今世缘
商赢环球
多喜爱
冠豪高新
凯利泰
永高股份
东方精工
黔轮胎Ａ
文投控股
高伟达
中原传媒
北京科锐
黄山旅游
菲达环保
博信股份
长城影视
华闻传媒
通策医疗
小天鹅Ａ
徐工机械
陕西煤业
天地科技
合金投资
济民制药
亚星客车
御银股份
海欣食品
韩建河山
联创电子
宁波精达
合诚股份
力生制药
京运通
润邦股份
亚通股份
新华医疗
东诚药业
世纪瑞尔
普邦股份
万润股份
招商银行
中国国贸
华宇软件
锦龙股份
沧州大化
强生控股
兖州煤业
浙商证券
阳光电源
摩恩电气
旷达科技
*ST丹科
中远海科
轻纺城
申能股份
南京医药
中国中车
长久物流
南卫股份
中华企业
德威新材
飞荣达
茂业通信
览海投资
鹿港文化
酒鬼酒
长电科技
龙泉股份
沃特股份
金河生物
大元泵业
天房发展
利亚德
金鹰股份
*ST爱富
史丹利
福建金森
安徽水利
亚太实业
扬子新材
初灵信息
航天机电
中衡设计
福能股份
华东医药
万孚生物
威帝股份
仙琚制药
亚邦股份
东方航空
南京化纤
桂林旅游
苏交科
珠江控股
同达创业
白云电器
浪潮信息
飞科电器
国民技术
金莱特
丰元股份
华鹏飞
西藏旅游
环能科技
神思电子
白云山
山东章鼓
川投能源
上海莱士
北部湾港
中航地产
国投中鲁
莱宝高科
欣旺达
中航机电
古井贡酒
大豪科技
润和软件
乐凯胶片
微光股份
安硕信息
海立股份
三圣股份
科林电气
*ST宏盛
博敏电子
新文化
方直科技
金固股份
安记食品
山东出版
帝龙文化
创新医疗
三聚环保
博思软件
新华文轩
百川能源
瑞康医药
正平股份
长荣股份
海通证券
应流股份
神开股份
津膜科技
国机通用
西部黄金
中泰化学
贵阳银行
凤凰光学
金利华电
三特索道
华东电脑
萃华珠宝
浙江仙通
南洋股份
德尔股份
上海沪工
乐心医疗
中信证券
四方冷链
卫 士 通
九鼎投资
必康股份
麦趣尔
宜华健康
巨人网络
平治信息
科达利
兆易创新
城地股份
步长制药
嘉澳环保
朗迪集团
五洲新春
科森科技
杭电股份
东方电缆
引力传媒
司太立
集友股份
维力医疗
圣达生物
德新交运
赛福天
山东华鹏
大唐发电
凤凰传媒
嘉泽新能
中国中冶
中国铝业
*ST锐电
陕鼓动力
君正集团
中国西电
晋亿实业
宁波热电
渤海活塞
江苏有线
*ST嘉陵
洛阳玻璃
石化油服
厦华电子
星湖科技
*ST京城
人民同泰
新华传媒
益民集团
中路股份
*ST厦工
华北制药
山西汾酒
天业股份
天津磁卡
宁波海运
保税科技
鲁银投资
汉商集团
天海投资
一汽富维
实达集团
S*ST前锋
绿庭投资
中船防务
奥瑞德
哈药股份
豫园股份
富控互动
申达股份
鹏起科技
惠泉啤酒
中珠医疗
国睿科技
老白干酒
时代出版
莫高股份
狮头股份
栖霞建设
宏达矿业
海航基础
腾达建设
驰宏锌锗
天药股份
信威集团
瑞贝卡
*ST海润
盘江股份
广东明珠
天科股份
三房巷
通葡股份
正源股份
亚星化学
营口港
XD万华化
广汇汽车
华仪电气
江苏舜天
重庆港九
亿利洁能
嘉化能源
航天信息
外运发展
赣粤高速
国电南自
大湖股份
广汇能源
ST成城
中昌数据
民丰特纸
赤天化
瀚叶股份
海航控股
江苏吴中
华资实业
国中水务
安通控股
太原重工
永泰能源
宝硕股份
中国船舶
*ST新亿
太极集团
西宁特钢
*ST天成
大名城
东方金钰
中葡股份
海泰发展
东风科技
宋都股份
康欣新材
宁波联合
四川路桥
东风汽车
朗新科技
隆盛科技
中孚信息
民德电子
南京聚隆
新雷能
贝斯特
会畅通讯
朗科智能
辰安科技
山鼎设计
迈克生物
康拓红外
双杰电气
鲍斯股份
航新科技
中光防雷
迦南科技
三环集团
腾信股份
飞天诚信
光环新网
光一科技
麦捷科技
邦讯技术
聚飞光电
吴通控股
华昌达
海联讯
新莱应材
飞力达
纳川股份
福安药业
佳士科技
通裕重工
智慧松德
迪威迅
新研股份
科融环境
量子高科
星普医科
大富科技
锦富技术
锐奇股份
易世达
坚瑞沃能
盛运环保
康芝药业
华谊嘉信
世纪鼎利
福瑞股份
华力创通
回天新材
上海凯宝
梅泰诺
金龙机电
宝德股份
立思辰
盈趣科技
香山股份
麦格米特
凯中精密
普路通
南兴装备
万达电影
中矿资源
葵花药业
燕塘乳业
奥瑞金
美盛文化
顾地科技
猛狮科技
德联集团
万润科技
民盛金科
三垒股份
瑞和股份
艾格拉斯
亚夏汽车
ST龙力
八菱科技
圣阳股份
中京电子
雷柏科技
群兴玩具
顺灏股份
三七互娱
千红制药
东方铁塔
鸿路钢构
云图控股
林州重机
海源机械
光正集团
天桥起重
日发精机
恺英网络
达华智能
涪陵榨菜
科林环保
金正大
益生股份
天马精化
壹桥股份
龙星化工
江苏神通
尤夫股份
胜利精密
凯撒文化
中原特钢
达实智能
爱仕达
建研集团
信邦制药
南洋科技
东山精密
千方科技
亚太药业
台海核电
神剑股份
森源电气
富临运业
顺丰控股
漫步者
高乐股份
潮宏基
海宁皮城
人人乐
*ST三泰
博云新材
大 东 南
德奥通航
升达林业
步 步 高
合兴包装
恒康医疗
特 尔 佳
利达光电
巴士在线
深圳惠程
中航三鑫
常铝股份
新光圆成
恒星科技
天马股份
三变科技
广博股份
浔兴股份
山河智能
万邦德
沙钢股份
凯瑞德
云南旅游
轴研科技
久联发展
丽江旅游
华信国际
东信和平
霞客环保
德豪润达
华邦健康
华润三九
中弘股份
中通客车
凯迪生态
中粮生化
山大华特
*ST天化
云内动力
现代投资
东凌国际
云南铜业
吉电股份
陕西金叶
冰轮环境
云铝股份
凯撒旅游
长江证券
*ST平能
通化金马
浩物股份
新华制药
南风化工
苏宁环球
恒逸石化
厦门信达
*ST华泽
建新矿业
东方电子
海航投资
平潭发展
太阳能
海南海药
供销大集
航天发展
中天金融
粤电力Ａ
万泽股份
万 家 乐
美菱电器
荣安地产
国际医学
华塑控股
鄂武商Ａ
渤海金控
胜利股份
华数传媒
广聚能源
皇庭国际
泛海控股
中国天楹
神州数码
中粮地产
深深房Ａ
深赤湾Ａ
深深宝Ａ
深中华A
全新好
深振业Ａ
华测导航
ST生化
和仁科技
牧原股份
传艺科技
庄园牧场
浩云科技
华钰矿业
元祖股份
万邦达
曲江文旅
贵航股份
汉森制药
长江电力
吉祥航空
华仁药业
金通灵
红蜻蜓
万东医疗
新日恒力
光大证券
伊力特
张江高科
中南传媒
捷顺科技
瀚蓝环境
维宏股份
精锻科技
深华发Ａ
曲美家居
中威电子
景嘉微
安信信托
赢时胜
天翔环境
永利股份
中金环境
达志科技
东方日升
金明精机
金龙汽车
兰州黄河
湘电股份
国机汽车
奇信股份
龙大肉食
中山公用
杭锅股份
视觉中国
恒信东方
南天信息
福成股份
特变电工
江苏国信
深天地Ａ
北京城乡
广日股份
宏图高科
中兴商业
宜华生活
潍柴重机
文山电力
尚荣医疗
羚锐制药
围海股份
好利来
优博讯
远达环保
精伦电子
慈文传媒
安井食品
隧道股份
恒丰纸业
黑牛食品
雄韬股份
东阳光科
兄弟科技
华铁股份
农 产 品
雷鸣科化
翠微股份
山东威达
ST南化
百利科技
*ST沪科
博深工具
清水源
新天然气
信捷电气
哈森股份
钱江生化
杭钢股份
奥克股份
马应龙
丰乐种业
登云股份
三角轮胎
新开普
永鼎股份
奥拓电子
嘉欣丝绸
华自科技
新朋股份
文科园林
四川九洲
美联新材
三元股份
柏堡龙
茂业商业
正邦科技
新力金融
深圳能源
悦达投资
四方达
川化股份
南京公用
朗姿股份
招商公路
广汽集团
小商品城
金石东方
上海环境
中核钛白
雪峰科技
光电股份
集智股份
国元证券
本钢板材
名家汇
鲁 泰Ａ
西安饮食
南京新百
华扬联众
数字政通
新大洲Ａ
北辰实业
仁和药业
南威软件
德尔未来
奥维通信
博实股份
凌云股份
东江环保
中环股份
青青稞酒
华统股份
皖能电力
天龙股份
荃银高科
新世界
越秀金控
龙韵股份
利源精制
英飞拓
奇正藏药
金亚科技
丽鹏股份
超图软件
金安国纪
晨光文具
新疆浩源
卓郎智能
东风股份
洪涛股份
南都电源
上海九百
江南高纤
吴江银行
航发科技
浦东建设
科大国创
汇中股份
林海股份
永贵电器
*ST智慧
比亚迪
泰达股份
华茂股份
蓝科高新
深高速
宁波富邦
和而泰
银轮股份
昆药集团
力星股份
双环传动
兰花科创
城投控股
哈尔斯
路畅科技
上海电力
人福医药
汉得信息
数码科技
潍柴动力
联环药业
三 力 士
启明星辰
四川成渝
杭州解百
科锐国际
共进股份
三峡水利
北大医药
东土科技
神奇制药
丰原药业
读者传媒
中粮糖业
雪人股份
富奥股份
凤竹纺织
桂林三金
天沃科技
鹏翎股份
福达股份
龙宇燃油
广东鸿图
兴业证券
神州信息
浙江广厦
春兴精工
恒力股份
姚记扑克
同济堂
双箭股份
漳州发展
紫光股份
裕兴股份
天龙光电
九 芝 堂
三鑫医疗
秀强股份
兴业股份
天银机电
石基信息
大东方
安控科技
恒泰实达
华昌化工
吉林高速
津滨发展
远东传动
常青股份
宜通世纪
宝鹰股份
中国联通
德美化工
民生控股
第一创业
北方国际
惠而浦
道恩股份
加加食品
西昌电力
中新科技
皖新传媒
金一文化
汉王科技
*ST沈机
鲁信创投
广汇物流
快克股份
国投资本
诺 普 信
幸福蓝海
中航电子
浦东金桥
科远股份
舒泰神
乔治白
京威股份
兴民智通
惠发股份
闰土股份
泰胜风能
皇氏集团
国金证券
瑞尔特
科力尔
吉林敖东
天喻信息
新华联
ST慧球
宜安科技
西部证券
中色股份
苏州高新
平高电气
智云股份
宝钢股份
际华集团
晋西车轴
山东高速
津劝业
新纶科技
丰华股份
大禹节水
欧亚集团
东音股份
金徽酒
华能国际
*ST上普
博闻科技
精准信息
天壕环境
江化微
雪浪环境
利德曼
东华软件
昆百大Ａ
中电广通
*ST运盛
摩登大道
亿利达
长白山
上海医药
中航重机
中电鑫龙
思源电气
杭萧钢构
佳发安泰
金隅集团
远兴能源
安居宝
精艺股份
江苏国泰
山东金泰
天业通联
康达尔
三超新材
中原环保
安车检测
中持股份
西部矿业
通润装备
铜陵有色
开润股份
诚迈科技
大西洋
克明面业
首商股份
武汉控股
巨轮智能
珠江啤酒
华安证券
美康生物
乐金健康
精华制药
九洲电气
菲林格尔
华达科技
中装建设
游久游戏
健民集团
北部湾旅
申华控股
宝光股份
大康农业
春兰股份
风范股份
以岭药业
百隆东方
软控股份
金智科技
海螺型材
百联股份
中原高速
商业城
国海证券
中国软件
闽东电力
富春环保
恒银金融
吉林森工
莱茵体育
哈投股份
楚天高速
金运激光
西南证券
川仪股份
欧浦智网
皖天然气
爱康科技
西藏矿业
方大化工
文化长城
万 科Ａ
郴电国际
南宁百货
开元股份
联明股份
宝莱特
雄塑科技
创力集团
联发股份
国统股份
华东科技
成都路桥
紫金矿业
祥源文化
泰合健康
中飞股份
仙坛股份
宁波高发
中原证券
西藏药业
广晟有色
宝胜股份
朗源股份
华峰超纤
奥康国际
国轩高科
汤臣倍健
盛通股份
新华网
力帆股份
天圣制药
环旭电子
通宝能源
恒立实业
山东药玻
云赛智联
华映科技
贵糖股份
旭光股份
新 华 都
兔 宝 宝
宜昌交运
广信材料
广泽股份
开创国际
长青集团
南宁糖业
大洋电机
上海电气
林洋能源
任子行
四环生物
黔源电力
中国动力
三雄极光
纽威股份
双星新材
绿城水务
民和股份
东睦股份
诚意药业
大恒科技
绿茵生态
安利股份
和邦生物
日上集团
中化国际
隆基机械
青岛双星
东安动力
中视传媒
开开实业
卧龙电气
中恒集团
天宸股份
中信重工
益佰制药
东方海洋
如意集团
银鸽投资
富森美
中国医药
圆通速递
开滦股份
慈星股份
中煤能源
宁沪高速
泰豪科技
浙江世宝
中际旭创
迪森股份
长城动漫
烽火电子
万向德农
双良节能
佛塑科技
双成药业
海格通信
双象股份
南岭民爆
合肥百货
寒锐钴业
江南化工
杭叉集团
特 力Ａ
万顺股份
上海电影
金种子酒
中电环保
苏州固锝
中炬高新
爱普股份
合康新能
科斯伍德
友阿股份
华海药业
中泰股份
先河环保
博世科
亚厦股份
嘉应制药
海康威视
*ST河化
中文在线
惠达卫浴
青海春天
南方传媒
国新能源
新集能源
长园集团
第一医药
新美星
欣天科技
福鞍股份
太平洋
中航高科
长源电力
鲁西化工
宏创控股
光迅科技
东易日盛
贵州百灵
宁波富达
绿康生化
国泰君安
龙源技术
新野纺织
长缆科技
江南水务
安源煤业
长安汽车
华电国际
华建集团
美达股份
申通快递
豫能控股
聚龙股份
恩华药业
晓程科技
中工国际
亚太科技
方正证券
中牧股份
珠江钢琴
神宇股份
红阳能源
天音控股
航发控制
浙江鼎力
北纬科技
奥联电子
中铁工业
徕木股份
吉鑫科技
明星电力
国农科技
花王股份
华微电子
九州通
天目湖
拓斯达
鸿达兴业
广生堂
今飞凯达
广深铁路
北玻股份
恒宝股份
赛升药业
恒为科技
江淮汽车
达安基因
海越股份
唐山港
向日葵
汇源通信
莱茵生物
道道全
四川长虹
智光电气
融捷股份
健盛集团
灵康药业
长生生物
万丰奥威
五矿资本
外高桥
启迪古汉
凤凰股份
鑫茂科技
赛轮金宇
节能风电
华虹计通
浙江医药
毅昌股份
百花村
康缘药业
梦网集团
岳阳林纸
济川药业
海信科龙
朗玛信息
银泰资源
苏利股份
西藏天路
永新股份
报 喜 鸟
嘉寓股份
京泉华
新时达
汇冠股份
国瓷材料
九洲药业
浙江东方
上海梅林
江苏雷利
科隆股份
西部创业
大同煤业
海虹控股
*ST郑煤
国电电力
盾安环境
我乐家居
时代新材
瑞凌股份
明家联合
东方电气
中成股份
沪电股份
深圳燃气
中国重工
湖北能源
东方集团
圣邦股份
西部牧业
航天通信
安琪酵母
东北制药
好当家
日月股份
华明装备
海亮股份
星云股份
金山股份
赛托生物
安诺其
积成电子
西王食品
长高集团
桃李面包
海印股份
佳沃股份
京蓝科技
百大集团
九安医疗
通程控股
四川美丰
九有股份
怡 亚 通
京天利
普利制药
深天马Ａ
吉视传媒
辽宁成大
泰尔股份
中国电影
阳泉煤业
联络互动
万林股份
金鸿控股
日出东方
东旭光电
中国银河
理邦仪器
北斗星通
峨眉山Ａ
红 宝 丽
漳泽电力
复星医药
五矿发展
太空板业
文一科技
兴业科技
内蒙华电
博济医药
生物股份
清新环境
新北洋
福斯特
道氏技术
特发信息
长江传媒
浙江众成
国美通讯
崇达技术
中富通
维尔利
弘业股份
春秋航空
汇鸿集团
友好集团
江西铜业
苏试试验
太阳纸业
德宏股份
艾华集团
裕同科技
海德股份
乾照光电
卫信康
康斯特
众业达
国风塑业
鹭燕医药
众泰汽车
麦达数字
弘讯科技
大连电瓷
亿帆医药
新洋丰
五洋科技
智慧能源
华西股份
康尼机电
中 关 村
特锐德
中国核建
豫光金铅
艾迪精密
新兴铸管
上海石化
理工环科
雅本化学
中超控股
河钢股份
四通股份
石大胜华
黑芝麻
中能电气
浩丰科技
远大智能
内蒙一机
苏大维格
南京熊猫
兴蓉环境
中化岩土
中钢国际
黄河旋风
康美药业
邦宝益智
凯乐科技
文峰股份
广百股份
武汉中商
数字认证
西部建设
*ST华菱
佳都科技
*ST中基
电科院
铜峰电子
飞马国际
华泰证券
航发动力
黄山胶囊
三元达
高能环境
中原内配
恒天海龙
宝钢包装
天润乳业
通产丽星
岷江水电
拉芳家化
赞宇科技
瑞特股份
三联虹普
宏润建设
金海环境
珈伟股份
航天工程
精达股份
蓝黛传动
中来股份
岭南园林
科华恒盛
南通锻压
银河电子
宝通科技
华立股份
庞大集团
中国核电
腾邦国际
建艺集团
康强电子
青岛金王
荣泰健康
凯盛科技
北京利尔
盈峰环境
奥 特 迅
福日电子
宗申动力
京东方Ａ
濮耐股份
中潜股份
*ST三维
中亚股份
*ST一重
*ST松江
京能电力
江山股份
综艺股份
巨化股份
华媒控股
洪都航空
红宇新材
海思科
北方华创
宝泰隆
中科创达
思维列控
安靠智电
思特奇
司尔特
山东矿机
高德红外
华脉科技
凌霄泵业
新潮能源
柳州医药
中顺洁柔
华能水电
宏达新材
祥龙电业
启迪设计
南山铝业
惠伦晶体
银河磁体
华锦股份
中储股份
良信电器
中科三环
碧水源
红豆股份
火炬电子
玉龙股份
德赛电池
得邦照明
巨星科技
骅威文化
溢多利
久远银海
迪瑞医疗
国恩股份
润欣科技
同和药业
超华科技
茂化实华
钱江水利
亿通科技
奥普光电
联创互联
海洋王
海马汽车
通宇通讯
青松股份
曙光股份
中联重科
紫光国芯
陕天然气
惠威科技
国星光电
久之洋
金城医药
炼石有色
三川智慧
万讯自控
可立克
雪迪龙
三丰智能
合肥城建
启明信息
模塑科技
东方国信
海南矿业
桂冠电力
博晖创新
龙溪股份
宁波建工
全通教育
亚振家居
国信证券
钢研高纳
达刚路机
*ST重钢
山东钢铁
恒泰艾普
维科精华
经纬纺机
网宿科技
吉药控股
抚顺特钢
海利尔
出版传媒
亚太股份
荣之联
珍宝岛
宁波银行
星徽精密
全志科技
中闽能源
温州宏丰
大冷股份
蓝焰控股
华体科技
云天化
东宝生物
广济药业
拓维信息
科华控股
中再资环
泰禾集团
三德科技
宏发股份
运达科技
川润股份
博瑞传播
皖通科技
湘邮科技
汇顶科技
思美传媒
岱美股份
沃华医药
日播时尚
恒通股份
精工钢构
太龙药业
泰和新材
昊华能源
华电能源
瑞泰科技
华天酒店
新黄浦
许继电气
渝三峡Ａ
广安爱众
安泰集团
永辉超市
天保基建
艾德生物
能科股份
东华测试
宝钛股份
贵广网络
盛路通信
永安药业
悦心健康
久立特材
中润资源
新联电子
好想你
长海股份
金陵药业
万集科技
秦川机床
佛慈制药
荣丰控股
广联达
诺邦股份
华灿光电
东方创业
坚朗五金
伟星股份
新天科技
金浦钛业
英特集团
东方电热
英洛华
华光股份
安科生物
东软载波
海王生物
跃岭股份
威华股份
高盟新材
汉钟精机
焦点科技
标准股份
仁智股份
海翔药业
南华生物
扬帆新材
瑞丰高材
乐惠国际
航天动力
起步股份
高新兴
秦安股份
特一药业
路通视信
诺力股份
延长化建
古鳌科技
中百集团
赛隆药业
中国宝安
南方轴承
西部材料
三花智控
惠博普
ST新梅
明牌珠宝
苏州科达
东方锆业
建设机械
天域生态
富临精工
龙建股份
海伦哲
安徽合力
中新药业
皖维高新
韵达股份
耀皮玻璃
海陆重工
牧高笛
英搏尔
数源科技
金圆股份
莲花健康
合力泰
安泰科技
中钢天源
平安银行
大众公用
三利谱
华平股份
宏达高科
万通智控
恒顺众昇
华铁科技
传化智联
东软集团
国光股份
同济科技
天山生物
晶盛机电
金信诺
百润股份
今天国际
金龙羽
天宝食品
刚泰控股
*ST普林
河北宣工
中航飞机
海伦钢琴
惠天热电
日海通讯
环球印务
顶点软件
中国卫星
中宠股份
世纪华通
方正电机
威 尔 泰
联建光电
比音勒芬
禾丰牧业
陕国投Ａ
多氟多
海波重科
伟隆股份
创元科技
赛象科技
香溢融通
雅克科技
宏辉果蔬
新疆天业
华丽家族
长城电工
坤彩科技
和佳股份
山东地矿
中航电测
海能达
太钢不锈
东方网络
海鸥股份
全柴动力
洲际油气
德展健康
广田集团
科华生物
厦门钨业
城市传媒
利民股份
嘉麟杰
同大股份
联化科技
国检集团
中文传媒
诺德股份
科陆电子
天汽模
章源钨业
振华科技
ST明科
金刚玻璃
红日药业
沧州明珠
中鼎股份
金轮股份
东方银星
亚泰国际
弘亚数控
财通证券
松发股份
嘉诚国际
兰生股份
塞力斯
大北农
隆华节能
通用股份
GQY视讯
中电电机
*ST大控
苏州恒久
康泰生物
中科新材
凯文教育
天马科技
紫江企业
中恒电气
胜宏科技
华意压缩
平煤股份
宁波中百
联创光电
鲁亿通
恒通科技
同兴达
包钢股份
潞安环能
杰赛科技
大冶特钢
广西广电
法拉电子
茶花股份
道森股份
得润电子
粤 传 媒
清源股份
天铁股份
瑞普生物
三星新材
东方证券
银之杰
金牛化工
飞亚达Ａ
蒙草生态
分众传媒
孚日股份
迅游科技
金麒麟
江山欧派
浙富控股
大金重工
顺络电子
隆鑫通用
中航资本
广电运通
华工科技
华鼎股份
温氏股份
科大讯飞
上海能源
长鹰信质
双塔食品
水星家纺
勤上股份
鸣志电器
方盛制药
大通燃气
宁波华翔
汇金通
青山纸业
湖南天雁
星期六
美邦服饰
艾艾精工
明泰铝业
星网锐捷
新宝股份
中马传动
宏盛股份
天顺风能
博士眼镜
禾望电气
至正股份
钱江摩托
富瀚微
天首发展
鼎龙股份
秦港股份
动力源
天通股份
甘肃电投
国盛金控
江特电机
远大控股
澳洋顺昌
首创股份
两面针
宁波东力
科信技术
*ST大有
远光软件
创兴资源
格林美
金钼股份
佩蒂股份
东珠景观
新 和 成
易事特
*ST紫学
置信电气
武进不锈
江西长运
神力股份
金贵银业
博通股份
北矿科技
安奈儿
科迪乳业
红 太 阳
*ST万里
硕贝德
康力电梯
航天晨光
冀中能源
荣华实业
中央商场
嘉事堂
英威腾
星帅尔
凯普生物
斯莱克
农业银行
常熟汽饰
龙蟒佰利
东方能源
万里马
万安科技
老凤祥
美锦能源
永创智能
一心堂
新疆众和
新安股份
桂发祥
智慧农业
松芝股份
奥翔药业
海兰信
高争民爆
郑煤机
远 望 谷
长春燃气
酒钢宏兴
世名科技
中航沈飞
乾景园林
正业科技
爱尔眼科
香梨股份
ST信通
英唐智控
大庆华科
中国科传
利群股份
上海凤凰
振华股份
博威合金
盛洋科技
美尚生态
华正新材
世运电路
圣龙股份
海特高新
冠福股份
键桥通讯
硅宝科技
罗顿发展
汇纳科技
海联金汇
株冶集团
苏宁云商
大连友谊
金岭矿业
华测检测
连云港
和科达
京新药业
国泰集团
合纵科技
通光线缆
方大炭素
安科瑞
怡球资源
国创高新
海 利 得
菲利华
银宝山新
北新路桥
电魂网络
威创股份
诚益通
世嘉科技
搜于特
威海广泰
市北高新
美晨生态
鼎汉技术
江南嘉捷
安 纳 达
通威股份
亚星锚链
迪生力
深 赛 格
*ST墨龙
园城黄金
雷迪克
浙江永强
兆丰股份
九华旅游
威龙股份
濮阳惠成
ST仰帆
渤海股份
普丽盛
蓝丰生化
卫星石化
天和防务
南 京 港
景峰医药
石化机械
天舟文化
金桥信息
盈方微
耐威科技
亿联网络
博创科技
南钢股份
超声电子
ST山水
中油资本
棕榈股份
正元智慧
日科化学
号百控股
华荣股份
劲拓股份
海信电器
天士力
电连技术
巨力索具
鞍钢股份
同益股份
泰晶科技
格尔软件
恒源煤电
北方导航
赛意信息
华银电力
横河模具
博腾股份
永清环保
英飞特
长青股份
德艺文创
三晖电气
劲嘉股份
联得装备
金诚信
保变电气
中信国安
昊志机电
凯众股份
纳尔股份
天宇股份
卓翼科技
京能置业
好莱客
新华锦
正泰电器
吉华集团
兴森科技
视源股份
神州易桥
同为股份
*ST圣莱
云海金属
泰山石油
沃尔核材
马钢股份
海天精工
沪宁股份
誉衡药业
正海磁材
恒润股份
美年健康
全信股份
康弘药业
高澜股份
正裕工业
辰欣药业
神农基因
大理药业
卫光生物
阳煤化工
赢合科技
金太阳
睿能科技
英派斯
氯碱化工
百川股份
韶能股份
启迪桑德
雷科防务
上海洗霸
世纪天鸿
先锋新材
光大嘉宝
中科电气
超讯通信
国电南瑞
快乐购
深大通
华升股份
优德精密
四通新材
富满电子
亚玛顿
依顿电子
碳元科技
三祥新材
百傲化学
九鼎新材
中利集团
杉杉股份
哈三联
基蛋生物
美克家居
新宏泰
西仪股份
华控赛格
航天科技
金财互联
杭州高新
斯太尔
友邦吊顶
荣晟环保
新奥股份
中孚实业
大参林
当升科技
中青旅
宝莫股份
太阳电缆
东华能源
如通股份
苏博特
浙江龙盛
信立泰
上海天洋
浦发银行
广宇发展
亚光科技
飞鹿股份
晨化股份
深南电A
聚光科技
法兰泰克
中公高科
新能泰山
三木集团
力盛赛车
*ST中安
海顺新材
联泰环保
大连热电
中国中期
鹏鹞环保
皖通高速
天奇股份
君禾股份
宁波韵升
益盛药业
新易盛
精功科技
贝因美
东方园林
西山煤电
光莆股份
焦作万方
佳创视讯
三夫户外
汇嘉时代
美盈森
鹏辉能源
绝味食品
博天环境
铁汉生态
百洋股份
通达动力
TCL 集团
兆新股份
中金黄金
美思德
伟星新材
拓邦股份
三江购物
东方市场
高新发展
寿仙谷
龙洲股份
金达威
永兴特钢
天华院
中兵红箭
农尚环境
宏达股份
海得控制
中材节能
维格娜丝
和晶科技
浙江东日
天龙集团
广信股份
大丰实业
岳阳兴长
恒锋信息
中核科技
泰禾光电
福晶科技
双林股份
先进数通
五矿稀土
均胜电子
富邦股份
东旭蓝天
厚普股份
开能环保
长春一东
中天科技
金域医学
威星智能
金能科技
华峰氨纶
合力科技
麦迪电气
欧比特
亚威股份
中金岭南
中国出版
丹邦科技
爱司凯
开立医疗
深桑达Ａ
华阳集团
至纯科技
深圳新星
乐歌股份
朗博科技
阳普医疗
天孚通信
金风科技
金洲管道
康惠制药
熊猫金控
新光药业
盛屯矿业
太辰光
江中药业
秋林集团
富瑞特装
恒华科技
方大特钢
兴业矿业
八一钢铁
容大感光
宝馨科技
露笑科技
天海防务
晶瑞股份
川金诺
上海亚虹
亿纬锂能
罗莱生活
贵研铂业
百达精工
深冷股份
锌业股份
创业环保
振芯科技
尔康制药
鄂尔多斯
电光科技
新筑股份
雅百特
北方稀土
山东黄金
瑞丰光电
穗恒运Ａ
新疆火炬
湘油泵
龙蟠科技
移为通信
康德莱
美力科技
辉丰股份
捷荣技术
金发科技
嘉凯城
安凯客车
藏格控股
万里扬
雄帝科技
诚邦股份
新通联
东尼电子
北巴传媒
醋化股份
万向钱潮
广东榕泰
奥士康
口子窖
景旺电子
创源文化
*ST弘高
西部资源
金卡智能
熙菱信息
佐力药业
飞凯材料
省广股份
天赐材料
普利特
四方精创
欧普康视
完美世界
创业黑马
赤峰黄金
蓝帆医疗
北方股份
普洛药业
天际股份
恒邦股份
石英股份
新宙邦
浪莎股份
上海贝岭
翰宇药业
韶钢松山
盐津铺子
设计总院
森霸股份
开尔新材
红星发展
乐通股份
重庆燃气
中广核技
新宏泽
戴维医疗
鹏欣资源
东方中科
晨光生物
麦迪科技
日机密封
德赛西威
上海钢联
有研新材
华通医药
凌钢股份
依米康
地尔汉宇
北讯集团
三钢闽光
帝王洁具
快意电梯
正海生物
中国巨石
大千生态
康达新材
恒顺醋业
经纬电材
中大力德
皇马科技
洪汇新材
横店东磁
超频三
新天药业
先锋电子
江粉磁材
大族激光
新坐标
南极电商
森远股份
安阳钢铁
台华新材
蓝海华腾
中材科技
朗科科技
金鸿顺
歌尔股份
通合科技
智能自控
纵横通信
华铭智能
中油工程
达安股份
银星能源
翔鹭钨业
大立科技
永东股份
凯发电气
永安林业
春风动力
空港股份
星网宇达
中捷资源
武汉凡谷
伊之密
长江通信
南国置业
常宝股份
江龙船艇
鲁北化工
盛讯达
丝路视觉
美格智能
新劲刚
阿石创
银江股份
金银河
国脉科技
蒙娜丽莎
豪能股份
必创科技
辅仁药业
国科微
泰嘉股份
中船科技
北化股份
大烨智能
赣能股份
中通国脉
中设股份
梅轮电梯
天顺股份
勘设股份
富煌钢构
西陇科学
华大基因
英 力 特
宝利国际
恒林股份
新凤鸣
海川智能
联诚精密
天齐锂业
金雷风电
*ST新赛
光威复材
中环装备
大博医疗
金溢科技
正川股份
华源控股
雅化集团
康旗股份
罗平锌电
华锋股份
德创环保
红相电力
双环科技
晨丰科技
浙商中拓
宇顺电子
神火股份
中兴通讯
珀莱雅
中颖电子
捷捷微电
生益科技
昭衍新药
中天能源
广哈通信
兴齐眼药
汇金股份
广和通
长春高新
春秋电子
联合光电
亨通光电
延江股份
光明地产
金瑞矿业
智动力
长盛轴承
昇兴股份
洲明科技
友讯达
中广天择
*ST东数
荣盛石化
东宏股份
华森制药
索通发展
英维克
西泵股份
宏达电子
闻泰科技
东方嘉盛
湖南黄金
安洁科技
莱绅通灵
杭州园林
贝瑞基因
银龙股份
华凯创意
一品红
国光电器
中环环保
欧菲科技
高科石化
意华股份
威唐工业
新国都
茂硕电源
光库科技
澄天伟业
精研科技
剑桥科技
璞泰来
韦尔股份
跨境通
天成自控
水晶光电
喜临门
博迈科
天安新材
信隆健康
江丰电子
高斯贝尔
美都能源
立讯精密
普莱柯
东杰智能
盛达矿业
新经典
江苏索普
金辰股份
扬农化工
新晨科技
和顺电气
旭升股份
和胜股份
润禾材料
北京君正
莱克电气
建研院
金石资源
东百集团
金杯汽车
同德化工
英联股份
伊戈尔
光弘科技
拉夏贝尔
盛弘股份
苏奥传感
迪贝电气
赛腾股份
佳力图
爱柯迪
赣锋锂业
广东骏亚
丽岛新材
东方材料
泰瑞机器
大业股份
上海新阳
国芳集团
盘龙药业
润都股份
长川科技
科创信息
冀凯股份
吉大通信
湖北宜化
铭普光磁
安图生物
银都股份
九典制药
亚士创能
万隆光电
振江股份
晨曦航空
西藏珠峰
祥和实业
华信新材
凯莱英
立昂技术
陇神戎发
鲁抗医药
亚翔集成
科创新源
维业股份
潜能恒信
贝肯能源
阳谷华泰
畅联股份
众生药业
百利电气
宇环数控
阿科力
白银有色
士兰微
易明医药
*ST众和
方大集团
中科信息
张家港行
双一科技
好太太
索菱股份
集泰股份
川恒股份
洛阳钼业
汇金科技
原尚股份
晶华新材
佛燃股份
百华悦邦
英科医疗
洛凯股份
*ST佳电
三孚股份
中曼石油
*ST德力
建科院
康普顿
*ST中富
香飘飘
ST保千里
安达维尔
盛和资源
德生科技
永福股份
海特生物
金奥博
新余国科
信维通信
深康佳Ａ
国立科技
科恒股份
风华高科
万马科技
华通热力
扬杰科技
弘信电子
西菱动力
名臣健康
科蓝软件
山东赫达
保隆科技
贵州燃气
皇台酒业
南纺股份
顺威股份
乐视网
豫金刚石
太龙照明
海达股份
步森股份
成都银行
*ST昆机
*ST吉恩
御家汇
明阳电路
华西证券
*ST建峰
*ST钒钛
*ST烯碳
嘉友国际
中源家居
淳中科技
南都物业
养元饮品
ST网力
天风证券
沪硅产业
新乳业
山鹰国际
湘佳股份
明德生物
新强联
东阳光
中建环能
东方盛虹
河钢资源
达刚控股
青松建化
*ST熊猫
宁德时代
*ST宏图
上海凯鑫
科拓生物
贝仕达克
时空科技
华峰铝业
泰禾智能
聚合顺
首航高科
江苏租赁
鼎胜新材
蔚蓝生物
*ST联络
双林生物
欧菲光
天味食品
吉翔股份
长虹华意
长源东谷
天润工业
*ST梦舟
*ST中南
中贝通信
瀚川智能
弘高创意
中国电研
海晨股份
普元信息
京粮控股
米奥会展
苏州龙杰
安道麦A
成都燃气
*ST金正
硕世生物
上海瀚讯
公牛集团
凯赛生物
森麒麟
雷曼光电
*ST大晟
帝尔激光
*ST济堂
红相股份
凯迪退
城地香江
南兴股份
妙可蓝多
宏和科技
圣济堂
中盐化工
*ST藏格
华软科技
南  玻Ａ
长城证券
帅丰电器
上机数控
品渥食品
协和电子
顺利办
奥特维
*ST界龙
三美股份
广联航空
爱美客
华特气体
联创股份
青农商行
钢研纳克
倍加洁
丰山集团
中国通号
中粮资本
睿创微纳
宇晶股份
奥海科技
杭可科技
东岳硅材
锦江酒店
罗博特科
银泰黄金
易天股份
百亚股份
*ST劝业
传音控股
苏农银行
ST华嵘
罗欣药业
ST冠福
佳禾智能
*ST众泰
中科软
青岛银行
甬金股份
众望布艺
瑞联新材
浙江力诺
海信视像
爱旭股份
福光股份
京沪高铁
申昊科技
美畅股份
甘源食品
天箭科技
国新健康
国茂股份
竞业达
今创集团
科瑞技术
甘咨询
ST浩源
久量股份
创世纪
*ST奋达
ST新海
*ST天娱
锦泓集团
阿拉丁
良信股份
*ST赫美
伟思医疗
睿智医药
若羽臣
蓝盾光电
中铁装配
ST安泰
每日互动
科达制造
华铁应急
金宏气体
麦克奥迪
帝科股份
汉嘉设计
*ST东电
永新光学
天融信
奥来德
阿尔特
我爱我家
*ST江泉
*ST湘电
飞亚达
五方光电
鸿合科技
*ST同洲
*ST安通
保力新
国华网安
海星股份
智莱科技
ST宇顺
ST沪科
中微公司
*ST宜生
龙腾光电
*ST华塑
天智航
和远气体
ST通葡
ST厦华
中天火箭
ST地矿
*ST鼎龙
中船应急
祥鑫科技
ST中捷
ST中安
迈得医疗
金科环境
奥普家居
冠盛股份
昊海生科
微芯生物
城建发展
德恩精工
天准科技
当虹科技
中国一重
石头科技
天山铝业
侨银环保
凯撒旅业
凯迪股份
福莱特
*ST西发
*ST力帆
思瑞浦
山大地纬
欣锐科技
海目星
孚能科技
力合科技
长阳科技
科思股份
光正眼科
中国广核
光峰科技
ST摩登
安克创新
爱朋医疗
ST安凯
运达股份
*ST华仪
广大特材
大洋生物
*ST胜利
绿的谐波
迈瑞医疗
安恒信息
晨光新材
长城科技
朝阳科技
太空智造
金春股份
*ST金洲
渤海租赁
交大思诺
吉贝尔
华丰股份
百邦科技
南京证券
ST中基
昂立教育
亿华通
三泰控股
仙乐健康
雷赛智能
电声股份
科威尔
*ST麦趣
*ST海华
ST巴士
广电计量
福然德
*ST中天
中泰证券
华夏航空
大智慧
红塔证券
*ST中昌
威胜信息
晶丰明源
奥福环保
国联股份
国网英大
沪光股份
日久光电
ST昌鱼
ST瑞德
福蓉科技
映翰通
汉宇集团
康辰药业
首都在线
三盛教育
惠程科技
先惠技术
龙磁科技
科德教育
捷佳伟创
雪龙集团
天合光能
卓胜微
*ST林重
ST柳化
郑州银行
立昂微
*ST聚力
宝丽迪
贵州轮胎
华神科技
ST华鼎
姚记科技
固德威
*ST盐湖
亚联发展
*ST天润
*ST东科
山东玻纤
*ST中新
博汇股份
ST游久
嘉元科技
恒银科技
谱尼测试
派克新材
*ST经开
ST宏盛
铁岭新城
*ST环球
万德斯
筑博设计
申联生物
中天精装
ST德豪
天元股份
*ST时万
万泰生物
国瑞科技
岭南股份
淮河能源
晶澳科技
新产业
锦浪科技
*ST华映
*ST友谊
特宝生物
中信出版
华东数控
长飞光纤
药明康德
晶晨股份
优彩资源
旭光电子
豪悦护理
天宜上佳
路德环境
中达安
利通电子
迈为股份
ST圣莱
锦和商业
中国外运
捷强装备
冰山冷热
锐科激光
地铁设计
新媒股份
数知科技
上能电气
克劳斯
迪普科技
金博股份
祥生医疗
卡倍亿
卓越新能
五洲特纸
迪威尔
ST金刚
国林科技
仲景食品
柏楚电子
*ST新光
长沙银行
威派格
天正电气
ST凯瑞
*ST飞乐
嘉必优
宏柏新材
豪美新材
当代文体
张  裕Ａ
大胜达
百奥泰
指南针
奥美医疗
澜起科技
ST云投
七一二
C海融
亚普股份
越博动力
华民股份
宸展光电
ST抚钢
中迪投资
飞龙股份
*ST升达
美瑞新材
仕佳光子
*ST大洲
中铝国际
通达电气
*ST海陆
佰奥智能
*ST金鸿
春光科技
南新制药
ST电能
淮北矿业
*ST金钰
爱博医疗
南大环境
海越能源
日月明
浙海德曼
东鹏控股
松霖科技
宇新股份
中电兴发
金力永磁
开普检测
ST罗普
*ST欧浦
国网信通
三友医疗
三角防务
C亿田
芯朋微
西麦食品
稳健医疗
中岩大地
*ST海创
宇信科技
容百科技
杰普特
锦鸡股份
小熊电器
八亿时空
华辰装备
振德医疗
中芯国际
国联证券
寒武纪
*ST刚泰
*ST拉夏
佰仁医疗
*ST美讯
宝丰能源
艾可蓝
锦盛新材
*ST皇台
博瑞医药
恒实科技
ST中葡
招商港口
*ST秦机
德力股份
鲁商发展
铁科轨道
天地数码
泰永长征
万华化学
恒力石化
*ST东洋
科翔股份
德方纳米
高测股份
芯原股份
敏芯股份
铜牛信息
帝欧家居
中科星图
ST禾盛
紫光国微
*ST中华A
明新旭腾
大宏立
松炀资源
鹏鼎控股
*ST成城
金山办公
仁东控股
乐鑫科技
领益智造
招商南油
拉卡拉
盛德鑫泰
三达膜
长鸿高科
交建股份
回盛生物
苏盐井神
*ST大港
福能东方
*ST安信
燕麦科技
柯力传感
*ST德奥
新智认知
ST猛狮
吉峰科技
华致酒行
巴比食品
深信服
ST椰岛
金石亚药
日海智能
ST天成
宏力达
中新集团
*ST雪莱
金富科技
*ST华讯
捷昌驱动
ST狮头
ST天龙
厦门象屿
八方股份
爱丽家居
均瑶健康
大为股份
泰和科技
麒盛科技
四会富仕
招商积余
瑞松科技
苑东生物
大地熊
航天宏图
*ST融捷
玉禾田
立华股份
龙利得
居然之家
天下秀
芯源微
致远互联
大东海A
贝斯美
君实生物
圣湘生物
渝农商行
安宁股份
珠海中富
华创阳安
ST金花
大东南
*ST永泰
ST威龙
日辰股份
四方科技
ST国重装
斯达半导
旗天科技
建龙微纳
洁特生物
心脉医疗
奇安信
ST坊展
英杰电气
复洁环保
*ST节能
豆神教育
锐新科技
泉阳泉
友发集团
健之佳
ST金泰
七彩化学
汇创达
北汽蓝谷
*ST银河
*ST天夏
*ST永林
和佳医疗
川能动力
派生科技
兴图新科
昂利康
新诺威
*ST富控
航天彩虹
攀钢钒钛
青岛中程
*ST交昂
*ST康得
开能健康
*ST联合
鲁  泰Ａ
重药控股
直真科技
惠发食品
矩子科技
泛亚微透
图南股份
海能实业
*ST中珠
翔丰华
*ST群兴
瑞晟智能
*ST科陆
力鼎光电
中国中免
国光连锁
珈伟新能
海容冷链
ST人乐
中信特钢
法狮龙
澳弘电子
天臣医疗
奥赛康
慧辰资讯
北摩高科
华阳国际
ST仁智
ST索菱
景津环保
科安达
东方环宇
新洁能
恒铭达
中科海讯
瑞达期货
晶科科技
ST乐凯
海航科技
建霖家居
中胤时尚
亚世光电
国安达
国盛智科
爱克股份
中山金马
*ST博信
芒果超媒
长城军工
上纬新材
唐源电气
西部超导
苏宁易购
地素时尚
ST天圣
金雷股份
丹化科技
前沿生物
华润微
万顺新材
辽宁能源
中信建投
巨星农牧
ST中孚
万通发展
*ST科林
中国卫通
TCL科技
隆利科技
ST舍得
万胜智能
启迪环境
圣元环保
*ST雅博
赛摩智能
金冠股份
ST创兴
有友食品
安徽建工
耐普矿机
双飞股份
浩洋股份
北元集团
卧龙电驱
彤程新材
力合微
中密控股
*ST瀚叶
宏川智慧
奕瑞科技
迦南智能
华图山鼎
海象新材
文灿股份
*ST夏利
声迅股份
东来技术
ST庞大
江苏新能
安集科技
工业富联
联瑞新材
ST天雁
国新文化
ST长投
秦川物联
*ST胜尔
蒙泰高新
华兴源创
*ST贵人
松井股份
渤海汽车
中谷物流
柳    工
蓝特光学
新城市
金达莱
卓易信息
伯特利  
浙矿股份
苏州银行
泽达易盛
五洋停车
航锦科技
*ST北能
尚纬股份
*ST商城
*ST银亿
长江健康
金现代
雪天盐业
贵州三力
科沃斯
松原股份
康平科技
湘财股份
天禾股份
锐明技术
瑞鹄模具
*ST北讯
顺钠股份
绿色动力
昇辉科技
德马科技
熊猫乳品
ST八菱
金龙鱼
中公教育
越剑智能
嘉美包装
中金公司
ST东网
赛轮轮胎
伟时电子
*ST晨鑫
紫天科技
中创环保
汇得科技
保利联合
财富趋势
博杰股份
盈康生命
三峰环境
壶化股份
普门科技
有方科技
北鼎股份
*ST蓝丰
因赛集团
佳云科技
豪森股份
国盾量子
兴瑞科技
泽璟制药
科前生物
天地在线
恒久科技
密尔克卫
中国人保
左江科技
良品铺子
三只松鼠
彩讯股份
新疆交建
华盛昌
*ST当代
天迈科技
昊华科技
京源环保
同庆楼
永兴材料
威尔药业
龙软科技
瑞玛工业
蠡湖股份
ST天首
*ST荣华
郑中设计
恩捷股份
华光环能
ST毅昌
德林海
芯海科技
大悦城
宝兰德
*ST信通
鸿远电子
亿嘉和
ST百花
震安科技
博汇科技
天奈科技
豪尔赛
江航装备
久日新材
亚钾国际
*ST中商
欧陆通
狄耐克
粤桂股份
长虹美菱
苏美达
南亚新材
芯能科技
顺博合金
光云科技
*ST九有
锋尚文化
中嘉博创
康希诺
康龙化成
*ST高升
顶固集创
ST运盛
济南高新
葫芦娃
新天绿能
天普股份
经纬辉开
沃格光电
特  力Ａ
宝明科技
ST毅达
森霸传感
青岛港
维信诺
西安银行
科博达
永冠新材
博睿数据
鸿泉物联
*ST天马
青鸟消防
新化股份
赛特新材
三人行
正帆科技
佳发教育
神州细胞
深南股份
蓝黛科技
ST宜化
紫金银行
*ST长城
*ST康盛
奥园美谷
润建股份
欣贺股份
金田铜业
江苏北人
准油股份
甘李药业
埃夫特
优刻得
凌志软件
利扬芯片
荣联科技
威奥股份
佳电股份
康泰医学
德利股份
*ST飞马
华文食品
*ST利源
中粮科技
*ST恒康
长华股份
金时科技
大叶股份
中国海防
交控科技
华林证券
派瑞股份
美迪西
艾力斯
格林达
博深股份
热景生物
创源股份
神驰机电
新亚强
ST云网
佳华科技
*ST辉丰
ST生物
海南发展
惠云钛业
山东墨龙
维康药业
*ST江特
东珠生态
海信家电
博通集成
方邦股份
*ST海源
ST远程
美吉姆
丽江股份
国城矿业
海油发展
天阳科技
震有科技
新兴装备
朗进科技
万  科Ａ
赛科希德
酷特智能
ST罗顿
华熙生物
建科机械
ST尤夫
万里股份
*ST斯太
惠城环保
重庆钢铁
雅运股份
华翔股份
安必平
兰剑智能
京北方
华菱钢铁
*ST敦种
华业香料
大有能源
京基智农
*ST目药
康华生物
海昌新材
中航西飞
南华期货
金海高科
福昕软件
维科技术
九洲集团
*ST实达
艾迪药业
华峰测控
上海沿浦
*ST亚振
世华科技
山科智能
值得买
华达新材
洪通燃气
道通科技
拱东医疗
泉峰汽车
测绘股份
丸美股份
中信博
天利科技
西域旅游
锋龙股份
ST科迪
科思科技
神农科技
铂科新材
捷安高科
共创草坪
胜蓝股份
紫晶存储
中船汉光
瑞丰新材
瑞芯微
日丰股份
丽人丽妆
壹网壹创
赛微电子
深粮控股
移远通信
聚辰股份
杰美特
延安必康
华培动力
起帆电缆
和顺石油
ST南风
*ST金贵
ST昌九
沃尔德
盟升电子
创业慧康
中光学
厦门银行
仙鹤股份
东方生物
融捷健康
步科股份
安博通
奥锐特
芯瑞达
邮储银行
中控技术
爱婴室
赛伍技术
三六零
华光新材
协鑫能科
*ST乐通
泰坦科技
志邦家居
山石网科
建业股份
ST步森
览海医疗
炼石航空
神马电力
盛达资源
联赢激光
ST亚邦
*ST辅仁
中创物流
ST亚星
柳药股份
福达合金
ST双环
垒知集团
省广集团
新金路
盛视科技
泸天化
虹软科技
*ST围海
一汽解放
ST岩石
铂力特
郑州煤电
紫光学大
金丹科技
科远智慧
ST康美
*ST盈方
斯迪克
ST海马
中银证券
中简科技
华宝股份
*ST勤上
聆达股份
会通股份
甘化科工
键凯科技
盛新锂能
天奥电子
元利科技
华闻集团
海尔智家
万林物流
华设集团
*ST兆新
莱伯泰科
神工股份
*ST六化
东亚药业
ST加加
泰林生物
协创数据
宇瞳光学
海鸥住工
南微医学
C朗特
赛诺医疗
澳洋健康
*ST长动
宁水集团
成都先导
云涌科技
城发环境
天津普林
海尔生物
复旦张江
皖仪科技
山西路桥
富祥药业
开普云
恒誉环保
隆华科技
聚杰微纤
浙商银行
新赛股份
佛燃能源
清溢光电
明阳智能
新大正
新光光电
富通鑫茂
天邑股份
三生国健
新农股份
海峡创新
新致软件
兆威机电
海融科技
确成股份
C兆龙
联泓新科
朗特智能
C凯龙
博迁新材
C润阳
同兴环保
西上海
C研奥
塞力医疗
特发服务
*ST中孚
*ST鑫科
派能科技
舒华体育
明微电子
启迪药业
蔚蓝锂芯
明冠新材
国机精工
健麾信息
鼎通科技
三旺通信
晋控电力
悦康药业
晋控煤业
东贝集团
伟创电气
C天秦
开元教育
中伟股份
一鸣食品
思进智能
华旺科技
欧科亿
振邦智能
杭华股份
彩虹集团
南山智尚
山西焦煤
亿田智能
科兴制药
恒玄科技
中晶科技
立方制药
南凌科技
吉大正元
航亚科技
森林包装
福立旺
汉马科技
通源环境
兆龙互连
星徽股份
凯龙高科
西大门
侨银股份
华峰化学
研奥股份
C法本
奥普特
润阳科技
C火星人
远东股份
天秦装备
鹏都农牧
天原股份


================================================
FILE: legacy_v1/src/Leorio/tokenization.py
================================================
import __init__

from Kite.database import Database
from Kite import config
from Kite import utils

import jieba
import pkuseg
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S')


class Tokenization(object):

    def __init__(self, import_module="jieba", user_dict=None, chn_stop_words_dir=None):
        #self.database = Database().conn[config.DATABASE_NAME]  #.get_collection(config.COLLECTION_NAME_CNSTOCK)
        self.database = Database()
        self.import_module = import_module
        self.user_dict = user_dict
        if self.user_dict:
            self.update_user_dict(self.user_dict)
        if chn_stop_words_dir:
            self.stop_words_list = utils.get_chn_stop_words(chn_stop_words_dir)
        else:
            self.stop_words_list = list()

    def update_user_dict(self, old_user_dict_dir, new_user_dict_dir=None):
        # 将缺失的(或新的)股票名称、金融新词等，添加进金融词典中
        word_list = []
        with open(old_user_dict_dir, "r", encoding="utf-8") as file:
            for row in file:
                word_list.append(row.split("\n")[0])
        name_code_df = self.database.get_data(config.STOCK_DATABASE_NAME,
                                              config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                              keys=["name", "code"])
        new_words_list = list(set(name_code_df["name"].tolist()))
        for word in new_words_list:
            if word not in word_list:
                word_list.append(word)
        new_user_dict_dir = old_user_dict_dir if not new_user_dict_dir else new_user_dict_dir
        with open(new_user_dict_dir, "w", encoding="utf-8") as file:
            for word in word_list:
                file.write(word + "\n")

    def cut_words(self, text):
        outstr = list()
        sentence_seged = None
        if self.import_module == "jieba":
            if self.user_dict:
                jieba.load_userdict(self.user_dict)
            sentence_seged = list(jieba.cut(text))
        elif self.import_module == "pkuseg":
            seg = pkuseg.pkuseg(user_dict=self.user_dict)  # 添加自定义词典
            sentence_seged = seg.cut(text)  # 进行分词
        if sentence_seged:
            for word in sentence_seged:
                if word not in self.stop_words_list \
                        and word != "\t" \
                        and word != " " \
                        and utils.is_contain_chn(word)\
                        and len(word) > 1:
                    outstr.append(word)
            return outstr
        else:
            return False

    def find_relevant_stock_codes_in_article(self, article, stock_name_code_dict):
        stock_codes_set = list()
        cut_words_list = self.cut_words(article)
        if cut_words_list:
            for word in cut_words_list:
                try:
                    stock_codes_set.append(stock_name_code_dict[word])
                except Exception:
                    pass
        return list(set(stock_codes_set))

    def update_news_database_rows(self,
                                  database_name,
                                  collection_name,
                                  incremental_column_name="RelatedStockCodes"):
        name_code_df = self.database.get_data(config.STOCK_DATABASE_NAME,
                                              config.COLLECTION_NAME_STOCK_BASIC_INFO,
                                              keys=["name", "code"])
        name_code_dict = dict(name_code_df.values)
        data = self.database.get_collection(database_name, collection_name).find()
        for row in data:
            # if row["Date"] > "2019-05-20 00:00:00":
            # 在新增数据中，并不存在更新列，但是旧数据中已存在更新列，因此需要
            # 判断数据结构中是否包含该incremental_column_name字段
            if incremental_column_name not in row.keys():
                related_stock_codes_list = self.find_relevant_stock_codes_in_article(
                                             row["Article"], name_code_dict)
                self.database.update_row(database_name,
                                         collection_name,
                                         {"_id": row["_id"]},
                                         {incremental_column_name: " ".join(related_stock_codes_list)}
                                         )
                logging.info("[{} -> {} -> {}] updated {} key value ... "
                             .format(database_name, collection_name, row["Date"], incremental_column_name))
            else:
                logging.info("[{} -> {} -> {}] has already existed {} key value ... "
                             .format(database_name, collection_name, row["Date"], incremental_column_name))


if __name__ == "__main__":
    tokenization = Tokenization(import_module="jieba",
                                user_dict="financedict.txt",
                                chn_stop_words_dir="chnstopwords.txt")
    # documents_list = \
    #     [
    #         "中央、地方支持政策频出,煤炭行业站上了风口 券商研报浩如烟海，投资线索眼花缭乱，\
    #         第一财经推出《一财研选》产品，挖掘研报精华，每期梳理5条投资线索，便于您短时间内获\
    #         取有价值的信息。专业团队每周日至每周四晚8点准时“上新”，助您投资顺利！",
    #         "郭文仓到重点工程项目督导检查 2月2日,公司党委书记、董事长、总经理郭文仓,公司董事,\
    #         股份公司副总经理、总工程师、郭毅民,股份公司副总经理张国富、柴高贵及相关单位负责人到\
    #         焦化厂煤场全封闭和干熄焦等重点工程项目建设工地督导检查施工进度和安全工作情况。"
    #     ]
    # for text in documents_list:
    #     cut_words_list = tokenization.cut_words(text)
    #     print(cut_words_list)
    # tokenization.update_news_database_rows(config.DATABASE_NAME, "jrj")


================================================
FILE: legacy_v1/src/Leorio/topicmodelling.py
================================================
import __init__

import os
import time

from Kite import config
from Kite import utils
from Kite.database import Database
from Leorio.tokenization import Tokenization
from Hisoka.classifier import Classifier

from sklearn import preprocessing

from gensim import corpora
from gensim import models
from gensim.matutils import corpus2dense

import logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s",
                    datefmt="%a, %d %b %Y %H:%M:%S")


class TopicModelling(object):

    def __init__(self):
        self.tokenization = Tokenization(import_module="jieba",
                                         user_dict=config.USER_DEFINED_DICT_PATH,
                                         chn_stop_words_dir=config.CHN_STOP_WORDS_PATH)
        self.database = Database()
        self.classifier = Classifier()

    def create_dictionary(self,
                          raw_documents_list,
                          save_path=None,
                          is_saved=False):
        """
        将文中每个词汇关联唯一的ID，因此需要定义词汇表
        :param: raw_documents_list, 原始语料列表，每个元素即文本，如["洗尽铅华...", "风雨赶路人...", ...]
        :param: savepath, corpora.Dictionary对象保存路径
        """
        documents_token_list = []
        for doc in raw_documents_list:
            documents_token_list.append(self.tokenization.cut_words(doc))
        _dict = corpora.Dictionary(documents_token_list)
        # 找到只出现一次的token
        once_items = [_dict[tokenid] for tokenid, docfreq in _dict.dfs.items() if docfreq == 1]
        # 在documents_token_list的每一条语料中，删除只出现一次的token
        for _id, token_list in enumerate(documents_token_list):
            documents_token_list[_id] = list(filter(lambda token: token not in once_items, token_list))
        # 极端情况，某一篇语料所有token只出现一次，这样该篇新闻语料的token列表就变为空，因此删除掉
        documents_token_list = [token_list for token_list in documents_token_list if (len(token_list) != 0)]
        # 找到只出现一次的token对应的id
        once_ids = [tokenid for tokenid, docfreq in _dict.dfs.items() if docfreq == 1]
        # 删除仅出现一次的词
        _dict.filter_tokens(once_ids)
        # 消除id序列在删除词后产生的不连续的缺口
        _dict.compactify()
        if is_saved and save_path:
            _dict.save(save_path)
            logging.info("new generated dictionary saved in path -> {} ...".format(save_path))

        return _dict, documents_token_list

    def renew_dictionary(self,
                         old_dict_path,
                         new_raw_documents_list,
                         new_dict_path=None,
                         is_saved=False):
        documents_token_list = []
        for doc in new_raw_documents_list:
            documents_token_list.append(self.tokenization.cut_words(doc))
        _dict = corpora.Dictionary.load(old_dict_path)
        _dict.add_documents(documents_token_list)
        if new_dict_path:
            old_dict_path = new_dict_path
        if is_saved:
            _dict.save(old_dict_path)
            logging.info("updated dictionary by another raw documents serialized in {} ... ".format(old_dict_path))

        return _dict, documents_token_list

    def create_bag_of_word_representation(self,
                                          raw_documents_list,
                                          old_dict_path=None,
                                          new_dict_path=None,
                                          bow_vector_save_path=None,
                                          is_saved_dict=False):
        if old_dict_path:
            # 如果存在旧的语料词典，就在原先词典的基础上更新，增加未见过的词
            corpora_dictionary, documents_token_list = self.renew_dictionary(old_dict_path,
                                                                             raw_documents_list,
                                                                             new_dict_path=new_dict_path)
        else:
            # 否则重新创建词典
            start_time = time.time()
            corpora_dictionary, documents_token_list = self.create_dictionary(raw_documents_list,
                                                                              save_path=new_dict_path,
                                                                              is_saved=is_saved_dict)
            end_time = time.time()
            logging.info("there are {} mins spent to create a new dictionary ... ".format((end_time-start_time)/60))
        # 根据新词典对文档(或语料)生成对应的词袋向量
        start_time = time.time()
        bow_vector = [corpora_dictionary.doc2bow(doc_token) for doc_token in documents_token_list]
        end_time = time.time()
        logging.info("there are {} mins spent to calculate bow-vector ... ".format((end_time - start_time) / 60))
        if bow_vector_save_path:
            corpora.MmCorpus.serialize(bow_vector_save_path, bow_vector)

        return documents_token_list, corpora_dictionary, bow_vector

    @staticmethod
    def transform_vectorized_corpus(corpora_dictionary,
                                    bow_vector,
                                    model_type="lda",
                                    model_save_path=None):
        # 如何没有保存任何模型，重新训练的情况下，可以选择该函数
        model_vector = None
        if model_type == "lsi":
            # LSI(Latent Semantic Indexing)模型，将文本从词袋向量或者词频向量(更好)，转为一个低维度的latent空间
            # 对于现实语料，目标维度在200-500被认为是"黄金标准"
            model_tfidf = models.TfidfModel(bow_vector)
            # model_tfidf.save("model_tfidf.tfidf")
            tfidf_vector = model_tfidf[bow_vector]
            model = models.LsiModel(tfidf_vector,
                                    id2word=corpora_dictionary,
                                    num_topics=config.TOPIC_NUMBER)  # 初始化模型
            model_vector = model[tfidf_vector]
            if model_save_path:
                model.save(model_save_path)
        elif model_type == "lda":
            model = models.LdaModel(bow_vector,
                                    id2word=corpora_dictionary,
                                    num_topics=config.TOPIC_NUMBER)  # 初始化模型
            model_vector = model[bow_vector]
            if model_save_path:
                model.save(model_save_path)
        elif model_type == "tfidf":
            model = models.TfidfModel(bow_vector)  # 初始化
            # model = models.TfidfModel.load("model_tfidf.tfidf")
            model_vector = model[bow_vector]  # 将整个语料进行转换
            if model_save_path:
                model.save(model_save_path)

        return model_vector

    def classify_stock_news(self,
                            unseen_raw_document,
                            database_name,
                            collection_name,
                            label_name="60DaysLabel",
                            topic_model_type="lda",
                            classifier_model="svm",
                            ori_dict_path=None,
                            bowvec_save_path=None,
                            is_saved_bow_vector=False):
        historical_raw_documents_list = []
        Y = []
        for row in self.database.get_collection(database_name, collection_name).find():
            if label_name in row.keys():
                if row[label_name] != "":
                    historical_raw_documents_list.append(row["Article"])
                    Y.append(row[label_name])
        logging.info("fetch symbol '{}' historical news with label '{}' from [DB:'{}' - COL:'{}'] ... "
                     .format(collection_name, label_name, database_name, collection_name))

        le = preprocessing.LabelEncoder()
        Y = le.fit_transform(Y)
        logging.info("encode historical label list by sklearn preprocessing for training ... ")
        label_name_list = le.classes_  # ['中性' '利好' '利空'] -> [0, 1, 2]

        # 根据历史新闻数据库创建词典，以及计算每个历史新闻的词袋向量；如果历史数据库创建的字典存在，则加载进内存
        # 用未见过的新闻tokens去更新该词典
        if not os.path.exists(ori_dict_path):
            if not os.path.exists(bowvec_save_path):
                _, _, historical_bow_vec = self.create_bag_of_word_representation(historical_raw_documents_list,
                                                                                  new_dict_path=ori_dict_path,
                                                                                  bow_vector_save_path=bowvec_save_path,
                                                                                  is_saved_dict=True)
                logging.info("create dictionary of historical news, and serialized in path -> {} ... ".format(ori_dict_path))
                logging.info("create bow-vector of historical news, and serialized in path -> {} ... ".format(bowvec_save_path))
            else:
                _, _, _ = self.create_bag_of_word_representation(historical_raw_documents_list,
                                                                 new_dict_path=ori_dict_path,
                                                                 is_saved_dict=True)
                logging.info("create dictionary of historical news, and serialized in path -> {} ... ".format(ori_dict_path))
        else:
            if not os.path.exists(bowvec_save_path):
                _, _, historical_bow_vec = self.create_bag_of_word_representation(historical_raw_documents_list,
                                                                                  new_dict_path=ori_dict_path,
                                                                                  bow_vector_save_path=bowvec_save_path,
                                                                                  is_saved_dict=True)
                logging.info("historical news dictionary existed, which saved in path -> {}, but not the historical bow-vector"
                             " ... ".format(ori_dict_path))
            else:
                historical_bow_vec_mmcorpus = corpora.MmCorpus(bowvec_save_path)  # type -> <gensim.corpora.mmcorpus.MmCorpus>
                historical_bow_vec = []
                for _bow in historical_bow_vec_mmcorpus:
                    historical_bow_vec.append(_bow)
                logging.info("both historical news dictionary and bow-vector existed, load historical bow-vector to memory ... ")

        start_time = time.time()
        updated_dictionary_with_old_and_unseen_news, unssen_documents_token_list = self.renew_dictionary(ori_dict_path,
                                                                                                         [unseen_raw_document],
                                                                                                         is_saved=True)
        end_time = time.time()
        logging.info("renew dictionary with unseen news tokens, and serialized in path -> {}, "
                     "which took {} mins ... ".format(ori_dict_path, (end_time-start_time)/60))

        unseen_bow_vector = [updated_dictionary_with_old_and_unseen_news.doc2bow(doc_token) for doc_token in
                             unssen_documents_token_list]
        updated_bow_vector_with_old_and_unseen_news = []
        updated_bow_vector_with_old_and_unseen_news.extend(historical_bow_vec)
        updated_bow_vector_with_old_and_unseen_news.extend(unseen_bow_vector)
        # 原先updated_bow_vector_with_old_and_unseen_news是list类型，
        # 但是经过下面序列化后重新加载进来的类型是gensim.corpora.mmcorpus.MmCorpus
        if is_saved_bow_vector and bowvec_save_path:
            corpora.MmCorpus.serialize(bowvec_save_path,
                                       updated_bow_vector_with_old_and_unseen_news)  # 保存更新后的bow向量，即包括新旧新闻的bow向量集
        logging.info("combined bow vector(type -> 'list') generated by historical news with unseen bow "
                     "vector to create a new one ... ")

        if topic_model_type == "lsi":
            start_time = time.time()
            updated_tfidf_model_vector = self.transform_vectorized_corpus(updated_dictionary_with_old_and_unseen_news,
                                                                          updated_bow_vector_with_old_and_unseen_news,
                                                                          model_type="tfidf")  # type -> <gensim.interfaces.TransformedCorpus object>
            end_time = time.time()
            logging.info("regenerated TF-IDF model vector by updated dictionary and updated bow-vector, "
                         "which took {} mins ... ".format((end_time-start_time)/60))

            start_time = time.time()
            model = models.LsiModel(updated_tfidf_model_vector,
                                    id2word=updated_dictionary_with_old_and_unseen_news,
                                    num_topics=config.TOPIC_NUMBER)  # 初始化模型
            model_vector = model[updated_tfidf_model_vector]  # type -> <gensim.interfaces.TransformedCorpus object>
            end_time = time.time()
            logging.info("regenerated LSI model vector space by updated TF-IDF model vector space, "
                         "which took {} mins ... ".format((end_time-start_time)/60))
        elif topic_model_type == "lda":
            start_time = time.time()
            model_vector = self.transform_vectorized_corpus(updated_dictionary_with_old_and_unseen_news,
                                                            updated_bow_vector_with_old_and_unseen_news,
                                                            model_type="lda")
            end_time = time.time()
            logging.info("regenerated LDA model vector space by updated dictionary and bow-vector, "
                         "which took {} mins ... ".format((end_time-start_time)/60))

        # 将gensim.interfaces.TransformedCorpus类型的lsi模型向量转为numpy矩阵
        start_time = time.time()
        latest_matrix = corpus2dense(model_vector,
                                     num_terms=model_vector.obj.num_terms).T
        end_time = time.time()
        logging.info("transform {} model vector space to numpy.adarray, "
                     "which took {} mins ... ".format(topic_model_type.upper(), (end_time-start_time)/60))

        # 利用历史数据的话题模型向量(或特征)，进一步训练新闻分类器
        start_time = time.time()
        train_x, train_y, test_x, test_y = utils.generate_training_set(latest_matrix[:-1, :], Y)
        clf = self.classifier.train(train_x, train_y, test_x, test_y, model_type=classifier_model)
        end_time = time.time()
        logging.info("finished training by sklearn {} using latest {} model vector space, which took {} mins ... "
                     .format(classifier_model.upper(), topic_model_type.upper(), (end_time-start_time)/60))

        label_id = clf.predict(latest_matrix[-1, :].reshape(1, -1))[0]

        return label_name_list[label_id]


if __name__ == "__main__":
    label_name = "3DaysLabel"
    database_name = "stocknews"
    # sh600004的数据量比较少，可作为跑通代码流程的参数；sz000001的数据量比较大，处理起来也较慢，可以作为后续案例测试
    collection_name = "sz000001"
    classifier_save_path = "{}_classifier.pkl".format(collection_name)
    ori_dict_path = "{}_docs_dict.dict".format(collection_name)
    bowvec_save_path = "{}_bowvec.mm".format(collection_name)

    # 对(未见过的)新闻进行分类
    # unseen_raw_documents_list = ["智通财经APP讯，白云机场(600004.SH)发布公告，公司2020年11月起降40278架次，\
    #                               同比下降2.47%;旅客吞吐量约501.4万人次，同比下降19.31%;货邮吞吐量约17.32万\
    #                               吨，同比下降1.27%。此外，公司2020年累计起降约33.2万架次，同比下降26.07%;旅\
    #                               客吞吐量约3890.14万人次，同比下降42.00%;货邮吞吐量约158.12万吨，同比下降9.14%。",
    #                              "格隆汇 9 月 1日丨白云机场(600004.SH)公布，公司收到中国证券监督管理委员会于2020\
    #                               年8月20日出具的《中国证监会行政许可项目审查一次反馈意见通知书》(202137号)。根据\
    #                               《反馈意见》的相关要求，白云机场控股股东广东省机场管理集团有限公司(“机场集团”)\
    #                               于2020年8月31日出具了《广东省机场管理集团有限公司关于不存在减持广州白云国际机场股\
    #                               份有限公司股票行为或减持计划的承诺函》，具体内容如下：鉴于机场集团拟以现金的方式参\
    #                               与认购本次白云机场非公开发行的A股股票。机场集团现作出如下承诺：1、自白云机场本次发\
    #                               行定价基准日(即2020年4月28日)前六个月至本承诺函出具之日，机场集团及机场集团控制的关\
    #                               联方未出售或以任何方式减持白云机场的任何股票。2、自本承诺函出具之日起至白云机场本次发\
    #                               行完成后六个月期间内，机场集团及机场集团控制的关联方将不会出售或以任何方式减持所持有的\
    #                               白云机场的任何股票，也不存在减持白云机场股票的计划。3、机场集团及机场集团控制的关联方\
    #                               不存在违反《中华人民共和国证券法》第四十四条的情形。如有违反，机场集团因减持股票所得收\
    #                               益将归白云机场所有。4、本承诺函自签署之日起对机场集团具有约束力，若机场集团或机场集团\
    #                               控制的关联方违反上述承诺发生减持情况，则减持所得全部收益归白云机场所有，机场集团依法\
    #                               承担由此产生的法律责任。",
    #                              "格隆汇11月27日丨白云机场(600004.SH)公布，为增强上市公司经营独立性、业务及资产完整性，\
    #                              提升公司盈利能力与运行保障能力，扩展白云机场物流业务发展空间，同时减少关联交易，确保上\
    #                              市公司利益最大化，公司拟实施如下交易：机场集团以所持有的航合公司100%的股权以及铂尔曼酒\
    #                              店、澳斯特酒店相应的经营性资产及负债与上市公司所持有的物流公司51%的股权进行资产置换，差\
    #                              额部分以现金补足。其中航合公司100%股权作价7.54亿元，铂尔曼酒店经营性资产及负债作价2.28\
    #                              亿元，澳斯特酒店经营性资产及负债作价3950.01万元，物流公司51%股权作价8.57亿元，上市公司\
    #                              需向机场集团以现金方式支付差额1.64亿元。本次交易完成后，公司将持有航合公司100%股权、铂\
    #                              尔曼酒店和澳斯特酒店经营性资产及负债、物流公司49%股权；机场集团将持有物流公司51%股权。\
    #                              本次交易除上述资产置换外，还包括：(1)上市公司与机场集团重新划分国内航空主业收入中旅客服\
    #                              务费(以下简称“旅客服务费”)的分成比例，由上市公司占85%、机场集团占15%，变更为上市公司\
    #                              占100%，机场集团不再享有旅客服务费分成，2018年15%旅客服务费对应金额为1.19亿元；及(2)上\
    #                              市公司将按物流公司年营业收入的4%向物流公司收取经营权使用费。2018年，模拟计算物流公司营\
    #                              业收入4%对应的经营权使用费为2536.07万元。本次资产置换交易完成后，上市公司2018年备考口径\
    #                              净利润、归母净利润、净资产、归母净资产和每股收益都将增厚约5%，2018年备考每股收益将从\
    #                              0.5457元每股增厚至0.5717元每股。为充分保障上市公司及中小股东利益，机场集团同意，自本次\
    #                              资产置换交割之日起五年内，上市公司享有一次回购物流公司股权的权利，即上市公司有权要求机\
    #                              场集团将本次交易取得的全部物流公司股权(对应同等金额的注册资本金额，包括在此基础上进行\
    #                              配股、转增、折股等所取得的股权)按届时评估值转让给上市公司。因此，上市公司在本次资产置\
    #                              换中拥有充分的主动权，可以选择重新取得物流公司的控制权。据悉，旅客服务费是公司主营航空\
    #                              性业务收入的重要组成部分，对业务完整性具有重要意义。旅客服务费全部由上市公司享有后，将\
    #                              较大幅度增加上市公司的收入、利润和现金流水平。受益于粤港澳大湾区规划及白云机场T2航站楼\
    #                              启用，旅客吞吐量逐年提升。未来随着白云机场的T3航站楼及新跑道的建设推进，旅客吞吐量还将\
    #                              进一步提升，15%旅客服务费对应收入将随之提升，并为公司贡献更多业绩增长空间。"]

    unseen_raw_documents_list = ["格隆汇6月23日丨平安银行(000001.SZ)公布，近日收到《中国银保监会关于平安银行变更注册资本\
                                 的批复》(银保监复〔2020〕342号)，中国银行保险监督管理委员会同意本行将注册资本由人民币\
                                 17, 170, 411, 366元增加至19, 405, 918, 198元，并修改本行章程相应条款。",
                                 "平安银行(000001,股吧)(000001.SZ)公布，公司于2020年8月19日收到《中国银保监会关于平安理\
                                 财有限责任公司开业的批复》(银保监复〔2020〕513号)，中国银行保险监督管理委员会(简称“中\
                                 国银保监会”)已批准公司全资子公司平安理财有限责任公司(简称“平安理财”)开业。根据中国银\
                                 保监会批复，平安理财注册资本为50亿元人民币，注册地为深圳市，主要从事发行公募理财产品、\
                                 发行私募理财产品、理财顾问和咨询等资产管理相关业务。　　近年来，公司以打造“中国最卓越\
                                 、全球领先的智能化零售银行”为战略目标，坚持“科技引领、零售突破、对公做精”十二字策略\
                                 方针，强化“综合金融”、“科技赋能”两大核心优势，打造数字化银行、生态银行、平台银行三\
                                 张名片，推动发展迈向新台阶。在此基础上，稳步推进资产管理和理财业务转型，综合服务能力不\
                                 断提升，规模、质量、效益实现协调发展。设立平安理财是本行严格落实监管要求、促进理财业务\
                                 健康发展、推动理财业务回归本源的重要举措。平安理财将秉持“受人之托，代客理财”的服务宗\
                                 旨，深耕理财市场，为客户提供更优质的资管产品和财富管理服务，助力实体经济高质量发展。下\
                                 一步，公司将按照法律法规相关要求严格履行有关程序，推动平安理财尽快开业运营。",
                                 "格隆汇5月26日丨平安银行(000001.SZ)公布，经中国银行保险监督管理委员会和中国人民银行批准\
                                 ，公司于近日在全国银行间债券市场成功发行了总额为300亿元人民币的小型微型企业贷款专项金融\
                                 债券。该期债券发行总规模为人民币300亿元，为3年期固定利率债券，票面利率为2.30%，募集资金\
                                 将依据适用法律和监管部门的批准，专项用于发放小型微型企业贷款，其中部分将用于发放与新冠\
                                 肺炎疫情防控相关的小微企业贷款，加大对小型微型企业信贷支持力度，推动小型微型企业业务稳\
                                 健、健康发展。"]

    topicmodelling = TopicModelling()
    for unseen_doc in unseen_raw_documents_list:
        chn_label = topicmodelling.classify_stock_news(unseen_doc,
                                                       database_name,
                                                       collection_name,
                                                       label_name=label_name,
                                                       topic_model_type="lsi",
                                                       classifier_model="rdforest",  # rdforest / svm
                                                       ori_dict_path=ori_dict_path,
                                                       bowvec_save_path=bowvec_save_path)
        logging.info("document '{}...' was classified with label '{}' for symbol {} ... ".format(unseen_doc[:20], chn_label, collection_name))

    # lsi Tue, 15 Dec 2020 14:54:08 classifier.py[line:54] INFO train_pred: 0.9829  test_pred: 0.703 (只是去掉停用词、tab符以及空格符) 30DaysLabel
    # lsi Tue, 15 Dec 2020 17:00:58 classifier.py[line:54] INFO train_pred: 0.9852  test_pred: 0.7492(去掉不含中文的词以及只有一个字符的词) 30DaysLabel
    # lda Tue, 15 Dec 2020 17:29:56 classifier.py[line:54] INFO train_pred: 0.9498  test_pred: 0.7426(去掉不含中文的词以及只有一个字符的词) 30DaysLabel
    # lsi Wed, 16 Dec 2020 15:57:28 classifier.py[line:54] INFO train_pred: 0.9872  test_pred: 0.7478(修改create_dictionary后) 30DaysLabel
    # lsi Wed, 16 Dec 2020 17:14:57 classifier.py[line:54] INFO train_pred: 0.9777  test_pred: 0.7247(修改create_dictionary后) 3DaysLabel
    # lsi Wed, 16 Dec 2020 17:30:15 classifier.py[line:54] INFO train_pred: 0.9883  test_pred: 0.7123(修改create_dictionary后) 60DaysLabel


================================================
FILE: legacy_v1/src/__init__.py
================================================


================================================
FILE: legacy_v1/src/history_spyder_startup.bat
================================================
cd ./Gon
python ./history_starter_stock_price.py
start python ./history_starter_cnstock.py
start python ./history_starter_nbd.py
start python ./history_starter_jrj.py

================================================
FILE: legacy_v1/src/main.py
================================================
import time
import logging

from Kite import config

from Gon.jrjspyder import JrjSpyder
from Gon.nbdspyder import NbdSpyder
from Gon.cnstockspyder import CnStockSpyder
from Gon.stockinfospyder import StockInfoSpyder

from Killua.denull import DeNull
from Killua.deduplication import Deduplication
from Killua.buildstocknewsdb import GenStockNewsDB


# 1. 爬取历史数据
stock_info_spyder = StockInfoSpyder(config.STOCK_DATABASE_NAME, config.COLLECTION_NAME_STOCK_BASIC_INFO)
stock_info_spyder.get_historical_news(start_date="2020-01-01")

cnstock_spyder = CnStockSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
for url_to_be_crawled, type_chn in config.WEBSITES_LIST_TO_BE_CRAWLED_CNSTOCK.items():
    logging.info("start crawling {} ...".format(url_to_be_crawled))
    cnstock_spyder.get_historical_news(url_to_be_crawled, category_chn=type_chn)
    logging.info("finished ...")
    time.sleep(30)

jrj_spyder = JrjSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)
jrj_spyder.get_historical_news(config.WEBSITES_LIST_TO_BE_CRAWLED_JRJ, start_date="2020-01-01")

nbd_spyder = NbdSpyder(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
nbd_spyder.get_historical_news(60)


# 2. 针对历史数据进行去重清洗
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
Deduplication(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()


# 3. 将历史数据中包含null值的行去掉
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK).run()
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_NBD).run()
DeNull(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ).run()


# 4. 创建新的数据库，针对每一个股票，将所有涉及该股票的新闻都保存在新的数据库，并贴好"利好","利空"和"中性"标签
gen_stock_news_db = GenStockNewsDB()
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_CNSTOCK)
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_NBD)
gen_stock_news_db.get_all_news_about_specific_stock(config.DATABASE_NAME, config.COLLECTION_NAME_JRJ)


# 5. 开启实时爬取新闻数据


================================================
FILE: legacy_v1/src/realtime_spyder_startup.bat
================================================
@echo off
:again
cls
echo =========================== Please select programs below to run ===========================
echo 1 ./Gon/realtime_starter_cnstock.py
echo 2 ./Gon/realtime_starter_jrj.py
echo 3 ./Gon/realtime_starter_nbd.py
echo 4 ./Gon/realtime_starter_stock_price.py
echo 5 run all
echo.
echo Please input number 1-5:
set /p num=

if "%num%"=="1" (
cd ./Gon
start python ./realtime_starter_redis_queue.py
start python ./realtime_starter_cnstock.py
)

if "%num%"=="2" (
cd ./Gon
start python ./realtime_starter_redis_queue.py
start python ./realtime_starter_jrj.py
)

if "%num%"=="3" (
cd ./Gon
start python ./realtime_starter_redis_queue.py
start python ./realtime_starter_nbd.py
)

if "%num%"=="4" (
cd ./Gon
start python ./realtime_starter_redis_queue.py
start python ./realtime_starter_stock_price.py
)

if "%num%"=="5" (
cd ./Gon
start python ./realtime_starter_redis_queue.py
start python ./realtime_starter_cnstock.py
start python ./realtime_starter_nbd.py
start python ./realtime_starter_jrj.py
start python ./realtime_starter_stock_price.py
)


================================================
FILE: legacy_v1/src/realtime_spyder_stopall.bat
================================================
cd ./Gon
start python ./kill_realtime_spyder_tasks.py

================================================
FILE: reset_all_data.sh
================================================
#!/bin/bash
# 一键清空所有数据并重新开始爬取

set -e

echo "=========================================="
echo "  FinnewsHunter 数据重置脚本"
echo "=========================================="
echo ""
echo "⚠️  警告：此操作将删除所有新闻和任务数据！"
echo "⚠️  此操作不可恢复！"
echo ""
read -p "确认要清空所有数据吗？(yes/no): " confirm

if [ "$confirm" != "yes" ]; then
    echo "❌ 操作已取消"
    exit 0
fi

echo ""
echo "开始清空数据..."
echo ""

# 1. 清空PostgreSQL数据
echo "[1/4] 清空PostgreSQL数据..."
docker exec finnews_postgres psql -U finnews -d finnews_db <<EOF
-- 清空新闻表
DELETE FROM news;
-- 清空任务表
DELETE FROM crawl_tasks;
-- 清空分析表（如果存在）
DELETE FROM analyses;
-- 重置自增ID
ALTER SEQUENCE news_id_seq RESTART WITH 1;
ALTER SEQUENCE crawl_tasks_id_seq RESTART WITH 1;
ALTER SEQUENCE analyses_id_seq RESTART WITH 1;
-- 显示结果
SELECT 'news表', COUNT(*) FROM news;
SELECT 'crawl_tasks表', COUNT(*) FROM crawl_tasks;
EOF

echo "✅ PostgreSQL数据已清空"
echo ""

# 2. 清空Redis缓存
echo "[2/4] 清空Redis缓存..."
docker exec finnews_redis redis-cli FLUSHDB
echo "✅ Redis缓存已清空"
echo ""

# 3. 清空Celery调度文件
echo "[3/4] 清空Celery调度文件..."
rm -f backend/celerybeat-schedule
rm -rf backend/celerybeat-schedule.db
echo "✅ Celery调度文件已清空"
echo ""

# 4. 重启所有服务
echo "[4/4] 重启服务..."
cd "$(dirname "$0")"
docker compose -f deploy/docker-compose.dev.yml restart celery-worker celery-beat

echo ""
echo "=========================================="
echo "  ✨ 数据重置完成！"
echo "=========================================="
echo ""
echo "📋 状态："
echo "  - PostgreSQL: 已清空"
echo "  - Redis: 已清空"
echo "  - Celery: 已重启"
echo ""
echo "🚀 下一步："
echo "  1. Celery Beat 每1分钟会自动爬取10个新闻源"
echo "  2. 约5-10分钟后可在前端查看新数据"
echo "  3. 访问 http://localhost:3000 查看进度"
echo ""
echo "=========================================="


================================================
FILE: thirdparty/DISC-FinLLM.md
================================================
# DISC-FinLLM - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
```
/home/ubuntu/DISC-FinLLM
|-- .git/ (Git version control metadata)
|-- LICENSE (Project license)
|-- README-en.md (English documentation and project overview)
|-- README.md (Chinese documentation and project overview)
|-- cli_demo.py (Command-line interface demonstration and entry point)
|-- web_demo.py (Web interface demonstration and entry point)
|-- requirements.txt (Python package dependencies)
|-- data/ (Contains JSON data files for different model components)
|   |-- README.md
|   |-- computing_part.json (Data for the financial computing module)
|   |-- consulting_part.json (Data for the financial consulting module)
|   |-- retrieval_part.json (Data for the financial knowledge retrieval module)
|   |-- task_part.json (Data for the financial text analysis module)
|-- eval/ (Contains evaluation data and the core evaluation logic)
|   |-- README.md
|   |-- computing_eval.json (Evaluation data for the computing module)
|   |-- retriever_eval.json (Evaluation data for the retrieval module)
|   |-- evaluator/ (Core module for all evaluation logic)
|       |-- README.md
|       |-- autoeval.py (Script for automated evaluation)
|       |-- evaluate.py (Main evaluation script)
|       |-- finllm.py (Core class/functions for interacting with the FinLLM)
|       |-- preprocess.py (Script for data preprocessing before evaluation)
|       |-- utils.py (Utility functions for evaluation)
|-- images/ (Contains images used in the documentation)
|   |-- README.md
|   |-- data_en.png
|   |-- data_zh.png
|   |-- example_consult.gif
|   |-- example_retrieval.gif
|   |-- example_task.gif
|   |-- example_tool.gif
|   |-- lora_en.png
|   |-- lora_zh.png
|   |-- model_en.png
|   |-- model_zh.png
```

The project structure is concise and clearly organized, primarily focusing on demonstration, data, and evaluation. The root directory contains the main entry points (`cli_demo.py`, `web_demo.py`) and configuration files. The `data/` directory holds the instruction-tuning data for the four expert modules: financial consulting, text analysis, computing, and knowledge retrieval. The `eval/` directory is dedicated to model assessment, with the critical `evaluator/` subdirectory housing the core Python logic for evaluating the model's performance across different tasks. The `images/` folder contains visual assets for the documentation. This clean separation of concerns facilitates easy navigation and maintenance.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/DISC-FinLLM`: Contains the main application entry points (`cli_demo.py`, `web_demo.py`) that demonstrate the model's capabilities and orchestrate the high-level flow.
*   `/home/ubuntu/DISC-FinLLM/eval/evaluator`: Contains the core Python classes and functions (`finllm.py`, `evaluate.py`, `autoeval.py`, `preprocess.py`, `utils.py`) responsible for loading the model, running evaluations, and handling data preparation. This is the heart of the model's operational and assessment logic.

## Phase 2: Module-by-Module Deep Analysis

## Module 1: Root/Demo Module (`/home/ubuntu/DISC-FinLLM`)

### Core Responsibility
This module serves as the primary interface layer for the DISC-FinLLM, providing two distinct demonstration entry points: a command-line interface (`cli_demo.py`) and a web-based interface (`web_demo.py`). Its core function is to load the pre-trained FinLLM model and tokenizer, manage the conversation history, and facilitate real-time interaction with the user, including support for streaming responses.

### Key Files and Functions
*   **`cli_demo.py`**: Provides a simple, interactive terminal chat interface.
    *   `init_model()`: Loads the model and tokenizer from the "Go4miii/DISC-FinLLM" path using `AutoModelForCausalLM` and `AutoTokenizer` from the `transformers` library. It sets `torch_dtype=torch.float16` and `device_map="auto"` for efficient loading.
    *   `clear_screen()`: Handles terminal clearing and prints the welcome message in Chinese, defining the basic commands (`exit`, `clear`, `stream`).
    *   `main()`: The main chat loop, handling user input, command parsing, and calling the model's `chat` method for response generation, with optional streaming.
*   **`web_demo.py`**: Implements a graphical chat interface using the `streamlit` framework.
    *   `init_model()`: Similar to the CLI version, but decorated with `@st.cache_resource` to ensure the large model is loaded only once across sessions.
    *   `clear_chat_history()`: Clears the `st.session_state.messages`.
    *   `init_chat_history()`: Initializes the chat history and displays previous messages in the Streamlit interface.
    *   `main()`: The main web application logic, handling user input via `st.chat_input` and displaying the model's streaming response in a chat message container.

### Core Implementation and Dependencies
The core implementation relies heavily on the **Hugging Face `transformers`** library. The model loading process is standardized:
1.  Load the model: `AutoModelForCausalLM.from_pretrained(...)`
2.  Load the tokenizer: `AutoTokenizer.from_pretrained(...)`
3.  Load generation configuration: `GenerationConfig.from_pretrained(...)`

The key interaction is the custom `model.chat(tokenizer, messages, stream=True)` method, which is assumed to be implemented within the model's `trust_remote_code` or a custom wrapper, providing a clean, multi-turn chat API.

**Dependencies**: `torch`, `transformers`, `colorama` (for CLI), `streamlit` (for Web).

## Module 2: Evaluation/Core Logic Module (`/home/ubuntu/DISC-FinLLM/eval/evaluator`)

### Core Responsibility
This module contains the comprehensive framework for evaluating the performance of the DISC-FinLLM and other comparable models on the BBT-FinCUGE financial NLP benchmark. It abstracts the LLM interaction, manages model-specific configurations, handles dataset preprocessing, and implements task-specific evaluation metrics.

### Key Files and Functions
*   **`finllm.py`**: **LLM Abstraction and Model Wrappers**.
    *   `DISCFINLLMBase` (Abstract Base Class): Defines the contract for all LLM wrappers with an abstract `generate(self, prompt: str) -> str` method.
    *   Concrete Classes: Implements wrappers for various models like `DISCVFINLLMChatGLM26B`, `DISCVFINLLMBaichuan13BChat`, etc. These classes handle model-specific loading (including **LoRA** fine-tuning via `peft.PeftModel`), tokenization, and the actual generation call.
*   **`evaluate.py`**: **Evaluation Logic and Prompt Engineering**.
    *   Multiple `*Evaluator` Classes (e.g., `FinFEEvaluator`, `FinQAEvaluator`): Each class is responsible for a specific financial task (e.g., sentiment analysis, QA).
    *   `__init__`: Loads the task-specific evaluation data and few-shot instruction samples.
    *   `build_zero_shot_prompt` / `build_few_shot_prompt`: Implements prompt engineering by constructing the input text based on predefined templates and few-shot examples.
    *   `evaluate`: Calculates the final metric (e.g., accuracy for sentiment, F1 for QA) by comparing model predictions (`preds`) with ground truth (`golds`).
    *   `run_evaluation`: The main evaluation loop, iterating over all data samples, generating responses using the injected `llm.generate()` method, and calculating both zero-shot and few-shot metrics.
*   **`autoeval.py`**: **Evaluation Orchestration**.
    *   `model_lists` and `Eval_datasets`: Dictionaries mapping string names to the respective model and evaluator classes, implementing a **Factory Pattern**.
    *   `main` block: Parses command-line arguments for model name, LoRA path, and dataset. It instantiates the chosen `llm` and `evaluator` and calls `evaluator().run_evaluation(llm)`.
*   **`preprocess.py`**: **Data Preparation**.
    *   `BBTFinCUGE` class: Manages the downloading and processing of the raw BBT-FinCUGE datasets.
    *   `download_all()`: Uses `requests` to fetch raw JSON data from a GitHub repository.
    *   `process_*` methods (e.g., `process_finfe`): Converts the raw dataset format into a standardized list of instances with `id`, `input`, `gold_answer`, and `source` fields.
*   **`utils.py`**: **Utility Functions**.
    *   `write_json`, `load_json`: Standardized JSON file I/O.
    *   `_mixed_segmentation`, `_remove_punctuation`: Text cleaning and tokenization utilities, crucial for Chinese NLP tasks, using `nltk.word_tokenize`.
    *   `_find_lcs`, `_compute_f1_score`: Implements the Longest Common Subsequence (LCS) algorithm and F1 score calculation, which is the core metric for generative tasks like QA.

### Dependencies and Error/Performance
**Dependencies**: `transformers`, `peft`, `torch`, `argparse`, `tqdm`, `requests`, `inspect`, `random`, `nltk`.
**Performance**: The use of `torch.float16` and `device_map="auto"` in model loading across all modules is a key performance optimization for large models on GPU. The `tqdm` library is used in `evaluate.py` to provide progress bars, enhancing user experience during long evaluation runs.
**Error Handling**: Basic file existence checks are present in `preprocess.py` (`if not os.path.exists(file_path)`). The `evaluate.py` includes assertions (`assert len(golds) == len(preds)`) to ensure data integrity before metric calculation.

### Module PlantUML Diagrams

### Module 1: Root/Demo Module

```plantuml
@startuml
title Root/Demo Module (cli_demo.py & web_demo.py)

class AutoModelForCausalLM
class AutoTokenizer
class GenerationConfig
class torch
class streamlit as st
class colorama

package "Demo Scripts" {
    class cli_demo {
        + init_model()
        + clear_screen()
        + main()
    }

    class web_demo {
        + @st.cache_resource init_model()
        + clear_chat_history()
        + init_chat_history()
        + main()
    }
}

cli_demo ..> AutoModelForCausalLM : loads
cli_demo ..> AutoTokenizer : loads
cli_demo ..> GenerationConfig : loads
cli_demo ..> torch : uses
cli_demo ..> colorama : uses

web_demo ..> AutoModelForCausalLM : loads
web_demo ..> AutoTokenizer : loads
web_demo ..> GenerationConfig : loads
web_demo ..> torch : uses
web_demo ..> st : uses

AutoModelForCausalLM <.. cli_demo : model.chat()
AutoModelForCausalLM <.. web_demo : model.chat()

@enduml
```

### Module 2: Evaluation/Core Logic Module

```plantuml
@startuml
title Evaluation/Core Logic Module (eval/evaluator)

abstract class DISCFINLLMBase {
    + generate(prompt: str): str {abstract}
}

package "LLM Wrappers (finllm.py)" {
    class DISCVFINLLMChatGLM26B
    class DISCVFINLLMBaichuan13BChat
    class FinGPTv3
    DISCFINLLMBase <|-- DISCVFINLLMChatGLM26B
    DISCFINLLMBase <|-- DISCVFINLLMBaichuan13BChat
    DISCFINLLMBase <|-- FinGPTv3
}

package "Data Preprocessing (preprocess.py)" {
    class BBTFinCUGE {
        + download_all()
        + process_finfe()
        + process_finqa()
        .. other process methods ..
    }
}

package "Evaluation Logic (evaluate.py)" {
    class FinFEEvaluator {
        + build_zero_shot_prompt()
        + build_few_shot_prompt()
        + evaluate(golds, preds)
        + run_evaluation(llm)
    }
    class FinQAEvaluator
    class FinCQAEvaluator
    .. other Evaluators ..

    FinFEEvaluator ..> BBTFinCUGE : loads instruct samples
    FinFEEvaluator ..> DISCFINLLMBase : calls generate()
}

package "Utilities (utils.py)" {
    class Utils {
        + write_json()
        + load_json()
        + _mixed_segmentation()
        + _find_lcs()
        + _compute_f1_score()
    }
}

package "Orchestration (autoeval.py)" {
    class AutoEval {
        + model_lists
        + Eval_datasets
        + main()
    }
}

AutoEval --> DISCFINLLMBase : instantiates model
AutoEval --> FinFEEvaluator : instantiates evaluator
FinFEEvaluator ..> Utils : uses metrics/text processing
BBTFinCUGE ..> Utils : uses load/write_json

@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The DISC-FinLLM project is structured around a **modular, multi-expert design philosophy** centered on a clear separation of concerns between the LLM interaction, task-specific evaluation, and application demonstration.

The **core abstraction** is the `DISCFINLLMBase` abstract class defined in `finllm.py`. This class establishes a standardized interface (`generate(prompt: str) -> str`) for all underlying Large Language Models (LLMs), effectively decoupling the evaluation and application logic from the specific model implementation (e.g., ChatGLM, Baichuan, Bloomz). This allows the system to be easily extended to support new base models or different fine-tuned versions without modifying the evaluation framework.

The **design philosophy** is a **"Model-as-a-Service"** approach within the evaluation context. The LLM is treated as a black-box component that accepts a prompt and returns a response. The complexity of model loading, LoRA weight merging, and device management is encapsulated within the concrete model wrapper classes (e.g., `DISCVFINLLMBaichuan13BChat`). This encapsulation promotes code reusability and maintainability. Furthermore, the project implicitly follows a **Multi-Expert System** design, where the four data files (`consulting_part.json`, `task_part.json`, etc.) suggest the model is fine-tuned for distinct financial sub-tasks, which is then validated by the corresponding task-specific evaluators in `evaluate.py`.

The **lifecycle management** of the application is straightforward:
1.  **Data Preparation**: The `preprocess.py` script manages the initial lifecycle phase by downloading and transforming raw BBT-FinCUGE data into a standardized format for evaluation.
2.  **Model Loading**: The model is loaded once at the start of the application, either via `init_model()` in the demo scripts or via the `autoeval.py` orchestrator. Crucially, the use of `torch.float16` and `device_map="auto"` ensures efficient, memory-optimized loading onto available hardware.
3.  **Execution**:
    *   **Demo Lifecycle**: The demo scripts maintain a continuous loop, managing conversation history (`messages` list) and repeatedly calling the model's `chat` method for each user turn.
    *   **Evaluation Lifecycle**: The `autoeval.py` script orchestrates the evaluation, instantiating the chosen model and evaluator, running the full `run_evaluation` loop, and finally writing the metrics to a JSON file.

#### 3.1.2. Component Interactions

The project exhibits two primary interaction flows: the **Demonstration Flow** and the **Evaluation Flow**.

## 1. Demonstration Flow (e.g., `cli_demo.py`)
This flow is a direct, synchronous interaction between the user interface and the LLM.
1.  **Initialization**: `cli_demo.py` calls `init_model()` to load the model and tokenizer.
2.  **User Input**: The user provides a `prompt`.
3.  **Request**: The script appends the user's prompt to the `messages` history.
4.  **Generation**: The script calls the model's custom `model.chat(tokenizer, messages, stream=True)` method.
5.  **Response**: The model generates a response, which is either printed as a stream (in `cli_demo.py`) or updated in a placeholder (in `web_demo.py`).
6.  **History Update**: The model's response is appended to the `messages` history, maintaining the conversational context.

## 2. Evaluation Flow (`autoeval.py` Orchestration)
This flow is more complex, involving multiple components to systematically test the LLM.
1.  **Orchestration**: `autoeval.py` instantiates a specific `DISCFINLLMBase` implementation (`llm`) and one or more `*Evaluator` instances.
2.  **Data Access**: The `*Evaluator` (e.g., `FinFEEvaluator`) loads its task-specific evaluation data (`finfe-eval.jsonl`) and few-shot samples (`instruct_samples.json`) using helper functions from `utils.py`.
3.  **Prompt Engineering**: Inside `*Evaluator.run_evaluation()`, for each data sample, the appropriate prompt construction method (`build_zero_shot_prompt` or `build_few_shot_prompt`) is called. This is where the task-specific instruction and context are formatted for the LLM.
4.  **LLM Interaction**: The evaluator calls `llm.generate(input_text)` on the model wrapper. This is the critical communication point, abstracting the underlying model's API.
5.  **Metric Calculation**: The evaluator collects the model's predictions (`preds`) and compares them to the ground truth (`golds`). It uses utility functions from `utils.py` (e.g., `_remove_punctuation`, `_find_lcs`) to clean text and calculate metrics like F1 score or accuracy.
6.  **Result Reporting**: The final metrics are returned to `autoeval.py`, which then aggregates and writes the results to a JSON file using `utils.write_json`.

The communication pattern between the `*Evaluator` and the `DISCFINLLMBase` is a clear example of the **Strategy Pattern**, where the evaluation logic (context) uses the model wrapper (strategy) to perform the generation task.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
title DISC-FinLLM Overall Architecture

skinparam componentStyle rectangle

package "Application Layer" {
    [cli_demo.py] as CLI
    [web_demo.py] as WEB
}

package "Core Model Abstraction" {
    abstract class DISCFINLLMBase
    [Model Wrappers (finllm.py)] as WRAPPER
    DISCFINLLMBase <|-- WRAPPER
}

package "Evaluation Framework" {
    [autoeval.py] as ORCHESTRATOR
    [evaluate.py] as EVAL_LOGIC
    [preprocess.py] as PREPROCESS
    [utils.py] as UTILS
    [Task Evaluators (e.g., FinFEEvaluator)] as EVALUATOR
    EVAL_LOGIC ..> EVALUATOR
}

package "External Dependencies" {
    [Hugging Face Transformers] as HF
    [PEFT (LoRA)] as PEFT
    [BBT-FinCUGE Data] as DATA
}

CLI --> WRAPPER : loads & interacts
WEB --> WRAPPER : loads & interacts

ORCHESTRATOR --> WRAPPER : instantiates LLM
ORCHESTRATOR --> EVALUATOR : instantiates Task Logic

EVALUATOR --> WRAPPER : calls generate()
EVALUATOR --> UTILS : uses metrics/helpers
PREPROCESS --> DATA : downloads
PREPROCESS --> UTILS : uses I/O

WRAPPER --> HF : uses AutoModel/Tokenizer
WRAPPER --> PEFT : loads LoRA weights

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase, particularly the evaluation framework, leverages several fundamental design patterns to manage complexity and promote extensibility.

## 1. Factory Pattern (Simple Factory)
*   **Description**: The Factory Pattern is used to create objects without exposing the instantiation logic to the client.
*   **Implementation**: In `autoeval.py`, the dictionaries `model_lists` and `Eval_datasets` act as simple factories.
*   **Code Example (`autoeval.py`):**
    ```python
    # Factory for LLM models
    model_lists = {
        'chatglm-6b': DISCVFINLLMChatGLM6B,
        'baichuan-13b-chat': DISCVFINLLMBaichuan13BChat,
        # ...
    }
    # Factory for Evaluators
    Eval_datasets = {
        'finfe': FinFEEvaluator,
        'finqa': FinQAEvaluator,
        # ...
    }
    # Client code instantiates based on string key
    llm = model_lists.get(model_name)(device, lora_path)
    # ...
    evaluator = Eval_datasets.get(eval_data)
    ```

## 2. Abstract Factory / Template Method Pattern
*   **Description**: The Abstract Factory pattern provides an interface for creating families of related or dependent objects without specifying their concrete classes. The Template Method pattern defines the skeleton of an algorithm in the superclass but lets subclasses override specific steps.
*   **Implementation**: The `DISCFINLLMBase` abstract class defines the common interface (`generate`), while each concrete model wrapper (e.g., `DISCVFINLLMBaichuan13BChat`) implements the specific steps for model loading, tokenization, and generation logic, which varies significantly between models (e.g., ChatGLM's `chat` method vs. Baichuan's prompt templating).

## 3. Strategy Pattern
*   **Description**: The Strategy Pattern defines a family of algorithms, encapsulates each one, and makes them interchangeable.
*   **Implementation**: The `*Evaluator` classes (the context) use the `DISCFINLLMBase` instance (`llm`, the strategy) to perform the text generation. The evaluation logic remains the same regardless of which concrete LLM implementation is used.

#### 3.3.2. Project Highlights

The DISC-FinLLM project demonstrates several key design strengths, primarily focused on rigorous evaluation and model flexibility.

*   **Comprehensive Evaluation Framework**: The most significant highlight is the dedicated, multi-task evaluation framework. By integrating the BBT-FinCUGE benchmark and creating distinct `*Evaluator` classes for tasks like sentiment analysis (`FinFE`), question answering (`FinQA`), and relation extraction (`FinRE`), the project ensures a **systematic and reproducible assessment** of the LLM's performance across the financial domain.
*   **Model Agnosticism via Abstraction**: The use of the `DISCFINLLMBase` abstract class provides excellent **extensibility**. New LLMs (e.g., Llama, Qwen) can be integrated simply by creating a new concrete wrapper class that implements the `generate` method, without altering the core evaluation or demonstration logic.
*   **LoRA Fine-Tuning Support**: The model wrappers in `finllm.py` are designed to support **LoRA (Low-Rank Adaptation)** fine-tuning out-of-the-box via the `peft` library. This allows developers to load a base model and merge LoRA weights dynamically, which is crucial for efficient experimentation and deployment of specialized financial models.
*   **Dual Interface for Demonstration**: Providing both a **Command-Line Interface (`cli_demo.py`)** and a **Web Interface (`web_demo.py`)** using Streamlit enhances the project's **accessibility and usability**. This dual approach caters to both developers who prefer a quick terminal check and end-users who need a more polished, graphical demonstration.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

While the project is well-structured, several areas could be improved to enhance performance, architectural robustness, and code quality.

## 1. Architectural Optimization: Model Loading
*   **Suggestion**: Implement a **Singleton Pattern** or a dedicated **Model Manager** class for the LLM.
*   **Reasoning**: Currently, the model loading logic is duplicated across the demo scripts and the evaluation wrappers, and the evaluation wrappers themselves contain repetitive model loading code. A Singleton pattern would ensure the large LLM is loaded only once per process, centralizing resource management and reducing memory overhead.

## 2. Code Quality: Refactoring `evaluate.py`
*   **Suggestion**: Introduce a common `BaseEvaluator` class in `evaluate.py` to abstract common methods like `__init__`, `run_evaluation`, and prompt building logic.
*   **Reasoning**: The current `evaluate.py` file is excessively long (nearly 1000 lines) due to the high degree of code duplication across the many `*Evaluator` classes. Abstracting the common structure (loading data, iterating samples, calling `llm.generate`, calculating metrics) would significantly reduce file size and improve maintainability.

## 3. Robustness and Error Handling
*   **Suggestion**: Enhance error handling, particularly in `preprocess.py` and model loading.
*   **Reasoning**: The `preprocess.py` download function only prints an error message on failure (`print('failed to download dataset {}, {}'.format(eval_dataset, e))`) but does not raise an exception or retry. In a production environment, network failures should be handled with retries or graceful failure. Similarly, model loading should include more robust exception handling for missing files or incompatible hardware.

## 4. Performance: Text Processing
*   **Suggestion**: Replace the dependency on `nltk` for simple Chinese segmentation and punctuation removal in `utils.py` with a lighter, custom regex-based function or a more modern, dedicated Chinese NLP library like `jieba`.
*   **Reasoning**: The current implementation relies on `nltk.word_tokenize`, which may not be optimized for Chinese text and introduces a heavy dependency for simple tasks. A more streamlined approach could improve the performance of the metric calculation step.

#### 3.4.2. Secondary Development Guide

This guide outlines the best path for developers looking to explore, modify, or extend the DISC-FinLLM project.

## 1. Code Exploration and Entry Points
*   **Application Flow**: Start with `cli_demo.py` to understand how the model is loaded (`init_model`) and how the chat loop is managed. This is the simplest entry point for testing model responses.
*   **Evaluation Flow**: The core logic is orchestrated by `autoeval.py`. Examine this file to see how models and evaluators are instantiated using the Factory Pattern.
*   **Model Abstraction**: Study `eval/evaluator/finllm.py`. This file is crucial for understanding how different LLMs are wrapped and how LoRA weights are integrated.

## 2. Extending Model Support
To integrate a new LLM (e.g., Llama-3):
1.  Create a new class in `finllm.py` (e.g., `DISCVFINLLMLlama3`) inheriting from `DISCFINLLMBase`.
2.  Implement the `__init__` method to handle the specific model and tokenizer loading for Llama-3, including any necessary `trust_remote_code` or LoRA integration.
3.  Implement the `generate(prompt: str)` method, ensuring it correctly formats the prompt and calls the model's generation function to return a clean string response.
4.  Add the new class to the `model_lists` dictionary in `autoeval.py`.

## 3. Adding a New Evaluation Task
To add a new financial NLP task:
1.  Create a new class in `evaluate.py` (e.g., `FinNewTaskEvaluator`) following the structure of existing evaluators.
2.  Define the `zero_shot_prompts` and `few_shot_prompts` templates specific to the new task.
3.  Implement the `evaluate(golds, preds)` static method to calculate the correct metric (e.g., F1, accuracy, exact match) for the task, leveraging helper functions in `utils.py`.
4.  Add the new evaluator class to the `Eval_datasets` dictionary in `autoeval.py`.

## 4. Customizing Data and Metrics
*   **Data**: The `preprocess.py` script is the place to modify how raw data is converted into the standardized `input`/`gold_answer` format.
*   **Metrics**: The `utils.py` file contains the core logic for text cleaning (`_mixed_segmentation`) and metric calculation (`_compute_f1_score`). Modifications here will affect all generative evaluation tasks.


================================================
FILE: thirdparty/ElegantRL.md
================================================
# ElegantRL - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The ElegantRL project is structured to separate the core reinforcement learning logic from examples, documentation, and utility components. The core logic resides primarily in the `elegantrl` directory, which is further divided into functional modules: `agents`, `envs`, and `train`.

```
/home/ubuntu/ElegantRL
|____.github/             # GitHub configuration files (e.g., FUNDING.yml)
|____docs/                # Documentation source files (using Sphinx/reStructuredText)
|____elegantrl/           # Core Reinforcement Learning Library
| |______init__.py        # Package initialization
| |____agents/            # Implementations of various DRL agents (AgentBase, AgentPPO, AgentSAC, etc.)
| |____envs/              # Custom and wrapper environments (StockTradingEnv, CustomGymEnv, etc.)
| |____train/             # Core training components (config, evaluator, replay_buffer, run)
|____examples/            # Scripts demonstrating how to use the library with different algorithms and environments
|____figs/                # Figures and images used in documentation and README
|____helloworld/          # Simple, single-file examples for quick start and tutorials
|____requirements.txt     # Python dependencies
|____rlsolver/            # A separate, specialized solver component, likely for combinatorial optimization (CO) problems
|____unit_tests/          # Test files for agents, environments, and training components
```

The primary focus is on the `elegantrl` directory, which contains the fundamental components of the DRL library. The separation into `agents`, `envs`, and `train` enforces a clear modular design, making the codebase maintainable and extensible. The top-level folders like `examples`, `helloworld`, and `unit_tests` serve to support the core library by providing usage demonstrations and ensuring code quality. The `rlsolver` folder suggests a specialized application of the DRL framework to optimization problems.
```

### 1.2. Core Folders for Analysis

- **elegantrl/agents**: Contains the base class `AgentBase` and concrete implementations for various Deep Reinforcement Learning (DRL) algorithms, including on-policy (PPO, A2C) and off-policy (SAC, TD3, DDPG, DQN) methods, as well as multi-agent extensions (MADDPG, MAPPO, QMix, VDN).
- **elegantrl/envs**: Houses custom and specialized environment implementations, such as `StockTradingEnv` for financial applications and wrappers for vectorized environments.
- **elegantrl/train**: Manages the training infrastructure, including configuration (`config.py`), the main execution logic (`run.py`), experience storage (`replay_buffer.py`), and performance monitoring (`evaluator.py`).

## Phase 2: Module-by-Module Deep Analysis

### 1. Module: `elegantrl/agents`

**Core Responsibility:** Implements the core logic for Deep Reinforcement Learning (DRL) agents, defining the interaction between the agent and the environment, and managing the policy and value networks.

**Key Files and Functions:**
- **`AgentBase.py`**: Defines the abstract base class `AgentBase` for all DRL agents. It handles initialization parameters (network dimensions, environment info, hyperparameters), device management (CPU/GPU), exploration logic (`explore_env`, `explore_action`), network update boilerplate (`update_net`, `optimizer_backward`, `soft_update`), and utility network classes (`ActorBase`, `CriticBase`, `build_mlp`).
- **`AgentPPO.py`**: Implements the **Proximal Policy Optimization (PPO)** algorithm, an on-policy method. It extends `AgentBase` and includes specific logic for Generalized Advantage Estimation (GAE), ratio clipping, and entropy regularization. It also contains `AgentA2C` as a simpler variant.
- **`AgentSAC.py`**: Implements the **Soft Actor-Critic (SAC)** algorithm, an off-policy, maximum entropy DRL method. It uses an ensemble of critics (`CriticEnsemble`) and includes logic for automatic temperature parameter (`alpha`) adjustment.
- **`AgentTD3.py`**: Implements the **Twin Delayed DDPG (TD3)** algorithm, an off-policy method that improves upon DDPG with clipped double Q-learning and delayed policy updates. It includes `AgentDDPG` as a simpler variant.
- **`AgentDQN.py`**: Implements **Deep Q-Network (DQN)** and its variants (Double DQN, Dueling DQN) for discrete action spaces.
- **`MAgent*.py`**: Contains multi-agent extensions like `MAgentMADDPG`, `MAgentMAPPO`, `MAgentQMix`, and `MAgentVDN`, which adapt single-agent algorithms for multi-agent systems.

**Core Implementation Details:**
- **Network Abstraction**: Agents rely on `ActorBase` and `CriticBase` (defined in `AgentBase.py`) which are essentially wrappers around PyTorch `nn.Module`s built using the `build_mlp` utility.
- **Exploration**: The `explore_env` method is central, handling the collection of trajectories from the environment, distinguishing between single-environment (`_explore_one_env`) and vectorized environment (`_explore_vec_env`) scenarios.
- **Update Logic**: The `update_net` method orchestrates the training. The core difference between on-policy (PPO) and off-policy (SAC, TD3) agents is evident here: PPO calculates advantages and reward sums from the collected batch, while off-policy agents sample from the `ReplayBuffer`.

### 2. Module: `elegantrl/envs`

**Core Responsibility:** Provides custom and specialized environment interfaces, particularly for financial and multi-agent tasks, and handles the creation of vectorized environments.

**Key Files and Functions:**
- **`CustomGymEnv.py`**: A template or wrapper for integrating custom environments that follow the OpenAI Gym/Gymnasium interface.
- **`StockTradingEnv.py`**: A specialized environment for financial reinforcement learning, a key feature of the AI4Finance foundation. It defines the state, action, and reward space for a stock trading problem.
- **`PlanIsaacGymEnv.py`**: Integration with NVIDIA's Isaac Gym for highly parallelized, high-performance simulation environments.
- **`PointChasingEnv.py`**: A simple multi-agent environment used for testing and demonstration of multi-agent algorithms.

**Core Implementation Details:**
- **Standard Interface**: All environments adhere to the standard `reset()` and `step()` methods, ensuring compatibility with the `AgentBase`'s exploration logic.
- **Vectorization**: The concept of a vectorized environment (`VecEnv` in `config.py`) is crucial, allowing multiple environment instances to run in parallel, which is essential for the "Massively Parallel" aspect of ElegantRL.

### 3. Module: `elegantrl/train`

**Core Responsibility:** Manages the overall training workflow, configuration, data storage, and performance evaluation.

**Key Files and Functions:**
- **`config.py`**: Defines the `Config` class, which holds all hyperparameters and environment metadata. It includes logic to automatically determine if an agent is on-policy or off-policy (`get_if_off_policy`) and contains the `VecEnv` and `SubEnv` classes for parallel environment execution using Python's `multiprocessing.Pipe` and `Process`.
- **`replay_buffer.py`**: Implements the `ReplayBuffer` class for off-policy algorithms. It supports both standard sampling and **Prioritized Experience Replay (PER)** using the `SumTree` data structure.
- **`run.py`**: Contains the main entry points for training (`train_agent`, `train_agent_single_process`, `train_agent_multiprocessing`). It defines the `Learner`, `Worker`, and `EvaluatorProc` classes for distributed training using Python's `multiprocessing`.
- **`evaluator.py`**: Implements the `Evaluator` class for logging, saving checkpoints, and calculating performance metrics (average return, steps, loss values). It supports both single and vectorized environment evaluation and includes utilities for plotting the learning curve.

**Core Implementation Details:**
- **Parallelism**: The multi-process architecture in `run.py` is the backbone of ElegantRL's "Massively Parallel" claim. `Worker` processes collect experience from environments, and the `Learner` process updates the agent's networks, communicating via `Pipe`s.
- **Data Flow**: In off-policy training, `Worker`s send collected experience to the `Learner`, which stores it in the `ReplayBuffer` and samples batches for network updates. In on-policy training, the collected experience is used directly for a few epochs of updates before being discarded.

### Module PlantUML Diagrams

### 1. `elegantrl/agents` Module Diagram (Simplified Core)

```puml
@startuml
skinparam classAttributeIconVisible false

abstract class AgentBase {
    + if_discrete: bool
    + if_off_policy: bool
    + net_dims: list
    + state_dim: int
    + action_dim: int
    + device: torch.device
    + act: ActorBase
    + cri: CriticBase
    + act_optimizer: Adam
    + cri_optimizer: Adam
    + explore_env(env, horizon_len)
    + explore_action(state)
    + update_net(buffer)
    + update_objectives(buffer, update_t)
    + soft_update(target_net, current_net, tau)
}

abstract class ActorBase extends nn.Module {
    + net: nn.Sequential
    + forward(state)
    + get_action(state)
}

abstract class CriticBase extends nn.Module {
    + net: nn.Sequential
    + forward(state, action)
    + get_q_values(state, action)
}

class AgentPPO extends AgentBase {
    + ratio_clip: float
    + lambda_gae_adv: float
    + get_advantages(states, rewards, undones, unmasks, values)
}

class AgentSAC extends AgentBase {
    + num_ensembles: int
    + alpha_log: Parameter
}

class AgentTD3 extends AgentBase {
    + update_freq: int
    + policy_noise_std: float
}

class ActorPPO extends ActorBase {
    + action_std_log: Parameter
    + state_norm(state)
    + get_logprob_entropy(state, action)
}

class CriticPPO extends CriticBase {
    + state_norm(state)
}

class CriticEnsemble extends CriticBase {
    + decoder_qs: list
    + get_q_values(state, action)
}

AgentBase <|-- AgentPPO
AgentBase <|-- AgentSAC
AgentBase <|-- AgentTD3
AgentBase <|-- AgentDDPG
AgentBase <|-- AgentDQN

ActorBase <|-- ActorPPO
CriticBase <|-- CriticPPO
CriticBase <|-- CriticEnsemble

AgentPPO *-- ActorPPO : uses
AgentPPO *-- CriticPPO : uses
AgentSAC *-- ActorSAC : uses
AgentSAC *-- CriticEnsemble : uses
AgentTD3 *-- Actor : uses
AgentTD3 *-- CriticTwin : uses

@enduml
```

### 2. `elegantrl/train` Module Diagram (Core Components)

```puml
@startuml
skinparam classAttributeIconVisible false

class Config {
    + num_envs: int
    + agent_class: class
    + env_class: class
    + gamma: float
    + learning_rate: float
    + batch_size: int
    + horizon_len: int
    + buffer_size: int
    + gpu_id: int
    + init_before_training()
    + get_if_off_policy()
}

class SumTree {
    + buf_len: int
    + tree: Tensor
    + update_ids(data_ids, prob)
    + important_sampling(batch_size, beg, end, per_beta)
}

class ReplayBuffer {
    + max_size: int
    + num_seqs: int
    + states: Tensor
    + actions: Tensor
    + if_use_per: bool
    + sum_trees: list[SumTree]
    + update(items)
    + sample(batch_size)
    + sample_for_per(batch_size)
}

class Evaluator {
    + cwd: str
    + total_step: int
    + max_r: float
    + recorder: list
    + evaluate_and_save(actor, steps, exp_r, logging_tuple)
    + save_training_curve_jpg()
}

class SubEnv extends Process {
    + sub_pipe0: Pipe
    + vec_pipe1: Pipe
    + run()
}

class VecEnv {
    + num_envs: int
    + sub_envs: list[SubEnv]
    + sub_pipe1s: list[Pipe]
    + vec_pipe0: Pipe
    + reset()
    + step(action)
}

class Worker extends Process {
    + worker_pipe: Pipe
    + learner_pipe: Pipe
    + run()
}

class Learner extends Process {
    + recv_pipe: Pipe
    + send_pipes: list[Pipe]
    + run()
}

Config *-- ReplayBuffer : configures
ReplayBuffer *-- SumTree : uses (for PER)
Config *-- VecEnv : creates
VecEnv *-- SubEnv : manages
Learner *-- ReplayBuffer : updates
Learner *-- Worker : communicates
Learner *-- EvaluatorProc : communicates
Worker *-- VecEnv : uses
@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The ElegantRL architecture is built around a set of highly modular and decoupled abstractions, primarily focused on the Actor-Critic paradigm and parallel execution.

1.  **Agent (`AgentBase`)**: The central abstraction for any DRL algorithm. It encapsulates the policy (`act`), value function (`cri`), optimization logic, and exploration strategy. Concrete implementations like `AgentPPO` and `AgentSAC` inherit from this base class, ensuring a consistent interface for the training loop.
2.  **Network (`ActorBase`, `CriticBase`)**: These define the neural network structures for the policy and value functions, respectively. They are decoupled from the agent logic, allowing for flexible network designs (e.g., `CriticTwin` for TD3, `CriticEnsemble` for SAC).
3.  **Configuration (`Config`)**: A single source of truth for all hyperparameters, environment details, and training settings. This abstraction simplifies experiment management and ensures consistency across the entire framework.
4.  **Experience Storage (`ReplayBuffer`, `SumTree`)**: Manages the collection and sampling of experience. The inclusion of `SumTree` for Prioritized Experience Replay (PER) highlights the focus on sample efficiency.
5.  **Parallelism Components (`Learner`, `Worker`, `VecEnv`)**: These are the core components enabling the "Massively Parallel" design. The `Learner` handles model updates, while `Worker`s handle environment interaction, and `VecEnv` manages multiple environment instances in parallel processes (`SubEnv`).

**Design Philosophy: Massively Parallel and Modular DRL**
ElegantRL's design philosophy is centered on two main pillars:

1.  **Decoupled Parallelism**: The framework adopts a clear separation between the **data collection** (exploration) and the **model update** (learning) phases, a design common in high-throughput DRL systems. `Worker` processes run in parallel to collect massive amounts of experience, which is then asynchronously sent to the `Learner` process for efficient GPU-based training. This maximizes hardware utilization and significantly speeds up training.
2.  **Modularity and Extensibility**: The codebase is highly modular, with clear boundaries between the `agents`, `envs`, and `train` components. This modularity makes it easy to implement new algorithms (by extending `AgentBase`), integrate new environments, or swap out core components like the `ReplayBuffer`.

**Lifecycle Management**
The training lifecycle is managed by the `run.py` module:

1.  **Initialization**: The `Config` object is initialized, and the `Learner`, `Worker`s, and `EvaluatorProc` processes are instantiated.
2.  **Exploration (Worker)**: Each `Worker` process continuously interacts with its assigned `VecEnv` instances, collecting trajectories.
3.  **Learning (Learner)**: The `Learner` receives batches of experience from all `Worker`s. It stores them in the `ReplayBuffer`, samples a batch, calculates the loss, updates the networks, and soft-updates the target networks.
4.  **Synchronization**: The `Learner` periodically sends the updated policy network parameters back to the `Worker`s.
5.  **Evaluation (Evaluator)**: The `Evaluator` process runs evaluation episodes, logs performance metrics, and handles model checkpointing.

#### 3.1.2. Component Interactions

The inter-component communication is primarily handled by Python's `multiprocessing.Pipe` for inter-process communication (IPC), enabling the asynchronous and parallel nature of the framework.

| Component | Role | Communication Pattern | Data Flow |
| :--- | :--- | :--- | :--- |
| **Worker** | Experience Collector | Sends data to `Learner` via `Pipe`. Receives model parameters from `Learner` via `Pipe`. | Trajectories (states, actions, rewards, etc.) -> `Learner`. Latest `Actor` state dict -> `Worker`. |
| **Learner** | Model Updater | Receives data from `Worker`s. Sends model to `Worker`s and `Evaluator`. | Trajectories from `Worker`s -> `ReplayBuffer`. Sampled batches from `ReplayBuffer` -> `Agent` for update. |
| **VecEnv** | Parallel Environment Manager | Manages multiple `SubEnv` processes using `Pipe`s. | Actions from `Worker` -> `SubEnv`. New states, rewards, dones from `SubEnv` -> `Worker`. |
| **ReplayBuffer** | Experience Storage | Accessed exclusively by the `Learner` process. | Stores trajectories from `Worker`s. Provides sampled batches to `Learner`'s `Agent`. |
| **Evaluator** | Performance Monitor | Receives training statistics from `Learner` via `Pipe`. | Training metrics (step, avgR, losses) -> `Evaluator`. |

**Key Interaction Flow (Off-Policy Training):**

1.  **Exploration**: `Worker` receives the latest `Actor` from `Learner`.
2.  **Data Collection**: `Worker` calls `agent.explore_env(VecEnv)`, which executes `VecEnv.step()` across all `SubEnv`s in parallel, collecting a batch of trajectories.
3.  **Data Transfer**: `Worker` sends the collected trajectories (e.g., 2048 steps * 8 environments) to the `Learner` via a `Pipe`.
4.  **Storage**: `Learner` receives the data and calls `ReplayBuffer.update()`.
5.  **Learning**: `Learner` repeatedly calls `ReplayBuffer.sample()` and passes the batch to `agent.update_net()`.
6.  **Synchronization**: After a set number of learning steps, `Learner` sends the updated `Actor` weights back to the `Worker`s.
7.  **Monitoring**: Periodically, `Learner` sends performance metrics to the `Evaluator` for logging and checkpointing.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam defaultFontName Courier
skinparam classAttributeIconVisible false
skinparam packageStyle rectangle

title ElegantRL Overall Architecture

package "elegantrl.train" {
    class Config
    class ReplayBuffer
    class Evaluator
    class Learner extends Process
    class Worker extends Process
    class VecEnv
    class SubEnv extends Process
}

package "elegantrl.agents" {
    abstract class AgentBase
    abstract class ActorBase
    abstract class CriticBase
}

package "elegantrl.envs" {
    class Environment
}

' Relationships

' 1. Configuration and Initialization
Config .> AgentBase : configures
Config .> ReplayBuffer : configures
Config .> VecEnv : configures

' 2. Agent Core
AgentBase <|-- AgentPPO
AgentBase <|-- AgentSAC
AgentBase <|-- AgentTD3
AgentBase *-- ActorBase : uses
AgentBase *-- CriticBase : uses

' 3. Training Loop Components
Learner *-- AgentBase : updates
Learner *-- ReplayBuffer : manages
Learner .> Evaluator : sends stats (Pipe)

Worker .> AgentBase : uses for exploration
Worker *-- VecEnv : collects data

' 4. Inter-Process Communication (IPC)
Worker .> Learner : sends data (Pipe)
Learner .> Worker : sends model (Pipe)

' 5. Environment Interaction
VecEnv *-- SubEnv : manages parallel instances
VecEnv .> Environment : wraps/uses

' 6. Data Flow
ReplayBuffer .> AgentBase : samples data

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

ElegantRL leverages several established software and reinforcement learning design patterns to achieve its modularity, stability, and performance goals.

1.  **Actor-Critic Pattern (Reinforcement Learning Pattern)**
    *   **Description**: Separates the policy (Actor) that selects actions from the value function (Critic) that estimates the expected return.
    *   **Implementation**:
        *   `AgentBase` is the abstract base for the entire pattern.
        *   `ActorBase` and `CriticBase` define the network interfaces.
        *   **Example (AgentPPO.py)**: The `AgentPPO` class explicitly instantiates `self.act = ActorPPO(...)` and `self.cri = CriticPPO(...)`, and the `update_objectives` method uses both to calculate the actor and critic losses.

2.  **Target Network Pattern (Reinforcement Learning Pattern)**
    *   **Description**: Used in off-policy algorithms (DDPG, TD3, SAC) to stabilize training by using a separate, delayed-update copy of the Q-network.
    *   **Implementation**:
        *   The `AgentBase` constructor initializes `self.act_target` and `self.cri_target`.
        *   The static method `AgentBase.soft_update(target_net, current_net, tau)` implements the exponential moving average (EMA) update rule.
        *   **Example (AgentTD3.py)**: The `update_objectives` method calculates the target Q-value using `next_q = self.cri_target.get_q_values(next_state, next_action).min(dim=1)[0]`.

3.  **Factory Method Pattern (Software Design Pattern)**
    *   **Description**: Defines an interface for creating an object, but lets subclasses alter the type of objects that will be created.
    *   **Implementation**:
        *   The `Config` object stores `self.agent_class` and `self.env_class`.
        *   The `run.py` module uses these classes to instantiate the actual objects: `agent = args.agent_class(...)` and `env = build_env(args.env_class, ...)`.

4.  **Strategy Pattern (Software Design Pattern)**
    *   **Description**: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.
    *   **Implementation**:
        *   The core training loop in `run.py` interacts only with the `AgentBase` interface (`agent.explore_env`, `agent.update_net`).
        *   The specific implementation is encapsulated within the concrete strategy classes (`AgentPPO`, `AgentSAC`), making them interchangeable.

5.  **Observer Pattern (Software Design Pattern)**
    *   **Description**: Defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically.
    *   **Implementation**:
        *   The `Learner` acts as the Subject, generating updated model parameters.
        *   The `Worker`s and `Evaluator` act as Observers, receiving the updated model parameters (or performance data) via the IPC `Pipe`s.

#### 3.3.2. Project Highlights

ElegantRL's design includes several innovative features that contribute to its high performance and usability:

*   **Massively Parallel Architecture (Cloud-Native DRL)**: The core highlight is the clear separation of concerns into `Learner` (GPU-heavy computation) and multiple `Worker`s (CPU-heavy environment interaction), communicating via IPC. This design is highly scalable and is explicitly optimized for cloud-native DRL applications, allowing for efficient utilization of multi-core CPUs and single/multi-GPU setups.
*   **Vectorized Environment Support (`VecEnv`)**: The framework natively supports running multiple environment instances in parallel within a single `Worker` process, dramatically increasing the data throughput (samples per second) and reducing the wall-clock time required for training. This is a crucial feature for on-policy algorithms like PPO.
*   **Prioritized Experience Replay (PER) with `SumTree`**: The implementation of PER in `replay_buffer.py` using a dedicated `SumTree` data structure is a highlight. It ensures that the most "surprising" or high-error transitions are sampled more frequently, leading to faster convergence and better sample efficiency for off-policy methods.
*   **Unified Agent Interface (`AgentBase`)**: By abstracting the core DRL logic into `AgentBase`, the framework provides a clean, consistent API for all algorithms (PPO, SAC, TD3, DQN, etc.). This significantly lowers the barrier to entry for users wanting to compare or switch between different algorithms.
*   **Financial Reinforcement Learning Focus**: The inclusion of specialized environments like `StockTradingEnv` and the project's association with the AI4Finance-Foundation indicate a strong focus on applying DRL to complex financial problems, which often require the stability and efficiency ElegantRL provides.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

Based on the code structure and design, the following areas could be considered for improvement:

1.  **Standardize Environment Interface**: The `elegantrl/envs` module contains custom environment implementations. While functional, adopting the latest Gymnasium API standards more strictly, possibly through a dedicated wrapper layer, would improve compatibility with the broader RL ecosystem and future-proof the environment integrations.
2.  **Configuration Management**: The `Config` class is a simple data container. For large-scale experiments, migrating to a more robust configuration management system (e.g., Hydra, Gin-config) would allow for easier tracking, overriding, and composition of hyperparameter sets, especially for the multi-GPU and multi-process setups.
3.  **Network Abstraction for Complex Architectures**: The current network building utility (`build_mlp`) is limited to simple Multi-Layer Perceptrons. Expanding the network module to include more complex, pre-built architectures (e.g., ResNets, attention-based models) or a more flexible network composition API would simplify the implementation of state-of-the-art DRL agents that require specialized network structures.
4.  **Asynchronous Communication Overhead**: The reliance on Python's `multiprocessing.Pipe` for IPC, while simple, can introduce serialization/deserialization overhead, especially when transferring large batches of data (tensors) between `Worker` and `Learner`. Investigating more efficient IPC mechanisms like shared memory (e.g., PyTorch's `multiprocessing.shared_memory` or Ray) could further reduce latency and increase the overall throughput.
5.  **Type Hinting and Documentation**: While type hints are present, expanding their use, especially in the core `AgentBase` and `run.py` components, along with more comprehensive docstrings, would significantly improve code readability and maintainability for secondary developers.

#### 3.4.2. Secondary Development Guide

For developers looking to extend or build upon the ElegantRL framework, the following guide provides the best path for code exploration and secondary development:

1.  **Implement a New Agent (Algorithm)**:
    *   **Start with `AgentBase.py`**: Create a new class (e.g., `AgentNewRL`) that inherits from `AgentBase`.
    *   **Define Networks**: Implement the specific Actor and Critic network architectures required by the new algorithm (e.g., `ActorNewRL`, `CriticNewRL`), inheriting from `ActorBase` and `CriticBase`.
    *   **Override `__init__`**: Initialize the new agent, setting algorithm-specific hyperparameters and instantiating the new networks.
    *   **Override `update_objectives`**: This is the most critical step. Implement the algorithm's core loss functions and optimization steps here.

2.  **Integrate a New Environment**:
    *   **Follow Gym/Gymnasium Standard**: Ensure the new environment implements the standard `__init__`, `reset`, and `step` methods.
    *   **Use `elegantrl/envs` as a Template**: If the environment is complex, use `StockTradingEnv.py` as a template for structuring the state, action, and reward logic.
    *   **Vectorization**: Ensure the environment is compatible with the `VecEnv` wrapper defined in `config.py` for high throughput.

3.  **Explore the Training Workflow**:
    *   **Configuration**: All experiments start with `config.py`. Understand how to set `agent_class`, `env_class`, and key hyperparameters.
    *   **Execution**: The `run.py` module is the entry point. Focus on the `train_agent_multiprocessing` function to understand how `Learner` and `Worker` processes are launched and communicate.
    *   **Data Flow**: Trace the data from `Worker.run()` (collection) through the `Pipe` to `Learner.run()` (storage and update) to fully grasp the parallel data pipeline.

4.  **Debugging and Monitoring**:
    *   **Logging**: Use the `Evaluator` in `evaluator.py` to monitor training progress.
    *   **PyTorch Debugging**: Standard PyTorch debugging techniques can be applied directly within the `update_objectives` methods.


================================================
FILE: thirdparty/FinCast-fts.md
================================================
# FinCast-fts - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
**Project Name:** FinCast-fts
**Project Path:** /home/ubuntu/FinCast-fts

```
/home/ubuntu/FinCast-fts
|____.git/ (EXCLUDE: Git version control metadata)
|____.gitattributes (EXCLUDE: Git configuration)
|____.gitignore (EXCLUDE: Files to ignore for Git)
|____Inference/ (EXCLUDE: Jupyter notebook for model inference and demonstration)
| |____inference_future.ipynb
|____LICENSE (EXCLUDE: Project license)
|____README.md (EXCLUDE: Project documentation)
|____dep_install.sh (EXCLUDE: Script for dependency installation)
|____env_setup.sh (EXCLUDE: Script for environment setup)
|____experiments/ (EXCLUDE: Scripts for running long-horizon benchmarks and evaluations)
| |____long_horizon_benchmarks/
| | |____Freq_map_eval.py
| | |____run_eval_ffm.py
| | |____run_eval_ffm_dataset.py
| | |____run_eval_ffm_stock.py
|____notebooks/ (EXCLUDE: Jupyter notebook for result summary and visualization)
| |____result_summary.ipynb
|____paper.pdf (EXCLUDE: Associated research paper)
|____peft_Fincast/ (CORE: Implementation for Parameter-Efficient Fine-Tuning (PEFT) integration)
| |____peft_injector.py
|____pics/ (EXCLUDE: Example images)
| |____example1_APPL.png
| |____example2_ETHUSD.png
|____requirement_v2.txt (EXCLUDE: Project dependencies list)
|____scripts/ (EXCLUDE: Shell scripts for running PEFT and evaluation)
| |____Fincast_PEFT/
| | |____local_4090_t1.sh
| |____Fincast_eval/
| | |____eval_stock_loop.sh
| | |____eval_stock_loop_supervised42.sh
|____setup.py (EXCLUDE: Python package setup file)
|____src/ (CORE: Main source code directory)
| |______init__.py
| |____data_tools/ (CORE: Data loading, processing, and batch sampling utilities)
| | |____Inference_dataset.py
| | |____TSdataset.py
| | |____batch_sampler.py
| | |____batch_sampler_ddp.py
| |____ffm/ (CORE: Core Financial Foundation Model (FFM) implementation)
| | |______init__.py
| | |____data_loader.py
| | |____ffm_base.py
| | |____ffm_torch_moe.py
| | |____pytorch_patched_decoder_MOE.py
| | |____time_features.py
| | |____xreg_lib.py
| |____st_moe_pytorch/ (CORE: Implementation of the Spatio-Temporal Mixture of Experts (ST-MoE) layer)
| | |______init__.py
| | |____distributed.py
| | |____st_moe_pytorch.py
| |____tools/ (CORE: General utility functions, metrics, model utils, and visualization)
| | |______init__.py
| | |____inference_utils.py
| | |____metrics.py
| | |____model_utils.py
| | |____result_vis_plt.ipynb
| | |____utils.py
| |____unit_test/ (EXCLUDE: Contains a unit test script)
| | |____BS_DDP_tc4.py
```

The project is organized into five core logical modules under the root and `src/` directory: `peft_Fincast` for model adaptation, `src/data_tools` for data pipeline, `src/ffm` for the core model logic, `src/st_moe_pytorch` for the MoE implementation, and `src/tools` for utilities. The rest of the folders contain non-core elements like scripts, notebooks, and documentation.
```

### 1.2. Core Folders for Analysis

- `/home/ubuntu/FinCast-fts/peft_Fincast`: Implementation for Parameter-Efficient Fine-Tuning (PEFT) integration.
- `/home/ubuntu/FinCast-fts/src/data_tools`: Data loading, processing, and batch sampling utilities.
- `/home/ubuntu/FinCast-fts/src/ffm`: Core Financial Foundation Model (FFM) implementation.
- `/home/ubuntu/FinCast-fts/src/st_moe_pytorch`: Spatio-Temporal Mixture of Experts (ST-MoE) layer implementation.
- `/home/ubuntu/FinCast-fts/src/tools`: General utility functions, metrics, and model utilities.

## Phase 2: Module-by-Module Deep Analysis

## Module Analysis

The FinCast-fts project is structured around a core deep learning model, the Financial Foundation Model (FFM), and its supporting infrastructure for data handling, training utilities, and inference. The architecture is heavily influenced by the TimesFM design, with significant modifications to incorporate a Spatio-Temporal Mixture of Experts (ST-MoE) layer.

### 1. Module: `peft_Fincast` (Parameter-Efficient Fine-Tuning)

*   **Files**: `peft_injector.py`
*   **Core Responsibility**: This module is responsible for integrating Parameter-Efficient Fine-Tuning (PEFT), specifically LoRA (Low-Rank Adaptation) or DoRA, into the pre-trained FFM. This allows for efficient fine-tuning of the large model on downstream tasks by only training a small fraction of new parameters.
*   **Key Implementation Details**:
    *   **`wrap_with_peft` Function**: The main entry point, which takes the base model and LoRA hyperparameters (`lora_r`, `lora_alpha`, `lora_dropout`, `lora_targets_preset`). It uses the external `peft` library's `LoraConfig` and `get_peft_model` to inject the adapters.
    *   **Target Selection (`_default_targets`)**: Defines presets for selecting which linear layers (`nn.Linear`) within the FFM should receive LoRA adapters. Presets include:
        *   `attn`: Targets the attention mechanism's query/key/value projection (`qkv_proj`) and output projection (`o_proj`).
        *   `attn_mlp`: Extends `attn` to include the feed-forward layers in both the input and horizon blocks.
        *   `attn_mlp_gating`: Further extends to include the MoE gating mechanism (`moe.gate.to_gates`), indicating a focus on routing behavior.
        *   `experts_heavy`: Targets the most parameters by including the experts themselves (`experts.experts`, `gate_proj`, `down_proj`).

### 2. Module: `src/data_tools` (Data Handling and Batching)

*   **Files**: `Inference_dataset.py`, `TSdataset.py`, `batch_sampler.py`, `batch_sampler_ddp.py`
*   **Core Responsibility**: Manages the entire data pipeline, from reading raw CSV files to preparing batched, windowed, and optionally masked time-series data for both training and inference.
*   **Key Implementation Details**:
    *   **`TimeSeriesDataset_MultiCSV_train_Production` (`TSdataset.py`)**: The primary training dataset class. It reads multiple CSVs, converts multi-column data into a collection of univariate series, applies Z-score normalization (`sklearn.preprocessing.StandardScaler`), and generates sliding windows with a configurable stride (`data_slice_interval`) and variable context lengths (`possible_context_lengths`). It also implements input masking (`mask_ratio`) for potential pre-training objectives.
    *   **`TimeSeriesDataset_SingleCSV_Inference` (`Inference_dataset.py`)**: A specialized dataset for inference on a single CSV, supporting both "last window" and "sliding window" modes. It returns metadata for traceability, which is crucial for post-inference analysis and plotting.
    *   **`GroupByLengthBatchSampler_Production` (`batch_sampler.py`)**: A custom PyTorch `BatchSampler` that groups samples by their context length (`get_length`). This is a critical optimization, as it eliminates the need for padding within a batch, maximizing GPU efficiency for the Transformer architecture.
    *   **`GroupByLengthBatchSampler_DDP` (`batch_sampler_ddp.py`)**: Extends the batch sampler for Distributed Data Parallel (DDP) training, ensuring that all ranks process a synchronized, deterministically shuffled subset of the data.

### 3. Module: `src/ffm` (Financial Foundation Model Core)

*   **Files**: `data_loader.py`, `ffm_base.py`, `ffm_torch_moe.py`, `pytorch_patched_decoder_MOE.py`, `time_features.py`, `xreg_lib.py`
*   **Core Responsibility**: Contains the model definition, configuration, base API, and components for handling time-series features and external regressors.
*   **Key Implementation Details**:
    *   **`FFmBase` (`ffm_base.py`)**: Defines the abstract interface for the FFM API, including shared utilities like `_normalize` and `_renormalize` for per-time-series normalization. It also includes the complex logic for integrating **eXogenous Regressors (XReg)**, supporting two modes: "timesfm + xreg" (forecast residuals) and "xreg + timesfm" (forecast on residuals).
    *   **`FFmTorch` (`ffm_torch_moe.py`)**: The concrete PyTorch implementation of the FFM API. It initializes the core model (`PatchedTimeSeriesDecoder_MOE`) and implements the inference loop (`_forecast`), handling checkpoint loading (including compiled models) and device placement (CPU/GPU).
    *   **`PatchedTimeSeriesDecoder_MOE` (`pytorch_patched_decoder_MOE.py`)**: The main model class. It implements the Transformer-based decoder architecture, which operates on time-series patches.
        *   **Patching**: Input time-series are reshaped into patches (`[B, N, P]`) before being passed to the transformer.
        *   **Feature Injection**: It uses a `ResidualBlock` (`input_ff_layer`) to project the concatenated time-series patch and padding mask (`[P*2]`) into the model's hidden dimension.
        *   **Frequency Embedding**: A learnable embedding (`freq_emb`) is added to the input to condition the model on the time-series frequency (e.g., high, medium, low).
        *   **Output Head**: A final `ResidualBlock` (`horizon_ff_layer`) projects the transformer output to the prediction horizon, outputting both the mean and multiple quantiles.
    *   **`TimesFMDecoderLayer` (`pytorch_patched_decoder_MOE.py`)**: The core building block of the transformer stack. It consists of:
        *   **Attention**: `TimesFMAttention` (a standard multi-head attention with RMSNorm).
        *   **Mixture of Experts (MoE)**: `SparseMoEBlock` (from `st_moe_pytorch`) is used as the feed-forward network, which is the key architectural innovation.
    *   **`TimeCovariates` (`time_features.py`)**: Extracts a rich set of time-based features (minute, hour, day of week/month/year, month/week of year) and optional holiday features, which are then normalized.

### 4. Module: `src/st_moe_pytorch` (Spatio-Temporal MoE)

*   **Files**: `distributed.py`, `st_moe_pytorch.py`
*   **Core Responsibility**: Provides the implementation for the Mixture of Experts (MoE) layer, which is integrated into the FFM's transformer blocks. This module is adapted from a general-purpose MoE library.
*   **Key Implementation Details**:
    *   **`MoE` (`st_moe_pytorch.py`)**: The main MoE class, composed of a `TopNGating` router and an `Experts` container.
        *   **`TopNGating`**: The router computes raw gate logits, applies Gumbel noise (during training), and uses a differentiable top-K selection to choose the top `top_n` experts for each token. It also calculates auxiliary losses (`balance_loss`, `router_z_loss`) to encourage balanced expert usage.
        *   **`Experts`**: A container for the individual `Expert` modules (which are simple MLPs). It handles the dispatching of tokens to the selected experts and combining the outputs.
    *   **`SparseMoEBlock`**: Wraps the `MoE` layer, adding pre- and post-feed-forward layers (`ff_before`, `ff_after`) and a residual connection, which is noted in the source code as a stabilization technique.
    *   **`distributed.py`**: Contains utility functions (`all_gather_variable_dim`, `AllGatherFunction`) for handling distributed communication (All-Gather) of variable-sized tensors, necessary for efficient distributed training of MoE models.

### 5. Module: `src/tools` (Utilities)

*   **Files**: `inference_utils.py`, `metrics.py`, `model_utils.py`, `utils.py`
*   **Core Responsibility**: Provides miscellaneous utilities for model loading, evaluation, metrics calculation, and visualization.
*   **Key Implementation Details**:
    *   **`inference_utils.py`**: Contains the high-level `FinCast_Inference` class, which orchestrates the entire inference process: dataset creation, model loading, running the `DataLoader`, and post-processing the results. It also includes functions for plotting (`plot_last_outputs`) and saving outputs to CSV.
    *   **`metrics.py`**: Implements standard time-series evaluation metrics using NumPy, including MAE, MSE, RMSE, MAPE, MSPE, RSE, and CORR.
    *   **`model_utils.py`**: Simple helper to instantiate the FFM model (`FFM`) and its configuration (`FFmHparams`) from a checkpoint path.
    *   **`utils.py`**: Provides logging and parameter counting utilities (`log_model_statistics`) for tracking model size and configuration.

---
## Module PlantUML Diagrams

### 1. Module: `peft_Fincast`

```puml
@startuml peft_Fincast
skinparam classAttributeIconSize 0

package "peft_Fincast" {
    class peft_injector {
        + wrap_with_peft(model, ...)
        -- Private --
        - _default_targets(model, preset)
        - resolve_linear_targets(model, patterns)
        - _unfreeze_all_params(model)
    }
}

package "External: peft" {
    class LoraConfig
    class get_peft_model
}

package "External: torch" {
    class nn.Module
    class nn.Linear
}

peft_injector ..> LoraConfig : uses
peft_injector ..> get_peft_model : uses
peft_injector ..> nn.Module : operates on
peft_injector ..> nn.Linear : targets

@enduml
```

### 2. Module: `src/data_tools`

```puml
@startuml data_tools
skinparam classAttributeIconSize 0

package "data_tools" {
    class TimeSeriesDataset_MultiCSV_train_Production {
        + __init__(...)
        + __len__()
        + get_length(idx)
        + __getitem__(idx)
        -- Private --
        - _read_csvs()
        - _prepare_index_records()
    }

    class TimeSeriesDataset_SingleCSV_Inference {
        + __init__(...)
        + __len__()
        + get_length(idx)
        + __getitem__(idx)
        -- Private --
        - _make_meta(series_idx, window_start)
    }

    class GroupByLengthBatchSampler_Production {
        + __init__(dataset, batch_size, ...)
        + __iter__()
        + __len__()
    }

    class GroupByLengthBatchSampler_DDP {
        + __init__(dataset, batch_size, ...)
        + __iter__()
        + __len__()
        + set_epoch(epoch)
    }

    object function {
        + freq_reader(file_path, freq_dict, mode)
    }
}

TimeSeriesDataset_MultiCSV_train_Production ..> function : uses freq_reader
TimeSeriesDataset_SingleCSV_Inference ..> function : uses freq_reader
GroupByLengthBatchSampler_Production ..> TimeSeriesDataset_MultiCSV_train_Production : operates on
GroupByLengthBatchSampler_DDP ..> TimeSeriesDataset_MultiCSV_train_Production : operates on

TimeSeriesDataset_MultiCSV_train_Production .up.|> torch.utils.data.Dataset
TimeSeriesDataset_SingleCSV_Inference .up.|> torch.utils.data.Dataset
GroupByLengthBatchSampler_DDP .up.|> torch.utils.data.Sampler
GroupByLengthBatchSampler_Production .up.|> torch.utils.data.BatchSampler

@enduml
```

### 3. Module: `src/ffm` (Core Model)

```puml
@startuml ffm_core
skinparam classAttributeIconSize 0

package "ffm" {
    class FFmHparams << (D,orchid) dataclass >> {
        + context_len : int
        + horizon_len : int
        + num_experts : int
        + gating_top_n : int
        + ...
    }

    abstract class FFmBase {
        + __init__(hparams, checkpoint, ...)
        + forecast(...)
        + forecast_on_df(...)
        -- Private --
        - _preprocess(inputs, freq)
        - _forecast(...)
    }

    class FFmTorch {
        + __init__(hparams, checkpoint, ...)
        + load_from_checkpoint_ffm(checkpoint)
        + model_eval_mode()
        -- Private --
        - _forecast(...)
    }

    class PatchedTimeSeriesDecoder_MOE {
        + config : FFMConfig
        + input_ff_layer : ResidualBlock
        + horizon_ff_layer : ResidualBlock
        + stacked_transformer : StackedDecoder
        + decode(...)
        + forward(...)
        -- Private --
        - _preprocess_input(...)
        - _postprocess_output(...)
        - _forward_transform(...)
        - _reverse_transform(...)
    }

    class TimeSeriesdata << (T,yellow) TensorFlow >> {
        + __init__(...)
        + train_gen()
        + test_val_gen(mode, shift)
        + tf_dataset(mode, shift)
    }

    class TimeCovariates {
        + __init__(datetimes, ...)
        + get_covariates()
    }

    class BatchedInContextXRegLinear {
        + fit(...)
        + create_covariate_matrix(...)
    }
}

FFmTorch --|> FFmBase
FFmTorch o-- PatchedTimeSeriesDecoder_MOE : wraps
FFmBase o-- FFmHparams : config
FFmBase ..> BatchedInContextXRegLinear : uses for XReg
TimeSeriesdata ..> TimeCovariates : uses
PatchedTimeSeriesDecoder_MOE ..> FFMConfig : config
PatchedTimeSeriesDecoder_MOE ..> StackedDecoder : contains
PatchedTimeSeriesDecoder_MOE ..> ResidualBlock : contains

@enduml
```

### 4. Module: `src/st_moe_pytorch` (Spatio-Temporal MoE)

```puml
@startuml st_moe_pytorch
skinparam classAttributeIconSize 0

package "st_moe_pytorch" {
    class MoE {
        + gate : TopNGating
        + experts : Experts
        + forward(x, ...) : MixtureOfExpertsReturn
    }

    class SparseMoEBlock {
        + moe : MoE
        + ff_before : Expert
        + ff_after : Expert
        + forward(x, ...) : MixtureOfExpertsReturn
    }

    class TopNGating {
        + to_gates : nn.Linear
        + forward(x, ...) : dispatch_tensor, combine_tensor, ...
    }

    class Experts {
        + experts : ModuleList<Expert>
        + forward(x, ...)
    }

    class Expert {
        + gate_proj : nn.Linear
        + down_proj : nn.Linear
        + forward(x, paddings)
        
    }

    class AllGatherFunction << (F,darkgreen) Distributed >>
    class AllGather << (M,darkgreen) Distributed >>
}

SparseMoEBlock o-- MoE
MoE o-- TopNGating
MoE o-- Experts
Experts o-- Expert
TopNGating ..> AllGather : uses (indirectly via distributed utils)

@enduml
```

### 5. Module: `src/tools` (Utilities)

```puml
@startuml tools
skinparam classAttributeIconSize 0

package "tools" {
    class FinCast_Inference {
        + __init__(config)
        + run_inference(...)
        -- Private --
        - _make_inference_loader(...)
    }

    object function {
        + plot_last_outputs(...)
        + _save_outputs_to_csv(...)
        + get_model_api(...)
        + log_model_statistics(...)
        + MAE, MSE, RMSE, MAPE, RSE, CORR
    }
}

FinCast_Inference ..> data_tools.TimeSeriesDataset_SingleCSV_Inference : creates
FinCast_Inference ..> ffm.FFmTorch : loads model API
FinCast_Inference ..> function : uses utilities

@enduml
```

### Module PlantUML Diagrams

### 1. Module: `peft_Fincast`

```puml
@startuml peft_Fincast
skinparam classAttributeIconSize 0

package "peft_Fincast" {
    class peft_injector {
        + wrap_with_peft(model, ...)
        -- Private --
        - _default_targets(model, preset)
        - resolve_linear_targets(model, patterns)
        - _unfreeze_all_params(model)
    }
}

package "External: peft" {
    class LoraConfig
    class get_peft_model
}

package "External: torch" {
    class nn.Module
    class nn.Linear
}

peft_injector ..> LoraConfig : uses
peft_injector ..> get_peft_model : uses
peft_injector ..> nn.Module : operates on
peft_injector ..> nn.Linear : targets

@enduml
```

### 2. Module: `src/data_tools`

```puml
@startuml data_tools
skinparam classAttributeIconSize 0

package "data_tools" {
    class TimeSeriesDataset_MultiCSV_train_Production {
        + __init__(...)
        + __len__()
        + get_length(idx)
        + __getitem__(idx)
        -- Private --
        - _read_csvs()
        - _prepare_index_records()
    }

    class TimeSeriesDataset_SingleCSV_Inference {
        + __init__(...)
        + __len__()
        + get_length(idx)
        + __getitem__(idx)
        -- Private --
        - _make_meta(series_idx, window_start)
    }

    class GroupByLengthBatchSampler_Production {
        + __init__(dataset, batch_size, ...)
        + __iter__()
        + __len__()
    }

    class GroupByLengthBatchSampler_DDP {
        + __init__(dataset, batch_size, ...)
        + __iter__()
        + __len__()
        + set_epoch(epoch)
    }

    object function {
        + freq_reader(file_path, freq_dict, mode)
    }
}

TimeSeriesDataset_MultiCSV_train_Production ..> function : uses freq_reader
TimeSeriesDataset_SingleCSV_Inference ..> function : uses freq_reader
GroupByLengthBatchSampler_Production ..> TimeSeriesDataset_MultiCSV_train_Production : operates on
GroupByLengthBatchSampler_DDP ..> TimeSeriesDataset_MultiCSV_train_Production : operates on

TimeSeriesDataset_MultiCSV_train_Production .up.|> torch.utils.data.Dataset
TimeSeriesDataset_SingleCSV_Inference .up.|> torch.utils.data.Dataset
GroupByLengthBatchSampler_DDP .up.|> torch.utils.data.Sampler
GroupByLengthBatchSampler_Production .up.|> torch.utils.data.BatchSampler

@enduml
```

### 3. Module: `src/ffm` (Core Model)

```puml
@startuml ffm_core
skinparam classAttributeIconSize 0

package "ffm" {
    class FFmHparams << (D,orchid) dataclass >> {
        + context_len : int
        + horizon_len : int
        + num_experts : int
        + gating_top_n : int
        + ...
    }

    abstract class FFmBase {
        + __init__(hparams, checkpoint, ...)
        + forecast(...)
        + forecast_on_df(...)
        -- Private --
        - _preprocess(inputs, freq)
        - _forecast(...)
    }

    class FFmTorch {
        + __init__(hparams, checkpoint, ...)
        + load_from_checkpoint_ffm(checkpoint)
        + model_eval_mode()
        -- Private --
        - _forecast(...)
    }

    class PatchedTimeSeriesDecoder_MOE {
        + config : FFMConfig
        + input_ff_layer : ResidualBlock
        + horizon_ff_layer : ResidualBlock
        + stacked_transformer : StackedDecoder
        + decode(...)
        + forward(...)
        -- Private --
        - _preprocess_input(...)
        - _postprocess_output(...)
        - _forward_transform(...)
        - _reverse_transform(...)
    }

    class TimeSeriesdata << (T,yellow) TensorFlow >> {
        + __init__(...)
        + train_gen()
        + test_val_gen(mode, shift)
        + tf_dataset(mode, shift)
    }

    class TimeCovariates {
        + __init__(datetimes, ...)
        + get_covariates()
    }

    class BatchedInContextXRegLinear {
        + fit(...)
        + create_covariate_matrix(...)
    }
}

FFmTorch --|> FFmBase
FFmTorch o-- PatchedTimeSeriesDecoder_MOE : wraps
FFmBase o-- FFmHparams : config
FFmBase ..> BatchedInContextXRegLinear : uses for XReg
TimeSeriesdata ..> TimeCovariates : uses
PatchedTimeSeriesDecoder_MOE ..> FFMConfig : config
PatchedTimeSeriesDecoder_MOE ..> StackedDecoder : contains
PatchedTimeSeriesDecoder_MOE ..> ResidualBlock : contains

@enduml
```

### 4. Module: `src/st_moe_pytorch` (Spatio-Temporal MoE)

```puml
@startuml st_moe_pytorch
skinparam classAttributeIconSize 0

package "st_moe_pytorch" {
    class MoE {
        + gate : TopNGating
        + experts : Experts
        + forward(x, ...) : MixtureOfExpertsReturn
    }

    class SparseMoEBlock {
        + moe : MoE
        + ff_before : Expert
        + ff_after : Expert
        + forward(x, ...) : MixtureOfExpertsReturn
    }

    class TopNGating {
        + to_gates : nn.Linear
        + forward(x, ...) : dispatch_tensor, combine_tensor, ...
    }

    class Experts {
        + experts : ModuleList<Expert>
        + forward(x, ...)
    }

    class Expert {
        + gate_proj : nn.Linear
        + down_proj : nn.Linear
        + forward(x, paddings)
        
    }

    class AllGatherFunction << (F,darkgreen) Distributed >>
    class AllGather << (M,darkgreen) Distributed >>
}

SparseMoEBlock o-- MoE
MoE o-- TopNGating
MoE o-- Experts
Experts o-- Expert
TopNGating ..> AllGather : uses (indirectly via distributed utils)

@enduml
```

### 5. Module: `src/tools` (Utilities)

```puml
@startuml tools
skinparam classAttributeIconSize 0

package "tools" {
    class FinCast_Inference {
        + __init__(config)
        + run_inference(...)
        -- Private --
        - _make_inference_loader(...)
    }

    object function {
        + plot_last_outputs(...)
        + _save_outputs_to_csv(...)
        + get_model_api(...)
        + log_model_statistics(...)
        + MAE, MSE, RMSE, MAPE, RSE, CORR
    }
}

FinCast_Inference ..> data_tools.TimeSeriesDataset_SingleCSV_Inference : creates
FinCast_Inference ..> ffm.FFmTorch : loads model API
FinCast_Inference ..> function : uses utilities

@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

## Core Abstractions, Design Philosophy, and Lifecycle Management

The FinCast-fts project implements a sophisticated architecture for financial time-series forecasting, centered on the **Financial Foundation Model (FFM)**. Its design is characterized by a set of powerful abstractions and a clear philosophy focused on scalability, efficiency, and predictive richness.

### Core Abstractions

The system is built upon four primary abstractions that govern how time-series data is processed and modeled:

1.  **Time-Series Patch**: The fundamental unit of data input is not a single time step but a **patch** (defined by `patch_len`, typically 32). The input time-series is segmented into a sequence of overlapping or non-overlapping patches, transforming the 1D series into a 2D sequence (`[N_patches, Patch_len]`). This patching mechanism is a core component of the underlying TimesFM architecture, enabling the Transformer to process local temporal patterns efficiently.
2.  **Frequency Embedding**: The model explicitly handles time-series data of varying frequencies (e.g., high, medium, low) by introducing a **Frequency Embedding** (`freq_emb`). This categorical embedding is added to the input representation, allowing the single FFM to condition its internal weights and attention mechanisms based on the inherent periodicity and characteristics of the input data.
3.  **Spatio-Temporal Mixture of Experts (ST-MoE)**: This is the central architectural innovation. The traditional Feed-Forward Network (FFN) within the Transformer block is replaced by a **SparseMoEBlock**. This abstraction allows the model to scale its parameter count dramatically (via multiple "experts") while maintaining a constant computational cost during inference. For any given input token (patch), a router selects only the top $K$ experts to process the data, enabling high capacity with sparse activation.
4.  **Quantile Forecast**: The model's output is abstracted beyond a simple point prediction (mean/median). The final layer predicts a full set of **Quantiles** (e.g., 0.1, 0.5, 0.9), providing a complete predictive distribution. This is essential for financial applications where risk assessment and uncertainty quantification are critical.

### Design Philosophy

The project's design adheres to three key philosophical tenets:

*   **Foundation Model Paradigm**: The FFM is designed as a large, pre-trained model capable of zero-shot or few-shot generalization across diverse financial time-series datasets. The goal is to capture universal temporal patterns and financial market dynamics, making it a powerful base model for various downstream tasks.
*   **Efficiency and Scalability**: The combination of **ST-MoE** and **Parameter-Efficient Fine-Tuning (PEFT)** drives the efficiency philosophy. ST-MoE ensures that the model can scale its capacity (number of experts) without a proportional increase in computational load. PEFT, implemented via the `peft_Fincast` module, allows for rapid, low-resource fine-tuning by only training small, low-rank adapters (LoRA) instead of the entire massive model.
*   **Data-Centric Optimization**: The use of the custom `GroupByLengthBatchSampler` is a pragmatic design choice to maximize hardware utilization. By grouping time-series samples by their context length, the system eliminates the need for zero-padding within batches, ensuring that all computation is meaningful and accelerating the training process significantly.

### Lifecycle Management

The project's lifecycle is clearly delineated across its modules:

| Phase | Module(s) Responsible | Key Components |
| :--- | :--- | :--- |
| **Data Ingestion & Preparation** | `src/data_tools`, `src/ffm/time_features.py` | `TSdataset`, `Inference_dataset`, `TimeCovariates` |
| **Training Optimization** | `src/data_tools` | `GroupByLengthBatchSampler_Production`, `GroupByLengthBatchSampler_DDP` |
| **Model Definition & Training** | `src/ffm`, `src/st_moe_pytorch` | `PatchedTimeSeriesDecoder_MOE`, `MoE`, `Expert` |
| **Model Adaptation** | `peft_Fincast` | `peft_injector.py` (LoRA/DoRA) |
| **Inference & Evaluation** | `src/tools` | `FinCast_Inference`, `metrics.py`, `plot_last_outputs` |

The `FinCast_Inference` class acts as the central orchestrator for the inference lifecycle, managing the loading of the model, the data flow from the `Inference_dataset`, and the final post-processing and visualization of the quantile forecasts.

#### 3.1.2. Component Interactions

## Component Interactions, Data Flow, and Communication Patterns

The FinCast-fts architecture is a tightly integrated system where data flows sequentially from raw input through data preparation, model processing, and finally to output generation. The core interaction pattern is a pipeline-style data transformation, with a critical internal loop governed by the Mixture of Experts (MoE) mechanism.

### 1. Data Flow Pipeline

The overall data flow can be broken down into three main stages:

| Stage | Source Module | Destination Module | Data Transformation |
| :--- | :--- | :--- | :--- |
| **Input & Preprocessing** | Raw CSV Files | `src/data_tools` | Raw time-series data is read, normalized (Z-score), and segmented into context windows and future horizons. Time features (e.g., day of week, month) are extracted by `TimeCovariates` and potentially used as eXogenous Regressors (XReg). |
| **Model Forward Pass** | `src/data_tools` (Batches) | `src/ffm` (Model) | Batches of time-series windows (`x_context`, `x_padding`, `freq`) are fed into the `PatchedTimeSeriesDecoder_MOE`. The input is patched, normalized, and embedded with frequency information. |
| **Output & Post-processing** | `src/ffm` (Forecasts) | `src/tools` | The model outputs a tensor of mean and quantile forecasts. This is denormalized, sliced to the required horizon, and then processed by `FinCast_Inference` for saving to CSV or visualization (`plot_last_outputs`). |

### 2. Core Model Interaction: The Transformer Block with MoE

The most complex interaction occurs within the `PatchedTimeSeriesDecoder_MOE` (the FFM). Each layer of the `StackedDecoder` (a `TimesFMDecoderLayer`) involves a sequence of interactions:

1.  **Input**: The hidden state (`hidden_states`) from the previous layer enters the current layer.
2.  **Attention**: The hidden state first passes through the **TimesFMAttention** module. This is a standard self-attention mechanism, where the input interacts with itself to capture long-range temporal dependencies.
3.  **Normalization**: The output of the attention block is normalized using **RMSNorm** before entering the MoE block.
4.  **MoE Routing (Sparse Activation)**:
    *   The normalized hidden state enters the **SparseMoEBlock**.
    *   The **TopNGating** module (the router) calculates the probability of sending the token (patch) to each expert.
    *   It selects the top $K$ experts (e.g., $K=2$) based on these probabilities.
    *   A **Dispatch Tensor** is created, which maps each token to its selected expert(s) and their position within the expert's mini-batch.
5.  **Expert Computation**:
    *   The tokens are dispatched to the **Experts** module.
    *   Each expert (a simple MLP) processes its assigned subset of tokens in parallel.
6.  **MoE Combination**:
    *   The **Combine Tensor** (containing the weights from the router) is used to aggregate the outputs from the activated experts back into the original sequence order and dimension.
7.  **Output**: The combined output is added to the input via a residual connection, and the process repeats for the next layer.

This sparse activation pattern is the key communication pattern: it ensures that only a small, dynamic subset of the model's total parameters is activated for any given input, enabling the model's high capacity.

### 3. Communication Patterns (Distributed)

The `src/st_moe_pytorch/distributed.py` module reveals the project's design for handling distributed training (DDP), which is essential for scaling MoE models:

*   **All-Gather for Variable-Sized Tensors**: The `AllGather` class and its underlying `AllGatherFunction` are designed to collect tensors from all Distributed Data Parallel (DDP) ranks. Crucially, it handles **variable sequence lengths** (`all_gather_variable_dim`).
    *   In a typical MoE setup, the tokens dispatched to an expert on one GPU might have a different batch size than the tokens dispatched to the same expert on another GPU.
    *   The `AllGather` mechanism ensures that the necessary data is collected across all ranks, padded to a uniform size (`max_size`), and then unpadded after the operation, allowing for correct processing and gradient flow in a distributed environment.

This pattern is a low-level optimization to ensure that the MoE's routing and expert computation can be correctly synchronized and scaled across multiple GPUs.

### 4. External Regressor (XReg) Interaction

The `FFmBase` class includes complex logic for integrating external regressors using `xreg_lib.py`. This interaction is highly configurable:

*   **Data Preparation**: The `BatchedInContextXRegLinear` class prepares the time-series data (`targets`) and the external covariates (numerical, categorical, static, dynamic) into a flattened, batched matrix format (`x_train`, `x_test`).
*   **Two-Way Interaction**:
    *   **Mode 1 (`timesfm + xreg`)**: The FFM forecasts the time-series, and the XReg model is trained on the *residuals* (the difference between the FFM's forecast and the true value). The final forecast is the FFM output plus the XReg residual forecast.
    *   **Mode 2 (`xreg + timesfm`)**: The XReg model is trained on the *raw time-series*. The FFM is then trained on the *residuals* (the difference between the XReg model's forecast and the true value). The final forecast is the XReg output plus the FFM residual forecast.

This flexible interaction pattern allows the FFM to focus on complex, non-linear temporal dependencies while offloading the modeling of linear, exogenous effects to a simpler, more interpretable linear regression model.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml FinCast_Architecture_v4
!theme toy

title FinCast-fts Overall Architecture

' Define Modules (Packages)
package "Data Pipeline (src/data_tools)" as Data {
    class TSdataset
    class Inference_dataset
    class BatchSampler
}

package "Model Core (src/ffm)" as FFM {
    class FFmTorch
    class PatchedTimeSeriesDecoder_MOE
    class TimeCovariates
    class BatchedInContextXRegLinear
}

package "MoE Implementation (src/st_moe_pytorch)" as MoE {
    class SparseMoEBlock
    class MoE_Router
    class Expert_MLP
}

package "Utilities & Inference (src/tools)" as Tools {
    class FinCast_Inference
    class Metrics
}

package "Adaptation (peft_Fincast)" as PEFT {
    class peft_injector
}

' External Entities
[Raw CSV Data] as RawData
[External Libraries] as ExtLibs

' 1. Data Flow
RawData --> TSdataset : Reads
TSdataset --> FFmTorch : Supplies Batches

' 2. Model Instantiation and Configuration
FFmTorch o-- PatchedTimeSeriesDecoder_MOE : Instantiates

' 3. Model Structure (FFM)
PatchedTimeSeriesDecoder_MOE o-- SparseMoEBlock : Uses (in Transformer Layer)
PatchedTimeSeriesDecoder_MOE ..> TimeCovariates : Uses for Time Features
PatchedTimeSeriesDecoder_MOE ..> BatchedInContextXRegLinear : Uses for XReg

' 4. MoE Structure
SparseMoEBlock o-- MoE_Router : Routes Tokens
SparseMoEBlock o-- Expert_MLP : Executes Computation

' 5. Inference and Output
FinCast_Inference ..> Inference_dataset : Uses Dataset
FinCast_Inference ..> FFmTorch : Calls Forecast API
FinCast_Inference ..> Metrics : Calculates Performance
FFmTorch --> FinCast_Inference : Returns Forecasts

' 6. Adaptation
peft_injector ..> PatchedTimeSeriesDecoder_MOE : Wraps Model for Fine-Tuning

' 7. Data Flow within Data Module
TSdataset ..> BatchSampler : Uses for Batching

' 8. External Dependencies
ExtLibs .up.> MoE : (einops, torch.distributed)
ExtLibs .up.> PEFT : (peft library)

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

## Design Patterns

The FinCast-fts codebase employs several established software design patterns and specialized architectural patterns common in deep learning to achieve modularity, flexibility, and performance.

### 1. Architectural Pattern: Mixture of Experts (MoE)

The core architectural pattern is the **Mixture of Experts (MoE)**, which is implemented in the `src/st_moe_pytorch` module and integrated into the FFM's transformer layers.

*   **Pattern**: Replaces the standard Feed-Forward Network (FFN) with a collection of expert networks and a trainable gating network (router).
*   **Implementation**:
    *   The `MoE` class in `st_moe_pytorch.py` encapsulates the entire mechanism.
    *   The `TopNGating` component acts as the router, using a soft-max over logits to determine the weight of each expert for a given token.
    *   The `Expert` class represents the individual, specialized MLPs.
*   **Code Example (from `st_moe_pytorch.py`):**
    ```python
    # MoE class initialization
    self.gate = TopNGating(...)
    self.experts = Experts(...)

    # MoE forward pass
    dispatch_tensor, combine_tensor, balance_loss, router_z_loss = self.gate(x, ...)
    expert_inputs = einsum('b n d, b n e c -> b e c d', x, dispatch_tensor)
    expert_outputs = self.experts(expert_inputs, ...)
    output = einsum('b e c d, b n e c -> b n d', expert_outputs, combine_tensor)
    ```

### 2. Structural Pattern: Adapter

The **Adapter Pattern** is used to reconcile the core model implementation with the desired external API interface.

*   **Pattern**: Converts the interface of a class into another interface clients expect.
*   **Implementation**: The `FFmTorch` class (`ffm_torch_moe.py`) acts as an adapter, inheriting from the abstract `FFmBase` (`ffm_base.py`) and wrapping the concrete PyTorch model (`PatchedTimeSeriesDecoder_MOE`). This allows the model to conform to the TimesFM-inspired API (`forecast`, `forecast_on_df`) while using a custom PyTorch implementation.

### 3. Behavioral Pattern: Strategy

The integration of eXogenous Regressors (XReg) follows the **Strategy Pattern**, allowing the user to select one of two distinct XReg integration methods at runtime.

*   **Pattern**: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.
*   **Implementation**: The `FFmBase` class's `forecast_with_xreg` method accepts an `xreg_mode` parameter (`"timesfm + xreg"` or `"xreg + timesfm"`), which determines the strategy for combining the FFM forecast with the linear regressor (`BatchedInContextXRegLinear`).

### 4. Creational Pattern: Factory Method

A simple form of the **Factory Method Pattern** is used for model instantiation.

*   **Pattern**: Defines an interface for creating an object, but lets subclasses decide which class to instantiate.
*   **Implementation**: The `get_model_FFM` function in `src/tools/model_utils.py` centralizes the logic for creating the FFM model instance (`FFM`) and its configuration (`FFmHparams`) from a checkpoint path, abstracting the complex setup from the main inference logic.

### 5. Idiomatic Pattern: Skip Connections (Residual Block)

The **Residual Block** pattern is fundamental to the stability and training of deep neural networks.

*   **Pattern**: Adds the input of a layer to its output, bypassing one or more layers.
*   **Implementation**:
    *   The `ResidualBlock` class in `pytorch_patched_decoder_MOE.py` explicitly implements this pattern for the input and horizon feed-forward layers.
    *   The `TimesFMDecoderLayer` and `SparseMoEBlock` also utilize residual connections around their main computational units (attention and MoE).
*   **Code Example (from `pytorch_patched_decoder_MOE.py`):**
    ```python
    class ResidualBlock(nn.Module):
        # ... (hidden_layer, output_layer, residual_layer defined)
        def forward(self, x):
            hidden = self.hidden_layer(x)
            output = self.output_layer(hidden)
            residual = self.residual_layer(x)
            return output + residual # The skip connection
    ```

#### 3.3.2. Project Highlights

## Project Highlights

The FinCast-fts project showcases several innovative features and design choices that contribute to its effectiveness, extensibility, and efficiency in financial time-series forecasting.

*   **Spatio-Temporal Mixture of Experts (ST-MoE) Integration**:
    *   **Highlight**: The core innovation is the seamless integration of the MoE architecture into the Transformer decoder, replacing the standard FFN. This allows the model to achieve a massive parameter count (high capacity) while maintaining a low, constant computational cost during the forward pass (sparse activation).
    *   **Benefit**: This is crucial for foundation models, as it enables the FFM to learn highly specialized patterns (experts) for different types of time-series or market regimes without becoming prohibitively slow or expensive to run. The `st_moe_pytorch` module, with its custom `TopNGating` and auxiliary loss functions, ensures the experts are used efficiently and balanced during training.

*   **Efficient Training via Length-Based Batching**:
    *   **Highlight**: The use of the custom `GroupByLengthBatchSampler` in `src/data_tools` is a significant performance optimization. This sampler groups time-series samples with identical context lengths into the same batch.
    *   **Benefit**: In a Transformer architecture, padding is a major source of wasted computation. By eliminating intra-batch padding, the project maximizes the utilization of GPU memory and compute, leading to faster training times and higher throughput, especially when dealing with time-series of varying lengths.

*   **Parameter-Efficient Fine-Tuning (PEFT) Support**:
    *   **Highlight**: The dedicated `peft_Fincast` module provides first-class support for PEFT techniques like LoRA and DoRA. It includes predefined presets (`attn`, `attn_mlp_gating`, `experts_heavy`) to target specific layers for adapter injection.
    *   **Benefit**: This design choice directly addresses the challenge of fine-tuning large foundation models. Instead of retraining the entire FFM, users can fine-tune a small set of parameters (the adapters) for a new task, drastically reducing training time, memory footprint, and storage requirements for task-specific models. This enhances the model's **extensibility** to new financial datasets.

*   **Comprehensive Time-Series Feature Engineering**:
    *   **Highlight**: The `TimeCovariates` class in `src/ffm/time_features.py` extracts a rich, normalized set of temporal features (e.g., minute-of-hour, day-of-year, holiday proximity).
    *   **Benefit**: This feature set provides the model with explicit, high-quality information about the time context, which is vital for financial data where seasonality and calendar effects (like holidays) are strong predictors. This design improves the model's **flexibility** and predictive power across different time granularities.

*   **Quantile Forecasting for Risk Management**:
    *   **Highlight**: The model's output head is designed to predict not just the mean, but a full distribution of quantiles (e.g., 0.1 to 0.9).
    *   **Benefit**: In finance, point forecasts are often insufficient. By providing a full predictive distribution, the FFM enables advanced risk management, Value-at-Risk (VaR) calculations, and confidence interval estimation, making the model's output more **actionable** for trading and investment strategies.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

## Improvement Suggestions

Based on the comprehensive analysis of the FinCast-fts codebase, the following suggestions are proposed to address potential performance bottlenecks, optimize the architecture, and enhance code quality.

### 1. Performance Bottlenecks and Optimization

| Area | Suggestion | Rationale and Impact |
| :--- | :--- | :--- |
| **Data Loading (CPU)** | Implement a more efficient data loading mechanism for large-scale datasets, potentially using Apache Arrow or Parquet format instead of CSV. | The current implementation in `TSdataset.py` relies on `pd.read_csv` and `np.vstack`, which can be slow and memory-intensive for massive financial datasets. Using columnar formats and memory-mapped files can significantly reduce I/O overhead and memory usage. |
| **XReg Solver** | Replace the JAX-based `BatchedInContextXRegLinear` with a PyTorch-native or highly optimized C++/CUDA linear algebra solver (e.g., using `torch.linalg.solve`). | The current XReg implementation in `xreg_lib.py` uses JAX, which introduces a dependency on a separate ecosystem and requires data transfer between PyTorch (model) and JAX (XReg). A unified PyTorch solution would eliminate this overhead and simplify the dependency stack. |
| **MoE Dispatch** | Optimize the MoE dispatch and combine operations for GPU. | The `st_moe_pytorch` module relies heavily on `einsum` and tensor manipulation (`rearrange`, `pack`, `unpack`). While flexible, these operations can be less performant than highly optimized custom CUDA kernels used in production-grade MoE implementations (e.g., Fairseq's Fused MoE). Investigating a fused kernel implementation for the dispatch/combine steps could yield significant speedups. |

### 2. Architecture Optimization

*   **Decouple FFM from XReg**: The tight coupling of the FFM (`FFmBase`) with the XReg logic makes the core model API complex. It is recommended to separate the XReg functionality into a standalone wrapper class that takes a trained FFM model and applies the XReg logic externally. This would simplify the `FFmBase` interface and make the core model more modular.
*   **Standardize Configuration Management**: The current configuration is spread across `FFmHparams` (dataclass) and `FFMConfig` (dataclass). It is recommended to consolidate all hyperparameters into a single, canonical configuration class (e.g., using `dataclasses` or `pydantic`) and pass this single object throughout the system. This improves clarity and reduces the risk of inconsistent parameter settings.
*   **Refactor `pytorch_patched_decoder_MOE.py`**: This file is excessively large (over 800 lines) and contains multiple classes (`FFMConfig`, `TimesFMAttention`, `TimesFMDecoderLayer`, `PatchedTimeSeriesDecoder_MOE`). Breaking this file into smaller, more focused modules (e.g., `attention.py`, `decoder_layer.py`, `model.py`) would significantly improve code navigation and maintainability.

### 3. Code Quality and Maintainability

*   **Type Hinting and Docstrings**: While type hints are present, consistency can be improved, especially in utility functions and complex tensor manipulation code. Comprehensive docstrings following a standard format (e.g., Google or NumPy style) should be added to all public methods and classes, particularly in the `st_moe_pytorch` module, which is complex due to its distributed nature.
*   **Remove Redundant TensorFlow Code**: The `src/ffm/data_loader.py` file contains a TensorFlow-based data loader (`TimeSeriesdata`). Since the rest of the project is PyTorch-native, this file appears to be vestigial code from the original TimesFM project. It should be removed or clearly marked as deprecated to avoid confusion and unnecessary dependencies.
*   **Consistent Naming Conventions**: The project uses a mix of naming conventions (e.g., `FFmTorch`, `PatchedTimeSeriesDecoder_MOE`, `peft_injector`). Adopting a consistent style (e.g., all classes using `PascalCase` and all functions using `snake_case`) across all modules would enhance readability.

#### 3.4.2. Secondary Development Guide

## Secondary Development Guide

This guide provides a structured approach for exploring the FinCast-fts codebase and conducting secondary development, such as fine-tuning, adding new features, or integrating new data sources.

### 1. Code Exploration Path

To understand the project, follow the data flow and model architecture sequentially:

1.  **Data Preparation (`src/data_tools`)**:
    *   Start with `src/data_tools/TSdataset.py` to understand how raw CSV data is converted into univariate time-series and how sliding windows are generated for training.
    *   Examine `src/data_tools/batch_sampler.py` to grasp the length-based batching optimization, which is crucial for efficient training.
2.  **Model Core and Configuration (`src/ffm`)**:
    *   Review `src/ffm/ffm_base.py` and `src/ffm/ffm_torch_moe.py` to understand the high-level API and model loading process.
    *   The core model logic is in `src/ffm/pytorch_patched_decoder_MOE.py`. Focus on the `PatchedTimeSeriesDecoder_MOE` class, particularly the `_preprocess_input` method (patching, normalization) and the `forward` method (Transformer stack, frequency embedding).
3.  **Architectural Innovation (`src/st_moe_pytorch`)**:
    *   Deep dive into `src/st_moe_pytorch/st_moe_pytorch.py`. This module defines the MoE mechanism. Understanding the `TopNGating` (router) and `MoE` (expert dispatch/combine) is key to modifying the model's capacity or routing behavior.

### 2. Best Practices for Fine-Tuning (PEFT)

The recommended path for secondary development is **Parameter-Efficient Fine-Tuning (PEFT)** using the provided `peft_Fincast` module.

*   **Select a Target Preset**: Use the `peft_injector.py` to wrap your pre-trained FFM. Start with a minimal preset like `"attn"` or `"attn_mlp"` to ensure stability. For maximum capacity increase, use `"experts_heavy"`.
*   **Hyperparameter Tuning**: Focus on tuning the LoRA rank (`lora_r`) and alpha (`lora_alpha`). A higher rank increases the number of trainable parameters and model capacity but also increases memory usage.
*   **Training Loop**: The fine-tuning process should be identical to the original training loop, but only the LoRA adapter parameters will have `requires_grad=True`.

### 3. Adding New Features

*   **New Time Features**: To add a new temporal covariate (e.g., lunar cycle, specific market hours), modify the `TimeCovariates` class in `src/ffm/time_features.py`. Ensure the new feature is correctly normalized and added to the output DataFrame.
*   **New Exogenous Regressors (XReg)**: If you are adding new external data (e.g., sentiment scores, macroeconomic indicators), ensure they are prepared in the `FFmBase`'s `forecast_with_xreg` method and integrated into the `BatchedInContextXRegLinear` in `src/ffm/xreg_lib.py`. This requires providing the new data as `dynamic_numerical_covariates` or `static_numerical_covariates` to the XReg fitting process.
*   **Custom Expert**: To experiment with a different expert architecture (e.g., a different activation function or a deeper MLP), modify the `Expert` class definition in `src/st_moe_pytorch/st_moe_pytorch.py`. Ensure the input and output dimensions remain consistent with the model's `hidden_size`.


================================================
FILE: thirdparty/FinGPT.md
================================================
# FinGPT - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The FinGPT repository is structured as a collection of distinct, yet related, sub-projects, each focusing on a specific financial application of Large Language Models (LLMs). This modular structure facilitates independent development and deployment of different FinLLM capabilities.

```
/home/ubuntu/FinGPT/
├── fingpt/
│   ├── FinGPT_Benchmark/             # Module 1: Benchmarking and Fine-tuning Utilities
│   │   ├── benchmarks/               # Contains scripts for various financial NLP benchmarks (e.g., ConvFinQA, FiQA).
│   │   ├── data/                     # Data download and preparation scripts for benchmarks.
│   │   ├── train_lora.py             # Script for LoRA-based fine-tuning of models on benchmark datasets.
│   │   └── utils.py                  # Utility functions for model path parsing, dataset loading, and tokenization.
│   ├── FinGPT_FinancialReportAnalysis/ # Module 2: Financial Report Analysis (RAG)
│   │   ├── reportanalysis.ipynb      # Jupyter notebook demonstrating the RAG analysis flow.
│   │   └── utils/                    # Core RAG implementation, including document formatting and clustering (Raptor).
│   │       ├── earning_calls.py      # Utilities for processing earning call transcripts.
│   │       ├── format_pdf.py         # Utilities for formatting PDF documents.
│   │       └── rag.py                # Core implementation of the Recursive Abstractive Clustering (Raptor) RAG system.
│   ├── FinGPT_Forecaster/            # Module 3: Financial Forecasting
│   │   ├── AAAI-Good-Data/           # Sub-module for a specific dataset/training configuration (e.g., AAAI paper data).
│   │   ├── FinGPT-Forecaster-Chinese/ # Sub-module for Chinese-specific forecasting data and models.
│   │   ├── app.py                    # Streamlit or Flask application for the forecaster interface.
│   │   ├── data_pipeline.py          # Script for data acquisition, prompt generation, and dataset creation.
│   │   ├── data.py                   # Core data preparation functions.
│   │   ├── indices.py                # Definitions of financial indices (DOW, EURO-STOXX, CRYPTO).
│   │   └── prompt.py                 # Functions for generating prompts for the LLM.
│   ├── FinGPT_MultiAgentsRAG/        # Module 4: Multi-Agent RAG and Evaluation (Experimental)
│   │   ├── Evaluation_methods/       # Contains evaluation scripts (HaluEval, MMLU, TruthfulQA).
│   │   ├── Fine_tune_model/          # Notebooks for fine-tuning models (e.g., GLM2, Llama2).
│   │   ├── MultiAgents/              # Notebooks demonstrating multi-agent inference.
│   │   └── RAG/                      # Notebooks for RAG implementation.
│   ├── FinGPT_Others/                # Module 5: Miscellaneous/Older Projects
│   │   ├── FinGPT_Low_Code_Development/ # Low-code development examples.
│   │   ├── FinGPT_Robo_Advisor/      # Robo-advisor examples.
│   │   └── FinGPT_Trading/           # Trading examples.
│   ├── FinGPT_RAG/                   # Module 6: General RAG and Data Scraping
│   │   ├── instruct-FinGPT/          # Scripts for supervised fine-tuning (SFT) and inference.
│   │   └── multisource_retrieval/    # Web scraping and data retrieval utilities.
│   │       ├── external_LLMs/        # Utilities for external LLM integration.
│   │       ├── scrapers/             # Specific web scrapers (Yahoo, CNBC, Google, etc.).
│   │       └── utils/                # Classification and formatting utilities.
│   ├── FinGPT_Sentiment_Analysis_v1/ # Module 7: Sentiment Analysis (Older Version)
│   └── FinGPT_Sentiment_Analysis_v3/ # Module 8: Sentiment Analysis (Latest Version)
│       ├── benchmark/                # Benchmarking notebooks.
│       ├── data/                     # Data preparation notebooks.
│   │   └── training_parallel/        # Parallel training scripts (e.g., using DeepSpeed).
├── requirements.txt                  # Project dependencies.
└── setup.py                          # Installation script.
```
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinGPT/fingpt/FinGPT_Benchmark`: Contains the infrastructure for evaluating and fine-tuning FinLLMs on various financial NLP tasks. It includes utilities for data preparation, model loading, and LoRA-based training.
*   `/home/ubuntu/FinGPT/fingpt/FinGPT_FinancialReportAnalysis/utils`: Houses the core logic for the RAG system applied to financial documents, notably the **Raptor** (Recursive Abstractive Clustering) implementation for document chunking and summarization.
*   `/home/ubuntu/FinGPT/fingpt/FinGPT_Forecaster`: Contains the complete pipeline for financial forecasting, from data acquisition and prompt engineering to dataset creation for model training.
*   `/home/ubuntu/FinGPT/fingpt/FinGPT_RAG/multisource_retrieval`: The primary module for web scraping and multi-source data retrieval, which is a critical component for feeding real-time financial news into the LLM.
*   `/home/ubuntu/FinGPT/fingpt/FinGPT_Sentiment_Analysis_v3`: The latest implementation for sentiment analysis model training, including parallel training configurations and benchmarking tools.

## Phase 2: Module-by-Module Deep Analysis

### Module 1: FinGPT_Benchmark
- **Core Responsibility**: Provides a standardized environment for fine-tuning and evaluating various base LLMs (Llama2, ChatGLM2, Qwen, etc.) on financial tasks using the LoRA technique.
- **Key Files**:
    - `utils.py`: Defines model-specific LoRA target modules (`lora_module_dict`), prompt templates (`template_dict`), model path parsing (`parse_model_name`), and a robust dataset loading mechanism (`load_dataset`) that supports replication and remote/local loading.
    - `train_lora.py`: The main training script. It loads the model, tokenizer, and dataset, applies LoRA configuration, and uses the Hugging Face `Trainer` with DeepSpeed for efficient, parallelized fine-tuning. It also integrates with **WandB** for experiment tracking.
- **Implementation Details**: The `tokenize` function in `utils.py` is critical, handling the concatenation of instruction, input, and output, and ensuring the sequence length does not exceed the model's maximum length, a common challenge in LLM fine-tuning. The use of `parse_model_name` centralizes the mapping between a simple model name (e.g., 'llama2') and its corresponding Hugging Face repository path.

### Module 2: FinGPT_FinancialReportAnalysis/utils
- **Core Responsibility**: Implements the **Raptor** (Recursive Abstractive Clustering) RAG framework for processing large financial documents (like earning call transcripts or PDFs) by recursively clustering and summarizing text chunks to create a hierarchical index.
- **Key Files**:
    - `rag.py`: Contains the `Raptor` class. This class uses **UMAP** for dimensionality reduction and **Gaussian Mixture Model (GMM)** with **BIC** for optimal cluster determination. The key methods are `recursive_embed_cluster_summarize` and `text_spliter`, which implement the hierarchical chunking and summarization process.
    - `format_pdf.py`: Handles the initial processing and formatting of PDF documents.
    - `earning_calls.py`: Contains specific logic for handling earning call data.
- **Implementation Details**: The `Raptor` class is a sophisticated implementation of hierarchical RAG. It first splits the text using `RecursiveCharacterTextSplitter`, then iteratively applies embedding, UMAP reduction, GMM clustering (using BIC for optimal cluster count), and LLM-based summarization. This recursive process creates a multi-layered knowledge base, significantly improving the context quality for RAG queries on long documents.

### Module 3: FinGPT_Forecaster
- **Core Responsibility**: Manages the end-to-end pipeline for generating structured financial forecasting datasets suitable for LLM fine-tuning.
- **Key Files**:
    - `data_pipeline.py`: The orchestrator. It defines the flow: 1) Acquire data for symbols in a given index (DOW, EURO, CRYPTO) via `prepare_data_for_symbol`. 2) Generate prompts and query an external LLM (GPT-4) for forecasts/rationales via `query_gpt4`. 3) Transform the results into a final training dataset via `create_dataset`.
    - `indices.py`: Simple file defining lists of stock/crypto symbols for different indices.
    - `prompt.py`: Contains the logic for constructing the detailed, structured prompts used to query the external LLM for forecasting.
- **Implementation Details**: The pipeline is a strong example of using an LLM for data labeling and rationale generation. The `query_gpt4` function is the bottleneck, as it relies on an external, non-deterministic API call to enrich the raw financial data with LLM-generated forecasts and explanations, which are then used as the "output" for the fine-tuning dataset.

### Module 4: FinGPT_RAG/multisource_retrieval
- **Core Responsibility**: A comprehensive web scraping and data retrieval layer designed to gather real-time financial news from multiple sources, which serves as the knowledge base for the RAG system.
- **Key Files**:
    - `news_scraper.py`: The main scraping logic. It uses `requests` and `BeautifulSoup` for static scraping and includes logic for handling various financial news sites (Seeking Alpha, Reuters, Bloomberg, Yahoo, CNBC, MarketWatch). It also contains a `select_column_and_classify` function, suggesting an interactive or GUI-driven workflow for data labeling.
    - `scrapers/`: Sub-directory containing site-specific scraping implementations (e.g., `scrape_yahoo.py`, `scrape_cnbc.py`).
    - `external_LLMs/`: Utilities for tokenization and interaction with external LLMs (e.g., ChatGPT, g4f).
- **Implementation Details**: The scraping logic is highly decentralized, with a central dispatcher (`scraping_by_url` in `news_scraper.py`) delegating to site-specific scrapers. This design is necessary due to the varied HTML structures of different news sites but makes the system fragile to website changes. The use of `similarity_score` attempts to filter for relevance before extracting the full article text.

### Module 5: FinGPT_Sentiment_Analysis_v3
- **Core Responsibility**: Provides the latest, optimized training pipeline for sentiment analysis models, focusing on efficiency and parallel processing.
- **Key Files**:
    - `training_parallel/train_lora.py`: A specialized LoRA training script, similar to the benchmark one but with custom `ModifiedTrainer` and `data_collator` classes. The `ModifiedTrainer` overrides `compute_loss` and `prediction_step` to handle the specific input/output format of the sentiment task, and customizes `save_model` to only save the LoRA adapter weights. It is configured for DeepSpeed and parallel training.
- **Implementation Details**: The custom `ModifiedTrainer` is a key feature, allowing the project to bypass the standard Hugging Face Trainer's assumptions about loss calculation and model saving, which is often necessary when working with specialized models like ChatGLM or when only saving adapter weights. The `data_collator` handles padding and label masking specific to the sentiment fine-tuning task.

### Module PlantUML Diagrams

@startuml FinGPT_Benchmark
title FinGPT_Benchmark Module Class Diagram

package "HuggingFace/PEFT" {
    class AutoModelForCausalLM
    class AutoTokenizer
    class TrainingArguments
    class Trainer
    class LoraConfig
    class get_peft_model
}

package "Datasets" {
    class Dataset
    class concatenate_datasets
}

package "Benchmark Utilities" {
    class Utils {
        + template_dict: Dict
        + lora_module_dict: Dict
        + get_prompt(template, instruction, input_text)
        + tokenize(args, tokenizer, feature)
        + parse_model_name(name, from_remote)
        + load_dataset(names, from_remote)
    }
    class TrainLoRA {
        - main(args)
    }
}

TrainLoRA ..> Utils : uses
TrainLoRA ..> AutoModelForCausalLM : loads
TrainLoRA ..> AutoTokenizer : loads
TrainLoRA ..> TrainingArguments : configures
TrainLoRA ..> Trainer : initializes
TrainLoRA ..> LoraConfig : configures
TrainLoRA ..> get_peft_model : applies
TrainLoRA ..> concatenate_datasets : combines
Utils ..> Dataset : loads
Utils ..> AutoTokenizer : uses in tokenize

@enduml

@startuml FinGPT_FinancialReportAnalysis_RAG
title FinGPT_FinancialReportAnalysis RAG Module Class Diagram

package "LangChain/Utils" {
    class ChatPromptTemplate
    class StrOutputParser
    class RecursiveCharacterTextSplitter
}

package "Clustering/Reduction" {
    class UMAP
    class GaussianMixture
}

class Raptor {
    - model: LLM
    - embd: Embeddings
    + global_cluster_embeddings(embeddings, dim)
    + local_cluster_embeddings(embeddings, dim)
    + get_optimal_clusters(embeddings) : int
    + GMM_cluster(embeddings, threshold) : Tuple[labels, n_clusters]
    + perform_clustering(embeddings, dim, threshold) : List[np.ndarray]
    + embed(texts) : np.ndarray
    + embed_cluster_texts(texts) : DataFrame
    + fmt_txt(df) : str
    + embed_cluster_summarize_texts(texts, level) : Tuple[DataFrame, DataFrame]
    + recursive_embed_cluster_summarize(texts, level, n_levels) : Dict
    + text_spliter(text, chunk_size_tok, level, n_levels) : List[str]
}

Raptor ..> UMAP : uses for reduction
Raptor ..> GaussianMixture : uses for clustering
Raptor ..> ChatPromptTemplate : uses for summarization prompt
Raptor ..> StrOutputParser : uses for summarization output
Raptor ..> RecursiveCharacterTextSplitter : uses for initial chunking
Raptor "1" *-- "1" UMAP
Raptor "1" *-- "1" GaussianMixture
Raptor "1" *-- "1" ChatPromptTemplate
Raptor "1" *-- "1" StrOutputParser
Raptor "1" *-- "1" RecursiveCharacterTextSplitter

@enduml

@startuml FinGPT_Forecaster
title FinGPT_Forecaster Module Class Diagram

package "Data Components" {
    class Indices {
        + DOW_30: List[str]
        + EURO_STOXX_50: List[str]
        + CRYPTO: List[str]
    }
    class Data {
        + prepare_data_for_symbol(symbol, data_dir, start_date, end_date, with_basics)
        + query_gpt4(index, data_dir, start_date, end_date, min_past_weeks, max_past_weeks, with_basics)
        + create_dataset(index, data_dir, start_date, end_date, train_ratio, with_basics)
    }
    class Prompt {
        + get_all_prompts(index, data_dir, start_date, end_date, min_past_weeks, max_past_weeks, with_basics)
    }
    class DataInferenceFetch {
        + get_curday()
        + fetch_all_data()
        + get_all_prompts_online()
    }
}

class DataPipeline {
    + main(args)
}

DataPipeline ..> Indices : uses
DataPipeline ..> Data : uses
DataPipeline ..> Prompt : uses
DataPipeline ..> DataInferenceFetch : uses

@enduml

@startuml FinGPT_RAG_MultisourceRetrieval
title FinGPT_RAG Multisource Retrieval Module Class Diagram

package "Web Scraping Tools" {
    class BeautifulSoup
    class requests_get
    class split_sentence
    class similarity_score
}

package "Site Specific Scrapers" {
    class ScrapeYahoo
    class ScrapeCNBC
    class ScrapeMarketScreener
    class ScrapeGoogle
}

class NewsScraper {
    + scraping_by_url(link, subject) : Tuple[url, subject]
    + scrape_bloomberg(subject) : List[str]
    + scrape_reuters(subject) : Tuple[url, subject]
    + scrape_market_watch_article_page(url, subject) : Tuple[url, subject]
    + select_column_and_classify() : void
}

NewsScraper ..> BeautifulSoup : uses
NewsScraper ..> requests_get : uses
NewsScraper ..> split_sentence : uses
NewsScraper ..> similarity_score : uses
NewsScraper ..> ScrapeYahoo : delegates
NewsScraper ..> ScrapeCNBC : delegates
NewsScraper ..> ScrapeMarketScreener : delegates
NewsScraper ..> ScrapeGoogle : delegates

@enduml

@startuml FinGPT_Sentiment_Analysis_v3
title FinGPT_Sentiment_Analysis_v3 Training Module Class Diagram

package "HuggingFace/PEFT" {
    class AutoModel
    class AutoTokenizer
    class TrainingArguments
    class Trainer
    class LoraConfig
    class get_peft_model
}

class ModifiedTrainer extends Trainer {
    + compute_loss(model, inputs, return_outputs=False)
    + prediction_step(model, inputs, prediction_loss_only, ignore_keys)
    + save_model(output_dir)
}

class CastOutputToFloat {
    + forward(x)
}

class TrainLoRA {
    + main()
}

class DataCollator {
    + data_collator(features: list) : dict
}

TrainLoRA ..> AutoModel : loads
TrainLoRA ..> AutoTokenizer : loads
TrainLoRA ..> TrainingArguments : configures
TrainLoRA ..> ModifiedTrainer : initializes
TrainLoRA ..> LoraConfig : configures
TrainLoRA ..> get_peft_model : applies
ModifiedTrainer ..> DataCollator : uses (via trainer init)
TrainLoRA ..> DataCollator : uses

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The FinGPT project is built upon a **modular, LLM-centric, and data-driven design philosophy**, aiming to provide an accessible, open-source framework for financial LLMs. The core abstractions are centered around three main pillars: **Parameter-Efficient Fine-Tuning (PEFT)**, **Hierarchical Retrieval-Augmented Generation (RAG)**, and **End-to-End Data Pipelines**.

The **LoRA Adapter** is the central abstraction for the model layer. Instead of fine-tuning the entire large language model, the project utilizes LoRA (Low-Rank Adaptation) to inject a small number of trainable parameters into the base LLM (e.g., Llama2, ChatGLM2). This abstraction allows for efficient domain adaptation with minimal computational resources, making the project highly accessible. The `lora_module_dict` in `FinGPT_Benchmark/utils.py` explicitly manages which modules of different base models are targeted for adaptation, demonstrating a flexible approach to model heterogeneity.

The **Raptor (Recursive Abstractive Clustering)** system, implemented in `FinGPT_FinancialReportAnalysis/utils/rag.py`, is the key abstraction for handling large, unstructured financial documents. It abstracts the complex process of document chunking, embedding, dimensionality reduction (UMAP), optimal clustering (GMM/BIC), and recursive summarization into a single, hierarchical RAG index. This allows the LLM to retrieve context from multiple levels of abstraction (raw text, cluster summaries, meta-summaries), significantly improving the quality of grounded responses.

The **Data Pipeline** abstraction, exemplified by `FinGPT_Forecaster/data_pipeline.py`, manages the entire lifecycle of creating a structured dataset. This pipeline abstracts data acquisition, prompt engineering, external LLM querying (e.g., GPT-4 for labeling/rationales), and final dataset transformation into a sequential, reproducible process.

The project’s **lifecycle management** follows a clear sequence:
1.  **Data Acquisition**: Raw financial data (news, reports) is gathered via the `multisource_retrieval` layer.
2.  **Data Preparation**: Data is cleaned, structured, and transformed into domain-specific datasets (Forecasting, Sentiment) or hierarchical RAG indices (Raptor).
3.  **Model Adaptation**: Base LLMs are fine-tuned using the LoRA Adapter via the `train_lora.py` scripts.
4.  **Application**: The adapted FinLLM is deployed within application agents (Forecaster, Sentiment Classifier, RAG Query Engine) to serve end-user tasks.

#### 3.1.2. Component Interactions

The FinGPT architecture is characterized by a unidirectional, layered data flow, starting from external sources and culminating in the application layer.

**Data Flow:**
1.  **External Sources** (Websites, APIs, PDFs) feed into the **Data Acquisition Layer** (`multisource_retrieval`).
2.  The **Scraper/Retriever** component extracts raw text and links.
3.  Raw text is routed to two main paths:
    *   **Structured Dataset Path**: Text is processed by `data_pipeline.py` (Forecaster) or similar scripts (Sentiment) to generate `instruction` and `output` pairs, often involving an external LLM (GPT-4) for initial labeling or rationale generation. This results in a Hugging Face `Dataset` object.
    *   **RAG Index Path**: Large documents are processed by the **Raptor** component (`rag.py`), which generates a multi-level index of summaries and embeddings.
4.  The **Fine-Tuning Layer** (`train_lora.py`) consumes the structured `Dataset` and applies the LoRA Adapter to the **Base LLM**.
5.  The resulting **FinLLM Core** (Base LLM + LoRA Adapter) is used by the **Application Agents** (RAG Query Engine, Forecaster Agent, Sentiment Classifier) for inference.

**Communication Patterns:**
*   **Hugging Face Ecosystem**: The primary communication pattern for model training is the Hugging Face `Trainer` class, which manages the entire training loop, including data loading, optimization, and checkpointing. This is heavily integrated with the **PEFT** library for LoRA.
*   **LangChain-Style Chains**: The RAG component in `rag.py` uses a functional chain pattern (`prompt | self.model | StrOutputParser()`) for summarization, a pattern popularized by LangChain, demonstrating a clear separation of prompt, model, and output parsing.
*   **Inter-Module Python Calls**: Data flow within the pipelines (e.g., `data_pipeline.py` calling `indices.py`, `data.py`, and `prompt.py`) relies on standard Python function and class imports, maintaining a tightly coupled but clear execution sequence.
*   **External API Calls**: The system communicates with external services for two main purposes: web scraping (`requests`, `BeautifulSoup` in `news_scraper.py`) and external LLM querying (e.g., `query_gpt4` in `data.py`, which is assumed to make an API call).

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml FinGPT_Overall_Architecture
title FinGPT Overall Architecture

skinparam componentStyle rectangle

package "1. Data Acquisition Layer" as DataAcquisition {
    [Multisource Retrieval] as Scraper
    [Data Fetchers] as Fetchers
    [Financial News Sources] as Sources
    Sources --> Scraper : Scrapes raw data
    Scraper --> Fetchers : Provides raw data
}

package "2. Data Processing & Preparation" as DataProcessing {
    [Forecaster Data Pipeline] as ForecasterDP
    [Sentiment Data Preparation] as SentimentDP
    [Document Chunking & Clustering] as Raptor
    [Financial Documents (PDFs)] as Docs
    
    Fetchers --> ForecasterDP : Structured data
    Fetchers --> SentimentDP : Labeled data
    Docs --> Raptor : Unstructured text
}

package "3. Model Fine-Tuning Layer" as FineTuning {
    [Base LLM (e.g., Llama2)] as BaseLLM
    [LoRA Adapter] as Adapter
    [Training Scripts (DeepSpeed)] as Trainer
    
    ForecasterDP --> Trainer : Forecasting Dataset
    SentimentDP --> Trainer : Sentiment Dataset
    Trainer --> Adapter : Fine-tunes weights
    BaseLLM <--> Adapter : Loads adapter
}

package "4. Application & Inference Layer" as Application {
    [FinLLM Core] as FinLLM
    [RAG Query Engine] as RAGEngine
    [Forecasting Agent] as ForecasterAgent
    [Sentiment Classifier] as SentimentAgent
    
    BaseLLM -[hidden]right-> Adapter
    BaseLLM --> FinLLM : Core Model
    Adapter --> FinLLM : Domain Knowledge
    
    Raptor --> RAGEngine : Hierarchical Index
    FinLLM --> RAGEngine : Contextual Generation
    
    FinLLM --> ForecasterAgent : Prediction
    FinLLM --> SentimentAgent : Classification
    
}

' Interactions
DataAcquisition --> DataProcessing : Raw Data Flow
DataProcessing --> FineTuning : Structured Datasets
DataProcessing --> Application : Knowledge Base (Raptor Index)

RAGEngine .> FinLLM : Queries for grounded response
ForecasterAgent .> FinLLM : Queries for prediction
SentimentAgent .> FinLLM : Queries for classification

[User/API] --> ForecasterAgent
[User/API] --> SentimentAgent
[User/API] --> RAGEngine

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The FinGPT codebase employs several established software design patterns to manage complexity and promote modularity:

1.  **Adapter Pattern (LoRA)**:
    *   **Description**: The LoRA mechanism acts as an adapter, allowing a new interface (domain-specific fine-tuning) to be used with an existing class (the frozen base LLM).
    *   **Implementation**: In `FinGPT_Benchmark/train_lora.py`, the `LoraConfig` and `get_peft_model` functions wrap the `AutoModelForCausalLM` instance, effectively adapting its behavior for financial tasks without modifying its massive original weights.
    *   **Code Example**:
        ```python
        # FinGPT_Benchmark/train_lora.py
        peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM,
            r=8,
            lora_alpha=32,
            target_modules=lora_module_dict[args.base_model], # The adaptation logic
            # ...
        )
        model = get_peft_model(model, peft_config) # The adapter application
        ```

2.  **Pipeline Pattern (Data Flow)**:
    *   **Description**: A sequence of processing steps where the output of one step becomes the input of the next.
    *   **Implementation**: The `main` function in `FinGPT_Forecaster/data_pipeline.py` clearly defines the pipeline stages: Acquire Data -> Generate Prompt/Query GPT-4 -> Transform to Training Format.
    *   **Code Example**:
        ```python
        # FinGPT_Forecaster/data_pipeline.py (Simplified)
        # 1. Acquire data
        for symbol in tqdm(index):
            prepare_data_for_symbol(symbol, data_dir, start_date, end_date, with_basics=with_basics)
        # 2. Generate prompt and query GPT-4
        query_gpt4(index, data_dir, start_date, end_date, min_past_weeks, max_past_weeks, with_basics=with_basics)
        # 3. Transform into training format
        dataset = create_dataset(index, data_dir, start_date, end_date, train_ratio, with_basics=with_basics)
        ```

3.  **Strategy Pattern (Model Configuration)**:
    *   **Description**: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.
    *   **Implementation**: The `lora_module_dict` in `FinGPT_Benchmark/utils.py` holds different strategies (target modules) for applying LoRA based on the specific base model architecture (e.g., `chatglm2` uses `query_key_value`, while `llama2` uses `q_proj`, `k_proj`, `v_proj`).
    *   **Code Example**:
        ```python
        # FinGPT_Benchmark/utils.py
        lora_module_dict = {
            'chatglm2': ['query_key_value'],
            'llama2': ['q_proj', 'k_proj', 'v_proj'],
            # ...
        }
        # ...
        target_modules=lora_module_dict[args.base_model],
        ```

4.  **Composite Pattern (Raptor RAG)**:
    *   **Description**: Composes objects into tree structures to represent part-whole hierarchies.
    *   **Implementation**: The `recursive_embed_cluster_summarize` function in `rag.py` recursively processes summaries from one level as the "documents" for the next level, creating a hierarchical index where a cluster summary is a composite of its underlying document chunks.

#### 3.3.2. Project Highlights

The FinGPT project demonstrates several innovative features that enhance its utility and flexibility in the financial domain:

*   **Hierarchical RAG with Raptor**: The most innovative feature is the **Raptor** RAG system. By combining **UMAP** (dimensionality reduction) and **Gaussian Mixture Models (GMM)** for clustering, it creates a multi-level index of document summaries. This allows the RAG engine to retrieve not just granular text chunks but also high-level conceptual summaries, leading to more coherent and contextually rich answers from the LLM.
*   **Accessibility through PEFT**: The core focus on **LoRA-based fine-tuning** significantly lowers the barrier to entry for financial LLM development. It allows researchers and developers to adapt massive models to financial tasks using consumer-grade GPUs, promoting the open-source spirit of the project.
*   **End-to-End Financial Forecasting Pipeline**: The `FinGPT_Forecaster` module provides a complete, runnable example of how to convert raw market data into a structured, LLM-ready dataset, including the crucial step of using an external LLM for generating rationales and labels. This is a highly valuable, innovative feature for quantitative finance.
*   **Robust Multisource Data Retrieval**: The dedicated `multisource_retrieval` component, with its site-specific scrapers (Yahoo, CNBC, Bloomberg), ensures the LLM can be grounded in up-to-date, real-world financial news, which is critical for time-sensitive financial applications.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

While the project is robust, several areas could be improved to enhance performance, maintainability, and architectural clarity:

*   **Standardization and Code Consolidation**:
    *   **Suggestion**: Consolidate the redundant `train_lora.py` and `utils.py` files found in multiple sub-projects (`FinGPT_Benchmark`, `FinGPT_Forecaster`, `FinGPT_Sentiment_Analysis_v3`).
    *   **Benefit**: Reduces code duplication, simplifies maintenance, and ensures a single source of truth for core utilities like `tokenize` and `load_dataset`.
*   **External Dependency Abstraction**:
    *   **Suggestion**: Abstract the external LLM calls (e.g., `query_gpt4` in `data.py`) into a dedicated, configurable service layer (e.g., an `ExternalLLMService` class).
    *   **Benefit**: Decouples the data pipeline from specific LLM providers, making it easier to switch between GPT-4, Claude, or other models, and simplifies API key management.
*   **RAG System Optimization**:
    *   **Suggestion**: The Raptor RAG system is computationally intensive due to UMAP and GMM clustering. Implement caching for the clustered embeddings and summaries, especially for static documents like financial reports.
    *   **Benefit**: Reduces processing time and cost for repeated queries or application restarts.
*   **Web Scraping Robustness**:
    *   **Suggestion**: The `news_scraper.py` is highly dependent on HTML structure. Implement more resilient scraping techniques (e.g., using a general-purpose content extraction library) and add robust retry logic with exponential backoff to handle transient network errors and rate limits.

#### 3.4.2. Secondary Development Guide

For developers looking to explore or extend the FinGPT codebase, the following path is recommended:

1.  **Initial Exploration (Fine-Tuning)**:
    *   Start by examining the **FinGPT_Benchmark** module. The `utils.py` file is essential for understanding model-specific configurations (LoRA targets) and data handling.
    *   Review `train_lora.py` to grasp the standard fine-tuning workflow using Hugging Face and LoRA. This is the template for all model adaptation tasks.

2.  **Understanding Data Flow (Forecasting)**:
    *   The **FinGPT_Forecaster** module provides the clearest example of an end-to-end pipeline. Analyze `data_pipeline.py` to see how raw data is transformed into a structured dataset suitable for LLM training.

3.  **Secondary Development - New Application Agent**:
    *   To create a new financial application (e.g., a Merger & Acquisition Agent), the best approach is to reuse the existing components:
        *   **Data**: Use the `multisource_retrieval` scrapers to gather M&A news.
        *   **Model**: Use the `FinGPT_Benchmark/train_lora.py` script to fine-tune a base LLM on a new M&A-specific dataset.
        *   **RAG**: If the task involves large documents (e.g., SEC filings), integrate the **Raptor** system from `FinGPT_FinancialReportAnalysis/utils/rag.py` to build the knowledge base.

4.  **Contribution Focus**:
    *   Focus contributions on developing new, robust scrapers in the `multisource_retrieval/scrapers` directory or creating new, standardized financial datasets for the community.
    *   When adding new models, ensure the `lora_module_dict` in the core `utils.py` is updated with the correct target modules.


================================================
FILE: thirdparty/FinGenius.md
================================================
# FinGenius - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The FinGenius project exhibits a clean, modular structure typical of a well-organized Python application, with a clear separation of concerns between the core framework, agents, environments, and external capabilities.

```
/home/ubuntu/FinGenius
├── config/                 # Configuration files for LLM settings and MCP server endpoints.
│   ├── config.example.toml # Primary configuration for LLM, logging, and general settings.
│   └── mcp.example.json    # Configuration for Model Context Protocol (MCP) server addresses.
├── docs/                   # Documentation and visual assets (architecture diagrams, flow charts).
├── main.py                 # The application's entry point and primary orchestration script.
├── requirements.txt        # Lists all Python dependencies (e.g., pydantic, akshare, loguru).
└── src/                    # The core source code directory.
    ├── agent/              # **Core Module 1: Agent Definitions**
    │   ├── base.py         # Defines BaseAgent, the abstract foundation for all agents.
    │   ├── react.py        # Implements the ReAct (Reasoning and Acting) pattern.
    │   ├── mcp.py          # Defines MCPAgent, integrating the Model Context Protocol.
    │   └── [specialized].py# Contains the concrete, domain-specific agents (e.g., chip_analysis.py).
    ├── environment/        # **Core Module 2: Execution Contexts**
    │   ├── base.py         # Defines BaseEnvironment and the EnvironmentFactory.
    │   ├── research.py     # Implements the Research Phase (data collection and analysis).
    │   └── battle.py       # Implements the Battle Phase (adversarial debate and voting).
    ├── tool/               # **Core Module 3: External Capabilities**
    │   ├── base.py         # Defines BaseTool and ToolCollection, the tool interface.
    │   ├── battle.py       # The tool agents use to interact within the BattleEnvironment.
    │   ├── search/         # Contains various web search tools (Baidu, Google, DuckDuckGo).
    │   └── [specialized].py# Contains tools for financial data fetching (e.g., big_deal_analysis.py).
    ├── mcp/                # **Core Module 4: MCP Server Stubs**
    │   └── [server].py     # Contains stubs for the specialized financial data servers (e.g., sentiment_server.py).
    ├── prompt/             # **Core Module 5: Agent Prompts**
    │   └── [agent_name].py # Stores the extensive system and next-step prompts for each agent.
    ├── schema.py           # Pydantic models for data structures (Message, Memory, AgentState).
    ├── llm.py              # Wrapper for LLM API calls.
    └── logger.py           # Configuration for the loguru logging system.
```
The structure clearly separates the core framework (`src/`), configuration (`config/`), and entry point (`main.py`). The `src/` directory is further divided into functional modules: `agent` for the actors, `environment` for the stages, `tool` for the capabilities, and `prompt` for the agent's "mindset." This organization adheres to the principles of modular design and separation of concerns, which is essential for a complex multi-agent system.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinGenius/src/agent`: Contains the definitions for all specialized AI agents, including the base classes (`BaseAgent`, `ReActAgent`, `ToolCallAgent`, `MCPAgent`) and the domain-specific agents (e.g., `ChipAnalysisAgent`, `HotMoneyAgent`).
*   `/home/ubuntu/FinGenius/src/environment`: Defines the two core operational environments (`ResearchEnvironment`, `BattleEnvironment`) and their base class (`BaseEnvironment`), which manage agent execution and interaction flow.
*   `/home/ubuntu/FinGenius/src/tool`: Houses the definitions for all external capabilities and internal actions available to the agents, such as data fetching tools (`BigDealAnalysisTool`) and interaction tools (`Battle`, `Terminate`).
*   `/home/ubuntu/FinGenius/src/mcp`: Contains the logic for the Model Context Protocol (MCP) integration, including the client-side logic used by `MCPAgent` and the server-side stubs for the specialized financial data services.
*   `/home/ubuntu/FinGenius/src/prompt`: Stores the extensive system and next-step prompt templates (in Python string format) used to guide the behavior and reasoning of the various agents.
*   `/home/ubuntu/FinGenius/src`: Contains core utility files and foundational classes like `llm.py`, `logger.py`, `schema.py`, and the main entry point logic.

## Phase 2: Module-by-Module Deep Analysis

The FinGenius project is structured around five core Python modules, each serving a distinct purpose in the multi-agent system.

### 1. `src/agent` Module (The Actors)
This module defines the entire agent hierarchy, from the abstract base to the specialized financial experts.

*   **Files Enumerated:** `base.py`, `react.py`, `toolcall.py`, `mcp.py`, `chip_analysis.py`, `big_deal_analysis.py`, `hot_money.py`, `risk_control.py`, `sentiment.py`, `technical_analysis.py`, `report.py`.
*   **Core Responsibility:** To provide the foundational logic for agent execution, memory management, LLM interaction, and to define the specific roles and capabilities of each financial expert agent.
*   **Key Implementation Details:**
    *   **`BaseAgent` (`base.py`):** Implements the main `run()` loop, state transitions (`AgentState`), and memory updates. It includes logic to detect and handle a "stuck state" (duplicate responses) by modifying the `next_step_prompt`.
    *   **`ReActAgent` (`react.py`):** Overrides `step()` to implement the **ReAct pattern**, parsing the LLM's response to determine if the next action is a `thought` or a `tool_call`.
    *   **`MCPAgent` (`mcp.py`):** The final base class, which integrates the `MCPClient` for specialized tool access. All domain agents inherit from this, ensuring they are "MCP-enabled."
    *   **Specialized Agents:** Agents like `ChipAnalysisAgent` and `BigDealAnalysisAgent` are simple, highly-configured classes. Their primary implementation is setting their unique `name`, `description`, `system_prompt`, and the specific `ToolCollection` they are allowed to use. This adheres to the **Strategy Pattern**.

### 2. `src/environment` Module (The Stage)
This module defines the execution contexts that govern agent interaction and the overall workflow.

*   **Files Enumerated:** `base.py`, `research.py`, `battle.py`.
*   **Core Responsibility:** To manage the lifecycle of agents, define the rules of engagement, and orchestrate the two-phase analysis process (Research and Battle).
*   **Key Implementation Details:**
    *   **`BaseEnvironment` (`base.py`):** Provides the abstract interface and a factory (`EnvironmentFactory`) for creating environments. It manages the registration and retrieval of agents.
    *   **`ResearchEnvironment` (`research.py`):** Manages the initial data collection. Its `run()` method executes all specialized agents, typically in parallel, and aggregates their final reports into a single `research_results` dictionary.
    *   **`BattleEnvironment` (`battle.py`):** Implements the core innovation: the adversarial debate. It uses the **`BattleState`** class to track the debate history, agent order, and voting results. The `run()` method manages the multi-round debate, constructing a **cumulative context** (research results + previous speeches) for each agent before its turn. It acts as a **Mediator** for agent communication via the `Battle` tool.

### 3. `src/tool` Module (The Capabilities)
This module provides the external and internal actions available to the agents, serving as the interface between the LLM-driven logic and the external world.

*   **Files Enumerated:** `base.py`, `terminate.py`, `tool_collection.py`, `battle.py`, `big_deal_analysis.py`, `chip_analysis.py`, `search/` (various web search tools).
*   **Core Responsibility:** To define a standard interface (`BaseTool`) for all capabilities and to implement the logic for data fetching, web searching, and inter-agent communication.
*   **Key Implementation Details:**
    *   **`BaseTool` (`base.py`):** An abstract class that defines the `name`, `description`, `parameters` (for LLM function calling), and the `async execute()` method. It also includes utility classes like `ToolResult` and `ToolFailure`.
    *   **`ToolCollection` (`tool_collection.py`):** A container class that holds all available tools for an agent, mapping tool names to instances and providing the list of tool schemas to the LLM.
    *   **`BigDealAnalysisTool` (`big_deal_analysis.py`):** A specialized tool that wraps the `akshare` library to fetch and process big order fund flow data, including a simple retry mechanism for unstable API calls.
    *   **`Battle` (`battle.py`):** A unique tool that allows agents to `speak` and `vote` within the `BattleEnvironment`, acting as the communication channel for the debate.

### 4. `src/mcp` Module (The Protocol Integration)
This module handles the Model Context Protocol (MCP) integration, which is key to accessing specialized financial data.

*   **Files Enumerated:** `__init__.py`, `battle_server.py`, `big_deal_analysis_server.py`, `server.py`, etc.
*   **Core Responsibility:** To define the server-side stubs for the specialized financial data services. These stubs are likely used in a separate deployment environment but are included here to define the protocol endpoints that the `MCPAgent`s are designed to call.
*   **Key Implementation Details:** The files primarily contain `MCPServer` implementations (or stubs) for services like `sentiment_server` and `chip_analysis_server`, defining the expected input and output schemas for the financial data APIs.

### 5. `src/prompt` Module (The Agent Mindset)
This module contains the extensive, Chinese-language prompt templates that define the personality, role, and instructions for each agent.

*   **Files Enumerated:** `battle.py`, `big_deal_analysis.py`, `chip_analysis.py`, `hot_money.py`, `risk_control.py`, `sentiment.py`, `technical_analysis.py`, etc.
*   **Core Responsibility:** To provide the system prompts (`SYSTEM_PROMPT`) and next-step prompts (`NEXT_STEP_PROMPT_ZN`) that guide the LLM's behavior within the ReAct loop, ensuring the agents adhere to their specialized financial roles and the rules of the environment. The prompts are critical for the project's A-share market specialization.

### Module PlantUML Diagrams

## Agent Module PlantUML Diagram

```plantuml
@startuml
skinparam classAttributeIconVisible false
skinparam defaultFontName Monospaced
skinparam defaultFontSize 12

package "src.agent" {
    abstract class BaseAgent {
        + name: str
        + memory: Memory
        + state: AgentState
        + run(request)
        + {abstract} step()
        + is_stuck()
    }

    abstract class ReActAgent {
        + step()
        - _parse_llm_response()
    }

    abstract class ToolCallAgent {
        + available_tools: ToolCollection
        + step()
        - _execute_tool(tool_call)
    }

    class MCPAgent {
        + mcp_client: MCPClient
    }

    class ChipAnalysisAgent
    class BigDealAnalysisAgent
    class HotMoneyAgent
    class RiskControlAgent
    class SentimentAgent
    class TechnicalAnalysisAgent
    class ReportAgent

    BaseAgent <|-- ReActAgent
    ReActAgent <|-- ToolCallAgent
    ToolCallAgent <|-- MCPAgent

    MCPAgent <|-- ChipAnalysisAgent
    MCPAgent <|-- BigDealAnalysisAgent
    MCPAgent <|-- HotMoneyAgent
    MCPAgent <|-- RiskControlAgent
    MCPAgent <|-- SentimentAgent
    MCPAgent <|-- TechnicalAnalysisAgent
    MCPAgent <|-- ReportAgent

    BaseAgent ..> [src.schema.Memory] : uses
    ToolCallAgent ..> [src.tool.ToolCollection] : manages
    MCPAgent ..> [src.mcp.MCPClient] : uses
}
@enduml
```

## Environment Module PlantUML Diagram

```plantuml
@startuml
skinparam classAttributeIconVisible false
skinparam defaultFontName Monospaced
skinparam defaultFontSize 12

package "src.environment" {
    abstract class BaseEnvironment {
        + name: str
        + agents: Dict[str, BaseAgent]
        + register_agent(agent)
        + {abstract} run()
    }

    class ResearchEnvironment {
        + run()
        - _create_agents()
        - _aggregate_results()
    }

    class BattleEnvironment {
        + battle_state: BattleState
        + run()
        + handle_speak(agent_id, speak)
        + handle_vote(agent_id, vote)
        - _get_cumulative_context()
    }

    class BattleState {
        + agent_order: List[str]
        + debate_history: List[Dict]
        + final_votes: Dict[str, str]
        + _recalculate_vote_results()
    }

    class EnvironmentFactory {
        + {static} create_environment(type, agents)
    }

    BaseEnvironment <|-- ResearchEnvironment
    BaseEnvironment <|-- BattleEnvironment

    BattleEnvironment o-- BattleState : manages

    BaseEnvironment ..> [src.agent.BaseAgent] : contains
    EnvironmentFactory ..> BaseEnvironment : creates
}
@enduml
```

## Tool Module PlantUML Diagram

```plantuml
@startuml
skinparam classAttributeIconVisible false
skinparam defaultFontName Monospaced
skinparam defaultFontSize 12

package "src.tool" {
    abstract class BaseTool {
        + name: str
        + description: str
        + parameters: Dict
        + {abstract} execute(**kwargs)
        + to_param()
    }

    class ToolResult {
        + output: Any
        + error: Optional[str]
    }

    class ToolCollection {
        + tools: Dict[str, BaseTool]
        + get_tool_schemas()
        + execute_tool(name, **kwargs)
    }

    class Terminate
    class Battle {
        + agent_id: str
        + controller: BattleEnvironment
        + execute(speak, vote)
    }
    class BigDealAnalysisTool {
        + execute(stock_code)
        - _safe_fetch(akshare_func)
    }
    class ChipAnalysisTool
    class CreateChatCompletion
    class WebSearchTool

    BaseTool <|-- Terminate
    BaseTool <|-- Battle
    BaseTool <|-- BigDealAnalysisTool
    BaseTool <|-- ChipAnalysisTool
    BaseTool <|-- CreateChatCompletion
    BaseTool <|-- WebSearchTool

    ToolCollection o-- BaseTool : aggregates
    BaseTool ..> ToolResult : returns
    Battle ..> [src.environment.BattleEnvironment] : interacts with (controller)
}
@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The FinGenius architecture is built upon a set of well-defined core abstractions that facilitate the multi-agent, dual-environment design.

**1. Agent Hierarchy (The Actors):**
The agent system follows a clear inheritance chain, embodying the **Strategy Pattern** and **Template Method Pattern**.
*   **`BaseAgent` (`src/agent/base.py`):** The foundational abstract class. It provides core agent capabilities: state management (`AgentState`), memory (`Memory`), logging, and the main execution loop (`run()`). It enforces the abstract method `step()`, which is the single unit of work for any agent.
*   **`ReActAgent` (`src/agent/react.py`):** Implements the **ReAct (Reasoning and Acting) pattern**. It extends `BaseAgent` by structuring the `step()` method to alternate between internal thought (reasoning) and external action (tool use).
*   **`ToolCallAgent` (`src/agent/toolcall.py`):** Extends `ReActAgent` to manage and execute tools. It handles the parsing of LLM responses for function calls and the execution of the tools contained within the `ToolCollection`.
*   **`MCPAgent` (`src/agent/mcp.py`):** The final, specialized base class. It extends `ToolCallAgent` to integrate the **Model Context Protocol (MCP)**, allowing agents to access specialized financial data servers via `MCPClient`. All domain-specific agents (e.g., `ChipAnalysisAgent`) inherit from this class.

**2. Environment Hierarchy (The Stage):**
The environments define the context and rules of interaction for the agents.
*   **`BaseEnvironment` (`src/environment/base.py`):** The abstract base class for all environments. It manages a collection of agents (`self.agents`) and defines the abstract `run()` method. It also includes an `EnvironmentFactory` for creating specific environment types.
*   **`ResearchEnvironment` (`src/environment/research.py`):** Implements the data collection and initial analysis phase. It is responsible for initializing the specialized agents and running them to gather their individual reports.
*   **`BattleEnvironment` (`src/environment/battle.py`):** Implements the adversarial validation phase. It manages the structured debate, tracks the debate history, and records agent votes using the **`BattleState`** class. This environment acts as a **Mediator**, controlling the flow of communication between agents.

**3. Data and Utility Abstractions:**
*   **`Memory` and `Message` (`src/schema.py`):** These Pydantic models define the structure for agent memory and communication. `Memory` stores a list of `Message` objects, which adhere to the OpenAI chat format (system, user, assistant, tool roles).
*   **`BaseTool` and `ToolCollection` (`src/tool/base.py`):** `BaseTool` is the abstract interface for all external capabilities, enforcing the `execute()` method. `ToolCollection` is a container that maps tool names to `BaseTool` instances, simplifying tool management for agents.
*   **`LLM` (`src/llm.py`):** A wrapper class for interacting with the Large Language Model API, centralizing LLM configuration and call logic.

The design philosophy is a modular, layered approach, separating the core agent logic, the interaction protocols (environments), and the external capabilities (tools). This separation of concerns ensures high extensibility, allowing new agents, tools, or even new debate formats to be introduced with minimal impact on the core framework. The use of Pydantic for data models enforces strict data validation and structure across the system.

#### 3.1.2. Component Interactions

The FinGenius system operates on a two-stage, sequential pipeline: **Research** followed by **Battle**. The entire process is orchestrated by `main.py`.

**1. Initialization and Research Phase (Data Collection & Analysis):**
*   **`main.py`** acts as the orchestrator. It initializes the `EnvironmentFactory` to create the `ResearchEnvironment` and a team of specialized `MCPAgent`s (e.g., `ChipAnalysisAgent`, `HotMoneyAgent`).
*   **`ResearchEnvironment.run()`** executes the agents, typically in parallel or a defined sequence.
*   **`MCPAgent.run()`** initiates the agent's ReAct loop, calling `step()` repeatedly.
*   **`ToolCallAgent.step()`** (inherited by `MCPAgent`) is the core of the interaction. It sends the current memory and prompt to the `LLM` to decide on the next action.
*   **LLM** responds with a `tool_call` (e.g., `big_deal_analysis_tool`).
*   **`ToolCallAgent`** executes the tool via the **`ToolCollection`**.
*   **`BigDealAnalysisTool.execute()`** (a specialized `BaseTool`) uses external libraries like `akshare` to fetch real-time financial data. This is the primary external data flow.
*   The tool returns a `ToolResult` (structured data) to the agent.
*   The agent incorporates the tool result into its memory and continues the ReAct loop until it decides to `Terminate`.
*   The `ResearchEnvironment` collects the final output from all agents into a comprehensive `research_results` dictionary.

**2. Battle Phase (Adversarial Validation & Decision):**
*   **`main.py`** then initializes the `BattleEnvironment`, passing the `research_results` as context.
*   **`BattleEnvironment.run()`** starts the multi-round debate, managed by the `BattleState`.
*   Agents are instructed to speak and vote using the **`Battle`** tool.
*   **`MCPAgent`** receives the full research context and the debate history (cumulative context) and uses the `Battle` tool to submit its argument (`speak`) and final decision (`vote`).
*   **`Battle.execute()`** is handled by the `BattleEnvironment`'s controller, which records the speech in the `debate_history` and updates the `BattleState`'s `final_votes`.
*   After a set number of rounds, the `BattleEnvironment` synthesizes the final conclusion based on the vote results (`vote_results` in `BattleState`).

**3. Final Reporting:**
*   The final decision and report are passed back to `main.py`, which uses the `ReportAgent` (or a similar mechanism) to format the output into a structured HTML or JSON report for the user.

The communication pattern is primarily **sequential orchestration** (`main.py` -> Research -> Battle) with **internal parallel execution** (agents running concurrently in the `ResearchEnvironment`) and a **Mediator pattern** (`BattleEnvironment` managing agent interactions via the `Battle` tool).

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam classAttributeIconVisible false
skinparam defaultFontName Monospaced
skinparam defaultFontSize 12

package "FinGenius" {
    package "src" {
        package "agent" {
            abstract class BaseAgent
            abstract class ReActAgent
            abstract class ToolCallAgent
            class MCPAgent
            class ChipAnalysisAgent
            class BigDealAnalysisAgent
            class HotMoneyAgent
            class RiskControlAgent
            class SentimentAgent
            class TechnicalAnalysisAgent
            class ReportAgent
        }

        package "environment" {
            abstract class BaseEnvironment
            class ResearchEnvironment
            class BattleEnvironment
            class EnvironmentFactory
            class BattleState
        }

        package "tool" {
            abstract class BaseTool
            class ToolCollection
            class Terminate
            class Battle
            class BigDealAnalysisTool
            class ChipAnalysisTool
            class CreateChatCompletion
            class FinancialDeepSearchTool
            class WebSearchTool
        }

        package "mcp" {
            class MCPClient
            class MCPServer
        }

        package "core" {
            class LLM
            class Memory
            class Message
            class AgentState
        }

        [main.py]
    }
}

' Inheritance
BaseAgent <|-- ReActAgent
ReActAgent <|-- ToolCallAgent
ToolCallAgent <|-- MCPAgent
MCPAgent <|-- ChipAnalysisAgent
MCPAgent <|-- BigDealAnalysisAgent
MCPAgent <|-- HotMoneyAgent
MCPAgent <|-- RiskControlAgent
MCPAgent <|-- SentimentAgent
MCPAgent <|-- TechnicalAnalysisAgent
MCPAgent <|-- ReportAgent

BaseEnvironment <|-- ResearchEnvironment
BaseEnvironment <|-- BattleEnvironment

' Dependencies
BaseAgent ..> LLM : uses
BaseAgent ..> Memory : uses
BaseAgent ..> AgentState : manages
MCPAgent ..> MCPClient : uses
ToolCallAgent ..> ToolCollection : manages
ToolCollection o-- BaseTool : aggregates

ResearchEnvironment o-- MCPAgent : contains (Research Team)
BattleEnvironment o-- MCPAgent : contains (Battle Team)
BattleEnvironment ..> BattleState : manages
BattleEnvironment ..> Battle : uses (Tool)

[main.py] ..> EnvironmentFactory : creates
[main.py] ..> ResearchEnvironment : runs
[main.py] ..> BattleEnvironment : runs

BaseTool <|-- Battle
BaseTool <|-- BigDealAnalysisTool
BaseTool <|-- ChipAnalysisTool
BaseTool <|-- Terminate

' Data Flow / Interaction
[main.py] --> ResearchEnvironment : Start Analysis
ResearchEnvironment --> MCPAgent : Execute Step
MCPAgent --> ToolCollection : Call Tool
ToolCollection --> BaseTool : Execute
ResearchEnvironment --> BattleEnvironment : Pass Results
BattleEnvironment --> MCPAgent : Debate Round
MCPAgent --> Battle : Speak/Vote
BattleEnvironment --> [main.py] : Final Report

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The FinGenius project effectively utilizes several key design patterns to manage complexity, promote modularity, and implement the multi-agent logic.

**1. Chain of Responsibility / Template Method Pattern (Agent Hierarchy):**
The agent structure is a classic example of the **Template Method Pattern** implemented via a **Chain of Responsibility**.
*   **Implementation:** The inheritance chain `BaseAgent` -> `ReActAgent` -> `ToolCallAgent` -> `MCPAgent` defines a fixed sequence of responsibilities. `BaseAgent` handles the execution loop, `ReActAgent` injects the reasoning/acting logic, and `ToolCallAgent` adds tool execution. The abstract `step()` method in `BaseAgent` is the template method that is refined at each level.
*   **Example:** `MCPAgent`'s `step()` method calls `ToolCallAgent`'s logic, which in turn relies on `ReActAgent`'s logic to decide whether to reason or call a tool.

**2. Strategy Pattern (Specialized Agents):**
The domain-specific agents (e.g., `ChipAnalysisAgent`, `HotMoneyAgent`) are concrete strategies that implement the agent interface defined by `MCPAgent`.
*   **Implementation:** Each specialized agent is configured with a unique `system_prompt` and a specific `ToolCollection` containing only the tools relevant to its domain (e.g., `ChipAnalysisAgent` gets `ChipAnalysisTool`).
*   **Example:** The difference between a `RiskControlAgent` and a `SentimentAgent` is primarily their system prompt (strategy) and the set of tools they are allowed to use (capabilities).

**3. Mediator Pattern (BattleEnvironment):**
The `BattleEnvironment` acts as a mediator, controlling the interactions between the agents during the debate phase.
*   **Implementation:** Agents do not communicate directly. Instead, they use the **`Battle`** tool, which routes their `speak` and `vote` actions to the `BattleEnvironment`'s controller. The environment then updates the shared `BattleState` and broadcasts the new context to the next agent.
*   **Example:** When an agent calls `battle(speak="...", vote="bullish")`, the `BattleEnvironment` processes this, records it in `debate_history`, and then constructs the cumulative context for the next agent, ensuring controlled, structured communication.

**4. Factory Method Pattern (EnvironmentFactory):**
The `EnvironmentFactory` is responsible for creating and initializing the correct environment type (`ResearchEnvironment` or `BattleEnvironment`) based on an input parameter.
*   **Implementation:** The static method `EnvironmentFactory.create_environment(environment_type, ...)` encapsulates the logic for instantiating the correct environment class and registering the necessary agents. This decouples the client (`main.py`) from the concrete environment classes.

**5. Adapter Pattern (BaseTool and ToolCollection):**
The `BaseTool` and `ToolCollection` serve as an adapter layer to integrate external capabilities (like `akshare` or the `Battle` mechanism) into the LLM's function-calling interface.
*   **Implementation:** `BaseTool.to_param()` converts the Python class definition into the required JSON schema for the LLM. The `execute()` method then adapts the LLM's call into the actual Python function logic.

| Pattern | Component | Role in FinGenius |
| :--- | :--- | :--- |
| **Template Method** | `BaseAgent` | Defines the skeleton of the agent's execution loop (`run`, `step`). |
| **Strategy** | Specialized Agents | Each agent is a strategy with a unique prompt and toolset for a specific financial domain. |
| **Mediator** | `BattleEnvironment` | Controls and structures the communication and debate flow between agents. |
| **Factory Method** | `EnvironmentFactory` | Centralizes the creation and initialization of `Research` and `Battle` environments. |
| **Adapter** | `BaseTool` / `ToolCollection` | Adapts external functions and internal logic for the LLM's function-calling interface. |

#### 3.3.2. Project Highlights

The FinGenius project stands out due to its innovative approach to financial analysis, leveraging a sophisticated multi-agent architecture tailored for the Chinese A-share market.

*   **Research–Battle Dual-Environment Architecture:** This is the core innovation. The system separates the process into two distinct phases: the **Research Environment** for parallel, specialized data collection and analysis, and the **Battle Environment** for adversarial validation. This dual structure ensures that the final conclusion is not just a summary of individual findings but a synthesis derived from a structured, competitive debate, significantly reducing the risk of LLM "hallucination."
*   **A-Share Market Specialization and Localization:** The project is explicitly designed to overcome the "water-soil incompatibility" of general-purpose AI in the Chinese financial context. This is achieved through:
    *   **Specialized Agents:** Agents like the **Hot Money Agent (游资agent)** and **Chip Agent (筹码agent)** are based on unique A-share market concepts (e.g., Dragon and Tiger Lists, chip distribution).
    *   **Localized Tools:** Integration with Chinese financial data APIs like `akshare` and localized search tools (Baidu search) ensures relevance and accuracy.
    *   **Chinese Prompts:** The use of extensive, high-quality Chinese system prompts in `src/prompt` ensures the LLM's reasoning is grounded in the correct market terminology and context.
*   **Cumulative Debate Mechanism:** The `BattleEnvironment` implements a sophisticated debate structure where each agent's argument is informed by the full research context and the speeches of all preceding agents in the current round. This **cumulative context** fosters a deeper, more context-aware discussion, simulating a real-world, progressive analysis process.
*   **Modular and Extensible Design:** The clear separation of concerns using the **Agent-Environment-Tool** architecture (Strategy and Factory patterns) makes the system highly extensible. Adding a new financial expert (Agent) or a new data source (Tool) requires minimal changes to the core framework, primarily involving configuration and inheritance.
*   **Robust State and Memory Management:** The use of Pydantic models for `Message`, `Memory`, and `BattleState` enforces strict data structure and validation. The `BaseAgent`'s built-in logic to detect and handle "stuck states" (duplicate responses) enhances the robustness of the autonomous execution loop.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The FinGenius project is architecturally sound, but several areas can be optimized for performance, robustness, and maintainability.

**1. Performance and Robustness:**
*   **Asynchronous Data Fetching and Caching:** The current tool implementations, particularly those relying on external APIs like `akshare` (e.g., `BigDealAnalysisTool`), appear to use synchronous calls within an `async` framework. While the `execute` method is `async`, the internal `_with_retry` and `_safe_fetch` functions use `time.sleep()`, which blocks the event loop.
    *   **Suggestion:** Refactor all external API calls to use `aiohttp` or an asynchronous wrapper for `akshare` to prevent blocking the main event loop, significantly improving concurrency in the `ResearchEnvironment`. Implement a time-to-live (TTL) cache (e.g., using Redis) for frequently requested financial data to reduce redundant API calls and improve speed.
*   **Tool Execution Timeout:** The `ToolCallAgent` should implement a strict timeout mechanism for tool execution to prevent a single unresponsive tool from stalling the entire agent's `run()` loop.

**2. Architecture Optimization:**
*   **Dynamic Tool Registration:** The `ToolCollection` is currently a static container. For a highly extensible system, consider implementing a dynamic tool discovery mechanism (e.g., using Python entry points or a configuration file) that automatically loads tools into the `ToolCollection` based on the agent's configuration, rather than requiring manual import and instantiation in each agent file.
*   **Environment State Management:** The `BattleState` is a large Pydantic model. While effective, for long-running debates, consider offloading the `battle_history` and `debate_history` to a persistent store (e.g., a database) to reduce memory footprint and enable recovery from crashes.

**3. Code Quality and Maintainability:**
*   **Prompt Management Refinement:** The system prompts are stored as large Python string variables in `src/prompt/*.py`. This is difficult to manage and version control.
    *   **Suggestion:** Consolidate prompts into a structured format (e.g., YAML or JSON files) or use a dedicated prompt management library. This would allow for easier localization, versioning, and separation of prompt content from Python logic.
*   **Type Hinting Consistency:** While Pydantic is used extensively, the use of `Any` in critical areas (e.g., `controller: Optional[Any]` in `Battle` tool) reduces type safety. Replace `Any` with specific protocol classes or forward references to improve static analysis and code clarity.
*   **Error Handling in Tools:** The `_safe_fetch` function in `BigDealAnalysisTool` returns `None` on failure. While safe, this can lead to silent failures.
    *   **Suggestion:** Tools should return a `ToolFailure` object with a detailed error message, allowing the agent's ReAct loop to explicitly reason about the failure and attempt a recovery strategy, rather than simply receiving `None` data.

#### 3.4.2. Secondary Development Guide

The FinGenius project is highly modular, making secondary development straightforward by focusing on the three core components: **Agents**, **Tools**, and **Environments**.

### 1. Code Exploration Path
To understand the system flow, follow this path:
1.  **Entry Point:** Start with `main.py` to see the high-level orchestration: environment creation, sequential execution of Research and Battle phases, and final report generation.
2.  **Environment Flow:** Examine `src/environment/research.py` and `src/environment/battle.py` to understand the rules and data flow for each phase.
3.  **Agent Logic:** Study the agent hierarchy in `src/agent/base.py` and `src/agent/toolcall.py` to grasp the ReAct loop and tool-calling mechanism.
4.  **Capabilities:** Review `src/tool/base.py` and the specific tool implementations (e.g., `src/tool/big_deal_analysis.py`) to see how external data is fetched and processed.

### 2. Adding a New Specialized Agent
To introduce a new financial expert (e.g., a "Policy Agent"):
1.  **Define the Agent:** Create a new file (e.g., `src/agent/policy.py`) inheriting from `MCPAgent`.
    ```python
    class PolicyAgent(MCPAgent):
        name: str = "policy_agent"
        description: str = "分析宏观政策和行业监管变动。"
        system_prompt: str = POLICY_SYSTEM_PROMPT # Define this prompt
        available_tools: ToolCollection = Field(
            default_factory=lambda: ToolCollection(PolicyTool(), Terminate())
        )
    ```
2.  **Create Necessary Tools:** If the agent needs new capabilities, create a `BaseTool` implementation (e.g., `PolicyTool`) in `src/tool/`.
3.  **Register the Agent:** Modify `src/environment/research.py`'s `_create_agents` method to instantiate and include the new `PolicyAgent` in the research team.

### 3. Adding a New Tool (External Capability)
To integrate a new data source or function:
1.  **Define the Tool:** Create a new file (e.g., `src/tool/new_data_source.py`) inheriting from `BaseTool`.
2.  **Implement Execution:** Implement the `async def execute(...)` method, which contains the logic for interacting with the external service (e.g., a new financial API).
3.  **Update Agent Toolset:** Add the new tool to the `ToolCollection` of the relevant specialized agent(s) in `src/agent/`.

### 4. Configuration
*   **LLM Configuration:** Modify `config/config.example.toml` to change the LLM model, API key, and other parameters.
*   **MCP Configuration:** Adjust `config/mcp.example.json` to configure the endpoints for the specialized financial data servers that the `MCPAgent`s connect to.

By adhering to the established agent hierarchy and the Tool/Environment separation, new features can be added with high confidence and minimal side effects.


================================================
FILE: thirdparty/FinRL-Meta.md
================================================
# FinRL-Meta - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project structure is highly modular, with the core logic encapsulated within the `meta/` directory. This design facilitates clear separation of concerns between data handling, environment simulation, and agent implementation.

```
FinRL-Meta/
├── meta/                               # Core library source code for FinRL-Meta framework.
│   ├── agents/                         # DRL Agent implementations and wrappers for various DRL libraries (ElegantRL, RLLib, Stable-Baselines3).
│   ├── config.py                       # Global configuration constants, including ticker lists, time zones, and API key placeholders.
│   ├── data_processor.py               # The Facade class that orchestrates the entire data pipeline, selecting and running the appropriate data source processor.
│   ├── data_processors/                # Module containing concrete implementations for fetching and cleaning data from different financial APIs.
│   │   ├── _base.py                    # Abstract base class defining the common interface for all data processors (Strategy Pattern).
│   │   ├── yahoofinance.py             # Implementation for fetching data from Yahoo Finance.
│   │   ├── binance.py                  # Implementation for fetching data from Binance.
│   │   └── ...                         # Other data source implementations (Alpaca, Tushare, etc.).
│   ├── env_crypto_trading/             # Module for cryptocurrency trading environments.
│   │   ├── env_multiple_crypto.py      # Multi-asset cryptocurrency trading environment, adhering to the OpenAI Gym interface.
│   │   ├── env_btc_ccxt.py             # Single-asset Bitcoin trading environment.
│   │   └── alpaca_paper_trade_multicrypto.py # Interface for live/paper trading execution using the Alpaca API.
│   └── env_execution_optimizing/       # Module for specialized execution optimization problems.
│       └── liquidation/                # Sub-module for the optimal liquidation problem.
│           ├── env_execution_optimizing.py # Market environment based on the Almgren-Chriss model.
│           ├── ddpg_agent.py           # Implementation of the DDPG agent for continuous control.
│           └── model.py                # Neural network definitions (Actor and Critic) for the DDPG agent.
├── README.md                           # Project documentation and usage examples.
├── setup.py                            # Python package setup file.
└── ...                                 # Non-core files (e.g., examples, notebooks, docs).
```
The structure clearly delineates the **Data Layer** (`data_processors/`), the **Environment Layer** (`env_crypto_trading/`, `env_execution_optimizing/`), and the **Agent Layer** (`agents/`, `liquidation/ddpg_agent.py`), supporting the project's modular design philosophy. The use of a central `data_processor.py` and `config.py` provides global control and configuration points. The separation of environments into distinct domains (crypto trading vs. execution optimizing) allows for specialized modeling of market dynamics.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/FinRL-Meta/meta/data_processors`: Contains the core logic for fetching, cleaning, and transforming financial market data from various sources.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinRL-Meta/meta/env_crypto_trading`: Contains the reinforcement learning environments and live trading interfaces for cryptocurrency portfolio management.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinRL-Meta/meta/env_execution_optimizing/liquidation`: Contains the specialized environment and DRL agent implementation for the optimal trade execution (liquidation) problem.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinRL-Meta/meta/agents`: Contains the wrappers and base classes for integrating various external DRL libraries.

## Phase 2: Module-by-Module Deep Analysis

## Module Analysis

### 1. Module: `meta/agents`
*   **Files Enumerated**: `elegantrl_models.py`, `rllib_models.py`, `stablebaselines3_models.py`.
*   **Module Core Responsibility**: To provide a standardized interface and wrappers for integrating various external Deep Reinforcement Learning (DRL) libraries (ElegantRL, RLLib, Stable-Baselines3) with the FinRL-Meta environments. This module abstracts the library-specific agent creation and training logic.
*   **Key File Identification**:
    *   `stablebaselines3_models.py`: Contains the `DRLAgent` class, which acts as a wrapper for Stable-Baselines3 algorithms (e.g., A2C, PPO, DDPG). It handles the creation of the agent, training, and testing, providing a unified API for the main workflow.
    *   `elegantrl_models.py`: Provides similar wrappers for ElegantRL agents.
    *   `rllib_models.py`: Provides wrappers for RLLib agents.
*   **Core Implementation**: The `DRLAgent` classes typically take an environment, a model name, and hyperparameters. They encapsulate the boilerplate code for agent initialization, model saving/loading, and the training loop (`train_model`, `get_model`).
*   **Dependencies**: Depends heavily on external DRL libraries (Stable-Baselines3, ElegantRL, RLLib) and the custom environments defined in the `meta/env_...` modules.

### 2. Module: `meta/data_processors`
*   **Files Enumerated**: `_base.py`, `alpaca.py`, `binance.py`, `ccxt.py`, `tushare.py`, `yahoofinance.py`.
*   **Module Core Responsibility**: To provide concrete implementations for fetching, cleaning, and transforming raw financial data from various sources into a standardized format (Pandas DataFrame) and ultimately into NumPy arrays for the RL environments.
*   **Key File Identification**:
    *   `_base.py`: Defines the abstract base class `_Base`, which outlines the common interface (`download_data`, `clean_data`, `add_technical_indicator`, `df_to_array`) that all concrete processors must implement. This is the core of the Strategy Pattern.
    *   `yahoofinance.py`: Implements data fetching using the `yfinance` library, including specific logic for price adjustment and handling time intervals.
    *   `binance.py`: Implements data fetching from the Binance exchange, handling specific API calls and data aggregation logic.
*   **Core Implementation**: The `download_data` methods handle API interaction. The `clean_data` methods are crucial for filling missing values and ensuring data integrity. The `df_to_array` method transforms the final DataFrame into the required NumPy arrays (`price_array`, `tech_array`, `turbulence_array`) for the RL environment.
*   **Dependencies**: Depends on external data libraries (`yfinance`, `ccxt`, `tushare`, `alpaca_trade_api`) and common data science libraries (`pandas`, `numpy`).

### 3. Module: `meta/env_crypto_trading`
*   **Files Enumerated**: `alpaca_paper_trade_multicrypto.py`, `create_crypto_env.py`, `env_btc_ccxt.py`, `env_multiple_crypto.py`.
*   **Module Core Responsibility**: To define the simulation environments for cryptocurrency trading and provide an interface for live/paper trading execution.
*   **Key File Identification**:
    *   `env_multiple_crypto.py`: Defines the `CryptoEnv` class, the primary multi-asset RL environment. It implements the core `reset()` and `step()` methods, managing the portfolio state (cash, stocks) and calculating the reward based on asset value change.
    *   `alpaca_paper_trade_multicrypto.py`: Defines `AlpacaPaperTradingMultiCrypto`, which acts as the execution layer. It loads a trained DRL policy, fetches real-time data, infers an action, and executes trades via the Alpaca API.
*   **Core Implementation**: The `CryptoEnv.step()` method contains the critical trading logic: action normalization (to handle large price differences), transaction cost calculation, and portfolio update. The state is constructed by stacking normalized cash, stocks, and a lookback window of technical indicators.
*   **Dependencies**: Depends on the `meta/data_processors` for data, and external libraries like `gym`, `numpy`, `pandas`, and `alpaca_trade_api`.

### 4. Module: `meta/env_execution_optimizing/liquidation`
*   **Files Enumerated**: `ddpg_agent.py`, `env_execution_optimizing.py`, `model.py`, `utils.py`.
*   **Module Core Responsibility**: To provide a specialized environment and DRL agent for the optimal trade execution problem, specifically the Almgren-Chriss liquidation model.
*   **Key File Identification**:
    *   `env_execution_optimizing.py`: Defines `MarketEnvironment`, which models the stock price dynamics under market impact (permanent and temporary) and calculates the reward based on the Almgren-Chriss utility function.
    *   `ddpg_agent.py`: Defines the `Agent` class, a standard implementation of the DDPG algorithm, including `Actor` and `Critic` networks, `ReplayBuffer`, and `OU_Noise`.
*   **Core Implementation**: The `MarketEnvironment.step()` method is the core, implementing the price evolution and market impact equations. The DDPG `Agent.learn()` method implements the standard DDPG update rules for the Actor and Critic networks.
*   **Dependencies**: Depends on `numpy`, `torch`, and standard DRL components.

### Module PlantUML Diagrams


@startuml
title Agents Module (Stable-Baselines3)

abstract class BaseCallback {
    + _on_step()
}

class TensorboardCallback {
    + _on_step(): bool
}

class DRLAgent {
    + __init__(env)
    + get_model(model_name, policy, policy_kwargs, model_kwargs, verbose, seed)
    + train_model(model, tb_log_name, total_timesteps)
    + DRL_prediction(model, environment)
    + DRL_prediction_load_from_file(model_name, environment, cwd)
}

class DRLEnsembleAgent {
    + __init__(df, train_period, ...)
    + get_model(model_name, env, ...)
    + train_model(model, model_name, tb_log_name, iter_num, total_timesteps)
    + get_validation_sharpe(iteration, model_name)
    + DRL_validation(model, test_data, test_env, test_obs)
    + DRL_prediction(model, name, last_state, iter_num, ...)
    + run_ensemble_strategy(A2C_model_kwargs, PPO_model_kwargs, DDPG_model_kwargs, timesteps_dict)
}

TensorboardCallback --|> BaseCallback
DRLAgent ..> MODELS : uses
DRLEnsembleAgent ..> MODELS : uses
DRLEnsembleAgent ..> DRLAgent : uses methods

note right of DRLAgent::get_model
  Initializes SB3 model (A2C, PPO, DDPG, SAC, TD3)
  Handles action noise configuration
end note

note right of DRLEnsembleAgent::run_ensemble_strategy
  Core logic for rolling-window training
  and model selection based on Sharpe ratio
end note

@enduml

@startuml
skinparam classAttributeIconVisible true

package "Data Processors" {
    enum DataSource {
        akshare
        alpaca
        alphavantage
        baostock
        binance
        ccxt
        iexcloud
        joinquant
        quandl
        quantconnect
        ricequant
        tushare
        wrds
        yahoofinance
    }

    abstract class _Base {
        + data_source: str
        + start_date: str
        + end_date: str
        + time_interval: str
        + dataframe: pd.DataFrame
        --
        + download_data(ticker_list: List[str])
        + clean_data()
        + fillna()
        + add_technical_indicator(tech_indicator_list: List[str])
        + add_turbulence()
        + calculate_turbulence(): pd.DataFrame
        + add_vix()
        + df_to_array(tech_indicator_list: List[str], if_vix: bool)
        + calc_nonstandard_time_interval(): str
        + transfer_standard_ticker_to_nonstandard(ticker: str): str
        + save_data(path)
        + load_data(path)
    }

    class DataProcessor {
        - processor: _Base
        + data_source: DataSource
        + start_date: str
        + end_date: str
        + time_interval: str
        + dataframe: pd.DataFrame
        --
        + __init__(data_source: DataSource, ...)
        + download_data(ticker_list)
        + clean_data()
        + add_technical_indicator(tech_indicator_list: List[str])
        + add_turbulence()
        + add_vix()
        + df_to_array(if_vix: bool): np.array
        + data_split(df, start, end)
        + fillna()
        + run(ticker_list: str, technical_indicator_list: List[str], if_vix: bool)
    }

    class Yahoofinance {
        + download_data(ticker_list: List[str])
    }

    class Alpaca {
        + api: tradeapi.REST
        + download_data(ticker_list)
        + clean_data()
        + get_trading_days(start, end)
    }

    class Binance {
        + download_data(ticker_list: List[str])
        + dataframe_with_limit(symbol)
        + fetch_n_combine(startDate, endDate, tickers)
    }

    class Tushare {
        + token: str
        + adj: str
        + download_data(ticker_list: List[str])
    }

    DataProcessor o-- _Base : delegates
    _Base <|-- Yahoofinance
    _Base <|-- Alpaca
    _Base <|-- Binance
    _Base <|-- Tushare
    DataProcessor o-- DataSource : uses
}

@enduml

@startuml
skinparam classAttributeIconVisible true

package "RL Environments (meta.envs)" {

    package "Crypto Trading" {
        class CryptoEnv {
            + lookback: int
            + initial_cash: float
            + buy_cost_pct: float
            + sell_cost_pct: float
            + price_array: np.ndarray
            + tech_array: np.ndarray
            + stocks: np.ndarray
            --
            + __init__(config, lookback, initial_capital, ...)
            + reset(): np.ndarray
            + step(actions): (np.ndarray, float, bool, None)
            + get_state(): np.ndarray
            - _generate_action_normalizer()
        }

        class BitcoinEnv {
            + stock_dim: int = 1
            + initial_account: float
            + transaction_fee_percent: float
            --
            + __init__(...)
            + reset(): np.ndarray
            + step(action): (np.ndarray, float, bool, None)
            + draw_cumulative_return(...)
            - load_data(...)
        }

        class AlpacaPaperTradingMultiCrypto {
            - alpaca: tradeapi.REST
            - act: AgentPPO.act
            - CCTX_time_interval: str
            - time_interval: int
            - stocks: np.ndarray
            - cash: float
            --
            + __init__(...)
            + run()
            + trade()
            + get_state()
            + submitOrder(qty, stock, side, resp)
        }

        class create_crypto_env {
            + create_train_env(...)
            + create_test_env(...)
        }

        CryptoEnv <.. create_crypto_env : creates
        BitcoinEnv .up.|> CryptoEnv : specialized single-asset env (conceptual)
        AlpacaPaperTradingMultiCrypto ..> CryptoEnv : uses concepts (state/action space)
        AlpacaPaperTradingMultiCrypto ..> meta.data_processors.Ccxt : data source
        AlpacaPaperTradingMultiCrypto ..> elegantrl.agent.AgentPPO : loads agent
    }

    package "Execution Optimizing" {
        class Agent << (A, #FF7700) DDPG Agent >> {
            + state_size: int
            + action_size: int
            - actor_local: Actor
            - critic_local: Critic
            - noise: OUNoise
            - memory: ReplayBuffer
            --
            + __init__(state_size, action_size, random_seed)
            + step(state, action, reward, next_state, done)
            + act(state, add_noise=True)
            + learn(experiences, gamma)
            + soft_update(local_model, target_model, tau)
        }

        class OUNoise {
            - mu: np.ndarray
            - theta: float
            - sigma: float
            --
            + __init__(size, seed, mu, theta, sigma)
            + reset()
            + sample()
        }

        class ReplayBuffer {
            - memory: deque
            - experience: namedtuple
            --
            + __init__(action_size, buffer_size, batch_size, seed)
            + add(state, action, reward, next_state, done)
            + sample()
        }

        Agent *-- OUNoise : uses
        Agent *-- ReplayBuffer : uses
        Agent ..> Actor : trains
        Agent ..> Critic : trains
    }
}

@enduml


## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

## Core Abstractions, Design Philosophy, and Lifecycle Management

The FinRL-Meta project is built upon a highly modular and layered architecture, primarily following the **Facade** and **Strategy** design patterns to achieve flexibility and extensibility. The core abstractions revolve around three main components: Data, Environment, and Agent.

### 1. Data Abstraction
The data layer abstracts the complex process of connecting to various financial data sources (e.g., Yahoo Finance, Binance, Alpaca) into a unified interface.

*   **`DataSource` Enum**: This is the fundamental abstraction, listing all supported data providers (`akshare`, `alpaca`, `yahoofinance`, etc.).
*   **`_Base` Class**: An abstract base class (`meta/data_processors/_base.py`) that defines the common interface for all concrete data processors. It includes core methods like `download_data()`, `clean_data()`, `add_technical_indicator()`, and `df_to_array()`. This enforces a standard contract across all data sources.
*   **`DataProcessor` Class**: This acts as a **Facade** (`meta/data_processor.py`). It takes a `DataSource` enum in its constructor and dynamically instantiates the corresponding concrete processor (e.g., `Yahoofinance`, `Binance`). Its `run()` method orchestrates the entire data pipeline: download, clean, add indicators, and transform the data into NumPy arrays (`price_array`, `tech_array`, `turbulence_array`) suitable for the RL environment.

### 2. Environment Abstraction
The environment layer provides a standard interface for the Deep Reinforcement Learning (DRL) agents, adhering to the OpenAI Gym standard (`reset`, `step`).

*   **`CryptoEnv` / `BitcoinEnv`**: These classes (`meta/env_crypto_trading/env_multiple_crypto.py`, `meta/env_crypto_trading/env_btc_ccxt.py`) abstract the trading logic, portfolio management, and reward calculation. They manage the state space (cash, holdings, technical indicators) and the action space (buy/sell/hold).
*   **State Representation**: The state is a flattened NumPy array, typically a concatenation of normalized cash, normalized stock holdings, and a lookback window of normalized technical indicators. This design choice simplifies the state space for DRL algorithms.

### 3. Agent and Execution Abstraction
The agent layer is designed to be decoupled from the core framework, allowing for easy integration of external DRL libraries (e.g., ElegantRL).

*   **`DDPG_Agent`**: A concrete implementation of a DRL agent, demonstrating the use of the **Actor-Critic** architecture for continuous action spaces. It uses helper classes like `ReplayBuffer` and `OU_Noise`.
*   **`AlpacaPaperTradingMultiCrypto`**: This class in the execution layer acts as a bridge between the trained DRL policy and a live trading API (Alpaca). It handles the real-time data fetching, state construction, policy inference, and order submission, managing the entire **live trading lifecycle**.

### Lifecycle Management
The typical lifecycle involves:
1.  **Initialization**: `DataProcessor` is initialized with a `DataSource` and time parameters.
2.  **Data Preparation**: `DataProcessor.run()` fetches and processes historical data, outputting NumPy arrays.
3.  **Environment Setup**: An environment (`CryptoEnv`) is instantiated with the processed data arrays.
4.  **Training/Testing**: A DRL agent interacts with the environment using `reset()` and `step()` methods.
5.  **Deployment (Live Trading)**: The trained agent's policy is loaded into an execution class (`AlpacaPaperTradingMultiCrypto`), which runs a continuous loop to fetch real-time data, generate actions, and execute trades. The `run()` method in this class manages the continuous trading loop.

#### 3.1.2. Component Interactions

## Component Interactions, Data Flow, and Communication Patterns

The FinRL-Meta architecture is characterized by a clear separation of concerns, with data flowing sequentially from the Data Layer to the Environment Layer, and control/action signals flowing between the Environment and the Agent Layer.

### 1. Data Flow (Offline/Training Phase)

The primary data flow during the offline training phase is a one-way pipeline from the data source to the reinforcement learning environment.

| Source Component | Target Component | Data Format | Communication Pattern | Description |
| :--- | :--- | :--- | :--- | :--- |
| **Data Processor** | **RL Environment** | NumPy Arrays | Synchronous Call | The `DataProcessor.run()` method orchestrates the data pipeline, culminating in the output of three key NumPy arrays: `price_array`, `tech_array`, and `turbulence_array`. These arrays, which represent the entire historical dataset, are passed directly to the `CryptoEnv` constructor. |
| **RL Environment** | **DRL Agent** | NumPy Array (State) | Synchronous Call | In each `step()` call, the `CryptoEnv` calculates the next state (`get_state()`) and returns it to the DRL agent. The state is a flattened, normalized vector of market data and portfolio information. |

The `DataProcessor` acts as a **Strategy Pattern** selector, dynamically choosing a concrete data source module (e.g., `Yahoofinance`, `Binance`) based on the `DataSource` enum provided by the user. This ensures that the downstream components (the RL environments) only interact with the standardized NumPy array format, completely decoupling them from the complexities of external APIs.

### 2. Control Flow (Training Phase)

The control flow adheres strictly to the standard **OpenAI Gym interface** for reinforcement learning.

1.  **Initialization**: The DRL training loop calls `env.reset()`. The environment initializes the portfolio (cash, stocks) and returns the initial state vector.
2.  **Action Selection**: The DRL agent receives the state and uses its neural network policy (`Actor.forward()`) to select an action (a continuous vector of target stock allocations).
3.  **State Transition**: The DRL training loop calls `env.step(action)`.
4.  **Environment Logic**: Inside `env.step()`, the environment:
    *   Applies the action (simulates trades, updating `cash` and `stocks`).
    *   Calculates the reward (change in total asset value).
    *   Advances the time step.
    *   Determines the next state (`get_state()`).
    *   Checks for termination (`done`).
5.  **Feedback**: The environment returns `(next_state, reward, done, info)` to the agent, closing the loop.

### 3. Communication Patterns (Online/Live Trading Phase)

The `AlpacaPaperTradingMultiCrypto` class manages the real-time interaction with external services, introducing asynchronous and external API communication.

1.  **Real-Time Data Fetch**: The `get_state()` method within the live trading class uses a data processor (specifically `Ccxt` in the example) to fetch the latest market data via HTTP requests to the exchange API (e.g., Binance). This is a synchronous, blocking call to retrieve the necessary historical lookback window.
2.  **Policy Inference**: The fetched data is transformed into the state vector, which is then passed to the loaded DRL policy (`self.act(s_tensor)`). This is a local, synchronous operation.
3.  **Trade Execution**: The resulting action is translated into market orders. The `submitOrder()` method uses the Alpaca API (`alpaca.submit_order()`) to send the order to the broker. This is typically an external, asynchronous HTTP call, although the provided code wraps it in a `threading.Thread` and uses `join()` to make it functionally synchronous within the main loop, ensuring one trade is processed before the next time step.

This layered design ensures that the core RL logic remains clean and platform-agnostic, while the complexity of external data fetching and live execution is encapsulated in dedicated modules.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam componentStyle rectangle
skinparam classAttributeIconVisible true

title FinRL-Meta High-Level Architecture

package "Data Layer" {
    class DataSource << (E, #ADD8E6) Enum >>
    abstract class _Base << (A, #ADD8E6) Base Processor >>
    class DataProcessor << (F, #ADD8E6) Facade >>
    class Yahoofinance << (C, #ADD8E6) Concrete Processor >>
    class Binance << (C, #ADD8E6) Concrete Processor >>
    ' ... other concrete processors ...
}

package "Environment Layer" {
    class CryptoEnv << (E, #90EE90) RL Environment >> {
        + price_array: np.ndarray
        + tech_array: np.ndarray
        + stocks: np.ndarray
        --
        + reset()
        + step(actions)
        + get_state()
    }
    class MarketEnvironment << (E, #90EE90) Liquidation Env >> {
        + shares_remaining
        + timeHorizon
        --
        + step(action)
    }
}

package "Agent Layer" {
    class DDPG_Agent << (A, #FFB6C1) Deep RL Agent >>
    class Actor << (N, #FFB6C1) Neural Network >>
    class Critic << (N, #FFB6C1) Neural Network >>
    class OUNoise << (H, #FFB6C1) Helper >>
    class ReplayBuffer << (H, #FFB6C1) Helper >>
}

package "Execution Layer" {
    class AlpacaPaperTradingMultiCrypto << (T, #FFA07A) Trading Interface >> {
        - alpaca: tradeapi.REST
        - act: Agent.act
        --
        + run()
        + trade()
        + get_state()
    }
}

' Relationships

' Data Flow
DataProcessor .up.> DataSource : uses
DataProcessor .right.> _Base : delegates
Yahoofinance .up.|> _Base
Binance .up.|> _Base

DataProcessor --> CryptoEnv : feeds (price, tech, turbulence arrays)
DataProcessor --> MarketEnvironment : feeds (implicitly via parameters)

' Environment to Agent
CryptoEnv .left.> DDPG_Agent : state/reward/action space

' Agent Internals
DDPG_Agent *-- Actor : trains/uses
DDPG_Agent *-- Critic : trains/uses
DDPG_Agent *-- OUNoise
DDPG_Agent *-- ReplayBuffer

' Execution Flow
AlpacaPaperTradingMultiCrypto .up.> CryptoEnv : conceptual interface
AlpacaPaperTradingMultiCrypto .up.> DDPG_Agent : loads/uses policy (act)
AlpacaPaperTradingMultiCrypto .up.> Binance : data fetching (via Ccxt)
AlpacaPaperTradingMultiCrypto .up.> Alpaca : trade execution

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

## Design Patterns Used in the Codebase

The FinRL-Meta project effectively utilizes several software design patterns to achieve modularity, flexibility, and maintainability, particularly in handling diverse data sources and complex reinforcement learning components.

### 1. Facade Pattern (DataProcessor)
The `DataProcessor` class (`meta/data_processor.py`) serves as a **Facade** to the entire data processing subsystem. It provides a simple, unified interface (`run()`, `download_data()`, `clean_data()`) for the complex operations of fetching, cleaning, and transforming data from multiple sources.

*   **Implementation**: The `DataProcessor.__init__` method takes a `DataSource` enum and dynamically instantiates the appropriate concrete processor (e.g., `Yahoofinance`, `Binance`). All subsequent method calls on `DataProcessor` are delegated to the internal concrete processor instance.
*   **Code Example (meta/data_processor.py)**:
    ```python
    class DataProcessor:
        def __init__(self, data_source: DataSource, ...):
            # ... dynamic instantiation logic ...
            self.processor = processor_dict.get(self.data_source)(...)

        def download_data(self, ticker_list):
            self.processor.download_data(ticker_list=ticker_list)
            self.dataframe = self.processor.dataframe
    ```

### 2. Strategy Pattern (Data Processors)
The various data source classes (e.g., `Yahoofinance`, `Alpaca`, `Binance`) implement the **Strategy Pattern**. They all inherit from the abstract base class `_Base` (`meta/data_processors/_base.py`), which defines the common interface (the "Strategy"). Each concrete class provides its own specific implementation (the "Concrete Strategy") for methods like `download_data()` and `clean_data()`, tailored to the requirements of its respective API.

*   **Implementation**: The `_Base` class defines the contract, and classes like `Yahoofinance` and `Binance` provide the specific logic for their data fetching and cleaning. The `DataProcessor` (the "Context") selects and uses the appropriate strategy object.
*   **Code Example (meta/data_processors/_base.py)**:
    ```python
    class _Base:
        def download_data(self, ticker_list: List[str]):
            pass # Defined in concrete classes
    ```

### 3. Actor-Critic Pattern (DDPG Agent)
The Deep Deterministic Policy Gradient (DDPG) agent implementation (`meta/env_execution_optimizing/liquidation/ddpg_agent.py`) is a prime example of the **Actor-Critic** architecture, a fundamental pattern in Reinforcement Learning.

*   **Implementation**: The agent consists of two main neural networks:
    *   **Actor (`Actor` class)**: The policy network that takes the state as input and outputs the action (the policy).
    *   **Critic (`Critic` class)**: The value network that takes the state and action as input and outputs the Q-value (the value function).
*   **Code Example (meta/env_execution_optimizing/liquidation/ddpg_agent.py)**:
    ```python
    # Actor Network (w/ Target Network)
    self.actor_local = Actor(state_size, action_size, random_seed).to(device)
    # Critic Network (w/ Target Network)
    self.critic_local = Critic(state_size, action_size, random_seed).to(device)

    # In learn method:
    # Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
    ```

### 4. Template Method Pattern (RL Environment)
The base environment structure, particularly in `CryptoEnv` and `BitcoinEnv`, follows the **Template Method Pattern**. The base class defines the skeleton of the algorithm (`reset`, `step`) but defers the implementation of specific steps (like state normalization or action scaling) to helper methods or configuration parameters.

*   **Implementation**: The `step()` method in `CryptoEnv` is the template, which calls the concrete implementation of action normalization via `_generate_action_normalizer()` and applies the core trading logic. The overall structure is inherited from the OpenAI Gym interface, which itself is a form of the Template Method.
*   **Code Example (meta/env_crypto_trading/env_multiple_crypto.py)**:
    ```python
    class CryptoEnv:
        # ...
        def step(self, actions) -> (np.ndarray, float, bool, None):
            # Template step 1: Normalize action (deferred to helper)
            for i in range(self.action_dim):
                norm_vector_i = self.action_norm_vector[i]
                actions[i] = actions[i] * norm_vector_i

            # Template step 2: Execute trades (core logic)
            # ... sell logic ...
            # ... buy logic ...

            # Template step 3: Update state and calculate reward (core logic)
            # ...
    ```

#### 3.3.2. Project Highlights

## Project Highlights: Innovative Features, Extensibility, and Flexibility Design

The FinRL-Meta project exhibits several innovative features and strong design choices that contribute to its extensibility and flexibility, making it a robust platform for financial reinforcement learning research and application.

*   **Unified Data Pipeline Abstraction**:
    *   **Innovation**: The use of the `DataProcessor` Facade over a set of concrete data source strategies (`Yahoofinance`, `Binance`, etc.) is a major highlight. This design abstracts away the heterogeneity of financial data APIs, which often have different data formats, time zone conventions, and rate limits.
    *   **Flexibility**: Researchers can easily add support for a new data source by simply creating a new class that inherits from `_Base` and implementing the required methods. The core RL environment remains completely unaware of the data source's origin, only consuming the standardized NumPy arrays.

*   **Decoupled RL Environment and Agent**:
    *   **Extensibility**: The core RL environments (`CryptoEnv`, `MarketEnvironment`) are designed to be agnostic to the specific DRL algorithm used. They adhere to the standard OpenAI Gym interface (`reset`, `step`), which is the universal contract for RL. This allows the project to seamlessly integrate agents from different DRL libraries (e.g., ElegantRL, Stable-Baselines3, RLLib), as seen in the `AlpacaPaperTradingMultiCrypto` class which dynamically loads the policy.
    *   **Innovation**: The environment state space is carefully engineered to be a fixed-size, normalized vector, making it directly compatible with standard deep learning models (e.g., fully connected layers in the Actor/Critic networks). The normalization factors (e.g., `cash * 2**-18`) are hardcoded to scale the state variables into a manageable range for neural network training.

*   **Real-Time Trading Integration**:
    *   **Innovation**: The inclusion of the `AlpacaPaperTradingMultiCrypto` module demonstrates a clear path from research to real-world application. This module encapsulates the complexity of live trading, including API communication, order submission, and real-time state construction. It bridges the gap between a simulated environment and a live paper trading account.
    *   **Flexibility**: By separating the trading logic from the core RL environment, the project allows for different execution strategies (e.g., market orders, limit orders, different brokers) to be implemented without modifying the core training environment.

*   **Domain-Specific Environment Modeling**:
    *   **Innovation**: The `MarketEnvironment` for execution optimization, based on the Almgren-Chriss model, is a sophisticated, domain-specific environment. It models complex financial phenomena like **permanent and temporary market impact** and uses a reward function based on the change in the Almgren-Chriss utility function. This highlights the project's focus on advanced financial modeling beyond simple portfolio management.
    *   **Extensibility**: The environment is parameterized with financial constants (`ANNUAL_VOLAT`, `BID_ASK_SP`, `LLAMBDA1`), allowing researchers to easily modify the market dynamics to test the robustness of their agents under different simulated conditions.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

## Improvement Suggestions: Performance, Architecture, and Code Quality

Based on the comprehensive analysis of the FinRL-Meta codebase, the following suggestions are proposed to enhance performance, optimize the architecture, and improve overall code quality.

### 1. Performance Bottlenecks and Optimization

| Area | Bottleneck/Issue | Suggestion for Improvement |
| :--- | :--- | :--- |
| **Data Processing (Pandas)** | Excessive use of `pd.concat()` and `df.append()` in data processors (e.g., `Yahoofinance.download_data`). These operations create new DataFrames in memory, leading to significant performance degradation and memory overhead, especially with large datasets. | **Pre-allocate Lists and Concatenate Once**: Instead of appending to a DataFrame in a loop, collect the individual DataFrames into a Python list and perform a single `pd.concat(list_of_dfs)` operation outside the loop. |
| **State Normalization** | Hardcoded magic numbers for state normalization (e.g., `cash * 2**-18`, `stocks * 2**-3`) are used across multiple environment files (`CryptoEnv`, `BitcoinEnv`). This makes tuning and debugging difficult. | **Centralize Normalization Constants**: Define all normalization constants in a single configuration file (e.g., `meta/config.py`) and load them dynamically. This improves maintainability and allows for easier hyperparameter tuning of the state space. |
| **Live Trading Latency** | The `AlpacaPaperTradingMultiCrypto.trade()` method uses `threading.Thread` with `join()` for `submitOrder`. This effectively makes the order submission synchronous and blocks the main trading loop, increasing latency. | **Asynchronous Order Submission**: Implement true asynchronous order submission using `asyncio` and non-blocking API calls (if supported by the Alpaca SDK) or a dedicated, non-blocking worker queue/process for trade execution. |

### 2. Architecture Optimization

*   **Formalize the Environment Base Class**: Currently, the RL environments (`CryptoEnv`, `BitcoinEnv`) do not explicitly inherit from a common abstract base class, other than the implicit contract of the OpenAI Gym interface.
    *   **Suggestion**: Introduce a formal `BaseEnv` class in `meta/envs/_base.py` that inherits from `gym.Env` (or a modern equivalent) and defines abstract methods for `_calculate_reward()`, `_update_portfolio()`, and `_get_state()`. This would enforce a stricter contract and improve the clarity of the environment's responsibilities.
*   **Decouple DRL Library Loading**: The `AlpacaPaperTradingMultiCrypto` class contains hardcoded imports and logic for `elegantrl` (lines 7164-7175). This tightly couples the execution layer to a specific DRL framework.
    *   **Suggestion**: Use a **Factory Pattern** to load the agent. The execution class should only accept a path to a saved model and a configuration, and a separate utility function should handle the framework-specific loading and policy instantiation.

### 3. Code Quality and Maintainability

*   **Consistent Type Hinting**: While some files use type hints, consistency is lacking across the entire codebase.
    *   **Suggestion**: Adopt comprehensive Python type hinting for all function signatures and class attributes. This significantly improves code readability, enables static analysis tools, and reduces runtime errors.
*   **Magic Number Elimination**: The `MarketEnvironment` in the execution optimization module is heavily parameterized with financial constants (e.g., `LLAMBDA1 = 1e-6`, `NUM_N = 60`).
    *   **Suggestion**: Move all these constants to a dedicated configuration file or a class-level attribute with clear documentation, making the environment's parameters transparent and easily adjustable.
*   **Refactor `AlpacaPaperTradingMultiCrypto` State Logic**: The state construction logic in `get_state()` is complex, involving multiple array stacking and normalization steps.
    *   **Suggestion**: Encapsulate the state construction into a dedicated `StateBuilder` class or a static method. This would isolate the complex logic and make the state representation easier to verify and modify.

#### 3.4.2. Secondary Development Guide

## Secondary Development Guide: Best Practices for Code Exploration and Extension

This guide provides a structured approach for developers looking to explore, modify, or extend the FinRL-Meta codebase.

### 1. Code Exploration Path

Start your exploration by focusing on the three core layers of the architecture:

1.  **Configuration and Entry Point (`meta/config.py` and `meta/data_processor.py`)**:
    *   Examine `meta/config.py` to understand the global constants, default ticker lists, and time zone settings.
    *   Review `meta/data_processor.py` to grasp how data sources are selected and the standardized data arrays (`price_array`, `tech_array`, `turbulence_array`) are generated. This is the **input** to the entire RL system.

2.  **Environment Layer (`meta/env_crypto_trading/`)**:
    *   Focus on `meta/env_crypto_trading/env_multiple_crypto.py` (`CryptoEnv`). This is the heart of the simulation.
    *   Analyze the `__init__`, `reset()`, and `step(actions)` methods to understand the state space definition, reward function, and transaction logic (cost calculation, portfolio update).

3.  **Agent/Execution Layer (`meta/env_execution_optimizing/` and `meta/env_crypto_trading/`)**:
    *   For DRL implementation details, study `meta/env_execution_optimizing/liquidation/ddpg_agent.py` and `model.py` to see the Actor-Critic network structure and training loop.
    *   For real-world application, examine `meta/env_crypto_trading/alpaca_paper_trade_multicrypto.py` to understand how a trained policy is deployed for live trading.

### 2. Best Practices for Extension

*   **Adding a New Data Source**:
    1.  Create a new file in `meta/data_processors/` (e.g., `new_source.py`).
    2.  Define a class that inherits from `meta/data_processors/_base._Base`.
    3.  Implement the required methods, especially `download_data()` and `clean_data()`, ensuring the final `self.dataframe` adheres to the expected format (columns: `time`, `open`, `high`, `low`, `close`, `volume`, `tic`).
    4.  Update the `DataSource` enum and the `processor_dict` mapping in `meta/data_processor.py` to include your new class.

*   **Creating a New Trading Environment**:
    1.  Create a new file in `meta/envs/` (e.g., `env_forex_trading.py`).
    2.  Define a new environment class (e.g., `ForexEnv`) that mimics the structure of `CryptoEnv`, implementing `reset()` and `step()`.
    3.  Crucially, redefine the **state space** (`self.state_dim`) and **action space** (`self.action_dim`) to match the requirements of the new domain (e.g., different asset types, different technical indicators).
    4.  Adjust the reward function and transaction cost logic to reflect the new market's characteristics.

*   **Integrating a New DRL Algorithm**:
    1.  Ensure your new algorithm's policy can be loaded and called with a NumPy state array to return a NumPy action array.
    2.  If integrating into the live trading module, modify the agent loading section in `AlpacaPaperTradingMultiCrypto.__init__` to correctly load your new model and expose the `self.act` function.
    3.  If the new algorithm requires a different environment interface (e.g., discrete action space), you will need to create a new environment wrapper that translates the continuous actions of the existing environments into the required format.


================================================
FILE: thirdparty/FinRL.md
================================================
# FinRL - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The FinRL project structure is organized into a core Python package (`finrl`) and several supporting directories, following a clear separation of concerns for a machine learning framework.

```
FinRL/
├── .git/                     # Git version control metadata (Excluded from analysis)
├── .github/                  # GitHub configuration (e.g., issue templates, workflows) (Excluded)
├── docker/                   # Docker setup for containerized environments (Excluded)
├── docs/                     # Documentation source files (Excluded)
├── examples/                 # Jupyter notebooks and scripts demonstrating usage (Excluded)
├── figs/                     # Project figures and logos (Excluded)
├── finrl/                    # **CORE SOURCE CODE PACKAGE** - The heart of the framework
│   ├── agents/               # **DRL Agents and Wrappers**: Integrates and adapts various DRL libraries (Stable-Baselines3, ElegantRL, RLlib) to the FinRL environment interface.
│   │   ├── elegantrl/        # Integration with ElegantRL DRL library
│   │   ├── portfolio_optimization/ # Specific agents for portfolio optimization tasks
│   │   ├── rllib/            # Integration with RLlib DRL library
│   │   └── stablebaselines3/ # Integration with Stable-Baselines3 DRL library
│   ├── applications/         # **Financial Application Templates**: Provides end-to-end examples and specific configurations for different financial tasks.
│   │   ├── cryptocurrency_trading/
│   │   ├── high_frequency_trading/
│   │   ├── portfolio_allocation/
│   │   └── stock_trading/    # Example implementations for stock trading, including ensemble methods
│   ├── meta/                 # **Meta/Environment Components**: The infrastructure layer for data and environment modeling.
│   │   ├── data_processors/  # Data acquisition and feature engineering from various sources (Yahoo, Alpaca, etc.)
│   │   ├── env_*/            # Custom OpenAI Gym environments for different financial tasks (stock trading, crypto, portfolio)
│   │   ├── paper_trading/    # Real-time/paper trading integration (e.g., Alpaca)
│   │   └── preprocessor/     # Legacy/alternative data downloaders
│   ├── config.py             # Global configuration constants (dates, indicators, model params)
│   ├── main.py               # Main entry point for CLI (train, test, trade modes)
│   ├── train.py              # Core DRL training workflow logic
│   ├── trade.py              # Core trading workflow logic (backtesting/paper trading)
│   └── plot.py               # Utility for plotting results and performance metrics
├── unit_tests/               # Unit tests (Excluded)
└── ...                       # Other configuration files (README, LICENSE, setup.py, etc.) (Excluded)
```
The structure is highly modular, with the `finrl` package acting as the primary container. The **`meta`** module handles the crucial task of transforming raw financial data into a standardized Reinforcement Learning problem (State, Action, Reward), while the **`agents`** module abstracts the complexity of different DRL algorithms. The top-level files (`main.py`, `train.py`, `trade.py`) serve as the **orchestration layer**, tying these components together to execute the full DRL pipeline. This design ensures that the core logic is separated from configuration, data handling, and algorithm implementation.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinRL/finrl`: The root of the core package, containing entry points and global configurations.
*   `/home/ubuntu/FinRL/finrl/agents`: The module responsible for integrating and wrapping various DRL libraries (Stable-Baselines3, ElegantRL, RLlib) into a unified `DRLAgent` interface.
*   `/home/ubuntu/FinRL/finrl/meta`: The meta-module that provides the necessary infrastructure for DRL in finance, including data processing, custom Gym environments, and paper trading interfaces.
*   `/home/ubuntu/FinRL/finrl/applications`: Contains application-specific, end-to-end examples and templates for different financial tasks.

## Phase 2: Module-by-Module Deep Analysis

## 1. Core/Entry Module (`finrl`)

**Module Core Responsibility**: This module serves as the **entry point** and **orchestrator** for the entire FinRL workflow. It defines global configurations and implements the high-level logic for the three main modes of operation: `train`, `test`, and `trade`.

**Key File Identification**:
*   `config.py`: Defines all global constants, including data directories (`DATA_SAVE_DIR`), date ranges (`TRAIN_START_DATE`), technical indicators (`INDICATORS` - e.g., `macd`, `rsi_30`), and default DRL model hyperparameters (`A2C_PARAMS`, `PPO_PARAMS`, etc.). This file centralizes all experiment parameters.
*   `main.py`: The command-line interface entry point. It parses the `--mode` argument (`train`, `test`, `trade`) and calls the corresponding function from `finrl.train`, `finrl.test`, or `finrl.trade`. It ensures necessary directories are created for saving data and models.
*   `train.py`: Implements the DRL training pipeline. It orchestrates the data flow: `DataProcessor` -> `download_data` -> `clean_data` -> `add_technical_indicator` -> `df_to_array` -> `StockTradingEnv` configuration -> `DRLAgent` initialization and `train_model`. It supports conditional loading of agents from `elegantrl`, `rllib`, or `stable_baselines3`.
*   `trade.py`: Implements the trading pipeline, supporting two sub-modes: `backtesting` (which delegates to `finrl.test`) and `paper_trading` (which uses the `AlpacaPaperTrading` class from the `meta` module).

## 2. Meta/Environment Module (`finrl/meta`)

**Module Core Responsibility**: This is the **infrastructure layer** that adapts financial data and tasks into the standard Reinforcement Learning paradigm (Gym environments). It handles data acquisition, feature engineering, and the definition of the trading environment's state, action, and reward space.

**Key File Identification**:
*   `data_processor.py`: The main facade class, `DataProcessor`. It acts as a factory/wrapper for various data source-specific processors (e.g., `YahooFinanceProcessor`, `AlpacaProcessor`). It provides a unified interface for data downloading, cleaning, adding technical indicators, and converting the final DataFrame into the NumPy arrays (`price_array`, `tech_array`, `turbulence_array`) required by the Gym environments.
*   `data_processors/processor_yahoofinance.py`: A concrete implementation of a data processor. It uses the `yfinance` library (and potentially Selenium for scraping) to fetch data and includes methods for data cleaning and feature engineering (e.g., adding the VIX index).
*   `env_stock_trading/env_stocktrading.py`: The core custom Gym environment, `StockTradingEnv`.
    *   **State Space**: A 1D NumPy array representing `[cash, stock_price_1, ..., stock_price_N, stock_shares_1, ..., stock_shares_N, tech_indicator_1, ..., tech_indicator_M, turbulence]`.
    *   **Action Space**: A continuous `Box` space, where each element corresponds to the percentage of total assets to allocate to a stock (ranging from -1 to 1, representing sell/buy).
    *   **Reward Function**: The reward is the change in the total portfolio value (cash + stock holdings) between the current step and the previous step, scaled by `reward_scaling`.
    *   **Turbulence**: The environment incorporates a **turbulence index** (`risk_indicator_col`) to model market volatility. If turbulence exceeds a threshold, the agent is forced to liquidate all positions, a critical risk management mechanism.

## 3. Agent Module (`finrl/agents`)

**Module Core Responsibility**: This module provides the necessary **wrappers and interfaces** to seamlessly integrate popular DRL libraries (Stable-Baselines3, ElegantRL, RLlib) with the custom FinRL Gym environments. This abstracts the DRL implementation details from the main workflow.

**Key File Identification**:
*   `stablebaselines3/models.py`: Defines the `DRLAgent` class, which wraps SB3 models (A2C, PPO, SAC, TD3, DDPG). It uses the **Adapter Pattern** to make SB3 algorithms conform to the FinRL training and prediction interface. The `DRL_prediction` method handles the testing/backtesting loop using the trained model on a vectorized environment (`DummyVecEnv`).
*   `elegantrl/models.py`: Defines the `DRLAgent` class for ElegantRL integration. This wrapper is more tightly coupled with the environment's internal arrays (`price_array`, `tech_array`) as ElegantRL uses a custom `Config` object for environment and agent setup.
*   `portfolio_optimization/algorithms.py`: Contains specific algorithms for portfolio optimization, demonstrating the framework's flexibility beyond standard stock trading.

## 4. Application Module (`finrl/applications`)

**Module Core Responsibility**: This module provides **ready-to-use, end-to-end examples** for various financial tasks. These files serve as templates and demonstrations, showing how to combine the `meta` (data/env) and `agents` (DRL models) modules to solve a specific problem.

**Key File Identification**:
*   `stock_trading/ensemble_stock_trading.py`: A key example demonstrating the use of an **ensemble strategy** where multiple DRL agents (e.g., PPO, A2C, DDPG) are trained and their performance is validated to select the best one for trading. This highlights a key feature of the FinRL framework.
*   Other files (e.g., `cryptocurrency_trading`, `portfolio_allocation`) provide specialized configurations and environment settings for those specific domains, showcasing the framework's adaptability.

### Module PlantUML Diagrams

@startuml Module_Meta
title FinRL Meta Module (Data and Environment)

package "finrl.meta" {
    class DataProcessor {
        - processor: AbstractProcessor
        + __init__(data_source, ...)
        + download_data(...)
        + clean_data(...)
        + add_technical_indicator(...)
        + add_turbulence(...)
        + add_vix(...)
        + df_to_array(...) : price_array, tech_array, turbulence_array
    }

    package "data_processors" {
        interface AbstractProcessor {
            + download_data()
            + clean_data()
            + add_technical_indicator()
            + add_turbulence()
            + add_vix()
            + df_to_array()
        }
        class YahooFinanceProcessor
        class AlpacaProcessor
        class WrdsProcessor
    }

    package "env_stock_trading" {
        class StockTradingEnv {
            - df: DataFrame
            - state: np.array
            - day: int
            - initial_amount: int
            - asset_memory: list
            + __init__(...)
            + step(actions) : state, reward, done, info
            + reset() : state
            + _sell_stock(index, action)
            + _buy_stock(index, action)
            + get_sb_env() : DummyVecEnv
        }
        StockTradingEnv -up-|> gym.Env
    }

    package "paper_trading" {
        class AlpacaPaperTrading {
            - api_key
            - api_secret
            - model
            + run()
        }
    }
}

DataProcessor o-- AbstractProcessor : uses
YahooFinanceProcessor -up-|> AbstractProcessor
AlpacaProcessor -up-|> AbstractProcessor
WrdsProcessor -up-|> AbstractProcessor

StockTradingEnv ..> DataProcessor : receives arrays from df_to_array()
AlpacaPaperTrading ..> StockTradingEnv : uses for state/action logic

@enduml

@startuml Module_Agents
title FinRL Agents Module (DRL Wrappers)

package "finrl.agents" {
    interface DRLAgentInterface {
        + get_model(model_name, ...)
        + train_model(model, ...)
        + DRL_prediction(model, environment)
    }

    package "stablebaselines3" {
        class DRLAgent_SB3 {
            - env: StockTradingEnv
            + get_model(model_name, ...)
            + train_model(model, ...)
            + DRL_prediction(model, environment)
        }
        class TensorboardCallback
    }

    package "elegantrl" {
        class DRLAgent_ElegantRL {
            - env_config
            + get_model(model_name, model_kwargs)
            + train_model(model, cwd, total_timesteps)
        }
    }

    package "rllib" {
        class DRLAgent_RLlib {
            + get_model(model_name)
            + train_model(model, ...)
        }
    }
}

DRLAgent_SB3 -up-|> DRLAgentInterface
DRLAgent_ElegantRL -up-|> DRLAgentInterface
DRLAgent_RLlib -up-|> DRLAgentInterface

DRLAgent_SB3 ..> TensorboardCallback : uses
DRLAgent_SB3 ..> StockTradingEnv : wraps/uses
DRLAgent_ElegantRL ..> StockTradingEnv : wraps/uses

@enduml

@startuml Module_Core
title FinRL Core Module (Orchestration)

package "finrl" {
    class Config {
        + TRAIN_START_DATE
        + INDICATORS
        + PPO_PARAMS
        + ...
    }

    class Main {
        + main()
        + build_parser()
    }

    class Train {
        + train(...)
    }

    class Trade {
        + trade(...)
    }
}

Main ..> Config : reads constants
Main ..> Train : calls train()
Main ..> Trade : calls trade()

Train ..> DataProcessor : uses for data prep
Train ..> StockTradingEnv : instantiates environment
Train ..> DRLAgentInterface : uses for model training

Trade ..> StockTradingEnv : instantiates environment
Trade ..> DRLAgentInterface : uses for prediction (backtesting)
Trade ..> AlpacaPaperTrading : uses for paper trading

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The FinRL framework is fundamentally built on the **Reinforcement Learning (RL) Paradigm** applied to quantitative finance, adhering closely to the **OpenAI Gym interface** for environment standardization.

**Core Abstractions**:
1.  **Data Processor**: This serves as an abstraction layer over diverse financial data sources (Yahoo Finance, Alpaca, WRDS, etc.). It is responsible for standardizing raw data into a clean, feature-engineered format (DataFrame) suitable for the RL environment. This abstraction ensures the core DRL logic remains independent of the data source.
2.  **Environment (`StockTradingEnv`)**: This is the central abstraction that models the financial market as a Markov Decision Process (MDP). It rigorously defines the three core components of the RL problem:
    *   **State**: The observation space, which includes cash, stock prices, stock shares, technical indicators, and the market turbulence index.
    *   **Action**: The action space, a continuous `Box` representing the normalized allocation of total assets to each stock (ranging from -1 for selling to 1 for buying).
    *   **Reward**: The immediate reward, calculated as the change in the total portfolio value (cash + stock holdings) between time steps.
3.  **DRL Agent Wrapper (`DRLAgent`)**: This is a critical abstraction over different DRL libraries (Stable-Baselines3, ElegantRL, RLlib). It allows users to swap out the underlying DRL algorithm with minimal code changes, promoting modularity, experimentation, and comparison of different algorithms on the same financial task.

**Design Philosophy**:
*   **Modularity and Extensibility**: The clear separation of concerns between Data (`DataProcessor`), Environment (`Env`), and Algorithm (`DRLAgent`) is the cornerstone of the design. This structure allows for easy extension: new data sources require only a new processor implementation, new financial tasks require a new Gym environment, and new DRL algorithms require a new `DRLAgent` wrapper.
*   **Risk-Awareness**: The framework demonstrates a focus on real-world risk management by explicitly including a **turbulence index** in the state space. The environment's logic includes a mechanism for forced liquidation of all positions if market turbulence exceeds a predefined threshold, a crucial feature for financial stability.
*   **Ensemble Learning Focus**: The design encourages the use of ensemble strategies, as evidenced by the application templates, to mitigate the high variance and improve the robustness of DRL models in volatile financial markets.

**Lifecycle Management**:
The lifecycle is managed by the core orchestration scripts (`main.py`, `train.py`, `trade.py`). The process flows from configuration (`config.py`) -> data preparation (`DataProcessor`) -> environment setup (`StockTradingEnv`) -> model training (`DRLAgent`) -> model persistence (saving trained models) -> and finally, deployment for backtesting or paper trading. This sequential, modular lifecycle ensures reproducibility and clear debugging paths.

#### 3.1.2. Component Interactions

The FinRL system follows a clear, sequential data flow, primarily orchestrated by the `train.py` and `trade.py` scripts, ensuring a structured pipeline from data to decision-making.

**1. Data Acquisition and Preprocessing**:
The process begins in `train.py` which calls the `DataProcessor` (from `finrl/meta/data_processor.py`). The `DataProcessor` acts as a facade, instantiating a source-specific processor (e.g., `YahooFinanceProcessor` in `finrl/meta/data_processors/processor_yahoofinance.py`). This processor fetches raw financial data, cleans it, adds technical indicators, and incorporates market volatility measures like the VIX index. The final output is a set of three NumPy arrays: `price_array`, `tech_array`, and `turbulence_array`, which are passed back to the core workflow.

**2. Environment Initialization**:
These NumPy arrays are used to configure and instantiate the custom Gym environment, typically `StockTradingEnv` (from `finrl/meta/env_stock_trading/env_stocktrading.py`). The environment uses these arrays to define its state space and to simulate the passage of time (days), making the financial market an accessible Markov Decision Process (MDP) for the DRL agent.

**3. Training Loop (Agent-Environment Interaction)**:
The `train.py` script initializes the appropriate `DRLAgent` wrapper (e.g., `DRLAgent_SB3` from `finrl/agents/stablebaselines3/models.py`) and calls its `train_model()` method.
*   **Interaction**: The DRL model interacts with the `StockTradingEnv` by calling `env.step(action)`. The DRL model outputs an `action` (a normalized portfolio allocation vector).
*   **Execution**: The `StockTradingEnv.step()` method executes the simulated trade, updates the portfolio state (`self.state`), calculates the `reward` (change in portfolio value), and returns the new state, reward, and terminal status to the DRL algorithm.
*   Training results are logged via the `TensorboardCallback` for monitoring.

**4. Testing/Trading Loop**:
The `trade.py` or `test.py` scripts handle post-training execution.
*   The trained model is loaded via `DRLAgent.DRL_prediction()`.
*   The model predicts an action for each day in the test/trade period, and the environment is stepped through.
*   For performance evaluation, the `asset_memory` and `actions_memory` are recorded.
*   For **paper trading**, the `AlpacaPaperTrading` class in `finrl/meta/paper_trading/alpaca.py` continuously monitors the market and executes trades via the Alpaca API based on the DRL model's predictions, bridging the gap between simulation and real-world application.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml FinRL_Architecture
title FinRL Overall Architecture

skinparam component {
  BackgroundColor<<Core>> LightBlue
  BorderColor<<Core>> Blue
  BackgroundColor<<Meta>> LightGreen
  BorderColor<<Meta>> Green
  BackgroundColor<<Agent>> LightYellow
  BorderColor<<Agent>> Orange
}

component [Config] <<Core>> as C
component [Main Entry Point] <<Core>> as M
component [Train Workflow] <<Core>> as T
component [Trade Workflow] <<Core>> as TR

package "finrl.meta" <<Meta>> {
    component [DataProcessor] <<Meta>> as DP
    component [Data Sources] <<Meta>> as DS
    component [StockTradingEnv (Gym)] <<Meta>> as E
    component [Paper Trading Interface] <<Meta>> as PT
}

package "finrl.agents" <<Agent>> {
    component [DRLAgent (SB3, ERL, RLlib)] <<Agent>> as A
    component [DRL Libraries] <<Agent>> as DRL
}

M --> C : Reads global parameters
M --> T : Calls train()
M --> TR : Calls trade()

T --> DP : 1. Initializes
DP --> DS : 2. Downloads & Preprocesses Data
DP --> T : 3. Returns price/tech/turbulence arrays

T --> E : 4. Instantiates Environment (with arrays)
T --> A : 5. Initializes DRL Agent (with Env)
A --> DRL : 6. Trains Model

TR --> E : Instantiates Environment
TR --> A : Loads Trained Model
TR --> PT : Executes Paper Trading (if mode=paper_trading)

E .right.> A : State/Action/Reward Loop (step())
A .left.> E : State/Action/Reward Loop (predict())

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The FinRL codebase effectively utilizes several software design patterns to achieve its goals of modularity, extensibility, and separation of concerns.

1.  **Adapter Pattern**
    *   **Description**: This pattern allows the interface of an existing class to be used as another interface. In FinRL, it is used to unify the interfaces of disparate DRL libraries.
    *   **Implementation**: The `DRLAgent` classes in `finrl/agents/stablebaselines3/models.py`, `finrl/agents/elegantrl/models.py`, and `finrl/agents/rllib/models.py` all conform to a common interface (`get_model`, `train_model`, `DRL_prediction`). Each class adapts the specific API calls of its underlying DRL library (SB3, ElegantRL, or RLlib) to this single, unified interface, allowing the core `train.py` script to treat them interchangeably.

2.  **Factory Method Pattern (Implicit)**
    *   **Description**: This pattern provides an interface for creating objects in a superclass, but allows subclasses to alter the type of objects that will be created.
    *   **Implementation**: The `DataProcessor` class in `finrl/meta/data_processor.py` acts as a simple factory. Based on the `data_source` string passed to its constructor (e.g., `"alpaca"`, `"yahoofinance"`), it dynamically instantiates the correct concrete data processor object (e.g., `AlpacaProcessor`, `YahooFinanceProcessor`).
    *   **Code Example (from `data_processor.py`)**:
        ```python
        class DataProcessor:
            def __init__(self, data_source, ...):
                if data_source == "alpaca":
                    self.processor = Alpaca(...)
                elif data_source == "yahoofinance":
                    self.processor = YahooFinance()
                # ... other data sources
        ```

3.  **Strategy Pattern**
    *   **Description**: This pattern defines a family of algorithms, encapsulates each one, and makes them interchangeable. Strategy lets the algorithm vary independently from the clients that use it.
    *   **Implementation**: The overall training workflow in `train.py` allows the user to select a "strategy" (the DRL algorithm, e.g., PPO, SAC, DDPG) and the DRL library (e.g., `stable_baselines3`, `elegantrl`) at runtime. The `train` function then dynamically loads and uses the corresponding `DRLAgent` and DRL model based on these parameters, enabling easy comparison of different trading strategies.

#### 3.3.2. Project Highlights

The FinRL framework includes several innovative features and design choices that enhance its utility and flexibility for financial reinforcement learning:

*   **Unified DRL Framework**: FinRL provides a single, consistent API that abstracts away the differences between multiple state-of-the-art DRL libraries, including Stable-Baselines3, ElegantRL, and RLlib. This allows researchers and practitioners to easily switch between and compare algorithms (e.g., PPO, SAC, DDPG) without modifying the core data or environment logic.
*   **Financial Market Modeling with Risk Awareness**: The custom Gym environments, such as `StockTradingEnv`, are specifically tailored for finance. They incorporate essential real-world elements like **transaction costs** (`buy_cost_pct`, `sell_cost_pct`) and, critically, a **turbulence index**. This index is used to model market volatility, and the environment enforces a **risk-management mechanism** (forced liquidation) when turbulence exceeds a threshold, making the simulation more realistic and risk-aware.
*   **Data Source Agnosticism**: Through the `DataProcessor` abstraction, the framework achieves a high degree of data source agnosticism. The same DRL pipeline can be run on data from various providers (Yahoo Finance, Alpaca, WRDS, etc.) by simply changing a configuration parameter, significantly reducing the effort required for data integration.
*   **Real-World Readiness and Paper Trading**: The inclusion of a dedicated `trade.py` module with the `AlpacaPaperTrading` class provides a direct and seamless path from backtesting to live paper trading. This feature is a major highlight, enabling users to test their trained agents in a simulated live market environment before committing real capital.
*   **Ensemble Learning Support**: The framework is explicitly designed to facilitate the training and validation of multiple agents, supporting robust **ensemble strategies** (as demonstrated in `ensemble_stock_trading.py`). This is a key feature for improving the stability and performance of DRL models in the highly stochastic financial domain.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The FinRL framework is robust, but several areas can be optimized to improve performance, maintainability, and flexibility:

1.  **Environment Performance and Vectorization**:
    *   **Issue**: The core `StockTradingEnv` in `env_stocktrading.py` is implemented using standard Python/Pandas/NumPy logic, which can be slow for high-frequency or large-scale backtesting due to Python's overhead in the simulation loop.
    *   **Suggestion**: Implement a fully **vectorized environment** for training. This involves processing all time steps for all assets in parallel using NumPy or a library like JAX/PyTorch, drastically reducing the number of Python function calls and improving training speed. The current `DummyVecEnv` wrapper only vectorizes the environment interface, not the internal simulation logic.

2.  **Data Acquisition Reliability and Brittle Code**:
    *   **Issue**: The `YahooFinanceProcessor` shows a mix of `yfinance` library usage and brittle web scraping techniques (Selenium/BeautifulSoup) for data acquisition. Web scraping is highly susceptible to breaking when the target website's structure changes.
    *   **Suggestion**: Standardize data acquisition to rely solely on stable, official APIs (like Alpaca, which is already integrated) or robust data providers. Remove the reliance on Selenium/scraping to ensure long-term stability and maintainability of the data pipeline.

3.  **Configuration Management Modernization**:
    *   **Issue**: The use of global constants in `config.py` is simple but limits the flexibility required for complex, reproducible experiments. Modifying a global constant affects all parts of the code.
    *   **Suggestion**: Adopt a modern configuration management library like **Hydra** or use **Pydantic Settings**. This would allow for structured, hierarchical configuration files (YAML/JSON), easy command-line overrides, and better separation of configuration from the core codebase, making experiment tracking and parameter tuning more robust.

4.  **Code Quality and Documentation**:
    *   **Issue**: While type hints are present, the documentation, particularly docstrings for the core `DRLAgent` methods and environment parameters, could be more comprehensive.
    *   **Suggestion**: Enforce a documentation standard (e.g., NumPy or Google style docstrings) for all public methods and classes. This will significantly improve code clarity and reduce the learning curve for secondary developers.

#### 3.4.2. Secondary Development Guide

The FinRL framework is designed for extensibility, making secondary development straightforward by focusing on the three core modular components: Data, Environment, and Agent.

1.  **Start with `config.py`**:
    *   The first step for any new experiment is to define the scope by modifying the global constants in `finrl/config.py`. This includes setting the `TRAIN_START_DATE`, `TRAIN_END_DATE`, the list of `INDICATORS`, and the hyperparameters for the DRL models (e.g., `PPO_PARAMS`).

2.  **Define the Task (Environment)**:
    *   For standard tasks (stock trading, crypto), use the existing environments in `finrl/meta/env_stock_trading`.
    *   To create a new financial task (e.g., options trading, futures), create a new custom Gym environment class that inherits from `gym.Env` and defines the unique state, action, and reward mechanisms specific to that task. Ensure the `step()` method correctly calculates the reward and updates the state based on the action.

3.  **Prepare Data (DataProcessor)**:
    *   If your data source is supported (Yahoo, Alpaca, etc.), use the existing `DataProcessor` facade.
    *   To integrate a new data source, create a new `processor_yourname.py` file in `finrl/meta/data_processors`. This new class must implement the required methods: `download_data`, `clean_data`, `add_technical_indicator`, and crucially, `df_to_array` to convert the data into the NumPy arrays expected by the environment.

4.  **Select/Implement Agent**:
    *   Choose a DRL library (Stable-Baselines3 is recommended for its comprehensive documentation). The `DRLAgent` wrappers handle the integration.
    *   To add a new DRL algorithm not currently supported, extend the appropriate `DRLAgent` class in `finrl/agents` and implement the `get_model`, `train_model`, and `DRL_prediction` methods to wrap the new algorithm's API.

5.  **Execute via `main.py`**:
    *   Use the command-line interface (`python main.py --mode=train`) to execute the workflow. The orchestration logic in `main.py`, `train.py`, and `trade.py` will handle the rest, ensuring the data, environment, and agent are correctly linked.

This modular approach ensures that developers can focus on one component at a time without needing to rewrite the entire pipeline.


================================================
FILE: thirdparty/FinRobot.md
================================================
# FinRobot - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```

```

### 1.2. Core Folders for Analysis


## Phase 2: Module-by-Module Deep Analysis


### Module PlantUML Diagrams


## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions


#### 3.1.2. Component Interactions


### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml

@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns


#### 3.3.2. Project Highlights


### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements


#### 3.4.2. Secondary Development Guide


================================================
FILE: thirdparty/FinceptTerminal.md
================================================
# FinceptTerminal - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal (Root of the project)
|-- .github (Configuration for GitHub workflows and templates)
|-- docs (Project documentation, likely Docusaurus or similar)
|-- fincept-terminal-desktop (The main application source code)
|   |-- public (Static assets for the frontend)
|   |-- src-tauri (Rust backend code for Tauri)
|   |   |-- src (Core Rust source files)
|   |   |   |-- commands (Tauri commands for data fetching and utilities, over 30 data sources)
|   |   |   |-- data_sources (Rust-side data source implementations)
|   |   |   |-- utils (Utility functions, notably the Python execution bridge)
|   |   |   |-- lib.rs (Main Rust library, process management, IPC setup)
|   |   |   |-- main.rs (Tauri entry point)
|   |-- src (TypeScript/React frontend code)
|   |   |-- assets (Frontend static assets)
|   |   |-- components (Reusable UI components)
|   |   |   |-- tabs (Major feature views like data-mapping, trading, portfolio, node-editor)
|   |   |   |-- ui (Design system components)
|   |   |-- constants (Application-wide configuration values)
|   |   |-- contexts (React Context providers for global state)
|   |   |-- hooks (Custom React hooks for logic reuse)
|   |   |-- lib (Frontend utility functions)
|   |   |-- services (Core business logic and data orchestration)
|   |   |   |-- backtesting (Logic for backtesting strategies)
|   |   |   |-- websocket (Real-time data handling)
|   |   |   |-- trading (Order management logic)
|   |   |-- stockBrokers (Brokerage API integration adapters, e.g., ZerodhaKite)
|   |   |-- types (TypeScript interfaces and type definitions)
|   |   |-- App.tsx (Main React application component)
|-- images (Marketing and documentation images)

The project structure clearly delineates the **Hybrid Architecture**. The `fincept-terminal-desktop` directory houses the core application, split into the `src-tauri` (Rust backend) and `src` (React/TypeScript frontend) folders. This separation of concerns is fundamental, with the Rust layer managing system-level tasks and the Python bridge, while the TypeScript layer handles the rich user interface and business logic via services. The extensive `commands` directory in the Rust backend highlights the project's focus on being a comprehensive financial data aggregator.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src-tauri/src`: The core Rust backend, handling IPC, process management, and data source delegation.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/components`: The React frontend's presentation layer, including all UI elements and feature views.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/services`: The frontend's business logic layer, containing core features like workflow management, backtesting, and trading.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/stockBrokers`: Brokerage integration adapters, implementing the Adapter Pattern for trading.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/types`: Shared TypeScript interfaces and type definitions for application-wide data structures.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/constants`: Application-wide configuration values and magic strings.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/contexts`: React Context providers for global state management.
*   `/home/ubuntu/FinnewsHunter/thirdparty/FinceptTerminal/fincept-terminal-desktop/src/hooks`: Custom React hooks for logic reuse across components.

## Phase 2: Module-by-Module Deep Analysis

## Module 1: `src-tauri/src` (Rust Backend)

**Core Responsibility:** The Rust backend, built with Tauri, serves as the **core application logic and data gateway**. Its primary function is to manage system-level interactions, handle inter-process communication (IPC) with the frontend, and act as a secure, performant bridge to various external data sources and computational backends (like Python). It is responsible for managing the lifecycle of external processes, such as the MCP (Model Context Protocol) server.

**Key Files and Functions:**
*   `lib.rs`: Defines the core state management (`MCPState` and `MCPProcess`) for external processes.
*   `commands/mod.rs`: The central registry for all Tauri commands, revealing **extensive data source integration** (e.g., `yfinance`, `polygon`, `fred`, `worldbank`).
*   `utils/python.rs`: A critical file that implements the logic to locate and execute the Python interpreter across different operating systems, confirming that the Rust backend delegates data fetching and heavy computation to Python scripts.

**Core Implementation & Dependencies:** The module uses Rust and Tauri, relying on `std::sync::{Arc, Mutex}` for safe, concurrent management of external processes. Tauri's `#[tauri::command]` macro is used extensively to expose Rust functions to the TypeScript/React frontend.

## Module 2: `src/components` (Frontend UI Components)

**Core Responsibility:** This module contains the React/TypeScript components that form the user interface, responsible for visual presentation and user interaction.

**Key Files and Functions (Inferred from Directory Structure):**
*   `components/tabs/*`: Contains the main feature views, such as `data-mapping`, `equity-research`, `node-editor`, `portfolio`, and `trading`, indicating a highly modular, tab-based application structure.
*   `components/charts`: Dedicated components for financial data visualization.

**Core Implementation & Dependencies:** Built with TypeScript and React, the components rely on the Tauri API (`@tauri-apps/api`) to call the Rust commands for data and system interaction.

## Module 3: `src/services` (Frontend Business Logic)

**Core Responsibility:** This module encapsulates the complex business logic and data orchestration for the frontend, separating it from the presentation layer.

**Key Files and Functions:**
*   `workflowService.ts`: Manages the creation, storage, execution, and state of user-defined **workflows**, suggesting a core feature is a visual programming or automation tool.
*   `services/backtesting`: Contains logic for financial backtesting, likely integrating with Python libraries like `vectorbt` or `lean`.
*   `services/websocket`: Handles real-time data streaming, essential for a financial terminal.

**Core Implementation & Dependencies:** This module implements the **Service Layer** pattern and depends on the Tauri IPC layer to communicate with the Rust backend for data and process control.

## Module 4: `src/stockBrokers` (Brokerage Integration)

**Core Responsibility:** Provides a standardized interface for connecting to and interacting with various stock brokerage APIs.

**Key Files and Functions:**
*   `stockBrokers/india/zerodhaKite`: A concrete implementation for a specific Indian brokerage, indicating a focus on the Indian market or a modular design for regional expansion.

**Core Implementation & Dependencies:** The module likely uses the **Adapter Pattern** to normalize the different brokerage APIs into a single interface used by the `trading` service.

## Module 5: `src/types` (Shared Data Structures)

**Core Responsibility:** Defines the core TypeScript data structures and interfaces used across the entire frontend application, ensuring type safety and consistency. This adheres to the **Single Source of Truth** principle for data types.

## Module 6: `src/lib`, `src/hooks`, `src/constants`, `src/contexts` (Utilities and State)

**Core Responsibility:** Contains common utilities, custom React hooks, application-wide constants, and React context providers for global state. This module uses the **Context Pattern** for dependency injection and state management throughout the frontend.

**Conclusion:** The project is a **hybrid desktop application** built with **Tauri (Rust) and React/TypeScript**. The Rust backend acts as a secure data API gateway, leveraging Python for data fetching, while the React frontend provides a rich, modular, tab-based user interface with core features like **workflow automation**, **backtesting**, and **brokerage integration**.

### Module PlantUML Diagrams

# Rust Backend Module (`src-tauri/src`)

@startuml
title Rust Backend Module (`src-tauri/src`)

package "Core Logic" {
    class AppHandle
    class MCPState {
        - processes: Mutex<HashMap<String, MCPProcess>>
    }
    class MCPProcess {
        - child: Child
        - stdin: Arc<Mutex<ChildStdin>>
        - response_rx: Receiver<String>
    }
    class SpawnResult
    interface TauriCommand
}

package "Utilities" {
    class PythonUtils {
        + get_python_path(app: &AppHandle)
        + execute_python_command(...)
    }
}

package "Commands" {
    class YFinanceCommand <<TauriCommand>>
    class PolygonCommand <<TauriCommand>>
    class FredCommand <<TauriCommand>>
    ' ... many other data source commands
}

AppHandle "1" -- "1" MCPState : manages
MCPState "1" -- "*" MCPProcess : contains
MCPProcess "1" -- "1" SpawnResult : returns status
AppHandle "1" -- "1" PythonUtils : uses
YFinanceCommand ..> PythonUtils : executes script via
YFinanceCommand ..> AppHandle : requires
TauriCommand <|-- YFinanceCommand
TauriCommand <|-- PolygonCommand
TauriCommand <|-- FredCommand

@enduml

# Frontend Services Module (`src/services`)

@startuml
title Frontend Services Module (`src/services`)

class Workflow {
    + id: string
    + name: string
    + nodes: any[]
    + edges: any[]
    + status: 'idle' | 'running' | 'completed' | 'error' | 'draft'
}

class WorkflowService {
    - workflows: Map<string, Workflow>
    - runningWorkflows: Set<string>
    + saveWorkflow(workflow: Workflow)
    + runWorkflow(workflowId: string)
    + cleanupRunningWorkflows()
}

class BacktestingService {
    + runBacktest(strategy: Strategy)
}

class WebsocketService {
    + connect(url: string)
    + subscribe(symbol: string)
}

class TradingService {
    + placeOrder(order: Order)
}

WorkflowService "1" -- "*" Workflow : manages
BacktestingService ..> TradingService : may simulate orders
WebsocketService ..> TradingService : provides real-time data
WorkflowService ..> TauriIPC : calls Rust commands
BacktestingService ..> TauriIPC : calls Rust commands

@enduml

# Brokerage Integration Module (`src/stockBrokers`)

@startuml
title Brokerage Integration Module (`src/stockBrokers`)

interface BrokerAdapter {
    + authenticate(credentials: any)
    + getQuote(symbol: string)
    + placeOrder(order: Order)
}

class ZerodhaKiteAdapter {
    - apiKey: string
    - accessToken: string
    + authenticate(credentials: any)
    + getQuote(symbol: string)
    + placeOrder(order: Order)
}

BrokerAdapter <|.. ZerodhaKiteAdapter
TradingService "1" --> "1" BrokerAdapter : uses adapter pattern

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The FinceptTerminal project is built on a **Hybrid Desktop Architecture** using **Tauri**, which is the foundational design philosophy. This approach leverages the performance, security, and native capabilities of **Rust** for the backend logic while providing a rich, cross-platform user interface with **React/TypeScript**.

### Core Abstractions

1.  **Tauri IPC Command:** The `#[tauri::command]` macro in Rust is the central abstraction for all application functionality. It creates a clean, asynchronous boundary between the UI and the system/data layer. Every data request, process management call, and utility function is exposed through this unified IPC mechanism.
2.  **Data Gateway:** The collection of Rust commands acts as a comprehensive **Data Gateway**. Instead of implementing all data fetching logic in Rust, the project abstracts the data source itself. Each command (e.g., `yfinance`, `fred`, `polygon`) represents a specific external API, standardizing access to a vast array of financial and economic data.
3.  **Workflow Object:** In the frontend, the `Workflow` object is a critical abstraction. It represents a user-defined, executable sequence of operations (a visual programming model). The object encapsulates the nodes, edges, status, and results of a user's automated task, making complex analysis and trading strategies manageable.
4.  **Python Execution Bridge:** The `utils/python.rs` module is an abstraction that hides the complexity of finding and executing Python scripts across different operating systems. This allows the Rust layer to seamlessly delegate data fetching and heavy computational tasks to the extensive Python data science ecosystem.

### Design Philosophy

The architecture adheres to a **Layered Design** with a strong emphasis on **Extensibility** and **Separation of Concerns**:
*   **Presentation Layer (React/TS):** Handles UI and user interaction.
*   **Service Layer (TypeScript Services):** Encapsulates business logic (`WorkflowService`, `BacktestingService`).
*   **Application Layer (Rust Backend):** Manages IPC, process lifecycle, and data orchestration.
*   **Data Layer (Python Scripts/External APIs):** Handles the actual data retrieval and processing.

### Lifecycle Management

The application lifecycle is managed by Tauri and the custom services:
*   **Application Startup:** Tauri initializes the Rust backend, which then prepares system resources, including the potential startup and state management of the **MCP (Model Context Protocol) server** (managed by `MCPState` and `MCPProcess` in `lib.rs`).
*   **Workflow Lifecycle:** The `WorkflowService` manages the state of user workflows, persisting them in local storage. The Rust backend is responsible for the lifecycle of any external processes spawned by a running workflow, ensuring they are properly cleaned up upon application exit or workflow termination (`cleanup_running_workflows`).

#### 3.1.2. Component Interactions

The application's dynamic behavior is driven by a sophisticated set of communication patterns:

| Component | Interacts With | Communication Pattern | Data Flow Example |
| :--- | :--- | :--- | :--- |
| **React UI** | **Rust Backend** | Tauri IPC (Asynchronous Command Calls) | User clicks "Fetch Data" -> React calls `tauri.invoke('execute_yfinance_command', ...)` |
| **Rust Backend** | **Python Scripts** | Process Spawning and Standard I/O | `YFinanceCommand` calls `execute_python_command` -> Rust spawns Python process -> Python prints JSON to stdout -> Rust captures stdout and returns it via IPC. |
| **Rust Backend** | **MCP Server** | Inter-Process Communication (Custom Protocol) | Rust manages the `MCPProcess` lifecycle and communicates with it via its stdin/stdout streams, as suggested by the `MCPProcess` struct fields (`child`, `stdin`, `response_rx`). |
| **Frontend Services** | **External APIs** | WebSocket (Real-time) | `WebsocketService` connects directly to a market data provider, pushing updates to the UI components for live charting. |
| **Trading Service** | **Broker Adapters** | Adapter Pattern (Method Calls) | `TradingService` calls `brokerAdapter.placeOrder()`, which is implemented by a specific adapter like `ZerodhaKiteAdapter`. |

The primary data flow for historical and fundamental data is a **chain of delegation**: **Frontend -> Rust IPC -> Python Bridge -> External API**. This pattern ensures that the computationally heavy and data-intensive tasks are handled by the most appropriate tool (Python's data science stack), while the Rust layer maintains control and security. The use of JSON serialization is implied at every boundary (Python stdout, Tauri IPC) to ensure structured data transfer.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
title FinceptTerminal Overall Architecture

skinparam componentStyle rectangle

package "Frontend (React/TypeScript)" {
    [UI Components] as UI
    [Services] as S
    [Data Types] as T
    [Custom Hooks] as H
    [Constants] as C
    [Contexts] as CX
}

package "Backend (Rust/Tauri)" {
    [Tauri IPC Layer] as IPC
    [Python Execution Bridge] as PyBridge
    [MCP Server Manager] as MCPM
    [Data Source Commands] as DSC
}

package "External Systems" {
    [Python Data Stack] as PDS
    [External Data APIs] as APIs
    [MCP Server] as MCP
    [Brokerage APIs] as BAPI
}

UI --> S : calls business logic
S --> IPC : invokes commands (via tauri.invoke)
IPC --> DSC : routes data requests
IPC --> MCPM : manages server lifecycle
DSC --> PyBridge : delegates data fetching
PyBridge --> PDS : executes Python scripts
PDS --> APIs : fetches raw data

S --> [Websocket Service] : real-time data
[Websocket Service] --> APIs : streams data

S --> [Trading Service] : order execution
[Trading Service] --> [Broker Adapter] : uses standardized interface
[Broker Adapter] --> BAPI : sends orders

MCPM --> MCP : manages process (stdin/stdout)

T <.. UI : defines data structures
T <.. S : defines data structures
H <.. UI : provides reusable logic
CX <.. UI : provides global state

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The FinceptTerminal codebase employs several established design patterns to manage complexity, ensure modularity, and promote extensibility:

1.  **Adapter Pattern (Brokerage Integration):**
    *   **Description:** This pattern allows the interface of an existing class to be used as another interface. This is implemented in the `src/stockBrokers` module.
    *   **Implementation:** The `TradingService` interacts with a generic `BrokerAdapter` interface. Concrete classes like `ZerodhaKiteAdapter` implement this interface, translating generic trading commands into the specific API calls required by the respective brokerage.

2.  **Bridge Pattern (Rust-Python Interoperability):**
    *   **Description:** Decouples an abstraction from its implementation so that the two can vary independently.
    *   **Implementation:** The **Rust IPC Layer** acts as the abstraction (the "what" to do, e.g., "get YFinance data"). The **Python Execution Bridge** (`utils/python.rs`) and the underlying Python scripts act as the implementation (the "how" to do it).

3.  **Service Layer Pattern (Frontend Business Logic):**
    *   **Description:** Defines an application's boundary and a set of available operations from the perspective of the client.
    *   **Implementation:** Modules like `WorkflowService`, `BacktestingService`, and `TradingService` in `src/services` encapsulate complex business rules and orchestration logic, shielding the UI components from direct interaction with the IPC layer.

4.  **Command Pattern (Tauri IPC):**
    *   **Description:** Encapsulates a request as an object, allowing for uniform handling of operations.
    *   **Implementation:** Each `#[tauri::command]` in the Rust backend is a concrete command object that the frontend can invoke via `tauri.invoke()`, treating every backend operation uniformly.

#### 3.3.2. Project Highlights

The FinceptTerminal project exhibits several innovative features and design choices that contribute to its extensibility and flexibility:

*   **Hybrid Architecture for Best-of-Breed Tooling:** The combination of **Rust (Tauri)** for system performance and security, and **Python** for data science and financial libraries, is a significant highlight. This allows the application to deliver a fast, native desktop experience while leveraging the vast, specialized ecosystem of Python for financial analysis (e.g., `pandas`, `yfinance`, `vectorbt`).
*   **Extensive Data Source Integration:** The sheer number of dedicated data source commands (over 30, including `fred`, `worldbank`, `sec`, `polygon`, and various government sources) is a major feature. This design makes the terminal a **universal financial data aggregator**, providing a single, standardized interface for disparate data APIs.
*   **Workflow Automation Engine:** The presence of the `WorkflowService` and the implied `node-editor` component suggests a powerful, user-facing **visual programming environment**. This feature allows users to automate complex analytical tasks and trading strategies without writing code, greatly enhancing the application's utility and appeal to a broader user base.
*   **Modular Brokerage Integration:** The use of the **Adapter Pattern** in `src/stockBrokers` ensures that the application is not locked into a single trading provider. This modularity is key to its flexibility, allowing for rapid expansion into new markets (e.g., the existing `zerodhaKite` for India) and future-proofing against changes in brokerage APIs.
*   **Model Context Protocol (MCP) Integration:** The management of an **MCP Server** in the Rust backend indicates a forward-looking design for integrating external AI/ML models or custom computational backends. This provides a clear path for future expansion into advanced, AI-powered financial insights.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The current architecture is robust and highly extensible, but a few areas could be optimized for performance, scalability, and maintainability:

1.  **Optimize the Rust-Python Bridge Performance:** The current model of delegating via `execute_python_command` introduces significant overhead from process startup and inter-process communication (IPC) latency. A key optimization would be to transition from spawning a new Python process for every data request to a **persistent Python server** (e.g., using a lightweight Flask/FastAPI server or a dedicated Jupyter kernel managed by the Rust backend). This would allow the Python environment to be initialized once, drastically reducing latency for subsequent data requests.

2.  **Implement a Standardized Data Source Interface:** The current `commands/mod.rs` lists many specific commands, which could lead to code duplication. Creating a generic `DataSource` trait or interface in Rust that all data source commands must implement would enforce consistency, simplify the addition of new data sources, and allow for meta-operations like caching or rate-limiting to be applied uniformly.

3.  **Enhance Workflow Persistence and State Management:** The migration of workflow storage from the browser's `localStorage` (as seen in `workflowService.ts`) to a more robust, local database solution like **SQLite** using Tauri's built-in database capabilities is recommended. This provides better scalability for a large number of complex workflows, ensures data integrity, and allows for more complex querying and indexing of workflow metadata.

4.  **Refactor Frontend State Management:** For complex, application-wide state (e.g., user settings, active data connections), adopting a dedicated global state management library (e.g., Redux Toolkit, Zustand) to replace or augment the current reliance on React Contexts would provide better debugging tools, predictable state changes, and easier management of asynchronous operations.

#### 3.4.2. Secondary Development Guide

For developers looking to extend or maintain the FinceptTerminal codebase, the following practices are recommended:

1.  **Understand the Hybrid Flow:** Always trace a feature request through the three main layers: **React UI** (user interaction) -> **TypeScript Service** (business logic) -> **Rust IPC Command** (data gateway) -> **Python Script** (data execution). Start with the `src/App.tsx` for the frontend entry and `src-tauri/src/lib.rs` for the backend entry.

2.  **Adding a New Data Source:**
    *   **Step 1 (Python):** Write a new Python script (e.g., `new_api_data.py`) that handles the API request and returns the result as a JSON string to standard output.
    *   **Step 2 (Rust):** Create a new file in `src-tauri/src/commands/` (e.g., `new_api.rs`). Implement a new `#[tauri::command]` that calls `utils::python::execute_python_command` with the new Python script's name and arguments.
    *   **Step 3 (Frontend):** Create a new service or function in the TypeScript layer that calls the new Tauri command via `tauri.invoke()`.

3.  **Integrating a New Stock Broker:**
    *   Create a new adapter class in `src/stockBrokers/` (e.g., `src/stockBrokers/europe/degiroAdapter.ts`). This class **MUST** implement the generic `BrokerAdapter` interface to ensure compatibility with the `TradingService`.

4.  **Extending the Workflow Engine:** New analytical or trading nodes are added by defining their structure and logic in the `src/components/tabs/node-editor` module. The execution logic for these nodes will primarily live in the `WorkflowService`, which orchestrates calls to the relevant services or directly to the Rust IPC layer.


================================================
FILE: thirdparty/Kronos.md
================================================
# Kronos - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The Kronos project exhibits a clear, modular structure typical of a modern deep learning project, organized around the core components of data processing, model definition, training, and deployment.

```
/home/ubuntu/Kronos_project
├── .git/                               # Git version control metadata (Excluded from analysis)
├── LICENSE                             # Project license file
├── README.md                           # Project overview and documentation
├── examples/                           # Scripts demonstrating model usage and prediction
│   ├── data/                           # Sample financial data for examples
│   └── prediction_*.py                 # Example scripts for single and batch prediction
├── figures/                            # Image assets for documentation (e.g., overview.png)
├── finetune/                           # **Core Module 2: Data Pipeline and Training**
│   ├── config.py                       # Centralized configuration for all hyperparameters and paths
│   ├── dataset.py                      # PyTorch Dataset for Qlib data, handles sliding window and sampling
│   ├── qlib_data_preprocess.py         # Script for loading, cleaning, and splitting Qlib financial data
│   ├── train_predictor.py              # Main script for training the Kronos Causal Language Model
│   ├── train_tokenizer.py              # Main script for training the KronosTokenizer (VQ-VAE)
│   └── utils/                          # Utility functions for DDP setup, logging, and model size calculation
├── finetune_csv/                       # Alternative finetuning pipeline for CSV data (secondary/legacy)
├── model/                              # **Core Module 1: Model Definition**
│   ├── __init__.py                     # Module initialization
│   ├── kronos.py                       # Defines the main Kronos and KronosTokenizer models, and KronosPredictor
│   └── module.py                       # Defines all fundamental building blocks (Transformer, Quantizer, Embeddings)
├── requirements.txt                    # Python dependencies for the core project
├── tests/                              # Unit and regression tests (Excluded from core analysis)
└── webui/                              # **Core Module 3: Web Interface for Inference**
    ├── app.py                          # Flask application logic, API endpoints for data/model loading and prediction
    ├── run.py                          # Startup script for the Flask application
    ├── start.sh                        # Shell script for starting the web server
    └── templates/                      # HTML templates for the frontend (e.g., index.html)
```

The structure clearly separates concerns: the `model` directory contains the reusable intellectual property (the model architecture), `finetune` contains the ML engineering pipeline (data and training), and `webui` contains the deployment and demonstration layer. This separation ensures high modularity and maintainability. The `examples` and `tests` directories provide necessary context for usage and validation.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/Kronos_project/model`: Contains the core PyTorch model definitions, including the main `Kronos` and `KronosTokenizer` classes, the quantization logic, and all fundamental Transformer components. This is the heart of the project's intellectual property.
*   `/home/ubuntu/Kronos_project/finetune`: Contains the complete data preparation and model training pipeline. This includes configuration (`config.py`), dataset handling (`dataset.py`), data preprocessing (`qlib_data_preprocess.py`), and the main training scripts (`train_tokenizer.py`, `train_predictor.py`).
*   `/home/ubuntu/Kronos_project/finetune/utils`: Contains utility functions essential for the training process, such as DDP setup, seed setting, model size calculation, and time formatting.
*   `/home/ubuntu/Kronos_project/webui`: Contains the Flask-based web application for demonstrating the model's inference capabilities. This includes the main application logic (`app.py`), the startup script (`run.py`), and the frontend template (`templates/index.html`).

## Phase 2: Module-by-Module Deep Analysis

The Kronos project is composed of three primary core modules: `model`, `finetune`, and `webui`.

### 1. Model Module (`/home/ubuntu/Kronos_project/model`)

This module is the core intellectual property, defining the architecture for the hierarchical quantization and causal language model.

| File | Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `module.py` | Defines all fundamental building blocks for the Transformer and Quantization components. | `BSQuantizer`, `TransformerBlock`, `RMSNorm`, `RotaryPositionalEmbedding`, `HierarchicalEmbedding`, `DualHead` |
| `kronos.py` | Defines the main model classes that assemble the components from `module.py`. | `KronosTokenizer`, `Kronos`, `KronosPredictor` |

#### Implementation Details
*   **Quantization**: The `BSQuantizer` implements the **Binary Spherical Quantization** technique, which is a variant of VQ-VAE. It maps continuous data to discrete indices, crucial for the language modeling approach. The `DifferentiableEntropyFunction` is used to enforce codebook usage and prevent collapse, a key challenge in VQ-VAE training.
*   **Transformer Components**: The model uses a standard Transformer block (`TransformerBlock`) but replaces the traditional LayerNorm with **RMSNorm** and uses **Rotary Positional Embedding (RoPE)** for positional encoding, which are common optimizations in modern LLMs (e.g., Llama).
*   **Hierarchical Tokenization**: `KronosTokenizer` and `HierarchicalEmbedding` implement the core idea of splitting the financial data into two token streams ($S_1$ for coarse, $S_2$ for fine) to improve the quality of the discrete representation.
*   **Prediction**: The `KronosPredictor` class encapsulates the entire inference process, including the complex **autoregressive generation** loop, which involves repeated calls to the `Kronos` model and the tokenizer's `decode` method.

### 2. Finetune Module (`/home/ubuntu/Kronos_project/finetune`)

This module manages the entire machine learning pipeline, from data ingestion to model training.

| File | Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `config.py` | Centralizes all configuration parameters. | `Config` |
| `qlib_data_preprocess.py` | Handles data loading from Qlib, feature calculation, and train/val/test splitting. | `QlibDataPreprocessor` |
| `dataset.py` | Implements the PyTorch `Dataset` for time-series sampling. | `QlibDataset` |
| `train_tokenizer.py` | Training script for the `KronosTokenizer` (VQ-VAE). | `train_model` (Tokenizer) |
| `train_predictor.py` | Training script for the `Kronos` Causal Language Model. | `train_model` (Predictor) |

#### Implementation Details
*   **Data Ingestion**: `QlibDataPreprocessor` uses the Qlib library to access financial data, calculates the `amt` (amount) feature, and ensures the data is correctly windowed for the lookback and prediction horizons defined in `Config`.
*   **Sampling Strategy**: `QlibDataset` implements a non-standard sampling strategy where `__getitem__` ignores the input index and instead draws a **random sample** from a pre-computed pool of all possible sliding windows. This is a robust method for training on large, continuous time-series data.
*   **Training Loop**: Both training scripts use **Distributed Data Parallel (DDP)** and a standard optimization loop with **AdamW** and a **OneCycleLR** scheduler. The tokenizer training minimizes reconstruction and quantization loss, while the predictor training minimizes cross-entropy loss on the next token prediction.

### 3. WebUI Module (`/home/ubuntu/Kronos_project/webui`)

This module provides a demonstration and inference interface using the Flask framework.

| File | Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `app.py` | Defines the Flask application, API routes, and model loading/prediction logic. | `load_model`, `predict`, `create_prediction_chart` |
| `run.py` | Startup script that checks dependencies and launches the Flask server. | `main` |

#### Implementation Details
*   **API Endpoints**: Key endpoints include `/api/load-model` (loads a model from a predefined list of configurations), `/api/data-files` (lists available data), and `/api/predict` (performs the core prediction).
*   **Inference Integration**: The `predict` endpoint uses the `KronosPredictor` to generate forecasts. It handles input data validation, time-series slicing, and parameter passing (temperature, top-p, sample count) for the generative model.
*   **Visualization**: The `create_prediction_chart` function uses the **Plotly** library to generate interactive candlestick charts, visualizing the historical data, the predicted future data, and optionally, the actual future data for comparison. The chart is returned as a JSON object for the frontend.

### Module PlantUML Diagrams

@startuml module_model
title Core Model Module (/model)

package "Quantization" {
    class DifferentiableEntropyFunction <<Function>>
    class BinarySphericalQuantizer {
        + embed_dim: int
        + beta: float
        + gamma0: float
        + gamma: float
        + zeta: float
        + quantize(z)
        + forward(z)
        + soft_entropy_loss(z)
    }
    class BSQuantizer {
        + s1_bits: int
        + s2_bits: int
        + codebook_dim: int
        + forward(z, half)
    }
}

package "Transformer Components" {
    class RMSNorm
    class FeedForward
    class RotaryPositionalEmbedding
    class MultiHeadAttentionWithRoPE
    class MultiHeadCrossAttentionWithRoPE
    class TransformerBlock {
        + norm1: RMSNorm
        + self_attn: MultiHeadAttentionWithRoPE
        + norm2: RMSNorm
        + ffn: FeedForward
        + forward(x)
    }
}

package "Embeddings & Heads" {
    class FixedEmbedding
    class TemporalEmbedding {
        + minute_embed: Embed
        + hour_embed: Embed
        ...
        + forward(x)
    }
    class HierarchicalEmbedding {
        + emb_s1: nn.Embedding
        + emb_s2: nn.Embedding
        + fusion_proj: nn.Linear
        + split_token(token_ids)
        + forward(token_ids)
    }
    class DualHead {
        + proj_s1: nn.Linear
        + proj_s2: nn.Linear
        + compute_loss(...)
        + forward(x)
        + cond_forward(x2)
    }
    class DependencyAwareLayer {
        + cross_attn: MultiHeadCrossAttentionWithRoPE
        + forward(hidden_states, sibling_embed)
    }
}

package "Main Models" {
    class KronosTokenizer <<PyTorchModelHubMixin>> {
        + encoder: nn.ModuleList
        + decoder: nn.ModuleList
        + tokenizer: BSQuantizer
        + embed: nn.Linear
        + head: nn.Linear
        + forward(x)
        + encode(x)
        + decode(x)
    }
    class Kronos <<PyTorchModelHubMixin>> {
        + embedding: HierarchicalEmbedding
        + time_emb: TemporalEmbedding
        + transformer: nn.ModuleList
        + dep_layer: DependencyAwareLayer
        + head: DualHead
        + forward(s1_ids, s2_ids, stamp)
    }
    class KronosPredictor {
        + model: Kronos
        + tokenizer: KronosTokenizer
        + generate(...)
        + predict(...)
        + predict_batch(...)
    }
}

' Relationships
BinarySphericalQuantizer <.. DifferentiableEntropyFunction : uses
BSQuantizer *-- BinarySphericalQuantizer : wraps
KronosTokenizer *-- BSQuantizer : uses
KronosTokenizer *-- TransformerBlock : uses (encoder/decoder)
Kronos *-- HierarchicalEmbedding : uses
Kronos *-- TemporalEmbedding : uses
Kronos *-- TransformerBlock : uses
Kronos *-- DependencyAwareLayer : uses
Kronos *-- DualHead : uses
TransformerBlock *-- MultiHeadAttentionWithRoPE : uses
TransformerBlock *-- FeedForward : uses
MultiHeadAttentionWithRoPE *-- RotaryPositionalEmbedding : uses
MultiHeadCrossAttentionWithRoPE *-- RotaryPositionalEmbedding : uses
DependencyAwareLayer *-- MultiHeadCrossAttentionWithRoPE : uses
TemporalEmbedding *-- FixedEmbedding : uses (or nn.Embedding)
KronosPredictor *-- Kronos : aggregates
KronosPredictor *-- KronosTokenizer : aggregates

@enduml

@startuml module_finetune
title Finetune Module (/finetune)

package "Configuration" {
    class Config {
        + qlib_data_path: str
        + lookback_window: int
        + predict_window: int
        + train_time_range: list
        + batch_size: int
        + tokenizer_learning_rate: float
        + predictor_learning_rate: float
    }
}

package "Data Pipeline" {
    class QlibDataPreprocessor {
        + initialize_qlib()
        + load_qlib_data()
        + prepare_dataset()
    }
    class QlibDataset <<PyTorch Dataset>> {
        + __init__(data_type)
        + __len__()
        + __getitem__(idx)
        + set_epoch_seed(epoch)
    }
}

package "Training Scripts" {
    class TrainingUtils
    class TrainTokenizer {
        + main()
        + train_model()
    }
    class TrainPredictor {
        + main()
        + train_model()
    }
}

' Relationships
Config <.. QlibDataPreprocessor : Reads parameters
Config <.. QlibDataset : Reads parameters
Config <.. TrainTokenizer : Reads parameters
Config <.. TrainPredictor : Reads parameters
QlibDataPreprocessor --> QlibDataset : Prepares and saves data
QlibDataset <.. TrainTokenizer : Provides batches
QlibDataset <.. TrainPredictor : Provides batches
TrainTokenizer ..> TrainingUtils : Uses DDP setup/utilities
TrainPredictor ..> TrainingUtils : Uses DDP setup/utilities
TrainTokenizer ..> KronosTokenizer : Trains
TrainPredictor ..> Kronos : Trains
TrainPredictor ..> KronosTokenizer : Uses for tokenization

@enduml

@startuml module_webui
title WebUI Module (/webui)

package "Frontend" {
    [index.html] as IH
}

package "Backend" {
    class FlaskApp <<app.py>> {
        + load_data_files()
        + load_data_file(path)
        + save_prediction_results(...)
        + create_prediction_chart(...)
        + / (index)
        + /api/load-model (POST)
        + /api/predict (POST)
    }
    class StartupScript <<run.py>> {
        + check_dependencies()
        + main()
    }
}

' Relationships
StartupScript --> FlaskApp : Launches server
FlaskApp --> IH : Renders template
FlaskApp ..> KronosPredictor : Uses for prediction
FlaskApp ..> pandas : Data handling
FlaskApp ..> plotly : Chart generation

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The Kronos project is built upon a **Hierarchical Quantization and Transformer-based Language Modeling** philosophy, specifically adapted for financial time-series data.

### Core Abstractions

1.  **Hierarchical Quantization (BSQuantizer)**: The project's most critical abstraction is the use of a **Binary Spherical Quantizer (BSQuantizer)**, implemented in `model/module.py`. This acts as a Vector Quantized Variational Autoencoder (VQ-VAE) component, translating continuous financial data (OHLCV) into discrete, hierarchical tokens.
    *   **BSQuantizer** is a wrapper around `BinarySphericalQuantizer`, which maps a continuous vector `z` to a binary vector `zhat` (values in $\{-1, 1\}$) and calculates a quantization loss.
    *   The **Hierarchical** aspect is achieved by splitting the total codebook dimension into two parts: $S_1$ bits (coarse token) and $S_2$ bits (fine token). This allows for a two-stage tokenization and prediction process, improving efficiency and representation power.

2.  **KronosTokenizer (VQ-VAE)**: This class (`model/kronos.py`) implements the full VQ-VAE structure. It consists of:
    *   **Encoder**: A stack of Transformer blocks that maps the input time-series data ($X$) to a latent representation ($Z$).
    *   **Quantizer**: The `BSQuantizer` which converts $Z$ into discrete token indices ($S_1$ and $S_2$).
    *   **Decoder**: A stack of Transformer blocks that reconstructs the original input ($X'$) from the quantized representation.
    *   The tokenizer's primary role is to learn a compact, discrete representation of the continuous financial data, effectively creating a **financial vocabulary**.

3.  **Kronos (Causal Language Model)**: This is the main prediction model (`model/kronos.py`), which operates entirely in the discrete token space, similar to a standard decoder-only Transformer (like GPT).
    *   It uses **HierarchicalEmbedding** to combine the $S_1$ and $S_2$ token IDs into a single embedding vector.
    *   It incorporates **TemporalEmbedding** to inject time-based features (minute, hour, day, month) into the sequence.
    *   It uses a stack of **TransformerBlock**s for sequence modeling.
    *   The **DualHead** predicts the probability distribution over the next $S_1$ and $S_2$ tokens, enabling the causal prediction of future financial states.

### Design Philosophy

The project follows a **Two-Stage Training** and **Tokenization-First** philosophy:
1.  **Tokenization (Unsupervised)**: The `KronosTokenizer` is trained first to learn the optimal discrete representation of the financial data, minimizing reconstruction error and maximizing codebook usage (entropy loss). This is an unsupervised pre-training step.
2.  **Prediction (Supervised/Self-Supervised)**: The `Kronos` model is then trained on the sequence of generated tokens using a causal language modeling objective. This decouples the continuous-to-discrete mapping from the sequence prediction task, which is a common and powerful technique in modern generative AI (e.g., VQ-GAN, VQ-Diffusion).

### Lifecycle Management

The lifecycle is managed through standard PyTorch and Hugging Face practices:
*   **Configuration**: The `Config` class (`finetune/config.py`) centralizes all hyperparameters, paths, and data splitting logic, ensuring a single source of truth for the entire pipeline.
*   **Distributed Training**: The training scripts (`train_tokenizer.py`, `train_predictor.py`) are built for **Distributed Data Parallel (DDP)** using `torch.distributed`, enabling efficient scaling across multiple GPUs.
*   **Model Persistence**: Both `KronosTokenizer` and `Kronos` inherit from `PyTorchModelHubMixin`, allowing them to use the `save_pretrained` and `from_pretrained` methods for easy serialization and loading from local paths or the **Hugging Face Hub**.
*   **Inference Abstraction**: The `KronosPredictor` class provides a clean, high-level API (`predict`, `predict_batch`) that abstracts away the underlying tokenization, model inference, and denormalization steps, making the model easy to integrate into applications like the `webui`.

#### 3.1.2. Component Interactions

The Kronos project architecture is centered around the **Kronos** and **KronosTokenizer** models, which are trained and then deployed for inference via a web interface. The core interaction flows are divided into three main phases: Data Preparation, Model Training, and Model Inference.

### 1. Data Preparation and Loading
The data pipeline is managed within the `finetune` module, specifically by `QlibDataPreprocessor` and `QlibDataset`.
*   **QlibDataPreprocessor** (`finetune/qlib_data_preprocess.py`) interacts with the external **Qlib Data Source** to load raw financial data (OHLCV). It processes this data by calculating additional features (like `amt`), performing data cleaning, and splitting it into train, validation, and test sets.
*   The processed data is saved as pickled files (`.pkl`).
*   **QlibDataset** (`finetune/dataset.py`) loads these pickled files and implements a sliding window mechanism to extract time-series samples. Crucially, it performs **instance-level normalization** on the fly for each sample, ensuring the model receives standardized input.

### 2. Model Training (Tokenizer and Predictor)
The training process uses a two-stage approach: first training the tokenizer, then the predictor.
*   **Tokenizer Training** (`finetune/train_tokenizer.py`): The script loads the `QlibDataset` and trains the **KronosTokenizer** (`model/kronos.py`). The input financial data (`x`) is passed to the tokenizer's `forward` method, which performs:
    1.  Linear embedding (`self.embed`).
    2.  Encoder Transformer blocks.
    3.  Quantization embedding (`self.quant_embed`).
    4.  Binary Spherical Quantization (`self.tokenizer`).
    5.  Decoder Transformer blocks.
    The loss function is a combination of **reconstruction loss** (MSE between input `x` and reconstructed output `z`) and the **BSQuantizer loss** (`bsq_loss`), which includes a commitment loss and an entropy penalty.
*   **Predictor Training** (`finetune/train_predictor.py`): This script first loads the *finetuned* `KronosTokenizer` in evaluation mode. It then loads and trains the **Kronos** model (`model/kronos.py`).
    1.  The input financial data (`x`) is passed to the *frozen* tokenizer's `encode` method to get the hierarchical token IDs (`s1_ids`, `s2_ids`).
    2.  These token IDs, along with the time features (`stamp`), are fed into the `Kronos` model's `forward` method.
    3.  The model uses a standard **causal language modeling** objective, predicting the next token (`s1_ids[t+1]`, `s2_ids[t+1]`) based on the current sequence.
    4.  The loss is calculated using the **DualHead**'s `compute_loss` method, which applies cross-entropy loss to both S1 and S2 token logits.

### 3. Model Inference (Web UI)
The web interface provides a user-friendly way to perform predictions.
*   **Model Loading** (`webui/app.py`): The Flask application loads the `Kronos` model and `KronosTokenizer` using the `from_pretrained` method (suggesting a dependency on **Hugging Face Hub**). It then instantiates the **KronosPredictor** class, which encapsulates the entire inference logic.
*   **Prediction** (`webui/app.py` -> `KronosPredictor.predict`):
    1.  The user's input data (`df`) is pre-processed, including normalization using the mean/std of the input data itself.
    2.  The pre-processed data is passed to `KronosPredictor.generate`.
    3.  The `generate` method performs **autoregressive sampling**. In each step, it uses the `Kronos` model to predict the next token IDs, which are then decoded back into the financial data space using the `KronosTokenizer.decode` method.
    4.  The output is denormalized and returned as a prediction DataFrame.
*   **Visualization**: The results are formatted into a Plotly chart (candlestick) for display in the `index.html` template.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml architecture_diagram
title Kronos Project Overall Architecture

skinparam component {
  FontSize 14
  FontName Arial
  BorderColor #383838
  BackgroundColor #F0F0F0
}

skinparam package {
  BorderColor #1E90FF
  BackgroundColor #ADD8E6
}

package "Data Pipeline (finetune)" as Finetune {
    [Config] as C
    [QlibDataPreprocessor] as QDP
    [QlibDataset] as QD
    [train_tokenizer.py] as TT
    [train_predictor.py] as TP
}

package "Core Model (model)" as Model {
    [KronosTokenizer] as KT
    [Kronos] as K
    [KronosPredictor] as KP
    [BSQuantizer] as BSQ
    [Transformer Blocks] as TB
}

package "Web Interface (webui)" as WebUI {
    [app.py (Flask App)] as FA
    [run.py (Startup)] as RU
    [index.html (Frontend)] as IH
}

package "External Services" as External {
    [Qlib Data Source] as QDS
    [Hugging Face Hub] as HF
    [Comet ML] as CM
}

' Data Flow and Dependencies
QDS --> QDP : Loads raw financial data
QDP --> QD : Generates processed datasets (.pkl)
QD --> TT : Provides training/validation batches
QD --> TP : Provides training/validation batches

' Training Flow
TT --> KT : Trains Tokenizer (VQ-VAE loss)
TP --> K : Trains Predictor (Cross-entropy loss)
KT --> BSQ : Uses Quantizer
K --> KT : Uses Tokenizer for token IDs
KT --> HF : Loads/Saves pretrained/finetuned Tokenizer
K --> HF : Loads/Saves pretrained/finetuned Predictor
TT --> CM : Logs training metrics
TP --> CM : Logs training metrics

' Prediction/Inference Flow
FA --> KP : Loads model and performs prediction
KP --> K : Uses Kronos model for generation
KP --> KT : Uses KronosTokenizer for encoding/decoding
FA --> IH : Renders web page
RU --> FA : Starts Flask server

' High-Level Relationships
Model <.. Finetune : Core logic used for training
Model <.. WebUI : Core logic used for inference

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The Kronos codebase employs several established software and machine learning design patterns to manage complexity, ensure modularity, and facilitate training.

### 1. Two-Stage Training Pattern (ML Pattern)
This pattern is fundamental to the project's structure, separating the representation learning from the sequence prediction task.

*   **Implementation**:
    *   **Stage 1 (Representation Learning)**: `KronosTokenizer` is trained as a VQ-VAE using `finetune/train_tokenizer.py`. It learns to compress continuous financial data into discrete tokens.
    *   **Stage 2 (Sequence Modeling)**: `Kronos` is trained as a Causal Language Model using `finetune/train_predictor.py`. It learns to predict the next token in a sequence, using the vocabulary learned in Stage 1.

### 2. Adapter Pattern (Structural Pattern)
The `KronosPredictor` class acts as an adapter, simplifying the complex interaction between the `Kronos` model and `KronosTokenizer` for external users (like the `webui`).

*   **Implementation**:
    ```python
    # model/kronos.py (inside KronosPredictor.__init__)
    self.model = model       # The Kronos Causal LM
    self.tokenizer = tokenizer # The KronosTokenizer VQ-VAE
    
    # model/kronos.py (inside KronosPredictor.predict)
    # The predictor handles the full pipeline:
    # 1. Pre-processing/Normalization
    # 2. Autoregressive generation (calling model and tokenizer repeatedly)
    # 3. Denormalization
    pred_df = predictor.predict(...)
    ```
    The `predict` method hides the intricate steps of token encoding, autoregressive loop, token decoding, and denormalization from the caller.

### 3. Composite Pattern (Structural Pattern)
The Transformer architecture naturally lends itself to the Composite pattern, where a complex structure is built from a tree of simpler, uniform components.

*   **Implementation**:
    ```python
    # model/kronos.py (inside Kronos.__init__)
    self.transformer = nn.ModuleList([
        TransformerBlock(...)
        for _ in range(self.n_layers)
    ])
    
    # model/module.py (TransformerBlock)
    class TransformerBlock(nn.Module):
        def __init__(self, ...):
            self.self_attn = MultiHeadAttentionWithRoPE(...)
            self.ffn = FeedForward(...)
            # ...
    ```
    The `Kronos` model is a composite of `TransformerBlock`s, which are themselves composites of `MultiHeadAttentionWithRoPE` and `FeedForward` layers.

### 4. Strategy Pattern (Behavioral Pattern)
The `QlibDataset` uses a strategy for sampling data, which is essential for training on large, time-series datasets where a full epoch is impractical.

*   **Implementation**:
    ```python
    # finetune/dataset.py (QlibDataset.__getitem__)
    # The index `idx` passed to __getitem__ is ignored.
    # Instead, a random index is drawn from the pre-computed pool of indices.
    random_idx = self.py_rng.randint(0, len(self.indices) - 1)
    symbol, start_idx = self.indices[random_idx]
    # ... extract data window
    ```
    This implements a **Random Sampling Strategy** over all possible sliding windows, ensuring that the model sees a diverse set of samples in each training iteration, regardless of the batch size or epoch definition.

### 5. Dependency Injection (Software Pattern)
The training scripts inject the necessary configuration and models into the training functions.

*   **Implementation**:
    ```python
    # finetune/train_predictor.py (main function)
    config_instance = Config()
    main(config_instance.__dict__)
    
    # ...
    # train_model(model, tokenizer, device, config, ...)
    ```
    The `Config` object is created once and its parameters are passed around, ensuring that all parts of the training pipeline (data loading, model initialization, optimization) use the same set of parameters. The `tokenizer` is explicitly loaded and passed to `train_model` for use by the `model`, demonstrating a clear dependency chain.

#### 3.3.2. Project Highlights

The Kronos project demonstrates several innovative and flexible design choices:

*   **Financial Time-Series as a Language (Innovation)**: The core innovation is the application of a **Causal Language Model (CLM)**, typically used for text generation, to financial time-series prediction. This is achieved by:
    *   **Tokenization**: Using the `KronosTokenizer` (a VQ-VAE variant) to translate continuous OHLCV data into discrete tokens, creating a "financial vocabulary."
    *   **Prediction**: Training the `Kronos` CLM to predict the next token in the sequence, which is equivalent to predicting the next price movement. This leverages the powerful sequence modeling capabilities of the Transformer architecture.

*   **Hierarchical Quantization (Flexibility)**: The use of $S_1$ (coarse) and $S_2$ (fine) bits in the `BSQuantizer` and `HierarchicalEmbedding` provides a flexible and robust tokenization scheme.
    *   This allows the model to first capture the **macro-level** price movements ($S_1$) and then refine the prediction with **micro-level** details ($S_2$), potentially improving both stability and accuracy.

*   **Modern Transformer Components (Extensibility)**: The model incorporates state-of-the-art components from the LLM domain:
    *   **RMSNorm** and **Rotary Positional Embedding (RoPE)** are used in the `TransformerBlock`. These choices are known to improve training stability and performance in large-scale sequence models, making the architecture easily extensible to larger model sizes.

*   **Decoupled and Reusable Architecture (Extensibility)**: The clear separation between the `model` definition, the `finetune` pipeline, and the `webui` deployment layer ensures high reusability.
    *   The `KronosPredictor` class is a clean, single-entry point for inference, allowing the core model to be easily integrated into other applications (e.g., trading bots, backtesting platforms) without needing to rewrite the complex autoregressive sampling logic.

*   **Distributed Training Ready (Flexibility)**: The training scripts are fully configured for **Distributed Data Parallel (DDP)**, making the project immediately scalable for training on massive financial datasets using multi-GPU or multi-node setups. This is a critical feature for a foundation model approach.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The Kronos project is well-engineered, but several areas could be optimized for performance, robustness, and modern ML practices.

### 1. Performance Bottlenecks and Optimization
*   **Data Loading Efficiency**: The `QlibDataset.__getitem__` method performs **instance-level normalization** and clipping (`x = (x - x_mean) / (x_std + 1e-5)`, `x = np.clip(x, -self.config.clip, self.config.clip)`) for *every sample* during training. While correct, this repeated calculation and conversion from NumPy to PyTorch can be a bottleneck.
    *   **Suggestion**: Pre-calculate and save the mean/std for each time series during the `QlibDataPreprocessor` stage. Normalize the data once and store it. The `__getitem__` method would then only need to perform the final clipping and tensor conversion.
*   **Quantizer Entropy Loss**: The `BinarySphericalQuantizer` calculates a complex entropy penalty (`soft_entropy_loss` or `get_hard_per_sample_entropy`). The `soft_entropy_loss` involves `torch.einsum` and `softmax` over a potentially large codebook, which can be computationally intensive.
    *   **Suggestion**: Profile the `BSQuantizer` forward pass. If the entropy calculation is a major overhead, consider simplifying the loss term or implementing a more efficient approximation, especially during the later stages of training.

### 2. Architecture Optimization
*   **Decoupling Tokenizer and Predictor**: The `KronosPredictor` class requires both the `Kronos` model and the `KronosTokenizer` to be loaded. The tokenizer is only used for the `encode` and `decode` steps.
    *   **Suggestion**: Merge the necessary tokenization/detokenization logic (specifically `indices_to_bits` and the embedding/head layers) directly into the `Kronos` model's `generate` method. This would allow the `Kronos` model to be a single, self-contained unit for inference, simplifying deployment and reducing the number of objects to manage.

### 3. Code Quality and Robustness
*   **Configuration Management**: The `Config` class uses a flat structure (`self.epochs`, `self.batch_size`). For a project of this complexity, a nested configuration system (e.g., using **Hydra** or **Pydantic**) would improve readability and prevent naming conflicts.
    *   **Suggestion**: Refactor `config.py` to use a structured configuration format, grouping parameters logically (e.g., `config.data.lookback_window`, `config.model.d_model`).
*   **Error Handling in WebUI**: The `webui/app.py` has several `try...except Exception as e` blocks that catch all exceptions and return a generic 500 error.
    *   **Suggestion**: Implement more granular exception handling. For example, catch `FileNotFoundError` for data loading, `ValueError` for invalid parameters, and a custom `ModelNotLoadedError` for prediction attempts, providing more informative error messages to the user.
*   **Type Hinting and Documentation**: While some files are well-documented, extending type hints across all function signatures (especially in `model/module.py`) would improve code maintainability and static analysis. The `KronosPredictor` methods, in particular, would benefit from explicit return type hints.

#### 3.4.2. Secondary Development Guide

The Kronos project is a well-structured, two-stage financial time-series prediction model. Secondary development should focus on the following areas:

### 1. Code Exploration Path
*   **Core Logic**: Start with `/home/ubuntu/Kronos_project/model/kronos.py` and `/home/ubuntu/Kronos_project/model/module.py`. These files define the `Kronos` and `KronosTokenizer` models, the core components (`TransformerBlock`, `BSQuantizer`), and the prediction logic (`KronosPredictor`). Understanding the **Hierarchical Quantization** mechanism in `BSQuantizer` is paramount.
*   **Configuration**: Review `/home/ubuntu/Kronos_project/finetune/config.py` to understand all hyperparameters, data paths, and time-splitting logic. All major modifications to the training process start here.
*   **Data Pipeline**: Examine `/home/ubuntu/Kronos_project/finetune/qlib_data_preprocess.py` and `/home/ubuntu/Kronos_project/finetune/dataset.py` to see how raw Qlib data is transformed into the normalized, windowed samples used for training.

### 2. Secondary Development Best Practices
*   **Feature Engineering**: To introduce new financial features, modify `QlibDataPreprocessor.load_qlib_data` to calculate the features and update `Config.feature_list`. The input dimension (`d_in`) of `KronosTokenizer` will need to be updated accordingly.
*   **Model Architecture Modification**: Changes to the Transformer block structure (e.g., adding a new layer type, changing attention mechanism) should be implemented in `model/module.py` and then integrated into `Kronos` or `KronosTokenizer` in `model/kronos.py`.
*   **Quantization Scheme**: Experimenting with the quantization parameters (e.g., `s1_bits`, `s2_bits`, `beta`, `gamma`) requires modifying `Config` and observing the impact on the `KronosTokenizer` training loss (reconstruction vs. entropy).
*   **Distributed Training**: The training scripts (`train_tokenizer.py`, `train_predictor.py`) are DDP-ready. Use `torchrun` to launch training for maximum efficiency. Ensure all data loading and logging is handled correctly by the rank 0 process to avoid redundancy.
*   **Inference Integration**: For integrating the model into a new application, use the `KronosPredictor` class. It is designed to handle the full inference lifecycle, from data normalization to autoregressive generation, requiring only the input DataFrame and timestamps.

### 3. Testing
*   The `tests` folder contains a regression test (`test_kronos_regression.py`). Any changes to the core model logic should be validated against this test to ensure numerical stability and prevent unintended side effects. New features should be accompanied by new unit tests.


================================================
FILE: thirdparty/Lean.md
================================================
# Lean - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project structure is organized around the core components of a quantitative trading engine, emphasizing a clear separation of concerns between financial abstractions, the algorithm execution layer, and the technical analysis library.

```
/home/ubuntu/FinnewsHunter/thirdparty/Lean/
├── Algorithm/             # Core execution layer and strategy framework
│   ├── Framework/         # Modular components (Alpha, Portfolio, Execution, Risk)
│   └── QCAlgorithm.cs     # The main base class for all trading strategies
├── Common/                # Fundamental financial abstractions and data structures
│   ├── Data/              # Base data types (IBaseData, Slice)
│   ├── Orders/            # Order types and transaction management
│   ├── Securities/        # Security definition, holdings, and financial models
│   └── Interfaces/        # Core interfaces (IAlgorithm, IFeeModel, etc.)
├── Indicators/            # Comprehensive library of technical analysis indicators
│   ├── CandlestickPatterns/ # Specific folder for candlestick pattern recognition
│   └── IndicatorBase.cs   # The generic base class for all indicators
└── ... (Other folders like Data, Configuration, Logging, etc.)
```

The `Common` folder serves as the foundation, defining the core language of the system (Security, Order, Data). The `Algorithm` folder builds upon this foundation to provide the execution environment (`QCAlgorithm`) and the modular strategy pipeline (Framework). The `Indicators` folder provides the necessary tools for signal generation. This structure promotes high cohesion within modules and low coupling between them, making the system highly maintainable and extensible.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/Lean/Common`: Contains the core financial abstractions, data structures, and interfaces for the entire engine, including `Security`, `Order`, and various financial models.
*   `/home/ubuntu/FinnewsHunter/thirdparty/Lean/Algorithm`: Houses the main execution logic, primarily the `QCAlgorithm` base class, and the modular Algorithm Framework components (Alpha, Portfolio, Execution, Risk, Universe Selection).
*   `/home/ubuntu/FinnewsHunter/thirdparty/Lean/Indicators`: A comprehensive library of technical analysis indicators, built on a generic, composable base class (`IndicatorBase<T>`).

## Phase 2: Module-by-Module Deep Analysis

## Module Analysis - Common

The `Common` module is the foundational layer of the LEAN engine, housing the core financial abstractions and data structures that are utilized throughout the entire system.

### Core Responsibility
The module's primary responsibility is to define the **financial domain model** and provide the fundamental building blocks for representing market data, financial instruments, and trading actions.

### Key Abstractions and Implementation Details

#### 1. Financial Instruments (`Securities` sub-folder)
*   **`Security.cs`**: This is the central class representing a tradable asset. It aggregates all necessary financial models (`IFeeModel`, `IFillModel`, `ISlippageModel`, `IBuyingPowerModel`) via the **Strategy Pattern**. This allows for highly realistic and customizable simulation of brokerage behavior. It also contains the `SecurityHolding` object, which tracks the current position.
*   **`SecurityHolding.cs`**: Encapsulates the portfolio position for a single security, tracking `Quantity`, `AveragePrice`, and calculating real-time metrics like `HoldingsValue` and `UnrealizedProfit`.

#### 2. Trading Actions (`Orders` sub-folder)
*   **`Order.cs`**: An abstract base class for all order types. It defines common properties like `Symbol`, `Quantity`, `Type`, and `Status`. Concrete implementations like `MarketOrder`, `LimitOrder`, and `StopMarketOrder` inherit from this, providing the specific logic for each order type.

#### 3. Data Structures
*   **`IBaseData`**: The fundamental interface for all data points consumed by the engine (e.g., `TradeBar`, `Tick`). This ensures a uniform data handling mechanism.
*   **`Slice`**: A container object passed to the algorithm's `OnData` method, holding all new `IBaseData` objects for the current time step.

### Dependencies
The `Common` module is largely self-contained, with its interfaces serving as dependencies for higher-level modules like `Algorithm` and `Indicators`.

## Module Analysis - Algorithm

The `Algorithm` module is the core execution layer of the LEAN engine, providing the base class (`QCAlgorithm`) that users extend to write their trading strategies. It integrates all the foundational components from the `Common` module and introduces the **Algorithm Framework** for modular strategy design.

### Core Responsibility
The module's primary responsibility is to serve as the **entry point and execution context** for a trading strategy. It manages the lifecycle of the algorithm (initialization, data handling, event processing, termination) and provides a rich API for interacting with the market, portfolio, and data.

### Key Abstractions and Implementation Details

#### 1. The Core Algorithm (`QCAlgorithm.cs`)
*   **Inheritance and Interfaces**: `QCAlgorithm` inherits from `MarshalByRefObject` and implements the crucial `IAlgorithm` interface.
*   **Initialization (`Initialize`, `PostInitialize`)**: The constructor sets up all core managers (`Securities`, `Transactions`, `Portfolio`, `Schedule`, `UniverseManager`) and databases (`MarketHoursDatabase`, `SymbolPropertiesDatabase`). The `Initialize()` method is the user's entry point for setting up data subscriptions, cash, and start/end dates.
*   **Data Handling (`OnData`, `OnEndOfTimeStep`)**: The `OnData(Slice slice)` method is the main event loop handler, receiving a `Slice` object containing all new data for the current time step. `OnEndOfTimeStep()` allows for post-data processing before the next step.
*   **Security Management (`AddSecurity`, `RemoveSecurity`)**: Provides overloaded methods like `AddEquity`, `AddForex`, `AddFuture`, etc., which internally call `AddSecurity<T>`. This process involves creating `SubscriptionDataConfig` objects and initializing the `Security` object using the configured `ISecurityInitializer`.

#### 2. The Algorithm Framework (Sub-folders)
The framework implements a **pipeline design pattern** for modular strategy creation, allowing users to compose a strategy from interchangeable components:
*   **`Selection` (Universe Selection)**: Defines how the set of tradable assets is chosen and maintained.
*   **`Alphas` (Alpha Model)**: Responsible for generating trading **insights** (predictions of price movement).
*   **`Portfolio` (Portfolio Construction)**: Translates the generated insights into target portfolio weights.
*   **`Execution` (Execution Model)**: Determines how to convert portfolio targets into actual market orders.
*   **`Risk` (Risk Management)**: Applies risk checks and constraints to the generated orders.

### Dependencies
The `Algorithm` module is heavily dependent on the interfaces and classes defined in the `Common` module, particularly `IAlgorithm`, `Security`, `Order`, and the various model interfaces (`IFeeModel`, `IFillModel`, etc.).

## Module Analysis - Indicators

The `Indicators` module is a comprehensive library for technical analysis, providing a vast collection of financial indicators and patterns. Its design is highly modular, centered around a generic base class that simplifies the creation and chaining of new indicators.

### Core Responsibility
The module's primary function is to provide **reusable, stateful technical analysis components** that can be easily integrated into any trading algorithm. It abstracts the complex mathematical calculations and data management required for indicators, allowing users to focus on strategy logic.

### Key Abstractions and Implementation Details

#### 1. Indicator Base Classes
*   **`IndicatorBase.cs`**: This file defines the core hierarchy:
    *   **`IndicatorBase` (Abstract)**: The non-generic base class implementing the `IIndicator` interface. It manages common properties like `Name`, `Samples`, `IsReady`, and the `Window` (a `RollingWindow<IndicatorDataPoint>`) to store historical output values. It also handles the `Updated` event, a key mechanism for chaining indicators.
    *   **`IndicatorBase<T>` (Abstract Generic)**: The generic base class where `T` is the input data type. It implements the core `Update(IBaseData input)` method, which handles data type validation, time-series checks, and calls the abstract `ComputeNextValue(T input)` method, which is the sole responsibility of the concrete indicator implementation. This is a clear example of the **Template Method Pattern**.

#### 2. Indicator Types and Composition
*   **Simple Indicators**: The vast majority of files (e.g., `SimpleMovingAverage.cs`, `RelativeStrengthIndex.cs`) implement the `IndicatorBase<T>` and override `ComputeNextValue`.
*   **Composite Indicators**: Indicators that combine the output of other indicators, demonstrating the **Decorator Pattern** and **Composite Pattern** (e.g., `CompositeIndicator.cs`).
*   **Specialized Indicators**: Sub-folders like `CandlestickPatterns` contain classes for pattern recognition.

### Dependencies
The module primarily depends on the `Common` module for its core data structures (`IBaseData`, `IndicatorDataPoint`, `RollingWindow`) and the `QuantConnect.Data.Consolidators` namespace for data aggregation.

### Code Highlights
*   **Indicator Chaining**: The design facilitates easy chaining of indicators using operator overloading, allowing users to write concise code like `var macd = EMA(12) - EMA(26)`.
*   **Type Safety and Generics**: The use of generics in `IndicatorBase<T>` ensures that each indicator is strongly typed to the data it consumes.

### Module PlantUML Diagrams

@startuml Common Module
title Common Module - Core Abstractions

interface IAlgorithm
interface ISecurityInitializer
interface IFeeModel
interface IFillModel
interface ISlippageModel
interface IBuyingPowerModel
interface ICurrencyConverter

class Security {
    + Symbol : Symbol
    + Type : SecurityType
    + Price : decimal
    + Holdings : SecurityHolding
    + Exchange : SecurityExchange
    + FeeModel : IFeeModel
    + FillModel : IFillModel
    + SlippageModel : ISlippageModel
    + BuyingPowerModel : IBuyingPowerModel
    + SetLeverage(leverage: decimal)
    + Update(data: IReadOnlyList<BaseData>)
}

class SecurityHolding {
    - _security : Security
    - _currencyConverter : ICurrencyConverter
    + Quantity : decimal
    + AveragePrice : decimal
    + HoldingsCost : decimal
    + HoldingsValue : decimal
    + IsLong : bool
    + IsShort : bool
    + Update(price: decimal)
}

abstract class Order {
    + Id : int
    + Symbol : Symbol
    + Quantity : decimal
    + Price : decimal
    + Type : OrderType
    + Status : OrderStatus
    + Properties : IOrderProperties
    + GetValue(security: Security) : decimal
}

class MarketOrder extends Order
class LimitOrder extends Order
class StopMarketOrder extends Order

IAlgorithm <|-- Security : uses
Security *-- SecurityHolding : contains
Security *-- IFeeModel : uses
Security *-- IFillModel : uses
Security *-- ISlippageModel : uses
Security *-- IBuyingPowerModel : uses
SecurityHolding *-- ICurrencyConverter : uses
Order *-- Symbol : contains
Order *-- IOrderProperties : contains
@enduml

@startuml Algorithm Module
title Algorithm Module - Framework Components

interface IAlgorithm
interface IAlphaModel
interface IPortfolioConstructionModel
interface IExecutionModel
interface IRiskManagementModel
interface IUniverseSelectionModel

class QCAlgorithm {
    - Securities : SecurityManager
    - Portfolio : SecurityPortfolioManager
    - Transactions : SecurityTransactionManager
    + Initialize()
    + OnData(slice: Slice)
    + AddEquity(ticker: string) : Equity
    + SetAlpha(model: IAlphaModel)
    + SetPortfolioConstruction(model: IPortfolioConstructionModel)
    + SetExecution(model: IExecutionModel)
    + SetRiskManagement(model: IRiskManagementModel)
    + SetUniverseSelection(model: IUniverseSelectionModel)
}

class AlphaModel implements IAlphaModel
class PortfolioConstructionModel implements IPortfolioConstructionModel
class ExecutionModel implements IExecutionModel
class RiskManagementModel implements IRiskManagementModel
class UniverseSelectionModel implements IUniverseSelectionModel

QCAlgorithm .up.|> IAlgorithm
QCAlgorithm "1" *-- "1" IAlphaModel : uses
QCAlgorithm "1" *-- "1" IPortfolioConstructionModel : uses
QCAlgorithm "1" *-- "1" IExecutionModel : uses
QCAlgorithm "1" *-- "1" IRiskManagementModel : uses
QCAlgorithm "1" *-- "1" IUniverseSelectionModel : uses

AlphaModel .up.|> IAlphaModel
PortfolioConstructionModel .up.|> IPortfolioConstructionModel
ExecutionModel .up.|> IExecutionModel
RiskManagementModel .up.|> IRiskManagementModel
UniverseSelectionModel .up.|> IUniverseSelectionModel
@enduml

@startuml Indicators Module
title Indicators Module - Core Hierarchy

interface IIndicator {
    + Name : string
    + IsReady : bool
    + Current : IndicatorDataPoint
    + Update(input: IBaseData) : bool
    + Reset()
}

abstract class IndicatorBase {
    + Window : RollingWindow<IndicatorDataPoint>
    + Samples : long
    + OnUpdated(consolidated: IndicatorDataPoint)
}

abstract class "IndicatorBase<T>" as IndicatorBaseT {
    # ComputeNextValue(input: T) : decimal
    + Update(input: IBaseData) : bool
}

class SimpleMovingAverage {
    - _window : RollingWindow<IndicatorDataPoint>
    # ComputeNextValue(input: IndicatorDataPoint) : decimal
}

class RelativeStrengthIndex {
    - _up : RelativeMovingAverage
    - _down : RelativeMovingAverage
    # ComputeNextValue(input: IndicatorDataPoint) : decimal
}

class CompositeIndicator {
    - _indicatorA : IIndicator
    - _indicatorB : IIndicator
}

IIndicator <|-- IndicatorBase
IndicatorBase <|-- IndicatorBaseT
IndicatorBaseT <|-- SimpleMovingAverage
IndicatorBaseT <|-- RelativeStrengthIndex
IndicatorBaseT <|-- CompositeIndicator

CompositeIndicator "1" *-- "2" IIndicator : aggregates
SimpleMovingAverage ..> RollingWindow : uses
RelativeStrengthIndex ..> RelativeMovingAverage : uses
@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The LEAN engine's architecture is built upon a set of fundamental abstractions that define the lifecycle and components of a quantitative trading strategy. The design philosophy centers on **modularity, extensibility, and event-driven processing**.

### Core Abstractions
The system is anchored by three primary financial abstractions:

1.  **`Security`**: Represents a financial instrument (e.g., Equity, Forex, Future). It is the central object for all market-related data and configuration. Crucially, it aggregates all models that define its behavior: `IFeeModel`, `IFillModel`, `ISlippageModel`, and `IBuyingPowerModel`. This aggregation follows the **Strategy Pattern**, allowing for easy customization of trading costs and constraints per asset.
2.  **`Order`**: An abstract base class for all trade requests (e.g., `MarketOrder`, `LimitOrder`). It encapsulates the intent to trade a specific quantity of a `Security`. The separation of `Order` from its execution logic is key to the system's transaction management.
3.  **`IAlgorithm` (Implemented by `QCAlgorithm`)**: The main control plane. It manages the algorithm's state, portfolio, and the flow of data. Its lifecycle is strictly defined:
    *   **Initialization**: The user-defined `Initialize()` method sets up the trading environment (cash, data subscriptions).
    *   **Execution**: The engine repeatedly calls `OnData(Slice slice)` to push new market data to the algorithm.
    *   **Termination**: The algorithm can be stopped via external events or reaching the backtest end date.

### Design Philosophy: The Algorithm Framework
The engine adopts a **modular pipeline** philosophy, formalized in the Algorithm Framework, which breaks down the complex task of strategy creation into five interchangeable components:

| Component | Interface | Responsibility |
| :--- | :--- | :--- |
| **Universe Selection** | `IUniverseSelectionModel` | Determines the set of tradable assets. |
| **Alpha Model** | `IAlphaModel` | Generates trading signals/predictions (`Insight` objects). |
| **Portfolio Construction** | `IPortfolioConstructionModel` | Translates signals into target portfolio weights. |
| **Risk Management** | `IRiskManagementModel` | Filters or adjusts orders based on risk constraints. |
| **Execution** | `IExecutionModel` | Converts final targets into executable `Order` objects. |

This design promotes **Separation of Concerns** and allows developers to mix and match components, greatly enhancing the platform's flexibility and research capabilities.

#### 3.1.2. Component Interactions

The system operates on an **event-driven architecture**, where market data drives the entire process. The primary interaction flow is a sequence of transformations, starting with raw data and ending with a trade order.

### Data Flow Sequence

1.  **Data Ingestion and Consolidation**: Raw market data (Ticks, TradeBars) is loaded and passed through `IDataConsolidator` instances (often implicitly created by the `QCAlgorithm.AddSecurity` methods).
2.  **Algorithm Update**: The engine calls `QCAlgorithm.OnData(Slice slice)`. The `Slice` object contains all new data points for the current time step.
3.  **Universe Selection**: The `IUniverseSelectionModel` processes the data to determine which assets are currently active and tradable. This results in a dynamic list of `Security` objects.
4.  **Alpha Generation**: The `IAlphaModel` consumes the data and generates a collection of `Insight` objects, which are predictions about the direction and magnitude of an asset's price movement.
5.  **Portfolio Construction**: The `IPortfolioConstructionModel` takes the generated `Insight` objects and the current portfolio state, and calculates a set of desired `PortfolioTarget` objects (e.g., "I want 10% of my portfolio in AAPL").
6.  **Risk Management**: The `IRiskManagementModel` intercepts the proposed `PortfolioTarget` objects and applies any necessary risk constraints (e.g., maximum drawdown, position size limits), potentially modifying or rejecting the targets.
7.  **Order Execution**: The `IExecutionModel` takes the final, risk-adjusted targets and translates them into concrete `Order` objects (e.g., `MarketOrder` for 100 shares).
8.  **Transaction Processing**: The `SecurityTransactionManager` receives the `Order` and applies the various models attached to the `Security` (Fee, Fill, Slippage) to simulate the trade execution and update the `SecurityHolding` and `SecurityPortfolioManager` state.

### Key Communication Patterns

*   **Observer Pattern (Indicators)**: Indicators use the `Updated` event to notify dependent indicators of a new value. This allows for complex indicator chains (e.g., RSI of an EMA) to update automatically.
*   **Dependency Injection (Models)**: The `QCAlgorithm` class uses setter methods (`SetAlpha`, `SetExecution`, etc.) to inject concrete implementations of the strategy interfaces, adhering to the **Inversion of Control** principle.
*   **Manager-Centric Access**: Core state is managed by dedicated classes (`SecurityManager`, `SecurityPortfolioManager`, `SecurityTransactionManager`) which are exposed as properties on `QCAlgorithm`, centralizing access to the trading environment.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml Overall Architecture
title LEAN Engine Core Architecture

package Common {
    [Security]
    [Order]
    [IBaseData]
    [IAlgorithm]
}

package Algorithm {
    [QCAlgorithm]
    [Alpha Model]
    [Portfolio Model]
    [Execution Model]
}

package Indicators {
    [IndicatorBase]
    [SimpleMovingAverage]
    [RelativeStrengthIndex]
}

[QCAlgorithm] ..> [Security] : manages
[QCAlgorithm] ..> [Order] : creates
[QCAlgorithm] ..> [IBaseData] : consumes
[QCAlgorithm] ..> [Alpha Model] : sets
[QCAlgorithm] ..> [Portfolio Model] : sets
[QCAlgorithm] ..> [Execution Model] : sets

[Alpha Model] ..> [IndicatorBase] : uses for signals
[IndicatorBase] ..> [IBaseData] : consumes

[QCAlgorithm] .up.|> [IAlgorithm]

' High-level dependencies
Algorithm .up.|> Common : Core Financial Abstractions
Indicators .up.|> Common : Data Structures (IBaseData)
Algorithm .up.> Indicators : Strategy Logic

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The LEAN engine leverages several established design patterns to achieve its modularity, extensibility, and maintainability.

| Pattern | Description | Implementation Example |
| :--- | :--- | :--- |
| **Strategy Pattern** | Defines a family of algorithms, encapsulates each one, and makes them interchangeable. | The `Security` class aggregates various strategy interfaces: `IFeeModel`, `IFillModel`, `ISlippageModel`, and `IBuyingPowerModel`. A user can swap a `ConstantFeeModel` for a `PercentageFeeModel` without changing the core `Security` logic. |
| **Template Method Pattern** | Defines the skeleton of an algorithm in a base class, deferring some steps to subclasses. | The `IndicatorBase<T>` abstract class defines the `Update` method's flow (data validation, sample counting, event firing) but delegates the core calculation to the abstract `ComputeNextValue(T input)` method, which must be implemented by every concrete indicator (e.g., `SimpleMovingAverage`). |
| **Composite Pattern** | Composes objects into tree structures to represent part-whole hierarchies. | The `CompositeAlphaModel` allows multiple `IAlphaModel` instances to be treated as a single model. Similarly, the `CompositeIndicator` allows combining indicators (e.g., `IndicatorA + IndicatorB`). |
| **Observer Pattern** | Defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically. | The `IndicatorBase` class exposes the `Updated` event. This is used extensively for **indicator chaining**, where one indicator (the subject) updates, and a dependent indicator (the observer) automatically receives the new value for its own calculation. |
| **Factory Method Pattern** | Defines an interface for creating an object, but lets subclasses decide which class to instantiate. | The `Securities.CreateSecurity` method acts as a factory, using the `SecurityType` to determine which concrete `Security` subclass (`Equity`, `Future`, `Option`, etc.) to instantiate. |

#### 3.3.2. Project Highlights

The LEAN engine's design incorporates several innovative features that contribute to its power and flexibility as a backtesting and live trading platform.

*   **Modular Algorithm Framework**: The separation of the trading strategy into five distinct, interchangeable models (`Alpha`, `Portfolio`, `Execution`, `Risk`) is the most significant highlight. This modularity allows for rapid prototyping, component-level testing, and the creation of highly sophisticated, multi-component strategies.
*   **Extensive Indicator Library and Chaining**: The `Indicators` module, with over 200 implementations, is a robust feature. The use of operator overloading on the `IndicatorBase` class (e.g., `indicatorA + indicatorB`) enables a highly intuitive and concise syntax for creating complex, chained technical analysis signals.
*   **Data Abstraction and Uniformity**: The use of `IBaseData` and `Slice` ensures that all data types (Ticks, TradeBars, custom data) are handled uniformly by the `QCAlgorithm`. This abstraction simplifies the process of integrating new data sources without modifying the core algorithm logic.
*   **Python Integration Support**: The design explicitly supports polyglot programming, with partial classes in `QCAlgorithm.Python.cs` dedicated to bridging the C# core with Python algorithms via `Python.Runtime`. This allows a single engine to serve both C# and Python developers, greatly expanding its user base.
*   **Model-Driven Financial Logic**: By externalizing financial logic (fees, fills, slippage, buying power) into dedicated model interfaces, the engine achieves a high degree of realism and configurability. A user can simulate different brokerage environments simply by injecting different model implementations into the `Security` object.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

Based on the deep analysis of the core modules, the following suggestions are offered to optimize the architecture, improve performance, and enhance code quality.

### Performance Bottlenecks and Optimization
1.  **Parallelize Algorithm Framework Execution**: The current event-driven model processes the Algorithm Framework pipeline (Alpha, Portfolio, Execution) sequentially within the `OnData` call. For algorithms managing a large universe of securities, this can become a bottleneck. The **Alpha Model** and **Portfolio Construction Model** could be executed in parallel across different securities or insights, provided thread-safe data access is ensured. This would require a careful refactoring of the data managers to support concurrent reads.
2.  **Optimize `RollingWindow` for Memory**: While the `RollingWindow` is efficient, for extremely long backtests or high-frequency data, the memory footprint of hundreds of indicators, each maintaining its own window, can be substantial. Exploring memory-mapped files or a more centralized, shared history service for indicators could reduce memory pressure.

### Architecture Optimization
1.  **Formalize Dependency Injection (DI)**: The `QCAlgorithm` currently uses direct setter methods (`SetAlpha`, `SetExecution`) to configure the strategy. While functional, adopting a more formal DI container pattern (e.g., using a lightweight IoC container) would make the system configuration more robust, testable, and easier to manage for complex deployments. This would decouple the `QCAlgorithm` from the concrete implementation details of the models.
2.  **Strongly-Typed Event Handling**: The `OnCommand` method uses the `dynamic` keyword, which bypasses compile-time type checking and can lead to runtime errors. Refactoring this to use a strongly-typed command pattern with a dedicated command dispatcher would significantly improve code quality and maintainability.

#### 3.4.2. Secondary Development Guide

For developers looking to extend or build upon the LEAN engine, the following path is recommended for code exploration and secondary development:

### 1. Code Exploration Starting Point
*   **Core Abstraction**: Begin by examining `Common/Securities/Security.cs` and `Common/Orders/Order.cs` to understand the fundamental financial objects.
*   **Control Flow**: Next, study `Algorithm/QCAlgorithm.cs` and `Common/Interfaces/IAlgorithm.cs` to grasp the algorithm's lifecycle, the `Initialize()` setup, and the `OnData(Slice)` event loop.

### 2. Implementing a New Trading Strategy
The most effective way to implement a new strategy is by creating custom components for the **Algorithm Framework**:

*   **New Alpha Model**: To generate a new trading signal, create a class that implements `IAlphaModel`. The core logic will reside in the `Update` method, which should return a list of `Insight` objects based on the input data.
*   **New Risk Management Model**: To enforce a custom risk rule (e.g., a maximum daily loss), implement `IRiskManagementModel`. This model's logic will filter or modify the `PortfolioTarget` objects before they are converted into orders.

### 3. Creating a Custom Technical Indicator
To add a new technical indicator to the library:

1.  **Inherit from `IndicatorBase<T>`**: Choose the appropriate input type `T` (e.g., `TradeBar` or `IndicatorDataPoint`).
2.  **Implement `ComputeNextValue(T input)`**: This is the single, critical method where the indicator's mathematical calculation must be implemented.
3.  **Chain Dependencies**: If the new indicator depends on others (e.g., a new moving average), use the existing indicator factory methods and the operator overloading feature to easily chain them within the constructor. For example: `_dependency = new SimpleMovingAverage(period).Of(inputData);`.


================================================
FILE: thirdparty/README.md
================================================
# 第三方开源金融智能体框架

本文件夹包含了截至2025年11月的主要开源金融智能体框架。以下是已成功克隆的仓库列表。

## 成功克隆的仓库 (20个)

### 一、多智能体协作框架

#### 1. TradingAgents系列
- **TradingAgents** - 多角色专业分工（分析师/研究员/交易员）
- **TradingAgents-CN** - 中文优化，支持A股/港股/国产大模型

#### 2. FinRL生态系统
- **FinRL** - 首个开源金融强化学习框架，三层架构
- **FinRL-Meta** - 市场环境库，300+真实交易环境
- **ElegantRL** - 轻量高效DRL库，FinRL的算法引擎

#### 3. 学术研究导向
- **FinRobot** - 四层架构，16个专业智能体分工
- **DISC-FinLLM** - 复旦DISC团队，投研团队模拟（NLPCC 2025获奖）

### 二、LLM+金融智能体框架

#### 1. 通用金融LLM智能体
- **FinGPT** - 金融大模型基座，支持多种下游任务

#### 2. 专业领域智能体
- **investor-agent** - MCP协议服务器，投资分析专用
- **agentic-trading** - Google ADK演示，A2A互操作性
- **FinGenius** - A股专用，博弈论决策机制

### 三、量化交易+智能体集成框架

#### 1. 成熟交易平台
- **vnpy** - 全功能量化框架，模块化设计
- **qlib** - 微软出品，AI量化投资平台
- **backtrader** - 经典回测框架，支持策略开发

#### 2. 专业工具集成
- **panda_quantflow** - 可视化工作流，节点式编排
- **FinceptTerminal** - CLI工具，技术/基本面/情绪分析

### 四、基础模型与数据框架

#### 1. 金融基础模型
- **Kronos** - K线基础模型，45个交易所预训练
- **FinCast-fts** - 时序预测基础模型，20B数据点

### 五、特色项目与工具

#### 1. 开发工具链
- **awesome-quant** - 量化资源精选列表
- **Lean** - QuantConnect算法交易引擎

## 克隆失败的仓库 (11个)

以下仓库在克隆时未找到或不存在：

1. **TradingAgents-Lite** - github.com/TauricResearch/TradingAgents-Lite
2. **HedgeAgents** - github.com/HedgeAgents/HedgeAgents
3. **FinMem** - github.com/FinMem/FinMem
4. **FinArena** - github.com/FinArena/FinArena
5. **FinHEAR** - github.com/FinHEAR/FinHEAR
6. **PulseReddit** - github.com/PulseReddit/PulseReddit
7. **mbt_gym** - github.com/mbt_gym/mbt_gym
8. **Agent-Trading-Arena** - github.com/Agent-Trading-Arena/Agent-Trading-Arena
9. **AI-Hedge-Fund** - github.com/AI-Hedge-Fund/AI-Hedge-Fund
10. **OpenBBTerminal** - github.com/OpenBB-finance/OpenBBTerminal (仓库过大，克隆超时)

## 说明

- 总计尝试克隆：31个仓库
- 成功克隆：20个仓库
- 失败原因：仓库不存在或已删除/重命名（10个），仓库过大超时（1个）
- 克隆时间：2025年11月30日

## 使用建议

建议根据具体需求选择合适的框架：
- **多智能体协作**：优先考虑 TradingAgents、FinRL、FinRobot
- **中文支持**：TradingAgents-CN、FinGenius
- **量化交易**：vnpy、qlib、backtrader
- **学术研究**：FinRL、DISC-FinLLM、Kronos
- **工具集成**：Lean、awesome-quant


================================================
FILE: thirdparty/TradingAgents-CN.md
================================================
# TradingAgents-CN - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project, **TradingAgents-CN**, is a comprehensive, multi-agent LLM financial trading framework. Its structure is highly modular, separating the core business logic (agents, data, graph) from the infrastructure (config, docker) and the user interface (web, cli).

The root directory `/home/ubuntu/TradingAgents-CN` contains:
*   `tradingagents/`: The core Python package containing all the business logic. This is the heart of the application.
*   `web/`: A Streamlit-based web application for user interaction, visualization, and management.
*   `cli/`: Command-line interface tools for running the agents and managing the system.
*   `config/`: Contains configuration files and default settings.
*   `data/`: Storage for logs, cache, and other runtime data.
*   `docker/`: Docker-related files for containerization and deployment.
*   `tests/`: Unit and integration tests (excluded from analysis).
*   `docs/`: Documentation files.

The core logic resides within the `tradingagents` package, which is further subdivided into critical modules:
*   `tradingagents/agents`: Implements the various LLM-powered agents (Analyst, Researcher, Manager, Trader) and their associated state and utility classes.
*   `tradingagents/dataflows`: Manages all data acquisition, caching, and external data provider interfaces, abstracting away the complexity of financial data sources (Tushare, yfinance, etc.).
*   `tradingagents/config`: Centralized configuration management, including API keys, model pricing, and database connections.
*   `tradingagents/graph`: Implements the multi-agent orchestration logic, likely using a state machine or graph-based framework like LangGraph to manage the flow of analysis and debate.
*   `tradingagents/utils`: General utility functions, including logging, stock validation, and tool logging.

The separation of concerns is clearly enforced, with the `tradingagents` package providing the backend service, and `web` and `cli` acting as frontends.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/TradingAgents-CN/tradingagents/agents`: Core implementation of the multi-agent system, including analysts, researchers, managers, and the trader.
*   `/home/ubuntu/TradingAgents-CN/tradingagents/dataflows`: Data abstraction layer, handling external data providers, caching, and data completeness checks.
*   `/home/ubuntu/TradingAgents-CN/tradingagents/config`: Configuration management, model pricing, API key handling, and database connection setup.
*   `/home/ubuntu/TradingAgents-CN/tradingagents/graph`: Orchestration logic for the multi-agent workflow, defining the state machine and conditional transitions.
*   `/home/ubuntu/TradingAgents-CN/web`: The Streamlit-based web interface, including UI components, session management, and analysis runners.

## Phase 2: Module-by-Module Deep Analysis

## 1. tradingagents/agents (Multi-Agent System Core)

**Core Responsibility**: Implements the various LLM-powered agents that perform financial analysis, debate, risk management, and trading. It defines the roles and the communication structure for the multi-agent system.

**Key Files**:
*   `agents/__init__.py`: Exports all agent creation functions and utility classes.
*   `agents/analysts/`: Contains specialized analysts: `fundamentals_analyst.py`, `market_analyst.py`, `news_analyst.py`, `social_media_analyst.py`.
*   `agents/researchers/`: Implements the `bull_researcher.py` and `bear_researcher.py` for investment debate.
*   `agents/managers/`: Contains `research_manager.py` and `risk_manager.py` for high-level coordination.
*   `agents/trader/trader.py`: The final decision-making agent responsible for generating trading signals.
*   `agents/utils/agent_states.py`: Defines the core state structures (`AgentState`, `InvestDebateState`, `RiskDebateState`) used by the graph orchestrator.
*   `agents/utils/toolkit.py`: Defines the `Toolkit` class, which bundles all available data access functions for the agents to use.

**Core Implementation**:
The agents are implemented as functions (e.g., `create_china_market_analyst`) that return a node function for a graph-based workflow (likely LangGraph). Each agent function:
1.  Defines a highly specialized **system message** (prompt) tailored to its role (e.g., "专业的中国股市分析师").
2.  Selects a subset of **tools** from the global `Toolkit` relevant to its task.
3.  Constructs a LangChain `ChatPromptTemplate` using `MessagesPlaceholder` to maintain conversation history.
4.  The node function takes the `state` (an `AgentState` dictionary) as input, invokes the LLM with the prompt and tools, and updates the state with the result (e.g., `china_market_report`).

**Dependencies**:
*   **Internal**: Heavily depends on `tradingagents.dataflows.interface` for data access (via the `Toolkit`), `tradingagents.utils.logging_init` for logging, and `tradingagents.agents.utils.agent_states` for state management.
*   **External**: `langchain_core.prompts`, `langchain_core.messages`, and the underlying LLM library (e.g., `openai`, `dashscope`).

## 2. tradingagents/dataflows (Data Abstraction Layer)

**Core Responsibility**: Provides a unified, cached interface for accessing diverse financial data sources, including market data, news, and fundamentals, with a strong focus on Chinese market data.

**Key Files**:
*   `dataflows/interface.py`: Defines the public API for data access (e.g., `get_china_stock_data_unified`, `get_YFin_data`).
*   `dataflows/cache/`: Contains various caching implementations (`file_cache.py`, `db_cache.py`, `integrated.py`, `adaptive.py`). The `IntegratedCacheManager` provides a fallback mechanism (MongoDB/Redis -> File).
*   `dataflows/providers/china/`: Specific providers for the Chinese market (`tushare.py`, `akshare.py`, `baostock.py`).
*   `dataflows/providers/us/`: Providers for the US market (`yfinance.py`, `alpha_vantage_common.py`).
*   `dataflows/data_source_manager.py`: Manages the selection and configuration of different data providers.

**Core Implementation**:
The module employs the **Adapter Pattern** to standardize access to disparate data sources. The `interface.py` acts as the Facade, exposing simple functions that internally manage:
1.  **Caching**: All data retrieval functions first check the `IntegratedCacheManager`.
2.  **Provider Selection**: For Chinese data, it can switch between Tushare, AKShare, and BaoStock via `switch_china_data_source`.
3.  **Data Standardization**: Raw data from providers is converted into a standardized format (often a string or a Pandas DataFrame) before being returned.

**Dependencies**:
*   **Internal**: `tradingagents.config.database_manager` for database connection details, `tradingagents.utils.logging_manager`.
*   **External**: `pandas`, `yfinance`, `tushare`, `akshare`, `pymongo` (optional for DB cache).

## 3. tradingagents/config (Configuration and Settings)

**Core Responsibility**: Centralized management of application settings, API keys, LLM model configurations, and token usage tracking.

**Key Files**:
*   `config/config_manager.py`: The main class `ConfigManager` handles loading and saving configurations from JSON files or MongoDB. **Note**: This file is marked as `DEPRECATED`, suggesting a migration to a service-based approach.
*   `config/database_manager.py`: Manages the connection to the MongoDB database for persistent storage of configuration and usage data.
*   `config/usage_models.py`: Defines data classes (`ModelConfig`, `PricingConfig`, `UsageRecord`) for structured configuration data.
*   `config/providers_config.py`: Contains specific configuration logic for different LLM providers (e.g., DashScope, OpenAI).

**Core Implementation**:
The `ConfigManager` uses a **Fallback Strategy**: it attempts to load configuration from environment variables (`.env` file), then from local JSON files, and finally from a MongoDB instance if configured. It includes logic for:
1.  **API Key Validation**: Specifically for OpenAI keys (`validate_openai_api_key_format`).
2.  **Token Tracking**: The `TokenTracker` class records LLM usage based on `UsageRecord` objects, allowing for cost calculation using `PricingConfig`.

**Dependencies**:
*   **Internal**: `tradingagents.utils.logging_manager`.
*   **External**: `dotenv`, `pymongo` (optional), `dataclasses`.

## 4. tradingagents/graph (Agent Orchestration)

**Core Responsibility**: Defines the state machine and conditional logic that orchestrates the flow of analysis and decision-making among the various agents.

**Key Files**:
*   `graph/trading_graph.py`: The main class `TradingAgentsGraph` that constructs and compiles the workflow graph.
*   `graph/conditional_logic.py`: Contains the crucial conditional functions (e.g., `should_continue_market`, `should_continue_debate`) that determine the next node in the graph based on the current `AgentState`.
*   `graph/setup.py`: Handles the initialization and setup of the graph, including adding nodes and edges.
*   `graph/propagation.py`: Defines the `Propagator` class, which is responsible for passing the final analysis results to the `ResearchManager`.

**Core Implementation**:
The module uses the **State Machine Pattern** (likely implemented with LangGraph) to manage the complex multi-step process:
1.  **Analysis Phase**: Parallel execution of analysts (Market, News, Social, Fundamentals).
2.  **Debate Phase**: Sequential, turn-based interaction between `BullResearcher` and `BearResearcher` (controlled by `should_continue_debate`).
3.  **Management Phase**: `ResearchManager` synthesizes the debate and analysis reports.
4.  **Risk Phase**: `RiskManager` and debators (`RiskyDebator`, `SafeDebator`) discuss the final recommendation.
5.  **Trading Phase**: `Trader` makes the final decision.

The `ConditionalLogic` class is critical for preventing infinite loops in the LLM tool-calling process by checking tool call counts and report completeness.

**Dependencies**:
*   **Internal**: `tradingagents.agents.utils.agent_states`, all agent creation functions from `tradingagents.agents`.
*   **External**: `langgraph`.

## 5. web (Web Interface)

**Core Responsibility**: Provides a user-friendly Streamlit web interface for configuring the system, running analyses, and viewing results and logs.

**Key Files**:
*   `web/app.py`: The main entry point for the Streamlit application.
*   `web/run_web.py`: Script to launch the web application.
*   `web/components/`: Contains reusable Streamlit components (`analysis_form.py`, `analysis_results.py`, `sidebar.py`).
*   `web/utils/analysis_runner.py`: Handles the asynchronous execution of the `TradingAgentsGraph` workflow.
*   `web/utils/auth_manager.py`: Manages user authentication and session state.
*   `web/utils/progress_tracker.py`: Tracks the progress of the long-running analysis workflow and updates the UI.

**Core Implementation**:
The web interface is built on Streamlit, leveraging its reactive nature and component model. The `analysis_runner.py` is key, as it encapsulates the execution of the core business logic (`TradingAgentsGraph`) in a separate thread or process to prevent the UI from blocking. Session management and persistence are handled by classes like `FileSessionManager` and `RedisSessionManager` (in `web/utils/`).

**Dependencies**:
*   **Internal**: `tradingagents.config.config_manager`, `tradingagents.graph.trading_graph`.
*   **External**: `streamlit`, `redis` (optional).

### Module PlantUML Diagrams

## 1. tradingagents/agents

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "Agents Module" {
    abstract class BaseAgent {
        + llm
        + toolkit
        + create_node()
    }

    class Toolkit {
        + get_china_stock_data()
        + get_YFin_data()
        + get_finnhub_news()
        + ... (many data access methods)
    }

    class AgentState <<TypedDict>> {
        + messages: list
        + company_of_interest: str
        + trade_date: str
        + market_report: str
        + sentiment_report: str
        + ... (all reports)
    }

    class ChinaMarketAnalyst
    class FundamentalsAnalyst
    class NewsAnalyst
    class SocialMediaAnalyst

    class BullResearcher
    class BearResearcher

    class ResearchManager
    class RiskManager

    class Trader

    BaseAgent <|-- ChinaMarketAnalyst
    BaseAgent <|-- FundamentalsAnalyst
    BaseAgent <|-- NewsAnalyst
    BaseAgent <|-- SocialMediaAnalyst
    BaseAgent <|-- BullResearcher
    BaseAgent <|-- BearResearcher
    BaseAgent <|-- ResearchManager
    BaseAgent <|-- RiskManager
    BaseAgent <|-- Trader

    ChinaMarketAnalyst ..> Toolkit : uses
    FundamentalsAnalyst ..> Toolkit : uses
    ResearchManager ..> AgentState : reads/writes
    Trader ..> AgentState : reads/writes
}
@enduml
```

## 2. tradingagents/dataflows

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "Dataflows Module" {
    interface DataInterface {
        + get_china_stock_data_unified()
        + get_YFin_data()
        + get_finnhub_news()
        + ...
    }

    abstract class BaseStockDataProvider {
        + get_stock_data()
        + get_fundamentals()
    }

    package "Providers" {
        class TushareProvider
        class AKShareProvider
        class YFinanceUtils
        class AlphaVantageCommon
    }

    package "Cache" {
        class StockDataCache <<File Cache>>
        class DatabaseCacheManager
        class IntegratedCacheManager {
            - primary_backend
            - fallback_backend
            + get_data()
            + set_data()
        }
    }

    class DataSourceManager {
        + switch_china_data_source()
        + get_provider()
    }

    DataInterface .> IntegratedCacheManager : uses
    DataInterface .> DataSourceManager : uses

    BaseStockDataProvider <|-- TushareProvider
    BaseStockDataProvider <|-- AKShareProvider
    BaseStockDataProvider <|-- YFinanceUtils

    IntegratedCacheManager o-- StockDataCache : fallback
    IntegratedCacheManager o-- DatabaseCacheManager : primary
}
@enduml
```

## 3. tradingagents/config

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "Config Module" {
    class ConfigManager <<Deprecated>> {
        - models_file: Path
        - pricing_file: Path
        - mongodb_storage: MongoDBStorage
        + load_models()
        + get_api_key()
        + validate_openai_api_key_format()
    }

    class TokenTracker {
        + record_usage(record: UsageRecord)
        + calculate_cost()
    }

    class DatabaseManager {
        + get_client()
        + get_database()
    }

    class ModelConfig <<DataClass>>
    class PricingConfig <<DataClass>>
    class UsageRecord <<DataClass>>

    ConfigManager .> DatabaseManager : uses
    TokenTracker .> PricingConfig : uses
    TokenTracker .> UsageRecord : stores
}
@enduml
```

## 4. tradingagents/graph

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "Graph Module" {
    class TradingAgentsGraph {
        + build_graph()
        + compile()
    }

    class ConditionalLogic {
        + should_continue_market()
        + should_continue_debate()
        + should_continue_risk_analysis()
    }

    class GraphSetup {
        + add_nodes()
        + add_edges()
    }

    class Propagator {
        + propagate_analysis_results()
    }

    class AgentState <<from agents>>

    TradingAgentsGraph o-- ConditionalLogic : uses
    TradingAgentsGraph o-- GraphSetup : uses
    ConditionalLogic .> AgentState : reads
    Propagator .> AgentState : writes
}
@enduml
```

## 5. web

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "Web Module" {
    class StreamlitApp <<Main App>>
    class AnalysisRunner {
        + run_analysis_async()
        - execute_graph(graph)
    }

    class ProgressTracker {
        + update_progress()
        + get_status()
    }

    class AuthManager {
        + login()
        + is_logged_in()
        + logout()
    }

    class FileSessionManager
    class RedisSessionManager

    StreamlitApp o-- AnalysisRunner : triggers
    AnalysisRunner .> ProgressTracker : updates
    StreamlitApp o-- AuthManager : uses
    AuthManager .> FileSessionManager : uses
    AuthManager .> RedisSessionManager : uses
}
@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The **TradingAgents-CN** project is built upon a robust **Multi-Agent System (MAS) architecture**, leveraging the **LangGraph** framework for orchestration. The core design philosophy is centered on **Role-Based Specialization** and **Data Abstraction**, mirroring the structure of a professional financial research firm.

**Core Abstractions**:
1.  **Agent (Node)**: Each agent (e.g., `ChinaMarketAnalyst`, `BullResearcher`, `Trader`) is an independent, specialized component implemented as a LangGraph node. This enforces the **Single Responsibility Principle (SRP)**, making each agent's prompt and toolset highly focused.
2.  **State (`AgentState`)**: A central, mutable data structure (`TypedDict`) that holds the collective knowledge and progress of the entire workflow. It acts as the shared memory and communication channel between all agents. This is a critical abstraction for the state machine pattern.
3.  **Toolkit**: A collection of standardized data access functions that are passed to the agents. This abstraction decouples the agent's reasoning logic from the complexity of data retrieval and caching.
4.  **Data Provider/Interface**: The `dataflows` module abstracts external financial APIs (Tushare, yfinance) behind a unified interface, allowing the core system to be agnostic to the underlying data source.

**Design Philosophy**:
*   **Modularity and Decoupling**: The system is cleanly separated into five major modules (`agents`, `dataflows`, `config`, `graph`, `web`), minimizing cross-module dependencies and facilitating maintenance and scaling.
*   **State Machine Orchestration**: The use of LangGraph in the `graph` module provides a deterministic, traceable, and conditional execution flow, which is essential for complex, multi-step decision-making processes like financial analysis and debate.
*   **Chinese Market Focus**: The project is explicitly tailored for the Chinese market, with specialized agents (`ChinaMarketAnalyst`) and data providers (`TushareProvider`, `AKShareProvider`), which is a key differentiator from its original counterpart.
*   **Caching and Fallback**: The `IntegratedCacheManager` in `dataflows` ensures performance and resilience by providing multiple layers of caching (MongoDB/Redis, File) and a graceful fallback mechanism when primary data sources fail.

**Lifecycle Management**:
The lifecycle of a single analysis run is managed by the `TradingAgentsGraph`:
1.  **Initialization**: The `AgentState` is initialized with the target stock and date.
2.  **Parallel Analysis**: Multiple analysts run concurrently, fetching data via the `Toolkit` and writing their reports to the `AgentState`.
3.  **Sequential Debate**: The `BullResearcher` and `BearResearcher` engage in a turn-based debate, with the `ConditionalLogic` controlling the flow until a consensus or maximum rounds are reached.
4.  **Synthesis and Decision**: The `ResearchManager` and `RiskManager` synthesize the reports and debate outcomes, leading to the final trading decision by the `Trader`.
5.  **Termination**: The graph terminates, and the final decision is propagated to the `web` module for display.

#### 3.1.2. Component Interactions

The system's communication is primarily mediated by the central **AgentState** object, following a **Shared State/Blackboard Pattern**.

**Key Interaction Flows**:

1.  **Agent-to-Data Interaction (Data Flow)**:
    *   **Agent**: An agent (e.g., `FundamentalsAnalyst`) needs data.
    *   **Communication**: The agent calls a function from the injected `Toolkit` (e.g., `toolkit.get_china_stock_data`).
    *   **Data Flow**: `Toolkit` -> `dataflows.interface` -> `IntegratedCacheManager` (Cache Check) -> `DataSourceManager` (Provider Selection) -> External API (e.g., Tushare).
    *   **Response**: The data is returned to the agent, which then uses it to generate its report.

2.  **Agent-to-Agent Interaction (Orchestration Flow)**:
    *   **Communication**: Agents communicate indirectly by reading from and writing to the shared `AgentState`.
    *   **Example (Debate)**:
        *   `BullResearcher` reads the initial `AgentState` and writes its bullish argument to the state's message history.
        *   The `ConditionalLogic` checks the state and routes the flow to the `BearResearcher`.
        *   `BearResearcher` reads the state (including the bullish argument) and writes its bearish counter-argument.
        *   This loop continues until the `ConditionalLogic` determines the debate is complete, then routes to the `ResearchManager`.

3.  **Frontend-to-Backend Interaction**:
    *   **Frontend (`web` module)**: The user submits a request via the Streamlit UI.
    *   **Communication**: The `AnalysisRunner` in the `web` module asynchronously executes the `TradingAgentsGraph`.
    *   **Data Flow**: The `ProgressTracker` (in `web/utils`) monitors the graph's execution state and updates the UI in real-time, providing feedback to the user.

**Communication Patterns**:
*   **Shared State (Blackboard)**: The primary pattern for agent-to-agent communication, enabling asynchronous and decoupled interactions.
*   **Facade/Adapter**: Used in the `dataflows` module to simplify and standardize access to complex external data sources.
*   **Tool-Use/Function Calling**: The LLM agents use the `Toolkit` functions via the LLM's function-calling capability, allowing the model to decide when and how to retrieve data.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam handwritten true
skinparam classAttributeIconVisible false

title TradingAgents-CN Overall Architecture

rectangle "External Financial APIs" as APIs

package "tradingagents" {
    package "Dataflows (Data Abstraction)" as Dataflows {
        interface DataInterface
        class IntegratedCacheManager
        class DataSourceManager
        DataInterface .> IntegratedCacheManager
        DataInterface .> DataSourceManager
    }

    package "Config (Settings & Pricing)" as Config {
        class ConfigManager
        class TokenTracker
    }

    package "Agents (LLM Nodes)" as Agents {
        class Toolkit
        class Analysts
        class Researchers
        class Managers
        class Trader
        Toolkit .> Dataflows : uses
    }

    package "Graph (Orchestration)" as Graph {
        class TradingAgentsGraph
        class ConditionalLogic
        class AgentState <<Shared State>>
        TradingAgentsGraph o-- ConditionalLogic
        TradingAgentsGraph o-- AgentState
        Analysts -> AgentState : writes report
        Researchers -> AgentState : writes debate
        Managers -> AgentState : reads/writes
        Trader -> AgentState : reads/writes
    }

    Dataflows .> APIs : fetches data
    Agents .> Toolkit : calls tools
    Agents .> Config : reads LLM settings
    Graph .> Agents : orchestrates
}

package "Web (Streamlit Frontend)" as Web {
    class StreamlitApp
    class AnalysisRunner
    StreamlitApp -> AnalysisRunner : triggers
    AnalysisRunner -> Graph : executes
}

Web .> Graph : monitors progress
Config .> APIs : stores API keys
@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase effectively utilizes several established design patterns to manage complexity and promote maintainability.

| Design Pattern | Description | Specific Implementation in TradingAgents-CN |
| :--- | :--- | :--- |
| **State Machine** | An object whose behavior is determined by its state, transitioning between states based on inputs. | Implemented using **LangGraph** in `tradingagents/graph`. The `TradingAgentsGraph` defines the nodes (agents) and edges (transitions), with `ConditionalLogic` handling the state-based routing (e.g., continuing a debate or moving to the next phase). |
| **Adapter** | Converts the interface of a class into another interface clients expect. | The `dataflows` module uses this pattern extensively. `TushareProvider`, `AKShareProvider`, and `YFinanceUtils` all adapt their respective external APIs to conform to the internal `BaseStockDataProvider` interface. |
| **Facade** | Provides a unified interface to a set of interfaces in a subsystem. | `dataflows/interface.py` acts as a facade, providing simple, high-level functions (e.g., `get_china_stock_data_unified`) that hide the complexity of caching, provider selection, and API calls. |
| **Strategy** | Defines a family of algorithms, encapsulates each one, and makes them interchangeable. | The caching system in `dataflows/cache` implements this. The `IntegratedCacheManager` can switch between `FileCache`, `DatabaseCacheManager`, and `AdaptiveCacheSystem` based on configuration (`TA_CACHE_STRATEGY`). |
| **Singleton** | Ensures a class has only one instance and provides a global point of access to it. | Implicitly used for core managers like `ConfigManager` and `DataSourceManager`, which are often initialized once and accessed globally via helper functions (e.g., `get_data_source_manager()`). |

#### 3.3.2. Project Highlights

The project demonstrates several innovative and well-designed features that enhance its functionality, extensibility, and flexibility.

*   **Advanced Multi-Agent Orchestration**: The use of LangGraph for a complex financial workflow is a significant highlight. The system goes beyond simple sequential processing by incorporating **parallel analysis**, **turn-based debate** (`BullResearcher` vs. `BearResearcher`), and **conditional routing** based on the state. This sophisticated orchestration allows for more nuanced and robust decision-making.
*   **Comprehensive Data Abstraction and Resilience**: The `dataflows` module is exceptionally well-structured.
    *   **Unified Interface**: Standardizes access across diverse data sources (China, US, HK).
    *   **Integrated Caching**: The multi-layered caching system (`IntegratedCacheManager`) with file and database fallbacks ensures high performance and resilience against temporary API outages.
    *   **Chinese Market Specialization**: The inclusion of multiple Chinese data providers (Tushare, AKShare, BaoStock) and the ability to switch between them is crucial for a China-focused application.
*   **Extensible LLM Provider Support**: The `llm_adapters` module is designed for easy integration of new LLMs (DashScope, DeepSeek, Google, OpenAI). The use of a base compatible class (`openai_compatible_base.py`) ensures that new providers can be added with minimal changes to the core agent logic.
*   **Detailed Configuration and Cost Tracking**: The `config` module's ability to track token usage and calculate costs based on model-specific pricing (`PricingConfig`) is a valuable feature for managing the operational expenses of an LLM-intensive application.
*   **User-Friendly Web Interface**: The Streamlit-based `web` module provides a clean, interactive front-end for running analyses, which is essential for usability. The `AnalysisRunner` and `ProgressTracker` effectively handle the long-running nature of the agent workflow, providing a good user experience.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The architecture is robust, but several areas can be optimized for performance, maintainability, and future-proofing.

## Performance Bottlenecks and Optimization
1.  **Graph Execution Time**: The sequential nature of the debate phase (`BullResearcher` vs. `BearResearcher`) and the risk discussion phase is a major time sink, as each turn requires a full LLM call.
    *   **Suggestion**: Implement a **"Fast-Track" Conditional Logic** where the debate is skipped or shortened if the initial analysis reports (Market, Fundamentals, News) are highly consistent or if the stock is deemed low-risk early on.
2.  **Deprecated Config Manager**: The `ConfigManager` is marked as deprecated, but its usage persists. This creates technical debt and potential for configuration conflicts (JSON vs. MongoDB).
    *   **Suggestion**: Complete the migration to the new service-based configuration system (`app.services.config_service.ConfigService`) and fully remove the deprecated `tradingagents/config` module to simplify the configuration loading process and centralize state management.
3.  **Data Serialization Overhead**: The `AgentState` is a large, shared object that is passed between all nodes. If the underlying LangGraph implementation serializes and deserializes this state for each transition, the overhead can become significant, especially with large reports.
    *   **Suggestion**: Investigate using a more efficient, in-memory state management solution (e.g., Redis) for the `AgentState` during a single graph run, or use LangGraph's built-in memory management features to optimize state persistence.

## Architecture Optimization and Code Quality
1.  **Toolkit Decoupling**: The `Toolkit` class is a monolithic collection of all data access functions. While convenient, it violates the **Interface Segregation Principle (ISP)**, as every agent receives a massive toolkit, most of which it does not need.
    *   **Suggestion**: Refactor the `Toolkit` into smaller, role-specific interfaces (e.g., `MarketDataToolkit`, `NewsDataToolkit`, `FundamentalsToolkit`). Each agent should only be injected with the specific toolkit it requires.
2.  **Explicit Data Models**: The system heavily relies on passing strings and dictionaries (e.g., reports, dataframes serialized as strings) within the `AgentState`. This is error-prone and lacks type safety.
    *   **Suggestion**: Introduce explicit Pydantic models for all data structures passed between agents (e.g., `MarketReportModel`, `DebateSummaryModel`). This will improve code quality, enable better validation, and simplify the logic in the `ResearchManager` and `RiskManager`.
3.  **Unified Logging**: While a unified logging system exists, the conditional logic functions in `graph/conditional_logic.py` are heavily polluted with detailed logging for debugging deadlocks.
    *   **Suggestion**: Abstract the deadlock detection and logging into a dedicated `GraphMonitor` or `DebugMiddleware` class, keeping the `ConditionalLogic` clean and focused solely on the transition logic.

#### 3.4.2. Secondary Development Guide

This guide outlines the best path for developers looking to explore the codebase or extend its functionality.

## 1. Code Exploration Path
The project's complexity is best understood by following the execution flow from the entry point to the core logic.

| Step | Module to Explore | Key Files | Focus |
| :--- | :--- | :--- | :--- |
| **1. Entry Point** | `web` or `cli` | `web/app.py`, `cli/main.py` | Understand how the application starts and how user input is collected. |
| **2. Orchestration** | `graph` | `graph/trading_graph.py`, `graph/conditional_logic.py` | This is the **most critical** module. Trace the `TradingAgentsGraph` definition to see the sequence of analysis, debate, and decision-making. |
| **3. Agent Logic** | `agents` | `agents/analysts/`, `agents/researchers/` | Select an agent (e.g., `fundamentals_analyst.py`) to see how the system prompt is constructed, which tools are used, and how the report is generated. |
| **4. Data Access** | `dataflows` | `dataflows/interface.py`, `dataflows/cache/integrated.py` | Examine the `Toolkit` and trace a data call (e.g., `get_china_stock_data_unified`) to understand the caching and provider abstraction layers. |
| **5. Configuration** | `config` | `config/config_manager.py`, `config/usage_models.py` | Understand how API keys and model pricing are loaded and managed. |

## 2. Extending Functionality

### A. Adding a New Agent Role
1.  **Create Agent File**: Create a new file in `tradingagents/agents/` (e.g., `macro_analyst.py`).
2.  **Define Logic**: Implement a function (e.g., `create_macro_analyst`) that defines the system prompt, selects necessary tools from the `Toolkit`, and returns a LangGraph node function.
3.  **Update Graph**: Modify `tradingagents/graph/trading_graph.py` to add the new agent as a node and define its edges (e.g., run it in parallel with other analysts).
4.  **Update State**: If the new agent generates a new type of report, update the `AgentState` in `tradingagents/agents/utils/agent_states.py` to include a field for the new report.

### B. Integrating a New Data Source
1.  **Create Provider**: Create a new file in `tradingagents/dataflows/providers/` (e.g., `europe/eurostat_provider.py`) that inherits from `BaseStockDataProvider` and implements the required data retrieval methods.
2.  **Update Manager**: Modify `dataflows/data_source_manager.py` to register the new provider.
3.  **Update Interface**: Add a new function to `dataflows/interface.py` to expose the new data source via the unified interface.
4.  **Update Toolkit**: Add the new interface function to the `Toolkit` class in `tradingagents/agents/utils/toolkit.py` so agents can access it.

### C. Modifying the Workflow
To change the decision-making process (e.g., adding a pre-screening step), modify the `TradingAgentsGraph` in `tradingagents/graph/trading_graph.py` and adjust the transition logic in `tradingagents/graph/conditional_logic.py`. This is the central control point for the entire system.


================================================
FILE: thirdparty/TradingAgents.md
================================================
# TradingAgents - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project is structured to support a multi-agent system for financial trading, built primarily on Python and the LangGraph framework. The core logic is cleanly separated into three main components: `agents`, `dataflows`, and `graph`, reflecting a clear separation of concerns between the cognitive layer, the data access layer, and the orchestration layer.

```
/home/ubuntu/TradingAgents
|-- cli/
|   |-- main.py             # Command-line interface entry point for running the agent system.
|   |-- models.py           # Data models for CLI arguments (e.g., Pydantic models).
|   |-- utils.py            # Utility functions for the CLI.
|-- main.py                 # Main entry point for the core application logic (likely for non-CLI use).
|-- tradingagents/          # Core package for the multi-agent framework
|   |-- agents/             # The "Brain" - Defines all LLM-powered agents and their logic.
|   |   |-- analysts/       # Specialized agents for data analysis (Market, News, Fundamentals).
|   |   |-- managers/       # Agents responsible for managing the debate flow (Research, Risk).
|   |   |-- researchers/    # Agents for the investment debate (Bull, Bear).
|   |   |-- risk_mgmt/      # Agents for the risk debate (Risky, Safe, Neutral).
|   |   |-- trader/         # The final decision-making agent.
|   |   |-- utils/          # Agent-specific utilities, state definitions, and memory (ChromaDB).
|   |-- dataflows/          # The "Data Layer" - Handles all external API interactions and data fetching.
|   |   |-- alpha_vantage/  # Specific implementations for Alpha Vantage data access.
|   |   |-- y_finance.py    # Implementation for Yahoo Finance data access.
|   |   |-- interface.py    # Core abstraction layer for routing data requests to vendors.
|   |   |-- config.py       # Configuration management for data vendors.
|   |-- graph/              # The "Orchestration Layer" - Implements the multi-agent workflow using LangGraph.
|   |   |-- conditional_logic.py # Defines state-based routing logic for the graph.
|   |   |-- propagation.py  # Handles state initialization and passing data between nodes.
|   |   |-- reflection.py   # Logic for post-trade reflection and memory update.
|   |   |-- trading_graph.py# Main class to build and run the LangGraph state machine.
|   |-- default_config.py   # Default settings for the entire framework.
```

The `tradingagents` directory is the heart of the project, containing the three main modules: `agents` (the cognitive layer), `dataflows` (the data layer), and `graph` (the orchestration layer). The `cli` directory provides a command-line interface for interacting with the core framework. This structure is highly modular, facilitating maintenance and extension.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/TradingAgents/tradingagents/agents`: Contains the implementation of all specialized LLM agents, including analysts, researchers, risk debators, and the final trader. This module is responsible for the decision-making logic and memory management.
*   `/home/ubuntu/TradingAgents/tradingagents/dataflows`: Contains the data abstraction layer, which handles fetching financial and news data from various external vendors (e.g., Alpha Vantage, Yahoo Finance). It provides a unified toolset for the agents.
*   `/home/ubuntu/TradingAgents/tradingagents/graph`: Contains the LangGraph-based orchestration logic, defining the state machine, conditional transitions, and the overall flow of the multi-agent system.

## Phase 2: Module-by-Module Deep Analysis

### Module: `tradingagents/agents`
**Core Responsibility:** To implement the specialized roles of the multi-agent system, process reports, engage in structured debates, and make final trading decisions. It acts as the **cognitive layer** of the framework.

**File Enumeration:**
*   `agents/utils/agent_states.py`: Defines `AgentState` (the global state), `InvestDebateState`, and `RiskDebateState` using `TypedDict` for LangGraph.
*   `agents/utils/memory.py`: Implements `FinancialSituationMemory` using `chromadb` for vector-based retrieval of past trading lessons.
*   `agents/analysts/*`: Implements four key analysts (`fundamentals`, `market`, `news`, `social_media`) using LangChain's `ChatPromptTemplate` and tool-binding to generate comprehensive reports.
*   `agents/researchers/*`: Implements the `bull_researcher` and `bear_researcher` for the investment debate.
*   `agents/risk_mgmt/*`: Implements the three risk debators (`aggresive_debator`, `conservative_debator`, `neutral_debator`).
*   `agents/trader/trader.py`: Implements the final `trader_node` which uses the accumulated reports and memory to output a `FINAL TRANSACTION PROPOSAL: **BUY/HOLD/SELL**`.

**Implementation Details:**
Each agent is implemented as a function (e.g., `create_fundamentals_analyst(llm)`) that returns a LangGraph node function. This node function takes the current `state` as input, uses an LLM with a highly specialized system prompt and bound tools, and returns an updated `state` dictionary. The use of `FinancialSituationMemory` is a key feature, allowing agents to learn from past, similar market situations via a Retrieval-Augmented Generation (RAG) approach. The agents' system prompts are highly detailed, guiding the LLM to act in a specific role (e.g., "You are a Bull Analyst advocating for investing in the stock") and to use the provided data reports to build their arguments.

### Module: `tradingagents/dataflows`
**Core Responsibility:** To abstract and manage all external data sources, providing a consistent, tool-friendly interface for the agents. It is the **data access layer**.

**File Enumeration:**
*   `dataflows/interface.py`: Contains the crucial `route_to_vendor` function, which dynamically selects the correct data fetching function based on the configured vendor for a given data type (e.g., `core_stock_apis`, `fundamental_data`).
*   `dataflows/config.py`: Manages the runtime configuration, allowing users to switch data vendors without changing agent code.
*   `dataflows/alpha_vantage_common.py`: Provides low-level utilities like `_make_api_request` and error handling (`AlphaVantageRateLimitError`).
*   `dataflows/alpha_vantage_*.py`: Specific implementations for fetching stock, indicator, fundamental, and news data from Alpha Vantage.

**Implementation Details:**
The module employs a **Strategy Pattern** via the `route_to_vendor` function. This decouples the data request from the data source implementation. For example, `get_stock_data` in `agents/utils/core_stock_tools.py` calls `route_to_vendor("get_stock_data", ...)` which then routes the call to the appropriate vendor-specific function (e.g., `alpha_vantage_stock.get_stock`). This design ensures the agent logic remains clean and vendor-agnostic. The Alpha Vantage implementations handle API key retrieval, request formatting, and response parsing (including CSV filtering via Pandas).

### Module: `tradingagents/graph`
**Core Responsibility:** To define the sequential and conditional workflow of the multi-agent system, ensuring a structured, multi-step decision-making process. It is the **orchestration layer**.

**File Enumeration:**
*   `graph/trading_graph.py`: The main class `TradingAgentsGraph` that builds the LangGraph `StateGraph`.
*   `graph/conditional_logic.py`: Implements the conditional routing functions (e.g., `should_continue_debate`, `should_continue_risk_analysis`) that determine the next node in the graph based on the current state (e.g., number of debate rounds completed, presence of tool calls).
*   `graph/propagation.py`: Contains `Propagator` class for state initialization (`create_initial_state`) and managing graph arguments.
*   `graph/reflection.py`: Contains `Reflector` class for post-trade analysis, generating reflections using an LLM, and updating the `FinancialSituationMemory`.

**Implementation Details:**
The system uses a **State Machine** architecture implemented with LangGraph. The flow is highly structured: it begins with a research phase (tool-call loops), moves to a two-party investment debate (Bull/Bear), then to a Trader decision, followed by a three-party risk debate (Risky/Safe/Neutral), and concludes with a final decision and a reflection phase to update the memory. The `ConditionalLogic` is critical, using counters (`count`) and the last speaker (`latest_speaker`) to manage the turn-based, fixed-round debates.

### Module PlantUML Diagrams

### Module: `tradingagents/agents`

```plantuml
@startuml
skinparam linetype polyline
skinparam linetype ortho

package "Agents Module" {
    
    class AgentState <<TypedDict>> {
        + company_of_interest: str
        + trade_date: str
        + market_report: str
        + fundamentals_report: str
        + investment_debate_state: InvestDebateState
        + risk_debate_state: RiskDebateState
        ...
    }
    
    class InvestDebateState <<TypedDict>> {
        + history: str
        + count: int
        ...
    }
    
    class RiskDebateState <<TypedDict>> {
        + history: str
        + latest_speaker: str
        + count: int
        ...
    }
    
    class FinancialSituationMemory <<ChromaDB/OpenAI>> {
        + __init__(name, config)
        + get_embedding(text)
        + add_situations(situations_and_advice)
        + get_memories(current_situation)
    }
    
    abstract class AnalystAgent {
        + create_agent(llm)
        --
        - system_message: str
        - tools: list
    }
    
    AnalystAgent <|-- FundamentalsAnalyst
    AnalystAgent <|-- MarketAnalyst
    AnalystAgent <|-- NewsAnalyst
    AnalystAgent <|-- SocialMediaAnalyst
    
    abstract class DebatorAgent {
        + create_debator(llm, memory)
        --
        - prompt: str (contextual)
    }
    
    DebatorAgent <|-- BullResearcher
    DebatorAgent <|-- BearResearcher
    DebatorAgent <|-- RiskyAnalyst
    DebatorAgent <|-- SafeAnalyst
    DebatorAgent <|-- NeutralAnalyst
    
    class Trader {
        + create_trader(llm, memory)
    }
    
    ' Relationships
    AnalystAgent ..> AgentState : updates reports
    DebatorAgent ..> AgentState : reads reports, updates debate state
    Trader ..> AgentState : reads reports, updates plan
    
    BullResearcher ..> FinancialSituationMemory : uses memory
    BearResearcher ..> FinancialSituationMemory : uses memory
    Trader ..> FinancialSituationMemory : uses memory
    
    AgentState *-- InvestDebateState
    AgentState *-- RiskDebateState
}
@enduml
```

### Module: `tradingagents/dataflows`

```plantuml
@startuml
skinparam linetype polyline
skinparam linetype ortho

package "DataFlows Module" {
    
    class Config {
        + initialize_config()
        + set_config(config)
        + get_config()
    }
    
    class Interface <<route_to_vendor>> {
        + route_to_vendor(function_name, *args)
    }
    
    abstract class DataVendor {
        + get_stock_data(...)
        + get_indicators(...)
        + get_fundamentals(...)
        ...
    }
    
    class AlphaVantageVendor {
        - _make_api_request(...)
        - _filter_csv_by_date_range(...)
    }
    
    class YFinanceVendor {
        + get_stock_data(...)
    }
    
    Interface --> Config : reads vendor config
    Interface .up.> DataVendor : dynamic dispatch
    
    DataVendor <|-- AlphaVantageVendor
    DataVendor <|-- YFinanceVendor
    
    AlphaVantageVendor ..> AlphaVantageCommon : uses helpers
    
    package "AlphaVantage Helpers" {
        class AlphaVantageCommon {
            + _make_api_request()
            + _filter_csv_by_date_range()
        }
    }
}
@enduml
```

### Module: `tradingagents/graph`

```plantuml
@startuml
skinparam linetype polyline
skinparam linetype ortho

package "Graph Module" {
    
    class TradingAgentsGraph {
        + __init__(llm, memory, ...)
        + build_graph()
        + run_graph(company, date)
        --
        - graph: StateGraph
    }
    
    class ConditionalLogic {
        + should_continue_market(state)
        + should_continue_debate(state)
        + should_continue_risk_analysis(state)
    }
    
    class Propagator {
        + create_initial_state(company, date)
    }
    
    class Reflector {
        + __init__(quick_thinking_llm)
        + reflect_bull_researcher(...)
        + reflect_trader(...)
        + _get_reflection_prompt()
    }
    
    TradingAgentsGraph *-- ConditionalLogic : uses for edges
    TradingAgentsGraph *-- Propagator : uses for setup
    TradingAgentsGraph *-- Reflector : uses for final node
    
    Reflector ..> FinancialSituationMemory : updates memory
    
    note right of TradingAgentsGraph
        Built using LangGraph
        Nodes: Analysts, Debators, Trader, Managers, Tools
        Edges: ConditionalLogic
    end note
}
@enduml
```,project_highlights:

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The **TradingAgents** framework is built upon three core abstractions that define its architecture: **Specialized Agents**, **State Machine Orchestration**, and **Vendor-Agnostic Data Access**.

1.  **Specialized Agents (Cognitive Layer):**
    *   **Abstraction:** Each component of a traditional trading firm (analyst, researcher, risk manager, trader) is abstracted into a dedicated, LLM-powered agent.
    *   **Design Philosophy:** The system adopts a **Role-Based Multi-Agent System** philosophy, where complex tasks are broken down into sub-tasks handled by experts. This mimics real-world organizational structure, leading to more robust and less hallucinated outputs.
    *   **Implementation:** Agents are implemented as LangGraph nodes, each with a highly specific system prompt and a limited set of bound tools. For example, the `MarketAnalyst` is only given tools for technical indicators and stock data, ensuring focus.

2.  **State Machine Orchestration (Graph Layer):**
    *   **Abstraction:** The entire trading decision process is abstracted as a **Finite State Machine (FSM)** using the LangGraph library.
    *   **Design Philosophy:** This ensures a structured, non-linear, and auditable workflow. The process is divided into distinct phases: **Research**, **Investment Debate**, **Trader Decision**, and **Risk Debate**. Conditional logic dictates the flow, allowing for dynamic looping (e.g., continuing a debate until a round limit is reached) and tool-call resolution.
    *   **Lifecycle Management:** The `AgentState` (`tradingagents/agents/utils/agent_states.py`) acts as the single source of truth, managing the entire lifecycle of a trading session. All information (reports, debate history, final decision) is propagated through this state object.

3.  **Vendor-Agnostic Data Access (Data Layer):**
    *   **Abstraction:** All external data fetching is abstracted behind a unified `route_to_vendor` interface.
    *   **Design Philosophy:** This implements the **Strategy Pattern**, making the framework flexible and resilient to changes in data providers. Agents request data using generic tool names (e.g., `get_stock_data`), and the `dataflows` module handles the routing to the currently configured vendor (e.g., Alpha Vantage or YFinance). This promotes **loose coupling** between the cognitive and data layers.

#### 3.1.2. Component Interactions


### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
```plantuml
@startuml
skinparam linetype polyline
skinparam linetype ortho

package "TradingAgents Framework" {
    
    package "1. Orchestration Layer (graph)" as Graph {
        class TradingAgentsGraph
        class ConditionalLogic
        class Propagator
        class Reflector
    }

    package "2. Cognitive Layer (agents)" as Agents {
        class AgentState <<TypedDict>>
        class FinancialSituationMemory <<ChromaDB/OpenAI>>
        
        package "Analysts" {
            class MarketAnalyst
            class NewsAnalyst
            class FundamentalsAnalyst
            class SocialMediaAnalyst
        }
        
        package "Debators" {
            class BullResearcher
            class BearResearcher
            class RiskyAnalyst
            class SafeAnalyst
            class NeutralAnalyst
        }
        
        class Trader
        
        package "Tools" {
            class DataTools <<LangChain Tool>>
        }
    }

    package "3. Data Layer (dataflows)" as DataFlows {
        class Interface <<route_to_vendor>>
        class Config
        package "Vendors" {
            class AlphaVantage
            class YFinance
        }
    }
    
    ' Relationships
    
    ' Graph to Agents/State
    TradingAgentsGraph --> AgentState : manages state
    TradingAgentsGraph --> ConditionalLogic : uses for routing
    TradingAgentsGraph --> Propagator : initializes state
    TradingAgentsGraph --> Reflector : updates memory
    
    ' Agents to DataFlows/Tools
    MarketAnalyst ..> DataTools : uses
    FundamentalsAnalyst ..> DataTools : uses
    DataTools --> Interface : routes request
    
    ' Memory
    Reflector --> FinancialSituationMemory : updates
    Trader ..> FinancialSituationMemory : retrieves lessons
    BullResearcher ..> FinancialSituationMemory : retrieves lessons
    BearResearcher ..> FinancialSituationMemory : retrieves lessons
    
    ' DataFlows
    Interface --> AlphaVantage : calls vendor
    Interface --> YFinance : calls vendor
    
    ' Agent Flow (Simplified)
    Analysts --> Debators : generate reports
    Debators --> Trader : generate investment plan
    Trader --> Debators : generates initial decision
    Debators --> Graph : final decision
}

@enduml
```,component_interactions:
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase effectively utilizes several software design patterns to manage complexity and promote flexibility.

| Pattern | Description | Implementation in TradingAgents |
| :--- | :--- | :--- |
| **Strategy Pattern** | Defines a family of algorithms, encapsulates each one, and makes them interchangeable. | Implemented in the `dataflows` module. The `route_to_vendor` function in `interface.py` acts as the context, dynamically selecting a concrete strategy (e.g., `AlphaVantageVendor` or `YFinanceVendor`) based on the configuration to fulfill a data request (e.g., `get_stock_data`). |
| **State Machine Pattern** | An object whose behavior is determined by its internal state, and which transitions between states based on input. | Implemented using the **LangGraph** library in the `graph` module. The `AgentState` is the state object, and the `ConditionalLogic` class defines the transition rules (edges) between the agent nodes (states). This ensures a structured, multi-step decision-making process. |
| **Chain of Responsibility Pattern** | Passes a request along a chain of handlers. Each handler decides either to process the request or pass it to the next handler in the chain. | Implicitly used in the **Research Phase** of the graph. The request for information flows through a sequence of analyst agents (Market -> News -> Fundamentals), where each agent adds its specialized report to the shared `AgentState` before passing the state to the next. |
| **Observer Pattern** | An object (subject) maintains a list of its dependents (observers) and notifies them automatically of any state changes. | Used in the **Debate Phases**. The shared `AgentState` acts as the subject. When one debator (e.g., `BullResearcher`) updates the state with a new argument, the next debator (e.g., `BearResearcher`) is notified (triggered by the graph edge) and reacts to the new state. |
| **Retrieval-Augmented Generation (RAG)** | Augments an LLM with an external knowledge base to improve the quality of generated responses. | Implemented via the `FinancialSituationMemory` class in `agents/utils/memory.py`. This class uses ChromaDB and OpenAI embeddings to store and retrieve past trading situations and their outcomes, allowing the Trader and Researchers to "learn from experience." |

#### 3.3.2. Project Highlights


### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

Based on the analysis, the following areas present opportunities for performance, architecture, and code quality improvements:

*   **Performance Bottleneck: Synchronous Data Fetching:** The current data fetching model appears to be synchronous, as each analyst node must complete its tool calls before the next node can execute. For a multi-agent system, converting the data fetching in the `dataflows` module to use **asynchronous I/O (asyncio)** would allow multiple data requests (e.g., fetching news, fundamentals, and technical data) to run concurrently, drastically reducing the overall research phase time.
*   **Architecture Optimization: Centralized Tool Definition:** The LangChain tools are currently defined in `agents/utils/*_tools.py` and act as wrappers around `dataflows.interface.route_to_vendor`. While functional, a cleaner separation would be to define the tools directly within the `dataflows` module, making the `dataflows` module a self-contained, tool-exposing service. This would simplify the `agents` module and reinforce the data layer's role.
*   **Code Quality: Configuration Management:** The configuration logic in `dataflows/config.py` uses global variables (`_config`, `DATA_DIR`), which can lead to hard-to-track side effects. A better practice would be to use a dedicated configuration object (e.g., a Singleton or a Pydantic settings class) that is explicitly passed to the `TradingAgentsGraph` constructor and then injected into the necessary components (like `FinancialSituationMemory` and `Interface`).
*   **Robustness: Vendor Fallback Mechanism:** The `dataflows` module currently routes to a single configured vendor. Implementing a **Decorator Pattern** or an explicit fallback mechanism within `route_to_vendor` would enhance robustness. If the primary vendor (e.g., Alpha Vantage) fails due to a rate limit or API error, the system could automatically attempt the request with a secondary vendor (e.g., YFinance) before failing the agent task.

#### 3.4.2. Secondary Development Guide

For a developer looking to extend or modify the TradingAgents framework, the following steps provide the best path for code exploration and secondary development:

1.  **Understand the State:** Start by reviewing `tradingagents/agents/utils/agent_states.py`. The `AgentState` class is the central data structure. Understanding what data is available and how it is structured is crucial for modifying any agent or graph node.
2.  **Trace the Workflow:** Examine `tradingagents/graph/trading_graph.py` to understand the sequence of operations. The `build_graph` method clearly defines the nodes (agents) and the edges (transitions). This is the primary file for modifying the overall decision-making flow.
3.  **Modify Agent Behavior (Prompts):** To change an agent's reasoning or output format, modify the system prompt within the agent's definition (e.g., in `tradingagents/agents/analysts/market_analyst.py`). Ensure the new prompt still guides the LLM to use its bound tools correctly.
4.  **Add New Data Sources (DataFlows):** To integrate a new data API:
    *   Create a new file in `tradingagents/dataflows` (e.g., `my_new_vendor.py`) with functions that match the required signatures (e.g., `get_stock_data`).
    *   Update the configuration in `tradingagents/default_config.py` to include the new vendor as an option.
    *   The existing agent tools will automatically route requests to the new vendor if it is selected in the configuration, thanks to the Strategy Pattern in `dataflows/interface.py`.
5.  **Extend Learning (Memory):** To enhance the system's memory, examine `tradingagents/graph/reflection.py` and `tradingagents/agents/utils/memory.py`. The `Reflector` is where the "lessons" are generated. Modifying the `reflection_system_prompt` can change the quality and focus of the stored memories, directly impacting future agent decisions.
6.  **Extend Debate Logic:** To add a new debator or change the debate rules, modify the agent creation functions in `tradingagents/agents/researchers` or `tradingagents/agents/risk_mgmt`, and update the conditional logic in `tradingagents/graph/conditional_logic.py` to include the new agent in the turn-based flow.


================================================
FILE: thirdparty/TrendRadar.md
================================================
# TrendRadar - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
```
/
├── config/                     # Application configuration files (config.yaml, frequency_words.txt)
├── docker/                     # Docker and containerization setup (Dockerfile, docker-compose.yml, manage.py)
├── main.py                     # Main application entry point
├── mcp_server/                 # Core package for the Model Context Protocol (MCP) server
│   ├── server.py               # FastAPI server definition and API endpoints
│   ├── services/               # Service layer for business logic
│   │   ├── cache_service.py    # Caching mechanisms
│   │   ├── data_service.py     # Data access and persistence
│   │   └── parser_service.py   # News data parsing and processing
│   ├── tools/                  # Callable tools exposed to the MCP Agent
│   │   ├── analytics.py        # Data analysis and reporting tools
│   │   ├── config_mgmt.py      # Configuration management tools
│   │   ├── data_query.py       # Data querying tools
│   │   ├── search_tools.py     # Search functionality tools
│   │   └── system.py           # System interaction tools
│   └── utils/                  # General utility functions
│       ├── date_parser.py      # Date and time parsing helpers
│       ├── errors.py           # Custom exception definitions
│       └── validators.py       # Data validation logic
├── output/                     # Storage for processed data and logs
└── requirements.txt            # Python dependency list
```
```

### 1.2. Core Folders for Analysis

- `/home/ubuntu/TrendRadar/mcp_server` (MCP Server Core)
- `/home/ubuntu/TrendRadar/mcp_server/services` (Services Layer)
- `/home/ubuntu/TrendRadar/mcp_server/tools` (Agent Tools)
- `/home/ubuntu/TrendRadar/mcp_server/utils` (Utilities)
- `/home/ubuntu/TrendRadar/main.py` (Application Entry Point)
- `/home/ubuntu/TrendRadar/config` (Configuration Files - to be analyzed with the code that uses them)

## Phase 2: Module-by-Module Deep Analysis

## Module Analysis: MCP Server Core (/home/ubuntu/TrendRadar/mcp_server)

### Module Core Responsibility
The `mcp_server` module serves as the **Model Context Protocol (MCP) Server** entry point and tool dispatcher for the TrendRadar project. It is built on the `FastMCP 2.0` framework, enabling the exposure of 13 high-level, data-centric tools to an external AI Agent. Its primary functions are server lifecycle management (supporting `stdio` and `http` transport) and delegating tool requests to specialized service classes.

### Key File Identification
- **`server.py`**: The main file. It defines the `FastMCP` application instance, registers all 13 MCP tools using the `@mcp.tool` decorator, and contains the `run_server` function for server startup.
- **`__init__.py`**: Standard package initializer, primarily used for relative imports.

### Code Detail Analysis
- **Core Implementation (Tool Dispatch):** The `_get_tools` function implements a **Singleton Pattern** to ensure that instances of the underlying tool classes (e.g., `DataQueryTools`, `AnalyticsTools`) are created only once upon the first request. This ensures consistent state and resource management across all tool calls. Each registered MCP tool function (e.g., `get_latest_news`) acts as a thin, asynchronous wrapper that validates input, calls the corresponding method on the singleton tool instance, and formats the result as a JSON string.
- **Critical Utility (`resolve_date_range`):** The `resolve_date_range` tool is a standout feature, designed to be called *first* by the Agent. It uses the `DateParser` utility to reliably translate natural language date expressions (e.g., "this week", "last 7 days") into a standardized `{"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"}` format. This standardization is crucial for the consistency and accuracy of all subsequent data-querying and analysis tools.
- **Dependencies:** The module is tightly coupled with the `FastMCP` library and depends on all tool classes in the `tools` sub-module and utility classes like `DateParser` and `MCPError` from the `utils` sub-module.

## Module Analysis: Services Layer (/home/ubuntu/TrendRadar/mcp_server/services)

### Module Core Responsibility
The `services` module implements the core business logic for data handling, caching, and news parsing. It acts as the **data access and processing layer**, abstracting the raw file system storage and configuration details from the higher-level tool logic.

### Key File Identification
- **`data_service.py`**: The central service for data retrieval. It handles reading raw news data from the file system (`output` directory), parsing the data, and aggregating it for query tools. It also manages configuration parsing (`config.yaml` and `frequency_words.txt`).
- **`cache_service.py`**: Provides a simple, in-memory caching mechanism (`SimpleCache`) to speed up repeated data access, especially for the large file-based data store.
- **`parser_service.py`**: Contains the logic (`DataParserService`) for parsing the raw `.txt` files generated by the crawler, extracting titles, ranks, and platform information.

### Code Detail Analysis
- **Core Implementation (Data Persistence):** The `DataService` class is the primary interface for data. It uses the `Path` object from `pathlib` to manage file paths, indicating a modern Python approach to file system interaction. The raw data is stored in a hierarchical structure: `output/{date_folder}/txt/{timestamp}.txt`. The `_get_data_by_date` method is critical, as it handles cache lookups, file system traversal, and data aggregation from multiple time-stamped files into a single daily view.
- **Caching Mechanism:** The `CacheService` implements a basic **Least Recently Used (LRU)**-like cache using a dictionary, with a Time-To-Live (TTL) mechanism to ensure data freshness. This is essential for a file-based data store to avoid slow disk I/O on every request.
- **Configuration Parsing:** `DataService` includes methods (`parse_yaml_config`, `parse_frequency_words`) to read and structure the application's configuration and keyword lists, which are then used by other services and tools. The keyword parsing supports complex logic with required (`+`) and filter (`!`) words.
- **Dependencies:** This module depends heavily on `pathlib` for file operations, `yaml` for configuration parsing, and custom exceptions from `mcp_server.utils.errors` (e.g., `DataNotFoundError`, `FileParseError`). It also uses the `CacheService` internally.

## Module Analysis: Tools Layer (/home/ubuntu/TrendRadar/mcp_server/tools)

### Module Core Responsibility
The `tools` module contains the **concrete implementations** of the business logic that are exposed as MCP tools. Each file defines a class that groups related functionalities, acting as the bridge between the high-level MCP server wrappers (`server.py`) and the low-level data services (`services`).

### Key File Identification
- **`data_query.py`**: Implements basic data retrieval methods like `get_latest_news`, `get_trending_topics`, and `get_news_by_date`. It relies heavily on `DataService` for data access.
- **`analytics.py`**: Implements advanced data analysis tools, including `analyze_topic_trend_unified`, `analyze_data_insights_unified`, `analyze_sentiment`, `find_similar_news`, and `generate_summary_report`. This is the core of the project's intelligence.
- **`search_tools.py`**: Implements various search functionalities, such as `search_news_unified` (supporting exact, fuzzy, and entity search) and `search_related_news_history`.
- **`config_mgmt.py`**: Provides the `get_current_config` method, which uses `DataService` to read and return configuration settings.
- **`system.py`**: Implements system-level operations like `get_system_status` and the crucial `trigger_crawl` method, which simulates a news crawl and handles data saving.

### Code Detail Analysis
- **Core Implementation (Tool Classes):** All tool classes (e.g., `DataQueryTools`, `AnalyticsTools`) inherit from a base class (implied or a simple structure) and are initialized with a `project_root`. They encapsulate the logic for processing tool arguments, calling the necessary services, and applying business rules.
- **`AnalyticsTools` Logic:** This class is complex, implementing unified methods that dispatch to different internal analysis logic based on the `analysis_type` or `insight_type` parameter. This **Strategy Pattern** allows the MCP tool interface to remain clean while supporting multiple analytical modes (e.g., trend, lifecycle, viral).
- **`SystemManagementTools` (`trigger_crawl`):** This method is a key component of the project's data pipeline. It simulates a news crawl, aggregates the results, and, if `save_to_local` is true, persists the data to the `output` directory in a structured format (`.txt` and `.html` files). It includes logic for generating a simple HTML report for visualization.
- **Dependencies:** The tools layer is the primary consumer of the `services` layer (`DataService`, `CacheService`, `ParserService`) and the `utils` layer (`DateParser`, `MCPError`). It also uses external libraries like `datetime` and `json`.

## Module Analysis: Utilities Layer (/home/ubuntu/TrendRadar/mcp_server/utils)

### Module Core Responsibility
The `utils` module provides essential, reusable, low-level functionalities for the entire MCP server, primarily focusing on **date/time manipulation, input validation, and custom error handling**. This ensures data integrity and robust error reporting to the consuming AI Agent.

### Key File Identification
- **`date_parser.py`**: Contains the `DateParser` class, which is responsible for translating natural language date expressions (e.g., "yesterday", "last week") into precise date ranges or specific dates. This is a critical component for the `resolve_date_range` MCP tool.
- **`validators.py`**: A collection of validation functions (e.g., `validate_limit`, `validate_date_range`, `validate_keyword`) used to sanitize and check the parameters passed to the MCP tools. It enforces business rules like date range validity and keyword length limits.
- **`errors.py`**: Defines a hierarchy of custom exceptions, all inheriting from `MCPError`. This allows the server to catch specific errors and return structured, user-friendly error messages to the AI Agent, which is a key requirement for the MCP specification.

### Code Detail Analysis
- **Core Implementation (Date Parsing):** The `DateParser` class uses the `datetime` module extensively. It defines methods like `resolve_date_range_expression` which is the core logic for translating natural language into a standard date range object. It also includes validation methods like `validate_date_not_future` and `validate_date_not_too_old` to ensure queries are within the system's data limits.
- **Error Handling:** The custom exception hierarchy (`MCPError` -> `InvalidParameterError`, `DataNotFoundError`, `FileParseError`) is a clean implementation of the **Chain of Responsibility** or **Custom Exception Pattern**. By raising specific exceptions, the tool wrappers in `server.py` can consistently format error responses for the Agent.
- **Input Validation:** The `validators.py` file centralizes all input checks, preventing invalid data from reaching the core business logic. The `validate_date_range` function is particularly important, as it checks for correct format, logical order (start <= end), and ensures the dates are not in the future, even dynamically checking the available data range via `DataService`.

### Module PlantUML Diagrams

### Module Class Diagram Generation (MCP Server Core)

```plantuml
@startuml
skinparam ClassAttributeIconStyle none

title MCP Server Core Module Diagram

package "mcp_server" {
    class "server.py (MCP Dispatcher)" {
        + mcp: FastMCP
        - _tools_instances: Dict
        + _get_tools(project_root): Dict
        + run_server(transport, host, port)
    }

    class "resolve_date_range (MCP Tool)"
    class "get_latest_news (MCP Tool)"
    class "analyze_topic_trend (MCP Tool)"
    ' ... 10 other MCP tool functions ...
}

package "External" {
    class "FastMCP"
}

package "mcp_server.tools" {
    class "DataQueryTools"
    class "AnalyticsTools"
    class "SearchTools"
    class "ConfigManagementTools"
    class "SystemManagementTools"
}

package "mcp_server.utils" {
    class "DateParser"
    class "MCPError"
}

"server.py (MCP Dispatcher)" ..> "FastMCP" : uses
"server.py (MCP Dispatcher)" ..> "DataQueryTools" : instantiates
"server.py (MCP Dispatcher)" ..> "AnalyticsTools" : instantiates
"server.py (MCP Dispatcher)" ..> "SearchTools" : instantiates
"server.py (MCP Dispatcher)" ..> "ConfigManagementTools" : instantiates
"server.py (MCP Dispatcher)" ..> "SystemManagementTools" : instantiates

"resolve_date_range (MCP Tool)" ..> "DateParser" : uses
"resolve_date_range (MCP Tool)" ..> "MCPError" : handles

"get_latest_news (MCP Tool)" ..> "DataQueryTools" : delegates
"analyze_topic_trend (MCP Tool)" ..> "AnalyticsTools" : delegates
' ... other tool functions delegate to their respective tool classes ...

@enduml
```

### Module Class Diagram Generation (Services Layer)

```plantuml
@startuml
skinparam ClassAttributeIconStyle none

title Services Layer Module Diagram

package "mcp_server.services" {
    class SimpleCache {
        - cache: Dict
        - ttl: int
        + get(key): Any
        + set(key, value, ttl): void
        + delete(key): void
    }

    class DataParserService {
        + parse_txt_file(file_path): Tuple
    }

    class DataService {
        - project_root: Path
        - cache: SimpleCache
        + get_date_folder_name(date): str
        + parse_yaml_config(config_path): dict
        + parse_frequency_words(words_file): List[Dict]
        + _get_data_by_date(date, platform_ids): Tuple
    }
}

package "mcp_server.utils" {
    class DataNotFoundError
    class FileParseError
}

DataService *-- SimpleCache : uses
DataService ..> DataParserService : uses
DataService ..> DataNotFoundError : raises
DataService ..> FileParseError : raises

@enduml
```

### Module Class Diagram Generation (Tools Layer)

```plantuml
@startuml
skinparam ClassAttributeIconStyle none

title Tools Layer Module Diagram

package "mcp_server.tools" {
    class BaseTools {
        # project_root: Path
    }

    class DataQueryTools
    class AnalyticsTools
    class SearchTools
    class ConfigManagementTools
    class SystemManagementTools {
        + trigger_crawl(platforms, save_to_local, include_url)
        - _generate_simple_html(results, id_to_name, failed_ids, now)
    }
}

package "mcp_server.services" {
    class DataService
    class CacheService
    class ParserService
}

BaseTools <|-- DataQueryTools
BaseTools <|-- AnalyticsTools
BaseTools <|-- SearchTools
BaseTools <|-- ConfigManagementTools
BaseTools <|-- SystemManagementTools

DataQueryTools ..> DataService : uses
AnalyticsTools ..> DataService : uses
SearchTools ..> DataService : uses
ConfigManagementTools ..> DataService : uses

SystemManagementTools ..> DataService : uses (for config)
SystemManagementTools ..> ParserService : uses (for crawl simulation)

@enduml
```

### Module Class Diagram Generation (Utilities Layer)

```plantuml
@startuml
skinparam ClassAttributeIconStyle none

title Utilities Layer Module Diagram

package "mcp_server.utils" {
    abstract class MCPError {
        + code: str
        + message: str
        + suggestion: str
        + to_dict(): dict
    }

    class InvalidParameterError
    class DataNotFoundError
    class FileParseError

    class DateParser {
        + resolve_date_range_expression(expression): dict
        + parse_date_query(date_query): datetime
        + validate_date_not_future(date): void
        + validate_date_not_too_old(date, max_days): void
    }

    class Validators {
        + validate_limit(limit, default, max_limit): int
        + validate_date(date_str): datetime
        + validate_date_range(date_range): tuple
        + validate_keyword(keyword): str
        + validate_mode(mode, valid_modes, default): str
        + validate_date_query(date_query): datetime
    }
}

MCPError <|-- InvalidParameterError
MCPError <|-- DataNotFoundError
MCPError <|-- FileParseError

Validators ..> DateParser : uses
Validators ..> InvalidParameterError : raises
Validators ..> DataNotFoundError : uses (indirectly via DataService)

@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The TrendRadar project is fundamentally designed around the **Model Context Protocol (MCP)**, which dictates a clear separation of concerns between the AI Agent (the consumer) and the data/logic server (the provider).

**Core Abstractions:**
1.  **MCP Tool:** The primary abstraction, represented by the `@mcp.tool` decorated functions in `server.py`. These are the public, high-level functions that the AI Agent can call. They are designed to be self-documenting (via docstrings) and return structured JSON, adhering strictly to the MCP specification.
2.  **Tool Class (e.g., `DataQueryTools`):** A middle-layer abstraction that groups related business logic. These classes are instantiated once (Singleton pattern) and hold the actual implementation of the tool's functionality, acting as the **Service Facade** for the MCP tool wrappers.
3.  **Service (e.g., `DataService`):** The lowest-level abstraction for business logic, responsible for interacting with external resources like the file system (data persistence) and configuration files. It abstracts away the complexity of data storage and retrieval.
4.  **Data Persistence:** The file system (`output/`) is abstracted as the primary data store, where data is organized by date and time-stamped files. This is a simple, robust, and easily inspectable form of persistence.

**Design Philosophy:**
The design follows a **Layered Architecture** with a strong emphasis on **Tool-Centric Design**.
-   **Tool Layer (MCP):** Focuses on clear, Agent-friendly interfaces and robust input/output handling.
-   **Business Logic Layer (Tools):** Focuses on implementing complex features (e.g., unified trend analysis) by orchestrating services.
-   **Data/Utility Layer (Services & Utils):** Focuses on data integrity, performance (caching), and reusable utilities (date parsing, validation).
The **"Date First"** philosophy is evident in the `resolve_date_range` tool, which is explicitly recommended for priority calling to ensure all subsequent data queries use standardized, server-validated date parameters.

**Lifecycle Management:**
The server lifecycle is managed by the `run_server` function in `server.py`.
-   **Initialization:** Tool instances are created once via `_get_tools` (Singleton) during server startup.
-   **Execution:** The server runs indefinitely, listening for requests via either `stdio` (for local/testing) or `http` (for production) transport, as managed by the `FastMCP` framework.
-   **Data Flow:** Data is primarily loaded from the file system into memory/cache upon request, and new data is written back to the file system via the `trigger_crawl` tool.

#### 3.1.2. Component Interactions

**Communication Pattern:**
The primary communication pattern is **Request-Response** via the MCP. The AI Agent sends a JSON-RPC request to the server, which identifies the requested tool and parameters, executes the corresponding Python function, and returns a structured JSON response.

**Key Interaction Flows:**

1.  **Date Resolution Flow (Pre-Query):**
    -   **Agent** calls `resolve_date_range(expression="本周")`.
    -   **MCP Server** delegates to `DateParser.resolve_date_range_expression`.
    -   **DateParser** returns a standardized `{"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"}` JSON object.
    -   **Agent** uses this standardized output in subsequent data tools.

2.  **Data Query Flow (e.g., `get_latest_news`):**
    -   **Agent** calls `get_latest_news(limit=50)`.
    -   **MCP Server** delegates to `DataQueryTools.get_latest_news`.
    -   **DataQueryTools** calls `DataService._get_data_by_date` (for today's data).
    -   **DataService** checks `SimpleCache`.
    -   **Cache Miss:** `DataService` reads and aggregates multiple `.txt` files from the `output/{date}/txt/` directory, parses them via `DataParserService`, and stores the result in `SimpleCache`.
    -   **DataService** returns the aggregated data to `DataQueryTools`.
    -   **MCP Server** returns the final JSON to the Agent.

3.  **System Management Flow (`trigger_crawl`):**
    -   **Agent** calls `trigger_crawl(save_to_local=True)`.
    -   **MCP Server** delegates to `SystemManagementTools.trigger_crawl`.
    -   **SystemManagementTools** simulates the crawl, aggregates results, and calls file system operations to write the new data to the `output` directory as both `.txt` (raw data) and `.html` (report).

**Data Flow:**
-   **Input:** User request (via Agent) -> MCP Tool parameters.
-   **Processing:** MCP Tool -> Tool Class -> Service Class -> Data (File System/Cache).
-   **Output:** Structured JSON (News lists, analysis results, status reports) -> MCP Tool -> Agent.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
```plantuml
@startuml
skinparam ClassAttributeIconStyle none
skinparam packageStyle rectangle

title TrendRadar MCP Server Overall Architecture

package "External Agent" as Agent {
    [AI Agent]
}

package "MCP Server Core (mcp_server)" as Core {
    [server.py (Tool Dispatcher)]
}

package "Tools Layer (mcp_server.tools)" as Tools {
    [DataQueryTools]
    [AnalyticsTools]
    [SearchTools]
    [ConfigManagementTools]
    [SystemManagementTools]
}

package "Services Layer (mcp_server.services)" as Services {
    [DataService]
    [CacheService]
    [ParserService]
}

package "Utilities (mcp_server.utils)" as Utils {
    [DateParser]
    [Validators]
    [MCPError Hierarchy]
}

package "Data Persistence" as Data {
    [config/config.yaml]
    [output/ (Raw News Files)]
}

' Interactions
Agent --> Core : JSON-RPC Request (Tool Call)
Core --> Tools : Delegates Tool Execution (Singleton Pattern)

Tools --> Services : Calls Business Logic
Tools --> Utils : Uses Validation/Date Parsing

Services --> Data : Reads/Writes Data
Services --> Utils : Uses Error Handling

Core --> Utils : Uses DateParser (resolve_date_range)

' Dependencies
Services ..> Utils : Imports MCPError
Tools ..> Services : Depends on DataService
Core ..> Tools : Instantiates Tool Classes
Core ..> Utils : Imports DateParser

@enduml
```
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

| Pattern Name | Description | Implementation Details |
| :--- | :--- | :--- |
| **Singleton** | Ensures a class has only one instance, and provides a global point of access to it. | Implemented in `mcp_server/server.py` via the `_get_tools` function, which uses a dictionary (`_tools_instances`) to store and return the single instance of each tool class (e.g., `DataQueryTools`, `AnalyticsTools`). |
| **Facade** | Provides a unified interface to a set of interfaces in a subsystem. | The MCP Tool functions in `server.py` act as a Facade to the complex business logic contained within the `Tools` classes (e.g., `DataQueryTools`). The Agent only sees the simple tool interface. |
| **Strategy** | Defines a family of algorithms, encapsulates each one, and makes them interchangeable. | Implemented in `AnalyticsTools` via methods like `analyze_topic_trend_unified` and `analyze_data_insights_unified`. These methods accept an `analysis_type` or `insight_type` string, which determines which specific internal analysis algorithm (strategy) to execute. |
| **Custom Exception** | Defines a hierarchy of domain-specific exceptions for robust error handling. | Implemented in `mcp_server/utils/errors.py` with the base `MCPError` and its subclasses (`InvalidParameterError`, `DataNotFoundError`, `FileParseError`). This allows for structured, machine-readable error responses to the Agent. |

#### 3.3.2. Project Highlights

-   **Model Context Protocol (MCP) Integration:** The use of `FastMCP 2.0` is the primary highlight, instantly transforming the data pipeline into a powerful, AI-callable service. This allows any compliant AI Agent to leverage TrendRadar's data and analysis capabilities without needing to understand the underlying Python code or data storage.
-   **Standardized Date Resolution:** The dedicated `resolve_date_range` tool is an innovative feature that solves a common pain point in Agent-Tool interactions: inconsistent date parsing. By forcing the Agent to use a server-side, validated date range, it ensures data query accuracy and consistency across all analytical tools.
-   **Unified Tool Interfaces:** The `AnalyticsTools` and `SearchTools` use "unified" methods (e.g., `analyze_topic_trend_unified`) that accept a `type` parameter to select the specific analysis mode. This design keeps the number of exposed MCP tools low and the Agent's cognitive load minimal, while maintaining high functional coverage.
-   **Simple, Inspectable Data Persistence:** Storing data in a structured file system (`output/{date}/txt/{timestamp}.txt`) provides excellent flexibility and inspectability. It avoids the overhead of a database while still allowing for time-series data aggregation. The use of a `SimpleCache` mitigates the performance impact of file I/O.
-   **Extensibility:** The layered architecture makes the system highly extensible.
    -   **New Tool:** Add a new method to an existing Tool Class (e.g., `AnalyticsTools`) and register a new wrapper in `server.py`.
    -   **New Data Source:** The `DataService` and `ParserService` can be extended to handle new data formats or sources without affecting the Tool or MCP layers.
    -   **New Analysis Strategy:** A new analysis type can be added to `AnalyticsTools` by implementing the new logic and updating the `unified` dispatcher method.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The TrendRadar architecture is well-suited for its current file-based data model and MCP-centric design. However, several areas can be optimized for performance and maintainability:

1.  **Performance Bottleneck: File I/O and Caching:**
    -   **Suggestion:** Upgrade the `SimpleCache` to a more robust, dedicated in-memory cache (e.g., Redis or a dedicated LRU cache library). The current cache is simple but lacks features like automatic eviction policies beyond TTL.
    -   **Optimization:** The `DataService._get_data_by_date` method involves reading and aggregating multiple `.txt` files on a cache miss. For large datasets, this I/O operation will be a significant bottleneck. Consider a background process to pre-aggregate daily data into a single, optimized file format (e.g., Parquet or a compressed JSON line file) to minimize read operations.

2.  **Architecture Optimization: Dependency Management:**
    -   **Suggestion:** Implement a lightweight **Dependency Injection (DI)** pattern. Currently, `DataService` is instantiated directly in `DataQueryTools`, and `DataService` is implicitly imported in `validators.py` (via a try-except block) to check available dates. This tight coupling makes testing and refactoring difficult.
    -   **Optimization:** Pass the `DataService` instance to the `Tools` classes during initialization in `server.py`. This would make dependencies explicit and improve unit testability.

3.  **Code Quality and Maintainability:**
    -   **Suggestion:** Introduce comprehensive **Type Hinting** across all module methods, especially in the `Tools` and `Services` layers. While some type hints exist, full coverage would improve code clarity and enable static analysis tools.
    -   **Optimization:** Implement unit tests for the complex logic in `AnalyticsTools` and the file parsing logic in `ParserService`. The current system lacks a visible testing suite, which is crucial for a project with complex data processing.

#### 3.4.2. Secondary Development Guide

For a developer looking to extend the TrendRadar project, the following path is recommended:

1.  **Code Exploration Path (Bottom-Up):**
    -   **Start with Data:** Examine `DataService` and the `output/` directory structure to understand how raw news data is stored and retrieved.
    -   **Understand Utilities:** Review `validators.py` and `date_parser.py` to grasp the input constraints and date standardization logic.
    -   **Trace Tool Logic:** Follow the flow from a tool wrapper in `server.py` (e.g., `get_latest_news`) down to its implementation in `DataQueryTools` and the subsequent call to `DataService`.

2.  **Adding a New MCP Tool (e.g., `get_top_platforms`):**
    -   **Step 1: Service Implementation:** If the new tool requires new data logic, add a method to `DataService` (or create a new service class in `mcp_server/services`).
    -   **Step 2: Tool Class Wrapper:** Add the new method to an existing Tool Class (e.g., `ConfigManagementTools`) or create a new one in `mcp_server/tools`. This method will contain the business logic and call the service.
    -   **Step 3: MCP Registration:** In `mcp_server/server.py`, define a new asynchronous function decorated with `@mcp.tool`. This function should call the method from the Tool Class instance obtained via `_get_tools()`, handle any `MCPError` exceptions, and return the result as a JSON string.

3.  **Integrating a New Data Source/Crawler:**
    -   **Step 1: Update Configuration:** Modify `config/config.yaml` to include the new platform ID and configuration details.
    -   **Step 2: Update `trigger_crawl`:** The `SystemManagementTools.trigger_crawl` method currently simulates the crawl. To integrate a real crawler, the simulation logic must be replaced with the actual crawler invocation, ensuring the output format matches the expected structure for `ParserService`.
    -   **Step 3: Data Persistence:** Ensure the new crawler's output is saved to the `output/` directory in the expected time-stamped `.txt` format, which is the contract for the `DataService`.


================================================
FILE: thirdparty/agentic-trading.md
================================================
# agentic-trading - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project is structured as a multi-component system for agentic trading, utilizing a microservices-like architecture with three main components: `alphabot`, `riskguard`, and `simulator`, all sharing a `common` library.

```
/home/ubuntu/FinnewsHunter/thirdparty/agentic-trading
├── Dockerfile.alphabot        # Dockerfile for the AlphaBot service
├── Dockerfile.riskguard       # Dockerfile for the RiskGuard service
├── Dockerfile.simulator       # Dockerfile for the Simulator service
├── LICENSE                    # Project license
├── README.md                  # Project documentation
├── alphabot                   # Core module: The primary trading agent (AlphaBot)
│   ├── __main__.py            # Entry point for running AlphaBot as an A2A server
│   ├── a2a_risk_tool.py       # Tool for AlphaBot to interact with RiskGuard
│   ├── agent.py               # Core logic for the AlphaBot agent
│   └── agent_executor.py      # Executes the agent's logic, integrating with A2A
├── cloudbuild-*.yaml          # Google Cloud Build configuration files
├── common                     # Core module: Shared utilities, configuration, and data models
│   ├── config.py              # Configuration settings for all services
│   ├── models.py              # Pydantic data models for inter-service communication
│   └── utils                  # Utility functions
│       ├── agent_utils.py     # Utilities for Agent-to-Agent (A2A) communication
│       └── indicators.py      # Technical indicator calculation functions
├── deploy_cloud_run.sh        # Script for deploying to Google Cloud Run
├── deploy_local.sh            # Script for local deployment
├── pyproject.toml             # Project metadata and build configuration
├── requirements.txt           # Python dependencies
├── riskguard                  # Core module: The risk management agent (RiskGuard)
│   ├── __main__.py            # Entry point for running RiskGuard as an A2A server
│   ├── agent.py               # Core logic for the RiskGuard agent
│   ├── agent_executor.py      # Executes the agent's logic, integrating with A2A
│   └── rules.py               # Implementation of risk management rules
├── simulator                  # Core module: Market simulation and FastAPI UI
│   ├── main.py                # FastAPI application entry point and orchestration
│   ├── market.py              # Market data simulation and management
│   ├── portfolio.py           # Portfolio state management
│   ├── static                 # Static assets (e.g., CSS) - Excluded from code analysis
│   └── templates              # HTML templates - Excluded from code analysis
└── tests                      # Unit and integration tests - Excluded from code analysis
```
The project's structure clearly delineates the responsibilities of each component. The `alphabot`, `riskguard`, and `simulator` directories contain the core business logic for the three independent services. The `common` directory acts as the shared library, providing the essential data contracts and utility functions that bind the services together. Configuration and deployment files (`Dockerfile.*`, `cloudbuild-*.yaml`, `deploy_*.sh`) are kept at the root, supporting a modern, containerized deployment workflow. The `tests` directory ensures code quality and correctness for all core modules.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/agentic-trading/alphabot`: Implements the core trading logic, acting as the primary decision-making agent, utilizing LangChain and an LLM.
*   `/home/ubuntu/FinnewsHunter/thirdparty/agentic-trading/riskguard`: Implements the risk management and compliance logic, acting as a deterministic gatekeeper for proposed trades.
*   `/home/ubuntu/FinnewsHunter/thirdparty/agentic-trading/simulator`: Contains the market simulation, portfolio management, and the FastAPI-based orchestration layer for the agents.
*   `/home/ubuntu/FinnewsHunter/thirdparty/agentic-trading/common`: Provides shared data models (Pydantic), configuration, and utility functions used across all three main components, ensuring a consistent data contract.

## Phase 2: Module-by-Module Deep Analysis

The project is composed of four core modules: `alphabot`, `riskguard`, `simulator`, and `common`. Each module plays a distinct role in the agentic trading system.

### Module: alphabot (Trading Agent)
*   **Files Enumeration**: `__main__.py`, `a2a_risk_tool.py`, `agent.py`, `agent_executor.py`.
*   **Core Responsibility**: To act as the **primary trading agent**. It analyzes the current market data and portfolio state, uses an LLM-based reasoning process (via LangChain), and leverages the `RiskGuard` as an external tool to propose a trade signal (buy, sell, or hold) along with the rationale. It is exposed as an A2A service.
*   **Key File Identification**:
    *   `agent.py`: Contains the core LLM-based trading logic and the definition of the agent's behavior.
    *   `a2a_risk_tool.py`: Critical for inter-agent communication, defining the mechanism for AlphaBot to query RiskGuard.
*   **Core Implementation**: The `AlphaBotAgent` in `agent.py` is a LangChain agent using a `ChatOpenAI` model. It is designed to consume `PortfolioState` and `MarketData` and produce a `TradeSignal`. Its core safety mechanism is the provision of the `RiskGuardTool`, which the LLM is prompted to use to validate any potential trade before finalizing its decision.
*   **Dependencies**: Internal dependencies include `common.models` and `a2a_risk_tool`. External dependencies are primarily the `langchain` framework, `pydantic` for structured output, and `requests` for calling the RiskGuard service.
*   **Error & Performance**: The primary performance bottleneck is the **synchronous LLM call** within the agent's execution, which blocks the A2A request handler. Error handling is robust, relying on `PydanticOutputParser` to ensure the LLM's output conforms to the required `TradeSignal` schema.

### Module: riskguard (Risk Management Agent)
*   **Files Enumeration**: `__main__.py`, `agent.py`, `agent_executor.py`, `rules.py`.
*   **Core Responsibility**: To act as the **risk management agent**. It receives a proposed trade and the current portfolio state, applies a set of predefined, deterministic risk rules, and returns a decision (approved or rejected) with a rationale. It is exposed as an A2A service, primarily consumed by `AlphaBot`.
*   **Key File Identification**:
    *   `rules.py`: Contains the critical, deterministic risk logic (`RiskGuardRules`).
    *   `agent_executor.py`: The A2A service interface, translating A2A requests into risk rule execution.
*   **Core Implementation**: The `RiskGuardRules` class in `rules.py` implements the risk logic, which is purely rule-based (e.g., Max Position Size, Max Daily Loss) and not LLM-driven. It takes a `ProposedTrade` and `PortfolioState` and returns a `RiskCheckResult`.
*   **Dependencies**: Internal dependencies include `common.models`. External dependencies are minimal, mainly the `uvicorn`/`fastapi` framework for serving the A2A endpoint.
*   **Error & Performance**: This module is **highly performant** as its logic is purely deterministic and computational, involving no external I/O (other than the network request itself). Error handling is straightforward, relying on Pydantic validation.

### Module: simulator (Orchestrator and Environment)
*   **Files Enumeration**: `main.py`, `market.py`, `portfolio.py`.
*   **Core Responsibility**: To act as the **orchestration and environment layer**. It provides a web UI (via FastAPI), simulates the financial market, manages the portfolio state, and drives the simulation loop by calling the `AlphaBot` agent and executing approved trades.
*   **Key File Identification**:
    *   `main.py`: The central orchestrator, containing the FastAPI routes and the simulation loop logic.
    *   `portfolio.py`: Defines the state of the trading system and handles trade execution.
*   **Core Implementation**: The `main.py` uses `requests` to communicate with the `AlphaBot` A2A service. The simulation loop manages the flow: get state, call `AlphaBot`, execute trade on `Portfolio`. The `Market` class simulates price movements, and the `Portfolio` class handles trade execution and P&L calculation.
*   **Dependencies**: Internal dependencies include `common.models`, `market`, and `portfolio`. External dependencies are `fastapi` for the web UI and `requests` for inter-agent communication.
*   **Error & Performance**: The simulation speed is directly limited by the **latency of the AlphaBot LLM call** in each step of the simulation loop. The FastAPI server itself is highly performant.

### Module: common (Shared Foundation)
*   **Files Enumeration**: `config.py`, `models.py`, `utils/agent_utils.py`, `utils/indicators.py`.
*   **Core Responsibility**: To provide the **shared foundation** for the entire system, ensuring consistency in configuration, data structure, and utility functions, especially those related to inter-agent communication and financial calculations.
*   **Key File Identification**:
    *   `models.py`: The **schema contract** for all inter-component communication.
    *   `utils/agent_utils.py`: Encapsulates the logic for A2A communication.
*   **Core Implementation**: `models.py` uses **Pydantic** extensively to define all data structures. `agent_utils.py` contains the crucial `call_agent_skill` function, which abstracts the HTTP request to an A2A service and handles Pydantic parsing of the response.
*   **Dependencies**: External dependencies are `pydantic` and `requests`.
*   **Error & Performance**: This module is highly performant. The `agent_utils.py` handles network errors and non-200 HTTP responses during A2A calls. The `indicators.py` is a potential area for optimization if it does not use vectorized operations.

### Module PlantUML Diagrams

# Module PlantUML Diagrams - alphabot
@startuml alphabot
title AlphaBot Module Class Diagram

package "common.models" {
  class MarketData
  class PortfolioState
  class TradeSignal
  class ProposedTrade
  class RiskCheckResult
}

package "alphabot" {
  class AlphaBotAgent {
    + run(market_data, portfolio_state)
    - prompt_template
  }
  
  class RiskGuardTool {
    + name = "risk_guard_tool"
    + description = "Tool to check trade risk with RiskGuard"
    + _run(proposed_trade) : RiskCheckResult
  }
  
  class AlphaBotAgentExecutor {
    + execute_skill(skill_id, input_data) : TradeSignal
  }
  
  class MainCLI
}

AlphaBotAgent "1" -- "1" RiskGuardTool : uses >
AlphaBotAgent "1" -- "1" TradeSignal : outputs >
AlphaBotAgent "1" -- "1" MarketData : consumes >
AlphaBotAgent "1" -- "1" PortfolioState : consumes >
RiskGuardTool "1" -- "1" ProposedTrade : consumes >
RiskGuardTool "1" -- "1" RiskCheckResult : outputs >
AlphaBotAgentExecutor "1" -- "1" AlphaBotAgent : executes >
MainCLI "1" -- "1" AlphaBotAgentExecutor : instantiates >

note right of AlphaBotAgent
  LangChain Agent
  Uses LLM for decision
  making and rationale.
end note

note right of RiskGuardTool
  Custom LangChain Tool
  Communicates with RiskGuard
  A2A service via HTTP.
end note

@enduml

# Module PlantUML Diagrams - riskguard
@startuml riskguard
title RiskGuard Module Class Diagram

package "common.models" {
  class PortfolioState
  class ProposedTrade
  class RiskCheckResult
}

package "riskguard" {
  class RiskGuardRules {
    + check_trade_risk(trade, portfolio) : RiskCheckResult
    - max_position_size_check()
    - max_daily_loss_check()
  }
  
  class RiskGuardAgentExecutor {
    + execute_skill(skill_id, input_data) : RiskCheckResult
  }
  
  class RiskGuardAgent {
    + run(proposed_trade, portfolio_state) : RiskCheckResult
  }
  
  class MainCLI
}

RiskGuardAgent "1" -- "1" RiskGuardRules : delegates to >
RiskGuardRules "1" -- "1" ProposedTrade : consumes >
RiskGuardRules "1" -- "1" PortfolioState : consumes >
RiskGuardRules "1" -- "1" RiskCheckResult : outputs >
RiskGuardAgentExecutor "1" -- "1" RiskGuardAgent : executes >
MainCLI "1" -- "1" RiskGuardAgentExecutor : instantiates >

note right of RiskGuardRules
  Deterministic, rule-based
  risk checking logic.
end note

@enduml

# Module PlantUML Diagrams - simulator
@startuml simulator
title Simulator Module Class Diagram

package "common.models" {
  class MarketData
  class PortfolioState
  class TradeSignal
}

package "simulator" {
  class Market {
    + get_current_data() : MarketData
    + advance_time()
    - simulate_price_movement()
  }
  
  class Portfolio {
    + state : PortfolioState
    + execute_trade(trade_signal)
    + calculate_pnl()
  }
  
  class SimulatorApp {
    + run_simulation_step()
    + call_alphabot(data) : TradeSignal
    + call_riskguard(trade) : RiskCheckResult
  }
  
  class MainFastAPIApp
}

SimulatorApp "1" -- "1" Market : uses >
SimulatorApp "1" -- "1" Portfolio : manages >
SimulatorApp "1" -- "1" TradeSignal : consumes >
SimulatorApp "1" -- "1" MarketData : produces >
SimulatorApp "1" -- "1" PortfolioState : consumes >
Portfolio "1" -- "1" PortfolioState : holds >
Market "1" -- "1" MarketData : produces >
MainFastAPIApp "1" -- "1" SimulatorApp : orchestrates >

note right of SimulatorApp
  FastAPI application
  Orchestrates the simulation
  and agent communication.
end note

@enduml

# Module PlantUML Diagrams - common
@startuml common
title Common Module Class Diagram

package "common.models" {
  class MarketData {
    + symbol: str
    + price: float
    + timestamp: datetime
  }
  
  class PortfolioState {
    + cash: float
    + holdings: Dict[str, float]
    + total_value: float
  }
  
  class TradeSignal {
    + action: str (BUY/SELL/HOLD)
    + symbol: str
    + quantity: float
    + rationale: str
  }
  
  class ProposedTrade {
    + trade_signal: TradeSignal
    + risk_level: str
  }
  
  class RiskCheckResult {
    + approved: bool
    + rationale: str
  }
}

package "common.utils" {
  class AgentUtils {
    + get_service_url(env_var, default_host, default_port)
    + call_agent_skill(url, skill_id, input_data)
  }
  
  class Indicators {
    + calculate_ma(prices, period)
    + calculate_rsi(prices, period)
  }
}

AgentUtils "1" -- "*" MarketData : communicates >
AgentUtils "1" -- "*" TradeSignal : communicates >
AgentUtils "1" -- "*" RiskCheckResult : communicates >

note right of MarketData
  Pydantic Model
end note

note right of AgentUtils
  Handles A2A HTTP calls
  and Pydantic parsing.
end note

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The core of the `agentic-trading` project is defined by a clear set of abstractions, a microservices-like design philosophy, and a loop-based lifecycle management.

### Core Abstractions
The system's functionality is built upon three primary categories of data models, all centrally defined in `common/models.py` using Pydantic:

1.  **State Models**: These represent the environment and the system's internal condition.
    *   `MarketData`: Encapsulates the current market information (e.g., price, symbol, timestamp).
    *   `PortfolioState`: Represents the system's holdings, cash balance, and total value.
2.  **Action/Decision Models**: These are the outputs of the agents and the structures used for inter-agent communication.
    *   `TradeSignal`: The final output of the `AlphaBot` agent, specifying the action (BUY/SELL/HOLD), quantity, and rationale.
    *   `ProposedTrade`: A structure used internally by `AlphaBot` to propose a trade to `RiskGuard` for validation.
3.  **Result Models**: These act as gatekeepers and feedback mechanisms.
    *   `RiskCheckResult`: The output of the `RiskGuard` agent, containing a boolean `approved` status and a detailed `rationale`.

### Design Philosophy
The project adheres to a **Microservices-like Agentic Architecture** using the **Agent-to-Agent (A2A) pattern**:

*   **Separation of Concerns**: The system is strictly divided into three independent services: a decision-maker (`AlphaBot`), a gatekeeper (`RiskGuard`), and an environment/orchestrator (`Simulator`). This separation allows for independent development, scaling, and technology choices (e.g., LLM-based decision vs. deterministic rules).
*   **Data Contract First**: The universal use of **Pydantic models** enforces a strict, validated data contract for all inter-service communication, which is essential for reliability in a distributed system.
*   **Safety via Tool Use**: The `AlphaBot` agent is designed to be **safe by default** by requiring it to explicitly use the `RiskGuardTool` before finalizing a trade. This design ensures that the LLM's creative decision is always filtered through a deterministic, rule-based safety layer.

### Lifecycle Management
The system's operational lifecycle is managed by the `Simulator` component:

1.  **Initialization**: All three services (`AlphaBot`, `RiskGuard`, `Simulator`) are started independently, typically via their respective entry points, exposing A2A endpoints.
2.  **Simulation Loop**: The `Simulator` drives the core loop, advancing the market time and repeatedly performing the trade decision and execution flow. This loop is the heart of the system's operation.
3.  **Termination**: The simulation stops when the market data is exhausted or the user manually terminates the `Simulator` application.

#### 3.1.2. Component Interactions

The primary communication pattern is **Synchronous HTTP Request/Response** following the **Agent-to-Agent (A2A) protocol**. This pattern ensures clear, decoupled communication between the three main services: `Simulator`, `AlphaBot`, and `RiskGuard`.

### Communication Patterns
*   **Simulator to AlphaBot**: The `Simulator` initiates the trading cycle by making an HTTP POST request to the `AlphaBot` A2A endpoint. This request passes the current `MarketData` and `PortfolioState` (defined in `common/models.py`) to request the `provide_trade_signal` skill.
*   **AlphaBot to RiskGuard**: This critical interaction is mediated by the `RiskGuardTool` within the `AlphaBotAgent`. When the LLM-based agent decides on a potential trade, it is prompted to use this tool. The tool then makes an internal HTTP POST request to the `RiskGuard` A2A endpoint, passing a `ProposedTrade` to request the `check_trade_risk` skill.

### Key Interaction Flow: Trade Decision and Execution
The entire system operates within a simulation loop orchestrated by the `Simulator` (`simulator/main.py`):

1.  **State Collection**: The `Simulator` retrieves the current market data from the `Market` component and the portfolio status from the `Portfolio` component.
2.  **Decision Request**: The `Simulator` sends the combined state data (`MarketData`, `PortfolioState`) to the `AlphaBot` service via an A2A call.
3.  **Internal Risk Check**: The `AlphaBotAgent` processes the request using its LLM-based logic. It decides on a potential trade (`ProposedTrade`) and, as a safety measure, is prompted to use the `RiskGuardTool` to validate this trade.
4.  **Risk Validation**: The `RiskGuardTool` calls the `RiskGuard` service. `RiskGuard` applies its deterministic rules (`riskguard/rules.py`) to the `ProposedTrade` and returns a `RiskCheckResult` (approved or rejected).
5.  **Final Signal**: The `AlphaBotAgent` incorporates the `RiskCheckResult` into its final reasoning. If approved, it outputs a `TradeSignal`. If rejected, it typically outputs a "HOLD" signal or a revised trade, which is then returned to the `Simulator`.
6.  **Execution**: The `Simulator` receives the `TradeSignal`. If the signal is not "HOLD" and the trade is valid, it calls `Portfolio.execute_trade()`, updating the system's state.

### Data Flow Summary
| Source Component | Destination Component | Data Model | Purpose |
| :--- | :--- | :--- | :--- |
| Simulator | AlphaBot | `MarketData`, `PortfolioState` | Request for a trade decision. |
| AlphaBot | RiskGuard | `ProposedTrade` | Request for risk validation. |
| RiskGuard | AlphaBot | `RiskCheckResult` | Risk validation outcome. |
| AlphaBot | Simulator | `TradeSignal` | Final trading decision. |
| Simulator | Portfolio | `TradeSignal` | Instruction to execute the trade. |
| Portfolio | Simulator | `PortfolioState` | Updated state after trade execution. |

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml architecture
title Agentic Trading System Architecture

' Define the main components (modules)
package "Simulator (Orchestrator & UI)" as Simulator {
  class SimulatorApp
  class Market
  class Portfolio
}

package "AlphaBot (Trading Agent)" as AlphaBot {
  class AlphaBotAgent
  class RiskGuardTool
}

package "RiskGuard (Risk Agent)" as RiskGuard {
  class RiskGuardRules
}

package "Common (Shared Models & Utils)" as Common {
  class DataModels
  class AgentUtils
}

' Define the inter-module dependencies and data flow

' 1. Simulator orchestrates the system
SimulatorApp --> AlphaBot : 1. Request Trade Signal (HTTP/A2A)
SimulatorApp --> Portfolio : 3. Execute Approved Trade
SimulatorApp --> Market : 1. Get Market Data

' 2. AlphaBot uses RiskGuard as a tool
AlphaBotAgent --> RiskGuardTool : Uses Tool
RiskGuardTool --> RiskGuard : 2. Check Proposed Trade Risk (HTTP/A2A)

' 3. All components rely on Common for models and utilities
SimulatorApp ..> Common : Uses DataModels, AgentUtils
AlphaBotAgent ..> Common : Uses DataModels
RiskGuardRules ..> Common : Uses DataModels

' 4. Data Flow within Simulator
Market --> SimulatorApp : Provides MarketData
Portfolio --> SimulatorApp : Provides PortfolioState

' 5. Data Flow between Agents
AlphaBotAgent --> DataModels : Outputs TradeSignal
RiskGuardRules --> DataModels : Outputs RiskCheckResult

' High-level flow note
note "Simulation Loop Flow:\n1. Simulator requests trade from AlphaBot.\n2. AlphaBot uses RiskGuardTool to check risk.\n3. AlphaBot returns TradeSignal.\n4. Simulator executes trade on Portfolio." as FlowNote
SimulatorApp .. FlowNote
@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The `agentic-trading` project effectively utilizes several design patterns to manage complexity and ensure robust inter-service communication:

### 1. Agent Pattern (Specialized Agents)
This is the foundational pattern, where the system's intelligence is distributed across specialized, autonomous entities.
*   **Implementation**: The system features the **AlphaBot Agent** (decision-making) and the **RiskGuard Agent** (gatekeeping).
*   **Code Example**: The definitions of `AlphaBotAgent` (`alphabot/agent.py`) and `RiskGuardAgent` (`riskguard/agent.py`), both wrapped in an A2A executor, exemplify this pattern.

### 2. Service-Oriented Architecture (SOA) / Microservices Pattern
The project is decomposed into independent, loosely coupled services that communicate over a network (HTTP/A2A), allowing for independent deployment and scaling.
*   **Implementation**: Each core component (`alphabot`, `riskguard`, `simulator`) runs as a separate service with its own entry point and is designed to be containerized (indicated by `Dockerfile.*`).
*   **Code Example**: The `call_agent_skill` function in `common/utils/agent_utils.py` abstracts the HTTP communication layer, treating each agent as a distinct service endpoint.

### 3. Data Transfer Object (DTO) Pattern
Pydantic models are used to transfer data between services, ensuring a well-defined, validated, and self-documenting data structure.
*   **Implementation**: Models like `MarketData`, `PortfolioState`, `TradeSignal`, and `RiskCheckResult` defined in `common/models.py` act as the DTOs.
*   **Code Example**: The use of `PydanticOutputParser` in `alphabot/agent.py` ensures the LLM's output strictly conforms to the `TradeSignal` DTO.

### 4. Tool/Function Calling Pattern (LangChain)
The LLM-based agent is given access to external functions (tools) it can choose to call to perform actions or gather information.
*   **Implementation**: The `AlphaBotAgent` is provided with the `RiskGuardTool` (`alphabot/a2a_risk_tool.py`), which encapsulates the logic for calling the separate `RiskGuard` service. This integrates a microservice call directly into the LLM's reasoning chain.
*   **Code Example**: The `RiskGuardTool` class inherits from a LangChain `BaseTool` and its `_run` method executes the A2A call to the RiskGuard service.

#### 3.3.2. Project Highlights

The `agentic-trading` project showcases several innovative features and design choices that contribute to its robustness, extensibility, and safety:

*   **Safety-First Agentic Design**: The most significant highlight is the **separation of intelligence and safety**. The LLM-driven `AlphaBot` (intelligence) is forced to consult the deterministic, rule-based `RiskGuard` (safety) via the Tool/Function Calling pattern. This design mitigates the risk of LLM hallucinations or irrational decisions by enforcing hard constraints before any trade is executed. This is a critical feature for any real-world trading system.
*   **Protocol-Oriented Interoperability (A2A)**: By adopting the Agent-to-Agent (A2A) protocol, the project establishes a clear, standardized way for agents to discover and interact with each other. This promotes high interoperability and makes it easy to swap out or add new agents (e.g., a new `DataAgent` or `ExecutionAgent`) without modifying the core logic of existing agents.
*   **Clear Separation of Concerns**: The three main components (`alphabot`, `riskguard`, `simulator`) are highly decoupled. The `Simulator` is purely the environment and orchestrator, `AlphaBot` is the LLM-based decision engine, and `RiskGuard` is the deterministic rule engine. This separation simplifies testing, maintenance, and technology upgrades for each part.
*   **Strong Data Contract**: The universal use of Pydantic models in the `common` module provides a robust, self-documenting, and validated data contract across all service boundaries, which is a hallmark of professional microservice design and enhances the flexibility of the system.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The analysis reveals several areas for performance optimization, architectural refinement, and code quality improvement:

1.  **Performance Bottleneck: LLM Latency**:
    *   **Suggestion**: Introduce **asynchronous execution** for the `AlphaBotAgent`. The current synchronous call to the LLM blocks the A2A request handler in `alphabot`, limiting concurrency.
    *   **Specific Action**: Refactor `AlphaBotAgentExecutor` to use an asynchronous LangChain agent and make the A2A request handling fully `async` (e.g., using `asyncio` and `httpx` for the `RiskGuardTool` call). This allows the `alphabot` service to handle multiple trade signal requests concurrently while waiting for the LLM response.

2.  **Architecture Optimization: RiskGuard Rules Extensibility**:
    *   **Suggestion**: Decouple the risk rules from the `RiskGuardRules` class logic to allow for dynamic, external configuration.
    *   **Specific Action**: Implement a **Strategy Pattern** or a simple configuration loader (e.g., reading rules from a YAML or JSON file) in `riskguard/rules.py`. This enables new risk rules to be added, modified, or disabled without requiring a code change and redeployment of the `riskguard` service.

3.  **Code Quality: Indicator Optimization**:
    *   **Suggestion**: Ensure the financial calculations in `common/utils/indicators.py` are optimized for performance.
    *   **Specific Action**: Verify that all indicator calculations leverage vectorized operations using libraries like **NumPy** or **Pandas** instead of pure Python loops. This is crucial for performance when processing large time-series datasets in a real-world scenario.

4.  **Observability and Logging**:
    *   **Suggestion**: Implement structured logging and distributed tracing across the services.
    *   **Specific Action**: Use a structured logging library (e.g., `structlog`) and ensure a unique **correlation ID** is passed in the A2A requests (from `Simulator` to `AlphaBot` to `RiskGuard`). This ID would link all log entries for a single trade decision, making debugging and performance monitoring significantly easier.

#### 3.4.2. Secondary Development Guide

This guide outlines the best practices for exploring and extending the `agentic-trading` codebase for secondary development.

1.  **Understand the Core Data Contract**: Begin by thoroughly reviewing `common/models.py`. All inter-service communication is governed by these Pydantic models. Any change to an agent's input or output MUST be reflected here first. Use these models for type hinting in all new code to ensure compatibility and leverage static analysis.

2.  **Extend the Agent Logic (AlphaBot)**: To change the trading strategy or LLM prompt, modify the prompt template and the `AlphaBotAgent` logic in `alphabot/agent.py`. If the new strategy requires external data or actions, create a new LangChain `BaseTool` in `alphabot/` and add it to the agent's tool list. Always ensure the final output strictly adheres to the `TradeSignal` schema.

3.  **Modify Risk Rules (RiskGuard)**: To add or modify a risk constraint, edit the `RiskGuardRules.check_trade_risk` method in `riskguard/rules.py`. Risk rules are deterministic and should be kept simple and fast. New rules MUST be accompanied by unit tests in `tests/riskguard/` to ensure correctness.

4.  **Add a New Agent**: To introduce a new specialized agent (e.g., a `DataAgent` or `ExecutionAgent`), create a new module, define its input/output models in `common/models.py`, implement the agent logic, and wrap it in an A2A executor (`__main__.py`). Finally, update the `Simulator` or another agent to use this new agent as a tool or service.

5.  **Testing and Simulation**: The `simulator` is the primary integration test environment. Use the `simulator/main.py` to run end-to-end tests of your changes. All new logic should have dedicated unit tests in the `tests/` directory.


================================================
FILE: thirdparty/awesome-quant.md
================================================
# awesome-quant - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project structure is typical for a GitHub "awesome list" that includes automation scripts and a static site generator (Quarto) for presentation. The core logic is distributed across a few Python scripts in the root directory, which are responsible for data processing and list maintenance.

The root directory (`/home/ubuntu/FinnewsHunter/thirdparty/awesome-quant`) serves as the central hub, containing the primary content source (`README.md`), configuration files (`pyproject.toml`, `poetry.lock`), and the executable Python scripts. The scripts `parse.py`, `cranscrape.py`, and `topic.py` form the data processing layer, transforming raw content into structured data. `parse.py` is the most critical, responsible for reading the Markdown list, enriching it with GitHub metadata, and outputting `site/projects.csv`. `cranscrape.py` is a specialized utility for scraping R package data into `cran.csv`.

The `site/` directory is the dedicated presentation layer, built around the **Quarto** publishing system. It contains the site configuration (`_quarto.yml`), static content pages (`about.qmd`, `index.qmd`), and the dynamic list template (`projects.qmd`). This separation ensures that the content maintenance logic (Python scripts) is decoupled from the content presentation logic (Quarto). The `.github/workflows/` folder contains the CI/CD pipeline (`build.yml`) which automates the execution of the Python scripts and the Quarto site generation, completing the "Content-as-Code" lifecycle. This structure is highly efficient for a data-driven, open-source documentation project.

```
/home/ubuntu/FinnewsHunter/thirdparty/awesome-quant
├── .git/                 # Git version control metadata
├── .github/              # GitHub Actions workflows for CI/CD
│   └── workflows/
│       └── build.yml     # Workflow to build the Quarto site
├── .gitignore            # Files to be ignored by Git
├── .nojekyll             # Configuration to disable Jekyll processing on GitHub Pages
├── README.md             # Main entry point and introduction to the awesome list
├── cran.csv              # Output data file from cranscrape.py (list of R packages)
├── cranscrape.py         # Python script to scrape R package data from CRAN
├── legacy.txt            # List of legacy or deprecated entries
├── parse.py              # Main Python script to parse README.md, fetch GitHub commit dates, and generate projects.csv
├── poetry.lock           # Poetry dependency lock file
├── pyproject.toml        # Poetry project configuration and dependencies
├── quants.md             # Secondary Markdown file, likely containing the main list content
├── recommendation.ipynb  # Jupyter notebook for content recommendation/analysis
├── site/                 # Directory for the Quarto-based static website generation
│   ├── .gitignore        # Git ignore for the site build
│   ├── CODE_OF_CONDUCT.qmd # Quarto document for the Code of Conduct
│   ├── _quarto.yml       # Quarto site configuration file
│   ├── about.qmd         # Quarto document for the "About" page
│   ├── index.qmd         # Quarto document for the main index page
│   ├── projects.csv      # Output data file from parse.py (list of all projects with metadata)
│   └── projects.qmd      # Quarto document to display the list of projects
├── styles.css            # Custom CSS for the Quarto site
└── topic.py              # Python script for topic-related processing (needs further analysis)
```
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/awesome-quant/` (Root): Contains the main Python scripts (`parse.py`, `cranscrape.py`, `topic.py`) responsible for data extraction, processing, and list maintenance, as well as the primary content files (`README.md`, `quants.md`).
*   `/home/ubuntu/FinnewsHunter/thirdparty/awesome-quant/site/`: Contains the Quarto documents (`.qmd`) and configuration files necessary to generate the static website, which serves as the final, structured presentation of the awesome list data.

## Phase 2: Module-by-Module Deep Analysis

The project's core functionality is implemented across three main Python scripts and the Quarto-based static site generation directory. These components form a data pipeline for maintaining and publishing the "awesome-quant" list.

## 1. Root Directory Scripts (Data Processing Pipeline)

### 1.1. `parse.py` (List Parsing and Metadata Fetcher Module)

*   **File Enumeration:** `parse.py`
*   **Module Core Responsibility:** This is the central orchestration script. Its primary purpose is to parse the main content file (`README.md`), extract project links and descriptions, concurrently fetch the latest commit date for each GitHub repository using the GitHub API, and compile the final, structured project data into `site/projects.csv`.
*   **Key File Identification:** `parse.py` is the single key file.
*   **Core Implementation:**
    *   **GitHub API Integration:** It initializes a `Github` object from the `pygithub` library using an environment variable (`GITHUB_ACCESS_TOKEN`). This is critical for fetching up-to-date metadata.
    *   **Concurrency:** The script uses the `threading.Thread` class, which is subclassed into the custom `Project` class. This allows the time-consuming network calls to the GitHub API (`get_last_commit`) to run in parallel, significantly speeding up the data collection process.
    *   **Parsing Logic:** It uses two main regular expressions: `ret` for matching Markdown headers (`#+ Title`) to track the project section, and `rex` for matching the list items (`- [project](url) - description`). The script iterates line-by-line through `README.md`, maintaining a stack of section titles (`m_titles`) to categorize each project.
    *   **Data Structure:** The `Project` thread stores its results in a dictionary (`self.regs`) which is later collected into a list of dictionaries and converted into a `pandas.DataFrame` for final output to CSV.
*   **Dependencies:** `os`, `re`, `pandas`, `threading`, `github` (PyGithub).
*   **Error & Performance:** The script includes a basic `try...except` block in `get_last_commit` to handle API errors (e.g., repository not found or API rate limits) and logs the error, preventing a full script crash. The use of multi-threading is the primary performance optimization.

### 1.2. `cranscrape.py` (CRAN Scraper Module)

*   **File Enumeration:** `cranscrape.py`
*   **Module Core Responsibility:** This script is a specialized data collector focused on R packages relevant to quantitative finance. It scrapes a hardcoded list of CRAN package pages to find associated GitHub links.
*   **Key File Identification:** `cranscrape.py` is the single key file.
*   **Core Implementation:**
    *   **Web Scraping:** It uses the `requests` library to fetch the HTML content of a predefined list of CRAN URLs.
    *   **Regex Extraction:** A regular expression (`reu = re.compile(r'https://github.com/([\w-]+/[\w-]+)')`) is used to extract the GitHub URL from the fetched HTML, assuming the link is present on the package's index page.
    *   **Output:** The collected data (CRAN URL, GitHub URL, and repository path) is saved to `cran.csv`. This file serves as a supplementary data source, potentially for merging or cross-referencing with the main `projects.csv`.
*   **Dependencies:** `requests`, `re`, `pandas`.

### 1.3. `topic.py` (GitHub Topic Search Utility)

*   **File Enumeration:** `topic.py`
*   **Module Core Responsibility:** This is a utility script, likely used for discovery and maintenance, to find new, popular repositories tagged with the 'quant' topic on GitHub.
*   **Key File Identification:** `topic.py` is the single key file.
*   **Core Implementation:** It uses the PyGithub `search_repositories` method with a query string (`topic:quant`) and filters the results to only show repositories with a high star count (currently hardcoded to `stargazers_count < 1000`). The results are printed to the console.
*   **Dependencies:** `os`, `github` (PyGithub).

## 2. `site/` Directory (Static Site Generation Module)

*   **File Enumeration:** `site/.gitignore`, `site/CODE_OF_CONDUCT.qmd`, `site/_quarto.yml`, `site/about.qmd`, `site/index.qmd`, `site/projects.csv`, `site/projects.qmd`.
*   **Module Core Responsibility:** To transform the collected and processed data (`projects.csv`) and the Quarto Markdown content (`.qmd` files) into a complete, navigable static website for publishing.
*   **Key File Identification:**
    *   `_quarto.yml`: Defines the site's configuration, navigation, and output format.
    *   `projects.qmd`: The template file responsible for reading `projects.csv` and rendering the main list of projects, likely using Quarto's data-driven table features.
*   **Core Implementation:** The implementation relies entirely on the external **Quarto** publishing system. The `.qmd` files are essentially enhanced Markdown that can embed code (e.g., R or Python) to process data and generate dynamic content, such as tables from the `projects.csv` file. This separation of concerns delegates the presentation layer to a dedicated tool.
*   **Dependencies:** External Quarto system.

### Module PlantUML Diagrams

## Diagram for `parse.py` Module

This diagram focuses on the main class and its interaction with external services and data structures within the `parse.py` script.

```plantuml
@startuml parse_module
title parse.py Module Class Diagram

' External Dependencies
package "External Libraries" {
  class Github
  class Thread
  class DataFrame
}

' Main Script Logic
class ParseScript {
  - g: Github
  - projects: list<Project>
  - ret: Regex (Header)
  - rex: Regex (List Item)
  + main_execution()
}

' Utility Functions
class UtilityFunctions {
  + extract_repo(url: str): str
  + get_last_commit(repo: str): str
}

' Core Abstraction for Concurrency
class Project {
  - _match: RegexMatch
  - _section: str
  + regs: dict
  + __init__(match, section)
  + run(): void
}

' Relationships
ParseScript --> Github : uses API
ParseScript --> Project : creates and manages threads
ParseScript --> UtilityFunctions : uses helper functions
ParseScript --> DataFrame : outputs final data
Project ..> Thread : extends
Project --> UtilityFunctions : calls get_last_commit
Project --> extract_repo : calls
UtilityFunctions --> Github : uses API

@enduml
```

## Diagram for `cranscrape.py` Module

This script is purely procedural and does not contain classes, so a class diagram is not applicable. A functional flow diagram would be more appropriate, but to adhere to the class diagram requirement, a conceptual representation of its main function is provided.

```plantuml
@startuml cranscrape_module
title cranscrape.py Module Conceptual Diagram

' External Dependencies
package "External Libraries" {
  class requests
  class re
  class pandas
}

' Main Function
class get_data {
  + url: str
  + res: Response
  + github_url: str
  + get_data(url): dict
}

' Main Script Execution
class MainExecution {
  - urls: list<str>
  - all_data: list<dict>
  + execute_scraping()
  + save_to_csv(filename: str)
}

' Relationships
MainExecution --> get_data : calls for each URL
get_data --> requests : fetches URL content
get_data --> re : extracts GitHub URL
MainExecution --> pandas : creates DataFrame and saves CSV

@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The **awesome-quant** project employs a **Data Pipeline Architecture** centered around the maintenance and publication of a curated list. The design philosophy is one of **separation of concerns** and **data-driven generation**, where raw content is processed by scripts to produce structured data, which is then consumed by a dedicated presentation layer.

The core abstractions are:

1.  **The Project Entity (`Project` class in `parse.py`):** This is the fundamental unit of data. It abstracts a single entry from the Markdown list, encapsulating its name, URL, description, and crucial metadata like the `last_commit` date, which is fetched concurrently. This abstraction is key to enriching the raw list data.
2.  **The Data Source (Markdown/CRAN):** The project treats the `README.md` file as the primary, human-editable source of truth. The `cranscrape.py` script introduces a secondary, specialized data source abstraction for R packages on CRAN.
3.  **Concurrent Enrichment:** The use of the `threading.Thread` subclass for the `Project` entity is an abstraction over the time-consuming process of external API calls. It abstracts the complexity of parallel execution, allowing the main script to initiate many data-fetching tasks simultaneously.

**Design Philosophy:** The project follows a **"Content-as-Code"** philosophy. The human-curated list is maintained in a simple Markdown file (`README.md`), which is then programmatically processed and validated by the Python scripts. This ensures that the list remains up-to-date (via `last_commit` checks) and can be presented in a highly structured, machine-readable format (`projects.csv`) before final publication.

**Lifecycle Management:** The project lifecycle is a simple, three-stage process:
1.  **Content Authoring:** A maintainer edits `README.md` and `quants.md`.
2.  **Data Processing:** The Python scripts (`parse.py`, `cranscrape.py`) are executed to generate the structured data files (`projects.csv`, `cran.csv`).
3.  **Publication:** The Quarto system builds the static site by consuming the structured data and the `.qmd` templates, resulting in the final website. This process is automated via the `.github/workflows/build.yml` GitHub Action.

#### 3.1.2. Component Interactions

The architecture is a linear data flow pipeline with two main branches: the primary list processing and the secondary CRAN scraping.

**1. Primary Data Flow (List Processing):**
*   **Input:** `README.md` (raw Markdown list).
*   **Processor:** `parse.py` reads the Markdown line-by-line, using regular expressions to extract project details (Name, URL, Description).
*   **External Communication:** For each extracted project URL, `parse.py` initiates a concurrent request to the **GitHub API** (via PyGithub) to fetch the repository's latest commit date. This is the main external communication pattern.
*   **Output:** The enriched data is aggregated and written to `site/projects.csv`.

**2. Secondary Data Flow (CRAN Scraping):**
*   **Input:** Hardcoded list of CRAN URLs within `cranscrape.py`.
*   **Processor:** `cranscrape.py` uses the `requests` library to scrape the HTML of each CRAN page.
*   **External Communication:** Direct HTTP requests to the **CRAN website**.
*   **Output:** The results, primarily the extracted GitHub URL for each R package, are written to `cran.csv`.

**3. Presentation Flow (Site Generation):**
*   **Input:** `site/projects.csv` (structured project data) and the Quarto template files (`.qmd`).
*   **Processor:** The **Quarto** static site generator.
*   **Output:** The final static website, where `site/projects.qmd` uses the data in `projects.csv` to render a dynamic, sortable table of all projects.

**Communication Patterns:**
*   **Internal:** File-based communication (e.g., `parse.py` -> `projects.csv` -> Quarto).
*   **External:** Synchronous API calls (GitHub API) and HTTP requests (CRAN website). The use of **multi-threading** in `parse.py` is a key pattern to mitigate the latency of external communication.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml awesome_quant_architecture
title Awesome-Quant Data Pipeline Architecture

' Define Components
component [README.md] as README {
  Source of raw list content
}

component [CRAN Scraper] as CRAN_SCRAPER {
  cranscrape.py
}

component [List Parser] as PARSER {
  parse.py
}

component [Structured Data] as DATA {
  projects.csv
  cran.csv
}

component [Quarto Site Generator] as QUARTO {
  site/
  .qmd templates
}

' Define External Systems
cloud "External Services" {
  [GitHub API] as GITHUB_API
  [CRAN Website] as CRAN_WEB
}

' Define Data Flow and Dependencies
README --> PARSER : Reads raw list content
PARSER --> GITHUB_API : Fetches last commit date (Concurrent)
PARSER --> DATA : Writes enriched project data

CRAN_SCRAPER --> CRAN_WEB : Scrapes GitHub links
CRAN_SCRAPER --> DATA : Writes R package data

DATA --> QUARTO : Consumes structured data

QUARTO --> [Final Static Website] : Generates HTML/CSS

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase, though small, effectively utilizes several design patterns to manage complexity, especially around external API interaction and data processing.

1.  **Worker Thread Pattern (Concurrency)**
    *   **Description:** This pattern is used to execute time-consuming tasks (fetching data from the GitHub API) in parallel, preventing the main thread from blocking and significantly improving the script's execution time.
    *   **Implementation:** The `Project` class in `parse.py` inherits from `threading.Thread`. Each project entry in the Markdown list is instantiated as a `Project` object, and its `run` method is executed in a separate thread.
    *   **Code Example (`parse.py`):**
        ```python
        class Project(Thread):
            def __init__(self, match, section):
                super().__init__()
                # ... initialization ...

            def run(self):
                # ... calls get_last_commit(repo) which is the blocking network call
                last_commit = get_last_commit(repo)
                # ... stores result in self.regs
        
        # ... in main execution loop:
        p = Project(m, ' > '.join(m_titles[1:]))
        p.start()
        projects.append(p)
        ```

2.  **Data-Driven Generation Pattern**
    *   **Description:** The content presentation is entirely driven by a structured data file (`projects.csv`), which is generated by the processing scripts. This separates the content logic from the presentation logic.
    *   **Implementation:** The `parse.py` script's final action is to create `projects.csv`. The Quarto site generator then uses this CSV file to dynamically render the project list page (`site/projects.qmd`).
    *   **Code Example (`parse.py`):**
        ```python
        df = pd.DataFrame(projects)
        df.to_csv('site/projects.csv', index=False)
        ```

3.  **Template Method Pattern (Implicit)**
    *   **Description:** The overall maintenance process follows a fixed sequence of steps: Read Source (Markdown) -> Enrich Data (GitHub API) -> Persist Data (CSV) -> Render Presentation (Quarto). The Python scripts provide the "enrich and persist" steps, which are customized implementations within a broader, fixed pipeline.

#### 3.3.2. Project Highlights

*   **Automated Content Enrichment:** The most significant highlight is the programmatic fetching of the **last commit date** for every listed repository via the GitHub API. This feature automatically provides a crucial metric for list users—the project's recency and maintenance status—which is a common pain point for manually maintained awesome lists.
*   **Content-as-Code (CaC) Philosophy:** By treating the list content (`README.md`) as a source file that is parsed and processed by code, the project ensures that the human-readable list and the machine-readable data (`projects.csv`) are always synchronized. This reduces manual effort and potential errors in maintaining a large, dynamic list.
*   **Decoupled Presentation Layer:** The use of **Quarto** for static site generation provides a professional, feature-rich web interface (e.g., dynamic tables, search, filtering) without requiring complex web development code within the core data processing scripts. The scripts only focus on generating the data, and Quarto handles the complex task of rendering.
*   **Extensibility via Data Sources:** The architecture is flexible enough to support multiple data sources. The existence of `cranscrape.py` alongside `parse.py` shows that specialized data collection pipelines (e.g., for R packages from CRAN) can be easily added to enrich the final dataset without modifying the core parsing logic.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The current implementation is effective but has several areas for optimization, primarily related to robustness and modern Python practices.

1.  **Robust Error Handling:** The `get_last_commit` function in `parse.py` uses a bare `except:` block. This is a critical anti-pattern as it catches all exceptions, including system errors, making debugging difficult and potentially masking critical failures.
    *   **Suggestion:** Replace the bare `except:` with specific exception handling for the `PyGithub` library, such as `except UnknownObjectException` for 404 errors or `except RateLimitExceededException` for API limits. This allows for targeted logging and recovery.

2.  **Modern Concurrency Management:** The use of raw `threading.Thread` with a manual polling loop (`while True: checks = [not p.is_alive() for p in projects]`) is functional but verbose and less idiomatic in modern Python.
    *   **Suggestion:** Refactor the concurrent logic in `parse.py` to use `concurrent.futures.ThreadPoolExecutor`. This abstraction simplifies thread management, automatically handles the waiting process, and results in cleaner, more readable code.

3.  **Configuration Externalization:** The list of CRAN URLs in `cranscrape.py` is hardcoded. This requires modifying the source code to update the list of R packages being scraped.
    *   **Suggestion:** Move this list to an external configuration file (e.g., a simple text file or a JSON/YAML file). This allows for easier maintenance and updates to the list of R packages without modifying the Python script itself, improving maintainability.

4.  **Markdown Parsing Robustness:** The reliance on regular expressions (`rex` and `ret`) to parse the `README.md` is fragile. A slight change in Markdown formatting could break the script.
    *   **Suggestion:** Adopt a dedicated Markdown parsing library (e.g., `markdown-it-py` or a similar tool) to create an Abstract Syntax Tree (AST) of the document. This would make the parsing logic resilient to minor formatting variations and future changes in the list's structure.

#### 3.4.2. Secondary Development Guide

This guide outlines the best practices for exploring and contributing to the **awesome-quant** project.

1.  **Environment Setup:**
    *   Clone the repository: `gh repo clone wilsonfreitas/awesome-quant`
    *   Install dependencies using Poetry (as indicated by `pyproject.toml`): `poetry install`
    *   Set the required environment variable for the GitHub API: `export GITHUB_ACCESS_TOKEN='your_token'`

2.  **Content Contribution (Adding a Project):**
    *   The primary source of truth is `README.md`. Add new entries following the existing Markdown list format: `- [Project Name](https://github.com/user/repo) - A brief description of the project.`
    *   Ensure the URL is a direct link to the GitHub repository for the `parse.py` script to correctly extract the repository path.

3.  **Data Generation and Validation:**
    *   Run the main data processing script: `python parse.py`
    *   This script will concurrently fetch the latest commit date for all projects and generate the `site/projects.csv` file.
    *   If updating the R package list, run: `python cranscrape.py` to update `cran.csv`.

4.  **Local Site Preview:**
    *   To view the final output, ensure Quarto is installed and run the site generator from the root directory: `quarto preview site`
    *   This will build the static site using the newly generated `projects.csv` and launch a local web server for review.

5.  **Script Exploration:**
    *   Start with `parse.py` to understand the core data flow and the concurrent API interaction logic.
    *   Examine the `Project` class to see how data is enriched, and the main loop to see how Markdown headers are used to categorize projects.


================================================
FILE: thirdparty/backtrader.md
================================================
# backtrader - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project structure is typical for a Python package, with the core logic residing in the `backtrader` directory. The main `backtrader/` directory serves as the core Python package containing the framework's logic. It is organized into several sub-modules: `analyzers/` for performance metrics, `brokers/` for trade execution simulation and integration, `comminfos/` for commission and slippage models, `dataseries/` for time-series data structures, `feeds/` for data loading mechanisms, `indicators/` for technical analysis tools, `observers/` for backtest monitoring, `strategies/` for user-defined trading logic base classes, and `utils/` for general helper functions. Outside the core package, the repository includes `docs/` for documentation, `examples/` for usage demonstrations, `tests/` for unit and integration testing, and `tools/` for command-line utilities. This modular structure clearly separates the core engine, financial modeling, data handling, and user-facing logic, which is a hallmark of a well-designed, extensible framework.
```

### 1.2. Core Folders for Analysis

- `/home/ubuntu/backtrader/backtrader` (Main Engine and Core Classes)
- `/home/ubuntu/backtrader/backtrader/analyzers` (Performance Metrics)
- `/home/ubuntu/backtrader/backtrader/brokers` (Broker Simulation and Integration)
- `/home/ubuntu/backtrader/backtrader/comminfos` (Commission and Slippage Models)
- `/home/ubuntu/backtrader/backtrader/dataseries` (Time Series Data Handling)
- `/home/ubuntu/backtrader/backtrader/feeds` (Data Loading and Management)
- `/home/ubuntu/backtrader/backtrader/indicators` (Technical Analysis Indicators)
- `/home/ubuntu/backtrader/backtrader/observers` (Backtest Monitoring and Visualization)
- `/home/ubuntu/backtrader/backtrader/strategies` (Strategy Definition Base Classes)
- `/home/ubuntu/backtrader/backtrader/utils` (Helper Functions and Mixins)

## Phase 2: Module-by-Module Deep Analysis

# Module Analysis: Core Engine and Base Classes (`/backtrader`)

## 3. Module Core Responsibility
This module, which is the root `backtrader` directory, contains the **core execution engine** and the **fundamental base classes** that define the entire framework's architecture. Its primary responsibility is to manage the backtesting lifecycle, synchronize data, execute trading logic, and simulate market interactions.

## 3.1 Key Files and Responsibilities

| File | Core Responsibility |
| :--- | :--- |
| `cerebro.py` | **The Brain/Engine**: Manages the backtesting run, aggregates strategies, data feeds, brokers, and observers. Controls the main execution loop (`run()`). |
| `strategy.py` | **User Logic Base**: Defines the `Strategy` and `SignalStrategy` base classes where users implement their trading logic (`next()`, `notify_order()`). |
| `broker.py` | **Market Simulation Base**: Defines `BrokerBase` and `BackBroker` for simulating cash, portfolio value, order submission, and position management. |
| `order.py` | **Transaction Object**: Defines `OrderBase` and `Order` objects, including various order types (Market, Limit, Stop) and their execution details (`OrderExecutionBit`). |
| `dataseries.py` | **Time-Series Data**: Defines the core data structures like `DataSeries`, `OHLC`, and `OHLCDateTime` which encapsulate financial time-series data. |
| `indicator.py` | **Technical Analysis Base**: Defines the `Indicator` base class, which is a specialized `LineIterator` for calculating technical values. |
| `metabase.py` | **Metaclass System**: Contains `MetaParams`, the custom metaclass that enables the powerful parameter and line-based system used throughout the framework. |
| `linebuffer.py`, `lineiterator.py`, `lineroot.py`, `lineseries.py` | **Data Flow and Synchronization**: Defines the abstract classes (`LineRoot`, `LineIterator`, `LineSeries`) that manage data synchronization, lookback, and calculation dependencies. |

## 4. Code Detail Analysis

### 4.1 Core Implementation: The Line-Based System

The most critical and innovative part of the `backtrader` core is its **Line-Based System**, implemented through a hierarchy of classes: `LineRoot`, `LineSingle`, `LineMultiple`, and `LineSeries`.

*   **`LineRoot`**: The base class for any object that holds a series of values (a "line"). It uses the custom `MetaParams` metaclass to handle class-level parameters.
*   **`LineIterator`**: A mixin/base class that allows objects (like `Strategy` and `Indicator`) to iterate over their input lines, ensuring synchronization.
*   **`LineSeries`**: Represents a time-series of values, providing array-like access (`self.lines[0][0]`) and lookback functionality (`self.lines[0][-1]`).

This system is the foundation for:
1.  **Data Synchronization**: The `Cerebro` engine uses the `LineIterator` mechanism to ensure that all data feeds, indicators, and strategies are advanced synchronously bar-by-bar.
2.  **Lookback and Dependency Management**: Indicators automatically manage their required lookback period (`_minperiod`) by inspecting the dependencies defined in their `__init__` method.

### 4.2 Dependencies

The core module has strong internal dependencies, forming a tightly coupled system:
*   **`Cerebro`** depends on:
    *   `Strategy` (to run the logic)
    *   `BrokerBase` (to handle execution)
    *   `DataSeries` and `FeedBase` (to provide data)
    *   `Observer` (to collect statistics)
*   **`Strategy`** depends on:
    *   `BrokerBase` (to submit orders)
    *   `Order` (to create transactions)
    *   `Sizer` (to calculate position size)
    *   `Indicator` (for technical analysis)
*   **`BrokerBase`** depends on:
    *   `CommInfoBase` (for commission calculation)
    *   `Order` and `Position` (for state management)

### 4.3 Error & Performance

*   **Error Handling**: The core uses custom exceptions defined in `errors.py` (e.g., `BacktraderError`, `StrategySkipError`). The `Cerebro` engine is responsible for catching and notifying strategies of order-related errors (e.g., `Margin` error).
*   **Performance**: The `Cerebro` class includes parameters like `preload` and `runonce` (lines 63-72 in `cerebro.py`) to optimize performance.
    *   `preload=True`: Loads all data into memory before the backtest starts.
    *   `runonce=True`: Enables vectorized execution for indicators, significantly speeding up calculations by leveraging NumPy-like operations on the entire data series at once, rather than bar-by-bar. This is a key performance feature.

# Module Analysis: Data Structures and Feeds (`/dataseries` and `/feeds`)

## 3. Module Core Responsibility
This module is responsible for the **ingestion, representation, and time-synchronization of financial time-series data**. It provides the fundamental data structures (`DataSeries`) and the mechanisms (`AbstractDataBase` and concrete feeds) to load data from various sources (CSV, Pandas, live feeds) and prepare it for the core engine (`Cerebro`).

## 3.1 Key Files and Responsibilities

| File | Core Responsibility |
| :--- | :--- |
| `dataseries.py` | Defines the `DataSeries`, `OHLC`, and `OHLCDateTime` classes, which are the canonical data containers for financial data. Also defines `TimeFrame` constants. |
| `feed.py` | Defines `AbstractDataBase`, the base class for all data feeds. It manages parameters, time-zone handling, date filtering, and live data notifications. |
| `feeds/btcsv.py` | Implementation for loading data from backtrader's default CSV format. |
| `feeds/pandafeed.py` | Implementation for seamlessly integrating Pandas DataFrames as a data source. |
| `feeds/yahoo.py` | Implementation for fetching historical data from Yahoo Finance. |
| `feeds/ibdata.py` | Implementation for connecting to Interactive Brokers for live and historical data. |

## 4. Code Detail Analysis

### 4.1 Core Implementation: Data Representation and Time Management

The data module is built upon the core `LineSeries` concept.
*   **`DataSeries`**: Extends `LineSeries` and introduces standard financial fields as lines: `open`, `high`, `low`, `close`, `volume`, `openinterest`, and `datetime`. This standardization allows indicators and strategies to access data consistently.
*   **`TimeFrame`**: A utility class defining constants for various time granularities (Ticks, Seconds, Minutes, Days, Weeks, etc.), which are crucial for data aggregation and resampling.
*   **`AbstractDataBase`**: This class acts as the bridge between the raw data source and the `Cerebro` engine. It implements complex logic for:
    *   **Timezone Handling**: Using parameters like `tz` and `tzinput` to correctly localize and convert timestamps.
    *   **Date Filtering**: Applying `fromdate` and `todate` to limit the data range.
    *   **Resampling/Replaying**: It is the base for data resampling and replaying mechanisms (though the implementation details are in `resamplerfilter.py`), allowing users to mix data of different timeframes.

### 4.2 Dependencies

The module's primary dependency is on the core `LineSeries` for its data structure foundation. It also depends on:
*   **`backtrader.utils`**: For date/time utilities (`date2num`, `num2date`, `tzparse`).
*   **`backtrader.tradingcal`**: For integrating market calendars to handle trading sessions and holidays, particularly in `AbstractDataBase`.
*   **`backtrader.resamplerfilter`**: For the actual logic of changing data granularity.

### 4.3 Error & Performance

*   **Performance**: The `AbstractDataBase` is designed to support the `preload` and `runonce` performance flags from `Cerebro`. For live feeds, it includes a `qcheck` parameter (in `feed.py`) to define a timeout for checking for new data events, which is critical for non-blocking live trading.
*   **Extensibility**: The design is highly extensible. New data sources only need to inherit from `AbstractDataBase` and implement the logic to parse their specific format and push data into the inherited line series. The `/feeds` directory is a clear example of this pattern.

# Module Analysis: Indicators and Analysis (`/indicators` and `/analyzers`)

## 3. Module Core Responsibility
This module provides the **quantitative tools** for the backtesting framework. The `/indicators` directory contains the technical analysis components used within strategies, while the `/analyzers` directory contains the performance measurement components used to evaluate the strategy's results.

## 3.1 Key Files and Responsibilities

| Directory | File Example | Core Responsibility |
| :--- | :--- | :--- |
| `/backtrader` | `indicator.py` | Defines the `Indicator` base class, which is the foundation for all technical indicators. |
| `/indicators` | `sma.py`, `rsi.py`, `macd.py` | Implementations of specific technical analysis indicators. |
| `/backtrader` | `analyzer.py` | Defines the `Analyzer` base class, which hooks into the strategy lifecycle to collect performance data. |
| `/analyzers` | `sharpe.py` | Calculates the Sharpe Ratio of the strategy's returns. |
| `/analyzers` | `drawdown.py` | Tracks and calculates the maximum drawdown and related metrics. |
| `/analyzers` | `tradeanalyzer.py` | Provides detailed statistics on individual trades (wins, losses, duration, etc.). |

## 4. Code Detail Analysis

### 4.1 Core Implementation: Indicator Chaining and Vectorization

Indicators are the primary example of the **Line-Based System** in action.
*   **Indicator Definition**: An indicator is essentially a specialized `LineSeries` that calculates its output line(s) based on its input line(s) (which can be a data feed line or the output of another indicator).
*   **`next()` vs. Vectorization**: Indicators are designed to be calculated bar-by-bar in the `next()` method for live/event-driven mode. However, the `Cerebro` engine, when configured with `runonce=True`, leverages the indicator's internal structure to perform calculations in a vectorized (NumPy-like) manner over the entire data series, offering a significant performance boost.
*   **Chaining**: The `Indicator` class handles the dependency chain automatically. For example, a Bollinger Band indicator depends on a Simple Moving Average (SMA) indicator, which in turn depends on a data line (e.g., `data.close`). The framework ensures the SMA is calculated before the Bollinger Band for each bar.

### 4.2 Core Implementation: Analyzer Hooks

The `Analyzer` class is a powerful example of the **Observer Pattern**.
*   **Lifecycle Integration**: Analyzers are instantiated within a `Strategy` and automatically receive callbacks from the `Cerebro` engine at key points in the backtest lifecycle:
    *   `notify_order(order)`: When an order status changes.
    *   `notify_trade(trade)`: When a trade is opened or closed.
    *   `notify_cashvalue(cash, value)`: Before each bar's processing.
    *   `next()`: For bar-by-bar calculations (e.g., tracking daily returns).
*   **Hierarchical Analysis**: The `Analyzer` class supports a parent-child structure (`_children` list), allowing complex analyzers to be composed of simpler ones, and ensuring notifications are propagated down the hierarchy.

### 4.3 Dependencies

*   **Indicators** depend on:
    *   `LineSeries` (for data structure).
    *   `mathsupport` (for mathematical functions).
    *   Other `Indicator` classes (for chaining).
*   **Analyzers** depend on:
    *   `Strategy` (for context and notifications).
    *   `Trade` and `Order` (for input data).
    *   `TimeFrame` (for time-based analysis like `TimeFrameAnalyzerBase`).

# Module Analysis: Brokerage and Execution (`/brokers` and `/comminfos`)

## 3. Module Core Responsibility
This module is responsible for **simulating the trading environment**, including managing cash, portfolio value, executing orders, and calculating the financial impact of trades (commissions, margin, interest). The `/brokers` directory contains the broker implementations, and the core `comminfo.py` file defines the rules for transaction costs.

## 3.1 Key Files and Responsibilities

| File | Core Responsibility |
| :--- | :--- |
| `broker.py` (Core) | Defines `BrokerBase`, the abstract interface for all brokers, and `BackBroker`, the default simulated broker used for backtesting. |
| `comminfo.py` (Core) | Defines `CommInfoBase`, the class responsible for calculating commissions, margin requirements, and interest charges for different asset types (stock-like vs. futures-like). |
| `brokers/bbroker.py` | Contains the `BackBroker` implementation, which handles the core logic of order matching, position updates, and cash management in a backtesting context. |
| `brokers/ibbroker.py` | Provides integration with the Interactive Brokers API for live trading. |
| `sizer.py` (Core) | Defines the `Sizer` base class, which is used by strategies to determine the size of a trade (e.g., `FixedSize`, `PercentSizer`). |

## 4. Code Detail Analysis

### 4.1 Core Implementation: The `CommInfoBase` Engine

The `CommInfoBase` class is a sophisticated financial modeler. It uses a parameter-driven approach to define the financial characteristics of an asset:
*   **Asset Type**: Determined by `stocklike` and `margin` parameters, allowing it to model both stock/forex (percentage commission, no margin) and futures (fixed commission, margin required) trading.
*   **Calculations**: It provides methods for:
    *   `getcommission(size, price)`: Calculates the transaction cost.
    *   `get_margin(price)`: Calculates the margin required per unit.
    *   `getsize(price, cash)`: Calculates the maximum tradable size given current cash.
    *   `profitandloss(size, price, newprice)`: Calculates P&L.
    *   `get_credit_interest(...)`: Calculates interest on short positions.

### 4.2 Core Implementation: `BackBroker` Execution

The `BackBroker` (in `bbroker.py`) is the concrete implementation of `BrokerBase` for backtesting.
*   **Order Matching**: It implements the logic to match submitted orders (`Order` objects) against the incoming data bar. This includes handling various order types (Market, Limit, Stop) and simulating slippage and partial fills.
*   **State Management**: It maintains the current cash, value, and a dictionary of open `Position` objects for each data feed.
*   **Cheat-on-Open**: It supports the `cheat_on_open` mechanism, allowing orders to be executed at the opening price of the current bar, which is a common feature in backtesting platforms.

### 4.3 Dependencies

*   **`BrokerBase`** depends on:
    *   `CommInfoBase` (for all financial calculations).
    *   `Order` and `Position` (for state and transaction objects).
    *   `Sizer` (to delegate position sizing logic).
*   **`CommInfoBase`** is self-contained but relies on the `MetaParams` metaclass for its parameter system.

# Module Analysis: Monitoring and Utilities (`/observers` and `/utils`)

## 3. Module Core Responsibility
This module provides the **monitoring and visualization** components (`/observers`) and the essential **helper functions and classes** (`/utils`) that support the entire framework. Observers are crucial for collecting data during the backtest for later plotting and analysis, while utilities handle common tasks like date/time manipulation and custom data structures.

## 3.1 Key Files and Responsibilities

| Directory | File Example | Core Responsibility |
| :--- | :--- | :--- |
| `/backtrader` | `observer.py` | Defines the `Observer` base class, which is a specialized `LineIterator` for monitoring the backtest state. |
| `/observers` | `broker.py` | Implements `Broker` observer to track cash and portfolio value over time. |
| `/observers` | `trades.py` | Implements `Trades` observer to track the entry and exit points of all trades. |
| `/observers` | `buysell.py` | Implements `BuySell` observer to mark buy and sell signals on the plot. |
| `/utils` | `py3.py` | Compatibility layer for Python 2/3 differences. |
| `/utils` | `date.py` | Date and time utilities, including timezone handling and conversion between datetime objects and floating-point numbers used internally. |
| `/utils` | `autodict.py` | Defines custom dictionary classes like `AutoOrderedDict` for convenient attribute access. |

## 4. Code Detail Analysis

### 4.1 Core Implementation: The Observer Pattern

The `Observer` class is a specialized `LineIterator` that is attached to a `Strategy` or `Cerebro`.
*   **Monitoring**: Unlike indicators, observers do not influence the trading logic. Their purpose is purely to record state changes. They implement the `next()` method to capture data points (e.g., portfolio value) at each bar.
*   **Plotting**: Observers are the primary source of data for the built-in plotting system. For example, the `Broker` observer tracks the `cash` and `value` lines, which are then plotted to visualize portfolio performance.
*   **Strategy-Wide vs. Data-Specific**: Observers can be defined to monitor the entire strategy (e.g., portfolio value) or specific data feeds (e.g., a trade observer for a single stock).

### 4.2 Core Implementation: Utilities

The `/utils` module is vital for maintaining code consistency and cross-platform compatibility.
*   **Date Handling**: `date.py` is crucial for handling the internal representation of time. `backtrader` uses a floating-point number (Julian date-like) to represent datetimes internally for fast comparison and calculation, and `date.py` provides the necessary conversion functions (`date2num`, `num2date`) and timezone localization (`Localizer`).
*   **Custom Data Structures**: `autodict.py` provides classes like `AutoOrderedDict`, which allows dictionary keys to be accessed as object attributes (e.g., `bar.close` instead of `bar['close']`), enhancing code readability for users.

### 4.3 Dependencies

*   **Observers** depend on:
    *   `LineIterator` (for synchronization).
    *   `Broker` and `Trade` (to extract monitoring data).
*   **Utilities** are generally self-contained but are imported heavily by almost all other modules in the framework.

### Module PlantUML Diagrams

# Core Engine Module: backtrader/
```plantuml
@startuml Core Engine Module

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Core Engine Module: backtrader/

' Metaclasses and Base Classes
abstract class MetaBase << (M, #ADD1B2) Metaclass >>
abstract class MetaParams << (M, #ADD1B2) Metaclass >>
MetaBase <|-- MetaParams

abstract class LineRoot << (R, #FF7700) Base >>
abstract class LineIterator << (I, #FF7700) Base >>
abstract class StrategyBase << (S, #FF7700) Base >>
abstract class BrokerBase << (B, #FF7700) Base >>
abstract class OrderBase << (O, #FF7700) Base >>
abstract class CommInfoBase << (C, #FF7700) Base >>

LineRoot <|-- LineIterator
LineIterator <|-- StrategyBase

' Core Components
class Cerebro << (C, #FFCC00) Engine >> {
    + run()
    + addstrategy()
    + adddata()
    + setbroker()
}

class Strategy << (S, #33CCFF) User Logic >> {
    + next()
    + notify_order()
    + buy()
    + sell()
    + lines
    + params
}

class BackBroker << (B, #AAFFCC) Broker >> {
    + submit()
    + getcash()
    + getvalue()
}

class Order << (O, #FF99CC) Transaction >> {
    + status
    + size
    + price
}

class Position << (P, #CC99FF) State >> {
    + size
    + price
}

' Relationships
MetaParams <|-- Cerebro
MetaParams <|-- BrokerBase
MetaParams <|-- OrderBase

StrategyBase <|-- Strategy
BrokerBase <|-- BackBroker

BrokerBase "1" o-- "0..*" Order : manages
BrokerBase "1" o-- "0..*" Position : manages
BrokerBase "1" o-- "1" CommInfoBase : uses

Cerebro "1" o-- "1..*" Strategy : runs
Cerebro "1" o-- "1" BackBroker : uses
Cerebro "1" o-- "1..*" LineRoot : feeds (Data)

Strategy "1" o-- "1" BrokerBase : submits orders to
Strategy "1" o-- "1..*" LineRoot : uses (Indicators/Data)

LineRoot <|-- DataSeries
LineRoot <|-- Indicator

@enduml
```

# Data Structures and Feeds Module: dataseries/feeds
```plantuml
@startuml Data Structures and Feeds Module

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Data Structures and Feeds Module: dataseries/feeds

' Core Line System (from Core Module)
abstract class LineSeries << (L, #FF7700) Core >>

' Data Structures
class TimeFrame << (U, #AAFFCC) Utility >> {
    + Ticks
    + Minutes
    + Days
}

class DataSeries << (D, #33CCFF) Data Container >> {
    + lines = (datetime, open, high, low, close, volume, openinterest)
}

class OHLC << (D, #33CCFF) Data Container >>
class OHLCDateTime << (D, #33CCFF) Data Container >>

LineSeries <|-- DataSeries
DataSeries <|-- OHLC
OHLC <|-- OHLCDateTime

' Data Feeds
abstract class AbstractDataBase << (F, #FFCC00) Base Feed >> {
    + params
    + _gettz()
    + islive()
    + put_notification()
}

class CSVFeedBase << (F, #FFCC00) Base Feed >>
class PandaFeed << (F, #FFCC00) Concrete Feed >>
class YahooFeed << (F, #FFCC00) Concrete Feed >>

OHLCDateTime <|-- AbstractDataBase
AbstractDataBase <|-- CSVFeedBase
AbstractDataBase <|-- PandaFeed
AbstractDataBase <|-- YahooFeed

' Relationships
AbstractDataBase "1" o-- "1" TimeFrame : uses
AbstractDataBase "1" o-- "1" TradingCalendar : uses

@enduml
```

# Indicators and Analysis Module: indicators/analyzers
```plantuml
@startuml Indicators and Analysis Module

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Indicators and Analysis Module: indicators/analyzers

' Core Components (from Core Module)
abstract class LineIterator << (I, #FF7700) Core >>
abstract class Strategy << (S, #33CCFF) Core >>
abstract class Trade << (T, #FF99CC) Core >>
abstract class Order << (O, #FF99CC) Core >>

' Indicators
abstract class IndicatorBase << (I, #AAFFCC) Base >>
LineIterator <|-- IndicatorBase

class Indicator << (I, #AAFFCC) Indicator >> {
    + next()
    + lines
}
IndicatorBase <|-- Indicator

class SMA << (I, #AAFFCC) Concrete >>
class RSI << (I, #AAFFCC) Concrete >>
class MACD << (I, #AAFFCC) Concrete >>

Indicator <|-- SMA
Indicator <|-- RSI
Indicator <|-- MACD

' Analyzers
abstract class Analyzer << (A, #FFCC00) Base >> {
    + notify_order(order)
    + notify_trade(trade)
    + get_analysis() : dict
}

class TimeFrameAnalyzerBase << (A, #FFCC00) Base >>
Analyzer <|-- TimeFrameAnalyzerBase

class SharpeRatio << (A, #FFCC00) Concrete >>
class DrawDown << (A, #FFCC00) Concrete >>
class TradeAnalyzer << (A, #FFCC00) Concrete >>

Analyzer <|-- SharpeRatio
Analyzer <|-- DrawDown
Analyzer <|-- TradeAnalyzer

' Relationships
Strategy "1" o-- "0..*" Indicator : uses
Strategy "1" o-- "0..*" Analyzer : contains

Analyzer .> Trade : observes
Analyzer .> Order : observes

Indicator "1" o-- "0..*" Indicator : chains (e.g., BBands uses SMA)

@enduml
```

# Brokerage and Execution Module: brokers/comminfos
```plantuml
@startuml Brokerage and Execution Module

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Brokerage and Execution Module: brokers/comminfos

' Core Components (from Core Module)
abstract class MetaParams << (M, #ADD1B2) Metaclass >>
abstract class Order << (O, #FF99CC) Core >>
abstract class Position << (P, #CC99FF) Core >>

' Commission and Financial Modeling
class CommInfoBase << (C, #AAFFCC) Financial Model >> {
    + params
    + getcommission(size, price)
    + get_margin(price)
    + getsize(price, cash)
}

' Sizing
abstract class Sizer << (S, #AAFFCC) Base >> {
    + _getsizing(data, isbuy, isopen)
}

class FixedSize << (S, #AAFFCC) Concrete >>
class PercentSizer << (S, #AAFFCC) Concrete >>

Sizer <|-- FixedSize
Sizer <|-- PercentSizer

' Brokerage
abstract class BrokerBase << (B, #FFCC00) Base >> {
    + submit(order)
    + cancel(order)
    + getcash()
    + getvalue()
}

class BackBroker << (B, #FFCC00) Simulated >>
class IBBroker << (B, #FFCC00) Live >>

BrokerBase <|-- BackBroker
BrokerBase <|-- IBBroker

' Relationships
MetaParams <|-- CommInfoBase
MetaParams <|-- Sizer

BrokerBase "1" o-- "1" CommInfoBase : uses
BrokerBase "1" o-- "0..*" Order : manages
BrokerBase "1" o-- "0..*" Position : manages

Strategy "1" o-- "1" Sizer : uses to calculate size

@enduml
```

# Monitoring and Utilities Module: observers/utils
```plantuml
@startuml Monitoring and Utilities Module

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Monitoring and Utilities Module: observers/utils

' Core Components (from Core Module)
abstract class LineIterator << (I, #FF7700) Core >>
abstract class StrategyBase << (S, #FF7700) Core >>
abstract class BrokerBase << (B, #FFCC00) Core >>

' Observers
abstract class ObserverBase << (O, #AAFFCC) Base >>
LineIterator <|-- ObserverBase

class Observer << (O, #AAFFCC) Observer >> {
    + next()
    + plotinfo
}
ObserverBase <|-- Observer

class BrokerObserver << (O, #AAFFCC) Concrete >>
class TradesObserver << (O, #AAFFCC) Concrete >>
class BuySellObserver << (O, #AAFFCC) Concrete >>

Observer <|-- BrokerObserver
Observer <|-- TradesObserver
Observer <|-- BuySellObserver

' Utilities
class AutoOrderedDict << (U, #ADD1B2) Utility >>
class DateUtils << (U, #ADD1B2) Utility >> {
    + date2num()
    + num2date()
    + Localizer()
}

' Relationships
StrategyBase "1" o-- "0..*" Observer : monitors
BrokerObserver .> BrokerBase : reads state from

@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The **core abstraction** of backtrader is the **Line-Based System**, implemented through classes like `LineRoot`, `LineSeries`, and `LineIterator`. This system treats all time-series data—whether raw data, indicator outputs, or observer values—as synchronized "lines" of data. This abstraction provides:
1.  **Automatic Synchronization**: All lines advance together, ensuring that the strategy operates on a consistent view of the market at any given bar.
2.  **Lookback Management**: The system automatically calculates and enforces the minimum required lookback period (`_minperiod`) for indicators and strategies, preventing look-ahead bias.
3.  **Vectorization Support**: The line-based structure allows for seamless switching between bar-by-bar iteration and vectorized calculation (via `runonce=True`), a key performance feature.

The **design philosophy** is a strong **Separation of Concerns** based on the Model-View-Controller (MVC) pattern, adapted for a backtesting engine:
*   **Controller (Cerebro)**: The central orchestrator, managing the simulation loop and component registration.
*   **Model (Data, Broker, Order, Position)**: The state and environment components, representing market data, account status, and transaction details.
*   **View (Observer, Analyzer)**: The monitoring and reporting components, collecting data for visualization and final analysis.
*   **User Logic (Strategy)**: The component where the user defines the trading rules, isolated from the core engine mechanics.

The **lifecycle management** is driven by the `Cerebro.run()` method, which executes an **event-driven loop**:
1.  **Initialization**: Components are instantiated, parameters are set (often via the `MetaParams` metaclass), and dependencies are resolved.
2.  **Pre-Run**: Data is optionally preloaded (`preload=True`), and the minimum period for all components is calculated.
3.  **Main Loop (Bar-by-Bar)**: `Cerebro` advances the clock, pushing the next bar of data to all feeds.
    *   The `Broker` processes pending orders against the new bar's prices.
    *   Indicators calculate their new values.
    *   The `Strategy.next()` method is called, allowing the user to read indicator values and submit new orders.
    *   `Analyzers` and `Observers` record the new state.
4.  **Finalization**: The loop ends, and `Analyzers` are instructed to finalize their results.

#### 3.1.2. Component Interactions

The framework's communication is highly structured, relying on method calls and a notification system.

### Key Interaction Flows

| Interaction | Source | Target | Communication Pattern | Description |
| :--- | :--- | :--- | :--- | :--- |
| **Data Synchronization** | `Cerebro` | `Data Feed`, `Indicator`, `Strategy` | Sequential Method Call (`next()`) | `Cerebro` drives the simulation by calling `next()` on all components in a fixed order, ensuring data consistency. |
| **Order Submission** | `Strategy` | `Broker` | Direct Method Call (`broker.buy()`, `broker.sell()`) | The strategy initiates a transaction by calling a method on the `Broker` instance. |
| **Order/Trade Update** | `Broker` | `Strategy`, `Analyzer`, `Observer` | Notification Method Call (`notify_order()`, `notify_trade()`) | The `Broker` acts as a central event source, notifying all interested parties (Strategy for logic, Analyzer/Observer for recording) about changes in order status or trade execution. |
| **Indicator Calculation** | `Indicator` | `Indicator` / `Data Feed` | Line Access / Dependency Chain | An indicator reads the lines of its input (data or another indicator) to calculate its own output line. This is managed implicitly by the `LineIterator` system. |

### Data Flow

1.  **Input Data**: Raw data (e.g., CSV, Pandas DataFrame) is loaded by a concrete `Data Feed` (e.g., `PandaFeed`) and converted into the internal `OHLCDateTime` format, which is a collection of synchronized `LineSeries`.
2.  **Transformation**: The `LineSeries` flow into `Indicator` objects, which transform the raw price data into technical metrics (e.g., SMA, RSI).
3.  **Decision**: The `Strategy` reads the transformed data (Indicator lines) and raw data lines to make a trading decision.
4.  **Execution**: The decision is sent to the `Broker` as an `Order` object. The `Broker` uses `CommInfo` to calculate the financial impact (cost, margin).
5.  **Feedback**: The `Broker` updates the `Position` and `Trade` objects and sends notifications back to the `Strategy` (to update its state) and the `Analyzer`/`Observer` components (for performance tracking).

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml backtrader Architecture

skinparam classAttributeIconVisible false
skinparam defaultFontName Courier

title Overall backtrader Architecture

' Main Components
component [Cerebro] as CEREBRO #FFCC00
component [Strategy] as STRATEGY #33CCFF
component [Data Feed] as DATA #AAFFCC
component [Broker] as BROKER #FF7700
component [Indicator] as INDICATOR #ADD1B2
component [Analyzer] as ANALYZER #CC99FF
component [Observer] as OBSERVER #CC99FF

' Sub-Components
[Order] as ORDER
[Position] as POSITION
[CommInfo] as COMMINFO

' Relationships
' 1. Orchestration by Cerebro
CEREBRO --> STRATEGY : 1. Runs (addstrategy)
CEREBRO --> DATA : 2. Feeds (adddata)
CEREBRO --> BROKER : 3. Uses (setbroker)
CEREBRO --> ANALYZER : 4. Aggregates Results

' 2. Strategy Logic
STRATEGY --> INDICATOR : Reads lines from
STRATEGY --> BROKER : Submits (buy/sell)

' 3. Execution and State
BROKER --> ORDER : Manages
BROKER --> POSITION : Manages
BROKER --> COMMINFO : Calculates costs

' 4. Data Flow and Calculation
DATA --> INDICATOR : Provides input lines
INDICATOR --> INDICATOR : Chains calculations

' 5. Monitoring and Reporting
BROKER --> STRATEGY : Notifies (Order/Trade)
BROKER --> ANALYZER : Notifies (Order/Trade)
BROKER --> OBSERVER : Notifies (Cash/Value)

STRATEGY --> ANALYZER : Notifies (Self)
STRATEGY --> OBSERVER : Notifies (Self)

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The backtrader framework makes extensive use of classic object-oriented design patterns to achieve its flexibility and power.

| Pattern | Description | Implementation Example |
| :--- | :--- | :--- |
| **Template Method** | Defines the skeleton of an algorithm in a base class, deferring some steps to subclasses. | The `Strategy` class defines the `__init__`, `next`, `notify_order`, and `stop` methods. Users must override `next()` and optionally others to implement their logic. |
| **Observer** | Defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically. | The `Broker` acts as the subject, notifying `Strategy`, `Analyzer`, and `Observer` objects when an `Order` or `Trade` status changes. |
| **Chain of Responsibility** | Passes a request along a chain of handlers. | The **Line-Based System** is a form of this. When a line is updated, it triggers the calculation of dependent indicators, which in turn trigger their dependents, forming a calculation chain. |
| **Factory Method** | Defines an interface for creating an object, but lets subclasses decide which class to instantiate. | The `Cerebro` class acts as a factory for creating and configuring the entire backtesting environment (e.g., `addstrategy`, `adddata`). |
| **Metaclass (Custom Pattern)** | A non-standard but critical pattern used to inject functionality into classes at creation time. | The `MetaParams` metaclass automatically handles class-level parameters (`params` tuple) and maps them to instance attributes, simplifying parameter management across the entire framework. |

#### 3.3.2. Project Highlights

The backtrader framework is highly regarded for its innovative design, which provides both high performance and exceptional flexibility.

*   **The Line-Based System**: This is the single most innovative feature. By abstracting all time-series data into synchronized "lines," it enables:
    *   **Indicator Chaining**: Indicators can be effortlessly chained together without manual data passing.
    *   **Automatic Lookback**: The system automatically determines the minimum required data points for all calculations, preventing runtime errors and look-ahead bias.
*   **Vectorized Execution (`runonce`)**: The ability to run indicators in a vectorized mode dramatically improves performance for historical backtesting, allowing users to choose between the flexibility of bar-by-bar logic and the speed of batch processing.
*   **Extensibility via Inheritance**: The framework is designed around abstract base classes (`Strategy`, `BrokerBase`, `AbstractDataBase`, `Analyzer`). Extending the framework is straightforward: users simply subclass the relevant base class and override the necessary methods (e.g., creating a new data feed, a custom broker, or a new indicator).
*   **Parameter Management (`MetaParams`)**: The custom metaclass system simplifies parameter handling. Users define parameters as a class attribute tuple (`params = (...)`), and the metaclass automatically handles default values, inheritance, and instance attribute assignment, leading to clean, readable strategy code.
*   **Data Resampling and Replaying**: The built-in support for resampling (aggregating data to a lower frequency, e.g., 1-minute to 1-hour) and replaying (mixing different timeframes, e.g., daily data with 1-minute data) is a powerful feature for complex multi-timeframe strategies.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

While backtrader is a mature and powerful library, a few areas could be considered for improvement or modernization.

*   **Performance Bottlenecks (Python GIL)**: The core backtesting loop in `Cerebro` is single-threaded and bound by the Python Global Interpreter Lock (GIL). While optimization is available via `runonce`, true parallel backtesting (e.g., running multiple strategies or optimizations simultaneously) is limited by the current architecture.
    *   **Suggestion**: Explore using `multiprocessing` more aggressively for optimization runs, or refactor core calculation loops to leverage libraries like Numba or Cython for C-level speed.
*   **Modern Python Typing and Structure**: The codebase was developed before modern Python type hinting became standard. Adopting type hints would significantly improve code clarity, maintainability, and allow for better static analysis.
    *   **Suggestion**: Introduce type hints across the entire codebase, especially in public interfaces like `Strategy.next()` and `Broker.notify_order()`.
*   **Data Handling Modernization**: The internal date/time handling uses a custom float-based system and relies on `pytz` for timezones. While functional, modern Python development often prefers native `datetime` objects with built-in timezone support or dedicated libraries like `dateutil`.
    *   **Suggestion**: Investigate migrating internal date/time representation to a more standard format, potentially leveraging NumPy's `datetime64` for performance gains in vectorized operations.
*   **Decoupling of Plotting**: The plotting logic is tightly coupled with the `Observer` and `Indicator` classes via the `plotinfo` attribute. This makes it difficult to use alternative plotting backends (e.g., Plotly, Bokeh) without custom wrappers.
    *   **Suggestion**: Introduce a dedicated Plotting Interface/Adapter layer to decouple the core components from the visualization implementation.

#### 3.4.2. Secondary Development Guide

For a developer looking to extend or modify the backtrader framework, the following steps and best practices are recommended:

1.  **Understand the Core Abstraction (The Line-Based System)**: Start by reading `lineroot.py`, `lineseries.py`, and `indicator.py`. Grasping how data is represented as synchronized lines is fundamental to all other components.
2.  **Trace the Execution Flow (The Cerebro Loop)**: Examine `cerebro.py` to understand the sequence of events in the `run()` method. This is the master control flow that dictates when `Broker`, `Strategy`, and `Indicator` methods are called.
3.  **Extending User Logic (Strategy)**:
    *   To create a new strategy, subclass `backtrader.Strategy`.
    *   Use the `params` tuple for configuration, not `__init__` arguments directly.
    *   Implement trading logic in `next()`.
    *   Handle order and trade status updates in `notify_order()` and `notify_trade()`.
4.  **Creating a New Indicator**:
    *   Subclass `backtrader.Indicator`.
    *   Define input lines in `lines = (...)` and output lines in `lines = (...)`.
    *   Implement the calculation logic in `next()`. The framework handles the lookback and synchronization automatically.
5.  **Adding a New Data Source**:
    *   Subclass `backtrader.feeds.AbstractDataBase`.
    *   The primary task is to implement the logic to read the source data and push it into the inherited `lines` (especially `datetime`, `open`, `high`, `low`, `close`).
    *   Pay close attention to timezone handling and data alignment, as this is the most complex part of data integration.
6.  **Testing**: The `tests/` directory contains a comprehensive suite of unit tests. Any new feature or modification should be accompanied by new tests that validate the expected behavior, particularly for financial calculations and data synchronization.
7.  **Leverage `MetaParams`**: When creating any new component that requires configuration, use the `MetaParams` pattern (by inheriting from a class that uses it, like `CommInfoBase` or `Strategy`) to ensure consistent and robust parameter handling.


================================================
FILE: thirdparty/investor-agent.md
================================================
# investor-agent - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project exhibits a highly streamlined and focused directory structure, typical for a single-purpose Model Context Protocol (MCP) server application.

```
/home/ubuntu/investor-agent
├── .git/                     # Git version control metadata, used for tracking changes.
├── .github/                  # Contains GitHub Actions workflows, specifically for continuous integration and publishing.
├── .gitignore                # Defines files and directories to be excluded from version control.
├── .python-version           # Specifies the required Python version (e.g., 3.11) for environment consistency.
├── LICENSE                   # The open-source license under which the project is distributed.
├── README.md                 # Primary documentation, providing setup instructions and usage examples.
├── chat.py                   # The client-side demonstration script, used to interact with the MCP server via an LLM agent.
├── investor_agent/           # The core Python package containing the server logic.
│   ├── __init__.py           # Package initialization file (currently empty).
│   └── server.py             # The main implementation file for the Investor-Agent MCP server.
├── pyproject.toml            # Project configuration file, managing dependencies, build system, and metadata.
└── uv.lock                   # Dependency lock file generated by the `uv` package manager, ensuring reproducible builds.
```

**Annotation:**
The structure is minimal, emphasizing a single, core function. The **`investor_agent/`** directory is the sole source code package, with all server logic consolidated into **`server.py`**. This file is responsible for initializing the `FastMCP` server, defining all the financial data tools, and handling external API integrations. The **`chat.py`** file is crucial for understanding the project's execution context, as it demonstrates how an LLM agent connects to and utilizes the MCP server via standard I/O (`MCPServerStdio`). The use of `pyproject.toml` and `uv.lock` indicates a modern approach to Python dependency management, prioritizing reproducible environments. The overall design is highly modular, with the entire financial logic encapsulated within the `investor_agent` package, making it easy to deploy and integrate as a specialized tool.
```

### 1.2. Core Folders for Analysis

*   **`/home/ubuntu/investor-agent/investor_agent`**: Contains the primary source code for the Investor-Agent MCP server. This is where all the financial data retrieval and analysis tools are implemented and exposed to the Large Language Model (LLM).
*   **`/home/ubuntu/investor-agent`**: The root directory contains the essential client-side demonstration script (`chat.py`), which is necessary for understanding the project's execution and communication flow with the MCP server.

## Phase 2: Module-by-Module Deep Analysis

## Module: investor_agent (Core MCP Server)

### File Enumeration
*   `/home/ubuntu/investor-agent/investor_agent/server.py` (748 lines)
*   `/home/ubuntu/investor-agent/investor_agent/__init__.py` (1 line, empty)

### Core Responsibility
The `investor_agent` module functions as a **Model Context Protocol (MCP) server**, providing a specialized suite of financial data and analysis tools to a Large Language Model (LLM). Its core purpose is to abstract the complexities of multiple external financial APIs (Yahoo Finance, Google Trends, Nasdaq, etc.) and data processing libraries (`pandas`, `talib`), exposing their capabilities as clean, reliable, and LLM-consumable functions (tools).

### Key Implementation Details (`server.py`)

#### 1. Initialization and Utilities
The file begins by initializing the `FastMCP` server instance: `mcp = FastMCP("Investor-Agent", dependencies=["yfinance", "pandas", "pytrends"])`. It also includes utility functions critical for data integrity and network resilience:
*   **`api_retry` Decorator**: A unified, robust retry mechanism using `tenacity`. It implements exponential backoff and specifically handles rate-limiting errors (`YFRateLimitError`) and common HTTP errors (5xx, 429), ensuring high reliability for external API calls.
*   **HTTP Client**: `create_async_client` utilizes `httpx` and `hishel.httpx.AsyncCacheClient` to provide an asynchronous, cached HTTP client, improving performance and reducing load on external servers.
*   **Data Cleaning**: The `to_clean_csv(df: pd.DataFrame) -> str` function is essential. It cleans DataFrames by removing empty columns and converts the result to a clean, index-less CSV string, which is the project's optimized format for LLM consumption.
*   **Validation**: Functions like `validate_ticker` and `validate_date` ensure input parameters are correctly formatted before API calls are made.

#### 2. Core MCP Tools (Financial Data Retrieval)
The module exposes numerous tools, primarily using `yfinance` and web scraping:

| Tool Name | Core Function | Data Source | Concurrency/Optimization |
| :--- | :--- | :--- | :--- |
| `get_ticker_data` | Comprehensive data (info, news, calendar, recommendations). | `yfinance` | Uses `ThreadPoolExecutor` to run multiple blocking `yfinance` calls in parallel. |
| `get_price_history` | Historical OHLCV data. | `yfinance` | Simple wrapper for `yf_call`. |
| `get_financial_statements` | Income, balance, and cash flow statements. | `yfinance` | Uses `ThreadPoolExecutor` for parallel fetching of different statement types. |
| `get_options` | Options chain data. | `yfinance` | Uses `ThreadPoolExecutor` to fetch options chains for multiple expiry dates concurrently. |
| `get_market_movers` | Top gainers, losers, and most active stocks. | Yahoo Finance (Web Scraping) | Uses `fetch_text` and `pandas.read_html` to parse data from Yahoo Finance web pages. |
| `get_nasdaq_earnings_calendar` | Earnings announcements for a specific date. | Nasdaq API | Uses `fetch_json` with custom headers. |
| `get_cnn_fear_greed_index` | Current and historical Fear & Greed index data. | CNN API | Uses `fetch_json`. |
| `get_google_trends` | Relative search interest for keywords. | `pytrends` | Uses the `pytrends` library to build and fetch payload. |

#### 3. Optional Tools (Conditional Registration)
The module demonstrates excellent modularity by conditionally registering tools based on dependency availability:
*   **`calculate_technical_indicator`**: Only registered if `talib` is installed. It uses `yfinance` data and `talib` functions (SMA, EMA, RSI, MACD, BBANDS) to calculate technical indicators, returning both price and indicator data as separate CSV strings.
*   **`fetch_intraday_data`**: Only registered if `alpaca-py` is installed. It uses Alpaca's API for high-resolution intraday data, requiring `ALPACA_API_KEY` and `ALPACA_API_SECRET` environment variables.

## Client: chat.py (Demonstration Interface)

### File Enumeration
*   `/home/ubuntu/investor-agent/chat.py` (58 lines)

### Core Responsibility
The `chat.py` file is a **client-side demonstration** script. It illustrates the standard pattern for connecting an LLM-based agent to the `Investor-Agent` MCP server, providing a simple, interactive command-line chat interface.

### Key Implementation Details
The script uses the `pydantic_ai` library:
1.  **Server Connection**: It launches the `investor-agent` server as a subprocess using `MCPServerStdio('uv', args=['run', 'investor-agent', 'stdio'], ...)`. This establishes the communication bridge over standard I/O.
2.  **Agent Initialization**: An `Agent` is created, and the `MCPServerStdio` instance is passed as a toolset: `agent = Agent(model_identifier, toolsets=[server])`.
3.  **Chat Loop**: The `main` asynchronous function manages the interactive loop, taking user input and calling `agent.run()` to process the query. This is where the LLM decides whether to use the tools provided by the MCP server. The script ensures a graceful exit and basic error logging.

### Module PlantUML Diagrams

# investor_agent Module Diagram

@startuml
title Investor-Agent Module Class Diagram

package "investor_agent" {
    class FastMCP as InvestorAgent {
        + mcp : FastMCP
        + run()
    }

    component "HTTP Client Utilities" as HttpUtils {
        + create_async_client() : AsyncCacheClient
        + fetch_json(url) : dict
        + fetch_text(url) : str
    }

    component "Data Utilities" as DataUtils {
        + validate_ticker(ticker) : str
        + validate_date(date_str) : date
        + to_clean_csv(df) : str
        + api_retry(func) : func
    }

    component "YFinance Wrapper" as YFinanceWrapper {
        + yf_call(ticker, method, ...)
        + get_options_chain(ticker, expiry, type) : DataFrame
    }

    component "Concurrency Manager" as ThreadPool {
        + ThreadPoolExecutor
    }

    component "Pandas Data Structure" as DataFrame {
        + DataFrame
    }

    InvestorAgent --> HttpUtils : uses
    InvestorAgent --> DataUtils : uses
    InvestorAgent --> YFinanceWrapper : uses

    ' Tools - functions exposed to the LLM
    package "MCP Tools" {
        class get_ticker_data {
            + get_ticker_data(ticker, ...) : dict
        }
        class get_financial_statements {
            + get_financial_statements(ticker, ...) : dict
        }
        class get_market_movers {
            + get_market_movers(category, ...) : str
        }
        class get_google_trends {
            + get_google_trends(keywords, ...) : str
        }
        class get_options {
            + get_options(ticker, ...) : str
        }
        class get_nasdaq_earnings_calendar {
            + get_nasdaq_earnings_calendar(date, ...) : str
        }
        class get_cnn_fear_greed_index {
            + get_cnn_fear_greed_index(...) : dict
        }
        class calculate_technical_indicator {
            + calculate_technical_indicator(ticker, indicator, ...) : dict
        }
    }

    get_ticker_data ..> YFinanceWrapper : calls
    get_ticker_data ..> ThreadPool : uses for parallel calls
    get_financial_statements ..> YFinanceWrapper : calls
    get_financial_statements ..> ThreadPool : uses for parallel calls
    get_market_movers ..> HttpUtils : calls fetch_text
    get_google_trends ..> YFinanceWrapper : calls
    get_options ..> YFinanceWrapper : calls
    get_nasdaq_earnings_calendar ..> HttpUtils : calls fetch_json
    get_cnn_fear_greed_index ..> HttpUtils : calls fetch_json

    ' All tools return or process DataFrames
    "MCP Tools" ..> DataFrame : processes/returns CSV from
    YFinanceWrapper ..> DataFrame : returns
    HttpUtils ..> DataFrame : processes HTML tables
    DataUtils ..> DataFrame : cleans/converts
}

@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The Investor-Agent is fundamentally an **LLM-Tooling** project built upon the **Model Context Protocol (MCP)**, which shapes its core abstractions and design philosophy.

**Core Abstractions:**
1.  **The Tool (MCP Function)**: The most critical abstraction is the function decorated with `@mcp.tool()` in `server.py`. This abstracts away the complexity of API interaction, data cleaning, and error handling, presenting a clean, self-contained, and well-documented interface to the LLM. Each tool encapsulates a specific financial data task, such as retrieving historical prices or financial statements.
2.  **The Data Frame (Pandas)**: Internally, the `pandas.DataFrame` is the central data structure. It abstracts raw, often messy API responses (JSON, HTML tables) into a structured, manipulable tabular format. This allows for consistent data processing, cleaning, and transformation.
3.  **The CSV String (LLM Output)**: The final output abstraction is the CSV string, generated by the `to_clean_csv` utility. This is a deliberate design choice to ensure the data returned to the LLM is highly structured, easily parsable, and token-efficient, which is crucial for reliable LLM reasoning and cost-effectiveness.

**Design Philosophy:**
The project adheres to the philosophy of **"Specialized, Reliable, and LLM-Optimized."**
*   **Specialized**: The agent focuses exclusively on financial data, integrating multiple specialized libraries (`yfinance`, `pytrends`, `talib`, `alpaca-py`) to provide deep domain expertise.
*   **Reliable**: Reliability is achieved through the robust, unified **`@api_retry`** mechanism using `tenacity` and the use of an HTTP cache (`hishel`), ensuring resilience against transient network issues and rate-limiting errors common with external financial APIs.
*   **LLM-Optimized**: Data is aggressively cleaned and structured (CSV format) to maximize the LLM's ability to interpret and use the information effectively, minimizing hallucination and improving accuracy.

**Lifecycle Management:**
The lifecycle is managed by the client-server relationship:
1.  **Server Startup**: The `chat.py` client launches the `investor-agent` server as a subprocess using `MCPServerStdio`. The `FastMCP` instance initializes and registers all `@mcp.tool()` functions.
2.  **Execution**: The LLM sends a tool call request, which is executed by the server. Blocking I/O is managed by `ThreadPoolExecutor` to maintain server responsiveness.
3.  **Shutdown**: The server process is terminated when the client exits, ensuring a clean resource release.

#### 3.1.2. Component Interactions

The project operates on a clear three-tier architecture: **Client (LLM Agent) -> MCP Server -> External APIs**. This structure ensures a clean separation of concerns, with the MCP Server acting as a specialized financial data broker for the LLM.

**Component Interactions:**
1.  **Client (`chat.py`) and LLM Agent**: The `chat.py` script serves as the command-line interface, initializing the LLM Agent (`pydantic_ai.Agent`) and injecting the MCP Server as a toolset. The user's text query is passed to the LLM, which then decides whether to invoke one of the available tools.
2.  **LLM Agent and MCP Server (`server.py`)**: This is the core communication channel, utilizing the **Model Context Protocol (MCP)** over standard I/O. The LLM sends a structured JSON request for a tool call (e.g., `get_ticker_data(ticker="TSLA")`), and the MCP Server responds with a JSON object containing the result.
3.  **MCP Server and External APIs**: The server interacts with external services using two primary patterns:
    *   **Synchronous Blocking Calls**: Used for the `yfinance` library (e.g., `yf_call`). To prevent the asynchronous server from blocking, these calls are strategically wrapped in a **`concurrent.futures.ThreadPoolExecutor`** (e.g., in `get_ticker_data`) to execute them concurrently and reduce overall latency.
    *   **Asynchronous HTTP Calls**: Used for general web scraping and API calls (e.g., CNN Fear & Greed, Nasdaq Earnings). These leverage `httpx` and the `AsyncCacheClient` for non-blocking, cached requests.

**Data Flow:**
The data flow is designed to maximize LLM efficiency:
1.  **Inbound**: User Query (Text) -> LLM Agent -> Tool Call Request (JSON) -> MCP Server.
2.  **Processing**: MCP Server receives the request, executes the corresponding Python function, and calls External APIs (e.g., `yfinance.Ticker().history()`). Raw data (JSON, HTML) is received and processed into a **`pandas.DataFrame`**.
3.  **Outbound**: The `pandas.DataFrame` is passed through the `to_clean_csv()` utility, resulting in a clean, structured **CSV String**. This CSV String (or a structured dictionary) is wrapped in a JSON Tool Call Response and sent back to the LLM Agent for final reasoning and text generation. This CSV-centric output is a critical design choice for reliable LLM consumption.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
title Overall Investor-Agent Architecture

skinparam component {
  BackgroundColor<<LLM>> LightGreen
  BackgroundColor<<Client>> LightBlue
  BackgroundColor<<Server>> LightYellow
  BackgroundColor<<External>> LightCoral
}

component [LLM Agent] <<LLM>> as LLM
component [Chat Client] <<Client>> as Client
component [Investor-Agent MCP Server] <<Server>> as Server
database [Yahoo Finance API] <<External>> as YF
database [Google Trends API] <<External>> as GT
database [CNN Fear & Greed] <<External>> as CNN
database [Nasdaq Earnings API] <<External>> as NASDAQ
database [Alpaca Data API] <<External>> as ALPACA
component [TA-Lib Library] <<External>> as TALIB

' Interactions
Client --> LLM : User Query (Text)
LLM --> Server : Tool Call Request (JSON/MCP)
Server --> LLM : Tool Call Result (CSV/JSON/MCP)
LLM --> Client : Final Answer (Text)

' Server Internal Dependencies
Server --> YF : yfinance calls (via ThreadPoolExecutor)
Server --> GT : pytrends calls
Server --> CNN : HTTP GET (AsyncCacheClient)
Server --> NASDAQ : HTTP GET (AsyncCacheClient)
Server ..> ALPACA : fetch_intraday_data (Optional)
Server ..> TALIB : calculate_technical_indicator (Optional)

' Core Abstractions
note right of Server
  Core Abstractions:
  - @mcp.tool() Functions
  - pandas.DataFrame
  - CSV String Output
  - @api_retry Decorator
end note

' Data Flow
YF --> Server : Raw Financial Data
GT --> Server : Search Interest Data
CNN --> Server : Fear & Greed Index Data
NASDAQ --> Server : Earnings Calendar Data
Server --> Server : Data Processing (Pandas)
Server --> LLM : Clean CSV Data

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase, despite its minimal size, effectively utilizes several established software design patterns to enhance robustness, modularity, and maintainability.

1.  **Adapter Pattern**
    *   **Description**: Converts the interface of a class into another interface clients expect.
    *   **Implementation**: The entire `server.py` module acts as an Adapter layer. It takes the raw, often complex and inconsistent APIs of `yfinance`, `pytrends`, and various web endpoints, and adapts them into a uniform, simple, and LLM-friendly interface of `@mcp.tool()` functions that return clean CSV strings.
    *   **Code Example (Implicit in `server.py`):** The `yf_call` function (lines 117-121) is a micro-adapter that wraps various `yfinance.Ticker` methods (`get_info`, `history`, `option_chain`) into a single, retry-enabled function call, standardizing access to the underlying library.

2.  **Decorator Pattern**
    *   **Description**: Attaches additional responsibilities to an object dynamically and transparently.
    *   **Implementation**: The **`@api_retry`** decorator (lines 51-64) is a critical implementation of this pattern. It wraps core API-calling functions (`fetch_json`, `fetch_text`, `yf_call`) with robust error handling and retry logic using the `tenacity` library, without altering the core logic of the wrapped functions.
    *   **Code Example (server.py:51-53, 75):**
        ```python
        # Decorator definition
        def api_retry(func):
            return retry(
                stop=stop_after_attempt(3),
                # ... retry logic ...
            )(func)

        # Decorator usage
        @api_retry
        async def fetch_json(url: str, headers: dict | None = None) -> dict:
            """Generic JSON fetcher with retry logic."""
            # ...
        ```

3.  **Strategy Pattern (Conditional Registration)**
    *   **Description**: Defines a family of algorithms, encapsulates each one, and makes them interchangeable.
    *   **Implementation**: The conditional registration of optional tools (`calculate_technical_indicator` and `fetch_intraday_data`) based on the availability of `talib` and `alpaca-py` (lines 613-614, 662-663) allows the server to dynamically switch its available capabilities (strategies) based on the user's installed environment. This makes the core agent lightweight while allowing for powerful extensions.
    *   **Code Example (server.py:662-663):**
        ```python
        # Only register the technical indicator tool if TA-Lib is available
        if _ta_available:
            @mcp.tool()
            def calculate_technical_indicator(...):
                # ... TA-Lib logic ...
        ```

#### 3.3.2. Project Highlights

The Investor-Agent project demonstrates several innovative and flexible design choices that contribute to its effectiveness as an LLM tool.

*   **Robustness and Resilience via Unified Retry Mechanism**:
    *   The project implements a single, unified **`@api_retry`** decorator using `tenacity`. This decorator is applied to all external API calls (`yfinance`, `fetch_json`, `fetch_text`). This centralized approach ensures that the agent can reliably handle the common fragility of external financial APIs, including network timeouts, transient errors, and specific rate-limiting exceptions (`YFRateLimitError`), with automatic exponential backoff.

*   **LLM-Centric Data Formatting (CSV Optimization)**:
    *   The aggressive use of `pandas` and the custom **`to_clean_csv`** utility is a key highlight. This utility removes empty columns and converts the resulting DataFrame into a clean, index-less CSV string. This structured, minimal format is highly optimized for LLM consumption, minimizing token usage and maximizing the LLM's ability to accurately parse and reason over the data.

*   **Performance Optimization through Concurrency**:
    *   The strategic use of **`concurrent.futures.ThreadPoolExecutor`** within asynchronous tool functions (e.g., `get_ticker_data`, `get_financial_statements`) is a significant performance feature. It allows multiple blocking `yfinance` calls to execute in parallel, drastically reducing the total latency for comprehensive data requests and preventing the asynchronous server from being blocked.

*   **Extensibility and Modularity via Conditional Registration**:
    *   The core MCP design promotes extensibility, but the conditional registration of optional tools (e.g., `calculate_technical_indicator` for TA-Lib and `fetch_intraday_data` for Alpaca) is a sophisticated modularity feature. This allows users to install only the dependencies they need, keeping the core agent lightweight while enabling powerful, specialized extensions.

*   **Asynchronous Caching for Web Requests**:
    *   The use of **`hishel.httpx.AsyncCacheClient`** provides automatic, persistent caching for all web-scraped and general API data. This reduces the load on external servers, speeds up repeated requests for static data, and further enhances the agent's reliability and performance.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The Investor-Agent is well-engineered for robustness, but several architectural and code quality improvements could further enhance its performance and maintainability.

1.  **Asynchronous `yfinance` Integration**:
    *   **Issue**: The current implementation wraps blocking `yfinance` calls in a `ThreadPoolExecutor`. While functional, this adds overhead and is less idiomatic for an `async` server.
    *   **Suggestion**: Explore using an asynchronous wrapper for `yfinance` (if available) or migrating to a fully asynchronous financial data library. This would eliminate the need for the `ThreadPoolExecutor`, simplifying the code and improving the server's overall non-blocking performance.

2.  **Standardized Data Output Schema (Pydantic)**:
    *   **Issue**: Complex tools like `get_ticker_data` return a raw dictionary, which lacks strong type checking and requires the LLM to infer the structure.
    *   **Suggestion**: Define explicit Pydantic models for the output of complex tools. MCP supports Pydantic models, which would provide a reliable, strongly-typed contract for the LLM, reducing ambiguity and improving the reliability of tool-use reasoning.

3.  **Centralized Configuration Management**:
    *   **Issue**: API keys and settings (e.g., Alpaca credentials) are managed via environment variables, which can be scattered and difficult to manage in complex deployments.
    *   **Suggestion**: Implement a dedicated configuration library (e.g., Pydantic Settings or `python-decouple`) to manage settings centrally. This would improve security, allow for environment-specific configuration files, and make the application easier to deploy in various environments.

4.  **Enhanced Error Reporting Detail**:
    *   **Issue**: The `api_retry` mechanism is robust, but the final error message to the LLM can be generic (e.g., "Failed to retrieve data").
    *   **Suggestion**: Enhance exception handling within each tool to provide more specific, actionable error messages to the LLM (e.g., "Ticker 'XYZ' not found on Yahoo Finance" or "Invalid date range provided"). This allows the LLM to better self-correct or provide more informative feedback to the user.

#### 3.4.2. Secondary Development Guide

The Investor-Agent is highly modular and designed for easy extension via the Model Context Protocol (MCP). Secondary development should focus on adding new `@mcp.tool()` functions to expand the agent's financial data capabilities.

1.  **Setup and Environment**:
    *   Clone the repository and install dependencies using the project's preferred package manager (e.g., `uv pip install -e .`).
    *   Set up necessary environment variables, such as `ALPACA_API_KEY` and `ALPACA_API_SECRET`, if the optional Alpaca tool is to be used.
    *   Test the existing functionality by running the client with `python chat.py`.

2.  **Adding a New Financial Tool**:
    *   **Locate `investor_agent/server.py`**. All new tool logic must be implemented here.
    *   **Define the Function**: Create a new asynchronous function that encapsulates the data retrieval logic (e.g., fetching data from a new API).
    *   **Apply Decorator**: Decorate the function with `@mcp.tool()` to expose it to the LLM.
    *   **Implement Robustness**: Ensure all external API calls within the function are wrapped with the `@api_retry` decorator to inherit the project's error handling and retry logic.
    *   **Format Output**: Use `pandas` for data manipulation and ensure the final return value is a clean, structured CSV string via `to_clean_csv(df)` or a well-defined dictionary/Pydantic model, as this is the preferred format for LLM consumption.

3.  **Dependency Management**:
    *   If a new tool requires a new library, add it to `pyproject.toml`. If the dependency is optional, implement conditional registration in `server.py` (similar to TA-Lib and Alpaca) to maintain a lightweight core for users who do not need the feature.


================================================
FILE: thirdparty/panda_quantflow.md
================================================
# panda_quantflow - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project, `panda_quantflow`, exhibits a clear, multi-layered structure typical of a complex Python application that combines a web interface, a backtesting engine, and a real-time trading service. The root directory contains standard project files (`.gitignore`, `Dockerfile`, `pyproject.toml`) and the main source code under the `src` directory.

The `src` directory is divided into four primary, high-level modules: `common`, `panda_backtest`, `panda_trading`, and `panda_web`, along with a `utils` module.

*   **`src/common`**: This module serves as the foundational infrastructure layer. It houses generic components like data models for backtest results (`backtest/model`), database connection handlers (`connector` for MongoDB, MySQL, Redis), system configuration (`config`), and logging utilities. This module is designed for reusability across the backtesting and trading services.
*   **`src/panda_backtest`**: This is the core backtesting engine. It is highly structured, containing the user-facing trading APIs (`api`), the core context and constant definitions (`backtest_common`), a robust exception handling system (`exception`), and the simulated exchange logic (`exchange`) for various asset classes (stock, future, fund). It also includes the logic for order verification and result processing.
*   **`src/panda_trading`**: This module is dedicated to real-time trading execution. It includes a FastAPI application (`__main__.py`) to manage trade processes, data models for real-time accounts, and the critical integration with external trading gateways, notably the CTP (China Futures Trading) API (`real_trade_api/ctp`). It also contains logic for trade monitoring and routing.
*   **`src/panda_web`**: This module functions as the application's web backend and API gateway. It defines the FastAPI routes (`routes`) for user interaction (backtest management, workflow control, chat), the business logic (`logic`) for handling these routes, and a sophisticated LLM-powered service layer (`services/llm`) for code assistance and strategy generation. The presence of static assets suggests it also serves the frontend.
*   **`src/utils`**: This module contains miscellaneous, general-purpose utilities that are not specific to trading or backtesting, such as annotation helpers (e.g., for Singleton), data manipulation tools, Redis-based distributed locking, and threading helpers.

This structure clearly separates concerns: **Infrastructure** (`common`, `utils`), **Simulation Logic** (`panda_backtest`), **Real-Time Execution** (`panda_trading`), and **User Interface/API** (`panda_web`).

```
/home/ubuntu/panda_quantflow
├── .git/
├── Dockerfile
├── README.md
├── pyproject.toml
└── src/
    ├── common/ (Infrastructure: Models, DB Connectors, Config)
    │   ├── backtest/model/ (Pydantic models for backtest data)
    │   ├── config/ (Project configuration)
    │   ├── connector/ (MongoDB, MySQL, Redis clients)
    │   ├── cron/ (Scheduled task management)
    │   ├── logging/ (System and user logging)
    │   └── utils/ (General common utilities)
    ├── panda_backtest/ (Backtesting Engine Core)
    │   ├── api/ (User-facing trading APIs)
    │   ├── backtest_common/ (Core context, constants, data structures)
    │   ├── exception/ (Custom exception handling)
    │   ├── exchange/ (Simulated exchange logic for assets)
    │   ├── model/ (Quotation and result models)
    │   ├── order/ (Order verification and building)
    │   ├── result/ (Backtest result calculation)
    │   ├── system/ (Core context and time management)
    │   └── util/ (Backtest-specific utilities)
    ├── panda_trading/ (Real-Time Trading Service)
    │   ├── models/ (Real-time trade models)
    │   ├── real_trade_api/ctp/ (CTP API integration)
    │   ├── trading/ (Core real-time trading logic)
    │   ├── trading_account_monitor/ (Account monitoring service)
    │   └── trading_route/ (Trade management server)
    ├── panda_web/ (Web API and LLM Services)
    │   ├── logic/ (Business logic for API routes)
    │   ├── messaging/ (RabbitMQ consumers)
    │   ├── models/ (API request/response models)
    │   ├── routes/ (FastAPI endpoints)
    │   └── services/llm/ (LLM Agents and Code Checkers)
    └── utils/ (General Utilities: Lock, Thread, Time)
```
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/panda_quantflow/src/common`: Foundational components, including backtest data models, database connectors (MongoDB, Redis, MySQL), configuration, and logging utilities. This forms the core infrastructure layer.
*   `/home/ubuntu/panda_quantflow/src/panda_backtest`: The comprehensive backtesting engine. It contains the trading APIs, core context management, data structures for accounts and positions, exception handling, and the logic for simulating various financial instruments (stock, future, fund).
*   `/home/ubuntu/panda_quantflow/src/panda_trading`: The real-time trading module. This includes the FastAPI server for trade management, models for real-time accounts, and the critical integration with external trading gateways, specifically the CTP (China Futures Trading) API.
*   `/home/ubuntu/panda_quantflow/src/panda_web`: The web application backend. It hosts the FastAPI routes, the business logic for managing workflows, and the advanced LLM-powered services (assistants and code checkers).
*   `/home/ubuntu/panda_quantflow/src/utils`: A collection of general-purpose utility functions, including singleton annotation, file/data manipulation, Redis-based distributed locking, logging factory, and threading utilities.

## Phase 2: Module-by-Module Deep Analysis

The project is logically partitioned into four main Python modules, each with distinct responsibilities, supported by a general `utils` module.

### 1. Module: `src/common`
**Core Responsibility**: Provides the fundamental infrastructure, data models, and connectivity services used by the higher-level backtesting and trading modules. It acts as the shared foundation of the application.

**Key Files and Responsibilities**:
*   `backtest/model/*.py`: Defines Pydantic models (`BacktestAccountModel`, `BacktestModel`, `BacktestPositionModel`, etc.) for persistent backtest data, often stored in MongoDB. The custom `PyObjectId` class handles MongoDB ID serialization.
*   `connector/mongodb_handler.py`: Manages the connection and CRUD operations for MongoDB, the primary data store for models and results.
*   `connector/redis_client.py`: Provides a client for Redis, used for caching, session management, and message queuing.
*   `config/config.py`, `config/project.py`: Handles application configuration loading and access.
*   `logging/system_logger.py`, `logging/user_logger.py`: Implements distinct logging mechanisms for internal system events and user-facing strategy logs.

**Implementation Details**: The module heavily relies on **Pydantic** for data integrity and **MongoDB** as the primary document store. The separation of system and user logging is a crucial feature for a platform where user-written code (strategies) is executed.

### 2. Module: `src/panda_backtest`
**Core Responsibility**: The complete backtesting engine, responsible for simulating trading strategies against historical data, managing virtual accounts, and calculating performance metrics.

**Key Files and Responsibilities**:
*   `api/api.py`: The main entry point for user-written strategies, exposing high-level functions like `order_shares`, `order_values`, `buy_open`, and `cancel_order`.
*   `backtest_common/system/context/core_context.py`: The central **Singleton** class that holds the state of the current backtest run.
*   `backtest_common/exchange/stock/back_test/stock_exchange.py`: Contains the core logic for simulating stock market operations (e.g., handling dividends, splits, and trade execution).
*   `order/common/order_quotation_verify.py`: Logic to check if an order is valid based on current market data (e.g., price limits).
*   `result/result_calculate.py`: Computes final backtest performance metrics (Sharpe ratio, Max Drawdown, Alpha, etc.).

**Implementation Details**: The module implements a complex simulation environment. The trading APIs are generic, relying on the `CoreContext` to dispatch actions to the appropriate simulated exchange logic. The extensive use of sub-modules for different asset types (`fund`, `future`, `stock`) and logic layers (`exchange`, `order`, `result`) demonstrates a high degree of modularity.

### 3. Module: `src/panda_trading`
**Core Responsibility**: Manages real-time trading execution, primarily focusing on the integration with external trading gateways like CTP (China Futures Trading Platform).

**Key Files and Responsibilities**:
*   `__main__.py`: The entry point for the real-time trading service, which runs a **FastAPI** server to receive commands (start/stop trade).
*   `trading_route/manager/real_trade_manager.py`: Manages the lifecycle of real-time trading processes.
*   `real_trade_api/ctp/*.py`: Contains the CTP-specific implementation, including the `ctp_trade_api.py` and `ctp_quotation_api.py` for trade and market data, and the corresponding SPI (Service Provider Interface) classes (`ctp_trade_spi.py`, `ctp_quotation_spi.py`) to handle asynchronous callbacks from the CTP gateway.
*   `models/trading/trading_future_account.py`: Pydantic model for real-time future trading accounts.

**Implementation Details**: This module is a dedicated microservice. It uses a **Producer-Consumer Pattern** where the `panda_web` module acts as the producer (sending start/stop commands) and this module acts as the consumer/executor. The CTP integration is a complex, low-level component that uses a thread-based query mechanism (`qry_thread.py`) to manage synchronous and asynchronous API calls.

### 4. Module: `src/panda_web`
**Core Responsibility**: The user-facing API and business logic layer, including workflow management and advanced LLM-powered features.

**Key Files and Responsibilities**:
*   `routes/*.py`: Defines the FastAPI endpoints for backtesting, trading, workflow management, and chat.
*   `logic/workflow_run_logic.py`: Core business logic for initiating and managing strategy runs.
*   `services/llm/agents/*.py`: Defines the LLM-powered assistants (`backtest_assistant.py`, `code_assistant.py`) that handle user queries and code generation/checking.
*   `services/llm/code_checker/*.py`: Contains logic for static analysis of user-submitted code, including rule definitions and variable tracking.
*   `messaging/workflow_consumer.py`: A consumer that listens to the RabbitMQ queue to process workflow execution tasks.

**Implementation Details**: This module is the application's brain. It implements the **Command Pattern** by translating API requests into internal commands (logic calls) and external messages (RabbitMQ). The LLM service is a sophisticated feature, employing a **Chain of Responsibility** or **Strategy Pattern** to route user requests to the correct assistant and then validate the generated code using the `code_checker` sub-module.

### 5. Module: `src/utils`
**Core Responsibility**: General-purpose, reusable utilities.

**Key Files and Responsibilities**:
*   `annotation/singleton_annotation.py`: Mechanism to easily apply the Singleton pattern to classes.
*   `lock/redis_lock.py`: Implements a distributed lock using Redis, essential for coordinating actions across multiple services (e.g., preventing concurrent updates to the same account).
*   `time/time_util.py`: Utility functions for time and date manipulation.

**Implementation Details**: This module ensures that common, non-domain-specific tasks are centralized and reusable, promoting a DRY (Don't Repeat Yourself) principle across the entire project. The distributed locking mechanism is critical for the integrity of a multi-service financial platform.

### Module PlantUML Diagrams

### Module: `src/common`

```plantuml
@startuml
package "common" {
  class PyObjectId {
    + validate(v)
    + __get_pydantic_core_schema__()
  }

  package "backtest.model" {
    class BacktestAccountBaseModel
    class BacktestAccountModel
    class BacktestBaseModel
    class BacktestModel
    class BacktestPositionBaseModel
    class BacktestPositionModel
    ' ... other models
  }

  package "connector" {
    class MongoDBHandler
    class RedisClient
    class MySQLClient
  }

  package "logging" {
    class LogFactory {
      + get_logger()
    }
    class RemoteLogFactory {
      + get_sr_logger()
    }
  }

  BacktestAccountBaseModel <|-- BacktestAccountModel
  BacktestBaseModel <|-- BacktestModel
  BacktestPositionBaseModel <|-- BacktestPositionModel

  PyObjectId <.. BacktestAccountModel : uses
  PyObjectId <.. BacktestModel : uses

  MongoDBHandler .> RedisClient : dependency
  MongoDBHandler .> MySQLClient : dependency
}
@enduml
```

### Module: `src/panda_backtest`

```plantuml
@startuml
package "panda_backtest" {
  package "api" {
    class TradingAPI {
      + order_shares()
      + order_values()
      + buy_open()
      + sell_open()
      + cancel_order()
    }
  }

  package "backtest_common.system.context" {
    class CoreContext <<Singleton>> {
      + get_instance()
      + operation_proxy
      + strategy_context
    }
  }

  package "order.common" {
    class OrderQuotationVerify
    class OrderRiskControlVerify
  }

  package "exchange.stock.back_test" {
    class StockExchange
    class DividendManager
  }

  package "result" {
    class ResultCalculate
  }

  CoreContext "1" -- "1" TradingAPI : provides access to
  TradingAPI .> CoreContext : uses
  CoreContext "1" -- "1" OperationProxy : contains
  OperationProxy .> OrderQuotationVerify : delegates
  OperationProxy .> OrderRiskControlVerify : delegates
  OperationProxy .> StockExchange : delegates
  StockExchange .> DividendManager : uses
  TradingAPI .> ResultCalculate : uses
}
@enduml
```

### Module: `src/panda_trading`

```plantuml
@startuml
package "panda_trading" {
  class MainApp <<FastAPI>>

  package "trading_route.manager" {
    class RealTradeManager {
      + start_trade(run_id)
      + kill_run_trade(run_id)
    }
  }

  package "real_trade_api.ctp" {
    class CTPTradeAPI
    class CTPQuotationAPI
    class CTPTradeSPI <<Callback>>
    class CTPQuotationSPI <<Callback>>
    class QryThread
  }

  MainApp --> RealTradeManager : manages
  RealTradeManager --> CTPTradeAPI : uses
  CTPTradeAPI .> CTPTradeSPI : registers callback
  CTPQuotationAPI .> CTPQuotationSPI : registers callback
  CTPTradeAPI .> QryThread : uses
  CTPQuotationAPI .> QryThread : uses
}
@enduml
```

### Module: `src/panda_web`

```plantuml
@startuml
package "panda_web" {
  package "routes" {
    class WorkflowRoutes <<FastAPI Router>>
    class BacktestRoute <<FastAPI Router>>
    class ChatRoutes <<FastAPI Router>>
  }

  package "logic" {
    class WorkflowRunLogic
    class WorkflowSaveLogic
  }

  package "services.llm.agents" {
    class CodeAssistant
    class BacktestAssistant
    class PromptsProvider
  }

  package "services.llm.code_checker" {
    class BaseCodeChecker
    class BacktestCodeChecker
    class VariableTracker
  }

  WorkflowRoutes --> WorkflowRunLogic : calls
  ChatRoutes --> CodeAssistant : calls
  ChatRoutes --> BacktestAssistant : calls

  CodeAssistant .> BaseCodeChecker : uses
  BacktestAssistant .> BaseCodeChecker : uses
  BaseCodeChecker <|-- BacktestCodeChecker
  BaseCodeChecker .> VariableTracker : uses
  CodeAssistant .> PromptsProvider : loads prompts
}
@enduml
```

### Module: `src/utils`

```plantuml
@startuml
package "utils" {
  package "annotation" {
    class SingletonAnnotation <<Decorator>>
  }

  package "lock" {
    class RedisLock {
      + acquire()
      + release()
    }
  }

  package "data" {
    class DataUtil
    class FileUtil
  }

  package "time" {
    class TimeUtil
  }

  RedisLock .> RedisClient : requires (from common)
}
@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The **panda_quantflow** project is built upon a set of core abstractions that enforce a clear separation between the trading logic, data management, and application infrastructure.

**1. Core Context and State Management (`CoreContext`)**
The most critical abstraction is the **CoreContext** (`panda_backtest/backtest_common/system/context/core_context.py`), which implements the **Singleton Pattern**. This class serves as the central registry for all runtime components and state within a single backtest or trading run. It encapsulates:
*   **Strategy Context**: Holds run-specific parameters (start date, end date, capital, etc.).
*   **Operation Proxy**: The interface for all trading actions (ordering, canceling, subscribing).
*   **Data Sources**: Access to historical and real-time quotation data.
*   **Account/Position Models**: The current state of the trading account.
This design ensures that all parts of a running strategy have a consistent, single source of truth for the environment and available actions.

**2. Data Modeling (`BaseModel` and `PyObjectId`)**
The project extensively uses **Pydantic's `BaseModel`** for defining data structures, ensuring data validation and clear schema definition across all components.
*   Classes like `BacktestModel`, `FutureAccountModel`, and `WorkflowModel` define the persistent state of the system.
*   The custom **`PyObjectId`** class handles the serialization and validation of MongoDB's `ObjectId` within the Pydantic framework, abstracting away the database-specific ID type for cleaner application logic.

**3. Service Interface (`OperationProxy` and LLM Agents)**
The **OperationProxy** acts as a facade, providing a unified interface for trading operations while hiding the complexity of order verification, risk control, and exchange interaction. This allows the strategy code to remain clean and agnostic to the underlying execution environment (backtest simulation vs. real-time CTP connection).

The **LLM Agents** (`panda_web/services/llm/agents`) are a key abstraction for the AI-powered features. Agents like `CodeAssistant` and `BacktestAssistant` abstract the complex logic of interacting with large language models, prompt engineering, and code checking (`code_checker`) into simple, callable services.

**4. Lifecycle Management**
The lifecycle of a strategy run is managed by the **Workflow Logic** in `panda_web`.
*   **Start**: A request triggers `workflow_run_logic.py`, which prepares the environment and dispatches the task via RabbitMQ.
*   **Execution**: The consumer process (`panda_backtest` or `panda_trading`) initializes the `CoreContext` and executes the strategy.
*   **End**: The execution process saves the final results to MongoDB and updates the run status, completing the lifecycle. The use of `uvicorn` and `FastAPI` in `panda_trading` also suggests a managed, long-running service lifecycle for real-time execution.

#### 3.1.2. Component Interactions

The system operates as a multi-component service-oriented architecture, primarily communicating through internal Python function calls, database interactions, and a message queue.

**1. Web-to-Logic Flow (Workflow Execution)**
The primary entry point is the `panda_web` module, which exposes FastAPI routes (`routes`).
*   A user request (e.g., to run a backtest or a real-time strategy) hits a route in `panda_web/routes`.
*   The route calls the corresponding business logic in `panda_web/logic` (e.g., `workflow_run_logic.py`).
*   The logic layer uses the `common` module's database connectors (e.g., `mongodb_handler.py`) to retrieve workflow and strategy details.
*   For execution, the logic layer likely sends a message to a message queue (RabbitMQ, via `panda_web/messaging/rabbitmq_client.py`) to trigger the actual backtest or trading service.

**2. Backtest Execution Flow**
*   The `panda_backtest` module, likely running as a consumer process (`workflow_consumer.py`), receives the execution message.
*   It initializes the core context (`CoreContext` in `panda_backtest/backtest_common/system/context/core_context.py`).
*   The strategy code is executed, making calls to the high-level trading APIs (`panda_backtest/api/api.py`).
*   These APIs delegate operations (e.g., `order_shares`, `cancel_order`) to an `OperationProxy` instance within the `CoreContext`.
*   The `OperationProxy` orchestrates the process:
    *   **Order Verification**: Calls modules in `panda_backtest/order` (e.g., `order_quotation_verify.py`, `order_risk_control_verify.py`) to check limits and risk.
    *   **Exchange Simulation**: Interacts with the simulated exchange logic in `panda_backtest/backtest_common/exchange` to update account and position models.
    *   **Data Persistence**: Uses `common/connector` to save results (`BacktestModel`, `BacktestAccountModel`, etc.) to MongoDB.

**3. Real-Time Trading Flow**
*   The `panda_trading` module runs a dedicated FastAPI server (`panda_trading/__main__.py`) to manage real-time strategies.
*   The `RealTradeManager` handles starting and stopping trade processes.
*   The core trading logic (`panda_trading/trading/extensions/real_trade/main.py`) uses the CTP API integration (`panda_trading/real_trade_api/ctp`) to connect to the exchange.
*   The CTP SPI (Service Provider Interface) classes (`ctp_trade_spi.py`, `ctp_quotation_spi.py`) handle asynchronous communication with the CTP gateway, receiving market data and trade confirmations.
*   Trade data and account updates are persisted to MongoDB and Redis using the `common` and `utils` modules.

**Key Communication Patterns:**
*   **API Gateway Pattern**: `panda_web` acts as the gateway, routing requests to specialized internal services.
*   **Message Queue (Asynchronous)**: Used for decoupling the web request from the long-running backtest/trading execution.
*   **Shared Database (MongoDB)**: Centralized storage for models, configurations, and results, facilitating communication between services.
*   **Shared Cache (Redis)**: Used for session management, real-time data caching, and distributed locking (`utils/lock/redis_lock.py`).

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam componentStyle rectangle

package "panda_quantflow" {
  component [panda_web] as Web {
    [Routes]
    [LLM Services]
    [Workflow Logic]
  }
  component [panda_backtest] as Backtest {
    [APIs]
    [Exchange Logic]
    [Order Verification]
    [Result Processing]
  }
  component [panda_trading] as Trading {
    [Real-Time Trade Server]
    [CTP API Integration]
    [Trade Models]
  }
  component [common] as Common {
    [Backtest Models]
    [DB Connectors]
    [Configuration]
    [Logging]
  }
  component [utils] as Utils {
    [Annotations]
    [Redis Lock]
    [Time/Data Utils]
  }

  Web --> Backtest : Backtest Management
  Web --> Trading : Real-Time Trade Management
  Web --> Common : Data Models & Config
  Web --> Utils : Utility Functions

  Backtest --> Common : Models, Connectors
  Backtest --> Utils : Utility Functions

  Trading --> Common : Models, Connectors
  Trading --> Utils : Utility Functions
  Trading --> [CTP Gateway] : Real-Time Trading

  Common --> [MongoDB] : Data Storage
  Common --> [Redis] : Caching, Locking
  Common --> [RabbitMQ] : Messaging

  Utils --> [Redis] : Locking

  [CTP Gateway] .> Trading : Market Data/Trade Execution
}
@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The codebase demonstrates the application of several common and domain-specific design patterns to manage complexity and enforce structure.

**1. Singleton Pattern**
*   **Implementation**: The `CoreContext` class (`panda_backtest/backtest_common/system/context/core_context.py`) and various logging/factory classes (e.g., `LogFactory`, `RemoteLogFactory`) use the Singleton pattern to ensure a single, globally accessible instance of a resource.
*   **Example**: The `CoreContext.get_instance()` method ensures that all parts of a strategy run access the same environment and state, which is crucial for consistent backtesting and trading. The `utils/annotation/singleton_annotation.py` file suggests a custom decorator or mechanism is used to enforce this pattern.

**2. Factory Pattern**
*   **Implementation**: The `log_factory.py` and `remote_log_factory.py` files in `common/logging` and `panda_backtest/util/log` use the Factory pattern to abstract the creation of logger instances.
*   **Example**: `LogFactory.get_logger()` returns a configured logger instance, decoupling the logging client code from the specific implementation details of the logging system.

**3. Builder Pattern**
*   **Implementation**: The exception handling mechanism uses a Builder pattern to construct complex exception objects.
*   **Example**: `risk_control_exception_builder.py` and `strategy_exception_builder.py` are responsible for assembling detailed, contextualized exception messages and objects, which is a common practice for robust error reporting in financial systems.

**4. Strategy Pattern (Implicit in LLM Services)**
*   **Implementation**: The LLM services are structured to follow the Strategy pattern, where different "assistants" (e.g., `backtest_assistant`, `code_assistant`, `factor_assistant`) implement a common interface (`llm_service.py`) but contain distinct logic for handling different types of user queries.
*   **Example**: The logic files (`backtest_assistant_stream_logic.py`, `code_assistant_nonstream_logic.py`) represent different strategies for fulfilling the assistant's request (streaming vs. non-streaming response).

**5. Data Access Object (DAO) Pattern**
*   **Implementation**: The database connector classes in `common/connector` (`mongodb_handler.py`, `mysql_client.py`, `redis_client.py`) act as Data Access Objects, abstracting the CRUD operations for their respective databases. This isolates the application logic from the specifics of the data persistence layer.

**6. Adapter Pattern**
*   **Implementation**: The `panda_trading` module uses an Adapter pattern to interface with the CTP trading gateway.
*   **Example**: `future_trade_adapter.py` likely translates the project's internal, generic trading commands into the specific data structures and function calls required by the CTP API, allowing the core logic to remain clean. The CTP SPI classes (`ctp_trade_spi.py`, `ctp_quotation_spi.py`) also serve as adapters for the asynchronous CTP callback mechanism.

#### 3.3.2. Project Highlights

The **panda_quantflow** project features several innovative and well-designed aspects that contribute to its flexibility and power as a quantitative trading platform.

*   **Integrated LLM-Powered Code Assistance**: The inclusion of the `services/llm` module is a significant highlight. It provides a **Code Assistant** and **Backtest Assistant** that can generate, check, and debug user-submitted strategy code. This lowers the barrier to entry for users and significantly enhances the development experience by integrating AI-driven static analysis and code generation directly into the platform's workflow. The `code_checker` sub-module, with its rules and variable tracking, is a robust implementation of this feature.

*   **Clear Separation of Concerns (Backtest vs. Real-Time)**: The architecture strictly separates the simulation environment (`panda_backtest`) from the real-time execution environment (`panda_trading`). This is a best practice in quantitative finance, ensuring that the complex, non-deterministic nature of real-time trading (e.g., CTP API callbacks, network latency) does not pollute the deterministic, reproducible environment required for backtesting.

*   **Extensible Trading API via `OperationProxy`**: The core trading functions (e.g., `order_shares`) are exposed through a clean, high-level API in `panda_backtest/api/api.py`. This API delegates all work to an `OperationProxy` within the `CoreContext`. This **Facade/Proxy Pattern** makes the system highly extensible. To support a new exchange or a new backtest feature, one only needs to update the `OperationProxy` implementation without changing the user-facing strategy code.

*   **Robust Infrastructure with Distributed Primitives**: The `common` and `utils` modules provide production-grade infrastructure components. The `RedisLock` utility is critical for maintaining data integrity in a distributed environment, preventing race conditions when multiple services (web, backtest, trading) might be accessing or modifying the same user account or workflow state.

*   **Pydantic-Enforced Data Integrity**: The pervasive use of Pydantic `BaseModel` for all internal and external data structures (models, requests, responses) ensures strong data validation and clear documentation (via FastAPI/Pydantic schema generation). This drastically improves the reliability and maintainability of the entire system.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The project is well-structured but has several areas where architectural clarity, performance, and maintainability could be improved.

**1. Decoupling and Abstraction**
*   **Suggestion**: Introduce a clearer **Repository Pattern** for data access. Currently, the `common/connector` classes are used directly by logic modules. A dedicated repository layer (e.g., `BacktestRepository`, `WorkflowRepository`) would abstract the database operations further, allowing the core logic to be database-agnostic and simplifying unit testing.
*   **Benefit**: Reduces coupling between business logic and the persistence layer (MongoDB/MySQL).

**2. Consistency in Data Handling and Typing**
*   **Suggestion**: Enforce consistent use of Python's native `datetime` objects instead of string representations for dates and times in Pydantic models. The current mix of `Optional[str]` and `Optional[datetime]` for time fields (e.g., in `BacktestBaseModel`) is error-prone.
*   **Benefit**: Improves type safety, simplifies date arithmetic, and prevents runtime parsing errors.

**3. Performance Optimization in Backtesting**
*   **Suggestion**: The backtesting engine should be reviewed for potential performance bottlenecks, especially in data loading and iteration. Consider using vectorized operations with libraries like NumPy or Polars instead of pure Python loops where possible for high-frequency data processing.
*   **Benefit**: Significantly speeds up backtest execution, especially for long time periods or high-resolution data.

**4. Clearer Separation of Concerns in `panda_backtest`**
*   **Suggestion**: The `panda_backtest` module is very large and contains both high-level APIs and low-level exchange simulation logic. Consider splitting it into two distinct modules: `panda_strategy_api` (the user-facing API) and `panda_simulation_core` (the internal engine logic).
*   **Benefit**: Enforces a cleaner boundary between the user-facing interface and the complex internal mechanics, making both easier to maintain and evolve.

**5. LLM Service Robustness**
*   **Suggestion**: Implement robust rate-limiting and caching mechanisms for the LLM services in `panda_web/services/llm`. LLM API calls are expensive and slow. Caching common code check results or prompt responses can reduce cost and latency.
*   **Benefit**: Improves application responsiveness and reduces operational costs associated with external API usage.

#### 3.4.2. Secondary Development Guide

For secondary development, a new developer should focus on the modular structure and core abstractions.

**1. Environment Setup and Configuration**
*   **Configuration**: Start by understanding `common/config/project.py` and the `.ini` files in `panda_trading/trading_route`. All environment-specific settings (database connections, server IPs, CTP credentials) are managed here.
*   **Dependencies**: Ensure all required external services (MongoDB, Redis, RabbitMQ) are running, as the system is highly dependent on them for state and messaging.

**2. Code Exploration Path**
*   **Trading Logic**: To understand how orders are placed, start at `panda_backtest/api/api.py` (for backtest) or `panda_trading/trading/extensions/real_trade/trade` (for real-time). Follow the call chain to the `CoreContext` and `OperationProxy`.
*   **Data Models**: All persistent data structures are defined using Pydantic in `common/backtest/model` and `panda_backtest/backtest_common/model`. Understanding these models is key to tracing data flow.
*   **LLM Integration**: New AI features should be added as new "Agents" in `panda_web/services/llm/agents`, following the pattern of existing assistants. This involves defining new prompts in `prompts_provider.py` and implementing the logic in `logic`.

**3. Best Practices**
*   **Type Consistency**: Maintain strict type hinting using Pydantic models for all data transfer objects (DTOs) and API requests/responses.
*   **Contextual Logging**: Use the provided `SRLogger` (Remote Logger) for all strategy-related output, as this is designed for remote monitoring and user-facing logs.
*   **Avoid Direct DB Calls**: Always interact with the database through the DAO classes in `common/connector` to maintain abstraction and consistency.
*   **Unit Testing**: The `panda_web/services/llm/tests` directory provides examples of how to test the LLM-powered components; ensure new features are covered with similar tests.


================================================
FILE: thirdparty/qlib.md
================================================
# qlib - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The Qlib project, located at `/home/ubuntu/FinnewsHunter/thirdparty/qlib`, is structured as a comprehensive Python package for quantitative investment research. The top-level directory contains the main source code package (`qlib`), along with supporting directories for documentation, examples, and utilities.

The library's core functionality is encapsulated within the `qlib` package, which follows a modular design:

*   **`qlib/`**: This is the core Python package containing the entire library's logic. It is subdivided into modules that represent the main functional components of the quantitative investment platform.
*   **`qlib/backtest`**: The engine for simulating trading strategies, including components for account management, decision making, exchange simulation, position tracking, and performance reporting.
*   **`qlib/data`**: Manages all aspects of data, from storage and caching to dataset creation, filtering, and time-series operations. This module is critical for feature engineering and preventing look-ahead bias.
*   **`qlib/model`**: Contains the base classes and implementations for machine learning models, trainers, and specialized components like risk models and ensemble methods.
*   **`qlib/strategy`**: Defines the interface and base classes for implementing trading strategies, acting as the decision-making layer.
*   **`qlib/workflow`**: Handles the end-to-end management of quantitative experiments, including task recording, result management, and integration with MLflow for reproducibility.
*   **`qlib/rl`**: Dedicated module for Reinforcement Learning applications, particularly in optimal order execution, providing specialized simulators and trainers.
*   **`qlib/utils`**: A collection of general-purpose utilities used across the library, including serialization, dynamic object creation, and time/data helpers.
*   **`examples/`**: Contains various example workflows, tutorials (e.g., Jupyter notebooks), and configuration files demonstrating how to use the Qlib library.
*   **`scripts/`**: Houses utility scripts, primarily for data collection and management, including data collectors for various sources and scripts for dumping and checking data health.
*   **`tests/`**: Contains unit and integration tests for various components, ensuring code quality and reliability.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/backtest`: Core backtesting engine components, including account, position, exchange simulation, and execution logic.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/data`: Data management, storage, feature engineering, and dataset creation.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/model`: Machine learning model abstraction, training infrastructure, and specialized risk/ensemble models.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/strategy`: Trading strategy base classes and interfaces for generating trade decisions.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/workflow`: Experiment management, tracking, and reproducibility via the Recorder system (MLflow-based).
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/rl`: Dedicated components for Reinforcement Learning applications, particularly for optimal order execution.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/qlib/utils`: General utility functions for serialization, dynamic object creation, and parallel processing.
*   `/home/ubuntu/FinnewsHunter/thirdparty/qlib/scripts`: External data collection and management utilities.

## Phase 2: Module-by-Module Deep Analysis

## Module: qlib/backtest

### Core Responsibility
The `qlib/backtest` module is the **core simulation engine** of Qlib, responsible for executing trading strategies against historical data. It manages the trading environment, including the account, positions, market exchange, and the execution of trade decisions. Its primary function is to provide a realistic and high-performance simulation of a quantitative investment strategy's performance over time, including the calculation of detailed portfolio and trading metrics.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `backtest.py` | `backtest_loop`, `collect_data_loop` | Defines the main backtesting loop, which orchestrates the interaction between the `BaseStrategy` and `BaseExecutor`. It collects portfolio and indicator metrics at the end of the simulation. |
| `executor.py` | `BaseExecutor`, `NestedExecutor` | Abstract base for trade execution. `NestedExecutor` implements the critical **nested decision execution** pattern, allowing for multi-frequency or hierarchical strategies (e.g., daily strategy deciding on minute-level execution). |
| `account.py` | `Account`, `AccumulatedInfo` | Manages the trading account, tracking cash, positions (`BasePosition`), accumulated return, cost, and turnover. It is responsible for updating the account state after each trade and at the end of each trading bar. |
| `decision.py` | `Order`, `BaseTradeDecision` | Defines the fundamental data structures for trading. `Order` represents a single buy/sell instruction. `BaseTradeDecision` is the abstract interface for a strategy's output, which the executor consumes. |
| `exchange.py` | `Exchange` | Simulates the market environment, providing price data, checking for stock suspensions, and calculating the actual trade price and volume based on market conditions and order size. |
| `position.py` | `BasePosition`, `Position` | Tracks the holdings of the account, including the amount and cost basis of each stock. `Position` is the concrete implementation. |
| `report.py` | `Indicator`, `PortfolioMetrics` | Defines the classes for calculating and storing performance metrics, such as alpha, beta, Sharpe ratio, and trading indicators like price advantage (PA) and fulfill rate (FFR). |

### Core Implementation Details
The backtesting process is driven by a generator pattern in `backtest.py`'s `collect_data_loop`. The loop iteratively calls the strategy to generate a `BaseTradeDecision` and then passes it to the executor for processing.

**Nested Execution:** The `NestedExecutor` is a key abstraction. It holds a list of sub-executors, each potentially operating at a different frequency (e.g., day, minute). A decision from an outer (slower) executor is passed down to an inner (faster) executor, which then executes the decision over its own calendar steps. The `Account` object is shallow-copied across nested levels, ensuring that while positions are shared, each level can maintain its own set of trading metrics.

**Order Processing:** An `Order` object is created by the strategy and contains the `stock_id`, `amount` (adjusted), `direction`, and time range. The `Exchange` determines the actual `deal_amount` and `trade_price`. The `Account` then updates its cash and position based on the executed trade, and the `AccumulatedInfo` tracks overall trading statistics. The `Account`'s `_update_state_from_order` method handles the complex calculation of return (`rtn`) and cost, ensuring that the return calculation is consistent with the end-of-bar earning calculation.

**Example: Order Structure**
The `Order` class in `decision.py` uses a `dataclass` and an `IntEnum` (`OrderDir`) for clarity:
```python
@dataclass
class Order:
    stock_id: str
    amount: float
    direction: OrderDir  # OrderDir.SELL or OrderDir.BUY
    start_time: pd.Timestamp
    end_time: pd.Timestamp
    deal_amount: float = 0.0
    factor: Optional[float] = None
```

### Dependencies
The `qlib/backtest` module has critical dependencies on:
*   **`qlib.strategy.base`**: Depends on `BaseStrategy` to receive trade decisions.
*   **`qlib.data`**: Implicitly depends on the data module via `Exchange` to fetch market data (prices, volumes, suspension status).
*   **`qlib.utils`**: Uses `init_instance_by_config` for dynamic object creation (e.g., `BasePosition` implementation) and time utilities like `Freq.parse`.
*   **`pandas`**: Heavily relies on `pd.Timestamp` and `pd.DataFrame` for time series and data handling.

## Module: qlib/data

### Core Responsibility
The `qlib/data` module is the **Data Layer** of the Qlib platform. Its responsibility is to manage all aspects of data, from raw data access and storage to advanced feature engineering and dataset creation for machine learning models. It abstracts the underlying data source and provides a unified, time-series-aware interface for the rest of the system.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `data.py` | `CalendarProvider`, `InstrumentProvider`, `FeatureProvider` | Defines the abstract interfaces for accessing market calendar, instrument lists (stock pools), and raw feature data. It is the foundation for all data access. |
| `dataset/handler.py` | `DataHandlerABC`, `DataHandler` | The central class for data preparation. It loads data via a `DataLoader`, stores it in a multi-index `pd.DataFrame` (indexed by `datetime` and `instrument`), and provides the `fetch` method for accessing processed data slices. |
| `dataset/processor.py` | `Processor`, `ZScoreNorm`, `CSZScoreNorm` | Defines the base class for all feature engineering and data cleaning steps. Concrete implementations like `ZScoreNorm` and `CSZScoreNorm` (Cross-Sectional Z-Score Normalization) are applied to the data before it is consumed by models. |
| `dataset/loader.py` | `DataLoader` | Abstract interface for loading raw data into the `DataHandler`. |
| `storage/` | `FileStorage` | Handles the persistence layer, managing how data is stored and retrieved from disk. |

### Core Implementation Details
The data flow is highly structured:
1.  **Data Loading**: A concrete `DataLoader` implementation fetches raw data.
2.  **Data Handling**: The `DataHandler` receives the raw data and stores it in a multi-index DataFrame. This structure is fundamental, allowing for efficient time-series and cross-sectional operations.
3.  **Feature Processing**: A pipeline of `Processor` objects is applied to the data. The `Processor` base class includes a `fit` method (for learning parameters like mean/std from a training set) and a `__call__` method (for applying the transformation). This separation is crucial for preventing **look-ahead bias** in quantitative research. For example, `MinMaxNorm` and `ZScoreNorm` fit their parameters only on the historical data defined by `fit_start_time` and `fit_end_time`.
4.  **Data Access**: The `fetch` method of `DataHandler` allows other modules (like `qlib/model` or `qlib/backtest`) to retrieve specific slices of the processed data, typically separated into `raw`, `infer`, and `learn` data keys.

### Dependencies
The `qlib/data` module is largely self-contained but relies heavily on:
*   **`pandas` and `numpy`**: For all data manipulation, especially multi-index DataFrame operations.
*   **`qlib.utils`**: For serialization (`Serializable`), dynamic object creation (`init_instance_by_config`), and parallel processing (`datetime_groupby_apply`).

## Module: qlib/model

### Core Responsibility
The `qlib/model` module provides the **Model Abstraction and Training Infrastructure** for quantitative research. It defines the interfaces for all learnable models, handles the training lifecycle, and includes specialized components for ensemble methods and risk modeling.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `base.py` | `BaseModel`, `Model`, `ModelFT` | Defines the fundamental interfaces. `BaseModel` for prediction, `Model` adds the `fit` method for training on a `Dataset`, and `ModelFT` (Fine-Tunable) adds the `finetune` method. |
| `trainer.py` | `Trainer`, `TrainerR`, `DelayTrainerR` | Manages the training process for one or more models (tasks). `TrainerR` uses the Qlib Recorder (`R`) for logging and saving models. `DelayTrainerR` supports delayed execution, which is useful for parallelizing the training of multiple models. |
| `ens/ensemble.py` | `RollingEnsemble` | Provides mechanisms for combining multiple models, often used in a rolling window fashion to simulate real-world deployment. |
| `riskmodel/structured.py` | `StructuredCovarianceEstimator` | Implements advanced risk modeling techniques, such as estimating the covariance matrix of asset returns using a structured approach (e.g., factor models). |
| `meta/model.py` | `MetaModel` | Supports meta-learning or model-agnostic meta-learning (MAML) approaches, where a model learns to quickly adapt to new tasks. |

### Core Implementation Details
**Model Abstraction:** The `Model` class enforces a clear separation between the `fit` and `predict` phases. The `fit` method takes a `Dataset` object, which is responsible for providing the processed features and labels. This design ensures that models operate on clean, pre-processed data, decoupling the modeling logic from the data engineering pipeline.

**Training Workflow:** The `Trainer` classes, particularly `TrainerR`, integrate tightly with the `qlib.workflow.R` (Recorder) system. The `task_train` function encapsulates the end-to-end process:
1.  Start a new Recorder (`R.start`).
2.  Log the task configuration.
3.  Initialize the `Model` and `Dataset` from the configuration.
4.  Call `model.fit(dataset)`.
5.  Save the trained model and the configured (but not data-dumped) dataset to the Recorder.
6.  Generate prediction, backtest, and analysis records.

The `DelayTrainer` concept is an advanced feature that allows the system to quickly create "placeholders" for models in the `train` phase and defer the actual, time-consuming model fitting to the `end_train` phase, often executed in parallel or on a separate cluster.

## Module: qlib/workflow

### Core Responsibility
The `qlib/workflow` module is the **Experiment Management and Tracking System** of Qlib. It provides a robust, MLflow-based infrastructure for defining, executing, tracking, and reproducing quantitative research experiments. It manages the lifecycle of experiments and individual runs (Recorders).

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `recorder.py` | `Recorder`, `MLflowRecorder` | Defines the interface for logging a single experiment run. It handles logging parameters, metrics, tags, and saving artifacts (models, predictions) to the artifact store, typically backed by MLflow. |
| `expm.py` | `ExpManager`, `MLflowExpManager` | Manages the collection of experiments. It provides methods to create, get, and search for experiments, and handles the activation/deactivation of the current experiment context. |
| `exp.py` | `Experiment`, `MLflowExperiment` | Defines the interface for an experiment, which is a collection of runs (Recorders). |
| `task/manage.py` | `TaskManager`, `run_task` | Provides utilities for managing and executing a set of tasks, often used in conjunction with `Trainer` to orchestrate multi-model training. |
| `record_temp.py` | `SignalRecord`, `PortAnalysisRecord` | Contains concrete implementations of records that can be generated after a model is trained, such as recording prediction signals or backtest analysis results. |

### Core Implementation Details
The workflow system is built around the **MLflow** tracking system, which Qlib abstracts with its own interfaces (`Recorder`, `Experiment`, `ExpManager`). This abstraction allows Qlib to add custom logic (like auto-logging uncommitted code or environment variables) while leveraging MLflow's robust backend for tracking and artifact storage.

**Experiment Lifecycle:** The process begins with `ExpManager.start_exp()`, which sets up the context for a new experiment run and returns an active `Recorder`. The `Trainer` then uses this active `Recorder` to log all training details, model parameters, and save the final model object as an artifact using `Recorder.save_objects()`. This ensures that every step of the quantitative research process is traceable and reproducible.

**Task Management:** The `task` submodule is crucial for defining a quantitative workflow as a configuration dictionary (a "task"). This configuration typically includes the `dataset`, `model`, and a list of `record` actions to perform. The `run_task` function orchestrates the execution of this configuration, ensuring the model is trained and the necessary records (predictions, backtest reports) are generated and logged.

## Module: qlib/rl

### Core Responsibility
The `qlib/rl` module is dedicated to **Reinforcement Learning (RL) applications** within the quantitative finance domain, with a strong focus on **Optimal Order Execution (OOE)**. It provides a specialized RL environment, data integration, and a training framework tailored for financial tasks.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `order_execution/simulator_qlib.py` | `SingleAssetOrderExecution` | Implements the RL environment (simulator) for the Single-Asset Order Execution (SAOE) problem, built on top of the core `qlib/backtest` engine. It translates the backtest loop into an RL step-by-step interaction. |
| `order_execution/strategy.py` | `SAOEStrategy`, `SAOEStateAdapter` | Defines the RL strategy interface and an adapter to convert the backtest state into an RL state (`SAOEState`) that the agent can observe. |
| `trainer/trainer.py` | `Trainer` | A sophisticated RL training utility (similar to PyTorch Lightning) that manages the training loop, including collecting policy-env interactions, updating the policy, and handling callbacks and logging. |
| `trainer/vessel.py` | `TrainingVesselBase` | A container that bundles all necessary RL components (policy, simulator, state/action/reward interpreters) for a specific training task. |
| `utils/finite_env.py` | `FiniteVectorEnv` | Provides a vectorized environment wrapper, allowing multiple RL episodes (simulations) to run in parallel for efficient data collection. |

### Core Implementation Details
**RL-Backtest Integration:** The `SingleAssetOrderExecution` class is the bridge between the core Qlib backtest and the RL framework. It uses the `collect_data_loop` from `qlib/backtest` as a generator, yielding control back to the RL agent at each time step to receive an action (the amount to deal). The `SAOEStateAdapter` converts the complex backtest state (position, account info) into a simplified `SAOEState` (e.g., time remaining, volume remaining) for the agent.

**Training Framework:** The `Trainer` class is designed for complex RL workflows. It operates in "collect" iterations, where the agent interacts with the environment (`FiniteVectorEnv`) to gather experience. It supports:
*   **Vectorized Environments**: Running multiple simulations concurrently for faster data collection.
*   **Callbacks**: Hooks for custom logic during training (e.g., checkpointing, early stopping).
*   **State Management**: `state_dict` and `load_state_dict` methods for saving and resuming the entire training state.

**The Training Vessel:** The `TrainingVesselBase` is a key abstraction, ensuring that all components required for an RL task (policy, environment, reward function, etc.) are correctly configured and passed to the `Trainer`.

## Module: qlib/strategy

### Core Responsibility
The `qlib/strategy` module defines the **Strategy Abstraction** and the interface for generating trading decisions. It acts as the brain of the backtesting process, deciding what to buy, sell, and when, based on market data and model predictions.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `base.py` | `BaseStrategy` | The abstract base class for all trading strategies. It provides access to the backtesting infrastructure (`trade_calendar`, `trade_position`, `trade_exchange`) and defines the core method `generate_trade_decision`. |
| `base.py` | `RLStrategy`, `RLIntStrategy` | Specialized base classes for strategies driven by Reinforcement Learning agents. `RLIntStrategy` includes `state_interpreter` and `action_interpreter` to bridge the RL agent's state/action space with the Qlib backtesting environment. |

### Core Implementation Details
**The Strategy-Executor Loop:** The central contract is the `generate_trade_decision` method in `BaseStrategy`. In each step of the backtest loop, the executor calls this method, passing the result of the previous execution step (`execute_result`). The strategy then uses this information, along with market data (via `trade_exchange`) and its current position (`trade_position`), to generate a new `BaseTradeDecision` (which typically contains a list of `Order` objects).

**Infrastructure Access:** `BaseStrategy` is initialized with `LevelInfrastructure` and `CommonInfrastructure`, giving it access to the current state of the simulation. This is a crucial design choice, as it allows strategies to be context-aware without tightly coupling them to the executor's implementation details.

**RL Integration:** The `RLIntStrategy` demonstrates Qlib's extensibility. It wraps an RL `policy` and uses interpreters to:
1.  **State Interpretation**: Convert the `execute_result` (raw simulation output) into a state representation (`_interpret_state`) suitable for the RL policy.
2.  **Action Interpretation**: Convert the RL policy's action (`_action`) into a Qlib-compatible `BaseTradeDecision` (`_trade_decision`).

This pattern is an example of the **Adapter Pattern**, enabling the integration of external components (RL agents) into the core framework.

## Module: qlib/utils

### Core Responsibility
The `qlib/utils` module serves as the **Utility and Infrastructure Layer** for the entire Qlib project. It provides essential, non-domain-specific functionalities such as object serialization, dynamic object creation, parallel processing helpers, and time/data manipulation routines.

### Key Files and Functions
| File | Primary Classes/Functions | Responsibility |
| :--- | :--- | :--- |
| `serial.py` | `Serializable` | Defines the base class for all objects that need to be pickled/unpickled. It implements custom logic (`__getstate__`, `__setstate__`) to control which attributes are saved, allowing for selective dumping (e.g., saving model parameters but not large dataframes). |
| `objm.py` | `ObjManager`, `FileManager` | Provides an abstract interface for object management (saving, loading, listing). `FileManager` is a concrete implementation that uses the local file system for object persistence. |
| `mod.py` | `init_instance_by_config` | A critical utility for Qlib's configuration-driven design. It dynamically loads and instantiates Python objects (classes) based on a configuration dictionary that specifies the class name and module path. |
| `paral.py` | `ParallelExt` | Contains utilities for parallelizing tasks, often used in data processing or multi-model training. |
| `time.py` | `Freq` | Provides utilities for handling time frequencies (e.g., 'day', 'minute') and converting between time formats. |

### Core Implementation Details
**Configuration-Driven Design:** The `init_instance_by_config` function is a cornerstone of Qlib's architecture. It allows users to define complex workflows (data handlers, models, strategies) entirely through configuration files (e.g., YAML), promoting flexibility and reproducibility without writing custom Python code for every new experiment.

**Serialization Control:** The `Serializable` class is essential for experiment tracking. By overriding `__getstate__`, it implements a policy for attribute dumping: attributes starting with `_` are dropped by default, unless `dump_all` is set or they are explicitly included. This prevents large, transient objects (like in-memory dataframes) from being saved with the model, keeping experiment artifacts small and manageable.

**Object Management:** The `ObjManager` abstraction, implemented by `FileManager`, is used by the `Recorder` to save and load artifacts (models, predictions) to the artifact store, ensuring that the persistence layer is modular and potentially swappable.

### Module PlantUML Diagrams

## Module: qlib/backtest

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.backtest" {
    abstract class BaseTradeDecision
    class Order {
        + stock_id: str
        + amount: float
        + direction: OrderDir
    }
    BaseTradeDecision <|-- Order

    abstract class BaseExecutor {
        + time_per_step: str
        + execute(trade_decision)
        + collect_data(trade_decision)
    }

    class NestedExecutor {
        - sub_executors: List[BaseExecutor]
    }
    BaseExecutor <|-- NestedExecutor

    class Exchange {
        + get_close(code, start, end)
        + check_stock_suspended(code, start, end)
    }

    class Account {
        - current_position: BasePosition
        - accum_info: AccumulatedInfo
        + update_order(order, trade_val, cost, trade_price)
    }

    class AccumulatedInfo {
        + rtn: float
        + cost: float
        + to: float
    }

    abstract class BasePosition
    class Position
    BasePosition <|-- Position

    class PortfolioMetrics
    class Indicator

    Account o-- BasePosition
    Account o-- AccumulatedInfo
    Account o-- PortfolioMetrics
    Account o-- Indicator
    BaseExecutor o-- Exchange
    BaseExecutor o-- Account
}
@enduml
```

## Module: qlib/data

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.data" {
    abstract class CalendarProvider {
        + calendar(start_time, end_time, freq)
        + locate_index(start_time, end_time, freq)
        # load_calendar(freq, future)
    }

    abstract class InstrumentProvider {
        + instruments(market, filter_pipe)
        # list_instruments(instruments, start_time, end_time, freq)
    }

    abstract class FeatureProvider {
        + feature(instrument, field, start_time, end_time, freq)
    }

    abstract class DataHandlerABC {
        + fetch(selector, level, col_set, data_key)
    }

    class DataHandler {
        - _data: pd.DataFrame
        - data_loader: DataLoader
        + setup_data()
        + fetch(...)
    }

    abstract class DataLoader {
        + load(instruments, start_time, end_time)
    }

    abstract class Processor {
        + fit(df)
        + __call__(df)
        + is_for_infer()
    }

    class ZScoreNorm {
        - mean_train
        - std_train
        + fit(df)
        + __call__(df)
    }

    class CSZScoreNorm {
        + __call__(df)
    }

    DataHandlerABC <|-- DataHandler
    DataLoader <.. DataHandler : uses
    Processor <|-- ZScoreNorm
    Processor <|-- CSZScoreNorm

    note right of DataHandler::fetch
    Handles multi-index DataFrame
    (datetime, instrument)
    end note

    note right of Processor
    Feature Engineering
    and Data Cleaning
    end note
}
@enduml
```

## Module: qlib/model

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.model" {
    abstract class BaseModel {
        + predict()
    }

    abstract class Model {
        + fit(dataset: Dataset, reweighter: Reweighter)
        + predict(dataset: Dataset, segment)
    }

    abstract class ModelFT {
        + finetune(dataset: Dataset)
    }

    class Trainer {
        + train(tasks)
        + end_train(models)
    }

    class TrainerR {
        - experiment_name
        + train(tasks)
    }

    class DelayTrainerR {
        + train(tasks)
        + end_train(models)
    }

    class StructuredCovarianceEstimator {
        + fit(data)
    }

    BaseModel <|-- Model
    Model <|-- ModelFT
    Trainer <|-- TrainerR
    TrainerR <|-- DelayTrainerR

    note right of Model::fit
    Takes Dataset for features/labels
    and Reweighter for sample weights
    end note

    note right of TrainerR
    Integrates with Qlib Recorder (R)
    for experiment tracking
    end note

    Model ..> Dataset : uses
    TrainerR ..> Recorder : manages
}
@enduml
```

## Module: qlib/workflow

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.workflow" {
    abstract class Experiment {
        + create_recorder()
        + search_records()
    }

    class MLflowExperiment {
    }

    abstract class Recorder {
        + start_run()
        + end_run()
        + log_params()
        + save_objects()
    }

    class MLflowRecorder {
    }

    abstract class ExpManager {
        + start_exp()
        + end_exp()
        + get_exp()
    }

    class MLflowExpManager {
    }

    class TaskManager {
        + run_task()
    }

    Experiment <|-- MLflowExperiment
    Recorder <|-- MLflowRecorder
    ExpManager <|-- MLflowExpManager

    MLflowExpManager o-- MLflowExperiment : manages
    MLflowExperiment o-- MLflowRecorder : contains

    note right of MLflowRecorder
    Wraps MLflow Run
    for experiment tracking
    end note
}
@enduml
```

## Module: qlib/rl

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.rl" {
    abstract class Simulator {
        + reset()
        + step(action)
        + get_state()
        + done()
    }

    class SingleAssetOrderExecution {
        - _executor: NestedExecutor
        - _collect_data_loop: Generator
        + twap_price
    }

    class SAOEStateAdapter {
        + saoe_state: SAOEState
    }

    class Trainer {
        + fit(vessel: TrainingVesselBase)
        + test(vessel: TrainingVesselBase)
        + venv_from_iterator(iterator)
    }

    abstract class TrainingVesselBase {
        + train(env)
        + validate(env)
        + test(env)
    }

    class FiniteVectorEnv {
    }

    Simulator <|-- SingleAssetOrderExecution
    SingleAssetOrderExecution ..> SAOEStateAdapter : uses
    Trainer ..> TrainingVesselBase : trains
    Trainer ..> FiniteVectorEnv : uses

    note right of SingleAssetOrderExecution
    Bridge between Qlib Backtest
    and RL Environment
    end note
}
@enduml
```

## Module: qlib/strategy

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.strategy" {
    abstract class BaseStrategy {
        + trade_calendar: TradeCalendarManager
        + trade_position: BasePosition
        + trade_exchange: Exchange
        + generate_trade_decision(execute_result)
        + reset(level_infra, common_infra)
    }

    abstract class RLStrategy {
        - policy
    }

    abstract class RLIntStrategy {
        - state_interpreter: StateInterpreter
        - action_interpreter: ActionInterpreter
        + generate_trade_decision(execute_result)
    }

    BaseStrategy <|-- RLStrategy
    RLStrategy <|-- RLIntStrategy

    RLIntStrategy ..> StateInterpreter : uses
    RLIntStrategy ..> ActionInterpreter : uses
    BaseStrategy ..> BaseTradeDecision : returns
    BaseStrategy ..> LevelInfrastructure : uses
    BaseStrategy ..> CommonInfrastructure : uses
}
@enduml
```

## Module: qlib/utils

```plantuml
@startuml
skinparam classAttributeIconVisible false

package "qlib.utils" {
    abstract class Serializable {
        + dump_all: bool
        + to_pickle(path)
        + load(filepath)
        # __getstate__()
    }

    abstract class ObjManager {
        + save_obj(obj, name)
        + load_obj(name)
    }

    class FileManager {
        - path: Path
    }

    class ModuleUtils {
        + init_instance_by_config(config)
    }

    class ParallelExt {
        + run_parallel(func, args)
    }

    Serializable <|-- ObjManager
    ObjManager <|-- FileManager

    note right of Serializable
    Custom serialization logic
    to control which attributes are saved
    end note

    note right of ModuleUtils
    Dynamic object instantiation
    from configuration
    end note
}
@enduml
```

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The Qlib architecture is built on a **Configuration-Driven, Modular, and Extensible** design philosophy, centered around four core abstractions: **Data, Model, Strategy, and Workflow**.

### Core Abstractions
1.  **The Dataset/DataHandler Abstraction (`qlib/data`)**: This abstraction is responsible for providing a unified, time-series-aware view of the financial data. The `DataHandler` manages a multi-index DataFrame (indexed by `datetime` and `instrument`), and the `Processor` classes enforce a strict separation between feature engineering and model training to prevent look-ahead bias.
2.  **The Model Abstraction (`qlib/model`)**: The `Model` interface (`Model.fit`, `Model.predict`) decouples the learning algorithm from the data source and the execution environment. This allows for easy integration of various machine learning models (e.g., LightGBM, PyTorch models) into the Qlib ecosystem.
3.  **The Strategy Abstraction (`qlib/strategy`)**: The `BaseStrategy` defines the decision-making logic, which is separated from the execution mechanics. This allows researchers to focus purely on alpha generation logic, while the `backtest` module handles the complex simulation details.
4.  **The Workflow/Recorder Abstraction (`qlib/workflow`)**: This abstraction, primarily implemented by the `Recorder` and `ExpManager`, manages the entire lifecycle of a quantitative experiment. It ensures **reproducibility** by logging all parameters, metrics, and artifacts (models, predictions) to a persistent store (MLflow), making it possible to trace every result back to its exact configuration and trained model.

### Design Philosophy
The primary design intention is to create an **end-to-end platform for quantitative research** that is both **flexible** and **rigorous**.
*   **Flexibility through Configuration**: The heavy reliance on `init_instance_by_config` (from `qlib/utils/mod.py`) allows users to define complex pipelines entirely through YAML configuration, enabling rapid experimentation and component swapping.
*   **Rigour through Abstraction**: Strict interfaces (e.g., `Processor` for data cleaning, `BaseStrategy` for decision-making) enforce best practices, such as preventing data leakage and ensuring a clean separation of concerns.
*   **Extensibility**: The use of abstract base classes (e.g., `BaseExecutor`, `BaseStrategy`, `BaseModel`) and the `contrib` module encourages community contributions and the integration of new algorithms or data sources.

### Lifecycle Management
The typical Qlib lifecycle is managed by the `qlib/workflow` module:
1.  **Configuration**: A task is defined in a configuration file (YAML/Dict).
2.  **Training**: The `Trainer` initializes the `Dataset` and `Model`, calls `model.fit()`, and logs the results to the `Recorder`.
3.  **Backtesting**: The trained `Model` is used by a `Strategy` inside the `Backtest` loop to generate trading decisions.
4.  **Reporting**: The `Backtest` module generates detailed `PortfolioMetrics` and `Indicator` reports, which are then logged by the `Recorder`.
5.  **Deployment/Inference**: The saved `Model` and `Dataset` configuration can be loaded for online inference, often utilizing the `DelayTrainer` concept for parallel or distributed execution.

#### 3.1.2. Component Interactions

The Qlib system is a tightly integrated pipeline where data flows sequentially from preparation to simulation and finally to reporting.

### Key Interaction Flows

| Source Module | Target Module | Interaction Description |
| :--- | :--- | :--- |
| **`qlib/data`** | **`qlib/model`** | **Data Provision for Training**: The `Dataset` object (managed by `DataHandler`) is passed to `Model.fit()`. The `Dataset.prepare()` method provides the model with processed features and labels, ensuring the data is correctly segmented (train/valid/test) and free of look-ahead bias. |
| **`qlib/model`** | **`qlib/strategy`** | **Prediction Generation**: A trained `Model` is often used within a concrete `BaseStrategy` implementation. The strategy calls `model.predict()` on the current market data slice to generate a prediction signal (e.g., stock scores) which informs the trading decision. |
| **`qlib/strategy`** | **`qlib/backtest`** | **Decision-Execution Loop**: The `BaseStrategy.generate_trade_decision()` method is called by the `BaseExecutor` in the `backtest` module. The strategy returns a `BaseTradeDecision` (containing `Order` objects), which the `Executor` then attempts to fulfill via the `Exchange`. |
| **`qlib/backtest`** | **`qlib/backtest`** | **Nested Execution**: The `NestedExecutor` is a critical internal pattern. An outer (e.g., daily) executor passes its decision to an inner (e.g., minute) executor, which breaks down the trade into smaller, higher-frequency steps. This allows for multi-frequency trading simulation. |
| **`qlib/backtest`** | **`qlib/workflow`** | **Result Logging**: At the end of the backtest, the `Account` and `Report` objects generate `PortfolioMetrics` and `Indicator` data. This data is passed to the `Recorder` to be logged as artifacts and metrics, completing the experiment loop. |
| **`qlib/utils`** | **All Modules** | **Infrastructure Services**: The `qlib/utils` module provides core services like dynamic object creation (`init_instance_by_config`) and controlled object persistence (`Serializable`), which are used ubiquitously across all other modules to maintain the configuration-driven and reproducible nature of the platform. |

### Data Flow
1.  **Raw Data**: Loaded by `DataLoader` -> Stored in `DataHandler` (Multi-Index DataFrame).
2.  **Processed Data**: `DataHandler` applies `Processor` pipeline (e.g., `ZScoreNorm`) -> `Dataset` segments the data.
3.  **Model Input**: `Dataset` provides `(X_train, y_train)` to `Model.fit()`.
4.  **Signal**: Trained `Model` generates prediction scores (signals).
5.  **Order**: `Strategy` converts signals into `Order` objects.
6.  **Execution**: `Executor` processes `Order` via `Exchange` -> updates `Account` and `Position`.
7.  **Metrics**: `Account` generates `PortfolioMetrics` -> `Recorder` logs results.

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml
skinparam componentStyle rectangle
skinparam defaultFontName Courier
skinparam monochrome true

title Qlib Overall Architecture

package "qlib.utils" as Utils {
    [Serializable]
    [init_instance_by_config]
}

package "qlib.data" as Data {
    [DataHandler]
    [Processor]
    [Dataset]
}

package "qlib.model" as Model {
    [Model]
    [Trainer]
}

package "qlib.strategy" as Strategy {
    [BaseStrategy]
    [RLIntStrategy]
}

package "qlib.backtest" as Backtest {
    [BaseExecutor]
    [NestedExecutor]
    [Account]
    [Exchange]
}

package "qlib.workflow" as Workflow {
    [ExpManager]
    [Recorder]
}

package "qlib.rl" as RL {
    [SAOE Simulator]
}

' Dependencies and Data Flow
Data --> Model : Provides (Dataset)
Model --> Strategy : Provides (Prediction Signal)
Strategy --> Backtest : Provides (Trade Decision/Order)
Backtest --> Data : Requests (Market Data via Exchange)
Backtest --> Workflow : Logs (Metrics/Reports)
Model --> Workflow : Logs (Model Artifacts)
Workflow ..> Utils : Uses (Serialization)
Data ..> Utils : Uses (Dynamic Config)
Model ..> Utils : Uses (Dynamic Config)
Strategy ..> Utils : Uses (Dynamic Config)
Backtest ..> Utils : Uses (Dynamic Config)

' Specialized Flows
RL --> Strategy : Implements (RLIntStrategy)
RL --> Backtest : Uses (SAOE Simulator wraps Executor)

' Key Abstractions
[DataHandler] .up.|> [Dataset]
[Model] .up.|> [Trainer]
[BaseStrategy] .up.|> [BaseExecutor]

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

Qlib extensively uses several fundamental design patterns to achieve its modularity and flexibility.

1.  **Strategy Pattern**:
    *   **Implementation**: The `BaseStrategy` class in `qlib/strategy` defines the interface for an algorithm (the trading logic). Concrete strategies (e.g., `SingleOrderStrategy`) implement this interface. The `BaseExecutor` holds a reference to a `BaseStrategy` and calls its `generate_trade_decision` method, allowing the execution logic to be independent of the decision-making logic.
    *   **Example**: Different trading strategies can be swapped out simply by changing the configuration passed to the backtest.

2.  **Factory Method / Abstract Factory Pattern (via Configuration)**:
    *   **Implementation**: The `init_instance_by_config` utility in `qlib/utils/mod.py` acts as a generic factory. It takes a configuration dictionary (which specifies the class path and keyword arguments) and dynamically creates an instance of that class. This pattern is used to instantiate `DataHandler`s, `Model`s, `Strategy`s, and `Executor`s.
    *   **Benefit**: This is the core mechanism for Qlib's configuration-driven workflow, allowing the system to be assembled from components defined in a YAML file.

3.  **Adapter Pattern**:
    *   **Implementation**: The `RLIntStrategy` in `qlib/strategy` and the `SAOEStateAdapter` in `qlib/rl` are prime examples. They adapt the complex, internal state of the Qlib backtesting engine into the simplified state/action space required by an external RL agent (policy).
    *   **Example**: The `SAOEStateAdapter` converts the backtest's `Account` and `Exchange` data into a simple `SAOEState` object for the RL agent.

4.  **Template Method Pattern**:
    *   **Implementation**: The `BaseExecutor` in `qlib/backtest` defines the skeleton of the backtesting algorithm (`collect_data`), but defers the specific execution details to abstract methods like `_collect_data` which are implemented by concrete executors (e.g., `NestedExecutor`). Similarly, `BaseModel` defines the `fit`/`predict` template.

5.  **Decorator Pattern (Implicit)**:
    *   **Implementation**: The `Processor` classes in `qlib/data/dataset/processor.py` wrap the raw data (`DataFrame`) and add new behavior (normalization, cleaning) before passing it to the next stage. A pipeline of processors effectively decorates the data.

#### 3.3.2. Project Highlights

*   **Nested Decision Execution**: This highly innovative feature addresses a critical need in quantitative finance: simulating multi-frequency strategies. The `NestedExecutor` allows a high-level strategy (e.g., daily rebalancing) to delegate execution to a lower-level strategy (e.g., minute-level optimal execution), providing a more realistic and powerful simulation environment.
*   **Look-Ahead Bias Prevention**: The explicit `fit` method in `Processor` and the segmentation logic in `Dataset` are designed to prevent data leakage. By forcing the normalization parameters (mean, std) to be learned only on the training set and then applied to the test set, Qlib enforces scientific rigor in the research process.
*   **MLflow-Based Reproducibility**: By abstracting MLflow into the `qlib/workflow` module, Qlib provides first-class support for experiment tracking. Every model, prediction, and backtest result is automatically logged with its full configuration, ensuring that research findings are fully reproducible and auditable.
*   **RL-Backtest Integration**: The dedicated `qlib/rl` module, which seamlessly integrates the backtesting engine with a modern RL training framework (e.g., PyTorch-based policies), positions Qlib as a cutting-edge platform for research in areas like Optimal Order Execution (OOE) and portfolio management using reinforcement learning.
*   **Controlled Serialization (`Serializable`)**: The ability to selectively save object attributes is a key flexibility feature. It allows large, memory-intensive objects (like raw data) to be excluded from the model artifact, making it practical to save and share thousands of trained models without excessive storage overhead.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

The Qlib architecture is robust, but a few areas present opportunities for optimization and enhanced clarity.

### Performance Bottlenecks
The most significant performance challenge stems from the heavy reliance on **Pandas DataFrames** for all data handling within `qlib/data`. While Pandas is flexible, its memory footprint and performance can degrade substantially with high-frequency or large-scale datasets. Specifically, cross-sectional operations in `qlib/data/dataset/processor.py`, which often involve grouping and applying functions across time steps, can be slow.
*   **Suggestion**: Investigate integrating high-performance data libraries like **Polars** or **Apache Arrow** for the core data storage and manipulation layers, especially for the multi-index DataFrames in `DataHandler`. This could drastically reduce memory usage and accelerate data processing.

### Architecture Optimization
The tight coupling between the `Trainer` in `qlib/model` and the `qlib.workflow.R` (Recorder) abstraction, which is itself a wrapper around MLflow, limits flexibility.
*   **Suggestion**: Introduce a more generic **Experiment Tracking Interface** that sits between the Qlib components and the MLflow-specific `MLflowRecorder`. This would allow users to easily plug in alternative experiment tracking systems (e.g., Weights & Biases, custom database) without modifying the core `Trainer` logic.

### Code Quality and Clarity
The serialization logic in `qlib/utils/serial.py` is overly complex. The `Serializable` class uses a set of implicit rules (`_is_kept`) to determine which attributes to dump, which can be confusing and lead to subtle bugs.
*   **Suggestion**: Simplify the serialization mechanism. Instead of relying on name-based conventions (e.g., attributes starting with `_`), adopt a more explicit approach using Python's built-in `dataclasses` or `attrs` library with explicit field metadata to control serialization. Additionally, the "trick" of using a shallow copy of `Account` in `NestedExecutor` should be replaced with a more explicit and well-documented pattern to reduce the risk of future maintenance errors.

#### 3.4.2. Secondary Development Guide

The Qlib framework is designed for extensibility, primarily through its configuration-driven architecture and well-defined abstract base classes. Secondary development should follow these best practices:

1.  **Understand the Configuration Flow**: All core components are instantiated via `qlib.utils.init_instance_by_config`. To understand a workflow, start by examining the YAML configuration file and tracing how the components (DataHandler, Model, Strategy, Executor) are initialized and linked together.

2.  **Adding a New Model**:
    *   Inherit from `qlib.model.base.Model`.
    *   Implement the `fit(dataset)` method, which takes a `Dataset` object. Use `dataset.prepare()` to get the training features and labels (`x_train`, `y_train`) as Pandas DataFrames.
    *   Implement the `predict(dataset, segment)` method, which should return a Pandas Series or DataFrame of prediction scores.

3.  **Adding a New Strategy**:
    *   Inherit from `qlib.strategy.base.BaseStrategy`.
    *   Implement the core method `generate_trade_decision(execute_result)`. This method is called at each time step of the backtest.
    *   Use the inherited properties like `self.trade_exchange` to fetch current market data and `self.trade_position` to check current holdings. The method must return a `BaseTradeDecision` (typically a list of `Order` objects).

4.  **Debugging and Reproducibility**:
    *   Leverage the `qlib.workflow.R` (Recorder) system. All parameters and artifacts are logged.
    *   To debug a specific run, use the Recorder's API to load the exact model and task configuration that produced the result, ensuring full reproducibility of the environment.
    *   For backtesting issues, focus on the interaction between the `generate_trade_decision` method in your strategy and the `execute_result` returned by the executor.


================================================
FILE: thirdparty/vnpy.md
================================================
# vnpy - In-Depth Source Code Analysis

## Phase 1: Global Scan & Planning

### 1.1. Full Directory Structure

```
The project structure is highly modular, separating the core trading logic, event handling, remote communication, and visualization into distinct top-level packages under `/home/ubuntu/vnpy/vnpy`.

```
/home/ubuntu/vnpy
├── vnpy/
│   ├── __init__.py
│   ├── alpha/             # Alpha research and strategy development tools (Excluded from core analysis)
│   ├── chart/             # **Core Module: Data Visualization**
│   │   ├── __init__.py
│   │   ├── axis.py        # Custom axis for PyQtGraph
│   │   ├── base.py        # Chart constants and utilities
│   │   ├── item.py        # Abstract and concrete chart items (CandleItem, VolumeItem)
│   │   ├── manager.py     # Bar data management and indexing (BarManager)
│   │   └── widget.py      # Main chart widget and cursor logic (ChartWidget, ChartCursor)
│   ├── event/             # **Core Module: Event-Driven Architecture**
│   │   ├── __init__.py
│   │   └── engine.py      # Event class and asynchronous EventEngine implementation
│   ├── rpc/               # **Core Module: Remote Procedure Call**
│   │   ├── __init__.py
│   │   ├── client.py      # ZeroMQ-based RPC client implementation (REQ/SUB)
│   │   ├── common.py      # RPC constants (Heartbeat)
│   │   └── server.py      # ZeroMQ-based RPC server implementation (REP/PUB)
│   └── trader/            # **Core Module: Trading Engine and Data Model**
│       ├── __init__.py
│       ├── app.py         # Base class for application modules (BaseApp)
│       ├── constant.py    # Trading enums (Direction, Exchange, Status)
│       ├── database.py    # Database interface (Abstract)
│       ├── engine.py      # MainEngine, OmsEngine, LogEngine, EmailEngine
│       ├── event.py       # Trading event constants (EVENT_TICK, EVENT_ORDER)
│       ├── gateway.py     # Abstract gateway interface (BaseGateway)
│       ├── logger.py      # Logging utility
│       ├── object.py      # Core data objects (TickData, OrderData, ContractData)
│       ├── setting.py     # Configuration management
│       ├── ui/            # User interface components (Qt-based widgets)
│       └── utility.py     # General utility functions
```

The structure clearly delineates responsibilities: `trader` holds the core business logic and data model, `event` provides the architectural backbone, `rpc` enables distributed scaling, and `chart` handles visualization. This modularity is key to the framework's extensibility. Folders like `alpha`, `locale`, and `ui` are present but contain non-core or localized components, while the four identified modules form the essential, language-agnostic core of the trading system.
```

### 1.2. Core Folders for Analysis

*   `/home/ubuntu/vnpy/vnpy/trader`: The core trading engine, defining data models, the main application orchestrator (`MainEngine`), and the gateway interface (`BaseGateway`).
*   `/home/ubuntu/vnpy/vnpy/event`: The event-driven core, implementing the central message bus (`EventEngine`) for decoupled communication.
*   `/home/ubuntu/vnpy/vnpy/rpc`: The remote procedure call module, enabling distributed deployment using ZeroMQ for synchronous function calls and asynchronous data streaming.
*   `/home/ubuntu/vnpy/vnpy/chart`: The data visualization module, providing optimized charting components for displaying market data.

## Phase 2: Module-by-Module Deep Analysis

## Module Analysis

The vn.py framework is structured around a highly decoupled, event-driven architecture, with core functionality segregated into distinct modules. The primary modules analyzed are `vnpy.trader`, `vnpy.event`, `vnpy.rpc`, and `vnpy.chart`.

### 1. vnpy.trader: The Core Trading Engine

The `trader` module is the central nervous system of the framework, defining the fundamental data model, the core application logic, and the interface for external connectivity.

| File | Core Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `object.py` | **Data Model Definition** | `BaseData`, `TickData`, `BarData`, `OrderData`, `ContractData`, `OrderRequest`, `CancelRequest` |
| `constant.py` | **Trading Constants** | `Direction`, `Exchange`, `Status`, `OrderType`, `Product` (Enums) |
| `gateway.py` | **External Interface** | `BaseGateway` (Abstract Class), `on_tick`, `send_order` |
| `engine.py` | **Application Orchestration** | `MainEngine`, `BaseEngine` (Abstract), `OmsEngine`, `LogEngine`, `EmailEngine` |

#### Core Implementation Details

*   **Data Structures (`object.py`)**: All trading data objects are defined as Python `dataclass`es inheriting from `BaseData`. This ensures clear definition of data fields. The use of `vt_symbol`, `vt_orderid`, etc., (e.g., `self.vt_symbol: str = f"{self.symbol}.{self.exchange.value}"`) is a key abstraction for creating globally unique identifiers across different gateways.
*   **Main Engine (`engine.py`)**: The `MainEngine` acts as a **Service Locator** and **Facade**. It manages a collection of `BaseGateway` instances (for connectivity) and `BaseEngine` instances (for functionality like logging, OMS, etc.). It delegates high-level trading operations (e.g., `send_order`) to the appropriate gateway.
*   **Order Management System (`OmsEngine`)**: This engine is responsible for maintaining the current state of all trading objects (ticks, orders, positions, etc.). It registers handlers for all incoming events (`EVENT_TICK`, `EVENT_ORDER`, etc.) and updates its internal dictionaries (`self.ticks`, `self.orders`). This implements the **Repository** pattern, providing a single source of truth for all trading data via methods like `get_tick` and `get_all_active_orders`.
*   **Gateway Interface (`gateway.py`)**: The `BaseGateway` is an abstract class that defines the mandatory interface for connecting to any trading system. It enforces a callback mechanism (`on_tick`, `on_order`, etc.) which gateways must use to push data back to the `MainEngine` via the `EventEngine`. This is a clear application of the **Adapter** pattern, allowing various vendor APIs to conform to a single internal standard.

### 2. vnpy.event: The Event-Driven Core

The `event` module provides the foundational event-driven mechanism that decouples all components in the framework.

| File | Core Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `engine.py` | **Event Bus Implementation** | `Event`, `EventEngine`, `EVENT_TIMER`, `register`, `put` |

#### Core Implementation Details

*   **Event Class**: A simple container with `type` (string identifier) and `data` (the payload, e.g., `TickData`).
*   **EventEngine**: Implements a classic **Publisher-Subscriber (Pub/Sub)** pattern.
    *   **Producer-Consumer**: It uses a `Queue` (`self._queue`) and a dedicated `_thread` (`self._thread`) to process events asynchronously, ensuring that event generation (e.g., from a gateway) does not block event processing (e.g., by a strategy).
    *   **Dispatching**: The `_process` method dispatches the `Event` to type-specific handlers (`self._handlers`) and general handlers (`self._general_handlers`).
    *   **Timer**: A separate `_timer` thread generates periodic `EVENT_TIMER` events, crucial for time-based operations like strategy execution or heartbeat checks.

### 3. vnpy.rpc: Remote Procedure Call

The `rpc` module enables distributed deployment by providing a robust inter-process communication layer based on ZeroMQ.

| File | Core Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `client.py` | **RPC Client** | `RpcClient`, `RemoteException`, `__getattr__` |
| `server.py` | **RPC Server** | `RpcServer`, `register`, `publish` |
| `common.py` | **Shared Constants** | `HEARTBEAT_TOPIC`, `HEARTBEAT_INTERVAL`, `HEARTBEAT_TOLERANCE` |

#### Core Implementation Details

*   **Hybrid Communication**: Uses ZeroMQ's `REQ-REP` pattern for synchronous RPC calls (e.g., `RpcClient` calling a function on `RpcServer`) and `PUB-SUB` for asynchronous data streaming (e.g., `RpcServer` publishing market data to `RpcClient`).
*   **Dynamic Proxy (`RpcClient`)**: The `RpcClient` uses Python's magic method `__getattr__` to dynamically create remote call functions. When a method is called on the client, it serializes the function name and arguments, sends them over the `REQ` socket, and waits for the response.
*   **Heartbeat**: The `RpcServer` periodically publishes a heartbeat on `HEARTBEAT_TOPIC`, which the `RpcClient` monitors to detect disconnections and call `on_disconnected`.

### 4. vnpy.chart: Data Visualization

The `chart` module provides the graphical components for displaying market data, built on the PyQtGraph library.

| File | Core Responsibility | Key Classes/Functions |
| :--- | :--- | :--- |
| `manager.py` | **Data Management** | `BarManager`, `update_history`, `get_price_range` |
| `item.py` | **Chart Elements** | `ChartItem` (Abstract), `CandleItem`, `VolumeItem` |
| `widget.py` | **Main Chart View** | `ChartWidget`, `ChartCursor`, `add_plot`, `add_item` |

#### Core Implementation Details

*   **BarManager**: Responsible for storing and indexing `BarData` objects. It manages the mapping between `datetime` and integer index (`self._datetime_index_map`, `self._index_datetime_map`), which is crucial for the x-axis plotting in PyQtGraph. It also caches price and volume ranges for efficient redrawing.
*   **ChartItem**: Abstract base class for all plottable elements. It uses `QPicture` for optimized drawing of bars, implementing a **Flyweight**-like pattern where each bar's drawing is cached. `CandleItem` and `VolumeItem` inherit from this to implement specific drawing logic.
*   **ChartWidget**: The main container, inheriting from `pg.PlotWidget`. It manages multiple `pg.PlotItem`s (plots) and `ChartItem`s (data series). It handles user interaction (keyboard/mouse for zooming and panning) and ensures that all plots are linked on the x-axis. The `ChartCursor` is responsible for displaying crosshair and data information.

### Module PlantUML Diagrams

### Module: vnpy.trader

```plantuml
@startuml vnpy.trader
skinparam classAttributeIconVisible false

package vnpy.trader {

    abstract class BaseData {
        + gateway_name: str
        + vt_symbol: str
    }

    class TickData {
        + symbol: str
        + exchange: Exchange
        + datetime: Datetime
        + last_price: float
        + bid_price_1: float
        + ask_price_1: float
    }

    class BarData {
        + symbol: str
        + exchange: Exchange
        + datetime: Datetime
        + open_price: float
        + high_price: float
        + low_price: float
        + close_price: float
    }

    class OrderData {
        + orderid: str
        + direction: Direction
        + status: Status
        + is_active(): bool
        + create_cancel_request(): CancelRequest
    }

    class TradeData {
        + tradeid: str
        + orderid: str
        + price: float
        + volume: float
    }

    class ContractData {
        + name: str
        + product: Product
        + size: float
    }

    class LogData {
        + msg: str
        + level: int
    }

    class OrderRequest {
        + create_order_data(orderid, gateway_name): OrderData
    }

    class CancelRequest {
        + orderid: str
    }

    abstract class BaseEngine {
        + __init__(main_engine, event_engine, name)
        + close()
    }

    class MainEngine {
        + event_engine: EventEngine
        - gateways: dict<str, BaseGateway>
        - engines: dict<str, BaseEngine>
        + add_gateway(gateway_class)
        + add_engine(engine_class)
        + send_order(req, gateway_name): str
        + subscribe(req, gateway_name)
        + write_log(msg, source)
    }

    class OmsEngine {
        - ticks: dict<str, TickData>
        - orders: dict<str, OrderData>
        + process_order_event(event)
        + get_tick(vt_symbol): TickData
        + get_all_active_orders(): list<OrderData>
    }

    abstract class BaseGateway {
        + default_name: str
        + exchanges: list<Exchange>
        + connect(setting)
        + close()
        + subscribe(req)
        + send_order(req): str
        + on_tick(tick: TickData)
        + on_order(order: OrderData)
    }

    BaseData <|-- TickData
    BaseData <|-- BarData
    BaseData <|-- OrderData
    BaseData <|-- TradeData
    BaseData <|-- ContractData
    BaseData <|-- LogData

    BaseEngine <|-- OmsEngine
    BaseEngine <|-- LogEngine
    BaseEngine <|-- EmailEngine

    MainEngine o-- BaseGateway : manages
    MainEngine o-- BaseEngine : manages
    MainEngine ..> OmsEngine : delegates data access
    BaseGateway ..> BaseData : uses
    BaseGateway ..> OrderRequest : accepts
    BaseGateway ..> CancelRequest : accepts
    BaseGateway ..> LogData : generates
}
@enduml

### Module: vnpy.event

```plantuml
@startuml vnpy.event
skinparam classAttributeIconVisible false

package vnpy.event {

    class Event {
        + type: str
        + data: Any
    }

    class EventEngine {
        - _queue: Queue
        - _thread: Thread
        - _timer: Thread
        - _handlers: defaultdict<str, list<HandlerType>>
        + start()
        + stop()
        + put(event: Event)
        + register(type, handler)
        - _run()
        - _process(event)
    }

    EventEngine "1" o-- "0..*" Event : processes
    EventEngine "1" o-- "0..*" HandlerType : dispatches to
}
@enduml

### Module: vnpy.rpc

```plantuml
@startuml vnpy.rpc
skinparam classAttributeIconVisible false

package vnpy.rpc {

    class RemoteException {
        + __init__(value)
    }

    class RpcClient {
        - _socket_req: zmq.Socket (REQ)
        - _socket_sub: zmq.Socket (SUB)
        + start(req_address, sub_address)
        + stop()
        + __getattr__(name): dorpc()
        + subscribe_topic(topic)
        + on_disconnected()
        + callback(topic, data)
    }

    class RpcServer {
        - _socket_rep: zmq.Socket (REP)
        - _socket_pub: zmq.Socket (PUB)
        - _functions: dict<str, Callable>
        + start(rep_address, pub_address)
        + stop()
        + register(func)
        + publish(topic, data)
        + check_heartbeat()
    }

    RpcClient ..> RpcServer : calls remote function
    RpcServer .> RpcClient : publishes data
}
@enduml

### Module: vnpy.chart

```plantuml
@startuml vnpy.chart
skinparam classAttributeIconVisible false

package vnpy.chart {

    class BarManager {
        - _bars: dict<datetime, BarData>
        - _datetime_index_map: dict<datetime, int>
        + update_history(history)
        + update_bar(bar)
        + get_price_range(min_ix, max_ix)
        + get_bar(ix): BarData
    }

    abstract class ChartItem {
        - _manager: BarManager
        - _bar_picutures: dict<int, QPicture>
        + update_history(history)
        + update_bar(bar)
        + paint(painter, opt, w)
        + {abstract} _draw_bar_picture(ix, bar): QPicture
        + {abstract} get_y_range(min_ix, max_ix)
    }

    class CandleItem {
        + _draw_bar_picture(ix, bar): QPicture
    }

    class VolumeItem {
        + _draw_bar_picture(ix, bar): QPicture
    }

    class ChartWidget {
        - _manager: BarManager
        - _plots: dict<str, PlotItem>
        - _items: dict<str, ChartItem>
        - _cursor: ChartCursor
        + add_plot(name, height)
        + add_item(item_class, name, plot_name)
        + update_history(history)
        + move_to_right()
        - _update_y_range()
    }

    class ChartCursor {
        - _widget: ChartWidget
        + update_info()
    }

    ChartItem <|-- CandleItem
    ChartItem <|-- VolumeItem

    ChartWidget o-- BarManager : uses
    ChartWidget o-- ChartItem : contains
    ChartWidget o-- ChartCursor : contains
    ChartItem o-- BarManager : uses
}
@enduml

## Phase 3: Overall Architecture & Summary

### 3.1. Overall Architecture Analysis

#### 3.1.1. Core Abstractions

The vn.py framework is built upon a robust **Event-Driven Architecture (EDA)**, which serves as the core design philosophy, ensuring high decoupling, extensibility, and responsiveness.

### Core Abstractions
The architecture is defined by five primary abstractions:

1.  **Event (`vnpy.event.Event`)**: The fundamental unit of communication. It is a simple, immutable container holding a string `type` (e.g., `EVENT_TICK`, `EVENT_ORDER`) and a `data` payload (e.g., `TickData`, `OrderData`). This abstraction ensures that components communicate without direct knowledge of each other.
2.  **Event Engine (`vnpy.event.EventEngine`)**: The central message bus and the heart of the EDA. It manages event registration, queuing, and asynchronous dispatching. It operates on a separate thread, ensuring that event generation (e.g., from a gateway) does not block the main application thread or event processing.
3.  **Main Engine (`vnpy.trader.MainEngine`)**: The application orchestrator, acting as a **Service Locator** and **Facade**. It is responsible for initializing and managing all other components, including gateways and functional engines. It provides a high-level interface for user operations (e.g., `send_order`, `subscribe`).
4.  **Base Gateway (`vnpy.trader.BaseGateway`)**: The abstract interface for all external connectivity. It standardizes the communication with various trading systems (brokers, exchanges). It defines mandatory methods for trading operations (`connect`, `send_order`) and a set of callback methods (`on_tick`, `on_order`) used to push data back into the system via the Event Engine.
5.  **Base Engine (`vnpy.trader.BaseEngine`)**: The abstract interface for all internal functional components (e.g., `OmsEngine`, `LogEngine`). It provides a standardized way for modules to integrate with the `MainEngine` and access the `EventEngine`.

### Design Philosophy
The architecture adheres to the following principles:

*   **Decoupling**: The Event Engine completely decouples data producers (Gateways) from data consumers (Engines/Strategies). Components only need to know the event type they are interested in, not the source or other components.
*   **Extensibility (Open/Closed Principle)**: New trading systems can be integrated simply by implementing the `BaseGateway` interface. New features (e.g., risk management, strategy execution) can be added by implementing a new `BaseEngine` without modifying the core logic.
*   **Centralized State Management**: The `OmsEngine` (Order Management System Engine) acts as the single source of truth for all current trading data (positions, orders, accounts). All other components query the `OmsEngine` for the latest state, preventing data inconsistencies.

### Lifecycle Management
The application lifecycle is managed by the `MainEngine`:

1.  **Initialization**: The `MainEngine` is instantiated, which in turn initializes and starts the `EventEngine`'s processing and timer threads.
2.  **Component Registration**: The `MainEngine` registers core `BaseEngine`s (like `OmsEngine`) and then loads and registers `BaseGateway`s and application-specific engines (`BaseApp`s).
3.  **Connection**: The user calls `MainEngine.connect()` for a specific gateway, which triggers the gateway to establish a connection and query initial data (contracts, positions).
4.  **Shutdown**: The `MainEngine.close()` method is called. Crucially, it first stops the `EventEngine` to prevent new events, then sequentially calls the `close()` method on all registered engines and gateways for a clean shutdown.

#### 3.1.2. Component Interactions

The entire system's operation is a continuous loop of data flowing into the system, being processed as events, and resulting in actions flowing out.

### Data Flow: Market Data Ingestion (Asynchronous)

1.  **Gateway Ingestion**: A `BaseGateway` (e.g., a simulated or real-time connection) receives a market data update (e.g., a new tick).
2.  **Event Creation**: The gateway creates a `TickData` object and wraps it in an `Event` of type `EVENT_TICK`.
3.  **Event Submission**: The gateway calls `self.event_engine.put(event)`.
4.  **Event Processing**: The `EventEngine`'s worker thread retrieves the event from the queue.
5.  **Dispatch to Handlers**:
    *   **OmsEngine**: Updates its internal `self.ticks` dictionary with the latest data.
    *   **Strategy Engine (Implied)**: Receives the event to execute its trading logic.
    *   **Chart Engine (`vnpy.chart`)**: Receives the event to update the real-time chart display.

### Control Flow: Trading Operations (Synchronous Request/Asynchronous Response)

1.  **Request Initiation**: A strategy or the user interface calls a method on the `MainEngine`, such as `MainEngine.send_order(req, gateway_name)`.
2.  **Delegation**: The `MainEngine` locates the specified `BaseGateway` and calls `gateway.send_order(req)`.
3.  **External Communication**: The `BaseGateway` sends the order request to the external trading system. It immediately returns a unique `vt_orderid`.
4.  **Asynchronous Response**: The external system's response (e.g., order accepted, filled, or rejected) is received by the `BaseGateway`.
5.  **Internal Update**: The `BaseGateway` creates an `OrderData` or `TradeData` object and pushes it as an `EVENT_ORDER` or `EVENT_TRADE` back into the `EventEngine`.
6.  **State Update**: The `OmsEngine` processes the event, updating the order's status or recording the trade.

### Communication Patterns

| Pattern | Module | Purpose |
| :--- | :--- | :--- |
| **Publish-Subscribe (Pub/Sub)** | `vnpy.event` | Core mechanism for all internal data and state changes. Ensures high decoupling. |
| **Request-Reply (Req/Rep)** | `vnpy.trader` (MainEngine to Gateway) | Synchronous control flow for sending trading commands (e.g., `send_order`). |
| **Remote Procedure Call (RPC)** | `vnpy.rpc` | Used for distributed deployment. `RpcClient` uses `REQ/REP` for function calls and `PUB/SUB` for streaming data from the `RpcServer`. |
| **Facade** | `vnpy.trader.MainEngine` | Simplifies the complex underlying system (multiple gateways and engines) into a single, easy-to-use interface. |

### 3.2. Overall Architecture PlantUML Diagram

```plantuml
@startuml
@startuml vnpy_architecture
skinparam classAttributeIconVisible false

title vn.py Core Architecture

package "External Trading Systems" {
    [Broker API] as API
}

package "vnpy.event" {
    class EventEngine {
        + put(event)
        + register(type, handler)
    }
    class Event {
        + type
        + data
    }
}

package "vnpy.trader" {
    class MainEngine {
        + add_gateway()
        + add_engine()
        + send_order()
        + subscribe()
    }

    abstract class BaseGateway {
        + connect()
        + send_order()
        + on_tick()
        + on_order()
    }

    class OmsEngine {
        + process_order_event()
        + get_all_orders()
    }

    class StrategyEngine {
        + process_tick_event()
        + process_order_event()
        + send_order()
    }
}

package "vnpy.chart" {
    class ChartWidget {
        + update_bar()
    }
}

' Relationships
API --> BaseGateway : Data In/Out

BaseGateway .> Event : creates
BaseGateway -> EventEngine : put(Event)

MainEngine o-- BaseGateway : manages
MainEngine o-- OmsEngine : manages
MainEngine o-- StrategyEngine : manages

MainEngine -> BaseGateway : send_order() (Control Flow)

EventEngine -> OmsEngine : dispatch(Event)
EventEngine -> StrategyEngine : dispatch(Event)
EventEngine -> ChartWidget : dispatch(Event)

OmsEngine .> MainEngine : get_contract() (State Query)
StrategyEngine -> MainEngine : send_order() (Action)

note right of EventEngine
    The EventEngine is the central
    message bus, decoupling all components.
end note

@enduml
@enduml
```

### 3.3. Design Patterns & Highlights

#### 3.3.1. Design Patterns

The vn.py framework leverages several classic software design patterns to achieve its flexibility, scalability, and maintainability.

### 1. Observer Pattern (via Event-Driven Architecture)
*   **Description**: Defines a one-to-many dependency so that when one object (the subject) changes state, all its dependents (observers) are notified.
*   **Implementation**: Implemented through the **Event Engine**. The `EventEngine` is the Subject, and any function registered to handle an event (e.g., `OmsEngine.process_order_event`) is an Observer.
*   **Code Example (`vnpy/event/engine.py`)**:
    ```python
    # Registration (Observer subscribes to Subject)
    def register(self, type: str, handler: HandlerType) -> None:
        handler_list: list = self._handlers[type]
        if handler not in handler_list:
            handler_list.append(handler)
    ```

### 2. Adapter Pattern
*   **Description**: Allows the interface of an existing class (external API) to be used as another interface (internal standard).
*   **Implementation**: The `BaseGateway` (`vnpy/trader/gateway.py`) acts as the target interface. Specific gateway implementations adapt the vendor's API calls and data structures to the standardized `BaseGateway` methods and callbacks.
*   **Code Example (`vnpy/trader/gateway.py`)**:
    ```python
    class BaseGateway(ABC):
        @abstractmethod
        def connect(self, setting: dict) -> None:
            """Start gateway connection."""
            pass
    ```

### 3. Facade Pattern
*   **Description**: Provides a unified interface to a set of interfaces in a subsystem.
*   **Implementation**: The `MainEngine` (`vnpy/trader/engine.py`) serves as the facade for the entire trading system, hiding the complexity of managing multiple gateways and functional engines.
*   **Code Example (`vnpy/trader/engine.py`)**:
    ```python
    class MainEngine:
        # ... manages gateways and engines internally ...
        def send_order(self, req: OrderRequest, gateway_name: str) -> str:
            """Send new order request to a specific gateway."""
            gateway: BaseGateway | None = self.get_gateway(gateway_name)
            if gateway:
                return gateway.send_order(req)
            else:
                return ""
    ```

### 4. Repository Pattern
*   **Description**: Mediates between the domain and data mapping layers, acting like an in-memory collection of domain objects.
*   **Implementation**: The `OmsEngine` (`vnpy/trader/engine.py`) acts as the repository for all current trading data (orders, positions, accounts, contracts), centralizing the state.
*   **Code Example (`vnpy/trader/engine.py` - OmsEngine):
    ```python
    class OmsEngine(BaseEngine):
        def __init__(self, main_engine: MainEngine, event_engine: EventEngine) -> None:
            self.orders: dict[str, OrderData] = {}
            # ...
        
        def get_all_active_orders(self) -> list[OrderData]:
            """Get all active orders."""
            return list(self.active_orders.values())
    ```

#### 3.3.2. Project Highlights

The vn.py framework exhibits several innovative and flexible design choices:

*   **Unified Data Model (VT-Symbol)**: The use of the `vt_symbol` (e.g., `symbol.exchange.value`) abstraction in `vnpy/trader/object.py` is a key highlight. It creates a globally unique identifier for every instrument across all connected gateways, simplifying data management and cross-gateway operations.
*   **High Extensibility via Base Classes**: The core is designed around abstract base classes (`BaseGateway`, `BaseEngine`, `BaseApp`). This makes it exceptionally easy to extend the system by adding new trading interfaces (gateways) or new functional modules (engines/apps) without modifying the core logic. This adheres to the Open/Closed Principle.
*   **Asynchronous RPC for Distributed Deployment**: The `vnpy.rpc` module, utilizing ZeroMQ, provides a built-in solution for distributing the trading system. This allows for separating the high-frequency market data processing (Server) from the strategy execution or UI (Client), enhancing performance and deployment flexibility. The hybrid use of REQ/REP for function calls and PUB/SUB for data streaming is a robust design choice.
*   **Optimized Charting with PyQtGraph**: The `vnpy.chart` module uses PyQtGraph and the `QPicture` object caching mechanism (`ChartItem._draw_item_picture`) to significantly optimize the rendering of large amounts of historical bar data, ensuring a smooth and responsive user experience even with extensive backtesting results.

### 3.4. Summary & Recommendations

#### 3.4.1. Potential Improvements

Based on the analysis, the following areas could be considered for improvement:

*   **Asynchronous Event Handling for High-Frequency**: While the `EventEngine` uses a separate thread, the event processing loop (`_run` in `vnpy/event/engine.py`) is synchronous. For extremely high-frequency trading (HFT) or scenarios with very high data throughput, switching the event processing to an `asyncio` loop with coroutines could prevent a slow handler from blocking the processing of subsequent events. This would improve performance under heavy load.
*   **Strict Immutability Enforcement**: The core data objects (`TickData`, `OrderData`, etc.) are defined as `dataclass`es, which implies a design intent for immutability. However, Python's `dataclass`es are mutable by default. Adding `frozen=True` to the `@dataclass` decorator in `vnpy/trader/object.py` would strictly enforce immutability, preventing accidental modification of critical state data after it has been published, thus enhancing data integrity.
*   **Dependency Injection for MainEngine**: The `MainEngine` currently instantiates its engines directly (e.g., `self.add_engine(LogEngine)`). This tight coupling makes unit testing harder. Implementing a simple dependency injection pattern where engines are passed to the `MainEngine` constructor or registered via a configuration would improve testability and modularity.
*   **Standardized Error Handling**: The `MainEngine`'s error handling often relies on logging messages in Chinese (e.g., `self.write_log(_("找不到底层接口：{}").format(gateway_name))`). A more standardized, exception-based error propagation mechanism (e.g., custom exceptions for `GatewayNotFound`, `OrderRejected`) would allow for more robust programmatic error handling in strategies and applications, moving beyond simple logging.

#### 3.4.2. Secondary Development Guide

For developers looking to extend or build upon the vn.py framework, the following path is recommended:

1.  **Understand the Event Flow**: The first step is to grasp the **Event-Driven Architecture**. All data flows through the `EventEngine`. To receive data, you must register a handler for the relevant event type (e.g., `EVENT_TICK`). To send data/commands, you must use the `MainEngine` facade.
2.  **Develop a New Strategy (App)**:
    *   Create a new module that inherits from `BaseApp` (if a UI is needed) or directly from `BaseEngine` (for pure backend logic).
    *   In the engine's `__init__`, register your event handlers with the `EventEngine` (e.g., `event_engine.register(EVENT_TICK, self.on_tick)`).
    *   Implement the core logic within the handler methods, querying the current state from the `OmsEngine` (e.g., `self.main_engine.get_position(vt_positionid)`).
3.  **Integrate a New Gateway**:
    *   Create a new class that inherits from `BaseGateway` (`vnpy/trader/gateway.py`).
    *   Implement the abstract methods: `connect()`, `close()`, `subscribe()`, `send_order()`, `query_account()`, and `query_position()`.
    *   Crucially, implement the callbacks (`on_tick`, `on_order`, etc.) to translate the external API's data format into vn.py's standardized `TickData`, `OrderData`, etc., and push them via `self.on_event()`.
4.  **Utilize the OMS Engine**: Always query the `OmsEngine` via the `MainEngine`'s helper methods (`self.main_engine.get_contract`, `self.main_engine.get_all_active_orders`) to access the current, authoritative state of the system. Do not attempt to maintain separate state copies.
5.  **Use VT-Symbols**: When referencing any instrument, order, or position, always use the unified VT-Symbol (`vt_symbol`, `vt_orderid`) to ensure compatibility across different gateways.