Showing preview only (608K chars total). Download the full file or copy to clipboard to get everything.
Repository: stanford-oval/storm
Branch: main
Commit: fb951af7744d
Files: 66
Total size: 581.6 KB
Directory structure:
gitextract_m_9bgqy0/
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ └── bug_report.md
│ └── workflows/
│ ├── format-check.yml
│ └── python-package.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── examples/
│ ├── costorm_examples/
│ │ └── run_costorm_gpt.py
│ └── storm_examples/
│ ├── README.md
│ ├── helper/
│ │ └── process_kaggle_arxiv_abstract_dataset.py
│ ├── run_storm_wiki_claude.py
│ ├── run_storm_wiki_deepseek.py
│ ├── run_storm_wiki_gemini.py
│ ├── run_storm_wiki_gpt.py
│ ├── run_storm_wiki_gpt_with_VectorRM.py
│ ├── run_storm_wiki_groq.py
│ ├── run_storm_wiki_mistral.py
│ ├── run_storm_wiki_ollama.py
│ ├── run_storm_wiki_ollama_with_searxng.py
│ └── run_storm_wiki_serper.py
├── frontend/
│ └── demo_light/
│ ├── .streamlit/
│ │ └── config.toml
│ ├── README.md
│ ├── demo_util.py
│ ├── pages_util/
│ │ ├── CreateNewArticle.py
│ │ └── MyArticles.py
│ ├── requirements.txt
│ ├── stoc.py
│ └── storm.py
├── knowledge_storm/
│ ├── __init__.py
│ ├── collaborative_storm/
│ │ ├── __init__.py
│ │ ├── engine.py
│ │ └── modules/
│ │ ├── __init__.py
│ │ ├── article_generation.py
│ │ ├── callback.py
│ │ ├── co_storm_agents.py
│ │ ├── collaborative_storm_utils.py
│ │ ├── costorm_expert_utterance_generator.py
│ │ ├── expert_generation.py
│ │ ├── grounded_question_answering.py
│ │ ├── grounded_question_generation.py
│ │ ├── information_insertion_module.py
│ │ ├── knowledge_base_summary.py
│ │ ├── simulate_user.py
│ │ └── warmstart_hierarchical_chat.py
│ ├── dataclass.py
│ ├── encoder.py
│ ├── interface.py
│ ├── lm.py
│ ├── logging_wrapper.py
│ ├── rm.py
│ ├── storm_wiki/
│ │ ├── __init__.py
│ │ ├── engine.py
│ │ └── modules/
│ │ ├── __init__.py
│ │ ├── article_generation.py
│ │ ├── article_polish.py
│ │ ├── callback.py
│ │ ├── knowledge_curation.py
│ │ ├── outline_generation.py
│ │ ├── persona_generator.py
│ │ ├── retriever.py
│ │ └── storm_dataclass.py
│ └── utils.py
├── requirements.txt
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: "[BUG]"
labels: ''
assignees: ''
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Report following things
1. Input topic name
2. All output files generated for this topic as a zip file.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Environment:**
- OS: [e.g. iOS, Windows]
- Browser [e.g. chrome, safari] if the bug report is UI problem
================================================
FILE: .github/workflows/format-check.yml
================================================
name: Check Python formatting with Black
on:
pull_request:
branches:
- main
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: psf/black@stable
with:
black_args: "knowledge_storm --check"
================================================
FILE: .github/workflows/python-package.yml
================================================
name: Build and upload Python package
on:
workflow_dispatch: # Allows manual triggering of the workflow
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- name: Set up Python 3.11
uses: actions/setup-python@v3
with:
python-version: "3.11"
- name: Compare versions in setup.py and knowledge_storm/__init__.py
run: |
VERSION_SETUP=$(grep -oP '(?<=version=\").*(?=\")' setup.py)
VERSION_INIT=$(grep -oP '(?<=__version__ = \").*(?=\")' knowledge_storm/__init__.py)
echo "Version in setup.py: $VERSION_SETUP"
echo "Version in __init__.py: $VERSION_INIT"
if [ "$VERSION_SETUP" != "$VERSION_INIT" ]; then
echo "Error: Version mismatch between setup.py ($VERSION_SETUP) and knowledge_storm/__init__.py ($VERSION_INIT)"
exit 1
fi
shell: bash
- name: Install dependencies
run: python3 -m pip install setuptools wheel twine
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip setuptools wheel
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Build a binary wheel
run: python3 setup.py sdist bdist_wheel
- name: Publish package to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# mac
.DS_Store
# Other
.vscode
*.tsv
*.pt
gpt*.txt
*.env
local/
local_*
build/
*.egg-info/
.idea
.venv
# Project-specific
secrets.toml
*.log
*/assertion.log
*results/
.venv/
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
name: Format Python code with black
entry: black
args: ["knowledge_storm/"]
language: python
pass_filenames: true
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing
Thank you for your interest in contributing to STORM!
Contributions aren't just about code. Currently (last edit: 7/22/2024), we are accepting the following forms of contribution:
- Pull requests for additional language model support to `knowledge_storm/lm.py`.
- Pull requests for additional retrieval model/search engine support to `knowledge_storm/rm.py`.
- Pull requests for new features to `frontend/demo_light` to assist other developers.
- Identification and reporting of issues or bugs.
- Helping each other by responding to issues.
Please note that we are not accepting code refactoring PRs at this time to avoid conflicts with our team's efforts.
## Development
This section contains technical instructions & hints for contributors.
### Setting up
1. Fork this repository and clone your forked repository.
2. Install the required packages:
```
conda create -n storm python=3.11
conda activate storm
pip install -r requirements.txt
```
3. If you want to contribute to `frontend/demo_light`, follow its [Setup guide](https://github.com/stanford-oval/storm/tree/main/frontend/demo_light#setup) to install additional packages.
### PR suggestions
Following the suggested format can lead to a faster review process.
**Title:**
[New LM/New RM/Demo Enhancement] xxx
**Description:**
- For new language model support, (1) describe how to use the new LM class, (2) create an example script following the style of existing example scripts under `examples/`, (3) attach an input-output example of the example script.
- For new retrieval model/search engine support, (1) describe how to use the new RM class and (2) attach input-output examples of the RM class.
- For demo light enhancements, (1) describe what's new and (2) attach screenshots to demonstrate the UI change.
- Please clearly describe the required API keys and provide instructions on how to get them (if applicable). This project manages API key with `secrets.toml`.
**Code Format:**
We adopt [`black`](https://github.com/psf/black) for arranging and formatting Python code. To streamline the contribution process, we set up a [pre-commit hook](https://pre-commit.com/) to format the code under `knowledge_storm/` before committing. To install the pre-commit hook, run:
```
pip install pre-commit
pre-commit install
```
The hook will automatically format the code before each commit.
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2024 Stanford Open Virtual Assistant Lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: MANIFEST.in
================================================
include requirements.txt
include LICENSE
include README.md
================================================
FILE: README.md
================================================
<p align="center">
<img src="assets/logo.svg" style="width: 25%; height: auto;">
</p>
# STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
<p align="center">
| <a href="http://storm.genie.stanford.edu"><b>Research preview</b></a> | <a href="https://arxiv.org/abs/2402.14207"><b>STORM Paper</b></a>| <a href="https://www.arxiv.org/abs/2408.15232"><b>Co-STORM Paper</b></a> | <a href="https://storm-project.stanford.edu/"><b>Website</b></a> |
</p>
**Latest News** 🔥
- [2025/01] We add [litellm](https://github.com/BerriAI/litellm) integration for language models and embedding models in `knowledge-storm` v1.1.0.
- [2024/09] Co-STORM codebase is now released and integrated into `knowledge-storm` python package v1.0.0. Run `pip install knowledge-storm --upgrade` to check it out.
- [2024/09] We introduce collaborative STORM (Co-STORM) to support human-AI collaborative knowledge curation! [Co-STORM Paper](https://www.arxiv.org/abs/2408.15232) has been accepted to EMNLP 2024 main conference.
- [2024/07] You can now install our package with `pip install knowledge-storm`!
- [2024/07] We add `VectorRM` to support grounding on user-provided documents, complementing existing support of search engines (`YouRM`, `BingSearch`). (check out [#58](https://github.com/stanford-oval/storm/pull/58))
- [2024/07] We release demo light for developers a minimal user interface built with streamlit framework in Python, handy for local development and demo hosting (checkout [#54](https://github.com/stanford-oval/storm/pull/54))
- [2024/06] We will present STORM at NAACL 2024! Find us at Poster Session 2 on June 17 or check our [presentation material](assets/storm_naacl2024_slides.pdf).
- [2024/05] We add Bing Search support in [rm.py](knowledge_storm/rm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model.
- [2024/04] We release refactored version of STORM codebase! We define [interface](knowledge_storm/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](knowledge_storm/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.
[](https://github.com/psf/black)
## Overview [(Try STORM now!)](https://storm.genie.stanford.edu/)
<p align="center">
<img src="assets/overview.svg" style="width: 90%; height: auto;">
</p>
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. Co-STORM further enhanced its feature by enabling human to collaborative LLM system to support more aligned and preferred information seeking and knowledge curation.
While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.
**More than 70,000 people have tried our [live research preview](https://storm.genie.stanford.edu/). Try it out to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
## How STORM & Co-STORM works
### STORM
STORM breaks down generating long articles with citations into two steps:
1. **Pre-writing stage**: The system conducts Internet-based research to collect references and generates an outline.
2. **Writing stage**: The system uses the outline and references to generate the full-length article with citations.
<p align="center">
<img src="assets/two_stages.jpg" style="width: 60%; height: auto;">
</p>
STORM identifies the core of automating the research process as automatically coming up with good questions to ask. Directly prompting the language model to ask questions does not work well. To improve the depth and breadth of the questions, STORM adopts two strategies:
1. **Perspective-Guided Question Asking**: Given the input topic, STORM discovers different perspectives by surveying existing articles from similar topics and uses them to control the question-asking process.
2. **Simulated Conversation**: STORM simulates a conversation between a Wikipedia writer and a topic expert grounded in Internet sources to enable the language model to update its understanding of the topic and ask follow-up questions.
### CO-STORM
Co-STORM proposes **a collaborative discourse protocol** which implements a turn management policy to support smooth collaboration among
- **Co-STORM LLM experts**: This type of agent generates answers grounded on external knowledge sources and/or raises follow-up questions based on the discourse history.
- **Moderator**: This agent generates thought-provoking questions inspired by information discovered by the retriever but not directly used in previous turns. Question generation can also be grounded!
- **Human user**: The human user will take the initiative to either (1) observe the discourse to gain deeper understanding of the topic, or (2) actively engage in the conversation by injecting utterances to steer the discussion focus.
<p align="center">
<img src="assets/co-storm-workflow.jpg" style="width: 60%; height: auto;">
</p>
Co-STORM also maintains a dynamic updated **mind map**, which organize collected information into a hierarchical concept structure, aiming to **build a shared conceptual space between the human user and the system**. The mind map has been proven to help reduce the mental load when the discourse goes long and in-depth.
Both STORM and Co-STORM are implemented in a highly modular way using [dspy](https://github.com/stanfordnlp/dspy).
## Installation
To install the knowledge storm library, use `pip install knowledge-storm`.
You could also install the source code which allows you to modify the behavior of STORM engine directly.
1. Clone the git repository.
```shell
git clone https://github.com/stanford-oval/storm.git
cd storm
```
2. Install the required packages.
```shell
conda create -n storm python=3.11
conda activate storm
pip install -r requirements.txt
```
## API
Currently, our package support:
- Language model components: All language models supported by litellm as listed [here](https://docs.litellm.ai/docs/providers)
- Embedding model components: All embedding models supported by litellm as listed [here](https://docs.litellm.ai/docs/embedding/supported_embedding)
- retrieval module components: `YouRM`, `BingSearch`, `VectorRM`, `SerperRM`, `BraveRM`, `SearXNG`, `DuckDuckGoSearchRM`, `TavilySearchRM`, `GoogleSearch`, and `AzureAISearch` as
:star2: **PRs for integrating more search engines/retrievers into [knowledge_storm/rm.py](knowledge_storm/rm.py) are highly appreciated!**
Both STORM and Co-STORM are working in the information curation layer, you need to set up the information retrieval module and language model module to create their `Runner` classes respectively.
### STORM
The STORM knowledge curation engine is defined as a simple Python `STORMWikiRunner` class. Here is an example of using You.com search engine and OpenAI models.
```python
import os
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from knowledge_storm.lm import LitellmModel
from knowledge_storm.rm import YouRM
lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
'api_key': os.getenv("OPENAI_API_KEY"),
'temperature': 1.0,
'top_p': 0.9,
}
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.
gpt_35 = LitellmModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
gpt_4 = LitellmModel(model='gpt-4o', max_tokens=3000, **openai_kwargs)
lm_configs.set_conv_simulator_lm(gpt_35)
lm_configs.set_question_asker_lm(gpt_35)
lm_configs.set_outline_gen_lm(gpt_4)
lm_configs.set_article_gen_lm(gpt_4)
lm_configs.set_article_polish_lm(gpt_4)
# Check out the STORMWikiRunnerArguments class for more configurations.
engine_args = STORMWikiRunnerArguments(...)
rm = YouRM(ydc_api_key=os.getenv('YDC_API_KEY'), k=engine_args.search_top_k)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
```
The `STORMWikiRunner` instance can be evoked with the simple `run` method:
```python
topic = input('Topic: ')
runner.run(
topic=topic,
do_research=True,
do_generate_outline=True,
do_generate_article=True,
do_polish_article=True,
)
runner.post_run()
runner.summary()
```
- `do_research`: if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results.
- `do_generate_outline`: if True, generate an outline for the topic; otherwise, load the results.
- `do_generate_article`: if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results.
- `do_polish_article`: if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results.
### Co-STORM
The Co-STORM knowledge curation engine is defined as a simple Python `CoStormRunner` class. Here is an example of using Bing search engine and OpenAI models.
```python
from knowledge_storm.collaborative_storm.engine import CollaborativeStormLMConfigs, RunnerArgument, CoStormRunner
from knowledge_storm.lm import LitellmModel
from knowledge_storm.logging_wrapper import LoggingWrapper
from knowledge_storm.rm import BingSearch
# Co-STORM adopts the same multi LM system paradigm as STORM
lm_config: CollaborativeStormLMConfigs = CollaborativeStormLMConfigs()
openai_kwargs = {
"api_key": os.getenv("OPENAI_API_KEY"),
"api_provider": "openai",
"temperature": 1.0,
"top_p": 0.9,
"api_base": None,
}
question_answering_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs)
discourse_manage_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=500, **openai_kwargs)
utterance_polishing_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=2000, **openai_kwargs)
warmstart_outline_gen_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=500, **openai_kwargs)
question_asking_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=300, **openai_kwargs)
knowledge_base_lm = LitellmModel(model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs)
lm_config.set_question_answering_lm(question_answering_lm)
lm_config.set_discourse_manage_lm(discourse_manage_lm)
lm_config.set_utterance_polishing_lm(utterance_polishing_lm)
lm_config.set_warmstart_outline_gen_lm(warmstart_outline_gen_lm)
lm_config.set_question_asking_lm(question_asking_lm)
lm_config.set_knowledge_base_lm(knowledge_base_lm)
# Check out the Co-STORM's RunnerArguments class for more configurations.
topic = input('Topic: ')
runner_argument = RunnerArgument(topic=topic, ...)
logging_wrapper = LoggingWrapper(lm_config)
bing_rm = BingSearch(bing_search_api_key=os.environ.get("BING_SEARCH_API_KEY"),
k=runner_argument.retrieve_top_k)
costorm_runner = CoStormRunner(lm_config=lm_config,
runner_argument=runner_argument,
logging_wrapper=logging_wrapper,
rm=bing_rm)
```
The `CoStormRunner` instance can be evoked with the `warmstart()` and `step(...)` methods.
```python
# Warm start the system to build shared conceptual space between Co-STORM and users
costorm_runner.warm_start()
# Step through the collaborative discourse
# Run either of the code snippets below in any order, as many times as you'd like
# To observe the conversation:
conv_turn = costorm_runner.step()
# To inject your utterance to actively steer the conversation:
costorm_runner.step(user_utterance="YOUR UTTERANCE HERE")
# Generate report based on the collaborative discourse
costorm_runner.knowledge_base.reorganize()
article = costorm_runner.generate_report()
print(article)
```
## Quick Start with Example Scripts
We provide scripts in our [examples folder](examples) as a quick start to run STORM and Co-STORM with different configurations.
We suggest using `secrets.toml` to set up the API keys. Create a file `secrets.toml` under the root directory and add the following content:
```shell
# ============ language model configurations ============
# Set up OpenAI API key.
OPENAI_API_KEY="your_openai_api_key"
# If you are using the API service provided by OpenAI, include the following line:
OPENAI_API_TYPE="openai"
# If you are using the API service provided by Microsoft Azure, include the following lines:
OPENAI_API_TYPE="azure"
AZURE_API_BASE="your_azure_api_base_url"
AZURE_API_VERSION="your_azure_api_version"
# ============ retriever configurations ============
BING_SEARCH_API_KEY="your_bing_search_api_key" # if using bing search
# ============ encoder configurations ============
ENCODER_API_TYPE="openai" # if using openai encoder
```
### STORM examples
**To run STORM with `gpt` family models with default configurations:**
Run the following command.
```bash
python examples/storm_examples/run_storm_wiki_gpt.py \
--output-dir $OUTPUT_DIR \
--retriever bing \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
**To run STORM using your favorite language models or grounding on your own corpus:** Check out [examples/storm_examples/README.md](examples/storm_examples/README.md).
### Co-STORM examples
To run Co-STORM with `gpt` family models with default configurations,
1. Add `BING_SEARCH_API_KEY="xxx"` and `ENCODER_API_TYPE="xxx"` to `secrets.toml`
2. Run the following command
```bash
python examples/costorm_examples/run_costorm_gpt.py \
--output-dir $OUTPUT_DIR \
--retriever bing
```
## Customization of the Pipeline
### STORM
If you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules:
1. Knowledge Curation Module: Collects a broad coverage of information about the given topic.
2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge.
3. Article Generation Module: Populates the generated outline with the collected information.
4. Article Polishing Module: Refines and enhances the written article for better presentation.
The interface for each module is defined in `knowledge_storm/interface.py`, while their implementations are instantiated in `knowledge_storm/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).
### Co-STORM
If you have installed the source code, you can customize Co-STORM based on your own use case
1. Co-STORM introduces multiple LLM agent types (i.e. Co-STORM experts and Moderator). LLM agent interface is defined in `knowledge_storm/interface.py` , while its implementation is instantiated in `knowledge_storm/collaborative_storm/modules/co_storm_agents.py`. Different LLM agent policies can be customized.
2. Co-STORM introduces a collaborative discourse protocol, with its core function centered on turn policy management. We provide an example implementation of turn policy management through `DiscourseManager` in `knowledge_storm/collaborative_storm/engine.py`. It can be customized and further improved.
## Datasets
To facilitate the study of automatic knowledge curation and complex information seeking, our project releases the following datasets:
### FreshWiki
The FreshWiki Dataset is a collection of 100 high-quality Wikipedia articles focusing on the most-edited pages from February 2022 to September 2023. See Section 2.1 in [STORM paper](https://arxiv.org/abs/2402.14207) for more details.
You can download the dataset from [huggingface](https://huggingface.co/datasets/EchoShao8899/FreshWiki) directly. To ease the data contamination issue, we archive the [source code](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup/FreshWiki) for the data construction pipeline that can be repeated at future dates.
### WildSeek
To study users’ interests in complex information seeking tasks in the wild, we utilized data collected from the web research preview to create the WildSeek dataset. We downsampled the data to ensure the diversity of the topics and the quality of the data. Each data point is a pair comprising a topic and the user’s goal for conducting deep search on the topic. For more details, please refer to Section 2.2 and Appendix A of [Co-STORM paper](https://www.arxiv.org/abs/2408.15232).
The WildSeek dataset is available [here](https://huggingface.co/datasets/YuchengJiang/WildSeek).
## Replicate STORM & Co-STORM paper result
For STORM paper experiments, please switch to the branch `NAACL-2024-code-backup` [here](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup).
For Co-STORM paper experiments, please switch to the branch `EMNLP-2024-code-backup` (placeholder for now, will be updated soon).
## Roadmap & Contributions
Our team is actively working on:
1. Human-in-the-Loop Functionalities: Supporting user participation in the knowledge curation process.
2. Information Abstraction: Developing abstractions for curated information to support presentation formats beyond the Wikipedia-style report.
If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!
Contact person: [Yijia Shao](mailto:shaoyj@stanford.edu) and [Yucheng Jiang](mailto:yuchengj@stanford.edu)
## Acknowledgement
We would like to thank Wikipedia for its excellent open-source content. The FreshWiki dataset is sourced from Wikipedia, licensed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.
We are very grateful to [Michelle Lam](https://michelle123lam.github.io/) for designing the logo for this project and [Dekun Ma](https://dekun.me) for leading the UI development.
Thanks to Vercel for their support of [open-source software](https://storm.genie.stanford.edu)
## Citation
Please cite our paper if you use this code or part of it in your work:
```bibtex
@inproceedings{jiang-etal-2024-unknown,
title = "Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations",
author = "Jiang, Yucheng and
Shao, Yijia and
Ma, Dekun and
Semnani, Sina and
Lam, Monica",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.554/",
doi = "10.18653/v1/2024.emnlp-main.554",
pages = "9917--9955",
}
@inproceedings{shao-etal-2024-assisting,
title = "Assisting in Writing {W}ikipedia-like Articles From Scratch with Large Language Models",
author = "Shao, Yijia and
Jiang, Yucheng and
Kanell, Theodore and
Xu, Peter and
Khattab, Omar and
Lam, Monica",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.347/",
doi = "10.18653/v1/2024.naacl-long.347",
pages = "6252--6278",
}
```
================================================
FILE: examples/costorm_examples/run_costorm_gpt.py
================================================
"""
Co-STORM pipeline powered by GPT-4o/4o-mini and Bing search engine.
You need to set up the following environment variables to run this script:
- OPENAI_API_KEY: OpenAI API key
- OPENAI_API_TYPE: OpenAI API type (e.g., 'openai' or 'azure')
- AZURE_API_BASE: Azure API base URL if using Azure API
- AZURE_API_VERSION: Azure API version if using Azure API
- BING_SEARCH_API_KEY: Biang search API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
Output will be structured as below
args.output_dir/
log.json # Log of information-seeking conversation
report.txt # Final article generated
"""
import os
import json
from argparse import ArgumentParser
from knowledge_storm.collaborative_storm.engine import (
CollaborativeStormLMConfigs,
RunnerArgument,
CoStormRunner,
)
from knowledge_storm.collaborative_storm.modules.callback import (
LocalConsolePrintCallBackHandler,
)
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
from knowledge_storm.logging_wrapper import LoggingWrapper
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_config: CollaborativeStormLMConfigs = CollaborativeStormLMConfigs()
openai_kwargs = (
{
"api_key": os.getenv("OPENAI_API_KEY"),
"api_provider": "openai",
"temperature": 1.0,
"top_p": 0.9,
"api_base": None,
}
if os.getenv("OPENAI_API_TYPE") == "openai"
else {
"api_key": os.getenv("AZURE_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
"api_base": os.getenv("AZURE_API_BASE"),
"api_version": os.getenv("AZURE_API_VERSION"),
}
)
ModelClass = (
OpenAIModel if os.getenv("OPENAI_API_TYPE") == "openai" else AzureOpenAIModel
)
# If you are using Azure service, make sure the model name matches your own deployed model name.
# The default name here is only used for demonstration and may not match your case.
gpt_4o_mini_model_name = "gpt-4o-mini"
gpt_4o_model_name = "gpt-4o"
if os.getenv("OPENAI_API_TYPE") == "azure":
openai_kwargs["api_base"] = os.getenv("AZURE_API_BASE")
openai_kwargs["api_version"] = os.getenv("AZURE_API_VERSION")
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
question_answering_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs
)
discourse_manage_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=500, **openai_kwargs
)
utterance_polishing_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=2000, **openai_kwargs
)
warmstart_outline_gen_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=500, **openai_kwargs
)
question_asking_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=300, **openai_kwargs
)
knowledge_base_lm = ModelClass(
model=gpt_4o_model_name, max_tokens=1000, **openai_kwargs
)
lm_config.set_question_answering_lm(question_answering_lm)
lm_config.set_discourse_manage_lm(discourse_manage_lm)
lm_config.set_utterance_polishing_lm(utterance_polishing_lm)
lm_config.set_warmstart_outline_gen_lm(warmstart_outline_gen_lm)
lm_config.set_question_asking_lm(question_asking_lm)
lm_config.set_knowledge_base_lm(knowledge_base_lm)
topic = input("Topic: ")
runner_argument = RunnerArgument(
topic=topic,
retrieve_top_k=args.retrieve_top_k,
max_search_queries=args.max_search_queries,
total_conv_turn=args.total_conv_turn,
max_search_thread=args.max_search_thread,
max_search_queries_per_turn=args.max_search_queries_per_turn,
warmstart_max_num_experts=args.warmstart_max_num_experts,
warmstart_max_turn_per_experts=args.warmstart_max_turn_per_experts,
warmstart_max_thread=args.warmstart_max_thread,
max_thread_num=args.max_thread_num,
max_num_round_table_experts=args.max_num_round_table_experts,
moderator_override_N_consecutive_answering_turn=args.moderator_override_N_consecutive_answering_turn,
node_expansion_trigger_count=args.node_expansion_trigger_count,
)
logging_wrapper = LoggingWrapper(lm_config)
callback_handler = (
LocalConsolePrintCallBackHandler() if args.enable_log_print else None
)
# Co-STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=runner_argument.retrieve_top_k,
)
case "you":
rm = YouRM(
ydc_api_key=os.getenv("YDC_API_KEY"), k=runner_argument.retrieve_top_k
)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=runner_argument.retrieve_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=runner_argument.retrieve_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=runner_argument.retrieve_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"),
k=runner_argument.retrieve_top_k,
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
costorm_runner = CoStormRunner(
lm_config=lm_config,
runner_argument=runner_argument,
logging_wrapper=logging_wrapper,
rm=rm,
callback_handler=callback_handler,
)
# warm start the system
costorm_runner.warm_start()
# Below is an example of how users may interact with Co-STORM to seek information together
# In actual deployment, we suggest allowing the user to decide whether to observe the agent utterance or inject a turn
# observing Co-STORM LLM agent utterance for 5 turns
for _ in range(1):
conv_turn = costorm_runner.step()
print(f"**{conv_turn.role}**: {conv_turn.utterance}\n")
# active engaging by injecting your utterance
your_utterance = input("Your utterance: ")
costorm_runner.step(user_utterance=your_utterance)
# continue observing
conv_turn = costorm_runner.step()
print(f"**{conv_turn.role}**: {conv_turn.utterance}\n")
# generate report
costorm_runner.knowledge_base.reorganize()
article = costorm_runner.generate_report()
# save results
os.makedirs(args.output_dir, exist_ok=True)
# Save article
with open(os.path.join(args.output_dir, "report.md"), "w") as f:
f.write(article)
# Save instance dump
instance_copy = costorm_runner.to_dict()
with open(os.path.join(args.output_dir, "instance_dump.json"), "w") as f:
json.dump(instance_copy, f, indent=2)
# Save logging
log_dump = costorm_runner.dump_logging_and_reset()
with open(os.path.join(args.output_dir, "log.json"), "w") as f:
json.dump(log_dump, f, indent=2)
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/co-storm",
help="Directory to store the outputs.",
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
# hyperparameters for co-storm
parser.add_argument(
"--retrieve_top_k",
type=int,
default=10,
help="Retrieve top k results for each query in retriever.",
)
parser.add_argument(
"--max_search_queries",
type=int,
default=2,
help="Maximum number of search queries to consider for each question.",
)
parser.add_argument(
"--total_conv_turn",
type=int,
default=20,
help="Maximum number of turns in conversation.",
)
parser.add_argument(
"--max_search_thread",
type=int,
default=5,
help="Maximum number of parallel threads for retriever.",
)
parser.add_argument(
"--max_search_queries_per_turn",
type=int,
default=3,
help="Maximum number of search queries to consider in each turn.",
)
parser.add_argument(
"--warmstart_max_num_experts",
type=int,
default=3,
help="Max number of experts in perspective-guided QA during warm start.",
)
parser.add_argument(
"--warmstart_max_turn_per_experts",
type=int,
default=2,
help="Max number of turns per perspective during warm start.",
)
parser.add_argument(
"--warmstart_max_thread",
type=int,
default=3,
help="Max number of threads for parallel perspective-guided QA during warm start.",
)
parser.add_argument(
"--max_thread_num",
type=int,
default=10,
help=(
"Maximum number of threads to use. "
"Consider reducing it if you keep getting 'Exceed rate limit' errors when calling the LM API."
),
)
parser.add_argument(
"--max_num_round_table_experts",
type=int,
default=2,
help="Max number of active experts in round table discussion.",
)
parser.add_argument(
"--moderator_override_N_consecutive_answering_turn",
type=int,
default=3,
help=(
"Number of consecutive expert answering turns before the moderator overrides the conversation."
),
)
parser.add_argument(
"--node_expansion_trigger_count",
type=int,
default=10,
help="Trigger node expansion for nodes that contain more than N snippets.",
)
# Boolean flags
parser.add_argument(
"--enable_log_print",
action="store_true",
help="If set, enable console log print.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/README.md
================================================
# Examples
We host a number of example scripts for various customization of STORM (e.g., use your favorite language models, use your own corpus, etc.). These examples can be starting points for your own customizations and you are welcome to contribute your own examples by submitting a pull request to this directory.
## Run STORM with your own language model
[run_storm_wiki_gpt.py](run_storm_wiki_gpt.py) provides an example of running STORM with GPT models, and [run_storm_wiki_claude.py](run_storm_wiki_claude.py) provides an example of running STORM with Claude models. Besides using close-source models, you can also run STORM with models with open weights.
`run_storm_wiki_mistral.py` provides an example of running STORM with `Mistral-7B-Instruct-v0.2` using [VLLM](https://docs.vllm.ai/en/stable/) server:
1. Set up a VLLM server with the `Mistral-7B-Instruct-v0.2` model running.
2. Run the following command under the root directory of the repository:
```
python examples/storm_examples/run_storm_wiki_mistral.py \
--url $URL \
--port $PORT \
--output-dir $OUTPUT_DIR \
--retriever you \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
- `--url` URL of the VLLM server.
- `--port` Port of the VLLM server.
Besides VLLM server, STORM is also compatible with [TGI](https://huggingface.co/docs/text-generation-inference/en/index) server or [Together.ai](https://www.together.ai/products#inference) endpoint.
## Run STORM with your own corpus
By default, STORM is grounded on the Internet using the search engine, but it can also be grounded on your own corpus using `VectorRM`. [run_storm_wiki_with_gpt_with_VectorRM.py](run_storm_wiki_gpt_with_VectorRM.py) provides an example of running STORM grounding on your provided data.
1. Set up API keys.
- Make sure you have set up the OpenAI API key.
- `VectorRM` use [Qdrant](https://github.com/qdrant/qdrant-client) to create a vector store. If you want to set up this vector store online on a [Qdrant cloud server](https://cloud.qdrant.io/login), you need to set up `QDRANT_API_KEY` in `secrets.toml` as well; if you want to save the vector store locally, make sure you provide a location for the vector store.
2. Prepare your corpus. The documents should be provided as a single CSV file with the following format:
| content | title | url | description |
|------------------------|------------|------------|------------------------------------|
| I am a document. | Document 1 | docu-n-112 | A self-explanatory document. |
| I am another document. | Document 2 | docu-l-13 | Another self-explanatory document. |
| ... | ... | ... | ... |
- `url` will be used as a unique identifier of the document in STORM engine, so ensure different documents have different urls.
- The contents for `title` and `description` columns are optional. If not provided, the script will use default empty values.
- The content column is crucial and should be provided for each document.
3. Run the command under the root directory of the repository:
To create the vector store offline, run
```
python examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py \
--output-dir $OUTPUT_DIR \
--vector-db-mode offline \
--offline-vector-db-dir $OFFLINE_VECTOR_DB_DIR \
--csv-file-path $CSV_FILE_PATH \
--device $DEVICE_FOR_EMBEDDING(mps, cuda, cpu) \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
To create the vector store online on a Qdrant server, run
```
python examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py \
--output-dir $OUTPUT_DIR \
--vector-db-mode online \
--online-vector-db-url $ONLINE_VECTOR_DB_URL \
--csv-file-path $CSV_FILE_PATH \
--device $DEVICE_FOR_EMBEDDING(mps, cuda, cpu) \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
4. **Quick test with Kaggle arXiv Paper Abstracts dataset**:
- Download `arxiv_data_210930-054931.csv` from [here](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts).
- Run the following command under the root directory to downsample the dataset by filtering papers with terms `[cs.CV]` and get a csv file that match the format mentioned above.
```
python examples/storm_examples/helper/process_kaggle_arxiv_abstract_dataset.py --input-path $PATH_TO_THE_DOWNLOADED_FILE --output-path $PATH_TO_THE_PROCESSED_CSV
```
- Run the following command to run STORM grounding on the processed dataset. You can input a topic related to computer vision (e.g., "The progress of multimodal models in computer vision") to see the generated article. (Note that the generated article may not include enough details since the quick test only use the abstracts of arxiv papers.)
```
python examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py \
--output-dir $OUTPUT_DIR \
--vector-db-mode offline \
--offline-vector-db-dir $OFFLINE_VECTOR_DB_DIR \
--csv-file-path $PATH_TO_THE_PROCESSED_CSV \
--device $DEVICE_FOR_EMBEDDING(mps, cuda, cpu) \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
- For a quicker run, you can also download the pre-embedded vector store directly from [here](https://drive.google.com/file/d/1bijFkw5BKU7bqcmXMhO-5hg2fdKAL9bf/view?usp=share_link).
```
python examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py \
--output-dir $OUTPUT_DIR \
--vector-db-mode offline \
--offline-vector-db-dir $DOWNLOADED_VECTOR_DB_DR \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
================================================
FILE: examples/storm_examples/helper/process_kaggle_arxiv_abstract_dataset.py
================================================
"""Process `arxiv_data_210930-054931.csv`
from https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts
to a csv file that is compatible with VectorRM.
"""
from argparse import ArgumentParser
import pandas as pd
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument(
"--input-path", type=str, help="Path to arxiv_data_210930-054931.csv."
)
parser.add_argument(
"--output-path",
type=str,
help="Path to store the csv file that is compatible with VectorRM.",
)
args = parser.parse_args()
df = pd.read_csv(args.input_path)
print(f"The original dataset has {len(df)} samples.")
# Downsample the dataset.
df = df[df["terms"] == "['cs.CV']"]
# Reformat the dataset to match the VectorRM input format.
df.rename(columns={"abstracts": "content", "titles": "title"}, inplace=True)
df["url"] = [
"uid_" + str(idx) for idx in range(len(df))
] # Ensure the url is unique.
df["description"] = ""
print(f"The downsampled dataset has {len(df)} samples.")
df.to_csv(args.output_path, index=False)
================================================
FILE: examples/storm_examples/run_storm_wiki_claude.py
================================================
"""
STORM Wiki pipeline powered by Claude family models and You.com search engine.
You need to set up the following environment variables to run this script:
- ANTHROPIC_API_KEY: Anthropic API key
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import ClaudeModel
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
claude_kwargs = {
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
}
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
conv_simulator_lm = ClaudeModel(
model="claude-3-haiku-20240307", max_tokens=500, **claude_kwargs
)
question_asker_lm = ClaudeModel(
model="claude-3-sonnet-20240229", max_tokens=500, **claude_kwargs
)
outline_gen_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=400, **claude_kwargs
)
article_gen_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=700, **claude_kwargs
)
article_polish_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=4000, **claude_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/claude",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_deepseek.py
================================================
"""
STORM Wiki pipeline powered by DeepSeek models and You.com or Bing search engine.
You need to set up the following environment variables to run this script:
- DEEPSEEK_API_KEY: DeepSeek API key
- DEEPSEEK_API_BASE: DeepSeek API base URL (default is https://api.deepseek.com)
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
import re
import logging
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import DeepSeekModel
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def sanitize_topic(topic):
"""
Sanitize the topic name for use in file names.
Remove or replace characters that are not allowed in file names.
"""
# Replace spaces with underscores
topic = topic.replace(" ", "_")
# Remove any character that isn't alphanumeric, underscore, or hyphen
topic = re.sub(r"[^a-zA-Z0-9_-]", "", topic)
# Ensure the topic isn't empty after sanitization
if not topic:
topic = "unnamed_topic"
return topic
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
logger = logging.getLogger(__name__)
# Ensure DEEPSEEK_API_KEY is set
if not os.getenv("DEEPSEEK_API_KEY"):
raise ValueError(
"DEEPSEEK_API_KEY environment variable is not set. Please set it in your secrets.toml file."
)
deepseek_kwargs = {
"api_key": os.getenv("DEEPSEEK_API_KEY"),
"api_base": os.getenv("DEEPSEEK_API_BASE", "https://api.deepseek.com"),
"temperature": args.temperature,
"top_p": args.top_p,
}
# DeepSeek offers two main models: 'deepseek-chat' for general tasks and 'deepseek-coder' for coding tasks
# Users can choose the appropriate model based on their needs
conv_simulator_lm = DeepSeekModel(
model=args.model, max_tokens=500, **deepseek_kwargs
)
question_asker_lm = DeepSeekModel(
model=args.model, max_tokens=500, **deepseek_kwargs
)
outline_gen_lm = DeepSeekModel(model=args.model, max_tokens=400, **deepseek_kwargs)
article_gen_lm = DeepSeekModel(model=args.model, max_tokens=700, **deepseek_kwargs)
article_polish_lm = DeepSeekModel(
model=args.model, max_tokens=4000, **deepseek_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
topic = input("Topic: ")
sanitized_topic = sanitize_topic(topic)
try:
runner.run(
topic=sanitized_topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
remove_duplicate=args.remove_duplicate,
)
runner.post_run()
runner.summary()
except Exception as e:
logger.exception(f"An error occurred: {str(e)}")
raise
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/deepseek",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
parser.add_argument(
"--model",
type=str,
choices=["deepseek-chat", "deepseek-coder"],
default="deepseek-chat",
help='DeepSeek model to use. "deepseek-chat" for general tasks, "deepseek-coder" for coding tasks.',
)
parser.add_argument(
"--temperature", type=float, default=1.0, help="Sampling temperature to use."
)
parser.add_argument(
"--top_p", type=float, default=0.9, help="Top-p sampling parameter."
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_gemini.py
================================================
"""
STORM Wiki pipeline powered by Google Gemini models and search engine.
You need to set up the following environment variables to run this script:
- GOOGLE_API_KEY: Google API key (Can be obtained from https://ai.google.dev/gemini-api/docs/api-key)
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import GoogleModel
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
gemini_kwargs = {
"api_key": os.getenv("GOOGLE_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
}
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
# To check out available Google models, see:
# https://ai.google.dev/gemini-api/docs/get-started/tutorial?lang=python#list_models
conv_simulator_lm = GoogleModel(
model="models/gemini-1.5-flash", max_tokens=500, **gemini_kwargs
)
question_asker_lm = GoogleModel(
model="models/gemini-1.5-flash", max_tokens=500, **gemini_kwargs
)
outline_gen_lm = GoogleModel(
model="models/gemini-1.5-pro-exp-0801", max_tokens=400, **gemini_kwargs
)
article_gen_lm = GoogleModel(
model="models/gemini-1.5-pro-exp-0801", max_tokens=700, **gemini_kwargs
)
article_polish_lm = GoogleModel(
model="models/gemini-1.5-pro-exp-0801", max_tokens=4000, **gemini_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/gemini",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_gpt.py
================================================
"""
STORM Wiki pipeline powered by GPT-3.5/4 and You.com search engine.
You need to set up the following environment variables to run this script:
- OPENAI_API_KEY: OpenAI API key
- OPENAI_API_TYPE: OpenAI API type (e.g., 'openai' or 'azure')
- AZURE_API_BASE: Azure API base URL if using Azure API
- AZURE_API_VERSION: Azure API version if using Azure API
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
AzureAISearch,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
"api_key": os.getenv("OPENAI_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
}
ModelClass = (
OpenAIModel if os.getenv("OPENAI_API_TYPE") == "openai" else AzureOpenAIModel
)
# If you are using Azure service, make sure the model name matches your own deployed model name.
# The default name here is only used for demonstration and may not match your case.
gpt_35_model_name = (
"gpt-3.5-turbo" if os.getenv("OPENAI_API_TYPE") == "openai" else "gpt-35-turbo"
)
gpt_4_model_name = "gpt-4o"
if os.getenv("OPENAI_API_TYPE") == "azure":
openai_kwargs["api_base"] = os.getenv("AZURE_API_BASE")
openai_kwargs["api_version"] = os.getenv("AZURE_API_VERSION")
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
conv_simulator_lm = ModelClass(
model=gpt_35_model_name, max_tokens=500, **openai_kwargs
)
question_asker_lm = ModelClass(
model=gpt_35_model_name, max_tokens=500, **openai_kwargs
)
outline_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=400, **openai_kwargs)
article_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=700, **openai_kwargs)
article_polish_lm = ModelClass(
model=gpt_4_model_name, max_tokens=4000, **openai_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case "azure_ai_search":
rm = AzureAISearch(
azure_ai_search_api_key=os.getenv("AZURE_AI_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", "searxng", or "azure_ai_search"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/gpt",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=[
"bing",
"you",
"brave",
"serper",
"duckduckgo",
"tavily",
"searxng",
"azure_ai_search",
],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py
================================================
"""
This STORM Wiki pipeline powered by GPT-3.5/4 and local retrieval model that uses Qdrant.
You need to set up the following environment variables to run this script:
- OPENAI_API_KEY: OpenAI API key
- OPENAI_API_TYPE: OpenAI API type (e.g., 'openai' or 'azure')
- QDRANT_API_KEY: Qdrant API key (needed ONLY if online vector store was used)
You will also need an existing Qdrant vector store either saved in a folder locally offline or in a server online.
If not, then you would need a CSV file with documents, and the script is going to create the vector store for you.
The CSV should be in the following format:
content | title | url | description
I am a document. | Document 1 | docu-n-112 | A self-explanatory document.
I am another document. | Document 2 | docu-l-13 | Another self-explanatory document.
Notice that the URL will be a unique identifier for the document so ensure different documents have different urls.
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.rm import VectorRM
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
from knowledge_storm.utils import load_api_key, QdrantVectorStoreManager
def main(args):
# Load API key from the specified toml file path
load_api_key(toml_file_path="secrets.toml")
# Initialize the language model configurations
engine_lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
"api_key": os.getenv("OPENAI_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
}
ModelClass = (
OpenAIModel if os.getenv("OPENAI_API_TYPE") == "openai" else AzureOpenAIModel
)
# If you are using Azure service, make sure the model name matches your own deployed model name.
# The default name here is only used for demonstration and may not match your case.
gpt_35_model_name = (
"gpt-3.5-turbo" if os.getenv("OPENAI_API_TYPE") == "openai" else "gpt-35-turbo"
)
gpt_4_model_name = "gpt-4o"
if os.getenv("OPENAI_API_TYPE") == "azure":
openai_kwargs["api_base"] = os.getenv("AZURE_API_BASE")
openai_kwargs["api_version"] = os.getenv("AZURE_API_VERSION")
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
conv_simulator_lm = ModelClass(
model=gpt_35_model_name, max_tokens=500, **openai_kwargs
)
question_asker_lm = ModelClass(
model=gpt_35_model_name, max_tokens=500, **openai_kwargs
)
outline_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=400, **openai_kwargs)
article_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=700, **openai_kwargs)
article_polish_lm = ModelClass(
model=gpt_4_model_name, max_tokens=4000, **openai_kwargs
)
engine_lm_configs.set_conv_simulator_lm(conv_simulator_lm)
engine_lm_configs.set_question_asker_lm(question_asker_lm)
engine_lm_configs.set_outline_gen_lm(outline_gen_lm)
engine_lm_configs.set_article_gen_lm(article_gen_lm)
engine_lm_configs.set_article_polish_lm(article_polish_lm)
# Initialize the engine arguments
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# Create / update the vector store with the documents in the csv file
if args.csv_file_path:
kwargs = {
"file_path": args.csv_file_path,
"content_column": "content",
"title_column": "title",
"url_column": "url",
"desc_column": "description",
"batch_size": args.embed_batch_size,
"vector_db_mode": args.vector_db_mode,
"collection_name": args.collection_name,
"embedding_model": args.embedding_model,
"device": args.device,
}
if args.vector_db_mode == "offline":
QdrantVectorStoreManager.create_or_update_vector_store(
vector_store_path=args.offline_vector_db_dir, **kwargs
)
elif args.vector_db_mode == "online":
QdrantVectorStoreManager.create_or_update_vector_store(
url=args.online_vector_db_url,
api_key=os.getenv("QDRANT_API_KEY"),
**kwargs
)
# Setup VectorRM to retrieve information from your own data
rm = VectorRM(
collection_name=args.collection_name,
embedding_model=args.embedding_model,
device=args.device,
k=engine_args.search_top_k,
)
# initialize the vector store, either online (store the db on Qdrant server) or offline (store the db locally):
if args.vector_db_mode == "offline":
rm.init_offline_vector_db(vector_store_path=args.offline_vector_db_dir)
elif args.vector_db_mode == "online":
rm.init_online_vector_db(
url=args.online_vector_db_url, api_key=os.getenv("QDRANT_API_KEY")
)
# Initialize the STORM Wiki Runner
runner = STORMWikiRunner(engine_args, engine_lm_configs, rm)
# run the pipeline
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/gpt_retrieval",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
# provide local corpus and set up vector db
parser.add_argument(
"--collection-name",
type=str,
default="my_documents",
help="The collection name for vector store.",
)
parser.add_argument(
"--embedding_model",
type=str,
default="BAAI/bge-m3",
help="The collection name for vector store.",
)
parser.add_argument(
"--device",
type=str,
default="mps",
help="The device used to run the retrieval model (mps, cuda, cpu, etc).",
)
parser.add_argument(
"--vector-db-mode",
type=str,
choices=["offline", "online"],
help="The mode of the Qdrant vector store (offline or online).",
)
parser.add_argument(
"--offline-vector-db-dir",
type=str,
default="./vector_store",
help="If use offline mode, please provide the directory to store the vector store.",
)
parser.add_argument(
"--online-vector-db-url",
type=str,
help="If use online mode, please provide the url of the Qdrant server.",
)
parser.add_argument(
"--csv-file-path",
type=str,
default=None,
help="The path of the custom document corpus in CSV format. The CSV file should include "
"content, title, url, and description columns.",
)
parser.add_argument(
"--embed-batch-size",
type=int,
default=64,
help="Batch size for embedding the documents in the csv file.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_groq.py
================================================
"""
STORM Wiki pipeline powered by llama3-70b-8192 hosted by Groq server and You.com search engine.
You need to set up the following environment variables to run this script:
- GROQ_API_KEY: You can get your Groq API Key at https://console.groq.com/keys
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
You also need to have a VLLM server running with the Mistral-7B-Instruct-v0.2 model. Specify `--url` and `--port` accordingly.
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
import re
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
# Now import lm directly
import lm
from lm import GroqModel
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def sanitize_topic(topic):
"""
Sanitize the topic name for use in file names.
Remove or replace characters that are not allowed in file names.
"""
# Replace spaces with underscores
topic = topic.replace(" ", "_")
# Remove any character that isn't alphanumeric, underscore, or hyphen
topic = re.sub(r"[^a-zA-Z0-9_-]", "", topic)
# Ensure the topic isn't empty after sanitization
if not topic:
topic = "unnamed_topic"
return topic
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
# Ensure GROQ_API_KEY is set
if not os.getenv("GROQ_API_KEY"):
raise ValueError(
"GROQ_API_KEY environment variable is not set. Please set it in your secrets.toml file."
)
groq_kwargs = {
"api_key": os.getenv("GROQ_API_KEY"),
"api_base": "https://api.groq.com/openai/v1",
"temperature": args.temperature,
"top_p": args.top_p,
}
# Groq currently offers the "llama3-70b-8192" model with generous free API credits and the llama3.1 family of models as a preview for paying customers
conv_simulator_lm = GroqModel(
model="llama3-70b-8192", max_tokens=500, **groq_kwargs
)
question_asker_lm = GroqModel(
model="llama3-70b-8192", max_tokens=500, **groq_kwargs
)
outline_gen_lm = GroqModel(model="llama3-70b-8192", max_tokens=400, **groq_kwargs)
article_gen_lm = GroqModel(model="llama3-70b-8192", max_tokens=700, **groq_kwargs)
article_polish_lm = GroqModel(
model="llama3-70b-8192", max_tokens=4000, **groq_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
topic = input("Topic: ")
sanitized_topic = sanitize_topic(topic)
try:
runner.run(
topic=sanitized_topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
remove_duplicate=args.remove_duplicate,
)
runner.post_run()
runner.summary()
except Exception as e:
logger.exception(f"An error occurred: {str(e)}")
raise
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/groq",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
parser.add_argument(
"--temperature", type=float, default=1.0, help="Sampling temperature to use."
)
parser.add_argument(
"--top_p", type=float, default=0.9, help="Top-p sampling parameter."
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_mistral.py
================================================
"""
STORM Wiki pipeline powered by Mistral-7B-Instruct-v0.2 hosted by VLLM server and You.com search engine.
You need to set up the following environment variables to run this script:
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
You also need to have a VLLM server running with the Mistral-7B-Instruct-v0.2 model. Specify `--url` and `--port` accordingly.
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from dspy import Example
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import VLLMClient
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
mistral_kwargs = {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"port": args.port,
"url": args.url,
"stop": (
"\n\n---",
), # dspy uses "\n\n---" to separate examples. Open models sometimes generate this.
}
conv_simulator_lm = VLLMClient(max_tokens=500, **mistral_kwargs)
question_asker_lm = VLLMClient(max_tokens=500, **mistral_kwargs)
outline_gen_lm = VLLMClient(max_tokens=400, **mistral_kwargs)
article_gen_lm = VLLMClient(max_tokens=700, **mistral_kwargs)
article_polish_lm = VLLMClient(max_tokens=4000, **mistral_kwargs)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
# Open LMs are generally weaker in following output format.
# One way for mitigation is to add one-shot example to the prompt to exemplify the desired output format.
# For example, we can add the following examples to the two prompts used in StormPersonaGenerator.
# Note that the example should be an object of dspy.Example with fields matching the InputField
# and OutputField in the prompt (i.e., dspy.Signature).
find_related_topic_example = Example(
topic="Knowledge Curation",
related_topics="https://en.wikipedia.org/wiki/Knowledge_management\n"
"https://en.wikipedia.org/wiki/Information_science\n"
"https://en.wikipedia.org/wiki/Library_science\n",
)
gen_persona_example = Example(
topic="Knowledge Curation",
examples="Title: Knowledge management\n"
"Table of Contents: History\nResearch\n Dimensions\n Strategies\n Motivations\nKM technologies"
"\nKnowledge barriers\nKnowledge retention\nKnowledge audit\nKnowledge protection\n"
" Knowledge protection methods\n Formal methods\n Informal methods\n"
" Balancing knowledge protection and knowledge sharing\n Knowledge protection risks",
personas="1. Historian of Knowledge Systems: This editor will focus on the history and evolution of knowledge curation. They will provide context on how knowledge curation has changed over time and its impact on modern practices.\n"
"2. Information Science Professional: With insights from 'Information science', this editor will explore the foundational theories, definitions, and philosophy that underpin knowledge curation\n"
"3. Digital Librarian: This editor will delve into the specifics of how digital libraries operate, including software, metadata, digital preservation.\n"
"4. Technical expert: This editor will focus on the technical aspects of knowledge curation, such as common features of content management systems.\n"
"5. Museum Curator: The museum curator will contribute expertise on the curation of physical items and the transition of these practices into the digital realm.",
)
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.find_related_topic.demos = [
find_related_topic_example
]
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.gen_persona.demos = [
gen_persona_example
]
# A trade-off of adding one-shot example is that it will increase the input length of the prompt. Also, some
# examples may be very long (e.g., an example for writing a section based on the given information), which may
# confuse the model. For these cases, you can create a pseudo-example that is short and easy to understand to steer
# the model's output format.
# For example, we can add the following pseudo-examples to the prompt used in WritePageOutlineFromConv and
# ConvToSection.
write_page_outline_example = Example(
topic="Example Topic",
conv="Wikipedia Writer: ...\nExpert: ...\nWikipedia Writer: ...\nExpert: ...",
old_outline="# Section 1\n## Subsection 1\n## Subsection 2\n"
"# Section 2\n## Subsection 1\n## Subsection 2\n"
"# Section 3",
outline="# New Section 1\n## New Subsection 1\n## New Subsection 2\n"
"# New Section 2\n"
"# New Section 3\n## New Subsection 1\n## New Subsection 2\n## New Subsection 3",
)
runner.storm_outline_generation_module.write_outline.write_page_outline.demos = [
write_page_outline_example
]
write_section_example = Example(
info="[1]\nInformation in document 1\n[2]\nInformation in document 2\n[3]\nInformation in document 3",
topic="Example Topic",
section="Example Section",
output="# Example Topic\n## Subsection 1\n"
"This is an example sentence [1]. This is another example sentence [2][3].\n"
"## Subsection 2\nThis is one more example sentence [1].",
)
runner.storm_article_generation.section_gen.write_section.demos = [
write_section_example
]
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--url", type=str, default="http://localhost", help="URL of the VLLM server."
)
parser.add_argument(
"--port", type=int, default=8000, help="Port of the VLLM server."
)
parser.add_argument(
"--output-dir",
type=str,
default="./results/mistral_7b",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_ollama.py
================================================
"""
STORM Wiki pipeline powered by local model hosted by Ollama server and You.com or Bing search engine.
You need to set up the following environment variables to run this script:
- YDC_API_KEY: You.com API key; BING_SEARCH_API_KEY: Bing Search API key, SERPER_API_KEY: Serper API key, BRAVE_API_KEY: Brave API key, or TAVILY_API_KEY: Tavily API key
You also need to have a Ollama server running with the llama3 model or other. Specify `--url`, `--port` and `--model` accordingly.
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
import sys
from argparse import ArgumentParser
from dspy import Example
from knowledge_storm.lm import OllamaClient
from knowledge_storm.rm import (
YouRM,
BingSearch,
BraveRM,
SerperRM,
DuckDuckGoSearchRM,
TavilySearchRM,
SearXNG,
)
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
ollama_kwargs = {
"model": args.model,
"port": args.port,
"url": args.url,
"stop": (
"\n\n---",
), # dspy uses "\n\n---" to separate examples. Open models sometimes generate this.
}
conv_simulator_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
question_asker_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
outline_gen_lm = OllamaClient(max_tokens=400, **ollama_kwargs)
article_gen_lm = OllamaClient(max_tokens=700, **ollama_kwargs)
article_polish_lm = OllamaClient(max_tokens=4000, **ollama_kwargs)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# STORM is a knowledge curation system which consumes information from the retrieval module.
# Currently, the information source is the Internet and we use search engine API as the retrieval module.
match args.retriever:
case "bing":
rm = BingSearch(
bing_search_api=os.getenv("BING_SEARCH_API_KEY"),
k=engine_args.search_top_k,
)
case "you":
rm = YouRM(ydc_api_key=os.getenv("YDC_API_KEY"), k=engine_args.search_top_k)
case "brave":
rm = BraveRM(
brave_search_api_key=os.getenv("BRAVE_API_KEY"),
k=engine_args.search_top_k,
)
case "duckduckgo":
rm = DuckDuckGoSearchRM(
k=engine_args.search_top_k, safe_search="On", region="us-en"
)
case "serper":
rm = SerperRM(
serper_search_api_key=os.getenv("SERPER_API_KEY"),
query_params={"autocorrect": True, "num": 10, "page": 1},
)
case "tavily":
rm = TavilySearchRM(
tavily_search_api_key=os.getenv("TAVILY_API_KEY"),
k=engine_args.search_top_k,
include_raw_content=True,
)
case "searxng":
rm = SearXNG(
searxng_api_key=os.getenv("SEARXNG_API_KEY"), k=engine_args.search_top_k
)
case _:
raise ValueError(
f'Invalid retriever: {args.retriever}. Choose either "bing", "you", "brave", "duckduckgo", "serper", "tavily", or "searxng"'
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
# Open LMs are generally weaker in following output format.
# One way for mitigation is to add one-shot example to the prompt to exemplify the desired output format.
# For example, we can add the following examples to the two prompts used in StormPersonaGenerator.
# Note that the example should be an object of dspy.Example with fields matching the InputField
# and OutputField in the prompt (i.e., dspy.Signature).
find_related_topic_example = Example(
topic="Knowledge Curation",
related_topics="https://en.wikipedia.org/wiki/Knowledge_management\n"
"https://en.wikipedia.org/wiki/Information_science\n"
"https://en.wikipedia.org/wiki/Library_science\n",
)
gen_persona_example = Example(
topic="Knowledge Curation",
examples="Title: Knowledge management\n"
"Table of Contents: History\nResearch\n Dimensions\n Strategies\n Motivations\nKM technologies"
"\nKnowledge barriers\nKnowledge retention\nKnowledge audit\nKnowledge protection\n"
" Knowledge protection methods\n Formal methods\n Informal methods\n"
" Balancing knowledge protection and knowledge sharing\n Knowledge protection risks",
personas="1. Historian of Knowledge Systems: This editor will focus on the history and evolution of knowledge curation. They will provide context on how knowledge curation has changed over time and its impact on modern practices.\n"
"2. Information Science Professional: With insights from 'Information science', this editor will explore the foundational theories, definitions, and philosophy that underpin knowledge curation\n"
"3. Digital Librarian: This editor will delve into the specifics of how digital libraries operate, including software, metadata, digital preservation.\n"
"4. Technical expert: This editor will focus on the technical aspects of knowledge curation, such as common features of content management systems.\n"
"5. Museum Curator: The museum curator will contribute expertise on the curation of physical items and the transition of these practices into the digital realm.",
)
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.find_related_topic.demos = [
find_related_topic_example
]
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.gen_persona.demos = [
gen_persona_example
]
# A trade-off of adding one-shot example is that it will increase the input length of the prompt. Also, some
# examples may be very long (e.g., an example for writing a section based on the given information), which may
# confuse the model. For these cases, you can create a pseudo-example that is short and easy to understand to steer
# the model's output format.
# For example, we can add the following pseudo-examples to the prompt used in WritePageOutlineFromConv and
# ConvToSection.
write_page_outline_example = Example(
topic="Example Topic",
conv="Wikipedia Writer: ...\nExpert: ...\nWikipedia Writer: ...\nExpert: ...",
old_outline="# Section 1\n## Subsection 1\n## Subsection 2\n"
"# Section 2\n## Subsection 1\n## Subsection 2\n"
"# Section 3",
outline="# New Section 1\n## New Subsection 1\n## New Subsection 2\n"
"# New Section 2\n"
"# New Section 3\n## New Subsection 1\n## New Subsection 2\n## New Subsection 3",
)
runner.storm_outline_generation_module.write_outline.write_page_outline.demos = [
write_page_outline_example
]
write_section_example = Example(
info="[1]\nInformation in document 1\n[2]\nInformation in document 2\n[3]\nInformation in document 3",
topic="Example Topic",
section="Example Section",
output="# Example Topic\n## Subsection 1\n"
"This is an example sentence [1]. This is another example sentence [2][3].\n"
"## Subsection 2\nThis is one more example sentence [1].",
)
runner.storm_article_generation.section_gen.write_section.demos = [
write_section_example
]
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--url", type=str, default="http://localhost", help="URL of the Ollama server."
)
parser.add_argument(
"--port", type=int, default=11434, help="Port of the Ollama server."
)
parser.add_argument(
"--model", type=str, default="llama3:latest", help="Model of the Ollama server."
)
parser.add_argument(
"--output-dir",
type=str,
default="./results/ollama",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "brave", "serper", "duckduckgo", "tavily", "searxng"],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_ollama_with_searxng.py
================================================
import os
from argparse import ArgumentParser
from dspy import Example
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import OllamaClient
from knowledge_storm.rm import SearXNG
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
ollama_kwargs = {
"model": args.model,
"port": args.port,
"url": args.url,
"stop": ("\n\n---",),
}
conv_simulator_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
question_asker_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
outline_gen_lm = OllamaClient(max_tokens=400, **ollama_kwargs)
article_gen_lm = OllamaClient(max_tokens=700, **ollama_kwargs)
article_polish_lm = OllamaClient(max_tokens=4000, **ollama_kwargs)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
rm = SearXNG(
searxng_api_url=args.searxng_api_url,
searxng_api_key=os.getenv("SEARXNG_API_KEY"),
k=engine_args.search_top_k,
)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
find_related_topic_example = Example(
topic="Knowledge Curation",
related_topics="https://en.wikipedia.org/wiki/Knowledge_management\n"
"https://en.wikipedia.org/wiki/Information_science\n"
"https://en.wikipedia.org/wiki/Library_science\n",
)
gen_persona_example = Example(
topic="Knowledge Curation",
examples="Title: Knowledge management\n"
"Table of Contents: History\nResearch\n Dimensions\n Strategies\n Motivations\nKM technologies"
"\nKnowledge barriers\nKnowledge retention\nKnowledge audit\nKnowledge protection\n"
" Knowledge protection methods\n Formal methods\n Informal methods\n"
" Balancing knowledge protection and knowledge sharing\n Knowledge protection risks",
personas=(
"1. Historian of Knowledge Systems: This editor will focus on the history and evolution of knowledge "
"curation. They will provide context on how knowledge curation has changed over time and its impact on "
"modern practices.\n"
"2. Information Science Professional: With insights from 'Information science', this editor will "
"explore the foundational theories, definitions, and philosophy that underpin knowledge curation\n"
"3. Digital Librarian: This editor will delve into the specifics of how digital libraries operate, "
"including software, metadata, digital preservation.\n"
"4. Technical expert: This editor will focus on the technical aspects of knowledge curation, "
"such as common features of content management systems.\n"
"5. Museum Curator: The museum curator will contribute expertise on the curation of physical items and "
"the transition of these practices into the digital realm."
),
)
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.find_related_topic.demos = [
find_related_topic_example
]
runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.gen_persona.demos = [
gen_persona_example
]
write_page_outline_example = Example(
topic="Example Topic",
conv="Wikipedia Writer: ...\nExpert: ...\nWikipedia Writer: ...\nExpert: ...",
old_outline="# Section 1\n## Subsection 1\n## Subsection 2\n"
"# Section 2\n## Subsection 1\n## Subsection 2\n"
"# Section 3",
outline="# New Section 1\n## New Subsection 1\n## New Subsection 2\n"
"# New Section 2\n"
"# New Section 3\n## New Subsection 1\n## New Subsection 2\n## New Subsection 3",
)
runner.storm_outline_generation_module.write_outline.write_page_outline.demos = [
write_page_outline_example
]
write_section_example = Example(
info="[1]\nInformation in document 1\n[2]\nInformation in document 2\n[3]\nInformation in document 3",
topic="Example Topic",
section="Example Section",
output="# Example Topic\n## Subsection 1\n"
"This is an example sentence [1]. This is another example sentence [2][3].\n"
"## Subsection 2\nThis is one more example sentence [1].",
)
runner.storm_article_generation.section_gen.write_section.demos = [
write_section_example
]
topic = input("Topic: ")
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--url", type=str, default="http://localhost", help="URL of the Ollama server."
)
parser.add_argument(
"--port", type=int, default=11434, help="Port of the Ollama server."
)
parser.add_argument(
"--model", type=str, default="llama3:latest", help="Model of the Ollama server."
)
parser.add_argument(
"--output-dir",
type=str,
default="./results/ollama",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["searxng"],
help="The search engine API to use for retrieving information.",
)
parser.add_argument(
"--searxng-api-url", type=str, required=True, help="URL of the SearXNG API."
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: examples/storm_examples/run_storm_wiki_serper.py
================================================
"""
STORM Wiki pipeline powered by Claude family models and serper search engine.
You need to set up the following environment variables to run this script:
- ANTHROPIC_API_KEY: Anthropic API key
- SERPER_API_KEY: Serper.dev api key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""
import os
from argparse import ArgumentParser
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import ClaudeModel
from knowledge_storm.rm import SerperRM
from knowledge_storm.utils import load_api_key
def main(args):
load_api_key(toml_file_path="secrets.toml")
lm_configs = STORMWikiLMConfigs()
claude_kwargs = {
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"temperature": 1.0,
"top_p": 0.9,
}
# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
conv_simulator_lm = ClaudeModel(
model="claude-3-haiku-20240307", max_tokens=500, **claude_kwargs
)
question_asker_lm = ClaudeModel(
model="claude-3-sonnet-20240229", max_tokens=500, **claude_kwargs
)
outline_gen_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=400, **claude_kwargs
)
article_gen_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=700, **claude_kwargs
)
article_polish_lm = ClaudeModel(
model="claude-3-opus-20240229", max_tokens=4000, **claude_kwargs
)
lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)
engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
max_thread_num=args.max_thread_num,
)
# Documentation to generate the data is available here:
# https://serper.dev/playground
# Important to note that tbs(date range is hardcoded values).
# num is results per pages and is recommended to use in increments of 10(10, 20, etc).
# page is how many pages will be searched.
# h1 is where the google search will orginate from.
topic = input("topic: ")
data = {"autocorrect": True, "num": 10, "page": 1}
rm = SerperRM(serper_search_api_key=os.getenv("SERPER_API_KEY"), query_params=data)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()
if __name__ == "__main__":
parser = ArgumentParser()
# global arguments
parser.add_argument(
"--output-dir",
type=str,
default="./results/serper",
help="Directory to store the outputs.",
)
parser.add_argument(
"--max-thread-num",
type=int,
default=3,
help="Maximum number of threads to use. The information seeking part and the article generation"
"part can speed up by using multiple threads. Consider reducing it if keep getting "
'"Exceed rate limit" error when calling LM API.',
)
parser.add_argument(
"--retriever",
type=str,
choices=["bing", "you", "serper"],
help="The search engine API to use for retrieving information.",
)
# stage of the pipeline
parser.add_argument(
"--do-research",
action="store_true",
help="If True, simulate conversation to research the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-outline",
action="store_true",
help="If True, generate an outline for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-generate-article",
action="store_true",
help="If True, generate an article for the topic; otherwise, load the results.",
)
parser.add_argument(
"--do-polish-article",
action="store_true",
help="If True, polish the article by adding a summarization section and (optionally) removing "
"duplicate content.",
)
# hyperparameters for the pre-writing stage
parser.add_argument(
"--max-conv-turn",
type=int,
default=3,
help="Maximum number of questions in conversational question asking.",
)
parser.add_argument(
"--max-perspective",
type=int,
default=3,
help="Maximum number of perspectives to consider in perspective-guided question asking.",
)
parser.add_argument(
"--search-top-k",
type=int,
default=3,
help="Top k search results to consider for each search query.",
)
# hyperparameters for the writing stage
parser.add_argument(
"--retrieve-top-k",
type=int,
default=3,
help="Top k collected references for each section title.",
)
parser.add_argument(
"--remove-duplicate",
action="store_true",
help="If True, remove duplicate content from the article.",
)
main(parser.parse_args())
================================================
FILE: frontend/demo_light/.streamlit/config.toml
================================================
[client]
showErrorDetails = false
toolbarMode = "minimal"
[theme]
primaryColor = "#F63366"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"
================================================
FILE: frontend/demo_light/README.md
================================================
# STORM Minimal User Interface
This is a minimal user interface for `STORMWikiRunner` which includes the following features:
1. Allowing user to create a new article through the "Create New Article" page.
2. Showing the intermediate steps of STORMWikiRunner in real-time when creating an article.
3. Displaying the written article and references side by side.
4. Allowing user to view previously created articles through the "My Articles" page.
<p align="center">
<img src="assets/create_article.jpg" style="width: 70%; height: auto;">
</p>
<p align="center">
<img src="assets/article_display.jpg" style="width: 70%; height: auto;">
</p>
## Setup
1. Make sure you have installed `knowledge-storm` or set up the source code correctly.
2. Install additional packages required by the user interface:
```bash
pip install -r requirements.txt
```
2. Make sure you set up the API keys following the instructions in the main README file. Create a copy of `secrets.toml` and place it under `.streamlit/`.
3. Run the following command to start the user interface:
```bash
streamlit run storm.py
```
The user interface will create a `DEMO_WORKING_DIR` directory in the current directory to store the outputs.
## Customization
You can customize the `STORMWikiRunner` powering the user interface according to [the guidelines](https://github.com/stanford-oval/storm?tab=readme-ov-file#customize-storm) in the main README file.
The `STORMWikiRunner` is initialized in `set_storm_runner()` in [demo_util.py](demo_util.py). You can change `STORMWikiRunnerArguments`, `STORMWikiLMConfigs`, or use a different retrieval model according to your need.
================================================
FILE: frontend/demo_light/demo_util.py
================================================
import base64
import datetime
import json
import os
import re
from typing import Optional
import markdown
import pytz
import streamlit as st
# If you install the source code instead of the `knowledge-storm` package,
# Uncomment the following lines:
# import sys
# sys.path.append('../../')
from knowledge_storm import (
STORMWikiRunnerArguments,
STORMWikiRunner,
STORMWikiLMConfigs,
)
from knowledge_storm.lm import OpenAIModel
from knowledge_storm.rm import YouRM
from knowledge_storm.storm_wiki.modules.callback import BaseCallbackHandler
from knowledge_storm.utils import truncate_filename
from stoc import stoc
class DemoFileIOHelper:
@staticmethod
def read_structure_to_dict(articles_root_path):
"""
Reads the directory structure of articles stored in the given root path and
returns a nested dictionary. The outer dictionary has article names as keys,
and each value is another dictionary mapping file names to their absolute paths.
Args:
articles_root_path (str): The root directory path containing article subdirectories.
Returns:
dict: A dictionary where each key is an article name, and each value is a dictionary
of file names and their absolute paths within that article's directory.
"""
articles_dict = {}
for topic_name in os.listdir(articles_root_path):
topic_path = os.path.join(articles_root_path, topic_name)
if os.path.isdir(topic_path):
# Initialize or update the dictionary for the topic
articles_dict[topic_name] = {}
# Iterate over all files within a topic directory
for file_name in os.listdir(topic_path):
file_path = os.path.join(topic_path, file_name)
articles_dict[topic_name][file_name] = os.path.abspath(file_path)
return articles_dict
@staticmethod
def read_txt_file(file_path):
"""
Reads the contents of a text file and returns it as a string.
Args:
file_path (str): The path to the text file to be read.
Returns:
str: The content of the file as a single string.
"""
with open(file_path) as f:
return f.read()
@staticmethod
def read_json_file(file_path):
"""
Reads a JSON file and returns its content as a Python dictionary or list,
depending on the JSON structure.
Args:
file_path (str): The path to the JSON file to be read.
Returns:
dict or list: The content of the JSON file. The type depends on the
structure of the JSON file (object or array at the root).
"""
with open(file_path) as f:
return json.load(f)
@staticmethod
def read_image_as_base64(image_path):
"""
Reads an image file and returns its content encoded as a base64 string,
suitable for embedding in HTML or transferring over networks where binary
data cannot be easily sent.
Args:
image_path (str): The path to the image file to be encoded.
Returns:
str: The base64 encoded string of the image, prefixed with the necessary
data URI scheme for images.
"""
with open(image_path, "rb") as f:
data = f.read()
encoded = base64.b64encode(data)
data = "data:image/png;base64," + encoded.decode("utf-8")
return data
@staticmethod
def set_file_modification_time(file_path, modification_time_string):
"""
Sets the modification time of a file based on a given time string in the California time zone.
Args:
file_path (str): The path to the file.
modification_time_string (str): The desired modification time in 'YYYY-MM-DD HH:MM:SS' format.
"""
california_tz = pytz.timezone("America/Los_Angeles")
modification_time = datetime.datetime.strptime(
modification_time_string, "%Y-%m-%d %H:%M:%S"
)
modification_time = california_tz.localize(modification_time)
modification_time_utc = modification_time.astimezone(datetime.timezone.utc)
modification_timestamp = modification_time_utc.timestamp()
os.utime(file_path, (modification_timestamp, modification_timestamp))
@staticmethod
def get_latest_modification_time(path):
"""
Returns the latest modification time of all files in a directory in the California time zone as a string.
Args:
directory_path (str): The path to the directory.
Returns:
str: The latest file's modification time in 'YYYY-MM-DD HH:MM:SS' format.
"""
california_tz = pytz.timezone("America/Los_Angeles")
latest_mod_time = None
file_paths = []
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
file_paths.append(os.path.join(root, file))
else:
file_paths = [path]
for file_path in file_paths:
modification_timestamp = os.path.getmtime(file_path)
modification_time_utc = datetime.datetime.utcfromtimestamp(
modification_timestamp
)
modification_time_utc = modification_time_utc.replace(
tzinfo=datetime.timezone.utc
)
modification_time_california = modification_time_utc.astimezone(
california_tz
)
if (
latest_mod_time is None
or modification_time_california > latest_mod_time
):
latest_mod_time = modification_time_california
if latest_mod_time is not None:
return latest_mod_time.strftime("%Y-%m-%d %H:%M:%S")
else:
return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
@staticmethod
def assemble_article_data(article_file_path_dict):
"""
Constructs a dictionary containing the content and metadata of an article
based on the available files in the article's directory. This includes the
main article text, citations from a JSON file, and a conversation log if
available. The function prioritizes a polished version of the article if
both a raw and polished version exist.
Args:
article_file_paths (dict): A dictionary where keys are file names relevant
to the article (e.g., the article text, citations
in JSON format, conversation logs) and values
are their corresponding file paths.
Returns:
dict or None: A dictionary containing the parsed content of the article,
citations, and conversation log if available. Returns None
if neither the raw nor polished article text exists in the
provided file paths.
"""
if (
"storm_gen_article.txt" in article_file_path_dict
or "storm_gen_article_polished.txt" in article_file_path_dict
):
full_article_name = (
"storm_gen_article_polished.txt"
if "storm_gen_article_polished.txt" in article_file_path_dict
else "storm_gen_article.txt"
)
article_data = {
"article": DemoTextProcessingHelper.parse(
DemoFileIOHelper.read_txt_file(
article_file_path_dict[full_article_name]
)
)
}
if "url_to_info.json" in article_file_path_dict:
article_data["citations"] = _construct_citation_dict_from_search_result(
DemoFileIOHelper.read_json_file(
article_file_path_dict["url_to_info.json"]
)
)
if "conversation_log.json" in article_file_path_dict:
article_data["conversation_log"] = DemoFileIOHelper.read_json_file(
article_file_path_dict["conversation_log.json"]
)
return article_data
return None
class DemoTextProcessingHelper:
@staticmethod
def remove_citations(sent):
return (
re.sub(r"\[\d+", "", re.sub(r" \[\d+", "", sent))
.replace(" |", "")
.replace("]", "")
)
@staticmethod
def parse_conversation_history(json_data):
"""
Given conversation log data, return list of parsed data of following format
(persona_name, persona_description, list of dialogue turn)
"""
parsed_data = []
for persona_conversation_data in json_data:
if ": " in persona_conversation_data["perspective"]:
name, description = persona_conversation_data["perspective"].split(
": ", 1
)
elif "- " in persona_conversation_data["perspective"]:
name, description = persona_conversation_data["perspective"].split(
"- ", 1
)
else:
name, description = "", persona_conversation_data["perspective"]
cur_conversation = []
for dialogue_turn in persona_conversation_data["dlg_turns"]:
cur_conversation.append(
{"role": "user", "content": dialogue_turn["user_utterance"]}
)
cur_conversation.append(
{
"role": "assistant",
"content": DemoTextProcessingHelper.remove_citations(
dialogue_turn["agent_utterance"]
),
}
)
parsed_data.append((name, description, cur_conversation))
return parsed_data
@staticmethod
def parse(text):
regex = re.compile(r']:\s+"(.*?)"\s+http')
text = regex.sub("]: http", text)
return text
@staticmethod
def add_markdown_indentation(input_string):
lines = input_string.split("\n")
processed_lines = [""]
for line in lines:
num_hashes = 0
for char in line:
if char == "#":
num_hashes += 1
else:
break
num_hashes -= 1
num_spaces = 4 * num_hashes
new_line = " " * num_spaces + line
processed_lines.append(new_line)
return "\n".join(processed_lines)
@staticmethod
def get_current_time_string():
"""
Returns the current time in the California time zone as a string.
Returns:
str: The current California time in 'YYYY-MM-DD HH:MM:SS' format.
"""
california_tz = pytz.timezone("America/Los_Angeles")
utc_now = datetime.datetime.now(datetime.timezone.utc)
california_now = utc_now.astimezone(california_tz)
return california_now.strftime("%Y-%m-%d %H:%M:%S")
@staticmethod
def compare_time_strings(
time_string1, time_string2, time_format="%Y-%m-%d %H:%M:%S"
):
"""
Compares two time strings to determine if they represent the same point in time.
Args:
time_string1 (str): The first time string to compare.
time_string2 (str): The second time string to compare.
time_format (str): The format of the time strings, defaults to '%Y-%m-%d %H:%M:%S'.
Returns:
bool: True if the time strings represent the same time, False otherwise.
"""
# Parse the time strings into datetime objects
time1 = datetime.datetime.strptime(time_string1, time_format)
time2 = datetime.datetime.strptime(time_string2, time_format)
# Compare the datetime objects
return time1 == time2
@staticmethod
def add_inline_citation_link(article_text, citation_dict):
# Regular expression to find citations like [i]
pattern = r"\[(\d+)\]"
# Function to replace each citation with its Markdown link
def replace_with_link(match):
i = match.group(1)
url = citation_dict.get(int(i), {}).get("url", "#")
return f"[[{i}]]({url})"
# Replace all citations in the text with Markdown links
return re.sub(pattern, replace_with_link, article_text)
@staticmethod
def generate_html_toc(md_text):
toc = []
for line in md_text.splitlines():
if line.startswith("#"):
level = line.count("#")
title = line.strip("# ").strip()
anchor = title.lower().replace(" ", "-").replace(".", "")
toc.append(
f"<li style='margin-left: {20 * (level - 1)}px;'><a href='#{anchor}'>{title}</a></li>"
)
return "<ul>" + "".join(toc) + "</ul>"
@staticmethod
def construct_bibliography_from_url_to_info(url_to_info):
bibliography_list = []
sorted_url_to_unified_index = dict(
sorted(
url_to_info["url_to_unified_index"].items(), key=lambda item: item[1]
)
)
for url, index in sorted_url_to_unified_index.items():
title = url_to_info["url_to_info"][url]["title"]
bibliography_list.append(f"[{index}]: [{title}]({url})")
bibliography_string = "\n\n".join(bibliography_list)
return f"# References\n\n{bibliography_string}"
class DemoUIHelper:
def st_markdown_adjust_size(content, font_size=20):
st.markdown(
f"""
<span style='font-size: {font_size}px;'>{content}</span>
""",
unsafe_allow_html=True,
)
@staticmethod
def get_article_card_UI_style(boarder_color="#9AD8E1"):
return {
"card": {
"width": "100%",
"height": "116px",
"max-width": "640px",
"background-color": "#FFFFF",
"border": "1px solid #CCC",
"padding": "20px",
"border-radius": "5px",
"border-left": f"0.5rem solid {boarder_color}",
"box-shadow": "0 0.15rem 1.75rem 0 rgba(58, 59, 69, 0.15)",
"margin": "0px",
},
"title": {
"white-space": "nowrap",
"overflow": "hidden",
"text-overflow": "ellipsis",
"font-size": "17px",
"color": "rgb(49, 51, 63)",
"text-align": "left",
"width": "95%",
"font-weight": "normal",
},
"text": {
"white-space": "nowrap",
"overflow": "hidden",
"text-overflow": "ellipsis",
"font-size": "25px",
"color": "rgb(49, 51, 63)",
"text-align": "left",
"width": "95%",
},
"filter": {"background-color": "rgba(0, 0, 0, 0)"},
}
@staticmethod
def customize_toast_css_style():
# Note padding is top right bottom left
st.markdown(
"""
<style>
div[data-testid=stToast] {
padding: 20px 10px 40px 10px;
background-color: #FF0000; /* red */
width: 40%;
}
[data-testid=toastContainer] [data-testid=stMarkdownContainer] > p {
font-size: 25px;
font-style: normal;
font-weight: 400;
color: #FFFFFF; /* white */
line-height: 1.5; /* Adjust this value as needed */
}
</style>
""",
unsafe_allow_html=True,
)
@staticmethod
def article_markdown_to_html(article_title, article_content):
return f"""
<html>
<head>
<meta charset="utf-8">
<title>{article_title}</title>
<style>
.title {{
text-align: center;
}}
</style>
</head>
<body>
<div class="title">
<h1>{article_title.replace('_', ' ')}</h1>
</div>
<h2>Table of Contents</h2>
{DemoTextProcessingHelper.generate_html_toc(article_content)}
{markdown.markdown(article_content)}
</body>
</html>
"""
def _construct_citation_dict_from_search_result(search_results):
if search_results is None:
return None
citation_dict = {}
for url, index in search_results["url_to_unified_index"].items():
citation_dict[index] = {
"url": url,
"title": search_results["url_to_info"][url]["title"],
"snippets": search_results["url_to_info"][url]["snippets"],
}
return citation_dict
def _display_main_article_text(article_text, citation_dict, table_content_sidebar):
# Post-process the generated article for better display.
if "Write the lead section:" in article_text:
article_text = article_text[
article_text.find("Write the lead section:")
+ len("Write the lead section:") :
]
if article_text[0] == "#":
article_text = "\n".join(article_text.split("\n")[1:])
article_text = DemoTextProcessingHelper.add_inline_citation_link(
article_text, citation_dict
)
# '$' needs to be changed to '\$' to avoid being interpreted as LaTeX in st.markdown()
article_text = article_text.replace("$", "\\$")
stoc.from_markdown(article_text, table_content_sidebar)
def _display_references(citation_dict):
if citation_dict:
reference_list = [f"reference [{i}]" for i in range(1, len(citation_dict) + 1)]
selected_key = st.selectbox("Select a reference", reference_list)
citation_val = citation_dict[reference_list.index(selected_key) + 1]
citation_val["title"] = citation_val["title"].replace("$", "\\$")
st.markdown(f"**Title:** {citation_val['title']}")
st.markdown(f"**Url:** {citation_val['url']}")
snippets = "\n\n".join(citation_val["snippets"]).replace("$", "\\$")
st.markdown(f"**Highlights:**\n\n {snippets}")
else:
st.markdown("**No references available**")
def _display_persona_conversations(conversation_log):
"""
Display persona conversation in dialogue UI
"""
# get personas list as (persona_name, persona_description, dialogue turns list) tuple
parsed_conversation_history = DemoTextProcessingHelper.parse_conversation_history(
conversation_log
)
# construct tabs for each persona conversation
persona_tabs = st.tabs([name for (name, _, _) in parsed_conversation_history])
for idx, persona_tab in enumerate(persona_tabs):
with persona_tab:
# show persona description
st.info(parsed_conversation_history[idx][1])
# show user / agent utterance in dialogue UI
for message in parsed_conversation_history[idx][2]:
message["content"] = message["content"].replace("$", "\\$")
with st.chat_message(message["role"]):
if message["role"] == "user":
st.markdown(f"**{message['content']}**")
else:
st.markdown(message["content"])
def _display_main_article(
selected_article_file_path_dict, show_reference=True, show_conversation=True
):
article_data = DemoFileIOHelper.assemble_article_data(
selected_article_file_path_dict
)
with st.container(height=1000, border=True):
table_content_sidebar = st.sidebar.expander(
"**Table of contents**", expanded=True
)
_display_main_article_text(
article_text=article_data.get("article", ""),
citation_dict=article_data.get("citations", {}),
table_content_sidebar=table_content_sidebar,
)
# display reference panel
if show_reference and "citations" in article_data:
with st.sidebar.expander("**References**", expanded=True):
with st.container(height=800, border=False):
_display_references(citation_dict=article_data.get("citations", {}))
# display conversation history
if show_conversation and "conversation_log" in article_data:
with st.expander(
"**STORM** is powered by a knowledge agent that proactively research a given topic by asking good questions coming from different perspectives.\n\n"
":sunglasses: Click here to view the agent's brain**STORM**ing process!"
):
_display_persona_conversations(
conversation_log=article_data.get("conversation_log", {})
)
def get_demo_dir():
return os.path.dirname(os.path.abspath(__file__))
def clear_other_page_session_state(page_index: Optional[int]):
if page_index is None:
keys_to_delete = [key for key in st.session_state if key.startswith("page")]
else:
keys_to_delete = [
key
for key in st.session_state
if key.startswith("page") and f"page{page_index}" not in key
]
for key in set(keys_to_delete):
del st.session_state[key]
def set_storm_runner():
current_working_dir = os.path.join(get_demo_dir(), "DEMO_WORKING_DIR")
if not os.path.exists(current_working_dir):
os.makedirs(current_working_dir)
# configure STORM runner
llm_configs = STORMWikiLMConfigs()
llm_configs.init_openai_model(
openai_api_key=st.secrets["OPENAI_API_KEY"], openai_type="openai"
)
llm_configs.set_question_asker_lm(
OpenAIModel(
model="gpt-4-1106-preview",
api_key=st.secrets["OPENAI_API_KEY"],
api_provider="openai",
max_tokens=500,
temperature=1.0,
top_p=0.9,
)
)
engine_args = STORMWikiRunnerArguments(
output_dir=current_working_dir,
max_conv_turn=3,
max_perspective=3,
search_top_k=3,
retrieve_top_k=5,
)
rm = YouRM(ydc_api_key=st.secrets["YDC_API_KEY"], k=engine_args.search_top_k)
runner = STORMWikiRunner(engine_args, llm_configs, rm)
st.session_state["runner"] = runner
def display_article_page(
selected_article_name,
selected_article_file_path_dict,
show_title=True,
show_main_article=True,
):
if show_title:
st.markdown(
f"<h2 style='text-align: center;'>{selected_article_name.replace('_', ' ')}</h2>",
unsafe_allow_html=True,
)
if show_main_article:
_display_main_article(selected_article_file_path_dict)
class StreamlitCallbackHandler(BaseCallbackHandler):
def __init__(self, status_container):
self.status_container = status_container
def on_identify_perspective_start(self, **kwargs):
self.status_container.info(
"Start identifying different perspectives for researching the topic."
)
def on_identify_perspective_end(self, perspectives: list[str], **kwargs):
perspective_list = "\n- ".join(perspectives)
self.status_container.success(
f"Finish identifying perspectives. Will now start gathering information"
f" from the following perspectives:\n- {perspective_list}"
)
def on_information_gathering_start(self, **kwargs):
self.status_container.info("Start browsing the Internet.")
def on_dialogue_turn_end(self, dlg_turn, **kwargs):
urls = list(set([r.url for r in dlg_turn.search_results]))
for url in urls:
self.status_container.markdown(
f"""
<style>
.small-font {{
font-size: 14px;
margin: 0px;
padding: 0px;
}}
</style>
<div class="small-font">Finish browsing <a href="{url}" class="small-font" target="_blank">{url}</a>.</div>
""",
unsafe_allow_html=True,
)
def on_information_gathering_end(self, **kwargs):
self.status_container.success("Finish collecting information.")
def on_information_organization_start(self, **kwargs):
self.status_container.info(
"Start organizing information into a hierarchical outline."
)
def on_direct_outline_generation_end(self, outline: str, **kwargs):
self.status_container.success(
f"Finish leveraging the internal knowledge of the large language model."
)
def on_outline_refinement_end(self, outline: str, **kwargs):
self.status_container.success(f"Finish leveraging the collected information.")
================================================
FILE: frontend/demo_light/pages_util/CreateNewArticle.py
================================================
import os
import time
import demo_util
import streamlit as st
from demo_util import (
DemoFileIOHelper,
DemoTextProcessingHelper,
DemoUIHelper,
truncate_filename,
)
def handle_not_started():
if st.session_state["page3_write_article_state"] == "not started":
_, search_form_column, _ = st.columns([2, 5, 2])
with search_form_column:
with st.form(key="search_form"):
# Text input for the search topic
DemoUIHelper.st_markdown_adjust_size(
content="Enter the topic you want to learn in depth:", font_size=18
)
st.session_state["page3_topic"] = st.text_input(
label="page3_topic", label_visibility="collapsed"
)
pass_appropriateness_check = True
# Submit button for the form
submit_button = st.form_submit_button(label="Research")
# only start new search when button is clicked, not started, or already finished previous one
if submit_button and st.session_state["page3_write_article_state"] in [
"not started",
"show results",
]:
if not st.session_state["page3_topic"].strip():
pass_appropriateness_check = False
st.session_state["page3_warning_message"] = (
"topic could not be empty"
)
st.session_state["page3_topic_name_cleaned"] = (
st.session_state["page3_topic"]
.replace(" ", "_")
.replace("/", "_")
)
st.session_state["page3_topic_name_truncated"] = truncate_filename(
st.session_state["page3_topic_name_cleaned"]
)
if not pass_appropriateness_check:
st.session_state["page3_write_article_state"] = "not started"
alert = st.warning(
st.session_state["page3_warning_message"], icon="⚠️"
)
time.sleep(5)
alert.empty()
else:
st.session_state["page3_write_article_state"] = "initiated"
def handle_initiated():
if st.session_state["page3_write_article_state"] == "initiated":
current_working_dir = os.path.join(demo_util.get_demo_dir(), "DEMO_WORKING_DIR")
if not os.path.exists(current_working_dir):
os.makedirs(current_working_dir)
if "runner" not in st.session_state:
demo_util.set_storm_runner()
st.session_state["page3_current_working_dir"] = current_working_dir
st.session_state["page3_write_article_state"] = "pre_writing"
def handle_pre_writing():
if st.session_state["page3_write_article_state"] == "pre_writing":
status = st.status(
"I am brain**STORM**ing now to research the topic. (This may take 2-3 minutes.)"
)
st_callback_handler = demo_util.StreamlitCallbackHandler(status)
with status:
# STORM main gen outline
st.session_state["runner"].run(
topic=st.session_state["page3_topic"],
do_research=True,
do_generate_outline=True,
do_generate_article=False,
do_polish_article=False,
callback_handler=st_callback_handler,
)
conversation_log_path = os.path.join(
st.session_state["page3_current_working_dir"],
st.session_state["page3_topic_name_truncated"],
"conversation_log.json",
)
demo_util._display_persona_conversations(
DemoFileIOHelper.read_json_file(conversation_log_path)
)
st.session_state["page3_write_article_state"] = "final_writing"
status.update(label="brain**STORM**ing complete!", state="complete")
def handle_final_writing():
if st.session_state["page3_write_article_state"] == "final_writing":
# polish final article
with st.status(
"Now I will connect the information I found for your reference. (This may take 4-5 minutes.)"
) as status:
st.info(
"Now I will connect the information I found for your reference. (This may take 4-5 minutes.)"
)
st.session_state["runner"].run(
topic=st.session_state["page3_topic"],
do_research=False,
do_generate_outline=False,
do_generate_article=True,
do_polish_article=True,
remove_duplicate=False,
)
# finish the session
st.session_state["runner"].post_run()
# update status bar
st.session_state["page3_write_article_state"] = "prepare_to_show_result"
status.update(label="information snythesis complete!", state="complete")
def handle_prepare_to_show_result():
if st.session_state["page3_write_article_state"] == "prepare_to_show_result":
_, show_result_col, _ = st.columns([4, 3, 4])
with show_result_col:
if st.button("show final article"):
st.session_state["page3_write_article_state"] = "completed"
st.rerun()
def handle_completed():
if st.session_state["page3_write_article_state"] == "completed":
# display polished article
current_working_dir_paths = DemoFileIOHelper.read_structure_to_dict(
st.session_state["page3_current_working_dir"]
)
current_article_file_path_dict = current_working_dir_paths[
st.session_state["page3_topic_name_truncated"]
]
demo_util.display_article_page(
selected_article_name=st.session_state["page3_topic_name_cleaned"],
selected_article_file_path_dict=current_article_file_path_dict,
show_title=True,
show_main_article=True,
)
def create_new_article_page():
demo_util.clear_other_page_session_state(page_index=3)
if "page3_write_article_state" not in st.session_state:
st.session_state["page3_write_article_state"] = "not started"
handle_not_started()
handle_initiated()
handle_pre_writing()
handle_final_writing()
handle_prepare_to_show_result()
handle_completed()
================================================
FILE: frontend/demo_light/pages_util/MyArticles.py
================================================
import os
import demo_util
import streamlit as st
from demo_util import DemoFileIOHelper, DemoUIHelper
from streamlit_card import card
# set page config and display title
def my_articles_page():
with st.sidebar:
_, return_button_col = st.columns([2, 5])
with return_button_col:
if st.button(
"Select another article",
disabled="page2_selected_my_article" not in st.session_state,
):
if "page2_selected_my_article" in st.session_state:
del st.session_state["page2_selected_my_article"]
st.rerun()
# sync my articles
if "page2_user_articles_file_path_dict" not in st.session_state:
local_dir = os.path.join(demo_util.get_demo_dir(), "DEMO_WORKING_DIR")
os.makedirs(local_dir, exist_ok=True)
st.session_state["page2_user_articles_file_path_dict"] = (
DemoFileIOHelper.read_structure_to_dict(local_dir)
)
# if no feature demo selected, display all featured articles as info cards
def article_card_setup(column_to_add, card_title, article_name):
with column_to_add:
cleaned_article_title = article_name.replace("_", " ")
hasClicked = card(
title=" / ".join(card_title),
text=article_name.replace("_", " "),
image=DemoFileIOHelper.read_image_as_base64(
os.path.join(demo_util.get_demo_dir(), "assets", "void.jpg")
),
styles=DemoUIHelper.get_article_card_UI_style(boarder_color="#9AD8E1"),
)
if hasClicked:
st.session_state["page2_selected_my_article"] = article_name
st.rerun()
if "page2_selected_my_article" not in st.session_state:
# display article cards
my_article_columns = st.columns(3)
if len(st.session_state["page2_user_articles_file_path_dict"]) > 0:
# get article names
article_names = sorted(
list(st.session_state["page2_user_articles_file_path_dict"].keys())
)
# configure pagination
pagination = st.container()
bottom_menu = st.columns((1, 4, 1, 1, 1))[1:-1]
with bottom_menu[2]:
batch_size = st.selectbox("Page Size", options=[24, 48, 72])
with bottom_menu[1]:
total_pages = (
int(len(article_names) / batch_size)
if int(len(article_names) / batch_size) > 0
else 1
)
current_page = st.number_input(
"Page", min_value=1, max_value=total_pages, step=1
)
with bottom_menu[0]:
st.markdown(f"Page **{current_page}** of **{total_pages}** ")
# show article cards
with pagination:
my_article_count = 0
start_index = (current_page - 1) * batch_size
end_index = min(current_page * batch_size, len(article_names))
for article_name in article_names[start_index:end_index]:
column_to_add = my_article_columns[my_article_count % 3]
my_article_count += 1
article_card_setup(
column_to_add=column_to_add,
card_title=["My Article"],
article_name=article_name,
)
else:
with my_article_columns[0]:
hasClicked = card(
title="Get started",
text="Start your first research!",
image=DemoFileIOHelper.read_image_as_base64(
os.path.join(demo_util.get_demo_dir(), "assets", "void.jpg")
),
styles=DemoUIHelper.get_article_card_UI_style(),
)
if hasClicked:
st.session_state.selected_page = 1
st.session_state["manual_selection_override"] = True
st.session_state["rerun_requested"] = True
st.rerun()
else:
selected_article_name = st.session_state["page2_selected_my_article"]
selected_article_file_path_dict = st.session_state[
"page2_user_articles_file_path_dict"
][selected_article_name]
demo_util.display_article_page(
selected_article_name=selected_article_name,
selected_article_file_path_dict=selected_article_file_path_dict,
show_title=True,
show_main_article=True,
)
================================================
FILE: frontend/demo_light/requirements.txt
================================================
streamlit==1.31.1
streamlit-card
markdown
unidecode
extra-streamlit-components==0.1.60
streamlit_extras
deprecation==2.1.0
st-pages==0.4.5
streamlit-float
streamlit-option-menu
================================================
FILE: frontend/demo_light/stoc.py
================================================
"""https://github.com/arnaudmiribel/stoc"""
import re
import streamlit as st
import unidecode
DISABLE_LINK_CSS = """
<style>
a.toc {
color: inherit;
text-decoration: none; /* no underline */
}
</style>"""
class stoc:
def __init__(self):
self.toc_items = list()
def h1(self, text: str, write: bool = True):
if write:
st.write(f"# {text}")
self.toc_items.append(("h1", text))
def h2(self, text: str, write: bool = True):
if write:
st.write(f"## {text}")
self.toc_items.append(("h2", text))
def h3(self, text: str, write: bool = True):
if write:
st.write(f"### {text}")
self.toc_items.append(("h3", text))
def toc(self, expander):
st.write(DISABLE_LINK_CSS, unsafe_allow_html=True)
# st.sidebar.caption("Table of contents")
if expander is None:
expander = st.sidebar.expander("**Table of contents**", expanded=True)
with expander:
with st.container(height=600, border=False):
markdown_toc = ""
for title_size, title in self.toc_items:
h = int(title_size.replace("h", ""))
markdown_toc += (
" " * 2 * h
+ "- "
+ f'<a href="#{normalize(title)}" class="toc"> {title}</a> \n'
)
# st.sidebar.write(markdown_toc, unsafe_allow_html=True)
st.write(markdown_toc, unsafe_allow_html=True)
@classmethod
def get_toc(cls, markdown_text: str, topic=""):
def increase_heading_depth_and_add_top_heading(markdown_text, new_top_heading):
lines = markdown_text.splitlines()
# Increase the depth of each heading by adding an extra '#'
increased_depth_lines = [
"#" + line if line.startswith("#") else line for line in lines
]
# Add the new top-level heading at the beginning
increased_depth_lines.insert(0, f"# {new_top_heading}")
# Re-join the modified lines back into a single string
modified_text = "\n".join(increased_depth_lines)
return modified_text
if topic:
markdown_text = increase_heading_depth_and_add_top_heading(
markdown_text, topic
)
toc = []
for line in markdown_text.splitlines():
if line.startswith("#"):
# Remove the '#' characters and strip leading/trailing spaces
heading_text = line.lstrip("#").strip()
# Create slug (lowercase, spaces to hyphens, remove non-alphanumeric characters)
slug = (
re.sub(r"[^a-zA-Z0-9\s-]", "", heading_text)
.lower()
.replace(" ", "-")
)
# Determine heading level for indentation
level = line.count("#") - 1
# Add to the table of contents
toc.append(" " * level + f"- [{heading_text}](#{slug})")
return "\n".join(toc)
@classmethod
def from_markdown(cls, text: str, expander=None):
self = cls()
for line in text.splitlines():
if line.startswith("###"):
self.h3(line[3:], write=False)
elif line.startswith("##"):
self.h2(line[2:], write=False)
elif line.startswith("#"):
self.h1(line[1:], write=False)
# customize markdown font size
custom_css = """
<style>
/* Adjust the font size for headings */
h1 { font-size: 28px; }
h2 { font-size: 24px; }
h3 { font-size: 22px; }
h4 { font-size: 20px; }
h5 { font-size: 18px; }
/* Adjust the font size for normal text */
p { font-size: 18px; }
</style>
"""
st.markdown(custom_css, unsafe_allow_html=True)
st.write(text)
self.toc(expander=expander)
def normalize(s):
"""
Normalize titles as valid HTML ids for anchors
>>> normalize("it's a test to spot how Things happ3n héhé")
"it-s-a-test-to-spot-how-things-happ3n-h-h"
"""
# Replace accents with "-"
s_wo_accents = unidecode.unidecode(s)
accents = [s for s in s if s not in s_wo_accents]
for accent in accents:
s = s.replace(accent, "-")
# Lowercase
s = s.lower()
# Keep only alphanum and remove "-" suffix if existing
normalized = (
"".join([char if char.isalnum() else "-" for char in s]).strip("-").lower()
)
return normalized
================================================
FILE: frontend/demo_light/storm.py
================================================
import os
script_dir = os.path.dirname(os.path.abspath(__file__))
wiki_root_dir = os.path.dirname(os.path.dirname(script_dir))
import demo_util
from pages_util import MyArticles, CreateNewArticle
from streamlit_float import *
from streamlit_option_menu import option_menu
def main():
global database
st.set_page_config(layout="wide")
if "first_run" not in st.session_state:
st.session_state["first_run"] = True
# set api keys from secrets
if st.session_state["first_run"]:
for key, value in st.secrets.items():
if type(value) == str:
os.environ[key] = value
# initialize session_state
if "selected_article_index" not in st.session_state:
st.session_state["selected_article_index"] = 0
if "selected_page" not in st.session_state:
st.session_state["selected_page"] = 0
if st.session_state.get("rerun_requested", False):
st.session_state["rerun_requested"] = False
st.rerun()
st.write(
"<style>div.block-container{padding-top:2rem;}</style>", unsafe_allow_html=True
)
menu_container = st.container()
with menu_container:
pages = ["My Articles", "Create New Article"]
styles = {
"container": {"padding": "0.2rem 0", "background-color": "#22222200"},
}
menu_selection = option_menu(
None,
pages,
icons=["house", "search"],
menu_icon="cast",
default_index=0,
orientation="horizontal",
manual_select=st.session_state.selected_page,
styles=styles,
key="menu_selection",
)
if st.session_state.get("manual_selection_override", False):
menu_selection = pages[st.session_state["selected_page"]]
st.session_state["manual_selection_override"] = False
st.session_state["selected_page"] = None
if menu_selection == "My Articles":
demo_util.clear_other_page_session_state(page_index=2)
MyArticles.my_articles_page()
elif menu_selection == "Create New Article":
demo_util.clear_other_page_session_state(page_index=3)
CreateNewArticle.create_new_article_page()
if __name__ == "__main__":
main()
================================================
FILE: knowledge_storm/__init__.py
================================================
from .storm_wiki import *
from .collaborative_storm import *
from .encoder import *
from .interface import *
from .lm import *
from .rm import *
from .utils import *
from .dataclass import *
__version__ = "1.1.0"
================================================
FILE: knowledge_storm/collaborative_storm/__init__.py
================================================
from .modules import *
from .engine import *
================================================
FILE: knowledge_storm/collaborative_storm/engine.py
================================================
import dspy
import os
from dataclasses import dataclass, field, asdict
from typing import List, Union, Literal, Optional, Dict
from .modules import collaborative_storm_utils as collaborative_storm_utils
from .modules.callback import BaseCallbackHandler
from .modules.co_storm_agents import (
SimulatedUser,
PureRAGAgent,
Moderator,
CoStormExpert,
)
from .modules.expert_generation import GenerateExpertModule
from .modules.warmstart_hierarchical_chat import WarmStartModule
from ..dataclass import ConversationTurn, KnowledgeBase
from ..encoder import Encoder
from ..interface import LMConfigs, Agent
from ..logging_wrapper import LoggingWrapper
from ..lm import LitellmModel
from ..rm import BingSearch
class CollaborativeStormLMConfigs(LMConfigs):
"""Configurations for LLM used in different parts of Co-STORM.
Given that different parts in Co-STORM framework have different complexity, we use different LLM configurations
to achieve a balance between quality and efficiency. If no specific configuration is provided, we use the default
setup in the paper.
"""
def __init__(self):
self.question_answering_lm = None
self.discourse_manage_lm = None
self.utterance_polishing_lm = None
self.warmstart_outline_gen_lm = None
self.question_asking_lm = None
self.knowledge_base_lm = None
def init(
self,
lm_type: Literal["openai", "azure", "together"],
temperature: Optional[float] = 1.0,
top_p: Optional[float] = 0.9,
):
if lm_type and lm_type == "openai":
openai_kwargs = {
"api_key": os.getenv("OPENAI_API_KEY"),
"temperature": temperature,
"top_p": top_p,
"api_base": None,
}
self.question_answering_lm = LitellmModel(
model="gpt-4o-2024-05-13", max_tokens=1000, **openai_kwargs
)
self.discourse_manage_lm = LitellmModel(
model="gpt-4o-2024-05-13", max_tokens=500, **openai_kwargs
)
self.utterance_polishing_lm = LitellmModel(
model="gpt-4o-2024-05-13", max_tokens=2000, **openai_kwargs
)
self.warmstart_outline_gen_lm = LitellmModel(
model="gpt-4-1106-preview", max_tokens=500, **openai_kwargs
)
self.question_asking_lm = LitellmModel(
model="gpt-4o-2024-05-13", max_tokens=300, **openai_kwargs
)
self.knowledge_base_lm = LitellmModel(
model="gpt-4o-2024-05-13", max_tokens=1000, **openai_kwargs
)
elif lm_type and lm_type == "azure":
azure_kwargs = {
"api_key": os.getenv("AZURE_API_KEY"),
"temperature": temperature,
"top_p": top_p,
"api_base": os.getenv("AZURE_API_BASE"),
"api_version": os.getenv("AZURE_API_VERSION"),
}
self.question_answering_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=1000, **azure_kwargs, model_type="chat"
)
self.discourse_manage_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=500, **azure_kwargs, model_type="chat"
)
self.utterance_polishing_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=2000, **azure_kwargs, model_type="chat"
)
self.warmstart_outline_gen_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=300, **azure_kwargs, model_type="chat"
)
self.question_asking_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=300, **azure_kwargs, model_type="chat"
)
self.knowledge_base_lm = LitellmModel(
model="azure/gpt-4o", max_tokens=1000, **azure_kwargs, model_type="chat"
)
elif lm_type and lm_type == "together":
together_kwargs = {
"api_key": os.getenv("TOGETHER_API_KEY"),
"temperature": temperature,
"top_p": top_p,
}
self.question_answering_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=1000,
model_type="chat",
**together_kwargs,
)
self.discourse_manage_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=500,
model_type="chat",
**together_kwargs,
)
self.utterance_polishing_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=2000,
model_type="chat",
**together_kwargs,
)
self.warmstart_outline_gen_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=500,
model_type="chat",
**together_kwargs,
)
self.question_asking_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=300,
model_type="chat",
**together_kwargs,
)
self.knowledge_base_lm = LitellmModel(
model="together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
max_tokens=1000,
model_type="chat",
**together_kwargs,
)
else:
raise Exception(
"No valid OpenAI API provider is provided. Cannot use default LLM configurations."
)
def set_question_answering_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.question_answering_lm = model
def set_discourse_manage_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.discourse_manage_lm = model
def set_utterance_polishing_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.utterance_polishing_lm = model
def set_warmstart_outline_gen_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.warmstart_outline_gen_lm = model
def set_question_asking_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.question_asking_lm = model
def set_knowledge_base_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
self.knowledge_base_lm = model
def collect_and_reset_lm_usage(self):
lm_usage = {}
for attr_name in self.__dict__:
if "_lm" in attr_name and hasattr(
getattr(self, attr_name), "get_usage_and_reset"
):
usage = getattr(self, attr_name).get_usage_and_reset()
if any(
value["prompt_tokens"] != 0 or value["completion_tokens"] != 0
for value in usage.values()
):
lm_usage[attr_name] = usage
return lm_usage
def to_dict(self):
"""
Converts the CollaborativeStormLMConfigs instance to a dictionary representation.
Returns:
dict: The dictionary representation of the CollaborativeStormLMConfigs.
"""
config_dict = {}
for attr_name in self.__dict__:
config_dict[attr_name] = getattr(self, attr_name).kwargs
return config_dict
@dataclass
class RunnerArgument:
"""Arguments for controlling the STORM Wiki pipeline."""
topic: str = field(
metadata={"help": "Topic of discourse"},
)
retrieve_top_k: int = field(
default=10,
metadata={"help": "retrieve top k results for each query in retriever"},
)
max_search_queries: int = field(
default=2,
metadata={
"help": "Maximum number of search queries to consider for each question."
},
)
total_conv_turn: int = field(
default=20,
metadata={"help": "Maximum number turn in conversation."},
)
max_search_thread: int = field(
default=5,
metadata={"help": "Maximum number of parallel thread for retriever"},
)
max_search_queries_per_turn: int = field(
default=3,
metadata={"help": "Maximum number of search queries to consider in each turn."},
)
warmstart_max_num_experts: int = field(
default=3,
metadata={
"help": "Max number of experts in perspective guided QA in warm start process"
},
)
warmstart_max_turn_per_experts: int = field(
default=2,
metadata={"help": "Max number of turns per perspective in warm start process"},
)
warmstart_max_thread: int = field(
default=3,
metadata={
"help": "Max number thread for parallel perspective guided QA in warm start process"
},
)
max_thread_num: int = field(
default=10,
metadata={
"help": "Maximum number of threads to use. "
"Consider reducing it if keep getting 'Exceed rate limit' error when calling LM API."
},
)
max_num_round_table_experts: int = field(
default=2,
metadata={"help": "Max number of active experts in round table discussion."},
)
moderator_override_N_consecutive_answering_turn: int = field(
default=3,
metadata={
"help": "Number of consecutive experts answering turn before moderator override the conversation"
},
)
node_expansion_trigger_count: int = field(
default=10,
metadata={
"help": "Trigger node expansion for node that contain more than N snippets"
},
)
disable_moderator: bool = field(
default=False,
metadata={"help": "If True, disable moderator."},
)
disable_multi_experts: bool = field(
default=False,
metadata={"help": "If True, disable moderator."},
)
rag_only_baseline_mode: bool = field(
default=False,
metadata={"help": "If True, switch to rag online baseline mode"},
)
def to_dict(self):
"""
Converts the RunnerArgument instance to a dictionary representation.
Returns:
dict: The dictionary representation of the RunnerArgument.
"""
return asdict(self)
@classmethod
def from_dict(cls, data):
"""
Constructs a RunnerArgument instance from a dictionary representation.
Args:
data (dict): The dictionary representation of the RunnerArgument.
Returns:
RunnerArgument: The constructed RunnerArgument instance.
"""
return cls(**data)
@dataclass
class TurnPolicySpec:
"""
Represents the policy specifications for determining the behavior of a conversation turn.
Attributes:
should_reorganize_knowledge_base (bool):
A flag that indicates whether the knowledge base should be reorganized after the current turn.
should_update_experts_list (bool):
A flag that indicates whether the list of experts should be updated based on the conversation context.
should_polish_utterance (bool):
A flag that indicates whether the generated utterance should be polished (e.g., refined or rephrase
gitextract_m_9bgqy0/ ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ └── bug_report.md │ └── workflows/ │ ├── format-check.yml │ └── python-package.yml ├── .gitignore ├── .pre-commit-config.yaml ├── CONTRIBUTING.md ├── LICENSE ├── MANIFEST.in ├── README.md ├── examples/ │ ├── costorm_examples/ │ │ └── run_costorm_gpt.py │ └── storm_examples/ │ ├── README.md │ ├── helper/ │ │ └── process_kaggle_arxiv_abstract_dataset.py │ ├── run_storm_wiki_claude.py │ ├── run_storm_wiki_deepseek.py │ ├── run_storm_wiki_gemini.py │ ├── run_storm_wiki_gpt.py │ ├── run_storm_wiki_gpt_with_VectorRM.py │ ├── run_storm_wiki_groq.py │ ├── run_storm_wiki_mistral.py │ ├── run_storm_wiki_ollama.py │ ├── run_storm_wiki_ollama_with_searxng.py │ └── run_storm_wiki_serper.py ├── frontend/ │ └── demo_light/ │ ├── .streamlit/ │ │ └── config.toml │ ├── README.md │ ├── demo_util.py │ ├── pages_util/ │ │ ├── CreateNewArticle.py │ │ └── MyArticles.py │ ├── requirements.txt │ ├── stoc.py │ └── storm.py ├── knowledge_storm/ │ ├── __init__.py │ ├── collaborative_storm/ │ │ ├── __init__.py │ │ ├── engine.py │ │ └── modules/ │ │ ├── __init__.py │ │ ├── article_generation.py │ │ ├── callback.py │ │ ├── co_storm_agents.py │ │ ├── collaborative_storm_utils.py │ │ ├── costorm_expert_utterance_generator.py │ │ ├── expert_generation.py │ │ ├── grounded_question_answering.py │ │ ├── grounded_question_generation.py │ │ ├── information_insertion_module.py │ │ ├── knowledge_base_summary.py │ │ ├── simulate_user.py │ │ └── warmstart_hierarchical_chat.py │ ├── dataclass.py │ ├── encoder.py │ ├── interface.py │ ├── lm.py │ ├── logging_wrapper.py │ ├── rm.py │ ├── storm_wiki/ │ │ ├── __init__.py │ │ ├── engine.py │ │ └── modules/ │ │ ├── __init__.py │ │ ├── article_generation.py │ │ ├── article_polish.py │ │ ├── callback.py │ │ ├── knowledge_curation.py │ │ ├── outline_generation.py │ │ ├── persona_generator.py │ │ ├── retriever.py │ │ └── storm_dataclass.py │ └── utils.py ├── requirements.txt └── setup.py
SYMBOL INDEX (633 symbols across 45 files)
FILE: examples/costorm_examples/run_costorm_gpt.py
function main (line 41) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_claude.py
function main (line 40) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_deepseek.py
function sanitize_topic (line 43) | def sanitize_topic(topic):
function main (line 61) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_gemini.py
function main (line 40) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_gpt.py
function main (line 44) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py
function main (line 42) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_groq.py
function sanitize_topic (line 45) | def sanitize_topic(topic):
function main (line 63) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_mistral.py
function main (line 42) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_ollama.py
function main (line 43) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_ollama_with_searxng.py
function main (line 16) | def main(args):
FILE: examples/storm_examples/run_storm_wiki_serper.py
function main (line 32) | def main(args):
FILE: frontend/demo_light/demo_util.py
class DemoFileIOHelper (line 28) | class DemoFileIOHelper:
method read_structure_to_dict (line 30) | def read_structure_to_dict(articles_root_path):
method read_txt_file (line 56) | def read_txt_file(file_path):
method read_json_file (line 70) | def read_json_file(file_path):
method read_image_as_base64 (line 86) | def read_image_as_base64(image_path):
method set_file_modification_time (line 106) | def set_file_modification_time(file_path, modification_time_string):
method get_latest_modification_time (line 124) | def get_latest_modification_time(path):
method assemble_article_data (line 169) | def assemble_article_data(article_file_path_dict):
class DemoTextProcessingHelper (line 219) | class DemoTextProcessingHelper:
method remove_citations (line 221) | def remove_citations(sent):
method parse_conversation_history (line 229) | def parse_conversation_history(json_data):
method parse (line 263) | def parse(text):
method add_markdown_indentation (line 269) | def add_markdown_indentation(input_string):
method get_current_time_string (line 286) | def get_current_time_string():
method compare_time_strings (line 299) | def compare_time_strings(
method add_inline_citation_link (line 321) | def add_inline_citation_link(article_text, citation_dict):
method generate_html_toc (line 335) | def generate_html_toc(md_text):
method construct_bibliography_from_url_to_info (line 348) | def construct_bibliography_from_url_to_info(url_to_info):
class DemoUIHelper (line 362) | class DemoUIHelper:
method st_markdown_adjust_size (line 363) | def st_markdown_adjust_size(content, font_size=20):
method get_article_card_UI_style (line 372) | def get_article_card_UI_style(boarder_color="#9AD8E1"):
method customize_toast_css_style (line 409) | def customize_toast_css_style():
method article_markdown_to_html (line 434) | def article_markdown_to_html(article_title, article_content):
function _construct_citation_dict_from_search_result (line 458) | def _construct_citation_dict_from_search_result(search_results):
function _display_main_article_text (line 471) | def _display_main_article_text(article_text, citation_dict, table_conten...
function _display_references (line 488) | def _display_references(citation_dict):
function _display_persona_conversations (line 502) | def _display_persona_conversations(conversation_log):
function _display_main_article (line 526) | def _display_main_article(
function get_demo_dir (line 560) | def get_demo_dir():
function clear_other_page_session_state (line 564) | def clear_other_page_session_state(page_index: Optional[int]):
function set_storm_runner (line 577) | def set_storm_runner():
function display_article_page (line 611) | def display_article_page(
class StreamlitCallbackHandler (line 627) | class StreamlitCallbackHandler(BaseCallbackHandler):
method __init__ (line 628) | def __init__(self, status_container):
method on_identify_perspective_start (line 631) | def on_identify_perspective_start(self, **kwargs):
method on_identify_perspective_end (line 636) | def on_identify_perspective_end(self, perspectives: list[str], **kwargs):
method on_information_gathering_start (line 643) | def on_information_gathering_start(self, **kwargs):
method on_dialogue_turn_end (line 646) | def on_dialogue_turn_end(self, dlg_turn, **kwargs):
method on_information_gathering_end (line 663) | def on_information_gathering_end(self, **kwargs):
method on_information_organization_start (line 666) | def on_information_organization_start(self, **kwargs):
method on_direct_outline_generation_end (line 671) | def on_direct_outline_generation_end(self, outline: str, **kwargs):
method on_outline_refinement_end (line 676) | def on_outline_refinement_end(self, outline: str, **kwargs):
FILE: frontend/demo_light/pages_util/CreateNewArticle.py
function handle_not_started (line 14) | def handle_not_started():
function handle_initiated (line 60) | def handle_initiated():
function handle_pre_writing (line 72) | def handle_pre_writing():
function handle_final_writing (line 100) | def handle_final_writing():
function handle_prepare_to_show_result (line 125) | def handle_prepare_to_show_result():
function handle_completed (line 134) | def handle_completed():
function create_new_article_page (line 151) | def create_new_article_page():
FILE: frontend/demo_light/pages_util/MyArticles.py
function my_articles_page (line 10) | def my_articles_page():
FILE: frontend/demo_light/stoc.py
class stoc (line 17) | class stoc:
method __init__ (line 18) | def __init__(self):
method h1 (line 21) | def h1(self, text: str, write: bool = True):
method h2 (line 26) | def h2(self, text: str, write: bool = True):
method h3 (line 31) | def h3(self, text: str, write: bool = True):
method toc (line 36) | def toc(self, expander):
method get_toc (line 55) | def get_toc(cls, markdown_text: str, topic=""):
method from_markdown (line 90) | def from_markdown(cls, text: str, expander=None):
function normalize (line 118) | def normalize(s):
FILE: frontend/demo_light/storm.py
function main (line 12) | def main():
FILE: knowledge_storm/collaborative_storm/engine.py
class CollaborativeStormLMConfigs (line 24) | class CollaborativeStormLMConfigs(LMConfigs):
method __init__ (line 32) | def __init__(self):
method init (line 40) | def init(
method set_question_answering_lm (line 144) | def set_question_answering_lm(self, model: Union[dspy.dsp.LM, dspy.dsp...
method set_discourse_manage_lm (line 147) | def set_discourse_manage_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.H...
method set_utterance_polishing_lm (line 150) | def set_utterance_polishing_lm(self, model: Union[dspy.dsp.LM, dspy.ds...
method set_warmstart_outline_gen_lm (line 153) | def set_warmstart_outline_gen_lm(self, model: Union[dspy.dsp.LM, dspy....
method set_question_asking_lm (line 156) | def set_question_asking_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HF...
method set_knowledge_base_lm (line 159) | def set_knowledge_base_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFM...
method collect_and_reset_lm_usage (line 162) | def collect_and_reset_lm_usage(self):
method to_dict (line 176) | def to_dict(self):
class RunnerArgument (line 190) | class RunnerArgument:
method to_dict (line 270) | def to_dict(self):
method from_dict (line 280) | def from_dict(cls, data):
class TurnPolicySpec (line 294) | class TurnPolicySpec:
class DiscourseManager (line 319) | class DiscourseManager:
method __init__ (line 320) | def __init__(
method serialize_experts (line 383) | def serialize_experts(self) -> List[Dict]:
method deserialize_experts (line 393) | def deserialize_experts(self, data: List[Dict]):
method _should_generate_question (line 408) | def _should_generate_question(
method _parse_expert_names_to_agent (line 425) | def _parse_expert_names_to_agent(self, expert_descriptions: Union[str,...
method _update_expert_list_from_utterance (line 446) | def _update_expert_list_from_utterance(self, focus: str, background_in...
method _is_last_turn_questioning (line 455) | def _is_last_turn_questioning(self, conversation_history: List[Convers...
method get_next_turn_policy (line 461) | def get_next_turn_policy(
class CoStormRunner (line 505) | class CoStormRunner:
method __init__ (line 506) | def __init__(
method to_dict (line 540) | def to_dict(self):
method from_dict (line 555) | def from_dict(cls, data, callback_handler: BaseCallbackHandler = None):
method warm_start (line 582) | def warm_start(self):
method generate_report (line 642) | def generate_report(self) -> str:
method dump_logging_and_reset (line 658) | def dump_logging_and_reset(self):
method step (line 661) | def step(
FILE: knowledge_storm/collaborative_storm/modules/article_generation.py
class ArticleGenerationModule (line 9) | class ArticleGenerationModule(dspy.Module):
method __init__ (line 12) | def __init__(
method _get_cited_information_string (line 20) | def _get_cited_information_string(
method gen_section (line 39) | def gen_section(
method forward (line 64) | def forward(self, knowledge_base: KnowledgeBase):
class WriteSection (line 110) | class WriteSection(dspy.Signature):
FILE: knowledge_storm/collaborative_storm/modules/callback.py
class BaseCallbackHandler (line 5) | class BaseCallbackHandler:
method on_turn_policy_planning_start (line 8) | def on_turn_policy_planning_start(self, **kwargs):
method on_expert_action_planning_start (line 12) | def on_expert_action_planning_start(self, **kwargs):
method on_expert_action_planning_end (line 16) | def on_expert_action_planning_end(self, **kwargs):
method on_expert_information_collection_start (line 20) | def on_expert_information_collection_start(self, **kwargs):
method on_expert_information_collection_end (line 24) | def on_expert_information_collection_end(self, info: List[Information]...
method on_expert_utterance_generation_end (line 28) | def on_expert_utterance_generation_end(self, **kwargs):
method on_expert_utterance_polishing_start (line 32) | def on_expert_utterance_polishing_start(self, **kwargs):
method on_mindmap_insert_start (line 36) | def on_mindmap_insert_start(self, **kwargs):
method on_mindmap_insert_end (line 40) | def on_mindmap_insert_end(self, **kwargs):
method on_mindmap_reorg_start (line 44) | def on_mindmap_reorg_start(self, **kwargs):
method on_expert_list_update_start (line 48) | def on_expert_list_update_start(self, **kwargs):
method on_article_generation_start (line 52) | def on_article_generation_start(self, **kwargs):
method on_warmstart_update (line 56) | def on_warmstart_update(self, message, **kwargs):
class LocalConsolePrintCallBackHandler (line 61) | class LocalConsolePrintCallBackHandler(BaseCallbackHandler):
method __init__ (line 62) | def __init__(self):
method on_turn_policy_planning_start (line 65) | def on_turn_policy_planning_start(self, **kwargs):
method on_expert_action_planning_start (line 69) | def on_expert_action_planning_start(self, **kwargs):
method on_expert_information_collection_start (line 73) | def on_expert_information_collection_start(self, **kwargs):
method on_expert_information_collection_end (line 77) | def on_expert_information_collection_end(self, info: List[Information]...
method on_expert_utterance_generation_end (line 84) | def on_expert_utterance_generation_end(self, **kwargs):
method on_expert_utterance_polishing_start (line 88) | def on_expert_utterance_polishing_start(self, **kwargs):
method on_mindmap_insert_start (line 92) | def on_mindmap_insert_start(self, **kwargs):
method on_mindmap_insert_end (line 96) | def on_mindmap_insert_end(self, **kwargs):
method on_mindmap_reorg_start (line 100) | def on_mindmap_reorg_start(self, **kwargs):
method on_expert_list_update_start (line 104) | def on_expert_list_update_start(self, **kwargs):
method on_warmstart_update (line 108) | def on_warmstart_update(self, message, **kwargs):
FILE: knowledge_storm/collaborative_storm/modules/co_storm_agents.py
class CoStormExpert (line 24) | class CoStormExpert(Agent):
method __init__ (line 42) | def __init__(
method _get_costorm_expert_utterance_generator (line 62) | def _get_costorm_expert_utterance_generator(
method generate_utterance (line 78) | def generate_utterance(
class SimulatedUser (line 110) | class SimulatedUser(Agent):
method __init__ (line 118) | def __init__(
method generate_utterance (line 139) | def generate_utterance(
class Moderator (line 159) | class Moderator(Agent):
method __init__ (line 169) | def __init__(
method _get_conv_turn_unused_information (line 190) | def _get_conv_turn_unused_information(
method _get_sorted_unused_snippets (line 248) | def _get_sorted_unused_snippets(
method generate_utterance (line 285) | def generate_utterance(
class PureRAGAgent (line 314) | class PureRAGAgent(Agent):
method __init__ (line 322) | def __init__(
method _gen_utterance_from_question (line 344) | def _gen_utterance_from_question(self, question: str):
method generate_topic_background (line 362) | def generate_topic_background(self):
method generate_utterance (line 365) | def generate_utterance(
FILE: knowledge_storm/collaborative_storm/modules/collaborative_storm_utils.py
function extract_storm_info_snippet (line 15) | def extract_storm_info_snippet(info: Information, snippet_index: int) ->...
function format_search_results (line 36) | def format_search_results(
function extract_cited_storm_info (line 86) | def extract_cited_storm_info(
function trim_output_after_hint (line 108) | def trim_output_after_hint(response: str, hint: str) -> str:
function separate_citations (line 125) | def separate_citations(text: str) -> str:
function extract_and_remove_citations (line 146) | def extract_and_remove_citations(text: str) -> Tuple[str, List[int]]:
function keep_first_and_last_paragraph (line 171) | def keep_first_and_last_paragraph(text: str) -> str:
function clean_up_section (line 194) | def clean_up_section(text):
function load_api_key (line 228) | def load_api_key(toml_file_path):
function _get_answer_question_module_instance (line 243) | def _get_answer_question_module_instance(
FILE: knowledge_storm/collaborative_storm/modules/costorm_expert_utterance_generator.py
class GenExpertActionPlanning (line 17) | class GenExpertActionPlanning(dspy.Signature):
class CoStormExpertUtteranceGenerationModule (line 42) | class CoStormExpertUtteranceGenerationModule(dspy.Module):
method __init__ (line 43) | def __init__(
method parse_action (line 59) | def parse_action(self, action):
method polish_utterance (line 73) | def polish_utterance(
method forward (line 103) | def forward(
FILE: knowledge_storm/collaborative_storm/modules/expert_generation.py
class GenerateExpertGeneral (line 6) | class GenerateExpertGeneral(dspy.Signature):
class GenerateExpertWithFocus (line 24) | class GenerateExpertWithFocus(dspy.Signature):
class GenerateExpertModule (line 43) | class GenerateExpertModule(dspy.Module):
method __init__ (line 44) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method trim_background (line 49) | def trim_background(self, background: str, max_words: int = 100):
method forward (line 58) | def forward(
FILE: knowledge_storm/collaborative_storm/modules/grounded_question_answering.py
class QuestionToQuery (line 16) | class QuestionToQuery(dspy.Signature):
class AnswerQuestion (line 32) | class AnswerQuestion(dspy.Signature):
class AnswerQuestionModule (line 50) | class AnswerQuestionModule(dspy.Module):
method __init__ (line 51) | def __init__(
method retrieve_information (line 66) | def retrieve_information(self, topic, question):
method forward (line 92) | def forward(
FILE: knowledge_storm/collaborative_storm/modules/grounded_question_generation.py
class KnowledgeBaseSummmary (line 23) | class KnowledgeBaseSummmary(dspy.Signature):
class ConvertUtteranceStyle (line 33) | class ConvertUtteranceStyle(dspy.Signature):
class GroundedQuestionGeneration (line 55) | class GroundedQuestionGeneration(dspy.Signature):
class GroundedQuestionGenerationModule (line 74) | class GroundedQuestionGenerationModule(dspy.Module):
method __init__ (line 75) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 81) | def forward(
FILE: knowledge_storm/collaborative_storm/modules/information_insertion_module.py
class InsertInformation (line 16) | class InsertInformation(dspy.Signature):
class InsertInformationCandidateChoice (line 39) | class InsertInformationCandidateChoice(dspy.Signature):
class InsertInformationModule (line 53) | class InsertInformationModule(dspy.Module):
method __init__ (line 54) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel], encod...
method _construct_intent (line 60) | def _construct_intent(self, question: str, query: str):
method _get_navigation_choice (line 72) | def _get_navigation_choice(
method layer_by_layer_navigation_placement (line 108) | def layer_by_layer_navigation_placement(
method _get_sorted_embed_sim_section (line 149) | def _get_sorted_embed_sim_section(
method _parse_selected_index (line 165) | def _parse_selected_index(self, string: str):
method choose_candidate_from_embedding_ranking (line 175) | def choose_candidate_from_embedding_ranking(
method _info_list_to_intent_mapping (line 213) | def _info_list_to_intent_mapping(self, information_list: List[Informat...
method forward (line 221) | def forward(
class ExpandSection (line 316) | class ExpandSection(dspy.Signature):
class ExpandNodeModule (line 334) | class ExpandNodeModule(dspy.Module):
method __init__ (line 335) | def __init__(
method _get_cited_info_meta_string (line 346) | def _get_cited_info_meta_string(self, node, knowledge_base):
method _get_expand_subnode_names (line 355) | def _get_expand_subnode_names(self, node, knowledge_base):
method _find_first_node_to_expand (line 373) | def _find_first_node_to_expand(
method _expand_node (line 391) | def _expand_node(self, node: KnowledgeNode, knowledge_base: KnowledgeB...
method forward (line 415) | def forward(self, knowledge_base: KnowledgeBase):
FILE: knowledge_storm/collaborative_storm/modules/knowledge_base_summary.py
class KnowledgeBaseSummmary (line 6) | class KnowledgeBaseSummmary(dspy.Signature):
class KnowledgeBaseSummaryModule (line 16) | class KnowledgeBaseSummaryModule(dspy.Module):
method __init__ (line 17) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 21) | def forward(self, knowledge_base: KnowledgeBase):
FILE: knowledge_storm/collaborative_storm/modules/simulate_user.py
class GenSimulatedUserUtterance (line 9) | class GenSimulatedUserUtterance(dspy.Module):
method __init__ (line 10) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method gen_conv_history_string (line 14) | def gen_conv_history_string(self, conversation_turns: List[Conversatio...
method forward (line 30) | def forward(self, topic: str, intent: str, conv_history: List[Conversa...
FILE: knowledge_storm/collaborative_storm/modules/warmstart_hierarchical_chat.py
class WarmStartModerator (line 31) | class WarmStartModerator(dspy.Signature):
class SectionToConvTranscript (line 50) | class SectionToConvTranscript(dspy.Signature):
class ReportToConversation (line 70) | class ReportToConversation(dspy.Module):
method __init__ (line 71) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 75) | def forward(self, knowledge_base: KnowledgeBase):
class WarmStartConversation (line 126) | class WarmStartConversation(dspy.Module):
method __init__ (line 127) | def __init__(
method format_dialogue_question_history_string (line 148) | def format_dialogue_question_history_string(
method generate_warmstart_experts (line 157) | def generate_warmstart_experts(self, topic: str):
method get_background_info (line 167) | def get_background_info(self, topic: str):
method forward (line 183) | def forward(self, topic: str):
class GenerateWarmStartOutline (line 259) | class GenerateWarmStartOutline(dspy.Signature):
class GenerateWarmStartOutlineModule (line 279) | class GenerateWarmStartOutlineModule(dspy.Module):
method __init__ (line 280) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method extract_questions_and_queries (line 285) | def extract_questions_and_queries(self, conv: List[ConversationTurn]):
method get_draft_outline (line 297) | def get_draft_outline(self, topic: str):
method forward (line 301) | def forward(self, topic: str, conv: List[ConversationTurn]):
class WarmStartModule (line 312) | class WarmStartModule:
method __init__ (line 313) | def __init__(
method initiate_warm_start (line 346) | def initiate_warm_start(self, topic: str, knowledge_base: KnowledgeBase):
FILE: knowledge_storm/dataclass.py
class ConversationTurn (line 11) | class ConversationTurn:
method __init__ (line 27) | def __init__(
method get_all_citation_index (line 50) | def get_all_citation_index(self):
method to_dict (line 54) | def to_dict(self):
method from_dict (line 69) | def from_dict(cls, conv_turn_dict: Dict):
class KnowledgeNode (line 86) | class KnowledgeNode:
method __init__ (line 97) | def __init__(
method collect_all_content (line 121) | def collect_all_content(self):
method has_child (line 133) | def has_child(self, child_node_name: str):
method add_child (line 139) | def add_child(self, child_node_name: str, duplicate_handling: str = "s...
method get_parent (line 157) | def get_parent(self):
method get_children (line 166) | def get_children(self):
method get_children_names (line 175) | def get_children_names(self):
method __repr__ (line 181) | def __repr__(self):
method get_path_from_root (line 190) | def get_path_from_root(self, root: Optional["KnowledgeNode"] = None):
method insert_information (line 206) | def insert_information(self, information_index: int):
method get_all_descendents (line 211) | def get_all_descendents(self) -> List["KnowledgeNode"]:
method get_all_predecessors (line 228) | def get_all_predecessors(self) -> List["KnowledgeNode"]:
method to_dict (line 242) | def to_dict(self):
method from_dict (line 259) | def from_dict(cls, data):
class KnowledgeBase (line 291) | class KnowledgeBase:
method __init__ (line 308) | def __init__(
method to_dict (line 362) | def to_dict(self):
method from_dict (line 374) | def from_dict(
method get_knowledge_base_structure_embedding (line 399) | def get_knowledge_base_structure_embedding(
method traverse_down (line 425) | def traverse_down(self, node):
method traverse_up (line 445) | def traverse_up(self, node):
method collect_all_nodes (line 461) | def collect_all_nodes(self):
method insert_node (line 472) | def insert_node(
method find_node (line 495) | def find_node(self, current_node, node_name):
method insert_from_outline_string (line 514) | def insert_from_outline_string(self, outline_string, duplicate_handlin...
method get_node_hierarchy_string (line 540) | def get_node_hierarchy_string(
method find_node_by_path (line 638) | def find_node_by_path(
method insert_information (line 680) | def insert_information(
method trim_empty_leaf_nodes (line 715) | def trim_empty_leaf_nodes(self):
method get_all_leaf_nodes (line 734) | def get_all_leaf_nodes(self):
method merge_single_child_nodes (line 752) | def merge_single_child_nodes(self):
method update_all_info_path (line 773) | def update_all_info_path(self):
method update_from_conv_turn (line 784) | def update_from_conv_turn(
method get_knowledge_base_summary (line 825) | def get_knowledge_base_summary(self):
method reorganize (line 828) | def reorganize(self):
method to_report (line 848) | def to_report(self):
FILE: knowledge_storm/encoder.py
class LitellmPlaceholder (line 27) | class LitellmPlaceholder:
method __getattr__ (line 28) | def __getattr__(self, _):
class Encoder (line 36) | class Encoder:
method __init__ (line 57) | def __init__(
method get_total_token_usage (line 97) | def get_total_token_usage(self, reset: bool = False) -> int:
method encode (line 112) | def encode(self, texts: Union[str, List[str]], max_workers: int = 5) -...
method _get_single_text_embedding (line 124) | def _get_single_text_embedding(self, text):
method _get_text_embeddings (line 132) | def _get_text_embeddings(
FILE: knowledge_storm/interface.py
class InformationTable (line 23) | class InformationTable(ABC):
method __init__ (line 33) | def __init__(self):
method retrieve_information (line 37) | def retrieve_information(**kwargs):
class Information (line 41) | class Information:
method __init__ (line 54) | def __init__(self, url, description, snippets, title, meta=None):
method __hash__ (line 70) | def __hash__(self):
method __eq__ (line 78) | def __eq__(self, other):
method __hash__ (line 87) | def __hash__(self):
method _meta_str (line 93) | def _meta_str(self):
method _md5_hash (line 97) | def _md5_hash(self, value):
method from_dict (line 104) | def from_dict(cls, info_dict):
method to_dict (line 125) | def to_dict(self):
class ArticleSectionNode (line 136) | class ArticleSectionNode:
method __init__ (line 142) | def __init__(self, section_name: str, content=None):
method add_child (line 152) | def add_child(self, new_child_node, insert_to_front=False):
method remove_child (line 158) | def remove_child(self, child):
class Article (line 162) | class Article(ABC):
method __init__ (line 163) | def __init__(self, topic_name):
method find_section (line 166) | def find_section(
method to_string (line 188) | def to_string(self) -> str:
method get_outline_tree (line 193) | def get_outline_tree(self):
method get_first_level_section_names (line 232) | def get_first_level_section_names(self) -> List[str]:
method from_string (line 240) | def from_string(cls, topic_name: str, article_text: str):
method prune_empty_nodes (line 246) | def prune_empty_nodes(self, node=None):
class Retriever (line 260) | class Retriever:
method __init__ (line 269) | def __init__(self, rm: dspy.Retrieve, max_thread: int = 1):
method collect_and_reset_rm_usage (line 273) | def collect_and_reset_rm_usage(self):
method retrieve (line 288) | def retrieve(
class KnowledgeCurationModule (line 322) | class KnowledgeCurationModule(ABC):
method __init__ (line 327) | def __init__(self, retriever: Retriever):
method research (line 334) | def research(self, topic) -> InformationTable:
class OutlineGenerationModule (line 347) | class OutlineGenerationModule(ABC):
method generate_outline (line 354) | def generate_outline(
class ArticleGenerationModule (line 372) | class ArticleGenerationModule(ABC):
method generate_article (line 379) | def generate_article(
class ArticlePolishingModule (line 395) | class ArticlePolishingModule(ABC):
method polish_article (line 402) | def polish_article(self, topic: str, draft_article: Article, **kwargs)...
function log_execution_time (line 411) | def log_execution_time(func):
class LMConfigs (line 427) | class LMConfigs(ABC):
method __init__ (line 433) | def __init__(self):
method init_check (line 436) | def init_check(self):
method collect_and_reset_lm_history (line 443) | def collect_and_reset_lm_history(self):
method collect_and_reset_lm_usage (line 452) | def collect_and_reset_lm_usage(self):
method log (line 475) | def log(self):
class Engine (line 485) | class Engine(ABC):
method __init__ (line 486) | def __init__(self, lm_configs: LMConfigs):
method log_execution_time_and_lm_rm_usage (line 492) | def log_execution_time_and_lm_rm_usage(self, func):
method apply_decorators (line 512) | def apply_decorators(self):
method run_knowledge_curation_module (line 525) | def run_knowledge_curation_module(self, **kwargs) -> Optional[Informat...
method run_outline_generation_module (line 529) | def run_outline_generation_module(self, **kwarg) -> Article:
method run_article_generation_module (line 533) | def run_article_generation_module(self, **kwarg) -> Article:
method run_article_polishing_module (line 537) | def run_article_polishing_module(self, **kwarg) -> Article:
method run (line 541) | def run(self, **kwargs):
method summary (line 544) | def summary(self):
method reset (line 559) | def reset(self):
class Agent (line 565) | class Agent(ABC):
method __init__ (line 590) | def __init__(self, topic: str, role_name: str, role_description: str):
method get_role_description (line 595) | def get_role_description(self):
method generate_utterance (line 601) | def generate_utterance(
FILE: knowledge_storm/lm.py
class LM (line 57) | class LM:
method __init__ (line 58) | def __init__(
method __call__ (line 78) | def __call__(self, prompt=None, messages=None, **kwargs):
method inspect_history (line 111) | def inspect_history(self, n: int = 1):
function cached_litellm_completion (line 116) | def cached_litellm_completion(request):
function litellm_completion (line 120) | def litellm_completion(request, cache={"no-cache": True, "no-store": Tru...
function cached_litellm_text_completion (line 126) | def cached_litellm_text_completion(request):
function litellm_text_completion (line 132) | def litellm_text_completion(request, cache={"no-cache": True, "no-store"...
function _green (line 158) | def _green(text: str, end: str = "\n"):
function _red (line 162) | def _red(text: str, end: str = "\n"):
function _inspect_history (line 166) | def _inspect_history(lm, n: int = 1):
class LitellmModel (line 192) | class LitellmModel(LM):
method __init__ (line 198) | def __init__(
method log_usage (line 210) | def log_usage(self, response):
method get_usage_and_reset (line 218) | def get_usage_and_reset(self):
method __call__ (line 233) | def __call__(self, prompt=None, messages=None, **kwargs):
class OpenAIModel (line 276) | class OpenAIModel(dspy.OpenAI):
method __init__ (line 279) | def __init__(
method log_usage (line 291) | def log_usage(self, response):
method get_usage_and_reset (line 299) | def get_usage_and_reset(self):
method __call__ (line 313) | def __call__(
class DeepSeekModel (line 366) | class DeepSeekModel(dspy.OpenAI):
method __init__ (line 369) | def __init__(
method log_usage (line 388) | def log_usage(self, response):
method get_usage_and_reset (line 396) | def get_usage_and_reset(self):
method _create_completion (line 415) | def _create_completion(self, prompt: str, **kwargs):
method __call__ (line 432) | def __call__(
class AzureOpenAIModel (line 461) | class AzureOpenAIModel(dspy.LM):
method __init__ (line 467) | def __init__(
method basic_request (line 510) | def basic_request(self, prompt: str, **kwargs) -> Any:
method _get_choice_text (line 538) | def _get_choice_text(self, choice: Any) -> str:
method log_usage (line 544) | def log_usage(self, response):
method get_usage_and_reset (line 552) | def get_usage_and_reset(self):
method __call__ (line 564) | def __call__(
class GroqModel (line 595) | class GroqModel(dspy.OpenAI):
method __init__ (line 598) | def __init__(
method log_usage (line 617) | def log_usage(self, response):
method get_usage_and_reset (line 625) | def get_usage_and_reset(self):
method _create_completion (line 644) | def _create_completion(self, prompt: str, **kwargs):
method __call__ (line 680) | def __call__(
class ClaudeModel (line 709) | class ClaudeModel(dspy.dsp.modules.lm.LM):
method __init__ (line 712) | def __init__(
method log_usage (line 749) | def log_usage(self, response):
method get_usage_and_reset (line 757) | def get_usage_and_reset(self):
method basic_request (line 770) | def basic_request(self, prompt: str, **kwargs):
method request (line 811) | def request(self, prompt: str, **kwargs):
method __call__ (line 815) | def __call__(self, prompt, only_completed=True, return_sorted=False, *...
class VLLMClient (line 845) | class VLLMClient(dspy.dsp.LM):
method __init__ (line 851) | def __init__(
method basic_request (line 873) | def basic_request(self, prompt, **kwargs):
method request (line 886) | def request(self, prompt: str, **kwargs):
method log_usage (line 889) | def log_usage(self, response):
method get_usage_and_reset (line 897) | def get_usage_and_reset(self):
method __call__ (line 911) | def __call__(self, prompt: str, **kwargs):
class OllamaClient (line 935) | class OllamaClient(dspy.OllamaLocal):
method __init__ (line 938) | def __init__(self, model, port, url="http://localhost", **kwargs):
class TGIClient (line 948) | class TGIClient(dspy.HFClientTGI):
method __init__ (line 949) | def __init__(self, model, port, url, http_request_kwargs=None, **kwargs):
method _generate (line 958) | def _generate(self, prompt, **kwargs):
class TogetherClient (line 1010) | class TogetherClient(dspy.HFModel):
method __init__ (line 1013) | def __init__(
method log_usage (line 1067) | def log_usage(self, response):
method get_usage_and_reset (line 1075) | def get_usage_and_reset(self):
method _generate (line 1094) | def _generate(self, prompt, **kwargs):
class GoogleModel (line 1159) | class GoogleModel(dspy.dsp.modules.lm.LM):
method __init__ (line 1162) | def __init__(
method log_usage (line 1210) | def log_usage(self, response):
method get_usage_and_reset (line 1218) | def get_usage_and_reset(self):
method basic_request (line 1231) | def basic_request(self, prompt: str, **kwargs):
method request (line 1261) | def request(self, prompt: str, **kwargs):
method __call__ (line 1265) | def __call__(
FILE: knowledge_storm/logging_wrapper.py
class EventLog (line 10) | class EventLog:
method __init__ (line 11) | def __init__(self, event_name):
method record_start_time (line 17) | def record_start_time(self):
method record_end_time (line 22) | def record_end_time(self):
method get_total_time (line 27) | def get_total_time(self):
method get_start_time (line 32) | def get_start_time(self):
method get_end_time (line 40) | def get_end_time(self):
method add_child_event (line 48) | def add_child_event(self, child_event):
method get_child_events (line 51) | def get_child_events(self):
class LoggingWrapper (line 55) | class LoggingWrapper:
method __init__ (line 56) | def __init__(self, lm_config):
method _pipeline_stage_start (line 63) | def _pipeline_stage_start(self, pipeline_stage: str):
method _event_start (line 78) | def _event_start(self, event_name: str):
method _event_end (line 116) | def _event_end(self, event_name: str):
method _pipeline_stage_end (line 143) | def _pipeline_stage_end(self):
method add_query_count (line 155) | def add_query_count(self, count):
method log_event (line 164) | def log_event(self, event_name):
method log_pipeline_stage (line 173) | def log_pipeline_stage(self, pipeline_stage):
method dump_logging_and_reset (line 192) | def dump_logging_and_reset(self, reset_logging=True):
FILE: knowledge_storm/rm.py
class YouRM (line 13) | class YouRM(dspy.Retrieve):
method __init__ (line 14) | def __init__(self, ydc_api_key=None, k=3, is_valid_source: Callable = ...
method get_usage_and_reset (line 32) | def get_usage_and_reset(self):
method forward (line 38) | def forward(
class BingSearch (line 77) | class BingSearch(dspy.Retrieve):
method __init__ (line 78) | def __init__(
method get_usage_and_reset (line 122) | def get_usage_and_reset(self):
method forward (line 128) | def forward(
class VectorRM (line 179) | class VectorRM(dspy.Retrieve):
method __init__ (line 191) | def __init__(
method _check_collection (line 228) | def _check_collection(self):
method init_online_vector_db (line 250) | def init_online_vector_db(self, url: str, api_key: str):
method init_offline_vector_db (line 273) | def init_offline_vector_db(self, vector_store_path: str):
method get_usage_and_reset (line 291) | def get_usage_and_reset(self):
method get_vector_count (line 297) | def get_vector_count(self):
method forward (line 306) | def forward(self, query_or_queries: Union[str, List[str]], exclude_url...
class StanfordOvalArxivRM (line 340) | class StanfordOvalArxivRM(dspy.Retrieve):
method __init__ (line 343) | def __init__(self, endpoint, k=3, rerank=True):
method get_usage_and_reset (line 349) | def get_usage_and_reset(self):
method _retrieve (line 355) | def _retrieve(self, query: str):
method forward (line 387) | def forward(
class SerperRM (line 406) | class SerperRM(dspy.Retrieve):
method __init__ (line 409) | def __init__(
method serper_runner (line 466) | def serper_runner(self, query_params):
method get_usage_and_reset (line 485) | def get_usage_and_reset(self):
method forward (line 490) | def forward(self, query_or_queries: Union[str, List[str]], exclude_url...
class BraveRM (line 570) | class BraveRM(dspy.Retrieve):
method __init__ (line 571) | def __init__(
method get_usage_and_reset (line 591) | def get_usage_and_reset(self):
method forward (line 597) | def forward(
class SearXNG (line 644) | class SearXNG(dspy.Retrieve):
method __init__ (line 645) | def __init__(
method get_usage_and_reset (line 674) | def get_usage_and_reset(self):
method forward (line 679) | def forward(
class DuckDuckGoSearchRM (line 728) | class DuckDuckGoSearchRM(dspy.Retrieve):
method __init__ (line 731) | def __init__(
method get_usage_and_reset (line 783) | def get_usage_and_reset(self):
method request (line 796) | def request(self, query: str):
method forward (line 802) | def forward(
class TavilySearchRM (line 858) | class TavilySearchRM(dspy.Retrieve):
method __init__ (line 861) | def __init__(
method get_usage_and_reset (line 915) | def get_usage_and_reset(self):
method forward (line 920) | def forward(
class GoogleSearch (line 984) | class GoogleSearch(dspy.Retrieve):
method __init__ (line 985) | def __init__(
method get_usage_and_reset (line 1043) | def get_usage_and_reset(self):
method forward (line 1048) | def forward(
class AzureAISearch (line 1108) | class AzureAISearch(dspy.Retrieve):
method __init__ (line 1115) | def __init__(
method get_usage_and_reset (line 1184) | def get_usage_and_reset(self):
method forward (line 1190) | def forward(
FILE: knowledge_storm/storm_wiki/engine.py
class STORMWikiLMConfigs (line 21) | class STORMWikiLMConfigs(LMConfigs):
method __init__ (line 29) | def __init__(self):
method init_openai_model (line 38) | def init_openai_model(
method set_conv_simulator_lm (line 111) | def set_conv_simulator_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFM...
method set_question_asker_lm (line 114) | def set_question_asker_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFM...
method set_outline_gen_lm (line 117) | def set_outline_gen_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFMode...
method set_article_gen_lm (line 120) | def set_article_gen_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFMode...
method set_article_polish_lm (line 123) | def set_article_polish_lm(self, model: Union[dspy.dsp.LM, dspy.dsp.HFM...
class STORMWikiRunnerArguments (line 128) | class STORMWikiRunnerArguments:
class STORMWikiRunner (line 171) | class STORMWikiRunner(Engine):
method __init__ (line 174) | def __init__(
method run_knowledge_curation_module (line 211) | def run_knowledge_curation_module(
method run_outline_generation_module (line 237) | def run_outline_generation_module(
method run_article_generation_module (line 256) | def run_article_generation_module(
method run_article_polishing_module (line 276) | def run_article_polishing_module(
method post_run (line 290) | def post_run(self):
method _load_information_table_from_local_fs (line 312) | def _load_information_table_from_local_fs(self, information_table_loca...
method _load_outline_from_local_fs (line 320) | def _load_outline_from_local_fs(self, topic, outline_local_path):
method _load_draft_article_from_local_fs (line 326) | def _load_draft_article_from_local_fs(
method run (line 341) | def run(
FILE: knowledge_storm/storm_wiki/modules/article_generation.py
class StormArticleGenerationModule (line 15) | class StormArticleGenerationModule(ArticleGenerationModule):
method __init__ (line 21) | def __init__(
method generate_section (line 33) | def generate_section(
method generate_article (line 53) | def generate_article(
class ConvToSection (line 136) | class ConvToSection(dspy.Module):
method __init__ (line 139) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 144) | def forward(
class WriteSection (line 162) | class WriteSection(dspy.Signature):
FILE: knowledge_storm/storm_wiki/modules/article_polish.py
class StormArticlePolishingModule (line 11) | class StormArticlePolishingModule(ArticlePolishingModule):
method __init__ (line 17) | def __init__(
method polish_article (line 29) | def polish_article(
class WriteLeadSection (line 56) | class WriteLeadSection(dspy.Signature):
class PolishPage (line 68) | class PolishPage(dspy.Signature):
class PolishPageModule (line 75) | class PolishPageModule(dspy.Module):
method __init__ (line 76) | def __init__(
method forward (line 87) | def forward(self, topic: str, draft_page: str, polish_whole_page: bool...
FILE: knowledge_storm/storm_wiki/modules/callback.py
class BaseCallbackHandler (line 1) | class BaseCallbackHandler:
method on_identify_perspective_start (line 4) | def on_identify_perspective_start(self, **kwargs):
method on_identify_perspective_end (line 8) | def on_identify_perspective_end(self, perspectives: list[str], **kwargs):
method on_information_gathering_start (line 12) | def on_information_gathering_start(self, **kwargs):
method on_dialogue_turn_end (line 16) | def on_dialogue_turn_end(self, dlg_turn, **kwargs):
method on_information_gathering_end (line 20) | def on_information_gathering_end(self, **kwargs):
method on_information_organization_start (line 24) | def on_information_organization_start(self, **kwargs):
method on_direct_outline_generation_end (line 28) | def on_direct_outline_generation_end(self, outline: str, **kwargs):
method on_outline_refinement_end (line 32) | def on_outline_refinement_end(self, outline: str, **kwargs):
FILE: knowledge_storm/storm_wiki/modules/knowledge_curation.py
class ConvSimulator (line 25) | class ConvSimulator(dspy.Module):
method __init__ (line 28) | def __init__(
method forward (line 47) | def forward(
class WikiWriter (line 84) | class WikiWriter(dspy.Module):
method __init__ (line 89) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 95) | def forward(
class AskQuestion (line 128) | class AskQuestion(dspy.Signature):
class AskQuestionWithPersona (line 139) | class AskQuestionWithPersona(dspy.Signature):
class QuestionToQuery (line 154) | class QuestionToQuery(dspy.Signature):
class AnswerQuestion (line 167) | class AnswerQuestion(dspy.Signature):
class TopicExpert (line 181) | class TopicExpert(dspy.Module):
method __init__ (line 189) | def __init__(
method forward (line 204) | def forward(self, topic: str, question: str, ground_truth_url: str):
class StormKnowledgeCurationModule (line 247) | class StormKnowledgeCurationModule(KnowledgeCurationModule):
method __init__ (line 252) | def __init__(
method _get_considered_personas (line 281) | def _get_considered_personas(self, topic: str, max_num_persona) -> Lis...
method _run_conversation (line 286) | def _run_conversation(
method research (line 347) | def research(
FILE: knowledge_storm/storm_wiki/modules/outline_generation.py
class StormOutlineGenerationModule (line 11) | class StormOutlineGenerationModule(OutlineGenerationModule):
method __init__ (line 17) | def __init__(self, outline_gen_lm: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method generate_outline (line 22) | def generate_outline(
class WriteOutline (line 75) | class WriteOutline(dspy.Module):
method __init__ (line 78) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 84) | def forward(
class WritePageOutline (line 128) | class WritePageOutline(dspy.Signature):
class NaiveOutlineGen (line 140) | class NaiveOutlineGen(dspy.Module):
method __init__ (line 143) | def __init__(self):
method forward (line 147) | def forward(self, topic: str):
class WritePageOutlineFromConv (line 153) | class WritePageOutlineFromConv(dspy.Signature):
FILE: knowledge_storm/storm_wiki/modules/persona_generator.py
function get_wiki_page_title_and_toc (line 10) | def get_wiki_page_title_and_toc(url):
class FindRelatedTopic (line 48) | class FindRelatedTopic(dspy.Signature):
class GenPersona (line 56) | class GenPersona(dspy.Signature):
class CreateWriterWithPersona (line 68) | class CreateWriterWithPersona(dspy.Module):
method __init__ (line 71) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method forward (line 77) | def forward(self, topic: str, draft=None):
class StormPersonaGenerator (line 114) | class StormPersonaGenerator:
method __init__ (line 131) | def __init__(self, engine: Union[dspy.dsp.LM, dspy.dsp.HFModel]):
method generate_persona (line 134) | def generate_persona(self, topic: str, max_num_persona: int = 3) -> Li...
FILE: knowledge_storm/storm_wiki/modules/retriever.py
function is_valid_wikipedia_source (line 225) | def is_valid_wikipedia_source(url):
FILE: knowledge_storm/storm_wiki/modules/storm_dataclass.py
class DialogueTurn (line 14) | class DialogueTurn:
method __init__ (line 15) | def __init__(
method log (line 34) | def log(self):
class StormInformationTable (line 48) | class StormInformationTable(InformationTable):
method __init__ (line 58) | def __init__(self, conversations=List[Tuple[str, List[DialogueTurn]]]):
method construct_url_to_info (line 66) | def construct_url_to_info(
method construct_log_dict (line 83) | def construct_log_dict(
method dump_url_to_info (line 93) | def dump_url_to_info(self, path):
method from_conversation_log_file (line 100) | def from_conversation_log_file(cls, path):
method prepare_table_for_retrieval (line 109) | def prepare_table_for_retrieval(self):
method retrieve_information (line 119) | def retrieve_information(
class StormArticle (line 148) | class StormArticle(Article):
method __init__ (line 149) | def __init__(self, topic_name):
method find_section (line 153) | def find_section(
method _merge_new_info_to_references (line 174) | def _merge_new_info_to_references(
method insert_or_create_section (line 209) | def insert_or_create_section(
method update_section (line 249) | def update_section(
method get_outline_as_list (line 301) | def get_outline_as_list(
method to_string (line 352) | def to_string(self) -> str:
method reorder_reference_index (line 374) | def reorder_reference_index(self):
method get_outline_tree (line 414) | def get_outline_tree(self):
method get_first_level_section_names (line 423) | def get_first_level_section_names(self) -> List[str]:
method from_outline_file (line 430) | def from_outline_file(cls, topic: str, file_path: str):
method from_outline_str (line 438) | def from_outline_str(cls, topic: str, outline_str: str):
method dump_outline_to_file (line 476) | def dump_outline_to_file(self, file_path):
method dump_reference_to_file (line 480) | def dump_reference_to_file(self, file_path):
method dump_article_as_plain_text (line 486) | def dump_article_as_plain_text(self, file_path):
method from_string (line 491) | def from_string(cls, topic_name: str, article_text: str, references: d...
method post_processing (line 502) | def post_processing(self):
FILE: knowledge_storm/utils.py
function truncate_filename (line 23) | def truncate_filename(filename, max_length=125):
function load_api_key (line 41) | def load_api_key(toml_file_path):
function makeStringRed (line 56) | def makeStringRed(message):
class QdrantVectorStoreManager (line 60) | class QdrantVectorStoreManager:
method _check_create_collection (line 69) | def _check_create_collection(
method _init_online_vector_db (line 103) | def _init_online_vector_db(
method _init_offline_vector_db (line 130) | def _init_offline_vector_db(
method create_or_update_vector_store (line 152) | def create_or_update_vector_store(
class ArticleTextProcessing (line 301) | class ArticleTextProcessing:
method limit_word_count_preserve_newline (line 303) | def limit_word_count_preserve_newline(input_string, max_word_count):
method remove_citations (line 337) | def remove_citations(s):
method parse_citation_indices (line 353) | def parse_citation_indices(s):
method remove_uncompleted_sentences_with_citations (line 367) | def remove_uncompleted_sentences_with_citations(text):
method clean_up_citation (line 428) | def clean_up_citation(conv):
method clean_up_outline (line 457) | def clean_up_outline(outline, topic=""):
method clean_up_section (line 506) | def clean_up_section(text):
method update_citation_index (line 541) | def update_citation_index(s, citation_map):
method parse_article_into_dict (line 553) | def parse_article_into_dict(input_string):
class FileIOHelper (line 597) | class FileIOHelper:
method dump_json (line 599) | def dump_json(obj, file_name, encoding="utf-8"):
method handle_non_serializable (line 604) | def handle_non_serializable(obj):
method load_json (line 608) | def load_json(file_name, encoding="utf-8"):
method write_str (line 613) | def write_str(s, path):
method load_str (line 618) | def load_str(path):
method dump_pickle (line 623) | def dump_pickle(obj, path):
method load_pickle (line 628) | def load_pickle(path):
class WebPageHelper (line 633) | class WebPageHelper:
method __init__ (line 639) | def __init__(
method download_webpage (line 674) | def download_webpage(self, url: str):
method urls_to_articles (line 684) | def urls_to_articles(self, urls: List[str]) -> Dict:
method urls_to_snippets (line 706) | def urls_to_snippets(self, urls: List[str]) -> Dict:
function user_input_appropriateness_check (line 714) | def user_input_appropriateness_check(user_input):
function purpose_appropriateness_check (line 769) | def purpose_appropriateness_check(user_input):
Condensed preview — 66 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (626K chars).
[
{
"path": ".github/ISSUE_TEMPLATE/bug_report.md",
"chars": 492,
"preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: \"[BUG]\"\nlabels: ''\nassignees: ''\n\n---\n\n**Describe "
},
{
"path": ".github/workflows/format-check.yml",
"chars": 307,
"preview": "name: Check Python formatting with Black\n\non:\n pull_request:\n branches:\n - main\n\njobs:\n lint:\n runs-on: ubu"
},
{
"path": ".github/workflows/python-package.yml",
"chars": 1471,
"preview": "name: Build and upload Python package\n\non:\n workflow_dispatch: # Allows manual triggering of the workflow\n\njobs:\n bui"
},
{
"path": ".gitignore",
"chars": 251,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# mac\n.DS_Store\n\n# Other\n.vscode\n*.tsv\n*.pt\ng"
},
{
"path": ".pre-commit-config.yaml",
"chars": 244,
"preview": "repos:\n - repo: https://github.com/psf/black\n rev: 24.8.0\n hooks:\n - id: black\n name: Format Python c"
},
{
"path": "CONTRIBUTING.md",
"chars": 2395,
"preview": "# Contributing\n\nThank you for your interest in contributing to STORM! \n\nContributions aren't just about code. Currently "
},
{
"path": "LICENSE",
"chars": 1091,
"preview": "MIT License\n\nCopyright (c) 2024 Stanford Open Virtual Assistant Lab\n\nPermission is hereby granted, free of charge, to an"
},
{
"path": "MANIFEST.in",
"chars": 58,
"preview": "include requirements.txt\ninclude LICENSE\ninclude README.md"
},
{
"path": "README.md",
"chars": 20081,
"preview": "<p align=\"center\">\n <img src=\"assets/logo.svg\" style=\"width: 25%; height: auto;\">\n</p>\n\n# STORM: Synthesis of Topic Out"
},
{
"path": "examples/costorm_examples/run_costorm_gpt.py",
"chars": 11470,
"preview": "\"\"\"\nCo-STORM pipeline powered by GPT-4o/4o-mini and Bing search engine.\nYou need to set up the following environment var"
},
{
"path": "examples/storm_examples/README.md",
"chars": 6101,
"preview": "# Examples\n\nWe host a number of example scripts for various customization of STORM (e.g., use your favorite language mod"
},
{
"path": "examples/storm_examples/helper/process_kaggle_arxiv_abstract_dataset.py",
"chars": 1126,
"preview": "\"\"\"Process `arxiv_data_210930-054931.csv` \nfrom https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts\nto a c"
},
{
"path": "examples/storm_examples/run_storm_wiki_claude.py",
"chars": 7950,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by Claude family models and You.com search engine.\nYou need to set up the following envi"
},
{
"path": "examples/storm_examples/run_storm_wiki_deepseek.py",
"chars": 9265,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by DeepSeek models and You.com or Bing search engine.\nYou need to set up the following e"
},
{
"path": "examples/storm_examples/run_storm_wiki_gemini.py",
"chars": 8163,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by Google Gemini models and search engine.\nYou need to set up the following environment "
},
{
"path": "examples/storm_examples/run_storm_wiki_gpt.py",
"chars": 9060,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by GPT-3.5/4 and You.com search engine.\nYou need to set up the following environment var"
},
{
"path": "examples/storm_examples/run_storm_wiki_gpt_with_VectorRM.py",
"chars": 10628,
"preview": "\"\"\"\nThis STORM Wiki pipeline powered by GPT-3.5/4 and local retrieval model that uses Qdrant.\nYou need to set up the fol"
},
{
"path": "examples/storm_examples/run_storm_wiki_groq.py",
"chars": 8982,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by llama3-70b-8192 hosted by Groq server and You.com search engine.\nYou need to set up t"
},
{
"path": "examples/storm_examples/run_storm_wiki_mistral.py",
"chars": 11889,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by Mistral-7B-Instruct-v0.2 hosted by VLLM server and You.com search engine.\nYou need to"
},
{
"path": "examples/storm_examples/run_storm_wiki_ollama.py",
"chars": 12002,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by local model hosted by Ollama server and You.com or Bing search engine.\nYou need to se"
},
{
"path": "examples/storm_examples/run_storm_wiki_ollama_with_searxng.py",
"chars": 8302,
"preview": "import os\nfrom argparse import ArgumentParser\n\nfrom dspy import Example\n\nfrom knowledge_storm import (\n STORMWikiRunn"
},
{
"path": "examples/storm_examples/run_storm_wiki_serper.py",
"chars": 6540,
"preview": "\"\"\"\nSTORM Wiki pipeline powered by Claude family models and serper search engine.\nYou need to set up the following envir"
},
{
"path": "frontend/demo_light/.streamlit/config.toml",
"chars": 198,
"preview": "[client]\nshowErrorDetails = false\ntoolbarMode = \"minimal\"\n\n[theme]\nprimaryColor = \"#F63366\"\nbackgroundColor = \"#FFFFFF\"\n"
},
{
"path": "frontend/demo_light/README.md",
"chars": 1669,
"preview": "# STORM Minimal User Interface\n\nThis is a minimal user interface for `STORMWikiRunner` which includes the following feat"
},
{
"path": "frontend/demo_light/demo_util.py",
"chars": 25314,
"preview": "import base64\nimport datetime\nimport json\nimport os\nimport re\nfrom typing import Optional\n\nimport markdown\nimport pytz\ni"
},
{
"path": "frontend/demo_light/pages_util/CreateNewArticle.py",
"chars": 6590,
"preview": "import os\nimport time\n\nimport demo_util\nimport streamlit as st\nfrom demo_util import (\n DemoFileIOHelper,\n DemoTex"
},
{
"path": "frontend/demo_light/pages_util/MyArticles.py",
"chars": 4683,
"preview": "import os\n\nimport demo_util\nimport streamlit as st\nfrom demo_util import DemoFileIOHelper, DemoUIHelper\nfrom streamlit_c"
},
{
"path": "frontend/demo_light/requirements.txt",
"chars": 176,
"preview": "streamlit==1.31.1\nstreamlit-card\nmarkdown\nunidecode\nextra-streamlit-components==0.1.60\nstreamlit_extras\ndeprecation==2.1"
},
{
"path": "frontend/demo_light/stoc.py",
"chars": 4712,
"preview": "\"\"\"https://github.com/arnaudmiribel/stoc\"\"\"\n\nimport re\n\nimport streamlit as st\nimport unidecode\n\nDISABLE_LINK_CSS = \"\"\"\n"
},
{
"path": "frontend/demo_light/storm.py",
"chars": 2295,
"preview": "import os\n\nscript_dir = os.path.dirname(os.path.abspath(__file__))\nwiki_root_dir = os.path.dirname(os.path.dirname(scrip"
},
{
"path": "knowledge_storm/__init__.py",
"chars": 214,
"preview": "from .storm_wiki import *\nfrom .collaborative_storm import *\nfrom .encoder import *\nfrom .interface import *\nfrom .lm im"
},
{
"path": "knowledge_storm/collaborative_storm/__init__.py",
"chars": 45,
"preview": "from .modules import *\nfrom .engine import *\n"
},
{
"path": "knowledge_storm/collaborative_storm/engine.py",
"chars": 32330,
"preview": "import dspy\nimport os\nfrom dataclasses import dataclass, field, asdict\nfrom typing import List, Union, Literal, Optional"
},
{
"path": "knowledge_storm/collaborative_storm/modules/__init__.py",
"chars": 325,
"preview": "from .article_generation import *\nfrom .grounded_question_answering import *\nfrom .grounded_question_generation import *"
},
{
"path": "knowledge_storm/collaborative_storm/modules/article_generation.py",
"chars": 5360,
"preview": "import dspy\r\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\r\nfrom typing import Set, Union\r\n\r\nfrom .col"
},
{
"path": "knowledge_storm/collaborative_storm/modules/callback.py",
"chars": 5274,
"preview": "from typing import List\nfrom ...interface import Information\n\n\nclass BaseCallbackHandler:\n \"\"\"Base callback handler t"
},
{
"path": "knowledge_storm/collaborative_storm/modules/co_storm_agents.py",
"chars": 15817,
"preview": "import dspy\nfrom itertools import zip_longest\nimport numpy as np\nfrom sklearn.metrics.pairwise import cosine_similarity\n"
},
{
"path": "knowledge_storm/collaborative_storm/modules/collaborative_storm_utils.py",
"chars": 9149,
"preview": "import dspy\r\nimport os\r\nimport re\r\nimport sys\r\nimport toml\r\nfrom typing import List, Tuple, Dict, Optional, TYPE_CHECKIN"
},
{
"path": "knowledge_storm/collaborative_storm/modules/costorm_expert_utterance_generator.py",
"chars": 7095,
"preview": "import dspy\nfrom typing import Union\n\nfrom .callback import BaseCallbackHandler\nfrom .collaborative_storm_utils import ("
},
{
"path": "knowledge_storm/collaborative_storm/modules/expert_generation.py",
"chars": 4337,
"preview": "import dspy\r\nimport re\r\nfrom typing import Union\r\n\r\n\r\nclass GenerateExpertGeneral(dspy.Signature):\r\n \"\"\"You need to s"
},
{
"path": "knowledge_storm/collaborative_storm/modules/grounded_question_answering.py",
"chars": 7594,
"preview": "import dspy\nfrom typing import Union, List\n\nfrom .callback import BaseCallbackHandler\nfrom .collaborative_storm_utils im"
},
{
"path": "knowledge_storm/collaborative_storm/modules/grounded_question_generation.py",
"chars": 5491,
"preview": "\"\"\"\r\nThis module handles question generation within the Co-STORM framework, specifically designed to support the Moderat"
},
{
"path": "knowledge_storm/collaborative_storm/modules/information_insertion_module.py",
"chars": 18794,
"preview": "import dspy\r\nimport numpy as np\r\nimport re\r\nimport traceback\r\n\r\nfrom concurrent.futures import ThreadPoolExecutor, as_co"
},
{
"path": "knowledge_storm/collaborative_storm/modules/knowledge_base_summary.py",
"chars": 1297,
"preview": "import dspy\nfrom typing import Union\nfrom ...dataclass import KnowledgeBase\n\n\nclass KnowledgeBaseSummmary(dspy.Signature"
},
{
"path": "knowledge_storm/collaborative_storm/modules/simulate_user.py",
"chars": 1525,
"preview": "import dspy\nfrom typing import List, Union\n\nfrom .collaborative_storm_utils import extract_and_remove_citations\nfrom ..."
},
{
"path": "knowledge_storm/collaborative_storm/modules/warmstart_hierarchical_chat.py",
"chars": 19437,
"preview": "\"\"\"\r\nWarm starts the Co-STORM system by conducting a background information search to establish a shared conceptual spac"
},
{
"path": "knowledge_storm/dataclass.py",
"chars": 32572,
"preview": "import dspy\nimport numpy as np\nimport re\nimport threading\nfrom typing import Set, Dict, List, Optional, Union, Tuple\n\nfr"
},
{
"path": "knowledge_storm/encoder.py",
"chars": 6812,
"preview": "import os\r\nimport numpy as np\r\n\r\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\r\nfrom typing import Lis"
},
{
"path": "knowledge_storm/interface.py",
"chars": 21021,
"preview": "import concurrent.futures\nimport dspy\nimport functools\nimport hashlib\nimport json\nimport logging\nimport time\nfrom abc im"
},
{
"path": "knowledge_storm/lm.py",
"chars": 42725,
"preview": "import backoff\nimport dspy\nimport functools\nimport logging\nimport os\nimport random\nimport requests\nimport threading\nfrom"
},
{
"path": "knowledge_storm/logging_wrapper.py",
"chars": 7707,
"preview": "from contextlib import contextmanager\nimport time\nimport pytz\nfrom datetime import datetime\n\n# Define California timezon"
},
{
"path": "knowledge_storm/rm.py",
"chars": 47638,
"preview": "import logging\nimport os\nfrom typing import Callable, Union, List\n\nimport backoff\nimport dspy\nimport requests\nfrom dsp i"
},
{
"path": "knowledge_storm/storm_wiki/__init__.py",
"chars": 45,
"preview": "from .engine import *\nfrom .modules import *\n"
},
{
"path": "knowledge_storm/storm_wiki/engine.py",
"chars": 17988,
"preview": "import json\nimport logging\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import Union, Literal, Optiona"
},
{
"path": "knowledge_storm/storm_wiki/modules/__init__.py",
"chars": 123,
"preview": "from .knowledge_curation import *\nfrom .persona_generator import *\nfrom .retriever import *\nfrom .storm_dataclass import"
},
{
"path": "knowledge_storm/storm_wiki/modules/article_generation.py",
"chars": 7385,
"preview": "import concurrent.futures\nimport copy\nimport logging\nfrom concurrent.futures import as_completed\nfrom typing import List"
},
{
"path": "knowledge_storm/storm_wiki/modules/article_polish.py",
"chars": 4752,
"preview": "import copy\nfrom typing import Union\n\nimport dspy\n\nfrom .storm_dataclass import StormArticle\nfrom ...interface import Ar"
},
{
"path": "knowledge_storm/storm_wiki/modules/callback.py",
"chars": 1219,
"preview": "class BaseCallbackHandler:\n \"\"\"Base callback handler that can be used to handle callbacks from the STORM pipeline.\"\"\""
},
{
"path": "knowledge_storm/storm_wiki/modules/knowledge_curation.py",
"chars": 16316,
"preview": "import concurrent.futures\nimport logging\nimport os\nfrom concurrent.futures import as_completed\nfrom typing import Union,"
},
{
"path": "knowledge_storm/storm_wiki/modules/outline_generation.py",
"chars": 7352,
"preview": "from typing import Union, Optional, Tuple\n\nimport dspy\n\nfrom .callback import BaseCallbackHandler\nfrom .storm_dataclass "
},
{
"path": "knowledge_storm/storm_wiki/modules/persona_generator.py",
"chars": 6621,
"preview": "import logging\nimport re\nfrom typing import Union, List\n\nimport dspy\nimport requests\nfrom bs4 import BeautifulSoup\n\n\ndef"
},
{
"path": "knowledge_storm/storm_wiki/modules/retriever.py",
"chars": 4973,
"preview": "from typing import Union, List\nfrom urllib.parse import urlparse\n\nimport dspy\n\nfrom ...interface import Retriever, Infor"
},
{
"path": "knowledge_storm/storm_wiki/modules/storm_dataclass.py",
"chars": 19508,
"preview": "import copy\nimport re\nfrom collections import OrderedDict\nfrom typing import Union, Optional, Any, List, Tuple, Dict\n\nim"
},
{
"path": "knowledge_storm/utils.py",
"chars": 31781,
"preview": "import concurrent.futures\nimport dspy\nimport httpx\nimport json\nimport logging\nimport os\nimport pickle\nimport re\nimport r"
},
{
"path": "requirements.txt",
"chars": 172,
"preview": "dspy_ai==2.4.9\nwikipedia==1.4.0\nsentence-transformers\ntoml\nlangchain-text-splitters\ntrafilatura\nlangchain-huggingface\nqd"
},
{
"path": "setup.py",
"chars": 1267,
"preview": "import re\n\nfrom setuptools import setup, find_packages\n\n# Read the content of the README file\nwith open(\"README.md\", enc"
}
]
About this extraction
This page contains the full source code of the stanford-oval/storm GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 66 files (581.6 KB), approximately 128.4k tokens, and a symbol index with 633 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.